Page 40 - EE Times Europe November 2021 final

P. 40

40 EE|Times EUROPE

Tesla AI Day: What to Expect for the Future of Self-Driving Cars

The training tile packaging includes
multiple layers of power and control, current
distribution, compute plane (25 D1 chips), and
cooling system. The training tile is for use in
IT centers — not in autonomous vehicles.
The training tile provides 25× performance
of a single D1 chip or up to 9 petaflops
for 16-bit floating-point calculations and
up to 565 Tflops for 32-bit floating-point
calculations.
Twelve training tiles in 2 × 3 × 2 configura-
tion can be packed in a cabinet, which Tesla
calls a training matrix.
EXAPOD
The largest system that Tesla described is the
ExaPOD. It is built from 120 training tiles,
which adds up to 3,000 D1 chips and
1.062 million training nodes. It fits in 10 cabi-
nets and is clearly intended for IT center use.
Maximum performance of the ExaPOD
is 1.09 exaflops for 16-bit floating-point
calculations and 67.8 Pflops for 32-bit
floating-point calculations.
DOJO SOFTWARE & DPU
The Dojo software is designed to support
training of large and small neural networks. (Source: Tesla)
Tesla has a compiler to create software code
that leverages the structure and capabilities D1-based systems can be subdivided and systems will be used in inferencing appli-
of the training nodes, D1 chips, training tiles, partitioned into units called Dojo Processing cations in AVs. The training tile’s power
and ExaPOD systems. It uses the PyTorch Units (DPUs). A DPU consists of one or more consumption looks too high for auto use in
open-source machine-learning library with D1 chips, an interface processor, and one or the current version. One picture in the pre-
extensions to leverage the D1 chip and Dojo more computer hosts. The DPU virtual system sentation had a “15 KW Heat Rejection” label
system architecture. can be scaled up or down as needed by the for the training tile. A D1 chip is probably in
These capabilities allow big neural neural network running on it. the range with 400-W TDP listed in a slide.
networks to be partitioned and mapped to It looks like Tesla is hoping and/or
extract model, graph, and data parallelism to BOTTOM LINE depending on this neural network training
speed up large neural network training. The The Tesla neural network training chip, innovation to make its Autopilot into an
compiler uses multiple techniques to extract system, and software are very impressive. L3- or L4-capable system — with only
parallelism. It can transform the networks to There is a lot of innovation, such as retaining camera-based sensors. Is this a good bet?
achieve fine-grain parallelism using data- tremendous bandwidth and low latency Time will tell, but so far, most of Elon Musk’s
model-graph–parallelism techniques and can from chip to systems. The packaging of bets have been good, albeit with some delay. ■
optimize to reduce memory footprints. the training tile for power and cooling also
The Dojo interface processors are used to looks innovative. Egil Juliussen is the former director of
communicate with host computers in IT and The neural network training systems are research for infotainment and ADAS at IHS
data centers. They connect with PCIe 4.0 to for data center use and will certainly be used Automotive; an independent auto industry
host computers and to the D1-based system via for improving Tesla’s AV software. It is likely analyst; and EE Times’ “Egil’s Eye” columnist.
the high bandwidth explained above. The inter- that other companies will also use these Tesla This article was originally published on
face processors also provide high-bandwidth neural network training systems. EE Times and may be viewed at
DRAM shared memory for the D1 systems. A key question is how the neural network bit.ly/3zO66Z6.

bit.ly/3BwAJU5

NOVEMBER 2021 | www.eetimes.eu

35 36 37 38 39 40 41 42 43 44 45