Page 38 - EE Times Europe November 2021 final

P. 38

38 EE|Times EUROPE

SPECIAL REPORT: ARTIFICIAL INTELLIGENCE
Tesla AI Day: What to Expect for the Future

of Self-Driving Cars

By Egil Juliussen

esla’s AI Day took place on Aug. 19 and featured the introduc- technology with other companies to create a revenue stream similar to
tion of automotive chips, systems, and software for machine the battery EV (BEV) credits sold to other OEMs.
learning and neural network training, which together will The table on the next page lists the characteristics of Tesla’s neural
T advance the training of models for self-driving cars. network product announcements. The data has been extracted from the
Elon Musk and his team of chip and system designers detailed these video of the August event. I have added my understanding of the chip
solutions in a more-than-three–hour presentation that can be viewed and system architecture in a few places.
at bit.ly/3ofPsjd. Here are the highlights. Tesla’s design goal was to scale three system characteristics across
its chip and systems: compute performance, high bandwidth, and
NEURAL NETS low-latency communication between compute nodes. High bandwidth
Tesla has designed a flexible and expandable distributed computer and low latency have always been difficult to scale to hundreds or
architecture that is tailored to neural network training. Tesla’s archi- thousands of compute nodes. It looks like Tesla has been successful in
tecture starts with the D1 special-purpose chip, with 354 training scaling all three parameters organized in a connected 2D mesh format.
nodes, each with a powerful CPU. These training-node CPUs are
designed for high-performance machine-learning and neural network TRAINING NODE
tasks and have a maximum performance of 64 gigaflops for 32-bit The training node is the smallest training unit on the D1 chip. It has
floating-point operations. a 64-bit processor with four-way scalar and four-way multi-threaded
Tesla has designed a For the D1 chip, with 354 program execution. The CPU also has a two-way vector data path with
flexible and expandable CPUs, the max performance 8 × 8 vector multiplication.
is 22.6 teraflops for 32-bit
The instruction set architecture of the CPU is tailored to
distributed computer floating-point arithmetic. machine-learning and neural network training tasks. The CPU sup-
ports multiple floating-point formats: 32-bit (FP32), 16-bit (BFP16),
For 16-bit floating-point
architecture that is calculations, the D1 max per- and 8-bit (Configurable FP8, or CFP8).
The processor has 1.25-MB high-speed SRAM for program
tailored to neural formance jumps to 362 Tflops. and data storage. The memory uses error-correction code for
Tesla introduced two
network training, which systems for neural network increased reliability.
To get low latency between training nodes, Tesla picked the farthest
starts with the D1 training: the training tile distance that a signal could travel in one cycle of 2-GHz+ clock fre-
and ExaPOD. A training tile
special-purpose chip. has 25 connected D1 chips quency. This defined how close the training nodes should be and how
in a multi-chip package. A
complex the CPU and its support electronics could be. These param-
training tile with 25 D1 chips eters also allowed a CPU to communicate with four adjacent training
comprises 8,850 training nodes, with each having the high- nodes at 512 Gbps.
performance CPU summarized above. The maximum performance of a The maximum performance of the training node varies with the
training tile is 565 Tflops for 32-bit floating-point calculations. arithmetic used. Floating-point performance is commonly used for
The ExaPOD connects 120 training tiles into a system, or 3,000 D1 comparison. The max training tile 32-bit floating-point performance
chips with 1.062 million training nodes. The max performance of an (FP32) is 64 Gflops. The max performance for BFP16 or CFP8 arithmetic
ExaPOD is 67.8 Pflops for 32-bit floating-point calculations. is 1,024 Gflops.

TESLA NEURAL NETWORK ANNOUNCEMENT DETAILS D1 CHIP
The introduction of the D1 chip and Dojo neural network training sys- The impressive Tesla D1 chip is a special-purpose design for neural
tem shows Tesla’s direction. The R&D investment to get these products network training. Manufactured in a 7-nm process, the D1 packs
into production is undoubtedly very high. Tesla is likely to share this 50 billion transistors in a die measuring 645 mm . The chip has more
2

Tesla AI Day: Neural Network Summary
Topic Neural Network Information Other Information
• D1 chip designed for neural network training • Special-purpose chip
• Each D1 chip has 354 training nodes • Each node CPU designed for ML and neural network tasks
D1 chip and training nodes • Training node processor max performance • FP32: 64 Gflops; BFP16/CFP8: 1,024 Gflops

• D1 chip max performance (Node × 354) • FP32: 22.6 Tflops; BFP16/CFP8: 362 Tflops
• Training tile: 25 connected D1 chips in MCM • 8,850 training nodes in multi-chip package
• Training tile max performance (D1 × 25) • FP32: 565 Tflops; BFP16/CFP8: 9 Pflops
Training tile and ExaPOD • ExaPOD: 120 connected training tiles • 3,000 D1 chips; 1.062 million training nodes

• ExaPOD max performance (D1 × 3,000) • FP32: 67.8 Pflops; BFP16/CFP8: 1.09 Eflops
(Source: Egil Juliussen, September 2021)

NOVEMBER 2021 | www.eetimes.eu

33 34 35 36 37 38 39 40 41 42 43