Page 39 - EE Times Europe November 2021 final

P. 39

EE|Times EUROPE 39

Tesla AI Day: What to Expect for the Future of Self-Driving Cars

Tesla AI Day: Neural Network Announcements
Topic Neural Network Information Other Information
• Smallest training unit with single processor • CPU instruction set tailored to neural network training
• Processor: 64-bit superscalar processor • Vector 8 × 8 matrix multiply and SIMD architecture
• 1.25-MB ECC-protected SRAM • ECC memory
• Max floating-point performance • 32-bit: 64 Gflops; 16-bit: 1,024 Gflops
• Max transfer rate: 512 Gbps • To each of four adjacent training nodes
• Tesla D1 AI chip for neural network training • Complexity: 50 billion transistors
• D1 chip packaging • Flip-chip ball grid array
• Chip manufacturer: TSMC • Manufacturing tech: 7 nm
• D1 has a compute array of 354 processors • The 354 training nodes described above
D1 chip
• I/O ring: 576 lanes of low-power SerDes • SerDes transfer rate: 112 Gbps
• Max on-chip transfer rate: 10 Tbps • Per direction
• Max off-chip transfer rate: 4 Tbps • For each side of chip
• Max floating-point performance • 32-bit: 22.6 Tflops; 16-bit: 362 Tflops
• 25 connected D1 chips in MCM • Possibly the largest MCM in the chip industry
• Retains D1 bandwidth within training tile • Retains D1 bandwidth to other tiles
Training tile • Max floating-point performance • 32-bit: 565 Tflops; 16-bit: 9 Pflops
• Bandwidth: 9 Tbps for each side • Bandwidth for four sides: 36 Tbps
• Packaged as a system or building block • For IT and data center use
• 120 training tiles for neural network training • 25 D1 chips per training tile
ExaPOD • 3,000 D1 chips • 1.062 million training nodes per ExaPOD
• Max ExaPOD floating-point performance • FP32: 67.8 Pflops; BFP16/CFP8: 1.09 Eflops
• Create large and small neural network models • User can choose size of neural network training hardware
• Neural network compiler • To leverage parallelism of D1-based hardware
Dojo software • PyTorch with Dojo extensions • Open-source machine-learning library
• Dojo interface processors and software drivers • To connect to host computer in data centers
• Dojo virtual hardware: From one D1 to ExaPOD • Called Dojo Processing Unit (DPU)
• Consists of one or more D1 chips plus Dojo interface
DPU • Virtual system for D1-based neural network training
processor and data center host(s)
(Source: Egil Juliussen, September 2021)

than 11 miles of wires and power consump-
tion in the 400-W range.
The D1 chip has an I/O ring with high-
speed, low-power SerDes — a total of
576 lanes that surround the chip. Each lane
has a transfer rate of 112 Gbps. The maximum
D1 on-chip transfer rate is 10 Tbps. The maxi-
mum off-board transfer rate is 4 Tbps for each
side of the chip.
With each of the 354 CPUs on a D1 chip
having 1.25 MB of SRAM, this adds up to over
442 MB of SRAM. The maximum performance
of the D1 chip is also based on the CPU array
of 354 training nodes.
D1 max performance for 32-bit
floating-point calculations reaches
22.6 Tflops. Max performance for 16-bit
floating-point calculations is 362 Tflops.
TRAINING TILE
Tesla’s training tile is the building block for
scaling AI training systems. A training tile
integrates 25 D1 dies onto a wafer and is
packaged as a multichip module (MCM). Tesla
believes this may be the largest MCM in the
chip industry. The training tile is packaged
as a large chip that can be connected to other
training tiles via a high-bandwidth connector
that retains the bandwidth of the training tile. Tesla’s training tile system (Source: Tesla)

www.eetimes.eu | NOVEMBER 2021

34 35 36 37 38 39 40 41 42 43 44