Page 38 - EE Times Europe November 2021 final
P. 38

38 EE|Times EUROPE



            SPECIAL REPORT: ARTIFICIAL INTELLIGENCE
           Tesla AI Day: What to Expect for the Future

           of Self-Driving Cars


           By Egil Juliussen


                 esla’s AI Day took place on Aug. 19 and featured the introduc-  technology with other companies to create a revenue stream similar to
                 tion of automotive chips, systems, and software for machine   the battery EV (BEV) credits sold to other OEMs.
                 learning and neural network training, which together will   The table on the next page lists the characteristics of Tesla’s neural
           T advance the training of models for self-driving cars.  network product announcements. The data has been extracted from the
             Elon Musk and his team of chip and system designers detailed these   video of the August event. I have added my understanding of the chip
           solutions in a more-than-three–hour presentation that can be viewed   and system architecture in a few places.
           at bit.ly/3ofPsjd. Here are the highlights.             Tesla’s design goal was to scale three system characteristics across
                                                                 its chip and systems: compute performance, high bandwidth, and
           NEURAL NETS                                           low-latency communication between compute nodes. High bandwidth
           Tesla has designed a flexible and expandable distributed computer   and low latency have always been difficult to scale to hundreds or
           architecture that is tailored to neural network training. Tesla’s archi-  thousands of compute nodes. It looks like Tesla has been successful in
           tecture starts with the D1 special-purpose chip, with 354 training   scaling all three parameters organized in a connected 2D mesh format.
           nodes, each with a powerful CPU. These training-node CPUs are
           designed for high-performance machine-learning and neural network   TRAINING NODE
           tasks and have a maximum performance of 64 gigaflops for 32-bit    The training node is the smallest training unit on the D1 chip. It has
                                        floating-point operations.  a 64-bit processor with four-way scalar and four-way multi-threaded
           Tesla has designed a           For the D1 chip, with 354   program execution. The CPU also has a two-way vector data path with
           flexible and expandable      CPUs, the max performance   8 × 8 vector multiplication.
                                        is 22.6 teraflops for 32-bit
                                                                   The instruction set architecture of the CPU is tailored to
           distributed computer         floating-point arithmetic.   machine-learning and neural network training tasks. The CPU sup-
                                                                 ports multiple floating-point formats: 32-bit (FP32), 16-bit (BFP16),
                                        For 16-bit floating-point
           architecture that is         calculations, the D1 max per-  and 8-bit (Configurable FP8, or CFP8).
                                                                   The processor has 1.25-MB high-speed SRAM for program
           tailored to neural           formance jumps to 362 Tflops.  and data storage. The memory uses error-correction code for
                                          Tesla introduced two
           network training, which      systems for neural network   increased reliability.
                                                                   To get low latency between training nodes, Tesla picked the farthest
           starts with the D1           training: the training tile   distance that a signal could travel in one cycle of 2-GHz+ clock fre-
                                        and ExaPOD. A training tile
           special-purpose chip.        has 25 connected D1 chips   quency. This defined how close the training nodes should be and how
                                        in a multi-chip package. A
                                                                 complex the CPU and its support electronics could be. These param-
                                        training tile with 25 D1 chips   eters also allowed a CPU to communicate with four adjacent training
           comprises 8,850 training nodes, with each having the high-   nodes at 512 Gbps.
           performance CPU summarized above. The maximum performance of a   The maximum performance of the training node varies with the
           training tile is 565 Tflops for 32-bit floating-point calculations.  arithmetic used. Floating-point performance is commonly used for
             The ExaPOD connects 120 training tiles into a system, or 3,000 D1   comparison. The max training tile 32-bit floating-point performance
           chips with 1.062 million training nodes. The max performance of an   (FP32) is 64 Gflops. The max performance for BFP16 or CFP8 arithmetic
           ExaPOD is 67.8 Pflops for 32-bit floating-point calculations.  is 1,024 Gflops.

           TESLA NEURAL NETWORK ANNOUNCEMENT DETAILS             D1 CHIP
           The introduction of the D1 chip and Dojo neural network training sys-  The impressive Tesla D1 chip is a special-purpose design for neural
           tem shows Tesla’s direction. The R&D investment to get these products   network training. Manufactured in a 7-nm process, the D1 packs
           into production is undoubtedly very high. Tesla is likely to share this   50 billion transistors in a die measuring 645 mm . The chip has more
                                                                                                   2

                                            Tesla AI Day: Neural Network Summary
                    Topic              Neural Network Information                   Other Information
                                 • D1 chip designed for neural network training  • Special-purpose chip
                                 • Each D1 chip has 354 training nodes   • Each node CPU designed for ML and neural network tasks
            D1 chip and training nodes  • Training node processor max performance   • FP32: 64 Gflops; BFP16/CFP8: 1,024 Gflops

                                 • D1 chip max performance (Node × 354)   • FP32: 22.6 Tflops; BFP16/CFP8: 362 Tflops
                                 • Training tile: 25 connected D1 chips in MCM  • 8,850 training nodes in multi-chip package
                                 • Training tile max performance (D1 × 25)   • FP32: 565 Tflops; BFP16/CFP8: 9 Pflops
             Training tile and ExaPOD   • ExaPOD: 120 connected training tiles   • 3,000 D1 chips; 1.062 million training nodes

                                 • ExaPOD max performance (D1 × 3,000)   • FP32: 67.8 Pflops; BFP16/CFP8: 1.09 Eflops
                                               (Source: Egil Juliussen, September 2021)


           NOVEMBER 2021 | www.eetimes.eu
   33   34   35   36   37   38   39   40   41   42   43