Page 32 - EE Times Europe November 2021 final
P. 32

32 EE|Times EUROPE



            SPECIAL REPORT: ARTIFICIAL INTELLIGENCE
           Memory Bottlenecks: Overcoming a Common

           AI Problem


           By Sally Ward-Foxton


                keptics of artificial intelligence have
                criticized the memory bottleneck
                that exists in the current technology,
          Sarguing that the inability to accelerate
           the data movement between processor and
           memory is holding back useful real-world
           applications.
             AI accelerators used to train AI models
           in data centers require the highest memory
           bandwidth available. While storing an entire
           model in a processor would eliminate off-chip
           memory from the equation, it isn’t a feasible
           solution, as the largest models measure in the
           billions or trillions of parameters.
             Where yesterday’s systems were
           memory-constrained, today’s data center
           architectures use a variety of techniques to
           overcome memory bottlenecks.
                                               Nvidia’s A100 data center GPU with six stacks of HBM2E memory (only five stacks are
           HIGH-BANDWIDTH MEMORY               used, for yield reasons) (Source: Nvidia)
           A popular solution is to use high-bandwidth
           memory (HBM), which involves connecting   principal engineer, during a recent Hot Chips   controller. On the version of Sapphire Rapids
           a 3D stack of four, eight, or 12 DRAM dies to   presentation. “However, with the exponen-  without HBM, there’s an area of the die
           the processor via a silicon interposer. The   tially growing model sizes, we see constant   where we added accelerators for crypto, com-
           latest version of the technology, HBM2E,   demand for both capacity and bandwidth   pression, etc. All of those go away — except
           features faster signaling rates per pin than its   without tradeoffs. Sapphire Rapids does just   for the data-streaming accelerator — and the
           predecessor, up to 3.6 Gb/s per pin, thereby   that by supporting both, natively.”   HBM controller goes in instead.
           boosting bandwidth. Samsung and SK Hynix   The approach is enhanced through mem-   “On top of that, we had to make some
           each offer eight-die HBM2E stacks for a   ory tiering, “which includes support for    changes to the mesh to support the band-
           total of 16-GB capacity, providing 460-GB/s   software-visible HBM plus DDR, and software   width requirements of HBM,” Nassif added.
           bandwidth (compared with 2.4 GB/s for   transparent caching that uses HBM as a DDR-  Beyond CPUs and GPUs, HBM is popular
           DDR5 and 64 GB/s for GDDR6, according to   backed cache,” Biswas added.  for data center FPGAs. For example, Intel’s
           SK Hynix). HBM3 is set to push speeds and   However, the HBM versions come at the   Stratix and the Xilinx Versal FPGAs come in
           capacities even higher.             cost of die area, Sapphire Rapids’ chief engi-  HBM versions, and some AI ASICs also use it.
             The latest version of Nvidia’s flagship   neer, Nevine Nassif, told EE Times.  Tencent-backed data center AI ASIC devel-
           data center GPU, the A100, provides 80 GB of   “The [HBM-compatible] die is slightly   oper Enflame Technology uses HBM for its
           HBM2E performance with 2 TB/s of memory   different,” she said. “There’s also an HBM   DTU 1.0 device, which is optimized for cloud
           bandwidth. The A100 incorporates five 16-GB   controller that is different than the DDR5   AI training. The 80-TFLOPS (FP16/BF16)
           stacks of DRAM, joining a 40-GB version that
           uses HBM2 for a total bandwidth of
           1.6 TB/s. The difference between the two
           yields a threefold increase in AI model
           training speed for the deep-learning recom-
           mendation model, a known memory hog.
             Meanwhile, data center CPUs are leverag-
           ing HBM bandwidth. Intel’s next-generation
           Xeon data center CPUs, Sapphire Rapids, will
           introduce HBM to the Xeon family. They are
           Intel’s first data center CPUs to use new AMX
           instruction extensions designed specifically
           for matrix multiplication workloads like AI.
           They will also be able to use either off-chip
           DDR5 DRAM or DRAM plus HBM.
             “Typically, CPUs are optimized for capacity,
           while accelerators and GPUs are optimized for   Enflame’s DTU 1.0 data center AI accelerator chip has two stacks of HBM2 memory.
           bandwidth,” said Arijit Biswas, an Intel senior   (Source: Enflame Technology)

           NOVEMBER 2021 | www.eetimes.eu
   27   28   29   30   31   32   33   34   35   36   37