Page 34 - EE Times Europe November 2021 final
P. 34

34 EE|Times EUROPE

           Memory Bottlenecks: Overcoming a Common AI Problem


                                                                                   networks use memory differently for different
                                                                                   components of the model. So we can design a
                                                                                   purpose-built solution for each type of mem-
                                                                                   ory and for each type of compute.”
                                                                                     As a result, those components are untan-
                                                                                   gled, thereby “simplify[ing] the scaling
                                                                                   problem,” said Lie.
                                                                                     During training, latency-sensitive activa-
                                                                                   tion memory must be immediately accessed.
                                                                                   Hence, Cerebras kept activations on-chip.
                                                                                     Cerebras stores weights on MemoryX,
                                                                                   then streams them onto the chip as required.
                                                                                   Weight memory is used relatively infrequently
                                                                                   without back-to-back dependencies, said Lie.
                                                                                   This can be leveraged to avoid latency and
           Esperanto claims to have solved the memory bottleneck by using six smaller chips rather   performance bottlenecks. Coarse-grained
           than a single large chip, leaving pins available to connect to LPDDR4x chips.    pipelining also avoids dependencies between
           (Source: Esperanto Technologies)                                        layers; weights for a layer start streaming
                                                                                   before the previous layer is complete.
                                                                                     Meanwhile, fine-grained pipelining avoids
           competitors “have a very limited number of   designed to accommodate between 4 TB and   dependencies between training iterations;
           pins, so they have to go to things like HBM to   2.4 PB (200 billion to 120 trillion parameters)   weight updates in the backward pass are
           get very high bandwidth on a small number   — sufficient capacity for the world’s biggest   overlapped with the subsequent forward pass
           of pins — but HBM is really expensive, hard to   AI models.             of the same layer.
           get, and high-power,” Ditzel said.    To make its off-chip memory behave as   “By using these pipelining techniques, the
             Esperanto’s multichip approach makes   if on-chip, Cerebras optimized MemoryX to   weight-streaming execution model can hide
           more pins available for communication with   stream parameter and weight data to the   the extra latency from external weights, and
           off-chip DRAM. Alongside six processor chips,   processor in a way that eliminates the impact   we can hit the same performance as if the
           the company uses 24 inexpensive LPDDR4x   of latency, said Sean Lie, the company’s   weights were [accessed] locally on the wafer,”
           DRAM chips designed for cellphones, running   co-founder and chief hardware architect.  Lie said. ■
           at low voltage with “about the same energy   “We separated the memory from the
           per bit as HBM,” Ditzel said.       compute, fundamentally disaggregating   Sally Ward-Foxton is editor-in-chief of
             “Because [LPDDR4x] is lower-bandwidth   them,” he said. “And by doing so, we made   EE Times Weekend. This article was originally
           [than HBM], we get more bandwidth by going   the communication elegant and straightfor-  published on EE Times and may be viewed at
           wider,” he added. “We go to 1,500 bits wide on   ward. The reason we can do this is that neural   bit.ly/3uhJSxE.
           the memory system on the accelerator card,
           [while one-chip competitors] cannot afford a
           1,500-bit–wide memory system, because for
           every data pin, you’ve got to have a couple of
           power and a couple of ground pins, and it’s
           just too many pins.
             “Having dealt with this problem before, we
           said, ‘Let’s just split it up,’” said Ditzel.
             The total memory capacity of 192 GB is
           accessed via 822-GB/s memory bandwidth.
           The total across all 64-bit DRAM chips works
           out to a 1,536-bit–wide memory system, split
           into 96× 16-bit channels to better handle
           memory latency. It all fits into a power budget
           of 120 W.
                                               Cerebras Systems’ MemoryX, an off-chip memory expansion for its CS-2 wafer-scale
           PIPELINING WEIGHTS                  engine system, behaves as though it was on-chip. (Source: Cerebras Systems)
           Wafer-scale AI accelerator company Cerebras
           Systems has devised a memory bottleneck
           solution at the far end of the scale. At Hot
           Chips, the company announced MemoryX,
           a memory extension system for its CS-2
           AI accelerator system aimed at high-
           performance computing and scientific work-
           loads. MemoryX seeks to enable training huge
           AI models with a trillion or more parameters.
             MemoryX is a combination of DRAM and
           flash storage that behaves as if on-chip. The   Cerebras uses pipelining to remove latency-sensitive communication during AI training.
           architecture is promoted as elastic and is   (Source: Cerebras Systems)

           NOVEMBER 2021 | www.eetimes.eu
   29   30   31   32   33   34   35   36   37   38   39