Page 34 - EE Times Europe November 2021 final

P. 34

34 EE|Times EUROPE

Memory Bottlenecks: Overcoming a Common AI Problem

networks use memory differently for different
components of the model. So we can design a
purpose-built solution for each type of mem-
ory and for each type of compute.”
As a result, those components are untan-
gled, thereby “simplify[ing] the scaling
problem,” said Lie.
During training, latency-sensitive activa-
tion memory must be immediately accessed.
Hence, Cerebras kept activations on-chip.
Cerebras stores weights on MemoryX,
then streams them onto the chip as required.
Weight memory is used relatively infrequently
without back-to-back dependencies, said Lie.
This can be leveraged to avoid latency and
Esperanto claims to have solved the memory bottleneck by using six smaller chips rather performance bottlenecks. Coarse-grained
than a single large chip, leaving pins available to connect to LPDDR4x chips. pipelining also avoids dependencies between
(Source: Esperanto Technologies) layers; weights for a layer start streaming
before the previous layer is complete.
Meanwhile, fine-grained pipelining avoids
competitors “have a very limited number of designed to accommodate between 4 TB and dependencies between training iterations;
pins, so they have to go to things like HBM to 2.4 PB (200 billion to 120 trillion parameters) weight updates in the backward pass are
get very high bandwidth on a small number — sufficient capacity for the world’s biggest overlapped with the subsequent forward pass
of pins — but HBM is really expensive, hard to AI models. of the same layer.
get, and high-power,” Ditzel said. To make its off-chip memory behave as “By using these pipelining techniques, the
Esperanto’s multichip approach makes if on-chip, Cerebras optimized MemoryX to weight-streaming execution model can hide
more pins available for communication with stream parameter and weight data to the the extra latency from external weights, and
off-chip DRAM. Alongside six processor chips, processor in a way that eliminates the impact we can hit the same performance as if the
the company uses 24 inexpensive LPDDR4x of latency, said Sean Lie, the company’s weights were [accessed] locally on the wafer,”
DRAM chips designed for cellphones, running co-founder and chief hardware architect. Lie said. ■
at low voltage with “about the same energy “We separated the memory from the
per bit as HBM,” Ditzel said. compute, fundamentally disaggregating Sally Ward-Foxton is editor-in-chief of
“Because [LPDDR4x] is lower-bandwidth them,” he said. “And by doing so, we made EE Times Weekend. This article was originally
[than HBM], we get more bandwidth by going the communication elegant and straightfor- published on EE Times and may be viewed at
wider,” he added. “We go to 1,500 bits wide on ward. The reason we can do this is that neural bit.ly/3uhJSxE.
the memory system on the accelerator card,
[while one-chip competitors] cannot afford a
1,500-bit–wide memory system, because for
every data pin, you’ve got to have a couple of
power and a couple of ground pins, and it’s
just too many pins.
“Having dealt with this problem before, we
said, ‘Let’s just split it up,’” said Ditzel.
The total memory capacity of 192 GB is
accessed via 822-GB/s memory bandwidth.
The total across all 64-bit DRAM chips works
out to a 1,536-bit–wide memory system, split
into 96× 16-bit channels to better handle
memory latency. It all fits into a power budget
of 120 W.
Cerebras Systems’ MemoryX, an off-chip memory expansion for its CS-2 wafer-scale
PIPELINING WEIGHTS engine system, behaves as though it was on-chip. (Source: Cerebras Systems)
Wafer-scale AI accelerator company Cerebras
Systems has devised a memory bottleneck
solution at the far end of the scale. At Hot
Chips, the company announced MemoryX,
a memory extension system for its CS-2
AI accelerator system aimed at high-
performance computing and scientific work-
loads. MemoryX seeks to enable training huge
AI models with a trillion or more parameters.
MemoryX is a combination of DRAM and
flash storage that behaves as if on-chip. The Cerebras uses pipelining to remove latency-sensitive communication during AI training.
architecture is promoted as elastic and is (Source: Cerebras Systems)

NOVEMBER 2021 | www.eetimes.eu

29 30 31 32 33 34 35 36 37 38 39