Page 33 - EE Times Europe November 2021 final

P. 33

EE|Times EUROPE 33

Memory Bottlenecks: Overcoming a Common AI Problem

Graphcore’s comparison of capacity and bandwidth for
different memory technologies. While others try to solve both
with HBM2E, Graphcore uses a combination of host DDR
memory plus on-chip SRAM on its Colossus Mk2 AI Graphcore’s cost analysis for HBM2 versus DDR4 memory has the
accelerator chip. (Source: Graphcore) former costing 10× more than the latter. (Source: Graphcore)

chip uses two HBM2 stacks, providing server-class DDR per byte,” Knowles said. reason for the emergence of a pluggable eco-
512-GB/s bandwidth connected through an “Even at modest capacity, HBM dominates the system of computer components is to avoid
on-chip network. processor module cost. If an AI computer can margin stacking.”
use DDR instead, it can deploy more AI pro-
PERFORMANCE PER DOLLAR cessors for the same total cost of ownership.” GOING WIDER
While HBM offers extreme bandwidth for the Emerging from stealth mode at Hot Chips,
off-chip memory needed for data center AI Esperanto Technologies offered yet another
accelerators, a few notable holdouts remain. “A primary reason for the take on the memory bottleneck problem. The
Graphcore is among them. During his Hot emergence of a pluggable company’s 1,000-core RISC-V AI acceler-
Chips presentation, Graphcore CTO Simon ator targets hyperscaler recommendation
Knowles noted that faster computation in ecosystem of computer model inference rather than the AI training
large AI models requires both memory capac- workloads mentioned above.
ity and memory bandwidth. While others use components is to avoid Dave Ditzel, Esperanto’s founder and
HBM to boost both capacity and bandwidth, executive chairman, noted that data center
tradeoffs include HBM’s cost, power con- margin stacking.” inference does not require huge on-chip
sumption, and thermal limitations. memory. “Our customers did not want
Graphcore’s second-generation intelli- — Simon Knowles, Graphcore 250 MB on-chip,” Ditzel said. “They wanted
gence processing unit (IPU) instead uses its 100 MB — all the things they wanted to do
large 896 MiB of on-chip SRAM to support with inference fit into 100 MB. Anything big-
the memory bandwidth required to feed According to Knowles, 40 GB of HBM ger than that will need a lot more.”
its 1,472 processor cores. That’s enough to effectively triples the cost of a packaged Ditzel added that customers prefer large
avoid the higher bandwidth needed to offload reticle-sized processor. Graphcore’s cost amounts of DRAM on the same card as the
DRAM, Knowles said. To support memory breakdown of 8 GB of HBM2 versus 8 GB of processor, not on-chip. “They advised us, ‘Just
capacity, AI models too big to fit on-chip use DDR4 reckons that the HBM die is double the get everything onto the card once, and then
low-bandwidth remote DRAM in the form size of a DDR4 die (comparing a 20-nm HBM use your fast interfaces. Then, as long as you
of server-class DDR. That configuration is with an 18-nm DDR4, which Knowles argued can get to 100 GB of memory faster than you
attached to the host processor, allowing are contemporaries), thereby increasing man- can get to it over the PCIe bus, it’s a win.’”
mid-sized models to be spread over SRAM in a ufacturing costs. Then there is the cost of TSV Comparing Esperanto’s approach with
cluster of IPUs. etching, stacking, assembly, and packaging, other data center inference accelerators,
Given that the company promotes its IPU along with memory and processor makers’ Ditzel said that others focus on a single giant
on a performance-per-dollar basis, profit margins. processor consuming the entire power budget.
Graphcore’s primary reason to reject HBM “This margin stacking does not occur for Esperanto’s approach — multiple low-power
appears to be cost. the DDR DIMM, because the user can source processors mounted on dual M.2 acceler-
“The net cost of HBM integrated with an AI that directly from the memory manufac- ator cards — better enables use of off-chip
processor is greater than 10× the cost of turer,” Knowles said. “In fact, a primary memory, the startup insists. Single-chip

www.eetimes.eu | NOVEMBER 2021

28 29 30 31 32 33 34 35 36 37 38