Page 12 - EE Times Europe Magazine – November 2023
P. 12
12 EE|Times EUROPE
Nvidia’s Michael Kagan: Building on AI’s ‘iPhone Moment’ to Architect Data Processing’s Future
amorphic pool of computing resources was a was developed for image processing to AI. EETE: Now, as CTO of Nvidia, how do you
challenge, and the key technology to enable More than 20 years ago, the term “GPU” stood see the industry evolving in the next
it was efficient communication and fast for graphics processing unit. In the age of AI, 10 years?
networks. We embarked on InfiniBand, [at the the GPU is actually a general processing unit Kagan: The timing of this question is excel-
time] the newly developed industry standard doing the heavy-lifting data crunching in all lent! We are now experiencing the “iPhone
for high-performance networking, and started AI workloads. moment” of AI, as ChatGPT has focused the
to develop products based on the InfiniBand High-performance networking is required world’s attention on the transformative tech-
network standard. to build computers for AI workloads, so nology. Generative AI will have a monumental
The first highlight—a real highlight—was Mellanox started working with Nvidia impact—probably more than the iPhone or
our second-generation network products. We 15 years ago. We worked closely together the internet.
developed a state-of-the-art network solution to build the world’s fastest supercomput- My role, as Nvidia CTO, is to architect
that knocked out all competition. Our Infini- ers. Nvidia GPUs were processing massive across the wealth of Nvidia technologies to
Band network democratized supercomputers, amounts of data, and the Mellanox network build the AI factories of the future. We are
starting in 2003 with the Virginia Tech team fed the beast with data. building an accelerated computing plat-
that built the third-fastest computer in the form for data processing in the 21st century.
world using 1,000 Apple personal computers EETE: Was it clear early on that the GPU AI-based computing will be accessed as a
connected with our network. As time went by, would one day power supercomputers? cloud service from everywhere—in data cen-
our network became increasingly prevalent in Kagan: Supercomputer workloads are highly ters, edge devices, enterprise, mobile devices,
supercomputers, and today, it is the de facto parallel workloads. Since the early days, the etc. AI and LLMs [large language models] will
standard for high-performance computing. primary benchmark to assess the performance emerge as mainstream computing platforms
Oracle then built its database machine based of supercomputers was LINPACK, a soft- very shortly.
on Mellanox networking. This was our debut in ware library for performing numerical linear
a much broader market than supercomputers— algebra, basically operating on huge matrices. EETE: Do you remember being inspired
our entry point to enterprise and cloud. This type of operation calls for accelerators by the person Grace Hopper? Can you
for higher performance and power efficiency, say a few words about her and what she
In the age of AI, the GPU is and the GPU was a natural fit for these work- means to computing?
loads. With the evolution of AI, linear algebra
Kagan: Grace Hopper was a hugely impres-
actually a general processing became mainstream computing. Nvidia sive woman. Creator of the first compiler, she
unit doing the heavy-lifting identified the opportunity and reinvented the was a trailblazer in computer programming.
She even coined the term “bug” for malfunc-
GPU to be a linear algebra accelerator, a GPU
data crunching in all AI with no display port. All the silicon budget is tioning software. To honor her contributions
devoted to linear algebra.
to programming and software development,
workloads. With Moore’s Law running out of steam and we named our GH200 Grace Hopper Superchip
AI workloads driving computing demand at an after her.
—MICHAEL KAGAN annual rate of 10×, only accelerated comput-
ing can meet this demand. This is where the EETE: What are the main features of
Another highlight was leveraging Infini- GPU excels. the Grace Hopper family of chips, and
Band technology and delivering its value on how does it break from traditional
top of standard Ethernet. This unlocked a new EETE: What were some of the reasons computing?
dimension of opportunities, as almost every Nvidia acquired Mellanox, and how has Kagan: The Nvidia GH200 Grace Hopper
cloud provider began to use our network. No that turned out? Superchip brings together the groundbreaking
matter where you go on the internet, you will Kagan: Today’s computing demand can only performance of the Nvidia Hopper GPU with
go through our network products. be met by a new unit of computing. The entire the power-efficient and high-
data center became a new unit of computing performance Nvidia Grace CPU, connected
EETE: When did you first hear about that runs workloads spanning across tens of with the high-bandwidth, memory-coherent
Nvidia, and what were your first thousands of compute nodes, with each node NVLink Chip-2-Chip [C2C] interconnect. This
impressions? containing multiple GPUs and CPUs. Accel- delivers up to 900-GB/s total bandwidth,
Kagan: Nvidia started in 1993 as a company erated networking is required to feed these 7× higher than the standard PCIe Gen5 lanes
designing graphics accelerator chips. I’m not GPUs and CPUs. These compute nodes run commonly used in accelerated systems, and
sure when the term “accelerated computing” distributed applications, and even a delay of a NVLink-C2C delivers this at 5× lower power.
was coined, but this is what Nvidia was all few nanoseconds in data delivery affects the GH200 is ideal for the most demanding gener-
about from the beginning. Nvidia developed entire application, causing wasted computing ative AI and HPC applications.
world-class programmable technology for resources and excess power consumption.
highly parallel processing. The program- Nvidia builds the largest computers in the EETE: What applications are most
mability was exposed on easy-to-consume world, and a high-performance network is appropriate for Grace Hopper? Climate
interfaces—CUDA—which remained stable one of the key elements to ensure predictable modeling? Large language models?
across generations. execution time and power efficiency and to Kagan: Customers need a versatile system
The combination of faster processors, improve TCO [total cost of ownership]. to handle the largest AI models and realize
mobility and the amount of data generated Mellanox worked closely with Nvidia for the full potential of their infrastructure.
by mobile devices inspired the development more than 10 years before the acquisition. GH200 is built to handle the most complex
of new data processing technology: artificial At some point, it just made more sense to generative AI and accelerated computing
intelligence. This new data processing was become one company. Since the acquisition, workloads spanning large language models,
calling for highly parallel computation tech- market developments prove this was an excel- recommender systems, vector databases and
nology. Nvidia applied the technology that lent move for all. high-performance computing. ■
NOVEMBER 2023 | www.eetimes.eu