Page 12 - EE Times Europe Magazine – November 2023
P. 12

12 EE|Times EUROPE

        Nvidia’s Michael Kagan: Building on AI’s ‘iPhone Moment’ to Architect Data Processing’s Future


        amorphic pool of computing resources was a   was developed for image processing to AI.   EETE: Now, as CTO of Nvidia, how do you
        challenge, and the key technology to enable   More than 20 years ago, the term “GPU” stood   see the industry evolving in the next
        it was efficient communication and fast   for graphics processing unit. In the age of AI,   10 years?
        networks. We embarked on InfiniBand, [at the   the GPU is actually a general processing unit   Kagan: The timing of this question is excel-
        time] the newly developed industry standard   doing the heavy-lifting data crunching in all   lent! We are now experiencing the “iPhone
        for high-performance networking, and started   AI workloads.            moment” of AI, as ChatGPT has focused the
        to develop products based on the InfiniBand   High-performance networking is required   world’s attention on the transformative tech-
        network standard.                   to build computers for AI workloads, so    nology. Generative AI will have a monumental
          The first highlight—a real highlight—was   Mellanox started working with Nvidia    impact—probably more than the iPhone or
        our second-generation network products. We   15 years ago. We worked closely together   the internet.
        developed a state-of-the-art network solution   to build the world’s fastest supercomput-  My role, as Nvidia CTO, is to architect
        that knocked out all competition. Our Infini-  ers. Nvidia GPUs were processing massive   across the wealth of Nvidia technologies to
        Band network democratized supercomputers,   amounts of data, and the Mellanox network   build the AI factories of the future. We are
        starting in 2003 with the Virginia Tech team   fed the beast with data.  building an accelerated computing plat-
        that built the third-fastest computer in the                            form for data processing in the 21st century.
        world using 1,000 Apple personal computers   EETE: Was it clear early on that the GPU   AI-based computing will be accessed as a
        connected with our network. As time went by,   would one day power supercomputers?  cloud service from everywhere—in data cen-
        our network became increasingly prevalent in   Kagan: Supercomputer workloads are highly   ters, edge devices, enterprise, mobile devices,
        supercomputers, and today, it is the de facto   parallel workloads. Since the early days, the   etc. AI and LLMs [large language models] will
        standard for high-performance computing.  primary benchmark to assess the performance   emerge as mainstream computing platforms
          Oracle then built its database machine based   of supercomputers was LINPACK, a soft-  very shortly.
        on Mellanox networking. This was our debut in   ware library for performing numerical linear
        a much broader market than supercomputers—  algebra, basically operating on huge matrices.   EETE: Do you remember being inspired
        our entry point to enterprise and cloud.  This type of operation calls for accelerators   by the person Grace Hopper? Can you
                                            for higher performance and power efficiency,   say a few words about her and what she
        In the age of AI, the GPU is        and the GPU was a natural fit for these work-  means to computing?
                                            loads. With the evolution of AI, linear algebra
                                                                                Kagan: Grace Hopper was a hugely impres-
        actually a general processing       became mainstream computing. Nvidia   sive woman. Creator of the first compiler, she
        unit doing the heavy-lifting        identified the opportunity and reinvented the   was a trailblazer in computer programming.
                                                                                She even coined the term “bug” for malfunc-
                                            GPU to be a linear algebra accelerator, a GPU
        data crunching in all AI            with no display port. All the silicon budget is   tioning software. To honor her contributions
                                            devoted to linear algebra.
                                                                                to programming and software development,
        workloads.                            With Moore’s Law running out of steam and   we named our GH200 Grace Hopper Superchip
                                            AI workloads driving computing demand at an   after her.
                    —MICHAEL KAGAN          annual rate of 10×, only accelerated comput-
                                            ing can meet this demand. This is where the   EETE: What are the main features of
          Another highlight was leveraging Infini-  GPU excels.                 the Grace Hopper family of chips, and
        Band technology and delivering its value on                             how does it break from traditional
        top of standard Ethernet. This unlocked a new   EETE: What were some of the reasons   computing?
        dimension of opportunities, as almost every   Nvidia acquired Mellanox, and how has   Kagan: The Nvidia GH200 Grace Hopper
        cloud provider began to use our network. No   that turned out?          Superchip brings together the groundbreaking
        matter where you go on the internet, you will   Kagan: Today’s computing demand can only   performance of the Nvidia Hopper GPU with
        go through our network products.    be met by a new unit of computing. The entire   the power-efficient and high-
                                            data center became a new unit of computing   performance Nvidia Grace CPU, connected
        EETE: When did you first hear about   that runs workloads spanning across tens of   with the high-bandwidth, memory-coherent
        Nvidia, and what were your first    thousands of compute nodes, with each node   NVLink Chip-2-Chip [C2C] interconnect. This
        impressions?                        containing multiple GPUs and CPUs. Accel-  delivers up to 900-GB/s total bandwidth,
        Kagan: Nvidia started in 1993 as a company   erated networking is required to feed these   7× higher than the standard PCIe Gen5 lanes
        designing graphics accelerator chips. I’m not   GPUs and CPUs. These compute nodes run   commonly used in accelerated systems, and
        sure when the term “accelerated computing”   distributed applications, and even a delay of a   NVLink-C2C delivers this at 5× lower power.
        was coined, but this is what Nvidia was all   few nanoseconds in data delivery affects the   GH200 is ideal for the most demanding gener-
        about from the beginning. Nvidia developed   entire application, causing wasted computing   ative AI and HPC applications.
        world-class programmable technology for   resources and excess power consumption.
        highly parallel processing. The program-  Nvidia builds the largest computers in the   EETE: What applications are most
        mability was exposed on easy-to-consume   world, and a high-performance network is   appropriate for Grace Hopper? Climate
        interfaces—CUDA—which remained stable   one of the key elements to ensure predictable   modeling? Large language models?
        across generations.                 execution time and power efficiency and to   Kagan: Customers need a versatile system
          The combination of faster processors,   improve TCO [total cost of ownership].  to handle the largest AI models and realize
        mobility and the amount of data generated   Mellanox worked closely with Nvidia for   the full potential of their infrastructure.
        by mobile devices inspired the development   more than 10 years before the acquisition.   GH200 is built to handle the most complex
        of new data processing technology: artificial   At some point, it just made more sense to   generative AI and accelerated computing
        intelligence. This new data processing was   become one company. Since the acquisition,   workloads spanning large language models,
        calling for highly parallel computation tech-  market developments prove this was an excel-  recommender systems, vector databases and
        nology. Nvidia applied the technology that   lent move for all.         high-performance computing. ■

        NOVEMBER 2023 | www.eetimes.eu
   7   8   9   10   11   12   13   14   15   16   17