Page 22 - EE Times Europe Magazine – November 2023
P. 22

22 EE|Times EUROPE




        OPINION | GREENER ELECTRONICS | PROCESSING

        The Mind-Boggling


        Cost of Generative AI                                                   MEMORY WALL’S IMPLICATIONS FOR
                                                                                GENERATIVE AI
                                                                                Moving terabytes of data at high speed
        Ownership                                                               between memory and computing elements
                                                                                requires data-transfer bandwidths of
                                                                                terabytes/second, which is hardly practica-
                                                                                ble. If the processor does not receive data on
        By Lauro Rizzatti, Vsora                                                time, it sits idle, decreasing its efficiency. As
                                                                                recently reported, the efficiency of running
                                                                                ChatGPT-4 on leading-edge hardware deep
                            Advancements in large language models (LLMs)—software   dives to 3% or less. A generative AI (GenAI)
                            algorithms driven by transformers—have not been met with similar   accelerator with 1-petaOPS nominal per-
                            progress in the computing hardware tasked to execute them.  formance but actual 3% efficiency delivers a
                              ChatGPT-4’s LLM, for example, exceeds 1 trillion parameters,   meager 30 teraOPS. Basically, a very costly
                            posing a challenge for current storage capabilities and performance   processor designed to run these algorithms
                            requirements. Memory storage is already reaching hundreds of   remains inactive 97% of the time.
                            gigabytes. Processing throughput needs multiple petaOPS    To compensate for the low efficiency in
                            (1 petaOPS = 10  operations per second) to deliver query responses   processing model training and inference in
                                       15
                            in an acceptable timeframe, typically less than a couple of seconds.  data centers, cloud providers add more hard-
          While model training and inference share performance requirements, they differ on four   ware to perform the same task. The approach
        other characteristics: memory, latency, power consumption and cost (Table 1).  escalates the cost and multiplies power
          The model training and inference scenario today is carried out on extensive comput-  consumption.
        ing farms. Each job runs for a long time and consumes a sizable amount of electric power   Obviously, such a method is not applicable
        that produces copious heat at mind-boggling costs. Nonetheless, the farms deliver what is   for inference at the edge.
        expected of them.
          Training a GPT-4 model on FP32 or FP64 arithmetic may require more than 1 trillion bits   ESTIMATED COST ANALYSIS OF GenAI IN
        stored on the fastest versions of high-bandwidth–memory (HBM) DRAM. The performance   DATA CENTERS PROCESSING GPT-4
        necessary to train such a massive model calls for tens of petaOPS running for weeks—an   McKinsey estimated that in 2022, Google
        annoyance, but not a roadblock. To accomplish the job, computing farms consume megawatts   search processed 3.3 trillion queries
        with total cost of ownership in the hundreds of billions of dollars. Again, not a perfect sce-  (~100,000 queries/second) at a cost of
        nario, but a working solution.                                          $US0.002 per query, considered to be the
          Via-à-vis model training, model inference, usually performed on FP8 arithmetic, which   benchmark. The total annual cost amounted
        still produces large amounts of data (in the hundreds of billions of bits), must deliver a query   to US$6.6 billion. Google is not charging fees
        response with a latency of no more than a couple of seconds to keep the user’s attention and   for the search service. Instead, it covers the
        acceptance. Furthermore, considering that a vast potential market for inference encompasses   cost via advertising revenues—for now.
        mobile applications at the edge, a viable solution must provide throughput of more than    The same McKinsey analysis stated that the
        1 petaOPS with implementation efficiency exceeding 50%.                 ChatGPT-3 cost per query hovers at around
          To ensure mobility, the solution must minimize energy consumption, possibly to less than   US$.03 per query, 15× larger than the bench-
        50 W/petaOPS, at an acquisition/deployment cost in the ballpark of a few hundred dollars.  mark. On an annual basis of
          These are lofty specifications for feasible inference scenarios running on edge devices.  100,000 queries/second, the total cost would
          The crux of the matter centers on the memory bottleneck (also called the memory wall),   exceed US$100 billion.
        which increases latency, with a deleterious impact on implementation efficiency, energy con-  Let’s evaluate the implication of the
        sumption and cost.                                                      benchmarks on the cost of ownership of a
                                                                                data center supporting ChatGPT-4 based on
         Attributes                       Training            Inference         a best-in-class GenAI accelerator, including
                                                                                purchasing, operating and system mainte-
                                                                                nance expenses.
         Memory                         FP32 or FP64             FP8              The cost per query comprises two con-
                                                                                tributors: acquisition cost and energy
         High Throughput                  Critical             Critical         consumption cost.
         Low Latency                     Important             Critical
                                                                                Estimated hardware acquisition costs
         Constrained Power                                                      Assumptions:
         Consumption                     Important             Critical           •  Hardware refresh: three years
                                                                                  •  Purchasing cost of a leading-edge GenAI
         Affordable Cost                 Important             Critical            accelerator, containing eight accelerator
                                                                                   chips and delivering a gross compute
                                                                                   power of 16 petaOPS at FP8 process-
        Table 1: Algorithm training and inference share some but not all critical attributes.   ing ChatGPT-4 with a 3% efficiency:
        (Source: Vsora)                                                            ~US$500,000 per system

        NOVEMBER 2023 | www.eetimes.eu
   17   18   19   20   21   22   23   24   25   26   27