Page 21 - EETimes Europe June 2021
P. 21

EE|Times EUROPE   21

                                                     Will Machines Ever Fully Understand What They Are Seeing?



























                                                                                                                      IMAGE: SHUTTERSTOCK









           Attention could help AI agents better understand what is happening in images by understanding the relevance between patches of the
           image to infer context.

           the picture: ‘There is a dog in the picture,’   Transformers often build gigantic N × N   tion mechanism and transformer networks.
           as opposed to, ‘There’s a brown pixel next   matrices of syllables (for text) or pixels    “I think the spirit of what attention is
           to a grey pixel, next to …’ which is a terrible   (for images) that require substantial com-   talking about is very important,” he said. “I
           description of what’s going on in the picture,”   pute power and memory to process.  think the machinery itself is going to evolve
           said Teig. “This is what becomes possible as   “The data center guys out there think,   very quickly over the next couple of years
           the system describes the pieces of the image   ‘Excellent — we have a data center, so every-  … in software, in theory, and in hardware to
           in these semantic terms, so to speak. It can   thing looks like a nail to us,’” said Teig, and   represent it.”
           then aggregate those into more useful con-  that’s how we’ve ended up with NLP models   Is there an eventual point where today’s
           cepts for downstream reasoning.”    like OpenAI’s GPT-3, with its 175 billion   huge transformer networks will fit onto an
             The eventual aim, Teig said, would be for   parameters. “It’s kind of ridiculous that you’re   accelerator in an edge device? In Teig’s view,
           the neural network to understand that the   looking at everything when, a priori, you can   networks like GPT-3’s 175 billion parame-
           picture is a dog chasing a Frisbee.  say that almost nothing in the prior sentence   ters — roughly 1 trillion bits of information
             “Good luck doing that with 16 million    is going to matter. Can’t you do any kind of   (assuming 8-bit parameters, for the sake of
           colors of pixels,” he said. “This is an attempt   filtering in advance? Do you really have to do   argument) — are part of the problem.
           to process that down to, ‘There’s a dog;   this crudely just because you have a gigantic   “It’s like we’re playing 20 questions, only I’m
           there’s a Frisbee; the dog is running.’ Now I   matrix multiplier? Does that make any sense?   going to ask you a trillion questions in order
           have a fighting chance at understanding that   Probably not.”           to understand what you’ve just said,” he said.
           maybe the dog is playing Frisbee.”    Recent attempts by the scientific commu-  “Maybe it can’t be done in 20,000 or 2 million,
                                               nity to reduce the computational overhead   but a trillion — get out of here! The flaw isn’t
           A STEP CLOSER                       for attention have reduced the number of   that we have a small 20-mW chip; the flaw
           Google’s work on attention in vision systems   operations required from N  to N√N. But   there is that [having] 175 billion parameters
                                                                   2
           is a step in the right direction, Teig said, “but   those attempts perpetuate “the near-   means you did something really wrong.”
           I think there’s a lot of room to advance here,   universal belief — one I do not share — that   Reducing attention-based networks’ param-
           both from a theory and software point of view   deep learning is all about matrices and   eter count, and representing them efficiently,
           and from a hardware point of view, when   matrix multiplication,” Teig said, pointing   could bring attention-based embedded vision
           one doesn’t have to bludgeon the data with   out that the most advanced neural net-  to edge devices, according to Teig. And such
           gigantic matrices, which I very much doubt   work research is being done by those with   developments are “not far away.” ■
           your brain is doing. There’s so much that can   access to massive matrix multiplication
           be filtered out in context without having to   accelerators.            REFERENCE
           compare it to everything else.”       Teig’s perspective as CEO of Perceive, an   1 A. Dosovitskiy et al. An Image is Worth 16×16
             While the Google research team’s solu-  edge-AI accelerator chip company, is that   Words: Transformers for Image Recognition at Scale.
           tion used compute resources more sparingly   there are more efficient ways of conceptual-  Preprint, October 2020. arxiv.org/pdf/2010.11929
           than CNNs do, the way attention is typically   izing neural network computation. Perceive is
           implemented in NLP makes networks like   already using some of these concepts, and Teig   Sally Ward-Foxton is editor-in-chief of
           transformers extremely resource-intensive.   thinks similar insights will apply to the atten-  EE Times Weekend.

                                                                                           www.eetimes.eu | JUNE 2021
   16   17   18   19   20   21   22   23   24   25   26