Page 21 - EETimes Europe June 2021

P. 21

EE|Times EUROPE 21

Will Machines Ever Fully Understand What They Are Seeing?

IMAGE: SHUTTERSTOCK

Attention could help AI agents better understand what is happening in images by understanding the relevance between patches of the
image to infer context.

the picture: ‘There is a dog in the picture,’ Transformers often build gigantic N × N tion mechanism and transformer networks.
as opposed to, ‘There’s a brown pixel next matrices of syllables (for text) or pixels “I think the spirit of what attention is
to a grey pixel, next to …’ which is a terrible (for images) that require substantial com- talking about is very important,” he said. “I
description of what’s going on in the picture,” pute power and memory to process. think the machinery itself is going to evolve
said Teig. “This is what becomes possible as “The data center guys out there think, very quickly over the next couple of years
the system describes the pieces of the image ‘Excellent — we have a data center, so every- … in software, in theory, and in hardware to
in these semantic terms, so to speak. It can thing looks like a nail to us,’” said Teig, and represent it.”
then aggregate those into more useful con- that’s how we’ve ended up with NLP models Is there an eventual point where today’s
cepts for downstream reasoning.” like OpenAI’s GPT-3, with its 175 billion huge transformer networks will fit onto an
The eventual aim, Teig said, would be for parameters. “It’s kind of ridiculous that you’re accelerator in an edge device? In Teig’s view,
the neural network to understand that the looking at everything when, a priori, you can networks like GPT-3’s 175 billion parame-
picture is a dog chasing a Frisbee. say that almost nothing in the prior sentence ters — roughly 1 trillion bits of information
“Good luck doing that with 16 million is going to matter. Can’t you do any kind of (assuming 8-bit parameters, for the sake of
colors of pixels,” he said. “This is an attempt filtering in advance? Do you really have to do argument) — are part of the problem.
to process that down to, ‘There’s a dog; this crudely just because you have a gigantic “It’s like we’re playing 20 questions, only I’m
there’s a Frisbee; the dog is running.’ Now I matrix multiplier? Does that make any sense? going to ask you a trillion questions in order
have a fighting chance at understanding that Probably not.” to understand what you’ve just said,” he said.
maybe the dog is playing Frisbee.” Recent attempts by the scientific commu- “Maybe it can’t be done in 20,000 or 2 million,
nity to reduce the computational overhead but a trillion — get out of here! The flaw isn’t
A STEP CLOSER for attention have reduced the number of that we have a small 20-mW chip; the flaw
Google’s work on attention in vision systems operations required from N to N√N. But there is that [having] 175 billion parameters
2
is a step in the right direction, Teig said, “but those attempts perpetuate “the near- means you did something really wrong.”
I think there’s a lot of room to advance here, universal belief — one I do not share — that Reducing attention-based networks’ param-
both from a theory and software point of view deep learning is all about matrices and eter count, and representing them efficiently,
and from a hardware point of view, when matrix multiplication,” Teig said, pointing could bring attention-based embedded vision
one doesn’t have to bludgeon the data with out that the most advanced neural net- to edge devices, according to Teig. And such
gigantic matrices, which I very much doubt work research is being done by those with developments are “not far away.” ■
your brain is doing. There’s so much that can access to massive matrix multiplication
be filtered out in context without having to accelerators. REFERENCE
compare it to everything else.” Teig’s perspective as CEO of Perceive, an 1 A. Dosovitskiy et al. An Image is Worth 16×16
While the Google research team’s solu- edge-AI accelerator chip company, is that Words: Transformers for Image Recognition at Scale.
tion used compute resources more sparingly there are more efficient ways of conceptual- Preprint, October 2020. arxiv.org/pdf/2010.11929
than CNNs do, the way attention is typically izing neural network computation. Perceive is
implemented in NLP makes networks like already using some of these concepts, and Teig Sally Ward-Foxton is editor-in-chief of
transformers extremely resource-intensive. thinks similar insights will apply to the atten- EE Times Weekend.

www.eetimes.eu | JUNE 2021

16 17 18 19 20 21 22 23 24 25 26