Page 20 - EETimes Europe June 2021
P. 20
20 EE|Times EUROPE
SPECIAL REPORT: EMBEDDED VISION
Will Machines Ever Fully Understand
What They Are Seeing?
By Sally Ward-Foxton
Attention-based networks have revolutionized
natural-language processing. They could do the same
for embedded vision, says Perceive’s CEO.
mbedded-vision technologies are words you’ve already read, combined with the
giving machines the power of sight, word you’re reading right now, lend valuable
but today’s systems still fall short of context that help you understand the text.
Eunderstanding all the nuances of an Teig’s example is:
image. An approach used for natural-language The car skidded on the street because it was
processing (NLP) could address that. slippery.
Attention-based neural networks, par- As you finish reading the sentence, you Perceive’s Steve Teig
ticularly transformer networks, have understand that “slippery” likely refers to the
revolutionized NLP, giving machines a better street and not the car, because you’ve held of a CNN. But a recent paper by Google scien-
understanding of language than ever before. the words “street” and “car” in memory, and tists argues that the concept of attention is
1
This technique, which is designed to mimic your experience tells you that the relevance more widely applicable to vision. The authors
cognitive processes by giving an artificial connection between “slippery” and “street” show that a pure transformer network, a type
neural network an idea of history or con- is much stronger than the relevance connec- of network widely used in NLP that relies on
text, has produced much more sophisticated tion between “slippery” and “car.” A neural the attention mechanism, can perform well
AI agents than older approaches that also network can try to mimic this ability using the on image-classification tasks when applied
employ memory, such as long short-term attention mechanism. directly to a sequence of image patches. The
memory and recurrent neural networks. NLP The mechanism “takes all the words in transformer network built by the researchers,
now has a deeper level of understanding of the recent past and compares them in some Vision Transformer (ViT), achieved superior
the questions or prompts it is fed and can fashion as a way of seeing which words might results to CNNs but required fewer compute
create long pieces of text in response that are possibly relate to which other words,” said resources to train.
often indistinguishable from what a human Teig. “Then the network knows to at least While it may be easy to imagine how
might write. focus on that, because it’s more likely for attention applies to text or spoken dialogue,
Attention can certainly be applied to image ‘slippery’ to be [relevant to] either the street applying the same concept to a still image
processing, though its use in computer vision or the car and not [any of the other words].” (rather than a temporal sequence such as a
has been limited so far. In an exclusive inter- Attention is therefore a way to focus on video) is less obvious. In fact, attention can be
view with EE Times, AI expert Steve Teig, CEO reducing the sequence of the presented data used in the spatial, rather than the temporal,
of Perceive, argued that attention will come to to a subset that might possibly be of interest context here. Syllables or words would be
be extremely important to vision applications. (perhaps the current and previous sentences analogous to patches of the images.
only) and then assigning possibilities to how Teig’s example is a photo of a dog. The
ATTENTION-BASED NETWORKS relevant each word is likely to be. patch of the image that shows the dog’s
The attention mechanism looks at an input “[Attention] ended up being a way of mak- ear might identify itself as an ear, even as
sequence, such as a sentence, and decides ing use of time, in a somewhat principled way, a particular type of ear that is found on a
after each piece of data in the sequence (sylla- without the overhead of looking at everything furry animal, or a quadruped. Similarly, the
ble or word) which other parts of the sequence that ever happened,” Teig said. “This caused tail patch knows it is also found on furry
are relevant. This is similar to how you are people, even until very recently, to think animals and quadrupeds. A tree patch in the
reading this article: Your brain is holding cer- that attention is a trick with which one can background of the image knows that it has
tain words in your memory even as it focuses manage time. Certainly, it has had a tremen- branches and leaves. The attention mecha-
on each new word you’re reading, because the dously positive impact on speech processing, nism asks the ear patch and the tree patch
language processing, and other temporal what they have in common. The answer is:
things. Much more recently, just in the last not a lot. The ear patch and the tail patch,
Perceive CEO Steve Teig spoke twice at handful of months, people have started to however, do have a lot in common; they
the Embedded Vision Summit. In “Fac- realize that maybe we can use attention to do can confer about those commonalities, and
ing up to Bias,” he discussed sources other focusing of information.” then maybe the neural network can find a
of discrimination in AI systems, and larger concept than “ear” or “tail.” Maybe the
in “TinyML Isn’t Thinking Big Enough,” VISION TRANSFORMER network can understand some of the context
he challenged the notions that TinyML Neural networks designed for vision have provided by the image to work out that ear
models must compromise on accuracy made very limited use of attention techniques plus tail might equal dog.
and that they should run on CPUs or so far. Until now, attention has been applied “The fact that the ear and the tail of the
MCUs. embeddedvisionsummit.com alongside convolutional neural networks dog are not independent allows us to have
(CNNs) or used to replace certain components a terser description of what’s going on in
JUNE 2021 | www.eetimes.eu