Page 20 - EETimes Europe June 2021
P. 20

20 EE|Times EUROPE



            SPECIAL REPORT: EMBEDDED VISION
           Will Machines Ever Fully Understand

           What They Are Seeing?


           By Sally Ward-Foxton

           Attention-based networks have revolutionized
           natural-language processing. They could do the same
           for embedded vision, says Perceive’s CEO.



                mbedded-vision technologies are   words you’ve already read, combined with the
                giving machines the power of sight,   word you’re reading right now, lend valuable
                but today’s systems still fall short of   context that help you understand the text.
          Eunderstanding all the nuances of an   Teig’s example is:
           image. An approach used for natural-language   The car skidded on the street because it was
           processing (NLP) could address that.  slippery.
             Attention-based neural networks, par-  As you finish reading the sentence, you   Perceive’s Steve Teig
           ticularly transformer networks, have   understand that “slippery” likely refers to the
           revolutionized NLP, giving machines a better   street and not the car, because you’ve held   of a CNN. But a recent paper by Google scien-
           understanding of language than ever before.   the words “street” and “car” in memory, and   tists  argues that the concept of attention is
                                                                                      1
           This technique, which is designed to mimic   your experience tells you that the relevance   more widely applicable to vision. The authors
           cognitive processes by giving an artificial   connection between “slippery” and “street”   show that a pure transformer network, a type
           neural network an idea of history or con-  is much stronger than the relevance connec-  of network widely used in NLP that relies on
           text, has produced much more sophisticated   tion between “slippery” and “car.” A neural   the attention mechanism, can perform well
           AI agents than older approaches that also   network can try to mimic this ability using the   on image-classification tasks when applied
           employ memory, such as long short-term   attention mechanism.           directly to a sequence of image patches. The
           memory and recurrent neural networks. NLP   The mechanism “takes all the words in   transformer network built by the researchers,
           now has a deeper level of understanding of   the recent past and compares them in some   Vision Transformer (ViT), achieved superior
           the questions or prompts it is fed and can   fashion as a way of seeing which words might   results to CNNs but required fewer compute
           create long pieces of text in response that are   possibly relate to which other words,” said   resources to train.
           often indistinguishable from what a human   Teig. “Then the network knows to at least   While it may be easy to imagine how
           might write.                        focus on that, because it’s more likely for   attention applies to text or spoken dialogue,
             Attention can certainly be applied to image   ‘slippery’ to be [relevant to] either the street   applying the same concept to a still image
           processing, though its use in computer vision   or the car and not [any of the other words].”  (rather than a temporal sequence such as a
           has been limited so far. In an exclusive inter-  Attention is therefore a way to focus on   video) is less obvious. In fact, attention can be
           view with EE Times, AI expert Steve Teig, CEO   reducing the sequence of the presented data   used in the spatial, rather than the temporal,
           of Perceive, argued that attention will come to   to a subset that might possibly be of interest   context here. Syllables or words would be
           be extremely important to vision applications.  (perhaps the current and previous sentences   analogous to patches of the images.
                                               only) and then assigning possibilities to how   Teig’s example is a photo of a dog. The
           ATTENTION-BASED NETWORKS            relevant each word is likely to be.  patch of the image that shows the dog’s
           The attention mechanism looks at an input   “[Attention] ended up being a way of mak-  ear might identify itself as an ear, even as
           sequence, such as a sentence, and decides   ing use of time, in a somewhat principled way,   a particular type of ear that is found on a
           after each piece of data in the sequence (sylla-  without the overhead of looking at everything   furry animal, or a quadruped. Similarly, the
           ble or word) which other parts of the sequence   that ever happened,” Teig said. “This caused   tail patch knows it is also found on furry
           are relevant. This is similar to how you are   people, even until very recently, to think   animals and quadrupeds. A tree patch in the
           reading this article: Your brain is holding cer-  that attention is a trick with which one can   background of the image knows that it has
           tain words in your memory even as it focuses   manage time. Certainly, it has had a tremen-  branches and leaves. The attention mecha-
           on each new word you’re reading, because the   dously positive impact on speech processing,   nism asks the ear patch and the tree patch
                                               language processing, and other temporal   what they have in common. The answer is:
                                               things. Much more recently, just in the last   not a lot. The ear patch and the tail patch,
             Perceive CEO Steve Teig spoke twice at   handful of months, people have started to   however, do have a lot in common; they
             the Embedded Vision Summit. In “Fac-  realize that maybe we can use attention to do   can confer about those commonalities, and
             ing up to Bias,” he discussed sources   other focusing of information.”  then maybe the neural network can find a
             of discrimination in AI systems, and                                  larger concept than “ear” or “tail.” Maybe the
             in “TinyML Isn’t Thinking Big Enough,”   VISION TRANSFORMER           network can understand some of the context
             he challenged the notions that TinyML   Neural networks designed for vision have   provided by the image to work out that ear
             models must compromise on accuracy   made very limited use of attention techniques   plus tail might equal dog.
             and that they should run on CPUs or   so far. Until now, attention has been applied   “The fact that the ear and the tail of the
             MCUs. embeddedvisionsummit.com    alongside convolutional neural networks   dog are not independent allows us to have
                                               (CNNs) or used to replace certain components   a terser description of what’s going on in

           JUNE 2021 | www.eetimes.eu
   15   16   17   18   19   20   21   22   23   24   25