Attention Attention Movie

0 views

Skip to first unread message

Faustina Bartsch

unread,

Aug 5, 2024, 12:00:33 AM8/5/24

to eradizbo

Theseq2seq model was born in the field of language modeling (Sutskever, et al. 2014). Broadly speaking, it aims to transform an input sequence (source) to a new one (target) and both sequences can be of arbitrary lengths. Examples of transformation tasks include machine translation between multiple languages in either text or audio, question-answer dialog generation, or even parsing sentences into grammar trees.

A critical and apparent disadvantage of this fixed-length context vector design is incapability of remembering long sentences. Often it has forgotten the first part once it completes processing the whole input. The attention mechanism was born (Bahdanau et al., 2015) to resolve this problem.

With the help of the attention, the dependencies between source and target sequences are not restricted by the in-between distance anymore! Given the big improvement by attention in machine translation, it soon got extended into the computer vision field (Xu et al. 2015) and people started exploring various other forms of attention mechanisms (Luong, et al., 2015; Britz et al., 2017; Vaswani, et al., 2017).

Self-attention, also known as intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence. It has been shown to be very useful in machine reading, abstractive summarization, or image description generation.

The long short-term memory network paper used self-attention to do machine reading. In the example below, the self-attention mechanism enables us to learn the correlation between the current words and the previous part of the sentence.

In the show, attend and tell paper, attention mechanism is applied to images to generate captions. The image is first encoded by a CNN to extract features. Then a LSTM decoder consumes the convolution features to produce descriptive words one by one, where the weights are learned through attention. The visualization of the attention weights clearly demonstrates which regions of the image the model is paying attention to so as to output a certain word.

Alan Turing in 1936 proposed a minimalistic model of computation. It is composed of a infinitely long tape and a head to interact with the tape. The tape has countless cells on it, each filled with a symbol: 0, 1 or blank (" ). The operation head can read symbols, edit symbols and move left/right on the tape. Theoretically a Turing machine can simulate any computer algorithm, irrespective of how complex or expensive the procedure might be. The infinite memory gives a Turing machine an edge to be mathematically limitless. However, infinite memory is not feasible in real modern computers and then we only consider Turing machine as a mathematical model of computation.

NTM contains two major components, a controller neural network and a memory bank.Controller: is in charge of executing operations on the memory. It can be any type of neural network, feed-forward or recurrent.Memory: stores processed information. It is a matrix of size $N \times M$, containing N vector rows and each has $M$ dimensions.

The content-addressing creates attention vectors based on the similarity between the key vector $\mathbfk_t$ extracted by the controller from the input and memory rows. The content-based attention scores are computed as cosine similarity and then normalized by softmax. In addition, NTM adds a strength multiplier $\beta_t$ to amplify or attenuate the focus of the distribution.

The location-based addressing sums up the values at different positions in the attention vector, weighted by a weighting distribution over allowable integer shifts. It is equivalent to a 1-d convolution with a kernel $\mathbfs_t(.)$, a function of the position offset. There are multiple ways to define this distribution. See Fig. 11. for inspiration.

The complete process of generating the attention vector $\mathbfw_t$ at time step t is illustrated in Fig. 12. All the parameters produced by the controller are unique for each head. If there are multiple read and write heads in parallel, the controller would output multiple sets.

In problems like sorting or travelling salesman, both input and output are sequential data. Unfortunately, they cannot be easily solved by classic seq-2-seq or NMT models, given that the discrete categories of output elements are not determined in advance, but depends on the variable input size. The Pointer Net (Ptr-Net; Vinyals, et al. 2015) is proposed to resolve this type of problems: When the output elements correspond to positions in an input sequence. Rather than using attention to blend hidden units of an encoder into a context vector (See Fig. 8), the Pointer Net applies attention over the input elements to pick one as the output at each decoder step.

The attention mechanism is simplified, as Ptr-Net does not blend the encoder states into the output with attention weights. In this way, the output only responds to the positions but not the input content.

The transformer adopts the scaled dot-product attention: the output is a weighted sum of the values, where the weight assigned to each value is determined by the dot-product of the query with all the keys:

The transformer has no recurrent or convolutional structure, even with the positional encoding added to the embedding vector, the sequential order is only weakly incorporated. For problems sensitive to the positional dependency like reinforcement learning, this can be a big problem.

The Simple Neural Attention Meta-Learner (SNAIL) (Mishra et al., 2017) was developed partially to resolve the problem with positioning in the transformer model by combining the self-attention mechanism in transformer with temporal convolutions. It has been demonstrated to be good at both supervised learning and reinforcement learning tasks.

SNAIL was born in the field of meta-learning, which is another big topic worthy of a post by itself. But in simple words, the meta-learning model is expected to be generalizable to novel, unseen tasks in the similar distribution. Read this nice introduction if interested.

Self-Attention GAN (SAGAN; Zhang et al., 2018) adds self-attention layers into GAN to enable both the generator and the discriminator to better model relationships between spatial regions.

The classic DCGAN (Deep Convolutional GAN) represents both discriminator and generator as multi-layer convolutional networks. However, the representation capacity of the network is restrained by the filter size, as the feature of one pixel is limited to a small local region. In order to connect regions far apart, the features have to be dilute through layers of convolutional operations and the dependencies are not guaranteed to be maintained.

As the (soft) self-attention in the vision context is designed to explicitly learn the relationship between one pixel and all other positions, even regions far apart, it can easily capture global dependencies. Hence GAN equipped with self-attention is expected to handle details better, hooray!

The SAGAN adopts the non-local neural network to apply the attention computation. The convolutional image feature maps $\mathbfx$ is branched out into three copies, corresponding to the concepts of key, value, and query in the transformer:

While the scaling parameter $\gamma$ is increased gradually from 0 during the training, the network is configured to first rely on the cues in the local regions and then gradually learn to assign more weight to the regions that are further away.

Figure 1. Attention-related brain networks. (A) Orienting attention networks include the DAN and VAN. The DAN is involved in top-down control of attention, while the VAN is responsible for bottom-up attention. When external salient stimuli are detected, the VAN will interrupt the processes of the DAN through the TPJ. (B) Executive attention networks include the CON and FPN. The CON provides a stable maintenance of attention performance, and the FPN is responsible for task initiation and switching. (C) The DMN is involved in self-reflective mental activity and is typically less active during externally oriented tasks, such as tasks that require attentional control. (D) The SN is involved in detecting salient stimuli (including interoceptive stimuli) and facilitates the switch between the FPN and the DMN. aINS, anterior insula; AMY, amygdala; aPFC, anterior prefrontal cortex; CON, cingulo-opercular network; dACC, dorsal anterior cingulate cortex; DAN, dorsal attention network; dFC, dorsal frontal cortex; dlPFC, dorsolateral prefrontal cortex; DMN, default mode network; dmPFC, dorsomedial prefrontal cortex; FEF, frontal eye fields; FPN, frontoparietal network; IFG, inferior frontal gyrus; IPL, inferior parietal lobule; IPS, intraparietal sulcus; MFG, middle frontal gyrus; PCC, posterior cingulate cortex; SN, salience network; TPJ temporoparietal junction; VAN, ventral attention network; vmPFC, ventromedial prefrontal cortex; VS, ventral striatum; VTA, ventral tegmental area. Straight lines represent connectivity between structures in cases where there is not a definitive directional relationship. Network nodes derived from the Gordon network parcellation scheme (Gordon et al., 2016).

Copyright 2023 Ely, Zundel, Gowatch, Evanski, Bhogal, Carpenter, Shampine and Marusak. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The answer is yes, but also no. Because the best, truest material will always come from deep attention, which must not only begin but also remain long enough in the world around us. Again, our attention must remain long enough on something outside of ourselves for us to see through that thing like a window, as the poet Marie Howe compels us to do. When we leap too quickly into the realm of story, we effectively remove our attention from the thing itself and instead wander inward toward our own thoughts, ideas, interpretations, desires, judgments, etc.