serelory nantin gathering

0 views

Skip to first unread message

Judd Eisenhauer

unread,

Aug 2, 2024, 10:57:17 PM8/2/24

to gatsitoore

Machine learning solutions are used to solve a wide variety of problems, but in nearly all cases the core components are the same. Whether you simply want to understand the skeleton of machine learning solutions better or are embarking on building your own, understanding these components - and how they interact - can help.

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need".[1] Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table.[1] At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et al. in 2014 for machine translation,[2][3] and the Fast Weight Controller, similar to a transformer, proposed in 1992.[4][5][6]

Transformers have the advantage of having no recurrent units, and therefore require less training time than earlier recurrent neural architectures such as long short-term memory (LSTM).[7] Later variations have been widely adopted for training large language models (LLM) on large (language) datasets, such as the Wikipedia corpus and Common Crawl.[8]

Transformers are currently used in large-scale natural language processing, computer vision (vision transformers), audio,[9] multi-modal processing, robotics,[10] and even playing chess.[11] It has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs)[12] and BERT[13] (Bidirectional Encoder Representations from Transformers)..mw-parser-output .toclimit-2 .toclevel-1 ul,.mw-parser-output .toclimit-3 .toclevel-2 ul,.mw-parser-output .toclimit-4 .toclevel-3 ul,.mw-parser-output .toclimit-5 .toclevel-4 ul,.mw-parser-output .toclimit-6 .toclevel-5 ul,.mw-parser-output .toclimit-7 .toclevel-6 uldisplay:none

Sequence modelling and generation had been done with plain recurrent neural networks for many years. An early well-cited example was the Elman network (1990). In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the vanishing-gradient problem leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens.

One key component of the attention mechanism is to include neurons that multiply the outputs of other neurons. Such neurons were called multiplicative units, and neural networks using multiplicative units were called sigma-pi networks or second-order networks,[35] but they faced high computational complexity.[7] A key breakthrough was LSTM (1995),[note 1] which incorporated multiplicative units into a recurrent network, as well as other innovations that prevented the vanishing gradient problem, and allowed efficient learning of long-sequence modelling. It became the standard architecture for long sequence modelling until the 2017 publication of Transformers.

However, LSTM did not solve a general problem that recurrent networks usually[note 2] have, which is that it cannot operate in parallel over all tokens in a sequence. It must operate one at a time from the first token to the last. The fast weight controller (1992) was an early attempt to bypass the difficulty. It used the fast weights architecture,[36] where one neural network outputs the weights of another neural network. It was later shown to be equivalent to the linear Transformer without normalization.[17][4]

In 2014, an attention mechanism was introduced to seq2seq models (using gated recurrent units, a variant of LSTM) for machine translation.[2][3] It was introduced to solve a specific issue encountered in seq2seq. In seq2seq, the input is processed sequentially by one recurrent network into a fixed-size output vector, which was then processed by another recurrent network into an output. If the input is long, then the output vector would not be able to contain all relevant information, and the output quality degrades.

The idea of attention mechanism in recurrent networks is to use all outputs of the first network, not just its last output. The second network at each step uses an attention mechanism to combine them linearly, then processes it further.

Previously seq2seq had no attention mechanism, and the state vector is accessible only after the last word of the source text was processed. Although in theory such a vector retains the information about the whole original sentence, in practice the information is poorly preserved. This is because seq2seq models have difficulty modelling long-distance dependencies. Reversing the input sentence improved seq2seq translation.[37] With an attention mechanism, the network can model long-distance dependencies more easily.[2]

Seq2seq models with attention still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them to be accelerated on GPUs. In 2016, decomposable attention applied attention mechanism to the feedforward network, which are easy to parallelize.[38]

In 2017, Vaswani et al. also proposed replacing recurrent neural networks with self-attention and started the effort to evaluate that idea.[1] Transformers, using an attention mechanism, processing all tokens simultaneously, calculated "soft" weights between them in successive layers. Since the attention mechanism only uses information about other tokens from lower layers, it can be computed for all tokens in parallel, which leads to improved training speed.

The plain transformer architecture had difficulty converging. In the original paper[1] the authors recommended using learning rate warmup. That is, the learning rate should linearly scale up from 0 to maximal value for the first part of the training (usually recommended to be 2% of the total number of training steps), before decaying again.

Transformers typically undergo self-supervised learning involving unsupervised pretraining followed by supervised fine-tuning. Pretraining is typically done on a larger dataset than fine-tuning, due to the limited availability of labeled training data. Tasks for pretraining and fine-tuning commonly include:

The transformer has had great success in natural language processing (NLP), for example the tasks of machine translation and time series prediction. Many large language models such as GPT-2, GPT-3, GPT-4, Claude, BERT, XLNet, RoBERTa and ChatGPT demonstrate the ability of transformers to perform a wide variety of such NLP-related tasks, and have the potential to find real-world applications. These may include:

As an illustrative example, Ithaca is an encoder-only transformer with three output heads. It takes as input ancient Greek inscription as sequences of characters, but with illegible characters replaced with "-". Its three output heads respectively outputs probability distributions over Greek characters, location of inscription, and date of inscription.[41]

Transformer layers can be one of two types, encoder and decoder. In the original paper both of them were used, while later models included only one type of them. BERT is an example of an encoder-only model; GPT are decoder-only models.

The input text is parsed into tokens by a tokenizer, most often a byte pair encoding tokenizer, and each token is converted into a vector via looking up from a word embedding table. Then, positional information of the token is added to the word embedding.

Like earlier seq2seq models, the original transformer model used an encoder-decoder architecture. The encoder consists of encoding layers that process the input tokens iteratively one layer after another, while the decoder consists of decoding layers that iteratively process the encoder's output as well as the decoder output's tokens so far.

The function of each encoder layer is to generate contextualized token representations, where each representation corresponds to a token that "mixes" information from other input tokens via self-attention mechanism. Each decoder layer contains two attention sublayers: (1) cross-attention for incorporating the output of encoder (contextualized input token representations), and (2) self-attention for "mixing" information among the input tokens to the decoder (i.e., the tokens generated so far during inference time).[42][43]

The attention calculation for all tokens can be expressed as one large matrix calculation using the softmax function, which is useful for training due to computational matrix operation optimizations that quickly compute matrix operations. The matrices Q \displaystyle Q , K \displaystyle K and V \displaystyle V are defined as the matrices where the i \displaystyle i th rows are vectors q i \displaystyle q_i , k i \displaystyle k_i , and v i \displaystyle v_i respectively. Then we can represent the attention as

One set of ( W Q , W K , W V ) \displaystyle \left(W_Q,W_K,W_V\right) matrices is called an attention head, and each layer in a transformer model has multiple attention heads. While each attention head attends to the tokens that are relevant to each token, multiple attention heads allow the model to do this for different definitions of "relevance". In addition, the influence field representing relevance can become progressively dilated in successive layers. Many transformer attention heads encode relevance relations that are meaningful to humans. For example, some attention heads can attend mostly to the next word, while others mainly attend from verbs to their direct objects.[44] The computations for each attention head can be performed in parallel, which allows for fast processing. The outputs for the attention layer are concatenated to pass into the feed-forward neural network layers.