Transformer Model Encoders Quiz & Flashcards

"Attention Is All You Need" Model Architecture

The "Attention Is All You Need" model architecture proposes a new simple network based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
The model allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
The model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU.
On the WMT 2014 English-to-French translation task, the model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
The Transformer uses multi-head attention in three different ways: encoder-decoder attention, self-attention layers in the encoder, and self-attention layers in the decoder.
The model contains a fully connected feed-forward network applied to each position separately and identically, consisting of two linear transformations with a ReLU activation in between.
The model uses learned embeddings to convert input and output tokens to vectors of dimension dmodel.
The model shares the same weight matrix between the two embedding layers and the pre-softmax linear transformation.
The model employs positional encoding to allow the model to make use of the order of the sequence.
The encoder is composed of a stack of N=6 identical layers, each with two sub-layers: a multi-head self-attention mechanism and a simple, position-wise fully connected feed-forward network.
The decoder is also composed of a stack of N=6 identical layers, in addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer which performs multi-head attention over the output of the encoder stack.
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.The Transformer - Attention is All You Need
The Transformer is a neural network architecture based solely on self-attention mechanisms.
The Transformer replaces recurrent layers with multi-head attention and feed-forward layers.
Self-attention layers allow for parallel computation and can connect all positions with a constant number of sequentially executed operations.
The Transformer uses positional encodings to inject information about the relative or absolute position of tokens in a sequence.
Compared to recurrent layers, self-attention layers are faster and have a shorter maximum path length between any two positions in the network.
Convolutional layers are generally more expensive than recurrent layers, but separable convolutions decrease complexity considerably.
The Transformer achieved state-of-the-art results on the WMT 2014 English-to-German and English-to-French translation tasks.
The Transformer is highly customizable, and variations in attention head number, attention key size, and model size can improve performance.
The Transformer also performed well on English constituency parsing, a task with strong structural constraints.
The Transformer's attention distributions can be inspected for interpretability, and many heads exhibit behavior related to syntactic and semantic structure.
The Transformer uses dropout and label smoothing for regularization during training.
The Transformer was trained on the standard WMT 2014 English-German dataset and the significantly larger WMT 2014 English-French dataset.The Transformer: A Model Based Entirely on Attention for Sequence Transduction
The Transformer is a sequence transduction model based entirely on attention, replacing the commonly used recurrent layers in encoder-decoder architectures.
The model employs multi-headed self-attention in both the encoder and decoder, allowing for faster training than with recurrent or convolutional layers.
The Transformer achieved a new state-of-the-art on the WMT 2014 English-to-German and English-to-French translation tasks.
The model was trained on the Wall Street Journal portion of the Penn Treebank and in a semi-supervised setting using the larger high-confidence and BerkleyParser corpora.
The Transformer outperformed the BerkeleyParser even when training only on the WSJ training set of 40K sentences.
The model was trained with a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.
The dropout, attention, residual, learning rates, and beam size were selected through a small number of experiments on the Section 22 development set.
During inference, the maximum output length was increased to input length + 300, and a beam size of 21 and alpha of 0.3 were used.
The Transformer performed surprisingly well despite the lack of task-specific tuning, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar.
The authors plan to apply the Transformer to other tasks and extend it to handle input and output modalities other than text, such as images, audio, and video.
The code used to train and evaluate the models is available on GitHub.
The authors are grateful to Nal Kalchbrenner and Stephan Gouws for their comments and inspiration.