Full Transcript

Transformer Architecture [VSPU17] Sequential processing of RNNs is their greatest strength – and their weakness: Sequential models cannot be parallelized well because each hidden state depends on previous one. ⇒ backpropagation-through-time becomes expensive with the sequence length In the transform...

Transformer Architecture [VSPU17] Sequential processing of RNNs is their greatest strength – and their weakness: Sequential models cannot be parallelized well because each hidden state depends on previous one. ⇒ backpropagation-through-time becomes expensive with the sequence length In the transformer architecture, all previous states are used ▶ word position is mixed into the encoding ▶ weighted by an attention mechanism This has benefits, but also drawbacks: ▶ no “bottleneck” for passing data along time ▶ all tokens can be processed in parallel instead of sequential processing along time ▶ attention requires computation 31 Classic Transformer Architecture [VSPU17] Many variations: encoder, decoder, attention, positional encoding (∿), … Today: primarily decoder-only models. 32 Attention [BaChBe15] Not all inputs are equally important. Importance may vary across time steps (RNN!): The dog played with its toy. ⇒ Der Hund spielte mit __ To predict the next word, the LM needs to attend more to the input tokens dog and its than, e.g., to played. Attention idea: learn to predict which inputs are currently relevant Attention provides the LM with weights to focus on certain inputs: 34 Scaled Dot-Product Attention [VSPU17] The attention variant used in the transformer architecture: Q: Matrix of queries (inputs) K: Matrix of keys (against which queries are matched) V: Matrix of values (often set to K) : Dimension of queries and keys The scaling factor is used to improve performance (it weakens the vanishing gradient problem). The dot-product In self-attention, is a very simple alignment measure. and hence we use. Masking: by overwriting values with , parts of the input can be masked in training and decoding. 35 Multi-Head Attention [VSPU17] Scaled Dot-Product Attention can only capture one level of meaning. Complex tasks require different views, e.g., paying attention to nouns, pronouns, verbs, all at once, but separately. 36 Flash Attention [Dao23; DFER22] A naive attention implementation needs intermediate memory: A 512 context (BERT, first major transformer) with bf16 data: 500 kb. A 4k context (LLaMA) 32 MB, 128k context (GPT-4) 32 GB ⇒ bottleneck We need optimized implementations that fuse these operations. FlashAttention 2 is a hardware-optimized implementation: ▶ scaled dot-product attention in a single CUDA kernel (Ampere/Hopper only) ▶ blockwise processing to avoid the storage cost ▶ optimized use of the GPU cache and VRAM ▶ several times faster than the PyTorch implementation But: still requires computation for full attention 37 LongFormer [BePeCo20] Full attention is ▶ if we use only and does not scale to a large context. words, this reduces to ▶ nearest words have the highest similarity by the positional encoding ▶ we can design other patterns, e.g., every other word Because of the stacked transformer layers, information can still propagate for a longer range. 38.1 Sparse Transformer [CGRS19] Factorized Self-Attention: ▶ different heads can have different attention access patterns ▶ certain access patterns are more efficient to compute ▶ we can connect everything with layers and computation cost 38.2 Positional Encoding [VSPU17] Because transformers processes all inputs at once, we loose the information about the order of tokens. Solution: Add “positional encodings” to embeddings such that the network can compute position differences. Rotary Embeddings (RoPE) [SLPW21] Adding PE to the embeddings changes their lengths and dot products. But we want only depend on the relative not absolute location in the text. Also PE does not extend well to a context longer than dim/2 to dim of the embeddings. Rotary Embeddings (RoPE, [SLPW21]) are built upon rotations instead of a Fourier expansion. Every pair of dimensions represents a rotation with rotation speed where : similar to the sinusoidal embeddings. We now multiply these rotations onto the vectors, which preserves the dot products. Because the matrix is composed of dim/2 submatrixes of size 2×2, the rotation has only linear cost.

Use Quizgecko on...
Browser
Browser