06_Transformer.pdf
Document Details
Uploaded by StrikingConflict
Ajou University
Full Transcript
Artificial Intelligence 6. Transformer and Large Language Models 7. Oct. 2024 Sang-Hoon Lee Ajou University RNNs VS Transformers ▪ In order to better understand Transformers,...
Artificial Intelligence 6. Transformer and Large Language Models 7. Oct. 2024 Sang-Hoon Lee Ajou University RNNs VS Transformers ▪ In order to better understand Transformers, It’s helpful to compare transformers with RNNs. ▪ While both are used for sequence modeling, the key difference lie in how they handle long-range dependencies and parallelization https://wikidocs.net/167212 1 Recurrent Neural Networks ▪ Recurrent Neural Networks RNNs have a mechanism that deals directly with the sequential nature of language, allowing them to handle the temporal nature of language without the use of arbitrary fixed-sized windows The recurrent network offers a new way to represent the prior context, in its recurrent connections, allowing the model’s decision to depend on information from hundreds of words in the past 2 Limitation of RNN ▪ Disadvantages Recurrent computation is slow (Slow Training Speed) In practice, difficult to access information from many steps back ▪ Advantages Can process any length input Computation for step t can (in theory) use information from many steps back Model size doesn’t increase for longer input Same weights applied on every timestep, so there is symmetry in how inputs are processed https://cs231n.stanford.edu/slides/2022/lecture_10_ruohan.pdf 3 Index ▪ Transformer Attention Architecture ▪ LLM Decoder-only (Next Token Prediction) Sampling Tokenization Pre-training ✓ Scaling Law Fine-tuning ✓ PEFT In-context Learning ✓ Prompting ✓ Chain-of-Thought (CoT) 4 Introduction to Transformers ▪ Transformer: a specific kind of network architecture, like a fancier feedforward network, but based on attention ▪ LLMs are built out of transformers 5 A very approximate timeline ▪ 1990 Static Word Embeddings ▪ 2003 Neural Language Model ▪ 2008 Multi-Task Learning ▪ 2015 Attention ▪ 2017 Transformer ▪ 2018 Contextual Word Embeddings and Pretraining ▪ 2019 Prompting 6 Transformer Networks ▪ Feed-forward Neural Networks (no recurrent) With Attention Module 7 Transformer Networks (Decoder) ▪ Next Token Prediction (Inference) Input: Text Tokens Sequences [x_1: x_t-1] Output: Next Token x_t ▪ Parallel Token Prediction (Training) Input: Text Tokens Sequences [x_1: x_t] Output: Text Tokens Sequences [x_2: x_t+1] ✓ x_t+1: EOS Token (end of sequence) Next token long and thanks for all Language Modeling logits logits logits logits logits … Head U U U U U Stacked … … … … … Transformer … Blocks x1 x2 x3 x4 x5 … + 1 + 2 + 3 + 4 + 5 Input Encoding E E E E E … Input tokens So long and thanks for 8 Transformer Networks (Encoder) ▪ Next Token Prediction (Inference) Input: Text Tokens Sequences [x_1: x_t] Output: Hidden Representation Sequences [h_1: h_t] ▪ Parallel Token Prediction (Training) Input: Text Tokens Sequences [x_1: x_t] Output: Hidden Representation Sequences [h_1: h_t] Next token long and thanks for all Language Modeling logits logits logits logits logits … Head U U U U U Stacked … … … … … Transformer … Blocks x1 x2 x3 x4 x5 … + 1 + 2 + 3 + 4 + 5 Input Encoding E E E E E … Input tokens So long and thanks for 9 Contextual Embedding The chicken didn't cross the road because it was too tired ▪ What is the meaning represented in the static embedding for "it"? Static Embedding: Word2Vec ▪ Intuition A meaning of a word should be different in different contexts ▪ Contextual Embedding Each word has a different vector that expresses different meanings depending on the surrounding words ▪ How to compute contextual embeddings? Attention 10 Contextual Embedding The chicken didn't cross the road because it was too tired ▪ What should be the properties of "it"? The chicken didn't cross the road because it was too tired The chicken didn't cross the road because it was too wide ▪ At this point in the sentence, it's probably referring to either the chicken or the street 11 Attention ▪ Build up the contextual embedding from a word by selectively integrating information from all the neighboring words ▪ We say that a word "attends to" some neighboring words more than others 12 Attention ▪ A mechanism for helping compute the embedding for a token by selectively attending to and integrating information from surrounding tokens (at the previous layer). ▪ More formally: a method for doing a weighted sum of vectors. Attention is left-to-right a1 a2 a3 a4 a5 Self-Attention attention attention attention attention attention Layer x1 x2 x3 x4 x5 13 quence for the current input sequence. For example, returning pter 9. Let’s refer to the result of this comparison between words i and j as a putation of a is Attention 3 based on a set of comparisons between the e (we’ll be updating this equation to add attention to the computation of this ding e): elements x1 and x2, and to x3 itself. mpare words toversion ▪ Simplified other ofwords? attention:Since a sumour representations of prior for words weighted by their similarity with the current word ,x ) = x ·x score(x e’ll make use of our old friend thei dot Verson 1: j producti j that we used (10.4) similarity ina Chapter sequence of 6, and token also played embeddings: a role in attention in The result of a dot product is a scalar value ranging from − to , the larger ▪ Given to the value result x1 ofxthis themoresimilar x3comparison between x4 x5 xarebeing 2 thevectorsthat 6 words iContinuing x7 xi compared. and j as with a our ng this mple, equation the first step into add attention computing y3 would tobe thetocomputation of thisx3 ·x1, compute three scores: x3 ·x3.aThen x2 andProduce: to make i = a weighted effective sum use of of x1 through x7 these (and xiscores, we’ll ) weighted normalize by their them similarity to xi h a softmax to create a vector of weights, a i j , that indicates the proportional vance of each Verson 1: input to theinput score(x ,x )element = x ·x i that isthecurrent focusof attention. (10.4) i j i j a i j = softmax(score(xi ,x j )) 8 j i (10.5) ot product is a scalar value ranging from − to , the larger exp(score(xi ,x j )) milar thevectorsthat=arebeing Pi compared. Continuing 8j i with our (10.6) k= 1 exp(score(xi ,xk)) p in computing y3 would be to compute three scores: x3 ·x1, en Of to makeeffectiveuse course, thesoftmax weightofwill these scores, likely we’ll behighest fornormalize them thecurrent focuselement nce vecx eate i is very a vector ofsimilar to itself, weights, a i j , resulting in a high that indicates thedotproportional product. But other 14 Attention ▪ Produce: ai = a weighted sum of x1 through x7 (and xi) weighted by their similarity to xi 15 Attention ▪ High-level idea: instead of using vectors (like xi and x2) directly, we'll represent 3 separate roles each vector xi plays: Query (Q): As the current element being compared to the preceding inputs. Key (K): as a preceding input that is being compared to the current element to determine a similarity Value (V): a value of a preceding element that gets weighted and summed 16 Attention Query (Q): As the current element being compared to the preceding inputs. Key (K): as a preceding input that is being compared to the current element to determine a similarity Value (V): a value of a preceding element that gets weighted and summed 17 Attention Query (Q): As the current element being compared to the preceding inputs. Key (K): as a preceding input that is being compared to the current element to determine a similarity Value (V): a value of a preceding element that gets weighted and summed 18 Attention ▪ We'll use matrices to project each vector xi into a representation of its role as query, key, value: query: WQ key: WK value: WV ▪ Given these 3 representation of xi To compute similarity of current element xi with some prior element xj We’ll use dot product between qi and kj. And instead of summing up xj , we'll sum up vj 19 Attention ▪ The result of a dot product can be an arbitrarily large (positive or negative) value, and exponentiating large values can lead to numerical issues and loss of gradients during training. ▪ To avoid this, we scale the dot product by a factor related to the size of the embeddings, via diving by the square root of the dimensionality of the query and key vectors (dk). 20 Attention Output of self-attention a3 6. Sum the weighted value vectors × 5. Weigh each value vector × 3,1 3,2 3,3 4. Turn into i,j weights via softmax ÷ ÷ ÷ 3. Divide score by √dk √dk √dk √dk 2. Compare x3’s query with the keys for x1, x2, and x3 k k k W k W k W k 1. Generate W q q W q q W q q key, query, value vectors W v v W v v W v v x1 x2 x3 (MLP Layer) 21 Multi-Head Attention ▪ Actual Attention: slightly more complicated (Multi-Head Attention) Instead of one attention head, we'll have lots of them Intuition: each head might be attending to the context for different purposes Different linguistic relationships or patterns in the context 22 Multi-Head Attention 23 Attention for Contextual Embeddings ▪ Attention is a method for enriching the representation of a token by incorporating contextual information The embedding for each word will be different in different contexts ▪ Contextual embeddings A representation of word meaning in its context. ▪ Attention can also be viewed as a way to move information from one token to another. 24 Transformer Networks ▪ Feed-forward Neural Networks (no recurrent) With Attention Module 25 Transformer Details Next token long and thanks for all Language Modeling logits logits logits logits logits … Head U U U U U Stacked … … … … … Transformer … Blocks x1 x2 x3 x4 x5 … + 1 + 2 + 3 + 4 + 5 Input Encoding E E E E E … Input tokens So long and thanks for 26 Transformer Details ▪ The residual stream: each token gets passed up and modified Technically this is the prenorm arechitrecture; there is an older "postnorm" architecture with the layer norms after the feedforward. 27 Transformer Details ▪ We'll need nonlinearities, so a feedforward layer The weights are the same for each token position i, but are different from layer to layer. It is common to make the dimensionality dff of the hidden layer of the feedforward network be larger than the model dimensionality d. ✓ For example in the original transformer model, d = 512 and dff = 2048 (512 → 2048 → 512) 28 Transformer Details ▪ Layer norm: the vector xi is normalized twice 29 Transformer Details ▪ Layer norm is a variation of the z-score from statistics, applied to a single vec- tor in a hidden layer 30 Transformer Details 31 Transformer Details ▪ Skip Connection 32 Transformer Details ▪ A transformer is a stack of these blocks so all the vectors are of the same dimensionality d Block 2 Block 1 33 Transformer Details ▪ Residual streams and attention Notice that all parts of the transformer block apply to 1 residual stream (1 token). Except attention, which takes information from other tokens We can view attention heads as literally moving information from the residual stream of a neighboring token into the current stream. 34 (Revisit) Limitation of RNN ▪ Disadvantages Recurrent computation is slow (Slow Training Speed) In practice, difficult to access information from many steps back ▪ Advantages Can process any length input Computation for step t can (in theory) use information from many steps back Model size doesn’t increase for longer input Same weights applied on every timestep, so there is symmetry in how inputs are processed https://cs231n.stanford.edu/slides/2022/lecture_10_ruohan.pdf 35 RNNs VS Transformers ▪ In order to better understand Transformers, It’s helpful to compare transformers with RNNs. ▪ While both are used for sequence modeling, the key difference lie in how they handle long-range dependencies and parallelization https://wikidocs.net/167212 36 Parallelizing Attention Computation ▪ For attention/transformer block we've been computing a single output at a single time step i in a single residual stream. ▪ But we can pack the N tokens of the input sequence into a single matrix X of size [N × d]. ▪ Each row of X is the embedding of one token of the input. ▪ X can have 1K-32K rows, each of the dimensionality of the embedding d (the model dimension) 37 Parallelizing Attention Computation ▪ QKT ▪ Now can do a single matrix multiply to combine Q and KT 38 Parallelizing Attention Computation ▪ Scale the scores, take the softmax, and then multiply the result by V resulting in a matrix of shape N × d An attention vector for each input token 39 Parallelizing Attention Computation ▪ Masking out the future What is this mask function? QKT has a score for each query dot every key, including those that follow the query. Guessing the next word is pretty simple if you already know it 40 Parallelizing Attention Computation ▪ Masking out the future Add –∞ to cells in upper triangle The softmax will turn it to 0 41 Parallelizing Attention Computation 42 Parallelizing Attention Computation 43 Parallelizing Attention Computation ▪ Parallelizing Attention computing with only forward process Next token long and thanks for all Language Modeling logits logits logits logits logits … Head U U U U U Stacked … … … … … Transformer … Blocks x1 x2 x3 x4 x5 … + 1 + 2 + 3 + 4 + 5 Input Encoding E E E E E … Input tokens So long and thanks for 44 Input and output: Position embeddings and the Language Model Head ▪ Token and Position Embeddings The matrix X (of shape [N × d]) has an embedding for each word in the context. This embedding is created by adding two distinct embedding for each input ✓ Token embedding ✓ positional embedding Transformer Block X = Composite Embeddings (word + position) + + + + + Word Janet back will the bill Embeddings Position 1 2 3 4 5 Embeddings Janet will back the bill 45 Input ▪ Token Embeddings Embedding matrix E has shape [|V | × d ]. ✓ One row for each of the |V | tokens in the vocabulary. ✓ Each word is a row vector of d dimensions ▪ Position Embeddings There are many methods, but we'll just describe the simplest: absolute position. Goal: learn a position embedding matrix Epos of shape [1 × N ]. Start with randomly initialized embeddings ✓ one for each integer up to some maximum length. ✓ i.e., just as we have an embedding for token fish, we’ll have an embedding for position 3 and position 17. ✓ As with word embeddings, these position embeddings are learned along with other parameters during training. 46 Output ▪ Unembedding layer: linear layer projects from hLN (shape [1 × d]) to logit vector Softmax turns the logits into probabilities over vocabulary. Shape 1 × |V |. y1 y2 … y|V| Word probabilities 1 x |V| Language Model Head Softmax over vocabulary V L takes h N and outputs a u1 u2 … u|V| Logits 1 x |V| distribution over vocabulary V Unembedding Unembedding layer d x |V| layer = ET hL1 hL2 hLN 1xd Layer L Transformer Block … w1 w2 wN 47 Transformer Networks ▪ 1. Input → Token Embeddings ▪ 2. Add Positional Embedding ▪ 3. Transformer ▪ 4. Linear → Softmax ▪ Training Parallel computing with masking ▪ Inference Autoregressive Next Token Prediction ✓ Without masking 48 Large Language Models 49 Encoder-only Transformer ▪ Many varieties BERT family Pretraining for three type Popular: Masked Language Models (MLMs) Trained by predicting words from surrounding words on both sides Are usually finetuned (trained on supervised data) for classification tasks. The neural architecture influences t Encoders Encoder- 50 Encoder-Decoder Transformer ▪ Trained to map from one sequence to another Encoders Sequence-to-Sequence Models ▪ Very popular for: machine translation (map from one language to another) speech recognition (map from acoustics to words) Encoder- Decoders 51 Introduction to Large Language Models: Decoder-only Transformer ▪ Large Language Models Even through pretrained only to predict words Learn a lot of useful language knowledge Since training on a lot of text ▪ Decoder-only models Also called: ✓ Causal LLMs ✓ Autoregressive LLMs 32 ✓ Left-to-right LLMs ✓ Predict words left to right 52 Decoder-only Transformer ▪ Conditional Generation: Generating text conditioned on previous text Completion Text all the Language Softmax Modeling logits Head Unencoder layer U U Transformer … … Blocks + i + i + i + i + i + i + i Encoder E E E E E E E So long and thanks for all the Prefix Text 53 Encoder Decoder-only Transformer + i + i + i + i + i + i + i E E E E E E E ▪ ManySopractical longNLP tasks and can be castfor thanks as wordall prediction! the ▪ Sentiment analysis: “IPrefix like Jackie Text Chan” Weto-.1 Left- give rightthe language (also model this string: called autoregressive) text completion with transformer-based large language The sentiment of the sentence "I like Jackie Chan" is: s each token is generated, it gets added onto the context asa prefix for generating the next token. And see what word it thinks comes next: word “negative” to see which is higher: P(positive|The sentiment of the sentence ‘‘I like Jackie Chan" is:) P(negative|The sentiment of the sentence ‘‘I like Jackie Chan" is:) If the word “positive” is more probable, we say the sentiment of the sentence is positive, otherwise we say the sentiment is negative. We can also cast more complex tasks as word prediction. Consider question answering, in which the system is given a question (for example a question with a simple factual answer) and must give a textual answer; we introduce this task in detail in Chapter 15. We can cast the task of question answering as word prediction by giving alanguage model aquestion and atoken likeA: suggesting that an answer should come next: 54 Decoder-only Transformer ▪ Framing lots of tasks as conditional generation ▪ QA: “Who wrote The Origin of Species” 1. We give the language model this string: 2. And see what word it thinks comes next: 3. And iterate: 55 Decoder-only Transformer Original Summary Generated Summary Kyle Waring will … LM Head U U U … E E E E E E E E The only … idea was born. tl;dr Kyle Waring will Original Story Delimiter 56 Decoder-only Transformer ▪ Decoding Task of choosing a word to generate based on the model’s probabilities is called decoding. ▪ The most common method for decoding in LLMs: sampling. Sampling from a model’s distribution over words: ✓ choose random words according to their probability assigned by the model. After each token we’ll sample words to generate according to their probability conditioned on our previous choices, ✓ A transformer language model will give the probability 57 Sampling method for Next Token ▪ Factors in word sampling: quality and diversity ▪ Emphasize high-probability words + quality: more accurate, coherent, and factual, - diversity: boring, repetitive. ▪ Emphasize middle-probability words + diversity: more creative, diverse, - quality: less factual, incoherent 58 Sampling method for Next Token ▪ Top-k sampling: 1. Choose # of words k 2. For each word in the vocabulary V , use the language model to compute the likelihood of this word given the context p(wt |w