Lecture 9 Neural Language Modeling PDF
Document Details

Uploaded by FastestGrowingOcean5061
Technische Universität Darmstadt
2024
Dr. Thomas Arnold
Tags
Related
Summary
This document presents lecture slides from Technische Universität Darmstadt, covering neural language modeling. The slides cover fundamental concepts such as n-grams, subword tokenization like Byte Pair Encoding, and explore advancements with neural networks and transformer models. The slides are from the WS 2024/2025 term.
Full Transcript
NLP and the Web – WS 2024/2025 Lecture 9 Neural Language Modeling Dr. Thomas Arnold Hovhannes Tamoyan Kexin Wang Ubiquitous Knowledge Processing Lab Technische Universität Darmstadt Syllabus (tentative) Nr. Lecture...
NLP and the Web – WS 2024/2025 Lecture 9 Neural Language Modeling Dr. Thomas Arnold Hovhannes Tamoyan Kexin Wang Ubiquitous Knowledge Processing Lab Technische Universität Darmstadt Syllabus (tentative) Nr. Lecture 01 Introduction / NLP basics 02 Foundations of Text Classification 03 IR – Introduction, Evaluation 04 IR – Word Representation 05 IR – Transformer/BERT 06 IR – Dense Retrieval 07 IR – Neural Re-Ranking 08 LLM – Language Modeling Foundations, Tokenization 09 LLM – Neural LLM 10 LLM – Adaptation, LoRa, Prompting 11 LLM – Alignment, Instruction Tuning 12 LLM – Long Contexts, RAG 13 LLM – Scaling, Computation Cost 14 Review & Preparation for the Exam WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 2 Outline Recap: LM and Sub-word tokenization Basic Neural Language Models RNN and Transformer LM WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 3 Recap: Language Models Language Model (LM) = Model that assigns probabilities to sequences of words Language I love salty chocolate 0.000147 Model WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 4 Recap: Bigram LM 𝐶(𝑤𝑛−1 𝑤𝑛 ) Word 1 Word 2 Count P(W2|W1) 𝑃 𝑤𝑛 𝑤𝑛−1 = 𝐶(𝑤𝑛−1 ) There 256 0.256 The 321 0.321 𝐶 < 𝑆 > = 1000 Oh 17 0.017 I 69 0.069 𝐶(< 𝑆 > 𝑇ℎ𝑒𝑟𝑒) 256 = 𝐶(< 𝑆 >) 1000 They 169 0.169 = 0.256 Also 123 0.123 However 45 0.045 Anything else … WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 5 Recap: Generation with a Bigram LM Generate the next token based on conditional probabilities Word 1 Word 2 Count P(W2|W1) There 256 0.256 The 321 0.321 Oh 17 0.017 There... There once … … … … There once 123 0.0246 There must 29 0.0058 There has 69 0.0138 … … … … WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 6 Recap: OOV problem Big problem: zero probability n-grams In Evaluation: Test set might contain N-Grams that did never appear in training = the entire probability of the test set is zero! (multiplication with zero) = Perplexity is undefined! (division by zero) In Generation: LM has never seen the words of my prompt before = No way to generate the next token! Approaches: tokens, Laplace or add-k smoothing, backoff, interpolation… WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 7 Recap: Byte Pair Encoding (BPE) Subword tokenization technique Used for data compression and dealing with unknown words Initialization: Vocabulary = set of all individual characters V = {A, B, C, … a, b, c, … 1, 2, 3, … !, $, %, …} Repeat: - Choose two symbols that appear as a pair most frequently (say “a” and “t”) - Add new merged symbol (“at”) - Replace each occurrence with the new symbol (“t”,”h”,”a”,”t” -> “t”,”h”,”at”) Until k merges have been done WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 8 De-Tokenization Greedy longest prefix matching. Example: “the golden snickers” vocab: {“the”, “go”, “gold”, “##den”, “##en”, “snicker”, “##ers” , “##s”} the golden snickers -> “the” in vocab -> [“the”] the golden snickers -> “golden” not in vocab -> prefix max match: “gold” -> [“the”, “gold”] -> remaining subword: “##en” -> “##en” in vocab -> [“the”, “gold”, “##en”] the golden snickers -> “snickers” not in vocab -> prefix max match: “snicker” in vocab -> [“the”, “gold”, “##en”, “snicker”] -> remaining subword: “##s” in vocab -> [“the”, “gold”, “##en”, “snicker”, “##s”] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 9 GPT3/4’s Tokenizer https://platform.openai.com/tokenizer WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 10 Outline Recap: LM and Sub-word tokenization Basic Neural Language Models RNN and Transformer LM WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 11 N-Grams LMs and Long-range Dependencies In general, count-based LMs are insufficient models of language because language has long-distance dependencies: “The computer which I had just put into the machine room on the fifth floor crashed.” WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 12 Pre-Computed N-Grams Google n-gram viewer https://books.google.com/ngrams/ Data: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 13 Pre-Computed N-Grams Google n-gram viewer https://books.google.com/ngrams/ Data: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 14 LM as a Machine Learning Problem Task: Given the embeddings of the context, predict the word on the right side. Discard anything beyond its context window blah blah blah blah and our problems turning into context words in window of size 4 target word WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 15 LM as a Machine Learning Problem Task: Given the embeddings of the context, predict the word on the right side. Discard anything beyond its context window blah blah blah blah and our problems turning into discard context words in window of size 4 target word WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 16 LM as a Machine Learning Problem Task: Given the embeddings of the context, predict the word on the right side. Discard anything beyond its context window and our problems turning into context words in window of size 4 target word WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 17 A Fixed-Window Neural LM Training this model is basically optimizing its parameters Θ such that it assigns high probability to the target word. Probs over vocabulary Trainable parameters of neural network mat table 𝑓( OOO OOO OOO OOO , Θ) into Embedding lookup ant chair and our problems turning into context words in window of size 4 target word WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 18 A Fixed-Window Neural LM It will also lay the foundation for the future models (RNN, transformers,...) But first we need to figure out how to train neural networks! Probs over vocabulary Trainable parameters of neural network mat table 𝑓( OOO OOO OOO OOO , Θ) into Embedding lookup ant chair and our problems turning into context words in window of size 4 target word WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 19 Feeding Text to Neural Net In practice this is implemented in this way: 1. Turn each word into a unique index 2. Map each index into a one-hot vector 0 1 2 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 20 Feeding Text to Neural Net In practice this is implemented in this way: 1. Turn each word into a unique index 2. Map each index into a one-hot vector 3. Lookup the corresponding word embedding via matrix multiplication WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 21 A Fixed-Window Neural LM mat Softmax output distribution table bed y = softmax(𝑊2 ℎ) desk chair 𝑾2 OOOOOOOOOOOOOOO hidden layer ℎ = 𝑓(𝑊1 𝑥) 𝑾1 concatenate concatenated word embeddings OOO OOO OOO OOO 𝑥 = [𝑣1 , 𝑣2 , 𝑣3 , 𝑣4 ] embedding lookup and our problems turning into context words in window of size 4 target word WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 22 A Fixed-Window Neural LM mat Softmax table Improvements over n-gram LM: bed Tackles the sparsity problem desk chair 𝑾2 Model size is O(n) not O(exp(n)) — OOOOOOOOOOOOOOO n being the window size. Remaining problems: 𝑾1 concatenate Fixed window is too small Enlarging window enlarges 𝑾 — Window OOO OOO OOO OOO can never be large enough! It’s not deep enough to capture nuanced embedding lookup contextual meanings and our problems turning into context words in window of size 4 target word WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 23 A Fixed-Window Neural LM mat Softmax table Improvements over n-gram LM: bed Tackles the sparsity problem desk chair 𝑾2 Model size is O(n) not O(exp(n)) — OOOOOOOOOOOOOOO n being the window size. Remaining problems: 𝑾1 concatenate Fixed window is too small Enlarging window enlarges 𝑾 — Window OOO OOO OOO OOO can never be large enough! It’s not deep enough to capture nuanced embedding lookup contextual meanings and our problems turning into context words in window of size 4 target word WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 24 Sun and Iyyer (2021): Revisiting Simple Neural Probabilistic Language Models Prob Softmax mat table Linear bed desk chair Add & Norm Add & Norm Feed-Forward Feed-Forward layer layer xN concatenate N layers OOO OOO OOO OOO Add & Norm Feed-Forward layer and our problems turning into context words in window of size 4 target word WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 25 Sun and Iyyer (2021): Revisiting Simple Neural Probabilistic Language Models Prob Softmax mat table Linear bed desk chair Add & Norm Feed-Forward layer xN concatenate OOO OOO OOO OOO and our problems turning into context words in window of size 4 target word WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 26 What Changed from N-Gram LMs to Neural LMs? What is the source of Neural LM’s strength? Why sparsity is less of an issue for Neural LMs? Answer: In n-grams, we treat all prefixes independently of each other! (even those that are semantically similar) students opened their ___ pupils opened their ___ Neural LMs are able to scholars opened their ___ share information across undergraduates opened their ___ these semantically-similar students turned the pages of their ___ prefixes and overcome the students attentively perused their ___... sparsity issue. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 27 Outline Recap: LM and Sub-word tokenization Basic Neural Language Models RNN and Transformer LM WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 28 RNN Language Model Sequence as the input Input sequence Do you like a good cappuccino Single vector (last state) as output Word representation (lookup in embedding matrix) Part of a larger network x1 xn Recurrent sequence encoding representation (1x RNN layer) RNN s1 sn-1 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 29 RNN Language Model Sequence as the input Input sequence Do you like a good cappuccino Single vector (last state) as output Word representation (lookup in embedding matrix) Part of a larger network x1 xn Recurrent sequence today encoding representation (1x RNN layer) 𝑾2 With s1 RNN Softmax ? coffee at sn-1 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 30 RNN Language Model What are the cons? While RNNs in theory can represent long sequences, they quickly forget portions of the input. Vanishing/exploding gradients Difficult to parallelize What can we do? WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 31 RNN vs Transformer WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 32 Self-Attention: Back to Big Picture Attention is a powerful mechanism to create context-aware representations A way to focus on select parts of the input 𝑏1 𝑏2 𝑏3 𝑏4 Self-Attention Layer 𝑥1 𝑥2 𝑥3 𝑥4 Better at maintaining long-distance dependencies in the context. [Attention Is All You Need, Vaswani et al. 2017] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 33 Properties of Self-Attention n = sequence length, d = hidden dimension Quadratic complexity, but: O(1) sequential operations (not linear like in RNN) Efficient implementations [Attention Is All You Need, Vaswani et al. 2017] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 34 How Do We Make it Deep? Step 1: Stack more layers! Step 2: … Step 3: Profit! WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 35 From Representations to Prediction books To perform prediction, add a classification head on top of the final layer of the transformer. This can be per token (Language modeling) Or can be for the entire sequence (only one token) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 36 Transformer-based Language Modeling And continue like TRANSFORMER that until we reach EOS or we get tired. Image by http://jalammar.github.io/illustrated-gpt2/ WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 37 Training a Transformer Language Model Goal: Train a Transformer for language modeling (i.e., predicting the next word). Approach: Train it so that each position is predictor of the next (right) token. We just shift the input to right by one, and use as labels EOS special token (gold output) 𝑌 = cat sat on the mat TRANSFORMER 𝑋= the cat sat on the mat [Slide credit: Arman Cohan] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 38 Training a Transformer Language Model For each position, compute their corresponding distribution over the whole vocab. (gold output) 𝑌 = cat sat on the mat TRANSFORMER 𝑋= the cat sat on the mat WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 39 Training a Transformer Language Model For each position, compute the loss between the distribution and the gold output label. (gold output) 𝑌 = cat sat on the mat TRANSFORMER 𝑋= the cat sat on the mat WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 40 Training a Transformer Language Model Sum the position-wise loss values to a obtain a global loss. (gold output) 𝑌 = cat sat on the mat ℒ = + + + + + TRANSFORMER 𝑋= the cat sat on the mat WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 41 Training a Transformer Language Model Using this loss, do Backprop and update the Transformer parameters. (gold output) 𝑌 = cat sat on the mat ℒ = + + + + + Well, this is not quite right 🤡 … ∇ℒ what is the problem with this? TRANSFORMER 𝑋= the cat sat on the mat WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 42 Training a Transformer Language Model The model would solve the task by copying the next token to output (data leakage). (gold output) 𝑌 = cat sat on the mat ℒ = + + + + + ∇ℒ TRANSFORMER 𝑋= the cat sat on the mat WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 43 Training a Transformer Language Model We need to prevent information leakage from future tokens! How? (gold output) 𝑌 = cat sat on the mat ℒ = + + + + + ∇ℒ TRANSFORMER 𝑋= the cat sat on the mat WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 44 Attention mask What we want What we have WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 45 Attention mask Attention mask WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 46 Attention mask Attention mask x Note matrix multiplication is quite fast in GPUs. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 47 Attention mask x = WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 48 Attention mask x softmax WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 49 Training a Transformer Language Model We need to prevent information leakage from future tokens! How? (gold output) 𝑌 = cat sat on the mat ℒ = + + + + + ∇ℒ TRANSFORMER +masking 𝑋= the cat sat on the mat WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 50 How to use the model to generate text? Use the output of previous step as input to the next step repeatedly sat TRANSFORMER the cat WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 51 How to use the model to generate text? Use the output of previous step as input to the next step repeatedly on The probabilities get revised upon adding a new token to the input. TRANSFORMER the cat sat WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 52 How to use the model to generate text? Use the output of previous step as input to the next step repeatedly the The probabilities get revised upon adding a new token to the input. TRANSFORMER the cat sat on WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 53 How to use the model to generate text? Use the output of previous step as input to the next step repeatedly mat The probabilities get revised upon adding a new token to the input. TRANSFORMER the cat sat on the WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 54 How to use the model to generate text? Use the output of previous step as input to the next step repeatedly The probabilities get revised upon adding a new token to the input. TRANSFORMER the cat sat on the mat WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 55 Lessons Learned Neural models overcome n-gram limitations (sparsity, fixed windows) and model long-range dependencies. Fixed-window models reduce sparsity but lack depth and scalability for larger contexts. RNNs handle sequences but struggle with vanishing gradients and parallelization. Transformers use self-attention for efficient, context-aware representations, excelling in long- distance dependencies. Masking prevents future token leakage during training, enabling autoregressive text generation. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 56 Next Lecture Adaptation & Prompting WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 57