Neural Language Modeling Lecture Notes PDF
Document Details

Uploaded by FastestGrowingOcean5061
Technische Universität Darmstadt
2024
Dr. Thomas Arnold
Tags
Summary
This document contains lecture slides on Neural Language Modeling, presented at Technische Universität Darmstadt during the WS 2024/2025 term. The slides cover various aspects of neural language models including Long Contexts, Retrieval-Augmented Generation, and Reinforcement Learning, offering a comprehensive overview of the subject.
Full Transcript
NLP and the Web – WS 2024/2025 Lecture 12 Neural Language Modeling 4 Dr. Thomas Arnold Hovhannes Tamoyan Kexin Wang Ubiquitous Knowledge Processing Lab Technische Universität Darmstadt Syllabus (tentative) Nr. Lecture...
NLP and the Web – WS 2024/2025 Lecture 12 Neural Language Modeling 4 Dr. Thomas Arnold Hovhannes Tamoyan Kexin Wang Ubiquitous Knowledge Processing Lab Technische Universität Darmstadt Syllabus (tentative) Nr. Lecture 01 Introduction / NLP basics 02 Foundations of Text Classification 03 IR – Introduction, Evaluation 04 IR – Word Representation 05 IR – Transformer/BERT 06 IR – Dense Retrieval 07 IR – Neural Re-Ranking 08 LLM – Language Modeling Foundations, Tokenization 09 LLM – Neural LLM 10 LLM – Adaptation 11 LLM – Prompting, Alignment, Instruction Tuning 12 LLM – Long Contexts, RAG 13 LLM – Scaling, Computation Cost 14 Review & Preparation for the Exam WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 2 Outline Recap Reinforcement Learning Long Context Retrieval-based LMs WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 3 In-Context Learning WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 4 Language Modeling != Following Human Instructions There is a mismatch between LLM pre-training and user intents. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 5 Instruction-Tuning [Weller et al. 2020. Mishra et al. 2021; Wang et al. 2022, Sanh et al. 2022; Wei et al., 2022, Chung et al. 2022, many others ] 1. Collect examples of (instruction, output) pairs across many tasks and finetune an LM 2. Evaluate on unseen tasks WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 6 Outline Recap Reinforcement Learning Long Context Retrieval-based LMs WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 7 Why Reinforcement Learning? ▪ Remember the limits of Instruction-tuning? 1. Difficult to collect diverse labeled data Limited/sparse feedback—usually 2. Rote learning (token by token) — considered a curse, but now a blessing. ▪ limited creativity “don't give a man fish rather teach him how to fish by himself” 3. Agnostic to model’s knowledge — ▪ may encourage hallucinations The model itself should be involved in the alignment loop. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 8 Reinforcement Learning: Intuition Action here: generating responses/token Reward here: whether humans [figure credit] liked the generation (sequence WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 9 of actions=tokens) Intuition Task: choose the better next message in a conversation WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 10 Intuition Scoring interface: Likert scale or rankings WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 11 Intuition human has conversation with the LLM WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 12 Intuition LLM provides two options for next responses WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 13 Intuition human rates better response WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 14 Reinforcement Learning: Abridged History ▪ The field of reinforcement learning (RL) has studied these (and related) problems for many years now [Williams, 1992; Sutton and Barto, 1998] ▪ Circa 2013: resurgence of interest in RL applied to deep learning, game-playing [Mnih et al., 2013] ▪ But there is a renewed interest in applying RL. Why? ▪ RL w/ LMs has commonly been viewed as very hard to get right (still is!) ▪ We have found successful RL variants that work for language models (e.g., PPO; [Schulman et al., 2017]) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 15 Reinforcement Learning: Formalism ▪ An agent interacts with an environment by taking actions ▪ The environment returns a reward for the action and a new state (representation of the world at that moment). ▪ Agent uses a policy function to choose an action at a given state. ▪ We need to figure out: (1) reward function and (2) the policy function WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 16 Reinforcement Learning from Human Feedback ▪ Imagine a reward function: 𝑅 𝑠; prompt ∈ ℝ for any output 𝑠 to a prompt. ▪ The reward is higher when humans prefer the output. ▪ Good generation is equivalent to finding reward-maximizing outputs: 𝑝𝜃 (𝑠) is a pre-trained model with Expected reward over the course of sampling from our 𝔼𝑠~𝑝 Ƹ 𝜃 𝑅 𝑠;Ƹ prompt params 𝜃 we would like to optimize (policy function) policy (generative model) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 17 [Slide credit: Jesse Mu] Reinforcement Learning from Human Feedback ▪ Imagine a reward function: 𝑅 𝑠; prompt ∈ ℝ for any output 𝑠 to a prompt. ▪ The reward is higher when humans prefer the output. ▪ Good generation is equivalent to finding reward-maximizing outputs: 𝑝𝜃 (𝑠) is a pre-trained model with Expected reward over the course of sampling from our 𝔼𝑠~𝑝 Ƹ 𝜃 𝑅 𝑠;Ƹ prompt params 𝜃 we would like to optimize (policy function) policy (generative model) ▪ On the notation: ▪“𝔼” here is an empirical expectation (i.e., average). ▪“~” indicates sampling from a given distribution. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 18 [Slide credit: Jesse Mu] Reinforcement Learning from Human Feedback ▪ Imagine a reward function: 𝑅 𝑠; prompt ∈ ℝ for any output 𝑠 to a prompt. ▪ The reward is higher when humans prefer the output ▪ Good generation is equivalent to finding reward-maximizing outputs: 𝔼𝑠~𝑝 Ƹ 𝜃 𝑅 𝑠;Ƹ prompt ▪ What we need to do: ▪(1) Estimate the reward function 𝑅 𝑠; prompt. ▪(2) Find the best generative model 𝑝𝜃 that maximizes the expected reward: 𝜃 = argmax𝜃 𝔼𝑠~𝑝 Ƹ 𝜃 𝑅 𝑠;Ƹ prompt WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 19 [Slide credit: Jesse Mu] Estimating the Reward 𝑅 ▪ Obviously, we don’t want to use human feedback directly since that could be ▪ Alternatively, we can build a model to mimic their preferences [Knox and Stone, 2009] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 20 Estimating the Reward 𝑅 ▪ Obviously, we don’t want to use human feedback directly since that could be ▪ Alternatively, we can build a model to mimic their preferences [Knox and Stone, 2009] ▪ Approach 1: get humans to provide absolute scores for each output Challenge: human judgments on different instances and by different people can be noisy and mis-calibrated! prompt 𝑠1 It is like any typical elevator, but it → 0.8 goes to space. … Explain ”space elevators” to a 6- year-old. LM 𝑠2 Explain gravity to a 6-year-old. … 𝑝𝜃 𝑠1 , 𝑠2 ~𝑝𝜃 → 1.2 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 21 Estimating the Reward 𝑅 ▪ Obviously, we don’t want to use human feedback directly since that could be ▪ Alternatively, we can build a model to mimic their preferences [Knox and Stone, 2009] ▪ Approach 2: ask for pairwise comparisons [Phelps et al. 2015; Clark et al. 2018] Bradley-Terry Pairwise comparison of multiple paired comparison provides which can be more reliable model prompt 𝑠1 It is like any typical elevator, but it goes to space. … Explain ”space elevators” to a 6- year-old. LM 𝑠2 Explain gravity to a 6-year-old. … 𝑝𝜃 𝑠1 , 𝑠2 ~𝑝𝜃 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 22 Scaling Reward Models Large enough reward trained on large enough data approaching human performance. [Stiennon et al., 2020] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 23 Putting it Together ▪ First collect a dataset of human preferences ▪ Present multiple outputs to human annotators and ask them to rank the output based on preferability Output 1 ✓ Human annotators specify Output 2 ✘ their preferences Output 1 ✓ Output 2 ✘ … Policy Prompt X LM WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 24 Putting it Together (2) ▪ Using this data, we can train a reward model ▪ The reward model returns a scalar reward which should numerically represent the human preference. Output 1 ✓ Output 2 ✘ Output 1 ✓ Output 2 ✘ … Policy Prompt X LM R WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 25 Putting it Together (3) ▪ We want to learn a policy (a Language Model) that optimizes against the reward model Policy Prompt X LM Output R 𝑅 Reinforcement learning update WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 26 Putting it Together (4) ▪ Periodically train the reward model with more samples and human feedback Output 1 ✓ Output 2 ✘ Output 1 ✓ Output 2 ✘ Periodically train the reward model … Policy Prompt X LM Output R 𝑅 Reinforcement learning update WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 27 One missing ingredient ▪ It turns out that this approach doesn’t quite work. The policy will learn to “cheat”. Output 1 ✓ Output 2 ✘ Output 1 ✓ Output 2 ✘ Periodically train the reward model … Policy Prompt X LM Output R 𝑅 Reinforcement learning update WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 28 One missing ingredient ▪ Will learn to produce an output that would get a high reward but might be gibberish or irrelevant to the prompt. ▪ Note, since 𝑅 is trained on natural inputs, it may not generalize to unnatural inputs. Output 1 ✓ Output 2 ✘ Output 1 ✓ Output 2 ✘ Periodically train the reward model … Policy Prompt X LM Output R 𝑅 Reinforcement learning update WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 29 Regularizing with Pre-trained Model ▪ Solution: add a penalty term that penalizes too much deviations from the distribution of the pre-trained LM. 𝑝𝑅𝐿 𝑠 pay a price when 𝑅 𝑠; 𝑝 ≔ 𝑅 𝑠; 𝑝 − 𝛽log 𝑃𝑇 𝑝𝑅𝐿 𝑠 > 𝑝𝑃𝑇 (𝑠) 𝑝 𝑠 ▪ This prevents the policy model from diverging too far from the pretrained model. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 30 The overall recipe Align Align Pretrain (instruct-tune) (RLHF) Reinforcement-Learning from Human Feedback WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 31 RLHF Gains over Instruction-Tuning [Stiennon et al., 2020] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 32 GPT3 vs. InstructGPT3 (RLHF-ed) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 33 GPT3 vs. InstructGPT3 (RLHF-ed) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 34 Can Help with Toxicity and Truthfulness ▪ Note, reward model can be used to induce any desired behavior as needed: ▪ Avoiding bias ▪ Avoiding responses outside its scope ▪ Avoiding toxicity ▪… Higher is Lower is better better WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 35 Summary Thus Far ▪ Reinforcement learning can help mitigate some of the problems with supervised instruction tuning ▪ Reward model is trained via ranking feedback of humans. ▪ Regularization to restraint the model from deviating too far from the pre-trained policy ▪ Limitations: ▪ RL can be tricky to get right ▪ Training a good reward may require a lot of annotations WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 36 Outline Recap Reinforcement Learning Long Context Retrieval-based LMs WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 37 Feeding Lots of Things to LM ▪ Books, scientific articles, government reports, videos, your daily experience, etc. they all are much longer than 2k tokens!! ▪ How do you enable language models process massive amounts of data? ▪ One approach: just scale up your model—train it on a much longer context window size. ▪ The bottleneck: memory usage and number of operations in Self-Attention increases quadratically. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 38 Transformer LMs and Long Inputs ▪ Length generalization: Do Transformers work accurately on long inputs? ▪ Efficiency considerations: How efficient are LMs are long inputs? WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 39 Transformer LMs and Long Inputs Fig reference:Tayet al.,2020 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 40 Length Generalization WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 41 Efficiency consideration: Sparse Attention Patterns WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 42 Sparse Attention Patterns ▪ The idea is to make the attention operation sparse [NAACL 2021 Tutorial Beltagy, Cohan, Hajishirzi, Min, and Peters] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 43 Pre-specified Sparsity Patterns ▪ A variety of patterns has been explored in the past work ▪ Longformer (Beltagy et al., 2020), Sparse Transformer (Child et al., 2019), … Slidingwindow Dilated Global Blocked Random SparseTransformer Longformer Big Bird Big Bird Big Bird Longformer Sinkhorn [NAACL 2021 Tutorial Beltagy, Cohan, Hajishirzi, Min, and Peters] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 44 Pre-specified Sparsity Patterns ▪ Different layers and attention heads can follow different patterns ▪ A common setup is to have earlier layers with sparser attention pattern. ▪ Longformer (Beltagy et al., 2020) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 45 Pre-specified Sparsity Patterns: Computations [Longformer (Beltagy et al., 2020)] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 46 A Notable Adoption: GPT-3 ▪ Sparse patterns also used in GPT-3 (Brown et al., 2020) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 47 Outline Recap Long Context Retrieval-based LMs WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 48 Retrieval-based Language Models It is a language model P(xn | x1, x2, ⋯, xn−1) The capital city of Ontario is It retrieves from an external datastore (at least during inference time) [Slides: Akari Asai, Sewon Min, Zexuan Zhong, Danqi Chen] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 49 Why Retrieval-based LMs? LLMs can’t memorize all (long-tail) knowledge in their parameters List 5 important papers authored by Geoffrey Hinton Geoffrey Hinton is a renowned computer scientist … Here are five important papers authored by him: 1. "Learning Internal Representations by Error Propagation" (with D. E. Rumelhart and R. J. Williams) - This paper, published in 1986,.. 2. "Deep Boltzmann Machines" (with R. Salakhutdinov) - Published in 2009,.. … 4. "Deep Learning" (with Y. Bengio and A. Courville) - Published as a book in 2016,… 5. "Attention Is All You Need" (with V. Vaswani, N. Shazeer, et al.) - Published in 2017, this paper introduced the Transformer model,… WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 50 Why Retrieval-based LMs? LLMs’ knowledge is easily outdated and hard to update Who is the CEO of Twitter? As of my knowledge cutoff in September 2021, the CEO of Twitter is Jack Dorsey…. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 51 Why Retrieval-based LMs? LLMs’ output is challenging to interpret and verify WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 52 Why Retrieval-based LMs? LLMs are *large* and expensive to train and run LM vs. Long-term goal: can we possibly reduce the training and inference costs, and scale down the size of LLMs? e.g., RETRO (Borgeaud et al., 2021): “obtains comparable performance to GPT-3 on the Pile, despite using 25x fewer parameters” WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 53 What are the Key Design Questions? What are your memories? ○ Documents, database records, training examples, etc. How to retrieve memories? ○ Use an off-the-shelf search engine (e.g. Google, StackOverflow). ○ How to train your own memory retriever. How to use retrieved memories? ○ "Text fusion" ○ Common failure modes: Underutilization: model ignores retrieved memories. Overreliance: model depends too much on memories! WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 54 Anatomy of a Neural Retreiver Remember our IR lectures? (especially neural retrieval and approximate NN) Prompt How to make a good cappuccino Re-Ranked Nearest Neural Re-Ranking BERTDOT Full text Model documents Neighbor Closest Index storage (Top 10) Documents (Top 1000) First Stage Retriever Second Stage Re-Ranker WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 55 Retrieval-Augmented LM ▪ x = World Cup 2022 was the last before the increase to [MASK] in the 2026 tournament. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 56 Retrieval-Augmented LM FIFA World Cup 2026 will expand to 48 teams. Encoder In 2022, the 32 national teams involved Encoder in the tournament. Team USA celebrated z = Encoder(z) after winning its match Encoder against Iran … Wikipedia 13M chunks (passages) (called documents in the paper) 7 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 57 Retrieval-Augmented LM x = World Cup 2022 was … the increase to [MASK] in 2026. FIFA World Cup 2026 will expand to 48 teams. Encoder In 2022, the 32 national teams involved Encoder in the tournament. Team USA celebrated z = Encoder(z) after winning its match Encoder against Iran … Wikipedia 13M chunks (passages) (called documents in the paper) 7 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 58 Retrieval-Augmented LM x = World Cup 2022 was … the increase to [MASK] in 2026. Encoder FIFA World Cup 2026 will expand to 48 teams. Encoder In 2022, the 32 national teams involved Encoder in the tournament. Team USA celebrated z = Encoder(z) after winning its match Encoder x = Encoder(x) against Iran … Wikipedia 13M chunks (passages) (called documents in the paper) 7 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 59 Retrieval-Augmented LM x = World Cup 2022 was … the increase to [MASK] in 2026. Encoder FIFA World Cup 2026 Fast nearest neighbor search will expand to 48 teams. Encoder In 2022, the 32 national teams involved Encoder in the tournament. Team USA celebrated z = Encoder(z) after winning its match Encoder x = Encoder(x) against Iran … Wikipedia 13M chunks (passages) (called documents in the paper) 7 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 60 Retrieval-Augmented LM x = World Cup 2022 was … the increase to [MASK] in 2026. Encoder FIFA World Cup 2026 Fast nearest neighbor search will expand to 48 teams. Encoder In 2022, the 32 national teams involved Encoder in the tournament. Team USA celebrated z = Encoder(z) after winning its match Encoder x = Encoder(x) against Iran … z1,... , zk = argTop-k (x ⋅z) Wikipedia 13M chunks (passages) k retrieved chunks (called documents in the paper) 7 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 61 Retrieval-Augmented LM: Common Variant What to retrieve? How to use retrieval? When to retrieve? - Chunks - Input layer - Once - Tokens - Intermediate layers - Every n tokens (n>1) - Others - Output layer - Every token WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 62 Retrieval-Augmented LM: Example Variation What to retrieve? How to use retrieval? When to retrieve? - Chunks - Input layer - Once - Tokens - Intermediate layers - Every n tokens (n>1) - Others - Output layer - Every token WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 63 [Slides: Akari Asai, Sewon Min, Zexuan Zhong, Danqi Chen] IR in the Middle of LM x = World Cup 2022 was the last with 32 teams, before the increase to 1 2 3 (k chunks of text per split) p1... pk 1 1 1 1 Index p... pk Retrieval 2 Encoder 2 2 3 p1... pk 3 3 Borgeaud et al. 2021. “Improving language models by retrieving from trillions of tokens” WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 64 [Slides: Akari Asai, Sewon Min, Zexuan Zhong, Danqi Chen] IR in the Middle of LM x = World Cup 2022 was the last with 32 teams, before the increase to 1 2 3 (k chunks of text per split) p1... pk 1 1 1 1 Index p... Retrieval pk LM 2 Encoder 2 2 Encoder 3 p1... pk 3 3 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 65 [Slides: Akari Asai, Sewon Min, Zexuan Zhong, Danqi Chen] IR in the Middle of LM x = World Cup 2022 was the last with 32 teams, before the increase to 1 2 3 (k chunks of text per split) p1... pk E1 1 1 1 1 Index p... Retrieval pk LM 2 Encoder 2 2 Encoder E2 3 p1... pk E3 3 3 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 66 Regular Decoder 1 ATTN FFN 2 EMB HEAD 3 Transformers blocks (xL) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 67 Regular Decoder with IR Embeddings E1 E2 E3 1 ATTN CCA FFN 2 EMB EMB HEAD 3 RETRO blocks (xL) Chunked CrossAttention (CCA) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 68 Training: End-to-end ▪ There are various ideas in the literature for how to train these models efficiently and in an end-to- end fashion. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 69 Main Takeaways How do we enable LMs to utilize external knowledge? Retrieval-augmented language models A retriever is a function, f(input, memory) → score What we did not discuss: Attribution: Tracing decisions to the source knowledge How to modify the knowledge Conflicting knowledge Editing knowledge More efficient scaling.... WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 70