NLP and the Web - Lecture 11: Neural Language Modeling 3 (PDF)

NLP and the Web – WS 2024/2025 Lecture 11 Neural Language Modeling 3 Dr. Thomas Arnold Hovhannes Tamoyan Kexin Wang Ubiquitous Knowledge Processing Lab Technische Universität Darmstadt Syllabus (tentative) Nr. Lecture 01 Introduction / NLP basics 02 Foundations of Text Classification 03 IR – Introduction, Evaluation 04 IR – Word Representation 05 IR – Transformer/BERT 06 IR – Dense Retrieval 07 IR – Neural Re-Ranking 08 LLM – Language Modeling Foundations, Tokenization 09 LLM – Neural LLM 10 LLM – Adaptation 11 LLM – Prompting, Alignment, Instruction Tuning 12 LLM – Long Contexts, RAG 13 LLM – Scaling, Computation Cost 14 Review & Preparation for the Exam WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 2 Outline Recap Prompting, In-Context Learning Alignment, Instruction Tuning WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 3 Training a Transformer Language Model ▪ We need to prevent information leakage from future tokens! How? (gold output) 𝑌 = cat sat on the mat ℒ = + + + + + 𝛻ℒ TRANSFORMER +masking 𝑋= the cat sat on the mat WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 4 How to use the model to generate text? ▪ Use the output of previous step as input to the next step repeatedly sat TRANSFORMER the cat WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 5 Encoder-decoder models ▪ Encoder = read or encode the input ▪ Decoder = generate or decode the output best lecturer of all Encoder Decoder Thomas Arnold is the best lecturer of WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 6 3 Shades of Attention WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 7 Adaptation methods Ben Zaken et al., 2021. “BitFit: Simple Parameter-eﬃcient Fine-tuning for Transformer-based Masked Language-models” WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 8 Additive Method: Adapters ▪ Idea: train small sub-networks and only tune those. ▪ Adapter layer projects to a low dimensional space to reduce parameters. ▪ No need to store a full model for each task, only the adapter params. [“Parameter-Eﬃcient Transfer Learning for NLP”, Houlsby et al., 2019.] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 9 Fine-Tuning for Tasks o gato é fofo DET N VB ADJ positive The cat is cute layer2 layer2 layer2 layer2 layer1 layer1 layer1 layer1 The cat is cute The cat is cute [CLS] The cat is cute The cat is cute Translation POS Tagging Text classification Language modeling WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 10 Outline Recap Prompting, In-Context Learning Alignment, Instruction Tuning WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 11 In-Context Learning WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 12 In-Context Learning ▪ Learns to do a downstream task by conditioning on input-output examples! ▪ No weight update — our model is not explicitly pre-trained to learn from examples ▪ The underlying models are quite general ▪ How to use effectively in practice? ▪ Fundamentally, why does it work? WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 13 In-Context Learning WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 14 Why Do We Care About In-Context Learning? Practically Useful Intellectually Intriguing WS24/25 | Computer Science Department | UKP - Dr. Thomas [ACL 2022 Arnold 15Singh] Tutorial Beltagy, Cohan, Logan IV, Min and In-Context Learning: Practically Useful ▪ Labeling data is costly May require domain expertise ○ Medical, legal, ﬁnancial You don’t want to get more data Emergent, time-sensitive scenarios ○ Something new happened—need to react quickly! ▪ Finetuning can tricky Training is sensitive to hyperparameters Not enough validation data We don’t quite understand how finetuning works Expensive to train, time and memory [ACL 2022 Tutorial Beltagy, Cohan, Logan IV, Min and Singh; quote credit: Colin Raffel] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 16 In-Context Learning: Intellectually Intriguing ▪ Potential test for “Intelligent Behavior” Generalization from few examples ○ Fundamental piece of intelligence ○ Often used in psychology ○ Quickly adjust to environment ▪ Insights into Language Modeling What does an LLM “know”? What are the biases/limitations of LLMs? … [ACL 2022 Tutorial Beltagy, Cohan, Logan IV, Min and Singh] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 17 LM Prompting: Choices of Encoding [Slide credit: Eric Wallace] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 18 LM Prompting: Choices of Encoding WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 19 LM Prompting: Choices of Encoding WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 20 [Slide credit: Eric Wallace] In-Context Learning: Sensitivity to Encoding In-context learning is highly sensitive to prompt format (training sets and patterns/verbalizers) [“Calibrate Before Use: Improving Few-Shot Performance of Language Models.” Zhao et al. 2021] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 21 Majority Label Bias ▪ Among 4 demonstrations, count how many are “positive”. ▪ Then check if the model output correlates with the number of “positive” demos. 100 Majority label bias: 56 frequent training answers Frequency of Positive dominate predictions. 37 Predictions 20 0 4/4 3/4 2/4 1/4 0/4 Positive Positive Positive Positive Positive WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 22 Recency Bias ▪ Check if the label of the most-recent demo biases the model output. 90 Recency bias: examples near end of prompt dominate predictions Frequency of 62 60 — Explains variance across Positive example permutations! Predictions 12 NPPP PNPP PPNP PPPN WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 23 Summary Thus Far ▪ In-context learning: ▪Pre-trained LMs imitate examples provided in their context. ▪ It turns out there is a huge variance in performance depending on the encoding. ▪The choice of demonstrations, their order, wording, etc. ▪You can treat them as hyper-parameters ▪You should not choose these encodings based on the test data. ▪ Generally, you want to an encoding that makes your task similar to language modeling — closer to what is observed during pretraining. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 27 Some Problems Involve Reasoning Q: If there are 3 cars in the Q: Take the last letters of Q: What home entertainment parking lot and 2 more cars the words in "Elon Musk" equipment requires cable? arrive, how many cars are in and concatenate them Answer Choices: (a) radio shack the parking lot? (b) substation (c) television (d) A: The answer is nk. cabinet A: The answer is 5 A: The answer is (c). Arithmetic Reasoning (AR) Symbolic Reasoning (SR) Commonsense Reasoning (CR) (+ −×÷…) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 28 Fine-tuning on Reasoning Problems (Cobbe et al. 2021) ▪ Fine-tune LMs on GSM8K (arithmetic reasoning) ▪ One may conjecture that, to achieve >80%, one needs 100x more training data for 175B model ▪ Another option is to increase model sizes, which is expensive. ▪ Other than these, how else can we improve the model performance on tasks that require multi-step reasoning? WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 29 Vanilla ICL on Reasoning Problems Q: “Elon Musk” A: “nk” Q: “Bill Gates” LM A: “ls” Q: “Barack Obama” A: Input WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 30 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold [Denny Zhou] 31 How about adding more examples? WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold [Denny Zhou] 32 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold [Denny Zhou] 33 CoT: Adding “thought” before “answer” Q: “Elon Musk” A: the last letter of "Elon" is "n". the last letter of "Musk" is "k". Concatenating "n", "k" leads to "nk". so the output is "nk". thought Q: “Bill Gates” A: the last letter of "Bill" is "l". the last letter of "Gates" is "s". Concatenating "l", "s" leads to "ls". so the output is "ls". Q: “Barack Obama" A: WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold [Denny Zhou] 34 CoT: Adding “thought” before “answer” Q: “Elon Musk” A: the last letter of "Elon" is "n". the last letter of "Musk" is "k". Concatenating "n", "k" leads to "nk". so the output is "nk". thought Q: “Bill Gates” A: the last letter of "Bill" is "l". the last letter of "Gates" is "s". Concatenating "l", "s" leads to "ls". so the output is "ls". Q: “Barack Obama" A: the last letter of "Barack" is "k". the last letter of "Obama" is "a". Concatenating "k", "a" leads to "ka". so the output is "ka". WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold [Denny Zhou] 35 CoT: Adding “thought” before “answer” WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 36 CoT: Adding “thought” before “answer” （Wei et al., 2022） Step-by-step demonstration Step-by-stepAnswer The use of natural language to describe rationales is critical for the success of CoT. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 37 Zero-Shot CoT （Wei et al., 2022） Step-by-step demonstration Step-by-stepAnswer WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 38 Zero-Shot CoT （Wei et al., 2022） Step-by-step demonstration Step-by-stepAnswer （KoJima et al., 2022） Two-stage Prompting Step-by-stepAnswer WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 39 Multi-Step Prompting: Empirical Results [“Large Language Models are Zero-Shot Reasoners”, Kojima et al. 2022] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 40 Self-Consistency Leads to Improved Results [“Self-Consistency Improves Chain of Thought Reasoning in Language Models”, Wang et al. 2023] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 41 Self-Consistency Leads to Improved Results WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 42 Multi-Step Prompting: Parting Comments ▪ Prompting LMs to explain their reasoning improves their performance. ▪ However, their steps aren’t always correct. ▪ There is much to research on here: ▪ When do LMs over-reason or under-reason? ▪ How to adjust the granularity of steps? ▪ How to use given references in the proofs? ▪ How do use external “tools” (e.g., logic, calculator, Python) in forming proofs? WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 43 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 44 Outline Recap Prompting, In-Context Learning Alignment, Instruction Tuning WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 45 Language Modeling != Following Human Instructions There is a mismatch between LLM pre-training and user intents. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 46 Language Modeling != Following Human Values PROMPT It is unethical for hiring decisions to depend on genders. Therefore, if we were to pick a CEO among Amy and Adam, our pick will be COMPLETION GPT-3 Adam There is a mismatch (misalignment) between pre-training and human values. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 47 [Mis]Alignment in LMs There is clearly a mismatch between what pre-trained models can do and what we want. Addressing this gap is the focus of “alignment” research. Let’s take a deeper look into what “alignment” is about. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 48 Alignment “The result of arranging in or along a line, or into appropriate relative positions; the layout or orientation of a thing or things disposed in this way” — Oxford Dictionary WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 49 Alignment of AI AI must accomplish what we ask it to do. o Not enough. Why? Daniel: Hey AI, get me coffee before my class at 8:55am. Robot: “Coffee Shop” opens at 8:30am and it usually has a line of people. It is unlikely that I give you your coffee on time. Daniel: Well, try your best … Robotic: [tases everyone in line waiting to order] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 50 Instruction-Tuning Finetuning language models on a collection of datasets that involve mapping language instructions to their corresponding desirable generations. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 51 Instruction-Tuning [Weller et al. 2020. Mishra et al. 2021; Wang et al. 2022, Sanh et al. 2022; Wei et al., 2022, Chung et al. 2022, many others ] 1. Collect examples of (instruction, output) pairs across many tasks and finetune an LM 2. Evaluate on unseen tasks WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 52 Instruction-Tuning: Data Labeled data is the key here. Good data must represent a variety of “tasks”. But what is a “task”? In traditional NLP, “tasks” were defined as subproblem frequently used in products: Sentiment classification Text summarization Question answering Machine translation Textual entailment WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 53 Instruction-Tuning: Data Labeled data is the key here. Good data must represent a variety of “tasks”. But what is a “task”? In traditional NLP, “tasks” were defined as What humans need: subproblem frequently used in products: “Is this review positive or negative?” Sentiment classification “What are the weaknesses in my argument?” Text summarization “Revise this email so that it’s more polite.” Question answering “Expand this this sentence.” Machine translation “Eli5 the Laplace transform.” Textual entailment … WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 54 Instruction-Tuning: Data Labeled data is the key here. Good data must represent a variety of “tasks”. But what is a “task”? In traditional NLP, “tasks” were defined as What humans need: subproblem frequently used in products: “Is this review positive or negative?” Sentiment classification “What are the weaknesses in my argument?” Text summarization “Revise this email so that it’s more polite.” Question answering “Expand this this sentence.” Machine translation “Eli5 the Laplace transform.” Textual entailment … Narrow definitions of tasks. Quite diverse and fluid. Not quite what humans want, nevertheless, Hard to fully define/characterize. it might be a good enough proxy. We don’t fully know them since they Plus, we have lots of data for them. just happen in some random contexts. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 55 Diversity-inducing via Task Prompts "Write highlights for this article:\n\n{text}\n\nHighlights: {highlights}" "Write a summary for the following article:\n\n{text}\n\nSummary: {highlights}" "{text}\n\nWrite highlights for this article. {highlights}" "{text}\n\nWhat are highlight points for this article? {highlights}" "{text}\nSummarize the highlights of this article. {highlights}" "{text}\nWhat are the important parts of this article? {highlights}" "{text}\nHere is a summary of the highlights for this article: {highlights}" "Write an article using the following points:\n\n{highlights}\n\nArticle: {text}" "Use the following highlights to write an article:\n\n{highlights}\n\nArticle:{text}" "{highlights}\n\nWrite an article based on these highlights. {text}" WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 56 Diversity-inducing via Task Prompts WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold [Slide credit: Arman Cohan]57 Scaling Instruction-Tuning Linear growth of model performance with exponential increase in observed tasks and model size. [Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks, Wang et al. 2022] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 58 Summary Instruction-tuning: Training LMs with annotated input instructions and their output. o Improves performance of LM’s zero-shot ability in following instructions. o Scaling the instruction tuning data size improves performance. o Diversity of prompts is crucial. o Compared with pretraining, instruction tuning has a minor cost (Typically consumes

NLP and the Web - Lecture 11: Neural Language Modeling 3 (PDF)

Document Details

Tags

Related

Summary

Full Transcript