NLP and the Web - Lecture 10: Neural Language Modeling 2 - WS 2024/2025 - PDF

NLP and the Web – WS 2024/2025 Lecture 10 Neural Language Modeling 2 Dr. Thomas Arnold Hovhannes Tamoyan Kexin Wang Ubiquitous Knowledge Processing Lab Technische Universität Darmstadt Syllabus (tentative) Nr. Lecture 01 Introduction / NLP basics 02 Foundations of Text Classification 03 IR – Introduction, Evaluation 04 IR – Word Representation 05 IR – Transformer/BERT 06 IR – Dense Retrieval 07 IR – Neural Re-Ranking 08 LLM – Language Modeling Foundations, Tokenization 09 LLM – Neural LLM 10 LLM – Adaptation 11 LLM – Prompting, Alignment, Instruction Tuning 12 LLM – Long Contexts, RAG 13 LLM – Scaling, Computation Cost 14 Review & Preparation for the Exam WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 2 Outline Transformer LM (cont.) Adaptation Lecture Evaluation, Quiz WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 3 A Fixed-Window Neural LM mat Softmax output distribution table bed y = softmax(𝑊2 ℎ) desk chair 𝑾2 OOOOOOOOOOOOOOO hidden layer ℎ = 𝑓(𝑊1 𝑥) 𝑾1 concatenate concatenated word embeddings OOO OOO OOO OOO 𝑥 = [𝑣1 , 𝑣2 , 𝑣3 , 𝑣4 ] embedding lookup and our problems turning into context words in window of size 4 target word WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 4 RNN Language Model Sequence as the input Input sequence Do you like a good cappuccino Single vector (last state) as output Word representation (lookup in embedding matrix) Part of a larger network x1 xn Recurrent sequence today encoding representation (1x RNN layer) 𝑾2 With s1 RNN Softmax ? coffee at sn-1 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 5 Training a Transformer Language Model ▪ We need to prevent information leakage from future tokens! How? (gold output) 𝑌 = cat sat on the mat ℒ = + + + + + 𝛻ℒ TRANSFORMER +masking 𝑋= the cat sat on the mat WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 6 How to use the model to generate text? ▪ Use the output of previous step as input to the next step repeatedly sat TRANSFORMER the cat WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 7 Encoder-decoder models ▪ Encoder = read or encode the input ▪ Decoder = generate or decode the output Le chat est mignon Encoder Decoder The cat is cute Le chat est WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 8 Transformer [Vaswani et al. 2017] ▪ An encoder-decoder architecture built with attention modules. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 9 Transformer [Vaswani et al. 2017] ▪ Computation of encoder attends to both sides. [Attention Is All You Need, Vaswani et al. 2017] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 10 Transformer [Vaswani et al. 2017] ▪ At any step of decoder, it attends to previous computation of encoder WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 11 Transformer [Vaswani et al. 2017] ▪ At any step of decoder, it attends to decoder’s previous generations WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 12 3 Shades of Attention WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 13 Variants of positional embeddings Architectural choices Multi-modal models WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 14 Outline Transformer LM (cont.) Adaptation Lecture Evaluation, Quiz WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 15 Language Models are not trained to do what you want There is a mismatch between LLM pre-training and user intents. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 16 Adapting Language Models You have a pre-trained language model that is pre-trained on massive amounts of data. They do not necessarily do useful things—they only complete sentences. Now how to you ”adapt” them for your use-case? ▪Tuning: adapting (modifying) model parameters ▪Prompting: adapting model inputs (language statements) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 17 Fine-Tuning for Tasks WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 18 Fine-Tuning for Tasks WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 19 Fine-Tuning for Tasks o gato é fofo DET N VB ADJ positive The cat is cute layer2 layer2 layer2 layer2 layer1 layer1 layer1 layer1 The cat is cute The cat is cute [CLS] The cat is cute The cat is cute Translation POS Tagging Text classification Language modeling WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 20 81 Fine-tuning Pre-trained Models ▪ Whole model tuning: Classiﬁcation ▪Run an optimization defined on Head your task data that updates all model parameters Language Model ▪ Head-tuning: ▪Run an optimization defined on your task data that updates the Embeddings parameters of the model “head” Input [CLS] A three-hour cinema master class. [ACL 2022 Tutorial Beltagy, Cohan, Logan IV, Min and Singh] WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 21 Parameter-efficient Fine-tuning WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 22 fig source https://arxiv.org/pdf/2303.15647.pdf Parameter-efficient Fine-tuning: Adding Models Augmenting the existing pre-trained model with extra parameters or layers and training only the new parameters One commonly used method: Adapters WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 23 Adapters ▪ Idea: train small sub-networks and only tune those. ▪ Adapter layer projects to a low dimensional space to reduce parameters. ▪ No need to store a full model for each task, only the adapter params. WS24/25 | Computer Science Department | UKP [“Parameter-Eﬃcient - Dr. Thomas Arnold 24 Transfer Learning for NLP”, Houlsby et al., 2019.] Question ▪ Is parameter-efficient tuning more (1) computationally efficient (2) memory-efficient than whole-model tuning? Answer to (1) It is not faster! You still need to do the entire forward and backward pass. Answer to (2) It is more memory efficient. You only need to keep the optimizer state for parameters that you are fine-tuning and not all the parameters. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 25 Selective methods Selective methods fine-tune a subset of the existing parameters of the model. It could be a layer depth-based selection, layer type-based selection, or even individual parameter selection. Ben Zaken WS24/25 | Computer Science et al., 2021. Department “BitFit: | UKP - Dr.Simple ThomasParameter-eﬃcient Arnold Fine-tuning for Transformer-based 26 Masked Language-models” 120 BitFit ▪ BitFit only tunes the bias terms in self-attention and MLP layers ▪ only updates about 0.05% of the model parameters Ben Zaken et al., 2021. “BitFit: Simple Parameter-eﬃcient Fine-tuning for Transformer-based Masked Language-models” WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 27 BitFit Ben Zaken et al., 2021. “BitFit: Simple Parameter-eﬃcient Fine-tuning for Transformer-based Masked Language-models” WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 28 Limitations of Pre-training, then Fine-tuning ▪Often you need a large labeled data ▪ Though more pre-training can reduce the need for labeled data WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 29 “I have an extremely large collection of clean labeled data” WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 30 “I have an extremely large collection of clean labeled data” -- No one WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 31 Outline Transformer LM (cont.) Adaptation Lecture Evaluation, Quiz WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 32 Lecture and Exercise Evaluation Lecture Exercise WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 33 Menti time WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 34 35

NLP and the Web - Lecture 10: Neural Language Modeling 2 - WS 2024/2025 - PDF

Document Details

Tags

Related

Summary

Full Transcript