NLP and the Web - Neural Language Modeling Lecture PDF
Document Details

Uploaded by FastestGrowingOcean5061
Technische Universität Darmstadt
2024
Dr. Thomas Arnold
Tags
Summary
This document contains lecture slides on neural language modeling, covering topics such as distributed training, quantization, and computation cost. The lecture is from Technische Universität Darmstadt and covers concepts relevant to NLP and machine learning.
Full Transcript
NLP and the Web – WS 2024/2025 Lecture 13 Neural Language Modeling 5 Dr. Thomas Arnold Hovhannes Tamoyan Kexin Wang Ubiquitous Knowledge Processing Lab Technische Universität Darmstadt Syllabus (tentative) Nr. Lecture...
NLP and the Web – WS 2024/2025 Lecture 13 Neural Language Modeling 5 Dr. Thomas Arnold Hovhannes Tamoyan Kexin Wang Ubiquitous Knowledge Processing Lab Technische Universität Darmstadt Syllabus (tentative) Nr. Lecture 01 Introduction / NLP basics 02 Foundations of Text Classification 03 IR – Introduction, Evaluation 04 IR – Word Representation 05 IR – Transformer/BERT 06 IR – Dense Retrieval 07 IR – Neural Re-Ranking 08 LLM – Language Modeling Foundations, Tokenization 09 LLM – Neural LLM 10 LLM – Adaptation 11 LLM – Prompting, Alignment, Instruction Tuning 12 LLM – Long Contexts, RAG 13 LLM – Scaling, Computation Cost 14 Review & Preparation for the Exam WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 2 General info about the exam ▪ Modus: Written close-book exam in Darmstadt (in-person) ▪ Date: 25.02.2023 ▪ Time slot: 15:00 – 17:00 ▪ Where: Will be announced on Moodle (~1 week before the exam) WS23/24 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 3 General info about the exam ▪ No books, notes, or other auxiliary material may be used ▪ For math problems you can use non-programmable calculator. ▪ Problems are stated in English ▪ The questions may be answered in either German or English. ▪ You will have ~90-100 minutes to complete the exam WS23/24 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 4 Questions Will there be any trial exams or past exams which we can use for preparation? Answer: We provide an exam from last year. However, the lecture content has changes considerably. Please only use this exam as an example of question types to expect! WS23/24 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 5 Questions Are the features / steps / results of research experiement X (mentioned in the lecture) relevant for the exam? Answer: No, you do not need to remember specifics of any mentioned experiments. WS23/24 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 6 Questions What kind of tasks can we expect in the exam? (multiple-choice / open questions... ) Answer: - ~10% "Know stuff", examples: - Definitions of basic terms (What is a morpheme?) - Remember lecture topics (Name 2 approaches for parameter-efficient fine-tuning) - ~30% "Understand stuff", examples: - Compare metrics / methods (Why is F1 better than Accuracy in IR?) - Explain a method (What is the key difference of decoder self-attention compared to encoder self- attention?) WS23/24 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 7 Questions What kind of tasks can we expect in the exam? (multiple-choice / open questions... ) Answer (continued): - ~30% "Do stuff", examples: - Tokenization, TF-IDF, inverted index, Viterbi, Precision/Recall, ranked evaluation metrics, Byte-Pair Encoding… - ~30% "Transfer knowledge", examples: - Here is a scenario X. Would you use an encoder or decoder transformer model? Why? - Why is tokenization especially hard in Twitter data? - Max. one multiple-choice questions WS23/24 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 8 Questions Are the home-exercises relevant for the exam? Answer: No, only the class exercises. How relevant are the practice classes in the overall context of the exam? Answer: There will be (about) two questions specific to the practice class. (Examples: explain code output, find errors, questions about basic coding…) WS23/24 | Computer Science Department | UKP - Prof. Dr. Iryna Gurevych | 9 Outline Distributed Training Quantization Computation Cost WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 10 Motivation: Models getting larger https://infohub.delltechnologies.com/de-de/p/investigating-the-memory-access-bottlenecks-of-running-llms/ WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 11 Motivation: How much memory do we need? Model Inference Memory Training Memory Mistral 7B 28 GB GPT3 175B 700 GB GPT4 1.8T 7200 GB WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 12 Motivation: How much memory do we need? Model Inference Memory Training Memory Mistral 7B 28 GB 168 GB GPT3 175B 700 GB 4200 GB GPT4 1.8T 7200 GB 43200 GB WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 13 Distributed Training: An Overview - Data Parallelism LM LM LM - Pipeline Parallelism L1 L2 L3 Layer 1 Layer 2 Layer 3 GPU 1 GPU 2 GPU 3 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 14 Data Parallelism: Shard Data Full Dataset LM LM LM GPU 1 GPU 2 GPU 3 Step 1: Shard the dataset into pieces and feed them separately into different GPUs WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 15 Data Parallelism: Aggregate Gradients Parameter Server LM Local Local Gradients Gradients Local Gradients LM LM LM GPU 1 GPU 2 GPU 3 Step 2: Each gpu sends it gradients to a main process to aggregate. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 16 Data Parallelism: Update Weights Parameter Server UpdatedLM LM LM LM GPU 1 GPU 2 GPU 3 Step 3: The GPU server performs the gradient updates, then replicates the updated weights to each GPU. In practice, the parameter server is often the first GPU. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 17 Data Parallelism: All Together Step 1: Data Sharding Step 2: Gradient Aggregation Step 3: Update and Replicate WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 18 Distributed Training: An Overview - Data Parallelism LM LM LM - Pipeline Parallelism L1 L2 L3 Layer 1 Layer 2 Layer 3 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 19 Pipeline Parallelism Splitting the model (instead of the data) into multiple GPUs Figure Credit: Song Han (MIT) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 20 Pipeline Parallelism: Naive Implementation GPUs are idle most of the time! Idle! GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism (Huang et al., NeurIPS 2019) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 21 Pipeline Parallelism: Solution Splitting data into mini-batches (32, 128, 768) (8, 128, 768), (8, 128, 768), (8, 128, 768), (8, 128, 768) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 22 Outline Distributed Training Quantization Computation Cost WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 23 Quantization Distillation Pruning Stores or performs computation Train a small model (the Removing excessive model on 4/8 bit integers instead of student) on the outputs of a weights to lower parameter 16/32 bit floating point numbers. large model (the teacher). count. The most effective and practical In essence, distillation = model A lot of the work are done solely way do training/inference of a ensembling. Therefore we can for research purposes. large model. distill between model with the same architecture (self- Cultivated different routes of Can be combined with pruning distillation) estimating importances of (GPTQ) and Distillation parameters. (ZeroQuant). Can be combined with pruning. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 24 What is Quantization? The process of mapping input values from a large set (often a continuous set) to output values in a (countable) smaller set, often with a finite number of elements. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 25 Overview of Quantization Methods K-Means Linear Integer Weights; Storage Floating Point Floating Point Integer Codebook Computation Floating Point Floating Point Integer WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 26 Linear Quantization Affine Mapping from floating point numbers to integers Original Quantized Reconstructed 32-bit float 2-bit signed int 32-bit float Zero pointScale ( - -1 ) × 1.07 = How to find these numbers? Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (Jacob et al., CVPR 2018) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 27 Linear Quantization Affine Mapping from floating point numbers to integers Original Quantized Reconstructed 32-bit float 2-bit signed int 32-bit float Zero pointScale ( - -1 ) × 1.07 = r ≈ ( q - Z ) ✕S floating- floating- integer integer point point WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 28 Linear Quantization Zero point Derivation | r = S(q-z) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 29 Linear Quantization Zero point Derivation | r = S(q-z) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 30 Linear Quantization “Absmax” Implementation In practice, the weights are usually centered around zero (Z = 0): Therefore, we can find scale by using only the max. Weight distribution of first conv layer of ResNet-50. Used in Pytorch, ONNX WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 31 Outline Distributed Training Quantization Computation Cost WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 32 How do you compute computational cost of a single-layer NN with one matrix multiplication? WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 33 FLOPS ▪ Floating point operations per second (FLOPS, flops or flop/s) ▪ Each FLOP can represent an addition, subtraction, multiplication, or division of floating-point numbers, ▪ The total FLOP of a model (e.g., Transformer) provides a basic approximation of computational costs associated with that model. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 34 FLOPS: Matrix Multiplication ▪ Matrix-vector multiplication are common in Self-Attention (e.g., QKV projection) ▪Requires 2𝑚𝑛 (2 x matrix size) operations for multiplying 𝐴 ∈ ℝ𝑚×𝑛 and 𝑏 ∈ ℝ𝑛 ▪(2 because 1 for multiplication, 1 for addition) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 35 FLOPS: Matrix Multiplication ▪ Matrix-vector multiplication are common in Self-Attention (e.g., QKV projection) ▪Requires 2𝑚𝑛 (2 x matrix size) operations for multiplying 𝐴 ∈ ℝ𝑚×𝑛 and 𝑏 ∈ ℝ𝑛 ▪(2 because 1 for multiplication, 1 for addition) ▪ For multiplying 𝐴 ∈ ℝ𝑚×𝑛 and 𝐵 ∈ ℝ𝑛×𝑝 , one needs 2𝑚𝑛𝑝 operations. ▪Again, 2 because of 1 for multiplication, 1 for addition ▪ Now this is just forward propagation in Backprop. What about the backward step? WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 36 FLOPS: Matrix Multiplication: Backward ▪ Backward pass needs to calculate the derivative of loss with respect to each hidden state and for each parameter 𝜕𝐿 We also need 𝜕𝑋 to continue to pass gradient to the previous layers. 𝑌 × 𝜕𝐿 𝜕𝐿 Upstream One matrix multiplication for 𝜕𝑊 𝜕𝑌 gradient FLOPs for backward pass is roughly twice of forward pass. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 37 FLOPS: Matrix Multiplication: Altogether ▪ Multiplying an input by a weight matrix requires 2x matrix size FLOPS. ▪ FLOPs for backward pass is roughly twice of forward pass. Training FLOPs for multiplying by a matrix W = 6 x (batch size) x (size of W) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 38 Transformer FLOPs: The Quick Estimate ▪ The Weight FLOPs Assumption ▪The FLOPs that matter the most are weight FLOPs, that is ones performed when intermediate states are multiplied by weight matrices. ▪The weight FLOPs are the majority of Transformer FLOPs ▪We can ignore FLOPs for ▪ Bias vector addition ▪ layer normalization ▪ residual connections ▪ non-linearities ▪ Softmax The FLOPs Calculus of Language Model Training, Dzmitry Bahdanau (2022) WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 39 Transformer FLOPs: The Quick Estimate ▪Let N be number of parameters (the sum of size of all matrices) ▪Let D be the number of tokens in pre-training dataset. ▪Forward pass: ▪FLOPs for forward pass on a single token is roughly 2N ▪FLOPs for forward pass for the entire dataset is roughly 2ND ▪Backward pass: ▪FLOPs for backward pass is roughly twice of forward pass ▪FLOPs for backward pass for the entire dataset is roughly 4ND ▪What is the total? WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 40 Transformer FLOPs: The Quick Estimate ▪Let N be number of parameters (the sum of size of all matrices) ▪Let D be the number of tokens in pre-training dataset. ▪The total cost of pre-training on this dataset is: C ~ 6ND WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 41 Estimating training time This is a very practical question in real world. We will use our formula earlier to estimate training time. Consider HyperCLOVA, an 82B parameter model that was pre-trained on 150B tokens, using a cluster of 1024 A100 GPUs. Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers, 2021 https://arxiv.org/pdf/2109.04650.pdf WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 42 Estimating training time Consider HyperCLOVA, an 82B parameter model that was pre-trained on 150B tokens, using a cluster of 1024 A100 GPUs. Training cost (FLOPs): 𝐶 ≈ 6𝑁𝐷 = 6 × 150 × 109 × 82 × 109 = 7.3 × 1022 A100 GPUs The peak throughput of A100 GPUs is 312 teraFLOPS or 3.12 × 1014. How long would this take? model compute cost 7.3 ×1022 Duration = = = 2.7 days cluster throughput 3.12 ×1014 × 1024 WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 43 Estimating training time How long would this take? model compute cost 7.3 ×1022 Duration = = = 2.7 days cluster throughput 3.12 ×1014 × 1024 According to the white paper, training took 13.4 days. Our estimate is 5 times off, but we did get the order of magnitude right! WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 44 Factors We Did Not Consider Note that these estimates can be slightly off in practice Theoretical peak throughput is not achievable with distributed training. (unless your model only does large matrix multiplications). We ignored many additional operations like softmax, ReLU/GeLU activations, self-attention, Layer Norm etc. Training divergence and restarting from earlier checkpoints are not uncommon. ▪ There are various factors that contribute to computation latency ▪Communication latency, memory bandwidth, caching, etc. ▪See https://kipp.ly/transformer-inference-arithmetic/ for an excellent discussion. WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 45 Summary WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 46