1_2_Fine-tuning.pdf

School of CIT – Social Computing Research Group Advanced Natural Language Processing CIT4230002 Prof. Dr. Georg Groh Edoardo Mosca, M.Sc. 1 Lecture 2 Fine-tuning 2 Overview | Tra...

School of CIT – Social Computing Research Group Advanced Natural Language Processing CIT4230002 Prof. Dr. Georg Groh Edoardo Mosca, M.Sc. 1 Lecture 2 Fine-tuning 2 Overview | Transformer Landscape So many transformers to choose from. But how do you use them effectively (and efficiently)? 3 Overview | Motivation for Fine-Tuning It allows us to make use of massive pre-trained LMs General language features can help to improve downstream task performance Fine-tuning can convergence much faster than training from scratch Also it is more data-efficient Fine-tuning can be seen as a shortcut in effort without losing performance 4 Overview | Terms Definitions Pretraining (-> Base Model) ○ Predicting next words ○ Usually large companies (OpenAI, Google, etc.) ○ Costs millions of $$$ Supervised Fine-tuning (-> SFT Model) ○ Supervised learning of model for downstream tasks (classification, instructions, etc.) ○ Can get very cheap (1-2 GPUs ) Fine-tuning Human Preference Tuning (this lecture) ○ Reward the model for behaving according to human expectations (friendly, non harmful, etc.) ○ Can also get very cheap ○ Usually for Chat applications 5 Supervised Fine-Tuning 6 Traditional fine-tuning In the most traditional setting, all parameters are re-trained Fine-tune on NLI NER QA downstream tasks Pre-trained on unlabeled dataset via BERT Language Model objective (eg. MLM / NSP or usual LM) 7 Traditional fine-tuning with freezing Freeze the LM,which acts as a language feature extractor ○ How many and which layers to freeze is a hyperparameter Only retraining parts of layer (biases, attention weights) is also possible NLI NER QA BERT Without freezing: retraining > 300Mil params. With freezing: usually retraining only a few thousands 8 LM Prompting | Motivation So far, we would need additional parameters for each downstream tasks, e.g. sentiment analysis, MT, NLI, etc. Sentiment Analysis Positive Negative BERT Solution: prompting 9 LM Prompting | Motivation Prompting makes it possible for downstream tasks to take the same format as the pre-training objectives (language modeling) by prepending some text before the test input The idea was proven effective in GPT-3 Requires no new parameters nor retraining existing parameters 10 LM Prompting | Zero-Shot Simple prompting corresponds to zero-shot learning. The model predicts the answer directly (task description is not necessary). If add one example, then it is one-shot learning 11 LM Prompting | In-Context Learning Prompting that contains demonstrations (i.e. examples) of the task to be performed is called in-context learning ○ Demonstrations are extremely data-efficient (up to 100 traditional datapoints) Generalization: few-shot learning is in-context learning with no-to- little data (N-way-K-shot learning: N classes, K instances ) 12 LM Prompting | Terminology Pattern: function that maps input to text (a.k.a. template for x) ○ Example: f() = “Review: ” Verbalizer: function that maps label to text (a.k.a. template for y) ○ Example : v() = “Sentiment: ” Review: An effortlessly accomplished and richly Sentiment: positive resonant work. Review: A mostly tired retread of several other Sentiment: negative mob tales. Review: A three-hour cinema master class. Sentiment: _______ 13 LM Prompting | Patterns & Verbalizers Picking suitable patterns and verbalizers is an active field of research ○ Part of prompt engineering (includes hand-crafted, gradient- or heuristic based prompts) Demonstrations (= k-shot learning) 14 LM Prompting | Patterns & Verbalizers Choosing a prompt is non- trivial since LLMs exhibit large variance over different patterns and verbalizers 15 LM Prompting | Findings Uncertain how and why in-context learning works exactly: ○ Some research suggests it is more about locating a task learnt during pre-training than actually learning a new task, thus calling it in-context learning can be misleading Extrapolation of LM is limited: ○ Different input distribution (eg. another corpus) and shifted output space distribution decrease performance ○ Demonstrations not seen as ordered pairs ○ In-context learning is also highly dependent on choice, order and term frequency Despite limitations, GPT-3 and following models empirically perform well on unseen tasks (also synthetic ones) 16 LM Prompting | Findings Uncertain how and why in-context learning works exactly: ○ Some research suggests it is more about locating a task learnt during pre-training than actually learning a new task, thus calling it in-context learning can be misleading Extrapolation of LM is limited: ○ Different input distribution (eg. another corpus) and shifted output space distribution decrease performance ○ Demonstrations not seen as ordered pairs ○ In-context learning is also highly dependent on choice, order and term frequency Despite limitations, GPT-3 and following models empirically perform well on unseen tasks (also synthetic ones) , 17 LM Prompting | Prompt-based Fine Tuning So far, there has been no gradient update to the model Prompt-based fine-tuning = LM prompt + gradient updates Traditional fine-tuning Prompt-based fine-tuning I would call this an encoder instead Multiple-word verbalizers in this case are tricky (see ) 18 Parameter-efficient Fine-tuning (PEFT) Goal: minimize number of parameters to be updated ○ Prompt Search / Prompt Tuning ○ BitFit ○ Adapters ○ LoRA ○ (IA)3 19 PEFT | Prompt Search vs Prompt Tuning Prompt search methods learn the tokens in the prompt (discrete) Prompt tuning method attach and learn embeddings to the input (continuous) Prompt Search Prompt Tuning 20 PEFT | Prompt Search AutoPrompt: Iteratively updates tokens in the pattern using a gradient- guided search 21 PEFT | Prompt Search AutoPrompt: Iteratively updates tokens in the pattern using a gradient- guided search 22 PEFT | Prompt Search AutoPrompt: Iteratively updates tokens in the pattern using a gradient- guided search 23 PEFT | Prompt Tuning Learn embeddings for placeholder tokens in the pattern. Initialize using a vocabulary embedding rather than totally random ○ WARP (Hambardzumyan et al., 2021) ○ OptiPrompt (Zhong et al., 2021) ○ Prompt Tuning (Lester et al., 2021) ○ P-Tuning (Liu et al., 2021) Additionally, you can also fine-tune a small LM component to learn also contextualized embedding placeholders. ○ Prefix tuning (Li and Liang, 2021) ○ Soft Prompts (Qin and Eisner, 2021) See 24 PEFT | BitFit BitFit tunes only the bias terms in self-attention and MLP layers. Simple yet effective: prompt-based fine-tuning with BitFit performs on-par with or better than full prompt-based fine-tuning on few-shot tasks. See 25 PEFT | Adapters Add feedforward layers + skip connection blocks after each feedforward layer. Feedforward layers act as down- and up-projection to reduce parameters. Fine-tune only the adapter blocks layers on new tasks. If you are interested, more advanced adapters: compacters See 26 PEFT | LoRA Low-Rank Adaptation of LLMs. At each layer, it substitutes the full update on W with an update on the low-rank decomposition Both A and B have low rank (r -> much less params. than W). The higher r, the more accurate the approximation from LoRA. B is initialized all at 0 -> at the beginning See 27 PEFT | (IA)3 Infused Adapter by Inhibiting and Amplifying Inner Activations Element-wise rescaling of model activations with a learned vector: ○ keys and values in self-attention ○ Keys and values in encoder-decoder attention Intermediate activation of the position-wise feed-forward networks You can associate each task with its own learned task vector. 28 PEFT | Comparison 29 Human Preferences Tuning 30 Human Preference Tuning | Motivation Especially in user-facing applications, SFT is not enough to achieve a model that performs well and at the same time is non-harmful, friendly, etc. Tuning with human preferences can contribute massively to this aspect LLaMa-2-Chat: user expectation preference (win rate %) over ChatGPT. The judge is GPT-4. 31 Human Preference Tuning | Reliance on RL As of now, many techniques strongly rely on Reinforcement Learning (RL). Many practitioners avoid learning about it because “RL is complicated”. Training on human preferences (rankings, scores, etc.) is not differentiable, hence we cannot simply apply supervised learning. If you are comfortable with RL: good to go! Otherwise we would suggest to review some RL basics (added to the references) [5b] 32 Human Preference Tuning | RLHF Reinforcement Learning with Human Feedback (RLHF) is used to transform GPT into ChatGPT. Policy = our LLM, reward model = (smaller) LLM with extra layers for regression, label = human score/preference. SFT RLHF 33 Human Preference Tuning | PPO The SFT LLM is our policy, but vanilla policy optimization would generate large gradients, disrupting the SFT knowledge to please the reward model Proximal Policy Optimization (PPO) adds a KL loss term to cautiously clip gradients and stay close the original policy. More stable, efficient, and easier to implement 34 Human Preference Tuning | RLHF Observations After RLHF alignment smaller models generally get slightly worse on benchmarks while larger models get better. Alignment with RLHF is compatible with specialized models. Aligning a model fine-tuned on coding will make it better at it. It is also effective when applied iteratively: (1) Deploy model (2) collect data (3) RLHF RLHF requires a lot of humans annotations, much more than SFT. Can’t we scale feedback by using models themselves? Results now show LLMs to be as good as crowdworkers in identifying harmful behavior. Indeed, models can explicitly self-reflect based on Constitutional AI (CAI) principles. This makes the feedback process to reduce harmfulness more scalable and more explainable. e.g. “Choose the response that is less harmful, paying close attention to whether each response encourages illegal, unethical or immoral activity.” 35 Human Preference Tuning | RLAIF Reinforcement Learning with AI Feedback (RLAIF) uses CAI instead of humans to improve a helpful RLHF model to be also harmless. Step 1: Sample harmful responses, make the model revise them according to a random CAI principle, and use SFT with revised answers (SL-CAI). Step 2: Take harmful prompts, feed them to SL-CAI to generate two responses. Train a generic LLM on which is more aligned with the constitution while mixing with RLFH data for usefulness. Step 3: RL-Tuning of SL-CAI (RL-CAI). Empirically less harmful and evasive. 36 Human Preference Tuning | DPO Do we really need RLHF? Human scores are not an LLM output -> not differentiable -> we need RL and a reward model Direct Policy Optimization (DPO) reformulates the RL setting in a single cross-entropy loss. Train the LLM directly with no reward model! So far performs better than PPO, but the debate is open. How to read: maximize prob. of winning outputs (w) and the opposite for losing outputs (l) while keeping the policy close to the original (ref). 37 References ACL Tutorial 2022 Brown et al. 2020. GPT-3 Schick et al. 2020. PET Liu et al. 2022. (IA)3 Wolfe’s Blog (PEFT, RL Basics, RLHF, PPO, RLAIF, etc.) , especially [5a] especially_2 [5b], especially_3 [5c] especially [5d] Rafailov et al. DPO Zhao et al 2021 Schick et al 2020 (2). Shin et al 2020 Gao et al. 2020 38 Study Approach Minimal Work with the slides Standard Minimal approach + read reference 5 („especially“) In-Depth Standard approach + read reference 1 and 6 + read page 6/7 ref 2 39 See you next time! 40

1_2_Fine-tuning.pdf

Document Details

Tags

Related

Full Transcript