AI Engineering.pdf

AI Engineering Chapter 2: Understanding Foundation Models (1) Pre-training Takes up 98% of total compute and resources Typically uses self-supervision Model Architecture seq2seq Transformer Model Size Parameters - adjustable variables between nodes in a neural network that determine how the input is transformed into the output. Neurons between layers of a network are connected to each other through links and those links have weights assigned to them (Weights = weights + biases). The network learns by adjusting the weights between the nodes. During training, the model initially defines weights between neurons and later adjusts them based on the training data to minimize the gap between predictions and outcomes. This allows is to learn the underlying patterns in the data. Sparse model - model with large percentage of zero-value parameters. These are parameters that are inactive. Because of this sparsity, the model requires less compute resources to run. A common model sparse model is mixture of experts (MoEs). Tokens - unit a model operates on. Number of training tokens can be larger than # of tokens dataset. You can decide to run the model through the training data twice (2x epochs) which means that your training token are double the dataset tokens. Why multiple epochs? to increase the accuracy of the model and learn better. If you see the model hits a learning curve plateau then you stop the epochs. Compute - Floating Point Operation (FLOP) is a standardized unit of compute resource/machine needed. Provides a quantitative measure of the processor's ability to perform complex mathematical calculations (2) Post-Training There trained foundational model is called a pre-trained model. There are two issues with a pre-trained model: (1) It is optimized for completion of tokens. It needs to be optimized for your task such as conversation, (2) It is trained on indiscriminate data from the internet which may not be fit for purpose and can be biased, racist, and harmful. To address these, you want to do additional post-training through two steps: (1) Supervised Finetuning, and (2) Alignment. Supervised Fine-Tuning (SFT) Finetuning model on high quality data to optimize the model for conversation. Also called instruction finetuning. You feed the model with examples of great responses so it can learn. Input → High quality demonstration data consisting of prompt + response across varied tasks such as classification, summarization, generation of text, Q&A, summarization. Instruction finetuning → giving examples of prompt + response so it learns from the instructions. Dialogue finetuning → giving multi-turn examples of prompt + response so it learns conversation Average cost: ~10$ per prompt + response Alignment (YouTube video, Medium) Post SFT, further finetune the model to preferred output responses that align with human preferences. Also called preference finetuning. The most common alignment algorithm used is Reinforcement Learning Human Feedback (RLHF). There are two steps: 1. Train a Reward Model to score your SFT model’s outputs a. Data collection: i. Collect prompts + multiple potential responses ii. Labelers to rank these responses based on quality and alignment to desired outcomes iii. Feed example prompt + winning response + losing response (called comparison data) as input into your reward model b. Model training: i. Design neural network architecture and feed it two inputs: (prompt + winning response) and (prompt + losing response). Each input is processed separately through the neural network. ii. Model outputs a single score (scalar score) for each input (prompt + winning / prompt + losing). The model learns to assign higher scores to winning responses and lower scores to losing responses. iii. Model calculates a loss function based on the difference between scores for winning and losing responses. Objective is to minimize the loss function. For example, if completion A was preferred over B, the loss increases if the model gives B a higher score than A. iv. This process teaches the model to predict human preferences. Scalar score represents the model's prediction of how well the completion aligns with human preferences for that prompt. v. The model compares its prediction with actual preferred human responses and adjusts its parameters to reduce the loss function, overtime making better predictions 2. Optimize the foundation model. After training the reward model, the next step is to optimize the foundation model (also called the policy model) to generate responses that maximize the reward scores a. Starting point: i. You have a trained foundation model (e.g., a large language model) ii. You have a trained reward model that can score responses b. Optimization Process: For each training iteration: a. Generate responses using the current foundation model b. Score these responses using the reward model c. Use the scores to update the foundation model's parameters The goal is to adjust the foundation model's parameters to maximize the expected reward This is often done using algorithms like Proximal Policy Optimization (PPO) Questions How much demonstration data do you need? Is Claude SFT? (3) Sampling Model constructs outputs based on a process called sampling probabilities of possible outcomes model calculates the probability of distribution over all tokens in the vocabulary Sample the next token based on the probability distribution. if “red” has 30% chance of being the next token, then pick “red” 30% of the time Temperature → process for redistributing the probability of possible values. reduce probability of common tokens and increase probability of rarer tokens. To create more creative tokens

AI Engineering.pdf

Document Details

Related

Full Transcript