LLM Recap PDF
Document Details
![UsefulGreekArt](https://quizgecko.com/images/avatars/avatar-20.webp)
Uploaded by UsefulGreekArt
Politecnico di Torino
Tags
Summary
This document provides a recap of large language models (LLMs). It covers key concepts, such as n-grams, deep learning, word embeddings, recurrent neural networks (RNNs), and transformers. The document also contains details about various aspects of LLM architectures and their applications.
Full Transcript
Recap LLM Introduzione N-grams Deep Learning Multi class classification Weights: Loss Word Embeddings Come si calcola effettivamente h? Come si calcolano le similarità (sarebbero le predizio...
Recap LLM Introduzione N-grams Deep Learning Multi class classification Weights: Loss Word Embeddings Come si calcola effettivamente h? Come si calcolano le similarità (sarebbero le predizioni) Word2vec FastText RNN FCNN RNN Problems seq2seq Transformers Encoder Decoder Tokenization Byte-Pair Encoding BPE Positional Encoding Sinusoidal Positional encoding Attention K similarity values Final attention: Types of Attention Encoder self-attention Decoder (masked) self-attention Encoder-decoder cross attention Multi-head attention Residual connections Layer Normalization Relative positional embeddings PROs TRANSFORMERS Encoder-Decoder T5 (Text-to-Text Transfer Transformer) Encoder-only BERT Decoder only GPT Sampling approach History GPT family GPT-1 GPT-2 GPT-3 Finetuning vs in-context DeepMind paper Approach 1: Fixed-Sized Models Approach 2: IsoFLOPs Approach 3: IsoFLOPs + IsoLoss LLama Metrics, Tasks, Benchmarks Geometric mean Perplexity BLEU BERT Score Other metrics Tasks LAMBADA ROCStories, HellaSwag, StoryCloze Question Answering Translation, summarization Natural language interference Grammatically acceptability Benchmarks Tuning and Model alignment InstructGPT Efficient fine-tuning and inference Bias-terms Fine-tuning (BitFit) Adapters Low Ranking Adaptation (LoRA) Prompt tuning Other optimization techniques Quantization LLM.int8() Reduce floating point precision Model distillation Potpourri Mixture of Experts Mixtral CLIP Contrastive Learning Alignment LLaVA LMSE Intro The Waterfall Model The V-Model Iterative Model Agile SCRUM LLM4SE Datasets Evaluation Metrics Requirements elicitation ! Prompt Engineering What makes a good prompt A prompt is composed with Priming RGC I want you to act as prompting Generate Knowledge Prompting Chain of Thought prompting Self Consistency with CoT (CoT-SC) Structured Chain of Thought (SCoT) Reduce Hallucinations Problemi Agent Architecture LLM Chain 1. Agentic AI System 2. Reactive AI vs Agentic AI 4. Da Chains ad Agents The memory model Utilization of Knowledge Reasoning and Planning Agent interaction Schemes Evaluation Generated requirements Requirements quality measures INVEST SRS quality measures Generated Design Structural metric Evaluating code generation Functional correctness Static Code Quality Metrics Cyclomatic Complexity CC Maintainability Index MI Halstead Volume Runtime performance quality metrics Code-specific similarity Metric CodeBLEU Feedback based evaluation Evaluating Test Case generation Test coverage Execution success rate Mutation Analysis Test Flakiness Developer Feedback LLM Introduzione N-grams Generate a new sentence based on know probabilities = autoregressive generation Markov Assumption : n = 2, cioè lo stato futuro è completamente determinata dallo stato attuale che vuol dire che la prob. della parola attuale dipende solo unicamente dalla precedente. Come si calcola la #probabilità ? Limitations: Data sparsity :Più aumenta n, più il numero di possibili n-grams cresce esponenzialmente; un piccolo vocabolario o un piccolo contesto rende LM inutili Context limitation : solo n = 2, longer may require remembering what happened "early on". Lack of semantics: similar words treated in the same way as completely different one Deep Learning Note When we refer to the "number of parameters" in a model, we refer to the total number of weights the model has. This is a "6 parameters" model! logits = unnormalized probabilities Multi class classification The output class is one of many (c_1, c_2,..., c_n) n = n. of classes The model produces n logits for a point x. (the last layer will have n perceptrons) Weights: Trovati come quelli che minimizzano la loss. (per problemi lineari si trova easy, non per complessi come il ns caso) -> si aggiorna iterativamente θ per raggiungere un minimo locale Loss Word Embeddings Word embeddings = dense vector representations of words Each word is mapped to a vector of real numbers capture semantic meanings and relationships between words One-hot encoding: (una versione precedente di word embeddings) parole ordinate in ordine lexicografico Problemi: - sparse: scalability issues - vectorial space not used efficiently - vectors are orthogonal: no preservation of semantic similarity or relationships: all pairs of words have exactly the same distance (cosine or euclidean...) These are local representations. DL -> distributed representations Come si calcola effettivamente h? Come si calcolano le similarità (sarebbero le predizioni) Word2vec Goal: Represent words as dense vectors capturing semantic relationships. Methods: 1. CBOW: Predict a word from its context. 2. Skip-gram: Predict the context from a word. Optimization: Hierarchical Softmax: Uses a Huffman tree for efficient predictions (O(log∣V∣)O(\log |V|)). Negative Sampling: Simplifies training by predicting if a word is "correct" or "incorrect." Limitations: 1. Out-of-Vocabulary Words: Cannot represent unseen words. 2. Lack of Context: Generates static vectors; cannot differentiate word meanings in different contexts. Efficient but surpassed by contextual embeddings (e.g., BERT, ELMo). FastText FastText addresses the out-of-vocabulary problem Breaking up words into subwords (e.g., tri-grams) E.g., è A vector representation is learned for each subword. We can compose subword vectors to generate vectors for new words RNN FCNN Constraints: input fixed in size each input is associated with its set of weights output fixed in size RNN input/output different size The same model (i.e., same weights!) is applied multiple times throughout the sequence The state is used to provide the model with the context of what happened before So if the same context occurs in different moments of the sequence, the model’s behavior will be the same, regardless of position No need to learn the same patterns in different positions of the input! Problems vanishing/exploding gradients Long-Term Dependency Issues: hidden state remembers recent inputs Computational inefficiency: cannot be parallelized -> LSTM (gates) Problems: Difficulties with long sequences are still present Gradients still vanish/explode Still cannot be parallelized Architectural complexity one-to-one mapping: to each word of input corresponds an other word of output But inputs and outputs may have different lengthsù The LSTM doesn’t get to see the entire input sequence until the end ma può succedere che: seq2seq Tasks that consist in mapping an input sequence to an output sequence. -> Transformers Transformers seq2seq NO RNN Encoder The entire input sequence is encoded, producing one code for each step (rappresentazione vettoriale della sequenza) Decoder The first time we feed a "beginning of a sequence" (BOS) The output of the transformer is a possibility for each possible token At training time, the loss is computed on all tokens generated at that step! Tokenization Tokenization is the process to split a sentence into units (tokens) character PROs: no out-of-vocabolary issues; robust to misspellings and variations CONs: Longer sequences, slower/complex, semantic information harder to capture, inefficient for common words words PROs: semantic meaning, shorter sequences, more intuitive CONs: Out-of-vocabulary (OOV) issues for rare or new words; not leverage info shared among words like prefix or suffix, larger vocabulary needed, vectors for rarer words are trained less subword BERT fastText; Byte-Pair Encoding (BPE) PROs: balance btw char e words, handles OOV effectively, compact vocabulary, efficient for both frequent and rare words CONs: requires defining a subword policy (may introduce comp. overhead) Byte-Pair Encoding BPE The number of tokens used to represent the corpus decreases as we increase the number of tokens used. Typically, 10’s to 100’s of thousands of tokens are used Representing a text with fewer tokens is desirable Shorter sequence, semantics better preserved Special tokens: BOS and EOS Special tokens in BERT: CLS (Classification): similar to BOS SEP (Separator): to separate sentences Positional Encoding we know thath vectors for each token is learned: adjusted with GD (initially random) BUT Two problems The same token is mapped to the same vector, regardless of position Attention does not really understand sequentiality -> need for positional encoding Sinusoidal Positional encoding The vector for each position is unique… At least for the first ~60,000 positions (2/ ⋅ 10,000), then they start repeating (for longer sequences, we can just change the 10,000 constant!) This preserves similarity: we can also check with trigonometric functions that btw position 1 and position 2 Some models (GPT) doesn't use sinusoidal PEs. Instead, they learn positional embeddings along with the other weights Attention The transformer learns to choose these weights (how much attention pay to a word) based on the input sequence Weight = Attention K similarity values Final attention: Types of Attention Encoder self-attention Classic attention. The same input sequence generates query, keys, values Hence the name, self-attention Decoder (masked) self-attention introduce the property of causality - each token can only see the past (previous generated tokens) - il token corrente compreso - by masking all invalid attention weights Encoder-decoder cross attention Used by the decoder to receive info from the encoder (input sequence) K e V came from encoder's output sequence Q from decoder's sequence The number of elements in keys/values can be different from the number of queries Multi-head attention Since the attention may need to focus on different aspects in different contexts (noun, verbs...), we generally adopt multiple attention heads in parallel Each attention head has its own W , W , W and produces its own output Q K V The final attention output is the concatenation of the various heads’ outputs Passed through a linear layer to go back to the desired vector size Too much heads -> vectors became more sparse Residual connections improve gradient flow backpropagation Layer Normalization "Norm" in "Add & Norm" is Layer Normalization normalizing each sample across all dimensions Relative positional embeddings (AIAYN uses sinusoidal absolute positional encoding, sinusoidal means fixed) Positional information is now relative to the key-query distance in the sequence We no longer encode 1st token of the sequence, 2nd token of the sequence, … But, 2 tokens before the query, 1 token before the query, 0, 1 token after the query, +2, … PROs TRANSFORMERS Parallelization: process all tokens simultaneously Long-range relationships thanks attention Better memory Encoder-Decoder T5 (Text-to-Text Transfer Transformer) Single framework for multiple tasks Task prefixes (resume, translate, ecc. ) to condition the decoder's output The input sequences encodes also the task T5 only tuned on specific tasks The model learns to recognize those tasks and addresses them No generalization to new tasks Encoder-only Focus on understanding without generating output. Used generally to solve downstream tasks: text classification, sentiment analysis, etc BERT Bidirectional Encoder Representations from Transformers pretrained on two self-supervised tasks masked LM next sentence prediction extends to new tasks with finetuning use bidirectional attention all tokens can attend to all other tokens requires being careful with the task definition pairs of input sentences (A,B) In this “Masked LM” task, random parts of the input are hidden (~15% in the original work The output of BERT is used to reconstruct the masked tokens The second task provides two sentences as input BERT has 12 stacked transformer layers each layer has 12 heads the sentence has 11 tokens each head produces an 11x11 attention map For a total of 12x12 attention maps, each being an 11x11 matrix Decoder only receive input sequence and continue extending output generated in an autoregressive manner GPT pretrained on the text generation task then finetuned on specific, supervised tasks Greedy sampling Pick highest prob token at each step Sampling approach Greedy sampling repetitive/predictable text Beam search: Expand, at each step, the k highest probability sequences repeat until stopping criterion still deterministic Random sampling: sample a token from the probability distribution Top-k sampling: sample from the top-k most probable tokens then sampling random Top-p (nucleus) sampling: sample from the set of most probable tokens whose cumulative probability is below a threshold p for high entropy distributions: more tokens to choose from and viceversa Temperature sampling: sharpen/flatten the probability distribution based on a target temperature, then sample History GPT family GPT-1 Decoder-only Unsupervised pretraining: can be done on large dataset (cheap if unlabeled) 5 GB Supervised fine-tuning natural language inference question answering semantic similarity classification (grammatical correction..) 12 layers decoder only, 12 heads BPE with 40000 merges context: 512 tokens 117M parameters GPT-2 2019 Unsupervised pretraining 40 GB NO finetuning same 12 layers, ecc... BPE with 50k Context: 1024 tokens 1.5B parameters Loss decrease with a power law for model-size, dataset size, amount of compute. “As the computational budget C increases, it should be spent primarily on larger models, without dramatic increases in training time or dataset size” GPT-3 2020 Few shot learner Multiple Datasets (scraped from internet so low quality, but also Wikipedia) +570 GB 175B (10x previous) parameters Feature GPT-1 GPT-2 GPT-3 Architecture Decoder-only Decoder-only Decoder-only Pretraining Unsupervised Unsupervised Unsupervised Fine-tuning Yes (e.g., NLI, QA, semantic No No; only context learning: Few- similarity) shot learning Training Dataset Size 5 GB 40 GB +570 GB Parameters 117M > 1.5B 175B Layers 12 12 12 Heads 12 12 12 Feature GPT-1 GPT-2 GPT-3 BPE (Byte Pair Encoding) 40,000 50,000 - Merges Context Length 512 tokens 1024 tokens - Year 2018 2019 2020 Finetuning vs in-context Fine-tuning Update model weights on a task-specific dataset In-context learning Model weights no longer updated Task described as a part of the prompt in natural language -> FEW SHOT LEARNING Scaling the model allows not fine-tuning on task-specific datasets and still get competitive results. Jurassic-1 178B params Gopher 280B params Megatron 530B params PaLM 540B params Oversized but undertrained. Indeed, for every doubling of model size, the n. of training tokens should also be doubled Chinchilla, a new correctly sized mode, outperforms larger ones. 70B params Trained on the same compute budget as Gopher Trained on 1.4 T vs 300 GB of Gopher DeepMind paper Approach 1: Fixed-Sized Models 1. Idea principale: Addestrare modelli di dimensioni fisse (con un numero definito di parametri) utilizzando diversi budget computazionali. 2. Osservazioni: Per ogni budget di calcolo (asse FLOPs), il modello con il minor valore di perdita (loss) viene identificato. Esempio: Per 10 FLOPs, il modello con 1B parametri ottiene la perdita minima (2.5). 20 La linea rossa mostra che i modelli ottimali seguono una relazione quasi lineare tra parametri e FLOPs. 3. Risultato chiave: I modelli "ottimali" non sempre sono i più grandi. Ad esempio, con un budget Gopher, un modello da 67B parametri è migliore di uno da 280B. Approach 2: IsoFLOPs 1. Idea principale: Per un budget computazionale fisso (IsoFLOPs), vengono addestrati modelli di diverse dimensioni. 2. Osservazioni: Il comportamento della perdita (loss) segue una parabola: c'è una dimensione del modello che minimizza la perdita per ogni budget. Esempio: Con un budget di 6 ⋅ 10 FLOPs, il modello ottimale è quello con 2B parametri, che ottiene 20 una perdita di ~2.3 La linea rossa rappresenta il miglior modello per ogni budget computazionale. 3. Risultato chiave: L'approccio evidenzia chiaramente il modello ottimale per ogni budget, confermando che il budget deve essere distribuito in modo equilibrato tra dimensione del modello e numero di token. Approach 3: IsoFLOPs + IsoLoss 1. Idea principale: Combina budget computazionale (FLOPs) e contorni di perdita (IsoLoss) per identificare il miglior modello. 2. Osservazioni: I contorni (IsoLoss) mostrano livelli di perdita costante per diverse combinazioni di dimensione del modello e budget. L'efficient frontier (linea rossa) collega i modelli con perdita minima per ogni budget. Lungo ogni linea IsoFLOPs, il modello ottimale (perdita minima) è identificato. 3. Risultato chiave: Questo approccio fornisce un'analisi globale, permettendo di trovare il miglior modello in funzione di budget e loss. Es: Con 10 FLOPs, il modello ottimale è quello che minimizza la perdita lungo la linea IsoFLOPs. 20 LLama 2023 Autoregressive Decoder-only Trained on open datasets Different models: from 1B to 90B Then others are GPT-Neo/GPT-J – open source alternatives to the GPT family Mistral (MistralAI, ") – wide variety of model sizes, code-tuned versions (for 80+ languages), multimodal versions (Pixtral) GLM (Zhipu AI, #) – General Language Model, more oriented toward the Chinese language, but also works well on other languages, including English Falcon (Technology Innovation Institute, $) – different sized models, they also released a Mamba-based model (State Space Language Models!) Metrics, Tasks, Benchmarks Geometric mean GM penalize more (w.r.t arithmetic mean) the presence of low values -> numerical instability So instead we use: Perplexity Perplexity is a metric that quantifies how uncertain the model is in predicting the (correct) next word High perplexity: the model is uncertain about the “correct” next word Si calcola come il reciproco della probabilità della parola corretta: Questo significa che è come se il modello fosse incerto tra 4 opzioni, ognuna con probabilità uguale (0.25). Significato: Se la perplexity è bassa (vicina a 1), il modello è molto sicuro della sua predizione. Se la perplexity è alta, il modello distribuisce le sue probabilità su molte opzioni, risultando meno efficace. The example above is for an uncertain model (computed normally is always 52.1) BLEU BLEU (Bilingual Evaluation Understudy) is a metric used to evaluate a generated sequence, when a reference one is available Brevity penalty: If a model generates a very short sequence (w.r.t. the reference), it is easier to obtain a larger precision BLEU is effective if we need to match exact results. CONs: BLEU does not care about the semantic of results Semantically similar sentences are not accepted as valid No considerations about fluency, or meaning Garbage sentences may get relatively large BLEU scores Word order ignored beyond n-grams BERT Score Tokenize G and R (generated and reference), get output vectors via BERT Compare each generated token against each reference token - with similarity function - Get precision and recall PROs: solve the semantic similarity problem CONs: not explicitly consider order; relies on external model (computationally not ideal, limits of the model); does not offer a clear interpretation (simply BERT says so) Other metrics Exact Match (EM) 1 if match is correct, 0 otherwise Ranking for each token, we can assign a rank to the right word Task specific metrics Human evaluation: (to measure coherence, creativity, fluency,...) Rating scales, pairwise comparisons Tasks LAMBADA collection of narrative passages guess the word accuracy, perplexity, rank ROCStories, HellaSwag, StoryCloze short stories provide a story and possible endings (only one correct) capability of detecting right answer (precision, accuracy,..): capability of generating the right (PPL, BLEU) Question Answering Can model answer questions? Two scenario Open book: allow the model to search the answer (via prompt or information retrieval) Closed book: measure what model already knows Translation, summarization BLEU,... Natural language interference Entailment (Vero): L'ipotesi è vera, dato che è implicata dalla premessa. Esempio: Premise: "A man is playing a guitar." Hypothesis: "Someone is making music." Risultato: Entailment (L'ipotesi è vera rispetto alla premessa). Contradiction (Falso): L'ipotesi è falsa rispetto alla premessa. Esempio: Premise: "A man is playing a guitar." Hypothesis: "No one is playing any instrument." Risultato: Contradiction (Contraddizione tra premessa e ipotesi). Neutral (Indeterminato): Non c'è una relazione certa tra la premessa e l'ipotesi (non si può dire se l'ipotesi è vera o falsa). Esempio: Premise: "A man is playing a guitar." Hypothesis: "The man is a professional musician." Risultato: Neutral (Non si può determinare se è vero o falso dalla premessa). Grammatically acceptability binary classification task Benchmarks GLUE: general language understanding evaluation provides a single number evaluation; combines performance Super GLUE: introduce more difficult tasks MMLU: Massive Multitask Language Understanding Focused on Question Answering LLM contamination: could happen that benchmarks end up in training corpus Tuning and Model alignment HHH objectives: Helpful, Honest, Harmless T0 is T5 inspired model Pretrained on masked LM task Fine-tuned on a mixture of multitask Q/A pairs each task phrased as question multiple rephrasing for the same task sometimes inverting the task -> Improved performance in zero-shot/new tasks Instruction tuning - > improves zero-shot performance on unseen tasks LM limited by: poor metrics (e.g. ROUGE) not capturing info about quality poor objectives (e.g. cross-entropy) do not distinguish between important errors and minor ones How to solve - 3 step approach: 1. Collect human feedback 2. Train reward model 3. Fine-tune model to learn "human" feedback CONs of humans: cost & scalability inconsistency, between humans Simplicity of feedback -> Train a reward model (maybe a LM) The loss will penalize if don't predict scores that corresponds to human ones. How performs this model? Larger models achieve better results More annotated data improves results Results get close to the performance of a single human But, not as good as the ensemble of humans approach (ensemble of humans: insieme di umani) Fine-tune model on human feedback: Fine-tune model on feedback Procedure: 1. Copiare il modello originale (π ) per creare un nuovo modello (π ). old new 2. Ottimizzare π per massimizzare il punteggio calcolato dal Reward Model (r ), che valuta quanto new θ l'output è gradito all'utente. 3. Usare PPO (Proximal Policy Optimization) per aggiornare la politica π , bilanciando stabilità e new miglioramento. Nota: Il "reward" è analogo a una loss, ma deve essere massimizzato. Fine-tuning on human feedback Definizione del reward: πnew(y|x) R(x, y) = rθ(x, y) + β log πold(y|x) : Valutazione del Reward Model sulla qualità dell'output. rθ(x, y) β: Peso del termine di regolarizzazione (log-ratio delle politiche). Scopo: Penalizzare output che deviano troppo da π. old KL Divergence: La log-ratio misura quanto π differisce da π , prevenendo aggiornamenti drastici e garantendo new old stabilità. However, r is a proxy for human preference, not the actual human preference. Between the: pretrain only: without finetuning supervised learning: finetuned on a dataset human feedback: the previous one but finetuned to improve the human feedback reward. -> preferred by humans InstructGPT 1.3B vs 175B (GPT3) more truthful, less toxic aligned to annotators generalize new tasks New trend: 1. pretrain on large quantities of dirty data 2. collect smaller, higher quality datasets (with human feedback) 3. Use RLHF to align models to user preferences Efficient fine-tuning and inference Finetuning: all of the original model's weights can change PROs: performance comparable to training from scratch smaller dataset can be used CONS: resource intensive for large models Feature-based transfer: freeze the backbone, train only the head. PROs: less resource intensive works well when the original and new tasks are similar CONs: complex tasks require deeper changes performance sub-optimal Parameter-efficient Fine-Tuning (PEFT): Techniques used to reduce the computational cost required for the fine-tuning of models reducing the n. of params to update BitFit Adapter layers LoRA Prompt tuning Bias-terms Fine-tuning (BitFit) only the bias terms of the model are updated E.g. is the 0.1% for BERT sufficient to get performance similar to fine-tuning Adapters introduces additional layer in-between existing layers only these layers are trained It's a simple fully-connected model. Adapters injected in a pretrained model: initially behave as an identity function then the layers adapts the residual connection is used in BERT 7 M parameters two adapters for each layer - > 100k params/ layer similar performance to fine-tuning Low Ranking Adaptation (LoRA) We can store a single instance of the pretrained model, and add the appropriate A, B for each task needed. At inference time we can precompute W' Prompt tuning adding extra information prompt tuning: we add a fixed set of special tokens to the prompt allow fine-tuning only those special tokens instead of choosing the right words, we create them. "Invece di riaddestrare l'intero modello (che è computazionalmente costoso), si ottimizzano solo i prompt per guidare il comportamento del modello verso un compito desiderato." Other optimization techniques Making the same model smaller Quantization, reduction of floating point precision or building smaller versions of models Distillation Quantization It's the process to mapping continuous (floating point) values to discrete ones. = Reduce the range of allowed values for weights/activations. PROs: reduce storage/memory reqs improves comp. efficiency CONs: loss in precision All mapping based on scale and zero-point. Absmax quantization: scale values symmetrically 0 is preserved to 0 Zero-point quantization: asymmetrically More efficient use of the range of possible values for asymmetric distributions Post-training Quantization (PTQ) Model trained normally, then its weights and or activations are quantized afterwards. Quantization-Aware Training (QAT) incorporated quantization during training forward pass: using fake quantized version of w/a all values kept to full precision but rounded backward pass: gradients full precision quantization params learned better performance, but need intervening on training How to know scale and zero-point? weights: can compute beforehand activations static quantization: pre-compute scale, zero-point, faster dynamic quantization: compute scale and zero-point for each activation separately, no calibration step required, more comp. expensive LLM.int8() Vector wise quantization: compute scaling constants for each row/vector of matrices Mixed precision decomposition: decompose some outlier w/a with large magnitudes Reduce floating point precision This is not quantization, just reducing precision model = model.half() Model distillation Distillation is generally done with a teacher/student paradigm: Teacher: the original (larger) model, we want to reduce in size (used as gt for the student) Student: a smaller model we want to use to mimic the teacher student trained to predict teacher's probability distribution By learning from the teacher, the student receives information unavailable in the ground truth can achieve comparable performance to original ones Potpourri Mixture of Experts Technique used to increase model size, without significantly increasing the computational cost of its execution experts = different versions of one layer Only one (or few) experts are used for any prediction Mixtral Sparse Mixture of Experts from 7B to 13B active params at inference CLIP A vision Language Model. Text and images are aligned in the same vector space So that the vector of the image of a dog and the vector for the sentence “a picture of a dog” are close in the shared space Vision modality (images) encoded with a vision encoder ResNet, or Vision Transformer (ViT) Text modality encoded with a decoder-only model (simil GPT-2) Contrastive Learning Alignment CLIP is not a generative model. Can simply find text-image similarities. LLaVA LLaVA is an instruction-tuned, multimodal LLM. È un modello multimodale progettato per combinare informazioni visive e testuali, consentendo di rispondere a domande su immagini o altre istruzioni visuali. 1. Encoding dell'immagine: Usa un Vision Transformer (ViT) per convertire un'immagine in una rappresentazione numerica chiamata visual tokens. 2. Allineamento visivo-testuale: I visual tokens vengono proiettati (allineati) in uno spazio condiviso con il testo tramite una matrice di apprendimento (WW).. 3. Preparazione dell'input multimodale: I visual tokens vengono aggiunti come prefisso (prepended) all'input testuale, come ad esempio una domanda o un'istruzione testuale. Esempio: Visual tokens (dall'immagine): [VisualToken1, VisualToken2,...] Testo (domanda): "What is in the image?" 4. Instruction-tuned LLM: Un modello di linguaggio pre-addestrato, come Vicuna, è ulteriormente ottimizzato (fine-tuned) per rispondere a domande che combinano testo e immagini. LMSE Intro Functional properties; Non functional properties: usability, efficiency; reliability/availability, maintainability, security, safety, dependability, portability; The Waterfall Model A Linear, sequential approach to the software development lifecycle. Advantages: Uses a clear structure Determines the end goals early Transfers information well Disadvantages: Makes change difficult Excludes the client and/or end user Delays testing until after completion The V-Model V-Model is focused on Verification & Validation and software. The V- Model mandates – for every stage in the development cycle – an associated testing phase is considered The testing activities start immediately in any phase (the tests are first prepared, then executed) Advantages: More control and more quality of the software Disadvantages: More expensive than waterfall Still, design only happens once Iterative Model In iterative model, the iterative process starts with a simple implementation of a small set of the software requirements, which iteratively enhances the evolving versions until the complete system is implemented and ready to be deployed. Advantages: It is easily adaptable to the ever changing needs of the project as well as the client Disadvantages: It is not suitable for smaller projects Defining increments may require definition of the complete system Agile 1. Our highest priority is to satisfy the customer through early and continuous delivery of valuable software. 2. Welcome changing requirements, even late in development. Agile processes harness change for the customer's competitive advantage. 3. Deliver working software frequently, from a couple of weeks to a couple of months, with a preference to the shorter timescale. 4. Business people and developers must work together daily throughout the project. 5. Build projects around motivated individuals. Give them the environment and support they need and trust them to get the job done. 6. The most efficient and effective method of conveying information to and within a development team is face-to-face conversation. 7. Working software is the primary measure of progress. 8. Agile processes promote sustainable development. The sponsors, developers, and users should be able to maintain a constant pace indefinitely. 9. Continuous attention to technical excellence and good design enhances agility. 10. Simplicity - the art of maximizing the amount of work not done- is essential. 11. The best architectures, requirements, and designs emerge from self-organizing teams. 12. At regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly. È guidato dai principi del Manifesto Agile (2001), che promuove lo sviluppo iterativo, collaborativo e flessibile. 1. Individui e interazioni rispetto a processi e strumenti. 2. Software funzionante rispetto a documentazione estensiva. 3. Collaborazione con il cliente rispetto a contratti rigidi. 4. Rispondere al cambiamento rispetto a seguire un piano fisso. Principi chiave: Sviluppo iterativo e incrementale. Consegna frequente di valore (es. software funzionante in sprint brevi). Adattabilità ai cambiamenti. Coinvolgimento costante del cliente. SCRUM Scrum è un framework Agile utilizzato per gestire progetti complessi, in particolare nello sviluppo software. Si basa su iterazioni brevi e focalizzate chiamate sprint, che consentono di consegnare valore incrementale in modo rapido e adattabile. LLM4SE LLM Application: Defined as any task or activity that benefits from LLM insights. LLM Consumer: Any individual, system, or process that utilizes LLM outputs. Assured LLMSE: This innovative approach guarantees the reliability of LLM outputs Encoder-only used on comprehensive understanding Encoder-decoder understanding input information followed by content generation Decoder only: generation tasks Datasets Open-source: disseminated through open source platforms or repositories trusted, because used in their project Collected datasets: directly from a multitude of source (websites, forum, blogs) collect user stories Constructed datasets: modifying or augmenting datasets (specific research) used for test cases because of formatted Industrial: from commercial or industrial entities (proprietary or sensitive information) not so trusted Graph-based datasets: can be used when representing the GUI states of an application to develop or test Software repository based-datasets: for example Git repo, containing code, documentation, related artifacrts. Combined datasets Text-based: Mostly Basic Python Programming (MBPP) A benchmark of around 1000 crowd-sourced Python programming problem, designed to be solvable by entry-level programmers, covering programming fundamentals, standard library functionalities, and so on. Each problem consists of a task description, code solution, and 3 automated test cases. Post2Vec (stack overflow posts) Code-based: CodeSearchNet - 2 million (comment, code) pairs from open source library Software repository-based: DeHallucinator: - Utilization of full projects mined from GitHub as dataset to fine-tune a LLM agent - used for finetuning Graph-based: RICO mining Android apps at runtime Evaluation Metrics For classification tasks, the most commonly used metrics are Precision and F1-Score. For recommendation tasks, MRR (Mean Reciprocal Rank) is the most frequent metric. For generation task, metrics like BLEU and Pass@k are used. Requirements elicitation ! Elicitation of a Software Requirements Specification (SRS) from natural language Generation by using pre-trained models (Code-LLAMA and GPT) Interpretation of the quality of the generated requirements After quality interpretation, the LLMs are asked again to correct the requirements Requirements classification Classify in Functional and Not Functional Code completion SUS: System Usability Scale [0,100] User Experience Measurement UEQ [-3,3] Net Promoter Score NPS [0,10] Code summarization Generate descriptions for codes Test generation Prompt Engineering Alternative to fine-tuning that adapts pre-trained LMs as fine-tuned language models What makes a good prompt Using a clear and concise language Assigning a persona to the LM Providing examples and information Providing a specific format for the output Continuously refining the prompts (reiteration) A prompt is composed with Context Instruction Input Data Output indicator Priming The practice of providing some initial input to the model before generating a response. Tabular Form prompting Fill-in-the-blank prompting Perspective prompting RGC Role, Result, Goal, Context, Constraint Role: LM’s persona Result: desired output Goal: purpose of the output Context: who, what, where, why Constraint: limitations and guidelines I want you to act as prompting «I want you to act as…» «I will give you…» «You will then…» «In a tone / style…» «The important details are…» Generate Knowledge Prompting The knowledge used in the context is generated by another model, and used in the prompt to make a prediction. The highest confidence prediction is then used Chain of Thought prompting INPUT -> STEP 1 -> OUTPUT -> STEP 2 -> OUTPUT -> STEP 3 -> OUTPUT Self Consistency with CoT (CoT-SC) Generate diverse reasoning chains and then identify the most consistent final answer Tree of Thoughts = uguale ma ad albero decisionale 1) Thought Decomposition Cosa significa: A differenza del Chain of Thought (CoT), che genera pensieri in sequenza senza una scomposizione esplicita, il Tree of Thoughts suddivide il problema in passi intermedi basati sulle proprietà del problema stesso. Questa scomposizione permette di generare più opzioni (pensieri) per risolvere ogni parte del problema. Perché è utile: Rende il ragionamento più strutturato e aumenta la probabilità di trovare una soluzione valida, specialmente per problemi complessi. 2) Thought Generation Cosa significa: Per ogni passo intermedio, vengono generate k possibili soluzioni o pensieri candidati. Ogni candidato rappresenta una possibile strada da seguire per risolvere il problema. Esempio: Se il problema è una mossa in un gioco di scacchi, il modello potrebbe generare diverse mosse possibili e valutare quale sia la migliore. 3) State Evaluator Cosa significa: Per ogni stato (o pensiero candidato), viene valutato il progresso verso la soluzione finale. Questo serve per decidere quali stati mantenere ed esplorare ulteriormente. Due metodi principali di valutazione: 1. Evaluation: Il modello assegna un punteggio (ad esempio, da 1 a 10) o una classificazione (ad esempio, "possibile" o "impossibile") a ciascuno stato. 2. Voting: Gli stati vengono confrontati tra loro, e i migliori vengono scelti tramite un voto collettivo (ad esempio, basato su prompt specifici). Perché è utile: Funziona come una euristica, guidando la ricerca verso stati più promettenti e scartando percorsi meno utili. 4) Search Algorithm Cosa significa: Una volta generati i candidati e valutati, un algoritmo di ricerca esplora gli stati in modo organizzato, seguendo una struttura ad albero. Due approcci principali: 1. Breadth-First Search (BFS): Esplora in ampiezza, mantenendo un insieme dei migliori stati per ogni livello dell'albero. Adatto per problemi dove è necessario considerare molte opzioni simultaneamente. 2. Depth-First Search (DFS): Esplora in profondità, concentrandosi sullo stato più promettente fino a trovare una soluzione o determinare che il percorso è inutile. Adatto per problemi dove è importante esaminare un singolo percorso alla volta. Perché è utile: Permette di bilanciare tra esplorazione (testare nuove possibilità) ed esploitazione (seguire la strada più promettente). Structured Chain of Thought (SCoT) A technique for code generation, motivated by the fact that human developers follow structured programming with three programming structures (sequential, branch and loop). Reduce Hallucinations RAG (Retrieval-Augumented Generation) : optimizing the output of a LLM, making it reference an authored knowledge base which is not the one on which it is trained, before generating an answer. solve: allucinations, obsolescence, dependability, confusion Il modello crea una query, che è una versione elaborata o trasformata del prompt per cercare informazioni rilevanti. Il sistema usa la query per cercare informazioni pertinenti da fonti di conoscenza esterne (ad esempio database, documenti, API, ecc.). Knowledge Sources: Queste fonti possono includere contenuti non direttamente presenti nel modello LLM, come articoli scientifici, pagine web o documenti aziendali. Il sistema restituisce informazioni rilevanti dalle fonti esterne. Questo è il passaggio in cui il contesto viene arricchito. Le informazioni estratte sono selezionate per essere le più utili rispetto al prompt originale. Problemi Prompt injecting (tipo tramite twitter usi l'AI per fare quello che vuoi) Prompt Leaking: intenzionalmente vengono estratte info riservate o di sicurezza Jailbreaking: aggirare i sistemi di sicurezza o le restrizioni imposte dai creatori del modello. Agent Architecture Agentic behavior: the model becomes an “agent” that can make decisions about which steps to take, which tools to use and when to terminate a process LLM Chain LLM Chain: series of prompts and operations that guide an LLM through a sequence of tasks Sequential chain: simple sequence; CONS: not feasible for cases with multiple input and multiple output Tree chain: May handle multiple input and output Router chain: for complicated tasks. If we have multiple subchains, each of which is specialized for a particular type of input, we could have a router chain that decides which subchain to pass the input to. It consists of Router chain: responsible to choose the next chain Destination chain: chain that the router chain can route to Default chain: used when the router can't decide 1. Agentic AI System Definizione: Un sistema di intelligenza artificiale che è in grado di: Stabilire obiettivi propri. Pianificare azioni per raggiungere tali obiettivi. Prendere decisioni indipendenti basandosi sulla comprensione dell'ambiente e del compito. Caratteristiche principali: Iniziativa: Non aspetta solo input, ma può agire proattivamente. Adattabilità: Si adatta ai cambiamenti nell'ambiente. Apprendimento: Può imparare dalle proprie esperienze per migliorare nel tempo. 2. Reactive AI vs Agentic AI Reactive AI: Risponde semplicemente agli input ricevuti. Esempio: Un assistente virtuale che fornisce un rapporto meteo in risposta a "Qual è il meteo oggi?" Agentic AI: Risponde agli input ma va oltre, prendendo iniziative basate sul contesto. Esempio: Un assistente virtuale che non solo dà il rapporto meteo, ma: Suggerisce abbigliamento appropriato. Propone attività basate sul meteo (indoor o outdoor). Imposta promemoria per attività dipendenti dal meteo. 3. LLM Chains Cosa sono: Sequenze di operazioni o passaggi che i modelli di linguaggio (LLM) eseguono per risolvere problemi complessi. Ruolo: Servono come fondamenti per il comportamento agentico. Permettono di: Scomporre compiti complessi in passi più semplici. Ragionare attraverso problemi a più fasi. Interagire con l'ambiente in modi più sofisticati e contestualizzati. 4. Da Chains ad Agents Evoluzione: I LLM Chains forniscono la capacità di ragionare e interagire. I sistemi agentici combinano queste capacità con la possibilità di prendere decisioni autonome e perseguire obiettivi propri. Agent : The Agent is the central decision-making entity that orchestrates the use of tools, memory, and planning. It determines how to tackle a given task by selecting the most suitable tool or planning strategy based on the problem's complexity. Tools : The Tools are specialized extensions that expand the agent's capabilities. Examples include the Calendar, which helps schedule events or track deadlines; the Calculator, used for precise mathematical operations; the Code Interpreter, enabling coding or data analysis tasks; and the Search tool, which fetches real-time information from external sources like the internet. For example, if tasked with planning a project timeline, the agent might use the Calendar to allocate dates and the Calculator to optimize resource allocation. Memory : Memory is divided into short- term memory and long-term memory. Short-term memory retains temporary information, such as recent instructions or intermediate results, while long- term memory stores persistent knowledge for future reference. For instance, the agent might use short- term memory to temporarily store search results or a conversation context and *long-term memory to store a user’s preferences*, allowing for personalized assistance over time. Planning : The Planning module allows the agent to devise strategies to solve problems. This involves several the use or combination of several different strategies. Reflection : Reflection is a meta-cognitive process that allows the agent to evaluate its past decisions and actions to identify areas for improvement. This capability ensures that the agent learns from its successes and failures. Self-critics : Self-critics enables the agent to analyze its own performance critically and suggest refinements. It operates as an internal feedback mechanism, helping the agent improve its reasoning and outputs. Chain-of-Thoughts : Chain of thoughts involves sequential reasoning to tackle complex, multi-step problems. The agent progresses logically, step by step, ensuring clarity and coherence in problem- solving. This approach minimizes errors and makes the problem-solving process transparent, facilitating debugging and iterative refinement Subgoal decomposition : Subgoal decomposition is the process of breaking down a large, complex problem into smaller, more manageable tasks or milestones. This strategy enables the agent to approach challenges in an organized manner, ensuring steady progress toward the overall objective. This modular approach ensures focus, reduces overwhelm, and enables flexible adjustments if issues arise in individual subgoals. Esempio di applicazione nelle slide. The memory model Memory Retrieval: Enhances decision-making by extracting relevant information from an agent's memory, including environmental perception, past interactions, experiential data, and external knowledge. Short-term memory: Retrieves the entire body of information. Long-term memory: Uses filtering mechanisms to extract only the most relevant memories. Memory Reflection: The process where agents improve themselves by summarizing, refining, and reflecting on historical interactions and learned experiences stored in memory. This enhances adaptability to new environments and tasks. Self-updating: Agents automatically update their memory with new knowledge for self-recognition. Multiagent environments: A central LLM-based agent oversees memory reflection for individual agents. Memory Storage: Information is stored primarily as natural language text but can include multi-modal data (e.g., visual, audio). The storage format is tailored to the task and data modality, enabling agents to utilize information effectively in complex environments. Memory Modification: Adjusts memory by assessing new information against existing data to decide whether to: Add new information. Merge with existing data. Replace erroneous information. Utilization of Knowledge Knowledge Utilization: Integrates external knowledge (beyond memory) into LLM-based planning. Sources: Leverages up-to-date textual, visual, and audio data. Techniques: Includes retrieval-augmented generation and real-time web scraping to enhance accuracy and context. Goal: Combines internal capabilities with external information to improve planning and decision- making. Visual Knowledge: Encoded as continuous embeddings (e.g., visual Transformer encodings, object-centric representations) integrated with text for multi-modal understanding. Audio Knowledge: Includes speech and audio events, represented through speech encoders or spectrograms. Speech is discretized and embedded into a shared vector space with text, Database and Knowledge Base Queries: Access structured data from sources like Google Knowledge Graph or PubMed to integrate reliable information with LLM outputs, enhancing response accuracy. Example: ChatDB uses SQL queries to fetch logical data for agents. Web Scraping and API Calls: Web Scraping: Automates data extraction from web pages, useful for large-scale, diverse data collection (e.g., news, market trends). API Calls: Fetch specific, up-to-date data (e.g., weather, news, financial updates) via APIs, enabling real- time analysis. Retrieval-Augmented Generation (RAG): Combines retrieval mechanisms with generative models to produce context-rich responses. Effective for open-domain tasks like Q&A or conversational agents, using textual, semi-structured (e.g., PDFs), and structured data sources. Challenges in Knowledge Extraction: Ensuring timely and accurate information is crucial for LLMs. Efficient methods to incorporate new knowledge are needed to keep models up-to-date. Hallucination: When LLMs generate inaccurate or unrealistic text. Mitigation strategies include integrating external knowledge bases and fact-checking systems, such as RAG models. Reducing Bias: Addressing biases, class imbalances, and issues from training data by rebalancing datasets, using advanced sampling techniques, and developing new evaluation metrics to improve fairness and robustness. Reasoning and Planning One-Step Method: Agents decompose a complex task into sub-tasks in a single reasoning and planning process. Sub-tasks are sequentially ordered, with each step logically following the previous one, leading to the final objective. Multi-Step Method: Involves iterative reasoning, with multiple cycles of LLM invocations. Each cycle generates incremental steps based on the current context while ensuring consistency with the overall objective. Agent interaction Schemes Agent Interaction Schemes: 1. Cooperative: Goal: Agents work together to achieve a shared objective. Process: Goal setting and task decomposition. Information sharing and collaborative decision-making. Task execution with feedback to optimize strategies. Key Features: Communication and consensus building. 2. Adversarial: Goal: Agents compete to maximize their own interests. Process: Goal setting and strategy formulation. Interaction through competitive "games." Evaluation of results and strategy adjustment for future competition. Example: ChatEval. 3. Mixed: Goal: Balance between cooperation and competition. Types: Parallel: Agents collaborate independently on separate tasks, sharing some information, followed by a competitive phase. Hierarchical: Parent agents set goals, decompose tasks, and delegate to child agents, who execute tasks and provide feedback in a tree-structured hierarchy. Evaluation Generated requirements Precision and Recall 1. manually or automatically review the extracted reqs 2. compare with ground truth 3. compute precision, recall and f1 4. use metrics to iterate on llm to improve TP P recision = TP + FP high precision: generates fewer irrelevant or incorrect requirements TP Recall = TP + FN how many of the total relevant reqs were correctly extracted 2 ⋅ P recision ⋅ Recall F 1Score = P recision + Recall high F1 indicates a good balance Problem: variation on words but maybe same meaning? Solutions: predefined synonym list word embedding we compare it with a threshold T Sentence embedding (BERT) Text preprocessing (Browsed -> Browse) TRP: how many of my expected I found FPR: ability to discriminate (negatives incorrectly identified as positive) ROC: TRP vs FPR; larger is the area, better is the classification It may be important evaluate the quality of reqs Requirements quality measures INVEST INVEST: administer a questionnaire to ask whether each generated requirement (or user story) is: Independent: Check if the requirement stands alone. Negotiable: Ensure the requirement is open to change and discussion. Valuable: Ensure the requirement adds tangible value to users or stakeholders. Estimable: Check if the requirement is clear enough for estimation. Small: Verify that the requirement is small and actionable. Testable: Ensure the requirement can be tested and validated. SRS quality measures Per-requirement grading unambiguous understandable correct verifiable (finite) Document-wide grading: internal consistency non-redundancy completeness conciseness Generated Design Class Diagrams Sequence diagrams Use case diagrams Structural metric Structural metrics refer to measurements that focus on the design and architecture of a system, specifically how different components, modules, or classes are organized and interact Missing dependencies: if two components depends each other but is not represented Missing dependency count = Total expected dependencies - total established dependencies Misplaced dependencies For quality: syntactic, semantic, pragmatic Evaluating code generation Functional correctness Assess accuracy of code generated determining how many test cases passed (so I need a correct test suite) 1. read reqs 2. write skeleton of classes 3. write gt test cases P assed test cases P assRate = ∗ 100 T otal number of test cases Ensure that tests cover: Positive cases: valid inputs and function succeed Edge cases: boundary values Negative cases: invalid inputs or errors Static Code Quality Metrics Evaluate code without executing it. Analyze the structure, complexity and maintanability. Cyclomatic Complexity CC Measure complexity counting of linearly independent paths of a control flow. CC = E − N + 2P E: number of edges in the flow graph N: number of nodes P: number of connected components (functions calling each other) For a single function P = 1 1-10: Simple code, easy to maintain. 11-20: Moderate complexity, requires careful review and testing. 21-50: High complexity, needs refactoring to improve maintainability. 50+: Very complex, refactor or reconsider the design Maintainability Index MI It combines several factors (e.g., cyclomatic complexity, lines of code, and Halstead metrics) into a single score Implemented in Visual Studio between 0 and 20: Very hard to maintain above 100: Exceptional maintainability Halstead Volume LOC (excludes the def line) other metrics like: LOC code smells (long methods or classes, excessing nesting, duplication) code duplication indentation comment density Runtime performance quality metrics execution time throughput: n. of operations the program can perform in a given period memory consumption cpu time used error rate: frequency of error or exceptions during code execution Code-specific similarity Metric BLEU not sufficient to consider characteristics of programming language CodeBLEU weighted N-Gram match Syntactic AST match: syntactic info to consider the tree structure of the code Semantic Data-Flow match: flows in the code Feedback based evaluation Blind peer review: reviewers assess code snippets generated by different models, without knowing the inditity of the models real world evaluation: assess in real world tasks Readability evaluation: reviewers consider naming conventions, comments and cocde logic Maintainability evaluation: reviewers evaluate whether the code is reasonably divided into modules, check documentation and comments Evaluating Test Case generation Test coverage High coverage reduces the risk of hidden bugs. line coverage branch coverage: decision points Execution success rate In this case we start from the perspective that the reference cose is correct n.passingTestCases/totTestcases * 100 Typical when we create test cases with the purpose of regression testing Regression testing is a software testing practice that ensures recent changes to the codebase, such as bug fixes, feature updates, or refactoring, do not introduce new defects into previously tested and functioning areas of the software Mutation Analysis Introducing small changes (mutations) in the code The goal is determine whether the test suite can "kill" these mutants by failing when encountering them N umber of M utants Killed M utationScore = ∗ 100 T otal number of M utants Surviving mutant: test suite doesn't detect a mutant Test Flakiness A measure over repetition Flaky Test Cases are tests that sometimes pass and sometimes fail, even when there is no change in the code or environment Intermittent Failures Inconsistent Behavior: same test may pass on one run and fail on another False Positives/Negatives: code is faulty (FP) or the code is working when it isn't (FN) A test is flaky if it has a flakiness rate above a certain threshold stable: 80-90% pass rate Developer Feedback Checklist for the developer: