27-Decoder-Architectures-and-RAG.pdf
Document Details
Uploaded by ThrillingTuba
Tags
Full Transcript
Decoder-only Models Early decoder-only models: As of 2023, pre-layernorm appears to be more popular [XYHZ20]: 57 Softmax with Temperature We use a slightly modified softmax to control the entropy of the distribution: yields the regular softmax operation. Because we see that: ▶ for , large get booste...
Decoder-only Models Early decoder-only models: As of 2023, pre-layernorm appears to be more popular [XYHZ20]: 57 Softmax with Temperature We use a slightly modified softmax to control the entropy of the distribution: yields the regular softmax operation. Because we see that: ▶ for , large get boosted (more likely to choose the largest ) ▶ for , large get weaker (more likely to choose another ) ▶ choosing a different corresponds to varying the base of the exponential ⇒ we can control the randomness (entropy) of the result ⇒ control how “creative” vs. “precise” a model is 60 Beam Search, Top-p and Top-k Sampling If we sample a poor word, subsequent words might also be very poor. Solution: consider more than one candidate ▶ beam search: expand multiple paths, keeping the k most probable paths ▶ top-k sampling: sample only from the top words [FaLeDa18] ▶ top-p sampling (nucleus sampling): sample words with a probability sum of ▶ best-of generation: generate multiple answers, return the best [RaGoGo23] “Most probable” answers found by beam search tend to be: ▶ rather short, as probability decreases with length ▶ generic ▶ repetitive 61 Generative Pre-trained Transformer (GPT) [RaNa18; RWCL19] 1. Pre-training: Generative language modelling task (“unsupervised”, i.e., no class labels); given a sequence of tokens , context window size, predict the next word: 2. Finetuning: Discriminative (supervised); given a labelled data set consisting of input tokens and a label per instance, the output of the decoder block, and a task-specific learned linear layer : 3. Final objective for better generalization, faster convergence: , weighting parameter 4. Start, end/extract special tokens are randomly initialized. A delimiter token is also used for separating structured data, e.g., questions & possible answers (needed because input is a single sequence) 62 Generative Pre-trained Transformer (GPT) [RaNa18; RWCL19] /2 Adapt architecture to various tasks ⇒ task-specific input transformations: Note: GPT-2 & GPT-3 use the same architecture but are much larger (1.5B and 175B parameters, respectively). 63 Transformer-XL [DYYC19] 1. Introduces recurrence into self-attention to extend context from a fixed size to variable size. Instead of re-computing hidden states for each new segment, they are reused ⇒ memory. 2. New positional encoding to avoid temporal confusion when reusing hidden states. Position is encoded relative to token positions , up to a maximum length : with 64 The Hallucination Problem Generative AI “hallucinates”: ▶ no concept of “facts” in these models ▶ minimizing randomness empirically decreases performance ▶ the most “reliable” answers tend to be short and generic ▶ creative use cases require randomness Scaling language-models is not satisfactory: ▶ cost of larger models grows quickly, latency increases ▶ training data needs to be scaled along with the models [HBMB22] ▶ for new information, there may be only little training data available ▶ domain-specific data may be small, but important to use ➜ connect the models to a database / search 67 Retriever Augmented Generation (RAG) [LPPP20] Originally an encoder-decoder (seq2seq) approach, based on BERT and BART. ▶ index millions of short sequences encoded with BERT ▶ retrieve relevant documents based on the users query, weighted by similarity: ▶ condition the word probabilities based on the retrieved documents (note: always only using one document at a time). ▶ fine-tuning of the query embedding encoder , not the document encoder ➜ modern “RAG” approaches use decoder-only generators, and do not weight results 68 Retrieval-Enhanced Transformer (RETRO) [BMHC22] Enhanced transformer architecture ▶ extract billions of text chunks from a large corpus process subsequent pairs of [N,F] as key, continuation ▶ compute BERT embeddings for the keys N ▶ index in a large vector database ▶ Chunked Cross-Attention (CCA): ▶ find the nearest neighbors to the query in the database ▶ integrate nearest neighbors as K, V in the attention layer ▶ train a large language model using this architecture ➜ outperformed much larger models But: appears to have never been used / continued: training effort too large? 69 Blenderbot [RDGJ21; SXKJ22; XuSzWe22] Open-domain chatbot effort by Facebook (now Meta) ▶ dialogue retrieval: the bot can store and retrieve information from the chat dialogue ▶ knowledge retrieval: the bot can retrieve from a database (e.g., Wikipedia) ▶ bot chooses to answer with retrieval, or not (Blenderbot 1, [RDGJ21]) ▶ can store summaries of earlier conversation (Blenderbot 2, [XuSzWe22]) ▶ can search the internet and store/retrieve information (Blenderbot 3, [SXKJ22]) You can try it out, but need a VPN to access https://blenderbot.ai/ from the U.S. ▶ tends to give short answers (much like humans!) ▶ often avoids answering and proposes to change topics ▶ in 2022 criticized for anti-semitism and similar undesired behavior 70 Maximum Inner Product Search (MIPS) To find relevant documents from a large document collection, we need: ▶ a compact vector representation (c.f., sentence embeddings) ▶ a similarity measure (Euclidean, cosine, or inner product) to rank by relevance ▶ a scalable search index Searching in high-dimensional data is hard – approximate search is faster. Popular search techniques for high-dimensional vector data: ▶ Hierarchical Navigable Small-World Graphs [MaYa20] Precompute a nearest-neighbor graph, explore neighbors of neighbors ▶ Product quantization and inverted files [JéDoSc11] Cluster subspaces to obtain codes, concatenate to form a product code ▶ Quantized graph – combination of both ideas 71