Understanding Attention Mechanisms in Transformers

Goal: Enable a transformer to accurately predict the next word by leveraging contextual information.
Tokenization: Input text is divided into tokens (words or parts of words).
Embeddings: Each token has a high-dimensional vector embedding representing its meaning.
Contextual Meaning: Embeddings are adjusted to represent richer contextual meaning beyond individual words.
Attention Benefits: Attention captures semantic connections between words, updating embeddings accordingly.
Example: "Mole": Different meanings in context, decoded by attention.
Example: "Tower": Embedding refined by preceding words ("Eiffel," "miniature") for specific meaning.
Attention Block: Crucial component processing embeddings based on contextual relationships.
Final Vector: Predicts the next word, derived from the processed embeddings in the sequence.
Single Head of Attention: Simplified example—adjectives modifying nouns.
Query Vector: Represents a question about word relationships (e.g., noun's query checks for preceding adjectives).
Key, Query, and Value Matrices: Tunable parameters, calculate the attention pattern.
Key Matrix: Maps embeddings to a smaller space, providing "answers" to queries.
Dot Product: Measures alignment between keys and queries.
Attention Pattern: Grid of values showing each word's relevance to others in the context.
Softmax Normalization: Normalizes the attention pattern to probabilities.
Masking: Prevents later words from influencing earlier ones during training (unidirectional flow).
Value Matrix: Provides information to update embeddings based on attention scores.
Value Vectors: Represent the change applied to an embedding based on relevance to other words.
Weighted Sum: Attention pattern weights combine value vectors from various words, updating embeddings.
Multi-headed Attention: Multiple heads operate in parallel, enhancing performance.
Parameter Efficiency: Parameter sharing (key, query, value) improves efficiency, especially with multiple heads.
Value Down and Value Up Matrices: Reduce value matrix parameters while mapping embeddings correctly.
Low-Rank Transformation: Value matrix constrained to low rank, enhancing efficiency.
Self-Attention: Attention mechanism focusing on relationships within a single sequence.
Cross-Attention: Attention considering relationships between two different sequences (e.g., translation).

Cross-Attention

Cross-attention processes two different data types (e.g., language translation).
Keys and queries come from different datasets.

Self-Attention

Self-attention understands relationships between words within a sentence.
Keys and queries come from the same dataset.
Key, query, and value matrices are learned during training.

Multi-Headed Attention

Multi-headed attention learns multiple word relationships.
Uses multiple attention heads with their own key, query, and value matrices.
GPT-3 has 96 attention heads per block.

Value Maps

Value maps project value vectors into a smaller space.
Composed of value down and value up matrices.
Value up matrices frequently combined into a single output matrix for the multi-headed attention block.

Transformer Architecture

Transformers use multiple layers of multi-headed attention blocks and multi-layer perceptrons (MLPs).
Each layer refines understanding of word relationships.
Deeper networks capture complex relationships.

Training Transformers

Attention mechanisms are parallelizable, ideal for training on large datasets.
Large-scale training significantly impacts model performance (accuracy and generalization).