Deep Learning Notes PDF

**1. The 5 V's of Big Data with Examples** 1. **Volume**: Refers to the sheer size of data.\ *Example*: Social media platforms generating petabytes of user data daily. 2. **Velocity**: The speed at which data is generated and processed.\ *Example*: Real-time data from IoT devices like sensors in smart homes. 3. **Variety**: The diverse types and formats of data.\ *Example*: Text, images, videos, and structured data in a database. 4. **Veracity**: The quality or reliability of data.\ *Example*: Social media posts may contain unreliable or biased information. 5. **Value**: The potential insights and benefits derived from data.\ *Example*: Using transaction data to optimize supply chain operations. **2. Architectural Design Decisions for Big Data** - **Horizontal vs. Vertical Scaling**: - **Horizontal Scaling**: Adding more servers to distribute load. *Example*: Adding nodes to a cluster in Hadoop. - **Vertical Scaling**: Adding resources (CPU, RAM) to a single server. *Example*: Upgrading a database server\'s hardware. - **Sharding**: Partitioning a database into smaller pieces for better performance. *Example*: Splitting a user database by geographic region. - **Replication**: Storing copies of data across multiple servers for reliability. *Example*: Keeping multiple replicas of a dataset in a distributed database. - **CAP Theorem**: - **Consistency**: Every read receives the most recent write. - **Availability**: Every request receives a response. - **Partition Tolerance**: The system works despite network partitions.\ *Example*: Prioritizing consistency over availability in financial systems. **3. Data Lake vs. Data Warehouse** - **Data Lake**: Raw, unstructured data stored in its original form.\ *Example*: Amazon S3 buckets storing logs and videos. - **Data Warehouse**: Structured and processed data optimized for analytics.\ *Example*: Snowflake for business intelligence. - **On-Prem vs. Cloud Solutions**: - **On-Prem**: Infrastructure managed locally, more control.\ *Example*: Hadoop clusters within a company\'s data center. - **Cloud**: Managed solutions, scalable, and cost-effective.\ *Example*: Google BigQuery. - **Data Lake House**: Combines a data lake\'s flexibility with a warehouse\'s structure. *Example*: Databricks Lakehouse platform. - **Data Mesh**: Decentralized approach to data management focused on domain ownership.\ *Example*: Each department manages its data assets independently. **4. Tools in Data Lake / Data Warehouse Context** - **Hadoop**: Framework for distributed storage and processing in a data lake. - **Spark**: In-memory data processing engine for big data analytics. - **Qlik (BI)**: Business Intelligence tool for reporting and dashboards. - **BigQuery**: Cloud-based analytics platform for structured data. - **TensorFlow/PyTorch**: Deep learning frameworks for building AI models. **5. Data Governance and the Veracity of Big Data** - **Data Governance**: Policies and practices ensuring data quality, security, and compliance. *Example*: Ensuring only authorized users access sensitive data. - **Veracity Connection**: Ensures data reliability and trustworthiness in a Data Lake. - **Data Swamp**: A poorly managed data lake where data becomes unusable due to lack of governance. **6. Trustworthy and Ethical AI: Human in-, on-, and out-the-loop** - **Human-in-the-loop (HITL)**: Humans actively participate in AI decision-making.\ *Example*: Reviewing flagged emails for spam. - **Human-on-the-loop (HOTL)**: Humans oversee AI decisions and intervene if needed.\ *Example*: Supervising autonomous vehicle performance. - **Human-out-the-loop (HOTL)**: AI operates autonomously without human oversight.\ *Example*: AI in high-frequency trading. **7. Explainability (XAI) in Deep Learning** - **XAI Definition**: Techniques ensuring AI models\' decisions are interpretable by humans. *Example*: LIME or SHAP explaining predictions in ML models. - **Challenges in Deep Learning**: - Complex architectures like deep neural networks are often seen as \"black boxes.\" - Balancing model accuracy and interpretability can be difficult. - Deep learning models with millions of parameters make interpretability challenging. Explainability is crucial for domains like healthcare and law, where trust and accountability are essential. **1. Multilayer Perceptron as a Universal Approximator** - **Affine Transformation**: Each neuron applies a linear transformation to the input data, defined as z=Wx+bz = Wx + bz=Wx+b, where W is the weight matrix, x is the input, and b is the bias. - **Activation Function**: Introduces non-linearity, enabling the network to learn complex patterns. Common activations include ReLU, sigmoid, and tanh. - **Universal Approximation**: Theoretical property that a sufficiently large neural network with non-linear activations can approximate any continuous function. - **Tensors and GPUs**: - **Tensors**: Multidimensional arrays are the data structures used in deep learning. - **GPUs**: Accelerate computations (like matrix multiplications) by handling parallel operations efficiently, essential for training large networks. **2. How Neural Networks Learn from Labeled Data** - **Forward Pass**: Input data propagates through the network, and predictions are made using the current weights. - **Error/Loss Function**: Measures the difference between predictions and actual labels (e.g., Mean Squared Error for regression, Cross-Entropy for classification). - **Gradient Descent**: Optimization algorithm to minimize the loss by updating weights iteratively. - **Learning Rate**: Determines the step size for weight updates; a critical hyperparameter affecting convergence. - **Backpropagation**: Uses the chain rule to compute gradients of the loss function with respect to each weight, propagating the error backward through the network. **3. Dealing with Local Minima in Loss Optimization** - **Techniques**: - **Stochastic Gradient Descent (SGD)**: Adds randomness to the optimization process, helping escape local minima. - **Momentum**: Accelerates optimization by considering the gradient's direction over time. - **Learning Rate Scheduling**: Adjusts the learning rate dynamically to explore the loss surface effectively. - **Advanced Optimizers**: Algorithms like Adam and RMSprop adapt learning rates for individual parameters. - **Regularization**: Adds constraints (e.g., L2 penalty) to smooth the loss surface. **4. Vanishing Gradients and Covariance Shift** - **Vanishing Gradients**: Gradients become very small as they propagate backward, hindering weight updates in earlier layers. - **Solutions**: - Use ReLU activations instead of sigmoids or tanh. - Apply Batch Normalization to stabilize activations. - Use architectures like LSTMs for sequence data. - **Covariance Shift**: Changes in data distribution between layers cause instability. - **Solutions**: - Batch Normalization: Normalizes layer inputs to reduce shift. - Data Augmentation: Adds variability to training data to make the model robust. **5. Regularization Techniques to Prevent Overfitting** - **Dropout**: Randomly \"drops\" neurons during training to prevent reliance on specific nodes. - **L1/L2 Regularization**: Adds penalties to large weights in the loss function. - **Data Augmentation**: Increases dataset diversity by transforming training samples. - **Early Stopping**: Stops training when performance on a validation set starts to degrade. - **Ensemble Methods**: Combines multiple models to average out overfitting tendencies. **6. Adjusting the Decision Threshold and Its Impact** - **Threshold Adjustment**: Changes the probability cutoff for classification. - **Precision**: Precision=TPTP+FP\\text{Precision} = \\frac{TP}{TP + FP}Precision=TP+FPTP: Focuses on reducing false positives. - **Recall**: Recall=TPTP+FN\\text{Recall} = \\frac{TP}{TP + FN}Recall=TP+FNTP: Focuses on reducing false negatives. - **Trade-offs**: - Increasing the threshold improves precision but reduces recall. - Lowering the threshold improves recall but may lower precision. - **Impact on Performance**: - Balance depends on the application (e.g., high recall for medical diagnosis, high precision for spam detection). **7. Precision-Recall Curve vs. ROC Curve** - **Precision**: Proportion of true positives among predicted positives.\ Precision=TP/TP+FP - **Recall**: Proportion of true positives among actual positives.\ Recall=TP/TP+FN - **Why PR Curve is Better for Imbalanced Datasets**: - ROC curves can be misleading when there are many true negatives, as they emphasize overall accuracy. - PR curves focus on the balance between precision and recall, highlighting performance on the positive class. - **Area Under PR Curve (AUPRC)**: - Measures the model\'s ability to handle the positive class across thresholds. - Higher AUPRC indicates better performance, especially in detecting rare events. **1. Why Fully Connected Networks are Not Suitable for Computer Vision** - **Limitations**: - **High Dimensionality**: Images have a large number of pixels; a fully connected layer leads to an explosion of parameters. - **Loss of Spatial Information**: Fully connected layers treat all input features equally, ignoring spatial relationships in images. - **Overfitting**: Too many parameters require large datasets to train effectively. - **How CNNs Solve These Issues**: - **Parameter Sharing**: Convolutional layers use shared kernels, reducing the number of parameters. - **Local Receptive Fields**: Convolutions focus on small regions of the image, preserving spatial hierarchies. - **Hierarchical Feature Learning**: Lower layers detect edges, while deeper layers learn complex patterns. **2. Convolutional Operation Concepts** - **Kernel**: A small matrix (filter) used to extract features by sliding over the image. - **Local Receptive Field**: The small region of the image the kernel interacts with at any step. - **Stride**: The step size by which the kernel moves across the image. - **Padding**: Adding pixels around the image border to control the output size. - **Feature Map**: The output of applying a convolutional operation to an input image. - **2D Convolution**: Used for processing images by applying kernels over two dimensions (width and height). - **Use Cases for 3D Convolutions**: - **Video Analysis**: Extract features across spatial and temporal dimensions. - **Medical Imaging**: Analyze 3D scans like MRIs or CT scans. **3. Typical CNN Architecture for Image Classification** 1. **Convolution + ReLU Layers**: Extract features and apply non-linearity to introduce activation. 2. **Pooling Layers**: Downsample feature maps (e.g., max pooling) to reduce spatial dimensions. 3. **Flatten/GAP Layer**: Convert feature maps into a 1D vector for the fully connected layer. - **GAP (Global Average Pooling)**: Reduces each feature map to a single value. 4. **Fully Connected Layer**: Combines extracted features for classification. 5. **Batch Normalization Layer**: Stabilizes learning by normalizing intermediate outputs. 6. **Softmax Layer**: Converts logits into probabilities for each class. 7. **Cross-Entropy Loss**: Measures the error between predicted probabilities and true labels. **4. Transfer Learning in Computer Vision** - **Concept**: Leverage a pre-trained network on a large dataset (e.g., ImageNet) for new tasks. - **Training Modes**: - **Feature Extraction**: Use pre-trained convolutional layers as fixed feature extractors; only the classifier is trained. - **Fine-Tuning**: Retrain some or all layers, adapting the network to the new dataset. - **Impact of ImageNet**: Provides a robust starting point with features learned from over a million images, making transfer learning highly effective. **5. Architectural Enhancements Inspired by the ILSVRC** - **AlexNet**: - **ReLU**: Introduced non-linear activations, avoiding vanishing gradients. - **Dropout**: Randomly drops neurons during training to prevent overfitting. - **VGG**: - **Small Kernels (e.g., 3x3)**: Reduced parameters while increasing depth, improving feature granularity. - **Inception/GoogLeNet**: - **Mixed Kernels**: Combines small (e.g., 1x1) and large kernels for multi-scale feature extraction. - **Identity Convolution**: Preserves computational efficiency. - **ResNet/DenseNet**: - **Skip Connections**: Allows gradients to flow through shortcuts, solving vanishing gradient problems. - **Dense Connections**: Reuses features across layers, enhancing efficiency. **6. Advanced Tweaks with Fastai** - **Learning Rate Finder**: Automatically finds the optimal learning rate for training. - **Discriminative Learning Rates**: Applies different learning rates to different layers based on their training needs. - **One-Cycle Learning Rate**: Dynamically adjusts learning rates to improve convergence. - **FP16 Mixed Precision**: Reduces memory usage and speeds up training by using 16-bit precision. - **timm Library**: - Extensive collection of pre-trained models, offering flexibility and advanced architectures. - Compatible with fastai, making it easier to integrate state-of-the-art models. **7. CNNs for Non-Traditional Domains** - **Applications**: - **Sound Classification**: Convert audio signals into spectrograms and use CNNs for classification. - **Fraud Detection**: Analyze transaction patterns as images. - **Malware Detection**: Represent binary code as grayscale images and classify using CNNs. - **Pre-trained ResNet on ImageNet**: - **Pros**: Useful for tasks where features (e.g., edges, textures) resemble those in natural images. - **Cons**: May not generalize well to non-vision domains (e.g., audio or binary data). Pre-training on a domain-specific dataset might be more effective. **1. Evolution of Object Detection Methods** - **Naïve Template Matching**: - Match pre-defined templates with image regions. - **Limitations**: - Computationally expensive. - Poor generalization to varying scales, rotations, and deformations. - **HOG (Histogram of Oriented Gradients)**: - Extracts edge and gradient-based features, making detection robust to small variations in lighting. - **Limitations**: - Requires careful parameter tuning. - Struggles with complex object shapes and cluttered backgrounds. - **SIFT (Scale-Invariant Feature Transform)**: - Detects scale-invariant key points and matches them across images. - **Limitations**: - Computationally intensive. - Inefficient for real-time applications and large-scale datasets. - **Improvements with CNNs**: - Automates feature extraction through hierarchical learning. - Learns both simple (edges) and complex (shapes, textures) features directly from data. **2. CNNs and Feature Extraction in Object Detection** - **Improved Feature Extraction**: - CNNs learn hierarchical features using convolutional layers, making detection robust to variations in scale, rotation, and occlusion. - Replace hand-crafted features (e.g., HOG, SIFT) with data-driven learning. - **Regression and Classification Heads**: - **Regression Head**: Predicts bounding box coordinates for detected objects. - **Classification Head**: Assigns a class label to the detected object within the bounding box. **3. Challenges in Object Detection** - **Scale**: Objects can appear at different sizes in an image. - **Aspect Ratio**: Objects may have varying shapes and orientations. - **Occlusion**: Objects can be partially obscured by other objects. - **Mitigation Techniques**: - **Region of Interest (ROI)**: Focuses on specific regions likely containing objects. - Example: Faster R-CNN uses ROI pooling to extract features for object proposals. - **Non-Maximum Suppression (NMS)**: Eliminates overlapping boxes by selecting the box with the highest confidence score. **4. Two-Stage vs. One-Stage Detectors** - **Two-Stage Detectors**: - **Example**: R-CNN family (R-CNN, Fast R-CNN, Faster R-CNN). - **Process**: First generates object proposals, then classifies and refines them. - **Advantages**: - High accuracy, especially for challenging tasks. - Better at detecting small objects. - **Disadvantages**: - Slower due to the two-step process. - More computationally intensive. - **One-Stage Detectors**: - **Example**: YOLO (You Only Look Once), SSD. - **Process**: Predicts object classes and bounding boxes directly in one pass. - **Advantages**: - Faster and suitable for real-time applications. - Simpler architecture. - **Disadvantages**: - Lower accuracy, especially for small objects or crowded scenes. **5. Importance of Evaluation Metrics in Object Detection** - **Intersection over Union (IoU)**: - Measures overlap between predicted and ground truth bounding boxes. - **Formula**:\ IoU=Area of OverlapArea of Union\\text{IoU} = \\frac{\\text{Area of Overlap}}{\\text{Area of Union}}IoU=Area of UnionArea of Overlap - Higher IoU indicates better localization. - **Mean Average Precision (mAP)**: - Combines precision-recall across multiple classes and IoU thresholds. - Provides a comprehensive measure of model performance. - **Significance**: - IoU ensures accurate localization. - mAP balances detection accuracy across classes and thresholds. **6. Impact of Non-Maximum Suppression (NMS) Threshold** - **Purpose**: NMS eliminates redundant detections of the same object. - **Threshold**: Determines the IoU value below which boxes are considered distinct. - **Effects of Threshold Adjustment**: - **High Threshold**: Retains more overlapping boxes, leading to duplicate detections. - **Low Threshold**: Removes more boxes, risking missed detections. - **Optimal Threshold**: Balances between avoiding duplicates and retaining valid detections. **1. Word Embeddings vs. Dimensionality Reduction Techniques** - **Word Embeddings**: - Dense vector representations of words or tokens that encode their semantic meanings. - Generated during training of models like word2vec, GloVe, or transformer-based models. - Capture contextual relationships (e.g., \"king\" - \"man\" + \"woman\" = \"queen\"). - **Advantages over Dimensionality Reduction (e.g., PCA)**: - **Semantic Encoding**: Embeddings inherently preserve semantic relationships, unlike PCA which reduces dimensions without capturing meaning. - **Context Awareness**: Modern embeddings (e.g., BERT) are context-dependent, whereas PCA-based methods are static. - **Scalability**: Embeddings are optimized for computational efficiency and downstream tasks, while PCA can be computationally intensive for large corporation. **2. Key Parameters in LLMs** - **Context Window**: - Defines the maximum number of tokens the model can process at a time. - **Impact**: A larger context window allows the model to understand and generate text with broader dependencies but requires more memory. - **Max Tokens**: - Sets the maximum length of the output sequence. - **Impact**: Longer outputs may provide more detailed responses but increase computation time. - **Temperature**: - Controls the randomness of the output. - **Low Temperature (e.g., 0.2)**: Focuses on high-probability outputs, producing deterministic responses. - **High Temperature (e.g., 1.0)**: Encourages diversity, generating more creative or varied outputs. **3. Supervised Fine-Tuning (SFT) vs. In-Context Learning** - **Supervised Fine-Tuning (SFT)**: - Involves retraining the model on labeled task-specific data. - **Advantages**: - Highly task-specific and accurate. - Useful for long-term adaptation. - **Disadvantages**: - Computationally expensive. - Requires significant labeled data. - **In-Context Learning**: - Adapts the model using prompts with few-shot or zero-shot examples. - **Advantages**: - No retraining required. - Quick adaptation to new tasks. - **Disadvantages**: - Limited generalization for highly specialized tasks. - Depends on prompt quality. - **Trade-offs**: - SFT excels in specialized and high-accuracy tasks but is resource-intensive. - In-context learning is faster and flexible but less precise for domain-specific requirements. **4. Challenges with LLMs and Retrieval Augmented Generation (RAG)** - **Challenges**: - **Memory Limitations**: Restricted context windows limit input size. - **Knowledge Staleness**: LLMs trained on static data may lack current information. - **Inference Costs**: High computational requirements for generating outputs. - **RAG Approach**: - Combines LLMs with external knowledge bases to provide up-to-date and accurate responses. - **Steps**: - Retrieve relevant documents from a database based on the input query. - Integrate retrieved information with the model's output generation. - **Role of Vector Databases**: - Store document embeddings to enable efficient similarity-based retrieval. - Facilitate fast and accurate matching of relevant information. **5. Components of CrewAI Framework and Chain-of-Thought Reasoning** **Key Components:** 1. **Crews**: - Groups of AI agents collaborating to achieve a shared goal. 2. **Agents**: - Individual AI entities with specialized roles and capabilities. 3. **Tasks**: - Defined objectives or problems assigned to agents. 4. **Tools**: - Resources or APIs agents use to perform tasks (e.g., calculators, web search). 5. **Processes**: - Sequences of actions or workflows agents follow to complete tasks. **Chain-of-Thought Reasoning:** - Encourages step-by-step logical reasoning. - **Enhancements**: - Improves decision-making by breaking complex tasks into manageable steps. - Increases accuracy in multi-step reasoning tasks, such as problem-solving or data analysis. **1. Tokenization, Token Normalization, and Vectorization in NLP** **a. Key Concepts:** - **Tokenization**: Breaking text into smaller units (tokens), such as words or subwords. *Example*: \"Deep learning is fun\" → \[\"Deep\", \"learning\", \"is\", \"fun\"\] - **Token Normalization**: Cleaning and standardizing tokens. - **Stop Words**: Common words (e.g., \"is\", \"the\") removed to reduce noise. - **Stemming**: Reducing words to their base forms by removing suffixes. *Example*: \"running\" → \"run\" - **Lemmatization**: Converts words to their dictionary form based on context. *Example*: \"better\" → \"good\" - **Vectorization (Numericalization)**: Converting tokens into numerical representations. **b. Corpus, Dictionary, and Vocabulary:** - **Corpus**: A collection of text data used for analysis or model training. *Example*: Wikipedia articles on a specific topic. - **Dictionary/Vocabulary**: The unique set of tokens in the corpus. **c. Frequency-Based vs. Embedding Techniques:** - **BoW (Bag of Words)**: Represents text as a vector of token frequencies. - **Limitation**: Ignores word order and context. - **TF-IDF (Term Frequency-Inverse Document Frequency)**: Adjusts BoW weights based on term importance across documents. - **Strength**: Highlights rare but important terms. - **Word Embeddings**: Dense vectors capturing semantic meanings of words (e.g., word2vec, GloVe). - **Advantage**: Context-aware and lower-dimensional. **2. Dense Vector Representation and Word Embeddings** **a. Vector Similarity Measurements:** 1. **Cosine Similarity**: Measures angle similarity between vectors. 2. **Euclidean Distance**: Measures straight-line distance in vector space. 3. **Dot Product**: Quantifies similarity based on magnitude alignment. **b. Word2Vec:** - **CBOW (Continuous Bag of Words)**: Predicts the target word from surrounding context words. - **Skip-gram**: Predicts context words from the target word. - **Principle**: Words in similar contexts have similar vector representations. **c. Self-Supervised Training:** - **Concept**: Use data itself as the supervision signal. - *Example*: Predicting missing words in a sentence (\"The \_\_\_ is blue\") trains embeddings without labeled data. **3. CNN vs. RNN for Sequential Data in NLP** - **CNN (Convolutional Neural Network)**: - Extracts local features using filters (e.g., n-grams). - Captures short-term patterns efficiently. - Limited for long-term dependencies. - **RNN (Recurrent Neural Network)**: - Processes sequences by maintaining hidden states. - Outputs depend on previous context, making it better suited for sequential data. - Challenges: Struggles with very long sequences due to vanishing gradients. **4. Long-Term Dependencies and Vanishing Gradients in RNNs** **Problems:** - **Long-Term Dependencies**: RNNs struggle to retain information across long sequences. - **Vanishing Gradients**: Gradients shrink during backpropagation, preventing effective weight updates. **Solutions:** - **LSTM (Long Short-Term Memory)**: - Introduces gates (input, forget, output) to control information flow. - Retains long-term dependencies effectively. - **GRU (Gated Recurrent Unit)**: - A simpler alternative to LSTM with similar performance. **5. Seq2Seq Models and Applications** **Seq2Seq Model:** - Transforms one sequence into another (e.g., translating English to French). **Architecture:** - **Encoder**: - Processes input sequence and encodes it into a context vector (summary representation). - **Decoder**: - Takes the context vector and generates the output sequence. **Advantages over Traditional RNNs:** - Encoders and decoders allow flexible input-output sequence lengths. - Better suited for tasks like machine translation, summarization, and question answering. **Impact on Machine Translation:** - Revolutionized translation by enabling context-aware, sequence-to-sequence learning. - **Example**: Google Translate leverages attention mechanisms in seq2seq models to translate phrases accurately by focusing on relevant input words. **1. Seq2Seq Modeling in NLP: RNN vs. Transformer** - **Seq2Seq Modeling**: Transforms one sequence into another, commonly used for tasks like translation, summarization, and question answering. **RNN-Based Seq2Seq:** - Uses an encoder-decoder structure with recurrent layers (e.g., LSTMs or GRUs). - Encodes the input sequence into a context vector, passed to the decoder to generate the output. - **Challenges**: - Struggles with long sequences due to vanishing gradients. - Bottleneck: Fixed-size context vector limits information capacity. **Transformer-Based Seq2Seq:** - Replaces recurrence with self-attention and parallelization. - Encoders and decoders are stacks of layers leveraging multi-head attention and feedforward layers. - **Advantages**: - Captures long-term dependencies efficiently. - Faster training due to parallelization. - Attention mechanisms enable focusing on relevant input parts dynamically. **2. Transfer Learning with Language Models (LMs)** **Transfer Learning:** - Utilizes knowledge from a pre-trained model on a large dataset to fine-tune it for a specific task. - *Example*: Using GPT for summarization after pre-training on diverse text. **Language Model (LM):** - Predicts the next word in a sequence (causal LM) or fills in missing words (masked LM). - Pre-trained on extensive corpora to learn linguistic patterns. **Steps for Using LM in Transfer Learning:** 1. **Pre-training**: Train the model on a large general corpus (e.g., books, Wikipedia). 2. **Fine-tuning**: Adapt the pre-trained model to a specific task or domain using labeled data. - Freeze some layers (optional) to retain general knowledge. 3. **Task-Specific Training**: Add task-specific heads (e.g., classification, regression) and optimize for the end goal. **3. Transformer Architecture** **Core Components:** - **Encoders and Decoders**: - **Encoders**: Process the input sequence into hidden representations. - **Decoders**: Use the encoded representations to generate output sequences. - Both are stacked layers consisting of self-attention and feedforward networks. - **Self-Attention vs. Encoder-Decoder Attention**: - **Self-Attention**: Enables each token in a sequence to focus on other tokens within the same sequence. - **Encoder-Decoder Attention**: Allows the decoder to focus on specific parts of the encoded input. - **Multi-Head Attention**: - Splits queries, keys, and values into multiple attention heads to capture diverse relationships. - Outputs are concatenated and transformed via a linear layer. - **Positional Encoding**: - Adds position information to token embeddings since transformers lack inherent sequence order understanding. - Uses sine and cosine functions to encode positional information. - **Residual Connections**: - Shortcut paths between layers to prevent vanishing gradients and enhance training stability. **4. Main Types of Transformer-Based LLMs** **1. Encoder-Only Models:** - Focus on understanding tasks (e.g., classification, named entity recognition). - **Example**: BERT (Bidirectional Encoder Representations from Transformers). - **Task**: Sentiment analysis. - **Model Name**: BERT, RoBERTa. **2. Decoder-Only Models:** - Designed for generative tasks (e.g., text generation, summarization). - **Example**: GPT (Generative Pre-trained Transformer). - **Task**: Text generation. - **Model Name**: GPT-3, GPT-4. **3. Encoder-Decoder Models:** - Combines both encoding and decoding for tasks requiring understanding and generation. - **Example**: T5 (Text-to-Text Transfer Transformer). - **Task**: Machine translation. - **Model Name**: T5, BART.

Deep Learning Notes PDF

Document Details

Tags

Related

Summary

Full Transcript