Deep Learning Notes PDF
Document Details
Uploaded by Deleted User
Tags
Related
- Deep Learning 1 Edited transcript.pdf
- Deep Learning and its Variants_Session 1_20240114.pdf
- Deep Learning and Variants_Session 2_20240120.pdf
- Artificial Intelligence for Big Data, Neural Networks, and Deep Learning PDF
- Artificial Intelligence for Big Data, Neural Networks & Deep Learning PDF
- Image Analysis Using Convolutional Neural Network (CNN) PDF
Summary
These notes cover fundamental concepts in deep learning, including the 5 Vs of big data, such as volume, velocity, variety, veracity, and value. They also explore architectural design decisions for big data, differences between data lakes and data warehouses, and introduce various critical concepts, like the CAP Theorem. Further, the notes delve into human-in-the-loop AI considerations and explainability in deep learning.
Full Transcript
**1. The 5 V's of Big Data with Examples** 1. **Volume**: Refers to the sheer size of data.\ *Example*: Social media platforms generating petabytes of user data daily. 2. **Velocity**: The speed at which data is generated and processed.\ *Example*: Real-time data from IoT devices lik...
**1. The 5 V's of Big Data with Examples** 1. **Volume**: Refers to the sheer size of data.\ *Example*: Social media platforms generating petabytes of user data daily. 2. **Velocity**: The speed at which data is generated and processed.\ *Example*: Real-time data from IoT devices like sensors in smart homes. 3. **Variety**: The diverse types and formats of data.\ *Example*: Text, images, videos, and structured data in a database. 4. **Veracity**: The quality or reliability of data.\ *Example*: Social media posts may contain unreliable or biased information. 5. **Value**: The potential insights and benefits derived from data.\ *Example*: Using transaction data to optimize supply chain operations. **2. Architectural Design Decisions for Big Data** - **Horizontal vs. Vertical Scaling**: - **Horizontal Scaling**: Adding more servers to distribute load. *Example*: Adding nodes to a cluster in Hadoop. - **Vertical Scaling**: Adding resources (CPU, RAM) to a single server. *Example*: Upgrading a database server\'s hardware. - **Sharding**: Partitioning a database into smaller pieces for better performance. *Example*: Splitting a user database by geographic region. - **Replication**: Storing copies of data across multiple servers for reliability. *Example*: Keeping multiple replicas of a dataset in a distributed database. - **CAP Theorem**: - **Consistency**: Every read receives the most recent write. - **Availability**: Every request receives a response. - **Partition Tolerance**: The system works despite network partitions.\ *Example*: Prioritizing consistency over availability in financial systems. **3. Data Lake vs. Data Warehouse** - **Data Lake**: Raw, unstructured data stored in its original form.\ *Example*: Amazon S3 buckets storing logs and videos. - **Data Warehouse**: Structured and processed data optimized for analytics.\ *Example*: Snowflake for business intelligence. - **On-Prem vs. Cloud Solutions**: - **On-Prem**: Infrastructure managed locally, more control.\ *Example*: Hadoop clusters within a company\'s data center. - **Cloud**: Managed solutions, scalable, and cost-effective.\ *Example*: Google BigQuery. - **Data Lake House**: Combines a data lake\'s flexibility with a warehouse\'s structure. *Example*: Databricks Lakehouse platform. - **Data Mesh**: Decentralized approach to data management focused on domain ownership.\ *Example*: Each department manages its data assets independently. **4. Tools in Data Lake / Data Warehouse Context** - **Hadoop**: Framework for distributed storage and processing in a data lake. - **Spark**: In-memory data processing engine for big data analytics. - **Qlik (BI)**: Business Intelligence tool for reporting and dashboards. - **BigQuery**: Cloud-based analytics platform for structured data. - **TensorFlow/PyTorch**: Deep learning frameworks for building AI models. **5. Data Governance and the Veracity of Big Data** - **Data Governance**: Policies and practices ensuring data quality, security, and compliance. *Example*: Ensuring only authorized users access sensitive data. - **Veracity Connection**: Ensures data reliability and trustworthiness in a Data Lake. - **Data Swamp**: A poorly managed data lake where data becomes unusable due to lack of governance. **6. Trustworthy and Ethical AI: Human in-, on-, and out-the-loop** - **Human-in-the-loop (HITL)**: Humans actively participate in AI decision-making.\ *Example*: Reviewing flagged emails for spam. - **Human-on-the-loop (HOTL)**: Humans oversee AI decisions and intervene if needed.\ *Example*: Supervising autonomous vehicle performance. - **Human-out-the-loop (HOTL)**: AI operates autonomously without human oversight.\ *Example*: AI in high-frequency trading. **7. Explainability (XAI) in Deep Learning** - **XAI Definition**: Techniques ensuring AI models\' decisions are interpretable by humans. *Example*: LIME or SHAP explaining predictions in ML models. - **Challenges in Deep Learning**: - Complex architectures like deep neural networks are often seen as \"black boxes.\" - Balancing model accuracy and interpretability can be difficult. - Deep learning models with millions of parameters make interpretability challenging. Explainability is crucial for domains like healthcare and law, where trust and accountability are essential. **1. Multilayer Perceptron as a Universal Approximator** - **Affine Transformation**: Each neuron applies a linear transformation to the input data, defined as z=Wx+bz = Wx + bz=Wx+b, where W is the weight matrix, x is the input, and b is the bias. - **Activation Function**: Introduces non-linearity, enabling the network to learn complex patterns. Common activations include ReLU, sigmoid, and tanh. - **Universal Approximation**: Theoretical property that a sufficiently large neural network with non-linear activations can approximate any continuous function. - **Tensors and GPUs**: - **Tensors**: Multidimensional arrays are the data structures used in deep learning. - **GPUs**: Accelerate computations (like matrix multiplications) by handling parallel operations efficiently, essential for training large networks. **2. How Neural Networks Learn from Labeled Data** - **Forward Pass**: Input data propagates through the network, and predictions are made using the current weights. - **Error/Loss Function**: Measures the difference between predictions and actual labels (e.g., Mean Squared Error for regression, Cross-Entropy for classification). - **Gradient Descent**: Optimization algorithm to minimize the loss by updating weights iteratively. - **Learning Rate**: Determines the step size for weight updates; a critical hyperparameter affecting convergence. - **Backpropagation**: Uses the chain rule to compute gradients of the loss function with respect to each weight, propagating the error backward through the network. **3. Dealing with Local Minima in Loss Optimization** - **Techniques**: - **Stochastic Gradient Descent (SGD)**: Adds randomness to the optimization process, helping escape local minima. - **Momentum**: Accelerates optimization by considering the gradient's direction over time. - **Learning Rate Scheduling**: Adjusts the learning rate dynamically to explore the loss surface effectively. - **Advanced Optimizers**: Algorithms like Adam and RMSprop adapt learning rates for individual parameters. - **Regularization**: Adds constraints (e.g., L2 penalty) to smooth the loss surface. **4. Vanishing Gradients and Covariance Shift** - **Vanishing Gradients**: Gradients become very small as they propagate backward, hindering weight updates in earlier layers. - **Solutions**: - Use ReLU activations instead of sigmoids or tanh. - Apply Batch Normalization to stabilize activations. - Use architectures like LSTMs for sequence data. - **Covariance Shift**: Changes in data distribution between layers cause instability. - **Solutions**: - Batch Normalization: Normalizes layer inputs to reduce shift. - Data Augmentation: Adds variability to training data to make the model robust. **5. Regularization Techniques to Prevent Overfitting** - **Dropout**: Randomly \"drops\" neurons during training to prevent reliance on specific nodes. - **L1/L2 Regularization**: Adds penalties to large weights in the loss function. - **Data Augmentation**: Increases dataset diversity by transforming training samples. - **Early Stopping**: Stops training when performance on a validation set starts to degrade. - **Ensemble Methods**: Combines multiple models to average out overfitting tendencies. **6. Adjusting the Decision Threshold and Its Impact** - **Threshold Adjustment**: Changes the probability cutoff for classification. - **Precision**: Precision=TPTP+FP\\text{Precision} = \\frac{TP}{TP + FP}Precision=TP+FPTP: Focuses on reducing false positives. - **Recall**: Recall=TPTP+FN\\text{Recall} = \\frac{TP}{TP + FN}Recall=TP+FNTP: Focuses on reducing false negatives. - **Trade-offs**: - Increasing the threshold improves precision but reduces recall. - Lowering the threshold improves recall but may lower precision. - **Impact on Performance**: - Balance depends on the application (e.g., high recall for medical diagnosis, high precision for spam detection). **7. Precision-Recall Curve vs. ROC Curve** - **Precision**: Proportion of true positives among predicted positives.\ Precision=TP/TP+FP - **Recall**: Proportion of true positives among actual positives.\ Recall=TP/TP+FN - **Why PR Curve is Better for Imbalanced Datasets**: - ROC curves can be misleading when there are many true negatives, as they emphasize overall accuracy. - PR curves focus on the balance between precision and recall, highlighting performance on the positive class. - **Area Under PR Curve (AUPRC)**: - Measures the model\'s ability to handle the positive class across thresholds. - Higher AUPRC indicates better performance, especially in detecting rare events. **1. Why Fully Connected Networks are Not Suitable for Computer Vision** - **Limitations**: - **High Dimensionality**: Images have a large number of pixels; a fully connected layer leads to an explosion of parameters. - **Loss of Spatial Information**: Fully connected layers treat all input features equally, ignoring spatial relationships in images. - **Overfitting**: Too many parameters require large datasets to train effectively. - **How CNNs Solve These Issues**: - **Parameter Sharing**: Convolutional layers use shared kernels, reducing the number of parameters. - **Local Receptive Fields**: Convolutions focus on small regions of the image, preserving spatial hierarchies. - **Hierarchical Feature Learning**: Lower layers detect edges, while deeper layers learn complex patterns. **2. Convolutional Operation Concepts** - **Kernel**: A small matrix (filter) used to extract features by sliding over the image. - **Local Receptive Field**: The small region of the image the kernel interacts with at any step. - **Stride**: The step size by which the kernel moves across the image. - **Padding**: Adding pixels around the image border to control the output size. - **Feature Map**: The output of applying a convolutional operation to an input image. - **2D Convolution**: Used for processing images by applying kernels over two dimensions (width and height). - **Use Cases for 3D Convolutions**: - **Video Analysis**: Extract features across spatial and temporal dimensions. - **Medical Imaging**: Analyze 3D scans like MRIs or CT scans. **3. Typical CNN Architecture for Image Classification** 1. **Convolution + ReLU Layers**: Extract features and apply non-linearity to introduce activation. 2. **Pooling Layers**: Downsample feature maps (e.g., max pooling) to reduce spatial dimensions. 3. **Flatten/GAP Layer**: Convert feature maps into a 1D vector for the fully connected layer. - **GAP (Global Average Pooling)**: Reduces each feature map to a single value. 4. **Fully Connected Layer**: Combines extracted features for classification. 5. **Batch Normalization Layer**: Stabilizes learning by normalizing intermediate outputs. 6. **Softmax Layer**: Converts logits into probabilities for each class. 7. **Cross-Entropy Loss**: Measures the error between predicted probabilities and true labels. **4. Transfer Learning in Computer Vision** - **Concept**: Leverage a pre-trained network on a large dataset (e.g., ImageNet) for new tasks. - **Training Modes**: - **Feature Extraction**: Use pre-trained convolutional layers as fixed feature extractors; only the classifier is trained. - **Fine-Tuning**: Retrain some or all layers, adapting the network to the new dataset. - **Impact of ImageNet**: Provides a robust starting point with features learned from over a million images, making transfer learning highly effective. **5. Architectural Enhancements Inspired by the ILSVRC** - **AlexNet**: - **ReLU**: Introduced non-linear activations, avoiding vanishing gradients. - **Dropout**: Randomly drops neurons during training to prevent overfitting. - **VGG**: - **Small Kernels (e.g., 3x3)**: Reduced parameters while increasing depth, improving feature granularity. - **Inception/GoogLeNet**: - **Mixed Kernels**: Combines small (e.g., 1x1) and large kernels for multi-scale feature extraction. - **Identity Convolution**: Preserves computational efficiency. - **ResNet/DenseNet**: - **Skip Connections**: Allows gradients to flow through shortcuts, solving vanishing gradient problems. - **Dense Connections**: Reuses features across layers, enhancing efficiency. **6. Advanced Tweaks with Fastai** - **Learning Rate Finder**: Automatically finds the optimal learning rate for training. - **Discriminative Learning Rates**: Applies different learning rates to different layers based on their training needs. - **One-Cycle Learning Rate**: Dynamically adjusts learning rates to improve convergence. - **FP16 Mixed Precision**: Reduces memory usage and speeds up training by using 16-bit precision. - **timm Library**: - Extensive collection of pre-trained models, offering flexibility and advanced architectures. - Compatible with fastai, making it easier to integrate state-of-the-art models. **7. CNNs for Non-Traditional Domains** - **Applications**: - **Sound Classification**: Convert audio signals into spectrograms and use CNNs for classification. - **Fraud Detection**: Analyze transaction patterns as images. - **Malware Detection**: Represent binary code as grayscale images and classify using CNNs. - **Pre-trained ResNet on ImageNet**: - **Pros**: Useful for tasks where features (e.g., edges, textures) resemble those in natural images. - **Cons**: May not generalize well to non-vision domains (e.g., audio or binary data). Pre-training on a domain-specific dataset might be more effective. **1. Evolution of Object Detection Methods** - **Naïve Template Matching**: - Match pre-defined templates with image regions. - **Limitations**: - Computationally expensive. - Poor generalization to varying scales, rotations, and deformations. - **HOG (Histogram of Oriented Gradients)**: - Extracts edge and gradient-based features, making detection robust to small variations in lighting. - **Limitations**: - Requires careful parameter tuning. - Struggles with complex object shapes and cluttered backgrounds. - **SIFT (Scale-Invariant Feature Transform)**: - Detects scale-invariant key points and matches them across images. - **Limitations**: - Computationally intensive. - Inefficient for real-time applications and large-scale datasets. - **Improvements with CNNs**: - Automates feature extraction through hierarchical learning. - Learns both simple (edges) and complex (shapes, textures) features directly from data. **2. CNNs and Feature Extraction in Object Detection** - **Improved Feature Extraction**: - CNNs learn hierarchical features using convolutional layers, making detection robust to variations in scale, rotation, and occlusion. - Replace hand-crafted features (e.g., HOG, SIFT) with data-driven learning. - **Regression and Classification Heads**: - **Regression Head**: Predicts bounding box coordinates for detected objects. - **Classification Head**: Assigns a class label to the detected object within the bounding box. **3. Challenges in Object Detection** - **Scale**: Objects can appear at different sizes in an image. - **Aspect Ratio**: Objects may have varying shapes and orientations. - **Occlusion**: Objects can be partially obscured by other objects. - **Mitigation Techniques**: - **Region of Interest (ROI)**: Focuses on specific regions likely containing objects. - Example: Faster R-CNN uses ROI pooling to extract features for object proposals. - **Non-Maximum Suppression (NMS)**: Eliminates overlapping boxes by selecting the box with the highest confidence score. **4. Two-Stage vs. One-Stage Detectors** - **Two-Stage Detectors**: - **Example**: R-CNN family (R-CNN, Fast R-CNN, Faster R-CNN). - **Process**: First generates object proposals, then classifies and refines them. - **Advantages**: - High accuracy, especially for challenging tasks. - Better at detecting small objects. - **Disadvantages**: - Slower due to the two-step process. - More computationally intensive. - **One-Stage Detectors**: - **Example**: YOLO (You Only Look Once), SSD. - **Process**: Predicts object classes and bounding boxes directly in one pass. - **Advantages**: - Faster and suitable for real-time applications. - Simpler architecture. - **Disadvantages**: - Lower accuracy, especially for small objects or crowded scenes. **5. Importance of Evaluation Metrics in Object Detection** - **Intersection over Union (IoU)**: - Measures overlap between predicted and ground truth bounding boxes. - **Formula**:\ IoU=Area of OverlapArea of Union\\text{IoU} = \\frac{\\text{Area of Overlap}}{\\text{Area of Union}}IoU=Area of UnionArea of Overlap - Higher IoU indicates better localization. - **Mean Average Precision (mAP)**: - Combines precision-recall across multiple classes and IoU thresholds. - Provides a comprehensive measure of model performance. - **Significance**: - IoU ensures accurate localization. - mAP balances detection accuracy across classes and thresholds. **6. Impact of Non-Maximum Suppression (NMS) Threshold** - **Purpose**: NMS eliminates redundant detections of the same object. - **Threshold**: Determines the IoU value below which boxes are considered distinct. - **Effects of Threshold Adjustment**: - **High Threshold**: Retains more overlapping boxes, leading to duplicate detections. - **Low Threshold**: Removes more boxes, risking missed detections. - **Optimal Threshold**: Balances between avoiding duplicates and retaining valid detections. **1. Word Embeddings vs. Dimensionality Reduction Techniques** - **Word Embeddings**: - Dense vector representations of words or tokens that encode their semantic meanings. - Generated during training of models like word2vec, GloVe, or transformer-based models. - Capture contextual relationships (e.g., \"king\" - \"man\" + \"woman\" = \"queen\"). - **Advantages over Dimensionality Reduction (e.g., PCA)**: - **Semantic Encoding**: Embeddings inherently preserve semantic relationships, unlike PCA which reduces dimensions without capturing meaning. - **Context Awareness**: Modern embeddings (e.g., BERT) are context-dependent, whereas PCA-based methods are static. - **Scalability**: Embeddings are optimized for computational efficiency and downstream tasks, while PCA can be computationally intensive for large corporation. **2. Key Parameters in LLMs** - **Context Window**: - Defines the maximum number of tokens the model can process at a time. - **Impact**: A larger context window allows the model to understand and generate text with broader dependencies but requires more memory. - **Max Tokens**: - Sets the maximum length of the output sequence. - **Impact**: Longer outputs may provide more detailed responses but increase computation time. - **Temperature**: - Controls the randomness of the output. - **Low Temperature (e.g., 0.2)**: Focuses on high-probability outputs, producing deterministic responses. - **High Temperature (e.g., 1.0)**: Encourages diversity, generating more creative or varied outputs. **3. Supervised Fine-Tuning (SFT) vs. In-Context Learning** - **Supervised Fine-Tuning (SFT)**: - Involves retraining the model on labeled task-specific data. - **Advantages**: - Highly task-specific and accurate. - Useful for long-term adaptation. - **Disadvantages**: - Computationally expensive. - Requires significant labeled data. - **In-Context Learning**: - Adapts the model using prompts with few-shot or zero-shot examples. - **Advantages**: - No retraining required. - Quick adaptation to new tasks. - **Disadvantages**: - Limited generalization for highly specialized tasks. - Depends on prompt quality. - **Trade-offs**: - SFT excels in specialized and high-accuracy tasks but is resource-intensive. - In-context learning is faster and flexible but less precise for domain-specific requirements. **4. Challenges with LLMs and Retrieval Augmented Generation (RAG)** - **Challenges**: - **Memory Limitations**: Restricted context windows limit input size. - **Knowledge Staleness**: LLMs trained on static data may lack current information. - **Inference Costs**: High computational requirements for generating outputs. - **RAG Approach**: - Combines LLMs with external knowledge bases to provide up-to-date and accurate responses. - **Steps**: - Retrieve relevant documents from a database based on the input query. - Integrate retrieved information with the model's output generation. - **Role of Vector Databases**: - Store document embeddings to enable efficient similarity-based retrieval. - Facilitate fast and accurate matching of relevant information. **5. Components of CrewAI Framework and Chain-of-Thought Reasoning** **Key Components:** 1. **Crews**: - Groups of AI agents collaborating to achieve a shared goal. 2. **Agents**: - Individual AI entities with specialized roles and capabilities. 3. **Tasks**: - Defined objectives or problems assigned to agents. 4. **Tools**: - Resources or APIs agents use to perform tasks (e.g., calculators, web search). 5. **Processes**: - Sequences of actions or workflows agents follow to complete tasks. **Chain-of-Thought Reasoning:** - Encourages step-by-step logical reasoning. - **Enhancements**: - Improves decision-making by breaking complex tasks into manageable steps. - Increases accuracy in multi-step reasoning tasks, such as problem-solving or data analysis. **1. Tokenization, Token Normalization, and Vectorization in NLP** **a. Key Concepts:** - **Tokenization**: Breaking text into smaller units (tokens), such as words or subwords. *Example*: \"Deep learning is fun\" → \[\"Deep\", \"learning\", \"is\", \"fun\"\] - **Token Normalization**: Cleaning and standardizing tokens. - **Stop Words**: Common words (e.g., \"is\", \"the\") removed to reduce noise. - **Stemming**: Reducing words to their base forms by removing suffixes. *Example*: \"running\" → \"run\" - **Lemmatization**: Converts words to their dictionary form based on context. *Example*: \"better\" → \"good\" - **Vectorization (Numericalization)**: Converting tokens into numerical representations. **b. Corpus, Dictionary, and Vocabulary:** - **Corpus**: A collection of text data used for analysis or model training. *Example*: Wikipedia articles on a specific topic. - **Dictionary/Vocabulary**: The unique set of tokens in the corpus. **c. Frequency-Based vs. Embedding Techniques:** - **BoW (Bag of Words)**: Represents text as a vector of token frequencies. - **Limitation**: Ignores word order and context. - **TF-IDF (Term Frequency-Inverse Document Frequency)**: Adjusts BoW weights based on term importance across documents. - **Strength**: Highlights rare but important terms. - **Word Embeddings**: Dense vectors capturing semantic meanings of words (e.g., word2vec, GloVe). - **Advantage**: Context-aware and lower-dimensional. **2. Dense Vector Representation and Word Embeddings** **a. Vector Similarity Measurements:** 1. **Cosine Similarity**: Measures angle similarity between vectors. 2. **Euclidean Distance**: Measures straight-line distance in vector space. 3. **Dot Product**: Quantifies similarity based on magnitude alignment. **b. Word2Vec:** - **CBOW (Continuous Bag of Words)**: Predicts the target word from surrounding context words. - **Skip-gram**: Predicts context words from the target word. - **Principle**: Words in similar contexts have similar vector representations. **c. Self-Supervised Training:** - **Concept**: Use data itself as the supervision signal. - *Example*: Predicting missing words in a sentence (\"The \_\_\_ is blue\") trains embeddings without labeled data. **3. CNN vs. RNN for Sequential Data in NLP** - **CNN (Convolutional Neural Network)**: - Extracts local features using filters (e.g., n-grams). - Captures short-term patterns efficiently. - Limited for long-term dependencies. - **RNN (Recurrent Neural Network)**: - Processes sequences by maintaining hidden states. - Outputs depend on previous context, making it better suited for sequential data. - Challenges: Struggles with very long sequences due to vanishing gradients. **4. Long-Term Dependencies and Vanishing Gradients in RNNs** **Problems:** - **Long-Term Dependencies**: RNNs struggle to retain information across long sequences. - **Vanishing Gradients**: Gradients shrink during backpropagation, preventing effective weight updates. **Solutions:** - **LSTM (Long Short-Term Memory)**: - Introduces gates (input, forget, output) to control information flow. - Retains long-term dependencies effectively. - **GRU (Gated Recurrent Unit)**: - A simpler alternative to LSTM with similar performance. **5. Seq2Seq Models and Applications** **Seq2Seq Model:** - Transforms one sequence into another (e.g., translating English to French). **Architecture:** - **Encoder**: - Processes input sequence and encodes it into a context vector (summary representation). - **Decoder**: - Takes the context vector and generates the output sequence. **Advantages over Traditional RNNs:** - Encoders and decoders allow flexible input-output sequence lengths. - Better suited for tasks like machine translation, summarization, and question answering. **Impact on Machine Translation:** - Revolutionized translation by enabling context-aware, sequence-to-sequence learning. - **Example**: Google Translate leverages attention mechanisms in seq2seq models to translate phrases accurately by focusing on relevant input words. **1. Seq2Seq Modeling in NLP: RNN vs. Transformer** - **Seq2Seq Modeling**: Transforms one sequence into another, commonly used for tasks like translation, summarization, and question answering. **RNN-Based Seq2Seq:** - Uses an encoder-decoder structure with recurrent layers (e.g., LSTMs or GRUs). - Encodes the input sequence into a context vector, passed to the decoder to generate the output. - **Challenges**: - Struggles with long sequences due to vanishing gradients. - Bottleneck: Fixed-size context vector limits information capacity. **Transformer-Based Seq2Seq:** - Replaces recurrence with self-attention and parallelization. - Encoders and decoders are stacks of layers leveraging multi-head attention and feedforward layers. - **Advantages**: - Captures long-term dependencies efficiently. - Faster training due to parallelization. - Attention mechanisms enable focusing on relevant input parts dynamically. **2. Transfer Learning with Language Models (LMs)** **Transfer Learning:** - Utilizes knowledge from a pre-trained model on a large dataset to fine-tune it for a specific task. - *Example*: Using GPT for summarization after pre-training on diverse text. **Language Model (LM):** - Predicts the next word in a sequence (causal LM) or fills in missing words (masked LM). - Pre-trained on extensive corpora to learn linguistic patterns. **Steps for Using LM in Transfer Learning:** 1. **Pre-training**: Train the model on a large general corpus (e.g., books, Wikipedia). 2. **Fine-tuning**: Adapt the pre-trained model to a specific task or domain using labeled data. - Freeze some layers (optional) to retain general knowledge. 3. **Task-Specific Training**: Add task-specific heads (e.g., classification, regression) and optimize for the end goal. **3. Transformer Architecture** **Core Components:** - **Encoders and Decoders**: - **Encoders**: Process the input sequence into hidden representations. - **Decoders**: Use the encoded representations to generate output sequences. - Both are stacked layers consisting of self-attention and feedforward networks. - **Self-Attention vs. Encoder-Decoder Attention**: - **Self-Attention**: Enables each token in a sequence to focus on other tokens within the same sequence. - **Encoder-Decoder Attention**: Allows the decoder to focus on specific parts of the encoded input. - **Multi-Head Attention**: - Splits queries, keys, and values into multiple attention heads to capture diverse relationships. - Outputs are concatenated and transformed via a linear layer. - **Positional Encoding**: - Adds position information to token embeddings since transformers lack inherent sequence order understanding. - Uses sine and cosine functions to encode positional information. - **Residual Connections**: - Shortcut paths between layers to prevent vanishing gradients and enhance training stability. **4. Main Types of Transformer-Based LLMs** **1. Encoder-Only Models:** - Focus on understanding tasks (e.g., classification, named entity recognition). - **Example**: BERT (Bidirectional Encoder Representations from Transformers). - **Task**: Sentiment analysis. - **Model Name**: BERT, RoBERTa. **2. Decoder-Only Models:** - Designed for generative tasks (e.g., text generation, summarization). - **Example**: GPT (Generative Pre-trained Transformer). - **Task**: Text generation. - **Model Name**: GPT-3, GPT-4. **3. Encoder-Decoder Models:** - Combines both encoding and decoding for tasks requiring understanding and generation. - **Example**: T5 (Text-to-Text Transfer Transformer). - **Task**: Machine translation. - **Model Name**: T5, BART.