Deep Learning Notes PDF

Lecture 0: Introduction to NN 1. What is a Neural Network?: A neural network (NN) consists of interconnected neurons (simple processing units). It is an alternative computation model, inspired by biology but highly abstracted. Key properties: ○ Expressive power comes from network architecture, not individual neurons. ○ Trained using input data and queried for predictions. 2. Units and Layer Artificial Neurons (Units): ○ A unit aggregates weighted inputs + bias: ○ Linear unit: Simple weighted sum. ○ Nonlinear unit: Applies an activation function (ReLU, Sigmoid, Tanh). Activation Functions Layers: ○ Dense (fully connected): Each unit is connected to all previous-layer units, the units have the same arity, but separate vectors parameters ○ Convolutional: Uses shared weights (important for images), their units have the same arity and shape and share parameters. ○ Recurrent (RNN, LSTM): Maintains memory/state across time steps. 3. Computational Capabilities Important Features: ○ Continuous: Use of whole math, handle of discrete variables. ○ Parallel or/and Distributed: Easy to deploy on parallel processing. ○ Non-symbolic Universal Approximation Theorem (Cybenko, 1989): NNs with a single hidden layer can approximate any function. Deep networks improve expressiveness (multiple layers enable feature hierarchy). 4. Training Neural Networks Supervised Learning: Learn from labeled data. Unsupervised Learning: Learn patterns without labels. Training = Optimizing model weights to minimize error on a dataset. Loss Functions : Defines the quantity to be minimized by the learner (network). It measures how well a model's predictions match the actual target values. It quantifies the difference between the predicted output and the ground truth, guiding the model's learning process. ○ Regression: MSE (L2 Loss): Penalizes squared errors. MAE (L1 Loss): Penalizes absolute errors. Huber Loss: Hybrid of L1 and L2. ○ Classification: Binary Cross-Entropy: For 2-class problems. Categorical Cross-Entropy: For multi-class problems. Hinge Loss: For SVM-style classification. 5. Gradient Descent Gradient Descent (GD): Optimizes weights by computing gradients and updating iteratively: ○ Stochastic GD (SGD): Uses one sample at a time. ○ Mini-batch GD: Uses small batches (common in DL). ○ Momentum, Adam, RMSprop: Advanced optimizations for stability. Backpropagation ○ Uses chain rule to propagate errors backward and adjust weights. 6. Models as Computation Graphs Forward Pass: Data flows from input to output. Backward Pass: Gradients propagate backward using chain rule. Implemented in DL frameworks: TensorFlow, PyTorch. 7. Other Relevant Concepts Hyperparameters vs. Parameters: ○ Parameters: Learned during training (weights, biases). ○ Hyperparameters: Set before training (learning rate, number of layers). Indeterminism: ○ Random Initialization affects training. ○ Batch order shuffling changes optimization path. 8. Technical Realizations Deep Learning Frameworks: ○ PyTorch (research, flexible, Pythonic). ○ TensorFlow (production, scalable). 9. Conclusion Neural networks are universal function approximators. Deep networks are more expressive than shallow ones. Training via gradient descent requires careful choice of loss functions, activation functions, and optimizers. Backpropagation is essential for learning. Frameworks like TensorFlow & PyTorch make implementation easier. Lecture 1: Convolutional Neural Networks (CNNss) 1. Introduction to CNNs CNNs are a specialized type of Neural Network (NN) used mainly for image processing. Unlike fully connected networks, CNNs preserve the spatial structure of images. Instead of processing entire images as vectors, they use local receptive fields to detect features. 2. Why Use CNNs? Motivations for CNNs in Computer Vision: - Differentiability: Many image operations (e.g., convolution) are differentiable and can be optimized with gradient descent. - Local Processing: CNNs process local areas, reducing the number of parameters and improving computational efficiency. - Parallel Computation: Operations can run in parallel, making them well-suited for GPUs and TPUs. 3. CNN Architecture - Key Concepts - CNNs consist of several layers that extract and refine features. The core layers include: a. Convolutional Layer: ○ The convolution operation applies a small filter/kernel over an input image to detect patterns/features (e.g., edges, textures). ○ The filter slides over the image, performing a dot product between the filter and the local region of the image. ○ Key hyperparameters: Kernel size (e.g., 3×3, 5×5) Stride (step size for moving the filter) Padding (handling edges of the image) b. Activation Function (ReLU, Sigmoid, etc.) ○ ReLU (Rectified Linear Unit): Commonly used since it helps prevent the vanishing gradient problem. ReLU(x)=max⁡(0,x) ○ Sigmoid and Tanh were used in older networks but cause saturation and slow learning. c. Pooling Layer (Downsampling for Dimensionality Reduction) ○ Max Pooling: Selects the maximum value from a small region, preserving dominant features. ○ Average Pooling: Averages values within a region, reducing noise. ○ Purpose: Reduces computation, helps prevent overfitting, and makes the model more translation-invariant. d. Fully Connected (Dense) Layer ○ Flattened feature maps are connected to dense layers for final classification. ○ Traditional CNNs end with a Softmax layer for multi-class classification. 4. Important CNN Properties Weight Sharing: CNN filters are shared across spatial positions, reducing parameters and improving generalization, it means features are learned globally. Translation Invariance: CNNs can recognize objects regardless of their position in an image. Hierarchy of Features: Earlier layers detect simple features (edges, corners), while deeper layers detect complex structures (objects). 5. Training CNNs CNNs are trained using backpropagation and gradient descent. The steps include: Forward Pass: Image flows through the CNN, and predictions are made. Loss Calculation: Measures how far the predictions are from the actual labels. Backpropagation: Computes gradients and updates weights using gradient descent. Common Loss Functions: Cross-entropy loss for classification. Mean Squared Error (MSE) for regression tasks. 6. Famous CNN Architectures Several CNN architectures have been developed for better accuracy and efficiency. a. LeNet-5 (1998) ○ First successful CNN, designed for digit recognition. ○ Architecture: Conv → Pool → Conv → Pool → Fully Connected. b. AlexNet (2012) ○ Deep network trained on ImageNet (1.2M images, 1000 classes). ○ Introduced: ReLU activation Dropout to prevent overfitting Trained using GPUs for the first time. c. VGG-16/VGG-19 (2014) ○ Uses only small 3×3 convolutions to make deep networks more efficient. ○ Trade-off: Deeper networks but more computationally expensive. d. GoogLeNet/Inception (2014) ○ Inception module: Uses multiple filters (1×1, 3×3, 5×5) in parallel for multi-scale feature extraction. ○ Lower parameter count than VGG with better performance. e. ResNet (2015) ○ Residual Connections (skip connections) allow deeper networks (152+ layers). ○ Solves the vanishing gradient problem, enabling very deep learning. f. DenseNet (2017) ○ Each layer is connected to every other layer, encouraging feature reuse and reducing parameters. 7. Fully Convolutional Networks (FCNs) Unlike standard CNNs, FCNs do not use dense layers, making them useful for: ○ Image Segmentation (e.g., U-Net, UNet++) ○ Denoising ○ Super-resolution Popular FCN Architectures: U-Net: Used in medical imaging (e.g., cell segmentation). UNet++: An improved version of U-Net with dense skip connections for better segmentation accuracy. 8. Image Segmentation with CNNs Types of segmentation: ○ Semantic Segmentation: Assigns a class label to each pixel. ○ Instance Segmentation: Differentiates multiple objects of the same class. ○ Panoptic Segmentation: Combines both semantic and instance segmentation. Example models: ○ U-Net: Uses an encoder-decoder structure with skip connections. ○ FCN (Fully Convolutional Network): Used for real-time segmentation. 9. CNNs for Image Denoising Noise is common in images, especially in medical imaging (e.g., MRI, CT scans). CNNs can learn a mapping from noisy images to clean images. Denoising Autoencoders: CNN-based models that remove noise while preserving image details. 10. Conclusion CNNs outperform traditional ML methods in image tasks. Convolutional layers extract hierarchical features, reducing the need for handcrafted features. Pooling layers help reduce dimensionality while retaining key features. Deep networks (e.g., ResNet, DenseNet) solve training challenges like vanishing gradients. FCNs (e.g., U-Net) are useful for segmentation and other image-to-image tasks. Pretrained models (e.g., ResNet50, VGG16) allow transfer learning for new tasks. Lecture 2: Generative Adversarial Networks 1. Introduction to Generative Adversarial Networks (GANs) GANs were introduced by Ian Goodfellow et al. (2014). They generate new data samples that resemble real data (e.g., images, text, music). Used in image generation, style transfer, super-resolution, and more. GANs consist of two networks that compete: ○ Generator (G): Creates fake data. ○ Discriminator (D): Distinguishes between real and fake data. 2. How GANs Work (Adversarial Setting) The goal is to train G to generate data that is indistinguishable from real data. Process: ○ Generator (G) takes a random noise vector z and transforms it into a synthetic sample G(z). ○ Discriminator (D) evaluates G(z) and assigns a probability of it being real or fake. ○ Training Process: D is trained to correctly classify real vs. fake data. G is trained to "fool" D by generating more realistic samples. Objective function: ○ D tries to maximize correct classifications. ○ G tries to minimize the discriminator’s success. ○ Result: A game-theoretic minimax game between G and D. 3. GAN Training Algorithm Iterate over the following steps: ○ Sample real data x from training set. ○ Sample noise z from a prior distribution (e.g., Gaussian). ○ Generate fake samples G(z). ○ Train D: Maximize: Probability of correctly classifying real vs. fake. ○ Train G: Minimize: The ability of D to detect fakes. ○ Repeat until convergence. 4. Common Problems in GAN Training Mode Collapse: ○ G generates only a few variations instead of diverse samples. ○ Solution: Use alternative loss functions like Wasserstein loss. Vanishing Gradient Problem: ○ If D gets too good, G stops learning. ○ Solution: Train D more slowly or use Wasserstein GAN (WGAN). Divergence and Instability: ○ GANs are hard to train due to the unstable adversarial process. ○ Solutions: Batch Normalization (stabilizes training). Better architectures (e.g., Deep Convolutional GANs). Gradient Penalty (e.g., WGAN-GP). 5. Variants of GANs - Several improvements and modifications to GANs have been proposed: a. Conditional GANs (cGANs): Condition the generation on additional input (x). Instead of generating randomly, G is guided by labels, images, or text. Loss function: Example: Pix2Pix (translates images from one domain to another, e.g., sketches to realistic images). b. CycleGAN Unpaired Image-to-Image Translation Learns mapping between two domains (X → Y and Y → X) without paired examples. Cycle consistency loss: Example: Turning horses into zebras, summer into winter images. c. Deep Convolutional GAN (DCGAN) Uses CNNs instead of fully connected networks. Architecture changes for stability: ○ Strided convolutions (instead of pooling). ○ Batch normalization. ○ ReLU in G, Leaky ReLU in D. Used in: Image synthesis, super-resolution. d. Wasserstein GAN (WGAN) Fixes mode collapse and instability issues. Uses Wasserstein-1 (Earth Mover’s Distance) instead of JS/KL divergence. Weight Clipping & Gradient Penalty for stability. Loss function: Improved version: WGAN-GP (adds gradient penalty instead of clipping). e. Coulomb GAN particles in an electric field. Fixes mode collapse by ensuring sample diversity. Samples repel each other, avoiding collapsing into a few points. 6. Applications of GANs a. Image Generation StyleGAN: Generates high-quality human faces. BigGAN: Large-scale image synthesis. b. Image-to-Image Translation Pix2Pix: Converts sketches to realistic images. CycleGAN: Transfers styles between unpaired images. c. Super-Resolution & Denoising SRGAN: Increases image resolution. Denoising GANs: Removes noise from images (medical imaging). d. Video & Music Generation MoCoGAN: Generates realistic videos. MuseGAN: Composes music. e. Text-to-Image Generation AttnGAN: Generates images from text descriptions. 7. Evaluating GANs GANs are hard to evaluate since there is no exact likelihood measure. Common Metrics: ○ Inception Score (IS) Measures diversity & quality of generated images. Higher IS = Better results. ○ Fréchet Inception Distance (FID) Measures similarity between generated and real images. Lower FID = Better performance. ○ Amazon Mechanical Turk (AMT) Human evaluation: People decide if images are real or fake. 8. Conclusion GANs use adversarial training to generate realistic samples. Training involves a min-max game between the Generator and Discriminator. Common problems include mode collapse, instability, and vanishing gradients. WGAN improves training stability by using Wasserstein loss. CycleGAN allows domain translation without paired data. GANs have applications in image synthesis, video generation, and super-resolution. Lecture 3: Autoencoder (AEs) 1. Introduction to Autoencoders Autoencoders (AEs) are a type of neural network used for unsupervised learning. They aim to learn a compressed (latent) representation of the input data. AEs consist of two main components: ○ Encoder f: X→ Z → Compresses input data into a smaller representation. ○ Decoder g: Z→ X → Reconstructs input data from the latent space. The latent representation Z is smaller than X (dimensionality reduction). 2. Why Use Autoencoders Unsupervised learning → No labeled data required. Feature extraction & dimensionality reduction (like PCA but more flexible). Anomaly detection → Identifies unusual patterns (e.g., fraud detection). Denoising → Removes noise from images or signals. Data compression → Efficiently stores and reconstructs high-dimensional data. Representation learning → Learns useful features for other tasks. 3. Autoencoder Architecture A typical autoencoder consists of: ○ Input Layer → The original data. ○ Encoder → Reduces the input dimension and extracts important features. ○ Latent Space (Z) → The compressed representation. ○ Decoder → Reconstructs data from the latent representation. ○ Output Layer → Reconstructed input. Mathematical Representation: Where: ○ x = original input ○ 𝑥 = Reconstructed input ○ f(x) = Encoding function ○ g(z) = Decoding function 4. Loss Function for Autoencoders Goal: Minimize the difference between input x and reconstructed output 𝑥. Common loss functions: ○ Mean Squared Error (MSE): Penalizes large differences. ○ Cross-Entropy Loss: Used for binary or normalized data. Connection to PCA (Principal Component Analysis): ○ If the encoder and decoder are linear, the autoencoder learns a form of PCA. 5. Types of Autoencoders There are several variation of autoencoders designed for specific tasks: a. Denoising Autoencoders (DAE) i. Purpose: Removes noise from data. ii. Training: The input x is corrupted (e.g., adding Gaussian noise), and the model learns to reconstruct the clean version. iii. Loss: Measures the difference between clean and reconstructed images. b. Sparse Autoencoders i. Purpose: Forces the network to learn important features by limiting active neurons. ii. Uses a sparsity constraint (L1 regularization) on the hidden layer activations. iii. Useful for feature selection and representation learning. c. Variational Autoencoders (VAEs) i. Purpose: A probabilistic extension of autoencoders that learns a structured latent space. ii. Key Idea: 1. Instead of mapping x to a fixed latent vector z, VAEs map x to a probability distribution 𝑞(𝑧|𝑥) 2 2. The encoder outputs mean μ and variance σ 3. Uses the reparameterization trick to allow gradient-based optimization: iii. Loss function includes: 1. Reconstruction loss (MSE or cross-entropy). 2. Kullback-Leibler (KL) divergence to enforce the latent space structure: d. Contractive Autoencoders (CAE) i. Purpose: Enforces robustness by penalizing large changes in latent space. ii. Adds a regularization term to penalize changes in encoder function f(x). e. Wasserstein Autoencoders (WAEs) 6. Training Challenges in Autoencoders Overfitting ○ AEs might memorize input data instead of learning generalizable features. ○ Solutions: Reduce the size of Z. Use dropout or weight regularization. Poor Generalization ○ Problem: The AE might reconstruct seen data well but fail on unseen data. ○ Solution: Use variational autoencoders (VAEs) to enforce a structured latent space. Mode Collapse in VAEs ○ Problem: All samples map to the same latent vector. ○ Solution: Enforce KL divergence regularization. 7. Applications of Autoencoders Image Denoising ○ Denoising AEs remove noise from images (e.g., medical imaging, astrophysics). Anomaly Detection ○ Trained on normal data → Large reconstruction errors indicate anomalies. ○ Used in fraud detection, cybersecurity, medical diagnosis. Generative Modeling ○ VAEs can generate new samples by sampling from the latent space. ○ Used in image synthesis, music generation, and text modeling. Feature Extraction for Supervised Learning ○ The encoder’s output (latent representation) can be used as input features for classification tasks. 8. Evaluation Autoencoders Reconstruction error (MSE, cross-entropy). Latent space visualization (e.g., PCA, t-SNE). Inception Score (IS) and Fréchet Inception Distance (FID) for generative AEs. 9. Conclusion - Autoencoders learn compact representations through encoding and decoding. - Denoising autoencoders removes noise from corrupted inputs. - Sparse autoencoders enforce feature selection with sparsity constraints. - VAEs learn structured latent representations using probabilistic modeling. - Regularization (KL divergence, MMD) improves latent space properties. - Autoencoders are useful for anomaly detection, image denoising, and feature learning. Lecture 4: Recurrent Neural Networks (RNNs) 1. Introduction to Recurrent Neural Networks (RNNs) RNNs are designed to process sequential data, where order matters (e.g., time series, speech, text). Unlike traditional feedforward networks, RNNs maintain a hidden state that carries information from previous steps. Key property: They process sequences one step at a time, using past computations to influence the current step. 2. Recurrent vs. Recursive Networks Recurrent Networks → Used for sequences (e.g., speech, time series). Recursive Networks → Used for hierarchical structures (e.g., trees). Key distinction: ○ Recurrence refers to patterns in data (e.g., sequences). ○ Recursion refers to computation based on previous iterations (e.g., applying a rule repeatedly). 3. Learning from Sequences RNNs are used for tasks where the order of elements matters, such as: ○ Speech recognition ○ Machine translation ○ Stock price prediction ○ Handwriting recognition Key challenges: ○ Variable length inputs. ○ Need for memory (state accumulation). ○ Long-range dependencies (solved using LSTMs/GRUs). 4. Challenges in Training RNNs Early RNNs had issues with training stability and remembering long-term dependencies. Key problems: ○ Vanishing gradients → Gradients become too small, making learning ineffective. ○ Exploding gradients → Gradients become too large, leading to instability. ○ Short-term memory → Simple RNNs struggle to remember long-term dependencies. Solutions: ○ Long Short-Term Memory (LSTM) Networks → Handles long-range dependencies better. ○ Gated Recurrent Units (GRUs) → A simpler, more efficient alternative to LSTMs. ○ Gradient Clipping → Prevents exploding gradients. ○ Batch Normalization & Regularization → Improves training stability. 5. Long Short-Term Memory (LSTM) Networks Introduced by Hochreiter & Schmidhuber (1997) to solve vanishing gradient problems in RNNs. Key innovation: Uses gates to control information flow LSTM Components: ○ Cell state (ct) → Memory that persists across time steps. ○ Hidden state (ht) → Output at each time step. ○ Three gates: Input Gate → Controls how much new information is added. Forget Gate → Controls how much old information is forgotten. Output Gate → Controls how much memory is output. ○ New cell state: ○ Hidden state output: Advantages of LSTMs: ○ Overcomes vanishing gradient problem. ○ Remembers long-range dependencies. ○ More robust than simple RNNs. 6. Gated Recurrent Units (GRUs) Introduced by Cho et al. (2014) as a simpler alternative to LSTMs. GRUs merge the forget and input gates into a single update gate. GRU Components: ○ Update Gate → Decides how much past information to keep. ○ Reset Gate → Decides how much past information to forget. ○ Candidate Activation → Creates new memory content. ○ Final Hidden State: Advantages of GRUs: ○ Fewer parameters than LSTMs (easier to train). ○ Performs similarly to LSTMs on many tasks. ○ Suitable for small datasets. 7. Applications of RNNs Sequence-to-Sequence (Seq2Seq) Models ○ Used in: Machine translation, text summarization, chatbots. ○ Architecture: Encoder RNN → Reads the input sequence and compresses it into a fixed-length representation. Decoder RNN → Takes the representation and generates the output sequence. Speech Recognition ○ Converts raw audio signals into text using LSTM-based architectures. Handwriting Recognition ○ IAM Online Handwriting Database: Input: Pen trajectory (Δx, Δy, t, up/down stroke). Output: Predicted text characters. Music Generation ○ Example: Bach Chorales dataset Task: Predict the next note in a sequence. Loss function: Negative log-likelihood (NLL). 8. Advanced RNN Concepts Bidirectional RNNs (Bi-RNNs) ○ Uses two RNNs, one reading forwards and one backwards. ○ Helps with tasks where context from both past and future is useful (e.g., Named Entity Recognition). Attention Mechanisms ○ Instead of compressing input into a fixed-size vector, attention allows the decoder to focus on different parts of the input at different times. ○ Widely used in Transformer models. Orthogonal Initialization ○ Prevents exploding/vanishing gradients by ensuring weight matrices remain stable. ○ Inspired by eigenvalue properties in Fibonacci sequences. 9. RNN Training Techniques Backpropagation Through Time (BPTT) ○ Standard backpropagation applied to sequences. ○ Computes gradients through the entire sequence length. ○ Problem: Can cause vanishing/exploding gradients. Gradient Clipping ○ Limits the maximum value of gradients to prevent instability. ○ Implemented as: Signal Regularization ○ Prevents overfitting by enforcing sparsity in activations. 10. Conclusion RNNs process sequential data using hidden states. LSTMs use gates (input, forget, output) to handle long-term dependencies. GRUs are simpler and computationally efficient alternatives to LSTMs. Seq2Seq models are used in machine translation and text summarization. Gradient clipping and orthogonal initialization improve training stability. Attention mechanisms have replaced RNNs in modern NLP (Transformers).

Deep Learning Notes PDF

Document Details

Tags

Related

Summary

Full Transcript