Neural Networks Study Guide PDF

Neural Networks Study Guide January 26, 2025 1 1 Neural Network Foundations 1.1 Basic Concepts 1. Neurons Each neuron computes a weighted sum of its inputs plus a bias term. An activation function is then applied to introduce non-linearity. 2. Layers Input Layer: Size is determined by the number of features (e.g., pixels in an image). Hidden Layers: One or more intermediate layers that learn non-linear transformations. Output Layer: Size corresponds to the prediction targets (e.g., number of classes). 3. Key Components Weights/Biases: Parameters learned during training. Overfitting: When the network memorizes rather than generalizes; mitigated via regularization, dropout, or early stopping. Example A neural network with an input size of 4, one hidden layer of 5 neurons, and an output layer of 2 neurons might be used for a classification task with 2 classes. The forward pass would involve computing weighted sums and activations in the hidden layer, then another weighted sum and activation at the output layer. 1.2 Activation Functions Purpose: Introduce non-linearity to the network so it can learn complex relationships. 1. Sigmoid Range: [0, 1]. Commonly used in binary classification. Drawback: Gradients can become very small (vanishing gradient problem) for large |x|. 2. Tanh Range: [-1, 1]. Typically steeper gradients than sigmoid but still can suffer from vanishing gradients. 3. ReLU (Rectified Linear Unit) ReLU(x) = max(0, x). Fast to compute and helps mitigate vanishing gradients. Drawback: Some neurons can “die” (output zero for all inputs if weights are updated incorrectly). 4. Softmax Outputs a probability distribution over multiple classes. Commonly used in the output layer of multi-class classification tasks. Example In a multi-class image classification problem (e.g., classifying digits 0–9), you might use ReLU in hidden layers for efficient training and a softmax in the final layer to get class probabilities. 2 1.3 Loss Functions A loss (or cost) function measures how far the network’s predictions are from the target values. 1. MSE (Mean Squared Error) Typically used for regression tasks. Can be sensitive to outliers. 2. Cross-Entropy Preferred for classification problems (both binary and multi-class). Strongly penalizes incorrect confident predictions (where the predicted probability for the correct class is low). Example In a binary classification problem with a sigmoid output, one might use Binary Cross-Entropy (BCE) Loss. For multi-class tasks, the corresponding function is Categorical Cross-Entropy (often imple- mented as nn.CrossEntropyLoss in PyTorch). 1.4 Backpropagation and Gradient Descent 1. Gradient Descent An optimization method used to find the set of parameters (weights) that minimize the loss function. It updates weights in the opposite direction of the gradient of the loss function with respect to those weights. 2. Backpropagation An algorithm that efficiently computes the gradient of the loss with respect to the weights by applying the chain rule. Works layer by layer, propagating errors from the output back to earlier layers. Example During each training iteration in PyTorch: 1. Forward Pass: Compute predictions and loss. 2. Zero Gradients: Clear old gradients (optimizer.zero grad()). 3. Backward Pass: Calculate gradients (loss.backward()). 4. Update Parameters: Adjust weights (optimizer.step()). 1.5 Dataset Splits and Validation 1. Typical Splits: Training Set: 60–80% of data for learning parameters. Validation Set: 10–20% of data for tuning hyperparameters (e.g., learning rate). Test Set: 10–20% of data for final performance estimation. 2. Cross-Validation (k-fold) Data is split into k parts; each part takes a turn as the validation set, with others used for training. Provides a more robust estimate of performance. 3 Example If you have 1000 samples, a 60/20/20 split yields 600 training samples, 200 validation samples, and 200 test samples. For 5-fold cross-validation, each fold would have 200 samples, rotating as the validation set. 2 Key Calculations and Parameters Neural networks are often tested on the ability to calculate the number of parameters, especially when adding layers or changing dimensions. 2.1 Number of Parameters in a Fully Connected Layer For a layer with n neurons and m inputs, each neuron has m weights plus 1 bias: Total parameters = (m + 1) × n. Worked Example Network Architecture: Input (2) → Hidden (3) → Output (1) Parameter Counts: Between input and hidden: (2 + 1) × 3 = 9 Between hidden and output: (3 + 1) × 1 = 4 Total : 9 + 4 = 13 2.2 Layer Sizes and Weight Matrices Input Size: Determined by the dimensionality of your features (e.g., 784 for a flattened 28×28 MNIST image). Output Size: Typically the number of classes (e.g., 10 for MNIST digits). Weight Matrix Dimensions: Between two layers—Layer A with A neurons and Layer B with B neurons—the weight matrix is A × B. 2.3 Exam-Style Questions Q: Parameters for layers [4, 5, 2]? (4 × 5 + 5) + (5 × 2 + 2) = 25 + 12 = 37. Q: Activation for multi-class classification? – A: Softmax (output layer). Q: Why ReLU over sigmoid? – A: Avoid vanishing gradients; faster computation. Q: Split 1000 samples into 60/20/20? – A: 600 training, 200 validation, 200 test. 4 3 Convolutional Neural Networks (CNNs) CNNs are specialized networks particularly effective for image and other spatial data. They use convolutional layers to detect patterns (such as edges) and pooling layers to reduce spatial dimensions. 3.1 Convolutional Layers Filters/Kernels: Small matrices used to detect features in the input (e.g., edges, textures). Padding: – valid: No padding; output shrinks as the kernel slides over the input. – same: Zero-padding so that the output size matches the input size. Stride: – Step size with which the kernel moves. – Larger stride ⇒ Smaller output. Output Size Formula: N − F + 2P Output Size = + 1, S where: – N = input size (e.g., width/height of the image) – F = kernel size – P = padding – S = stride Example If N = 28, kernel F = 5, stride S = 1, and padding P = 0, the output size is: 28 − 5 + 0 + 1 = 24. 1 3.2 Pooling Layers Max-Pooling: Takes the maximum value in each spatial window, reducing the output size (e.g., MaxPool2d(kernel size=2, stride=2)). No Learnable Parameters: Pooling layers do not have weights to learn; they only perform a down- sampling function. 4 PyTorch Essentials 4.1 Tensors Definition: Multi-dimensional arrays that can run on CPUs or GPUs. Key Operations: – Mathematical operations: addition, multiplication, matrix multiplication, etc. – Reshaping with view() or reshape(). – Moving to GPU: tensor.to(device). 5 4.2 Model Architecture in PyTorch Below is a minimal example using nn.Sequential: import torch import torch.nn as nn model = nn.Sequential( nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3), # e.g., input: 3 color channels nn.ReLU(), nn.MaxPool2d(2), nn.Flatten(), # Flatten before Linear nn.Linear(16 * some_height * some_width, 10) # Example fully connected layer ) Common Error Forgetting to call optimizer.zero grad() before loss.backward(). This can cause gradients to accumulate from previous batches. 5 Advanced Calculations to Master 5.1 Parameter Counts in CNN Layers Conv2D Layer: Cin × Fh × Fw + 1 × Cout , where: – Cin : Number of input channels – Fh , Fw : Filter height and width – Cout : Number of output channels – +1 accounts for a bias term per output channel Linear (Fully Connected) Layer: (Cin + 1) × Cout. 6 Examples and Exercises 6.1 Example 1: CNN Output Dimension Calculation Input: 1 channel, 28 × 28 image Conv2d: 10 output channels, kernel 5 × 5, stride 1, padding 0 Step-by-Step 1. Calculate Output Size (H/W): 28 − 5 + 0 + 1 = 24. 1 So, the output is 24 × 24 for each of the 10 output channels. 2. Number of Parameters: (Cin × Fh × Fw + 1) × Cout = (1 × 5 × 5 + 1) × 10 = (25 + 1) × 10 = 260. 6 6.2 Example 2: Training Step Code Snippet for data, labels in dataloader: # 1. Move data to device (GPU/CPU) data, labels = data.to(device), labels.to(device) # 2. Forward pass outputs = model(data) loss = criterion(outputs, labels) # 3. Zero gradient optimizer.zero_grad() # 4. Backward pass loss.backward() # 5. Update parameters optimizer.step() Key Points optimizer.zero grad() prevents gradient accumulation. loss.backward() computes gradients using backpropagation. optimizer.step() updates the model’s parameters. 7 Practice Questions 1. Parameter Calculation (Fully Connected) You have a layer that goes from 8 inputs to 3 outputs. How many parameters (weights + biases) does this layer have? 2. CNN Output Size Input size = 64 × 64, kernel size = 3, stride = 2, padding = 1. What is the spatial size of the output? 3. Loss Functions For a multi-class classification problem with 5 classes, which loss function is most appropriate and why? 4. Activation Functions Give one reason you might replace a Sigmoid activation with a ReLU in a hidden layer. 5. Data Splits If you have 10,000 samples, propose a 70/15/15 split. How many samples go to each subset? Hints/Solutions 1. Parameter Calculation Formula: (8 + 1) × 3 = 27. 2. CNN Output Size Using 64−3+2×1 + 1 = 64 2 2 + 1 = 32 + 1 = 33. Output spatial dimension = 33 × 33. 3. Loss Functions Typically, Cross-Entropy (specifically, Categorical Cross-Entropy). 7 4. Activation Functions ReLU reduces vanishing gradients and speeds up training. 5. Data Splits Training: 7000, Validation: 1500, Test: 1500. 8 Conclusion and Key Takeaways Neural Networks: – Build understanding from neurons and layers to activation and loss functions. – Master parameter count and dimension calculations to avoid mistakes. Convolutional Neural Networks: – Focus on convolution/pooling layers for effective feature extraction. – Know how to compute output sizes and parameter counts. PyTorch Workflow: – Familiarize yourself with the training loop: forward pass, loss, backward pass, and parameter update. – Pay attention to data reshaping and device allocation. By thoroughly understanding these concepts and practicing the calculations and coding steps provided, you will be well-prepared for exams and practical deep learning tasks. 8 9 VGG Network Architecture 9.1 Key Features Uses 3x3 convolutions (stacked to mimic larger receptive fields). Architecture variants: VGG16 (16 layers) and VGG19 (19 layers). Structure: Conv layers → MaxPooling (2x2, stride 2) → Fully Connected (FC) layers. Advantages: Fewer parameters, faster training, reduced overfitting. 9.2 Output Size Calculation Convolution: W − K + 2P Output = +1 S Example: Input=224x224, kernel=3x3, padding=1, stride=1 ⇒ Output=224x224. MaxPooling: Halves spatial dimensions (e.g., 224x224 → 112x112). 10 Implementation & Training 10.1 Code Structure # Configuration for VGG16 VGG_types = { " VGG16 " : [64 , 64 , " M " , 128 , 128 , " M " , 256 , 256 , 256 , " M " , 512 , 512 , 512 , " M " , 512 , 512 , 512 , " M " ] } 10.2 Training Steps Data: CIFAR100 (resized to 224x224, normalized). Hyperparameters: SGD optimizer (LR=0.005, momentum=0.9). Loss: CrossEntropyLoss. 11 Transfer Learning 11.1 Approach Freeze pre-trained Conv layers and retrain FC layers: model = models. vgg16_bn ( weights = models. VGG16_BN_Weights. DEFAULT ) for param in model. features. parameters () : param. requires_grad = False # Freeze conv layers model. classifier = nn. Sequential (...) # New FC layers 12 Batch Normalization 12.1 Purpose & Implementation Stabilizes training by normalizing layer inputs. Added after Conv layers: Conv2d → BatchNorm2d → ReLU. 9 13 Key Calculations 13.1 Parameter Count For a Conv layer: Parameters = (kernel h × kernel w × in ch + 1) × out ch Example: 3x3 Conv, 64 in → 128 out: (3 × 3 × 64 + 1) × 128 = 73,856 13.2 Output Dimensions After 3 Conv layers (3x3, padding=1) and 2 pooling layers on 224x224 input: convs pooling pooling 224 −−−→ 224 −−−−→ 112 −−−−→ 56 14 Exam-Ready Concepts 14.1 Example Questions 1. Q: Calculate parameters for a 3x3 Conv layer with 256 input and 512 output channels. A: (3 × 3 × 256 + 1) × 512 = 1,180,672. 2. Q: Why use 3x3 convolutions? A: Fewer parameters, more non-linearity, better feature refinement. 10 15 ResNet (Residual Networks) 15.1 Key Concepts Problem: Vanishing gradients in deep networks (e.g., VGG). Solution: Residual blocks with skip connections to learn F (x) = H(x) − x. Block Types: – Identity shortcut (matching dimensions). – 1x1 convolution shortcut (adjusts dimensions). Bottleneck Architecture: Uses 1x1 convolutions to reduce/restore channels. 15.2 Code Structure class Block ( nn. Module ) : def __init__ ( self , in_channels , intermediate_channels , identity_downsample = None , stride =1) : super (). __init__ () self. expansion = 4 self. conv1 = nn. Conv2d ( in_channels , intermediate_channels , kernel_size =1 , stride =1 , padding =0 , bias = False ) self. bn1 = nn. BatchNorm2d ( in terme diat e_cha nnel s ) #... ( other layers ) 16 ResNetv2 (Improved Residual Networks) Optimization: Reordered layers to BN → ReLU → Conv. Ensures cleaner identity mappings and better gradient flow. 17 ResNeXt (Cardinality Dimension) Key Idea: Uses grouped convolutions (parallel processing paths). Cardinality: Number of groups (e.g., 32) improves feature diversity. Example: 3x3 Conv with Cin = 256, Cout = 256, groups=32 has only 2,304 parameters. 18 Key Calculations 18.1 Output Size After Convolution W − K + 2P Output Size = +1 S Example: Input=224x224, kernel=7x7, stride=2, padding=3 → Output=112x112. 18.2 Parameter Counts Standard convolution: Params = (Cin × K × K × Cout ) + Bias Grouped convolution (ResNeXt): Cin Cout Params = ×K ×K × × groups groups groups 11 19 Example Exam Question Question: Calculate parameters in a ResNeXt block with 128 input channels, 256 output channels, 3x3 kernel, and groups=32. Answer: 128 256 Params = ×3×3× × 32 = 9, 216. 32 32 12 20 MobileNet Overview Purpose: Efficient CNN architectures for mobile/resource-constrained devices. Versions: – MobileNet v1 (2017): Introduced Depthwise Separable Convolutions. – MobileNet v2 (2019): Added Inverted Residual Blocks with Linear Bottlenecks. 21 Key Concepts 21.1 Depthwise Separable Convolution Two Stages: 1. Depthwise Convolution: Single filter per input channel (e.g., 3x3 kernel). 2. Pointwise Convolution: 1x1 convolution to combine outputs. Parameter Formula: (input channels × kernel size2 ) + (input channels × output channels) 21.2 Inverted Residual Block with Linear Bottleneck Structure: 1. Expansion: 1x1 convolution (expand channels by factor t = 6) + ReLU6. 2. Depthwise Convolution: Spatial filtering (e.g., 3x3 kernel). 3. Compression: 1x1 convolution (no activation). Residual Connection: Added only if input/output dimensions match. 22 MobileNet v2 Architecture Components: – 17 Inverted Residual Blocks (configured via base model). – Global Average Pooling (output: 1x1x1280). – Classifier: Dropout (0.2) + Linear layer. Hyperparameters: Expansion factor t = 6, stride s = 2, kernel size 3 × 3. 23 Implementation Details (PyTorch) 23.1 Code Structure CNNBlock: Conv2d + BatchNorm + ReLU6. InvertedResidualBlock: Handles expansion, depthwise, compression. MobileNet Class: Assembles blocks from base model. 13 23.2 Configuration (base model) base_model = [ # expand_ratio , channels , repeats , stride , kernel_size [1 , 16 , 1 , 1 , 3] , [6 , 24 , 2 , 2 , 3] , [6 , 32 , 3 , 2 , 3] , [6 , 64 , 4 , 2 , 3] , [6 , 96 , 3 , 1 , 3] , [6 , 160 , 3 , 2 , 3] , [6 , 320 , 1 , 1 , 3] ] 24 Calculations to Master 24.1 Output Size After Convolution Input Size − Kernel Size + 2 × Padding Output Size = +1 Stride Example: Input=224x224, kernel=3x3, stride=2, padding=1 ⇒ Output=112x112. 24.2 Parameter Count Standard Convolution: input channels × kernel size2 × output channels Depthwise Separable: (input channels × kernel size2 ) + (input channels × output channels) 25 Exam-Ready Insights Why MobileNet?: Fewer parameters, faster inference, mobile-friendly. ReLU6: Bounds activations to [0,6] for quantization. Linear Bottleneck: Prevents ReLU-induced information loss. Residual Connections: Improve gradient flow. Common Exam Questions Compare standard vs. depthwise separable convolutions. Calculate parameters/output dimensions for given layers. Explain expansion/compression in inverted residuals. Trace code flow for PyTorch implementation. 14 Object Detection Overview Classification: Identify the object (e.g., ”dog”, ”car”). Localization: Predict bounding box coordinates (x, y, w, h). Two-stage detectors (e.g., Faster R-CNN): – Step 1: Generate region proposals. – Step 2: Classify proposals. – Pros: High accuracy. Cons: Slow. One-stage detectors (e.g., YOLO, SSD): – Single-step prediction (fast, real-time). – Pros: Speed. Cons: Struggles with small/overlapping objects. YOLOv1 Core Mechanics Grid System: – Image divided into 7 × 7 grid cells. – Each cell predicts: ∗ 2 bounding boxes (confidence, x, y, w, h). ∗ 20 class probabilities (Pascal VOC dataset). – Output tensor: 7 × 7 × 30. Responsibility Rule: The grid cell containing the object’s center predicts the box/class. Bounding Box Encoding: – (x, y): Relative to the grid cell (range [0, 1]). – (w, h): Normalized to image size (range [0, 1]). Model Architecture Layers: – 24 convolutional layers (pretrained on ImageNet). – 2 fully connected (FC) layers. – Uses Leaky ReLU (α = 0.1) and BatchNorm. Parameter Calculations: – Convolutional layer: Params = (kernel size × in channels) × out channels Example: 7 × 7 × 3 × 64 = 9, 408. – FC layer: Params = (input size + 1) × output size Example: (1024 × 7 × 7 + 1) × 4096 = 204, 925, 952. 15 Loss Function Combines four weighted components: √ √ 1. Coordinate Loss: MSE for x, y and w, h (λcoord = 5). 2. Object Confidence Loss: MSE for boxes with objects. 3. No-Object Confidence Loss: MSE for empty boxes (λnoobj = 0.5). 4. Classification Loss: MSE for class probabilities. Non-Max Suppression (NMS) Purpose: Remove duplicate bounding boxes. Steps: 1. Filter boxes by confidence threshold (e.g., 0.5). 2. Sort boxes by confidence (highest first). 3. Iteratively remove overlapping boxes (IoU ¿ threshold). Mean Average Precision (mAP) Metric: Average precision across all classes. Steps: 1. For each class, compute precision-recall curve. 2. Calculate AP (area under the curve). 3. Average AP across classes. Key Code Snippets IoU Calculation: def i n t er s ec ti o n_ o ve r_ u ni o n ( boxes_preds , boxes_labels , box_format = " midpoint " ) : # Calculate coordinates for intersection x1 = torch. max ( boxes_preds [... , 0] , boxes_labels [... , 0]) y1 = torch. max ( boxes_preds [... , 1] , boxes_labels [... , 1]) #... ( clamp and compute area ) Model Architecture: class Yolov1 ( nn. Module ) : def __init__ ( self , in_channels =3 , ** kwargs ) : super (). __init__ () self. architecture = architecture_config self. darknet = self. _create_conv_layers () 16 Exam Focus: Calculations Convolution Output Size: Win − kernel + 2 × padding Wout = +1 stride Example: 448 × 448 → 224 × 224 with kernel=7, stride=2. Grid Assignment: – Object at (0.6, 0.7) → Cell (4, 4) in a 7 × 7 grid. 17 26 Generative Models Goal: Learn p(x) to generate new data samples. Latent Variables: Assume data x is generated from a low-dimensional latent variable z (e.g., shape/- color in images). 26.1 Key Probabilistic Concepts Prior p(z): Assumed distribution of z (e.g., N (0, 1)). Posterior p(z|x): Intractable true distribution of z given x. Approximate Posterior q(z|x): Learned distribution (e.g., Gaussian) to approximate p(z|x). Likelihood p(x|z): Decoder’s output distribution. 26.2 KL Divergence Measures divergence between distributions: Forward KL: Encourages q(z|x) to cover all modes of p(z|x). Reverse KL: Focuses on matching single modes (used in VAEs). 27 VAE Architecture 27.1 Encoder Maps input x to parameters of q(z|x): µ, log σ 2 = Encoder(x) 27.2 Decoder Reconstructs x from latent z: x̂ = Decoder(z) 27.3 Reparameterization Trick Samples z while preserving gradients: z = µ + σ ⊙ ϵ, ϵ ∼ N (0, 1) 28 Loss Function: ELBO Maximize the Evidence Lower Bound: ELBO = Eq(z|x) [log p(x|z)] − KL(q(z|x) ∥ p(z)) | {z } | {z } Reconstruction Loss Regularization For Gaussian distributions: 1X 1 + log σ 2 − µ2 − σ 2 KL = − 2 18 29 Training Process 1. Encode input x to get µ and log σ 2. 2. Sample z via reparameterization. 3. Decode z to reconstruct x̂. 4. Compute total loss: Loss = MSE(x̂, x) + KL. 5. Backpropagate and update weights. 30 Generating New Data Inference: Sample z ∼ N (µ, σ 2 ) from encoded input. Prior Sampling: Generate novel data via z ∼ N (0, 1). 31 Implementation Details 31.1 Network Design Encoder: FC layers with ReLU activations. Decoder: FC layers with sigmoid output (for MNIST). 31.2 Code Snippet (PyTorch) class V a r i ati on al Aut oE nc od er ( nn. Module ) : def __init__ ( self , input_dim , h_dim =200 , z_dim =20) : super (). __init__ () # Encoder self. img_2hid = nn. Linear ( input_dim , h_dim ) self. hid_2mu = nn. Linear ( h_dim , z_dim ) self. hid_2sigma = nn. Linear ( h_dim , z_dim ) # Decoder self. z_2hid = nn. Linear ( z_dim , h_dim ) self. hid_2img = nn. Linear ( h_dim , input_dim ) self. relu = nn. ReLU () def encode ( self , x ) : h = self. relu ( self. img_2hid ( x ) ) return self. hid_2mu ( h ) , self. hid_2sigma ( h ) def decode ( self , z ) : h = self. relu ( self. z_2hid ( z ) ) return torch. sigmoid ( self. hid_2img ( h ) ) def forward ( self , x ) : mu , log_sigma = self. encode ( x ) sigma = torch. exp (0.5 * log_sigma ) epsilon = torch. randn_like ( sigma ) z = mu + sigma * epsilon return self. decode ( z ) , mu , log_sigma 19 32 Why VAEs Work Structured latent space via KL regularization. Diversity through stochastic sampling. Balance between reconstruction and regularization. 33 Applications Image generation, anomaly detection, semi-supervised learning, and data compression. 34 Limitations Blurry outputs compared to GANs. Simplistic prior assumptions. 20 35 GANs: Core Concepts 35.1 Architecture Generator (G): – Input: Random noise vector (e.g., 100-dimensional). – Output: Synthetic data (e.g., images). – Goal: Fool the discriminator into classifying fake data as real. Discriminator (D): – Input: Real or generated data. – Output: Probability (0–1) of the input being real. – Goal: Distinguish real vs. fake data. 35.2 Training Dynamics Adversarial Process: 1. Train D: Freeze G, update D to maximize: LD = Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z)))] (1) 2. Train G: Freeze D, update G to minimize: LG = −Ez∼pz [log D(G(z))] (2) 36 DCGANs: Enhancing GANs with Convolutions 36.1 Architecture Guidelines Replace fully connected layers with convolutional layers. Use BatchNorm in both networks. Generator: Transposed convolutions for upsampling, ReLU (output: Tanh). Discriminator: Strided convolutions, LeakyReLU (slope=0.2). 36.2 Transposed Convolutions Output Size Formula: Output Size = (Input Size − 1) × Stride + Kernel Size − 2 × Padding (3) Example: Input=5x5, kernel=3x3, stride=2, padding=1: (5 − 1) × 2 + 3 − 2 × 1 = 9 × 9 37 CycleGAN: Unpaired Image Translation 37.1 Loss Functions Cycle Consistency Loss: Lcycle = Ex [∥F (G(x)) − x∥1 ] + Ey [∥G(F (y)) − y∥1 ] (4) Identity Loss: Lidentity = Ey [∥G(y) − y∥1 ] + Ex [∥F (x) − x∥1 ] (5) 21 38 Parameter Calculations 38.1 Convolutional Layer Parameters = (Kw × Kh × Cin + 1) × Cout (6) 38.2 Example Calculation Input channels=3, kernel=4x4, output channels=64: (4 × 4 × 3 + 1) × 64 = 3,136 39 Code Examples 39.1 GAN Generator for MNIST (PyTorch) class Generator ( nn. Module ) : def __init__ ( self , z_dim =64 , img_dim =784) : super (). __init__ () self. gen = nn. Sequential ( nn. Linear ( z_dim , 256) , nn. LeakyReLU (0.01) , nn. Linear (256 , img_dim ) , nn. Tanh () , # Output in [ -1 , 1] ) def forward ( self , x ) : return self. gen ( x ) 40 Common Pitfalls & Solutions Issue Cause Solution Mode Collapse G produces limited outputs Use WGAN-GP Checkerboard Artifacts Transposed conv. stride/kernel mismatch Use kernel size divisible by stride 41 Exam-Style Questions 1. Q: Calculate parameters for a DCGAN discriminator layer with 128 input channels, 256 output chan- nels, and 4x4 kernel. A: (4 × 4 × 128 + 1) × 256 = 524,544. 2. Q: Why is fake.detach() used in GAN training? A: To prevent gradients from propagating into the generator when updating the discriminator. 22 42 Recurrent Neural Networks (RNNs) 42.1 Core Idea RNNs maintain a ”hidden state” to model temporal dependencies via feedback loops. Unlike feedforward networks, they process sequences step-by-step, using previous outputs as inputs. 42.2 Structure Input Layer: Receives sequential data (e.g., words, video frames). Hidden State: Updated at each time step: ht = f (Whh ht−1 + Wxh xt + bh ) where Whh , Wxh are weight matrices, bh is bias, and f is an activation function (e.g., tanh). Output Layer: Generates predictions (e.g., next word in a sentence). 42.3 Architecture Types One-to-Many: Single input → sequence (e.g., image captioning). Many-to-One: Sequence → single output (e.g., sentiment analysis). Many-to-Many: Sequence → sequence (e.g., machine translation). 42.4 Challenges Vanishing/Exploding Gradients: Backpropagation Through Time (BPTT) struggles with long sequences. Short-Term Memory: Vanilla RNNs fail to retain distant dependencies. 43 Long Short-Term Memory (LSTM) 43.1 Core Idea LSTMs solve RNN limitations using a cell state (Ct ) and gating mechanisms to regulate information flow. Introduced by Hochreiter & Schmidhuber (1997). 43.2 Key Components Cell State (Ct ): Long-term memory ”conveyor belt.” Gates (sigmoid-activated): – Forget Gate (ft ): Discards irrelevant information. – Input Gate (it ): Adds new information. – Output Gate (ot ): Exposes relevant parts of Ct. 23 Figure 1: LSTM Cell 43.3 Mathematical Formulation [ ft = σ(Wf · [ht−1 , xt ] + bf ) it = σ(Wi · [ht−1 , xt ] + bi ) C̃t = tanh(WC · [ht−1 , xt ] + bC ) Ct = ft ⊙ Ct−1 + it ⊙ C̃t ot = σ(Wo · [ht−1 , xt ] + bo ) ht = ot ⊙ tanh(Ct ) ] σ = sigmoid, ⊙ = element-wise multiplication. 43.4 Advantages Over RNNs Mitigates vanishing gradients via cell state. Selective memory updates using gates. 44 Variations of LSTMs Peephole LSTM: Gates access Ct for timing-sensitive tasks. GRU (Gated Recurrent Unit): Combines forget/input gates into an ”update gate”; merges Ct and ht. 24 45 PyTorch Implementation 45.1 RNN vs. LSTM Code # RNN Definition self. rnn = nn. RNN ( input_size , hidden_size , num_layers , batch_first = True ) # LSTM Definition self. lstm = nn. LSTM ( input_size , hidden_size , num_layers , batch_first = True ) 45.2 Training Workflow Data Preparation: Use pack sequence for variable-length inputs. Forward Pass: out , ( h_n , c_n ) = self. lstm (x , ( h0 , c0 ) ) # LSTM Loss & Optimization: Cross-entropy loss with Adam. 45.3 Performance on MNIST RNN: ∼97.8% test accuracy. LSTM: ∼98.7% test accuracy. GRU: ∼97.8% accuracy (faster training). 46 Applications Text/Image/Music Generation. Machine Translation. Speech Recognition. Time Series Forecasting. 47 Key Takeaways RNNs struggle with long-term dependencies. LSTMs use gated memory for stable gradients. GRUs balance efficiency and performance. 25 1. Conceptual Overviews Reinforcement Learning (RL) Definition: Learning to map states to actions to maximize cumulative reward. Key Components: – Agent: Decision-maker (policy, value function, reward function). – Environment: Where the agent acts. – Policy: Strategy to choose actions (e.g., neural network). – Value Function: Estimates long-term reward from a state. – Reward Function: Immediate feedback for actions. On-Policy vs Off-Policy On-Policy (e.g., SARSA): – Updates based on actions taken by the current policy. – Pros: Adapts to exploration noise. – Cons: May get stuck in local optima. Off-Policy (e.g., Q-Learning): – Updates using hypothetical actions (not necessarily taken). – Pros: More flexible, learns optimal policy. – Cons: Higher variance. Policy Gradient Goal: Directly optimize the policy (no explicit value function). Key Formula: hX i ∇θ J(θ) = E ∇θ log πθ (a|s) · A(s, a) – A(s, a): Advantage function (how much better action a is than average). – Interpretation: Adjust policy parameters (θ) to favor actions with higher advantage. Actor-Critic Combines Actor (policy) and Critic (value function): – Actor: Learns policy using policy gradient. – Critic: Estimates value function (e.g., V (s) or Q(s, a)). Advantage Function: A(s, a) = Q(s, a) − V (s) – Reduces variance by comparing action value to state value. 26 2. Architecture Insights Actor Network (Example from Lunar Lander Code) Layers: – Input: State (8-dimensional vector). – Hidden Layer: 64 units (ReLU activation). – Output: 4 actions (Softmax for probabilities). Role: Outputs action probabilities (stochastic policy). Critic Network Layers: – Input: State (8-dimensional vector). – Hidden Layer: 64 units (ReLU activation). – Output: Scalar value (V (s)). Role: Estimates the value of the current state. 3. Key Formulas 1. Policy Gradient Update: θ ← θ + α · ∇θ J(θ) α: Learning rate. Intuition: Adjust policy to increase the probability of high-reward actions. 2. TD Target: T Dtarget = r + γ · V (s′ ) Used to update Critic by minimizing MSE loss. 3. Advantage: A(s, a) = T Dtarget − V (s) 4. Parameter Calculation Examples Actor Network Parameters Input → Hidden (64 units) : 8 × 64 + 64 = 576 (weights + biases) Hidden → Output (4 units) : 64 × 4 + 4 = 260 Total : 576 + 260 = 836 Critic Network Parameters Input → Hidden (64 units) : 8 × 64 + 64 = 576 Hidden → Output (1 unit) : 64 × 1 + 1 = 65 Total : 576 + 65 = 641 27 5. Code Understanding Key Snippets (PyTorch) # Actor Network Definition class Actor ( nn. Module ) : def __init__ ( self , input_size , num_actions ) : super (). __init__ () self. fc1 = nn. Linear ( input_size , 64) # Input -> Hidden self. fc2 = nn. Linear (64 , num_actions ) # Hidden -> Output ( Softmax ) def forward ( self , x ) : x = F. relu ( self. fc1 ( x ) ) x = F. softmax ( self. fc2 ( x ) , dim = -1) return x Softmax: Converts logits to probabilities for stochastic action selection. Training Loop Actor Update: log_prob = dist. log_prob ( action ) actor_loss = - log_prob * advantage. detach () – Purpose: Maximize the log probability of actions with high advantage. Critic Update: critic_loss = F. mse_loss ( value , td_target. detach () ) – Purpose: Minimize error in value estimation. 6. Application & Pitfalls Use Cases Lunar Lander: Learn to land safely using thrusters. Robotics: Continuous control tasks. Pitfalls High Variance: Solved by advantage normalization. Exploration vs Exploitation: Add entropy regularization. Credit Assignment: Long episodes delay reward signals. 7. Study Tips & Exam-Style Questions Study Strategies Practice deriving policy gradient updates. Simulate the training loop step-by-step. Compare on-policy (SARSA) vs off-policy (Q-learning) in grid-world examples. 28 Sample Questions 1. Parameter Calculation: Calculate the number of parameters in an Actor network with input size 10, hidden layer 128, and 5 actions. Answer: 10 × 128 + 128 + 128 × 5 + 5 = 2, 181. 2. Code Interpretation: What does advantage.detach() do in the Actor loss? Answer: Prevents gradients from flowing through the Critic during Actor updates. 3. Conceptual: Why use a softmax layer in the Actor network? Answer: To sample actions stochastically, enabling exploration. Summary This guide connects RL theory (policy gradients, actor-critic) with code implementations, parameter calcu- lations, and common exam questions. Focus on understanding how gradients flow in policy updates and the role of advantage functions in reducing variance. 29

Neural Networks Study Guide PDF

Document Details

Tags

Related

Summary

Full Transcript