Module 4 - CNN PDF

Module - 4 Convolutional Networks Introduction to Convolutional Networks (CNNs) CNNs are a specialized type of neural network designed to handle data with a grid-like structure. Examples of grid-like data: 1D data: Time-series data, sampled at regular intervals (like stock prices). 2D data: Images, represented as grids of pixels. CNNs have achieved significant success in many practical applications, especially in areas like image recognition. Convolutional The name “convolutional neural network” comes from the convolution operation used in these networks. Convolution is a type of mathematical operation, that replaces matrix multiplication in some of the network layers. Structure of Convolutional Networks CNNs are similar to traditional neural networks but replace general matrix multiplication with convolution in at least one layer. This allows CNNs to capture patterns within data more effectively The Convolution Operation In general, Convolution is a mathematical operation that combines two functions to produce a new function. In CNNs, the convolution operation extracts features from input data by sliding a filter over the input and capturing patterns like edges and textures. Each filter captures a different aspect of the image, and the output is a feature map that highlights where these features appear in the image. multiply each number in the filter by the number in the image directly beneath it and then add up all the results to get a single number. Feature map Convolution Process Example of different filters Motivation for CNN Convolution leverages three important ideas that can help improve a machine learning system: 1. sparse interactions, 2. parameter sharing and 3. Equivariant representations. Moreover, convolution provides a means for working with inputs of variable size. Sparse interactions Traditional neural network layers connect each input node to every output node, leading to a lot of parameters to store and process. This is inefficient, especially for large inputs like images. Sparse interactions (continued) In contrast, convolutional networks have sparse interactions because they use kernels (small filters) that focus on small regions of the input at a time. This means fewer parameters to learn, faster computation, and less memory usage. Efficiency Gains through Sparse Connectivity This is much faster than making all-to-all connections, reducing both computation time and storage needs. By using sparse connections, the network can efficiently combine many simple patterns (such as edges or corners) into more complex shapes or objects in deeper layers. Parameter sharing Parameter sharing refers to using the same parameter for more than one function in a model. In traditional neural networks, each weight (parameter) is used once, making it separate and unique to one connection. In CNNs, the same set of weights (the kernel) is applied across different parts of the input. Parameter sharing (continued) This means parameter sharing allows CNNs to use a single set of weights across multiple locations, making them more memory-efficient while reducing the total number of parameters to learn. For example, a filter that detects a vertical edge in one region of an image can be used to detect similar edges elsewhere, leading to better generalization and more efficient learning. Equivariant representations In the case of convolution, the particular form of parameter sharing causes the layer to have a property called equivariance to translation. Equivariance means that if the input changes (like shifting or translating an image), the output also changes in the same way. Equivariant representations (continued) For CNNs, this means that moving an object in the input image results in the output feature map reflecting that same change. This property is useful for image processing because it ensures consistent detection of features regardless of their exact position within an image. Pooling A typical layer of a convolutional network consists of three stages. In the first stage, the layer performs several convolutions in parallel to produce a set of linear activations. In the second stage, each linear activation is run through a nonlinear activation function, such as the ReLU activation function. This stage is sometimes called the detector stage. In the third stage, we use a pooling function to modify the output of the layer further. What is Pooling? Pooling is a process that summarizes the output of nearby values in a specific neighborhood. A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby outputs. Different types of Pooling The max pooling operation is where the maximum value in a rectangular area is taken as the output. Average pooling: Computes the average value in a neighborhood. L2 norm pooling: Uses the L2 norm (square root of the sum of squared values) of the neighborhood. Example of a Max Pool Purpose of Pooling Pooling helps make the network's representation invariant to small translations (small shifts) in the input. This means small changes in the input won't significantly alter the pooled outputs. Invariance Invariance to local translation can be a very useful property if we care more about whether some feature is present than exactly where it is. For example, when determining whether an image contains a face, we need not know the location of the eyes with pixel-perfect accuracy, we just need to know that there is an eye on the left side of the face and an eye on the right side of the face. Spatial Pooling and Computational Efficiency Pooling reduces the number of units by summarizing larger areas. This means fewer outputs to process in the next layer, improving both speed and memory use. For example, pooling over regions spaced k pixels apart results in fewer inputs for the next layer to handle. Handling Inputs of Varying Sizes Pooling can make it easier to deal with inputs (e.g., images) of different sizes. By using pooling regions of varying sizes, we can ensure a fixed number of summary statistics are passed to the classification layer. Convolution and Pooling as an Infinitely Strong Prior Infinitely Strong Prior: Forbids certain parameters, no matter the data. certain parameter values are considered impossible or completely unacceptable Convolution as an Infinitely Strong Prior The prior assumption is that a filter (set of weights) should be shared across all positions in the input. This means the weights for detecting a feature at one location are identical to those at another location. This assumption forbids having a different set of weights at different locations. Pooling as an Infinitely Strong Prior Similarly, in pooling, the prior assumption is that the network should be invariant to small translations of the input, meaning it forbids learning a model where the exact position of the feature is important. Variants of the Basic Convolution Function The "convolution" in neural networks is not the same as standard mathematical convolution. Neural networks use multiple kernels in parallel to extract various features at different spatial locations. Inputs are often multi-dimensional (e.g., color images with RGB channels or multi-channel outputs from earlier layers). Multi-Channel Convolution Inputs and outputs are treated as 3D tensors: one dimension for spatial coordinates (rows and columns) and one for channels (e.g., RGB or feature maps). In practice, operations might use 4D tensors to handle batches of inputs. Stride The stride determines how much the kernel "jumps" over the input when sliding A stride > 1 skips some positions, effectively downsampling the output Zero Padding Zero-padding adds rows and columns of zeros around the input to control the output size and prevent shrinking of spatial dimensions. Three types of padding: Valid convolution: No padding; Same convolution: Padding ensures output size equals input size. Full convolution: Padding allows the kernel to fully visit all positions Valid Padding (No Padding) No extra pixels are added to the input. The output feature map is smaller than the input after convolution. Output size = Input size−Kernel size + 1 (for a stride of 1). Reduces computational load since there are fewer output values. Used when reducing the spatial dimensions of the input is acceptable Example of Valid convolution/padding Same Padding Padding is added such that the output feature map has the same spatial dimensions as the input. Padding size = (Kernel size − 1)/2 (for a stride of 1). Preserves spatial dimensions, making it easier to stack layers without shrinking feature maps. Maintains information near the edges of the input. Used in Deep networks where consistent feature map size is required across layers Example of Same convolution/padding Full Padding Enough padding is added so that the kernel can fully slide over the input, even on the edges. Padding size = Kernel size − 1 Allows the kernel to access all pixels, including the ones at the edges, multiple times. Used when preserving maximum information from the edges is critical. Example of Full convolution Advantages of Zero Padding Preserves Information at Edges Controls Feature Map Size Allows deeper networks by controlling the size of intermediate feature maps. Ensures that the kernel can be applied at every spatial position, including edges. calculate the output dimensions of the convolution operation: Input size: 7×7 Input size: 6×6 Kernel size: 3×3 Kernel size: 3×3 Stride: 1 Stride: 1 Padding: 0 Padding: 1 Input size: 8×8 Kernel size: 5×5 Stride: 2 Padding: 2 Locally Connected Layers Unlike convolutional layers, where a single set of weights (kernel) is shared across all spatial locations, locally connected layers assign unique weights to each connection between input and output. unshared convolutions: no parameter is reused across spatial locations. Locally Connected Layers (continued) These layers focus on specific spatial regions in the input, making them ideal for detecting features restricted to certain areas. Useful when features are expected to appear only in specific regions rather than everywhere in the input. Ex: Detecting the mouth in the lower half of a face image. They allow for more flexible modeling when spatial invariance is not required. Tiled convolution A compromise between standard convolution and locally connected layers. Instead of learning one kernel for all locations or a separate kernel for each location, tiled convolution learns a small set of kernels and rotates through them across spatial positions. Tiling introduces variety like in locally connected layers but uses fewer parameters. Three Necessary Operations for training Convolutional Networks 1. Forward Convolution is the standard convolution operation applied during forward propagation. It takes the input tensor, applies the kernel stack, and outputs the feature map 2. Gradient w.r.t. Weights (Kernel Gradients) This computes the derivatives of the loss with respect to the kernel weights This operation adjusts the kernel weights during training. 3. Gradient w.r.t. Inputs (Input Gradients) This computes the derivatives of the loss with respect to the input, enabling backpropagation to earlier layers. Structured Outputs Instead of outputting a single class label or a single value, a CNN can output a structured object, such as a tensor that represents predictions for each pixel of an input image. Example: instead of saying, "This image is a car," the model says, "This pixel is part of a car, and this pixel is part of the road.“ This is useful for tasks like semantic segmentation, where every pixel in an image gets labeled (e.g., road, car, tree). How to Label Pixels Accurately? Make an initial guess for each pixel (e.g., "This pixel might be a car"). Refine this guess by considering nearby pixels ("If neighboring pixels are also cars, then this pixel is likely a car"). This is done using layers that apply the same logic repeatedly, called recurrent convolutional networks. Use methods to group nearby pixels with the same label into regions (e.g., one region for the car, one for the road). Use probabilistic models to make sure predictions are consistent Data Types Convolutional networks handle data with multiple channels, where each channel corresponds to a distinct quantity observed over time or space. In a grayscale image, there is one channel representing pixel intensity. In an RGB image, there are three channels for red, green, and blue pixel values. Variable-Sized Inputs CNNs can process inputs with varying spatial dimensions (e.g., images with different widths and heights). Traditional fully connected neural networks cannot easily handle variable input sizes because of their reliance on fixed-size weight matrices. Handling Variable-Sized Inputs The kernel is applied across varying input sizes, scaling the output accordingly. For outputs with variable size (e.g., assigning a class label to each pixel), no extra modifications are needed. For fixed-size outputs (e.g., a single class label for the entire input), design adjustments like proportionally scaled pooling regions are required. Appropriate Use of Convolution Suitable for inputs with variable size due to consistent observations (e.g., varying lengths of audio recordings or spatial dimensions of images). Not suitable for data with variable size due to differing observations (e.g., college applications where some features may or may not be present). Efficient Convolution Algorithms Modern convolutional neural networks (CNNs) are very large, with millions of parameters. Training and using these networks can take a lot of time and computing power. To save time and resources, we need smarter ways to perform convolution operations. Efficient Convolution Algorithms (continued) Break Down Complex Operations (Separable Convolutions): Instead of doing one big operation, we split it into smaller, simpler steps. Ex: Instead of analyzing a 2D image all at once, analyze it row by row and then column by column. makes convolutions more efficient by reducing computational complexity and the number of parameters. This method is faster and uses less memory Random or Unsupervised Features Features are patterns or characteristics the model learns to recognize The most expensive part of convolutional network training is learning the features. we can skip training the features and still get good results by using random or unsupervised features Random or Unsupervised Features (continued) Random Initialization: Set the convolution filters (kernels) to random values. Random filters often work well in recognizing patterns like edges or shapes. Handcrafted Features: Design filters manually to detect specific patterns Example: Set a filter to focus on vertical or horizontal lines. Unsupervised Learning: Use techniques like k-means clustering to automatically find patterns in the data without labels. Why Use Random or Unsupervised Features Reduces the need for full forward and backpropagation through the network. Ideal when computational resources are limited. Works well when labeled data is scarce Enables training very large networks with less computation during training. Modern Advances Today, datasets are larger, and computing power has increased. Fully supervised training is now the norm, as it often gives better results.

Module 4 - CNN PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue