Convolutional Neural Networks Handout PDF
Document Details
Trinity College Dublin
2024
François Pitié
Tags
Summary
This handout from Trinity College Dublin provides an overview of convolutional neural networks (CNNs). It covers the mathematical foundations, including convolution filters and their applications in image processing. The document is suitable for undergraduate computer science students learning about deep learning.
Full Transcript
06 - Convolutional Neural Networks François Pitié Assistant Professor in Media Signal Processing Department of Electronic & Electrical Engineering, Trinity College Dublin [4C16/5C16] Deep Learning and its Applications — 2024/2025...
06 - Convolutional Neural Networks François Pitié Assistant Professor in Media Signal Processing Department of Electronic & Electrical Engineering, Trinity College Dublin [4C16/5C16] Deep Learning and its Applications — 2024/2025 1 Convolutional Neural Networks, or convnets, are a type of neural net especially used for processing image data. They are inspired by the organisation of the visual cortex and math- ematically based on a well understood signal processing tool: image filtering by convolution. Convnets gained popularity with LeNet-5, a pioneering 7-level convo- lutional network by LeCun et al. (1998) that was successfully applied on the MNIST dataset. 2 Convolution Filters 3 Recall that in dense layers, every unit in the layer is connected to every unit in the adjacent layers: u1,1 u2,1 u1,2 x1 u2,2 x2 u1,3 y1 u2,3 x3 u1,4 u2,4 x4 u1,5 Input layer Hidden layer 1 Hidden layer 2 Ouput 4 When the input is an image (as in the MNIST dataset), each pixel in the input image corresponds to a unit in the input layer. For an input image of dimension width by height pixels and 3 colour channels, the input layer will contain width × height × 3 input units. If the next layer is of the same size, then we have up to (width × height×3)2 weights to train, which can become very large very quickly. 5 In a fully connected approach, we don’t take advantage of the spatial structure of an image. For instance, we know that pixel values are usually more related to their neighbours than to far away locations. We need to take advan- tage of this. This is what is done in convolutional neural networks. 6 Input Layer Next Layer The input layer is made of a grid of the image pixels. In a fully con- nected layer architecture, pixels in the next layer are connected to all of the pixels in the input layer. 7 Input Layer Next Layer In convolutional layers, the units in the next layer are only connected to their neighbours in the input layer. In this case the neighbourhood is defined as a 5 × 5 window. 7 Moreover the weights are shared across all the pixels. 8 So, in convnets, the weights are associated to the relative positions of the neighbours and shared across all pixel locations. Let us see how they are defined. Denote the units of a layer as ui,j,k,n , where n refers to the layer, i, j to the coordinates of the pixel and k to the channel of consideration. The logit for that neuron is defined as the result of a convolution filter: h1 h2 h3 logiti,j,k,n = w0,k,n + ∑ ∑ ∑ wa,b,c,k,n ua+i,b+j,c,n−1 a=−h1 b=−h2 c=1 where h1 and h2 correspond to half of the dimensions of the neigh- bourhood window and h3 is the number of channels of the input im- age for that layer. After activation f , the output of the neuron is simply: ui,j,k,n = f (logiti,j,k,n ) 9 Example Consider the case of a grayscale image (1 channel) where the convolution is defined as: logiti,j,n = ui+1,j,n−1 + ui−1,j,n−1 + ui,j+1,n−1 + ui,j−1,n−1 − 4ui,j,n−1 The weights can be arranged as a weight mask (also called kernel): 0 1 0 w: 1 -4 1 0 1 0 10 0 10 0 9 10 3 9 4 3 4 4 3 4 2 3 1 2 5 1 5 10 2 10 10 2 3 10 0 3 0 3 -1 3 -3 -1 -3 -3 -4 -3 -4 -4 -10-4 -8-10 -1 -8 -2 -1 -2 pixel values in previous layer logits after convolution 0x 1x 0x 0x 1x 0x 1x -4x 1x 1x -4x 1x 11 1x 0x 0x 1x 0x 0 10 0 9 10 3 9 4 3 4 -4x 1x 1x -4x 1x 4 3 4 2 3 1 2 5 1 5 6 6 1x 0x 0x 1x 0x 10 2 10 10 2 3 10 0 3 0 3 -1 3 -3 -1 -3 -3 -4 -3 -4 -4 -10-4 -8-10 -1 -8 -2 -1 -2 pixel values in previous layer logits after convolution 11 0x 1x 0x 0x 1x 0x 0 10 0 9 10 3 9 4 3 4 1x -4x 1x 1x -4x 1x 4 3 4 2 3 1 2 5 1 5 6 15 6 15 0x 1x 0x 0x 1x 0x 10 2 10 10 2 3 10 0 3 0 3 -1 3 -3 -1 -3 -3 -4 -3 -4 -4 -10-4 -8-10 -1 -8 -2 -1 -2 pixel values in previous layer logits after convolution 11 0x 1x 0x 0x 1x 0x 0 10 0 9 10 3 9 4 3 4 1x -4x 1x 1x -4x 1x 4 3 4 2 3 1 2 5 1 5 6 15 6 9 15 9 0x 1x 0x 0x 1x 0x 10 2 10 10 2 3 10 0 3 0 3 -1 3 -3 -1 -3 -3 -4 -3 -4 -4 -10-4 -8-10 -1 -8 -2 -1 -2 pixel values in previous layer logits after convolution 11 0 10 0 9 10 3 9 4 3 4 1x 0x 0x 1x 0x 4 3 4 2 3 1 2 5 1 5 6 15 6 9 15 9 -4x 1x 1x -4x 1x 10 2 10 10 2 3 10 0 3 0 14 14 1x 0x 0x 1x 0x 3 -1 3 -3 -1 -3 -3 -4 -3 -4 -4 -10-4 -8-10 -1 -8 -2 -1 -2 pixel values in previous layer logits after convolution 11 0 10 0 9 10 3 9 4 3 4 0x 1x 0x 0x 1x 0x 4 3 4 2 3 1 2 5 1 5 6 15 6 9 15 9 1x -4x 1x 1x -4x 1x 10 2 10 10 2 3 10 0 3 0 14 -36 14 -36 0x 1x 0x 0x 1x 0x 3 -1 3 -3 -1 -3 -3 -4 -3 -4 -4 -10-4 -8-10 -1 -8 -2 -1 -2 pixel values in previous layer logits after convolution 11 0 10 0 9 10 3 9 4 3 4 0x 1x 0x 0x 1x 0x 4 3 4 2 3 1 2 5 1 5 6 15 6 9 15 9 1x -4x 1x 1x -4x 1x 10 2 10 10 2 3 10 0 3 0 14 -36 14 -4 -36 -4 0x 1x 0x 0x 1x 0x 3 -1 3 -3 -1 -3 -3 -4 -3 -4 -4 -10-4 -8-10 -1 -8 -2 -1 -2 pixel values in previous layer logits after convolution 11 0 10 0 9 10 3 9 4 3 4 4 3 4 2 3 1 2 5 1 5 6 15 6 9 15 9 1x 0x 0x 1x 0x 10 2 10 10 2 3 10 0 3 0 14 -36 14 -4 -36 -4 -4x 1x 1x -4x 1x 3 -1 3 -3 -1 -3 -3 -4 -3 -4 -4 -4 1x 0x 0x 1x 0x -4 -10-4 -8-10 -1 -8 -2 -1 -2 pixel values in previous layer logits after convolution 11 0 10 0 9 10 3 9 4 3 4 4 3 4 2 3 1 2 5 1 5 6 15 6 9 15 9 0x 1x 0x 0x 1x 0x 10 2 10 10 2 3 10 0 3 0 14 -36 14 -4 -36 -4 1x -4x 1x 1x -4x 1x 3 -1 3 -3 -1 -3 -3 -4 -3 -4 -4 10 -4 10 0x 1x 0x 0x 1x 0x -4 -10-4 -8-10 -1 -8 -2 -1 -2 pixel values in previous layer logits after convolution 11 0 10 0 9 10 3 9 4 3 4 4 3 4 2 3 1 2 5 1 5 6 15 6 9 15 9 0x 1x 0x 0x 1x 0x 10 2 10 10 2 3 10 0 3 0 14 -36 14 -4 -36 -4 1x -4x 1x 1x -4x 1x 3 -1 3 -3 -1 -3 -3 -4 -3 -4 -4 10 -4 7 10 7 0x 1x 0x 0x 1x 0x -4 -10-4 -8-10 -1 -8 -2 -1 -2 pixel values in previous layer logits after convolution 11 Padding At the picture boundaries, not all neighbours are defined (see figure in next slide). In Keras two padding strategies are possible: padding=’same’ means that the values outside of image domain are extrapolated to zero. padding=’valid’ means that we don’t compute the pixels that need neighbours outside of the image domain. This means that the picture is slightly cropped. 12 Padding 1x 0x 1x 0x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? -4x 1x -4x 1x ? 0 10 ? 0 9 10 3 9 4 ? 3 4 ? 1x 0x 1x 0x ? 6 ? 15 6 9 15 ? 9 ? ? 4 ? 3 4 2 3 1 2 5 ? 1 5 ? ? 10 ? 2 10 2 3 10 0 ? 3 0 ? ? 14 ? -3614 -4-36 ? -4 ? ? 3 -1 ? -3 3 -3 -1 -4 -3 -3 ? -4 ? ? -4 ? 10 -4 7 10 ? 7 ? ? -4 -10 ? -4 -8 -10 -1 -2 -8 -1 ? -2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? input layer. Pixels outside the image after 3 × 3 convolution. Boundary domain are marked with ’?’. pixels require out of domain neighbours. 13 Each convolutional layer defines a number of convolution filters and the output of a layer is thus a new image, where each channel is the result of a convolution filter followed by activation. 14 Example On next slide is a colour picture of size 443 × 592 × 3 (width=443, height=592, number of channels=3). The convolutional layer used has a kernel of size 5 × 5, and produces 6 different filters. The padding strategy is set to ’valid’ thus we loose 2 pixels on each side. The output of the convolutional layer is a picture of size 439 × 588 × 6. In Keras, this would be defined as follows: x = Input(shape=(443, 592, 3)) x = Conv2D(6, [5, 5], activation=’relu’, padding=’valid’)(x) This convolution layer is defined by 3 × 6 × 5 × 5 = 450 weights. This is only a fraction of what would be required in a dense layer. 15 input layer = red channel green channel blue channel output filtered images: 16 Reducing the tensor size 17 If convolution filters offer a way of reducing the number of weights in the network, the number of units still remains high. For instance, applying Conv2D(16, (5,5)) to an input image of size 2000 × 2000 × 3 only requires 5 × 5 × 3 × 16 = 1200 weights to train, but still produces 2000 × 2000 × 16 = 64 million units. In this section, we’ll see how stride and pooling can be used to down- sample the images and thus reduce the number of units. 18 Stride In image processing, the stride is the distance that separates each processed pixel. A stride of 1 means that all pixels are processed and kept. A stride of 2 means that only every second pixel in both x and y directions are kept. 19 Stride x = Input(shape=(16, 16, 1)) x = Conv2D(1, [3, 3], padding=’valid’, stride=2)(x) Example for a stride of 2. Previous layer is 16 × 16. Convolution mask is 3 × 3. Convolution is only computed for one in four pixels (marked in blue). Output is of size 7 × 7. 20 Stride x = Input(shape=(16, 16, 1)) x = Conv2D(1, [3, 3], padding=’valid’, stride=3)(x) Example for a stride of 3. Only 1 in 9 pixels are kept. 21 Stride x = Input(shape=(16, 16, 1)) x = Conv2D(1, [3, 3], padding=’valid’, stride=4)(x) Example for a stride of 3. Only 1 in 16 pixels are kept. 22 Max Pooling Whereas stride is set on the convolution layer itself, Pooling is a sep- arate node that is appended after the conv layer. The Pooling layer operates a sub-sampling of the picture. Different sub-sampling strategies are possible: average pooling, max pooling, stochastic pooling. 23 Max Pooling MaxPooling2D(pool_size=(2, 2)) The maximum of each block is kept. 0 10 0 9 10 39 3 10 10 9 9 4 34 23 12 1 10 2 10 10 2 310 3 10 10 10 10 3 -1 3 -3-1 -3-3 -3 24 Example The following keras code: x = Input(shape=(32, 32, 3)) x = Conv2D(16, [5, 5], activation=’relu’, padding=’same’, strides=1)(x) x = MaxPooling2D(pool_size=(2, 2))(x) The code transforms the original image of size 32 × 32 × 3 into a new image of size 32 × 32 × 16. Each of the 16 output image channels are obtained through their own 5 × 5 × 3 convolution filter. Then maxpooling reduces the image size to 16 × 16 × 16. 25 Increasing the tensor size 26 Transposed Convolution Similarly we can increase the horizontal and vertical dimensions of a tensor using an upsampling operation. This step is sometimes called up-convolution, deconvolution or transposed convolution. This step is equivalent to first upsampling the tensor by inserting ze- ros in-between the input samples and then applying a convolution layer. More on this is discussed in the link below. https://towardsdatascience.com/types-of-convolutions-in-deep-learning- 717013397f4d In keras: x = np.random.rand(4, 10, 8, 128) nfilters = 32; kernel_size = (3,3); stride = (2, 2) y = Conv2DTranspose(nfilters, kernel_size, stride)(x) print(y.shape) # (4, 21, 17, 32) 27 Transposed Convolution Just note that Deconvolution is an unfortunate term for this step as deconvolution is already used in signal processing and refers to trying to estimate the input signal/tensor from the output signal. (eg. trying to recover the original image from an blurred image). 28 Architecture Design 29 A typical convnet architecture for classification is based on interleav- ing convolution layers with pooling layers. Conv layers usually have a small kernel size (eg. 5 × 5 or 3 × 3). As you go deeper, the picture becomes smaller in resolution but also contains more channels. At some point the picture is so small (eg. 7 × 7), that it doesn’t make sense to call it a picture. You can then connect it to fully connected layers and terminate by a last softmax layer for classification: 30 The idea is that we start from a few low level features (eg. image edges) and as we go deeper, we built more and more features that are increasingly more complex. The following slides show some of the landmark networks. 31 LeNet-5 (LeCun, 1998). The network pioneered the use of convolutional layers in neural nets. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. 32 AlexNet (Alex Krizhevsky et al., 2012). This is the winning entry of the ILSVRC-2012 competition for object recognition. This is the network that started the deep learning revolution. Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton (2012) Imagenet classification with deep convolutional neural networks. 33 VGG (Simonyan and Zisserman, 2013). This is a popular 16-layer network used by the VGG team in the ILSVRC-2014 competition for object recognition. K. Simonyan, A. Zisserman Very Deep Convolutional Networks for Large-Scale Image Recognition 34 Keras code for the VGG16 network # Block 1 x = Conv2D ( 6 4 , ( 3 , 3 ) , a c t i v a t i o n = ’ r e l u ’ , padding = ’ same ’ , name= ’ b l o c k 1 _ c o n v 1 ’ ) ( img_input ) x = Conv2D ( 6 4 , ( 3 , 3 ) , a c t i v a t i o n = ’ r e l u ’ , padding = ’ same ’ , name= ’ bl o c k 1 _c o n v 2 ’ ) ( x ) x = MaxPooling2D ( ( 2 , 2 ) , s t r i d e s = ( 2 , 2 ) , name= ’ b l o c k 1 _ p o o l ’ ) ( x ) # Block 2 x = Conv2D ( 1 2 8 , ( 3 , 3 ) , a c t i v a t i o n = ’ r e l u ’ , padding = ’ same ’ , name= ’ bl o c k 2 _c o n v 1 ’ ) ( x ) x = Conv2D ( 1 2 8 , ( 3 , 3 ) , a c t i v a t i o n = ’ r e l u ’ , padding = ’ same ’ , name= ’ block2_conv2 ’ ) ( x ) x = MaxPooling2D ( ( 2 , 2 ) , s t r i d e s = ( 2 , 2 ) , name= ’ b l o c k 2 _ p o o l ’ ) ( x ) # Block 3 x = Conv2D ( 2 5 6 , ( 3 , 3 ) , a c t i v a t i o n = ’ r e l u ’ , padding = ’ same ’ , name= ’ bl oc k3 _c on v1 ’ ) ( x ) x = Conv2D ( 2 5 6 , ( 3 , 3 ) , a c t i v a t i o n = ’ r e l u ’ , padding = ’ same ’ , name= ’ block3_conv2 ’ ) ( x ) x = Conv2D ( 2 5 6 , ( 3 , 3 ) , a c t i v a t i o n = ’ r e l u ’ , padding = ’ same ’ , name= ’ block3_conv3 ’ ) ( x ) x = MaxPooling2D ( ( 2 , 2 ) , s t r i d e s = ( 2 , 2 ) , name= ’ blo ck 3 _p oo l ’) (x) # Block 4 x = Conv2D ( 5 1 2 , ( 3 , 3 ) , a c t i v a t i o n = ’ r e l u ’ , padding = ’ same ’ , name= ’ block4_conv1 ’ ) ( x ) x = Conv2D ( 5 1 2 , ( 3 , 3 ) , a c t i v a t i o n = ’ r e l u ’ , padding = ’ same ’ , name= ’ block4_conv2 ’ ) ( x ) x = Conv2D ( 5 1 2 , ( 3 , 3 ) , a c t i v a t i o n = ’ r e l u ’ , padding = ’ same ’ , name= ’ block4_conv3 ’ ) ( x ) x = MaxPooling2D ( ( 2 , 2 ) , s t r i d e s = ( 2 , 2 ) , name= ’ block4_pool ’) (x) # Block 5 x = Conv2D ( 5 1 2 , ( 3 , 3 ) , a c t i v a t i o n = ’ r e l u ’ , padding = ’ same ’ , name= ’ bl ock 5_co nv 1 ’ ) ( x ) x = Conv2D ( 5 1 2 , ( 3 , 3 ) , a c t i v a t i o n = ’ r e l u ’ , padding = ’ same ’ , name= ’ block5_conv2 ’ ) ( x ) x = Conv2D ( 5 1 2 , ( 3 , 3 ) , a c t i v a t i o n = ’ r e l u ’ , padding = ’ same ’ , name= ’ block5_conv3 ’ ) ( x ) x = MaxPooling2D ( ( 2 , 2 ) , s t r i d e s = ( 2 , 2 ) , name= ’ blo ck 5_ poo l ’) (x) # C l a s s i f i c a t i o n block x = F l a t t e n ( name= ’ f l a t t e n ’ ) ( x ) x = Dense ( 4 0 9 6 , a c t i v a t i o n = ’ r e l u ’ , name= ’ fc1 ’ ) ( x ) x = Dense ( 4 0 9 6 , a c t i v a t i o n = ’ r e l u ’ , name= ’ fc2 ’ ) ( x ) x = Dense ( c l a s s e s , a c t i v a t i o n = ’ softmax ’ , name= ’ p r e d i c t i o n s ’ ) ( x ) 35 original input image (tensor size 224 × 224 × 3, image has been resized to match that dimension): a few output images from the 64 filters of block1_conv2 (size 224 × 224 × 64): a few output images from the 128 filters of block2_conv2 (output size 112 × 112 × 128): a few output images from the 256 filters of block3_conv3 (output size 56 × 56): a few output images from the 512 filters of block4_conv3 (tensor size 28 × 28 × 512): a few output images from the 512 filters of block5_conv3 (tensor size 14 × 14 × 512): 36 Visualisation 37 Understanding each of the inner operations of a trained network is still an open problem. Thankfully Convolutional Neural Nets focus on images and a few vi- sualisation techniques have been proposed recently. One technique (see link below) is to find an input image that max- imises the activation for that specific filter. Recall that the output of ReLU and sigmoid is always positive and that a positive activation means that the filter has detected something. Thus finding the image that maximises the response from that filter will give us a good indi- cation about the nature of that filter. See https://blog.keras.io/category/demo.html 38 The optimisation proceeds as follows: 1. Define the loss function as the mean value of the activation for that filter. 2. Use backpropagation to compute the gradient of the loss function w.r.t. the input image. 3. Update the input image using a gradient ascent approach, so as to maximise the loss function. Goto 2. A few examples of optimised input images for VGG16 are presented in the next slides. See https://blog.keras.io/category/demo.html 39 40 41 42 Take Away Convolutional Neural Nets offer a very effective simplification over Dense Nets when dealing with images. By interleaving pooling and convolutional layers, we can reduce both the number of weights and the number of units. The successes in Convnet applications (eg. image classification) were key to start the deep learning/AI revolution. The mathematics behind convolutional filters were nothing new and have long been understood. What convnets have brought, is a frame- work to systematically train optimal filters and combine them to pro- duce powerful high level visual features. 43 Useful Resources Deep Learning (MIT press) from Ian Goodfellow et al. - chapters 9 https://www.deeplearningbook.org Brandon Rohrer YT channel https://youtu.be/ILsA4nyG7I0 Stanford CS class CS231n http://cs231n.github.io Michael Nielsen’s webpage http://neuralnetworksanddeeplearning.com/ 44