DL-CNN-U02W02.pptx
Document Details
Full Transcript
Deep Learning CSC-Elective Instructor : Dr. Mohammad Asif Khan Slides prepared by Dr. M Asif Khan [email protected] Unit 02 CNN Week 1 Contents ImageNet Competition Training a convnet from scratch on a small dataset Directory structuring of dataset using python...
Deep Learning CSC-Elective Instructor : Dr. Mohammad Asif Khan Slides prepared by Dr. M Asif Khan [email protected] Unit 02 CNN Week 1 Contents ImageNet Competition Training a convnet from scratch on a small dataset Directory structuring of dataset using python Data pre-processing Data Augmentation Pre-trained models Fine tuning pre-trained models Deep Learning by Dr. Asif Khan 2 ImageNet Competition In this competition the top-5 error rate for image classification fell from over 26% to barely over 3% in just five years. The top-five error rate is the number of test images for which the system’s top 5 predictions did not include the correct answer. The images are large (256 pixels high) and there are 1,000 classes, some of which are really subtle (try distinguishing 120 dog breeds). We first look at the classical LeNet-5 architecture Deep Learning by Dr. Asif Khan 3 (1998), then three of the winners the challenge: ImageNet Competition (LeNet-5) The LeNet-5 architecture is the most widely known CNN architecture. Created by Yann LeCun (1998), widely used for handwritten digit recognition (MNIST). MNIST images are 28 × 28, but are zero-padded to 32 × 32 pixels and normalized before being fed to the network. The rest of the network does not use any padding, which is Deep Learning by Dr. Asif Khan 4 ImageNet Competition (AlexNet) Developed by Alex Krizhevsky (hence the name), Ilya Sutskever, and Geoffrey Hinton. It is quite similar to LeNet-5, only much larger and deeper, and it was the first to stack convolutional layers directly on top of each other, instead of stacking a pooling layer on top of each convolutional layer. To reduce overfitting, the authors used two regularization techniques: First they applied dropout (with a 50% dropout rate) during training to the outputs of layers F8 and F9. Secondly, data augmentation Deep Learning by Dr.by randomly shifting training Asif Khan 5 ImageNet Competition (AlexNet) Calculation of Output Size for first CNN layer 1.Input Size: 224 (for both width and height) 2.Kernel Size: 11 (11x11) 3.Padding: 4 (SAME padding) 4.Stride: 4 Now, substituting into the formula: 224−11+8=221 221/4=55.25 the output size from the first convolutional layer indeed would be 55x55, as specified. Deep Learning by Dr. Asif Khan 6 ImageNet Competition (GoogleNet) This was made possible by sub-networks called inception modules, which allow GoogLeNet to use parameters much more efficiently than previous architectures: GoogLeNet actually has 10 times fewer parameters than AlexNet (roughly 6 million instead of 60 million). Deep Learning by Dr. Asif Khan 7 ImageNet Competition (GoogleNet) The notation “3 × 3 + 2(S)” means that the layer uses a 3 × 3 kernel, stride 2, and SAME padding. The input signal is first copied and fed to four different layers. All convolutional layers use the ReLU activation function. Note that the second set of convolutional layers uses different kernel sizes (1 × 1, 3 × 3, and 5 × 5), allowing them to capture patterns at different scales. Also note that every single layer uses a stride of 1 and SAME padding (even the max pooling layer), so their outputs have same height and width as their inputs. This makes it possible to concatenate all the outputs along the Deep Learning by Dr. Asif Khan 8 ImageNet Competition (GoogleNet) The Local Response Normalization (LRN) layer in architectures like GoogleNet serves to enhance the generalization ability of the model by normalizing the responses of neurons across feature maps. The six numbers on inception module represents the number of filers applied by each convolutional layer on its input. Deep Learning by Dr. Asif Khan 9 ImageNet Competition (ResNet) The winner of the ILSVRC (ImageNet Large Scale Visual Recognition Challenge) 2015 challenge was the Residual Network (or ResNet) It used extremely deep CNN composed of 152 layers The key to being able to train such a deep network is to use skip connections (also called shortcut connections): the signal feeding into a layer is also added to the output of a layer located a bit higher up the stack. Let’s see why this is useful. When training a neural network, the goal is to make it model a target function h(x). If you add the input x to the output of the network (i.e., you add a skip connection), then the network will be forced to model 10 Deep Learning by Dr. Asif Khan ImageNet Competition (ResNet) Deep Learning by Dr. Asif Khan 11 ImageNet Competition (ResNet) Deep Learning by Dr. Asif Khan 12 Training a convnet from scratch on a small dataset Having to train an image-classification model using, you’ll likely encounter situation to have small dataset A “few” samples means from a few hundred to a few tens of thousands of images. As a practical example, we’ll focus on classifying images as dogs or cats, in a dataset containing 4,000 pictures of cats and dogs (2,000 cats, 2,000 dogs). We’ll use 2,000 pictures for training—1,000 for validation, and 1,000 for testing. We’ll use training a new model from scratch using what little data you have. You’ll start by naively training a small convnet on the 2,000 Deep Learning by Dr. Asif Khan 13 Training a convnet from scratch on a small dataset Then we’ll introduce data augmentation, a powerful technique for mitigating overfitting in computer vision. By using data augmentation, you’ll improve the network to reach an accuracy of 82%. In the next section, we’ll review two more essential techniques for applying deep learning to small datasets: feature extraction with a pretrained network (which will get you to an accuracy of 90% to 96%). Then fine-tuning a pretrained network (this will get you to a final accuracy of 97%). Deep Learning by Dr. Asif Khan 14 Downloading the data The Dogs vs. Cats dataset that you’ll use isn’t packaged with Keras. It was made available by Kaggle as part of a computer- vision competition in late 2013, You can download the original dataset from www.kaggle.com/c/dogs-vs- cats/data (you’ll need to create a Kaggle account if you don’t already have one—don’t worry, the process is painless). Deep Learning by Dr. Asif Khan 15 Downloading the data This dataset contains 25,000 images of dogs and cats (12,500 from each class) and is 543 MB (compressed). After downloading and uncompressing it, you’ll create a new dataset containing three subsets: a training set with 1,000 samples of each class, a validation set with 500 samples of each class, and a test set with 500 Deep Learning by Dr. Asif Khan 16 Copying images to training, validation, and test directories Deep Learning by Dr. Asif Khan 17 Building your network Deep Learning by Dr. Asif Khan 18 Data preprocessing Currently, the data sits on a drive as JPEG files, so the steps for getting it into the network are roughly as follows: 1. Read the picture files. 2. Decode the JPEG content to RGB grids of pixels. 3. Convert these into floating-point tensors. 4. Rescale the pixel values (between 0 and 255) to the [0, 1] interval (as you know, neural networks prefer to deal with small input values). Keras has a module with image- processing helper tools, located at keras.preprocessing.image. In particular, it contains the class ImageDataGenerator, which lets you quickly set up Python generators that can automatically turn image files on disk into batches of preprocessed tensors. Deep Learning by Dr. Asif Khan 19 Data preprocessing Deep Learning by Dr. Asif Khan 20 Data preprocessing We get a test accuracy of 69.5%. (Due to the randomness of neural network initializa you may get numbers within one percentage point of that.) Deep Learning by Dr. Asif Khan 21 First milestone The training accuracy increases linearly over time, until it reaches nearly 100%, whereas the validation accuracy peaks at 75%. The validation loss reaches its minimum after only ten epochs and then stalls, whereas the training loss keeps decreasing linearly as training proceeds. Accuracy of 69.5%! Deep Learning by Dr. Asif Khan 22 First milestone These plots are characteristic of overfitting. The training accuracy increases linearly over time, until it reaches nearly 100%, whereas the validation accuracy stalls at 69–72%. The validation loss reaches its minimum after only five epochs and then stalls, Whereas the training loss keeps decreasing linearly until it reaches nearly 0. Because you have relatively few training samples (2,000), overfitting will be your number-one concern. You already know about a number of techniques that can help mitigate overfitting, such as dropout and weight decay (L2 regularization). We’re now going to work with a new one, specific to computer vision and used almost universally when processing images with deep-learning models: data augmentation. Deep Learning by Dr. Asif Khan 23 Using data augmentation Data augmentation takes the approach of generating more training data from existing training samples, by augmenting (increasing) the samples via a number of random transformations. These random trasformatons yield believable-looking images. The goal is that at training time, your model will never see the exact same picture twice. This helps expose the model to more aspects of the data and generalize better. To further fight overfitting, you’ll also add a Dropout layer to your model. The dropout will be added right before the densely connected classifier. In Keras, this can be done by adding a number of data augmentation layers at the start of your model. Deep Learning by Dr. Asif Khan 24 Using data augmentation. Deep Learning by Dr. Asif Khan 25 Using data augmentation. Deep Learning by Dr. Asif Khan 26 Using data augmentation. Deep Learning by Dr. Asif Khan 27 Using data augmentation (2 nd milestone) You now reach an accuracy of 82%, a 15% relative improvement over the non-regularized model. Deep Learning by Dr. Asif Khan 28 Leveraging a pretrained convnet A common and highly effective approach to deep learning on small image datasets is to use a pretrained network. A pretrained network is a saved network that was previously trained on a large dataset, typically on a large-scale image- classification task. If this original dataset is large enough and general enough, then the spatial hierarchy of features learned by the pretrained network can effectively act as a generic model of the visual world, and hence its features can prove useful for many different computer vision problems, Deep Learning by Dr. Asif Khan 29 Using a pretrained convnet For instance, you might train a network on ImageNet (where classes are mostly animals and everyday objects) and then repurpose this trained network for something as remote as identifying furniture items in images. In this case, let’s consider a large convnet trained on the ImageNet dataset (1.4 million labeled images and 1,000 different classes). ImageNet contains many animal classes, including different species of cats and dogs, and you can thus expect to perform well on the dogs-versus-cats classification problem. You’ll use the VGG16 architecture. Deep Learning by Dr. Asif Khan 30 Using a pretrained convent (Feature extraction) Convnets used for image classification comprise two parts: Start with a series of pooling and convolution layers, and they end with a densely connected classifier. The first part is called the convolutional base of the model. In the case of convnets, feature extraction consists of taking the convolutional base of a previously trained network, running the new data through it, and training a Deep new classifier Learning by Dr. Asif Khan 31 Using a pretrained convent (Feature extraction) The VGG16 model, among others, comes prepackaged with Keras. You can import it from the keras.applications module. Many other image-classification models (all pretrained on the ImageNet dataset) are available as part of keras.applications: Xception ResNet MobileNet EfficientNet DenseNet, etc. Deep Learning by Dr. Asif Khan 32 Using a pretrained convent (Feature extraction) Deep Learning by Dr. Asif Khan 33 Using a pretrained convent (Feature extraction) FAST FEATURE EXTRACTION WITHOUT DATA AUGMENTATION Deep Learning by Dr. Asif Khan 34 Using a pretrained convent (Feature extraction) FAST FEATURE EXTRACTION WITHOUT DATA AUGMENTATION Deep Learning by Dr. Asif Khan 35 Using a pretrained convent (Feature extraction) You reach a validation accuracy of about 90%—much better than you achieved in the previous section with the small model trained from scratch. But the plots also indicate that you’re overfitting almost from the start—despite using dropout with a fairly large rate. That’s because this technique doesn’t use data augmentation, which is essential for preventing overfitting with small image datasets. Deep Learning by Dr. Asif Khan 36 Feature Extraction with Data Augmentation For instance, you might train a network on ImageNet (where classes are mostly animals and everyday objects) and then repurpose this trained network for something as remote as identifying furniture items in images. In this case, let’s consider a large convnet trained on the ImageNet dataset (1.4 million labeled images and 1,000 different classes). ImageNet contains many animal classes, including different species of cats and dogs, and you can thus expect to perform well on the dogs-versus-cats classification problem. You’ll use the VGG16 architecture. Deep Learning by Dr. Asif Khan 37 Feature Extraction with Data Augmentation Now let’s review the second technique for doing feature extraction, which is much slower and more expensive. But which allows us to use data augmentation during training: creating a model that chains the conv_base with a new dense classifier, and training it end to end on the inputs. In order to do this, we will first freeze the convolutional base. Freezing a layer or set of layers means preventing their weights from being updated during training. In Keras, we freeze a layer or model by setting its trainable attribute to False. Deep Learning by Dr. Asif Khan 38 Feature Extraction with Data Augmentation Now let’s review the second technique for doing feature extraction, which is much slower and more expensive. But which allows us to use data augmentation during training: creating a model that chains the conv_base with a new dense classifier, and training it end to end on the inputs. In order to do this, we will first freeze the convolutional base. Freezing a layer or set of layers means preventing their weights from being updated during training. In Keras, we freeze a layer or model by setting its trainable attribute to False. Deep Learning by Dr. Asif Khan 39 Feature Extraction with Data Augmentation Now we can create a new model that chains together A data augmentation stage Our frozen convolutional base A dense classifier Deep Learning by Dr. Asif Khan 40 Feature Extraction with Data Augmentation Deep Learning by Dr. Asif Khan 41 Feature Extraction with Data Augmentation Deep Learning by Dr. Asif Khan 42 Feature Extraction with Data Augmentation Deep Learning by Dr. Asif Khan 43 Feature Extraction with Data Augmentation (3rd milestone) We get a test accuracy of 97.5%. This is only a modest improvement compared to the previous test accuracy, which is a bit disappointing given the strong results on the validation data. A model’s accuracy always depends on the set of samples you evaluate it on! Some sample sets may be more difficult than others, and strong results on one set won’t necessarily fully translate to all other sets. Deep Learning by Dr. Asif Khan 44 Fine-tuning a pretrained model Another widely used technique for model reuse, complementary to feature extraction, is fine-tuning (see figure 8.15). Fine-tuning consists of unfreezing a few of the top layers of a frozen model base used for feature extraction, and jointly training both the newly added part of the model (in this case, the fully connected classifier) and these top layers. This is called fine-tuning because it slightly adjusts the more abstract representations of the model being reused in order to make them more relevant for the problem at Deep Learning by Dr. Asif Khan 45 hand. Fine-tuning a pretrained model The steps for fine-tuning a network are as follows: 1. Add our custom network on top of an already-trained base network. 2. Freeze the base network. 3. Train the part we added. 4. Unfreeze some layers in the base network. (Note that you should not unfreeze “batch normalization” layers, which are not relevant here since there are no such layers in VGG16. Batch normalization and its impact on finetuning is explained in the next chapter.) 5. Jointly train these layers and the part we added. You already completed the first three steps when doing feature extraction. Let’s proceed with step 4: we’ll unfreeze our conv_base and then freeze individual layers inside it. Deep Learning by Dr. Asif Khan 46 Fine-tuning a pretrained model The steps for fine-tuning a network are as follows: 1. Add our custom network on top of an already-trained base network. 2. Freeze the base network. 3. Train the part we added. 4. Unfreeze some layers in the base network. (Note that you should not unfreeze “batch normalization” layers, which are not relevant here since there are no such layers in VGG16. Batch normalization and its impact on finetuning is explained in the next chapter.) 5. Jointly train these layers and the part we added. You already completed the first three steps when doing feature extraction. Let’s proceed with step 4: we’ll unfreeze our conv_base and then freeze individual layers inside it. We’ll fine-tune the last three convolutional layers, which means all layers up to block4_pool should beDeep frozen, and Learning by the Dr. Asif Khanlayers block5_conv1, 47 Fine-tuning a pretrained model Why not fine-tune more layers? Why not fine-tune the entire convolutional base? You could. But you need to consider the following: Earlier layers in the convolutional base encode more generic, reusable features, whereas layers higher up encode more specialized features. It’s more useful to fine-tune the more specialized features, because these are the ones that need to be repurposed on your new problem. There would be fast-decreasing returns in fine-tuning lower layers. The more parameters you’re training, the more you’re at risk of overfitting. The convolutional base has 15 million parameters, so it would be risky to attempt to train it on your small dataset. Thus, in this situation, it’s a good strategy to fine-tune only the top two or three layers in the convolutional base. Let’s set this up, starting from where we left off in the previous example. Deep Learning by Dr. Asif Khan 48 Fine-tuning a pretrained model Deep Learning by Dr. Asif Khan 49 Fine-tuning a pretrained model (final accuracy) Here, we get a test accuracy of 98.5% In the original Kaggle competition around this dataset, this would have been one of the top results. It’s not quite a fair comparison, however, since we used pretrained features that already contained prior knowledge about cats and dogs, which competitors couldn’t use at the time. On the positive side, by leveraging modern deep learning techniques, we managed to reach this result using only a small fraction of the training data that was available for the competition (about 10%). There is a huge difference between being able to train on 20,000 samples compared to 2,000 samples! Now you have a solid set of tools for dealing with image-classification Deep Learning by Dr. Asif Khan 50 problems—in particular, with small datasets. More CNN Example codes Run and understand CIFAR classification using CNN Run and understand MNIST-fashion classification using CNN Deep Learning by Dr. Asif Khan 51 Summary A step by step solution to train for classification models With small datasets Also most techniques can also be used for large datasets Data augmentation is used to overcome overfitting with small datasets. Regularization can be useful for both small and large datasets. Pre-trained models play a vital role for deep learning. You can fine tune pre-trained models. Deep Learning by Dr. Asif Khan 52