Introduction to Convolutional Neural Networks - Book1 PDF

Summary

This document explores Convolutional Neural Networks (CNNs), covering their history, architecture, and applications. It explains how computers interpret images and details image augmentation. The content also describes CNNs for image data augmentation, and the practical use of algorithms.

Full Transcript

Chapter 2 ========= 1. **Introduction to Convolutional Neural Networks** ------------------------------------------------- Convolutional Neural Networks (CNNs) are everywhere. In the last five years, we have seen a dramatic rise in the performance of visual recognition systems due to the intr...

Chapter 2 ========= 1. **Introduction to Convolutional Neural Networks** ------------------------------------------------- Convolutional Neural Networks (CNNs) are everywhere. In the last five years, we have seen a dramatic rise in the performance of visual recognition systems due to the introduction of deep architectures for feature learning and classification. CNNs have achieved good performance in a variety of areas, such as automatic speech understanding, computer vision, language translation, self-driving cars, and games such as Alpha Go. Thus, the applications of CNNs are almost limitless. DeepMind (from Google) recently published WaveNet, which uses a CNN to generate speech that mimics any human voice (https://deepmind.com/blog/wavenet-generative-model-raw-audio/). In this chapter, we will cover the following topics: History of CNNs Overview of a CNN Image augmentation **History of CNNs** ------------------- There have been numerous attempts to recognize pictures by machines for decades. It is a challenge to mimic the visual recognition system of the human brain in a computer. Human vision is the hardest to mimic and most complex sensory cognitive system of the brain. We will not discuss biological neurons here, that is, the primary visual cortex, but rather focus on artificial neurons. Objects in the physical world are three dimensional, whereas pictures of those objects are two dimensional. In this book, we will introduce neural networks without appealing to brain analogies. In 1963, computer scientist Larry Roberts, who is also known as the **father of computer vision**, described the possibility of extracting 3D geometrical information from 2D perspective views of blocks in his research dissertation titled* ***BLOCK WORLD**. This was the first breakthrough in the world of computer vision. Many researchers worldwide in machine learning and artificial intelligence followed this work and studied computer vision in the context of BLOCK WORLD. Human beings can recognize blocks regardless of any orientation or lighting changes that may happen. In this dissertation, he said that it is important to understand simple edge-like shapes in images. He extracted these edge-like shapes from blocks in order to make the computer understand that these two blocks are the same irrespective of orientation: The vision starts with a simple structure. This is the beginning of computer vision as an engineering model. David Mark, an MIT computer vision scientist, gave us the next important concept, that vision is hierarchical. He wrote a very influential book named *VISION*. This is a simple book. He said that an image consists of several layers. These two principles form the basis of deep learning architecture, although they do not tell us what kind of mathematical model to use. In the 1970s, the first visual recognition algorithm, known as the **generalized cylinder model**, came from the AI lab at Stanford University. The idea here is that the world is composed of simple shapes and any real-world object is a combination of these simple shapes. At the same time, another model, known as the **pictorial structure model**, was published from SRI Inc. The concept is still the same as the generalized cylinder model, but the parts are connected by springs; thus, it introduced a concept of variability. The first visual recognition algorithm was used in a digital camera by Fujifilm in 2006. **Convolutional neural networks** --------------------------------- CNNs, or ConvNets, are quite similar to regular neural networks. They are still made up of neurons with weights that can be learned from data. Each neuron receives some inputs and performs a dot product. They still have a loss function on the last fully connected layer. They can still use a nonlinearity function. All of the tips and techniques that we learned from the last chapter are still valid for CNN. As we saw in the previous chapter, a regular neural network receives input data as a single vector and passes through a series of hidden layers. Every hidden layer consists of a set of neurons, wherein every neuron is fully connected to all the other neurons in the previous layer. Within a single layer, each neuron is completely independent and they do not share any connections. The last fully connected layer, also called the **output layer**, contains class scores in the case of an image classification problem. Generally, there are three main layers in a simple ConvNet. They are the **convolution layer**, the **pooling layer**, and the **fully connected layer**. We can see a simple neural network in the following image: https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781788392303/files/assets/22900b9c-4754-4b53-8023-c1f50bfc7eda.png A regular three-layer neural network So, what changes? Since a CNN mostly takes images as input, this allows us to encode a few properties into the network, thus reducing the number of parameters. In the case of real-world image data, CNNs perform better than **Multi-Layer Perceptrons** (**MLPs**). There are two reasons for this: - In the last chapter, we saw that in order to feed an image to an MLP, we convert the input matrix into a simple numeric vector with no spatial structure. It has no knowledge that these numbers are spatially arranged. So, CNNs are built for this very reason; that is, to elucidate the patterns in multidimensional data. Unlike MLPs, CNNs understand the fact that image pixels that are closer in proximity to each other are more heavily related than pixels that are further apart: - CNNs differ from MLPs in the types of hidden layers that can be included in the model. A ConvNet arranges its neurons in three dimensions: **width**, **height**, and **depth**. Each layer transforms its 3D input volume into a 3D output volume of neurons using activation functions. For example, in the following figure, the red input layer holds the image. Thus its width and height are the dimensions of the image, and the depth is three since there are Red, Green, and Blue channels: ![https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781788392303/files/assets/2909a83d-f11b-4e4f-be2b-ed1365c13eae.png](media/image2.png) *ConvNets are deep neural networks that share their parameters across space.* **How do computers interpret images?** -------------------------------------- Essentially, every image can be represented as a matrix of pixel values. In other words, images can be thought of as a function (*f*) that maps from *R**^2^*** to *R*. *f(x, y)* gives the intensity value at the position *(x, y)*. In practice, the value of the function ranges only from *0* to *255*. Similarly, a color image can be represented as a stack of three functions. We can write this as a vector of: \* f( x, y) = \[ r(x,y) g(x,y) b(x,y)\]* Or we can write this as a mapping: *f: R x R \--\> R3* So, a color image is also a function, but in this case, a value at each *(x,y)* position is not a single number. Instead it is a vector that has three different light intensities corresponding to three color channels. The following is the code for seeing the details of an image as input to a computer. **Code for visualizing an image ** ---------------------------------- Let\'s take a look at how an image can be visualized with the following code: \#import all required lib\ import matplotlib.pyplot as plt\ %matplotlib inline\ import numpy as np\ from skimage.io import imread\ from skimage.transform import resize \# Load a color image in grayscale\ image = imread(\'sample\_digit.png\',as\_grey=True)\ image = resize(image,(28,28),mode=\'reflect\')\ print(\'This image is: \',type(image),\ \'with dimensions:\', image.shape) We obtain the following image as a result: def visualize\_input(img, ax):\ \ ax.imshow(img, cmap=\'gray\')\ width, height = img.shape\ thresh = img.max()/2.5\ for x in range(width):\ for y in range(height):\ ax.annotate(str(round(img\[x\]\[y\],2)), xy=(y,x),\ horizontalalignment=\'center\',\ verticalalignment=\'center\',\ color=\'white\' if img\[x\]\[y\]\ \[0,1\]\ x\_train = x\_train.astype(\'float32\')/255\ from keras.utils import np\_utils\ \ \# one-hot encode the labels\ num\_classes = len(np.unique(y\_train))\ y\_train = keras.utils.to\_categorical(y\_train, num\_classes)\ y\_test = keras.utils.to\_categorical(y\_test, num\_classes)\ \ \# break training set into training and validation sets\ (x\_train, x\_valid) = x\_train\[5000:\], x\_train\[:5000\]\ (y\_train, y\_valid) = y\_train\[5000:\], y\_train\[:5000\]\ \ \# print shape of training set\ print(\'x\_train shape:\', x\_train.shape)\ \ \# printing number of training, validation, and test images\ print(x\_train.shape\[0\], \'train samples\')\ print(x\_test.shape\[0\], \'test samples\')\ print(x\_valid.shape\[0\], \'validation samples\')x\_test = x\_test.astype(\'float32\')/255\ \ \ from keras.models import Sequential\ from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout\ \ model = Sequential()\ model.add(Conv2D(filters=16, kernel\_size=2, padding=\'same\', activation=\'relu\',\ input\_shape=(32, 32, 3)))\ model.add(MaxPooling2D(pool\_size=2))\ model.add(Conv2D(filters=32, kernel\_size=2, padding=\'same\', activation=\'relu\'))\ model.add(MaxPooling2D(pool\_size=2))\ model.add(Conv2D(filters=64, kernel\_size=2, padding=\'same\', activation=\'relu\'))\ model.add(MaxPooling2D(pool\_size=2))\ model.add(Conv2D(filters=32, kernel\_size=2, padding=\'same\', activation=\'relu\'))\ model.add(MaxPooling2D(pool\_size=2))\ model.add(Dropout(0.3))\ model.add(Flatten())\ model.add(Dense(500, activation=\'relu\'))\ model.add(Dropout(0.4))\ model.add(Dense(10, activation=\'softmax\'))\ \ model.summary()\ \ \# compile the model\ model.compile(loss=\'categorical\_crossentropy\', optimizer=\'rmsprop\',\ metrics=\[\'accuracy\'\])\ from keras.callbacks import ModelCheckpoint\ \ \# train the model\ checkpointer = ModelCheckpoint(filepath=\'model.weights.best.hdf5\', verbose=1,\ save\_best\_only=True)\ hist = model.fit(x\_train, y\_train, batch\_size=32, epochs=100,\ validation\_data=(x\_valid, y\_valid), callbacks=\[checkpointer\],\ verbose=2, shuffle=True) **Image augmentation** ---------------------- While training a CNN model, we do not want the model to change any prediction based on the size, angle, and position of the image. The image is represented as a matrix of pixel values, so the size, angle, and position have a huge effect on the pixel values. To make the model more size-invariant, we can add different sizes of the image to the training set. Similarly, in order to make the model more rotation-invariant, we can add images with different angles. This process is known as **image data augmentation**. This also helps to avoid overfitting. Overfitting happens when a model is exposed to very few samples. Image data augmentation is one way to reduce overfitting, but it may not be enough because augmented images are still correlated. Keras provides an image augmentation class called ImageDataGenerator that defines the configuration for image data augmentation. This also provides other features such as: - Sample-wise and feature-wise standardization - Random rotation, shifts, shear, and zoom of the image - Horizontal and vertical flip - ZCA whitening - Dimension reordering - Saving the changes to disk An augmented image generator object can be created as follows: imagedatagen = ImageDataGenerator() This API generates batches of tensor image data in real-time data augmentation, instead of processing an entire image dataset in memory. This API is designed to create augmented image data during the model fitting process. Thus, it reduces the memory overhead but adds some time cost for model training. After it is created and configured, you must fit your data. This computes any statistics required to perform the transformations to image data. This is done by calling the fit() function on the data generator and passing it to the training dataset, as follows: imagedatagen.fit(train\_data) The batch size can be configured, the data generator can be prepared, and batches of images can be received by calling the flow() function: imagedatagen.flow(x\_train, y\_train, batch\_size=32) Finally, call the fit\_generator() function instead of calling the fit() function on the model: fit\_generator(imagedatagen, samples\_per\_epoch=len(X\_train), epochs=200) Let\'s look at some examples to understand how the image augmentation API in Keras works. We will use the MNIST handwritten digit recognition task in these examples. Let\'s begin by taking a look at the first nine images in the training dataset: \#Plot images\ from keras.datasets import mnist\ from matplotlib import pyplot\ \#loading data\ (X\_train, y\_train), (X\_test, y\_test) = mnist.load\_data()\ \#creating a grid of 3x3 images\ for i in range(0, 9):\ pyplot.subplot(330 + 1 + i)\ pyplot.imshow(X\_train\[i\], cmap=pyplot.get\_cmap(\'gray\'))\ \#Displaying the plot\ pyplot.show() The following code snippet creates augmented images from the CIFAR-10 dataset. We will add these images to the training set of the last example and see how the classification accuracy increases: from keras.preprocessing.image import ImageDataGenerator\ \# creating and configuring augmented image generator\ datagen\_train = ImageDataGenerator(\ width\_shift\_range=0.1, \# shifting randomly images horizontally (10% of total width)\ height\_shift\_range=0.1, \# shifting randomly images vertically (10% of total height)\ horizontal\_flip=True) \# flipping randomly images horizontally\ \# creating and configuring augmented image generator\ datagen\_valid = ImageDataGenerator(\ width\_shift\_range=0.1, \# shifting randomly images horizontally (10% of total width)\ height\_shift\_range=0.1, \# shifting randomly images vertically (10% of total height)\ horizontal\_flip=True) \# flipping randomly images horizontally\ \# fitting augmented image generator on data\ datagen\_train.fit(x\_train)\ datagen\_valid.fit(x\_valid) **Summary** ----------- We began this chapter by briefly looking into the history of CNNs. We introduced you to the implementation of visualizing images.  We studied image classification with the help of a practical example, using all the principles we learned about in the chapter. Finally, we learned how image augmentation helps us avoid overfitting and studied the various other features provided by image augmentation. In the next chapter, we will learn how to build a simple image classifier CNN model from scratch. 2. Chapter 3 ========= 14. **Build Your First CNN and Performance Optimization** ----------------------------------------------------- A **convolutional neural network** (**CNN**) is a type of **feed-forward neural network** (**FNN**) in which the connectivity pattern between its neurons is inspired by an animal\'s visual cortex. In the last few years, CNNs have demonstrated superhuman performance in image search services, self-driving cars, automatic video classification, voice recognition, and **natural language processing **(**NLP**). Considering these motivations, in this chapter, we will construct a simple CNN model for image classification from scratch, followed by some theoretical aspects, such as convolutional and pooling operations. Then we will discuss how to tune hyperparameters and optimize the training time of CNNs for improved classification accuracy. Finally, we will build the second CNN model by considering some best practices. In a nutshell, the following topics will be covered in this chapter: - CNN architectures and drawbacks of DNNs - The convolution operations and pooling layers - Creating and training a CNN for image classification - Model performance optimization - Creating an improved CNN for optimized performance 15. **CNN architectures and drawbacks of DNNs** ------------------------------------------- In [Chapter 2](https://learning.oreilly.com/library/view/practical-convolutional-neural/9781788392303/00f0eb08-6d6c-48b7-8ffe-db69c7f90a73.xhtml), *Introduction to Convolutional Neural Networks*, we discussed that a regular multilayer perceptron works fine for small images (for example, MNIST or CIFAR-10). However, it breaks down for larger images because of the huge number of parameters it requires. For example, a 100 × 100 image has 10,000 pixels, and if the first layer has just 1,000 neurons (which already severely restricts the amount of information transmitted to the next layer), this means 10 million connections; and that is just for the first layer. CNNs solve this problem using partially connected layers. Because consecutive layers are only partially connected and because it heavily reuses its weights, a CNN has far fewer parameters than a fully connected DNN, which makes it much faster to train, reduces the risk of overfitting, and requires much less training data. Moreover, when a CNN has learned a kernel that can detect a particular feature, it can detect that feature anywhere on the image. In contrast, when a DNN learns a feature in one location, it can detect it only in that particular location. Since images typically have very repetitive features, CNNs are able to generalize much better than DNNs for image processing tasks such as classification, using fewer training examples. Importantly, a DNN has no prior knowledge of how pixels are organized; it does not know that nearby pixels are close. A CNN\'s architecture embeds this prior knowledge. Lower layers typically identify features in small areas of the images, while higher layers combine the lower-level features into larger features. This works well with most natural images, giving CNNs a decisive head start compared to DNNs: Figure 1: Regular DNN versus CNN, where each layer has neurons arranged in 3D For example, in *Figure 1*, on the left, you can see a regular three-layer neural network. On the right, a ConvNet arranges its neurons in three dimensions (width, height, and depth) as visualized in one of the layers. Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. The red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be three (red, green, and blue channels). Therefore, all the multilayer neural networks we looked at had layers composed of a long line of neurons, and we had to flatten input images or data to 1D before feeding them to the neural network. However, what happens once you try to feed them a 2D image directly? The answer is that in CNNs, each layer is represented in 2D, which makes it easier to match neurons with their corresponding inputs. We will see examples of this in upcoming sections. Another important fact is that all the neurons in a feature map share the same parameters, so it dramatically reduces the number of parameters in the model; but more importantly, it means that once the CNN has learned to recognize a pattern in one location, it can recognize it in any other location. In contrast, once a regular DNN has learned to recognize a pattern in one location, it can recognize it only in that particular location. In multilayer networks such as MLP or DBN, the outputs of all neurons of the input layer are connected to each neuron in the hidden layer, and then the output will again act as the input to the fully connected layer. In CNN networks, the connection scheme that defines the convolutional layer is significantly different. The convolutional layer is the main type of layer in a CNN, where each neuron is connected to a certain region of the input area called the **receptive field**. In a typical CNN architecture, a few convolutional layers are connected in a cascade style. Each layer is followed by a **Rectified Linear Unit** (**ReLU**) layer, then a pooling layer, then one or more convolutional layers (+ReLU), then another pooling layer, and finally one or more fully connected layers. Pretty much depending on problem type, the network might be deep though. The output from each convolution layer is a set of objects called **feature maps**, generated by a single kernel filter. Then the feature maps can be used to define a new input to the next layer. Each neuron in a CNN network produces an output, followed by an activation threshold, which is proportional to the input and not bound: ![https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781788392303/files/assets/03186daf-dff9-499d-ad13-731d480942fd.png](media/image6.png) Figure 2: A conceptual architecture of a CNN As you can see in *Figure 2*, the pooling layers are usually placed after the convolutional layers (for example, between two convolutional layers). A pooling layer into subregions then divides the convolutional region. Then, a single representative value is selected, using either a max-pooling or an average pooling technique, to reduce the computational time of subsequent layers. This way, a CNN can be thought of as a feature extractor. To understand this more clearly, refer to the following figure: In this way, the robustness of the feature with respect to its spatial position is increased too. To be more specific, when feature maps are used as image properties and pass through the grayscale image, it gets smaller and smaller as it progresses through the network; but it also typically gets deeper and deeper, as more feature maps will be added. We\'ve already discussed the limitations of such FFNN - that is, a very high number of neurons would be necessary, even in a shallow architecture, due to the very large input sizes associated with images, where each pixel is a relevant variable. The convolution operation brings a solution to this problem as it reduces the number of free parameters, allowing the network to be deeper with fewer parameters. **Convolutional operations** ---------------------------- A convolution is a mathematical operation that slides one function over another and measures the integral of their pointwise multiplication. It has deep connections with the Fourier transformation and the Laplace transformation and is heavily used in signal processing. Convolutional layers actually use cross-correlations, which are very similar to convolutions. *In mathematics, convolution is a mathematical operation on two functions that produces a third function---that is, the modified (convoluted) version of one of the original functions. The resulting function gives in integral of the pointwise multiplication of the two functions as a function of the amount that one of the original functions is translated. Interested readers can refer to this URL for more information: [[https://en.wikipedia.org/wiki/Convolution]](https://en.wikipedia.org/wiki/Convolution).* Thus, the most important building block of a CNN is the convolutional layer. Neurons in the first convolutional layer are not connected to every single pixel in the input image (that is, like FNNs---for example, MLP and DBN) but only to pixels in their receptive fields. See *Figure 3*. In turn, each neuron in the second convolutional layer is connected only to neurons located within a small rectangle in the first layer: https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781788392303/files/assets/e7c30e9b-9df7-4948-9dc4-617c8ef86b52.png Figure 3: Each convolutional neuron processes data only for its receptive field In [[Chapter 2]](https://learning.oreilly.com/library/view/practical-convolutional-neural/9781788392303/00f0eb08-6d6c-48b7-8ffe-db69c7f90a73.xhtml), *Introduction to Convolutional Neural Networks*, we have seen that all multilayer neural networks (for example, MLP) have layers composed of so many neurons, and we have to flatten input images to 1D before feeding them to the neural network. Instead, in a CNN, each layer is represented in 2D, which makes it easier to match neurons with their corresponding inputs. *The receptive fields concept is used by CNNs to exploit spatial locality by enforcing a local connectivity pattern between neurons of adjacent layers.* This architecture allows the network to concentrate on low-level features in the first hidden layer, and then assemble them into higher-level features in the next hidden layer, and so on. This hierarchical structure is common in real-world images, which is one of the reasons why CNNs work so well for image recognition. Finally, it not only requires a low number of neurons but also reduces the number of trainable parameters significantly. For example, regardless of image size, building regions of size 5 x 5, each with the same-shared weights, requires only 25 learnable parameters. In this way, it resolves the vanishing or exploding gradients problem in training traditional multilayer neural networks with many layers by using backpropagation. **Pooling, stride, and padding operations** ------------------------------------------- Once you\'ve understood how convolutional layers work, the pooling layers are quite easy to grasp. A pooling layer typically works on every input channel independently, so the output depth is the same as the input depth. You may alternatively pool over the depth dimension, as we will see next, in which case the image\'s spatial dimensions (for example, height and width) remain unchanged but the number of channels is reduced. Let\'s see a formal definition of pooling layers from the well-known TensorFlow website: *\"The pooling ops sweep a rectangular window over the input tensor, computing a reduction operation for each window (average, max, or max with argmax). Each pooling op uses rectangular windows of size called ksize separated by offset strides. For example, if strides are all ones, every window is used, if strides are all twos, every other window is used in each dimension, and so on.\"* Therefore, in summary, just like convolutional layers, each neuron in a pooling layer is connected to the outputs of a limited number of neurons in the previous layer, located within a small rectangular receptive field. However, we must define its size, the stride, and the padding type. So in summary, the output can be computed as follows: output\[i\] = reduce(value\[strides \* i:strides \* i + ksize\]), Here, the indices also take the padding values into consideration. *A pooling neuron has no weights. Therefore, all it does is aggregate the inputs using an aggregation function such as max or mean.* In other words, the goal of using pooling is to subsample the input image in order to reduce the computational load, memory usage, and number of parameters. This helps to avoid overfitting in the training stage. Reducing the input image size also makes the neural network tolerate a little bit of image shift. The spatial semantics of the convolution ops depend on the padding scheme chosen. Padding is an operation to increase the size of the input data. In the case of one-dimensional data, you just append/prepend the array with a constant; in two-dimensional data, you surround the matrix with these constants. In n-dimensional, you surround your n-dimensional hypercube with the constant. In most of the cases, this constant is zero and it is called **zero padding**: - **VALID padding**: Only drops the rightmost columns (or bottommost rows) - **SAME padding**: Tries to pad evenly left and right, but if the number of columns to be added is odd, it will add the extra column to the right, as is the case in this example Let\'s explain the preceding definition graphically, in the following figure. If we want a layer to have the same height and width as the previous layer, it is common to add zeros around the inputs, as shown in the diagram. This is called **SAME** or **zero** **padding**. *The term ***SAME*** means that the output feature map has the same spatial dimensions as the input feature map.* On the other hand, zero padding is introduced to make the shapes match as needed, equally on every side of the input map. **VALID** means no padding and only drops the rightmost columns (or bottommost rows): ![https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781788392303/files/assets/fdd19f4b-8552-4d31-8035-6101490b48c9.png](media/image8.png) Figure 4: SAME versus VALID padding with CNN In the following example (*Figure 5*), we use a 2 × 2 pooling kernel and a stride of 2 with no padding. Only the **max** input value in each kernel makes it to the next layer since the other inputs are dropped (we will see this later on): https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781788392303/files/assets/10a39cda-7b65-48aa-8ef5-1f412325d5c7.png Figure 5: An example using max pooling, that is, subsampling **Fully connected layer** ------------------------- At the top of the stack, a regular fully connected layer (also known as **FNN** or **dense layer**) is added; it acts similar to an MLP, which might be composed of a few fully connected layers (+ReLUs). The final layer outputs (for example, softmax) the prediction. An example is a softmax layer that outputs estimated class probabilities for a multiclass classification. Fully connected layers connect every neuron in one layer to every neuron in another layer. Although fully connected FNNs can be used to learn features as well as classify data, it is not practical to apply this architecture to images. **Convolution and pooling operations in TensorFlow** ---------------------------------------------------- Now that we have seen how convolutional and pooling operations are performed theoretically, let\'s see how we can perform these operation hands-on using TensorFlow. So let\'s get started. **Applying pooling operations in TensorFlow** --------------------------------------------- Using TensorFlow, a subsampling layer can normally be represented by a max\_pool operation by maintaining the initial parameters of the layer. For max\_pool, it has the following signature in TensorFlow: tf.nn.max\_pool(value, ksize, strides, padding, data\_format, name) Now let\'s learn how to create a function that utilizes the preceding signature and returns a tensor with type tf.float32, that is, the max pooled output tensor: import tensorflow as tf def maxpool2d(x, k=2): return tf.nn.max\_pool(x, ksize=\[1, k, k, 1\], strides=\[1, k, k, 1\], padding=\'SAME\') In the preceding code segment, the parameters can be described as follows: - value: This is a 4D tensor of float32 elements and shape (batch length, height, width, and channels) - ksize: A list of integers representing the window size on each dimension - strides: The step of the moving windows on each dimension - data\_format: NHWC, NCHW, and NCHW\_VECT\_C are supported - ordering: NHWC or NCHW - padding: VALID or SAME However, depending upon the layering structures in a CNN, there are other pooling operations supported by TensorFlow, as follows: - tf.nn.avg\_pool: This returns a reduced tensor with the average of each window - tf.nn.max\_pool\_with\_argmax: This returns the max\_pool tensor and a tensor with the flattened index of max\_value - tf.nn.avg\_pool3d: This performs an avg\_pool operation with a cubic-like - window; the input has an added depth - tf.nn.max\_pool3d: This performs the same function as (\...) but applies the max operation Now let\'s see a concrete example of how the padding thing works in TensorFlow. Suppose we have an input image x with shape \[2, 3\] and one channel. Now we want to see the effect of both VALID and SAME paddings: - valid\_pad: Max pool with 2 x 2 kernel, stride 2, and VALID padding - same\_pad: Max pool with 2 x 2 kernel, stride 2, and SAME padding Let\'s see how we can attain this in Python and TensorFlow. Suppose we have an input image of shape \[2, 4\], which is one channel: import tensorflow as tf x = tf.constant(\[\[2., 4., 6., 8.,\], \[10., 12., 14., 16.\]\]) Now let\'s give it a shape accepted by tf.nn.max\_pool: x = tf.reshape(x, \[1, 2, 4, 1\]) If we want to apply the VALID padding with the max pool with a 2 x 2 kernel, stride 2: VALID = tf.nn.max\_pool(x, \[1, 2, 2, 1\], \[1, 2, 2, 1\], padding=\'VALID\') On the other hand, using the max pool with a 2 x 2 kernel, stride 2 and SAME padding: SAME = tf.nn.max\_pool(x, \[1, 2, 2, 1\], \[1, 2, 2, 1\], padding=\'SAME\') For VALID padding, since there is no padding, the output shape is \[1, 1\]. However, for the SAME padding, since we pad the image to the shape \[2, 4\] (with - inf) and then apply the max pool, the output shape is \[1, 2\]. Let\'s validate them: print(VALID.get\_shape()) print(SAME.get\_shape()) \>\>\> (1, 1, 2, 1) (1, 1, 2, 1) **Convolution operations in TensorFlow** ---------------------------------------- TensorFlow provides a variety of methods for convolution. The canonical form is applied by the conv2d operation. Let\'s have a look at the usage of this operation: conv2d(\ input,\ filter,\ strides,\ padding,\ use\_cudnn\_on\_gpu=True,\ data\_format=\'NHWC\',\ dilations=\[1, 1, 1, 1\],\ name=None\ ) The parameters we use are as follows: - input: The operation will be applied to this original tensor. It has a definite format of four dimensions, and the default dimension order is shown next. - filter: This is a tensor representing a kernel or filter. It has a very generic method: (filter\_height, filter\_width, in\_channels, and out\_channels). - strides: This is a list of four int tensor datatypes, which indicate the sliding windows for each dimension. - padding: This can be SAME or VALID. SAME will try to conserve the initial tensor dimension, but VALID will allow it to grow if the output size and padding are computed. We will see later how to perform padding along with the pooling layers. - use\_cudnn\_on\_gpu: This indicates whether to use the CUDA GPU CNN library to accelerate calculations. - data\_format: This specifies the order in which data is organized (NHWC or NCWH). - dilations: This signifies an optional list of ints. It defaults to (1, 1, 1, 1). 1D tensor of length 4. The dilation factor for each dimension of input. If it is set to k \> 1, there will be k-1 skipped cells between each filter element on that dimension. The dimension order is determined by the value of data\_format; see the preceding code example for details. Dilations in the batch and depth dimensions must be 1. - name: A name for the operation (optional). The following is an example of a convolutional layer. It concatenates a convolution, adds a bias parameter sum, and finally returns the activation function we have chosen for the whole layer (in this case, the ReLU operation, which is a frequently used one): def conv\_layer(data, weights, bias, strides=1): x = tf.nn.conv2d(x, weights, strides=\[1, strides, strides, 1\], padding=\'SAME\') x = tf.nn.bias\_add(x, bias) return tf.nn.relu(x) Here, x is the 4D tensor input (batch size, height, width, and channel). TensorFlow also offers a few other kinds of convolutional layers. For example: - tf.layers.conv1d() creates a convolutional layer for 1D inputs. This is useful, for example, in NLP, where a sentence may be represented as a 1D array of words, and the receptive field covers a few neighboring words. - tf.layers.conv3d() creates a convolutional layer for 3D inputs. - tf.nn.atrous\_conv2d() creates an a trous convolutional layer (*a* tro*us* is French for with holes). This is equivalent to using a regular convolutional layer with a filter dilated by inserting rows and columns of zeros. For example, a 1 × 3 filter equal to (1, 2, 3) may be dilated with a dilation rate of 4, resulting in a dilated filter (1, 0, 0, 0, 2, 0, 0, 0, 3). This allows the convolutional layer to have a larger receptive field at no computational price and using no extra parameters. - tf.layers.conv2d\_transpose () creates a transpose convolutional layer, sometimes called a **deconvolutional layer,** which up-samples an image. It does so by inserting zeros between the inputs, so you can think of this as a regular convolutional layer using a fractional stride. - tf.nn.depthwise\_conv2d() creates a depth-wise convolutional layer that applies every filter to every individual input channel independently. Thus, if there are *f**~n~*** filters and *f**~n~*****~′~** input channels, then this will output *f**~n ~***× *f**~n~*****~′~** feature maps. - tf.layers.separable\_conv2d() creates a separable convolutional layer that first acts like a depth-wise convolutional layer and then applies a 1 × 1 convolutional layer to the resulting feature maps. This makes it possible to apply filters to arbitrary sets of inputs channels. 22. **Training a CNN** ------------------ In the previous section, we have seen how to construct a CNN and apply different operations on its different layers. Now when it comes to training a CNN, it is much trickier as it needs a lot of considerations to control those operations such as applying appropriate activation function, weight and bias initialization, and of course, using optimizers intelligently. There are also some advanced considerations such as hyperparameter tuning for optimized too. However, that will be discussed in the next section. We first start our discussion with weight and bias initialization. **Weight and bias initialization** ---------------------------------- One of the most common initialization techniques in training a DNN is random initialization. The idea of using random initialization is just sampling each weight from a normal distribution of the input dataset with low deviation. Well, a low deviation allows you to bias the network towards the simple 0 solutions. But what does it mean? The thing is that, the initialization can be completed without the bad repercussions of actually initializing the weights to 0. Secondly, Xavier initialization is often used to train CNNs. It is similar to random initialization but often turns out to work much better. Now let me explain the reason for this: - Imagine that you initialize the network weights randomly but they turn out to start too small. Then the signal shrinks as it passes through each layer until it is too tiny to be useful. - On the other hand, if the weights in a network start too large, then the signal grows as it passes through each layer until it is too massive to be useful. The good thing is that using Xavier initialization makes sure the weights are just right, keeping the signal in a reasonable range of values through many layers. In summary, it can automatically determine the scale of initialization based on the number of input and output neurons. *Interested readers should refer to this publication for detailed information: Xavier Glorot and Yoshua Bengio, Understanding the difficulty of training deep FNNs, Proceedings of the 13th International Conference on ***Artificial Intelligence and Statistics*** (***AISTATS***) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP.* Finally, you may ask an intelligent question, *Can\'t I get rid of the random initialization while training a regular DNN (for example, MLP or DBN)*? Well, recently, some researchers have been talking about random orthogonal matrix initializations that perform better than just any random initialization for training DNNs: - **When it comes to initializing the biases**, it is possible and common to initialize the biases to be zero since the asymmetry breaking is provided by the small random numbers in the weights. Setting the biases to a small constant value such as 0.01 for all biases ensures that all ReLU units can propagate some gradient. However, it neither performs well nor does consistent improvement. Therefore, sticking with zero is recommended. 24. **Regularization** ------------------ There are several ways of controlling training of CNNs to prevent overfitting in the training phase. For example, L2/L1 regularization, max norm constraints, and drop out: - **L2 regularization**: This is perhaps the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. For example, using the gradient descent parameter update, L2 regularization ultimately means that every weight is decayed linearly: *W += -*lambda \* *W* towards zero. - **L1 regularization**: This is another relatively common form of regularization, where for each weight *w* we add the term *λ∣w∣* to the objective. However, it is also possible to possible to combine the L1 regularization with the L2 regularization: *λ1∣w∣+λ2w2*, which is commonly known as **Elastic-net regularization**. - **Max-norm constraints**: Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. Finally, dropout is an advanced variant of regularization, which will be discussed later in this chapter. **Activation functions** ------------------------ The activation ops provide different types of nonlinearities for use in neural networks. These include smooth nonlinearities, such as sigmoid, tanh, elu, softplus, and softsign. On the other hand, some continuous but not-everywhere-differentiable functions that can be used are relu, relu6, crelu, and relu\_x. All activation ops apply component-wise and produce a tensor of the same shape as the input tensor. Now let us see how to use a few commonly used activation functions in TensorFlow syntax. **Using sigmoid** ----------------- In TensorFlow, the signature tf.sigmoid(x, name=None) computes sigmoid of x element-wise using *y = 1 / (1 + exp(-x))* and returns a tensor with the same type x. Here is the parameter description: - x: A tensor. This must be one of the following types: float32, float64, int32, complex64, int64, or qint32. - name: A name for the operation (optional). 27. **Using tanh** -------------- In TensorFlow, the signature tf.tanh(x, name=None) computes a hyperbolic tangent of x element-wise and returns a tensor with the same type x. Here is the parameter description: - x: A tensor or sparse. This is a tensor with type float, double, int32, complex64, int64, or qint32. - name: A name for the operation (optional). 28. **Using ReLU** -------------- In TensorFlow, the signature tf.nn.relu(features, name=None) computes a rectified linear using max(features, 0) and returns a tensor having the same type as features. Here is the parameter description: - features: A tensor. This must be one of the following types: float32, float64, int32, int64, uint8, int16, int8, uint16, and half. - name: A name for the operation (optional). For more on how to use other activation functions, please refer to the TensorFlow website. Up to this point, we have the minimal theoretical knowledge to build our first CNN network for making a prediction. **Building, training, and evaluating our first CNN** ---------------------------------------------------- In the next section, we will look at how to classify and distinguish between dogs from cats based on their raw images. We will also look at how to implement our first CNN model to deal with the raw and color image having three channels. This network design and implementation are not straightforward; TensorFlow low-level APIs will be used for this. However, do not worry; later in this chapter, we will see another example of implementing a CNN using TensorFlow\'s high-level contrib API. Before we formally start, a short description of the dataset is a mandate. ### **Dataset description** For this example, we will use the dog versus cat dataset from Kaggle that was provided for the infamous Dogs versus Cats classification problem as a playground competition with kernels enabled. The dataset can be downloaded from . The train folder contains 25,000 images of dogs and cats. Each image in this folder has the label as part of the filename. The test folder contains 12,500 images, named according to a numeric ID. For each image in the test set, you should predict a probability that the image is a dog (1 = dog, 0 = cat); that is, a binary classification problem. For this example, there are three Python scripts. ### **Step 1 -- Loading the required packages** Here we import the required packages and libraries. Note that depending upon the platform, your imports might be different: import time import math import random import os import pandas as pd import numpy as np import matplotlib.pyplot as plt import tensorflow as tf import Preprocessor import cv2 import LayersConstructor from sklearn.metrics import confusion\_matrix from datetime import timedelta from sklearn.metrics.classification import accuracy\_score from sklearn.metrics import precision\_recall\_fscore\_support ### **Step 2 -- Loading the training/test images to generate train/test set** We set the number of color channels as 3 for the images. In the previous section, we have seen that it should be 1 for grayscale images: num\_channels = 3 For the simplicity, we assume the image dimensions should be squares only. Let\'s set the size to be 128: img\_size = 128 Now that we have the image size (that is, 128) and the number of the channel (that is, 3), the size of the image when flattened to a single dimension would be the multiplication of the image dimension and the number of channels, as follows: img\_size\_flat = img\_size \* img\_size \* num\_channels Note that, at a later stage, we might need to reshape the image for the max pooling and convolutional layers, so we need to reshape the image. For our case, it would be the tuple with height and width of images used to reshape arrays: img\_shape = (img\_size, img\_size) We should have explicitly defined the labels (that is, classes) since we only have the raw color image, and so the images do not have the labels like other numeric machine learning dataset, have. Let\'s explicitly define the class info as follows: classes = \[\'dogs\', \'cats\'\] num\_classes = len(classes) We need to define the batch size that needs to be trained on our CNN model later on: batch\_size = 14 Note that we also can define what portion of the training set will be used as the validation split. Let\'s assume that 16% will be used, for simplicity: validation\_size = 0.16 One important thing to set is how long to wait after the validation loss stops improving before terminating the training. We should use none if we do not want to implement early stopping: early\_stopping = None Now, download the dataset and you have to do one thing manually: separate the images of dogs and cats and place them in two separate folders. To be more specific, suppose you put your training set under the path /home/DoG\_CaT/data/train/. In the train folder, create two separate folders dogs and cats but only show the path to DoG\_CaT/data/train/. We also assume that our test set is in the /home/DoG\_CaT/data/test/ directory. In addition, you can define the checkpoint directory where the logs and model checkpoint files will be written: train\_path = \'/home/DoG\_CaT/data/train/\' test\_path = \'/home/DoG\_CaT/data/test/\' checkpoint\_dir = \"models/\" Then we start reading the training set and prepare it for the CNN model. For processing the test and train set, we have another script Preprocessor.py. Nonetheless, it would be better to prepare the test set as well**:** data = Preprocessor.read\_train\_sets(train\_path, img\_size, classes, validation\_size=validation\_size) The preceding line of code reads the raw images of cats and dogs and creates the training set. The read\_train\_sets() function goes as follows: def read\_train\_sets(train\_path, image\_size, classes, validation\_size=0): class DataSets(object): pass data\_sets = DataSets() images, labels, ids, cls = load\_train(train\_path, image\_size, classes) images, labels, ids, cls = shuffle(images, labels, ids, cls) if isinstance(validation\_size, float): validation\_size = int(validation\_size \* images.shape\[0\]) validation\_images = images\[:validation\_size\] validation\_labels = labels\[:validation\_size\] validation\_ids = ids\[:validation\_size\] validation\_cls = cls\[:validation\_size\] train\_images = images\[validation\_size:\] train\_labels = labels\[validation\_size:\] train\_ids = ids\[validation\_size:\] train\_cls = cls\[validation\_size:\] data\_sets.train = DataSet(train\_images, train\_labels, train\_ids, train\_cls) data\_sets.valid = DataSet(validation\_images, validation\_labels, validation\_ids, validation\_cls) return data\_sets In the previous code segment, we have used the method load\_train() to load the images which is an instance of a class called DataSet: def load\_train(train\_path, image\_size, classes): images = \[\] labels = \[\] ids = \[\] cls = \[\] print(\'Reading training images\') for fld in classes: index = classes.index(fld) print(\'Loading {} files (Index: {})\'.format(fld, index)) path = os.path.join(train\_path, fld, \'\*g\') files = glob.glob(path)\ for fl in files: image = cv2.imread(fl) image = cv2.resize(image, (image\_size, image\_size), cv2.INTER\_LINEAR) images.append(image) label = np.zeros(len(classes)) label\[index\] = 1.0 labels.append(label) flbase = os.path.basename(fl) ids.append(flbase) cls.append(fld)\ images = np.array(images) labels = np.array(labels) ids = np.array(ids) cls = np.array(cls) return images, labels, ids, cls The DataSet class, which is used to generate the batches of the training set, is as follows: class DataSet(object): def next\_batch(self, batch\_size): \"\"\"Return the next \`batch\_size\` examples from this data set.\"\"\" start = self.\_index\_in\_epoch self.\_index\_in\_epoch += batch\_size if self.\_index\_in\_epoch \> self.\_num\_examples: \# Finished epoch self.\_epochs\_completed += 1 start = 0 self.\_index\_in\_epoch = batch\_size assert batch\_size \\>\> Accuracy on Test-Set: 81.1% (3244 / 4000) Precision: 0.811057239265 Recall: 0.811 F1-score: 0.81098298755 So it did not improve that much but was a 2% increase on the overall accuracy. Now is the time to evaluate our model for a single image. For simplicity, we will take two random images of a dog and a cat and see the prediction power of our model: Figure 9: Example image for the cat and dog to be classified At first, we load these two images and prepare the test set accordingly, as we have seen in an earlier step in this example: test\_cat = cv2.imread(\'Test\_image/cat.jpg\') test\_cat = cv2.resize(test\_cat, (img\_size, img\_size), cv2.INTER\_LINEAR) / 255 preview\_cat = plt.imshow(test\_cat.reshape(img\_size, img\_size, num\_channels)) test\_dog = cv2.imread(\'Test\_image/dog.jpg\') test\_dog = cv2.resize(test\_dog, (img\_size, img\_size), cv2.INTER\_LINEAR) / 255 preview\_dog = plt.imshow(test\_dog.reshape(img\_size, img\_size, num\_channels)) Then we have the following function for making the prediction: def sample\_prediction(test\_im): feed\_dict\_test = { x: test\_im.reshape(1, img\_size\_flat), y\_true: np.array(\[\[1, 0\]\]) } test\_pred = session.run(y\_pred\_cls, feed\_dict=feed\_dict\_test) return classes\[test\_pred\[0\]\] print(\"Predicted class for test\_cat: {}\".format(sample\_prediction(test\_cat))) print(\"Predicted class for test\_dog: {}\".format(sample\_prediction(test\_dog))) \>\>\> Predicted class for test\_cat: cats Predicted class for test\_dog: dogs Finally, when we\'re done, we close the TensorFlow session by invoking the close() method: session.close() **Model performance optimization** ---------------------------------- Since CNNs are different from the layering architecture\'s perspective, they have different requirements as well as tuning criteria. How do you know what combination of hyperparameters is the best for your task? Of course, you can use a grid search with cross-validation to find the right hyperparameters for linear machine learning models. However, for CNNs, there are many hyperparameters to tune, and since training a neural network on a large dataset takes a lot of time, you will only be able to explore a tiny part of the hyperparameter space in a reasonable amount of time. Here are some insights that can be followed. **Number of hidden layers** --------------------------- For many problems, you can just begin with a single hidden layer and you will get reasonable results. It has actually been shown that an MLP with just one hidden layer can model even the most complex functions provided it has enough neurons. For a long time, these facts convinced researchers that there was no need to investigate any deeper neural networks. However, they overlooked the fact that deep networks have a much higher parameter efficiency than shallow ones; they can model complex functions using exponentially fewer neurons than shallow nets, making them much faster to train. It is to be noted that this might not be always the case. However, in summary, for many problems, you can start with just one or two hidden layers. It will work just fine using two hidden layers with the same total amount of neurons, in roughly the same amount of training time. For a more complex problem, you can gradually ramp up the number of hidden layers, until you start overfitting the training set. Very complex tasks, such as large image classification or speech recognition, typically require networks with dozens of layers and a huge amount of training data. **Number of neurons per hidden layer** -------------------------------------- Obviously, the number of neurons in the input and output layers is determined by the type of input and output your task requires. For example, if your dataset has the shape of 28 x 28 it should expect to have input neurons with size 784 and the output neurons should be equal to the number of classes to be predicted. As for the hidden layers, a common practice is to size them to form a funnel, with fewer and fewer neurons at each layer, the rationale being that many low-level features can coalesce into far fewer high-level features. However, this practice is not as common now, and you may simply use the same size for all hidden layers. If there are four convolutional layers with 256 neurons, that\'s just one hyperparameter to tune instead of one per layer. Just like the number of layers, you can try increasing the number of neurons gradually until the network starts overfitting. Another important question is: when would you want to add a max pooling layer rather than a convolutional layer with the same stride? The thing is that a max-pooling layer has no parameters at all, whereas a convolutional layer has quite a few. Sometimes, adding a local response normalization layer that makes the neurons that most strongly activate inhibit neurons at the same location but in neighboring feature maps, encourages different feature maps to specialize and pushes them apart, forcing them to explore a wider range of features. It is typically used in the lower layers to have a larger pool of low-level features that the upper layers can build upon. **Batch normalization** ----------------------- **Batch normalization** (**BN**) is a method to reduce internal covariate shift while training regular DNNs. This can apply to CNNs too. Due to the normalization, BN further prevents smaller changes to the parameters to amplify and thereby allows higher learning rates, making the network even faster: https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781788392303/files/assets/844dd566-706b-49c2-81d2-3d64964ff092.png The idea is placing an additional step between the layers, in which the output of the layer before is normalized. To be more specific, in the case of non-linear operations (for example, ReLU), BN transformation has to be applied to the non-linear operation. Typically, the overall process has the following workflow: - Transforming the network into a BN network (see *Figure 1*) - Then training the new network - Transforming the batch statistic into a population statistic This way, BN can fully partake in the process of backpropagation. As shown in *Figure 1*, BN is performed before the other processes of the network in this layer are applied. However, any kind of gradient descent (for example, **stochastic gradient descent** (**SGD**) and its variants) can be applied to train the BN network. *Interested readers can refer to the original paper to get to more information: Ioffe, Sergey, and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).* Now a valid question would be: where to place the BN layer? Well, to know the answer, a quick evaluation of BatchNorm layer performance on ImageNet-2012 () shows the following benchmark: ![https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781788392303/files/assets/2ea9e9c0-8cf4-42a0-a436-c50a531f1857.png](media/image14.png) From the preceding table, it can be seen that placing BN after non-linearity would be the right way. The second question would be: what activation function should be used in a BN layer? Well, from the same benchmark, we can see the following result: https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781788392303/files/assets/71798011-3bf7-421a-a5dd-05a4786e3b4e.png From the preceding table, we can assume that using ReLU or its variants would be a better idea. Now, another question would be how to use these using deep learning libraries. Well, in TensorFlow, it is: training = tf.placeholder(tf.bool)\ x = tf.layers.dense(input\_x, units=100)\ x = tf.layers.batch\_normalization(x, training=training)\ x = tf.nn.relu(x) A general warning: set this to True for training and False for testing. However, the preceding addition introduces extra ops to be performed on the graph, which is updating its mean and variance variables in such a way that they will not be dependencies of your training op. To do it, we can just run the ops separately, as follows: extra\_update\_ops = tf.get\_collection(tf.GraphKeys.UPDATE\_OPS)\ sess.run(\[train\_op, extra\_update\_ops\], \...) **Advanced regularization and avoiding overfitting** ---------------------------------------------------- As mentioned in the previous chapter, one of the main disadvantages observed during the training of large neural networks is overfitting, that is, generating very good approximations for the training data but emitting noise for the zones between single points. There are a couple of ways to reduce or even prevent this issue, such as dropout, early stop, and limiting the number of parameters. In the case of overfitting, the model is specifically adjusted to the training dataset, so it will not be used for generalization. Therefore, although it performs well on the training set, its performance on the test dataset and subsequent tests is poor because it lacks the generalization property: Figure 10: Dropout versus without dropout The main advantage of this method is that it avoids holding all the neurons in a layer to optimize their weights synchronously. This adaptation made in random groups prevents all the neurons from converging to the same goals, thus de-correlating the adapted weights. A second property found in the dropout application is that the activation of the hidden units becomes sparse, which is also a desirable characteristic. In the preceding figure, we have a representation of an original fully connected multilayer neural network and the associated network with the dropout linked. As a result, approximately half of the input was zeroed (this example was chosen to show that probabilities will not always give the expected four zeroes). One factor that could have surprised you is the scale factor applied to the non-dropped elements. This technique is used to maintain the same network, and restore it to the original architecture when training, using dropout\_keep\_prob as 1. A major drawback of using dropout is that it does not have the same benefits for convolutional layers, where the neurons are not fully connected. To address this issue, there are a few techniques can be applied, such as DropConnect and stochastic pooling: - DropConnect is similar to dropout as it introduces dynamic sparsity within the model, but it differs in that the sparsity is on the weights, rather than the output vectors of a layer. The thing is that a fully connected layer with DropConnect becomes a sparsely connected layer in which the connections are chosen at random during the training stage. - In stochastic pooling, the conventional deterministic pooling operations are replaced with a stochastic procedure, where the activation within each pooling region is picked randomly according to a multinomial distribution, given by the activities within the pooling region. The approach is hyperparameter free and can be combined with other regularization approaches, such as dropout and data augmentation. **Stochastic pooling versus standard max pooling:*** Stochastic pooling is equivalent to standard max pooling but with many copies of an input image, each having small local deformations.* Secondly, one of the simplest methods to prevent overfitting of a network is to simply stop the training before overfitting gets a chance to occur. This comes with the disadvantage that the learning process is halted. Thirdly, limiting the number of parameters is sometimes helpful and helps avoid overfitting. When it comes to CNN training, the filter size also affects the number of parameters. Thus, limiting this type of parameter restricts the predictive power of the network directly, reducing the complexity of the function that it can perform on the data, and that limits the amount of overfitting. **Applying dropout operations with TensorFlow** ----------------------------------------------- If we apply the dropout operation to a sample vector, it will work on transmitting the dropout to all the architecture-dependent units. In order to apply the dropout operation, TensorFlow implements the tf.nn.dropout method, which works as follows: tf.nn.dropout (x, keep\_prob, noise\_shape, seed, name) Where x is the original tensor. The keep\_prob means the probability of keeping a neuron and the factor by which the remaining nodes are multiplied. The noise\_shape signifies a four-element list that determines whether a dimension will apply zeroing independently or not. Let\'s have a look at this code segment: import tensorflow as tf X = \[1.5, 0.5, 0.75, 1.0, 0.75, 0.6, 0.4, 0.9\]\ drop\_out = tf.nn.dropout(X, 0.5)\ sess = tf.Session() with sess.as\_default():\ print(drop\_out.eval())\ sess.close() \[ 3. 0. 1.5 0. 0. 1.20000005 0. 1.79999995\] In the preceding example, you can see the results of applying dropout to the *x* variable, with a 0.5 probability of zero; in the cases in which it didn\'t occur, the values were doubled (multiplied by 1/1.5, the dropout probability). **Which optimizer to use?** --------------------------- When using a CNN, since one of the objective functions is to minimize the evaluated cost, we must define an optimizer. Using the most common optimizer , such as SGD, the learning rates must scale with *1/T* to get convergence, where *T* is the number of iterations. Adam or RMSProp try to overcome this limitation automatically by adjusting the step size so that the step is on the same scale as the gradients. In addition, in the previous example, we have used Adam optimizer, which performs well in most cases. Nevertheless, if you are training a neural network but computing the gradients is mandatory, using the RMSPropOptimizer function (which implements the RMSProp algorithm) is a better idea since it would be the faster way of learning in a mini-batch setting. Researchers also recommend using the momentum optimizer, while training a deep CNN or DNN. Technically, RMSPropOptimizer is an advanced form of gradient descent that divides the learning rate by an exponentially decaying average of squared gradients. The suggested setting value of the decay parameter is 0.9, while a good default value for the learning rate is 0.001. For example, in TensorFlow, tf.train.RMSPropOptimizer() helps us to use this with ease: optimizer = tf.train.RMSPropOptimizer(0.001, 0.9).minimize(cost\_op) **Memory tuning** ----------------- In this section, we try to provide some insights. We start with an issue and its solution; convolutional layers require a huge amount of RAM, especially during training, because the reverse pass of backpropagation requires all the intermediate values computed during the forward pass. During inference (that is, when making a prediction for a new instance), the RAM occupied by one layer can be released as soon as the next layer has been computed, so you only need as much RAM as required by two consecutive layers. Nevertheless, during training, everything computed during the forward pass needs to be preserved for the reverse pass, so the amount of RAM needed is (at least) the total amount of RAM required by all layers. If your GPU runs out of memory while training a CNN, here are five things you can try to solve the problem (other than purchasing a GPU with more RAM): - Reduce the mini-batch size - Reduce dimensionality using a larger stride in one or more layers - Remove one or more layers - Use 16-bit floats instead of 32-bit - Distribute the CNN across multiple devices (see more at [[https://www.tensorflow.org/deploy/distributed]](https://www.tensorflow.org/deploy/distributed)) 38. **Appropriate layer placement** ------------------------------- Another important question would be: when do you want to add a max pooling layer rather than a convolutional layer with the same stride? The thing is that a max-pooling layer has no parameters at all, whereas a convolutional layer has quite a few. Even adding a local response normalization layer sometimes makes the neurons that most strongly activate inhibit neurons at the same location but in neighboring feature maps, which encourages different feature maps to specialize and pushes them apart, forcing them to explore a wider range of features. It is typically used in the lower layers to have a larger pool of low-level features that the upper layers can build upon. **Building the second CNN by putting everything together** ---------------------------------------------------------- Now we know how to optimize the layering structure in a CNN by adding dropout, BN, and biases initializers, such as Xavier. Let\'s try to apply these to a less complex CNN. Throughout this example, we will see how to solve a real-life classification problem. To be more specific, our CNN model will be able to classify the traffic sign from a bunch of images. **Dataset description and preprocessing** ----------------------------------------- For this we will be using the Belgian traffic dataset (BelgiumTS for Classification (cropped images)). This dataset can be download from . Here are a quick glimpse about the traffic signs convention in Belgium: - Belgian traffic signs are usually in Dutch and French. This is good to know, but for the dataset that you\'ll be working with, it\'s not too important! - There are six categories of traffic signs in Belgium: warning signs, priority signs, prohibitory signs, mandatory signs, signs related to parking and standing still on the road and, lastly, designatory signs. Once we download the aforementioned dataset, we will see the following directory structure (training left, test right): ![https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781788392303/files/assets/ce5f8479-60f0-4703-8e95-c991d7d6b829.png](media/image16.png) The images are in .ppm format; otherwise we could\'ve used TensorFlow built-in image loader (example, tf.image.decode\_png). However, we can use the skimage Python package. In Python 3, execute \$ sudo pip3 install scikit-image for skimage to install and use this package. So let\'s get started by showing the directory path as follows: Train\_IMAGE\_DIR = \"\/BelgiumTSC\_Training/\"\ Test\_IMAGE\_DIR = \"\"\/BelgiumTSC\_Testing/\" Then let\'s write a function using the skimage library to read the images and returns two lists: - images: A list of Numpy arrays, each representing an image - labels: A list of numbers that represent the images labels def load\_data(data\_dir):\ \# All subdirectories, where each folder represents a unique label\ directories = \[d for d in os.listdir(data\_dir)if os.path.isdir(os.path.join(data\_dir, d))\]\ \ \# Iterate label directories and collect data in two lists, labels and images.\ labels = \[\]\ images = \[\]\ for d in directories:label\_dir = os.path.join(data\_dir, d)\ file\_names = \[os.path.join(label\_dir, f)\ for f in os.listdir(label\_dir) if f.endswith(\".ppm\")\]\ \ \# For each label, load it\'s images and add them to the images list.\ \# And add the label number (i.e. directory name) to the labels list.\ for f in file\_names:images.append(skimage.data.imread(f))\ labels.append(int(d))\ return images, labels The preceding code block is straightforward and contains inline comments. How about showing related statistics about images? However, before that, let\'s invoke the preceding function: \# Load training and testing datasets.\ train\_data\_dir = os.path.join(Train\_IMAGE\_DIR, \"Training\")\ test\_data\_dir = os.path.join(Test\_IMAGE\_DIR, \"Testing\")\ \ images, labels = load\_data(train\_data\_dir) Then let\'s see some statistics: print(\"Unique classes: {0} \\nTotal Images: {1}\".format(len(set(labels)), len(images))) \>\>\>\ Unique classes: 62\ Total Images: 4575 So we have 62 classes to be predicted (that is, a multiclass image classification problem) and we have many images too that should be sufficient to satisfy a smaller CNN.Now let\'s see the class distribution visually: \# Make a histogram with 62 bins of the \`labels\` data and show the plot:\ plt.hist(labels, 62)\ plt.xlabel(\'Class\')\ plt.ylabel(\'Number of training examples\')\ plt.show() https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781788392303/files/assets/03894908-617b-4b51-9011-b3d5f4a084e3.png Therefore, from the preceding figure, we can see that classes are very imbalanced. However, to make it simpler, we won\'t take care of this but next, it would be great to visually inspect some files, say displaying the first image of each label: def display\_images\_and\_labels(images, labels):\ unique\_labels = set(labels)\ plt.figure(figsize=(15, 15))\ i = 1\ for label in unique\_labels:\ \# Pick the first image for each label.\ image = images\[labels.index(label)\]\ plt.subplot(8, 8, i) \# A grid of 8 rows x 8 column\ splt.axis(\'off\')\ plt.title(\"Label {0} ({1})\".format(label, labels.count(label)))\ i += 1\ \_= plt.imshow(image)\ plt.show()\ display\_images\_and\_labels(images, labels) Now you can see from the preceding figure that the images come in different sizes and shapes. Moreover, we can see it using Python code, as follows: for img in images\[:5\]:\ print(\"shape: {0}, min: {1}, max: {2}\".format(img.shape, img.min(), img.max())) \>\>\>\ shape: (87, 84, 3), min: 12, max: 255\ shape: (289, 169, 3), min: 0, max: 255\ shape: (205, 76, 3), min: 0, max: 255\ shape: (72, 71, 3), min: 14, max: 185\ shape: (231, 228, 3), min: 0, max: 255 Therefore, we need to apply some pre-processing such as resizing, reshaping, and so on to each image. Let\'s say each image will have size of 32 x 32: images32 = \[skimage.transform.resize(img, (32, 32), mode=\'constant\')\ \ for img in images\]for img in images32\[:5\]:\ print(\"shape: {0}, min: {1}, max: {2}\".format(img.shape, img.min(), img.max())) \>\>\>\ shape: (32, 32, 3), min: 0.06642539828431372, max: 0.9704350490196079\ shape: (32, 32, 3), min: 0.0, max: 1.0\ shape: (32, 32, 3), min: 0.03172870710784261, max: 1.0\ shape: (32, 32, 3), min: 0.059474571078431314, max: 0.7036305147058846\ shape: (32, 32, 3), min: 0.01506204044117481, max: 1.0 Now, all of our images have same size. The next task would be to convert labels and image features as a numpy array: labels\_array = np.array(labels)\ images\_array = np.array(images32)\ print(\"labels: \", labels\_array.shape, \"nimages: \", images\_array.shape) \>\>\>\ labels: (4575,)\ images: (4575, 32, 32, 3) Fantastic! The next task would be creating our second CNN, but this time we will be using TensorFlow contrib package, which is a high-level API that supports layering ops. **Creating the CNN model** -------------------------- We are going to construct a complex network. However, it has a straightforward architecture. At the beginning, we use Xavier as the network initializer. Once we initialize the network bias using the Xavier initializer. The input layer is followed by a convolutional layer (convolutional layer 1), which is again followed by a BN layer (that is, BN layer 1). Then there is a pooling layer with strides of two and a kernel size of two. Then another BN layer follows the second convolutional layer. Next, there is the second pooling layer with strides of two and kernel size of two. Well, then the max polling layer is followed by a flattening layer that flattens the input from (None, height, width, channels) to (None, height \* width \* channels) == (None, 3072). Once the flattening is completed, the input is fed into the first fully connected layer 1. Then third BN is applied as a normalizer function. Then we will have a dropout layer before we feed the lighter network into the fully connected layer 2 that generates logits of size (None, 62). Too much of a mouthful? Don\'t worry; we will see it step by step. Let\'s start the coding by creating the computational graph, creating both features, and labeling placeholders: graph = tf.Graph()\ with graph.as\_default():\ \# Placeholders for inputs and labels.\ images\_X = tf.placeholder(tf.float32, \[None, 32, 32, 3\]) \# each image\'s 32x32 size\ labels\_X = tf.placeholder(tf.int32, \[None\])\ \ \# Initializer: Xavier\ biasInit = tf.contrib.layers.xavier\_initializer(uniform=True, seed=None, dtype=tf.float32)\ \ \# Convolution layer 1: number of neurons 128 and kernel size is 6x6.\ conv1 = tf.contrib.layers.conv2d(images\_X, num\_outputs=128, kernel\_size=\[6, 6\],\ biases\_initializer=biasInit)\ \ \# Batch normalization layer 1: can be applied as a normalizer\ \# function for conv2d and fully\_connected\ bn1 = tf.contrib.layers.batch\_norm(conv1, center=True, scale=True, is\_training=True)\ \ \# Max Pooling (down sampling) with strides of 2 and kernel size of 2\ pool1 = tf.contrib.layers.max\_pool2d(bn1, 2, 2)\ \ \# Convolution layer 2: number of neurons 256 and kernel size is 6x6.\ conv2 = tf.contrib.layers.conv2d(pool1, num\_outputs=256, kernel\_size=\[4, 4\], stride=2,\ biases\_initializer=biasInit)\ \ \# Batch normalization layer 2:\ bn2 = tf.contrib.layers.batch\_norm(conv2, center=True, scale=True, is\_training=True)\ \ \# Max Pooling (down-sampling) with strides of 2 and kernel size of 2\ pool2 = tf.contrib.layers.max\_pool2d(bn2, 2, 2)\ \ \# Flatten the input from \[None, height, width, channels\] to\ \# \[None, height \* width \* channels\] == \[None, 3072\]\ images\_flat = tf.contrib.layers.flatten(pool2)\ \ \# Fully connected layer 1\ fc1 = tf.contrib.layers.fully\_connected(images\_flat, 512, tf.nn.relu)\ \ \# Batch normalization layer 3\ bn3 = tf.contrib.layers.batch\_norm(fc1, center=True, scale=True, is\_training=True)\ \ \# apply dropout, if is\_training is False, dropout is not applied\ fc1 = tf.layers.dropout(bn3, rate=0.25, training=True)\ \ \# Fully connected layer 2 that generates logits of size \[None, 62\].\ \# Here 62 means number of classes to be predicted.\ logits = tf.contrib.layers.fully\_connected(fc1, 62, tf.nn.relu) Up to this point, we have managed to generate the logits of size (None, 62). Then we need to convert the logits to label indexes (int) with the shape (None), which is a 1D vector of length == batch\_size:predicted\_labels = tf.argmax(logits, axis=1). Then we define cross-entropy as the loss function, which is a good choice for classification: loss\_op = tf.reduce\_mean(tf.nn.sparse\_softmax\_cross\_entropy\_with\_logits(logits=logits, labels=labels\_X)) Now one of the most important parts is updating the ops and creating an optimizer (Adam in our case): update\_ops = tf.get\_collection(tf.GraphKeys.UPDATE\_OPS)\ with tf.control\_dependencies(update\_ops):\ \# Create an optimizer, which acts as the training op.train =\ tf.train.AdamOptimizer(learning\_rate=0.10).minimize(loss\_op) Finally, we initialize all the ops: init\_op = tf.global\_variables\_initializer() **Training and evaluating the network** --------------------------------------- We start by create a session to run the graph we created. Note that for faster training, we should use a GPU. However, if you do not have a GPU, just set log\_device\_placement=False: session = tf.Session(graph=graph, config=tf.ConfigProto(log\_device\_placement=True))\ session.run(init\_op)\ for i in range(300):\ \_, loss\_value = session.run(\[train, loss\_op\], feed\_dict={images\_X: images\_array, labels\_X:\ labels\_array})\ if i % 10 == 0:\ print(\"Loss: \", loss\_value) \>\>\>\ Loss: 4.7910895\ Loss: 4.3410876\ Loss: 4.0275432\ \...\ Loss: 0.523456 Once the training is completed, let us pick 10 random images and see the predictive power of our model: random\_indexes = random.sample(range(len(images32)), 10)\ random\_images = \[images32\[i\]\ for i in random\_indexes\]\ random\_labels = \[labels\[i\]\ for i in random\_indexes\] Then let\'s run the predicted\_labels op: predicted = session.run(\[predicted\_labels\], feed\_dict={images\_X: random\_images})\[0\]\ print(random\_labels)\ print(predicted) \>\>\>\ \[38, 21, 19, 39, 22, 22, 45, 18, 22, 53\]\ \[20 21 19 51 22 22 45 53 22 53\] So we can see that some images were correctly classified and some wrongly. However, visual inspection would be more helpful. So let\'s display the predictions and the ground truth: fig = plt.figure(figsize=(5, 5))\ for i in range(len(random\_images)):\ truth = random\_labels\[i\]\ prediction = predicted\[i\]\ plt.subplot(5, 2,1+i)\ plt.axis(\'off\')color=\'green\'\ if truth == prediction\ else\ \'red\'plt.text(40, 10, \"Truth: {0}nPrediction: {1}\".format(truth, prediction), fontsize=12,\ color=color)\ plt.imshow(random\_images\[i\])\ \>\>\> ![https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781788392303/files/assets/bc955eb3-eac0-47a3-b447-c0b6722c80cf.png](media/image18.png) Finally, we can evaluate our model using the test set. To see the predictive power, we compute the accuracy: \# Load the test dataset.\ test\_X, test\_y = load\_data(test\_data\_dir)\ \ \# Transform the images, just as we did with the training set.\ test\_images32 = \[skimage.transform.resize(img, (32, 32), mode=\'constant\')\ for img in test\_X\]\ display\_images\_and\_labels(test\_images32, test\_y)\ \ \# Run predictions against the test\ setpredicted = session.run(\[predicted\_labels\], feed\_dict={images\_X: test\_images32})\[0\]\ \ \# Calculate how many matches\ match\_count = sum(\[int(y == y\_) for y, y\_ in zip(test\_y, predicted)\])\ accuracy = match\_count / len(test\_y)print(\"Accuracy: {:.3f}\".format(accuracy)) \>\>\ Accuracy: 87.583 Not that bad in terms of accuracy. In addition to this, we can also compute other performance metrics such as precision, recall, f1 measure and also visualize the result in a confusion matrix to show the predicted versus actual labels count. Nevertheless, we can still improve the accuracy by tuning the network and hyperparameters. But I leave these up to the readers. Finally, we are done, so let\'s close the TensorFlow session: session.close() **Summary** ----------- In this chapter, we discussed how to use CNNs, which are a type of feed-forward artificial neural network in which the connectivity pattern between neurons is inspired by the organization of an animal\'s visual cortex. We saw how to cascade a set of layers to construct a CNN and perform different operations in each layer. Then we saw how to train a CNN. Later on, we discussed how to optimize the CNN hyperparameters and optimization. Finally, we built another CNN, where we utilized all the optimization techniques. Our CNN models did not achieve outstanding accuracy since we iterated both of the CNNs a few times and did not even apply any grid searching techniques; that means we did not hunt for the best combinations of the hyperparameters. Therefore, the takeaway would be to apply more robust feature engineering in the raw images, iterate the training for more epochs with the best hyperparameters, and observe the performance. In the next chapter, we will see how to use some deeper and popular CNN architectures, such as ImageNet, AlexNet, VGG, GoogLeNet, and ResNet. We will see how to utilize these trained models for transfer learning.