Essential Neural Networks PDF

Essential Neural Networks Ismail JAMIAI ([email protected]) November 20, 2024 Contents 1 Chapter 2: Feedforward Neural Networks 1 1.1 2.0 Introduction................................. 1...

Essential Neural Networks Ismail JAMIAI ([email protected]) November 20, 2024 Contents 1 Chapter 2: Feedforward Neural Networks 1 1.1 2.0 Introduction................................. 1 1.2 2.1 Understanding biological neural networks................ 2 1.3 2.2 Comparing the perceptron and the McCulloch-Pitts neuron...... 3 1.3.1 2.2.1 The MP neuron.......................... 4 1.3.2 2.2.2 Perceptron............................ 4 1.3.3 2.2.3 Pros and cons of the MP neuron and perceptron....... 6 1.4 2.3 MLPs.................................... 7 1.4.1 2.3.1 Layers............................... 10 1.4.2 2.3.2 Activation functions....................... 17 1.4.3 2.3.3 The loss function......................... 22 1.4.4 2.3.4 Backpropagation......................... 26 1.5 2.4 Training neural networks.......................... 27 1.5.1 2.4.1 Parameter initialization..................... 28 1.5.2 2.4.2 The data............................. 30 1.6 2.5 Deep neural networks........................... 33 1.7 2.6 Summary.................................. 34 LATEXCLASS : article LATEXCLASSOPTIONS : [letterpaper] OPTIONS: toc:nil STARTUP: inlineimages 1 Chapter 2: Feedforward Neural Networks 1.1 2.0 Introduction In the previous chapter, we covered linear neural networks, which have proven to be effective for problems such as regression and so are widely used in the industry. However, we also saw that they have their limitations and are unable to work effectively on higher- dimensional problems. In this chapter, we will take an in-depth look at the multilayer perceptron (MLP), a type of feedforward neural network (FNN). We will start by taking a look at how 1 biological neurons process information, then we will move onto mathematical models of biological neurons. The artificial neural networks (ANNs) we will study in this book are made up of mathematical models of biological neurons (we will learn more about this shortly). Once we have built a foundation, we will move on to understanding how MLPs—which are the FNNs—work and their involvement with deep learning. What FNNs allow us to do is approximate a function that maps input to output and this can be used in a variety of tasks, such as predicting the price of a house or a stock or determining whether or not an event will occur. The following topics are covered in this chapter: Understanding biological neural networks Comparing the perceptron and the McCulloch-Pitts neuron MLPs Training neural networks Deep neural networks 1.2 2.1 Understanding biological neural networks The human brain is capable of some remarkable feats—it performs very complex infor- mation processing. The neurons that make up our brains are very densely connected and perform in parallel with others. These biological neurons receive and pass signals to other neurons through the connections (synapses) between them. These synapses have strengths associated with them and increasing or weakening the strength of the connec- tions between neurons is what facilitates our learning and allows us to continuously learn and adapt to the dynamic environments we live in. As we know, the brain consists of neurons—in fact, according to recent studies, it is estimated that the human brain contains roughly 86 billion neurons. That is a lot of neurons and a whole lot more connections. A very large number of these neurons are used simultaneously every day to allow us to carry out a variety of tasks and be functional members of society. Neurons by themselves are said to be quite slow, but it is this large- scale parallel operation that gives our brains its extraordinary capability. The following is a diagram of a biological neuron: 2 As you can see from the preceding diagram, each neuron has three main compo- nents—the body, an axon, and many dendrites. The synapses connect the axon of one neuron to the dendrites of other neurons and determine the weight of the information that is received from other neurons. Only when the sum of the weighted inputs to the neuron exceeds a certain threshold does the neuron fire (activate); otherwise, it is at rest. This communication between neurons is done through electrochemical reactions, involving potassium, sodium, and chlorine (which we will not go into as it is beyond the scope of this book; however, if this interests you, there is a lot of literature you can find on it). The reason we are looking at biological neurons is that the neurons and neural networks we will be learning about and developing in this book are largely biologically inspired. If we are trying to develop artificial intelligence, where better to learn than from actual intelligence? Since the goal of this course is to teach you how to develop ANNs on computers, it is relatively important that we take a look at the differences between the computational power of our brains as opposed to computers. Computers have a significant advantage over our brains as they can perform roughly 10 billion operations per second, whereas the human brain can only perform around 800 operations per second. However, the brain requires roughly 10 watts to operate, which is 10 times less than what a computer requires. Another advantage that computers have is their precision; they can perform operations millions of times more accurately. Lastly, computers perform operations sequentially and cannot deal with data they have not been programmed to deal with, but the brain performs operations in parallel and is well equipped to deal with new data. 1.3 2.2 Comparing the perceptron and the McCulloch-Pitts neuron In this section, we will cover two mathematical models of biological neurons—the McCulloch- Pitts (MP) neuron and Rosenblatt’s perceptron—which create the foundation for neural networks. 3 1.3.1 2.2.1 The MP neuron The MP neuron was created in 1943 by Warren McCulloch and Walter Pitts. It was modeled after the biological neuron and is the first mathematical model of a biological neuron. It was created primarily for classification tasks. The MP neuron takes as input binary values and outputs a binary value based on a threshold value. If the sum of the inputs is greater than the threshold, then the neuron outputs 1 (if it is under the threshold, it outputs 0). In the following diagram, we can see what a basic neuron with three inputs and one output looks like: As you can see, this isn’t entirely dissimilar to the biological neuron we saw earlier. Mathematically, we can write this as follows: Pn 1 if y= i=1 xi ≥ b 0 otherwise Here, xi = 0 or 1. We can think of this as outputting Boolean answers; that is, true or false (or yes or no). While the MP neuron may look simple, it has the ability to model any logic function, such as OR, AND, and NOT; but it is unable to classify the XOR function. Addi- tionally, it does not have the ability to learn, so the threshold (b) needs to be adjusted analytically to fit our data. 1.3.2 2.2.2 Perceptron The perceptron model, created by Frank Rosenblatt in 1958, is an improved version of the MP neuron and can take any real value as input. Each input is then multiplied by a real- valued weight. If the sum of the weighted inputs is greater than the threshold, 4 then the output is 1, and if it is below the threshold, then the output is 0. The following diagram illustrates a basic perceptron model: - This model shares a lot of similarities with the MP neuron, but it is more similar to the biological neuron. Mathematically, we can write this as follows: Pn 1 if y= i=1 xi wi ≥ b 0 otherwise Here, xi ∈ R. Sometimes, we rewrite the perceptron equation in the following form: Pn 1 if y= i=0 xi wi ≥ 0 0 otherwise The following diagram shows how the perceptron equation will look like: 5 Here x0 = 1 and w0 = −b. This prevents us from having to hardcode the thresh- old, which makes the threshold a learnable parameter instead of something we have to manually adjust (as is the case with the MP neuron). 1.3.3 2.2.3 Pros and cons of the MP neuron and perceptron The advantage the perceptron model has over the MP neuron is that it is able to learn through error correction and it linearly separates the problem using a hyperplane, so anything that falls below the hyperplane is 0 and anything above it is 1. This error correction allows the perceptron to adjust the weights and move the position of the hyperplane so that it can properly classify the data. Earlier, we mentioned that the perceptron learns to linearly classify a problem—but what exactly does it learn? Does it learn the nature of the question that is asked? No. It learns the effect of the input on the output. So, the greater the weight associated with a certain input, the greater its impact on the prediction (classification). The update for the weights (learning) happens as follows: wnew = wold + δx Here, δ = expected value - predicted value. We could also add a learning rate 0 < η ≤ 1 if we want to speed up the learning; so, the update will be as follows: wnew = wold + ηδx 6 During these updates, the perceptron calculates the distance of the hyperplane from the points to be classified and adjusts itself to find the best position that it can perfectly linearly classify the two target classes. So, it maximally separates both points on either side, which we can see in the following plot: What is even more fascinating about this is that because of the aforementioned learning rule, the perceptron is guaranteed to converge when given a finite number of updates and so will work on any binary classification task. But alas, the perceptron is not perfect either and it also has limitations. As it is a linear classifier, it is unable to deal with nonlinear problems, which makes up the majority of the problems we usually wish to develop solutions for. 1.4 2.3 MLPs As mentioned, both the MP neuron and perceptron models are unable to deal with nonlinear problems. To combat this issue, modern-day perceptrons use an activation function that introduces nonlinearity to the output. The perceptrons (neurons, but we will mostly refer to them as nodes going forward) we will use are of the following form: ! X y=ϕ wi x i + b i Here, y is the output, ϕ is a nonlinear activation function, xi is the inputs to the unit, wi is the weights, and b is the bias. This improved version of the perceptron looks as follows: 7 In the preceding diagram, the activation function is generally the sigmoid function: 1 ϕ= Pn 1+ e− i=1 wi xi −b What the sigmoid activation function does is squash all the output values into the (0, 1) range. The sigmoid activation function is largely used for historical purposes since the developers of the earlier neurons focused on thresholding. When gradient-based learning was introduced, the sigmoid function turned out to be the best choice. An MLP is the simplest type of FNN. It is basically a lot of nodes combined together and the computation is carried out sequentially. The network looks as follows: 8 As you can see from the preceding diagram, the nodes are arranged in layers and the nodes in each layer are connected to each of the neurons in the next layer. However, there aren’t any connections between nodes in the same layer. We refer to networks such as this as being fully connected. The first layer is referred to as the input layer, the last layer is referred to as the output layer, and all the layers in between are called hidden layers. The number of nodes in the output layer depends on the type of problem we build our MLP for. It is important that you remember that the inputs to and outputs from layers are not the same as the inputs to and outputs from the network. You may also notice that in the preceding architecture, there is only one unit in the output layer. This is generally the case when we have a regression or binary classification task. So, if we want our network to be able to detect multiple classes, then our output layer will have K nodes, where K is the number of classes. However, what makes neural networks so powerfully effective, and the reason we are studying them, is that they are universal function approximators. The universal approx- imation theorem states that “a feedforward neural network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact 9 subsets of Rn , under mild assumptions on the activation function.” What this means is that if the hidden layer contains a specific number of neurons, then our neural network can reasonably approximate any known function. By now, you might be thinking that if MLPs have been around since the late 1960s, why has it taken nearly 50 years for them to take off and be used as widely as they are today? This is because the computing power that was available 50 years ago was nowhere near as powerful as what is available today, nor was the same amount of data that is available now available back then. So, because of the lack of results that MLPs were able to achieve back then, they faded into obscurity. Because of this, as well as the universal approximation theorem, researchers at the time hadn’t looked deeper than into a couple of layers. Let’s break the model down and see how it works. 1.4.1 2.3.1 Layers We know now that MLPs (and so FNNs) are made of three different kinds of lay- ers—input, hidden, and output. We also know what a single neuron looks like. Let’s now mathematically explore MLPs and how they work. Nous savons maintenant que les MLP (et donc les FNN) sont constitués de trois types de couches différentes - d’entrée, cachées et de sortie. Nous savons également à quoi ressemble un neurone unique. Explorons maintenant mathématiquement les MLP et leur fonctionnement. Suppose we have an MLP with x ∈ Rd input (where d ∈ N), L layers, N neurons in each layer, an activation function ϕ : R → R, and the network output, y. The MLP looks as follows: Supposons que nous ayons un MLP avec x ∈ Rd entrée (où d ∈ N), L couches, N neurones dans chaque couche, une fonction d’activation ϕ : R → R, et la sortie du réseau, y. Le MLP se présente comme suit : 10 As you can see, this network has four inputs—the first hidden layer has five nodes, the second hidden layer has three nodes, the third hidden layer has five nodes, and there is one node for the output. Mathematically, we can write this as follows: Comme vous pouvez le voir, ce réseau a quatre entrées - la première couche cachée a cinq nœuds, la deuxième couche cachée a trois nœuds, la troisième couche cachée a cinq nœuds, et il y a un nœud pour la sortie. Mathématiquement, nous pouvons l’écrire comme suit :   X hi = ϕ  wi,j xj + bi  j   X hi = ϕ  wi,j hj + bi  j   X hi = ϕ  wi,j hj + bi  j   X yi = ϕ  wi,j hj + bi  j [l] Here, hi is the ith node in the lth layer, ϕ[l] is an activation function for the lth layer, xj [l] [l] is the j th input to the network, bi is the bias for the ith node in the lth layer, and wi,j 11 is the directed weight that connects the j th node in the l − 1st layer to the ith node in [l] the lth layer. Ici, hi est le nœud ith dans la couche lth , ϕ[l] est une fonction d’activation [l] pour la couche lth , xj est l’entrée j th dans le réseau, bi est le biais pour le nœud ith dans [l] la couche lth , et wi,j est le poids dirigé qui relie le nœud j th dans la couche l − 1st au nœud ith dans la couche lth. Before we move forward, let’s take a look at the preceding equations. From them, we can easily observe that each hidden node depends on the weights from the previous layer. If you take a pencil and draw out the network (or use your fingers to trace the connections), you will notice that the deeper we get into the network, the more complex the relationship nodes in the later hidden layers have with those in the earlier layers. Avant d’aller plus loin, jetons un coup d’œil aux équations précédentes. Nous pouvons facilement observer que chaque nœud caché dépend des poids de la couche précédente. Si vous prenez un crayon et dessinez le réseau (ou utilisez vos doigts pour tracer les connexions), vous remarquerez que plus nous nous enfonçons dans le réseau, plus la relation entre les nœuds des dernières couches cachées et ceux des couches précédentes est complexe. Now that you have an idea of how each neuron is computed in an MLP, you might have realized that explicitly writing out the computation on each node in each layer can be a daunting task. So, let’s rewrite the preceding equation in a cleaner and simpler manner. We generally do not express neural networks in terms of the computation that happens on each node. We instead express them in terms of layers and because each layer has multiple nodes, we can write the previous equations in terms of vectors and matrices. The previous equations can now be written as follows: Maintenant que vous avez une idée de la manière dont chaque neurone est calculé dans un MLP, vous avez peut-être réalisé qu’écrire explicitement le calcul sur chaque nœud de chaque couche peut être une tâche décourageante. Réécrivons donc l’équation précédente d’une manière plus propre et plus simple. En général, nous n’exprimons pas les réseaux neuronaux en termes de calcul effectué sur chaque nœud. Nous les exprimons plutôt en termes de couches et, comme chaque couche comporte plusieurs nœuds, nous pouvons écrire les équations précédentes en termes de vecteurs et de matrices. Les équations précédentes peuvent maintenant être écrites comme suit : h = ϕ w x + b h = ϕ W h + b h = ϕ W h + b y = ϕ W h + b This is a whole lot simpler to follow. 12 For the networks we want to build, the input more than likely will not be a vector, as it is in the preceding examples; it will be a matrix, so we can then rewrite it as follows: Pour les réseaux que nous voulons construire, l’entrée ne sera probablement pas un vecteur, comme dans les exemples précédents ; il s’agira d’une matrice, que nous pouvons donc réécrire comme suit : H = ϕ XW⊤ + 1b⊤ H = ϕ H W⊤ + 1b⊤ H = ϕ H W⊤ + 1b⊤ Y = ϕ H W⊤ + 1b⊤ Here, X is the matrix containing all the data we want to train our model on H[l] contains the hidden nodes at each layer for all the data samples, and everything else is the same as it was earlier. Ici, X est la matrice contenant toutes les données sur lesquelles nous voulons former notre modèle H[l] contient les nœuds cachés à chaque couche pour tous les échantillons de données, et tout le reste est identique à ce qu’il était précédemment. If you have been paying attention, you will have noticed that the order of the multi- plication taking place in the matrix is different than what took place earlier. Why do you think that is? (I’ll give you a hint—transpose.) Si vous avez été attentif, vous aurez remarqué que l’ordre de la multiplication dans la matrice est différent de ce qui s’est passé précédemment. Pourquoi pensez-vous que cela soit le cas ? (Je vous donne un indice : transposez.) You should now have a decent, high-level understanding of how neural networks are constructed. Let’s now lift up the hood and take a look at what is going on underneath. We know from the previous equations that neural networks are comprised of a series of matrix multiplications and matrix additions and scalar multiplications. Since we are now dealing with vectors and matrices, their dimensions are important because if they don’t line up properly, we can’t multiply and add them. Vous devriez maintenant avoir une bonne compréhension de haut niveau de la manière dont les réseaux neuronaux sont construits. Soulevons maintenant le capot et regardons ce qui se passe en dessous. Les équations précédentes nous ont appris que les réseaux neuronaux sont constitués d’une série de multiplications et d’additions matricielles et de multiplications scalaires. Étant donné que nous avons maintenant affaire à des vecteurs et à des matrices, leurs dimensions sont importantes car si elles ne sont pas alignées correctement, nous ne pouvons pas les multiplier ni les additionner. 13 Let’s view the preceding MLP in its full matrix form. (To keep things simple, we will go through it layer by layer and we will use the second form since our input is in vector form.) Voyons le MLP précédent sous sa forme matricielle complète. (Pour simplifier les choses, nous allons le parcourir couche par couche et nous utiliserons la deuxième forme puisque notre entrée est sous forme de vecteur). To simplify the view and to properly understand what is happening, we will now denote z = W x + b and h = ϕ z. Pour simplifier la vue et bien comprendre ce qui se passe, nous noterons désormais z = W x + b et h = ϕ (z ). Calculate z as follows: Calculer z comme suit :       z w1,1 w1,2 w1,3 w1,4   b  1    1  w2,2 w2,3 w2,4  x1  z2  w2,1 b2  x2          z  = w w3,2 w3,3 w3,4  x3  + b3     3   3,1       z4  w4,1 w4,2 w4,3 w4,4  x4 b4  | {z } z5 w5,1 w5,2 w5,3 w5,4 4×1 b5 | {z } | {z } R | {z } R5×1 R5×4 R5×1 Calculate h as follows:       h1 z1 ϕ (z1 )       h2  z2  ϕ (z2 )   h  = ϕ z  = ϕ (z )      3   3   3        h4  z4  ϕ (z4 ) h5 z5 ϕ (z5 ) | {z } | {z } | {z } R5×1 R5×1 R5×1 Calculate z as follows:      h1   w1,5 h    z1 w1,1 w1,2 w1,3 w1,4   b1  2        z2  = w2,1 w2,2 w2,3 w2,4 w2,5  h3  + b2    z3 w3,1 w3,2 w3,3 w3,4 w3,5 h4  b3 | {z } | } | {z } h5 {z R3×1 R3×5 R3×1 | {z } R5×1 Calculate h as follows:     h1 z1     h2  = ϕ z2  h3 z3 | {z } | {z } R3×1 R3×1 14 Calculate z as follows:       z w1,1 w1,2 w1,3 b  1       1  z2  w2,1 w2,2 w2,3  h1 b2           z  = w w3,2 w3,3  h2  +  b3    3   3,1     z4  w4,1 w4,2 w4,3  h3 b4    | {z } z5 w w5,2 w5,3 R3×1 b5 | {z } | 5,1 {z } | {z } R5×1 R5×3 R5×1 Calculate h as follows:     h1 z    1  h2  z2    h  = ϕ z     3   3      h4  z4  h5 z5 | {z } | {z } R5×1 R5×1 Calculate z as follows:   h  1  ih2  h i h i h    z 1 = w1 w2 w3 w4 w5 h3  + b1 }   | {z } h4  R1×1 | {z } | {z R1×1 R1×5 h5 | {z } R5×1 Calculate y as follows: h i y = ϕ z1 |{z} R1×1 | {z } R1×1 There we have it. Those are all the operations that take place in our MLP. Nous y voilà. Voilà toutes les opérations qui ont lieu dans notre MLP. Now, if you think back to Linear Algebra, where we did matrix multiplication, we learned that when a matrix or vector is multiplied by another matrix with differing dimensions, then the resulting matrix or vector is of a different shape (except, of course, 15 when we multiply by the identity matrix). We call this mapping because our matrix maps points in one space to points in another space. Keeping this in mind, let’s take a look again at the operations that were carried out in our MLP. From this, we can deduce that our neural network maps our input vector from one Euclidean space to our output vector in another Euclidean space. Si vous repensez à l’algèbre linéaire, où nous avons étudié la multiplication des matrices, nous avons appris que lorsqu’une matrice ou un vecteur est multiplié par une autre matrice de dimensions différentes, la matrice ou le vecteur résultant a une forme différente (sauf, bien sûr, lorsque nous multiplions par la matrice d’identité). Nous parlons de cartographie parce que notre matrice fait correspondre des points d’un espace à des points d’un autre espace. En gardant cela à l’esprit, regardons à nouveau les opérations qui ont été effectuées dans notre MLP. Nous pouvons en déduire que notre réseau neuronal fait correspondre notre vecteur d’entrée d’un espace euclidien à notre vecteur de sortie dans un autre espace euclidien. Using this observation, we can generalize and write the following: En utilisant cette observation, nous pouvons généraliser et écrire ce qui suit : N : Rn1 → RnL Here, N is our MLP, Rn1 is the number of nodes in the dimension of the input layer, RnL is the number of nodes in the output layer, and L is the total number of layers. Ici, N est notre MLP, Rn1 est le nombre de nœuds dans la dimension de la couche d’entrée, RnL est le nombre de nœuds dans la couche de sortie, et L est le nombre total de couches. However, there are a number of matrix multiplications that take place in the preceding network and each has different dimensions, which tells us that a sequence of mappings takes place (from one layer to the next). Cependant, un certain nombre de multiplications de matrices ont lieu dans le réseau précédent et chacune a des dimensions différentes, ce qui nous indique qu’une séquence de mappings a lieu (d’une couche à l’autre). We can then write the mappings individually, as follows: Nous pouvons alors écrire les correspondances individuellement, comme suit : f1 : Rn1 → Rn2 , f2 : Rn2 → Rn3 , · · · , fL−1 : RnL−1 → RnL Here, each f value maps the lth layer to the l + 1st layer. To make sure we have covered all of our bases, W[l] ∈ Rnl ×nl−1 and b[l] ∈ Rnl. Now, we can summarize our MLP in the following equation: Nous pouvons maintenant résumer notre MLP dans l’équation suivante : N (x) = ϕ (W (ϕ (W (ϕ (W (ϕ (W (x) + b )) + b )) + b )) + b ) | {z } h | {z } h | {z } h | {z } y 16 With that done, we can now move on to the next subsection where we will understand activation functions. Ceci étant fait, nous pouvons maintenant passer à la sous-section suivante où nous allons comprendre les fonctions d’activation. 1.4.2 2.3.2 Activation functions We have mentioned activation functions a few times so far and we introduced one of them as well—the sigmoid activation function. However, this isn’t the only activation function that we use in neural networks. In fact, it is an active area of research, and today, there are many different types of activation functions. They can be classified into two types—linear and non-linear. We will focus on the latter because they are differentiable and this property is very important for us when we train neural networks. Nous avons mentionné les fonctions d’activation à plusieurs reprises jusqu’à présent et nous avons présenté l’une d’entre elles, la fonction d’activation sigmoïde. Cependant, ce n’est pas la seule fonction d’activation utilisée dans les réseaux neuronaux. En fait, il s’agit d’un domaine de recherche actif et il existe aujourd’hui de nombreux types de fonctions d’activation. Elles peuvent être classées en deux catégories : les fonctions linéaires et les fonctions non linéaires. Nous nous concentrerons sur ces dernières car elles sont différentiables et cette propriété est très importante pour nous lorsque nous formons des réseaux neuronaux. 1. Sigmoid To start, we will take a look at sigmoid since we’ve already encountered it. The sigmoid function is written as follows: Pour commencer, nous allons nous intéresser à la fonction sigmoïde puisque nous l’avons déjà rencontrée. La fonction sigmoïde s’écrit comme suit: 1 f (x) = 1 + e−x The function looks as follows: 17 The sigmoid activation function takes the sum of the weighted inputs and bias as input and compresses the value into the (0, 1) range. La fonction d’activation sig- moïde prend la somme des entrées pondérées et du biais comme entrée et comprime la valeur dans la plage (0, 1). Its derivative is as follows: d e−x f (x) = = f (x)(1 − f (x)) dx (1 + e−x )2 The derivative will look as follows: 18 This activation function is usually used in the output layer for predicting a probability- based output. We avoid using it in the hidden layers of deep neural networks be- cause it leads to what is known as the vanishing gradient problem. When the value of x is either greater than 2 or less than −2, then the output of the sigmoid function is very close to 1 or 0, respectively. This hinders the network’s ability to learn or slows it down drastically. Cette fonction d’activation est généralement utilisée dans la couche de sortie pour prédire une sortie basée sur la probabilité. Nous évitons de l’utiliser dans les couches cachées des réseaux neuronaux profonds, car elle entraîne ce que l’on appelle le problème du gradient de fuite. Lorsque la valeur de x est supérieure à 2 ou inférieure à −2, la sortie de la fonction sigmoïde est très proche de 1 ou de 0, respectivement. Cela entrave la capacité d’apprentissage du réseau ou le ralentit considérablement. 2. Hyperbolic tangent Another activation function used instead of the sigmoid is the hyperbolic tangent (tanh). It is written as follows: ex − e−x f (x) = ex + e−x The function looks as follows: The tanh function squashes all the output values into the (−1, 1) range. Its deriva- tive is as follows: d f (x) = 1 − f (x)2 dx The derivative looks as follows: 19 From the preceding graph you can tell that the tanh function is zero-centered, which allows us to model values that are very positive, very negative, or neutral. Le graphique précédent montre que la fonction tanh est centrée sur le zéro, ce qui nous permet de modéliser des valeurs très positives, très négatives ou neutres. 3. Rectified linear unit Rectified linear unit (ReLU) is one of the most widely used activation functions because it is more computationally efficient than the activation functions we have already seen; therefore, it allows the network to train a lot faster and so converge more quickly. Rectified linear unit (ReLU) est l’une des fonctions d’activation les plus utilisées car elle est plus efficace sur le plan du calcul que les fonctions d’activation que nous avons déjà vues ; elle permet donc au réseau de s’entraîner beaucoup plus rapidement et donc de converger plus vite. The ReLU function is as follows: 0 if x < 0 f (x) = max(0, x) = x if x ≥ 0 The function looks as follows: 20 As you can see, all the negative values for x are clipped off and turn into 0. It may surprise you to know that even though this looks like a linear function, it has a derivative that is as follows: Comme vous pouvez le constater, toutes les valeurs négatives de x sont supprimées et deviennent 0. Vous serez peut-être surpris d’apprendre que même si cette fonction semble linéaire, elle possède une dérivée qui est la suivante : d 1 if x ≥ 0 f= dx 0 otherwise The derivative looks as follows: This, too, faces some problems in training—particularly, the dying ReLU problem. This occurs when the input values are negative and this hinders learning because we cannot differentiate 0. Cette méthode est également confrontée à certains problèmes 21 lors de la simulation, en particulier le problème de la ReLU mourante. Ce problème survient lorsque les valeurs d’entrée sont négatives, ce qui entrave l’apprentissage car nous ne pouvons pas différencier 0. 4. Exponential linear unit Exponential linear unit (ELU) is another variation of the leaky ReLU activation function, where instead of having a straight line for all cases of x < 0, it is a log curve. L’unité linéaire exponentielle (ELU) est une autre variante de la fonction d’activation ReLU fuyante, où au lieu d’avoir une ligne droite pour tous les cas de x < 0, il s’agit d’une courbe logarithmique. The ELU activation function is as follows: α(ex − 1) if x < 0 f (x) = x if x ≥ 0 The function looks as follows: The derivative of this activation function is as follows: d f (x) + α if x < 0 f= dx 1 otherwise 1.4.3 2.3.3 The loss function The loss function is a very critical part of neural networks and their training. They give us a means of calculating the error of a network after a forward pass has been computed. 22 This error compares the neural network output with the target output that was specified in the training data. La fonction de perte est un élément très important des réseaux neuronaux et de leur simulation. Elle permet de calculer l’erreur d’un réseau après le calcul d’une passe avant. Cette erreur compare la sortie du réseau neuronal à la sortie cible spécifiée dans les données de simulation. There are two errors in particular that are of concern to us—the local error and the global error. The local error is the difference between the output expected of a neuron and its actual output. The global error, however, is the total error (the sum of all the local errors) and it tells us how well our network is performing on the training data. Deux erreurs en particulier nous intéressent : l’erreur locale et l’erreur globale. L’erreur locale est la différence entre la sortie attendue d’un neurone et sa sortie réelle. L’erreur globale, quant à elle, est l’erreur totale (la somme de toutes les erreurs locales) et nous renseigne sur les performances de notre réseau sur les données de simulation. There are a number of methods that we use in practice and each has its own use cases, advantages, and disadvantages. Conventionally, the loss function is referred to as the cost function and is denoted as J(θ) (or, equivalently, J(W, b)). Il existe un certain nombre de méthodes que nous utilisons dans la pratique et chacune a ses propres cas d’utilisation, ses avantages et ses inconvénients. Par convention, la fonction de perte est appelée fonction de coût et est notée J(θ) (ou, de manière équivalente, J(W, b)). 1. Mean absolute error Mean absolute error (MAE) is the same as the L1 loss we saw in Probability and Statistics, and it looks as follows: L’erreur absolue moyenne (MAE) est la même que la perte L1 que nous avons vue dans Probabilité et statistiques, et elle se présente comme suit : PN |ŷi − yi | MAE = i=1 N Here, N is the number of samples in our training dataset. What we are doing here is calculating the absolute distance between the prediction and the true value and averaging over the sum of the errors. Il s’agit ici de calculer la distance absolue entre la prédiction et la valeur réelle et de calculer la moyenne de la somme des erreurs. 2. Mean squared error Mean squared error (MSE) is one of the most commonly used loss functions, especially for regression tasks (it takes in a vector and outputs a scalar). It calcu- lates the square of the difference between the output and the expected output. It looks as follows: L’erreur quadratique moyenne (EQM) est l’une des fonctions de perte les plus couramment utilisées, en particulier pour les tâches de régression (elle prend un vecteur et produit un scalaire). Elle calcule le carré de la différence entre la sortie et la sortie attendue. Elle se présente comme suit : 1 X MSE = ∥ŷi − yi ∥22 N i 23 Here, N is the number of samples in our training dataset. In the preceding equation, we calculate the square of the L2 norm. Intuitively, we should be able to tell that when ŷ = y, the error is 0, and the larger the distance between the points, the larger the error. The reason we use this is that it always outputs a positive value and by squaring the distance between the output and expected output, it allows us to differentiate between small and large errors with greater ease and correct them. Dans l’équation précédente, nous calculons le carré de la norme L2. Intuitivement, nous devrions pouvoir dire que lorsque ŷ = y, l’erreur est 0, et plus la distance entre les points est grande, plus l’erreur est importante. La raison pour laquelle nous utilisons cette méthode est qu’elle produit toujours une valeur positive et qu’en élevant au carré la distance entre la sortie et la sortie attendue, elle nous permet de différencier plus facilement les petites et les grandes erreurs et de les corriger. 3. Root mean squared error Root mean squared error (RMSE) is simply the square root of the preceding MSE function and it looks as follows: s PN i=1 ∥ŷi − yi ∥22 RMSE = N The reason we use this is that it scales back the MSE function to the scale it was originally at before we squared the errors, which gives us a better idea of the error with respect to the target(s). La raison pour laquelle nous utilisons cette fonction est qu’elle ramène la fonction MSE à l’échelle qu’elle avait à l’origine avant que nous n’élevions les erreurs au carré, ce qui nous donne une meilleure idée de l’erreur par rapport à la (aux) cible(s). 4. The Huber loss The Huber loss looks as follows: 1 − ŷ)2 2 (y when |y − ŷ| ≤ ϵ Huber loss = ϵ2 ϵ |y − ŷ| − 2 otherwise Here, ϵ is a constant term that we can configure. The smaller it is, the more insensitive the loss is to large errors and outliers, and the larger it is, the more sensitive the loss is to large errors and outliers. Ici, ϵ est un terme constant que nous pouvons configurer. Plus il est petit, plus la perte est insensible aux erreurs importantes et aux valeurs aberrantes, et plus il est grand, plus la perte est sensible aux erreurs importantes et aux valeurs aberrantes. Now, if you look closely, you should notice that when is very small, the Huber loss is similar to MAE, but when it is very large, it is similar to MSE. En y regardant de plus près, on remarque que lorsque est très petit, la perte de Huber est similaire à l’EAM, mais que lorsqu’il est très grand, elle est similaire à l’EQM. 24 5. Cross entropy Cross entropy loss is used mostly when we have a binary classification problem; that is, where the network outputs either 1 or 0. La perte d’entropie croisée est surtout utilisée dans les problèmes de classification binaire, c’est-à-dire lorsque le réseau produit soit Suppose we are given a training dataset, D = {(xi , yi ), · · · , (xN , yN )} and yi ∈ {0, 1}. We can then write this in the following form: ŷi = f (xi ; θ) Here, θ is the parameters of the network (weights and biases). We can express this in terms of a Bernoulli distribution, as follows: Ici, θ représente les paramètres du réseau (poids et biais). Nous pouvons exprimer cela en termes de distribution de Bernoulli, comme suit : P (xi → yi | θ) = ŷiyi (1 − ŷi )1−yi The probability, given the entire dataset, is then as follows: N N ŷiyi (1 − ŷi )1−yi Y Y P (x1 , · · · , xN , y1 , · · · , yN ) = P (xi → yi | θ) = i=1 i=1 If we take its negative-log likelihood, we get the following: N ŷiyi (1 − ŷi )1−yi Y − log P (x1 , · · · , xN , y1 , · · · , yN ) = − log i=1 So, we have the following: N X L(ŷ, y) = − yi log ŷi + (1 − yi ) log(1 − ŷi ) i=1 Cross entropy is also used when we have more than two classes. This is known as multiclass cross entropy. Suppose we have K output units, then, we would calculate the loss for each class and then sum them together, as follows: K X − yi,k log ŷi,k k=1 Here, ŷi,k is the probability that observation (i) belongs to class k. 25 1.4.4 2.3.4 Backpropagation Now that we know how the forward passes are computed in MLPs, as well as how to best initialize them and calculate the loss of the network, it is time for us to learn about backpropagation—a method that allows us to calculate the gradient of the network using the information from the loss function. This is where our knowledge of multivariable calculus and partial derivatives comes in handy. Maintenant que nous savons comment les passes avant sont calculées dans les MLP, ainsi que la meilleure façon de les initialiser et de calculer la perte du réseau, il est temps de nous familiariser avec la rétropropagation, une méthode qui nous permet de calculer le gradient du réseau à l’aide des informations de la fonction de perte. C’est ici que nos connaissances en calcul à plusieurs variables et en dérivées partielles sont utiles. If you recall, this network is fully connected, which means all the nodes in each layer are connected to—and so have an impact on—the next layer. It is for this reason that in backpropagation we take the derivative of the loss with respect to the weights of the layer closest to the output, then the one before that, and so on, until we reach the first layer. If you don’t yet understand this, don’t worry. We will go through backpropagation in detail and use the network from earlier as an example. We will assume that the activation function is sigmoid and our loss function is cross entropy. We will first calculate the derivative of the loss (J ) with respect to W , which looks as follows: Si vous vous souvenez, ce réseau est entièrement connecté, ce qui signifie que tous les nœuds de chaque couche sont connectés à la couche suivante et ont donc un impact sur elle. C’est pour cette raison que, dans la rétropropagation, nous prenons la dérivée de la perte par rapport aux poids de la couche la plus proche de la sortie, puis de la couche précédente, et ainsi de suite jusqu’à la première couche. Si vous ne comprenez pas encore cela, ne vous inquiétez pas. Nous allons étudier la rétropropagation en détail et utiliser le réseau de tout à l’heure comme exemple. Nous supposerons que la fonction d’activation est sigmoïde et que notre fonction de perte est l’entropie croisée. Nous commencerons par calculer la dérivée de la perte (J ) par rapport à W , qui se présente comme suit : ∂J ∂ (4) =− [y log ŷ + (1 − y) log(1 − ŷ)] ∂W ∂W (4) ∂J ∂ h i ∂ h i = −y log ϕ (W h + b ) − (1 − y) log 1 − ϕ (W h + b ) ∂W ∂W ∂W ∂J 1 h ′ i 1 h ′ = −y ϕ (W h +b )h −(1−y) −ϕ (W h +b ) ∂W ϕ (W h + b ) 1 − ϕ (W h + b ) ∂J = −y(1 − σ(W h + b ))h + (1 − y)σ(W h + b )h ∂W ∂J = −y(1 − ŷ)h + (1 − y)ŷh ∂W 26 ∂J = (ŷ − y)h ∂W With that, we have finished computing the first derivative. As you can see, it takes quite a bit of work, and calculating the derivative for each layer can be a very time- consuming process. So, instead, we can make use of the chain rule from calculus. For simplicity, let’s say z [l] = W [l] h[l−1] + b[l] and h[l] = ϕ[l] (z [l] ) and assume that b[l] = 0. Now, if we want to calculate the gradient of the loss with respect to W , we get the following: ∂J ∂J ∂ ŷ ∂z ∂h ∂z ∂h ∂z = ∂W ∂ ŷ ∂z ∂h ∂z ∂h ∂z ∂W We can rewrite this as follows: ∂J ′ = (ŷ − y)W ϕ (z )W ϕ (z )h ∂W Suppose we do want to find the partial of the loss with respect to b ; this looks as follows: ∂J ∂J ∂ ŷ ∂z = ∂b ∂ ŷ ∂z ∂b Before we move on to the next section, pay close attention to the preceding derivative, ∂J ∂W.If you look back to earlier on in the Layers section, W [j] , h[j] , z [j] , b[j] were all vectors and matrices. This is still true. Because we are again dealing with vectors and matrices, it is important that their dimensions line up. ∂J We know that ∈ R3×5 , but what about the others? I will leave this to you as ∂W |2| an exercise to determine whether or not the other is correct and if it is not, how would you change the order to ensure it is? If you’re feeling very confident in your math abilities and are up for a challenge, I ∂J encourage you to try finding the derivative,. ∂w3,4 1.5 2.4 Training neural networks Now that we have an understanding of backpropagation and how gradients are computed, you might be wondering what purpose it serves and what it has to do with training our MLP. If you will recall from Vector Calculus, when we covered partial derivatives, we learned that we can use partial derivatives to check the impact that changing one param- eter can have on the output of a function. When we use the first and second derivatives to plot our graphs, we can analytically tell what the local and global minima and max- ima are. However, it isn’t as straightforward as that in our case as our model doesn’t know where the optima is or how to get there; so, instead, we use backpropagation with 27 the gradient descent as a guide to help us get to the (hopefully global) minima. Main- tenant que nous avons compris ce qu’est la rétropropagation et comment les gradients sont calculés, vous vous demandez peut-être à quoi elle sert et ce qu’elle a à voir avec la simulation de notre MLP. Si vous vous souvenez du calcul vectoriel, lorsque nous avons abordé les dérivées partielles, nous avons appris que nous pouvons utiliser les dérivées partielles pour vérifier l’impact que la modification d’un paramètre peut avoir sur la sor- tie d’une fonction. Lorsque nous utilisons les dérivées première et seconde pour tracer nos graphiques, nous pouvons dire analytiquement quels sont les minima et maxima lo- caux et globaux. Cependant, ce n’est pas aussi simple que cela dans notre cas, car notre modèle ne sait pas où se trouve l’optimum ni comment l’atteindre. Nous utilisons donc la rétropropagation avec la descente de gradient comme guide pour nous aider à atteindre les minima (que l’on espère globaux). In Optimization, we learned about gradient descent and how we iteratively move from one point on the function to a lower point on the function that is in the direction of the local/global minima by taking a step in the direction of the negative of the gradient. We expressed it in the following form: xk+1 = xk − ck ∇f (x) However, for neural networks, the update rule for the weights, in this case, is written as follows: ∂J θ[l] = θ[l] − α ∂θ[l] Here, θ = (W, b). As you can see, while this does look similar, it isn’t the optimization we have learned. Our goal here is to minimize the total loss of the network and update our weights ac- cordingly. 1.5.1 2.4.1 Parameter initialization In Optimization, we mentioned that before we start optimizing, we need an initial (start- ing) point, which is the purpose of initialization. This is an extremely important part of training neural networks because as mentioned earlier on in this chapter, neural net- works have a lot of parameters—often, well over tens of millions—which means that finding the point in the weight space that minimizes our loss can be very time consum- ing and challenging (because the weight space is non-convex; that is, there are lots of local minima and saddle points). Dans l’optimisation, nous avons mentionné qu’avant de commencer l’optimisation, nous avons besoin d’un point initial (de départ), ce qui constitue l’objectif de l’initialisation. Il s’agit d’une partie extrêmement importante de la simulation des réseaux neuronaux car, comme nous l’avons déjà mentionné dans ce chapitre, les réseaux neuronaux ont beaucoup de paramètres - souvent plus de dizaines de millions - ce qui signifie que la recherche du point de l’espace de poids qui minimise notre perte peut être très longue et difficile (parce que l’espace de poids n’est pas convexe, c’est-à-dire qu’il y a beaucoup de minima locaux et de points de selle). 28 For this reason, finding a good initial point is important because it makes it easier to get to the optima and reduce the training time, as well as reducing the chances of our weights either vanishing or exploding. Let’s now explore the various ways that we can initialize our weights and biases. C’est pourquoi il est important de trouver un bon point de départ, car cela permet d’atteindre plus facilement les optima et de réduire le temps de simulation, ainsi que de réduire les risques de disparition ou d’explosion de nos poids. Explorons maintenant les différentes façons d’initialiser nos poids et nos biais. 1. All zeros As the name suggests, here we set the initial weights and biases of our model to be zeros. I don’t recommend doing this because, as you may have guessed, this means that all the neurons in our model are dead. In fact, this is the very problem we want to avoid when training our network. Comme son nom l’indique, nous fixons ici les poids et les biais initiaux de notre modèle à zéro. Je ne recommande pas de faire cela car, comme vous l’avez peut-être deviné, cela signifie que tous les neurones de notre modèle sont morts. En fait, c’est précisément le problème que nous voulons éviter lors de la simulation de notre réseau. Let’s see what happens anyway. For the sake of simplicity, let’s suppose we have the following linear classifier: n X ŷ = w · x + b = wi x i + b i=1 If the weights are initialized as 0, then our output is always 0, which means we lost all the information that was part of our training data and the network that we put so much effort into building learns nothing. 2. Random initialization One way of initializing our weights to be non-zero is to use random initialization and for this, we could use one of two distributions—the normal distribution or the uniform distribution. To initialize our parameters using the normal distribution, we have to specify the mean and the standard deviation. Usually, we choose a mean of 0 and a standard deviation of 1. To initialize using the uniform distribution, we usually use the [−1, 1] range (where there is an equal probability of any value in the range being picked). While this gives us weights that we can use in training, it is very slow and has previously resulted in vanishing and exploding gradients in deep networks, resulting in mediocre performance. 3. Xavier initialization As we have seen, if our weights are too small, then they vanish, which results in dead neurons and, conversely, if our weights are too big, we get exploding gradients. 29 We want to avoid both scenarios, which means we need the weights to be initialized just right so that our network can learn what it needs to. To tackle this problem, Xavier Glorot and Yoshua Bengio created a normalized initialization method (generally referred to as Xavier initialization). It is as follows: " √ √ # [k] 6 6 Wi,j ∼U − , nk + nk−1 nk + nk−1 Here, nk is the number of neurons in layer k. But why does this work better than randomly initializing our network? The idea is that we want to maintain the variance as we propagate through subsequent layers. Mais pourquoi cette méthode est-elle plus efficace que l’initialisation aléatoire de notre réseau ? L’idée est que nous voulons maintenir la variance au fur et à mesure que nous nous propageons dans les couches suivantes. 1.5.2 2.4.2 The data As you will know by now, what we are trying to build here are networks that can learn to map an input to an output. For our network to be able to do this, it needs to be fed data—and lots of it. Therefore, it is important for us to know what the data should look like. Let’s suppose we have a classification or regression task. Our data will then take the following form: D = {(xi , yi ), · · · , (xN , yN )}N i=1 Here, we assume the following: xi , yi ∼ p(x, y) As you can see, each sample in the dataset has the input (xi ) and a corresponding output/target (yi ). However, depending on the task, our output will look a bit different. In regression, our output can take on any real value, whereas in classification, it must be one of the classes we can predict. Comme vous pouvez le constater, chaque échantillon de l’ensemble de données possède l’entrée (xi ) et une sortie/cible correspondante (yi ). Toutefois, en fonction de la tâche, notre résultat sera légèrement différent. En régression, notre sortie peut prendre n’importe quelle valeur réelle, alors qu’en classification, elle doit correspondre à l’une des classes que nous pouvons prédire. Our data (x), as you may expect, contains all the various information we want to use to predict our target variables (y) and this, of course, depends on the problem. As an example, let’s take the Boston Housing dataset, which is a regression task. It contains the following features: Nos données (x), comme vous pouvez vous y attendre, contiennent toutes les informations diverses que nous voulons utiliser pour prédire nos variables cibles (y) et cela, bien sûr, dépend du problème. Prenons par exemple l’ensemble de données 30 sur le logement à Boston, qui est une tâche de régression. Il contient les caractéristiques suivantes : The per-capita crime rate by town The proportion of residential land zones for lots over 25,000 square feet The proportion of non-retail business acres per town The average number of rooms per dwelling The proportion of owner-occupied units built before 1940 The weighted distances to five Boston employment centers The index of accessibility to radial highways The full-value property tax rate per 10, 000 The pupil-to-teacher ratio by town The percentage of the population that is of a lower status Le taux de criminalité par habitant par ville La proportion de zones résidentielles pour les lots de plus de 25 000 pieds carrés La proportion d’hectares consacrés à des activités autres que le commerce de détail par ville Le nombre moyen de pièces par logement La proportion de logements occupés par leur propriétaire et construits avant 1940 Les distances pondérées par rapport à cinq centres d’emploi de Boston L’indice d’accessibilité aux autoroutes radiales Le taux de l’impôt foncier par 10 000 habitants Le nombre d’élèves par enseignant par ville La proportion d’Afro-Américains par ville Le pourcentage de la population ayant un statut inférieur 31 The target variable is the median value of owner-occupied homes in $1, 000. La variable cible est la valeur médiane des logements occupés par leur propriétaire à 1 000 dollars. All the data is numerical (since the machines don’t really read or know what those la- bels mean, but they do know how to parse numbers). Toutes les données sont numériques (puisque les machines ne lisent pas ou ne savent pas vraiment ce que signifient ces éti- quettes, mais elles savent comment analyser les nombres). Now, let’s look at a classification problem—since we are trying to predict which class our data belongs to, the target will become a vector instead of a scalar (as it is in the preceding dataset), where the dimension of the target vector will be the number of categories. But how do we represent this target vector? Examinons maintenant un problème de classification - puisque nous essayons de prédire à quelle classe appartiennent nos données, la cible deviendra un vecteur au lieu d’un scalaire (comme dans l’ensemble de données précédent), où la dimension du vecteur cible sera le nombre de catégories. Mais comment représenter ce vecteur cible ? Suppose we have a dataset of images with the corresponding target labels: classes = {0 : cat, 1 : dog, 2 : horse, 3 : turtle} As you can see, each label has a digit assigned to it and during training, our network could mistake these for trainable parameters, which we obviously would want to avoid. Instead, we can one-hot encode this, thereby turning the label vector into the following: Comme vous pouvez le voir, chaque étiquette est associée à un chiffre et, au cours de la simulation, notre réseau pourrait confondre ces chiffres avec des paramètres entraînables, ce que nous voulons évidemment éviter. Au lieu de cela, nous pouvons coder cela en une seule fois, transformant ainsi le vecteur d’étiquettes en ce qui suit : T cat : 1 0 0 0 T dog : 0 1 0 0 T horse : 0 0 1 0 T turtle : 0 0 0 1 Great! Now we know what is in a dataset and how datasets are structured. But what now? We split the dataset into training, testing, and validation sets. How we split the data into the three respective sets depends largely on how much data we have. In the case of deep learning, we will, more often than not, be dealing with very large datasets; that is, millions to tens of millions of samples. C’est très bien ! Nous savons maintenant ce que contient un ensemble de données et comment les ensembles de données sont structurés. Mais que faire maintenant ? Nous divisons l’ensemble de données en ensembles de simulation, de test et de validation. La façon dont nous répartissons les données dans les trois ensembles respectifs dépend largement de la quantité de données dont nous disposons. Dans le cas de l’apprentissage profond, nous aurons le plus souvent affaire à de très grands ensembles de données, c’est-à-dire des millions ou des dizaines de millions d’échantillons. As a rule of thumb, we generally select 80 − 90% of the dataset to train our network, and the remaining 10 − 20% is split into two portions—the validation and test sets. The 32 validation set is used during training to determine whether our network has overfit or underfit to the data and the test set is used at the end to check how well our model generalizes to unseen data. En règle générale, nous sélectionnons 80 − 90 de l’ensemble de données pour la simulation de notre réseau, et les 10 − 20 restants sont divisés en deux parties - les ensembles de validation et de test. L’ensemble de validation est utilisé pendant la simulation pour déterminer si notre réseau s’est suradapté ou sous-adapté aux données et l’ensemble de test est utilisé à la fin pour vérifier la capacité de notre modèle à s’adapter à des données inédites. 1.6 2.5 Deep neural networks Now, it’s time to get into the really fun stuff (and what you picked up this book for)—deep neural networks. The depth comes from the number of layers in the neural network and for an FNN to be considered deep, it must have more than 10 hidden layers. A number of today’s state-of-the-art FNNs have well over 40 layers. Let’s now explore some of the properties of deep FNNs and get an understanding of why they are so powerful. If you recall, earlier on we came across the universal approximation theorem, which stated that an MLP with a single hidden layer could approximate any function. But if that is the case, why do we need deep neural networks? Simply put, the capacity of a neural network increases with each hidden layer (and the brain has a deep structure). What this means is that deeper networks have far greater expressiveness than shallower networks. This is something we came across earlier when learning about MLPs. We saw that by adding hidden layers, we were able to create a network that was able to learn to solve a problem that a linear neural network was not able to. Additionally, deeper networks are preferred over wider networks, not because they improve the overall performance, but because networks with more hidden layers (but less width) have much fewer parameters than wider networks with fewer hidden layers. Let’s suppose we have two networks—one that is wide and one that is deep. Both net- works have 20 inputs and 6 output nodes. Let’s calculate the total number of parameters for both layers; that is, the number of connections between all the layers and biases. Our wide neural network has two hidden layers, each with 1, 024 neurons. The total number of parameters is as follows: (20 × 1024) + (2014 × 1024) + (1024 × 8) + (1024 + 1024 + 8) = 1, 079, 304 Our deep neural network has 12 hidden layers, each with 150 neurons. The total number of parameters is as follows: (20 × 200) + (200 × 200) × 11 + (200 × 8) + (200 × 12 + 8) = 484, 008 As you can see, the deeper network has less than half the parameters that the wider network does. 33 1.7 2.6 Summary In this chapter, we first learned about a simple FNN, known as the MLP, and broke it down into its individual components to get a deeper understanding of how they work and are constructed. We then extended these concepts to further our understanding of deep neural networks. You should now have intimate knowledge of how FNNs work and understand how various models are constructed, as well as understand how to build and possibly improve them for yourself. Let’s now move on to the next chapter, where we will learn how to improve our neural networks so that they generalize better on unseen data. 34

Essential Neural Networks PDF

Document Details

Tags

Related

Summary

Full Transcript