Deep Learning: Basics and Convolutional Neural Networks (CNN) PDF

Document Details

Uploaded by Deleted User

2023

Maria Vakalopoulou, Stergios Christodoulidis, Ninon Burgos, Olivier Colliot, Vincent Lepetit

Tags

deep learning machine learning convolutional neural networks medical applications

Summary

This document provides a detailed overview of deep learning and convolutional neural networks (CNNs), with a specific focus on their mathematical formulations and applications, especially in medical settings.It discusses topics such as perceptrons, backpropagation, and autoencoders, all of which are fundamental concepts.

Full Transcript

Deep learning: basics and convolutional neural networks (CNN) Maria Vakalopoulou, Stergios Christodoulidis, Ninon Burgos, Olivier Colliot, Vincent Lepetit To cite this version: Maria Vakalopoulou, Stergios Christod...

Deep learning: basics and convolutional neural networks (CNN) Maria Vakalopoulou, Stergios Christodoulidis, Ninon Burgos, Olivier Colliot, Vincent Lepetit To cite this version: Maria Vakalopoulou, Stergios Christodoulidis, Ninon Burgos, Olivier Colliot, Vincent Lepetit. Deep learning: basics and convolutional neural networks (CNN). Olivier Colliot. Machine Learning for Brain Disorders, Springer, 2023, 10.1007/978-1-0716-3195-9_3. hal-03957224v2 HAL Id: hal-03957224 https://hal.science/hal-03957224v2 Submitted on 3 Oct 2023 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution 4.0 International License Chapter 3 Deep Learning: Basics and Convolutional Neural Networks (CNNs) Maria Vakalopoulou, Stergios Christodoulidis, Ninon Burgos, Olivier Colliot, and Vincent Lepetit Abstract Deep learning belongs to the broader family of machine learning methods and currently provides state-of- the-art performance in a variety of fields, including medical applications. Deep learning architectures can be categorized into different groups depending on their components. However, most of them share similar modules and mathematical formulations. In this chapter, the basic concepts of deep learning will be presented to provide a better understanding of these powerful and broadly used algorithms. The analysis is structured around the main components of deep learning architectures, focusing on convolutional neural networks and autoencoders. Key words Perceptrons, Backpropagation, Convolutional neural networks, Deep learning, Medical imaging 1 Introduction Recently, deep learning frameworks have become very popular, attracting a lot of attention from the research community. These frameworks provide machine learning schemes without the need for feature engineering, while at the same time they remain quite flexible. Initially developed for supervised tasks, they are nowadays extended to many other settings. Deep learning, in the strict sense, involves the use of multiple layers of artificial neurons. The first artificial neural networks were developed in the late 1950s with the presentation of the perceptron algorithms. However, limita- tions related to the computational costs of these algorithms during that period, as well as the often-miscited claim of Minsky and Papert that perceptrons are not capable of learning non-linear functions such as the XOR, caused a significant decline of interest for further research on these algorithms and contributed to the so-called artificial intelligence winter. In particular, in their book , Minsky and Papert discussed that single-layer perceptrons are Olivier Colliot (ed.), Machine Learning for Brain Disorders, Neuromethods, vol. 197, https://doi.org/10.1007/978-1-0716-3195-9_3, © The Author(s) 2023 77 78 Maria Vakalopoulou et al. only capable of learning linearly separable patterns. It was often incorrectly believed that they also presumed this is the case for multilayer perceptron networks. It took more than 10 years for research on neural networks to recover, and in , some of these issues were clarified and further discussed. Even if during this period there was not a lot of research interest for perceptrons, very important algorithms such as the backpropagation algorithm [4–7] and recurrent neural networks were introduced. After this period, and in the early 2000s, publications by Hin- ton, Osindero, and Teh indicated efficient ways to train multi- layer perceptrons layer by layer, treating each layer as an unsupervised restricted Boltzmann machine and then using super- vised backpropagation for the fine-tuning. Such advances in the optimization algorithms and in hardware, in particular graphics processing units (GPUs), increased the computational speed of deep learning systems and made their training easier and faster. Moreover, around 2010, the first large-scale datasets, with Ima- geNet being one of the most popular, were made available, contributing to the success of deep learning algorithms, allowing the experimental demonstration of their superior performance on several tasks in comparison with other commonly used machine learning algorithms. Finally, another very important factor that contributed to the current popularity of deep learning techniques is their support by publicly available and easy-to-use libraries such as Theano , Caffe , TensorFlow , Keras , and PyTorch. Indeed, currently, due to all these publicly available libraries that facilitate collaborative and reproducible research and access to resources from large corporations such as Kaggle, Google Colab, and Amazon Web Services, teaching and research about these algorithms have become much easier. This chapter will focus on the presentation and discussion of the main components of deep learning algorithms, giving the reader a better understanding of these powerful models. The chap- ter is meant to be readable by someone with no background in deep learning. The basic notions of machine learning will not be included here; however, the reader should refer to Chap. 2 (reader without a background in engineering or computer science can also refer to Chap. 1 for a lay audience-oriented presentation of these concepts). The rest of this chapter is organized as follows. We will first present the deep feedforward networks focusing on percep- trons, multilayer perceptrons, and the main functions that they are composed of (Subheading 2). Then, we will focus on the optimiza- tion of deep neural networks, and in particular, we will formally present the topics of gradient descent, backpropagation, as well as the notions of generalization and overfitting (Subheading 3). Sub- heading 4 will focus on convolutional neural networks discussing in detail the basic convolution operations, while Subheading 5 will give an overview of the autoencoder architectures. Deep Learning: Basics and CNN 79 2 Deep Feedforward Networks In this section, we will present the early deep learning approaches together with the main functions that are commonly used in deep feedforward networks. Deep feedforward networks are a set of parametric, non-linear, and hierarchical representation models that are optimized with stochastic gradient descent. In this defini- tion, the term parametric holds due to the parameters that we need to learn during the training of these models, the non-linearity due to the non-linear functions that they are composed of, and the hierarchical representation due to the fact that the output of one function is used as the input of the next in a hierarchical way. 2.1 Perceptrons The perceptron was originally developed for supervised binary classification problems, and it was inspired by works from neuros- cientists such as Donald Hebb. It was built around a non-linear neuron, namely, the McCulloch-Pitts model of a neu- ron. More formally, we are looking for a function f(x;w, b) such that f ð:; w, bÞ : x∈p → fþ1, - 1g where w and b are the parameters of f and the vector x = [x1,..., xp]⊤ is the input. The training set is {(x(i), y(i))}. In particular, the perceptron relies on a linear model for performing the classification: þ1 if w ⊤ x þ b ≥ 0 f ðx; w, bÞ = : ð1Þ -1 otherwise Such a model can be interpreted geometrically as a hyperplane that can appropriately divide data points that are linearly separable. Moreover, one can observe that, in the previous definition, a per- ceptron is a combination of a weighted summation between the elements of the input vector x combined with a step function that performs the decision for the classification. Without loss of gener- ality, this step function can be replaced by other activation functions such as the sigmoid, hyperbolic tangent, or softmax functions (see Subheading 2.3); the output simply needs to be thresholded to assign the + 1 or - 1 class. Graphically, a perceptron is presented in Fig. 1 on which each of the elements of the input is described as a neuron and all the elements are combined by weighting with the models’ parameters and then passed to an activation function for the final decision. During the training process and similarly to the other machine learning algorithms, we need to find the optimal parameters w and b for the perceptron model. One of the main innovations of Rosen- blatt was the proposition of the learning algorithm using an itera- tive process. First, the weights are initialized randomly, and then using one sample (x(i), y(i)) of the training set, the prediction of the 80 Maria Vakalopoulou et al. 1 b w1 x1 6 ^y w2 x2 wp xp Fig. 1 A simple perceptron model. The input elements are described as neurons and combined for the final prediction y^. The final prediction is composed of a weighted sum and an activation function perceptron is calculated. If the prediction is correct, no further action is needed, and the next data point is processed. If the prediction is wrong, the weights are updated with the following rule: the weights are increased in case the prediction is smaller than the ground-truth label y(i) and decreased if the predic- tion is higher than the ground-truth label. This process is repeated until no further errors are made for the data points. A pseudocode of the training or convergence algorithm is presented in Algorithm 1 (note that in this version, it is assumed that the data is linearly separable). Algorithm 1 Train perceptron procedure Train({(x(i) , y (i) )}) Initialization: initialize randomly the weights w and bias b while ∃i ∈ {1,... , n}, f (x(i) ; w, b) = y (i) do Pick i randomly error = y (i) − f (x(i) ; w, b) if error = 0 then w ← w + error · x(i) b b + error Originally, the perceptron has been proposed for binary classi- fication tasks. However, this algorithm can be generalized for the case of multiclass classification, fc(x;w, b), where c ∈{1,..., C} are the different classes. This can be easily achieved by adding more neurons to the output layer of the perceptron. That way, the number of output neurons would be the same as the number of possible outputs we need to predict for the specific problem. Then, the final decision can be made by choosing the maximum of the different output neurons f n = max f c ðx; w, bÞ. we ,will Finally, in the following, c∈f1..., Cg integrate the bias b in the weights w (and thus add 1 as the first element of the input vector x = [1, x1,..., xp]⊤). The model can then be rewritten as f(x;w) such that f ð:; wÞ : x∈pþ1 → fþ1, - 1g. Deep Learning: Basics and CNN 81 2.2 Multilayer The limitation of perceptrons to linear problems can be overcome Perceptrons by using multilayer perceptions, often denoted as MLP. An MLP consists of at least three layers of neurons: the input layer, a hidden layer, and an output layer. Except for the input neurons, each neuron uses a non-linear activation function, making it capable of distinguishing data that is not linearly separable. These layers can also be called fully connected layers since they connect all the neurons of the previous and of the current layer. It is absolutely crucial to keep in mind that non-linear functions are necessary for the network to find non-linear separations in the data (otherwise, all the layers could simply be collapsed together into a single gigantic linear function). 2.2.1 A Simple Multilayer Without loss of generality, an MLP with one hidden layer can be Network defined as: zðxÞ = gðW 1 xÞ , ð2Þ y^ = f ðx; W 1 , W 2 Þ = W 2 zðxÞ where gðxÞ :  →  denotes the non-linear function (which can be applied element-wise to a vector), W1 the matrix of coefficients of the first layer, and W2 the matrix of coefficients of the second layer. Equivalently, one can write: d1 yc = W 2ðc,j Þ gðW 1⊤ ðj Þ xÞ, ð3Þ j =1 where d1 is the number of neurons for the hidden layer which defines the width of the network, W 1ðj Þ denotes the first column of the matrix W1, and W 2ðc,j Þ denotes the c, j element of the matrix W2. Graphically, a two-layer perceptron is presented in Fig. 2 on z1 x1 z2 ^y1 x2 W1 W2 z3 ^y2 xp zd 1 Fig. 2 An example of a simple multilayer perceptron model. The input layer is fed into a hidden layer (z), which is then combined for the last output layer providing the final prediction 82 Maria Vakalopoulou et al. which the input neurons are fed into a hidden layer whose neurons are combined for the final prediction. There were a lot of research works indicating the capacity of feedforward neural networks with a single hidden layer of finite size to approximate continuous functions. In the late 1980s, the first proof was published for sigmoid activation functions (see Subheading 2.3 for the definition) and was generalized to other functions for feedforward multilayer architectures [19–21]. In par- ticular, these works prove that any continuous function can be approximated under mild conditions as closely as wanted by a three-layer network. As N → 1, any continuous function f can be approximated by some neural network f^, because each compo- nent gðW Tðj Þ xÞ behaves like a basis function and functions in a suitable space admit a basis expansion. However, since N may need to be very large, introducing some limitations for these types of networks, deeper networks, with more than one hidden layer, can provide good alternatives. 2.2.2 Deep Neural The simple MLP networks can be generalized to deeper networks Network with more than one hidden layer that progressively generate higher-level features from the raw input. Such networks can be written as: z 1 ðxÞ = gðW 1 xÞ... z k ðxÞ = gðW k z k - 1 ðxÞÞ , ð4Þ... y^ = f ðx; W 1 ,..., W K Þ = z K ðz K - 1 ð...ðz 1 ðxÞÞÞÞ where K denotes the number of layers for the neural network, which defines the depth of the network. In Fig. 3, a graphical representation of the deep multilayer perceptron is presented. Once again, the input layer is fed into the different hidden layers of the network in a hierarchical way such that the output of one layer is the input of the next one. The last layer of the network corresponds to the output layer, which makes the final prediction of the model. As for networks with one hidden layer, they are also universal approximators. However, the approximation theory for deep net- works is less understood compared with neural networks with one hidden layer. Overall, deep neural networks excel at representing the composition of functions. So far, we have described neural networks as simple chains of layers, applied in a hierarchical way, with the main considerations being the depth of the network (the number of layers K) and the Deep Learning: Basics and CNN 83 z k ,1 x1 z k ,2 ^y1 x2 ╳ ╳ ╳ ╳ z k ,3 ^y2 xp z k ,d k Fig. 3 An example of a deep neural network. The input layer, the kth layer of the deep neural network, and the output layer are presented in the figure Fig. 4 Comparison of two different networks with almost the same number of parameters, but different depths. Figure inspired by Goodfellow et al. width of each k layer (the number of neurons dk). Overall, there are no rules for the choice of the K and dk parameters that define the architecture of the MLP. However, it has been shown empirically that deeper models perform better. In Fig. 4, an overview of 2 different networks with 3 and 11 hidden layers is presented with respect to the number of parameters and their accuracy. For each architecture, the number of parameters varies by changing the number of neurons dk. One can observe that, empirically, deeper networks achieve better performance using approximately the same or a lower number of parameters. Additional evidence to support these empirical findings is a very active field of research [22, 23]. Neural networks can come in a variety of models and architec- tures. The choice of the proper architecture and type of neural network depends on the type of application and the type of data. 84 Maria Vakalopoulou et al. Most of the time, the best architecture is defined empirically. In the next section, we will discuss the main functions used in neural networks. 2.3 Main Functions A neural network is a composition of different functions also called modules. Most of the times, these functions are applied in a sequen- tial way. However, in more complicated designs (e.g., deep residual networks), different ways of combining them can be designed. In the following subsections, we will discuss the most commonly used functions that are the backbones of most perceptrons and multi- layer perceptron architectures. One should note, however, that a variety of functions can be proposed and used for different deep learning architectures with the constraint to be differentiable – almost – everywhere. This is mainly due to the way that deep neural networks are trained, and this will be discussed later in the chapter. 2.3.1 Linear Functions One of the most fundamental functions used in deep neural net- works is the simple linear function. Linear functions produce a linear combination of all the nodes of one layer of the network, weighted with the parameters W. The output signal of the linear function is Wx, which is a polynomial of degree one. While it is easy to solve linear equations, they have less power to learn complex functional mappings from data. Moreover, when the number of samples is much larger than the dimension of the input space, the probability that the data is linearly separable comes close to zero (Box 1). This is why they need to be combined with non-linear functions, also called activation functions (the name activation has been initially inspired by biology as the neuron will be active or not depending on the output of the function). Box 1: Function Counting Theorem The so-called Function Counting Theorem (Cover ) counts the number of linearly separable dichotomies of n points in general position in p. The theorem shows that, out of the total 2n dichotomies, only Cðn, pÞ = p n-1 2 j =0 are homogeneously, linearly separable. j When n >> p, the probability of a dichotomy to be line- arly separable converges to zero. This indicates the need for the integration of non-linear functions into our modeling and architecture design. Note that n >> p is a typical regime in machine learning and deep learning applications where the number of samples is very large. Deep Learning: Basics and CNN 85 (a) (b) (c) Tanh Sigmoid ReLU Fig. 5 Overview of different non-linear functions (in green) and their first-order derivative (blue). (a) Hyperbolic tangent function (tanh), (b) sigmoid, and (c) rectified linear unit (ReLU) 2.3.2 Non-linear One of the most important components of deep neural networks is Functions the non-linear functions, also called activation functions. They convert the linear input signal of a node into non-linear outputs to facilitate the learning of high-order polynomials. There are a lot of different non-linear functions in the literature. In this subsec- tion, we will discuss the most classical non-linearities. Hyperbolic Tangent One of the most standard non-linear functions is the hyperbolic Function (tanh) tangent function, aka the tanh function. Tanh is symmetric around the origin with a range of values varying from - 1 to 1. The biggest advantage of the tanh function is that it produces a zero-centered output (Fig. 5a), thereby supporting the backpropagation process that we will cover in the next section. The tanh function is used extensively for the training of multilayer neural networks. Formally, the tanh function, together with its gradient, is defined as: ex - e - x g = tanh ðxÞ = ex þ e - x : ð5Þ ∂g = 1 - tanh 2 ðxÞ ∂x One of the downsides of tanh is the saturation of gradients that occurs for large or small inputs. This can slow down the training of the networks. Sigmoid Similar to tanh, the sigmoid is one of the first non-linear functions that were used to compose deep learning architectures. One of the main advantages is that it has a range of values varying from 0 to 1 (Fig. 5b) and therefore is especially used for models that aim to predict a probability as an output. Formally, the sigmoid function, together with its gradient, is defined as: 1 g = σðxÞ = 1 þ e -x : ð6Þ ∂g = σðxÞð1 - σðxÞÞ ∂x 86 Maria Vakalopoulou et al. Note that this is in fact the logistic function, which is a special case of the more general class of sigmoid function. As it is indicated in Fig. 5b, the sigmoid gradient vanishes for large or small inputs making the training process difficult. However, in case it is used for the output units which are not latent variables and on which we have access to the ground-truth labels, sigmoid may be a good option. Rectified Linear Unit (ReLU) ReLU is considered among the default choice of non-linearity. Some of the main advantages of ReLU include its efficient calcula- tion and better gradient propagation with fewer vanishing gradient problems compared to the previous two activation functions. Formally, the ReLU function, together with its gradient, is defined as: g = max ð0, xÞ ∂g 0, if x ≤ 0 : ð7Þ = ∂x 1, if x > 0 As it is indicated in Fig. 5c, ReLU is differentiable anywhere else than zero. However, this is not a very important problem as the value of the derivative at zero can be arbitrarily chosen to be 0 or 1. In , the authors empirically demonstrated that the number of iterations required to reach 25% training error on the CIFAR-10 dataset for a four-layer convolutional network was six times faster with ReLU than with tanh neurons. On the other hand, and as discussed in , ReLU-type neural networks which yield a piece- wise linear classifier function produce almost always high confi- dence predictions far away from the training data. However, due to its efficiency and popularity, many variations of ReLU have been proposed in the literature, such as the leaky ReLU or the parametric ReLU. These two variations both address the problem of dying neurons, where some ReLU neurons die for all inputs and remain inactive no matter what input is supplied. In such a case, no gradient flows from these neurons, and the training of the neural network architecture is affected. Leaky ReLU and parametric ReLU change the g(x) = 0 part, by adding a slope and extending the range of ReLU. Swish The choice of the activation function in neural networks is not always easy and can greatly affect performance. In , the authors performed a combination of exhaustive and reinforcement learning-based searches to discover novel activation functions. Their experiments discovered a new activation function that is called Swish and is defined as: Deep Learning: Basics and CNN 87 g = x  σðβxÞ ∂g , ð8Þ = βgðxÞ þ σðβxÞð1 - βgðxÞÞ ∂x where σ is the sigmoid function and β is either a constant or a trainable parameter. Swish tends to work better than ReLU on deeper models, as it has been shown experimentally in in different domains. Softmax Softmax is often used as the last activation function of a neural network. In practice, it normalizes the output of a network to a probability distribution over the predicted output classes. Softmax is defined as: e xi Softmaxðx i Þ = C x : ð9Þ j ej The softmax function takes as input a vector x of C real num- bers and normalizes it into a probability distribution consisting of C probabilities proportional to the exponentials of the input num- bers. However, a limitation of softmax is that it assumes that every input x belongs to at least one of the C classes (which is not the case in practice, i.e., the network could be applied to an input that does not belong to any of the classes). 2.3.3 Loss Functions Besides the activation functions, the loss function (which defines the cost function) is one of the main elements of neural networks. It is the function that represents the error for a given prediction. To that purpose, for a given training sample, it compares the prediction f(x(i);W) to the ground truth y(i) (here we denote for simplicity as W all the parameters of the network, combining all the W1,..., WK in the multilayer perceptron shown above). The loss is denoted as ℓ(y, f(x;W)). The average loss across the n training samples is called the cost function and is defined as: n 1 J ðW Þ = ℓ y ðiÞ , f ðx ðiÞ ; W Þ , ð10Þ n i=1 where {(x(i), y(i))}i=1..n composes the training set. The aim of the training will be to find the parameters W such that J(W) is mini- mized. Note that, in deep learning, one often calls the cost function the loss function, although, strictly speaking, the loss is for a given sample, and the cost is averaged across samples. Besides, the objec- tive function is the overall function to minimize, including the cost and possible regularization terms. However, in the remainder of this chapter, in accordance with common usage in deep learning, we will sometimes use the term loss function instead of cost function. 88 Maria Vakalopoulou et al. In neural networks, the loss function can be virtually any func- tion that is differentiable. Below we present the two most common losses, which are, respectively, used for classification or regression problems. However, specific losses exist for other tasks, such as segmentation, which are covered in the corresponding chapters. Cross-Entropy Loss One of the most basic loss functions for classification problems corresponds to the cross-entropy between the expected values and the predicted ones. It leads to the following cost function: n J ðW Þ = - log P y = y ðiÞ jx = x ðiÞ ; W , ð11Þ i=1 where P y= y ðiÞ jx = x ðiÞ ; W is the probability that a given sample is correctly classified. The cross-entropy can also be seen here as the negative log-likelihood of the training set given the predictions of the net- work. In other words, minimizing this loss function corresponds to maximizing the likelihood: n J ðW Þ = ∏ P y = y ðiÞ jx = x ðiÞ ; W : ð12Þ i=1 Mean Squared Error Loss For regression problems, the mean squared error is one of the most basic cost functions, measuring the average of the squares of the errors, which is the average squared difference between the pre- dicted values and the real ones. The mean squared error is defined as: n J ðW Þ = jj y ðiÞ - f ðx ðiÞ ; W Þ jj 2 : ð13Þ i=1 3 Optimization of Deep Neural Networks Optimization is one of the most important components of neural networks, and it focuses on finding the parameters W that minimize the loss function J(W). Overall, optimization is a difficult task. Traditionally, the optimization process is performed by care- fully designing the loss function and integrating its constraints to ensure that the optimization process is convex (and thus, one can be sure to find the global minimum). However, neural networks are non-convex models, making their optimization challenging, and, in general, one does not find the global minimum but only a local one. In the next sections, the main components of their optimization will be presented, giving a general overview of the optimization process, its challenges, and common practices. Deep Learning: Basics and CNN 89 Fig. 6 The gradient descent algorithm. This first-order optimization algorithm is finding a local minimum by taking steps toward the opposite direction of the gradient 3.1 Gradient Descent Gradient descent is an iterative optimization algorithm that is among the most popular and basic algorithms in machine learning. It is a first-order1 optimization algorithm, which is finding a local minimum of a differentiable function. The main idea of gradient descent is to take iterative steps toward the opposite direction of the gradient of the function that needs to be optimized (Fig. 6). That way, the parameters W of the model are updated by: ∂J ðW t Þ W tþ1 ← W t - η , ð14Þ ∂W t where t is the iteration and η, called learning rate, is the hyperpara- meter that indicates the magnitude of the step that the algorithm will take. Besides its simplicity, gradient descent is one of the most com- monly used algorithms. More sophisticated algorithms require computing the Hessian (or an approximation) and/or its inverse (or an approximation). Even if these variations could give better optimization guarantees, they are often more computationally expensive, making gradient descent the default method for optimization. In the case of convex functions, the optimization problem can be reduced to the problem of finding a local minimum. Any local minimum is then guaranteed to be a global minimum, and gradient descent can identify it. However, when dealing with non-convex functions, such as neural networks, it is possible to have many local minima making the use of gradient descent challenging. Neural networks are, in general, non-identifiable. A model is said to be identifiable if it is theoretically possible, given a sufficiently large training set, to rule out all but one set of the model’s parameters. Models with latent variables, such as the hidden layers of neural networks, are often not identifiable because we can obtain equiva- lent models by exchanging latent variables with each other. 1 First-order means here that the first-order derivatives of the cost function are used as opposed to second-order algorithms that, for instance, use the Hessian. 90 Maria Vakalopoulou et al. However, all these minima are often almost equivalent to each other in cost function value. In that case, these local minima are not a problematic form of non-convexity. It remains an open ques- tion whether there exist many local minima with a high cost that prevent adequate training of neural networks. However, it is cur- rently believed that most local minima, at least as found by modern optimization procedures, will correspond to a low cost (even though not to identical costs). For W to be a local minimum, we need mainly two conditions to be fulfilled: ∂J ∂W ðW  Þ = 0. 2 ∂ J All the eigenvalues of ∂W 2 ðW  Þ to be positive. For random functions in n dimensions, the probability for the eigenvalues to be all positive is n1. On the other hand, the ratio of the number of saddle points to local minima increases exponentially with n. A saddle point, or critical point, is a point where the deriva- tives are zero without being a minimum of the function. Such points could result in a high error making the optimization with gradient descent challenging. In , this issue is discussed, and an optimi- zation algorithm that leverages second-order curvature information is proposed to deal with this issue for deep and recurrent networks. 3.1.1 Stochastic Gradient Gradient descent efficiency is not enough when it comes to Descent machine learning problems with large numbers of training samples. Indeed, this is the case for neural networks and deep learning which often rely on hundreds or thousands of training samples. Updating the parameters W after calculating the gradient using all the training samples would lead to a tremendous computational com- plexity of the underlying optimization algorithm. To deal with this problem, the stochastic gradient descent (SGD) algorithm is a drastic simplification. Instead of computing the ∂J∂W ðW Þ exactly, each iteration estimates this gradient on the basis of a small set of randomly picked examples, as follows: W tþ1 ← W t - ηt GðW t Þ, ð15Þ where K 1 ∂J ðik Þ W t GðW Þ =t , ð16Þ K ∂W k=1 where J ik is the loss function at training sample ik, ði k Þ ði k Þ fðx , y Þgk = 1...K is the small subset of K training samples (K δ > 0. Consider fitting the data using a feedforward neural network with ReLU activa- tions. Denote by D (resp. W ) the depth (resp. width) of the network. Suppose that the neural network is suffi- ciently overparametrized, i.e.: 1 W ≫ polynomial n, D, : ð17Þ δ Then, with high probability, running SGD with some random initialization and properly chosen step sizes ηt yields J(Wt) < E in t / log 1ε. 3.2 Backpropagation The training of neural networks is performed with backpropaga- tion. Backpropagation computes the gradient of the loss function with respect to the parameters of the network in an efficient and local way. This algorithm was originally introduced in 1970. How- ever, it started becoming very popular after the publication of , which indicated that backpropagation works faster than other methods that had been proposed back then for the training of neural networks. 92 Maria Vakalopoulou et al. Fig. 7 A multilayer perceptron with one hidden layer The backpropagation algorithm works by computing the gra- dient of the loss function (J) with respect to each weight by the chain rule, computing the gradient one layer at a time, and iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule. In Fig. 7, an example of a multilayer perceptron with one hidden layer is presented. In such a network, the backpropagation is calculated as: ∂J ðW Þ ∂J ðW Þ ∂^y = × ∂w 2 ∂^y ∂w 2 : ð18Þ ∂J ðW Þ ∂J ðW Þ ∂^y ∂J ðW Þ ∂^y ∂z 1 = × = × × ∂w 1 ∂^y ∂w 1 ∂^y ∂z 1 ∂w 1 Overall, backpropagation is very simple and local. However, the reason why we can train a highly non-convex machine with many local minima, like neural networks, with a strong local learning algorithm is not really known even today. In practice, backpropagation can be computed in different ways, including manual calculation, numerical differentiation using finite difference approximation, and symbolic differentiation. Nowadays, deep learning frameworks such as [14, 16] use automatic differentiation for the application of backpropagation. 3.3 Generalization Similar to all the machine learning algorithms (discussed in and Overfitting Chapter 2), neural networks can suffer from poor generaliza- tion and overfitting. These problems are caused mainly by the optimization of the parameters of the models performed in the {(xi, yi)}i=1,...,n training set, while we need the model to per- form well on other unseen data that are not available during the training. More formally, in the case of cross-entropy, the loss that we would like to minimize is: J ðW Þ = - log ∏ðx , yÞ∈T T P ðy = yjx = x; W Þ, ð19Þ where TT is the set of any data, not available during training. In practice, a small validation set TV is used to evaluate the loss on unseen data. Of course, this validation set should be distinct from the training set. It is extremely important to keep in mind that the performance obtained on the validation set is generally biased upward because the validation set was used to perform early stop- ping or to choose regularization parameters. Therefore, one should have an independent test set that has been isolated at the Deep Learning: Basics and CNN 93 beginning, has not been used in any way during training, and is only used to report the performance (see Chap. 20 for details). In case one cannot have an additional independent test set due to a lack of data, one should be aware that the performance may be biased and that this is a limitation of the specific study. To avoid overfitting and improve the generalization perfor- mance of the model, usually, the validation set is used to monitor the loss during the training of the networks. Tracking the training and validation losses over the number of epochs is essential and provides important insights into the training process and the selected hyperparameters (e.g., choice of learning rate). Recent visualization tools such as TensorBoard3 or Weights & Biases4 make this tracking easy. In the following, we will also mention some of the most commonly applied optimization techniques that help with preventing overfitting. Early Stopping Using the reported training and validation errors, the best model in terms of performance and generalization power is selected. In particular, early stopping, which corresponds to select- ing a model corresponding to an earlier time point than the final epoch, is a common way to prevent overfitting. Early stopping is a form of regularization for models that are trained with an iterative method, such as gradient descent and its variants. Early stopping can be implemented with different criteria. However, generally, it requires the monitoring of the performance of the model on a validation set, and the model is selected when its performance degrades or its loss increases. Overall, early stopping should be used almost universally for the training of neural net- works. The concept of early stopping is illustrated in Fig. 8. Weight Regularization Similar to other machine learning meth- ods (Chap. 2), weight regularization is also a very commonly used technique for avoiding overfitting in neural networks. More specif- ically, during the training of the model, the weights of the network start growing in size in order to specialize the model to the training data. However, large weights tend to cause sharp transitions in the different layers of the network and, that way, large changes in the output for only small changes in the inputs. To handle this problem, during the training process, the weights can be updated in such a way that they are encouraged to be small, by adding a penalty to the loss function, for instance, the ℓ 2 norm of the parameters λkWk2, where λ is a trade-off parameter between the loss and the regularization. Since weight regularization is quite popular in 3 https://www.tensorflow.org/tensorboard. 4 https://wandb.ai/site. 94 Maria Vakalopoulou et al. Loss Underfitting Overfitting Validation Training Time (epochs) Fig. 8 Illustration of the concept of early stopping. The model that should be selected corresponds to the dashed bar which is the point where the validation loss starts increasing. Before this point, the model is underfitting. After, it is overfitting neural networks, different optimizers have integrated them into their optimization process in the form of weight decay. Weight Initialization The way that the weights of neural net- works will be initialized is very important, and it can determine whether the algorithm converges at all, with some initial points being so unstable that the algorithm encounters numerical difficul- ties and fails altogether. Most of the time, the weights are initialized randomly from a Gaussian or uniform distribution. According to , the choice of Gaussian or uniform distribution does not seem to matter very much; however, the scale does have a large effect both on the outcome of the optimization procedure and on the ability of the network to generalize. Nevertheless, more tailored approaches have been developed over the last decade that have become the standard initialization points. One of them is the Xavier Initialization which balances between all the layers to have the same activation variance and the same gradient variance. More formally the weights are initialized as: 6 6 W i,j  Uniform - , , ð20Þ mþn mþn where m is the number of inputs and n the number of outputs of matrix W. Moreover, the biases b are initialized to 0. Drop-out There are other techniques to prevent overfitting, such as drop-out , which involves randomly destroying neurons during the training process, thereby reducing the complexity of Deep Learning: Basics and CNN 95 Fig. 9 Examples of data transformations applied in the MNIST dataset. Each of these generated samples is considered additional training data the model. Drop-out is an ensemble method that does not need to build the models explicitly. In practice, at each optimization itera- tion, random binary masks on the units are considered. The proba- bility of removing a unit ( p) is defined as a hyperparameter during the training of the network. During inference, all the units are activated; however, the obtained parameters W are multiplied with this probability p. Drop-out is quite efficient and commonly used in a variety of neural network architectures. Data Augmentation Since neural networks are data-driven meth- ods, their performance depends on the training data. To increase the amount of data during the training, data augmentation can be performed. It generates slightly modified copies of the existing training data to enrich the training samples. This technique acts as a regularizer and helps reduce overfitting. Some of the most com- monly used transformations applied during data augmentation include random rotations, translations, cropping, color jittering, resizing, Gaussian blurring, and many more. In Fig. 9, examples of different transformations on different digits (first column) of the MNIST dataset are presented. For medical images, the TorchIO library allows to easily perform data augmentation. Batch Normalization To ensure that the training of the networks will be more stable and faster, batch normalization has been pro- posed. In practice, batch normalization re-centers and re-scales the layer’s input, mitigating the problem of internal 96 Maria Vakalopoulou et al. covariate shift which changes the distribution of the inputs of each layer affecting the learning rate of the network. Even if the method is quite popular, its necessity and use for the training have recently been questioned. 3.4 State-of-the-Art Over the years, different optimizers have been proposed and widely Optimizers used, aiming to provide improvements over the classical stochastic gradient descent. These algorithms are motivated by challenges that need to be addressed with stochastic gradient descent and are focusing on the choice of the proper learning rate, its dynamic change during training, as well as the fact that it is the same for all the parameter updates. Moreover, a proper choice of opti- mizer could speed up the convergence to the optimal solution. In this subsection, we will discuss some of the most commonly used optimizers nowadays. 3.4.1 Stochastic Gradient One of the limitations of the stochastic gradient descent is that Descent with Momentum since the direction of the gradient that we are taking is random, it can heavily oscillate, making the training slower and even getting stuck in a saddle point. To deal with this problem, stochastic gradient descent with momentum [45, 46] keeps a history of the previous gradients, and it updates the weights taking into account the previous updates. More formally: g t ← ρg t - 1 þ ð1 - ρÞGðW t Þ ΔW t ← - ηt g t , ð21Þ W tþ1 ← W t þ ΔW t where gt is the direction of the update of the weights in time-step t and ρ ∈ [0, 1] is a hyperparameter that controls the contribution of the previous gradients and current gradient in the current update. When ρ = 0, it is the same as the classical stochastic gradient descent. A large value of ρ will mean that the update is strongly influenced by the previous updates. The momentum algorithm accumulates an exponentially decaying moving average of the past gradients and continues to move in their direction. Momentum increases the speed of convergence, while it is also helpful to not get stuck in places where the search space is flat (saddle points with zero gradient), since the momentum will pursue the search in the same direction as before the flat region. 3.4.2 AdaGrad To facilitate and speed up, even more, the training process, optimi- zers with adaptive learning rates per parameter have been proposed. The adaptive gradient (AdaGrad) optimizer is one of them. It updates each individual parameter proportionally to their compo- nent (and momentum) in the gradient. More formally: Deep Learning: Basics and CNN 97 g t ← GðW t Þ rt ← rt - 1 þ gt gt η , ð22Þ ΔW t ← - p gt δ þ rt W tþ1 ← W t þ ΔW t where gt is the gradient estimate vector in time-step t, rt is the term controlling the per parameter update, and δ is some small quantity that is used to avoid the division by zero. Note that rt constitutes of the gradient’s element-wise product with itself and of the previous term rt-1 accumulating the gradients of the previous terms. This algorithm performs very well for sparse data since it decreases the learning rate faster for the parameters that are more frequent and slower for the infrequent parameters. However, since the update accumulates gradients of the previous steps, the updates could decrease very fast, blocking the learning process. This limita- tion is mitigated by extensions of the AdaGrad algorithm as we discuss in the next sections. 3.4.3 RMSProp Another algorithm with adaptive learning rates per parameter is the root mean squared propagation (RMSProp) algorithm, proposed by Geoffrey Hinton. Despite its popularity and use, this algorithm has not been published. RMSProp is an extension of the AdaGrad algorithm dealing with the problem of radically diminishing learning rates by being less influenced by the first iterations of the algorithm. More formally: g t ← GðW t Þ r t ← ρr t - 1 þ ð1 - ρÞg t gt η , ð23Þ ΔW t ← - p gt δ þ rt W tþ1 ← W t þ ΔW t where ρ is a hyperparameter that controls the contribution of the previous gradients and the current gradient in the current update. Note that RMSProp estimates the squared gradients in the same way as AdaGrad, but instead of letting that estimate continually accumulate over training, we keep a moving average of it, integrat- ing the momentum. Empirically, RMSProp has been shown to be an effective and practical optimization algorithm for deep neural networks. 3.4.4 Adam The effectiveness and advantages of the AdaGrad and RMSProp algorithms are combined in the adaptive moment estimation (Adam) optimizer. The method computes individual adaptive learning rates for different parameters from estimates of the first and second moments of the gradients. More formally: 98 Maria Vakalopoulou et al. g t ← GðW t Þ s t ← ρ1 s t - 1 þ ð1 - ρ1 Þg t r t ← ρ2 r t - 1 þ ð1 - ρ2 Þg t gt st ^s t ← 1 - ðρ1 Þt , ð24Þ rt r^t ← 1 - ðρ2 Þt λ ΔW t ← - p ^s t δ þ r^t W tþ1 ← W t þ ΔW t where st is the gradient with momentum, rt accumulates the squared gradients with momentum as in RMSProp, and ^s t and r^t are smaller than st and rt, respectively, but they converge toward them. Moreover, δ is some small quantity that is used to avoid the division by zero, while ρ1 and ρ2 are hyperparameters of the algo- rithm. The parameters ρ1 and ρ2 control the decay rates of each moving average, respectively, and their value is close to 1. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods, making it the go-to optimizer for deep learning problems. 3.4.5 Other Optimizers The development of efficient (in terms of speed and stability) optimizers is still an active research direction. RAdam is a variant of Adam, introducing a term to rectify the variance of the adaptive learning rate. In particular, RAdam leverages a dynamic rectifier to adjust the adaptive momentum of Adam based on the variance and effectively provides an automated warm-up custom- tailored to the current dataset to ensure a solid start to training. Moreover, LookAhead was inspired by recent advances in the understanding of loss surfaces of deep neural networks and pro- vides a breakthrough in robust and stable exploration during the entirety of the training. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of fast weights gener- ated by another optimizer. These are only some of the optimizers that exist in the literature, and depending on the problem and the application, different optimizers could be selected and applied. 4 Convolutional Neural Networks Convolutional neural networks (CNNs) are a specific category of deep neural networks that employ the convolution operation in order to process the input data. Even though the main concept dates back to the 1990s and is greatly inspired by neuroscience (in particular by the organization of the visual cortex), their wide- spread use is due to a relatively recent success on the ImageNet Large Scale Visual Recognition Challenge of 2012. In contrast Deep Learning: Basics and CNN 99 to the deep fully connected networks that have been already dis- cussed, CNNs excel in processing data with a spatial or grid-like organization (e.g., time series, images, videos, etc.) while at the same time decreasing the number of trainable parameters due to their weight sharing properties. The rest of this section is first introducing the convolution operation and the motivation behind using it as a building block/module of neural networks. Then, a number of different variations are presented together with exam- ples of the most important CNN architectures. Lastly, the impor- tance of the receptive field – a central property of such networks – will be discussed. 4.1 The Convolution The convolution operation is defined as the integral of the product Operation of the two functions ( f, g)5 after one is reversed and shifted over the other function. Formally, we write: 1 hðtÞ = f ðt - τÞgðτÞ dτ: ð25Þ -1 Such an operation can also be denoted with an asterisk (), so it is written as: hðtÞ = ðf  gÞðtÞ: ð26Þ In essence, the convolution operation shows how one function affects the other. This intuition arises from the signal processing domain, where it is typically important to know how a signal will be affected by a filter. For example, consider a uni-dimensional con- tinuous signal, like the brain activity of a patient on some electro- encephalography electrode, and a Gaussian filter. The result of the convolution operation between these two functions will output the effect of a Gaussian filter on this signal which will, in fact, be a smoothed version of the input. A different way to think of the convolution operation is that it shows how the two functions are related. In other words, it shows how similar or dissimilar the two functions are at different relative positions. In fact, the convolution operation is very similar to the cross-correlation operation, with the subtle difference being that in the convolution operation, one of the two functions is inverted. In the context of deep learning specifically, the exact differences between the two operations can be of secondary concern; however, the convolution operation has more properties than correlation, such as commutativity. Note also that when the signals are symmet- ric, both operations will yield the same result. In order to deal with discrete and finite signals, we can expand the definition of the convolution operation. Specifically, given two 5 Note that f and g have no relationship to their previous definitions in the chapter. In particular, f is not the deep learning model. 100 Maria Vakalopoulou et al. 0 1 1 1×1 0×0 0×1 0 0 0 1 1×0 1×1 0×0 0 1 4 3 4 1 0 0 0 1×1 1×0 1×1 0 1 0 1 1 2 4 3 3 0 0 0 1 1 0 0 ∗ 0 1 0 = 1 2 3 4 1 0 0 1 1 0 0 0 1 0 1 1 3 3 1 1 0 1 1 0 0 0 0 3 3 1 1 0 1 1 0 0 0 0 0 I K I∗K Fig. 10 A visualization of the discrete convolution operation in 2D discrete signals f[k] and g[k], with k∈, the convolution operation is defined by: h½k = f ½k - ng½n: ð27Þ n Lastly, the convolution operation can be extended for multidi- mensional signals similarly. For example, we can write the convolu- tion operation between two discrete and finite two-dimensional signals (e.g., I[i, j], K[i, j]) as: H ½i, j  = I ½i - m, j - nK ½m, n: ð28Þ m n Very often, the first signal will be the input of interest (e.g., a large size image), while the second signal will be of relatively small size (e.g., a 3 × 3 or 4 × 4 matrix) and will implement a specific operation. The second signal is then called a kernel. In Fig. 10, a visualization of the convolution operation is shown in the case of a 2D discrete signal such as an image and a 3 × 3 kernel. In detail, the convolution kernel is shifted over all locations of the input, and an element-wise multiplication and a summation are utilized to calcu- late the convolution output at the corresponding location. Exam- ples of applications of convolutions to an image are provided in Fig. 11. Finally, note that, as in multilayer perceptrons, a convolu- tion will generally be followed by a non-linear activation function, for instance, a ReLU (see Fig. 12 for an example of activation applied to a feature map). In the following sections of this chapter, any reference to the convolution operation will mostly refer to the 2D discrete case. The Deep Learning: Basics and CNN 101 1 0 -1 1 1 1 1 0 -1 0 0 0 1 0 -1 -1 -1 -1 Original image Vertical edge detection Horizontal edge detection Fig. 11 Two examples of convolutions applied to an image. One of the filters acts as a vertical edge detector and the other one as a horizontal edge detector. Of course, in CNNs, the filters are learned, not predefined, so there is no guarantee that, among the learned filters, there will be a vertical/horizontal case detector, although it will often be the case in practice, especially for the first layers of the architecture Fig. 12 Example of application of a non-linear activation function (here a ReLU) to an image extension to the 3D case, which is often encountered in medical imaging, is straightforward. 4.2 Properties of the In the case of a discrete domain, the convolution operation can be Convolution Operation performed using a simple matrix multiplication without the need of shifting one signal over the other one. This can be essentially achieved by utilizing the Toeplitz matrix transformation. The Toe- plitz transformation creates a sparse matrix with repeated elements which, when multiplied with the input signal, produces the convo- lution result. To illustrate how the convolution operation can be implemented as a matrix multiplication, let’s take the example of a 3 × 3 kernel (K) and a 4 × 4 input (I): 102 Maria Vakalopoulou et al. i 00 i 01 i 02 i 03 k00 k01 k02 i 10 i 11 i 12 i 13 K = k10 k11 k12 and I = : i 20 i 21 i 22 i 23 k20 k21 k22 i 30 i 31 i 32 i 33 Then, the convolution operation can be computed as a matrix multiplication between the Toepliz transformed kernel: k00 k01 k02 0 k10 k11 k12 0 k20 k21 k22 0 0 0 0 0 0 k00 k01 k02 0 k10 k11 k12 0 k20 k21 k22 0 0 0 0 K = 0 0 0 0 k00 k01 k02 0 k10 k11 k12 0 k20 k21 k22 0 0 0 0 0 0 k00 k01 k02 0 k10 k11 k12 0 k20 k21 k22 and a reshaped input: I = ½ i 00 i 01 i 02 i 03 i 10 i 11 i 12 i 13 i 20 i 21 i 22 i 23 i 30 i 31 i 32 i 33 ⊤ : The produced output will need to be reshaped as a 2 × 2 matrix in order to retrieve the convolution output. This matrix multiplica- tion implementation is quite illuminating on a few of the most important properties of the convolution operation. These proper- ties are the main motivation behind using such elements in deep neural networks. By transforming the convolution operation to a matrix multi- plication operation, it is evident that it can fit in the formalization of the linear functions, which has already been presented in Subhead- ing 2.3. As such, deep neural networks can be designed in a way to utilize trainable convolution kernels. In practice, multiple convolu- tion kernels are learned at each convolutional block, while several of these trainable convolutional blocks are stacked on top of each other forming deep CNNs. Typically, the output of a convolution operation is called a feature map or just features. Another important aspect of the convolution operation is that it requires much fewer parameters than the fully connected MLP-based deep neural networks. As it can also be seen from the K matrix, the exact same parameters are shared across all locations. Eventually, rather than learning a different set of parameters for the different locations of the input, only one set is learned. This is referred to as parameter sharing or weight sharing and can greatly decrease the amount of memory that is required to store the network parameters. An illustration of the process of weight sharing across locations, together with the fact that multiple filters (result- ing in multiple feature maps) are computed for a given layer, is illustrated in Fig. 13. The multiple feature maps for a given layer are stored using another dimension (see Fig. 14), thus resulting in a 3D Deep Learning: Basics and CNN 103 Fig. 13 For a given layer, several (usually many) filters are learned, each of them being able to detect a specific characteristic in the image, resulting in several feature/filter maps. On the other hand, for a given filter, the weights are shared across all the locations of the image Fig. 14 The different feature maps for a given layer are arranged along another dimension. The feature maps will thus be a 3D array when the input is a 2D image (and a 4D array when the input is a 3D image) array when the input is a 2D image (and a 4D array when the input is a 3D image). Convolutional neural networks have proven quite powerful in processing data with spatial structure (e.g., images, videos, etc.). This is effectively based on the fact that there is a local connectivity of the kernel elements while at the same time the same kernel is applied at different locations of the input. Such processing grants a quite useful property called translation equivariance enabling the 104 Maria Vakalopoulou et al. network to output similar responses at different locations of the input. An example of the usefulness of such a property can be identified on an image detection task. Specifically, when training a network to detect tumors in an MR image of the brain, the model should respond similarly regardless of the location where the anom- aly can be manifested. Lastly, another important property of the convolution opera- tion is that it decouples the size of the input with the trainable parameters. For example, in the case of MLPs, the size of the weight matrix is a function of the dimension of the input. Specifically, a densely connected layer that maps 256 features to 10 outputs would have a size of W ∈10 × 256. On the contrary, in convolu- tional layers, the number of trainable parameters only depends on the kernel size and the number of kernels that a layer has. This eventually allows the processing of arbitrarily sized inputs, for example, in the case of fully convolutional networks. 4.3 Functions and An observant reader might have noticed that the convolution Variants operation can change the dimensionality of the produced output. In the example visualized in Fig. 10, the image of size 7 × 7, when convolved with a kernel of size 3 × 3, produces a feature map of size of 5 × 5. Even though dimension changes can be avoided with appropriate padding (see Fig. 15 for an illustration of this process) prior to the convolution operation, in some cases, it is actually desired to reduce the dimensions of the input. Such a decrease can be achieved in a number of ways depending on the task at hand. In this subsection, some of the most typical functions that are utilized in CNNs will be discussed. Fig. 15 The padding operation, which involves adding zeros around the image, allows to obtain feature maps that are of the same size as the original image Deep Learning: Basics and CNN 105 Input feature map Pooled feature map Max pooling with 2×2 filter and stride 2 Fig. 16 Effect of a pooling operation. Here, a maximum pooling of size 2 × 2 with a stride of 2 Downsampling Operations (i.e., Pooling Layers) In many CNN architectures, there is an extensive use of downsampling operations that aim to compress the size of the feature maps and decrease the computational burden. Otherwise referred to as pool- ing layers, these processing operations are aggregating the values of their input depending on their design. Some of the most common downsampling layers are the maximum pooling, average pooling, or global average pooling. In the first two, either the maximum or the average value is used as a feature for the output across non-overlapping regions of a predefined pooling size. In the case of the global average pooling, the spatial dimensions are all repre- sented with the average value. An example of pooling is provided in Fig. 16. Strided Convolution The strided convolution refers to the spe- cific case in which, instead of applying the convolution operation for every location using a step size (or stride, s) of 1, different step sizes can be considered (Fig. 17). Such an operation will produce a convolution output with much fewer elements. Convolutional blocks with s > 1 can be found on CNN architectures as a way to decrease the feature sizes in intermediate layers. Atrous or Dilated Convolution Dilated, also called atrous, con- volution is the convolution with kernels that have been dilated by inserting zero holes (a ` trous in French) between the non-zero values of a kernel. In this case, an additional parameter (d) of the convolution operation is added, and it is changing the distance between the kernel elements. In essence, it is increasing the reach of the kernel but keeping the number of trainable parameters the same. For example, a dilated convolution with a kernel size of 3 × 3 and a dilation rate of d = 2 would be sparsely arranged on a 5 × 5 grid. 106 Maria Vakalopoulou et al. Fig. 17 Stride operation, here with a stride of 2 Transposed Convolution In certain circumstances, one needs not only to downsample the spatial dimensions of the input but also, usually at a later stage of the network, apply an upsample operation. The most emblematic case is the task of image segmen- tation (see Chap. 13), in which a pixel-level classification is expected, and therefore, the output of the neural network should have the same size as the input. In such cases, several upsampling operations are typically applied. The upsampling can be achieved by a transposed convolution operation that will eventually increase the size of the output. In details, the transposed convolution is per- formed by dilating the input instead of the kernel before applying a convolution operation. In this way, an input of size 5 × 5 will reach a size of 10 × 10 after being dilated with d = 2. With proper padding and using a kernel of size 3 × 3, the output will eventually double in size. 4.4 Receptive Field In the context of deep neural networks and specifically CNNs, the Calculation term receptive field is used to define the proportion of the input that produces a specific feature. For example, a CNN that takes an image as input and applies only a single convolution operation with a kernel size of 3 × 3 would have a receptive field of 3 × 3. This means that for each pixel of the first feature map, a 3 × 3 region of the input would be considered. Now, if another layer were to be added, with again 3 × 3 size, then the receptive field of the new fea

Use Quizgecko on...
Browser
Browser