Deep Learning: Classification PDF
Document Details
Uploaded by RomanticCobalt
Ferhat Abbas University of Setif 1
Tags
Related
Summary
This document is a deep learning lecture or presentation that outlines concepts and examples of classification problems. The presentation delves into binary classification, multiclass classification discussing examples. It also includes discussion of model architecture, steps in modeling, and evaluation metrics.
Full Transcript
Deep Learning: Classification “What is a classification problem?” Example classification problems “Is this email spam or not spam?” “Is this a photo of sushi, steak or pizza?” To: [email protected]...
Deep Learning: Classification “What is a classification problem?” Example classification problems “Is this email spam or not spam?” “Is this a photo of sushi, steak or pizza?” To: [email protected] To: [email protected] Hey Daniel, Hay daniel… This deep learning course is incredible! C0ongratu1ations! U win $1139239230 I can’t wait to use what I’ve learned! Not spam Spam Binary classiJcation Multiclass classiJcation “What tags should this article have?” Machine learning Representation learning Arti2cial intelligence Multilabel classiJcation Binary vs. Multi-class Classification Binary classiJcation Multiclass classiJcation What we’re going to cover Architecture of a neural network classiJcation model Input shapes and output shapes of a classiJcation model (features and labels) Creating custom data to view, J t on and predict on Steps in modelling Creating a model, setting a loss function and optimiser, creating a training loop, evaluating a model Saving and loading models Harnessing the power of non-linearity DiQerent classiJcation evaluation methods How: Classification inputs and outputs 224 W = 224 224 Sushi H = 224 Steak C =3 Pizza (c = colour channels, R, G, B) Actual output [[0.31, 0.62, 0.44…], [[0.97, 0.00, 0.03], [0.92, 0.03, 0.27…], [0.81, 0.14, 0.05], [0.25, 0.78, 0.07…], [0.03, 0.07, 0.90], …, …, Numerical Predicted output encoding Input and output shapes (for an image classification example) 224 [[0.31, 0.62, 0.44…], 224 [0.92, 0.03, 0.27…], [0.97, 0.00, 0.03] [0.25, 0.78, 0.07…], …, [batch_size, colour_channels, width, height] Shape = Shape = [None, 3, 224, 224] or Shape = [32, 3, 224, 224] These will vary depending on the problem you’re working on. Architecture of a classification model Sushi Steak Pizza Deep Learning: Classification “What is a classification problem?” Deep Learning: Classification Let’s code! torchvision.transforms torch.utils.data.Dataset torch.utils.data.DataLoader torchmetrics torch.optim torch.nn torch.utils.tensorboard torch.nn.Module torchvision.models Se e more: https://pytorch.org/tutorials/beginner/ptcheat.html Improving a model (from a model’s perspective) Smaller model C o m m o n ways to improve a deep model: Adding layers Increase the number of hidden units Change/add activation functions Larger model C h a n g e the optimization function C h a n g e the learning rate (because you can alter each of these, they’re hyperparameters) Fitting for longer The missing piece: Non-linearity “What could you draw if you had an unlimited amount of straight (linear) and non- straight (non-linear) lines?” Linear data Non-linear data (possible to model with straight lines) (not possible to model with straight lines) The missing piece: Non-linearity A = torch.arange(-10, 10) plt.plot(A) plt.plot(torch.sigmoid(A)) plt.plot(torch.relu(A)) Linear activation Sigmoid activation ReLU activation (same as original values) (non-linear) (non-linear) The machine learning explorer’s motto “Visualize, visualize, visualize” Data Model It’s a g o o d idea to visualize these as often as possible. Training Predictions The machine learning practitioner’s motto “Experiment, experiment, experiment” Steps in modelling with PyTorch 1.Construct or import a pretrained model relevant to your problem 2. Prepare the loss function, optimizer and training loop Loss — how wrong your model’s predictions are compared to the truth labels (you want to minimise this). Optimizer — how your model should update its internal patterns to better its predictions. 3. Fit the model to the training data so it can discover patterns Epochs — how many times the model will g o through all of the training examples. 4. Evaluate the model on the test data (how reliable are our model’s predictions?) (some common) Classification evaluation methods Key: tp = True Positive, tn = True Negative, f p = False Positive, fn = False Negative Metric Name Metric Forumla Code When to use torchmetrics.Accuracy() Default metric for classiPcation tp + tn Accuracy Accuracy = or problems. Not the best for tp + tn + fp + fn sklearn.metrics.accuracy_score() imbalanced classes. tp torchmetrics.Precision() Higher precision leads to less false Precision Precision = or tp + fp sklearn.metrics.precision_score() positives. tp torchmetrics.Recall() Higher recall leads to less false Recall Recall = or tp + fn sklearn.metrics.recall_score() negatives. precision ⋅recall torchmetrics.F1Score() Combination of precision and recall, F1-score F1-score = 2 ⋅ or usually a g o o d overall metric for a precision + recall sklearn.metrics.f1_score() classiPcation model. When comparing predictions to truth labels to see where model gets Confusion matrix NA torchmetrics.ConfusionMatrix() confused. Can be hard to use with large numbers of classes. Anatomy of a confusion matrix Confusion Matrix 0 99 (98.0%) 2 (2.0%) True positive = model predicts 1 when truth is 1 True negative = model predicts 0 when truth is 0 True Label False positive = model predicts 1 when truth is 0 False negative = model predicts 0 when truth is 1 1 0 (0.0%) 99 (100.0%) 0 1 Predicted Label Three datasets (possibly the most important concept in machine learning…) Course materials Practice exam Final exam (training set) (validation set) (test set) See if the model is ready for the wild Generalization The ability for a machine learning model to perform well on data it hasn’t seen before. [[0.092, 0.210, 0.415], [0.778, 0.929, 0.030], [0.019, 0.182, 0.555], …, [[116, 78, 15], [[0.983, 0.004, 0.013], Coat, [117, 43, 96], [0.110, 0.889, 0.001], Ankle boot, Shirt, [125, 87, 23], [0.023, 0.027, 0.985], Sandal …, …, 3. Update representation 4. Repeat with more outputs (weights & biases) examples Learns Numerical representation Representation Inputs Outputs encoding (patterns/features/weights) outputs Learning Strategies Supervised Learning Unsupervised Learning Classification or Discrete Categorization Clustering Continuous Regression Dimensionality reduction 1 Classification Supervised Learning 2 Classification SVM Example 3 Classification SVM Example 4 Supervised Learning Classification Input Programm “Training” Output (Model) 5 Classification SVM Example 6 Classification SVM Example 7 Regression Supervised Learning 8 Regression Classification 2.1 ⇒ 𝑔𝑜𝑜𝑑 1.8 Regression 2.1 ⇒.9 1.8 With a likelihood of 90% is this email good. 9 Regression 10 Regression Linear Support Vector Regression (SVR) 11 Regression Linear Support Vector Regression (SVR) 12 Regression RBF Support Vector Regression (SVR) 13 Conclusion Supervised Learning Classification Labeled data (paired input and label) The label is of a class (e.g., dog, [grey, dog]) Regression Labeled data (paired input and label) Labels are on a continues scale (e.g., 3.4, [4.5, -17.1]) 14 Unsupervised Learning Data Programm Input “Training” (Model) Not needed: Lables (Classes) Continues values 1 Unsupervised Learning What can we learn just buy looking at the data? 🐩 🐆 🐈 🐕 🦍 🐴 🐶 2 Unsupervised Learning What can we learn just buy looking at the data? 🐩 🐆 🐈 🐕 🦍 🐴 🐶 3 Learning Strategies Supervised Learning Unsupervised Learning Classification or Discrete Categorization Clustering Continuous Regression Dimensionality reduction 4 Unsupervised Learning Methods Hierarchical clustering K-means clustering Principal Component Analysis (PCA) Singular Value Decomposition Independent Component Analysis …. 5 Clustering Unsupervised Learning 6 Unknown Data 7 Clustering K-means clustering Uncovering “structure” in unlabeled data 8 Clustering K-means clustering 9 Clustering K-means clustering ? 10 Clustering Cluster Count? 11 Dimensionality Reduction Unsupervised Learning 12 Dimensionality Reduction Principal Component Analysis (PCA) Transforming high-dimensional data into a low-dimensional space 4.5 2.3 2.1 𝑥1 𝑥1 2.4 ⇒ ⋮ 1.8 𝑓𝑃𝐶𝐴 = ⋮ −99 𝑥𝑛 𝑥𝑚 𝑥1 𝑥1 −3 ⋮ ⇒ ⋮ 𝑛>𝑚 𝑥𝑛 𝑥𝑚 𝑛>𝑚 Dimensionality Reduction Principal Component Analysis (PCA) 14 Dimensionality Reduction Principal Component Analysis (PCA) 15 Dimensionality Reduction Principal Component Analysis (PCA) 16 Conclusion Unsupervised Learning Clustering Finding meaning in unlabeled data Dimensionality Reduction Transforming high-dimensional data into a low-dimensional space Reducing the data to more essential features. 17 Backpropagation Parameter ⋮ 2 8 400 14 Layers 400, 14, 8, and 2 → weights: 400 * 14 + 14 * 8 + 8 * 2 = 5,728 → biases: 14 + 8 + 2 = 24 → trainable parameter: 5,728 + 24 = 5,752 Trainable parameters can raise fast Layers 400, 100, 40, 2 => Parameter: 44,222 (one model from the walkthough) 2 Combining Perceptions Why is it all about fast matrix multiplication? 𝑎 𝑥 ∙ 𝑤0 + 𝑏0 = 𝑦0 𝑎 𝑥 ∙ 𝑤1 + 𝑏1 = 𝑦1 2 8 𝑤0,0 𝑤0,1 … 𝑤0,6 𝑤0,7 𝑥0 𝑏0 𝑦0 𝑎 𝑤1,0 𝑤1,1 … 𝑤1,6 𝑤0,7 𝑥1 + = 𝑦 𝑏1 1 𝑤0,0 ⋯ 𝑤0,𝑚 𝑥0 𝑏0 𝑦0 𝑎 ⋮ ⋱ ⋮ ⋮ + ⋮ = ⋮ 𝑤𝑛,0 ⋯ 𝑤𝑛,𝑚 𝑥𝑛 𝑏𝑛 𝑦𝑛 3 Training Trainable Parameter Weights Biases Trainable Parameter Trainable Parameter Input Output Given by the previous layer Given by the label 𝑤0,0 𝑤0,1 … 𝑤0,6 𝑤0,7 𝑥0 𝑏0 𝑦0 𝑎 𝑤1,0 𝑤1,1 … 𝑤1,6 𝑤0,7 𝑥1 + = 𝑦 𝑏1 1 𝑤0,0 ⋯ 𝑤0,𝑚 𝑥0 𝑏0 𝑦0 𝑎 ⋮ ⋱ ⋮ ⋮ + ⋮ = ⋮ 𝑤𝑛,0 ⋯ 𝑤𝑛,𝑚 𝑥𝑛 𝑏𝑛 𝑦𝑛 4 Training 1. Initialize 𝑤 and 𝑏 randomly 2. Determine how good the model is using a ”cost function” 3. How can we adjust 𝑤 and 𝑏 to be better? Cost Functions 𝑦𝑡,0 𝑦𝑝,0 2 Squared error ⋮ − ⋮ 𝑦𝑡,𝑛 𝑦𝑝,𝑛 RMSE … 5 Optimization “Stochastic Gradient Descent” In which direction do we need to go to minimize our cost? 𝑑𝑦 (𝑥) 𝑑𝑥 6 InputLayer HiddenLayer1 HiddenLayer2 OutputLayer Adjusting 𝑤 and 𝑏 ⋮ 2 Error 8 𝑦𝑡,0 𝑦𝑝,0 2 𝑡𝑟𝑢𝑒 − 𝑝𝑟𝑒𝑑 2 = ⋮ − ⋮ 𝑦𝑡,𝑛 𝑦𝑝,𝑛 400 14 7 InputLayer HiddenLayer1 HiddenLayer2 OutputLayer Adjusting 𝑤 and 𝑏 Adjusting 𝑤 and 𝑏 Adjusting 𝑤 and 𝑏 ⋮ 2 Error 8 𝑦𝑡,0 𝑦𝑝,0 2 𝑡𝑟𝑢𝑒 − 𝑝𝑟𝑒𝑑 2 = ⋮ − ⋮ 𝑦𝑡,𝑛 𝑦𝑝,𝑛 400 14 8 Adjusting 𝑤 and 𝑏 With the slope, we know how to adjust 𝑤 and 𝑏 First fast large then small steps → “learning rate” The gradient can be calculated for each input 𝑥 To not overfit to one 𝑥 we take more inputs Taking all 𝑥 from takes too long In each ”optimization step” we only look at a few 𝑥’s → ”batch size” 9 Conclusion Backpropagation Backpropagation Trainable Parameter: Weights and biases Optimizing e.g. Stochastic Gradient Descent Loss function e.g. MSE, MAE, RMSE Learning rate Batch size 10 48 CHAPTER 2 The mathematical building blocks of neural networks approach of incrementally decomposing a complicated geometric transformation into a long chain of elementary ones, which is pretty much the strategy a human would follow to uncrumple a paper ball. Each layer in a deep network applies a trans- formation that disentangles the data a little, and a deep stack of layers makes tractable an extremely complicated disentanglement process. 2.4 The engine of neural networks: Gradient-based optimization As you saw in the previous section, each neural layer from our first model example transforms its input data as follows: output = relu(dot(input, W) + b) In this expression, W and b are tensors that are attributes of the layer. They’re called the weights or trainable parameters of the layer (the kernel and bias attributes, respec- tively). These weights contain the information learned by the model from exposure to training data. Initially, these weight matrices are filled with small random values (a step called random initialization). Of course, there’s no reason to expect that relu(dot(input, W) + b), when W and b are random, will yield any useful representations. The resulting representations are meaningless—but they’re a starting point. What comes next is to gradually adjust these weights, based on a feedback signal. This gradual adjustment, also called training, is the learning that machine learning is all about. This happens within what’s called a training loop, which works as follows. Repeat these steps in a loop, until the loss seems sufficiently low: 1 Draw a batch of training samples, x, and corresponding targets, y_true. 2 Run the model on x (a step called the forward pass) to obtain predictions, y_pred. 3 Compute the loss of the model on the batch, a measure of the mismatch between y_pred and y_true. 4 Update all weights of the model in a way that slightly reduces the loss on this batch. You’ll eventually end up with a model that has a very low loss on its training data: a low mismatch between predictions, y_pred, and expected targets, y_true. The model has “learned” to map its inputs to correct targets. From afar, it may look like magic, but when you reduce it to elementary steps, it turns out to be simple. Step 1 sounds easy enough—just I/O code. Steps 2 and 3 are merely the application of a handful of tensor operations, so you could implement these steps purely from what you learned in the previous section. The difficult part is step 4: updating the model’s weights. Given an individual weight coefficient in the model, how can you compute whether the coefficient should be increased or decreased, and by how much? One naive solution would be to freeze all weights in the model except the one sca- lar coefficient being considered, and try different values for this coefficient. Let’s say The engine of neural networks: Gradient-based optimization 49 the initial value of the coefficient is 0.3. After the forward pass on a batch of data, the loss of the model on the batch is 0.5. If you change the coefficient’s value to 0.35 and rerun the forward pass, the loss increases to 0.6. But if you lower the coefficient to 0.25, the loss falls to 0.4. In this case, it seems that updating the coefficient by –0.05 would contribute to minimizing the loss. This would have to be repeated for all coeffi- cients in the model. But such an approach would be horribly inefficient, because you’d need to com- pute two forward passes (which are expensive) for every individual coefficient (of which there are many, usually thousands and sometimes up to millions). Thankfully, there’s a much better approach: gradient descent. Gradient descent is the optimization technique that powers modern neural net- works. Here’s the gist of it. All of the functions used in our models (such as dot or +) transform their input in a smooth and continuous way: if you look at z = x + y, for instance, a small change in y only results in a small change in z, and if you know the direction of the change in y, you can infer the direction of the change in z. Mathemat- ically, you’d say these functions are differentiable. If you chain together such functions, the bigger function you obtain is still differentiable. In particular, this applies to the function that maps the model’s coefficients to the loss of the model on a batch of data: a small change in the model’s coefficients results in a small, predictable change in the loss value. This enables you to use a mathematical operator called the gradient to describe how the loss varies as you move the model’s coefficients in different direc- tions. If you compute this gradient, you can use it to move the coefficients (all at once in a single update, rather than one at a time) in a direction that decreases the loss. If you already know what differentiable means and what a gradient is, you can skip to section 2.4.3. Otherwise, the following two sections will help you understand these concepts. 2.4.1 What’s a derivative? Consider a continuous, smooth function f(x) = y, mapping a number, x, to a new number, y. We can use the function in figure 2.15 as an example. y = f(x) y Figure 2.15 A continuous, x smooth function Because the function is continuous, a small change in x can only result in a small change in y—that’s the intuition behind continuity. Let’s say you increase x by a small factor, epsilon_x: this results in a small epsilon_y change to y, as shown in figure 2.16. 50 CHAPTER 2 The mathematical building blocks of neural networks epsilon_y y = f(x) Figure 2.16 With a continuous function, y a small change in x results in a small epsilon_x x change in y. In addition, because the function is smooth (its curve doesn’t have any abrupt angles), when epsilon_x is small enough, around a certain point p, it’s possible to approxi- mate f as a linear function of slope a, so that epsilon_y becomes a * epsilon_x: f(x + epsilon_x) = y + a * epsilon_x Obviously, this linear approximation is valid only when x is close enough to p. The slope a is called the derivative of f in p. If a is negative, it means a small increase in x around p will result in a decrease of f(x) (as shown in figure 2.17), and if a is positive, a small increase in x will result in an increase of f(x). Further, the abso- lute value of a (the magnitude of the derivative) tells you how quickly this increase or decrease will happen. Local linear approximation of f, with slope a y = f(x) y x Figure 2.17 Derivative of f in p For every differentiable function f(x) (differentiable means “can be derived”: for exam- ple, smooth, continuous functions can be derived), there exists a derivative function f'(x), that maps values of x to the slope of the local linear approximation of f in those points. For instance, the derivative of cos(x) is -sin(x), the derivative of f(x) = a * x is f'(x) = a, and so on. Being able to derive functions is a very powerful tool when it comes to optimization, the task of finding values of x that minimize the value of f(x). If you’re trying to update x by a factor epsilon_x in order to minimize f(x), and you know the deriva- tive of f, then your job is done: the derivative completely describes how f(x) evolves as you change x. If you want to reduce the value of f(x), you just need to move x a lit- tle in the opposite direction from the derivative. The engine of neural networks: Gradient-based optimization 51 2.4.2 Derivative of a tensor operation: The gradient The function we were just looking at turned a scalar value x into another scalar value y: you could plot it as a curve in a 2D plane. Now imagine a function that turns a tuple of scalars (x, y) into a scalar value z: that would be a vector operation. You could plot it as a 2D surface in a 3D space (indexed by coordinates x, y, z). Likewise, you can imagine functions that take matrices as inputs, functions that take rank-3 tensors as inputs, etc. The concept of derivation can be applied to any such function, as long as the sur- faces they describe are continuous and smooth. The derivative of a tensor operation (or tensor function) is called a gradient. Gradients are just the generalization of the concept of derivatives to functions that take tensors as inputs. Remember how, for a scalar function, the derivative represents the local slope of the curve of the function? In the same way, the gradient of a tensor function represents the curvature of the multidi- mensional surface described by the function. It characterizes how the output of the function varies when its input parameters vary. Let’s look at an example grounded in machine learning. Consider An input vector, x (a sample in a dataset) A matrix, W (the weights of a model) A target, y_true (what the model should learn to associate to x) A loss function, loss (meant to measure the gap between the model’s current predictions and y_true) You can use W to compute a target candidate y_pred, and then compute the loss, or mismatch, between the target candidate y_pred and the target y_true: We use the model weights, W, y_pred = dot(W, x) to make a prediction for x. loss_value = loss(y_pred, y_true) We estimate how far off the prediction was. Now we’d like to use gradients to figure out how to update W so as to make loss_value smaller. How do we do that? Given fixed inputs x and y_true, the preceding operations can be interpreted as a function mapping values of W (the model’s weights) to loss values: loss_value = f(W) f describes the curve (or high-dimensional surface) formed by loss values when W varies. Let’s say the current value of W is W0. Then the derivative of f at the point W0 is a ten- sor grad(loss_value, W0), with the same shape as W, where each coefficient grad(loss_value, W0)[i, j] indicates the direction and magnitude of the change in loss_value you observe when modifying W0[i, j]. That tensor grad(loss_value, W0) is the gradient of the function f(W) = loss_value in W0, also called “gradient of loss_value with respect to W around W0.” 52 CHAPTER 2 The mathematical building blocks of neural networks Partial derivatives The tensor operation grad(f(W), W) (which takes as input a matrix W) can be expressed as a combination of scalar functions, grad_ij(f(W), w_ij), each of which would return the derivative of loss_value = f(W) with respect to the coeffi- cient W[i, j] of W, assuming all other coefficients are constant. grad_ij is called the partial derivative of f with respect to W[i, j]. Concretely, what does grad(loss_value, W0) represent? You saw earlier that the deriva- tive of a function f(x) of a single coefficient can be interpreted as the slope of the curve of f. Likewise, grad(loss_value, W0) can be interpreted as the tensor describing the direction of steepest ascent of loss_value = f(W) around W0, as well as the slope of this ascent. Each partial derivative describes the slope of f in a specific direction. For this reason, in much the same way that, for a function f(x), you can reduce the value of f(x) by moving x a little in the opposite direction from the derivative, with a function f(W) of a tensor, you can reduce loss_value = f(W) by moving W in the opposite direction from the gradient: for example, W1 = W0 - step * grad(f(W0), W0) (where step is a small scaling factor). That means going against the direction of steep- est ascent of f, which intuitively should put you lower on the curve. Note that the scaling factor step is needed because grad(loss_value, W0) only approximates the curva- ture when you’re close to W0, so you don’t want to get too far from W0. 2.4.3 Stochastic gradient descent Given a differentiable function, it’s theoretically possible to find its minimum analyti- cally: it’s known that a function’s minimum is a point where the derivative is 0, so all you have to do is find all the points where the derivative goes to 0 and check for which of these points the function has the lowest value. Applied to a neural network, that means finding analytically the combination of weight values that yields the smallest possible loss function. This can be done by solv- ing the equation grad(f(W), W) = 0 for W. This is a polynomial equation of N variables, where N is the number of coefficients in the model. Although it would be possible to solve such an equation for N = 2 or N = 3, doing so is intractable for real neural net- works, where the number of parameters is never less than a few thousand and can often be several tens of millions. Instead, you can use the four-step algorithm outlined at the beginning of this section: modify the parameters little by little based on the current loss value for a random batch of data. Because you’re dealing with a differentiable function, you can compute its gradient, which gives you an efficient way to implement step 4. If you update the weights in the opposite direction from the gradient, the loss will be a little less every time: 1 Draw a batch of training samples, x, and corresponding targets, y_true. 2 Run the model on x to obtain predictions, y_pred (this is called the forward pass). The engine of neural networks: Gradient-based optimization 53 3 Compute the loss of the model on the batch, a measure of the mismatch between y_pred and y_true. 4 Compute the gradient of the loss with regard to the model’s parameters (this is called the backward pass). 5 Move the parameters a little in the opposite direction from the gradient—for example, W -= learning_rate * gradient—thus reducing the loss on the batch a bit. The learning rate (learning_rate here) would be a scalar factor modulat- ing the “speed” of the gradient descent process. Easy enough! What we just described is called mini-batch stochastic gradient descent (mini-batch SGD). The term stochastic refers to the fact that each batch of data is drawn at random (stochastic is a scientific synonym of random). Figure 2.18 illustrates what happens in 1D, when the model has only one parameter and you have only one training sample. Loss Learning rate value Starting point (t=0) t=1 t=2 t=3 Parameter Figure 2.18 SGD down a 1D loss value curve (one learnable parameter) As you can see, intuitively it’s important to pick a reasonable value for the learning_ rate factor. If it’s too small, the descent down the curve will take many iterations, and it could get stuck in a local minimum. If learning_rate is too large, your updates may end up taking you to completely random locations on the curve. Note that a variant of the mini-batch SGD algorithm would be to draw a single sam- ple and target at each iteration, rather than drawing a batch of data. This would be true SGD (as opposed to mini-batch SGD). Alternatively, going to the opposite extreme, you could run every step on all data available, which is called batch gradient descent. Each update would then be more accurate, but far more expensive. The efficient com- promise between these two extremes is to use mini-batches of reasonable size. Although figure 2.18 illustrates gradient descent in a 1D parameter space, in prac- tice you’ll use gradient descent in highly dimensional spaces: every weight coefficient in a neural network is a free dimension in the space, and there may be tens of thou- sands or even millions of them. To help you build intuition about loss surfaces, you can also visualize gradient descent along a 2D loss surface, as shown in figure 2.19. But you can’t possibly visualize what the actual process of training a neural network looks 54 CHAPTER 2 The mathematical building blocks of neural networks like—you can’t represent a 1,000,000-dimensional space in a way that makes sense to humans. As such, it’s good to keep in mind that the intuitions you develop through these low-dimensional representations may not always be accurate in practice. This has historically been a source of issues in the world of deep learning research. Starting point 45 40 35 30 25 20 15 10 5 Figure 2.19 Gradient descent Final point down a 2D loss surface (two learnable parameters) Additionally, there exist multiple variants of SGD that differ by taking into account previous weight updates when computing the next weight update, rather than just looking at the current value of the gradients. There is, for instance, SGD with momen- tum, as well as Adagrad, RMSprop, and several others. Such variants are known as opti- mization methods or optimizers. In particular, the concept of momentum, which is used in many of these variants, deserves your attention. Momentum addresses two issues with SGD: convergence speed and local minima. Consider figure 2.20, which shows the curve of a loss as a function of a model parameter. Loss value Local minimum Global minimum Parameter Figure 2.20 A local minimum value and a global minimum As you can see, around a certain parameter value, there is a local minimum: around that point, moving left would result in the loss increasing, but so would moving right. The engine of neural networks: Gradient-based optimization 55 If the parameter under consideration were being optimized via SGD with a small learning rate, the optimization process could get stuck at the local minimum instead of making its way to the global minimum. You can avoid such issues by using momentum, which draws inspiration from physics. A useful mental image here is to think of the optimization process as a small ball rolling down the loss curve. If it has enough momentum, the ball won’t get stuck in a ravine and will end up at the global minimum. Momentum is imple- mented by moving the ball at each step based not only on the current slope value (current acceleration) but also on the current velocity (resulting from past accelera- tion). In practice, this means updating the parameter w based not only on the cur- rent gradient value but also on the previous parameter update, such as in this naive implementation: past_velocity = 0. Constant momentum factor momentum = 0.1 Optimization loop while loss > 0.01: w, loss, gradient = get_current_parameters() velocity = past_velocity * momentum - learning_rate * gradient w = w + momentum * velocity - learning_rate * gradient past_velocity = velocity update_parameter(w) 2.4.4 Chaining derivatives: The Backpropagation algorithm In the preceding algorithm, we casually assumed that because a function is differentia- ble, we can easily compute its gradient. But is that true? How can we compute the gra- dient of complex expressions in practice? In the two-layer model we started the chapter with, how can we get the gradient of the loss with regard to the weights? That’s where the Backpropagation algorithm comes in. THE CHAIN RULE Backpropagation is a way to use the derivatives of simple operations (such as addition, relu, or tensor product) to easily compute the gradient of arbitrarily complex combi- nations of these atomic operations. Crucially, a neural network consists of many tensor operations chained together, each of which has a simple, known derivative. For instance, the model defined in listing 2.2 can be expressed as a function parameter- ized by the variables W1, b1, W2, and b2 (belonging to the first and second Dense layers respectively), involving the atomic operations dot, relu, softmax, and +, as well as our loss function loss, which are all easily differentiable: loss_value = loss(y_true, softmax(dot(relu(dot(inputs, W1) + b1), W2) + b2)) Calculus tells us that such a chain of functions can be derived using the following identity, called the chain rule. Consider two functions f and g, as well as the composed function fg such that fg(x) == f(g(x)): 56 CHAPTER 2 The mathematical building blocks of neural networks def fg(x): x1 = g(x) y = f(x1) return y Then the chain rule states that grad(y, x) == grad(y, x1) * grad(x1, x). This enables you to compute the derivative of fg as long as you know the derivatives of f and g. The chain rule is named as it is because when you add more intermediate func- tions, it starts looking like a chain: def fghj(x): x1 = j(x) x2 = h(x1) x x3 = g(x2) y = f(x3) return y grad(y, x) == (grad(y, x3) * grad(x3, x2) * W1 dot grad(x2, x1) * grad(x1, x)) Applying the chain rule to the computation of the gradient values of a neural network gives rise to an b1 + algorithm called backpropagation. Let’s see how that works, concretely. AUTOMATIC DIFFERENTIATION WITH COMPUTATION GRAPHS relu A useful way to think about backpropagation is in terms of computation graphs. A computation graph is the data structure at the heart of TensorFlow and the W2 dot deep learning revolution in general. It’s a directed acyclic graph of operations—in our case, tensor operations. For instance, figure 2.21 shows the graph representation of our first model. b2 + Computation graphs have been an extremely successful abstraction in computer science because they enable us to treat computation as data: a comput- softmax able expression is encoded as a machine-readable data structure that can be used as the input or out- put of another program. For instance, you could imagine a program that receives a computation y_true loss graph and returns a new computation graph that implements a large-scale distributed version of the loss_val same computation—this would mean that you could distribute any computation without having to write Figure 2.21 The computation the distribution logic yourself. Or imagine a pro- graph representation of our gram that receives a computation graph and can two-layer model The engine of neural networks: Gradient-based optimization 57 automatically generate the derivative of the expression it represents. It’s much easier to do these things if your computation is expressed as an explicit graph data structure rather than, say, lines of ASCII characters in a.py file. To explain backpropagation clearly, let’s look at a really basic example of a com- putation graph (see figure 2.22). We’ll consider a simplified version of figure 2.21, where we only have one linear layer and where all variables are scalar. We’ll take two scalar variables w and b, a scalar input x, and apply some operations to them to com- bine them into an output y. Finally, we’ll apply an absolute value error-loss function: loss_val = abs(y_true - y). Since we want to update w and b in a way that will min- imize loss_val, we are interested in computing grad(loss_val, b) and grad(loss _val, w). x w * x1 b + x2 y_true loss Figure 2.22 A basic example loss_val of a computation graph Let’s set concrete values for the “input nodes” in the graph, that is to say, the input x, the target y_true, w, and b. We’ll propagate these values to all nodes in the graph, from top to bottom, until we reach loss_val. This is the forward pass (see figure 2.23). Now let’s “reverse” the graph: for each edge in the graph going from A to B, we will create an opposite edge from B to A, and ask, how much does B vary when A varies? That is to say, what is grad(B, A)? We’ll annotate each inverted edge with this value. This backward graph represents the backward pass (see figure 2.24). 58 CHAPTER 2 The mathematical building blocks of neural networks 2 x 3 w * x1 = 6 1 b + x2 = 7 4 y_true loss loss_val = 3 Figure 2.23 Running a forward pass 2 x 3 grad(x1, w) = 2 w * x1 grad(x2, x1) = 1 1 grad(x2, b) = 1 b + x2 grad(loss_val, x2) = 1 4 y_true loss loss_val Figure 2.24 Running a backward pass The engine of neural networks: Gradient-based optimization 59 We have the following: grad(loss_val, x2) = 1, because as x2 varies by an amount epsilon, loss_val = abs(4 - x2) varies by the same amount. grad(x2, x1) = 1, because as x1 varies by an amount epsilon, x2 = x1 + b = x1 + 1 varies by the same amount. grad(x2, b) = 1, because as b varies by an amount epsilon, x2 = x1 + b = 6 + b varies by the same amount. grad(x1, w) = 2, because as w varies by an amount epsilon, x1 = x * w = 2 * w var- ies by 2 * epsilon. What the chain rule says about this backward graph is that you can obtain the deriva- tive of a node with respect to another node by multiplying the derivatives for each edge along the path linking the two nodes. For instance, grad(loss_val, w) = grad(loss_val, x2) * grad(x2, x1) * grad(x1, w) (see figure 2.25). 2 x 3 grad(x1, w) = 2 w * x1 grad(x2, x1) = 1 1 grad(x2, b) = 1 b + x2 grad(loss_val, x2) = 1 4 y_true abs_diff loss_val Figure 2.25 Path from loss_val to w in the backward graph By applying the chain rule to our graph, we obtain what we were looking for: grad(loss_val, w) = 1 * 1 * 2 = 2 grad(loss_val, b) = 1 * 1 = 1 60 CHAPTER 2 The mathematical building blocks of neural networks NOTE If there are multiple paths linking the two nodes of interest, a and b, in the backward graph, we would obtain grad(b, a) by summing the contribu- tions of all the paths. And with that, you just saw backpropagation in action! Backpropagation is simply the application of the chain rule to a computation graph. There’s nothing more to it. Backpropagation starts with the final loss value and works backward from the top lay- ers to the bottom layers, computing the contribution that each parameter had in the loss value. That’s where the name “backpropagation” comes from: we “back propa- gate” the loss contributions of different nodes in a computation graph. Nowadays people implement neural networks in modern frameworks that are capable of automatic differentiation, such as TensorFlow. Automatic differentiation is implemented with the kind of computation graph you’ve just seen. Automatic differ- entiation makes it possible to retrieve the gradients of arbitrary compositions of differ- entiable tensor operations without doing any extra work besides writing down the forward pass. When I wrote my first neural networks in C in the 2000s, I had to write my gradients by hand. Now, thanks to modern automatic differentiation tools, you’ll never have to implement backpropagation yourself. Consider yourself lucky! THE GRADIENT TAPE IN TENSORFLOW The API through which you can leverage TensorFlow’s powerful automatic differenti- ation capabilities is the GradientTape. It’s a Python scope that will “record” the tensor operations that run inside it, in the form of a computation graph (sometimes called a “tape”). This graph can then be used to retrieve the gradient of any output with respect to any variable or set of variables (instances of the tf.Variable class). A tf.Variable is a specific kind of tensor meant to hold mutable state—for instance, the weights of a neural network are always tf.Variable instances. Instantiate a scalar Variable with an initial value of 0. Open a GradientTape scope. import tensorflow as tf Inside the scope, apply x = tf.Variable(0.) some tensor operations with tf.GradientTape() as tape: to our variable. y = 2 * x + 3 grad_of_y_wrt_x = tape.gradient(y, x) Use the tape to retrieve the gradient of the output y with respect to our variable x. The GradientTape works with tensor operations: x = tf.Variable(tf.random.uniform((2, 2))) Instantiate a Variable with shape with tf.GradientTape() as tape: (2, 2) and an initial value of all zeros. y = 2 * x + 3 grad_of_y_wrt_x = tape.gradient(y, x) grad_of_y_wrt_x is a tensor of shape (2, 2) (like x) describing the curvature of y = 2 * a + 3 around x = [[0, 0], [0, 0]]. Looking back at our first example 61 It also works with lists of variables: W = tf.Variable(tf.random.uniform((2, 2))) matmul is how you say b = tf.Variable(tf.zeros((2,))) “dot product” in TensorFlow. x = tf.random.uniform((2, 2)) with tf.GradientTape() as tape: grad_of_y_wrt_W_and_b is a y = tf.matmul(x, W) + b list of two tensors with the same grad_of_y_wrt_W_and_b = tape.gradient(y, [W, b]) shapes as W and b, respectively. You will learn about the gradient tape in the next chapter. 2.5 Looking back at our first example You’re nearing the end of this chapter, and you should now have a general under- standing of what’s going on behind the scenes in a neural network. What was a mag- ical black box at the start of the chapter has turned into a clearer picture, as illustrated in figure 2.26: the model, composed of layers that are chained together, maps the input data to predictions. The loss function then compares these predic- tions to the targets, producing a loss value: a measure of how well the model’s pre- dictions match what was expected. The optimizer uses this loss value to update the model’s weights. Input X Layer Weights (data transformation) Layer Weights (data transformation) Weight Predictions True targets update Y' Y Optimizer Loss function Figure 2.26 Relationship between the network, layers, loss function, Loss score and optimizer Let’s go back to the first example in this chapter and review each piece of it in the light of what you’ve learned since. This was the input data: (train_images, train_labels), (test_images, test_labels) = mnist.load_data() train_images = train_images.reshape((60000, 28 * 28)) train_images = train_images.astype("float32") / 255 test_images = test_images.reshape((10000, 28 * 28)) test_images = test_images.astype("float32") / 255 62 CHAPTER 2 The mathematical building blocks of neural networks Now you understand that the input images are stored in NumPy tensors, which are here formatted as float32 tensors of shape (60000, 784) (training data) and (10000, 784) (test data) respectively. This was our model: model = keras.Sequential([ layers.Dense(512, activation="relu"), layers.Dense(10, activation="softmax") ]) Now you understand that this model consists of a chain of two Dense layers, that each layer applies a few simple tensor operations to the input data, and that these opera- tions involve weight tensors. Weight tensors, which are attributes of the layers, are where the knowledge of the model persists. This was the model-compilation step: model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"]) Now you understand that sparse_categorical_crossentropy is the loss function that’s used as a feedback signal for learning the weight tensors, and which the train- ing phase will attempt to minimize. You also know that this reduction of the loss happens via mini-batch stochastic gradient descent. The exact rules governing a spe- cific use of gradient descent are defined by the rmsprop optimizer passed as the first argument. Finally, this was the training loop: model.fit(train_images, train_labels, epochs=5, batch_size=128) Now you understand what happens when you call fit: the model will start to iterate on the training data in mini-batches of 128 samples, 5 times over (each iteration over all the training data is called an epoch). For each batch, the model will compute the gradient of the loss with regard to the weights (using the Backpropagation algorithm, which derives from the chain rule in calculus) and move the weights in the direction that will reduce the value of the loss for this batch. After these 5 epochs, the model will have performed 2,345 gradient updates (469 per epoch), and the loss of the model will be sufficiently low that the model will be capable of classifying handwritten digits with high accuracy. At this point, you already know most of what there is to know about neural net- works. Let’s prove it by reimplementing a simplified version of that first example “from scratch” in TensorFlow, step by step. Looking back at our first example 63 2.5.1 Reimplementing our first example from scratch in TensorFlow What better demonstrates full, unambiguous understanding than implementing every- thing from scratch? Of course, what “from scratch” means here is relative: we won’t reimplement basic tensor operations, and we won’t implement backpropagation. But we’ll go to such a low level that we will barely use any Keras functionality at all. Don’t worry if you don’t understand every little detail in this example just yet. The next chapter will dive in more detail into the TensorFlow API. For now, just try to fol- low the gist of what’s going on—the intent of this example is to help crystalize your understanding of the mathematics of deep learning using a concrete implementation. Let’s go! A SIMPLE DENSE CLASS You’ve learned earlier that the Dense layer implements the following input transfor- mation, where W and b are model parameters, and activation is an element-wise function (usually relu, but it would be softmax for the last layer): output = activation(dot(W, input) + b) Let’s implement a simple Python class, NaiveDense, that creates two TensorFlow variables, W and b, and exposes a __call__() method that applies the preceding transformation. import tensorflow as tf Create a matrix, class NaiveDense: W, of shape def __init__(self, input_size, output_size, activation): (input_size, self.activation = activation output_size), initialized with w_shape = (input_size, output_size) random values. w_initial_value = tf.random.uniform(w_shape, minval=0, maxval=1e-1) self.W = tf.Variable(w_initial_value) Create a vector, b, of shape b_shape = (output_size, (output_size,), initialized with zeros. b_initial_value = tf.zeros(b_shape) self.b = tf.Variable(b_initial_value) Apply the forward pass. def __call__(self, inputs):: return self.activation(tf.matmul(inputs, self.W) + self.b) @property Convenience method for def weights(self): retrieving the layer’s weights return [self.W, self.b] A SEQUENTIAL CLASS SIMPLE Now, let’s create a NaiveSequential class to chain these layers. It wraps a list of layers and exposes a __call__() method that simply calls the underlying layers on the inputs, in order. It also features a weights property to easily keep track of the layers’ parameters. 64 CHAPTER 2 The mathematical building blocks of neural networks class NaiveSequential: def __init__(self, layers): self.layers = layers def __call__(self, inputs): x = inputs for layer in self.layers: x = layer(x) return x @property def weights(self): weights = [] for layer in self.layers: weights += layer.weights return weights Using this NaiveDense class and this NaiveSequential class, we can create a mock Keras model: model = NaiveSequential([ NaiveDense(input_size=28 * 28, output_size=512, activation=tf.nn.relu), NaiveDense(input_size=512, output_size=10, activation=tf.nn.softmax) ]) assert len(model.weights) == 4 A BATCH GENERATOR Next, we need a way to iterate over the MNIST data in mini-batches. This is easy: import math class BatchGenerator: def __init__(self, images, labels, batch_size=128): assert len(images) == len(labels) self.index = 0 self.images = images self.labels = labels self.batch_size = batch_size self.num_batches = math.ceil(len(images) / batch_size) def next(self): images = self.images[self.index : self.index + self.batch_size] labels = self.labels[self.index : self.index + self.batch_size] self.index += self.batch_size return images, labels 2.5.2 Running one training step The most difficult part of the process is the “training step”: updating the weights of the model after running it on one batch of data. We need to 1 Compute the predictions of the model for the images in the batch. 2 Compute the loss value for these predictions, given the actual labels. Looking back at our first example 65 3 Compute the gradient of the loss with regard to the model’s weights. 4 Move the weights by a small amount in the direction opposite to the gradient. To compute the gradient, we will use the TensorFlow GradientTape object we intro- duced in section 2.4.4: def one_training_step(model, images_batch, labels_batch): Run the “forward with tf.GradientTape() as tape: pass” (compute predictions = model(images_batch) the model’s per_sample_losses = tf.keras.losses.sparse_categorical_crossentropy( predictions under labels_batch, predictions) a GradientTape average_loss = tf.reduce_mean(per_sample_losses) scope). gradients = tape.gradient(average_loss, model.weights) update_weights(gradients, model.weights) return average_loss Compute the gradient of the loss with regard to the weights. The output gradients Update the weights using the gradients is a list where each entry corresponds to (we will define this function shortly). a weight from the model.weights list. As you already know, the purpose of the “weight update” step (represented by the pre- ceding update_weights function) is to move the weights by “a bit” in a direction that will reduce the loss on this batch. The magnitude of the move is determined by the “learning rate,” typically a small quantity. The simplest way to implement this update_weights function is to subtract gradient * learning_rate from each weight: learning_rate = 1e-3 def update_weights(gradients, weights): assign_sub is the for g, w in zip(gradients, weights): equivalent of -= for w.assign_sub(g * learning_rate) TensorFlow variables. In practice, you would almost never implement a weight update step like this by hand. Instead, you would use an Optimizer instance from Keras, like this: from tensorflow.keras import optimizers optimizer = optimizers.SGD(learning_rate=1e-3) def update_weights(gradients, weights): optimizer.apply_gradients(zip(gradients, weights)) Now that our per-batch training step is ready, we can move on to implementing an entire epoch of training. 2.5.3 The full training loop An epoch of training simply consists of repeating the training step for each batch in the training data, and the full training loop is simply the repetition of one epoch: def fit(model, images, labels, epochs, batch_size=128): for epoch_counter in range(epochs): print(f"Epoch {epoch_counter}") 66 CHAPTER 2 The mathematical building blocks of neural networks batch_generator = BatchGenerator(images, labels) for batch_counter in range(batch_generator.num_batches): images_batch, labels_batch = batch_generator.next() loss = one_training_step(model, images_batch, labels_batch) if batch_counter % 100 == 0: print(f"loss at batch {batch_counter}: {loss:.2f}") Let’s test drive it: from tensorflow.keras.datasets import mnist (train_images, train_labels), (test_images, test_labels) = mnist.load_data() train_images = train_images.reshape((60000, 28 * 28)) train_images = train_images.astype("float32") / 255 test_images = test_images.reshape((10000, 28 * 28)) test_images = test_images.astype("float32") / 255 fit(model, train_images, train_labels, epochs=10, batch_size=128) 2.5.4 Evaluating the model We can evaluate the model by taking the argmax of its predictions over the test images, and comparing it to the expected labels: predictions = model(test_images) predictions = predictions.numpy() Calling.numpy() on a predicted_labels = np.argmax(predictions, axis=1) TensorFlow tensor converts matches = predicted_labels == test_labels it to a NumPy tensor. print(f"accuracy: {matches.mean():.2f}") All done! As you can see, it’s quite a bit of work to do “by hand” what you can do in a few lines of Keras code. But because you’ve gone through these steps, you should now have a crystal clear understanding of what goes on inside a neural network when you call fit(). Having this low-level mental model of what your code is doing behind the scenes will make you better able to leverage the high-level features of the Keras API. Summary Tensors form the foundation of modern machine learning systems. They come in various flavors of dtype, rank, and shape. You can manipulate numerical tensors via tensor operations (such as addition, tensor product, or element-wise multiplication), which can be interpreted as encoding geometric transformations. In general, everything in deep learning is amenable to a geometric interpretation. Deep learning models consist of chains of simple tensor operations, parameter- ized by weights, which are themselves tensors. The weights of a model are where its “knowledge” is stored. Learning means finding a set of values for the model’s weights that minimizes a loss function for a given set of training data samples and their corresponding targets. Summary 67 Learning happens by drawing random batches of data samples and their tar- gets, and computing the gradient of the model parameters with respect to the loss on the batch. The model parameters are then moved a bit (the magnitude of the move is defined by the learning rate) in the opposite direction from the gradient. This is called mini-batch stochastic gradient descent. The entire learning process is made possible by the fact that all tensor operations in neural networks are differentiable, and thus it’s possible to apply the chain rule of derivation to find the gradient function mapping the current parameters and current batch of data to a gradient value. This is called backpropagation. Two key concepts you’ll see frequently in future chapters are loss and optimizers. These are the two things you need to define before you begin feeding data into a model. – The loss is the quantity you’ll attempt to minimize during training, so it should represent a measure of success for the task you’re trying to solve. – The optimizer specifies the exact way in which the gradient of the loss will be used to update parameters: for instance, it could be the RMSProp optimizer, SGD with momentum, and so on.