Fundamentals of Data Science PDF

Chapter 6 Neural Networks and Backpropagation Neural networks are computational models inspired by the structure and function of the human brain, consist- ing of interconnected layers of nodes (neurons) that process and transform input data to produce predictions or classifications. These networks are composed of an input layer, one or more hidden layers, and an output layer, with each neuron applying a weighted sum of its inputs followed by a nonlinear activation function. The learn- ing process in neural networks involves optimizing the weights of these connections to minimize a loss function, which quantifies the error between the network’s predictions and the actual target values. Backpropagation is the algorithm that enables this optimization by efficiently calculating the gradient of the loss function with respect to the network’s weights through the chain rule of calculus. This gradient is then used by optimization algorithms like gradient descent to update the weights iteratively. Backpropagation is fundamental to the training process of neural networks, as it ensures that the error signal is propagated backward through the network, allowing each layer to adjust its weights based on its contribution to the overall error. This tight relationship between neural networks and backpropagation makes it possible for deep learning models to learn complex patterns and achieve high performance across a wide range of tasks. 6.1 Backpropagation and Gradient Descent What is Backpropagation? Backpropagation, short for ”backward propagation of errors,” is an algorithm used to train neural networks. It is based on the chain rule of calculus and calculates the gradient of the loss function with respect to each weight in the network. This gradient tells us how to adjust the weights to minimize the loss function, which measures how well the network’s predictions match the actual data. Backpropagation works by propagating the error from the output layer backward through the hidden layers to the input layer, systematically updating weights layer by layer. This process ensures efficient computation of gradients, especially in deep neural networks with many layers. Why is Backpropagation Used? Backpropagation is used because it provides a computationally efficient way to optimize the parameters of a neural network. Without backpropagation, calculating gradients for a large network would be prohibitively complex. Backpropagation allows us to leverage the structure of the network to compute gradients layer by layer, reducing computational redundancy. By iteratively minimizing the loss function, backpropagation ensures that the network learns meaningful patterns from the data, improving its performance on the training set and generalization to unseen data. It is especially crucial in deep learning, where networks with millions of parameters require scalable and efficient optimization methods. When is Backpropagation Used? Backpropagation is used during the training phase of neural networks. It is employed whenever we need to optimize a network’s parameters to minimize a loss function, typically in supervised learning tasks such as classification and regression. For example, it is used in image recognition tasks to adjust weights in convolutional neural networks or in natural language processing to optimize recurrent neural networks. Backpropagation is also applied when using more advanced architectures like transformers and autoencoders. It is essential for tasks where a network must 151 152 CHAPTER 6. NEURAL NETWORKS AND BACKPROPAGATION learn to make predictions or extract features from raw data, enabling models to achieve state-of-the-art performance across various domains. 6.1.1 Gradient Computation for Machine Learning Optimization This framework integrates the softmax loss, regularization, and nonlinear activations to create a robust model for multi-class classification. Regularization reduces overfitting, and gradient computation enables iterative optimiza- tion (e.g., via gradient descent). This approach is foundational in deep learning and extends to various architectures like feedforward and convolutional neural networks. To compute gradients for optimizing a machine learning model we use a nonlinear score function, a softmax loss, and regularization. Nonlinear Score Function The score function is defined as: s = f (x; W1 , W2 ) = W2 max(0, W1 x) Here: W1 and W2 are weight matrices. x is the input vector. max(0, W1 x) applies the Rectified Linear Unit (ReLU) activation, which outputs 0 if the input is negative and the input itself if positive. This score function projects the input x into a higher-dimensional space using W1 , applies the non-linearity (ReLU), and then maps it back to the output space using W2. This structure is commonly seen in neural networks and allows the model to capture complex relationships between features. Softmax Loss The loss function for a single data point i is given as: ! esyi Li = − log P s je j This is the softmax loss, which is used for multi-class classification. It computes the probability of the correct class yi using the softmax function. The log term ensures that the loss penalizes misclassifications heavily while encouraging higher confidence for correct predictions. syi : Score corresponding to the true class yi. P sj j e : Normalization term summing over all class scores. The softmax function transforms raw scores into probabilities, and the negative log likelihood ensures the model learns to maximize the probability of the correct class. Regularization Regularization is added to prevent overfitting by penalizing large weight values. Here, the regularization term is: X R(W ) = Wk2 k This represents the L2 regularization (also called weight decay). It penalizes large weights, encouraging the model to prefer simpler solutions that generalize better to unseen data. Separate regularization terms are applied to W1 and W2. 6.1. BACKPROPAGATION AND GRADIENT DESCENT 153 Total Loss The total loss for the model combines the data loss (softmax loss) and the regularization term: N 1 X L= Li + λR(W1 ) + λR(W2 ) N i=1 Here: Li : Data loss for each sample. R(W1 ) and R(W2 ): Regularization terms for the weights. λ: Regularization strength, controlling the trade-off between minimizing the data loss and penalizing large weights. N : Number of samples in the dataset. The total loss ensures that the model learns patterns from the data (via Li ) while avoiding overfitting by keeping weights small (via R(W )). Gradient Computation To optimize W1 and W2 , we need their gradients with respect to the total loss L: ∂L ∂L , ∂W1 ∂W2 These gradients are computed using backpropagation: 1. Gradients of the softmax loss with respect to W2 and W1. 2. Gradients of the regularization term: ∂ X 2 Wk = 2Wk ∂Wk k 3. Combining these gradients to update W1 and W2. 6.1.2 Challenges of Manual Gradient Derivation in Machine Learning Manual Gradient Derivation in Machine Learning can lead to some problems: Problem: Tediousness of Matrix Calculus Deriving gradients ∇W L manually for machine learning models can be an extremely tedious process, particularly because it requires meticulous matrix calculus. Models often involve layers of weights and activation functions, making the computation of partial derivatives highly intricate. Each layer adds new terms and dependencies, leading to complex expressions that must be carefully expanded and simplified. This process becomes cumbersome and error-prone when dealing with deep networks, where small errors in manual calculations can propagate and invalidate the entire derivation. This reliance on manual derivation is not only time-consuming but also inefficient, given the repeated calculations that can be automated. Problem: Changing Loss Functions If the loss function in a model needs to be changed, such as replacing a softmax loss with an L2 loss, the entire derivation of gradients must be revisited. This means re-evaluating and re-deriving all the partial derivatives associated with the new loss function, which is both time-consuming and impractical. Models often need to adapt to different tasks or optimization goals, requiring flexibility in the choice of loss functions. Manual derivation hinders this adaptability because each change involves starting from scratch, recalculating gradients for all parameters, and validating the correctness of the new derivation. 154 CHAPTER 6. NEURAL NETWORKS AND BACKPROPAGATION Problem: Infeasibility for Complex Models For very complex models, manual gradient derivation becomes nearly impossible. Modern neural networks often have millions of parameters and multiple layers with intricate interdependencies. For such models, the sheer number of terms involved in calculating gradients makes manual derivation infeasible. Additionally, the increasing use of non-linear activation functions and advanced architectures like transformers adds further complexity. Automating the gradient computation process using algorithms like backpropagation not only saves time but also ensures scalability, enabling researchers to focus on improving model design rather than getting bogged down in labor- intensive mathematical derivations. 6.1.3 Computational Graphs + Backpropagation to Efficiently Compute Gradients Figure 6.1: Computational Graphs + Backpropagation The Diagram introduces the concept of using computational graphs combined with backpropagation to efficiently compute gradients for optimization in machine learning models. Forward Pass and the Computational Graph In this diagram, the forward pass begins with the input vector x and the weight matrix W. The operation f = W · x computes a score vector s representing the linear transformation of the inputs by the weights. This step is visualized as a simple multiplication node (∗) in the computational graph. The output s, known as the scores, is then fed into a softmax loss function, defined as: ! esyi Li = − log P sj , je where syi represents the score for the true class yi. This loss function measures how well the model’s predictions align with the true class by converting scores into probabilities and penalizing incorrect predictions. The computed loss L captures the discrepancy between predictions and actual labels. This is the data loss, which quantifies prediction error. Simultaneously, a regularization term R(W ) is computed as: X R(W ) = Wk2 , k where Wk represents individual weight parameters. Regularization penalizes large weight values to prevent overfit- ting by introducing a smoothness constraint on the model. The total loss L is therefore: N 1 X L= Li + λR(W ), N i=1 6.1. BACKPROPAGATION AND GRADIENT DESCENT 155 where λ controls the trade-off between data loss and regularization. Backward Pass and Gradient Computation The backward pass utilizes the computational graph to efficiently compute gradients of the total loss L with re- spect to the weights W using backpropagation. Backpropagation applies the chain rule of calculus, systematically traversing the computational graph in reverse to propagate errors from the output layer back to the inputs. Specif- ∂L ically, the gradient ∂W is computed by: 1. Calculating the derivative of the softmax loss with respect to the scores ( ∂L ∂s ). i 2. Propagating this derivative back to the weights W through the linear transformation f = W · x. 3. Adding the derivative of the regularization term: ∂R(W ) λ = 2λW, ∂W ensuring the penalty term is incorporated. Advantages of Computational Graphs Computational graphs make gradient computation modular and scalable. Each operation (e.g., multiplication, addition, softmax) is represented as a node in the graph with well-defined inputs, outputs, and partial derivatives. When the model architecture or loss function changes, only the relevant nodes need adjustment, avoiding the need to manually re-derive gradients. This modularity is crucial for complex models with multiple layers or novel loss functions. Example 1 of Backpropagation: f (x, y, z) = (x + y)z This example demonstrates backpropagation using a simple function f (x, y, z) = (x + y)z. The process is broken into computational steps to compute the desired derivatives ∂f ∂f ∂x , ∂y , and ∂f ∂z. Step 1: Forward Pass The computation of the function proceeds as follows: 1. Calculate the intermediate value q: q =x+y Given x = −2 and y = 5, the result is: q = −2 + 5 = 3 2. Compute the final output f : f = qz Substituting q = 3 and z = −4, we get: f = 3 × −4 = −12 At this point, the forward pass is complete. We have the output f = −12 and the intermediate value q = 3. Figure 6.2: Enter Caption 156 CHAPTER 6. NEURAL NETWORKS AND BACKPROPAGATION Step 2: Backward Pass - Gradients of f Now, we compute the partial derivatives of f with respect to x, y, and z using the chain rule. The steps are: 1. Compute the gradients of f with respect to q and z: ∂f ∂f =z and =q ∂q ∂z Substituting z = −4 and q = 3, we get: ∂f ∂f = −4, =3 ∂q ∂z 2. Compute the gradients of q with respect to x and y: ∂q ∂q =1 and =1 ∂x ∂y ∂f ∂f 3. Apply the chain rule to compute ∂x and ∂y : Using: ∂f ∂f ∂q ∂f ∂f ∂q = · and = · , ∂x ∂q ∂x ∂y ∂q ∂y we compute: ∂f = (−4)(1) = −4 ∂x ∂f = (−4)(1) = −4 ∂y Step 3: Final Results The computed gradients are: ∂f ∂x = −4 ∂f ∂y = −4 ∂f ∂z =3 These derivatives represent the sensitivity of the output f to small changes in the inputs x, y, and z. 6.1. BACKPROPAGATION AND GRADIENT DESCENT 157 (a) Immagine 1 (b) Immagine 2 (c) Immagine 3 (d) Immagine 4 (e) This example illustrates how backpropagation is a systematic application of the chain rule, breaking down the computation into smaller, manageable steps (forward pass for values and backward pass for derivatives). This approach scales well to more complex models and forms the foundation of training neural networks. 158 CHAPTER 6. NEURAL NETWORKS AND BACKPROPAGATION 1 Example 2 of Backpropagation: f (w, x) = 1+e−(w0 x0 +w1 x1 +w2 ) This example demonstrates the forward and backward pass in a computational graph representing the sigmoid function f (w, x) = 1+e−(w0 x01+w1 x1 +w2 ). Step 1: Forward Propagation The computational graph is designed to break down the calculation of f (w, x) into its atomic operations. Each node in the graph represents a basic operation (addition, multiplication, exponential, reciprocal), and intermediate variables are calculated step by step. Figure 6.4: Forward Propagation 1. Input Initialization: Assign input values: w0 = 2, x0 = −1, w1 = −3, x1 = −2, w2 = −3. 2. Intermediate Calculations: Compute w0 · x0 : 2 · −1 = −2. Compute w1 · x1 : −3 · −2 = 6. Add these results to get −2 + 6 = 4. Add w2 : 4 + (−3) = 1. Apply the negative sign to the result: −1. Compute the exponential e−1 ≈ 0.37. Add 1 to this value: 0.37 + 1 = 1.37. Compute the reciprocal: 1 1.37 ≈ 0.73. At the end of the forward pass, the final value of f (w, x) = 0.73. Step 2: Backward Propagation In backward propagation, we calculate the gradients (partial derivatives) of f with respect to each input w0 , w1 , w2 , x0 , x1. This is done using the chain rule of calculus, propagating gradients from the output node back to the inputs. If the loss function is of the form: L = f, where f is the output of the reciprocal node, the derivative of L with respect to f is: 6.1. BACKPROPAGATION AND GRADIENT DESCENT 159 ∂L = 1. ∂f This is because L is directly equal to f , meaning any change in f leads to an equivalent change in L, resulting in a gradient of 1.00. 1. Final Node (Reciprocal): ∂ 1 1 = − 2. ∂x x x ∂L Upstream gradient: ∂f = 1. Downstream gradient: ∂L 1 =− ≈ −0.53. ∂z 1.372 Figure 6.5: Backward Propagation: Final Node (Reciprocal) ∂ 2. Addition Node (+1): Gradient of addition: ∂x (x + c) = 1. Upstream gradient: −0.53. Downstream gradient: ∂L = −0.53. ∂z ∂ x 3. Exponential Node (ex ): Gradient of exponential: ∂x e = ex. Upstream gradient: −0.53. Downstream gradient: ∂L = −0.53 · e−1 ≈ −0.2. ∂x ∂ 4. Negative Sign Node: Gradient of negation: ∂x (−x) = −1. Upstream gradient: −0.2. Downstream gradient: ∂L = 0.2. ∂x Figure 6.6: Backward Propagation: Addition Node (+1) 160 CHAPTER 6. NEURAL NETWORKS AND BACKPROPAGATION Figure 6.7: Backward Propagation: Exponential Node (ex ) Figure 6.8: Backward Propagation: Negative Sign Node 5. Addition Node (Sum of Weighted Inputs): At the ”+” node, the operation is a simple addition: q =a+b Where: a = 4.00 (output from one multiplication node), b = −3.00 (output from w2 ). This addition results in: q = 1.00, which is sent to downstream nodes for further computation. In backpropagation, gradients are propagated from downstream nodes to upstream nodes using the chain rule. Let’s consider the following: (a) Local Gradient of the ”+” Node: For any addition operation q = a + b, the partial derivatives of q with respect to its inputs are both equal to 1: ∂q ∂q = 1, = 1. ∂a ∂b This is because changing a or b by 1 unit will increase q by exactly 1 unit, independent of the other input. (b) Upstream Gradient Entering the ”+” Node: From the downstream node (here, the ”* -1” node), the upstream gradient is passed as 0.2. This upstream gradient represents how the output of the ”+” node influences the loss function L. (c) Combining Upstream and Local Gradients: 6.1. BACKPROPAGATION AND GRADIENT DESCENT 161 The chain rule is used to combine the upstream gradient with the local gradient at the ”+” node. For each input of the ”+” node: ∂q Gradient w.r.t. a = (Upstream Gradient) × = 0.2 × 1 = 0.2 ∂a ∂q Gradient w.r.t. b = (Upstream Gradient) × = 0.2 × 1 = 0.2 ∂b Thus, both inputs a and b receive an upstream gradient of 0.2. The gradients of the ”+” node are now passed back to its inputs: The gradient w.r.t. a (the value 4.00) is 0.2. The gradient w.r.t. b (the value −3.00) is also 0.2. These gradients will then be combined with other gradients further upstream in the network during subsequent backpropagation steps. Figure 6.9: Enter Caption 6. Multiplication Nodes: At this node, the operation is defined as: q = w 0 · x0 Where: w0 = 2.00, x0 = −1.00, The output of the node q = −2.00 is sent to downstream nodes. In backpropagation, the gradient of the loss L with respect to the inputs w0 and x0 is calculated using the chain rule: ∂L ∂L ∂q = · ∂w0 ∂q ∂w0 ∂L ∂L ∂q = · ∂x0 ∂q ∂x0 (a) Upstream Gradient ( ∂L ∂q ): From the downstream node (in this case, the ”+” node), the upstream gradient is 0.2. This gradient represents how sensitive the loss L is to changes in the output q of this multiplication node. ∂q ∂q (b) Local Gradients ( ∂w 0 and ∂x0 ): The local gradient of q with respect to w0 is: ∂q = x0 = −1.00 ∂w0 162 CHAPTER 6. NEURAL NETWORKS AND BACKPROPAGATION The local gradient of q with respect to x0 is: ∂q = w0 = 2.00 ∂x0 (c) Combining Upstream and Local Gradients: The chain rule is applied to calculate the gradients with respect to w0 and x0 : ∂L = (Upstream Gradient) · (Local Gradient w.r.t. w0 ) ∂w0 ∂L = 0.2 · (−1.00) = −0.2 ∂w0 ∂L = (Upstream Gradient) · (Local Gradient w.r.t. x0 ) ∂x0 ∂L = 0.2 · 2.00 = 0.4 ∂x0 The gradient of the loss with respect to w0 is −0.2. This means that increasing w0 slightly will decrease the loss by 0.2 units. The gradient of the loss with respect to x0 is 0.4. This means that increasing x0 slightly will increase the loss by 0.4 units. These gradients are then propagated further upstream to adjust w0 and x0 during the parameter update step in training. Figure 6.10 6.2 FINIRE BACKPROPAGATION

Fundamentals of Data Science PDF

Document Details

Tags

Related

Summary

Full Transcript