Back-propagation Process AI-ML2 PDF

Summary

This document provides a detailed explanation of the back-propagation process in artificial neural networks, including the calculations and algorithms involved. It explains the concept of optimizing the weights in a neural network's layers using algorithms such as: Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam).

Full Transcript

05 Back-propagation Process Back-propagation Process The Learning Process: After a prediction is made, the error is computed by comparing the predicted output to the actual target values. The loss function (𝓛), indicates how far the predictions are from the true labels. Knowin...

05 Back-propagation Process Back-propagation Process The Learning Process: After a prediction is made, the error is computed by comparing the predicted output to the actual target values. The loss function (𝓛), indicates how far the predictions are from the true labels. Knowing the prediction is wrong is not enough; the model must understand how to adjust each weight to reduce this error. Backpropagation calculates each weight's contribution to the error, enabling the network to adjust its parameters accordingly. Feed-forward Input Layer Hidden Layers Output Layer ………............ Loss............ Minimize the Loss How? Back-propagation Gradient of Loss Back-propagation Process Feed-forward Update Hidden Layers Input Layer Output Layer Output Weights........................ Back-propagation Back-propagation Process Gradient of the Loss 𝝏𝓛 The gradient of the loss function represents how sensitive the loss function 𝓛 is to changes in the 𝝏𝑾 𝒍 weight matrix 𝑾 𝒍. Thus, it tells us how much the loss will change if the weights in layer 𝒍 are adjusted slightly. The symbol 𝝏 denotes partial derivatives, which measure the rate of change of a multivariable function (in this case, the loss function) with respect to one variable (the weight matrix) while keeping the other variables constant. Back-propagation Process Gradient of the Loss If the gradient is positive, the loss increases as the weight increases. So, we should decrease the weight to minimize the loss. If the gradient is negative, the loss decreases as the weight increases. So, we should increase the weight to minimize the loss. Back-propagation Process Gradient of the Loss This is crucial because the loss function depends on all the weights in the network. Using the chain rule, we break down the gradient into smaller, manageable parts: Chain Rule and Gradient Computation This chain of derivatives allows us to compute the contribution of each weight to the overall loss, guiding the model to adjust the weights in a direction that reduces the loss. Back-propagation Process Chain Rule and Gradient Computation The core mechanism of backpropagation relies on the chain rule of calculus to compute gradients of the loss function with respect to the weights. The loss function depends on the network's outputs, which depend on the weights. To update the weights, we compute how changes in the weights affect the loss by decomposing the gradient into smaller, local derivatives using the chain rule. The gradient of the loss function with respect to a weight matrix 𝑾 𝒍 in layer 𝒍 is given by: 𝝏𝓛 is the derivative of the loss function with respect to the 𝝏𝒂 𝒍 𝒍 activation 𝒂 at layer 𝒍, 𝝏𝒂 𝒍 is the derivative of the activation with respect to the pre- 𝝏𝒛 𝒍 activation value 𝒛 𝒍 , 𝝏𝒛 𝒍 is the derivative of the pre-activation value 𝒛 𝒍 with 𝝏𝑾 𝒍 respect to the weight matrix 𝑾 𝒍 , 𝒛 𝒍 is the weighted sum of the inputs at layer 𝒍, i.e., 𝒛 𝒍 = 𝑾 𝒍 𝒂 𝒍−𝟏 +𝒃 𝒍. Back-propagation Process Optimization Algorithms Optimization algorithms play a crucial role in training neural networks by minimizing the loss function. The most commonly used algorithms are Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam) Stochastic Gradient Descent (SGD): updates the weights by computing the gradient based on small batches of data. The update rule for SGD is as follows: While SGD is computationally efficient, using random batches of data adds some noise, which can slow down finding the best solution. However, this noise helps the model avoid getting stuck in bad solutions. Optimization Algorithms SGD The learning rate (𝜼) controls the step size of weight updates. A large learning rate can speed up training but may overshoot optimal weights, while a small learning rate ensures precise updates but slows the learning process. Back-propagation Process Learning Rate The learning rate (𝜼) controls the step size for weight updates and is crucial for training. A high learning rate may prevent convergence and lead to poor performance, while a low learning rate can slow training excessively and result in suboptimal solutions. 0.0001 0.01 Learning Rate Scheduling: Learning rate schedules adjust the learning rate during training. A common technique is to start with a high learning rate and gradually decrease it as training progresses. This allows for large initial updates and finer weight adjustments later. Several techniques exists (including custom methods). Back-propagation Process Learning Rate Importance of Learning Rate: The learning rate (𝜼) is crucial for how backpropagation updates weights. A high learning rate can cause large updates that overshoot optimal values, leading to oscillation or increased loss. Conversely, a low learning rate results in small updates and slow convergence. Thus, tuning the learning rate is critical for efficient and effective model training. Minimizing Loss Loss Optimization Algorithms Adam Adam (Adaptive Moment Estimation) Optimizer: improves upon SGD by using adaptive learning rates for each parameter. It combines Momentum and RMSProp (Root Mean Square Propagation) techniques to accelerate convergence. Momentum Adam RMSProp Optimization Algorithms Adam Momentum Optimizer: Helps speed up learning and smooth out the gradient updates. Instead of just using the current gradient to update the weights, Momentum also considers the past gradients to build velocity. This way, the optimizer can move faster in the right direction. 𝒎𝒕 Velocity (average of past gradients) at time step t, representing the momentum, 𝒎𝒕−𝟏 is the previous momentum value from time step t-1, 𝜷𝟏 a factor (usually 0.9) controlling how much of the past gradient to use, 𝒈𝒕 is the gradient of the loss function with respect to the parameters (weights) 𝑾𝒕 at time step t, It helps overcome slow progress when the gradient zig-zags (by smoothing updates) and speeds up movement in the right direction. Optimization Algorithms Adam RMSprop Optimizer: Adapts the learning rate for each parameter to handle cases where gradients vary widely. It keeps track of the squared gradients and adjusts the learning rate for each parameter by dividing the gradient by the square root of its recent squared gradients. This ensures that large gradients don't lead to too large steps and small gradients don't lead to too small steps. 𝒗𝒕 is the moving average of squared gradients at time step t, 𝝐 is a small constant (e.g.,10−8) added to avoid division by zero. RMSprop adapts the learning rate for each parameter based on how volatile its gradient is, making training more stable. Optimization Algorithms Adam RMSprop adjusts the learning rate by considering the history of squared gradients (but not the gradient itself like in momentum). Momentum focuses on smoothing the direction of updates by considering past gradients. While they both "remember" past information, they serve different purposes. Optimization Algorithms Adam Momentum: Speeds up learning by considering past gradients (adds velocity). RMSprop: Adjusts learning rates based on how large or small the gradients are, making it more adaptive and stable. Both of these optimizers are combined in the Adam optimizer. Momentum Adam RMSProp 𝒎𝒕 scales the learning rate by both the momentum and the magnitude of recent gradients. 𝒗𝒕 +𝝐 Optimization Algorithms Adam is widely used due to its fast convergence and ability to handle sparse gradients Back-propagation Process Propagation of Gradients: Backpropagation calculates the gradient at the output layer and propagates this information backward through the network. At each layer, the gradient indicates how much the weights contributed to the overall error. Using the chain rule, we compute the gradient of the loss with respect to each weight, layer by layer. This backward flow allows the model to "assign blame" for the error to individual weights, identifying which contributed most and by how much. Weights are then updated to reduce error in the next training iteration, with larger updates for weights that had a greater impact on the error. 06 Dropout Regularization Regularization is a technique used to prevent a model from overfitting (performing well on training data but poorly on new data). It works by adding a penalty to the model's complexity during training, which encourages the model to be simpler and more generalizable. Several Regularization methods exists (such as: L1, L2, ElasticNet, Dropout) It helps the model avoid memorizing the training data and instead focuses on learning patterns that work well for unseen data. Dropout Dropout is a technique used to prevent overfitting in neural networks. During training, it randomly "drops out" (turns off) a percentage of neurons in a layer. This means that some neurons are ignored during each training step, forcing the model to not rely too much on specific neurons and encouraging it to learn more robust patterns. Dropout(dropout-rate) Dropout Probabilities: 1. Retention Probability (p): This is the probability that a neuron is kept (not dropped out) during training. 2. Dropout Probability (1 - p): This is the probability that a neuron is dropped out during training. P = 1 - dropout-rate Dropout Perceptron 𝒍 𝒍 𝟏 ෥ⅈ 𝒚 = 𝒓 𝒊 × 𝒚𝒊 b r1 Perceptron (Neuron) Output Weights y1 ෪ 𝒚𝟏 r2 𝑛 𝑛 w2 y2 ෪ 𝒚𝟐 𝑓 ෍ 𝑤𝑖 𝑦𝑖 + 𝑏 Y ෍ 𝑤𝑖 𝑦𝑖 + 𝑏 𝑖=1 𝑖=1 r2 Sum Activation yn ෦ 𝒚𝒏 𝒍 𝒓𝒊 is a vector of independent Bernoulli random variables each of which has probability p of being 1. Dropout During each training iteration, for each neuron, there is a probability p that it will be retained and a probability 1−p that it will be dropped. Mathematically, the dropout operation on neuron activations can be written as: 𝒉 𝒍 is the vector of activations from layer 𝒍, r is a random binary mask with each element sampled independently from a Bernoulli distribution with probability p, ⊙ denotes element-wise multiplication. At test time, the network uses the full set of neurons, but the activations are scaled by the retention probability p to account for the effect of dropout during training: This scaling ensures that the expected output during training matches the output at test time. Dropout Standard Dropout: No scaling during training, scaling applied during testing. Inverted Dropout: Scales during training, no adjustment needed during testing. There are several other types of dropout beyond the standard and inverted versions. They are designed to improve performance in specific situations or adapt to different types of neural networks. 07 Conclusion Conclusion ✓ In this chapter, we introduced the fundamentals of deep learning, focusing on Multilayer Perceptrons. ✓ We covered the feedforward and backpropagation algorithms, optimization techniques like SGD and Adam, and the regularization method Dropout. ✓ These concepts provide a foundation for understanding more complex deep learning architectures in the following chapters.

Use Quizgecko on...
Browser
Browser