Lecture 2B - Backpropagation using Computational Graphs.pdf
Document Details
Uploaded by AdventurousPraseodymium
Tags
Full Transcript
Lecture 2B: Backpropagation using Computational Graphs KEVIN BRYSON 1 What are computational graphs ? 2 Backpropagation on a computational graph 1. For each sample, make a prediction (forward pass) ◦ St...
Lecture 2B: Backpropagation using Computational Graphs KEVIN BRYSON 1 What are computational graphs ? 2 Backpropagation on a computational graph 1. For each sample, make a prediction (forward pass) ◦ States of units (i.e. values) are propagated from the input layer to the output layer when doing “out = model(x)”. 2. Measure the output error of the model (i.e. the loss). 3. Go back through each layer to measure error contribution (i.e. gradient of the loss) from each connection (reverse pass) ◦ Gradients of the error are propagated backwards from the output layer to the input layer 4. Make a small change to weights in the negative gradient direction to reduce the output error (gradient descent) 3 In PyTorch The Computational Graph Image from Deep Learning with PyTorch, Manning Publishers 4 Chain rule of calculus 5 Gradient descent for our neuron 𝒘𝟏 Gradient of the loss: 𝒘𝟐 𝒘𝟑 𝒃 z Update: Chain rule for our neuron Derivative of loss 𝒘𝟏 function 𝒘𝟐 Derivative of 𝒘𝟑 activation function Derivative of loss Derivative of Linear wrt to weight component z Loss derivative 𝒘𝟏 𝒘𝟐 𝒘𝟑 z Activation derivative 𝒘𝟏 𝒘𝟐 𝒘𝟑 z Linear part derivative 𝒘𝟏 𝒘𝟐 𝒘𝟑 z Putting it all together 𝒘𝟏 𝒘𝟐 𝒘𝟑 z Now imagine that the previous layer was a hidden layer with neuron values h1, h2, h3... h1 𝒘𝟏 𝒘𝟐 h2 𝒘𝟑 ℎ𝑖 ℎ𝑖 h3 z 𝑤𝑖 ℎ𝑖 ℎ𝑖 What happens with a deep network? Derivative of loss Layer i - 1 Layer i function (𝑖) Derivative of 𝑤11 h1 activation function (𝑖+1) 𝑤11 Derivative of loss Derivative of Linear wrt to weight component h2 (𝑖+1) 𝑤21 (𝑖) 𝑤23 (𝑖+1) 𝑤31 What happens with a deep network? Layer i - 1 Layer i The derivative of the loss at a (𝑖) 𝑤11 hidden unit is the sum of the h1 (𝑖+1) 𝑤11 derivatives from the neurons it contributes to in the next layer. h2 (𝑖+1) 𝑤21 (𝑖) 𝑤23 𝜕𝐿 (𝑖) 𝑖 𝜕𝐿 (𝑖+1) 𝑤31 (𝑖−1) = σ𝑘 𝑤𝑗𝑘 ∅′(𝑧𝑘 ) (𝑖) 𝜕ℎ𝑗 𝜕ℎ𝑘 Why knowing this theory is useful … the vanishing gradient problem Consider a simple deep neural network: ◦ 5 layers ◦ A single neuron per layer ◦ How does an error propagate? 𝑤1 𝑤2 𝑤3 𝑤4 𝑤5 𝑥 ℎ1 ℎ2 ℎ3 ℎ4 𝑦5 The vanishing gradient problem The vanishing gradient problem Similarly, for n layers we have: 𝑥 𝑤1 𝑤2 𝑤3 𝑤4 𝑤5 ℎ0 ℎ1 ℎ2 ℎ3 ℎ4 𝑦5 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝑤 1 𝜕𝑤 2 𝜕𝑤 3 𝜕𝑤 4 𝜕𝑤 5 The vanishing gradient problem ▪Multiplying by numbers