RNN Backpropagation Quiz and Flashcards

Study Notes

Backpropagation through time (BPTT) is used to compute gradients of the loss function with respect to the model's parameters in Recurrent Neural Networks (RNNs).
The gradients are computed by unrolling the RNN in time, creating a new copy of the network at each time step.
The gradients are then propagated backwards through the unrolled network, using the chain rule to compute the gradients of the loss with respect to the model's parameters.

Exploding gradients occur when the gradients of the loss function grow exponentially during backpropagation, causing the gradients to become very large.
This can cause the model's parameters to update too aggressively, leading to unstable training.
Exploding gradients can be mitigated using techniques such as:
- Gradient clipping: clipping the gradients to a maximum value to prevent them from growing too large.
- Gradient normalization: normalizing the gradients to have a fixed norm.

Vanishing gradients occur when the gradients of the loss function shrink exponentially during backpropagation, causing the gradients to become very small.
This can cause the model's parameters to update too slowly, leading to slow training.
Vanishing gradients can be mitigated using techniques such as:
- Gradient normalization: normalizing the gradients to have a fixed norm.
- Residual connections: adding connections between layers to help the gradients flow more easily.

The computational complexity of BPTT is O(T * n * m), where T is the number of time steps, n is the number of inputs, and m is the number of model parameters.
The computational complexity can be reduced using techniques such as:
- Truncated BPTT: only unrolling the network for a fixed number of time steps.
- Approximating the gradients using methods such as stochastic gradient descent.

Backpropagation through time (BPTT) is used to compute gradients of the loss function with respect to the model's parameters in Recurrent Neural Networks (RNNs).

Unrolling the RNN in time creates a new copy of the network at each time step.
The forward pass computes the output of the network at each time step.
The loss function is computed at each time step.
Backpropagation is used to compute the gradients of the loss with respect to the model's parameters.
Gradients are accumulated from each time step to compute the total gradient.
The model's parameters are updated using the accumulated gradients.

Exploding gradients occur when the gradients of the loss function grow exponentially during backpropagation.
This can cause the model's parameters to update too aggressively, leading to unstable training.
Techniques to mitigate exploding gradients include:
- Gradient clipping: clipping the gradients to a maximum value.
- Gradient normalization: normalizing the gradients to have a fixed norm.

Vanishing gradients occur when the gradients of the loss function shrink exponentially during backpropagation.
This can cause the model's parameters to update too slowly, leading to slow training.
Techniques to mitigate vanishing gradients include:
- Gradient normalization: normalizing the gradients to have a fixed norm.
- Residual connections: adding connections between layers to help the gradients flow more easily.

The computational complexity of BPTT is O(T * n * m), where T is the number of time steps, n is the number of inputs, and m is the number of model parameters.
Techniques to reduce the computational complexity include:
- Truncated BPTT: only unrolling the network for a fixed number of time steps.
- Approximating the gradients using methods such as stochastic gradient descent.