5. Issues and Techniques in Deep Learning 2 - 28012024 - RMSProp vs Adam Update in Neural Networks

Play an AI-generated podcast conversation about this lesson

What is a key benefit of using the ReLU activation function in deep neural networks?

It is not biologically plausible

It saturates in the positive region

It is computationally slower

It helps prevent vanishing gradient problems (correct)

Which activation function has been the most popular for deep neural networks since 2012?

ReLU (correct)

Leaky ReLU

Tanh

Sigmoid

What is a common issue with gradient descent when the learning rate is very slow?

It converges quickly

It is computationally faster

It settles at local minima too easily (correct)

It bounces off

What makes a fast learning rate problematic in gradient descent?

It bounces off the optimization path Signup and view all the answers

Why is weight initialization important in deep neural networks?

It affects how quickly the network learns Signup and view all the answers

What problem does ReLU help address in deep neural networks?

Vanishing gradient problem Signup and view all the answers

What two hyperparameters are used in the Adam (adaptive moment) update?

Momentum and squared momentum Signup and view all the answers

In the Adam update equation, what does 𝑤𝑡+1 represent?

The updated weights Signup and view all the answers

What is the primary purpose of implementing beta corrections to 𝑣𝑡+1 and 𝑟𝑡+1 in the Adam update equation?

To stabilize the algorithm convergence Signup and view all the answers

What role does the hyperparameter squared momentum (beta 2) play in the Adam update?

It controls the decay rate of historical gradients Signup and view all the answers

Which component in the Adam update equation is responsible for incorporating past gradients into the optimization process?

Momentum (beta 1) Signup and view all the answers

What distinguishes the Adam update from RMSProp in terms of optimization performance?

Adam incorporates past gradients unlike RMSProp Signup and view all the answers

What is a key issue with weight initialization where all weights are set to zero?

All updates will be the same because all outputs will be the same Signup and view all the answers

Which regularization technique involves introducing noise into the training data to prevent overfitting?

Dropout Signup and view all the answers

What is the purpose of Xavier/Glorot weight initialization in deep neural networks?

To avoid exploding/vanishing gradients Signup and view all the answers

What does L1 regularization encourage in neural networks?

Feature selection by driving some weights to exactly zero Signup and view all the answers

How does batch normalization help with training deep neural networks?

By normalizing the input to each layer of the network Signup and view all the answers

What is the main risk associated with using too many 𝛽𝑖 terms in a regression model?

Overfitting the data by capturing noise rather than signal Signup and view all the answers

What does momentum in gradient descent indicate?

How much importance is given to past values Signup and view all the answers

What is the purpose of implementing RMS Prop in SGD with Momentum?

To decrease the update if the average update of a weight is high Signup and view all the answers

How does RMS Prop differ from taking just the gradient for updating weights?

It considers EMA of gradient square instead of just gradient Signup and view all the answers

In gradient descent with momentum, what does the term 'rho' represent?

How many past values are taken into account for averaging Signup and view all the answers

Why is it important to avoid zig-zag movements in gradient descent?

To improve convergence speed Signup and view all the answers

What is the role of momentum in updating weights in gradient descent?

To take the average of past few gradients Signup and view all the answers

What is the purpose of Regularization in Standard Least Squares Regression?

To minimize the sum of squared errors Signup and view all the answers

In the context of Regularization, what does penalizing large coefficients help prevent?

Overfitting Signup and view all the answers

What is the key difference between Ridge and Lasso regularization functions?

Ridge has infinite lambda for no model, Lasso has high lambda for simplest models Signup and view all the answers

What happens to the model complexity as lambda value increases in Regularization?

Model complexity decreases Signup and view all the answers

Which term refers to the vector of all training responses in the context of Regularization?

$y$ Signup and view all the answers

What does N represent in the equation for Regularization?

<h1>of training samples</h1> Signup and view all the answers

What is the formula used to minimize the sum of squared errors with regularization included?

$ ext{min} igg( ext{sum}(y - Xeta) + ext{lambda} imes eta_jigg)$ Signup and view all the answers

How does high lambda value affect the complexity of models in Regularization?

Decreases model complexity Signup and view all the answers