5. Issues and Techniques in Deep Learning 2 - 28012024 - RMSProp vs Adam Update in Neural Networks

PalatialRelativity avatar
PalatialRelativity
·
·
Download

Start Quiz

Study Flashcards

32 Questions

What is a key benefit of using the ReLU activation function in deep neural networks?

It helps prevent vanishing gradient problems

Which activation function has been the most popular for deep neural networks since 2012?

ReLU

What is a common issue with gradient descent when the learning rate is very slow?

It settles at local minima too easily

What makes a fast learning rate problematic in gradient descent?

It bounces off the optimization path

Why is weight initialization important in deep neural networks?

It affects how quickly the network learns

What problem does ReLU help address in deep neural networks?

Vanishing gradient problem

What two hyperparameters are used in the Adam (adaptive moment) update?

Momentum and squared momentum

In the Adam update equation, what does 𝑤𝑡+1 represent?

The updated weights

What is the primary purpose of implementing beta corrections to 𝑣𝑡+1 and 𝑟𝑡+1 in the Adam update equation?

To stabilize the algorithm convergence

What role does the hyperparameter squared momentum (beta 2) play in the Adam update?

It controls the decay rate of historical gradients

Which component in the Adam update equation is responsible for incorporating past gradients into the optimization process?

Momentum (beta 1)

What distinguishes the Adam update from RMSProp in terms of optimization performance?

Adam incorporates past gradients unlike RMSProp

What is a key issue with weight initialization where all weights are set to zero?

All updates will be the same because all outputs will be the same

Which regularization technique involves introducing noise into the training data to prevent overfitting?

Dropout

What is the purpose of Xavier/Glorot weight initialization in deep neural networks?

To avoid exploding/vanishing gradients

What does L1 regularization encourage in neural networks?

Feature selection by driving some weights to exactly zero

How does batch normalization help with training deep neural networks?

By normalizing the input to each layer of the network

What is the main risk associated with using too many 𝛽𝑖 terms in a regression model?

Overfitting the data by capturing noise rather than signal

What does momentum in gradient descent indicate?

How much importance is given to past values

What is the purpose of implementing RMS Prop in SGD with Momentum?

To decrease the update if the average update of a weight is high

How does RMS Prop differ from taking just the gradient for updating weights?

It considers EMA of gradient square instead of just gradient

In gradient descent with momentum, what does the term 'rho' represent?

How many past values are taken into account for averaging

Why is it important to avoid zig-zag movements in gradient descent?

To improve convergence speed

What is the role of momentum in updating weights in gradient descent?

To take the average of past few gradients

What is the purpose of Regularization in Standard Least Squares Regression?

To minimize the sum of squared errors

In the context of Regularization, what does penalizing large coefficients help prevent?

Overfitting

What is the key difference between Ridge and Lasso regularization functions?

Ridge has infinite lambda for no model, Lasso has high lambda for simplest models

What happens to the model complexity as lambda value increases in Regularization?

Model complexity decreases

Which term refers to the vector of all training responses in the context of Regularization?

$y$

What does N represent in the equation for Regularization?

of training samples

What is the formula used to minimize the sum of squared errors with regularization included?

$ ext{min} igg( ext{sum}(y - Xeta) + ext{lambda} imes eta_jigg)$

How does high lambda value affect the complexity of models in Regularization?

Decreases model complexity

Explore the differences between RMSProp and Adam updates in neural networks, including their implementation and performance in tackling saddle points. Learn about the hyperparameters involved in Adam update for improved optimization.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free
Use Quizgecko on...
Browser
Browser