quiz1_NN

VictoriousGlockenspiel avatar
VictoriousGlockenspiel
·
·
Download

Start Quiz

Study Flashcards

92 Questions

What is the main goal of Maximum Likelihood Estimation (MLE) in logistic regression?

To maximize the likelihood of making the observations given the parameters

In logistic regression, what is the function that 'squeezes in' the weighted input into a probability space?

Logistic Sigmoid Function

What is the measure of the uncertainty associated with a random variable in logistic regression?

Entropy (H)

What does the Logistic regression model specify in terms of binary output given input?

Probability of binary output

What is the method of estimating the parameters of a statistical model given observations in logistic regression?

Maximum Likelihood Estimation (MLE)

What is the purpose of minimizing the Kullback-Leibler Divergence in logistic regression?

To measure the difference between two probability distributions

In logistic regression, what is the role of the derivative of the logit function?

To adjust the parameters in the gradient-descent based method

Why is logistic regression considered insufficient for classification in the rings dataset?

The dataset is not linearly separable

What is the objective of a multi-layer perceptron (MLP) with regards to minimizing cross-entropy error?

To create separation planes on the input space

What influence can be observed in the simple MLP using the Neural Net Playground?

Influence of activation functions and relation between absolute gradient value and learning rate

What is the passing condition for the written final exam in the Neural Networks course?

Scoring at least 30% in the written final exam

What is the weightage of the project in the evaluation for exam entry?

60%

How many intermediary short presentations are required for the project evaluation?

3

What is the weightage of the written final exam in the evaluation for exam entry?

30%

What is the main focus of Step 3 of the project?

Final experiments and results

Which optimization algorithm stores an exponentially decaying average of squared gradients and also an exponentially decaying average of past gradients?

Adam

What is the update rule for the Adadelta optimization algorithm?

$\theta(t+1) = \theta(t) - \sqrt{\lambda E[g^2]_t + \epsilon} g_t$

In the context of optimization, what does the acronym CNN stand for?

Convolutional Neural Network

What is the main purpose of early stopping in the context of optimization?

To prevent overfitting

Which technique is used to make training more robust to poor initialization or highly irregular error functions?

Gradient Noise

What is the dimension of the output map when an input volume of 32 × 32 × 3 is convolved with a 5 × 5 × 3 filter?

28 × 28 × 1

What does the size of the receptive field represent in a convolutional layer?

Spatial extent of the local connectivity of each neuron

What is the purpose of connecting each neuron to only a local region of the input volume in a convolutional layer?

To reduce the number of parameters and computational complexity

What is the disadvantage of using a large filter in a convolutional layer?

Lose info about spatial arrangement of pixels

In the context of convolutional layers, what does 'inductive bias' refer to?

The assumptions made by the network to simplify learning

In the context of optimization for linear regression, what is the gradient of the loss function f(θ) with respect to θ?

−2XT y + 2XT Xθ

What is the main issue with the second-order optimization algorithm in computing the direction of descent?

Difficulty in inverting the Hessian matrix

Which statement best describes the Stochastic Gradient Descent (SGD) algorithm?

It uses a subset of the data to estimate the true gradient and updates the parameters based on this estimate.

What is the main purpose of normalizing the input space in Stochastic Gradient Descent (SGD)?

To improve numerical stability and convergence behavior

Which feature distinguishes Adadelta from Adagrad in terms of updating individual parameter learning rates?

Adadelta adapts learning rates based on parameter importance, while Adagrad uses a fixed learning rate for all parameters.

What is the dimension of the output map when applying a filter of size 7 × 7 × 3 to an input volume of dimension 32 × 32 × 3?

26 × 26 × 1

If the input volume has dimensions 50 × 50 × 3, a kernel size of 5 × 5 × 3, zero-padding of 1, and a stride of 2, what is the dimension of the output map?

24 × 24 × 1

For an input volume of dimension 20 × 20 × 3 and a filter size of 5 × 5 × 3 with a stride of 3, what is the dimension of the output map?

6 × 6 × 3

If the stride is set to 2, the input volume has dimensions of 30 × 30 × 3, and the filter size is 4 × 4 × 3, what padding can be used to ensure the output map has the same width and height as the input volume?

2

What size of zero-padding should be applied to an input volume with dimensions of 40 × 40 × 3 and a filter size of 4 × 4 × 3 in order to obtain an output map of dimensions 40 × 40 × 1?

1

When applying a convolution layer with a kernel size of 6 × 6 and a stride of 2 to an input volume with dimensions of H x W x D, what is the dimension of the output map?

$(H - (6 - 1)) \times (W - (6 - 1)) \times D$

If parameter sharing is used in a CNN with an output volume of dimensions H x W x D and a filter size of K x K x D, how many parameters are shared within a depth slice?

$K \times K \times D$

In a CNN with an output volume of dimensions H x W x D and a filter size of K x K x D, how many biases are needed when using parameter sharing?

$D$

For an input volume with dimensions H x W x D, if the kernel size is P x P x D and the stride is S, what is the constraint on P to ensure that the result of division when computing Hin has to be an integer?

$P < S$

If an input volume has dimensions H x W x D, a kernel size of K x K x D, zero-padding of P, and dilation factor of L, what is the formula for calculating Wout, the width of the output map?

$W + (2P) - L(K - 1) - S + 1$

What is the loss function used in logistic regression?

Cross-Entropy

What is the derivative of the logit function used for in logistic regression?

To compute the gradient of the loss function

What is the main disadvantage of using logistic regression for classification in the rings dataset?

It is not suitable for non-linearly separable datasets

What is the objective of a multi-layer perceptron (MLP) in the context of minimizing cross-entropy error?

To learn an optimal representation of the input data

What influence can be observed in the simple MLP using the Neural Net Playground?

Influence of activation functions on model performance

In the context of optimization for linear regression, what is the update form of the steepest descent algorithm?

θ (k+1) = θ (k) - λ(k) ∇f(θ (k) )

What is the algorithm derived from the second-order Taylor series approximation of J(θ) around θ (k) in the context of second-order optimization?

Newton's algorithm

What is the main limitation of using second-order optimization algorithms such as Newton's algorithm?

Inversion of the Hessian matrix at each iteration

What distinguishes Stochastic Gradient Descent (SGD) from Batch and Mini-batch Gradient Descent in terms of parameter updates?

SGD updates parameters based on a subset or even instance of the data

What is the intuition behind Nesterov Accelerated Gradient in optimization?

To give momentum a sense of when to speed up before a slope increases

Which optimization algorithm involves an update rule that includes a biased correction for the exponentially decaying average of squared gradients?

Adam

What is the equivalent of Adadelta without the exponential decay of squared parameter updates?

RMSProp

For big, redundant datasets, which optimization algorithm is specifically recommended?

Adam

Which method is used to make training more robust to poor initialization or highly irregular error functions?

Gradient Noise

What is the common practice to deal with the covariate shift in intermediary layer inputs during training in deep networks?

Batch Normalization

What is the primary focus of Step 2 in the project guidelines for the Neural Networks course?

Evaluating the first results of the project

In logistic regression, what is the main purpose of the derivative of the logit function?

To adjust the model parameters during backpropagation

What is the constraint on the kernel size (P) when applying a convolution layer with stride (S) to an input volume with dimensions H x W x D to ensure that the result of division when computing Hin has to be an integer?

$P = \frac{H - 1}{S} + 1$

What is the passing condition for exam entry in the Neural Networks course?

Earning at least 50% of semester activity, including project evaluations, and achieving 30% in the written final exam

What is the main goal of Maximum Likelihood Estimation (MLE) in logistic regression?

To maximize the log-likelihood function

What is the main difference between linear regression and logistic regression?

Linear regression uses a linear predictor, while logistic regression uses a logistic sigmoid function for classification.

What is the purpose of the sigmoid function in logistic regression?

To 'squeeze in' the weighted input into a probability space.

What is the measure of the uncertainty associated with a random variable in logistic regression?

Entropy

What is the method of estimating the parameters of a statistical model given observations in logistic regression?

Maximizing Likelihood Estimation (MLE)

What is the main issue with the second-order optimization algorithm in computing the direction of descent?

It involves computing and storing Hessian matrices.

What is the dimension of the output map when applying a filter of size 5 × 5 × 3 to an input volume of dimension 32 × 32 × 3?

28 × 28 × 1

What does the size of the receptive field represent in a convolutional layer?

Spatial extent of the local connectivity of each neuron

What is the disadvantage of using a large filter in a convolutional layer?

Lose info about spatial arrangement of pixels

What is the main purpose of connecting each neuron to only a local region of the input volume in a convolutional layer?

Reduce the number of parameters and enforce translation invariance

What influence can be observed in the simple MLP using the Neural Net Playground?

Impact of different activation functions on model performance

What is the dimension of the output map when applying a filter of size $5 \times 5 \times 3$ to an input volume of dimension $32 \times 32 \times 3$ with a stride of 1 and zero-padding of 0?

$28 \times 28 \times 1$

If an input volume with dimensions $50 \times 50 \times 3$, a kernel size of $5 \times 5 \times 3$, zero-padding of 0, and a stride of 2, what is the dimension of the output map?

$24 \times 24 \times 1$

For an input volume with dimensions $40 \times 40 \times 3$, a filter size of $4 \times 4 \times 3$, and zero-padding of 2, what is the dimension of the output map?

$42 \times 42 \times 1$

If the stride is set to 2, the input volume has dimensions of $30 \times 30 \times 3$, and the filter size is $4 \times 4 \times 3$, what padding can be used to ensure the output map has the same width and height as the input volume?

1

For an input volume with dimensions $20 \times 20 \times 3$ and a filter size of $5 \times 5 \times 3$ with a stride of 3, what is the dimension of the output map?

$6 \times 6 \times 1$

What is the constraint on the kernel size ($P$) to ensure that the result of division when computing $H_{in}$ has to be an integer, given an input volume with dimensions $H\ times W\ times D$ and a stride of $S$?

$P <= H - S + W - S + D$

What is the formula for calculating $W_{out}$, the width of the output map, given an input volume with dimensions $H\ times W\ times D$, a kernel size of $P\ times P\ times D$, zero-padding of $Z$, and dilation factor of $L$?

$W_{out} = W + Z - L*(P-1) - L-1 + S$

What does 'inductive bias' refer to in the context of convolutional layers?

The assumption that features useful in one area are likely to be useful in another area

What is the main issue with second-order optimization algorithms in computing the direction of descent?

Computational complexity increases significantly with higher order derivatives

What feature distinguishes Adadelta from Adagrad in terms of updating individual parameter learning rates?

Adadelta uses an exponentially decaying average of past gradients in addition to squared gradients

What are the passing conditions for the written final exam in the Neural Networks course?

50% of semester activity required for exam entry (grades for HW + Intermediary presentations + Step 2 of project), 50% of final project, 30% of written exam

How many biases are needed when using parameter sharing in a CNN with an output volume of dimensions H x W x D and a filter size of K x K x D?

1

What is the dimension of the output map when applying a filter of size 6 × 6 × D and a stride of 2 to an input volume with dimensions of H x W x D?

((H - 6) / 2 + 1) × ((W - 6) / 2 + 1) × D

What is the weightage of the project in the evaluation for exam entry?

50%

What is the update rule for the Adadelta optimization algorithm?

w_t = w_{t-1} - (RMS[delta_w]_t / RMS[g]_t) * g_t

What is the main purpose of early stopping in the context of optimization?

To prevent overfitting

What is the main difference between linear regression and logistic regression?

Linear regression predicts continuous outcomes, while logistic regression predicts binary outcomes.

What is the main goal of Maximum Likelihood Estimation (MLE) in logistic regression?

To estimate the parameters of a statistical model given observations

What is the primary focus of Step 2 in the project guidelines for the Neural Networks course?

Intermediary project evaluation (e.g. first results)

What distinguishes Stochastic Gradient Descent (SGD) from Batch and Mini-batch Gradient Descent in terms of parameter updates?

SGD updates parameters using a single training example at a time, while Batch and Mini-batch GD use multiple examples.

What is the weightage of the written final exam in the evaluation for exam entry?

30%

What is the method of estimating the parameters of a statistical model given observations in logistic regression?

Maximum Likelihood Estimation (MLE)

Study Notes

  • The text outlines the course structure for a machine learning class, focusing on neural networks and their related topics.
  • The course consists of 14 lectures, programming and analysis homework assignments, and a project.
  • The project involves topic selection, presentation of state-of-the-art research, intermediary presentations, intermediate project evaluation, final project poster presentation, and final project paper.
  • The passing conditions for the course include completion of 50% of semester activities for exam entry and 50% of the final project, as well as 30% of the written exam.
  • The text covers the basics of linear regression, including the objective function, mean squared error loss, and the solution using gradient descent.
  • The text discusses the limitations of linear regression for classification tasks and introduces the logistic regression model.
  • Logistic regression models the probability of binary output given input, using a sigmoid activation function.
  • The text covers maximizing likelihood estimation and the minimization of cross-entropy using gradient descent for logistic regression.
  • The text also touches on Kullback-Leibler divergence and its relationship to minimizing cross-entropy.
  • The text briefly discusses the application of logistic regression as a neural network, introducing the softmax function for multiclass classification.
  • The text then covers backpropagation, a method used to train multi-layered neural networks.
  • The text outlines the forward and backward passes in a neural network and discusses the calculation of delta values for each layer during backpropagation.
  • The text concludes by mentioning the insufficiency of logistic regression for the given rings dataset and the need for a multi-layered perceptron to address the classification problem.

This quiz covers the concept of convolution layers in neural networks, including their advantages, disadvantages, and the spatial arrangement of pixels. It also includes a case study involving an input volume and filter dimensions.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free
Use Quizgecko on...
Browser
Browser