Introduction to Machine Learning: Linear Models

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

What is a key characteristic of Stochastic Gradient Descent (SGD)?

It requires a larger learning rate compared to Batch Gradient Descent.
It guarantees convergence to the global minimum.
It processes the entire dataset for each iteration.
It uses a single random training example for each update. (correct)

Which of the following describes the advantage of using SGD over traditional Gradient Descent methods?

It introduces randomness to the optimization process. (correct)
It calculates the gradient using the entire dataset.
It minimizes the number of iterations required.
It guarantees a lower cost function value in all iterations.

What is the first step in the Stochastic Gradient Descent algorithm?

Shuffle the training dataset.
Randomly initialize the parameters of the model. (correct)
Compute the gradient of the cost function.
Determine the number of iterations.

What does 'stochastic' refer to in Stochastic Gradient Descent?

Selecting training examples randomly. (D)

Signup and view all the answers

In the context of SGD, what is meant by 'mini-batch'?

Using a randomized small group of training examples. (D)

Signup and view all the answers

What happens in the Stochastic Gradient Descent loop when a model converges?

The parameters stop updating. (C)

Signup and view all the answers

Which of the following statements is true regarding Batch Gradient Descent compared to SGD?

Batch Gradient Descent is more efficient for small datasets. (C)

Signup and view all the answers

Why is it important to shuffle the training dataset before each iteration in SGD?

To avoid patterns and introduce randomness in training. (C)

Signup and view all the answers

What does the loss function quantify in a machine learning model?

The cost or penalty for incorrect predictions (A)

Signup and view all the answers

Which optimization technique is most commonly used to minimize the loss function?

Gradient Descent (A)

Signup and view all the answers

Which loss function is especially sensitive to outliers in the dataset?

Mean Squared Error (MSE) (C)

Signup and view all the answers

What advantage does Mean Absolute Error (MAE) Loss have over Mean Squared Error (MSE) Loss?

It is less sensitive to outliers (D)

Signup and view all the answers

What is the primary characteristic of loss functions in regression tasks?

They evaluate how well predictions match actual data (A)

Signup and view all the answers

Which characteristic makes the Mean Squared Error (MSE) Loss suitable for gradient-based optimization?

It is differentiable (A)

Signup and view all the answers

What does the term 'Huber Loss' refer to in the context of loss functions?

A type of loss that is insensitive to outliers up to a certain threshold (B)

Signup and view all the answers

Which loss function calculates the average of the squared differences?

Mean Squared Error (MSE) (C)

Signup and view all the answers

What is a major consequence of using a high learning rate in SGD?

Overshooting the minimum (B)

Signup and view all the answers

Which theorem states that a neural network with a single hidden layer can approximate any continuous function?

Universal Approximation Theorem (C)

Signup and view all the answers

What role does the hidden layer play in a neural network?

Processes input through weighted connections and activation functions (D)

Signup and view all the answers

Which method can help mitigate the issues of noisy updates in SGD?

Using learning rate scheduling (B)

Signup and view all the answers

What does the output of a neural network's single hidden layer depend on, mathematically?

A composition of linear transformations and activation functions (C)

Signup and view all the answers

In the context of the Universal Approximation Theorem, what is required for a neural network to approximate a continuous function?

An appropriate activation function (A)

Signup and view all the answers

What can occur if SGD converges too slowly due to a low learning rate?

The solution may be suboptimal (C)

Signup and view all the answers

Which of the following accurately describes the composition of the neural network function?

Combination of weighted linear transformations and non-linear activation functions (C)

Signup and view all the answers

What is the main criterion for selecting the best hyperplane in a Support Vector Machine?

It maximizes the separation margin between two classes. (C)

Signup and view all the answers

What happens when a data point lies on the boundary of the separating classes in SVM?

It is considered a support vector and is crucial for defining the hyperplane. (A)

Signup and view all the answers

What is a characteristic of SVM in relation to outliers?

SVM focuses on maximizing the margin by ignoring outliers. (B)

Signup and view all the answers

What is meant by the term 'soft margin' in SVM?

A margin that allows some data points to violate the separation rule. (B)

Signup and view all the answers

What is the formula to minimize when a soft margin is applied in SVM?

(1/margin + ∑penalty). (B)

Signup and view all the answers

When data is not linearly separable, what does SVM do?

It creates new variables using a kernel function. (C)

Signup and view all the answers

What does hinge loss represent in the context of SVM?

A measure of how far any data point violates the margin. (A)

Signup and view all the answers

What is the result of a maximum-margin hyperplane in SVM?

It maximizes the distance from the hyperplane to the nearest points of each class. (A)

Signup and view all the answers

What is a primary function of the activation function in a Perceptron?

To map output values between specific ranges (D)

Signup and view all the answers

What information does the weight of an input provide in a Perceptron?

The strength of the input node (A)

Signup and view all the answers

Which of the following mathematical forms represents the calculation of the weighted sum in a Perceptron?

∑wi*xi (D)

Signup and view all the answers

What is the purpose of the bias in the Perceptron model?

To shift the activation function curve (D)

Signup and view all the answers

In which scenario would a single-layer Perceptron be used effectively?

When outcomes are linearly separable (C)

Signup and view all the answers

What does the output of a Perceptron model indicate when the summed input exceeds a threshold?

The output value is +1 (C)

Signup and view all the answers

Which type of Perceptron model consists of only one layer?

Single-layer Perceptron (C)

Signup and view all the answers

What is added to the weighted sum in a Perceptron to improve its performance?

Bias (D)

Signup and view all the answers

What is the primary difference between a single-layer perceptron and a multi-layer perceptron?

A single-layer perceptron can only process linear patterns, while a multi-layer perceptron can process both linear and non-linear patterns. (A)

Signup and view all the answers

Which of the following is NOT an advantage of a multi-layer perceptron model?

Requires minimal training data to achieve high accuracy (A)

Signup and view all the answers

Which of the following accurately describes the "backward stage" of the multi-layer perceptron training process?

The stage where weights and biases are adjusted based on the difference between the actual output and the desired output. (A)

Signup and view all the answers

In which of the following scenarios would a multi-layer perceptron model be a suitable choice?

All of the above (D)

Signup and view all the answers

What is a potential drawback of using a multi-layer perceptron model?

Complexity and computational cost of training (A)

Signup and view all the answers

Which of the following is NOT a common type of activation function used in a multi-layer perceptron?

Linear (C)

Signup and view all the answers

What is a common method for evaluating the performance of a multi-layer perceptron model?

Both A and B (B)

Signup and view all the answers

What is the significance of the "hidden layers" in a multi-layer perceptron?

They allow the network to learn complex non-linear relationships. (C)

Signup and view all the answers

Flashcards

Hyperplane

A flat affine subspace that separates data points in SVM.

Best Hyperplane

The hyperplane that maximizes the separation margin between classes.

Separation Margin

The distance between the hyperplane and the nearest data points from each class.

Maximum-Margin Hyperplane

The hyperplane that maximizes the separation margin; also called hard margin.

Signup and view all the flashcards

SVM Robustness

SVM can ignore outliers and find the best hyperplane.

Signup and view all the flashcards

Soft Margin

Allows some points to violate the margin; used in less clear data cases.

Signup and view all the flashcards

Hinge Loss

A penalty used in SVM to measure violations of the margin.

Signup and view all the flashcards

Kernel Trick

A method in SVM to handle non-linearly separable data by transforming features.

Signup and view all the flashcards

Perceptron

A single-layer neural network for binary classification.

Signup and view all the flashcards

Input Values

The data points fed into the perceptron model.

Signup and view all the flashcards

Weights

Parameters that determine the strength of input values in perceptron.

Signup and view all the flashcards

Bias

A term that shifts the activation function curve.

Signup and view all the flashcards

Weighted Sum

The total derived from multiplying inputs and weights.

Signup and view all the flashcards

Activation Function

A function that determines the output based on the weighted sum.

Signup and view all the flashcards

Single-layer Perceptron

The simplest form of perceptron with one layer, analyzing linear data.

Signup and view all the flashcards

Multi-layer Perceptron

A perceptron model with multiple layers for complex patterns.

Signup and view all the flashcards

Forward Stage

The phase where activation functions run from input to output layer.

Signup and view all the flashcards

Backward Stage

The phase where weights and biases are adjusted based on error.

Signup and view all the flashcards

Complex non-linear problems

Challenges that can't be solved using linear models alone.

Signup and view all the flashcards

Advantages of Multi-layer Perceptron

Quick predictions and handles large/small data effectively.

Signup and view all the flashcards

Disadvantages of Multi-layer Perceptron

Computations can be complex and time-consuming.

Signup and view all the flashcards

Gradient Descent

An optimization algorithm used to minimize a function by moving in the opposite direction of the gradient.

Signup and view all the flashcards

Batch Gradient Descent

A type of gradient descent that uses the entire dataset to compute the gradient at each iteration.

Signup and view all the flashcards

Stochastic Gradient Descent (SGD)

A variant of gradient descent that uses one random example at each iteration to compute gradients, enhancing efficiency.

Signup and view all the flashcards

Mini-batch Gradient Descent

A gradient descent variant that splits the dataset into small batches and uses these to compute the gradient for each update.

Signup and view all the flashcards

Initialization in SGD

The first step in the SGD process where model parameters are randomly set before training.

Signup and view all the flashcards

Learning Rate (alpha)

A hyperparameter in the gradient descent algorithm that determines the size of the steps taken towards the minimum.

Signup and view all the flashcards

Shuffle Dataset

The process of randomly rearranging the training examples to ensure diversity in updates during SGD.

Signup and view all the flashcards

Convergence in SGD

The point at which the algorithm has sufficiently minimized the cost function and stops updating model parameters.

Signup and view all the flashcards

Learning Rate

A hyperparameter that controls the step size during optimization.

Signup and view all the flashcards

Universal Approximation Theorem (UAT)

A theorem stating a single hidden layer neural network can approximate any continuous function.

Signup and view all the flashcards

Hidden Layer

A layer in a neural network that processes inputs through weights and activation functions.

Signup and view all the flashcards

Output Layer

The final layer in a neural network that provides the predicted output.

Signup and view all the flashcards

Weights and Biases

Parameters in neural networks that are adjusted through training to minimize error.

Signup and view all the flashcards

Convergence

The process of an algorithm reaching a stable solution over iterations.

Signup and view all the flashcards

Loss Function

Quantifies the error as a cost for incorrect predictions.

Signup and view all the flashcards

Objective Function

The function that algorithms aim to minimize, typically involving the loss function.

Signup and view all the flashcards

Mean Squared Error (MSE) Loss

Calculates the average of the squared differences between predicted and actual values, widely used for regression tasks.

Signup and view all the flashcards

Mean Absolute Error (MAE) Loss

Averages the absolute differences between predicted and actual values, less sensitive to outliers than MSE.

Signup and view all the flashcards

Huber Loss

A robust loss function that combines MSE and MAE, less sensitive to outliers.

Signup and view all the flashcards

Log-Cosh Loss

A loss function that is smoother than MSE, still sensitive to outliers but less so than MSE.

Signup and view all the flashcards

Efficacy of Loss Functions

Different loss functions are suited for different types of prediction problems in regression.

Signup and view all the flashcards

Study Notes