Supervised Machine Learning
38 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

In supervised learning, what is the role of the 'ideal' function $f^*$?

  • It maps each observation perfectly to its target, but is often unattainable in practice. (correct)
  • It represents the best approximation that a model can achieve on the training data.
  • It is a function in $F^m$ that we can model easily.
  • It is a function that minimizes the training error to zero.

What core assumption underpins the use of supervised learning for generalization?

  • The training data is free of noise and outliers, and it ensures perfect accuracy on unseen data.
  • The algorithm used is the most efficient for the given dataset size.
  • The chosen model is complex enough to capture all nuances in the training data.
  • The observed data is representative of the real-world data distribution. (correct)

Which type of error in supervised learning directly reflects the model's performance on unseen, real-world data that was not available during training or testing?

  • Testing error
  • Training error
  • Validation error
  • Generalization error (correct)

What is the significance of the 'independent and identically distributed' (IID) assumption in supervised machine learning?

<p>It allows estimating generalization error from the testing error by assuming data points are independently drawn from the same distribution. (A)</p> Signup and view all the answers

Which of the following supervised learning tasks involves mapping observations to a probability distribution over a set of categories?

<p>Probability density estimation (C)</p> Signup and view all the answers

What is the potential risk of only re-doing previous quizzes to prepare for a major exam?

<p>It might lead to overfitting to the quiz content and poor preparation for novel questions. (C)</p> Signup and view all the answers

What principle underlies the minimization of 'wrongness' on training data, with the expectation that it leads to minimal 'wrongness' in the real world?

<p>Empirical Risk Minimization (D)</p> Signup and view all the answers

In the context of linear regression, what does the 'residual' represent?

<p>The difference between the predicted and actual values for a data point. (A)</p> Signup and view all the answers

What is the role of the learning rate ($\eta$) in gradient-based estimation?

<p>It controls the size of the jump taken during parameter updates. (D)</p> Signup and view all the answers

What is a potential consequence of setting the learning rate too high in gradient descent?

<p>The algorithm may overshoot the minimum and fail to converge. (B)</p> Signup and view all the answers

Why are iterative methods, such as gradient descent, important in machine learning?

<p>They can handle loss functions for which closed-form solutions are not available. (C)</p> Signup and view all the answers

In the context of machine learning, what is a hyperparameter?

<p>A parameter set prior to training that influences the learning process. (A)</p> Signup and view all the answers

What is the primary concern when using Multiple Linear Regression (MLR) models?

<p>They can easily overfit the training data and become too complex. (D)</p> Signup and view all the answers

Which of the following is NOT a basic statistical assumption of linear regression?

<p>Multicollinearity among the independent variables. (A)</p> Signup and view all the answers

In logistic regression, what is the primary reason for using the logistic function?

<p>To transform the output into a probability between 0 and 1. (B)</p> Signup and view all the answers

When generalizing logistic regression to N classes, what is a common approach?

<p>Creating N separate binary classification tasks. (B)</p> Signup and view all the answers

What is a key advantage of linear models?

<p>Their simplicity and ease of interpretation. (C)</p> Signup and view all the answers

How does K-Nearest Neighbors (KNN) make predictions?

<p>By finding the closest training examples. (D)</p> Signup and view all the answers

In the context of decision trees, what does 'pruning' refer to?

<p>The technique of reducing the size of the tree by removing sections. (D)</p> Signup and view all the answers

What defines the 'margin' in Support Vector Machines (SVMs)?

<p>The distance between the decision boundary and the closest data points. (D)</p> Signup and view all the answers

Which of the following is an advantage of Bayesian Networks?

<p>They serve as an alternative to Empirical Risk Minimisation. (C)</p> Signup and view all the answers

What role does a non-linearity play in neural network models?

<p>It enables the network to model complex relationships. (B)</p> Signup and view all the answers

What is the primary purpose of the 'backward pass' in neural networks?

<p>To refine the network's weights based on the error in predictions. (B)</p> Signup and view all the answers

What key capability typically emerges when a neural network becomes 'deep' enough?

<p>It starts learning relevant data representations for making predictions. (B)</p> Signup and view all the answers

What is the primary goal of Bootstrap Aggregating (Bagging) in ensemble methods?

<p>To reduce variance by training models on random data samples. (A)</p> Signup and view all the answers

In the context of regularisation, what is the general effect of adding a term like $\lambda \cdot P(\beta)$ to the loss function?

<p>It adds a penalty for model complexity. (B)</p> Signup and view all the answers

What is the likely effect of a Lasso regularisation term?

<p>Tends to push model parameters to zero. (C)</p> Signup and view all the answers

What is the purpose of 'early stopping' as a form of regularisation in neural network training?

<p>To avoid overfitting to the training data. (C)</p> Signup and view all the answers

With reference to tree models, what does tree pruning do?

<p>Simplifies the tree. (B)</p> Signup and view all the answers

In the context of Vapnik-Chervonenkis (VC) dimension, what does a higher VC dimension generally indicate about a model?

<p>Greater model complexity and. (D)</p> Signup and view all the answers

What is the primary focus of supervised learning?

<p>Predicting outcomes based on input features. (B)</p> Signup and view all the answers

Which of the following statements best describes the concept of 'generalization' in machine learning?

<p>The ability of a model to accurately predict outcomes on new, unseen data. (A)</p> Signup and view all the answers

What is the main purpose of 'loss functions' in supervised learning?

<p>To measure the error. (B)</p> Signup and view all the answers

Which of the following is a pitfall of linear regression?

<p>It may be misused for extrapolation. (D)</p> Signup and view all the answers

What is the role of cross-validation in machine learning?

<p>To assess how the results of a statistical analysis will generalize to an independent data set. (C)</p> Signup and view all the answers

What is the purpose of the Elastic-Net regularization technique?

<p>To combine the penalties. (C)</p> Signup and view all the answers

Considering the bias-variance tradeoff, which regularization strength would most likely favor a lower bias in a model?

<p>Non whatsoever. (C)</p> Signup and view all the answers

Which of the following tasks best describes the use of memory based models?

<p>Storing data. (B)</p> Signup and view all the answers

Flashcards

Supervised learning

Finding a function that maps observations to targets.

Core Assumption

The assumption that the data used is representative of the real world.

IID

Data points are independent and identically distributed.

Training error

Measures how well the model performs on seen data.

Signup and view all the flashcards

Testing error

Measures performance on unseen data, but still available.

Signup and view all the flashcards

Generalisation error

Measures performance on unseen, unavailable real-world data.

Signup and view all the flashcards

Empirical Risk Minimisation

Minimizing "wrongness" on available training data helps reduce "wrongness" in the real world.

Signup and view all the flashcards

Regression

Mapping observation to a numerical value.

Signup and view all the flashcards

Classification

Mapping observation to a discrete category.

Signup and view all the flashcards

Linear Regression

A procedure to find the parameters of the linear model that best fits a specific dataset.

Signup and view all the flashcards

Residual

The difference between the predicted and actual value.

Signup and view all the flashcards

Residual Sum of Squares (RSS)

Sum of the squared differences between predicted and actual values.

Signup and view all the flashcards

Gradient-based Estimation

Iteratively updating the model parameters with the partial derivative of the loss.

Signup and view all the flashcards

Learning Rate

A parameter that controls how big a jump gradient descent updates take.

Signup and view all the flashcards

Multiple Linear Regression

Generalising simple linear regression to multiple independent variables/features.

Signup and view all the flashcards

Homoscedasticity

The errors' variance is constant over X.

Signup and view all the flashcards

Logistic Regression

Model focuses on modelling the probability of p(y=1|X).

Signup and view all the flashcards

Regularisation

Regularisation adds an item to control for complexity

Signup and view all the flashcards

Bagging

Reduces variance by training models on random data samples.

Signup and view all the flashcards

Boosting

Builds models sequentially, with each model focusing on errors made by previous models.

Signup and view all the flashcards

Stacking

Combines predictions of diverse base models.

Signup and view all the flashcards

Study Notes

  • Lecture 5 focuses on supervised machine learning.

Supervised Learning as Generalization

  • Supervised learning involves generalizing from a given dataset
  • The core assumption is that the provided data represents the real world accurately.

Supervised Learning as Function Approximation

  • Supervised learning approximates a function that maps observations to targets.
  • There exists a space of functions (F) that accurately map observations to targets
  • The ideal function (f*) exists within F, mapping each observation to its correct target.
  • The goal is to find a function (𝑓ො) within a model space (Fm) that best approximates the ideal function.
  • All that is available is a dataset of observations where this function holds true.

Errors in Supervised Learning

  • Training error measures the model's performance on training data.
  • Testing error measures the model's performance on a testing dataset
  • Generalization error measures model performance in the real world, where data is unavailable during training and testing.
  • Training a supervised machine learning (ML) model involves iteratively adjusting parameters to minimize training error, aiming to also reduce generalization error.

Statistical Assumptions

  • Supervised learning relies on the data being independent and identically distributed (IID).
  • Independent data means observations don't influence each other; knowing one data point doesn't provide information about others.
  • Identically distributed data means all observations are sampled from the same distribution without underlying trends.
  • An assumption is that the data represents the real world and that we can estimate the generalisation error from the testing error.

Types of Supervised Learning

  • Classification maps observations to discrete categories.
  • Regression maps observations to numerical values.
  • Probability density estimation maps observations to probability distributions over categories.
  • Ranking maps observations to a linear order of discrete categories.

Minimizing Loss Functions

  • Supervised learning involves minimizing a loss function.
  • A loss function quantifies the difference between predicted and actual values, indicating how far off predictions are.
  • Empirical Risk Minimization revolves around minimizing "wrongness" on available data to achieve minimal "wrongness" in the real world.

Empirical Risk Minimization

  • Minimizing training error is the basis of the Empirical Risk Minimization principle.
  • Empirical Risk Minimization serves as a foundational concept in machine learning.

Linear Models for Regression

  • Linear regression identifies the parameters for the linear model which best represent a specific dataset
  • A linear model helps predicting numerical values and understanding associations between variables

Least Squares Estimation

  • Predict a value for the "ith" data point using Y = 𝛽0 + 𝛽1 * Xi.
  • The residual is the difference between the actual and predicted value: eᵢ = Yᵢ - Ŷᵢ.
  • The Residual Sum of Squares (RSS) is defined as RSS = e₁² + e₂² + e₃² + ... + eₙ².

Gradient-Based Estimation

  • The function J = f(RSS) minimizes as a loss function.
  • Parameters (𝛽) update across many repeats using the derivative of the training error with respect to the parameters.
  •   𝛽 ᵗ⁺¹ = 𝛽ᵗ - η ⋅ ∇βᵗ J(𝛽ᵗ)
    
  • Eta (η) is a set parameter, which controls the size of the jumps the updates make.
  • The point when it stops is not guaranteed to be the best possible answer, just the best we could find

Learning Rate (LR)

  • The fundamental equation of gradient descent is

      𝛽ᵗ⁺¹ = 𝛽ᵗ - η ⋅ ∇βᵗ J(𝛽ᵗ)
    
  • Eta (η) is known as the learning rate or step size

Hyperparameters and Learning Rate

  • The learning rate is a hyperparameter that affects the learning process.
  • Three approaches to selecting hyperparameters:
  • Use a reasonable hyperparameter value and keep it constant, assuming it doesn't significantly affect the outcome.
  • Choose a hyperparameter value from existing literature, assuming established best practices apply
  • Discover the most effective hyperparameter through experimentation on the specific dataset, which assumes optimal performance requires fine-tuning.

Multiple Linear Regression (MLR)

  • Generalized simple linear regression incorporates several independent variables/features.
  • MLRs can become too complex.
  • Regularization helps control MLRs.

Statistical Assumptions of Linear Regression

  • Linear regression requires a linear relationship between variables X and Y
  • Errors should be independent and around normally distributed
  • There should also be an absence of multicollinearity between features
  • The variance of errors (mostly) constant over X

Linear Models for Classification

  • Binary classification places observations into one of two classes, {0, 1}.

  • The logistic model focuses on the probability P(y = 1 | X).

  • The logistic function 'squashes' the number between 0 and 1.

  • Decision boundary: if p > 0.5, assign class 1, otherwise class 0

  • The logistic function is

      𝑒 𝛽0 +𝛽1 𝑥
    

    𝑝 𝑦=1𝑋 = 1 + 𝑒𝛽0 +𝛽1 𝑥

Training of Logistic Regression

  • Training of logistic regression works similar to linear regression.
  • Training employs closed-form or gradient-based methods.
  • Arbitrary binary coding: if p > 0.5, assign class 1, otherwise class 0

Generalizing to N Classes

  • A N-class classification task converts into N binary classification tasks.
  • Assign the class using the highest probability to the corresponding category.

Non-Linear Models

  • Linear models provide simplicity and interpretability.
  • Other model types exist without as many pre-set assumptions
  • Some examples of Non-Linear Models are: Memory-based, Tree-based, Maximum margin, Probabilistic and Neural network models.

Ensemble Methods

  • Bagging (Bootstrap Aggregating) reduces variance by training models on random data samples.
  • Boosting builds models sequentially, with each model focusing on the errors made by previous models.
  • Stacking combines predictions of diverse base models.

Regularization & Generalization

  • Training errors can be minimized by minimizing the loss function
  • It is not safe to assume that training error matches the generalisation error
  • A training error of 0 can be achieved by adding complexity to the model

Regularizing Models

  • Regularization adds an additional item to the learning process, shown as

      𝑛         መ    መ
    

    𝐽 𝛽, 𝑋, 𝑌 = σ𝑖=1 𝑌𝑖 − 𝛽0 − 𝛽1 ⋅ 𝑋𝑖² + 𝜆 ⋅ 𝑃(𝛽)

  • P(𝛽) is a function of the parameters of the model

  • Lambda (A) defines how you want to trade off goodness of fit against model simplicity

  • Regularised Empirical Risk Minimization is called Structural Risk Minimisation

Lasso vs Ridge Regularization

  • Lasso regression tends to push underperforming parameters to 0.
  • Ridge regression tends to push underperforming parameters to lower values, excluding 0.
  • A hyperparameter (lambda) weighs sides of the optimisation problem.
  • A high lambda says you are willing to worsen the fit if it means the model is simpler
  • A low lambda says you will be more complex if it means that the data is a better fit.

Other Forms of Regularization

  • Regularization is used in neural network models through drop-out regularization, and early stopping
  • Regularization is used in neural network models through mix-up/cut-mix/general noisy data augmentation.
  • Tree models use tree purning in a number of forms to apply regularization
  • There are sampling-based regularization methods like the process of cross-validation and bootstrapping

Vapnik-Chervonenkis Dimension

  • Vapnik-Chervonenkis dimension applies to scenarios
  • Geometrically separating students with/without blonde hair
  • Use the "rule" of a circle, and two students can easily be separated
  • The VC dimension of drawing a circle is likely to be 3, where students can be separated into blonde and non-blonde hair
  • Alternate example, geometrically separating using a Line.
  • A line can still be drawn with 3 students.
  • With 4 students, where the straight line cannot perfectly separate them all, this drops the likely VC dimension to 3, or 2 even.

Vapnik-Chervonenkis Theory

  • Is a mathematical theory of learnability.
  • It is linked to the concept of PAC-learnability
  • It claims to be useful in deep neural networks, however, it is very hard to learn the model.
  • The VC dimension of a model is considered a theoretical bound on it's model complexity

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Description

Explore supervised machine learning, focusing on generalization and function approximation. Understand how models learn from data to map observations to targets. Learn about the different types of errors that can occur.

More Like This

Use Quizgecko on...
Browser
Browser