Podcast
Questions and Answers
In supervised learning, what is the role of the 'ideal' function $f^*$?
In supervised learning, what is the role of the 'ideal' function $f^*$?
- It maps each observation perfectly to its target, but is often unattainable in practice. (correct)
- It represents the best approximation that a model can achieve on the training data.
- It is a function in $F^m$ that we can model easily.
- It is a function that minimizes the training error to zero.
What core assumption underpins the use of supervised learning for generalization?
What core assumption underpins the use of supervised learning for generalization?
- The training data is free of noise and outliers, and it ensures perfect accuracy on unseen data.
- The algorithm used is the most efficient for the given dataset size.
- The chosen model is complex enough to capture all nuances in the training data.
- The observed data is representative of the real-world data distribution. (correct)
Which type of error in supervised learning directly reflects the model's performance on unseen, real-world data that was not available during training or testing?
Which type of error in supervised learning directly reflects the model's performance on unseen, real-world data that was not available during training or testing?
- Testing error
- Training error
- Validation error
- Generalization error (correct)
What is the significance of the 'independent and identically distributed' (IID) assumption in supervised machine learning?
What is the significance of the 'independent and identically distributed' (IID) assumption in supervised machine learning?
Which of the following supervised learning tasks involves mapping observations to a probability distribution over a set of categories?
Which of the following supervised learning tasks involves mapping observations to a probability distribution over a set of categories?
What is the potential risk of only re-doing previous quizzes to prepare for a major exam?
What is the potential risk of only re-doing previous quizzes to prepare for a major exam?
What principle underlies the minimization of 'wrongness' on training data, with the expectation that it leads to minimal 'wrongness' in the real world?
What principle underlies the minimization of 'wrongness' on training data, with the expectation that it leads to minimal 'wrongness' in the real world?
In the context of linear regression, what does the 'residual' represent?
In the context of linear regression, what does the 'residual' represent?
What is the role of the learning rate ($\eta$) in gradient-based estimation?
What is the role of the learning rate ($\eta$) in gradient-based estimation?
What is a potential consequence of setting the learning rate too high in gradient descent?
What is a potential consequence of setting the learning rate too high in gradient descent?
Why are iterative methods, such as gradient descent, important in machine learning?
Why are iterative methods, such as gradient descent, important in machine learning?
In the context of machine learning, what is a hyperparameter?
In the context of machine learning, what is a hyperparameter?
What is the primary concern when using Multiple Linear Regression (MLR) models?
What is the primary concern when using Multiple Linear Regression (MLR) models?
Which of the following is NOT a basic statistical assumption of linear regression?
Which of the following is NOT a basic statistical assumption of linear regression?
In logistic regression, what is the primary reason for using the logistic function?
In logistic regression, what is the primary reason for using the logistic function?
When generalizing logistic regression to N classes, what is a common approach?
When generalizing logistic regression to N classes, what is a common approach?
What is a key advantage of linear models?
What is a key advantage of linear models?
How does K-Nearest Neighbors (KNN) make predictions?
How does K-Nearest Neighbors (KNN) make predictions?
In the context of decision trees, what does 'pruning' refer to?
In the context of decision trees, what does 'pruning' refer to?
What defines the 'margin' in Support Vector Machines (SVMs)?
What defines the 'margin' in Support Vector Machines (SVMs)?
Which of the following is an advantage of Bayesian Networks?
Which of the following is an advantage of Bayesian Networks?
What role does a non-linearity play in neural network models?
What role does a non-linearity play in neural network models?
What is the primary purpose of the 'backward pass' in neural networks?
What is the primary purpose of the 'backward pass' in neural networks?
What key capability typically emerges when a neural network becomes 'deep' enough?
What key capability typically emerges when a neural network becomes 'deep' enough?
What is the primary goal of Bootstrap Aggregating (Bagging) in ensemble methods?
What is the primary goal of Bootstrap Aggregating (Bagging) in ensemble methods?
In the context of regularisation, what is the general effect of adding a term like $\lambda \cdot P(\beta)$ to the loss function?
In the context of regularisation, what is the general effect of adding a term like $\lambda \cdot P(\beta)$ to the loss function?
What is the likely effect of a Lasso regularisation term?
What is the likely effect of a Lasso regularisation term?
What is the purpose of 'early stopping' as a form of regularisation in neural network training?
What is the purpose of 'early stopping' as a form of regularisation in neural network training?
With reference to tree models, what does tree pruning do?
With reference to tree models, what does tree pruning do?
In the context of Vapnik-Chervonenkis (VC) dimension, what does a higher VC dimension generally indicate about a model?
In the context of Vapnik-Chervonenkis (VC) dimension, what does a higher VC dimension generally indicate about a model?
What is the primary focus of supervised learning?
What is the primary focus of supervised learning?
Which of the following statements best describes the concept of 'generalization' in machine learning?
Which of the following statements best describes the concept of 'generalization' in machine learning?
What is the main purpose of 'loss functions' in supervised learning?
What is the main purpose of 'loss functions' in supervised learning?
Which of the following is a pitfall of linear regression?
Which of the following is a pitfall of linear regression?
What is the role of cross-validation in machine learning?
What is the role of cross-validation in machine learning?
What is the purpose of the Elastic-Net regularization technique?
What is the purpose of the Elastic-Net regularization technique?
Considering the bias-variance tradeoff, which regularization strength would most likely favor a lower bias in a model?
Considering the bias-variance tradeoff, which regularization strength would most likely favor a lower bias in a model?
Which of the following tasks best describes the use of memory based models?
Which of the following tasks best describes the use of memory based models?
Flashcards
Supervised learning
Supervised learning
Finding a function that maps observations to targets.
Core Assumption
Core Assumption
The assumption that the data used is representative of the real world.
IID
IID
Data points are independent and identically distributed.
Training error
Training error
Signup and view all the flashcards
Testing error
Testing error
Signup and view all the flashcards
Generalisation error
Generalisation error
Signup and view all the flashcards
Empirical Risk Minimisation
Empirical Risk Minimisation
Signup and view all the flashcards
Regression
Regression
Signup and view all the flashcards
Classification
Classification
Signup and view all the flashcards
Linear Regression
Linear Regression
Signup and view all the flashcards
Residual
Residual
Signup and view all the flashcards
Residual Sum of Squares (RSS)
Residual Sum of Squares (RSS)
Signup and view all the flashcards
Gradient-based Estimation
Gradient-based Estimation
Signup and view all the flashcards
Learning Rate
Learning Rate
Signup and view all the flashcards
Multiple Linear Regression
Multiple Linear Regression
Signup and view all the flashcards
Homoscedasticity
Homoscedasticity
Signup and view all the flashcards
Logistic Regression
Logistic Regression
Signup and view all the flashcards
Regularisation
Regularisation
Signup and view all the flashcards
Bagging
Bagging
Signup and view all the flashcards
Boosting
Boosting
Signup and view all the flashcards
Stacking
Stacking
Signup and view all the flashcards
Study Notes
- Lecture 5 focuses on supervised machine learning.
Supervised Learning as Generalization
- Supervised learning involves generalizing from a given dataset
- The core assumption is that the provided data represents the real world accurately.
Supervised Learning as Function Approximation
- Supervised learning approximates a function that maps observations to targets.
- There exists a space of functions (F) that accurately map observations to targets
- The ideal function (f*) exists within F, mapping each observation to its correct target.
- The goal is to find a function (𝑓ො) within a model space (Fm) that best approximates the ideal function.
- All that is available is a dataset of observations where this function holds true.
Errors in Supervised Learning
- Training error measures the model's performance on training data.
- Testing error measures the model's performance on a testing dataset
- Generalization error measures model performance in the real world, where data is unavailable during training and testing.
- Training a supervised machine learning (ML) model involves iteratively adjusting parameters to minimize training error, aiming to also reduce generalization error.
Statistical Assumptions
- Supervised learning relies on the data being independent and identically distributed (IID).
- Independent data means observations don't influence each other; knowing one data point doesn't provide information about others.
- Identically distributed data means all observations are sampled from the same distribution without underlying trends.
- An assumption is that the data represents the real world and that we can estimate the generalisation error from the testing error.
Types of Supervised Learning
- Classification maps observations to discrete categories.
- Regression maps observations to numerical values.
- Probability density estimation maps observations to probability distributions over categories.
- Ranking maps observations to a linear order of discrete categories.
Minimizing Loss Functions
- Supervised learning involves minimizing a loss function.
- A loss function quantifies the difference between predicted and actual values, indicating how far off predictions are.
- Empirical Risk Minimization revolves around minimizing "wrongness" on available data to achieve minimal "wrongness" in the real world.
Empirical Risk Minimization
- Minimizing training error is the basis of the Empirical Risk Minimization principle.
- Empirical Risk Minimization serves as a foundational concept in machine learning.
Linear Models for Regression
- Linear regression identifies the parameters for the linear model which best represent a specific dataset
- A linear model helps predicting numerical values and understanding associations between variables
Least Squares Estimation
- Predict a value for the "ith" data point using Y = 𝛽0 + 𝛽1 * Xi.
- The residual is the difference between the actual and predicted value: eᵢ = Yᵢ - Ŷᵢ.
- The Residual Sum of Squares (RSS) is defined as RSS = e₁² + e₂² + e₃² + ... + eₙ².
Gradient-Based Estimation
- The function J = f(RSS) minimizes as a loss function.
- Parameters (𝛽) update across many repeats using the derivative of the training error with respect to the parameters.
-
𝛽 ᵗ⁺¹ = 𝛽ᵗ - η ⋅ ∇βᵗ J(𝛽ᵗ)
- Eta (η) is a set parameter, which controls the size of the jumps the updates make.
- The point when it stops is not guaranteed to be the best possible answer, just the best we could find
Learning Rate (LR)
-
The fundamental equation of gradient descent is
𝛽ᵗ⁺¹ = 𝛽ᵗ - η ⋅ ∇βᵗ J(𝛽ᵗ)
-
Eta (η) is known as the learning rate or step size
Hyperparameters and Learning Rate
- The learning rate is a hyperparameter that affects the learning process.
- Three approaches to selecting hyperparameters:
- Use a reasonable hyperparameter value and keep it constant, assuming it doesn't significantly affect the outcome.
- Choose a hyperparameter value from existing literature, assuming established best practices apply
- Discover the most effective hyperparameter through experimentation on the specific dataset, which assumes optimal performance requires fine-tuning.
Multiple Linear Regression (MLR)
- Generalized simple linear regression incorporates several independent variables/features.
- MLRs can become too complex.
- Regularization helps control MLRs.
Statistical Assumptions of Linear Regression
- Linear regression requires a linear relationship between variables X and Y
- Errors should be independent and around normally distributed
- There should also be an absence of multicollinearity between features
- The variance of errors (mostly) constant over X
Linear Models for Classification
-
Binary classification places observations into one of two classes, {0, 1}.
-
The logistic model focuses on the probability P(y = 1 | X).
-
The logistic function 'squashes' the number between 0 and 1.
-
Decision boundary: if p > 0.5, assign class 1, otherwise class 0
-
The logistic function is
𝑒 𝛽0 +𝛽1 𝑥
𝑝 𝑦=1𝑋 = 1 + 𝑒𝛽0 +𝛽1 𝑥
Training of Logistic Regression
- Training of logistic regression works similar to linear regression.
- Training employs closed-form or gradient-based methods.
- Arbitrary binary coding: if p > 0.5, assign class 1, otherwise class 0
Generalizing to N Classes
- A N-class classification task converts into N binary classification tasks.
- Assign the class using the highest probability to the corresponding category.
Non-Linear Models
- Linear models provide simplicity and interpretability.
- Other model types exist without as many pre-set assumptions
- Some examples of Non-Linear Models are: Memory-based, Tree-based, Maximum margin, Probabilistic and Neural network models.
Ensemble Methods
- Bagging (Bootstrap Aggregating) reduces variance by training models on random data samples.
- Boosting builds models sequentially, with each model focusing on the errors made by previous models.
- Stacking combines predictions of diverse base models.
Regularization & Generalization
- Training errors can be minimized by minimizing the loss function
- It is not safe to assume that training error matches the generalisation error
- A training error of 0 can be achieved by adding complexity to the model
Regularizing Models
-
Regularization adds an additional item to the learning process, shown as
𝑛 መ መ
𝐽 𝛽, 𝑋, 𝑌 = σ𝑖=1 𝑌𝑖 − 𝛽0 − 𝛽1 ⋅ 𝑋𝑖² + 𝜆 ⋅ 𝑃(𝛽)
-
P(𝛽) is a function of the parameters of the model
-
Lambda (A) defines how you want to trade off goodness of fit against model simplicity
-
Regularised Empirical Risk Minimization is called Structural Risk Minimisation
Lasso vs Ridge Regularization
- Lasso regression tends to push underperforming parameters to 0.
- Ridge regression tends to push underperforming parameters to lower values, excluding 0.
- A hyperparameter (lambda) weighs sides of the optimisation problem.
- A high lambda says you are willing to worsen the fit if it means the model is simpler
- A low lambda says you will be more complex if it means that the data is a better fit.
Other Forms of Regularization
- Regularization is used in neural network models through drop-out regularization, and early stopping
- Regularization is used in neural network models through mix-up/cut-mix/general noisy data augmentation.
- Tree models use tree purning in a number of forms to apply regularization
- There are sampling-based regularization methods like the process of cross-validation and bootstrapping
Vapnik-Chervonenkis Dimension
- Vapnik-Chervonenkis dimension applies to scenarios
- Geometrically separating students with/without blonde hair
- Use the "rule" of a circle, and two students can easily be separated
- The VC dimension of drawing a circle is likely to be 3, where students can be separated into blonde and non-blonde hair
- Alternate example, geometrically separating using a Line.
- A line can still be drawn with 3 students.
- With 4 students, where the straight line cannot perfectly separate them all, this drops the likely VC dimension to 3, or 2 even.
Vapnik-Chervonenkis Theory
- Is a mathematical theory of learnability.
- It is linked to the concept of PAC-learnability
- It claims to be useful in deep neural networks, however, it is very hard to learn the model.
- The VC dimension of a model is considered a theoretical bound on it's model complexity
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Explore supervised machine learning, focusing on generalization and function approximation. Understand how models learn from data to map observations to targets. Learn about the different types of errors that can occur.