Podcast
Questions and Answers
In the augmented linear regression model, what value does $x_0$ typically hold for all samples?
In the augmented linear regression model, what value does $x_0$ typically hold for all samples?
- A random number
- 1 (correct)
- The mean of all other features
- 0
What is another common term for the intercept term 'b' in a linear regression model?
What is another common term for the intercept term 'b' in a linear regression model?
- Slope
- Residual
- Variance
- Bias parameter (correct)
In the augmented linear regression model $y = w^T x$, solving the machine learning problem involves determining what?
In the augmented linear regression model $y = w^T x$, solving the machine learning problem involves determining what?
- The predicted value y
- The feature vector x
- The bias parameter b
- The weight vector w (correct)
What does the augmented design matrix include to account for the bias in a linear regression model?
What does the augmented design matrix include to account for the bias in a linear regression model?
What is the purpose of finding the stationary point of a function?
What is the purpose of finding the stationary point of a function?
What does $ŷ$ represent in the augmented linear regression model?
What does $ŷ$ represent in the augmented linear regression model?
In the equation $ŷ = b + w_1x_1 + w_2x_2 + ... + w_nx_n$, what does 'b' represent?
In the equation $ŷ = b + w_1x_1 + w_2x_2 + ... + w_nx_n$, what does 'b' represent?
In the augmented linear regression model, the weight vector w
includes $w_0$. What does $w_0$ represent?
In the augmented linear regression model, the weight vector w
includes $w_0$. What does $w_0$ represent?
What does 'Tp' stand for in the context of model performance?
What does 'Tp' stand for in the context of model performance?
What does a confusion matrix primarily help to summarize?
What does a confusion matrix primarily help to summarize?
Which formula correctly calculates accuracy?
Which formula correctly calculates accuracy?
What is the formula for calculating sensitivity (recall)?
What is the formula for calculating sensitivity (recall)?
What is the formula for calculating precision?
What is the formula for calculating precision?
What type of task is typically associated with logistic regression?
What type of task is typically associated with logistic regression?
What is the formula for calculating specificity?
What is the formula for calculating specificity?
What might a model with low capacity struggle to do?
What might a model with low capacity struggle to do?
What is the hypothesis space in machine learning?
What is the hypothesis space in machine learning?
What is a binary classifier?
What is a binary classifier?
What does a high bias typically indicate?
What does a high bias typically indicate?
In the context of machine learning models, a 'high gap' is most likely referring to which of the following?
In the context of machine learning models, a 'high gap' is most likely referring to which of the following?
What is the primary goal of gradient descent?
What is the primary goal of gradient descent?
What is the effect of high capacity on a model?
What is the effect of high capacity on a model?
What is the negative class typically labeled as in a binary classifier?
What is the negative class typically labeled as in a binary classifier?
What type of classifier is logistic regression when distinguishing between two classes?
What type of classifier is logistic regression when distinguishing between two classes?
In logistic regression, if $P(x \in Class1) = 0.3$, what is $P(x \in Class0)$?
In logistic regression, if $P(x \in Class1) = 0.3$, what is $P(x \in Class0)$?
What is the range of the probability output by a logistic regression model?
What is the range of the probability output by a logistic regression model?
What type of machine learning algorithm is logistic regression?
What type of machine learning algorithm is logistic regression?
What mathematical tool is used to find the optimal value of w that minimizes the Mean Squared Error (MSE) in linear regression?
What mathematical tool is used to find the optimal value of w that minimizes the Mean Squared Error (MSE) in linear regression?
In the context of logistic regression, what does the sigmoid function do?
In the context of logistic regression, what does the sigmoid function do?
For logistic regression with one feature, what is the formula for t?
For logistic regression with one feature, what is the formula for t?
In single-variable calculus, what is the first step in finding the extrema of a function f(x)?
In single-variable calculus, what is the first step in finding the extrema of a function f(x)?
If $f''(x) \geq 0$ on the real numbers, what does this indicate about the function $f(x)$?
If $f''(x) \geq 0$ on the real numbers, what does this indicate about the function $f(x)$?
In logistic regression, what is the purpose of finding the 'best distribution'?
In logistic regression, what is the purpose of finding the 'best distribution'?
What does 'argmin' represent in the equation $\mathbf{w}{min} = \text{argmin } MSE{train}(\mathbf{w})$?
What does 'argmin' represent in the equation $\mathbf{w}{min} = \text{argmin } MSE{train}(\mathbf{w})$?
What is the formula for the sigmoid function $\sigma(t)$?
What is the formula for the sigmoid function $\sigma(t)$?
In the context of linear regression, what does $MSE_{train}(\mathbf{w})$ represent?
In the context of linear regression, what does $MSE_{train}(\mathbf{w})$ represent?
Which of the following is the formula for $MSE_{train}(\mathbf{w})$?
Which of the following is the formula for $MSE_{train}(\mathbf{w})$?
In the equation $MSE_{train}(\mathbf{w}) = \frac{1}{N} ||\mathbf{Xw} - \mathbf{y}_{train}||^2 $, what does $\mathbf{X}$ represent?
In the equation $MSE_{train}(\mathbf{w}) = \frac{1}{N} ||\mathbf{Xw} - \mathbf{y}_{train}||^2 $, what does $\mathbf{X}$ represent?
What is the significance of finding where the gradient of $(\hat{\mathbf{y}}{train} - \mathbf{y}{train})^2$ with respect to $\mathbf{w}$ equals zero?
What is the significance of finding where the gradient of $(\hat{\mathbf{y}}{train} - \mathbf{y}{train})^2$ with respect to $\mathbf{w}$ equals zero?
What is the primary purpose of a test set in machine learning?
What is the primary purpose of a test set in machine learning?
Which type of machine learning algorithm uses labeled data for training?
Which type of machine learning algorithm uses labeled data for training?
What kind of data is typically used in unsupervised learning?
What kind of data is typically used in unsupervised learning?
In the context of machine learning datasets, what is a 'feature'?
In the context of machine learning datasets, what is a 'feature'?
What is a design matrix commonly used for?
What is a design matrix commonly used for?
In a design matrix, what does each row typically represent?
In a design matrix, what does each row typically represent?
What does a label or target provide in supervised learning?
What does a label or target provide in supervised learning?
In the Iris dataset example, what do the features $X_{i,1}$ and $X_{i,2}$ represent?
In the Iris dataset example, what do the features $X_{i,1}$ and $X_{i,2}$ represent?
Flashcards
Training Set
Training Set
Data used to train a machine learning model.
Test Set
Test Set
Data used to evaluate the performance of a trained machine learning model.
Unsupervised Learning
Unsupervised Learning
Learning without labeled data, discovering patterns on its own. Learns from unlabeled data.
Supervised Learning
Supervised Learning
Signup and view all the flashcards
Design Matrix
Design Matrix
Signup and view all the flashcards
Sample
Sample
Signup and view all the flashcards
Feature
Feature
Signup and view all the flashcards
Label/Target
Label/Target
Signup and view all the flashcards
Augmented Linear Regression Model
Augmented Linear Regression Model
Signup and view all the flashcards
Augmented Feature Vector (𝒙)
Augmented Feature Vector (𝒙)
Signup and view all the flashcards
Augmented Weight Vector (𝒘)
Augmented Weight Vector (𝒘)
Signup and view all the flashcards
Bias Parameter (b)
Bias Parameter (b)
Signup and view all the flashcards
Linear Regression Equation
Linear Regression Equation
Signup and view all the flashcards
Solving Linear Regression
Solving Linear Regression
Signup and view all the flashcards
Stationary Point
Stationary Point
Signup and view all the flashcards
Local Minimum
Local Minimum
Signup and view all the flashcards
Minimizing MSE with Vector Calculus
Minimizing MSE with Vector Calculus
Signup and view all the flashcards
Critical/Stationary Points
Critical/Stationary Points
Signup and view all the flashcards
Global Maximum Condition
Global Maximum Condition
Signup and view all the flashcards
Global Minimum Condition
Global Minimum Condition
Signup and view all the flashcards
argmin MSEtrain(w)
argmin MSEtrain(w)
Signup and view all the flashcards
Role of Vector Calculus
Role of Vector Calculus
Signup and view all the flashcards
True Positive (Tp)
True Positive (Tp)
Signup and view all the flashcards
True Negative (Tn)
True Negative (Tn)
Signup and view all the flashcards
False Positive (Fp)
False Positive (Fp)
Signup and view all the flashcards
False Negative (Fn)
False Negative (Fn)
Signup and view all the flashcards
Confusion Matrix
Confusion Matrix
Signup and view all the flashcards
Accuracy
Accuracy
Signup and view all the flashcards
Sensitivity (Recall)
Sensitivity (Recall)
Signup and view all the flashcards
Specificity
Specificity
Signup and view all the flashcards
Model Capacity
Model Capacity
Signup and view all the flashcards
Hypothesis Space
Hypothesis Space
Signup and view all the flashcards
Underfitting vs. Overfitting
Underfitting vs. Overfitting
Signup and view all the flashcards
Bias vs. Variance
Bias vs. Variance
Signup and view all the flashcards
High Bias vs. High Variance
High Bias vs. High Variance
Signup and view all the flashcards
Gradient Descent
Gradient Descent
Signup and view all the flashcards
Binary Classifier
Binary Classifier
Signup and view all the flashcards
Positive Class vs. Negative Class
Positive Class vs. Negative Class
Signup and view all the flashcards
Logistic Regression Goal
Logistic Regression Goal
Signup and view all the flashcards
Performance Measure Goal
Performance Measure Goal
Signup and view all the flashcards
Sigmoid Curve
Sigmoid Curve
Signup and view all the flashcards
Logistic Sigmoid Function
Logistic Sigmoid Function
Signup and view all the flashcards
t = w^T * x
t = w^T * x
Signup and view all the flashcards
Find the best distribution
Find the best distribution
Signup and view all the flashcards
Study Notes
- The session aims to teach about Machine Learning problems, linear and logistic regression, design matrix creation, and Gradient Descent.
Learning Algorithms
- A machine learning algorithm is able to learn from data.
- To learn, a computer program needs experience (E) for a class of tasks (T) and performance measure (P).
- If performance at tasks in T, measured by P, improves with experience E, then learning has occurred.
- T, P, and E need definition for every machine learning algorithm.
Task "T"
- The process of learning is not the task itself.
- Machine learning tasks are described by how the system processes an example.
- An example is quantitatively measured features from an object or event.
- An example is represented as a vector x ∈ Rn, where each entry xi is a feature.
- Pixel values in an image are its features.
Common Machine Learning Tasks include
- Classification which specifies the category an input belongs to.
- Object recognition such as pedestrians, cars, buses is an example.
- Regression which predicts a numerical value from a given input.
- Predicting the claim amount an insured person will make is an example.
- Transcription which transcribes unstructured data into discrete, textual form.
- Optical character and speech recognition are examples.
- Machine translation.
- Synthesis and sampling which generates new examples similar to training data.
- Automatically generating textures for video games is an example.
- Imputation of missing values where a machine learning algorithm is given a new example x ∈ Rn, but with some missing entries, xᵢ of x.
- Denoising.
Performance Measure "P"
- A quantitative measure must be designed to evaluate a machine learning algorithm's abilities.
- Performance measure P is specific to the task T being carried out.
- Accuracy measures the accuracy of the model for classification and transcription; this can also be measured as error rate
- Algorithms should perform well on data they have not seen before.
- A training set is used to train the machine learning system.
- A test set measures the performance of the machine learning system.
Experience "E"
- Machine learning algorithms can be broadly categorized as Unsupervised & Supervised
- Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset.
- The data is unlabelled and used in clustering methods.
- There is no instructor or guide, the algorithm must make sense of the data.
- Supervised learning algorithms experience a dataset containing features, but each example is also associated with a label or target.
- Classifying iris plants into three different species based on their measurements is an example.
- A teacher shows the machine learning system what to do.
Datasets
- Most machine learning algorithms experience a dataset.
- A dataset is a collection of examples, which are collections of features.
- A common way of describing a dataset is with a design matrix.
- The samples go per row.
- The features go per column.
- Datasets can also use the opposite i.e.
- One feature per row
- One sample per column
Iris Dataset
- Contains 150 samples and 4 features.
- The design matrix is X ∈ R150×4 where X₁,₁ is the sepal length, X₁,₂ is the sepal width of plant (sample) i etc.
- Datasets can be described with a set containing m elements {x(1), x(2),...,x(m)}
Supervised Learning Datasets
- The example in supervised learning contains a label/target and a collection of features.
- Object recognition from photographs needs specification of the appearing object in each photo.
- Numeric code can be used whereby 0 means person, 1 means car and 2 means cat etc.
- Given feature observations X, a vector of labels y is also given, with yᵢ providing the label for example i.
Linear Regression
- The task of linear regression is determining the value of the weights w to predict the value of the output scalar y ∈ R given the input vector x ∈ Rn.
- Output y is assumed to be a linear function of input x.
- Let ŷ be the value that the model predicts given x; the true value is written y.
- ŷ = w²x where w ∈ Rn is a vector of parameters or weights to be determined.
- ŷ = w₁x₁ + w₂x₂ + ··· + wnxn in its expanded form.
- ŷ = 0 when x = 0, which is a strong assumption.
Augmented LR Model
- Linear regression often refers to a more general model with an intercept term b.
- ŷ = w²x + b where b ∈ R.
- With affine functions, the plot of the model's predictions is a line that passes through the origin.
- This model augments x by a new feature that is always set to 1 for every sample vs adding the bias parameter b
- The intercept term b is the bias parameter for the linear regression model.
- This terminology comes since the output is biased toward being b in the absence of input.
- With an extended linear model, ŷ = w²x.
- The feature vector x = {x₀,...xₙ} and x₀ = 1 for all samples.
- The weight vector w = {w₀,...wₙ} and w₀=b (bias parameter).
- ŷ = w₀x₀ + w₁x₁ + w₂x₂ + … + wnxn or
- ŷ = b + w₁x₁ + w₂x₂ + … + wnxn.
- Solving the machine learning problem determines the weight vector w.
Recall: Finding the Minimum of a Function
- To find the minimum of a function there are a number of options to use
- Find the unique stationary point x₀ of the function i.e., that satisfies f’(x₀) = 0
- You can find all Stationary points (local maximum/minimum/saddle)
- Then determine their nature
Recall: Calculating Norms
- For a vector x = (x₁, x₂,...,xₙ) ∈ ℝⁿ the norm 2 of x is given by
- ||x||₂ = √Σᵢ₌₁ⁿ xᵢ² = √(x₁² + x₂² + ....xₙ²) and is a non negative real number.
- One result used later is that xᵀx = (x₁, x₂...xₙ) (x₁, x₂,...,xₙ)ᵀ = x₁² + x₂² + ... +xₙ²
- ||x||₂² i.e. xᵀx = ||x||₂²
Notation: Euclidean Difference
- For 2 points in ℝⁿ, say A = (a₁, a₂,..., aₙ)ᵀ and B = (b₁, b₂,..., bₙ)ᵀ, the Euclidean difference between them is given by
- d(A,B) = || AB ||₂ where AB is the vector with origin A and head B
Training the Linear Regression Model
- The mean squared error (MSE) of the model is computed to measure its performance on a test set.
- Since training is done on training data, what must be minimized is:
- εᵢ= ŷ(train)ᵢ − y(train)ᵢ for all sample i, where εᵢ is the residual or error between the predicted output ŷ(train)ᵢ and its true value y(train)ᵢ.
- To minimise the following function we need to jointly minimise εᵢ for every sample but εᵢ can be either positive or negative such that
- MSEtrain(w) = 1/N Σᵢ₌₁ᴺ (ŷ(train)ᵢ - y(train)ᵢ )²
- N = number of samples in training set
- In linear regression the model is trained to minimise ei or yᵢ^{train} – yᵢ^{train} for all samples where E¡ is the residual or error between value and the true value y¡^{train}
To what is X(train). w equal to?
- X(train). w is the prediction of the model
- (where the predicted labels can be calculated from all the design matrix and the weight vector).
- Squared error loss is expressed as the sum of X (train) and y^{train}, which is the training data: W vector is adjusted to minimise mean squared error.
Random Search
- An method to find the minimum could be to let w take different values and evaluate MSEtrain(w).
- The smaller MSEtrain(w), the better, but the challenge is to find best possible fit.
Linear Regression Models
- Vector Calculus determines w's exact value that minimises MSEtrain, as opposed to a random search.
- For one variable, let f be a function of variable x.
- Find the zeroes of f'(x) (aka critical/stationary points).
- xo is only value such that f'(x₀)=0
- If f"(x) ≤ 0 on R, then f(x₀) is the global maximum at x₀.
- If f"(x) ≥ 0 on R, then f(x₀) is the global minimum at x₀.
- May not have f"(x) to be always negative or positive, but local extrema can be found.
- If f"(x₀) ≤ 0, then f(x₀) is a local maximum at x₀..
- If f"(x₀) ≥ 0, then f(x₀) is a local minimum at x₀
Training Linear Regression using Vector Calculus
- MSEtrain(w) is a function N + 1 variables but with the same approach.
- MSEtrain(w) = 1 N ||ŷ(train) − y(train)||² = 1/N||X(train). w − y(train)||²
- Which is mathematically written as
- wmin = argminw MSEtrain(w), which is to find the argument w that minimises MSEtrain(w). wmin = argminw MSEtrain = argmin
- Whether or not ‘N’ has role in minimisation MSEtrain (w) is not necessarily obvious from this formulation
Closed Form Solution
- Vector Calculus determines w's exact value that minimises ||ŷ(train) – y(train) ||².
- Compute the gradient of ||ŷ(train) – y(train) ||² with respect to w.
- ||ŷ(train) – y(train) ||² = (ŷ(train) – y(train))ᵀ. (ŷ(train) – y(train))
- Expanding yields ||ŷ(train) – y(train)||² = ŷ(train)ᵀ . ŷ(train) - 2 . ŷ(train). y(train) + y(train)ᵀ . y(train)
- As ŷ(train) = X(train).w and (A . B)ᵀ = Bᵀ. Aᵀ ,this gives:
- ||ŷ(train)-y(train)||² = wᵀX(train)ᵀ X(train) w - 2 wᵀX(train) y(train) + y(train)ᵀ . y(train)
- Using the gradient properties where ∇w(wᵀA w) = (A + Aᵀ)w and ∇w(wᵀA) = A, the gradient can be calculated
- ∇w(w |ŷ(train) - y(train)|²) = 2 X(train)ᵀ X(train) w - 2X(train)ᵀ y(train)
- ∇w|| ŷ(train) – y(train)||² = 0, which means that for w the solution yields for training with Y
- wmin = (X(train)ᵀX(train))⁻¹ X(train)ᵀ y(train)
- The above is the only solution provided that X(train)ᵀX(train) is invertible.
- The above equality is known as the normal equations and gives the analytical solution of the linear regression problem
- For value of w = wmin, is proven that MSEtrain is a global minimum.
Evaluating
- The model is used for training and evaluating how the model performs: x(test) ∈ ℝ ^M x n
- Training a regression target with exact value y for example; the test set y(test) ∈ ℝ ᵐ
- MSEtest is the prediction of the model on the test set ŷ(test) ∈ ℝ ᵐ
- MSEtest = 1/Μ Σᵢ₌₁^Μ (ŷ^(test)ᶦ – y^(test)ᵢ)² where a lower MSEtest value indicates the model generalises and performs in previously unseen inputs
Convexity
- Convexity study of a function is about determining function is: Convex, Concave Or Neither.
- In mathematics; a Real Value Function is called Convex or Concave - If a line segment between two points of a function lies above/below the grade, between the two points.
- A function that is not convex is not concave.
- Most Functions are Neither
- Ex: Convex Functions - f(x) = x squared and f(x) = exp(x)
- Ex: Concave functions - f (x) = square root of x or f(x) = ln(x)
- Ex: Neither - f(x) = x cubed, and f(x) = cos x
- MSE Train Convex function vs all ML optimisation problems are Convex Concave
- Deep Neural networks loss functions are typical of - Not being convex or conave
- Some functions can be said not generous all to all ML Problems
MSEtrain Convex function vs all ML optimisation problems
- Convexity ensures that solving minimisation is simpler from the point of optimisation or learning
- It allows us to conclude whether a local minimum (maximum) is a global minimum (maximum).
- A local minima is a global minima in R (if it is convex) and a global maxima (if it is concave)
- Local Min can be found in the gradient Descent of MSTrain or vanish the we need to show it is a local Min
- It is not easy
Linear Regression
- Central Challenge in Machine Learning
- Machine Learning is that you are working with data and only one set of trained data.
- Finding MSE train, minimising MSTrain to a minimum to the learning dataset
- Can be guaranteed with minimal good test for MSTest?
- To test
- MSTest will be dependent the statisticial or the Traning
- It will also be depedent on how well is captured by the data to use for the Phenomenonal enterst.
- Numerically Approximation of Solution
- Minimum or May find is locval min
- Local Main can defend on implement algo
Training-Test and error
- The ability to perform well on previously un observed imports is called Generalisation
- As model is Trained ( Through numerical methods) the track of 2 errors Is the training error ( Which need to minimize) Test and Train
- itertivley to Minimise, is key not to confuse the training and learning
- 3 Factor ( Training error ,Small, The Data, and Testing, Error)
Determining ML Training Capacity, Overfitting and Underfitting
- Occurs Under fitting ,Model is not able to have a sufficently lo error value of a trained seat
- Capacity measures ablity to fil wide variety of functions by alteing caocity
- Low and High Capcity can often times struggle a fit traain, but memortise with high capcity with properties do Serve themwell or the test test.
- The set functions are with learning algo with solution
Hypothesis space
- set functions the learning algo is allowed to get
- What mean for linear or regression means set all functions of space
- Polynomials can genralize Linear regression like the follow example models T= B-wx and for x's x0 = 1 and w w 0'=' (b - Linear W1 W = 1 + wy w2 x squared the hupothesis can larger thar befrore
- T2 + v1y = w2x
- The model can be larget its hypothsis space is larger
Bias/Variance
- High/ Variance.Empirically ,Training models can have a irreducible Large training between small testing errors.
- Small Training error is between high testing error
- acceptable Low traninfg betweeen traninand teesting eeror
- High bis then tranibg eeriring is hi
- High variance the traninhing in eerotesting er testing is high
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.