Podcast
Questions and Answers
In the context of least squares, what does the expression arg min
represent?
In the context of least squares, what does the expression arg min
represent?
- The minimum value of the squared differences between predicted and actual values.
- The argument (w) that minimizes the sum of squared differences between predicted and actual values. (correct)
- The derivative of the squared differences with respect to w.
- The average of the squared differences between predicted and actual values.
Given a matrix X
and a vector y
, which of the following expressions represents the cost function being minimized in ordinary least squares?
Given a matrix X
and a vector y
, which of the following expressions represents the cost function being minimized in ordinary least squares?
- $(Xw - y)^T (Xw - y)$ (correct)
- $(w - X^T y)^T (w - X^T y)$
- $(X^T w - y)^T (X^T w - y)$
- $(Xw + y)^T (Xw + y)$
What condition is achieved when the gradient of the loss function, (\nabla_w E(w)), is set to zero?
What condition is achieved when the gradient of the loss function, (\nabla_w E(w)), is set to zero?
- The loss function's rate of change is at its greatest.
- A saddle point is achieved in the loss function.
- The loss function is minimized or maximized, indicating a stationary point. (correct)
- The loss function reaches its maximum value.
In the normal equation (w^* = (X^T X)^{-1} X^T y), what does the term ((X^T X)^{-1} X^T) represent?
In the normal equation (w^* = (X^T X)^{-1} X^T y), what does the term ((X^T X)^{-1} X^T) represent?
For what type of matrix X does the Moore-Penrose pseudo-inverse (X^{\dagger}) simplify to the regular inverse (X^{-1})?
For what type of matrix X does the Moore-Penrose pseudo-inverse (X^{\dagger}) simplify to the regular inverse (X^{-1})?
What does a positive semi-definite Hessian matrix indicate about the loss function (E(w))?
What does a positive semi-definite Hessian matrix indicate about the loss function (E(w))?
In linear regression, the relationship between the independent variable x
and the dependent variable y
is assumed to be linear. What should be considered if this assumption is not valid?
In linear regression, the relationship between the independent variable x
and the dependent variable y
is assumed to be linear. What should be considered if this assumption is not valid?
Suppose you have a dataset where the relationship between the features and the target variable is non-linear. Which of the following is a suitable approach to model this relationship effectively?
Suppose you have a dataset where the relationship between the features and the target variable is non-linear. Which of the following is a suitable approach to model this relationship effectively?
In the context of linear regression, what does the Moore-Penrose pseudoinverse, denoted as $X^{\dagger}$, primarily help to achieve?
In the context of linear regression, what does the Moore-Penrose pseudoinverse, denoted as $X^{\dagger}$, primarily help to achieve?
In a probabilistic linear regression model, what is the typical assumption regarding the noise component ($\omega_i$)?
In a probabilistic linear regression model, what is the typical assumption regarding the noise component ($\omega_i$)?
In the equation $y_i = f(x_i) + \epsilon_i$, where $\epsilon_i \sim N(0, \vartheta^{-1})$, what does $\epsilon_i$ represent?
In the equation $y_i = f(x_i) + \epsilon_i$, where $\epsilon_i \sim N(0, \vartheta^{-1})$, what does $\epsilon_i$ represent?
How does the inclusion of the bias term, $w_0$, affect a linear regression model?
How does the inclusion of the bias term, $w_0$, affect a linear regression model?
What does the notation $\vartheta = \frac{1}{\varepsilon^2}$ represent in the context of probabilistic linear regression?
What does the notation $\vartheta = \frac{1}{\varepsilon^2}$ represent in the context of probabilistic linear regression?
Given $y_i \downarrow N(f_w(x_i), \vartheta^{-1})$, how should this be interpreted?
Given $y_i \downarrow N(f_w(x_i), \vartheta^{-1})$, how should this be interpreted?
In the context of housing price prediction using linear regression, if $x_{new}$ represents the area of a new house, what does $f(x_{new})$ represent?
In the context of housing price prediction using linear regression, if $x_{new}$ represents the area of a new house, what does $f(x_{new})$ represent?
Given a dataset D = {(xi , yi )}Ni=1 of house areas xi and corresponding prices yi, what is the purpose of finding a mapping function f(·)?
Given a dataset D = {(xi , yi )}Ni=1 of house areas xi and corresponding prices yi, what is the purpose of finding a mapping function f(·)?
What is the significance of assuming that samples are drawn independently when calculating the likelihood of the entire dataset?
What is the significance of assuming that samples are drawn independently when calculating the likelihood of the entire dataset?
What is the primary role of the training data, denoted as 'D', in the context of linear regression?
What is the primary role of the training data, denoted as 'D', in the context of linear regression?
In maximum likelihood estimation, what are we trying to maximize with respect to?
In maximum likelihood estimation, what are we trying to maximize with respect to?
In the linear model $f_w(x_i) = w_0 + w_1x_{i1} + w_2x_{i2} + ...$, how do the weights $w_1, w_2, ...$ influence the model's predictions?
In the linear model $f_w(x_i) = w_0 + w_1x_{i1} + w_2x_{i2} + ...$, how do the weights $w_1, w_2, ...$ influence the model's predictions?
Given the likelihood function $p(y | X, w, \vartheta) = \prod_{i=1}^{N} p(y_i | f_w(x_i), \vartheta)$, what does each term $p(y_i | f_w(x_i), \vartheta)$ represent?
Given the likelihood function $p(y | X, w, \vartheta) = \prod_{i=1}^{N} p(y_i | f_w(x_i), \vartheta)$, what does each term $p(y_i | f_w(x_i), \vartheta)$ represent?
How does the probabilistic formulation of linear regression differ from the standard linear regression approach?
How does the probabilistic formulation of linear regression differ from the standard linear regression approach?
Consider a linear regression model used to predict housing prices. Which of the following scenarios would most likely require the use of the Moore-Penrose pseudoinverse?
Consider a linear regression model used to predict housing prices. Which of the following scenarios would most likely require the use of the Moore-Penrose pseudoinverse?
What is the purpose of maximizing the likelihood function in the context of probabilistic linear regression?
What is the purpose of maximizing the likelihood function in the context of probabilistic linear regression?
In Bayesian linear regression, what role does the prior distribution $p(w | \cdot)$ play?
In Bayesian linear regression, what role does the prior distribution $p(w | \cdot)$ play?
What is the purpose of the 'normalizing constant' in the context of calculating the posterior distribution $p(w | X, y, \vartheta, \cdot)$?
What is the purpose of the 'normalizing constant' in the context of calculating the posterior distribution $p(w | X, y, \vartheta, \cdot)$?
How does using the posterior distribution, instead of just MLE, address the problem of overfitting, especially when dealing with limited training data?
How does using the posterior distribution, instead of just MLE, address the problem of overfitting, especially when dealing with limited training data?
In the equation for the posterior distribution $p(w | X, y, \vartheta, \cdot) = \frac{p(y | X, w, \vartheta) \cdot p(w | \cdot)}{p(y | X, \vartheta, \cdot)}$, what does $p(y | X, w, \vartheta)$ represent?
In the equation for the posterior distribution $p(w | X, y, \vartheta, \cdot) = \frac{p(y | X, w, \vartheta) \cdot p(w | \cdot)}{p(y | X, \vartheta, \cdot)}$, what does $p(y | X, w, \vartheta)$ represent?
Suppose you are building a Bayesian linear regression model. You have a strong belief that the weights should be close to zero. How would you incorporate this belief into your model?
Suppose you are building a Bayesian linear regression model. You have a strong belief that the weights should be close to zero. How would you incorporate this belief into your model?
How does treating the precision $\omega = \frac{1}{\epsilon^2}$ (inverse of the error variance) as a known parameter simplify the calculations in Bayesian linear regression?
How does treating the precision $\omega = \frac{1}{\epsilon^2}$ (inverse of the error variance) as a known parameter simplify the calculations in Bayesian linear regression?
Which of the following is the most accurate analogy for the relationship between the prior, likelihood, and posterior in Bayesian inference?
Which of the following is the most accurate analogy for the relationship between the prior, likelihood, and posterior in Bayesian inference?
In the coin flip analogy, what corresponds to the 'train data' in the regression context?
In the coin flip analogy, what corresponds to the 'train data' in the regression context?
When does the posterior distribution equal the prior distribution?
When does the posterior distribution equal the prior distribution?
What does the notation $y \downarrow N(f_w(x), \vartheta^{-1})$ represent in the context of linear regression?
What does the notation $y \downarrow N(f_w(x), \vartheta^{-1})$ represent in the context of linear regression?
In the context of predicting new data points using MLE and MAP, what role do the model parameters w play?
In the context of predicting new data points using MLE and MAP, what role do the model parameters w play?
What does $p(ŷnew | xnew , wML , \varthetaML )$ represent in the context of prediction?
What does $p(ŷnew | xnew , wML , \varthetaML )$ represent in the context of prediction?
In the equation $p(ŷnew | xnew , wMAP , \vartheta) = N(ŷnew | w^{T}{MAP} \omega(xnew ), \vartheta^{-1})$, what does $w{MAP}$ represent?
In the equation $p(ŷnew | xnew , wMAP , \vartheta) = N(ŷnew | w^{T}{MAP} \omega(xnew ), \vartheta^{-1})$, what does $w{MAP}$ represent?
What is the significance of using the full posterior distribution $p(w | D)$ in prediction?
What is the significance of using the full posterior distribution $p(w | D)$ in prediction?
What is assumed to be known a priori for simplified calculations in the context of linear regression?
What is assumed to be known a priori for simplified calculations in the context of linear regression?
Even when we assume an isotropic prior $p(w)$, what is a characteristic of the posterior covariance?
Even when we assume an isotropic prior $p(w)$, what is a characteristic of the posterior covariance?
In the context of linear regression, what is the primary purpose of using a design matrix (denoted as ”)?
In the context of linear regression, what is the primary purpose of using a design matrix (denoted as ”)?
How does the least squares loss function using the design matrix, $ELS(w) = \frac{1}{2} (\Phi w - y)^T (\Phi w - y)$, relate to the original feature matrix $X$?
How does the least squares loss function using the design matrix, $ELS(w) = \frac{1}{2} (\Phi w - y)^T (\Phi w - y)$, relate to the original feature matrix $X$?
Given the optimal weights $w↑ = (\Phi^T \Phi)^{-1} \Phi^T y = \Phi^{\dagger} y$ , what does $\Phi^{\dagger}$ represent?
Given the optimal weights $w↑ = (\Phi^T \Phi)^{-1} \Phi^T y = \Phi^{\dagger} y$ , what does $\Phi^{\dagger}$ represent?
What is the significance of comparing $w↑ = (\Phi^T \Phi)^{-1} \Phi^T y = \Phi^{\dagger} y$ to $w↑ = (X^T X)^{-1} X^T y = X^{\dagger} y$?
What is the significance of comparing $w↑ = (\Phi^T \Phi)^{-1} \Phi^T y = \Phi^{\dagger} y$ to $w↑ = (X^T X)^{-1} X^T y = X^{\dagger} y$?
In the context of polynomial regression, what does increasing the degree M
of the polynomial generally achieve?
In the context of polynomial regression, what does increasing the degree M
of the polynomial generally achieve?
If you observe that a linear regression model (M=1) underfits the data, what adjustment to the polynomial degree M
would likely improve the model's fit?
If you observe that a linear regression model (M=1) underfits the data, what adjustment to the polynomial degree M
would likely improve the model's fit?
How does the choice of the polynomial degree M
relate to the bias-variance tradeoff in machine learning?
How does the choice of the polynomial degree M
relate to the bias-variance tradeoff in machine learning?
In polynomial regression, if a model with a high degree M
perfectly fits the training data but performs poorly on new, unseen data, what is this phenomenon called?
In polynomial regression, if a model with a high degree M
perfectly fits the training data but performs poorly on new, unseen data, what is this phenomenon called?
Flashcards
Scalar (x)
Scalar (x)
A scalar is a single numerical value, represented by a lowercase, non-bold symbol.
Vector (x)
Vector (x)
A vector is an ordered array of numbers, represented by a lowercase, bold symbol.
Matrix (X)
Matrix (X)
A matrix is a rectangular array of numbers, represented by an uppercase, bold symbol.
f(x)
f(x)
Signup and view all the flashcards
Target vector (y)
Target vector (y)
Signup and view all the flashcards
Target (yi)
Target (yi)
Signup and view all the flashcards
Bias term (w0)
Bias term (w0)
Signup and view all the flashcards
Regression Goal
Regression Goal
Signup and view all the flashcards
Probabilistic Linear Regression
Probabilistic Linear Regression
Signup and view all the flashcards
Linear Regression Model
Linear Regression Model
Signup and view all the flashcards
Noise Distribution
Noise Distribution
Signup and view all the flashcards
Precision (ϑ)
Precision (ϑ)
Signup and view all the flashcards
Distribution of Targets (yi)
Distribution of Targets (yi)
Signup and view all the flashcards
Function Representation
Function Representation
Signup and view all the flashcards
Maximum Likelihood
Maximum Likelihood
Signup and view all the flashcards
Likelihood of Dataset
Likelihood of Dataset
Signup and view all the flashcards
Optimal 'w' in Least Squares
Optimal 'w' in Least Squares
Signup and view all the flashcards
Data Matrix (X)
Data Matrix (X)
Signup and view all the flashcards
Minimizing Loss E(w)
Minimizing Loss E(w)
Signup and view all the flashcards
Gradient of E(w) (↗w E(w))
Gradient of E(w) (↗w E(w))
Signup and view all the flashcards
Finding the Minimizer 'w'
Finding the Minimizer 'w'
Signup and view all the flashcards
Normal Equation
Normal Equation
Signup and view all the flashcards
Moore-Penrose Pseudo-Inverse (X†)
Moore-Penrose Pseudo-Inverse (X†)
Signup and view all the flashcards
Nonlinear Dependency
Nonlinear Dependency
Signup and view all the flashcards
Design Matrix (ω)
Design Matrix (ω)
Signup and view all the flashcards
Optimal Weights (ŵ)
Optimal Weights (ŵ)
Signup and view all the flashcards
ŵ = (ΩᵀΩ)⁻¹Ωᵀy
ŵ = (ΩᵀΩ)⁻¹Ωᵀy
Signup and view all the flashcards
ŵ = (XᵀX)⁻¹Xᵀy
ŵ = (XᵀX)⁻¹Xᵀy
Signup and view all the flashcards
Polynomial Degree (M)
Polynomial Degree (M)
Signup and view all the flashcards
M = 0 (Polynomial Degree)
M = 0 (Polynomial Degree)
Signup and view all the flashcards
M = 1 (Polynomial Degree)
M = 1 (Polynomial Degree)
Signup and view all the flashcards
Choose M
Choose M
Signup and view all the flashcards
Likelihood
Likelihood
Signup and view all the flashcards
Prior
Prior
Signup and view all the flashcards
Posterior Distribution
Posterior Distribution
Signup and view all the flashcards
Posterior Formula
Posterior Formula
Signup and view all the flashcards
Normalizing Constant
Normalizing Constant
Signup and view all the flashcards
Overfitting
Overfitting
Signup and view all the flashcards
Solution to Overfitting
Solution to Overfitting
Signup and view all the flashcards
Posterior with no data
Posterior with no data
Signup and view all the flashcards
Non-diagonal posterior covariance
Non-diagonal posterior covariance
Signup and view all the flashcards
Gaussian conjugate prior
Gaussian conjugate prior
Signup and view all the flashcards
Prediction
Prediction
Signup and view all the flashcards
Predictive distribution
Predictive distribution
Signup and view all the flashcards
MLE predictive distribution
MLE predictive distribution
Signup and view all the flashcards
MAP predictive distribution
MAP predictive distribution
Signup and view all the flashcards
Posterior predictive distribution
Posterior predictive distribution
Signup and view all the flashcards
Study Notes
- The lecture focuses on Machine Learning, specifically Lecture 4 on Linear Regression, presented by Dr. Leo Schwinn at the Technical University of Munich during the Winter term 2024/2025.
Notation
- Scalars are represented in lowercase and not bold
- Vectors are in lowercase and bold
- Matrices are in uppercase and in bold
- A predicted value for inputs 'x' is denoted as f(x)
- Vector of targets is represented as 'y'
- The target of the i'th example is 'yi'
- 'w0' represents the bias term, which shouldn't be confused with general bias
- Basis function is represented as φ(·)
- E() represents the error function
- Training data is denoted as 'D'
- The Moore-Penrose pseudoinverse of X is represented as X†
- A special symbol for vectors or matrices augmented by the bias term, wo doesn't exist
- It is assumed that it is always included.
Basic Linear Regression
- A dataset D = {(xi, yi)}Ni=1 includes house areas xi and their corresponding prices yi
Regression Problem
- Given observations represented as X = {x1, x2, ..., xN}, where each xi is a real vector in RD
- Given targets as y = {y1, y2, ..., yN}, where each yi is a real number
- The problem involves mapping f(.) from inputs to targets, such that yi ≈ f(xi)
- A common way to represent the samples is as a data matrix X ∈ RN×D, where each row represents one sample.
Linear Model
- Target y is generated by a deterministic function f of x plus noise, where yi = f(xi) + εi
- εi follows a normal distribution N(0, β-1)
- A linear function for f(x) is chosen such that fw(xi) = w0 + w1xi1 + w2xi2 + ... + wDXiD = w0 + wTxi
Absorbing the Bias Term
- The linear function is given by fw(x) = w0 + w1x1 + w2x2 + ... + wDxD = w0 + wTx
- The "bias" or "offset" term is w0
- For simplicity, w0 can be "absorbed" by prepending a 1 to the feature vector 'x' and adding w0 to the weight vector 'w
- A new vector x = (1, x1, ..., xD)T and ῶ = (w0, w1, ..., wD)T are created
- The function fw can then be written as fw(x) = wTx
- It is assumed the bias term is always absorbed, simplifying notation by writing w and x instead of ῶ and x
Loss Function
- A loss function measures the "misfit" or error between a model (parametrized by w) and observed data D = {(xi, yi)}Ni=1
- Least squares (LS) is a standard choice where ELS(w) = 1/2 * Σ(fw(xi) – yi)2 = 1/2 * Σ(wTxi - yi)2
Objective
- The objective is to find the optimal weight vector w* that minimizes the error, w* = arg minw ELS(w)
- ELS(w) can be expressed as arg minw 1/2 Σ(xiTw – yi)2
- By stacking observations 'xi' as rows of matrix X ∈ RN×D, this is equivalent to arg 'minw' 1/2 (Xw - y)T(Xw - y)
Optimal Solution
- Computing the gradient ∇wE(w), the equation becomes ∇wELS(w) = ∇w 1/2 (Xw - y)T(Xw - y) = ∇w 1/2 (wTXTXw - 2wTXTy + yTy) = XTXw – XTy
- Setting the gradient to zero and solving for w yields the minimizer: XTXw – XTy = 0, known as the normal equation for the least squares problem.
- The optimal solution is w* = (XTX)-1XTy, where X† = (XTX)-1XTy which is called the Moore-Penrose pseudo-inverse of X.
- The formula is applicable when XTX is invertible (i.e., X has full column rank).
Nonlinear Dependency in Data
- If the dependency between y and x is not linear, polynomials are used as universal function approximators
- For 1-dimensional x, f can be defined as fw(x) = w0 + Σwjxj where j ranges from 1 to M
Polynomials
- Polynomials can be generalized as = w0 + Σwjφj(x), where j ranges from 1 to M
- Define φ0 = 1
- This implies that f(x)= wT φ(x)
- The function f is still linear in w even when not linear in x
Typical Basis Functions
- Functions can be Polynomials, Gaussian, or Logistic Sigmoid
Linear Basis Function Model
- For d-dimensional data x: φj: Rd → R
- Prediction for one sample with $f_w(x) = w_0 + \sum_{j=1}^{M}{w_j \phi_j(x)} = w^T \phi(x)$
- The least squares error function as: $E_{LS}(w) = \frac{1}{2} \frac{1}{N} \sum_{i=1}^{N}(w^T\phi (x_i) - y_i)^2 = \frac{1}{2} (\Phi w - y)^T (\Phi w - y)$
- Describes the design matrix as: $\Phi = \begin{pmatrix} \phi_0 (x_1) & \phi_1 (x_1) & ... & \phi_M(x_1)\ \phi_0 (x_2) & \phi_1 (x_2) & ... & \phi_M(x_2)\ ...&...&...&...\ \phi_0 (x_N) & \phi_1 (x_N) & ... & \phi_M(x_N) \end{pmatrix} \in \mathbb{R}^{N \times (M+1)}$
Optimal Solution
- Recall the final form of the least squares loss that we arrived at for the original feature matrix X: $E_{LS}(w) = \frac{1}{2} (Xw - y)^T (Xw - y)$, and compare it to the expression we found with the design matrix $\Phi \in \mathbb{R}^{N \times (M+1)}$: $E_{LS}(w) = \frac{1}{2} (\Phi w - y)^T (\Phi w - y)$
- This means that the optimal weights $w^$ can be obtained in the same way : $ w^ = (\Phi^T \Phi)^{-1} \Phi^T y = \Phi^{\dagger} y$
- Compare to Equation 15: $ w^* = (X^T X)^{-1} X^T y = X^{\dagger}y$
Choosing Degree of the Polynomial
- One valid solution is to choose M using the standard train-validation split approach.
- Overfitting occurs when the coefficients w become large, resulting in oscillation of the curve
- This can be resolve by penalizing large weights
Controlling Overfitting with Regularization
- Addresses overfitting using L2 regularization/ridge regression
- Introduces a modified least squares loss function: Eridge(w) = (1/2) * Σ[wT φ(xi) - yi]2 + (λ/2) * ||w||2
- ||w||2 = wTw = w02 + w12 + w22 + ... + wM2 which is squared L2 norm of w
- λ is the regularization strength, where larger λ leads to smaller weights w
Bias-Variance Tradeoff
- The error of an estimator can be decomposed into bias and variance
- Bias refers to the expected error due to model mismatch
- Variance is the variation due to randomness in training data
- High bias exists, the model is too rigid to fit the underlying data distribution
- Occurs if model is misspecified and/or regularization strength λ is too high
- If there is high variance, the model is too flexible and captures noise in the data (overfitting)
- Typically happens when the model has high capacity (memorizes the training data) or λ is too low
- Want models that have low bias and low variance, creating conflicting goals.
- A popular technique is to select a model with large capacity (high degree polynomial) and keep the variance in check by choosing an appropriate regularization strength λ.
- A bias-variance tradeoff exists in unregularized least squares regression (λ = 0).
Correlation
- The weights wi can represent the strength of the linear relationship between feature 'xi' and 'y'
- Shows a strong correlation and uses 0.018
- Real-world application involves normalizing data to handle the different scales, to find a good model that has a weight of approximately 1.
Correlation vs. Causation
- Correlation does not imply causation.
- Be aware of Confounding Variables
Probabilistic Linear Regression
- A probabilistic graphical model is used
- It's remembed in the problem definition at the start of the lecture that $y_i = f_w(x_i) + \epsilon_i$
- Noise has zero-mean Gaussian distribution with a fixed precision $\beta = \frac{1}{\sigma^2}$
- $\epsilon_i \sim N(0,\beta^{-1})$
- This implies that the distribution of the targets is: $y_i \sim N(f_w(x_i), \beta^{-1})$
Maximum Likelihood
- Likelihood of a single sample: p(yi | fw(xi), β) = N(yi | fw(xi), β-1)
- Assumes samples are drawn independently
- Likelihood of the entire dataset is: p(y | X, w, β) = ∏p(yi | fw(xi), β) for i=1 to N
- Maximize the likelihood with respect to w and β
Maximum Likelihood
- Likeness to the coin flip example, allows for various simplifications
- $w_{ML}, \beta_{ML} = arg \underset{w,\beta }{max} p(y \mid X, w, \beta)$
- $ = arg \underset{w,\beta }{max} \ln p(y \mid X, w, \beta) $
- $ = arg \underset{w,\beta }{min} - \ln p(y \mid X, w, \beta) $
- Denoting this quantity as maximum likelihood error function, requiring minimization: $E_{ML}(w, \beta) = - \ln p(y \mid X, w, \beta) $
Maximum Likelihood: Simplify Error function
- $E_{ML}(w, \beta) = - \ln \prod_{i=1}^{N}p(y_i \mid f_w(x_i), \beta^{-1})$
- $ - \ln \prod_{i=1}^{N} \sqrt{\frac{\beta}{2\pi}} exp( - \frac{\beta}{2} (w^T\phi(x_i) - y_i)^2 )) $
- Simplifying and optimizing log-likelihood with respect to w results in wML = arg min EML(w, β).
Optimizing Log-Likelihood
- wrt to $w$: $w_{ML} = arg \underset{w}{min} E_{ML}(w, \beta)$
- $= arg \underset{w}{min} [\frac{\beta}{2} \sum_{i=1}^{N}(w^T\phi (x_i) - y_i)^2 - \frac{N}{2} ln \beta + \frac{N}{2} ln 2\pi ] $
- $= arg \underset{w}{min} \frac{1}{2} \sum_{i=1}^{N} (w^T \phi(x_i) - y_i)^2 $ It becomes least squares error fn!
- Therefore $ = arg \underset{w}{min} E_{LS}(w) $
Maximizing the likelihood
- This is equivalent to minimizing the least squares error function
- $w_{ML} = (\Phi^T \Phi)^{-1} \Phi^T y = \Phi^{\dagger} y$
Optimizing log-likelihood wrt β
- Plug in the estimate for w and minimize wrt β
- $\beta_{ML} = arg \underset{\beta}{min} E_{ML}(w_{ML}, \beta)$
- With the derivative wrt to $\beta$ set to zero, $ \frac{\partial}{\partial \beta} E_{ML}(w_{ML}, \beta) = \frac{1}{2} \sum_{i=1}^{N}(w_{ML}^T\phi(x_i) - y_i)^2 - \frac{N}{2 \beta} \overset{!}{=} 0 $
- Solving for $\beta$: $ \frac{1}{\beta_{ML}} = \frac{1}{N} \sum_{i=1}^{N}(w_{ML}^T\phi(x_i) - y_i)^2$
Posterior Distribution
- Recalls that MLE can lead to overfitting, especially when little raining data is available
- Considers the posterior distribution instead: $p(w \mid X, y, \beta,.) = \frac{p(y \mid X, w, \beta) \cdot p(w \mid .)}{p(y \mid X, \beta, .)} $, where $ \propto p(y \mid X, w, \beta) \cdot p(w \mid .)$
- Considers a connection to the coin flip with train data, likehood, prior, and posteior information
- Precision $\beta = 1/ \sigma^2$ is treated as a known parameter to simplify the calculations
Prior for w
- Sets the prior over w to an isotropic multivariate normal distribution with zero mean $
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.