Untitled
48 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

In the context of least squares, what does the expression arg min represent?

  • The minimum value of the squared differences between predicted and actual values.
  • The argument (w) that minimizes the sum of squared differences between predicted and actual values. (correct)
  • The derivative of the squared differences with respect to w.
  • The average of the squared differences between predicted and actual values.

Given a matrix X and a vector y, which of the following expressions represents the cost function being minimized in ordinary least squares?

  • $(Xw - y)^T (Xw - y)$ (correct)
  • $(w - X^T y)^T (w - X^T y)$
  • $(X^T w - y)^T (X^T w - y)$
  • $(Xw + y)^T (Xw + y)$

What condition is achieved when the gradient of the loss function, (\nabla_w E(w)), is set to zero?

  • The loss function's rate of change is at its greatest.
  • A saddle point is achieved in the loss function.
  • The loss function is minimized or maximized, indicating a stationary point. (correct)
  • The loss function reaches its maximum value.

In the normal equation (w^* = (X^T X)^{-1} X^T y), what does the term ((X^T X)^{-1} X^T) represent?

<p>The Moore-Penrose pseudo-inverse of X. (D)</p> Signup and view all the answers

For what type of matrix X does the Moore-Penrose pseudo-inverse (X^{\dagger}) simplify to the regular inverse (X^{-1})?

<p>When X is a square, invertible matrix. (D)</p> Signup and view all the answers

What does a positive semi-definite Hessian matrix indicate about the loss function (E(w))?

<p>That (E(w)) is convex. (C)</p> Signup and view all the answers

In linear regression, the relationship between the independent variable x and the dependent variable y is assumed to be linear. What should be considered if this assumption is not valid?

<p>Transform the variables or use a non-linear model. (D)</p> Signup and view all the answers

Suppose you have a dataset where the relationship between the features and the target variable is non-linear. Which of the following is a suitable approach to model this relationship effectively?

<p>Apply linear regression after transforming the features to capture non-linear relationships. (B)</p> Signup and view all the answers

In the context of linear regression, what does the Moore-Penrose pseudoinverse, denoted as $X^{\dagger}$, primarily help to achieve?

<p>It provides a 'best fit' solution when the matrix X is not invertible, which is common in overdetermined or underdetermined systems. (A)</p> Signup and view all the answers

In a probabilistic linear regression model, what is the typical assumption regarding the noise component ($\omega_i$)?

<p>It follows a zero-mean Gaussian distribution with fixed precision. (B)</p> Signup and view all the answers

In the equation $y_i = f(x_i) + \epsilon_i$, where $\epsilon_i \sim N(0, \vartheta^{-1})$, what does $\epsilon_i$ represent?

<p>Random noise or error, assumed to be normally distributed with a mean of 0 and a variance related to $\vartheta^{-1}$. (A)</p> Signup and view all the answers

How does the inclusion of the bias term, $w_0$, affect a linear regression model?

<p>It allows the regression line to have a non-zero y-intercept, providing more flexibility in fitting the data. (D)</p> Signup and view all the answers

What does the notation $\vartheta = \frac{1}{\varepsilon^2}$ represent in the context of probabilistic linear regression?

<p>The precision of the noise. (A)</p> Signup and view all the answers

Given $y_i \downarrow N(f_w(x_i), \vartheta^{-1})$, how should this be interpreted?

<p>The target variable $y_i$ follows a Gaussian distribution with mean $f_w(x_i)$ and a variance of $\vartheta^{-1}$. (A)</p> Signup and view all the answers

In the context of housing price prediction using linear regression, if $x_{new}$ represents the area of a new house, what does $f(x_{new})$ represent?

<p>The predicted price of the new house based on the linear regression model. (A)</p> Signup and view all the answers

Given a dataset D = {(xi , yi )}Ni=1 of house areas xi and corresponding prices yi, what is the purpose of finding a mapping function f(·)?

<p>To predict the price of a house based on its area. (D)</p> Signup and view all the answers

What is the significance of assuming that samples are drawn independently when calculating the likelihood of the entire dataset?

<p>It simplifies the calculation by allowing the likelihood to be expressed as a product of individual sample likelihoods. (B)</p> Signup and view all the answers

What is the primary role of the training data, denoted as 'D', in the context of linear regression?

<p>To provide the input features and corresponding target values that the model learns from. (B)</p> Signup and view all the answers

In maximum likelihood estimation, what are we trying to maximize with respect to?

<p>The parameters of the model (w) and the precision ($\vartheta$). (D)</p> Signup and view all the answers

In the linear model $f_w(x_i) = w_0 + w_1x_{i1} + w_2x_{i2} + ...$, how do the weights $w_1, w_2, ...$ influence the model's predictions?

<p>They quantify the strength and direction of the relationship between each input feature and the target variable. (C)</p> Signup and view all the answers

Given the likelihood function $p(y | X, w, \vartheta) = \prod_{i=1}^{N} p(y_i | f_w(x_i), \vartheta)$, what does each term $p(y_i | f_w(x_i), \vartheta)$ represent?

<p>The probability of observing a single target value $y_i$ given the prediction $f_w(x_i)$ and the precision $\vartheta$. (D)</p> Signup and view all the answers

How does the probabilistic formulation of linear regression differ from the standard linear regression approach?

<p>Probabilistic linear regression explicitly models the noise as a random variable with a specific distribution. (A)</p> Signup and view all the answers

Consider a linear regression model used to predict housing prices. Which of the following scenarios would most likely require the use of the Moore-Penrose pseudoinverse?

<p>When there are more features than houses in the dataset, and some features are highly correlated. (B)</p> Signup and view all the answers

What is the purpose of maximizing the likelihood function in the context of probabilistic linear regression?

<p>To estimate the parameters that make the observed data most probable under the assumed model. (D)</p> Signup and view all the answers

In Bayesian linear regression, what role does the prior distribution $p(w | \cdot)$ play?

<p>It represents our prior knowledge or beliefs about the regression weights <em>w</em> before observing the data. (B)</p> Signup and view all the answers

What is the purpose of the 'normalizing constant' in the context of calculating the posterior distribution $p(w | X, y, \vartheta, \cdot)$?

<p>To ensure that the posterior distribution integrates to one, thus making it a valid probability distribution. (B)</p> Signup and view all the answers

How does using the posterior distribution, instead of just MLE, address the problem of overfitting, especially when dealing with limited training data?

<p>By incorporating prior knowledge or beliefs, which regularizes the model and prevents it from fitting the noise in the training data. (D)</p> Signup and view all the answers

In the equation for the posterior distribution $p(w | X, y, \vartheta, \cdot) = \frac{p(y | X, w, \vartheta) \cdot p(w | \cdot)}{p(y | X, \vartheta, \cdot)}$, what does $p(y | X, w, \vartheta)$ represent?

<p>The likelihood of observing the data <em>y</em> given the inputs <em>X</em> and the weights <em>w</em>. (C)</p> Signup and view all the answers

Suppose you are building a Bayesian linear regression model. You have a strong belief that the weights should be close to zero. How would you incorporate this belief into your model?

<p>Use a prior distribution $p(w | \cdot)$ that is centered at zero and has a small variance. (C)</p> Signup and view all the answers

How does treating the precision $\omega = \frac{1}{\epsilon^2}$ (inverse of the error variance) as a known parameter simplify the calculations in Bayesian linear regression?

<p>It allows for a closed-form solution for the posterior distribution, avoiding complex numerical integration. (C)</p> Signup and view all the answers

Which of the following is the most accurate analogy for the relationship between the prior, likelihood, and posterior in Bayesian inference?

<p>Prior * Likelihood Posterior: The prior and likelihood are multiplied, and the posterior is proportional to this product. (B)</p> Signup and view all the answers

In the coin flip analogy, what corresponds to the 'train data' in the regression context?

<p>$D = {X, y}$ (C)</p> Signup and view all the answers

When does the posterior distribution equal the prior distribution?

<p>When there are no data points. (D)</p> Signup and view all the answers

What does the notation $y \downarrow N(f_w(x), \vartheta^{-1})$ represent in the context of linear regression?

<p>The likelihood function of the data given the model parameters. (C)</p> Signup and view all the answers

In the context of predicting new data points using MLE and MAP, what role do the model parameters w play?

<p>They are a means to achieve the prediction $ŷnew$ for a new data point $xnew$. (D)</p> Signup and view all the answers

What does $p(ŷnew | xnew , wML , \varthetaML )$ represent in the context of prediction?

<p>The predictive distribution of $ŷnew$ given $xnew$, $wML$, and $\varthetaML$ using maximum likelihood estimation. (B)</p> Signup and view all the answers

In the equation $p(ŷnew | xnew , wMAP , \vartheta) = N(ŷnew | w^{T}{MAP} \omega(xnew ), \vartheta^{-1})$, what does $w{MAP}$ represent?

<p>The maximum a posteriori estimate of the model parameters. (D)</p> Signup and view all the answers

What is the significance of using the full posterior distribution $p(w | D)$ in prediction?

<p>It accounts for the uncertainty in the model parameters. (B)</p> Signup and view all the answers

What is assumed to be known a priori for simplified calculations in the context of linear regression?

<p>$\omega$. (C)</p> Signup and view all the answers

Even when we assume an isotropic prior $p(w)$, what is a characteristic of the posterior covariance?

<p>It is generally not diagonal. (A)</p> Signup and view all the answers

In the context of linear regression, what is the primary purpose of using a design matrix (denoted as ”)?

<p>To transform the original feature matrix into a higher-dimensional space, allowing for non-linear relationships to be modeled. (D)</p> Signup and view all the answers

How does the least squares loss function using the design matrix, $ELS(w) = \frac{1}{2} (\Phi w - y)^T (\Phi w - y)$, relate to the original feature matrix $X$?

<p>It offers an alternative computation method for <code>w</code> that may be computationally more efficient depending on size and structure. (C)</p> Signup and view all the answers

Given the optimal weights $w↑ = (\Phi^T \Phi)^{-1} \Phi^T y = \Phi^{\dagger} y$ , what does $\Phi^{\dagger}$ represent?

<p>The Moore-Penrose pseudoinverse of the design matrix. (B)</p> Signup and view all the answers

What is the significance of comparing $w↑ = (\Phi^T \Phi)^{-1} \Phi^T y = \Phi^{\dagger} y$ to $w↑ = (X^T X)^{-1} X^T y = X^{\dagger} y$?

<p>It helps in understanding the computational complexity differences when using different feature representations. (A)</p> Signup and view all the answers

In the context of polynomial regression, what does increasing the degree M of the polynomial generally achieve?

<p>It increases the model's complexity, allowing it to fit more intricate relationships in the data. (D)</p> Signup and view all the answers

If you observe that a linear regression model (M=1) underfits the data, what adjustment to the polynomial degree M would likely improve the model's fit?

<p>Increase <code>M</code> to a higher value, such as 2 or 3, to allow for more complex curves. (B)</p> Signup and view all the answers

How does the choice of the polynomial degree M relate to the bias-variance tradeoff in machine learning?

<p>Higher <code>M</code> reduces bias but may increase variance, while lower <code>M</code> increases bias but may reduce variance. (B)</p> Signup and view all the answers

In polynomial regression, if a model with a high degree M perfectly fits the training data but performs poorly on new, unseen data, what is this phenomenon called?

<p>Overfitting (A)</p> Signup and view all the answers

Flashcards

Scalar (x)

A scalar is a single numerical value, represented by a lowercase, non-bold symbol.

Vector (x)

A vector is an ordered array of numbers, represented by a lowercase, bold symbol.

Matrix (X)

A matrix is a rectangular array of numbers, represented by an uppercase, bold symbol.

f(x)

The predicted output value from a model given an input x.

Signup and view all the flashcards

Target vector (y)

The vector containing the true output values that the model is trying to predict.

Signup and view all the flashcards

Target (yi)

The i-th element from the target vector.

Signup and view all the flashcards

Bias term (w0)

The bias term, a constant added to the linear model.

Signup and view all the flashcards

Regression Goal

Find a mapping f(.) from inputs to targets.

Signup and view all the flashcards

Probabilistic Linear Regression

Linear regression with a probabilistic approach, often using probabilistic graphical models.

Signup and view all the flashcards

Linear Regression Model

The target value equals a function of the input plus noise.

Signup and view all the flashcards

Noise Distribution

Noise in the model follows a zero-mean Gaussian distribution.

Signup and view all the flashcards

Precision (ϑ)

Precision is the inverse of the noise variance.

Signup and view all the flashcards

Distribution of Targets (yi)

Targets are normally distributed around the function output.

Signup and view all the flashcards

Function Representation

Any function can be represented as a weighted sum of basis functions.

Signup and view all the flashcards

Maximum Likelihood

Finding the parameters that maximize the probability of observing the data.

Signup and view all the flashcards

Likelihood of Dataset

The product of the probabilities of individual samples, assuming independence.

Signup and view all the flashcards

Optimal 'w' in Least Squares

The value of 'w' (weights) that minimizes the sum of squared differences between predicted and actual values in a linear regression.

Signup and view all the flashcards

Data Matrix (X)

A matrix where each row represents an observation and each column represents a feature.

Signup and view all the flashcards

Minimizing Loss E(w)

The process of finding the 'w' that results in the smallest possible value of the loss function E(w).

Signup and view all the flashcards

Gradient of E(w) (↗w E(w))

The vector of partial derivatives of the loss function E(w) with respect to the weight vector 'w'.

Signup and view all the flashcards

Finding the Minimizer 'w'

Setting the gradient of the loss function to zero and solving for 'w' to find the minimum loss.

Signup and view all the flashcards

Normal Equation

Relates to the equation derived by setting the gradient of the least squares objective function to zero.

Signup and view all the flashcards

Moore-Penrose Pseudo-Inverse (X†)

A generalization of the inverse matrix, used when the matrix is not square or invertible.

Signup and view all the flashcards

Nonlinear Dependency

Addresses relationships where a straight line does not accurately represent the connection between input and output variables.

Signup and view all the flashcards

Design Matrix (ω)

A matrix that transforms input data into a higher-dimensional space for linear regression.

Signup and view all the flashcards

Optimal Weights (ŵ)

The optimal weights (ŵ) minimize the least squares loss function.

Signup and view all the flashcards

ŵ = (ΩᵀΩ)⁻¹Ωᵀy

Calculates the optimal weights (ŵ) using the design matrix (Ω) and target values (y).

Signup and view all the flashcards

ŵ = (XᵀX)⁻¹Xᵀy

Calculates the optimal weights (ŵ) using the original feature matrix (X) and target values (y).

Signup and view all the flashcards

Polynomial Degree (M)

Choosing the right degree is crucial for model performance.

Signup and view all the flashcards

M = 0 (Polynomial Degree)

A polynomial of degree 0 is a constant (horizontal line).

Signup and view all the flashcards

M = 1 (Polynomial Degree)

A polynomial of degree 1 is a line.

Signup and view all the flashcards

Choose M

Evaluate models at different degrees

Signup and view all the flashcards

Likelihood

The probability of observing the data (y) given the model parameters (w, ϑ) and the input data (X).

Signup and view all the flashcards

Prior

The prior belief about the model parameters (w) before observing the data.

Signup and view all the flashcards

Posterior Distribution

The probability of the model parameters (w) given the data (X, y), model parameters (ϑ), and prior information.

Signup and view all the flashcards

Posterior Formula

p(w | X, y, ϑ, ·) = [p(y | X, w, ϑ) * p(w | ·)] / p(y | X, ϑ, ·)

Signup and view all the flashcards

Normalizing Constant

A constant that ensures the posterior distribution integrates to 1 (it is a proper probability distribution).

Signup and view all the flashcards

Overfitting

A problem where a model learns the training data too well, leading to poor performance on new, unseen data.

Signup and view all the flashcards

Solution to Overfitting

Use the entire posterior distribution instead of just a single point estimate (like in MLE).

Signup and view all the flashcards

Posterior with no data

With no data, the posterior distribution is the same as the prior distribution.

Signup and view all the flashcards

Non-diagonal posterior covariance

Even if the prior for the weights (w) is isotropic (same in all directions), the posterior covariance is generally not diagonal.

Signup and view all the flashcards

Gaussian conjugate prior

The Gaussian distribution is a conjugate prior for itself, simplifying Bayesian calculations.

Signup and view all the flashcards

Prediction

Predicting the output (\hat{y}{new}) for a new data point (x{new}) is the main goal when creating a model.

Signup and view all the flashcards

Predictive distribution

The predictive distribution allows us to make predictions (\hat{y}{new}) for new data (x{new}) by plugging in estimated parameters from MLE.

Signup and view all the flashcards

MLE predictive distribution

The predictive distribution using Maximum Likelihood Estimation (MLE) is a normal distribution with the mean as (w_{ML}^T \omega(x_{new})) and variance (\vartheta_{ML}).

Signup and view all the flashcards

MAP predictive distribution

The predictive distribution using Maximum a Posteriori (MAP) estimation is a normal distribution with the mean as (w_{MAP}^T \omega(x_{new})) and variance (\vartheta).

Signup and view all the flashcards

Posterior predictive distribution

Instead of using point estimates (MLE/MAP), we can use the full posterior distribution p(w | D) to make predictions, accounting for uncertainty in the model parameters.

Signup and view all the flashcards

Study Notes

  • The lecture focuses on Machine Learning, specifically Lecture 4 on Linear Regression, presented by Dr. Leo Schwinn at the Technical University of Munich during the Winter term 2024/2025.

Notation

  • Scalars are represented in lowercase and not bold
  • Vectors are in lowercase and bold
  • Matrices are in uppercase and in bold
  • A predicted value for inputs 'x' is denoted as f(x)
  • Vector of targets is represented as 'y'
  • The target of the i'th example is 'yi'
  • 'w0' represents the bias term, which shouldn't be confused with general bias
  • Basis function is represented as φ(·)
  • E() represents the error function
  • Training data is denoted as 'D'
  • The Moore-Penrose pseudoinverse of X is represented as X†
  • A special symbol for vectors or matrices augmented by the bias term, wo doesn't exist
  • It is assumed that it is always included.

Basic Linear Regression

  • A dataset D = {(xi, yi)}Ni=1 includes house areas xi and their corresponding prices yi

Regression Problem

  • Given observations represented as X = {x1, x2, ..., xN}, where each xi is a real vector in RD
  • Given targets as y = {y1, y2, ..., yN}, where each yi is a real number
  • The problem involves mapping f(.) from inputs to targets, such that yi ≈ f(xi)
  • A common way to represent the samples is as a data matrix X ∈ RN×D, where each row represents one sample.

Linear Model

  • Target y is generated by a deterministic function f of x plus noise, where yi = f(xi) + εi
  • εi follows a normal distribution N(0, β-1)
  • A linear function for f(x) is chosen such that fw(xi) = w0 + w1xi1 + w2xi2 + ... + wDXiD = w0 + wTxi

Absorbing the Bias Term

  • The linear function is given by fw(x) = w0 + w1x1 + w2x2 + ... + wDxD = w0 + wTx
  • The "bias" or "offset" term is w0
  • For simplicity, w0 can be "absorbed" by prepending a 1 to the feature vector 'x' and adding w0 to the weight vector 'w
  • A new vector x = (1, x1, ..., xD)T and ῶ = (w0, w1, ..., wD)T are created
  • The function fw can then be written as fw(x) = wTx
  • It is assumed the bias term is always absorbed, simplifying notation by writing w and x instead of ῶ and x

Loss Function

  • A loss function measures the "misfit" or error between a model (parametrized by w) and observed data D = {(xi, yi)}Ni=1
  • Least squares (LS) is a standard choice where ELS(w) = 1/2 * Σ(fw(xi) – yi)2 = 1/2 * Σ(wTxi - yi)2

Objective

  • The objective is to find the optimal weight vector w* that minimizes the error, w* = arg minw ELS(w)
  • ELS(w) can be expressed as arg minw 1/2 Σ(xiTw – yi)2
  • By stacking observations 'xi' as rows of matrix X ∈ RN×D, this is equivalent to arg 'minw' 1/2 (Xw - y)T(Xw - y)

Optimal Solution

  • Computing the gradient ∇wE(w), the equation becomes ∇wELS(w) = ∇w 1/2 (Xw - y)T(Xw - y) = ∇w 1/2 (wTXTXw - 2wTXTy + yTy) = XTXw – XTy
  • Setting the gradient to zero and solving for w yields the minimizer: XTXw – XTy = 0, known as the normal equation for the least squares problem.
  • The optimal solution is w* = (XTX)-1XTy, where X† = (XTX)-1XTy which is called the Moore-Penrose pseudo-inverse of X.
  • The formula is applicable when XTX is invertible (i.e., X has full column rank).

Nonlinear Dependency in Data

  • If the dependency between y and x is not linear, polynomials are used as universal function approximators
  • For 1-dimensional x, f can be defined as fw(x) = w0 + Σwjxj where j ranges from 1 to M

Polynomials

  • Polynomials can be generalized as = w0 + Σwjφj(x), where j ranges from 1 to M
  • Define φ0 = 1
  • This implies that f(x)= wT φ(x)
  • The function f is still linear in w even when not linear in x

Typical Basis Functions

  • Functions can be Polynomials, Gaussian, or Logistic Sigmoid

Linear Basis Function Model

  • For d-dimensional data x: φj: Rd → R
  • Prediction for one sample with $f_w(x) = w_0 + \sum_{j=1}^{M}{w_j \phi_j(x)} = w^T \phi(x)$
  • The least squares error function as: $E_{LS}(w) = \frac{1}{2} \frac{1}{N} \sum_{i=1}^{N}(w^T\phi (x_i) - y_i)^2 = \frac{1}{2} (\Phi w - y)^T (\Phi w - y)$
  • Describes the design matrix as: $\Phi = \begin{pmatrix} \phi_0 (x_1) & \phi_1 (x_1) & ... & \phi_M(x_1)\ \phi_0 (x_2) & \phi_1 (x_2) & ... & \phi_M(x_2)\ ...&...&...&...\ \phi_0 (x_N) & \phi_1 (x_N) & ... & \phi_M(x_N) \end{pmatrix} \in \mathbb{R}^{N \times (M+1)}$

Optimal Solution

  • Recall the final form of the least squares loss that we arrived at for the original feature matrix X: $E_{LS}(w) = \frac{1}{2} (Xw - y)^T (Xw - y)$, and compare it to the expression we found with the design matrix $\Phi \in \mathbb{R}^{N \times (M+1)}$: $E_{LS}(w) = \frac{1}{2} (\Phi w - y)^T (\Phi w - y)$
  • This means that the optimal weights $w^$ can be obtained in the same way : $ w^ = (\Phi^T \Phi)^{-1} \Phi^T y = \Phi^{\dagger} y$
  • Compare to Equation 15: $ w^* = (X^T X)^{-1} X^T y = X^{\dagger}y$

Choosing Degree of the Polynomial

  • One valid solution is to choose M using the standard train-validation split approach.
  • Overfitting occurs when the coefficients w become large, resulting in oscillation of the curve
  • This can be resolve by penalizing large weights

Controlling Overfitting with Regularization

  • Addresses overfitting using L2 regularization/ridge regression
  • Introduces a modified least squares loss function: Eridge(w) = (1/2) * Σ[wT φ(xi) - yi]2 + (λ/2) * ||w||2
  • ||w||2 = wTw = w02 + w12 + w22 + ... + wM2 which is squared L2 norm of w
  • λ is the regularization strength, where larger λ leads to smaller weights w

Bias-Variance Tradeoff

  • The error of an estimator can be decomposed into bias and variance
  • Bias refers to the expected error due to model mismatch
  • Variance is the variation due to randomness in training data
  • High bias exists, the model is too rigid to fit the underlying data distribution
  • Occurs if model is misspecified and/or regularization strength λ is too high
  • If there is high variance, the model is too flexible and captures noise in the data (overfitting)
  • Typically happens when the model has high capacity (memorizes the training data) or λ is too low
  • Want models that have low bias and low variance, creating conflicting goals.
  • A popular technique is to select a model with large capacity (high degree polynomial) and keep the variance in check by choosing an appropriate regularization strength λ.
  • A bias-variance tradeoff exists in unregularized least squares regression (λ = 0).

Correlation

  • The weights wi can represent the strength of the linear relationship between feature 'xi' and 'y'
  • Shows a strong correlation and uses 0.018
  • Real-world application involves normalizing data to handle the different scales, to find a good model that has a weight of approximately 1.

Correlation vs. Causation

  • Correlation does not imply causation.
  • Be aware of Confounding Variables

Probabilistic Linear Regression

  • A probabilistic graphical model is used
  • It's remembed in the problem definition at the start of the lecture that $y_i = f_w(x_i) + \epsilon_i$
  • Noise has zero-mean Gaussian distribution with a fixed precision $\beta = \frac{1}{\sigma^2}$
  • $\epsilon_i \sim N(0,\beta^{-1})$
  • This implies that the distribution of the targets is: $y_i \sim N(f_w(x_i), \beta^{-1})$

Maximum Likelihood

  • Likelihood of a single sample: p(yi | fw(xi), β) = N(yi | fw(xi), β-1)
  • Assumes samples are drawn independently
  • Likelihood of the entire dataset is: p(y | X, w, β) = ∏p(yi | fw(xi), β) for i=1 to N
  • Maximize the likelihood with respect to w and β

Maximum Likelihood

  • Likeness to the coin flip example, allows for various simplifications
  • $w_{ML}, \beta_{ML} = arg \underset{w,\beta }{max} p(y \mid X, w, \beta)$
  • $ = arg \underset{w,\beta }{max} \ln p(y \mid X, w, \beta) $
  • $ = arg \underset{w,\beta }{min} - \ln p(y \mid X, w, \beta) $
  • Denoting this quantity as maximum likelihood error function, requiring minimization: $E_{ML}(w, \beta) = - \ln p(y \mid X, w, \beta) $

Maximum Likelihood: Simplify Error function

  • $E_{ML}(w, \beta) = - \ln \prod_{i=1}^{N}p(y_i \mid f_w(x_i), \beta^{-1})$
  • $ - \ln \prod_{i=1}^{N} \sqrt{\frac{\beta}{2\pi}} exp( - \frac{\beta}{2} (w^T\phi(x_i) - y_i)^2 )) $
  • Simplifying and optimizing log-likelihood with respect to w results in wML = arg min EML(w, β).

Optimizing Log-Likelihood

  • wrt to $w$: $w_{ML} = arg \underset{w}{min} E_{ML}(w, \beta)$
  • $= arg \underset{w}{min} [\frac{\beta}{2} \sum_{i=1}^{N}(w^T\phi (x_i) - y_i)^2 - \frac{N}{2} ln \beta + \frac{N}{2} ln 2\pi ] $
  • $= arg \underset{w}{min} \frac{1}{2} \sum_{i=1}^{N} (w^T \phi(x_i) - y_i)^2 $ It becomes least squares error fn!
  • Therefore $ = arg \underset{w}{min} E_{LS}(w) $

Maximizing the likelihood

  • This is equivalent to minimizing the least squares error function
  • $w_{ML} = (\Phi^T \Phi)^{-1} \Phi^T y = \Phi^{\dagger} y$

Optimizing log-likelihood wrt β

  • Plug in the estimate for w and minimize wrt β
  • $\beta_{ML} = arg \underset{\beta}{min} E_{ML}(w_{ML}, \beta)$
  • With the derivative wrt to $\beta$ set to zero, $ \frac{\partial}{\partial \beta} E_{ML}(w_{ML}, \beta) = \frac{1}{2} \sum_{i=1}^{N}(w_{ML}^T\phi(x_i) - y_i)^2 - \frac{N}{2 \beta} \overset{!}{=} 0 $
  • Solving for $\beta$: $ \frac{1}{\beta_{ML}} = \frac{1}{N} \sum_{i=1}^{N}(w_{ML}^T\phi(x_i) - y_i)^2$

Posterior Distribution

  • Recalls that MLE can lead to overfitting, especially when little raining data is available
  • Considers the posterior distribution instead: $p(w \mid X, y, \beta,.) = \frac{p(y \mid X, w, \beta) \cdot p(w \mid .)}{p(y \mid X, \beta, .)} $, where $ \propto p(y \mid X, w, \beta) \cdot p(w \mid .)$
  • Considers a connection to the coin flip with train data, likehood, prior, and posteior information
  • Precision $\beta = 1/ \sigma^2$ is treated as a known parameter to simplify the calculations

Prior for w

  • Sets the prior over w to an isotropic multivariate normal distribution with zero mean $

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Untitled
110 questions

Untitled

ComfortingAquamarine avatar
ComfortingAquamarine
Untitled
44 questions

Untitled

ExaltingAndradite avatar
ExaltingAndradite
Untitled
6 questions

Untitled

StrikingParadise avatar
StrikingParadise
Untitled Quiz
50 questions

Untitled Quiz

JoyousSulfur avatar
JoyousSulfur
Use Quizgecko on...
Browser
Browser