L08 - Bayesian Linear Regression Lecture Notes PDF

Document Details

Uploaded by Deleted User

University of Bath

Wenbin Li

Tags

machine learning bayesian linear regression probabilistic modeling linear regression

Summary

This document is lecture notes on Bayesian Linear Regression, part of a Machine Learning module (CM50264). The notes provide an overview of linear regression, including probabilistic models and maximum likelihood estimation. The document also details the MAP estimation and summarises Bayesian linear regression.

Full Transcript

CM50264 Machine Learning 1, Lecture 8 Bayesian Linear Regression Wenbin Li Probabilistic modelling Bayesian Linear regression Linear regression example Again, linear regression example:...

CM50264 Machine Learning 1, Lecture 8 Bayesian Linear Regression Wenbin Li Probabilistic modelling Bayesian Linear regression Linear regression example Again, linear regression example: The data: D = {(x1 , y1 ),... , (xN , yN )} ⊂ X × Y ⊂ RM × R The function: f (x) = w0 x0 + w1 x1 + · · · + wM xM = w⊤ x where N is the number of data points and M is the input dimension. This induction lecture will derive regression solutions using probabilistic modeling. More details will be given in CM50268: Bayesian machine learning. Any “Optional” slides/content will not be required in the final exam. 2 / 22 Probabilistic modelling Bayesian Linear regression Probabilistic modelling Standard probabilistic modelling of a linear regression problem starts with a simple model: yi = f (xi ) + ϵi = w⊤ xi + ϵi where i = 1, · · · , N, and ϵi is the i.i.d. noise variable. There exists an unknown ground-truth function f ∗ (xi ) = yi However, the measured data/observations are noisy. The noise is described by ϵ noise is i.i.d.: Independent and identically distributed noise, which means the observations are measured independently, and follow the same distribution. In common cases, ϵ is modelled by a Gaussian distribution N (µ, σ 2 ), where µ is the mean and σ is the standard deviation 3 / 22 Probabilistic modelling Bayesian Linear regression Probabilistic modelling The noise variable ϵi represents the deviation between the noisy observation yi and the model prediction f (xi ), i.e., the error. 4 / 22 Probabilistic modelling Bayesian Linear regression Probabilistic modelling - summary We define a model with zero mean noise: yi = w⊤ xi + ϵi , ϵi ∼ N (0, σ 2 ) It can be rewritten as: yi − w⊤ xi ∼ N (0, σ 2 ), which gives the likelihood expression: (yi − w⊤ xi )2   1 p(yi |xi , w) = √ exp − 2πσ 2 2σ 2 where the Gaussian distribution can be expressed: (x − µ)2   2 1 N (µ, σ ) = √ exp − 2πσ 2 2σ 2 5 / 22 Probabilistic modelling Bayesian Linear regression Maximum likelihood (ML) estimation Define: Data matrix: X = (x⊤ ⊤ ⊤ 1 ,... , xN ) Label vector: y = (y1 ,... , yN )⊤ The likelihood over the whole data set is: N Y p(y|X, w) = p(yi |xi , w) i=1 N (w⊤ xi − yi )2   Y 1 = √ exp − 2πσ 2 2σ 2 i=1 ∥Xw − y∥2   1 =p exp −. (2πσ 2 )N 2σ 2 6 / 22 Probabilistic modelling Bayesian Linear regression Maximum likelihood (ML) estimation Our likelihood model: ∥Xw − y∥2   1 p(y|X, w) = p exp −. (2πσ 2 )N 2σ 2 ML estimation maximises p(y|X, w): ∥Xw − y∥2   1 w∗ = arg max p exp − w∈RM (2πσ 2 )N 2σ 2 = arg min ∥Xw − y∥2 , w∈RM (1) that leads to: arg min ∥Xw − y∥2 ⇔ X⊤ Xw∗ = X⊤ y. w∈RM ML solution under i.i.d. Gaussian noise is equivalent to least-squares solution. 7 / 22 Probabilistic modelling Bayesian Linear regression Bayes rule Data: the observation (e.g. X and y) Hypothesis: models and unknown random variables (e.g. w) Posterior=Likelihood×Prior×(Evidence)−1 8 / 22 Probabilistic modelling Bayesian Linear regression Maximum a posteriori (MAP) estimation Another probabilistic solution is MAP In ML, we maximise the likelihood: w∗ = arg max p(y|X, w). w In MAP, we maximise the posterior: w∗ = arg max p(w|X, y). w Applying Bayes rule, we have the proportionating: p(w|X, y) ∝ p(y|X, w)p(w) 9 / 22 Probabilistic modelling Bayesian Linear regression Maximum a posteriori (MAP) estimation ML: w∗ = arg max p(y|X, w) w MAP: w∗ = arg max p(w|X, y) w 10 / 22 Probabilistic modelling Bayesian Linear regression Maximum a posteriori (MAP) estimation Applying Bayes’ rule: arg max p(w|y, X) ∝ arg max p(y|X, w)p(w). w w We now know how to calculate p(y|X, w). But how to obtain p(w)? 11 / 22 Probabilistic modelling Bayesian Linear regression Gaussian prior We can assume p(w) is a standard Gaussian N (0, I) with zero mean and identity covariance matrix: ∥w∥2   1 p(w) = p exp − , (2π)M 2 maximising the posterior p(w|X, y) ∝ p(y|X, w)p(w) biases the solution w∗ towards 0: ∥Xw − y∥2 ∥w∥2     1 1 p(y|X, w)p(w) = p exp − exp − 2σ 2 p (2πσ 2 )N (2π)M 2 → arg max p(y|X, w)p(w) = arg min ∥Xw − y∥2 + σ 2 ∥w∥2 w w 12 / 22 Probabilistic modelling Bayesian Linear regression MAP with Gaussian prior maximising the posterior p(w|X, y) ∝ p(y|X, w)p(w) biases the solution w∗ towards 0: w∗ = arg min ∥Xw − y∥2 + σ 2 ∥w∥2 w... ⇔ (X⊤ X + σ 2 I )w∗ = X⊤ y. MAP solution becomes regularised least-squares solution with σ 2 as the regularisation coefficient. 13 / 22 Probabilistic modelling Bayesian Linear regression MAP - summary Input: Data {(x1 , y1 ),... , (xN , yN )} ⊂ RM × R; noise parameter σ 2 ≥ 0. Training: Build the data matrix: X = (x⊤ ⊤ ⊤ 1 ,... , xN ) ⊤ and label vector y = (y1 ,... , yN ) ; Obtain optimum solution by maximising p(w|y, X): w∗ = (X⊤ X + σ 2 I)−1 X⊤ y; Testing: f (xnew ) = (w∗ )⊤ xnew. We can now make a new prediction if we have a newly received input xnew. 14 / 22 Probabilistic modelling Bayesian Linear regression Bayesian linear regression: basic idea We can also obtain prediction by probabilistic modelling. Why we need this? The form of predictive distribution can be defined as: p(ynew |xnew , y, X). (2) It describes the distribution of new predictions given new data points. How can we obtain it? 15 / 22 Probabilistic modelling Bayesian Linear regression Optional: Marginalisation for a given joint distribution p(a, b), its marginal distribution p(a) (or p(b)) can be obtained by integrating b (or a) out: Z p(a) = p(a, b)db, Z p(b) = p(a, b)da. Similarly, for joint distribution p(a, b, c): Z p(a, c) = p(a, b, c)db Z = p(a|b, c)p(b|c)db. 16 / 22 Probabilistic modelling Bayesian Linear regression Optional: Predictive distribution The predictive distribution can be obtained by marginalising w in the product of likelihood and posterior Z p(ynew |xnew , y, X) = p(ynew |xnew , w)p(w|y, X)dw 17 / 22 Probabilistic modelling Bayesian Linear regression Optional: Predictive distribution The predictive distribution can be obtained by marginalising w in the product of likelihood and posterior Z p(ynew |xnew , y, X) = p(ynew |xnew , w)p(w|y, X)dw where the expressions of likelihood and posterior are: p(ynew |xnew , w) = N (w⊤ xnew , σ 2 )   p(w|y, X) = N (X⊤ X + σ 2 I)−1 X⊤ y, σ 2 (X⊤ X + σ 2 I)−1 As an example, the notation of the likelihood is as: ∥w⊤ xnew − ynew ∥2   1 p(ynew |xnew , w) = p exp −. (2πσ 2 )N 2σ 2 18 / 22 Probabilistic modelling Bayesian Linear regression Optional: Predictive distribution These are all in Gaussian form: the predictive distribution p(ynew |xnew , y, X) can be also expressed in Gaussian form:   N x⊤ ⊤ 2 −1 ⊤ ⊤ 2 ⊤ 2 −1 new (X X + σ I) X y, xnew σ (X X + σ I) xnew. The mean of the predictive distribution is equivalent to the MAP solution: x⊤ ⊤ 2 −1 ⊤ ⊤ new (X X + σ I) X y = xnew w∗ w∗ = (X⊤ X + σ 2 I)−1 X⊤ y 19 / 22 Probabilistic modelling Bayesian Linear regression Optional: Predictive distribution The prediction is a Gaussian distribution characterised by: predictive mean: x⊤ ⊤ 2 −1 ⊤ new (X X + σ I) X y predictive variance: x⊤ 2 ⊤ 2 −1 new σ (X X + σ I) xnew Under the i.i.d. Gaussian noise model, predictive variance are independent of training labels y 20 / 22 Probabilistic modelling Bayesian Linear regression Optional: Bayesian linear regression - summary Input: Data {(x1 , y1 ),... , (xN , yN )} ⊂ Rn × R; noise parameter σ 2 ≥ 0 Construct the predictive distribution p(ynew |xnew , y, X) for a given input xnew :   p(ynew |xnew , y, X ) = N x⊤new AX ⊤ y, σ 2 ⊤ xnew Ax new where A = (X⊤ X + σ 2 I)−1 No clear distinction of training and testing stages AX⊤ y could be pre-calculated 21 / 22 Probabilistic modelling Bayesian Linear regression Reading list BIS Christopher Bishop. Pattern Recognition and Machine Learning, Section 3.3-3.4, Section 6.4.2. BAR David Barber. Bayesian Reasoning and Machine Learning, Chapter 18. 22 / 22

Use Quizgecko on...
Browser
Browser