L08 - Bayesian Linear Regression Lecture Notes PDF

CM50264 Machine Learning 1, Lecture 8 Bayesian Linear Regression Wenbin Li Probabilistic modelling Bayesian Linear regression Linear regression example Again, linear regression example:...

CM50264 Machine Learning 1, Lecture 8 Bayesian Linear Regression Wenbin Li Probabilistic modelling Bayesian Linear regression Linear regression example Again, linear regression example: The data: D = {(x1 , y1 ),... , (xN , yN )} ⊂ X × Y ⊂ RM × R The function: f (x) = w0 x0 + w1 x1 + · · · + wM xM = w⊤ x where N is the number of data points and M is the input dimension. This induction lecture will derive regression solutions using probabilistic modeling. More details will be given in CM50268: Bayesian machine learning. Any “Optional” slides/content will not be required in the final exam. 2 / 22 Probabilistic modelling Bayesian Linear regression Probabilistic modelling Standard probabilistic modelling of a linear regression problem starts with a simple model: yi = f (xi ) + ϵi = w⊤ xi + ϵi where i = 1, · · · , N, and ϵi is the i.i.d. noise variable. There exists an unknown ground-truth function f ∗ (xi ) = yi However, the measured data/observations are noisy. The noise is described by ϵ noise is i.i.d.: Independent and identically distributed noise, which means the observations are measured independently, and follow the same distribution. In common cases, ϵ is modelled by a Gaussian distribution N (µ, σ 2 ), where µ is the mean and σ is the standard deviation 3 / 22 Probabilistic modelling Bayesian Linear regression Probabilistic modelling The noise variable ϵi represents the deviation between the noisy observation yi and the model prediction f (xi ), i.e., the error. 4 / 22 Probabilistic modelling Bayesian Linear regression Probabilistic modelling - summary We define a model with zero mean noise: yi = w⊤ xi + ϵi , ϵi ∼ N (0, σ 2 ) It can be rewritten as: yi − w⊤ xi ∼ N (0, σ 2 ), which gives the likelihood expression: (yi − w⊤ xi )2 1 p(yi |xi , w) = √ exp − 2πσ 2 2σ 2 where the Gaussian distribution can be expressed: (x − µ)2 2 1 N (µ, σ ) = √ exp − 2πσ 2 2σ 2 5 / 22 Probabilistic modelling Bayesian Linear regression Maximum likelihood (ML) estimation Define: Data matrix: X = (x⊤ ⊤ ⊤ 1 ,... , xN ) Label vector: y = (y1 ,... , yN )⊤ The likelihood over the whole data set is: N Y p(y|X, w) = p(yi |xi , w) i=1 N (w⊤ xi − yi )2 Y 1 = √ exp − 2πσ 2 2σ 2 i=1 ∥Xw − y∥2 1 =p exp −. (2πσ 2 )N 2σ 2 6 / 22 Probabilistic modelling Bayesian Linear regression Maximum likelihood (ML) estimation Our likelihood model: ∥Xw − y∥2 1 p(y|X, w) = p exp −. (2πσ 2 )N 2σ 2 ML estimation maximises p(y|X, w): ∥Xw − y∥2 1 w∗ = arg max p exp − w∈RM (2πσ 2 )N 2σ 2 = arg min ∥Xw − y∥2 , w∈RM (1) that leads to: arg min ∥Xw − y∥2 ⇔ X⊤ Xw∗ = X⊤ y. w∈RM ML solution under i.i.d. Gaussian noise is equivalent to least-squares solution. 7 / 22 Probabilistic modelling Bayesian Linear regression Bayes rule Data: the observation (e.g. X and y) Hypothesis: models and unknown random variables (e.g. w) Posterior=Likelihood×Prior×(Evidence)−1 8 / 22 Probabilistic modelling Bayesian Linear regression Maximum a posteriori (MAP) estimation Another probabilistic solution is MAP In ML, we maximise the likelihood: w∗ = arg max p(y|X, w). w In MAP, we maximise the posterior: w∗ = arg max p(w|X, y). w Applying Bayes rule, we have the proportionating: p(w|X, y) ∝ p(y|X, w)p(w) 9 / 22 Probabilistic modelling Bayesian Linear regression Maximum a posteriori (MAP) estimation ML: w∗ = arg max p(y|X, w) w MAP: w∗ = arg max p(w|X, y) w 10 / 22 Probabilistic modelling Bayesian Linear regression Maximum a posteriori (MAP) estimation Applying Bayes’ rule: arg max p(w|y, X) ∝ arg max p(y|X, w)p(w). w w We now know how to calculate p(y|X, w). But how to obtain p(w)? 11 / 22 Probabilistic modelling Bayesian Linear regression Gaussian prior We can assume p(w) is a standard Gaussian N (0, I) with zero mean and identity covariance matrix: ∥w∥2 1 p(w) = p exp − , (2π)M 2 maximising the posterior p(w|X, y) ∝ p(y|X, w)p(w) biases the solution w∗ towards 0: ∥Xw − y∥2 ∥w∥2 1 1 p(y|X, w)p(w) = p exp − exp − 2σ 2 p (2πσ 2 )N (2π)M 2 → arg max p(y|X, w)p(w) = arg min ∥Xw − y∥2 + σ 2 ∥w∥2 w w 12 / 22 Probabilistic modelling Bayesian Linear regression MAP with Gaussian prior maximising the posterior p(w|X, y) ∝ p(y|X, w)p(w) biases the solution w∗ towards 0: w∗ = arg min ∥Xw − y∥2 + σ 2 ∥w∥2 w... ⇔ (X⊤ X + σ 2 I )w∗ = X⊤ y. MAP solution becomes regularised least-squares solution with σ 2 as the regularisation coefficient. 13 / 22 Probabilistic modelling Bayesian Linear regression MAP - summary Input: Data {(x1 , y1 ),... , (xN , yN )} ⊂ RM × R; noise parameter σ 2 ≥ 0. Training: Build the data matrix: X = (x⊤ ⊤ ⊤ 1 ,... , xN ) ⊤ and label vector y = (y1 ,... , yN ) ; Obtain optimum solution by maximising p(w|y, X): w∗ = (X⊤ X + σ 2 I)−1 X⊤ y; Testing: f (xnew ) = (w∗ )⊤ xnew. We can now make a new prediction if we have a newly received input xnew. 14 / 22 Probabilistic modelling Bayesian Linear regression Bayesian linear regression: basic idea We can also obtain prediction by probabilistic modelling. Why we need this? The form of predictive distribution can be defined as: p(ynew |xnew , y, X). (2) It describes the distribution of new predictions given new data points. How can we obtain it? 15 / 22 Probabilistic modelling Bayesian Linear regression Optional: Marginalisation for a given joint distribution p(a, b), its marginal distribution p(a) (or p(b)) can be obtained by integrating b (or a) out: Z p(a) = p(a, b)db, Z p(b) = p(a, b)da. Similarly, for joint distribution p(a, b, c): Z p(a, c) = p(a, b, c)db Z = p(a|b, c)p(b|c)db. 16 / 22 Probabilistic modelling Bayesian Linear regression Optional: Predictive distribution The predictive distribution can be obtained by marginalising w in the product of likelihood and posterior Z p(ynew |xnew , y, X) = p(ynew |xnew , w)p(w|y, X)dw 17 / 22 Probabilistic modelling Bayesian Linear regression Optional: Predictive distribution The predictive distribution can be obtained by marginalising w in the product of likelihood and posterior Z p(ynew |xnew , y, X) = p(ynew |xnew , w)p(w|y, X)dw where the expressions of likelihood and posterior are: p(ynew |xnew , w) = N (w⊤ xnew , σ 2 ) p(w|y, X) = N (X⊤ X + σ 2 I)−1 X⊤ y, σ 2 (X⊤ X + σ 2 I)−1 As an example, the notation of the likelihood is as: ∥w⊤ xnew − ynew ∥2 1 p(ynew |xnew , w) = p exp −. (2πσ 2 )N 2σ 2 18 / 22 Probabilistic modelling Bayesian Linear regression Optional: Predictive distribution These are all in Gaussian form: the predictive distribution p(ynew |xnew , y, X) can be also expressed in Gaussian form: N x⊤ ⊤ 2 −1 ⊤ ⊤ 2 ⊤ 2 −1 new (X X + σ I) X y, xnew σ (X X + σ I) xnew. The mean of the predictive distribution is equivalent to the MAP solution: x⊤ ⊤ 2 −1 ⊤ ⊤ new (X X + σ I) X y = xnew w∗ w∗ = (X⊤ X + σ 2 I)−1 X⊤ y 19 / 22 Probabilistic modelling Bayesian Linear regression Optional: Predictive distribution The prediction is a Gaussian distribution characterised by: predictive mean: x⊤ ⊤ 2 −1 ⊤ new (X X + σ I) X y predictive variance: x⊤ 2 ⊤ 2 −1 new σ (X X + σ I) xnew Under the i.i.d. Gaussian noise model, predictive variance are independent of training labels y 20 / 22 Probabilistic modelling Bayesian Linear regression Optional: Bayesian linear regression - summary Input: Data {(x1 , y1 ),... , (xN , yN )} ⊂ Rn × R; noise parameter σ 2 ≥ 0 Construct the predictive distribution p(ynew |xnew , y, X) for a given input xnew : p(ynew |xnew , y, X ) = N x⊤new AX ⊤ y, σ 2 ⊤ xnew Ax new where A = (X⊤ X + σ 2 I)−1 No clear distinction of training and testing stages AX⊤ y could be pre-calculated 21 / 22 Probabilistic modelling Bayesian Linear regression Reading list BIS Christopher Bishop. Pattern Recognition and Machine Learning, Section 3.3-3.4, Section 6.4.2. BAR David Barber. Bayesian Reasoning and Machine Learning, Chapter 18. 22 / 22

L08 - Bayesian Linear Regression Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript