Machine Learning Lecture 4: Linear Regression PDF

Machine Learning Lecture 4: Linear Regression Dr. Leo Schwinn Data Analytics and Machine Learning Technical University of Munich Winter term 2024/2025 Notation Symbol Meaning x scalar is lowercase and not bold x vector is lowercase and bold ! matrix is uppercase and bold f (x) predicted value for inputs x y vector of targets yi target of the i’th example w0 bias term (not to be confused with bias in general) ω(·) basis function E(·) error function D training data X† Moore-Penrose pseudoinverse of X There is not a special symbol for vectors or matrices augmented by the bias term, w0. Assume it is always included. Data Analytics and Linear Regression 2 Machine Learning Section 1 Basic Linear Regression Data Analytics and Linear Regression 3 Machine Learning Example: Housing price prediction Given is a dataset D = {(xi , yi )}N i=1 , of house areas xi and corresponding prices yi. How do we estimate a price of a new house with area xnew ? Data Analytics and Linear Regression 4 Machine Learning Regression problem Given observations 1 X = {x1 , x2 ,... , xN }, xi → RD targets y = {y1 , y2 ,... , yN }, yi → R Find Mapping f (·) from inputs to targets yi ↑ f (xi ) 1Acommon way to represent the samples is as a data matrix X → RN →D , where each row represents one sample. Data Analytics and Linear Regression 5 Machine Learning Linear model Target y is generated by a deterministic function f of x plus noise yi = f (xi ) + εi , εi ↓ N (0, ϑ →1 ) (1) Let’s choose f (x) to be a linear function fw (xi ) = w0 + w1 xi1 + w2 xi2 +... + wD xiD (2) T = w 0 + w xi (3) Data Analytics and Linear Regression 6 Machine Learning Absorbing the bias term The linear function is given by fw (x) = w0 + w1 x1 + w2 x2 +... + wD xD (4) T = w0 + w x (5) Here w0 is called bias or o!set term. For simplicity, we can ”absorb” it by prepending a 1 to the feature vector x and respectively adding w0 to the weight vector w: x̃ = (1, x1 ,..., xD )T w̃ = (w0 , w1 ,..., wD )T The function fw can compactly be written as fw (x) = w̃T x̃. To unclutter the notation, we will assume the bias term is always absorbed and write w and x instead of w̃ and x̃. From now we will always assume that the bias term is absorbed into the x vector Data Analytics and Linear Regression 7 Machine Learning Loss function Now, how do we choose the ”best” w that fits our data? A loss function measures the “misfit” or error between our model (parametrized by w) and observed data D = {(xi , yi )}N i=1. Standard choice - least squares (LS) N 1! ELS (w) = (fw (xi ) ↔ yi )2 (6) 2 i=1 N 1! T = (w xi ↔ yi )2 (7) 2 i=1 1 Factor 2 is for later convenience Data Analytics and Linear Regression 8 Machine Learning Objective Find the optimal weight vector wω that minimizes the error w↑ = arg min ELS (w) (8) w N 1! T = arg min (x w ↔ yi )2 (9) w 2 i=1 i By stacking the observations xi as rows of the matrix X → RN ↓D 1 = arg min (Xw ↔ y)T (Xw ↔ y) (10) w 2 Data Analytics and Linear Regression 9 Machine Learning Optimal solution To find the minimum of the loss E(w), compute the gradient ↗w E(w): 1 ↗w ELS (w) = ↗w (Xw ↔ y)T (Xw ↔ y) (11) 2 1" T T # = ↗w w X Xw ↔ 2wT X T y + y T y (12) 2 = X Xw ↔ X T y T (13) See Equations (69), (81) from Matrix cookbook for details Data Analytics and Linear Regression 10 Machine Learning Optimal solution 2 Now set the gradient to zero and solve for w to obtain the minimizer ! X T Xw ↔ X T y = 0 (14) This leads to the so-called normal equation of the least squares problem w↑ = (X T X)→1 X T y (15) $ %& ' =X † X † is called Moore-Penrose pseudo-inverse of X (because for an invertible square matrix, X † = X →1 ). 2 Because Hessian ↑w ↑w E(w) is positive (semi)definite ↓ see Optimization Data Analytics and Linear Regression 11 Machine Learning Nonlinear dependency in data What if the dependency between y and x is not linear? 1 yz 0 0 xx 1 Data generating process: yi = sin(2ϖxi ) + εi , εi ↓ N (0, ϑ →1 ) For this example assume that the data dimensionality is D = 1 Data Analytics and Linear Regression 12 Machine Learning Polynomials Solution: Polynomials are universal function approximators, so for 1-dimensional x we can define f as M ! fw (x) = w0 + w j xj (16) j=1 Or more generally M ! = w0 + wj ωj (x) (17) j=1 Define ω0 = 1 = wT ω(x) (18) The function f is still linear in w (despite not being linear in x)! Data Analytics and Linear Regression 13 Machine Learning Typical basis functions Polynomials ωj (x) = xj →(x→µj )2 Gaussian ωj (x) = e 2s2 x→µj ωj (x) = ϱ( s ), Logistic Sigmoid 1 where ϱ(a) = 1+e→a Data Analytics and Linear Regression 14 Machine Learning Linear basis function model For d-dimensional data x: ω j : Rd ↘ R Prediction for one sample M ! fw (x) = w0 + wj ωj (x) = wT ω(x) (19) j=1 Using the same least squares error function as before N 1 !( T )2 1 ELS (w) = w ω(xi ) ↔ yi = (”w ↔ y)T (”w ↔ y) (20) 2 2 with   ω0 (x1 ) ω1 (x1 )... ωM (x1 ) ..   ω0 (x2 ) ω1 (x2 ).  ”= ....  → RN ↓(M +1)  .. ... ω0 (xN ) ω1 (xN )... ωM (xN ) being the design matrix of ω. Data Analytics and Linear Regression 15 Machine Learning Optimal solution Recall the final form of the least squares loss that we arrived at for the original feature matrix X 1 ELS (w) = (Xw ↔ y)T (Xw ↔ y) 2 and compare it to the expression we found with the design matrix ” → RN ↓(M +1) 1 ELS (w) = (”w ↔ y)T (”w ↔ y). (21) 2 This means that the optimal weights w↑ can be obtained in the same way w↑ = (”T ”)→1 ”T y = ”† y (22) Compare this to Equation 15: w↑ = (X T X)→1 X T y = X † y (23) Data Analytics and Linear Regression 16 Machine Learning Choosing degree of the polynomial How do we choose the degree of the polynomial M ? Data Analytics and Linear Regression 17 Machine Learning Choosing degree of the polynomial How do we choose the degree of the polynomial M ? 1 M =0 z 0 −1 0 x 1 Data Analytics and Linear Regression 17 Machine Learning Choosing degree of the polynomial How do we choose the degree of the polynomial M ? 1 M =0 1 M =1 z z 0 0 −1 −1 0 x 1 0 x 1 Data Analytics and Linear Regression 17 Machine Learning Choosing degree of the polynomial How do we choose the degree of the polynomial M ? 1 M =0 1 M =1 z z 0 0 −1 −1 0 x 1 0 x 1 1 M =3 z 0 −1 0 x 1 Data Analytics and Linear Regression 17 Machine Learning Choosing degree of the polynomial How do we choose the degree of the polynomial M ? 1 M =0 1 M =1 z z 0 0 −1 −1 0 x 1 0 x 1 1 M =3 1 M =9 z z 0 0 −1 −1 0 x 1 0 x 1 Data Analytics and Linear Regression 17 Machine Learning Choosing degree of the polynomial 1 M =0 1 M =1 z z Training 0 0 Validation −1 −1 ↑ 0 x 1 0 x 1 1 M =3 1 M =9 z z 0 0 −1 −1 : ↓ 0 x 1 0 x 1 One valid solution is to choose M using the standard train-validation split approach. Data Analytics and Linear Regression 18 Machine Learning Choosing degree of the polynomial 1 M =0 1 M =1 3 z z 0 0 −1 −1 0 x 1 0 x 1 1 M =3 1 M =9 z z 0 0 −1 −1 0 x 1 0 x 1 We also make another observation: overfitting occurs when the coe”cients w become large. => Oscillation the curve. of What if we penalize large weights? Data Analytics and Linear Regression 19 Machine Learning Controlling overfitting with regularization Least squares loss with L2 regularization (also called ridge regression) N 1 ! T 2 ς Eridge (w) = w ω(xi ) ↔ yi + ≃w≃22 (24) 2 2 where ≃w≃22 ⇐ wT w = w02 + w12 + w22 + · · · + wM 2 - squared L2 norm of w ς - regularization strength Data Analytics and Linear Regression 20 Machine Learning Controlling overfitting with regularization Least squares loss with L2 regularization (also called ridge regression) N 1 ! T 2 ς minimize > - Eridge (w) = w ω(xi ) ↔ yi + ≃w≃22 (24) 2 2 where ≃w≃22 ⇐ wT w = w02 + w12 + w22 + · · · + wM 2 - squared L2 norm of w ς - regularization strength luge d small , D = 9 - ray weights 1 M=9 1 M=9 λ = 10-8 λ=1 y y 0 0 −1 −1 0 x 1 0 x 1 Larger regularization strength ς leads to smaller weights w Data Analytics and Linear Regression 20 Machine Learning Bias-variance tradeo! 3 The error of an estimator can be decomposed into two parts: Bias - expected error due to model mismatch Variance - variation due to randomness in training data the center of the target: the true model that predicts the correct values. di!erent hits (the blue dots): Birs di!erent realizations of model EBias given di!erent training data. 3 See Bishop Section 3.2 for a more rigorous mathematical derivation Data Analytics and Linear Regression 21 Machine Learning Bias-variance tradeo!: high bias In case of high bias, the model is too rigid to fit the underlying data distribution. This typically happens if the model is misspecified and/or the ↓ is too high weights regularization strength ς is too high. , - o > - unifum function 1 λ=1 y 0 −1 0 x 1 Data Analytics and Linear Regression 22 Machine Learning Bias-variance tradeo!: high variance In case of high variance, the model is too flexible, and therefore captures noise in the data. This is exactly what we call overfitting. This typically happens when the model has high capacity (= it ”memorizes” the training data) and/or ς is too low. ↳ ware high 1 λ=0 y 0 −1 0 x 1 Data Analytics and Linear Regression 23 Machine Learning Bias-variance tradeo! Of course, we want models that have low bias and low variance, but often those are conflicting goals. A popular technique is to select a model with large capacity (e.g. high degree polynomial), and keep the variance in check by choosing appropriate regularization strength ς. 1 M=9 λ = 10-8 y 0 −1 0 x 1 Data Analytics and Linear Regression 24 Machine Learning Bias-variance tradeo! Bias-variance tradeo! in the case of unregularized least squares regression (ς = 0) The upper-left figure from: https://eissanematollahi.com/wp-content/uploads/2018/09/Machine-Learning-Basics-1.pdf. Data Analytics and Linear Regression 25 Machine Learning Correlation Least squares fit f (x) = 0.018x + 13.43 The weights wi can be interpreted as the strength of the (linear) relationship between feature xi and y A weight of 0.018 shows a strong correlation (considering the di!erent scales) With actual data, you would normalize the data to handle the di!erent scales of X and y and find a weight of about 1 Data Analytics and Linear Regression 26 Machine Learning Correlation vs. Causation Correlation does not imply causation! Putting more people in the pool does not increase the air temperature. Be aware of Confounding Variables! T V Tristes I a cam Data Analytics and Linear Regression 27 Machine Learning Section 2 Probabilistic Linear Regression In the following section, we will use probabilistic graphical models. If you do not know them yet, watch our separate Introduction to PGMs video. Data Analytics and Linear Regression 28 Machine Learning Probabilistic formulation of linear regression Remember from our problem definition at the start of the lecture, yi = fw (xi ) + ωi noise 1 Noise has zero-mean Gaussian distribution with a fixed precision ϑ = ε2 εi ↓ N (0, ϑ →1 ) This implies that the distribution of the targets is yi ↓ N (fw (xi ), ϑ →1 ) # Remember: any function can be represented as fw (xi ) = wT ω(xi ) Data Analytics and Linear Regression 29 Machine Learning Maximum likelihood ↑ Likelihood of a single sample p(yi | fw (xi ), ϑ) = N (yi | fw (xi ), ϑ →1 ) (25) Assume that the samples are drawn independently =⇒ likelihood of the entire dataset is en ·t papete  N p(y | X, w, ϑ) = p(yi | fw (xi ), ϑ) (26) i=1 We can now use the same approach we used in previous lecture - maximize the likelihood w.r.t. w and ϑ wML , ϑML = arg max p(y | X, w, ϑ) (27) w,ϑ Data Analytics and Linear Regression 30 Machine Learning Maximum likelihood Like in the coin flip example, we can make a few simplifications wML , ϑML = arg max p(y | X, w, ϑ) (28) w,ϑ monotonic = arg max ln p(y | X, w, ϑ) function (29) w,ϑ ↔ ln p(y | X, w, ϑ) = arg minO (30) w,ϑ Let’s denote this quantity as maximum likelihood error function that we need to minimize > - minimize end EML (w, ϑ) = ↔ ln p(y | X, w, ϑ) (31) Data Analytics and Linear Regression 31 Machine Learning Maximum likelihood Simplify the error function N   →1 EML (w, ϑ) = ↔ ln N (yi | fw (xi ), ϑ ) (32) i=1 N     ϑ ϑ T = ↔ ln exp ↔ (w ω(xi ) ↔ yi )2 (33) i=1 2ϖ 2 N     ! ϑ ϑ T 2 =↔ ln exp ↔ (w ω(xi ) ↔ yi ) (34) i=1 2ϖ 2 N ϑ! T N N = (w ω(xi ) ↔ yi )2 ↔ ln ϑ + ln 2ϖ (35) 2 i=1 2 2 B is constant for all Xi Data Analytics and Linear Regression 32 Machine Learning Optimizing log-likelihood w.r.t. w wML = arg min EML (w, ϑ) (36) w   !N ϑ N N  = arg min  (wT ω(xi ) ↔ yi )2 ↔ ln ϑ + ln 2ϖ  (37) w 2 i=1 2 2 = const N 1! = arg min (wT ω(xi ) ↔ yi )2 (38) w 2 i=1 least squares error fn! = arg min ELS (w) (39) w Data Analytics and Linear Regression 33 Machine Learning Optimizing log-likelihood w.r.t. w wML = arg min EML (w, ϑ) (36) w   !N ϑ N N  = arg min  (wT ω(xi ) ↔ yi )2 ↔ ln ϑ + ln 2ϖ  (37) w 2 i=1 2 2 = const N 1! = arg min (wT ω(xi ) ↔ yi )2 (38) w 2 i=1 least squares error fn! = arg min ELS (w) (39) w Maximizing the likelihood is equivalent to minimizing the least squares error function! wML = (”T ”)→1 ”T y = ”† y (40) Data Analytics and Linear Regression 33 Machine Learning Optimizing log-likelihood w.r.t. ε Plug in the estimate for w and minimize w.r.t. ϑ ϑML = arg min EML (wML , ϑ) (41) ϑ  N  ϑ! T N N = arg min (wML ω(xi ) ↔ yi )2 ↔ ln ϑ + ln 2ϖ (42) ϑ 2 i=1 2 2 Take derivative w.r.t. ϑ and set it to zero N φ 1! T N ! EML (wML , ϑ) = (w ω(xi ) ↔ yi )2 ↔ =0 (43) φϑ 2 i=1 ML 2ϑ Solving for ϑ N 1 1 ! T = (w ω(xi ) ↔ yi )2 (44) ϑML N i=1 ML Data Analytics and Linear Regression 34 Machine Learning Posterior distribution Recall from the Lecture 3, that the MLE leads to overfitting (especially, when little training data is available). Solution - consider the posterior distribution instead = prich Data Analytics and Linear Regression 35 Machine Learning Posterior distribution Recall from the Lecture 3, that the MLE leads to overfitting (especially, when little training data is available). Solution - consider the posterior distribution instead Bis given. likelihood prior & '$ % & '$ % p(y | X, w, ϑ) · p(w | ·) p(w | X, y, ϑ, ·) = (45) p(y | X, ϑ, ·) $ %& ' normalizing constant ⇑ p(y | X, w, ϑ) · p(w | ·) (46) Precision ω = 1/ε 2 is treated as a known parameter to simplify the calculations. Data Analytics and Linear Regression 36 Machine Learning Posterior distribution Recall from the Lecture 3, that the MLE leads to overfitting (especially, when little training data is available). Solution - consider the posterior distribution instead Bisgiven likelihood prior & '$ % & '$ % p(y | X, w, ϑ) · p(w | ·) p(w | X, y, ϑ, ·) = (45) p(y | X, ϑ, ·) $ %& ' normalizing constant ⇑ p(y | X, w, ϑ) · p(w | ·) (46) Connection to the coin flip example train data likelihood prior posterior coin: D=X p(X | ↼) p(↼ | a, b) p(↼ | X) regr.: D = {X, y} p(y | X, w, ϑ) p(w | ·) p(w | X, y, ϑ, ·) How do we choose the prior p(w | ·)? & Precision ω = 1/ε 2 is treated as a known parameter to simplify the calculations. Data Analytics and Linear Regression 36 Machine Learning Prior for w 2 We set the prior over w to an isotropic multivariate normal distribution with zero mean " ↽ # M2 " ↽ # p(w | ↽) = N (w | 0, ↽→1 I) = exp ↔ wT w (47) 2ϖ 2 where, ↽ - precision of the distribution M - number of elements in the vector w Motivation: Higher probability is assigned to small values of w =⇒ prevents theet overfitting (recall slide 20) > - mean-o Ip most of ware aroun Likelihood is also Gaussian - simplified calculations Data Analytics and Linear Regression 37 Machine Learning Maximum a posteriori (MAP) We are looking for w that corresponds to the mode of the posterior wMAP = arg max p(w | X, y, ↽, ϑ) (48) w Normalizer = arg max ln p(y | X, w, ϑ) + ln p(w | ↽) ↔ ln p(y | X, ϑ, ↽) w =const (49) = arg min ↔ ln p(y | X, w, ϑ) ↔ ln p(w | ↽) (50) w Similar to ML, define the MAP error function based on negative log-posterior EMAP (w) = ↔ ln p(y | X, w, ϑ) ↔ ln p(w | ↽) (51) We ignore the constant terms in the error function, as they are independent of w Data Analytics and Linear Regression 38 Machine Learning MAP error function Simplify the error function a et EM AP (w) = ↔ ln p(y | X, w, ϑ) ↔ ln p(w | ↽) = N ϑ! T (w ω(xi ) ↔ yi )2 ↔ N ln ϑ + N ln 2ϖ Bis constant 2 i=1 2 2 M "↽# ↽ T ↔ ln + w w 2 2ϖ 2 !N ϑ ↽ = (wT ω(xi ) ↔ yi )2 + ≃w≃22 + const Il /le 2 i=1 2 N Win = 1! T ς ↽ ⇑ (w ω(xi ) ↔ yi )2 + ≃w≃22 + const where ς = 2 i=1 L 2 ϑ ridge regression error fn! ⇑Eridge (w) + const (ajtWit-wn)(52) 2 w & MAP estimation with Gaussian prior is equivalent to ridge regression! Data Analytics and Linear Regression 39 Machine Learning Full Bayesian approach Instead of representing p(w | D) with the point estimate wMAP , we can compute the full posterior distribution p(w | D) ⇑ p(y | X, w, ϑ) p(w | ↽). (53) Since both likelihood and prior are Gaussian, the posterior is as well!4 p(w | D) = N (w | µ, !) where µ = ϑ!”T y and !→1 = ↽I + ϑ”T ”. Observations The posterior is Gaussian, so its mode is the mean and wMAP = µ In the limit of an infinitely broad prior ↽ ↘ 0, wMAP ↘ wML For N = 0, i.e. no data points, the posterior equals the prior Even though we assume an isotropic prior p(w), the posterior covariance is in general not diagonal 4 The Gaussian distribution is a conjugate prior of itself Data Analytics and Linear Regression 40 Machine Learning Predicting for new data: MLE and MAP After observing data D = {(xi , yi )}N i=1 , we can compute the MLE/MAP. Usually, what we are actually interested in is the prediction ŷnew for a new data point xnew - the model parameters w are just a means to achieve this. Recall, that we assume ω to be known a priori (for simplified calculations). Data Analytics and Linear Regression 41 Machine Learning Predicting for new data: MLE and MAP After observing data D = {(xi , yi )}N i=1 , we can compute the MLE/MAP. Usually, what we are actually interested in is the prediction ŷnew for a new data point xnew - the model parameters w are just a means to achieve this. Recall, that y ↓ N (fw (x), ϑ →1 ) Plugging in the estimated parameters we get a predictive distribution that lets us make prediction ŷnew for new data xnew. Maximum likelihood: wML and ϑML ( →1 ) p(ŷnew | xnew , wML , ϑML ) = N ŷnew | wTML ω(xnew ), ϑML (54) Maximum a posteriori: wMAP ( ) p(ŷnew | xnew , wMAP , ϑ) = N ŷnew | wTMAP ω(xnew ), ϑ →1 (55) Yer Recall, that we assume ω to be known a priori (for simplified calculations). Data Analytics and Linear Regression 41 Machine Learning Posterior predictive distribution Alternatively, we can use the full posterior distribution p(w | D). This allows us to compute the posterior predictive distribution  p(ŷnew | xnew , D) = p(ŷnew , w | xnew , D) dw  = p(ŷnew | xnew , w) p(w | D) dw ( ) = N ŷnew | µT ω(xnew ), ϑ →1 + ω(xnew )T !ω(xnew ) Advantage: We get a more accurate estimate about the uncertainty in the prediction (i.e. the variance of the Gaussian, which now also depends on the input xnew ) Data Analytics and Linear Regression 42 Machine Learning #simmetric Example of posterior predictive distribution 1 1 t 0 0 −1 −1 0 x 1 0 1 1 different 1 0 0 −1 −1 0 1 0 1 Green: Underlying function, Blue: Observations, Dark-Red: Mode, Light-Red: Variance Data Analytics and Linear Regression 43 Machine Learning Summary Optimization-based approaches to regression have probabilistic interpretations Least squares regression →↑ Maximum likelihood (Slide 33) Ridge regression →↑ Maximum a posteriori (Slide 39) Even nonlinear dependencies in the data can be captured by a model linear w.r.t. weights w (Slide 13) Penalizing large weights helps to reduce overfitting (Slide 20) Full Bayesian gives us data-dependent uncertainty estimates (Slide 42) Data Analytics and Linear Regression 44 Machine Learning Reading material Main reading “Pattern Recognition and Machine Learning” by Bishop [ch. 1.1, 3.1, 3.2, 3.3.1, 3.3.2, 3.6] Extra reading “Machine Learning: A Probabilistic Perspective” by Murphy [ch. 7.2–7.3, 7.5.1, 7.6.1, 7.6.2] Slides are based on an older version by G. Jensen and C. Osendorfer. Some figures are from Bishop’s “Pattern Recognition and Machine Learning”. Data Analytics and Linear Regression 45 Machine Learning

Machine Learning Lecture 4: Linear Regression PDF

Document Details

Tags

Related

Summary

Full Transcript