JBM050 Statistical Computing Lecture 8: Linear Regression PDF

JBM050 Statistical Computing Q1 2024-2025 Lecture 8: Linear regression Part III: Optimization for statistics  III.0. Introduction:  III.1. Basic Optimization procedures  III.2. Linear regression ◼ Simple/Univariate linear regression ◼ Multiple linear regression 2 Today’s lecture  Regression Analysis ◼ Introduction ◼ Workflow of regression modelling ▪ Step 1. The data generating model ▪ Step 2. Formulating the objective function: ML and LS ▪ Step 3. Solving the optimization problem ▪ Step 4. Implementation ▪ Step 5. Model selection (see next lecture) 3 III.2. Linear regression 4  Material: ◼ Chapter 3 of “An introduction to Statistical Learning” (ISL), written by James, Witten, Hastie, & Tibshirani and freely available from https://www.statlearning.com/ 5 III.2.1. Simple linear regression  The simple linear regression model ◼ Is a supervised learning method: we have an outcome variable that we want to predict using one/more other variables ◼ Notation:  𝑌 is the outcome (or, dependent variable)  𝑋 is the predictor (or, independent variable)  In regression analysis 𝑌 is quantitative  We will mainly discuss the case of quantitative 𝑋 ◼ So, central idea is that 𝑌 depends on 𝑋; or 𝑋 contains information that is useful to say something about 𝑌 ISL, 3.1, pp. 61-71 6 Linear relation  Linear regression builds on the assumption that there is an approximate linear relation between 𝑋 and 𝑌: 𝑌 ≈ 𝑓 𝑋 = 𝛽0 + 𝛽1 𝑋  Can you think of any real examples of a pair of variables (𝑋, 𝑌) where values on 𝑌 are exactly/approximately linearly dependent on 𝑋? 7 Toy Example  Assume the following relation between birth rate (𝑌) and number of storks per km² (𝑋): 𝑦𝑖 ≈ 𝛽0 + 𝛽1 𝑥𝑖 = 1.4 + 10𝑥𝑖 This is 𝛽0 = 1.4 and 𝛽1 = 10  So, let us estimate the birth rate for several places based on the number of storks X: Storks Y: Birth rate Antwerp 1 Eindhoven 2 Tilburg 3 Veluwe 4 8 Exact linear relation Birth rate in function of nr Storks 45 40 X: Storks Y: Birth rate 35 Antwerp 1 11,4 30 Y: Birth rate Eindhoven 2 21,4 25 20 Tilburg 3 31,4 15 Veluwe 4 41,4 10 5 0 𝑌 = 𝛽0 + 𝛽1 𝑋 0 2 4 6 X: Number of storks / km² 9 Approximate linear relation Birth rate in function of nr Y: Birth ෡: 𝒀 Storks X: Storks rate Estimate 45 Antwerp 1 11,4 11,4 40 Eindhoven 2 21,4 21,4 35 30 Tilburg 3 31,4 31,4 Y: Birth rate 25 Veluwe 4 41,4 41,4 20 Reeshof 2 25 21,4 15 Weert 1 5 11,4 10 Rotterdam 1 15 11,4 5 Den Haag 3 25 31,4 0 0 2 4 6 X: Number of storks / km² 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀 = 𝑌෠ + 𝜀 10 Simple linear regression  The simple/univariate linear regression model is given by 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 with 𝛽0 called the intercept (or, offset) and 𝛽1 the regression weight (or, slope). 𝜀𝑖 is the residual or error made in estimating 𝑦𝑖 by 𝑦ො𝑖 (with 𝑦ො𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 ) 11 Interpretation of the univariate linear regression model, 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖  𝛽0 tells you the estimate for 𝑥𝑖 = 0 (𝑦ො𝑖 = 𝛽0 + 𝛽1 (0) = 𝛽0 )  𝛽1 tells you how much the estimate increases (𝛽1 >0, positive slope) or decreases (𝛽1 𝑦ො2 = 𝛽0 + 𝛽1 𝑥2 = 𝛽0 + 𝛽1 𝑥1 + 1 = (𝛽0 + 𝛽1 𝑥1 ) + 𝛽1 = 𝑦ො1 + 𝛽1 12 Applied to the toy example  𝑦ො𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 = 1.4 + 10𝑥𝑖 Value of the intercept? Value of the regression weight? If there are no storks, what is the estimated birth rate? If a region has 10 storks more per km² than Antwerp, how many more babies do you expect for that region than for Antwerp? 13 Step1. Univariate linear regression model  Stochastic 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 with 𝜀~ 𝑁 0, 𝜎 2 with corresponding density 2 1 1 2 𝑓 𝑦𝑖 𝑥𝑖 , 𝛽0 , 𝛽1 , 𝜎 = exp − 2 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 2𝜋𝜎 2 2𝜎  Deterministic 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 with 𝜀 assumed independent from 𝑋 and E 𝜀 =0 Note the difference  In both cases (stochastic and deterministic): between the observed 𝑦ො𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 score 𝑦𝑖 and the estimated score 𝑦ො𝑖 (hat; no error) Or, we assume a linear relation between 𝑋 and 𝑌 14 Visualized: Stochastic model Image retrieved from: http://www.seaturtle.org/mtn/archives/mtn122/mtn122p1.shtml 15 Visualized: Deterministic model Shows: -Line ‘closest’ to all points; this is with minimal average squared distance of the observed score to the regression line -Dependence of Y on X -Noise is assumed only on Y (and not on X) 16 Step 2+3. Setting up objective function and estimating the model parameters  Can be done using either least squares or maximum likelihood  First maximum likelihood  Then least squares 17 1. Deriving the ML estimators  Maximum likelihood approach to estimating the simple regression problem ◼ Explicit distributional assumptions 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 with 𝜀~ 𝑁 0, 𝜎 2  Equivalent to 𝑌~ 𝑁 𝛽0 + 𝛽1 𝑋, 𝜎 2 ◼ Normality of 𝑌 around the regression line ◼ Constant variability 𝜎 2 (homoscedastic noise) 1 1  Density: 𝑓 𝑦𝑖 𝛽0 , 𝛽1 , 𝜎 2 = exp − 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 2 2𝜋𝜎 2 2𝜎2 18  Maximize likelihood 𝐿 𝛽0 , 𝛽1 or, equivalently, log-likelihood 𝑙 𝛽0 , 𝛽1 𝑙 𝛽0 , 𝛽1 = ln 𝐿 𝛽0 , 𝛽1 𝑛 1 𝑛 = − ln 2𝜋𝜎 2 − σ 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 2 2 2𝜎2 𝑖=1  1. Estimating the intercept ◼ First derivative 1 𝑛 𝑙′ 𝛽0 = σ 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 𝜎2 𝑖=1 ◼ Hence maximum obtained for 1 1 𝛽መ0 = ෍ 𝑦𝑖 − ෍ 𝛽1 𝑥𝑖 = 𝑦ത − 𝛽1 𝑥ҧ 𝑛 𝑖 𝑛 𝑖 19  2. Estimating the regresssion weight ◼ Substitute 𝛽0 by 𝑦ത − 𝛽1 𝑥ҧ 𝑛 1 𝑛 𝑙 𝛽1 = − ln 2𝜋𝜎 2 − σ 𝑦𝑖 − 𝑦ത + 𝛽1 𝑥ҧ − 𝛽1 𝑥𝑖 2 2 2𝜎2 𝑖=1 ◼ First derivative 1 𝑛 𝑙 ′ 𝛽1 = σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത − 𝛽1 𝑥𝑖 − 𝑥ҧ 𝜎2 𝑖=1 ◼ Hence maximum obtained for σ𝑖 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത 𝛽መ1 = σ𝑖 𝑥𝑖 − 𝑥ҧ 2 20 2. Deriving the LS estimators  Least squares approach ◼ Minimize 𝑄 𝛽0 , 𝛽1 𝑄 𝛽0 , 𝛽1 = σ𝑖 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 2 ◼ 1. Intercept: First Derivative 𝑄′ 𝛽0 = −2 σ𝑖 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 1 1 ◼ Hence 𝛽መ0 = σ𝑖 𝑦𝑖 − σ𝑖 𝛽1 𝑥𝑖 = 𝑦ത − 𝛽1 𝑥ҧ 𝑛 𝑛 21 ◼ 2. Regression weight: First derivative 𝑄′ 𝛽1 = σ𝑖 𝑦𝑖 − 𝑦ത + 𝛽1 𝑥ҧ − 𝛽1 𝑥𝑖 2 ′ = −2 σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത − 𝛽1 𝑥𝑖 − 𝑥ҧ σ𝑖 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത ◼ Hence 𝛽መ1 = σ𝑖 𝑥𝑖 −𝑥ҧ 2  Again, maximizing the likelihood is equivalent to minimizing the residual sum of squares  Note: 𝑠𝑥𝑦 is the sample covariance መ σ𝑖 𝑥𝑖 −𝑥ҧ 𝛽1 = σ 𝑦𝑖 −𝑦ത = (𝑛−1)∗𝑠𝑥𝑦 = 𝑟𝑥𝑦 𝑠𝑦 between X and Y: 𝑖 𝑥𝑖 −𝑥ҧ 2 (𝑛−1)∗𝑠𝑥2 𝑠𝑥 𝑠𝑥𝑦 = 𝑟𝑥𝑦 𝑠𝑥 𝑠𝑦  So for standardized variables 𝛽መ1 = 𝑟𝑥𝑦 22 http://guessthecorrelation.com/ 23 Some issues with (simple) linear regression  Regression analysis is a very popular tool for data analysis  But it has its limits ◼ There is more than linear relations ◼ Highly sensitive to influential outliers ◼ Correlation ≠ Causation  So use it wisely 24 Zero correlation ≠ no relation  What does it mean when the regression weight is equal to zero? ALWAYS make scatter plots!! 25 Outliers: Influential observation Guess the correlation. 26 Outliers: Influential observations => Always (!) make a scatter plot 27 Correlation ≠ Causation Source: https://www.ibpsychmatters.com/why-correlation-is-not-causation 28 III.2.2 Multiple linear regression  Simple linear regression is too limited  Often multiple (many, ultra-many) predictors are available  Multiple linear regression ◼ Predict outcome 𝑌 on the basis of multiple (𝑝) predictors 𝑋1 , 𝑋2 , … , 𝑋𝑝 ◼ Example: Predicting wage (𝑌) on the basis of age (𝑋1 ), year (𝑋2 ), and education level (𝑋3 ) ◼ Example: Predicting study/job success on the basis of the five personality traits 29 Step 1. Multiple linear regression model  How to learn a prediction rule from the data 𝑋1 , 𝑋2 , … , 𝑋𝑝 , 𝑌?  Statistical approach: take a sample and assume a data generating model containing the parameters of interest and introduce an objective function ◼ The multiple linear regression model: deterministic 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑝 𝑥𝑖𝑝 + 𝜀𝑖 with 𝜀 assumed independent from 𝑋 and E 𝜀 = 0 ◼ The multiple linear regression model: stochastic 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑝 𝑥𝑖𝑝 + 𝜀𝑖 with 𝜀~ 𝑁 0, 𝜎 2 30 Step 2+3. Estimating the regression model (least-squares)  To find estimates of the regression coefficients 𝛽0 , 𝛽1 , … , 𝛽𝑝 we introduce the least squares criterion: 𝑛 2 argmin ෍ 𝑦𝑖 − 𝛽𝑖0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑝 𝑥𝑖𝑝 𝛽0 ,𝛽1 ,…,𝛽𝑝 𝑖=1 2 ◼ Note that σ𝑛𝑖=1 𝑦𝑖 − 𝛽𝑖0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑝 𝑥𝑖𝑝 = σ𝑛𝑖=1 𝜀𝑖 2 , this is the residual sum of squares ◼ Solution can be found as usual: Find values 𝛽መ0 , … , 𝛽መ𝑝 for which partial first derivatives are equal to zero ◼ Or not so ‘usual’ …? 31 Multiple linear regression  Solution to LS objective: Find values 𝛽መ0 , … , 𝛽መ𝑝 for which partial first derivatives are equal to zero 𝑄 ′ 𝛽0 = −2 σ𝑖 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − 𝛽2 𝑥𝑖2 − ⋯ − 𝛽𝑝 𝑥𝑖𝑝 =0 2 𝑄′ 𝛽1 = σ𝑖 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − 𝛽2 𝑥𝑖2 − ⋯ − 𝛽1 𝑥𝑖𝑝 ′ = −2 σ𝑖 𝑥𝑖1 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − 𝛽2 𝑥𝑖2 − ⋯ − 𝛽𝑝 𝑥𝑖𝑝 = 0 ⋮ 𝑄 ′ 𝛽𝑝 = −2 σ𝑖 𝑥𝑖𝑝 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − 𝛽2 𝑥𝑖2 − ⋯ − 𝛽𝑝 𝑥𝑖𝑝 = 0 => Solve the system of equations 32 Multiple linear regression  Solving the multiple linear regression problem implies solving a system of equations … is involving  What in case of a huge number (e.g., 𝑝>1000) of variables?  Matrices and linear algebra to the rescue!  Check data mining; next slides show main steps 33 Multiple linear regression  Rewriting the regression equations with matrix notation 𝑦1 ≈ 𝛽0 + 𝛽1 𝑥11 + 𝛽2 𝑥12 + ⋯ + 𝛽𝑝 𝑥1𝑝... 𝑦𝑖 ≈ 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑝 𝑥𝑖𝑝... 𝑦𝑛 ≈ 𝛽0 + 𝛽1 𝑥𝑛1 + 𝛽2 𝑥𝑛2 + ⋯ + 𝛽𝑝 𝑥𝑛𝑝 Can be expressed as (check this!) 𝐲 ≈ 𝐗𝛃 with 𝐲, 𝐗 and 𝛃 as on the next slide 34 Multiple linear regression 𝑦1 𝛽0 𝑦2 1 𝑥11 ⋯ 𝑥1𝑝 𝛽  For 𝒚 = ⋮ 𝐗= 1 ⋮ 𝑥𝑖𝑗 ⋮ 𝛃= 1 ⋮ 𝑦𝑛 1 𝑥𝑛1 ⋯ 𝑥𝑛𝑝 𝛽𝑝 𝐲 ≈ 𝐗𝛃. Note the columns of ones in 𝐗! 35 Multiple linear regression  The objective function than becomes in matrix notation argmin 𝐲 − 𝐗𝛃 2 with 𝐲 − 𝐗𝛃 2 = 𝐲 − 𝐗𝛃 ′ 𝐲 − 𝐗𝛃 𝛃  We also check this: 𝑦1 − 𝛽0 + 𝛽1 𝑥11 + 𝛽2 𝑥12 + ⋯ + 𝛽𝑝 𝑥1𝑝 𝜀1 𝑦2 − 𝛽0 + 𝛽1 𝑥21 + 𝛽2 𝑥22 + ⋯ + 𝛽𝑝 𝑥2𝑝 𝜀2 𝐲 − 𝐗𝛃 = = ⋮ ⋮ 𝑦𝑛 − 𝛽0 + 𝛽1 𝑥𝑛1 + 𝛽2 𝑥𝑛2 + ⋯ + 𝛽𝑝 𝑥𝑛𝑝 𝜀𝑛 and thus 𝑛 𝑄 𝛃 = 𝐲 − 𝐗𝛃 ′ ′ 𝐲 − 𝐗𝛃 = 𝜺 𝜺 = ෍ 𝜀𝑖2 𝑖=1 This is the sum of squared residuals 36  Hence, the problem of solving the system of linear equations is to find 𝛃 such that −2 𝐲 − 𝐗𝛃 ′ 𝐗 = 𝟎  Using basic matrix operations: −2 𝐲 − 𝐗𝛃 ′ 𝐗 = 𝟎 𝐲 ′ 𝐗 = 𝛃′𝐗’ 𝐗 𝛃 = 𝐗′𝐗 −𝟏 𝐗′𝐲 (note that 𝐗′𝐗 is a square matrix and if 𝑝 > 𝑛 and the predictors are uncorrelated, it is also of full rank such that the inverse exists)  This gives the solution to the multiple linear regression problem ෡ = 𝐗′𝐗 𝛃 −𝟏 𝐗′𝐲 37 Illustration: BEP project on Dutch FFT  Background setting of this bachelor-end-project (BEP) ◼ Motivation: Improve early diagnosis of dementia by detecting abnormal cognitive decline early on ◼ Useful screening tool: Dutch Famous Faces Test ◼ Data science problem: Develop personalized norms for predicting worrisome cognitive decline (in contrast to usual clinical practice to use the same norms for everyone) Reference: van den Elzen, E. H., Brehmer, Y., Van Deun, K., & Mark, R. E. (2023). Stimulus material selection for the Dutch famous faces test for older adults. Frontiers in medicine, 10. 38 Please fill in the name of this person below. Spelling and special characters are not important. 39  Organization of the data (338 respondents aged 60 or older) ◼ Outcome: Score for each of the 220 pictures indicating level of correctness of the name (completely wrong = 0, partially correct = 0.5, fully correct = 1) ◼ Predictors:  Famous person characteristics: national or international, category (sports, film/theatre, singer/music, politics)  Respondent characteristics: Age, gender, interests (sports, film/theatre, singer/music, politics, ice skating, cycling, football), several scales (e.g., loneliness, health, cognitive failure) 40 Exploration: Co-clustering of faces and respondents Proportion correct responses per co-cluster (rows = respondents, columns = faces) Result obtained by BEP student Hanna Broszczak 41 Multiple linear regression analysis  Linear regression model with following predictors: ◼ Age respondent: continuous ◼ Sex respondent: binary with 1 = ‘Male’, 0 = ‘Female’ ◼ Respondent’s interest in film/theatre: continuous ◼ Category of the picture: Film/theatre, sports, musician/singer, politics (dummy coding; Film/theatre = reference) ◼ Interaction effect between interest in film/theatre and category of the picture 𝑦ො𝑝𝑒𝑟𝑠,𝑓𝑓 = 𝛽0 + 𝛽𝑎𝑔𝑒 𝑥𝑎𝑔𝑒 𝑝𝑒𝑟𝑠 + 𝛽𝑠𝑒𝑥 𝑥𝑠𝑒𝑥 𝑝𝑒𝑟𝑠𝑜𝑛 + 𝛽𝑖𝑛𝑡 𝑥𝑖𝑛𝑡_𝐹𝑇 𝑝𝑒𝑟𝑠 + … + 𝛽𝑠𝑝𝑜𝑟𝑡𝑠 𝑥𝑠𝑝𝑜𝑟𝑡𝑠 𝑝𝑖𝑐𝑡? + 𝛽𝑠𝑝𝑜𝑟𝑡𝑠 𝑝𝑖𝑐𝑡 × 𝑖𝑛𝑡_𝐹𝑇 𝑥𝑠𝑝𝑜𝑟𝑡𝑠 𝑝𝑖𝑐𝑡? × 𝑥𝑖𝑛𝑡_𝐹𝑇 𝑝𝑒𝑟𝑠  Note: interaction effect implies a different regression weight of the interest in film/theatre per category of the picture (sports, film/theatre, politics, singer/musician) Result obtained by BEP student Daantje Crooymans 42 Interpretation? 43

JBM050 Statistical Computing Lecture 8: Linear Regression PDF

Document Details

Tags

Related

Summary

Full Transcript