Biostatistics 521 Lecture 20 Multiple Regression I PDF
Document Details
Uploaded by Deleted User
Xiang Zhou
Tags
Summary
This document is a lecture on Multiple Linear Regression. It discusses the key concepts, assumptions, and applications of the method. The lecture includes examples related to skin cancer mortality on latitude, longitude and coastal status and employee salary.
Full Transcript
MULTIPLE LINEAR REGRESSION Xiang Zhou, PhD BIOS 521 11/16/2023 Example: Does Skin Cancer Mortality Depend on Longitude and Latitude? Latitude Longitude 2 Does Skin Cancer Mortality Depend on Longitude and Latitude? Elwoo...
MULTIPLE LINEAR REGRESSION Xiang Zhou, PhD BIOS 521 11/16/2023 Example: Does Skin Cancer Mortality Depend on Longitude and Latitude? Latitude Longitude 2 Does Skin Cancer Mortality Depend on Longitude and Latitude? Elwood JM et al, Relationship of Melanoma and other Skin Cancer Mortality to Latitude and Ultraviolet Radiation in the United States and Canada. Intern J of Epidemiology 1974 Skin Cancer Mortality rates for 48 states + Washington DC from 1950–67 (counts per 100K) Longitude and Latitude of largest city and indicator variable for the state touching an ocean (1=yes, 0=no) 3 Multi-Variable Scatterplot shows pairwise relationships between all variables Do any of Latitude, Longitude or Ocean indicator appear to be good candidates for linear regression with Skin Cancer Mortality? 4 Multi-Variable Scatterplot shows pairwise relationships between all variables Do any of Latitude, Longitude or Ocean indicator appear to be good candidates for linear regression with Skin Cancer Mortality? 5 Regression of Skin Cancer Mortality on Latitude (North-South) Skin Cancer Mortality Rate (per 100K) Coeff Estimate SE P-Value (Intercept) 389.18 23.81 < 2𝑒𝑒 − 16 Latitude −5.97 0.59 3.31𝑒𝑒 − 13 𝑅𝑅2 = 0.6798 p-value is significant Since 𝑝𝑝 is small, we reject the null hypothesis that latitude is unrelated to skin cancer mortality: 𝐻𝐻0 : 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 = 0 𝐻𝐻1 : 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 ≠ 0 We reject the null hypothesis that latitude is unrelated to skin cancer mortality rate. Latitude 6 A Check of the Assumptions for Regression of Skin Cancer Mortality on Latitude Observed Quantiles (Standardized Residuals) Residual Predicted Value Normal Quantiles linear and normal distribution 7 Interpretation: Regression of Skin Cancer Mortality on Latitude Coeff Estimate SE P-Value (Intercept) 389.18 23.81 < 2𝑒𝑒 − 16 Latitude −5.97 0.59 3.31𝑒𝑒 − 13 𝑅𝑅2 = 0.6798 The exposure variable explains 68% of the variation in the outcome variable The linear effect of Latitude on skin cancer mortality is highly statistically significant (p = 3.31𝑒𝑒 −13 ) We reject the null hypothesis 𝐻𝐻0 : 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 = 0 vs 𝐻𝐻1 : 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 ≠ 0. In words, we reject the hypothesis that latitude is unrelated to skin cancer mortality. The predicted skin cancer mortality rate decreases by 5.97 per 100k for a 1 degree increase in latitude The predicted skin cancer mortality rate decreases by 59. 7 per 100K for a 10 degree increase in latitude Beta 1 hat +/- p-value (S.E) The 95% CI for effect size of latitude is −5.97 ± 1.96 × 0.59 = −7.12, −4.81 The latitude of a state explains 68% of variation in skin cancer mortality rates 8 Regression of Skin Cancer Mortality on Longitude (East-West) Skin Cancer Mortality Rate (per 100K) Coeff Estimate SE P-Value (Intercept) 182.76 29.88 1.8𝑒𝑒 − 07 Longitude −0.32 0.32 0.316 𝑅𝑅2 = 0.02137 p value is not significant Since 𝑝𝑝 = 0.316, we fail to reject the null hypothesis that longitude is unrelated to skin cancer mortality: 𝐻𝐻0 : 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 = 0 𝐻𝐻1 : 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 ≠ 0 We fail to reject the null hypothesis that longitude is unrelated to skin cancer mortality rate at 𝛼𝛼 = 5%. Longitude 9 Regression of Skin Cancer Mortality on Ocean Indicator Coeff Estimate SE P-Value Skin Cancer Mortality Rate (per 100K) (Intercept) 138.74 5.72 < 2𝑒𝑒 − 16 Ocean 31.48 8.54 0.000592 𝑅𝑅2 = 0.2241 p value is significant At 𝛼𝛼 = 5%, we reject the null hypothesis that a state being on the ocean is unrelated to skin cancer mortality: 𝐻𝐻0 : 𝛽𝛽𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 = 0 𝐻𝐻1 : 𝛽𝛽𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 ≠ 0 Ocean (0=No, 1=Yes) 10 A Check of the Assumptions for Regression of Skin Cancer Mortality on Ocean Indicator Observed Quantiles (Standardized Residuals) Residual Predicted Value Linear relationship confirmed Normal Quantiles Normal relationship confirmed 11 Important slide Interpretation: Regression of Skin Cancer Mortality on Coastal Coeff Estimate SE P-Value (Intercept) 138.74 y-intercept parameter 5.72 < 2𝑒𝑒 − 16 Ocean 31.48 8.54 0.000592 𝑅𝑅2 = 0.2241 slope parameter The difference in skin cancer mortality is statistically significant between coastal and non-coastal states (p = 0.000592) substitute 0 in equation Predicted skin cancer mortality is 138.74 (per 100K) for a non-coastal state 138.74 (per 100K) is the intercept substitute 1 in equation The skin cancer mortality rate is 31.48 higher (per 100K) for coastal states (95% 𝐶𝐶𝐶𝐶 [14.74, 48.22] ) A state being on the ocean accounts for 22% of variability in skin cancer mortality between states 12 𝐻𝐻0 : 𝜇𝜇0 = 𝜇𝜇1 𝐻𝐻1 : 𝜇𝜇0 ≠ 𝜇𝜇1 Two-Sample t-test and Simple Linear Regression are equivalent tests for comparing means between the two groups (𝜇𝜇0 = 𝜇𝜇1 ⟺ 𝛽𝛽1 = 0) 𝐻𝐻0 : 𝛽𝛽1 = 0 𝐻𝐻1 : 𝛽𝛽1 ≠ 0 13 Unadjusted Associations Between Geography and Skin Cancer Mortality Unadjusted Associations are measures or tests of the relationship between an outcome variable and an exposure variable that do not account for additional factors that may contribute to variation in the outcome (i.e. additional risk factors and/or confounders) Table: Unadjusted associations obtained from simple linear regression with skin cancer mortality as the outcome. Standard Variable Effect Size Error P-value Latitude −5.97 0.59 3.31𝑒𝑒 − 13 Longitude −0.32 0.32 0.316 Ocean 31.48 8.54 0.000592 14 Motivating Multiple Linear Regression A state’s latitude and being on the ocean are both associated with skin cancer mortality rate. How can I simultaneously model the effects of both variables? Latitude How can I compute the increased effect of a state being on the ocean, adjusting for the latitude of the state? If I adjust for a state’s latitude, does being on the ocean still matter? Longitude 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 𝛽𝛽0 + 𝛽𝛽1 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 + 𝛽𝛽2 × 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 15 Motivation for Multiple Linear Regression Recall the Skin Cancer Mortality example from Simple Linear Regression: Slope p = 3.31𝑒𝑒 −13 Slope p=5.9𝑒𝑒 −4 Skin Cancer Mortality Skin Cancer Mortality Rate (per 100K) Latitude Rate (per 100K) Ocean (0=No, 1=Yes) 16 Multiple Linear Regression Big Picture Simple linear regression is a very useful statistical tool. However, it limits the analysis to a single covariate (x-variable). In practice, the outcome (y-variable) often depends on more than a single outcome, meaning simple linear regression is unable to accurately model the complexity of real associations. Multiple linear regression is an extension of simple linear regression that models the mean value of the outcome on more than one covariate variable. In this lecture we will introduce multiple linear regression with a focus on intuition, interpretation and performing inference. 17 Motivation for Multiple Linear Regression Latitude and being Coastal were both significant predictors of skin cancer mortality How do I choose which model to use? Ideally, we could use both variables in the same regression model for skin cancer mortality Would allow us to model the relationship between skin cancer mortality and both latitude and being coastal simultaneously (useful for both inference and prediction) Multiple linear regression models a single numerical outcome variable based on multiple numerical or categorical covariate variables single numerical outcome vs mutiple numerical/categorical covariate variables 18 Multiple Linear Relationship between Y and X1, …, Xn Expected response (or mean value, or predicted value) for Y for a given set of slope covariate values 𝑋𝑋1, 𝑋𝑋2, … , 𝑋𝑋𝑛𝑛 parameters E[Y] = β0 + β1X1 + β2X2 + … + βnXn Y intercept parameter Expected value of Y when Exposures, X1=0, X2=0, …, Xn=0 covariates, predictors 19 Multiple Linear Regression Model observed response for ith observation Yi = β0 + β1X1 + β2X2 + … + βnXn + ε i Residual = random error, amount that the ith observed value differs from its predicted value Add in the assumption that 𝜖𝜖𝑖𝑖 ~ 𝑁𝑁(0, 𝜎𝜎 2 ) 20 Intuition: Multiple Linear Regression Assume that we already a simple linear regression model containing a continuous outcome 𝑌𝑌 and continuous covariate 𝑋𝑋1 𝐸𝐸 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1 What does adding a second covariate 𝑋𝑋2 look like? The model has the form 𝐸𝐸 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1 + 𝛽𝛽2 𝑋𝑋2 The picture for the regression model depends on whether 𝑋𝑋2 is continuous or dichotomous/binary/dummy (i.e 𝑋𝑋2 = 0 or 1) 21 Intuition: Multiple Linear Regression At first, ignore X2 and consider the simple linear regression model between Y and X1. All the points are just “black dots” Regression line goes through center of all dots Simple Linear Reg 𝛽𝛽0 + 𝛽𝛽 𝑋𝑋1 𝑌𝑌 𝑋𝑋1 Intuition: Multiple Linear Regression Consider the binary covariate 𝑿𝑿𝟐𝟐 = 𝟎𝟎, 𝟏𝟏 Now each dot can be colored blue (𝑿𝑿𝟐𝟐 = 𝟏𝟏) or red (𝑿𝑿𝟐𝟐 = 𝟎𝟎) Simple Linear Reg 𝛽𝛽0 + 𝛽𝛽 𝑋𝑋1 𝑋𝑋2 = 1 𝑌𝑌 𝑋𝑋2 = 0 𝑋𝑋1 𝑋𝑋1 Linear regression E[𝑌𝑌] = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1 + 𝛽𝛽2 𝑋𝑋2 with continuous covariate 𝑋𝑋1 and binary covariate 𝑿𝑿𝟐𝟐 = 𝟎𝟎, 𝟏𝟏 The regression equation is two parallel lines, each with same slope 𝛽𝛽1 Data points with 𝑋𝑋2 = 0 (red) have intercept of 𝛽𝛽0 and slope of 𝛽𝛽1 they are parallel lines as they have the same slope Data points with 𝑋𝑋2 = 1 (blue) have intercept of 𝛽𝛽0 + 𝛽𝛽2 and slope of 𝛽𝛽1 Multiple Linear Reg 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1 + 𝛽𝛽2 𝑋𝑋2 Simple Linear Reg 𝛽𝛽0 + 𝛽𝛽2 + 𝛽𝛽1 𝑋𝑋1 𝛽𝛽0 + 𝛽𝛽 𝑋𝑋1 𝑋𝑋2 = 1 𝑌𝑌 𝑋𝑋2 = 0 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1 𝑋𝑋1 𝑋𝑋1 𝑋𝑋1 Intuition: Multiple Linear Regression –Two Numerical Covariates Two-dimensional pairwise relationships 𝑦𝑦 𝑥𝑥1 𝑥𝑥2 25 Intuition: Multiple Linear Regression –Two Numerical Covariates Two-dimensional pairwise relationships Three-Dimensional Relationship of 𝑌𝑌 on 𝑋𝑋1 ,𝑋𝑋2 𝑦𝑦 𝑥𝑥1 𝑦𝑦 𝑥𝑥2 𝑥𝑥2 𝑥𝑥1 26 Intuition: Multiple Linear Regression –Two Numerical Covariates 𝐸𝐸 𝑌𝑌 = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑋𝑋1 + 𝛽𝛽̂2 𝑋𝑋2 is a plane going through the points in 3D space Plane with equation: 𝑌𝑌 = 𝛽𝛽 0 + 𝛽𝛽 1 𝑋𝑋1 + 𝛽𝛽 2 𝑋𝑋2 27 Intuition: Multiple Linear Regression –Two Numerical Covariates 𝛽𝛽1 𝛽𝛽2 28 Parameter Interpretation For MLR Model: 𝐸𝐸 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1 + 𝛽𝛽2 𝑋𝑋2 + ⋯ + 𝛽𝛽𝑛𝑛 𝑋𝑋𝑛𝑛 Fitted Model: 𝐸𝐸 𝑌𝑌 = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑋𝑋1 + 𝛽𝛽̂2 𝑋𝑋2 + ⋯ + 𝛽𝛽̂𝑛𝑛 𝑋𝑋𝑛𝑛 The hats ( ^ ) in the Fitted Model indicate the point estimates based on data The intercept 𝛽𝛽0 is the predicted value for the outcome when each 𝑋𝑋𝑖𝑖 = 0. The slope parameter 𝛽𝛽𝑖𝑖 is the change in predicted outcome for a one unit increase in 𝑋𝑋𝑖𝑖 holding all other covariates constant Predicted values are obtained by plugging in the desired values of 𝑋𝑋1 , 𝑋𝑋2 , … , 𝑋𝑋𝑛𝑛 Plug them in and keep them constant or fixed 29 Parameter Interpretation For MLR Suppose there are three exposure variables and the model is 𝐸𝐸 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1 + 𝛽𝛽2 𝑋𝑋2 + 𝛽𝛽3 𝑋𝑋3 Show that 𝛽𝛽1 is the change in predicted Y for one unit change in 𝑋𝑋1 while holding 𝛽𝛽2 and 𝛽𝛽3 fixed. In the context of MLR, keeping X2 and X3 constant means that we are observing the effect of a one unit change in X1 on the dependent varaible Y while not allowing X2 and X3 to change. We examine the relationship between X1 and Y without the influence of changes in X2 and X3. Then beta1 represents the effect of changing X1 by one unit on Y assuming X2 and X3 are not changing. In more statistical sense, holding X2 and X3 constant helps control for the effects of these varaibles enabling us to interpret beta1 as the partial derivative of Y with respect to X1 30 Parameter Interpretation For MLR Suppose we are most interested in the effect of 𝑋𝑋1 on 𝑌𝑌. The unadjusted effect of 𝑋𝑋1 on 𝑌𝑌 is 𝛽𝛽̂1 obtained from the simple linear model: 𝐸𝐸 𝑌𝑌 = 𝛽𝛽̂0 + 𝜷𝜷 𝟏𝟏 𝑋𝑋1 The adjusted effect of 𝑋𝑋1 on 𝑌𝑌, controlling for 𝑋𝑋2 through 𝑋𝑋𝑛𝑛 , is 𝛼𝛼 1 obtained from the multiple linear model: 𝟏𝟏 𝑋𝑋1 + 𝛼𝛼 2 𝑋𝑋2 + ⋯ + 𝛼𝛼 𝑛𝑛 𝑋𝑋𝑛𝑛 𝐸𝐸 𝑌𝑌 = 𝛼𝛼 0 + 𝜶𝜶 Note: I am using 𝛼𝛼 ′ 𝑠𝑠 above simply to differentiate between parameters in the two models Controlling= including variables in the model that are likely to affect the outcome but are not your specific exposures of interest This helps us control the effects of these variables 31 Example: Parameter Interpretation For MLR Suppose you are interested in modelling weight based on height in a population. 𝐸𝐸 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 = 𝛽𝛽0 + 𝛽𝛽1 × ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝛽𝛽1 is the unadjusted effect of height on weight Unadjusted means we haven't taken into consideration the effects of other variables that affect weight in this equation Many factors influence an individual’s weight besides height. 𝐸𝐸 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 = 𝛼𝛼0 + 𝛼𝛼1 × height + 𝛼𝛼2 × 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 + 𝛼𝛼3 × 𝑎𝑎𝑎𝑎𝑎𝑎 𝛼𝛼1 is the effect of height on weight, adjusted for age and sex 𝛼𝛼1 is the change in predicted weight for a one-unit increase in height, fixing or controlling for age and sex 32 Example: Sun Cancer Mortality We fit the models: 1. 𝐸𝐸 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 𝛽𝛽̂0 + 𝛽𝛽̂𝐿𝐿𝐿𝐿𝐿𝐿 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 2. 𝐸𝐸 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 𝛽𝛽̂0 + 𝛽𝛽̂𝐿𝐿𝑜𝑜𝑜𝑜𝑜𝑜 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 3. 𝐸𝐸 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 𝛽𝛽̂0 + 𝛽𝛽̂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 × 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 What about fitting… 𝐸𝐸 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 𝛽𝛽̂0 + 𝛽𝛽̂𝐿𝐿𝐿𝐿𝐿𝐿 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 + 𝛽𝛽̂𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐸𝐸 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 𝛽𝛽̂0 + 𝛽𝛽̂𝐿𝐿𝐿𝐿𝐿𝐿 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 + 𝛽𝛽̂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 × 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 33 Example: MLR of Skin Cancer Mortality on Latitude and Coastal Skin Cancer Mortality Latitude 34 Example: MLR of Skin Cancer Mortality on Latitude and Coastal Coastal Non- Coastal Skin Cancer Mortality What do you see? The points in the plot are differentiated based on colour whether they are coastal or non-coastal Latitude 35 Example: MLR of Skin Cancer Mortality on Latitude and Coastal Fitted Model for Skin Cancer Mortality on Latitude & Ocean Skin Cancer Mortality 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 360.69 − 5.49 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 + 20.43 × 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 For Non-Coastal States (Ocean = 0): 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 360.69 − 5.49 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 For Coastal States (Ocean = 1): 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 381.12 − 5.49 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 Coastal Non- Coastal Latitude 36 Interpretation: MLR of Skin Cancer Mortality on Latitude and Coastal Parameter Estimate Skin cancer mortality rate decreases by 5.49 for each one (Intercept) 360.69 degree increase in latitude, for both coastal and non- Latitude −5.49 coastal states This is because the slope is the same for coastal and non-coastal Ocean 20.43 Skin cancer mortality rate increases by 20.43 in coastal versus non-coastal states, holding latitude fixed Skin Cancer Mortality The predicted skin cancer mortality rate for a coastal state with latitude of 38 degrees is: 360.69 − 5.49 × 38 + 20.43 = 172.5 Skin cancer mortality rates increase by 5.49 for one degree increase in latitude for coastal and non- coastal states. Skin cancer mortality rates increase by 20.43 in coastal states when compared to non-coastal states, while keeping latitude fixed. Latitude 37 Skin Cancer Mortality Comparison of the Unadjusted versus Adjusted Effects of Parameter Estimate Latitude on Skin Cancer (Intercept) 389.18 Mortality Latitude −5.97 Latitude Parameter Estimate Skin Cancer Mortality (Intercept) 360.69 Latitude −5.49 Ocean 20.43 Latitude 38 Skin Cancer Mortality Parameter Estimate Unadjusted effect of (Intercept) 389.18 Latitude on Skin Latitude −5.97 Cancer Mortality Latitude Parameter Estimate Adjusted effect of Latitude on Skin Cancer Mortality (Intercept) 360.69 Skin Cancer Mortality, or the Latitude −5.49 Effect of Latitude on Skin Cancer Ocean 20.43 Mortality controlling for Ocean status The magnitude of the effect of Latitude on Skin Cancer Latitude Mortality decreased when ocean was added to the model. 39 Example: MLR of Skin Cancer Mortality on Latitude and Longitude Skin Cancer Mortality Latitude 40 Fitted Multiple Linear Regression Model: Predicted Skin Cancer Mortality = 400.67 − 5.93 × Latitude − 0.14 × Longitude Skin Cancer Mortality Latitude 41 Interpretation: MLR of Skin Cancer Mortality on Latitude and Longitude 𝐸𝐸 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 400.67 − 5.93 × Latitude − 0.14 × Longitude Skin cancer mortality rate decreases by 5.93 for each one Parameter Estimate degree increase in latitude, holding longitude fixed (Intercept) 400.67 Skin cancer mortality rate decreases by 0.14 for each one Latitude −5.93 degree increase in longitude, holding latitude fixed Longitude −0.14 The predicted skin cancer mortality rate for a state with latitude of 38 degrees and longitude of 90 degrees is: 400.67 − 5.93 × 38 − 0.14 × 90 = 162.73 42 Interpretation: MLR of Skin Cancer Mortality on Latitude and Longitude The unadjusted effect of Latitude on Skin Cancer Mortality Parameter Estimate was −5.97 (Intercept) 389.18 The adjusted effect of Latitude on Skin Cancer Mortality, Latitude −5.97 accounting for Longitude, is −5.93 Not a very big change. Parameter Estimate Not surprising because Longitude did not have much effect (Intercept) 400.67 on Skin Cancer Mortality Latitude −5.93 Longitude −0.14 43 MLR for Salary Suppose a random sample of salaries are taken for employees at a large company. You are interested in determining if there is a salary difference between men and women. You suspect that salary is strongly dependent upon employee age so you fit a multiple linear regression model to determine the effect of sex on salary while “controlling” for employee age. You fit the following model: 𝐸𝐸 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 𝛽𝛽0 + 𝛽𝛽𝐴𝐴𝐴𝐴𝐴𝐴 × Age + 𝛽𝛽𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 × 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 Parameter Estimate (Intercept) 1666.67 Age 1333.33 Female −1650.00 * Numbers are totally fictional! 44 Based on the MLR model: 𝐸𝐸 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 𝛽𝛽0 + 𝛽𝛽𝐴𝐴𝐴𝐴𝐴𝐴 × Age + 𝛽𝛽𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 × 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 Based on the parameter estimates, which figure shows the general shape of the Parameter Estimate regression model: (Intercept) 1666.67 good! make informed decisions just like you did now. positive slope Age 1333.33 Female −1650.00 C it has to be this one women's line A be below the men's line Salary Salary MEN Age Age WOMEN B D Salary Salary Age Age 45 MLR for Salary 𝐸𝐸 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 𝛽𝛽0 + 𝛽𝛽𝐴𝐴𝐴𝐴𝐴𝐴 × Age + 𝛽𝛽𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 × 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 Parameter Estimate (Intercept) 1666.67 Age 1333.33 𝛽𝛽̂𝐴𝐴𝐴𝐴𝐴𝐴 > 0 Female −1650.00 𝛽𝛽̂𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 < 0 Salary 𝛽𝛽̂𝐴𝐴𝐴𝐴𝐴𝐴 > 0 ⇒ Salary increases with age 𝛽𝛽̂𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 < 0 ⇒ For a fixed age, salary for a female is lower than for male Age 46 MLR for Salary 𝐸𝐸 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 𝛽𝛽0 + 𝛽𝛽𝐴𝐴𝐴𝐴𝐴𝐴 × Age + 𝛽𝛽𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 × 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 Parameter Estimate (Intercept) 1666.67 Age 1333.33 Female −1650.00 Salary The slope for both men and women is 𝛽𝛽̂𝐴𝐴𝐴𝐴𝐴𝐴 = 1333.33 This is the adjusted effect of age on salary, controlling for employee sex Age 47 MLR for Salary 𝐸𝐸 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 𝛽𝛽0 + 𝛽𝛽𝐴𝐴𝐴𝐴𝐴𝐴 × Age + 𝛽𝛽𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 × 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 Parameter Estimate (Intercept) 1666.67 Age 1333.33 Female −1650.00 The intercept for men is: Salary 𝛽𝛽̂0 = 1666.67 The intercept for women is: 𝛽𝛽̂0 + 𝛽𝛽̂𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 1666.67 − 1650 = 16.67 Age 48 MLR for Salary 𝐸𝐸 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 𝛽𝛽0 + 𝛽𝛽𝐴𝐴𝐴𝐴𝐴𝐴 × Age + 𝛽𝛽𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 × 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 The Female estimate will be 0 if we are substituting for a man and will be Parameter Estimate 1 if we are substituting for a female (Intercept) 1666.67 Age 1333.33 Female −1650.00 Predicted salary for a 45 year old man is: 𝛽𝛽̂0 + 𝛽𝛽̂𝐴𝐴𝐴𝐴𝐴𝐴 × 45 = 1666.67 + 1333.33 × 45 = $61,666.65 Predicted salary for a 45 year old woman is: 𝛽𝛽̂0 + 𝛽𝛽̂𝐴𝐴𝐴𝐴𝐴𝐴 × 45 + 𝛽𝛽̂𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 1666.67 + 1333.33 × 45 − 1650 = $60,016.65 49 Interpretation: MLR for Salary 𝐸𝐸 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 𝛽𝛽0 + 𝛽𝛽𝐴𝐴𝐴𝐴𝐴𝐴 × Age + 𝛽𝛽𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 × 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 Parameter Estimate (Intercept) 1666.67 Age 1333.33 Female −1650.00 Salary increases with employee age: the average salary increases $1,333.33 for each one-year increase in age Women earn an average of $1650 less than men, controlling for employee age 50 Multiple Linear Regression Assumptions Each data point (𝑌𝑌𝑌𝑌, 𝑋𝑋1𝑖𝑖 , 𝑋𝑋2𝑖𝑖 , … , 𝑋𝑋𝑘𝑘𝑘𝑘 ) is independent Yi is a linear function of each 𝑋𝑋𝑘𝑘𝑘𝑘 The variance of residuals is constant across values of the covariates. That is variance of residuals does not depend on any covariate X. This is called homoskedasticity. The residuals are normally distributed Just like Simple Linear Regression, residual plots and QQ plots can be used to diagnose violations of the assumptions 51 Residuals in MLR 𝛼𝛼 + 𝛽𝛽2 + 𝛽𝛽1 𝑋𝑋1 𝑒𝑒𝑖𝑖 The residual 𝑒𝑒𝑖𝑖 is still just the difference between the 𝑖𝑖 𝑡𝑡𝑡 observation and its predicted value based on the model 𝛼𝛼 + 𝛽𝛽1 𝑋𝑋1 The plane gives the predicted values 𝑋𝑋1 𝑒𝑒𝑖𝑖 52 Inference for Multiple Linear Regression 𝐸𝐸 𝑌𝑌 = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑋𝑋1 + 𝛽𝛽̂2 𝑋𝑋2 + ⋯ + 𝛽𝛽̂𝑛𝑛 𝑋𝑋𝑛𝑛 1. We can test if a specific slope parameter is non-zero, given the other parameters in the model: 𝐻𝐻0 : 𝛽𝛽2 = 0 𝐻𝐻1 : 𝛽𝛽2 ≠ 0 2. Alternatively, we can test if any of the slope parameters are non-zero: 𝐻𝐻0 : 𝛽𝛽1 = 𝛽𝛽2 = ⋯ = 𝛽𝛽𝑛𝑛 = 0 𝐻𝐻1 : 𝐴𝐴𝐴𝐴 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑜𝑜𝑜𝑜𝑜𝑜 𝛽𝛽𝑖𝑖 ≠ 0 Called a “global” F-Test Failing to reject the global F-Test is equivalent to saying that the model 𝑌𝑌 = 𝛽𝛽0 can explain the data as well as 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1 + 𝛽𝛽2 𝑋𝑋2 + ⋯ + 𝛽𝛽𝑛𝑛 𝑋𝑋𝑛𝑛 53 Inference: MLR of Skin Cancer Mortality on Latitude and Coastal 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 𝛽𝛽0 + 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 + 𝛽𝛽𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 × 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 Skin Cancer Mortality Typically perform the global test first: Latitude Are either Latitude or Ocean significant predictors of Skin Cancer Mortality? 54 Inference: MLR of Skin Cancer Mortality on Latitude and Coastal 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 𝛽𝛽0 + 𝜷𝜷𝑳𝑳𝑳𝑳𝑳𝑳 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 + 𝜷𝜷𝑶𝑶𝑶𝑶𝑶𝑶𝑶𝑶𝑶𝑶 × 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 Global Test: Highly Significant so reject the null hypothesis. That is, 𝐻𝐻0 : 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿 = 𝛽𝛽𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 = 0 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 𝛽𝛽0 + 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 + 𝛽𝛽𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 × 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 explains the data better than 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 𝛽𝛽0 𝐻𝐻1 : 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿 ≠ 0 and/or 𝛽𝛽𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 ≠ 0 55 Inference: MLR of Skin Cancer Mortality on Latitude and Coastal Skin Cancer Mortality Can also test the individual parameters… Is the slope for Latitude non-zero when also accounting for ocean status? 𝐻𝐻0 : 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿 = 0 𝐻𝐻1 : 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿 ≠ 0 Latitude 56 Inference : MLR of Skin Cancer Mortality on Latitude and Coastal 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 𝛽𝛽0 + 𝜷𝜷𝑳𝑳𝑳𝑳𝑳𝑳 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 + 𝛽𝛽𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 × 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 𝐻𝐻0 : 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿 = 0 𝐻𝐻1 : 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿 ≠ 0 The p-value is highly significant, so we reject the null. There is sufficient evidence that the association between Latitude and Skin Cancer Mortality, while controlling for Ocean status, is real. 57 Inference: MLR of Skin Cancer Mortality on Latitude and Coastal Skin Cancer Mortality Is the change in intercept for Ocean non-zero after accounting for latitude? 𝐻𝐻0 : 𝛽𝛽𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 = 0 𝐻𝐻1 : 𝛽𝛽𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 ≠ 0 Latitude 58 Inference: MLR of Skin Cancer Mortality on Latitude and Coastal 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 𝛽𝛽0 + 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 + 𝜷𝜷𝑶𝑶𝑶𝑶𝑶𝑶𝑶𝑶𝑶𝑶 × 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 𝐻𝐻0 : 𝛽𝛽𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 = 0 𝐻𝐻1 : 𝛽𝛽𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 ≠ 0 59 Inference: MLR of Skin Cancer Mortality on Latitude and Coastal 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 𝛽𝛽0 + 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 + 𝛽𝛽𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 × 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 Skin Cancer Mortality Parameters for both Latitude and Ocean are significant, Latitude so you would want to keep both in the model 60 Inference: MLR of Skin Cancer Mortality on Latitude and Longitude Let’s perform hypothesis testing on parameters in the Longitude & Latitude multiple regression model Skin Cancer Mortality Isthe “tilt” of the plane in each direction significant? Latitude 61 Example: MLR of Skin Cancer Mortality on Latitude and Coastal 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 𝛽𝛽0 + 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 + 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 × 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐻𝐻0 : 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿 = 0 𝐻𝐻1 : 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿 ≠ 0 𝐻𝐻0 : 𝛽𝛽𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 = 0 𝐻𝐻1 : 𝛽𝛽𝐿𝐿𝑜𝑜𝑜𝑜𝑜𝑜 ≠ 0 should be zero here Reject the null hypothesis that the slope for Latitude is non-zero, but fail to reject the null that the slope for Longitude is non-zero In the next lecture we will discuss deciding if parameters should be dropped from a MLR model 62 MLR for Salary 𝐸𝐸 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 𝛽𝛽0 + 𝛽𝛽𝐴𝐴𝐴𝐴𝐴𝐴 × Age + 𝛽𝛽𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 × 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 Parameter Estimate P-value (Intercept) 1666.67 < 3.3𝑒𝑒 −16 Age 1333.33 1.7𝑒𝑒 −5 Female −1650.00 0.0042 Is there sufficient evidence to claim that women have lower salaries, after controlling for age? 63 MLR for Salary 𝐸𝐸 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 𝛽𝛽0 + 𝛽𝛽𝐴𝐴𝐴𝐴𝐴𝐴 × Age + 𝛽𝛽𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 × 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 Parameter Estimate P-value (Intercept) 1666.67 < 3.3𝑒𝑒 −16 Age 1333.33 1.7𝑒𝑒 −5 Female −1650.00 0.0042 Is there sufficient evidence to claim that women have lower salaries after controlling for age? Perform the hypothesis test: 𝐻𝐻0 : 𝛽𝛽𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 0 vs. 𝐻𝐻1 : 𝛽𝛽𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 ≠ 0 Since pFemale is significant and the point estimate for 𝛽𝛽̂𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 < 0, there is sufficient evidence to claim women have lower salaries than men, even when controlling for employee age 64