Biostatistics 521 Lecture 20 Multiple Regression I PDF
Document Details
Uploaded by Deleted User
Xiang Zhou
Tags
Summary
This document is a lecture on Multiple Linear Regression. It discusses the key concepts, assumptions, and applications of the method. The lecture includes examples related to skin cancer mortality on latitude, longitude and coastal status and employee salary.
Full Transcript
MULTIPLE LINEAR REGRESSION Xiang Zhou, PhD BIOS 521 11/16/2023 Example: Does Skin Cancer Mortality Depend on Longitude and Latitude? Latitude Longitude 2 Does Skin Cancer Mortality Depend on Longitude and Latitude? Elwoo...
MULTIPLE LINEAR REGRESSION Xiang Zhou, PhD BIOS 521 11/16/2023 Example: Does Skin Cancer Mortality Depend on Longitude and Latitude? Latitude Longitude 2 Does Skin Cancer Mortality Depend on Longitude and Latitude? Elwood JM et al, Relationship of Melanoma and other Skin Cancer Mortality to Latitude and Ultraviolet Radiation in the United States and Canada. Intern J of Epidemiology 1974 Skin Cancer Mortality rates for 48 states + Washington DC from 1950β67 (counts per 100K) Longitude and Latitude of largest city and indicator variable for the state touching an ocean (1=yes, 0=no) 3 Multi-Variable Scatterplot shows pairwise relationships between all variables Do any of Latitude, Longitude or Ocean indicator appear to be good candidates for linear regression with Skin Cancer Mortality? 4 Multi-Variable Scatterplot shows pairwise relationships between all variables Do any of Latitude, Longitude or Ocean indicator appear to be good candidates for linear regression with Skin Cancer Mortality? 5 Regression of Skin Cancer Mortality on Latitude (North-South) Skin Cancer Mortality Rate (per 100K) Coeff Estimate SE P-Value (Intercept) 389.18 23.81 < 2ππ β 16 Latitude β5.97 0.59 3.31ππ β 13 π π 2 = 0.6798 p-value is significant Since ππ is small, we reject the null hypothesis that latitude is unrelated to skin cancer mortality: π»π»0 : π½π½πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ = 0 π»π»1 : π½π½πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ β 0 ο We reject the null hypothesis that latitude is unrelated to skin cancer mortality rate. Latitude 6 A Check of the Assumptions for Regression of Skin Cancer Mortality on Latitude Observed Quantiles (Standardized Residuals) Residual Predicted Value Normal Quantiles linear and normal distribution 7 Interpretation: Regression of Skin Cancer Mortality on Latitude Coeff Estimate SE P-Value (Intercept) 389.18 23.81 < 2ππ β 16 Latitude β5.97 0.59 3.31ππ β 13 π π 2 = 0.6798 The exposure variable explains 68% of the variation in the outcome variable The linear effect of Latitude on skin cancer mortality is highly statistically significant (p = 3.31ππ β13 ) ο We reject the null hypothesis π»π»0 : π½π½πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ = 0 vs π»π»1 : π½π½πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ β 0. In words, we reject the hypothesis that latitude is unrelated to skin cancer mortality. The predicted skin cancer mortality rate decreases by 5.97 per 100k for a 1 degree increase in latitude The predicted skin cancer mortality rate decreases by 59. 7 per 100K for a 10 degree increase in latitude Beta 1 hat +/- p-value (S.E) The 95% CI for effect size of latitude is β5.97 Β± 1.96 Γ 0.59 = β7.12, β4.81 The latitude of a state explains 68% of variation in skin cancer mortality rates 8 Regression of Skin Cancer Mortality on Longitude (East-West) Skin Cancer Mortality Rate (per 100K) Coeff Estimate SE P-Value (Intercept) 182.76 29.88 1.8ππ β 07 Longitude β0.32 0.32 0.316 π π 2 = 0.02137 p value is not significant Since ππ = 0.316, we fail to reject the null hypothesis that longitude is unrelated to skin cancer mortality: π»π»0 : π½π½πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ = 0 π»π»1 : π½π½πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ β 0 ο We fail to reject the null hypothesis that longitude is unrelated to skin cancer mortality rate at πΌπΌ = 5%. Longitude 9 Regression of Skin Cancer Mortality on Ocean Indicator Coeff Estimate SE P-Value Skin Cancer Mortality Rate (per 100K) (Intercept) 138.74 5.72 < 2ππ β 16 Ocean 31.48 8.54 0.000592 π π 2 = 0.2241 p value is significant At πΌπΌ = 5%, we reject the null hypothesis that a state being on the ocean is unrelated to skin cancer mortality: π»π»0 : π½π½ππππππππππ = 0 π»π»1 : π½π½ππππππππππ β 0 Ocean (0=No, 1=Yes) 10 A Check of the Assumptions for Regression of Skin Cancer Mortality on Ocean Indicator Observed Quantiles (Standardized Residuals) Residual Predicted Value Linear relationship confirmed Normal Quantiles Normal relationship confirmed 11 Important slide Interpretation: Regression of Skin Cancer Mortality on Coastal Coeff Estimate SE P-Value (Intercept) 138.74 y-intercept parameter 5.72 < 2ππ β 16 Ocean 31.48 8.54 0.000592 π π 2 = 0.2241 slope parameter The difference in skin cancer mortality is statistically significant between coastal and non-coastal states (p = 0.000592) substitute 0 in equation Predicted skin cancer mortality is 138.74 (per 100K) for a non-coastal state 138.74 (per 100K) is the intercept substitute 1 in equation The skin cancer mortality rate is 31.48 higher (per 100K) for coastal states (95% πΆπΆπΆπΆ [14.74, 48.22] ) A state being on the ocean accounts for 22% of variability in skin cancer mortality between states 12 π»π»0 : ππ0 = ππ1 π»π»1 : ππ0 β ππ1 Two-Sample t-test and Simple Linear Regression are equivalent tests for comparing means between the two groups (ππ0 = ππ1 βΊ π½π½1 = 0) π»π»0 : π½π½1 = 0 π»π»1 : π½π½1 β 0 13 Unadjusted Associations Between Geography and Skin Cancer Mortality Unadjusted Associations are measures or tests of the relationship between an outcome variable and an exposure variable that do not account for additional factors that may contribute to variation in the outcome (i.e. additional risk factors and/or confounders) Table: Unadjusted associations obtained from simple linear regression with skin cancer mortality as the outcome. Standard Variable Effect Size Error P-value Latitude β5.97 0.59 3.31ππ β 13 Longitude β0.32 0.32 0.316 Ocean 31.48 8.54 0.000592 14 Motivating Multiple Linear Regression A stateβs latitude and being on the ocean are both associated with skin cancer mortality rate. How can I simultaneously model the effects of both variables? Latitude How can I compute the increased effect of a state being on the ocean, adjusting for the latitude of the state? If I adjust for a stateβs latitude, does being on the ocean still matter? Longitude ππππππππππππππππππ = π½π½0 + π½π½1 Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ + π½π½2 Γ ππππππππππ 15 Motivation for Multiple Linear Regression Recall the Skin Cancer Mortality example from Simple Linear Regression: Slope p = 3.31ππ β13 Slope p=5.9ππ β4 Skin Cancer Mortality Skin Cancer Mortality Rate (per 100K) Latitude Rate (per 100K) Ocean (0=No, 1=Yes) 16 Multiple Linear Regression Big Picture Simple linear regression is a very useful statistical tool. However, it limits the analysis to a single covariate (x-variable). In practice, the outcome (y-variable) often depends on more than a single outcome, meaning simple linear regression is unable to accurately model the complexity of real associations. Multiple linear regression is an extension of simple linear regression that models the mean value of the outcome on more than one covariate variable. In this lecture we will introduce multiple linear regression with a focus on intuition, interpretation and performing inference. 17 Motivation for Multiple Linear Regression ο Latitude and being Coastal were both significant predictors of skin cancer mortality ο How do I choose which model to use? ο Ideally, we could use both variables in the same regression model for skin cancer mortality ο Would allow us to model the relationship between skin cancer mortality and both latitude and being coastal simultaneously (useful for both inference and prediction) ο Multiple linear regression models a single numerical outcome variable based on multiple numerical or categorical covariate variables single numerical outcome vs mutiple numerical/categorical covariate variables 18 Multiple Linear Relationship between Y and X1, β¦, Xn Expected response (or mean value, or predicted value) for Y for a given set of slope covariate values ππ1, ππ2, β¦ , ππππ parameters E[Y] = Ξ²0 + Ξ²1X1 + Ξ²2X2 + β¦ + Ξ²nXn Y intercept parameter Expected value of Y when Exposures, X1=0, X2=0, β¦, Xn=0 covariates, predictors 19 Multiple Linear Regression Model observed response for ith observation Yi = Ξ²0 + Ξ²1X1 + Ξ²2X2 + β¦ + Ξ²nXn + Ξ΅ i Residual = random error, amount that the ith observed value differs from its predicted value ο Add in the assumption that ππππ ~ ππ(0, ππ 2 ) 20 Intuition: Multiple Linear Regression Assume that we already a simple linear regression model containing a continuous outcome ππ and continuous covariate ππ1 πΈπΈ ππ = π½π½0 + π½π½1 ππ1 What does adding a second covariate ππ2 look like? The model has the form πΈπΈ ππ = π½π½0 + π½π½1 ππ1 + π½π½2 ππ2 The picture for the regression model depends on whether ππ2 is continuous or dichotomous/binary/dummy (i.e ππ2 = 0 or 1) 21 Intuition: Multiple Linear Regression At first, ignore X2 and consider the simple linear regression model between Y and X1. ο All the points are just βblack dotsβ ο Regression line goes through center of all dots Simple Linear Reg π½π½0 + π½π½ ππ1 ππ ππ1 Intuition: Multiple Linear Regression Consider the binary covariate πΏπΏππ = ππ, ππ Now each dot can be colored blue (πΏπΏππ = ππ) or red (πΏπΏππ = ππ) Simple Linear Reg π½π½0 + π½π½ ππ1 ππ2 = 1 ππ ππ2 = 0 ππ1 ππ1 Linear regression E[ππ] = π½π½0 + π½π½1 ππ1 + π½π½2 ππ2 with continuous covariate ππ1 and binary covariate πΏπΏππ = ππ, ππ The regression equation is two parallel lines, each with same slope π½π½1 Data points with ππ2 = 0 (red) have intercept of π½π½0 and slope of π½π½1 they are parallel lines as they have the same ο slope ο Data points with ππ2 = 1 (blue) have intercept of π½π½0 + π½π½2 and slope of π½π½1 Multiple Linear Reg π½π½0 + π½π½1 ππ1 + π½π½2 ππ2 Simple Linear Reg π½π½0 + π½π½2 + π½π½1 ππ1 π½π½0 + π½π½ ππ1 ππ2 = 1 ππ ππ2 = 0 π½π½0 + π½π½1 ππ1 ππ1 ππ1 ππ1 Intuition: Multiple Linear Regression βTwo Numerical Covariates Two-dimensional pairwise relationships π¦π¦ π₯π₯1 π₯π₯2 25 Intuition: Multiple Linear Regression βTwo Numerical Covariates Two-dimensional pairwise relationships Three-Dimensional Relationship of ππ on ππ1 ,ππ2 π¦π¦ π₯π₯1 π¦π¦ π₯π₯2 π₯π₯2 π₯π₯1 26 Intuition: Multiple Linear Regression βTwo Numerical Covariates πΈπΈ ππ = π½π½Μ0 + π½π½Μ1 ππ1 + π½π½Μ2 ππ2 is a plane going through the points in 3D space Plane with equation: ππ = π½π½ 0 + π½π½ 1 ππ1 + π½π½ 2 ππ2 27 Intuition: Multiple Linear Regression βTwo Numerical Covariates π½π½1 π½π½2 28 Parameter Interpretation For MLR Model: πΈπΈ ππ = π½π½0 + π½π½1 ππ1 + π½π½2 ππ2 + β― + π½π½ππ ππππ Fitted Model: πΈπΈ ππ = π½π½Μ0 + π½π½Μ1 ππ1 + π½π½Μ2 ππ2 + β― + π½π½Μππ ππππ ο The hats ( ^ ) in the Fitted Model indicate the point estimates based on data ο The intercept π½π½0 is the predicted value for the outcome when each ππππ = 0. ο The slope parameter π½π½ππ is the change in predicted outcome for a one unit increase in ππππ holding all other covariates constant ο Predicted values are obtained by plugging in the desired values of ππ1 , ππ2 , β¦ , ππππ Plug them in and keep them constant or fixed 29 Parameter Interpretation For MLR Suppose there are three exposure variables and the model is πΈπΈ ππ = π½π½0 + π½π½1 ππ1 + π½π½2 ππ2 + π½π½3 ππ3 Show that π½π½1 is the change in predicted Y for one unit change in ππ1 while holding π½π½2 and π½π½3 fixed. In the context of MLR, keeping X2 and X3 constant means that we are observing the effect of a one unit change in X1 on the dependent varaible Y while not allowing X2 and X3 to change. We examine the relationship between X1 and Y without the influence of changes in X2 and X3. Then beta1 represents the effect of changing X1 by one unit on Y assuming X2 and X3 are not changing. In more statistical sense, holding X2 and X3 constant helps control for the effects of these varaibles enabling us to interpret beta1 as the partial derivative of Y with respect to X1 30 Parameter Interpretation For MLR Suppose we are most interested in the effect of ππ1 on ππ. The unadjusted effect of ππ1 on ππ is π½π½Μ1 obtained from the simple linear model: πΈπΈ ππ = π½π½Μ0 + π·π· ππ ππ1 The adjusted effect of ππ1 on ππ, controlling for ππ2 through ππππ , is πΌπΌ 1 obtained from the multiple linear model: ππ ππ1 + πΌπΌ 2 ππ2 + β― + πΌπΌ ππ ππππ πΈπΈ ππ = πΌπΌ 0 + πΆπΆ ο Note: I am using πΌπΌ β² π π above simply to differentiate between parameters in the two models ο Controlling= including variables in the model that are likely to affect the outcome but are not your specific exposures of interest This helps us control the effects of these variables 31 Example: Parameter Interpretation For MLR Suppose you are interested in modelling weight based on height in a population. πΈπΈ π€π€π€π€π€π€π€π€π€π€π€ = π½π½0 + π½π½1 Γ βπππππππππ ο π½π½1 is the unadjusted effect of height on weight Unadjusted means we haven't taken into consideration the effects of other variables that affect weight in this equation Many factors influence an individualβs weight besides height. πΈπΈ π€π€π€π€π€π€π€π€π€π€π€ = πΌπΌ0 + πΌπΌ1 Γ height + πΌπΌ2 Γ ππππππππ + πΌπΌ3 Γ ππππππ ο πΌπΌ1 is the effect of height on weight, adjusted for age and sex ο πΌπΌ1 is the change in predicted weight for a one-unit increase in height, fixing or controlling for age and sex 32 Example: Sun Cancer Mortality We fit the models: 1. πΈπΈ ππππππππ = π½π½Μ0 + π½π½ΜπΏπΏπΏπΏπΏπΏ Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ 2. πΈπΈ ππππππππ = π½π½Μ0 + π½π½ΜπΏπΏππππππ Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ 3. πΈπΈ ππππππππ = π½π½Μ0 + π½π½Μππππππππππ Γ ππππππππππ What about fittingβ¦ πΈπΈ ππππππππ = π½π½Μ0 + π½π½ΜπΏπΏπΏπΏπΏπΏ Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ + π½π½ΜπΏπΏπΏπΏπΏπΏπΏπΏ Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ πΈπΈ ππππππππ = π½π½Μ0 + π½π½ΜπΏπΏπΏπΏπΏπΏ Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ + π½π½Μππππππππππ Γ ππππππππππ 33 Example: MLR of Skin Cancer Mortality on Latitude and Coastal Skin Cancer Mortality Latitude 34 Example: MLR of Skin Cancer Mortality on Latitude and Coastal Coastal Non- Coastal Skin Cancer Mortality What do you see? The points in the plot are differentiated based on colour whether they are coastal or non-coastal Latitude 35 Example: MLR of Skin Cancer Mortality on Latitude and Coastal Fitted Model for Skin Cancer Mortality on Latitude & Ocean Skin Cancer Mortality ππππππππ = 360.69 β 5.49 Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ + 20.43 Γ ππππππππππ For Non-Coastal States (Ocean = 0): ππππππππ = 360.69 β 5.49 Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ For Coastal States (Ocean = 1): ππππππππ = 381.12 β 5.49 Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ Coastal Non- Coastal Latitude 36 Interpretation: MLR of Skin Cancer Mortality on Latitude and Coastal Parameter Estimate Skin cancer mortality rate decreases by 5.49 for each one (Intercept) 360.69 degree increase in latitude, for both coastal and non- Latitude β5.49 coastal states This is because the slope is the same for coastal and non-coastal Ocean 20.43 Skin cancer mortality rate increases by 20.43 in coastal versus non-coastal states, holding latitude fixed Skin Cancer Mortality The predicted skin cancer mortality rate for a coastal state with latitude of 38 degrees is: 360.69 β 5.49 Γ 38 + 20.43 = 172.5 Skin cancer mortality rates increase by 5.49 for one degree increase in latitude for coastal and non- coastal states. Skin cancer mortality rates increase by 20.43 in coastal states when compared to non-coastal states, while keeping latitude fixed. Latitude 37 Skin Cancer Mortality Comparison of the Unadjusted versus Adjusted Effects of Parameter Estimate Latitude on Skin Cancer (Intercept) 389.18 Mortality Latitude β5.97 Latitude Parameter Estimate Skin Cancer Mortality (Intercept) 360.69 Latitude β5.49 Ocean 20.43 Latitude 38 Skin Cancer Mortality Parameter Estimate Unadjusted effect of (Intercept) 389.18 Latitude on Skin Latitude β5.97 Cancer Mortality Latitude Parameter Estimate Adjusted effect of Latitude on Skin Cancer Mortality (Intercept) 360.69 Skin Cancer Mortality, or the Latitude β5.49 Effect of Latitude on Skin Cancer Ocean 20.43 Mortality controlling for Ocean status ο The magnitude of the effect of Latitude on Skin Cancer Latitude Mortality decreased when ocean was added to the model. 39 Example: MLR of Skin Cancer Mortality on Latitude and Longitude Skin Cancer Mortality Latitude 40 Fitted Multiple Linear Regression Model: Predicted Skin Cancer Mortality = 400.67 β 5.93 Γ Latitude β 0.14 Γ Longitude Skin Cancer Mortality Latitude 41 Interpretation: MLR of Skin Cancer Mortality on Latitude and Longitude πΈπΈ ππππππππππππππππππ = 400.67 β 5.93 Γ Latitude β 0.14 Γ Longitude Skin cancer mortality rate decreases by 5.93 for each one Parameter Estimate degree increase in latitude, holding longitude fixed (Intercept) 400.67 Skin cancer mortality rate decreases by 0.14 for each one Latitude β5.93 degree increase in longitude, holding latitude fixed Longitude β0.14 The predicted skin cancer mortality rate for a state with latitude of 38 degrees and longitude of 90 degrees is: 400.67 β 5.93 Γ 38 β 0.14 Γ 90 = 162.73 42 Interpretation: MLR of Skin Cancer Mortality on Latitude and Longitude The unadjusted effect of Latitude on Skin Cancer Mortality Parameter Estimate was β5.97 (Intercept) 389.18 The adjusted effect of Latitude on Skin Cancer Mortality, Latitude β5.97 accounting for Longitude, is β5.93 ο Not a very big change. Parameter Estimate ο Not surprising because Longitude did not have much effect (Intercept) 400.67 on Skin Cancer Mortality Latitude β5.93 Longitude β0.14 43 MLR for Salary Suppose a random sample of salaries are taken for employees at a large company. You are interested in determining if there is a salary difference between men and women. You suspect that salary is strongly dependent upon employee age so you fit a multiple linear regression model to determine the effect of sex on salary while βcontrollingβ for employee age. You fit the following model: πΈπΈ ππππππππππππ = π½π½0 + π½π½π΄π΄π΄π΄π΄π΄ Γ Age + π½π½πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Γ πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Parameter Estimate (Intercept) 1666.67 Age 1333.33 Female β1650.00 * Numbers are totally fictional! 44 Based on the MLR model: πΈπΈ ππππππππππππ = π½π½0 + π½π½π΄π΄π΄π΄π΄π΄ Γ Age + π½π½πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Γ πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Based on the parameter estimates, which figure shows the general shape of the Parameter Estimate regression model: (Intercept) 1666.67 good! make informed decisions just like you did now. positive slope Age 1333.33 Female β1650.00 C it has to be this one women's line A be below the men's line Salary Salary MEN Age Age WOMEN B D Salary Salary Age Age 45 MLR for Salary πΈπΈ ππππππππππππ = π½π½0 + π½π½π΄π΄π΄π΄π΄π΄ Γ Age + π½π½πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Γ πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Parameter Estimate (Intercept) 1666.67 Age 1333.33 π½π½Μπ΄π΄π΄π΄π΄π΄ > 0 Female β1650.00 π½π½ΜπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ < 0 Salary π½π½Μπ΄π΄π΄π΄π΄π΄ > 0 β Salary increases with age π½π½ΜπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ < 0 β For a fixed age, salary for a female is lower than for male Age 46 MLR for Salary πΈπΈ ππππππππππππ = π½π½0 + π½π½π΄π΄π΄π΄π΄π΄ Γ Age + π½π½πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Γ πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Parameter Estimate (Intercept) 1666.67 Age 1333.33 Female β1650.00 Salary The slope for both men and women is π½π½Μπ΄π΄π΄π΄π΄π΄ = 1333.33 This is the adjusted effect of age on salary, controlling for employee sex Age 47 MLR for Salary πΈπΈ ππππππππππππ = π½π½0 + π½π½π΄π΄π΄π΄π΄π΄ Γ Age + π½π½πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Γ πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Parameter Estimate (Intercept) 1666.67 Age 1333.33 Female β1650.00 The intercept for men is: Salary π½π½Μ0 = 1666.67 The intercept for women is: π½π½Μ0 + π½π½ΜπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ = 1666.67 β 1650 = 16.67 Age 48 MLR for Salary πΈπΈ ππππππππππππ = π½π½0 + π½π½π΄π΄π΄π΄π΄π΄ Γ Age + π½π½πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Γ πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ The Female estimate will be 0 if we are substituting for a man and will be Parameter Estimate 1 if we are substituting for a female (Intercept) 1666.67 Age 1333.33 Female β1650.00 Predicted salary for a 45 year old man is: π½π½Μ0 + π½π½Μπ΄π΄π΄π΄π΄π΄ Γ 45 = 1666.67 + 1333.33 Γ 45 = $61,666.65 Predicted salary for a 45 year old woman is: π½π½Μ0 + π½π½Μπ΄π΄π΄π΄π΄π΄ Γ 45 + π½π½ΜπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ = 1666.67 + 1333.33 Γ 45 β 1650 = $60,016.65 49 Interpretation: MLR for Salary πΈπΈ ππππππππππππ = π½π½0 + π½π½π΄π΄π΄π΄π΄π΄ Γ Age + π½π½πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Γ πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Parameter Estimate (Intercept) 1666.67 Age 1333.33 Female β1650.00 ο Salary increases with employee age: the average salary increases $1,333.33 for each one-year increase in age ο Women earn an average of $1650 less than men, controlling for employee age 50 Multiple Linear Regression Assumptions Each data point (ππππ, ππ1ππ , ππ2ππ , β¦ , ππππππ ) is independent Yi is a linear function of each ππππππ The variance of residuals is constant across values of the covariates. That is variance of residuals does not depend on any covariate X. This is called homoskedasticity. The residuals are normally distributed ο Just like Simple Linear Regression, residual plots and QQ plots can be used to diagnose violations of the assumptions 51 Residuals in MLR πΌπΌ + π½π½2 + π½π½1 ππ1 ππππ The residual ππππ is still just the difference between the ππ π‘π‘π‘ observation and its predicted value based on the model πΌπΌ + π½π½1 ππ1 The plane gives the predicted values ππ1 ππππ 52 Inference for Multiple Linear Regression πΈπΈ ππ = π½π½Μ0 + π½π½Μ1 ππ1 + π½π½Μ2 ππ2 + β― + π½π½Μππ ππππ 1. We can test if a specific slope parameter is non-zero, given the other parameters in the model: π»π»0 : π½π½2 = 0 π»π»1 : π½π½2 β 0 2. Alternatively, we can test if any of the slope parameters are non-zero: π»π»0 : π½π½1 = π½π½2 = β― = π½π½ππ = 0 π»π»1 : π΄π΄π΄π΄ ππππππππππ ππππππ π½π½ππ β 0 Called a βglobalβ F-Test Failing to reject the global F-Test is equivalent to saying that the model ππ = π½π½0 can explain the data as well as ππ = π½π½0 + π½π½1 ππ1 + π½π½2 ππ2 + β― + π½π½ππ ππππ 53 Inference: MLR of Skin Cancer Mortality on Latitude and Coastal ππππππππ πΆπΆπΆπΆπΆπΆπΆπΆπΆπΆπΆπΆ ππππππππππππππππππ = π½π½0 + π½π½πΏπΏπΏπΏπΏπΏ Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ + π½π½ππππππππππ Γ ππππππππππ Skin Cancer Mortality Typically perform the global test first: Latitude Are either Latitude or Ocean significant predictors of Skin Cancer Mortality? 54 Inference: MLR of Skin Cancer Mortality on Latitude and Coastal ππππππππ πΆπΆπΆπΆπΆπΆπΆπΆπΆπΆπΆπΆ ππππππππππππππππππ = π½π½0 + π·π·π³π³π³π³π³π³ Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ + π·π·πΆπΆπΆπΆπΆπΆπΆπΆπΆπΆ Γ ππππππππππ Global Test: Highly Significant so reject the null hypothesis. That is, π»π»0 : π½π½πΏπΏπΏπΏπΏπΏ = π½π½ππππππππππ = 0 ππππππππππππππππππ = π½π½0 + π½π½πΏπΏπΏπΏπΏπΏ Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ + π½π½ππππππππππ Γ ππππππππππ explains the data better than ππππππππππππππππππ = π½π½0 π»π»1 : π½π½πΏπΏπΏπΏπΏπΏ β 0 and/or π½π½ππππππππππ β 0 55 Inference: MLR of Skin Cancer Mortality on Latitude and Coastal Skin Cancer Mortality Can also test the individual parametersβ¦ Is the slope for Latitude non-zero when also accounting for ocean status? π»π»0 : π½π½πΏπΏπΏπΏπΏπΏ = 0 π»π»1 : π½π½πΏπΏπΏπΏπΏπΏ β 0 Latitude 56 Inference : MLR of Skin Cancer Mortality on Latitude and Coastal ππππππππ πΆπΆπΆπΆπΆπΆπΆπΆπΆπΆπΆπΆ ππππππππππππππππππ = π½π½0 + π·π·π³π³π³π³π³π³ Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ + π½π½ππππππππππ Γ ππππππππππ π»π»0 : π½π½πΏπΏπΏπΏπΏπΏ = 0 π»π»1 : π½π½πΏπΏπΏπΏπΏπΏ β 0 ο The p-value is highly significant, so we reject the null. There is sufficient evidence that the association between Latitude and Skin Cancer Mortality, while controlling for Ocean status, is real. 57 Inference: MLR of Skin Cancer Mortality on Latitude and Coastal Skin Cancer Mortality Is the change in intercept for Ocean non-zero after accounting for latitude? π»π»0 : π½π½ππππππππππ = 0 π»π»1 : π½π½ππππππππππ β 0 Latitude 58 Inference: MLR of Skin Cancer Mortality on Latitude and Coastal ππππππππ πΆπΆπΆπΆπΆπΆπΆπΆπΆπΆπΆπΆ ππππππππππππππππππ = π½π½0 + π½π½πΏπΏπΏπΏπΏπΏ Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ + π·π·πΆπΆπΆπΆπΆπΆπΆπΆπΆπΆ Γ ππππππππππ π»π»0 : π½π½ππππππππππ = 0 π»π»1 : π½π½ππππππππππ β 0 59 Inference: MLR of Skin Cancer Mortality on Latitude and Coastal ππππππππ πΆπΆπΆπΆπΆπΆπΆπΆπΆπΆπΆπΆ ππππππππππππππππππ = π½π½0 + π½π½πΏπΏπΏπΏπΏπΏ Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ + π½π½ππππππππππ Γ ππππππππππ Skin Cancer Mortality Parameters for both Latitude and Ocean are significant, Latitude so you would want to keep both in the model 60 Inference: MLR of Skin Cancer Mortality on Latitude and Longitude Letβs perform hypothesis testing on parameters in the Longitude & Latitude multiple regression model Skin Cancer Mortality ο Isthe βtiltβ of the plane in each direction significant? Latitude 61 Example: MLR of Skin Cancer Mortality on Latitude and Coastal ππππππππ πΆπΆπΆπΆπΆπΆπΆπΆπΆπΆπΆπΆ ππππππππππππππππππ = π½π½0 + π½π½πΏπΏπΏπΏπΏπΏ Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ + π½π½πΏπΏπΏπΏπΏπΏπΏπΏ Γ πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ π»π»0 : π½π½πΏπΏπΏπΏπΏπΏ = 0 π»π»1 : π½π½πΏπΏπΏπΏπΏπΏ β 0 π»π»0 : π½π½πΏπΏπΏπΏπΏπΏπΏπΏ = 0 π»π»1 : π½π½πΏπΏππππππ β 0 should be zero here Reject the null hypothesis that the slope for Latitude is non-zero, but fail to reject the null that the slope for Longitude is non-zero ο In the next lecture we will discuss deciding if parameters should be dropped from a MLR model 62 MLR for Salary πΈπΈ ππππππππππππ = π½π½0 + π½π½π΄π΄π΄π΄π΄π΄ Γ Age + π½π½πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Γ πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Parameter Estimate P-value (Intercept) 1666.67 < 3.3ππ β16 Age 1333.33 1.7ππ β5 Female β1650.00 0.0042 Is there sufficient evidence to claim that women have lower salaries, after controlling for age? 63 MLR for Salary πΈπΈ ππππππππππππ = π½π½0 + π½π½π΄π΄π΄π΄π΄π΄ Γ Age + π½π½πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Γ πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ Parameter Estimate P-value (Intercept) 1666.67 < 3.3ππ β16 Age 1333.33 1.7ππ β5 Female β1650.00 0.0042 Is there sufficient evidence to claim that women have lower salaries after controlling for age? ο Perform the hypothesis test: π»π»0 : π½π½πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ = 0 vs. π»π»1 : π½π½πΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ β 0 ο Since pFemale is significant and the point estimate for π½π½ΜπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉπΉ < 0, there is sufficient evidence to claim women have lower salaries than men, even when controlling for employee age 64