Biostatistics 521 Lecture 17 Linear Regression PDF
Document Details
Uploaded by Deleted User
Xiang Zhou, PhD
Tags
Summary
This lecture on linear regression from Biostatistics 521 covers the basics of regression models and introduces the assumptions required for statistical inference on linear regression parameters. It discusses the relationship between variables, linear regression, and parameter interpretations, including the intercept and slope terms. The lecture also explains the concept of prediction in linear regression, and the role of the slope parameter 𝛽𝛽1.
Full Transcript
SIMPLE LINEAR REGRESSION Xiang Zhou, PhD BIOS 521 Linear Regression Big Picture We first encountered linear regression when learning how to summarize the relationship between two numerical variables. The slope parameter 𝛽𝛽1 provided a summary measure for the effect of the exposure on th...
SIMPLE LINEAR REGRESSION Xiang Zhou, PhD BIOS 521 Linear Regression Big Picture We first encountered linear regression when learning how to summarize the relationship between two numerical variables. The slope parameter 𝛽𝛽1 provided a summary measure for the effect of the exposure on the outcome. A positive slope value indicates a positive relationship between covariate on outcome, and a negative slope indicates a negative relationship. A slope value of 𝛽𝛽1 = 0 therefore indicates no relationship between the two variables. We will now return to linear regression with the goal of performing statistical inference on the relationship between the exposure and outcome. That is, we will formally test the hypothesis 𝐻𝐻0 : 𝛽𝛽1 = 0 vs. 𝐻𝐻1 : 𝛽𝛽1 ≠ 0 to determine if there is evidence for a relationship. In this lecture, we will review the basics of regression models and introduce the assumptions required to perform statistical inference on linear regression parameters. 2 Flashback: Summarizing Relationships Between Variables Relationships between variables can be summarized graphically and numerically Recall that data can be either numerical or categorical The methods for summarizing how variables are related depends on the data type for the exposure and outcome variables Outcome Variable Numerical Categorical Scatterplots, Numerical Correlation, Exposure Linear Regression Variable Categorical 3 Linear Regression Y If the two numerical variable have a linear relationship, the association can be described using a linear 𝒚𝒚 = −𝟏𝟏 + 𝟐𝟐𝟐𝟐 regression Linear regression creates a model for the relationship between the exposure and outcome X Equation of line is 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥 4 Linear Regression Y Equation of a line is 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥 Parameters in the model are the 𝒚𝒚 = −𝟏𝟏 + 𝟐𝟐𝟐𝟐 y-intercept (𝛽𝛽0 ) and slope (𝛽𝛽1 ) terms The y-intercept 𝛽𝛽0 is the value where the line crosses the y-axis 2 unit increase in y for 1-unit increase in x The slope is the rate of change in the line. The y variable increases 𝛽𝛽1 units for X every 1-unit increase in the x variable Line crosses the y-axis at y=-1 5 Linear Regression Prediction Linear regression analysis is very popular tool in statistical analysis For now, we are interested in computing the parameters and interpreting the model: 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥, where 𝑦𝑦 is the outcome variable and 𝑥𝑥 is the exposure variable The “fitted values” of the parameters for a given set of data are: 𝑠𝑠𝑦𝑦 ̂ 𝛽𝛽1 = 𝑟𝑟 × and 𝛽𝛽̂0 = 𝑦𝑦 − 𝛽𝛽̂1 𝑥𝑥̅ 𝑠𝑠𝑥𝑥 For a fixed exposure value of 𝑥𝑥, the predicted outcome 𝑦𝑦 is 𝑦𝑦 = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥 These parameters represent the best-fitting linear equation in this above form. This equation minimizes the sum of squared residuas between the observed y values and predicted values from the regression line. 6 Linear Regression Parameter Interpretation The intercept term 𝛽𝛽0 is the predicted outcome value when exposure is equal to zero: 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 0 = 𝛽𝛽0 The slope variable 𝛽𝛽1 is often called the “effect size” The predicted or mean outcome value increases 𝛽𝛽1 units for every 1-unit change in the exposure variable 𝛽𝛽1 > 0 indicates a positive association between the exposure and outcome, 𝛽𝛽1 < 0 indicates negative association 𝛽𝛽1 = 0 indicates no association Larger magnitudes of 𝛽𝛽1 indicate greater increases in the outcome for changes in the exposure 7 Linear Regression Slope Parameter Negative Association: 𝛽𝛽̂1 < 0 No Association: 𝛽𝛽̂1 ≈ 0 Positive Association: 𝛽𝛽̂1 > 0 8 Two Continuous Variables Example The “classic” mtcars dataset in R weight mpg qsec Fuel consumption and 10 aspects of AMC Javelin 3.44 15.2 17.3 automobile design and performance for 32 Merc 230 3.15 22.8 22.9 automobiles (1973–74 models) taken from Lotus Europa 1.51 30.4 16.9 1974 Motor Trend US magazine Camaro Z28 3.84 13.3 15.4 Toyota Corona 2.47 21.5 20.0 We will consider the effect of car weight on Porsche 914-2 2.14 26.0 16.7 quarter mile time and fuel economy Valiant 3.46 18.1 20.2 The continuous variables are mpg (miles Merc 450SE 4.07 16.4 17.4 per gallon), weight (1000s lbs) and qsec Fiat X1-9 1.94 27.3 18.9 (seconds) Merc 450SLC 3.78 15.2 18.0 9 Relationship Between Car Weight on Quarter Mile Time 𝑟𝑟 = −0.174 From the scatterplot, we see there is no Quarter Mile Time (seconds) clear relationship between car weight and quarter mile time As expected, the correlation coefficient between car weight and quarter mile time is low, 𝑟𝑟 = −0.174 The correlation coefficient is a statistical measure of the strength of a linear relationship between two variables. Its values can range from -1 to 1. A correlation coefficient of -1 describes a perfect negative, or inverse, correlation, with values in one series rising as those in the other decline, and vice versa. Car Weight (1000s Lbs) 10 Relationship Between Car Weight on Quarter Mile Time Relationship does not look linear. Quarter Mile Time (seconds) You decide to fit a linear regression anyway 𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄 = 18.9 − 0.32 × 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊 This model has 𝑅𝑅 2 = 0.03. The linear model only explains 3% of variation in the Quarter Mile Time! Linear Regression not helpful for these variables, but how to make that decision? The linear model explains (R sqaure * 100) % of the variation in the outcome variable Car Weight (1000s Lbs) 11 Relationship Between Car Weight on Fuel Economy 𝑟𝑟 = −0.868 From the scatterplot, we see that the relationship between car weight and Fuel Efficiency (mpg) fuel efficiency is negative, strong and linear. Since the relationship is linear, we can quantify the relationship using correlation, 𝑟𝑟 = −0.868 Car Weight (1000s Lbs) 12 Relationship Between Car Weight on Fuel Economy The fitted model is: 𝑀𝑀𝑀𝑀𝑀𝑀 = 37.3 − 5.3 × 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊 A 1000 lb increase in weight reduces fuel 𝑀𝑀𝑀𝑀𝑀𝑀 = 37.3 − 5.3 × 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊 Fuel Efficiency (mpg) economy by 5.3 miles per gallon The model has 𝑅𝑅2 = 0.75, meaning that 75% of the variation in fuel efficiency is explained by the linear relationship with car weight (That’s a lot!) This regression line “looks” good, again how do we make that decision? Car Weight (1000s Lbs) 13 Goal: Evaluate these Regressions using Inference Framework How do we test whether there is sufficient evidence to say 𝛽𝛽1 ≠ 0? Quarter Mile Time (seconds) Fuel Efficiency (mpg) Car Weight (1000s Lbs) Car Weight (1000s Lbs) 14 Statistical Inference for Linear Regression Models The Population If 𝛽𝛽1 = 0, the expected outcome value is the same for all samples, regardless of exposure value 𝛽𝛽0 , 𝛽𝛽1 If 𝛽𝛽1 ≠ 0, there is a true linear relationship The true relationship between the exposure and expected outcome between the exposure and outcome variable in Usually most interested in deciding if a slope the entire population parameter (𝛽𝛽1 ) is non-zero. 15 Statistical Inference for Linear Regression Models Random Sample The Population Draw a random sample and compute point estimates for the regression parameters Sample Estimates: 𝛽𝛽̂0 , 𝛽𝛽̂1 𝛽𝛽0 , 𝛽𝛽1 The true relationship between the exposure and outcome variable in the entire population 16 Statistical Inference for Linear Regression Models Random Sample The Population Sample Estimates: 𝛽𝛽̂0 , 𝛽𝛽̂1 𝛽𝛽0 , 𝛽𝛽1 The true relationship between the exposure and outcome variable in Use the point estimates for inference on the entire population population parameters 17 Statistical Inference for Linear Regression Models Random Sample The Population Sample Estimates: 𝛽𝛽̂0 , 𝛽𝛽̂1 𝛽𝛽0 , 𝛽𝛽1 The true relationship between the exposure What is the Null Hypothesis that we are testing? and outcome variable in the entire population No association between the exposure and outcome variables The slope parameter is zero, H0 : 𝛽𝛽1 = 0 vs H1 : 𝛽𝛽1 ≠ 0 The Null Model 𝑦𝑦 = 𝛽𝛽0 adequately explains the data 18 Statistical Inference for Linear Regression Models There is variation in the The Population estimates of 𝛽𝛽 among independent random samples Sample Estimates: 𝛽𝛽̂0 , 𝛽𝛽̂1 𝛽𝛽0 , 𝛽𝛽1 The true relationship between the exposure Sample Estimates: and outcome variable in the entire population 𝛽𝛽̂0 , 𝛽𝛽̂1 Sample Estimates: Sample Estimates: 𝛽𝛽̂0 , 𝛽𝛽̂1 𝛽𝛽̂0 , 𝛽𝛽̂1 19 Statistical Inference for Linear Regression Models The observed 𝛽𝛽̂1 is a draw from Like other point estimates/statistics we have used in this the sampling distribution, course (e.g. 𝑋𝑋, ̂ the estimate of the slope parameter 𝛽𝛽̂1 𝑝𝑝), centered at 𝛽𝛽1 has a sampling distribution and a standard error (standard deviation of the sampling distribution) Inference for 𝛽𝛽1 , both hypothesis testing and confidence intervals, is performed using the point estimate 𝛽𝛽̂1 and its associated Standard Error. Inference requires a probability model be incorporated into the regression model 𝛽𝛽1 20 Outcome (𝑦𝑦) Where do the Regression Parameter Estimates Come From? Exposure (𝑥𝑥) Where do the Regression Parameter Estimates Come From? The point estimates for the Linear Regression Line, intercept 𝛽𝛽̂0 and slope 𝛽𝛽̂1 , are obtained by minimizing Fitted Regression Line the total squared sum of the residuals Outcome (𝑦𝑦) 𝑦𝑦 = 𝛽𝛽̂𝑜𝑜 + 𝛽𝛽̂1 ∗ 𝑥𝑥𝑖𝑖 Least Squares Regression Line Residual 𝒆𝒆𝒊𝒊 = distance That is, we find values 𝛽𝛽̂0 and 𝛽𝛽̂1 that between the regression minimize the total distance between line and the observed the regression line and all data points value for the 𝑖𝑖 𝑡𝑡𝑡 data point (i.e. the total length of the dashed green lines in the figure) Exposure (𝑥𝑥) Where do the Regression Parameter Estimates Come From? The total sums of squares distance between the observed and expected outcomes can be written as a function: 𝑁𝑁 𝑁𝑁 2 2 𝑓𝑓 𝛽𝛽0 , 𝛽𝛽1 = 𝑦𝑦𝑖𝑖 − 𝑦𝑦 𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝛽𝛽0 − 𝛽𝛽1 × 𝑥𝑥𝑖𝑖 𝑖𝑖=1 𝑖𝑖=1 Using ideas from calculus, we can solve for the values 𝛽𝛽 0 and 𝛽𝛽 1 that minimize the above function for a given set of data points 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖. 23 Statistical Inference for Linear Regression Models Notice that we did not need a probability model to estimate the least squares regression line, just geometry and calculus. The Least Square Regression Line gave us… The estimate 𝛽𝛽̂1 to quantify the relationship between the exposure and outcome variables Prediction of expected outcome for a given value of the exposure The proportion of variation in the outcome explained by a linear relationship with the exposure (the 𝑅𝑅2 value). Simple Linear Relationship between X and Y slope parameter Expected response for the 𝑖𝑖𝑡𝑡𝑡 sample E[Yi] = β0 + β1 Xi Exposure for the 𝑖𝑖𝑡𝑡𝑡 sample Y intercept parameter 25 Simple Linear Regression Model Exposure for the 𝑖𝑖𝑡𝑡𝑡 sample observed response for Yi = β0 + β1Xi + ε i the 𝑖𝑖𝑡𝑡𝑡 sample Residual = random error, amount that the 𝑖𝑖𝑡𝑡𝑡 observed value differs from its predicted value Add in the assumption that 𝜖𝜖𝑖𝑖 ~ 𝑁𝑁(0, 𝜎𝜎 2 ) 26 Simple Linear Regression Statistical Model Define 𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋𝑖𝑖 + 𝜖𝜖𝑖𝑖 to be the simple linear regression model that relates outcomes Y to exposure or “covariate” X. Simple = only one covariate or exposure 𝜖𝜖𝑖𝑖 is called the residual, and is the quantity by which the 𝑖𝑖𝑡𝑡𝑡 observed outcome value differs from the model predicted outcome Assume that 𝜖𝜖𝑖𝑖 ~ 𝑁𝑁(0, 𝜎𝜎 2 ), that is, the residuals are independent and normally distributed with mean zero and constant variance The above assumption allows us to compute the sampling distribution for 𝛽𝛽’s ̂ Errors (beyond the scope of this class!) The difference between an observed value and the true value of a quantity of interest. Errors are unknown. (Because the true population estimates are unknown) Residuals The difference between an observed value and the estimated value of a quantity of interest. Residuals are calculated from sample data and are estimates of errors. Statistical Model for Simple Linear Regression y Regression Line Observations x Statistical Model for Simple Linear Regression Outcomes with covariate 𝑥𝑥 = 𝑥𝑥0 are Normally Distributed about the regression line 𝑥𝑥𝑜𝑜 Quarter Mile Time (seconds) point estimate, beta1 hat SE of point estimate Car Weight (1000s Lbs) Learn formula for correlation The Hypothesis Test: coeffecient r from the given data 𝐻𝐻0 : 𝛽𝛽1 = 0 𝐻𝐻1 : 𝛽𝛽1 ≠ 0 Point estimate of 𝛽𝛽1 is 𝛽𝛽̂1 = −0.3191 with 𝑆𝑆𝑆𝑆 𝛽𝛽̂1 = 0.3283 R output tells you this has a p-value of 𝑝𝑝 = 0.339 → Fail to reject the Null 30 Fuel Efficiency (mpg) Car Weight (1000s Lbs) The Hypothesis Test: 𝐻𝐻0 : 𝛽𝛽1 = 0 𝐻𝐻1 : 𝛽𝛽1 ≠ 0 Point estimate of 𝛽𝛽1 is 𝛽𝛽̂1 = −5.3445 with 𝑆𝑆𝑆𝑆 𝛽𝛽̂1 = 0.5591 R output tells you this has a p-value of 𝑝𝑝 = 1.29𝑒𝑒 −10 → Reject the Null 31 4 Key Assumptions of Linear Regression 1. Independence Each data point (𝑌𝑌𝑖𝑖 , 𝑋𝑋𝑖𝑖 ) is independent 2. Linearity 𝑌𝑌𝑖𝑖 is a linear function of 𝑋𝑋𝑖𝑖 3. The error terms are Normally Distributed The 𝜖𝜖𝑖𝑖 follow a Normal distribution 4. The error terms have constant variance (homoscedasticity) 𝑉𝑉𝑉𝑉𝑉𝑉 𝜖𝜖𝑖𝑖 = 𝜎𝜎 2 for all values of 𝑖𝑖. 32 Checking Regression Assumptions Important to check that assumptions of Linear Regression are met otherwise inference might not be valid Rarely done in practice! You will almost never see mention of this in methods or results sections of papers. Independence Typically this is addressed through data collection methods/study design Can look for clustering of points in the scatterplot 33 Assessing Linearity We assume a linear relationship between the exposure (X) and outcome (Y) Can look at scatterplot of the two variables: Dataset #1 Dataset #2 Dataset #3 Sometimes difficult to judge. So just fit the regression line and see what happens. 34 Assessing Linearity We assume a linear relationship between the exposure (X) and outcome (Y) Dataset #1 Dataset #2 Dataset #3 Now compute and plot the residuals. 35 Assessing Linearity: Residual Plots Residual plot is either X vs residuals or Y hat vs residuals. We can't plot Y vs residuals because they are related and do not truely represent the association. Residuals are the difference between the observed and predicted values Residual ≈ 0 means the model is predicting the outcome very well Large residual implies the model is not doing a good job predicting that outcome value The residual plot is a scatterplot of either X vs. residuals, or 𝑌𝑌 vs. residuals Residual plot tells you how the model is doing at predicting the data Residual = 0, model is predicting outcome very well Large residual = model is not predicting If the relationship in the data is truly linear, residuals should appear as a outcome well “random cloud” of points distributed about the horizontal line at 0. A discernible pattern in the residual plot can indicate non-linearity 36 It is a random cloud of points so it represents linearity between the exposure and outcome variables Residual Plot for Regression of Y on X: Dataset #1: Scatterplot of X and Y No clear pattern in the residuals Residuals (𝑌𝑌 − 𝑌𝑌) Outcome Y 0 Exposure X Predicted Values 𝑌𝑌 37 Since the residuals are not randomly scattered around the zero line and there is a visible pattern where the residuals are not ramdomly scattered (posiitve in the low and high range and negative in the middle), the residual plot does not represent linearity between the exposure and outcome variable. Residual Plot for Regression of Y on X: Dataset #2: Scatterplot of X and Y Subtle V-shaped pattern in the residuals Residuals (𝑌𝑌 − 𝑌𝑌) Outcome Y 0 Exposure X Predicted Values 𝑌𝑌 38 This residual plot does not represent linearity between the exposure and outcome vatiable since the residuals are not randomly scattered and instead make a U-shaped distribution. Residual Plot for Regression of Y on X: Dataset #3: Scatterplot of X and Y Clear pattern in the residuals Residuals (𝑌𝑌 − 𝑌𝑌) Outcome Y 0 Exposure X Predicted Values 𝑌𝑌 39 Assessing Normality of Residuals Residuals need to be normally distributed to make valid statistical inferences and predictions in regression analysis: We require residuals to be Normally distributed 1. Hypothesis testing: When residuals are normally distributed, t-tests and F- tests for significance are valid. 2. P-values: Normality is important for calculating reliable and interpretable p- Recall that we can assess Normality using histogram values in significance testing. 3. Model accuracy: Normalizing residuals improves the accuracy of predictions. Prediction intervals: If residuals are nonnormal, prediction intervals may be inaccurate. Check for Skewness Residuals (𝑌𝑌 − 𝑌𝑌) A boxplot would let you evaluate if Outcome Y the residuals are symmetric. That is a good start, but not sufficient to say that the distribution is actually Normal. Exposure X 40 Assessing Normality “The most that can be expected from any model is that it can supply a useful approximation to reality” – George Box How close is the Normal approximation for each of these distributions of data? 41 Assessing Normality Using QQ Plots expected vs observed Histogram of data compared to a Normal PDF with the sample mean and standard deviation Quantile-Quantile plot (called QQ Plot) plots the observed quantiles of the data against the theoretical Normal quantiles. The closer to straight line, the better the Normal Distribution fits the data. 42 Normal approximation Normal approximation is good looks good in the middle Normal approximation is not good. across the full distribution. of the distribution, but the tails are somewhat off. 43 Assessing Normality of Residuals We require residuals to be Normally distributed Recall that we can assess Normality using histogram, boxplot or QQ plot: Check for Skewness Deviation from Diagonal line indicates departure from Normal Distribution Residuals (𝑌𝑌 − 𝑌𝑌) Observed Residuals Outcome Y Exposure X Theoretical Quantiles 44 Assessing Constant Variance of Residuals We require constant variance for residuals across all range of predicted values Dataset A Dataset B Outcome Y Outcome Y Exposure X Exposure X 45 Assessing Constant Variance of Residuals Variance of Residuals are Nearly Constant Dataset A Residuals (𝑌𝑌 − 𝑌𝑌) Outcome Y 0 Exposure X Predicted Values 𝑌𝑌 46 Assessing Constant Variance of Residuals Dataset B Fan shaped residuals indicate Non-Constant Variance Residuals (𝑌𝑌 − 𝑌𝑌) Outcome Y 0 Exposure X Predicted Values 𝑌𝑌 47 Assessing Assumptions for Linear Regression Assumptions for linear regression models are rarely ever perfectly achieved Remember that “all models are wrong, some models are useful” Rough order of importance for the assumptions 1. Linearity – if the functional form of the model is wrong then prediction is poor and all inference will be incorrect 2. Independence – if data points are not independent, the estimated standard error is incorrect and inference can be off (e.g. incorrect type 1 error rates) 3. Constant Variance Inference can be off (e.g. incorrect type 1 error rates) 4. Normality 48