Chapter 17 Regression - Biological Data Analysis, 3rd Edition PDF
Document Details
Uploaded by Deleted User
2020
Michael C. Whitlock, Dolph Schluter
Tags
Summary
This document is Chapter 17 from the textbook "The Analysis of Biological Data, Third Edition". It introduces linear regression and correlation, with an explanation of key concepts and objectives. It also discusses how to use linear models and regression in biological data analysis, including important assumptions of regression analysis and the interpretation of results.
Full Transcript
Chapter 17 Regression The Analysis of Biological Data Third Edition Michael C. Whitlock, Dolph Schluter © 2020 W.H. Freeman and Company Key Learning Objectives Use linear regression to predict Y based on X. Test the null hypothesis of zero slope....
Chapter 17 Regression The Analysis of Biological Data Third Edition Michael C. Whitlock, Dolph Schluter © 2020 W.H. Freeman and Company Key Learning Objectives Use linear regression to predict Y based on X. Test the null hypothesis of zero slope. Display uncertainty in predictions from a linear model. Use diagnostic plots to evaluate if data meet assumptions of regression © 2020 W.H. Freeman and Company Correlation and Regression (Review) © 2020 W.H. Freeman and Company Associations Between Continuous Variables Correlation: How well we can predict Y from X. (Chapter 16) Parameter: ρ Estimate: r Regression: How the expectation of Y changes with a change in X. (Chapter 17) Parameter: Estimate: © 2020 W.H. Freeman and Company Correlation vs. Regression: Visual © 2020 W.H. Freeman and Company Correlation vs Regression: Conceptual Correlation: Regression: measures the strength quantifies the and direction of the association between association between numerical variables numerical variables Is in units of initial Does not have units measures Is bound between −1 Can take any value and 1 Generates a formula Doesn’t yield a formula Measures how reliably Does not (directly) X & Y move together © 2020 W.H. Freeman and Company measure how well X 2024-11-29 © 2020 W.H. Freeman and Company iClicker Quiz © 2020 W.H. Freeman and Company iClicker Question Linear models and regression allow us to: A. Quantify the relationship between x and Y B. Test if x causes an increase in Y C. Test how tight the relationship between x and Y is D. Predict how much a change in Y affects x E. All the above © 2020 W.H. Freeman and Company Regression © 2020 W.H. Freeman and Company Regression: Nomenclature & Visualization Nomenclature The regression line through a scatter of points is described 𝑌 =𝑎+ 𝑏𝑋 mathematically by Y is the response variable X is the explanatory variable a is the Y-intercept. The predicted Y-value for X = 0. b is the slope. Unit increase in Y for increase in X. Y symbol î used to represent the predicted value of Y for a given value of X. When you use this line to make predictions, the © 2020 W.H. Freeman and Company A more proper syntax for the equation Y is the response variable x (lowercase!) is the explanatory variable is the Y-intercept, the predicted Y for x = 0. is the slope, the increase in Y for increase of 1 in x. is the error term: The estimated model is (note that there is no hatW.H. © 2020 onFreeman x because and it is not Company Regression: Goals & Assumptions Goals: Predict Y from x The relationship between x and Y is linear (a Assumptions At each value of x, the distribution of possible Y- continuous line). The variance of Y-values is the same for all values of values is normal. x. Y values are independent © 2020 W.H. of oneand Freeman another, after accounting for x. Company Introduction to Linear Regression © 2020 W.H. Freeman and Company The Regression Equation The regression equation describes the predicted value of Y Predicted value of Y Yî for a given value of X, ^ = 𝑎+ 𝑏 𝑥 +𝑒 𝑌 𝑖 𝑖 The Y-intercept,. The slope, b. xi The value of the explanatory value, The estimated error term, e © 2020 W.H. Freeman and Company Residuals The difference (Yi) and the is between the value called the residual prediction, (). © 2020 W.H. Freeman and Company Method of Least Squares In linear regression, we find the line that “best fits” the data. But what criteria defines the “best fit?” The “best fit” line minimizes the sum of squared residuals. The equations on the © 2020 W.H. Freeman and following slides solve this!!! Company The Regression Coefficient, b The slope, , describes the expected change in the response variable for each corresponding change in i Xi X Yi Y the explanatory variable. b sX sample covariance i Xi X 2 2 The sample slope, b, is an estimate of the true parameter, β. NOTE: b is an unbiased estimate of β. © 2020 W.H. Freeman and Company is The Y-Intercept The Y -intercept, a, is the value Y is predicted* to take if X a Y bX = 0. a is an unbiased estimate of the true parameter, *NOTE: This is often a silly “prediction” rarely to be taken seriously (see end of lecture) but is useful mathematically. (generally, you should design your model well so that the intercept has a useful meaning) © 2020 W.H. Freeman and Company Example Slopes and Intercepts © 2020 W.H. Freeman and Company Linear Regression Example: Predicting age based on radioactivity in teeth © 2020 W.H. Freeman and Company 2024-12-02 © 2020 W.H. Freeman and Company iClicker Quiz © 2020 W.H. Freeman and Company iClicker Question After fitting a linear model, we get the estimate and corresponding 80% confidence interval (1.5, 3.2). What can we conclude from this? A. Our estimate is more precise than one with 80% CI given by (1.2, 3.9) B. We reject H0: at the confidence level C. All the above © 2020 W.H. Freeman and Company Getting Long in the Tooth? Radioactivity from nuclear bomb tests in the ’50s and ‘60s may have left a signal in developing teeth. Can we predict date of birth based on dental ? Data from Spalding et al. 2005. Forensics: age written in teet h by nuclear tests © 2020 W.H. Freeman. and Nature 437: 333–334. Company Fitting the Slope and the Intercept Fitting the Slope and Intercept i Xi X Yi Y slope b Y -intercept a Y bX i Xi X 2 Use shortcuts 1 Sum of cross products: Xi X Yi Y n X Y X Y n i i i i 1 Sum of squares: Xi X n Xi2 Xi 2 2 n With values © 2020 W.H. Freeman and Company Plug and Chug Keep in mind that X = Δ14C and Y = dob First find the slope: 1 1 Sum of the cross products: n X Y X Y 16*7495223 3798*31674 23392.75 n 16 i i i i 1 1 Sum of squares: n Xi2 X 16 16*1340776 3798 439225.8 2 n 2 i 23392.75 The slope equals:b 0.05326 439225.8 Next find the Y-intercept: 31674 3798 a Y bX 0.05326 * 1992.27 16 16 © 2020 W.H. Freeman and Company Building a Linear Model in R with lm() © 2020 W.H. Freeman and Company Predicting Y from X The regression equation Predicted dob = 1992.27 - 0.05326(Δ14C) We can think of the regression line as offering predictions [i.e. best guesses] for points not in our data set. SELF TEST: If a cadaver has a tooth with content equal© to 2020 W.H. Freeman and Company Visualizing Residuals: Tooth Decay Data The residual, , is the difference between an observed ( and a value. predicted © 2020 W.H. Freeman and Company Do Data Fit Regression Assumptions? (1 of 2) Examine residuals vs. fitted data: Are residuals, and their variance, independent of predictions? © 2020 W.H. Freeman and Company Do Data Fit Regression Assumptions? (2 of 2) Examine the QQ-plot: Do points (roughly) fall on a line? © 2020 W.H. Freeman and Company The Slope Uncertainty and Hypothesis Testing © 2020 W.H. Freeman and Company Sum of Squares SStotal = SSregression + SSresidual © 2020 W.H. Freeman and Company Sum of Squares Regression SSregression is the sum of the squared difference values of Y and the between the predicted mean of Y. Yî Y SSregression2 = b * Sum cross products = -0.05326 * -23392.75 = 1245.90 © 2020 W.H. Freeman and Company Mean Squares Residual (1 of 2) The variance of the residuals, MSresiduals measures the spread of points above and below the regression line.Y Yˆ 2 i i n 2 MSResidual We could find all by hand, or we can get residuals from R, © 2020 W.H. Freeman and Company Mean Squares Residual (2 of 2) Or subtract SSregression from SStotal to get SSresidual without finding residuals SSresidual = SStotal - SSregression 16*24962.12 6142 1245.90 SSY SSRegression 16 10.998 n 2 16 2 MSResidual © 2020 W.H. Freeman and Company Standard Error of the Slope The slope, b, is t-distributed with SE SEb MSresidual Xi X 2 So in the tooth decay example: 10.998 SEb 0.005 439225.8 © 2020 W.H. Freeman and Company Testing Hypotheses: H0: β = 0 and HA: β ≠ 0 b 0 0.053259 0 Test Statistic: t SE 0.005 10.65 b With n − 2 = 16 − 2 = 14 degrees of freedom. Look up the P-value which equals. We reject & conclude that age increases with the amount of. © 2020 W.H. Freeman and Company Confidence Intervals b ± tα(2),dfSEb With 14 degrees of freedom, tα(2),df for a 95% confidence interval ist0.05(2),14 = 2.145 0.05326 ± 2.145*0.005 We are 95% confident that the true slope is between 0.043 and 0.064. Remember, this does not mean that the true value is in the CI with 95% probability! © 2020 W.H. Freeman and Company Regression in R © 2020 W.H. Freeman and Company Regressions as an ANOVA © 2020 W.H. Freeman and Company R2: The Proportion of Variance Explained R2 is the called the coefficient of determination. We saw this in Chapter 16. R2 is the square of the correlation coefficient r Unlike r, R2 does not describe whether variables increase or decrease with one another. Range of R2: 0 < R2 < 1 © 2020 W.H. Freeman and Company Examples of r and R2 © 2020 W.H. Freeman and Company R2 in an ANOVA framework SStotal = SSregression + SSresidual F 1 MSregression 2 SSregression SSresidual and R MSresidual SStotal SStotal © 2020 W.H. Freeman and Company We can get F and R2 just as in an ANOVA Many of the quantities of interest in a regression setting are the same as in an ANOVA SSregression is analogous to SSgroups SSresidual is analogous to SSerror Here is an ANOVA table from a regression Term df SumSq Mean Sq F P-value C 14 1 1245.9 1246 113 4.3 x 10-8 Residua 14 154.0 11 --- --- ls R2 = 1245.9/(1245.9+154) = 0.89 © 2020 W.H. Freeman and Company Uncertainty in Predictions © 2020 W.H. Freeman and Company Confidence Bands for the mean Confidence bands DO NOT represent uncertainty in 𝜇”, but that does not mean that this CI includes the true Read these bands as “95% of 95% CIs include the true 𝜇 with 95% probability. Just because the average CI includes the true value 95% of the time, that doesn’t mean this one does. CIs for are really only useful for hypothesis testing © 2020 W.H. Freeman and Company Prediction Intervals Prediction intervals present expected ranges for given. This and expected residual variation. incorporates uncertainty in These are intervals for instead of. © 2020 W.H. Freeman and Company Caution in Predictions: Be Wary of Extrapolation © 2020 W.H. Freeman and Company Example: Species * Area Relationship The number of fish species increases with pool area (P-value < 0.01) #species = 1.79 + 0.000355(area) What if we extrapolate to predict the number of fish species in a 50,000 m2 Data from: Kodric-Brown A. & Brown J.H. (1993) Highly structured fish communities in pool? Australian desert springs. Ecology, 74, 1847- 1855. DATA © 2020 W.H. Freeman and Company A Fishy Approach What is the predicted number of fish species in a 50,000 m2 pool? #species = 1.79 + 0.000355(area) =1.79 + 0.000355*50000 =19.54 How does this compare to the actual data? © 2020 W.H. Freeman and Company We Have Data from Larger Lakes! Number of fish species is POORLY predicted by linear extrapolation of the area of a desert pool! The regression from the small lakes does not do a good job of predicting our observations of larger lakes. predictions are only valid within the range of X that data Lesson: DO NOT extrapolate from your model – came from. © 2020 W.H. Freeman and Company © 2020 W.H. Freeman and Company Assumptions and What To Do When They Are Not Met… © 2020 W.H. Freeman and Company Recall Regression Assumptions between x and Y is The relationship linear (a continuous At each value of x, the line). Y-values is normal. distribution of possible The variance of Y- all values of x. values is the same for Y values are independent of ©one 2020 W.H. Freeman and another, after Company Diagnostic Plots Points randomly Meets Doesn’t Meet bounce about zero in Assumptions Assumptions residual plots if residuals are independent. Points fall (roughly) on the line in QQ-plots if residuals are normal. © 2020 W.H. Freeman and Company You should fit models to data rather than data to models https://events.ok.ubc.ca/s eries/fitting-models-to-dat a-not-data-to-models-wor kshop-series/ © 2020 W.H. Freeman and Company The impact of Measurement Error on Regression Estimates © 2020 W.H. Freeman and Company Attenuation Q: How does measurement error affect inferences from regression? A: It depends, which measured with error, X or Y? Measurement error in Y increases the in Y explained by X, with an unbiased effect variance of residuals, decreasing the variance Measurement error in X increases BOTH the on the slope. variance of residuals, and biases the slope closer to zero and away © 2020 from W.H. Freeman Company and its true value. Visualizing Attenuation Histograms of 10k replicate simulations. © 2020 W.H. Freeman and Company Regression Summary © 2020 W.H. Freeman and Company Regression Wrap-Up Linear models predict Y as the intercept plus the slope times the value of x. We can test the hypothesis of slope = 0 with the t-distribution or in an ANOVA framework. Diagnostic plots allow for visual evaluation of assumptions that residuals should be normally distributed and independent of predictions. © 2020 W.H. Freeman and Company