CHS 729 Week 7 Lecture Slides: Linear Regression

CHS 729 Week 7 Statistical Models Linear Regression Statistical Models Statistical models represent tools that allow us to relate an outcome Y to one or more other variables X1, X2, … The outcome Y is also often referred to as the response variable or dependent variable The X-variables are often referred to as predictors, explanatory variables, or independent variables. Models will generally take the form: Where f() represents a function applied to our predictors and represents the error of our model. All Models Are Wrong, Some Are Useful Every model simplifies reality. You will never have a model that perfectly fits to your data. GEP Box said “All models are wrong; some are useful” So we have two big questions we need to answer when we use statistical models:  How far wrong is the model?  How useful is the model? A Four-Step Process to Model Building When we use a statistical model in a study, there are four general steps that we need to undertake: 1. Choose the form of the model – this will be dependent on the structure of the data and patterns within it. 2. Fit the model to the data – this is when we run the model fitting algorithm (such as a regression function). 3. Assess how well the model fits the data – i.e., how wrong (or not wrong) is our model? 4. Use the model to answer our scientific research question. Linear Regression Today, we will go over each of these 4 steps for a method called linear regression. We will discuss:  When to use this approach  How to fit a linear regression to data in R  How to assess model fitness in R  How to extract our results using R to help us answer our scientific research question For the example in this lecture, we will use the built in mtcars dataset Starting With a Simple Linear Regression A simple linear regression refers to when we want to regress an outcome Y on just one predictor X. As implied in the name, a linear regression attempts to fit a linear function f() such that Y = f(X). A simple linear function takes the form Y = mX + b, where m is the slope of the function and b is the intercept We can understand the slope m to represent the value by which Y changes when X changes by a unit of 1 Typical Function Notation While you may be familiar with Y = mX + b, typically we use the following convention for defining a linear regression: In this equation, represents the intercept of our function and represents the slope. This convention is useful for when we fit a multiple regression, which is when we regress Y on multiple predictors, X1, X2,…,Xn: Error (aka Residual) You may also see an error term added to a model equation like so: This term is not technically part of the model, but indicates the fact that our regression function does not actually “go through” every data point in our model. We will visualize and discuss errors and residuals in next steps – importantly, models are simplifications of our data and, therefore, is not a perfect refl ection of our data. Assumptions of Linear Regression 1. Linearity – relationship between Y and predictors X follows a linear pattern (we 2. Homoscedasticity – the error/residual have the same variance (spread) at every level of X. If, say, model errors were larger for larger values of X, this would be an example of heteroscedasticity. 3. Independence of Residuals – the error observed in one observation should not be correlated with the error observed in another observation or with a predictor variables value. 4. Normality of Residuals – it is assumed that the unseen error in the model follows a normal distribution, with a mean value of 0 5. No Multicollinearity – for a multiple regression with multiple predictors, X1, X2, … , the independent variables should not be correlated with one another (interpretation becomes challenging in this case) Using a Scatterplot to Identify Linear Relationship: Assumption 1 We can use a scatterplot to visually inspect if there is a linear relationship between X and Y In R, we can use the ggplot() function paired with the geom_point() function to make the scatterplot Then, we can use the stat_smooth(method = “lm”) to create a linear function that best fits the data. You’ll notice the line doesn’t actually touch the points (i.e., residuals!) Assessing Other Assumptions in R When we use a given statistical model, we want to assess if assumptions hold. To do this, we often have to actually fit the model and then make the assessments. Assumptions are important because our statistical software often can run the method regardless of whether the assumptions are met. It is our job (not the computers) to assess if the method we are choosing is appropriate. Before we discuss checking the other assumptions, we are going to chat about how a linear regression is fit and how to do it in R. Ordinary Least Squares Regression When we want to fit a linear regression, we are trying to identify a linear function f() such that Y = f(X) fits our data very well. To do so, we run what is called an optimization function. An optimization function simply tries to find the best solution (aka optimal solution) given some task and metric. In this case, we want to identify a linear function that is as close to each datapoint in our sample as possible. We need a metric to measure how close our linear function is to our data. Residuals When we fit a linear regression, we are essentially creating a line that represents the relationship between X and Y. Each point represents an observed pair of X and Y. However, our line doesn’t actually go through these points (just near them) For a given observation, the residual is how far the observed point (x,y) is from the regression line. Visualizing Residuals Let us say we have an observation (x,y). We can let be the value of Y on the regression line for our value x The residual for this data point can then be represented as: Visually, this can be represented as the vertical distance from the point (x,y) to the linear function. Minimizing Residuals An ordinary least squares regression is fit by minimizing the model residuals. We must take the square of each residual, because if we sum the residuals, they actually add up to 0 (similar to how if we have to measure variance before standard deviation). So, if we have n observations, we can calculate: This is the sum of squared error/residuals (SSE). An OLS regression finds the linear function that minimizes this value. Fortunately, You Don’t Need to Do This By Hand Optimization is an inherently time-consuming process That is why computers are awesome While an OLS regression can easily be done by hand when you have a small number of observations, it can become way too time-consuming and complex quickly. We can easily identify the best fit linear function in R! Fitting a Linear Regression in R We can fit a linear regression in R using the lm() function. We need to supply to arguments: First, the regression formula, which takes the form: We place the outcome on the left side of the ~ and each of our independent variables on the right side. We separate the predictors with a + sign Second, we specify the data.frame that the variables are drawn from. R will then find the best fitting linear function by minimizing SSE Checking for Linearity & Homoscedasticity We can use a fit-residual plot. On the x-axis is our outcome variable. On the y-axis is the residual for each observation. Ideally, the points will be scattered uniformly around the line. Interpreting Fit-Residual Plot The plot on the left displays what an ideal plot will look like when assumptions of both linearity and homoscedasticity hold. Notice, points are distributed uniformly across the horizontal line for all values of Y Normality of Residuals – QQ Plots This is the quantile-quantile (QQ) plot. We can compare the distribution of our observed data (y-axis) versus how it would be distributed if it were normally distributed (x-axis) If the points follow line, then we feel confident we have normality of residuals. If the points diverge from the line, we have evidence that this assumption may be violated:  Notice the top right seems to diverge. Independence of Residuals There are two forms of independence to assess. Residuals should be independent of values of each predictor Residuals should be independent of one another (aka no auto- correlation) First, we can use a scatterplot of residuals (y-axis) and a given predictor (x-axis) to visually inspect for independence (i.e., no pattern) Autocorrelation Autocorrelation is when the error for a given observation is associated with the error of another observation Typically, in a random sample, each observation is independent, which should have the effect of leading to independence of residuals However, when observations are related in some way (i.e., time series data, nested data), we need to account for this. Often, we make a logical assessment of autocorrelation based on study design – for example, if we have longitudinal data or nested data (e.g., students within classrooms), we understand that we have autocorrelation. Methods such as time series analyses and mixed effects models can be used to control for impact of autocorrelation. Multicollinearity An important assumption for multiple regression is that the independent variables should not be correlated. As we discussed last week, one way to assess multi- collinearity is to create a correlation matrix of all of our variables. Variance Inflation Factor VIF indicates the level of multi-collinearity between predictors in a fitted model. A VIF of 1 indicates no multi-collinearity Here we see evidence that multi- A VIF greater than 5 collinearity should be a concern. indicates multi-collinearity may be impacting model This means we will need to performance. consider removing and/or modifying our predictors in some way! Transforming Your Variables When assumptions are violated, sometimes transforming the scale of your predictors and/or response can fix the problem. Transforming a variable simply means applying some mathematical function to a variable. If we have a variable X we could transform it by:  Squaring it :  Taking the natural log :  Taking the square root of it : How Do You Know When to Transform When it appears some of your model assumptions may be violated, exploring potential transformations is important. A “greedy” approach is to just try a bunch of transformations…this can be good for building some intuition behind the process, but I can’t recommend this. Looking at the distributions of our variables can assist us. Many of assumptions of built around the normality of our data. So, one heuristic is to try to normalize each variable in our data. Normalize? Histogram of Area of Each State Histogram of Log of Area of Each State Log-Transforms Are Common It is common to observe sample data skewed to the right. This is when we have a high concentration of values below the mean and a smaller set of extreme values above the mean. For example, lets think of the number of overdose deaths in each US county:  Most counties will have few if any deaths  A smaller set of counties have A LOT of deaths A log transform can “reel in” the skew and help the model perform better. Re-Interpreting the Model After a Transform When you transform a variable, it will change the interpretation of the model. Often, you will need to de-transform your point estimate to contextualize your finding. We will refl ect on this more as we learn how to interpret the results of a linear regression. Outliers A linear regression is not robust to outliers. An outlying data point can meaningfully impact model performance, as displayed in the photo. It can be useful to try to identify, and potentially remove, extreme outliers. Using Cook’s Distance to Detect Outliers Cook’s distance displays how much the fitted values of Y are altered by a given observation in your data set. For each observation, a Cook’s distance can be calculated. A (hand-wavy) rule is that observations with a Cook’s distance 4 times greater than the sample mean Cook’s distance are infl uential and should be considered for inspection. So, We Checked Assumptions, Now What? Once we are comfortable that we have met the assumptions for such a regression, we can actually use the model to try to answer our scientific research question. There are two aspects to this: 1. Evaluating the relationship identified between our predictors X1, X2,… and our response Y 2. Diagnosing how well our model fits to our data – ideally, our model will fit to our data relatively well. Interpreting the Results of Our Model First, let us examine what information we receive when we run a linear regression in R. By using the summary() function we are able to see a table, as displayed in the photo. This contains key pieces of information we can use to answer our research question with. Intercept and Slopes Recall we are trying to fit a linear function of the form: In this case, we are fitting: The first column, Estimate, shows our beta values, so (rounding to two decimal), our regression equation is: Intercept and Slopes – Hypothesis Testing As we can see, our slope defines the observed linear relationship between each variable X and our outcome Y. In this case, we can see that for every one unit increase in horsepower is associated with a decrease of 0.03 miles per gallon for the vehicle. We are curious though…is this a result of random chance or does it represent a population-level pattern (i.e., do cars with greater horsepower actually have reduced fuel effi ciency?) Intercept and Slopes – Hypothesis Testing As we have discussed, in inferential statistics we start by assuming a null hypothesis. If there is no relationship between a variable X and our outcome Y, then the beta coeffi cient corresponding to X should equal zero: So, if we assume that there is no relationship between X and Y, we want to ask how probable our observed slope is! Standard Error of Regression Slope In order to run a hypothesis test, we need to capture the standard error of our regression slope. R will calculate this automatically, but it essentially just measures the dispersion of the residuals for this variable: is the sum of squared residuals (SSE) is the variance of the predictor X Standard Error of Regression Slope This value captures how closely our data fits to the regression equation. The smaller the std error, the closer to the given line data falls. Notice here, we have n-2 degrees of freedom to calculate dispersion – that’s because you need 2 data points to fit a line! The remaining n-2 can thus inform dispersion. (In fact, 2 data points can’t have a std error because you can always fit a line through them) Running a T-test Recall, that to run a linear regression we have assumed that the residuals will be normally distributed! Thus, under our null hypothesis (), we would assume that our residuals are centered around 0 and that they are normally distributed across this mean. Well, this is actually meets the definitions for running a one-sample t-test with n-2 degrees of freedom! We can calculate our t-value for a given coeffi cient like so: T-Value in Table R has actually run this t-test for us. Using our formula, let’s look at hp: We can then compare our t-value to a distribution with n-2 degrees of freedom. In this case, we see that we have 28 degrees of freedom. The corresponding p-value (two-tailed) is displayed in the column to the right! In this case our p-value is approximately 0.011. A significant slope? In this case, we have found a significant slope. Is this useful? Well, the finding provides evidence that our null hypothesis is wrong. This significance test indicates that there is a meaningful population- level relationship between X and Y. However, the t-test does not tell us whether this relationship is meaningful! Further, it does not provide us an estimation of the value of at the population-level. Our Point Estimate is Our Best Estimate Here we found that a one unit increase in horse power is associated with a 0.031 unit decrease in miles per gallon. Based on our sample data, this represents our best estimate for the population-level relationship between X and Y. But, since its from a random sample, we can’t be confident the population-level parameter will be exactly this. Thus, we calculate a confidence interval by generating a t-distribution around our sampled slope. Confidence Interval for Slope To calculate the confidence interval for a given slope, , we compute the two following values: Here t# represents the critical value for our given t- distribution (n-2 degrees of freedom) for. In this example, t# is approximately 2.05, so we calculate: We refer to this as a 95% CI, and it represents our best estimate for an alternate hypothesis that the population may follow. What Do We Typically Want to Extract From This Table In a research paper, it is common to present the following pieces of information: 1. The point estimate 2. The standard error 3. The confidence interval 4. The p-value (rounded to 2-3 decimals) While we can see how we could extract this data by hand, it would be nice to just automate the creation of a table, just like we did with table one! We will end the lecture by going over how to do this, but first, we need to discuss other ways to diagnose the performance of our model. Coefficient of Determination, The coeffi cient of determination represents the proportion of variation in the response that is explained by the model: Where represents the predicted values of Y and represents the mean value of Y Values closer to 1/-1 indicate that the model explains more of the variation in the outcome. In a simple regression, is simply the square of the Pearson correlation between X and Y Adjusted Coefficient of Determination, As you add more independent variables, this tends to infl ate the size of the coeffi cient. To adjust for this we apply the following equation: Here k is the number of predictors in the model. This will result in a more conservative estimate for the coeffi cient of determination. This is also really useful when comparing model performance of models with different numbers of predictors. ANOVA For Multiple Regression A rather broad question we can ask is if the model provides any information at all about the outcome. A model would be uninformative if the slopes for each variable were equal to 0. This actually matches perfectly with the null hypothesis for a one way ANOVA where: Partitioning Model Variability The logic of the ANOVA test is that we can partition total variability in our outcome (SSTotal) into variability explained by the model (SSModel) and variation explained by error (SSE): Partitioning Modal Variability From here, we calculate the mean squared for the model and for the error: Where k is the number of predictors in the model. Partitioning Modal Variability From here, we calculate the mean squared for the model and for the error: Where k is the number of predictors in the model. Partitioning Modal Variability Finally, we calculate: This value follows an F-distribution with k and n-k-1 degrees of freedom. Recall, an F-distribution represents the behavior of a variable defined as the ratio of two sums of squares. Partitioning Model Variability A significant result provides evidence that the model explains some of the variation in the outcome. The ANOVA test does not tell us which variables in the model explain variations in the outcome – that is what the t-tests for the slopes were for. Often, this test is not reported in practice, or is noted in a footnote of a table reporting regression results. Often the existence of significant coeffi cients makes this test moot. Outputting Our Results to a Table Now We Can Work With it in Excel!

CHS 729 Week 7 Lecture Slides: Linear Regression

Document Details

Tags

Related

Summary

Full Transcript