3003PSY Tutorial 5: Regression Diagnostics PDF
Document Details
Uploaded by MesmerizedPeridot
Griffith University
Dr Natalie Loxton
Tags
Related
- Simple Linear Regression Analysis PDF
- 3003PSY Survey Design & Analysis Week 8 2024 Student Lecture Slides (PDF)
- Regression Diagnostics Transcript PDF
- Multivariate Statistics Made Simple (2019) PDF
- Chapter 5 Classical Linear Regression Model Assumptions and Diagnostics PDF
- ANOVA: Introduction Lecture Notes PDF
Summary
This tutorial provides a step-by-step guide to regression diagnostics, focusing on checking assumptions in multiple regression models. The document covers outlier detection, normality, and homoscedasticity assessment procedures, as well as transformations. It's intended for undergraduate-level students learning data analysis techniques.
Full Transcript
3003PSY Tutorial 5: Regression Diagnostics In our last tutorial we learnt how to run and interpret a multiple regression. However, with most statistical analyses there are a few associated assumptions that we need to check. In 1003PSY you learnt that when running a t-test we had Levene’s Test to ch...
3003PSY Tutorial 5: Regression Diagnostics In our last tutorial we learnt how to run and interpret a multiple regression. However, with most statistical analyses there are a few associated assumptions that we need to check. In 1003PSY you learnt that when running a t-test we had Levene’s Test to check for equal variances, in 2000PSY, you learnt that when running an ANOVA we checked for sphericity using Mauchly’s Test. Similarly, for multiple regression we have to check for a few assumptions. This week we will be running through an example of data screening and checking for assumptions. It wouldn’t hurt to run through this material twice. In fact, I strongly encourage you to as it may take a bit to get your head around. Why do we need to check assumptions? Ordinary Least Squares (OLS) regression is a statistical procedure that is based upon the theory of the normal curve. Basically, we assume that various aspects of the data are normal and the most important indicator of this is the set of residuals that we obtain after running a multiple regression. Severe departures from normality lead to violation of various assumptions upon which we have based our modelling. In turn, the inferences we draw from our results may be faulty. As well, some problems with data can lead to issues with even running the analysis. 1 As per last week, the purpose of the present analysis is to determine if students' attitudes towards doing statistics courses is predicted by population demographics. This week we will examine if students' age, and levels of mathematical education predict their attitudes towards doing statistics courses. (We have dropped gender from last week so we can just focus on two predictors). The ATS datafile contains data on the cohort’s attitudes towards statistics, as you have seen in previous classes and we will be again using the dataset from earlier tutorials to run this week (ATSTute4.sav). ⭐ IMPORTANT NOTE: This tutorial will go through assumption checking step by step, and you will see a bit of repetition in the syntax commands. This is purely for teaching and learning. Once you have understood what assumptions you are checking, you will be able to merge a few of the following steps into one to be more efficient. On Learn@GU there are two worksheets you can use to make notes for this tutorial. The answers will also be made available. tutorial 5, regression assumption testing check summary.docx · tutorial 5 worksheet, regression assumptions.pdf A) Multiple regression with assumption checking – the steps 1. Check for outliers that are data entry errors The first test is not really a test in any sense of the word. Outliers are data points that stand apart from the rest of the data (either very large or very small values compared to the rest of the dataset). Apparent outliers will be found in the data set that actually stem from errors of data entry (or related matters). The term univariate refers to considering each variable separately (i.e., not the entire set of variables jointly or simultaneously). The data entry person may have mashed the keypad or may have slipped and typed 7 instead of 4 when the item is measured on a 5-point scale. Consequently, the first thing we do with data is run range and logic checks. This first step is to get basic descriptive statistics that include the mean, SD, minimum and maximum values and the number of valid values for each variable. Use the descriptives syntax to do this. Scan the output and look for any values that do not fit. For example, are there values that seem extremely large (e.g., > 1000 on a 7 point scale) for any scales? Does someone report (despite all logic), that their year of birth is in the late 12th century? Do they report that they use their phone 46 hours per day when most reasonable people think that there are only 24 hours in a day? descriptives var= course age mathsBC. This step is something that can really be done first up as soon as you get your data set as data entry errors can also affect other analyses and descriptives that you run. We teach it to you in this tutorial as it links to other outlier checks later on in the tutorial; however you can also think about this step as part of the data management section that you conduct before your analysis. 2) Ensure you have an ID variable. Recall from the previous tute to add an identifying variable (if you haven’t already) compute ID = $casenum. Execute. 2 3) Run a multiple regression. For this tutorial, we are going to use mathematical education (mathsBC) and age (age) to predict attitudes towards course statistics (course). Remember the syntax from last week is: regression var= course age mathsBC /statistics= defaults zpp /dep= course /enter. 4) Check for normality and homoscedasticity After having run the regression and examined the results, we now check for violations of assumptions. Remember, we try to present these procedures in a linear order to facilitate learning but be aware that in practice these procedures are often run both before or after regression analysis. Recall that fitting a regression model involves expressing the Y variable (our criterion) as a function of the predicted score, Yʹ, and error, e; as seen below. Error is the residual that each person is assigned—the difference between their actual and predicted scores. As you will see in this tutorial, assumptions are more about the residual scores (what is NOT accounted for by our model). Normality. The first assumption we are checking in this step is the assumption of normality - are our residuals are normally distributed and centred around zero? Homoscedasticity. The second assumption that we are looking at is the assumption of homoscedasticity - do the residuals have a constant variance? So how do we ensure that our regression meets these two assumptions? By adding a few lines to our regression syntax command! When we run a regression, we can calculate a predicted score for each participant that is based off our regression model. Therefore, we could also calculate the difference between their actual score and their predicted score for each participant (this is their residual score). Our first added line to our regression syntax standardises everyone’s residual scores (like deriving a Z score) and plots them on a histogram. The histogram needs to be normally distributed and centred around zero to meet the first assumption of normality. As well as being normally distributed, residuals ought to be distributed similarly across values of the predictors. That is, we expect that when underlying relationships are linear and other assumptions are met that the regression equation ought to be just as good at predicting values for low as for high levels of the predictors – this is the assumption of homoscedasticity. Our second added line to the regression syntax plots the residual scores against the predicted values in a scatterplot. The scatterplot should look like a rectangle to meet the second assumption of homoscedasticity. A rectangular plot suggests that the variation in size of the residuals is similar across values of the predicted values. Anything different from this suggests problems - a fan shape usually points to nonnormality 3 Below is the modified regression syntax. Run our regression again using the added lines checking for the assumptions. Examine these and consider the information they present. What do you conclude? regression var= course age mathsBC /statistics= defaults zpp /dep= course /enter /residuals= histogram(zresid) id(ID) /scatterplot (*zresid,*zpred). So what do we do if we cannot meet our assumptions? There are a few things that we can try to help meet our assumptions: First we can examine our data for statistical outliers and remove them from the dataset to help with normality; and we can also check our variables for skewness and kurtosis and transform these variables. The following steps present how we examine each of these in turn. 5) Test for univariate outliers. If we suspect non-normality after looking at residuals, we begin by looking at potential outliers. In this sense, we are looking at statistical outliers, not at data entry errors or other matters as we checked for these earlier. There are several ways to do this and we will look at each in turn. First, the frequencies procedure will provide frequency tables and histograms that can show whether there are values in the data that, although within valid ranges for the variables in question, stand apart from the rest. The syntax is below. We add the line /histogram=normal that produces a histogram for each variable with an overlay of the normal curve over our data to help in assessing the normality of the distribution. How do you find the data? frequencies var= course age mathsBC /stat=def /histogram=normal. If we are making a very careful examination of the variables, we can use a specialised procedure in SPSS called examine. Although current versions of SPSS are phasing out this approach, it is still very useful for learning. It too is a syntax only approach in SPSS—it’s not available via menus. The syntax is below. Note, the variable listed below for the /id= subcommand needs to match whatever the actual participant ID variable is called in your own dataset. examine var= course age mathsBC /id= ID /plot= histogram boxplot npplot /statistics. The examine command produces a lot of output, so lets quickly go through the important bits. 4 Output First, the descriptives box provides a wide range of descriptive statistics for each of the variables. In amongst these you will find the mean and SD and other statistics you recognise. That is all we need for now. However later on in the tutorial when we consider normality in more detail, we will return to the skewness and kurtosis values provided in these tables. Then for each variable you stated in your syntax, you have been provided with 4 graphs. Two of these graphs you will already be familiar with: the histogram and boxplots. We can use these to make an assessment on how “normal” the data looks and for the examination of univariate outliers. We also have two additional plots called normal probability plot and a detrended version of this. We won’t worry about the detrended plots. The NPP manipulates our data and presents a plot so that if the data were perfectly statistically normal, all the points in this plot would sit perfectly on top of the 45° line in the plot. The extent to which the points slip off the line is an indicator of how far the data depart from normality. We will return to these plots below when we talk about normality. However, what we are interested in at the moment is the boxplots: the box-and-whiskers plots may contain circles or stars above or below the whiskers. The stars (**) are univariate outliers as defined by the statistical criterion of having a Z score greater than 3.29. A Z as large as 3.29 in absolute value is associated with a p value of.001 in the standard normal curve. So, a case with a Z greater than 3.29 (ignoring the sign of the Z) has a probability of less than.001 of occurring in our data. The numbers in the plot represent the values of the ID variable we created above—but the plot shows numbers only if there are not many cases. ✏ Take note of any participants who had data considered as a univariate outlier at the.001 level (star). If you find any, you can consider dropping them from the sample – that is running the analysis again after you have removed them from the data. One way is to just delete that participant, but as we like to keep all our data ‘just in case’, we would prefer to just de-select it. We can do this by creating a filter variable. A filter variable is like a sieve, where SPSS only selects certain participants to run the analysis on. The syntax to create a filter variable is: compute filter = ID>0 = 1. variable labels filter = 'filter variable to exclude outliers'. value labels filter 0 'not selected' 1 'selected'. formats filter (f1.0). filter by filter. execute. Warning! You will need to run the filter EVERY TIME you re-open the file (check to see the little “/” are in the far-left column.) As you can see if you type this into syntax, rather than one command with a few sub-options, we have a variety of one line commands that when run altogether, will create you a filter variable. Let’s look at this block of syntax line by line. 5 Now when we go back to our dataset, we see that we have a new variable, in this case the variable “filter”, where all participants are coded as one. What happens if we change a participant to a zero? As you will see, a line is drawn across that row number showing that SPSS will not include that participant in any subsequent analyses. The filter variable is an easy way to remove individual participants from the dataset without deleting their data entirely. We can easily add them back into our dataset by making their value on the filter command a one again. Now that you have made a filter variable, filter out any univariate outliers that you found. NOTE. Only filter out outliers once. If you were to run examine again, you will most likely find that SPSS has found new statistical outliers. Ignore these otherwise you will find yourself in a neverending loop. 6 6) Test for multivariate outliers. Sometimes, the removal of a couple of univariate outliers is all that is needed to produce regression plots that are acceptable. However, at other times the search must go on! Multivariate outliers are cases that are outliers when considered against all the variables simultaneously. They need not (but could) be outliers on any of the variables individually. A good way to remember this is to consider an 18 year old earning $100000 a year. Neither the age of 18 or the annual salary of $100 000 are necessarily outliers in a sample from the general public but the combination of the two is likely a multivariate outlier. Because usually we are looking for multivariate outliers in the space defined by several variables at once (not just two), we cannot visualise this using something like a scatterplot. Instead we must apply a statistical technique. Identifying multivariate outliers One of the great Indian statisticians of the 20th century, Prasanta Chandra Mahalanobis, gave us a technique named after him that generalises the Z score method we use in looking for univariate outliers. Mahalanobis’ Distance is a measure of how far a case is from the centroid of the multidimensional (multivariate) normal distribution created from all the predictors. Instead of getting a Z score, Mahalanobis’ Distance is distributed as chi-square and we look for cases with a chi-square corresponding to a p value less than.001. SPSS can give us the value of Mahalanobis’ Distance for the 10 most extreme points in the dataset (this is an arbitrary number). We compare these 10 most extreme points with a critical chi-squared value: the.001 cutoff for the degrees of freedom equal to the number of predictors. We find this cutoff using a chisquare table. In our example, we need to know the critical chisquare value for p =.001 and df = 2 (because we have 2 predictors in the current analysis) and this value happens to be 13.816 (you have been provided with a table on Blackboard that shows the.001 cutoff value for other degrees of freedom). So, if any case in our sample has a value of Mahalanobis’ Distance greater than 13.816 it is considered to be a multivariate outlier. We get Mahalanobis’ Distance by again amending our basic regression syntax again (see below; note again the use of the ID variable we created earlier). In fact, in practice we would combine this syntax with that for the residuals plots and get everything at once so we could check it out in the output window on screen, as this would be most efficient. However, to best facilitate your learning, let us just do one thing at a time. What can you conclude about the presence of multivariate outliers in the sample? regression var= course age mathsBC /statistics= defaults zpp /dep= course /enter /residuals = outliers(mahal) id(ID). 7 ✏ Again, take note of any participants who exceed the cutoff for a multivariate outlier and using your filter variable, remove them from your dataset and rerun your regression. How do your results look now? NOTE: When testing for univariate outliers - the question you need to be asking yourself when examining a regression with outliers removed is: Has the nature of the results changed due to removing the outliers? If there is no change, then the outliers should be returned to the data set (i.e. we no longer filter them out). HOWEVER, because multivariate are particularly problematic, we ALWAYS remove them (and report this in our reports). See the decision flowchart Outlier and Transformation decision flowcharts.pdf 7) Examine skewness and kurtosis If the residuals are not in order after the above checks, addressing the normality of the predictors and criterion may well help this (but not always—transformations are not panaceas, especially if the issues lie with the nature of the data we have collected or the nature of the underlying relationship under examination). Maybe OLS regression just is not the appropriate technique for us in some cases. The best thing to do now is to rerun the examine syntax and this is especially true if we have dropped any univariate or multivariate outliers. We now turn our attention to the indicators of (non)normality. Remember those results we brushed over when looking for univariate outliers? Now is their time to shine! An ideal normal distribution is symmetric about the mean and has a particular curve that corresponds to a certain percentage of the cases being contained under each part of the curve. If a dataset is skewed it does not follow the symmetric basis of the normal curve. Instead it leans somewhat to the left or right and the long, skewer-like tail either points to the left for negative skew or to the right for positive skew. If you have forgotten this from earlier years, it may be helpful to sketch these curves on scrap paper to help you recall. Similarly, if a curve is too peaky—tall and skinny, it is leptokurtic and if it is too short and fat it is platykurtic—we call the underlying quantity, kurtosis. Let’s look at the histograms and normal probability plots of each of our variables to get an idea about which variable may violate our assumption of normality. In addition to looking at plots, we can also find a skewness and kurtosis statistic. In SPSS, examine is one procedure that provides skewness and kurtosis values and associated standard errors to help us assess the (non)normality of our data. You will find these values in the statistics box for each variable, very close to the bottom. Briefly, the skewness and kurtosis parameters shown ought to be zero when the data is perfectly normally distributed. However, as we know that chance can play a role in sampling, we actually use the respective standard errors to test whether skewness and kurtosis differ significantly from zero. Because these tests are basically oversensitive, we use a very stringent criterion of.001 rather than.05 for significance. We divide skewness and kurtosis values by their respective standard errors. Then, we say that a variable is skewed or kurtotic if the result of the ratio (of value/ standard error) is greater than 3.29, respectively, in absolute value. 8 Transformations If something is or looks skewed, we consider transformation. A transformation is a mathematical procedure that modifies the shape of the data in certain permissible ways that help us to attenuate the violations of normality assumptions if possible. There are several ways to transform variables. An example is presented here for practical value – that is, a logarithmic transformation. We might consider applying the logarithmic transformation to the criterion, age that appears to have moderate positive skew (hump to the left, skew to the right). The syntax below performs the transformation with a compute statement and then follows up with an examine procedure just checking the newly transformed variable lnage to help to assess whether the transformation has had the desired effect. compute lnage= ln(age). examine var= lnage /id= ID /plot= histogram boxplot npplot /statistics. What can you conclude about transforming age? Has it helped? Rerun your regression using your transformed variable rather than the original format and note any changes to your assumptions and actual regression parameters (regression results). How does it compare from before the transformation? Should you keep the transformed variable or go back to the original? Note. The Log transformation is good for transforming quite severely skewed data but sometimes a square root transformation is sufficient for moderately skewed data. It is worth trying both. See the graphs in the lecture for the different types of skew and suggested transformations. Negative skew: Transformations can only be done on positive skewed variables. If you have a variable that is negatively skewed (tail points to the left) you will need to reflect the variable (similar to a reverse scored item) before you compute a transformation. NOTE: because of your reflection, the sign of the regression coefficients will be opposite—this is OK, you just alert the reader to the fact. To reflect your variable, instead of using the recode procedure (that is handy when we have a smallish number of discrete scores, e.g., 1-2-3-4-5-6-7) to do this, we can do it mathematically as this can be applied to all sorts of scores and scales. Take the maximum score a given variable can take and add 1 to it. Write this down. We will call this number MAXPLUSONE in the syntax below. Write the following syntax line in your SPSS file: compute REFLECTEDVAR = MAXPLUSONE - OLDVAR. Remember to replace the bits in UPPER CASE with your actual variables and numbers. 9 For example, if I have a variable called hoursworked and it has a maximum score of 40. My MAXPLUSONE number would be 41 and I would write the syntax below to reflect it. compute reflectHOURSWORKED = 41 - hoursworked. There are some subtleties to this that I have skimmed over but it doesn’t matter. Use of this procedure is recommended strongly for any negative skewness requirements in the assignment. You then have a variable that can be attacked with the standard approaches to skewness (i.e., transformations) you know from the course. Important part to keep in mind: the regression coefficients (b-weight, beta, semipartials [but not squared semipartial]) for a reflected variable will all be of the opposite sign to what you would expect (since the variable scoring has been reflected). So, if you get a b-weight of -.3 for something that has been reflected, it really would be.3 if the variable were not reflected. This seems odd but you get used to it pretty quickly. It is what researchers do all the time—just inform the reader either in text or as the note to a table, as appropriate. B) Creating a summary of your data cleaning and assumption check for your write up. Finally, when we have done the data screening and cleaning, we need to write it up. You just need to state what you did and whether it made a difference. Try writing up a summary for the example we went through in class. Here is an example from a different dataset. Example Diagnostics write-up Image Worksheet updated by Dr Natalie Loxton April 2021 Image by John Petalcurin on Unsplash ↑ 3003PSY Tutorial 5: Regression Diagnostics On Learn@GU there are two worksheets you can use to make notes for this tutorial. The answers will also be made available. tutorial 5, regression assumption testing check summary.docx DOCX File 30.0 KB tutorial 5 worksheet, regression assumptions.pdf PDF Document 603.0 KB 10 ↑ 3003PSY Tutorial 5: Regression Diagnostics See the decision flowchart Outlier and Transformation decision flowcharts.pdf PDF Document 32.0 MB ↑ 3003PSY Tutorial 5: Regression Diagnostics Example Diagnostics write-up 11 12