Chapter 15: Statistical Analysis - PDF

Summary

This document explores key statistical methods, including the Bonferroni and Holm corrections for multiple comparisons and the assumptions of one-way ANOVA. It provides practical examples and discusses how to interpret and write up the results of post hoc tests. The text is aimed at undergraduate levels. The document also includes information about checking the homogeneity of variance assumption.

Full Transcript

The usual solution to this problem is to introduce an adjustment to the p-value, which aims to control the total error rate across the family of tests (see Sha↵er, 1995). An adjustment of this form, which is usually (but not always) applied because one is doing post hoc analysis, is often referred t...

The usual solution to this problem is to introduce an adjustment to the p-value, which aims to control the total error rate across the family of tests (see Sha↵er, 1995). An adjustment of this form, which is usually (but not always) applied because one is doing post hoc analysis, is often referred to as a correction for multiple comparisons, though it is sometimes referred to as “simultaneous inference”. In any case, there are quite a few di↵erent ways of doing this adjustment. I’ll discuss a few of them in this section and in Section 16.8, but you should be aware that there are many other methods out there (see, e.g., Hsu, 1996). 14.5.3 Bonferroni corrections The simplest of these adjustments is called the Bonferroni correction (Dunn, 1961), and it’s very very simple indeed. Suppose that my post hoc analysis consists of m separate tests, and I want to ensure that the total probability of making any Type I errors at all is at most ↵.10 If so, then the Bonferroni correction just says “multiply all your raw p-values by m”. If we let p denote the original p-value, and let p1j be the corrected value, then the Bonferroni correction tells that: p1 “ m ˆ p And therefore, if you’re using the Bonferroni correction, you would reject the null hypothesis if p1 † ↵. The logic behind this correction is very straightforward. We’re doing m di↵erent tests; so if we arrange it so that each test has a Type I error rate of at most ↵{m, then the total Type I error rate across these tests cannot be larger than ↵. That’s pretty simple, so much so that in the original paper, the author writes: The method given here is so simple and so general that I am sure it must have been used before this. I do not find it, however, so can only conclude that perhaps its very simplicity has kept statisticians from realizing that it is a very good method in some situations (Dunn, 1961, pp 52-53) To use the Bonferroni correction in R, you can use the pairwise.t.test() function,11 making sure that you set p.adjust.method = "bonferroni". Alternatively, since the whole reason why we’re doing these pairwise tests in the first place is because we have an ANOVA that we’re trying to understand, it’s probably more convenient to use the posthocPairwiseT() function in the lsr package, since we can use my.anova as the input: > posthocPairwiseT( my.anova, p.adjust.method = "bonferroni") Pairwise comparisons using t tests with pooled SD data: mood.gain and drug placebo anxifree anxifree 0.4506 - joyzepam 9.1e-05 0.0017 P value adjustment method: bonferroni 10 It’s worth noting in passing that not all adjustment methods try to do this. What I’ve described here is an approach for controlling “family wise Type I error rate”. However, there are other post hoc tests seek to control the “false discovery rate”, which is a somewhat di↵erent thing. 11 There’s also a function called p.adjust() in which you can input a vector of raw p-values, and it will output a vector of adjusted p-values. This can be handy sometimes. I should also note that more advanced users may wish to consider using some of the tools provided by the multcomp package. - 444 - If we compare these three p-values to those that we saw in the previous section when we made no adjustment at all, it is clear that the only thing that R has done is multiply them by 3. 14.5.4 Holm corrections Although the Bonferroni correction is the simplest adjustment out there, it’s not usually the best one to use. One method that is often used instead is the Holm correction (Holm, 1979). The idea behind the Holm correction is to pretend that you’re doing the tests sequentially; starting with the smallest (raw) p-value and moving onto the largest one. For the j-th largest of the p-values, the adjustment is either p1j “ j ˆ pj (i.e., the biggest p-value remains unchanged, the second biggest p-value is doubled, the third biggest p-value is tripled, and so on), or p1j “ p1j`1 whichever one is larger. This might sound a little confusing, so let’s go through it a little more slowly. Here’s what the Holm correction does. First, you sort all of your p-values in order, from smallest to largest. For the smallest p-value all you do is multiply it by m, and you’re done. However, for all the other ones it’s a two-stage process. For instance, when you move to the second smallest p value, you first multiply it by m ´ 1. If this produces a number that is bigger than the adjusted p-value that you got last time, then you keep it. But if it’s smaller than the last one, then you copy the last p-value. To illustrate how this works, consider the table below, which shows the calculations of a Holm correction for a collection of five p-values: raw p rank j pˆj Holm p.001 5.005.005.005 4.020.020.019 3.057.057.022 2.044.057.103 1.103.103 Hopefully that makes things clear. Although it’s a little harder to calculate, the Holm correction has some very nice properties: it’s more powerful than Bonferroni (i.e., it has a lower Type II error rate), but – counterintuitive as it might seem – it has the same Type I error rate. As a consequence, in practice there’s never any reason to use the simpler Bonferroni correction, since it is always outperformed by the slightly more elaborate Holm correction. Because of this, the Holm correction is the default one used by pairwise.t.test() and posthocPairwiseT(). To run the Holm correction in R, you could specify p.adjust.method = "Holm" if you wanted to, but since it’s the default you can just to do this: > posthocPairwiseT( my.anova ) Pairwise comparisons using t tests with pooled SD data: mood.gain and drug placebo anxifree anxifree 0.1502 - joyzepam 9.1e-05 0.0011 P value adjustment method: holm - 445 - As you can see, the biggest p-value (corresponding to the comparison between Anxifree and the placebo) is unaltered: at a value of.15, it is exactly the same as the value we got originally when we applied no correction at all. In contrast, the smallest p-value (Joyzepam versus placebo) has been multiplied by three. 14.5.5 Writing up the post hoc test Finally, having run the post hoc analysis to determine which groups are significantly di↵erent to one another, you might write up the result like this: Post hoc tests (using the Holm correction to adjust p) indicated that Joyzepam produced a significantly larger mood change than both Anxifree (p “.001) and the placebo (p “ 9.1 ˆ 10´5 ). We found no evidence that Anxifree performed better than the placebo (p “.15). Or, if you don’t like the idea of reporting exact p-values, then you’d change those numbers to p †.01, p †.001 and p °.05 respectively. Either way, the key thing is that you indicate that you used Holm’s correction to adjust the p-values. And of course, I’m assuming that elsewhere in the write up you’ve included the relevant descriptive statistics (i.e., the group means and standard deviations), since these p-values on their own aren’t terribly informative. 14.6 Assumptions of one-way ANOVA Like any statistical test, analysis of variance relies on some assumptions about the data. There are three key assumptions that you need to be aware of: normality, homogeneity of variance and independence. If you remember back to Section 14.2.4 – which I hope you at least skimmed even if you didn’t read the whole thing – I described the statistical models underpinning ANOVA, which I wrote down like this: H0 : Yik “ µ ` ✏ik H1 : Yik “ µk ` ✏ik In these equations µ refers to a single, grand population mean which is the same for all groups, and µk is the population mean for the k-th group. Up to this point we’ve been mostly interested in whether our data are best described in terms of a single grand mean (the null hypothesis) or in terms of di↵erent group- specific means (the alternative hypothesis). This makes sense, of course: that’s actually the important research question! However, all of our testing procedures have – implicitly – relied on a specific assumption about the residuals, ✏ik , namely that ✏ik „ Normalp0, 2 q None of the maths works properly without this bit. Or, to be precise, you can still do all the calculations, and you’ll end up with an F -statistic, but you have no guarantee that this F -statistic actually measures what you think it’s measuring, and so any conclusions that you might draw on the basis of the F test might be wrong. So, how do we check whether this assumption about the residuals is accurate? Well, as I indicated above, there are three distinct claims buried in this one statement, and we’ll consider them separately. Normality. The residuals are assumed to be normally distributed. As we saw in Section 13.9, we can assess this by looking at QQ plots or running a Shapiro-Wilk test. I’ll talk about this in an ANOVA context in Section 14.9. - 446 - Homogeneity of variance. Notice that we’ve only got the one value for the population standard deviation (i.e., ), rather than allowing each group to have it’s own value (i.e., k ). This is referred to as the homogeneity of variance (sometimes called homoscedasticity) assumption. ANOVA assumes that the population standard deviation is the same for all groups. We’ll talk about this extensively in Section 14.7. Independence. The independence assumption is a little trickier. What it basically means is that, knowing one residual tells you nothing about any other residual. All of the ✏ik values are assumed to have been generated without any “regard for” or “relationship to” any of the other ones. There’s not an obvious or simple way to test for this, but there are some situations that are clear violations of this: for instance, if you have a repeated-measures design, where each participant in your study appears in more than one condition, then independence doesn’t hold; there’s a special relationship between some observations... namely those that correspond to the same person! When that happens, you need to use something like repeated measures ANOVA. I don’t currently talk about repeated measures ANOVA in this book, but it will be included in later versions. 14.6.1 How robust is ANOVA? One question that people often want to know the answer to is the extent to which you can trust the results of an ANOVA if the assumptions are violated. Or, to use the technical language, how robust is ANOVA to violations of the assumptions. Due to deadline constraints I don’t have the time to discuss this topic. This is a topic I’ll cover in some detail in a later version of the book. 14.7 Checking the homogeneity of variance assumption There’s more than one way to skin a cat, as the saying goes, and more than one way to test the homo- geneity of variance assumption, too (though for some reason no-one made a saying out of that). The most commonly used test for this that I’ve seen in the literature is the Levene test (Levene, 1960), and the closely related Brown-Forsythe test (Brown & Forsythe, 1974), both of which I’ll describe here. Alternatively, you could use the Bartlett test, which is implemented in R via the bartlett.test() function, but I’ll leave it as an exercise for the reader to go check that one out if you’re interested. Levene’s test is shockingly simple. Suppose we have our outcome variable Yik. All we do is define a new variable, which I’ll call Zik , corresponding to the absolute deviation from the group mean: ˇ ˇ Zik “ ˇYik ´ Ȳk ˇ Okay, what good does this do us? Well, let’s take a moment to think about what Zik actually is, and what we’re trying to test. The value of Zik is a measure of how the i-th observation in the k-th group deviates from its group mean. And our null hypothesis is that all groups have the same variance; that is, the same overall deviations from the group means! So, the null hypothesis in a Levene’s test is that the population means of Z are identical for all groups. Hm. So what we need now is a statistical test of the null hypothesis that all group means are identical. Where have we seen that before? Oh right, that’s what ANOVA is... and so all that the Levene’s test does is run an ANOVA on the new variable Zik. What about the Brown-Forsythe test? Does that do anything particularly di↵erent? Nope. The only change from the Levene’s test is that it constructs the transformed variable Z in a slightly di↵erent way, using deviations from the group medians rather than deviations from the group means. That is, for the Brown-Forsythe test, Zik “ |Yik ´ mediank pY q| - 447 - where mediank pY q is the median for group k. Regardless of whether you’re doing the standard Levene test or the Brown-Forsythe test, the test statistic – which is sometimes denoted F , but sometimes written as W – is calculated in exactly the same way that the F -statistic for the regular ANOVA is calculated, just using a Zik rather than Yik. With that in mind, let’s just move on and look at how to run the test in R. 14.7.1 Running the Levene’s test in R Okay, so how do we run the Levene test? Obviously, since the Levene test is just an ANOVA, it would be easy enough to manually create the transformed variable Zik and then use the aov() function to run an ANOVA on that. However, that’s the tedious way to do it. A better way to do run your Levene’s test is to use the leveneTest() function, which is in the car package. As usual, we first load the package > library( car ) and now that we have, we can run our Levene test. The main argument that you need to specify is y, but you can do this in lots of di↵erent ways. Probably the simplest way to do it is actually input the original aov object. Since I’ve got the my.anova variable stored from my original ANOVA, I can just do this: > leveneTest( my.anova ) Levene’s Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 2 1.47 0.26 15 If we look at the output, we see that the test is non-significant pF2,15 “ 1.47, p “.26q, so it looks like the homogeneity of variance assumption is fine. Remember, although R reports the test statistic as an F -value, it could equally be called W , in which case you’d just write W2,15 “ 1.47. Also, note the part of the output that says center = median. That’s telling you that, by default, the leveneTest() function actually does the Brown-Forsythe test. If you want to use the mean instead, then you need to explicitly set the center argument, like this: > leveneTest( y = my.anova, center = mean ) Levene’s Test for Homogeneity of Variance (center = mean) Df F value Pr(>F) group 2 1.45 0.27 15 That being said, in most cases it’s probably best to stick to the default value, since the Brown-Forsythe test is a bit more robust than the original Levene test. 14.7.2 Additional comments Two more quick comments before I move onto a di↵erent topic. Firstly, as mentioned above, there are other ways of calling the leveneTest() function. Although the vast majority of situations that call for a Levene test involve checking the assumptions of an ANOVA (in which case you probably have a variable like my.anova lying around), sometimes you might find yourself wanting to specify the variables directly. Two di↵erent ways that you can do this are shown below: > leveneTest(y = mood.gain ~ drug, data = clin.trial) # y is a formula in this case > leveneTest(y = clin.trial$mood.gain, group = clin.trial$drug) # y is the outcome - 448 - Secondly, I did mention that it’s possible to run a Levene test just using the aov() function. I don’t want to waste a lot of space on this, but just in case some readers are interested in seeing how this is done, here’s the code that creates the new variables and runs an ANOVA. If you are interested, feel free to run this to verify that it produces the same answers as the Levene test (i.e., with center = mean): > Y G gp.mean Ybar Z summary( aov(Z ~ G) ) # run the ANOVA That said, I don’t imagine that many people will care about this. Nevertheless, it’s nice to know that you could do it this way if you wanted to. And for those of you who do try it, I think it helps to demystify the test a little bit when you can see – with your own eyes – the way in which Levene’s test relates to ANOVA. 14.8 Removing the homogeneity of variance assumption In our example, the homogeneity of variance assumption turned out to be a pretty safe one: the Levene test came back non-significant, so we probably don’t need to worry. However, in real life we aren’t always that lucky. How do we save our ANOVA when the homogeneity of variance assumption is violated? If you recall from our discussion of t-tests, we’ve seen this problem before. The Student t-test assumes equal variances, so the solution was to use the Welch t-test, which does not. In fact, Welch (1951) also showed how we can solve this problem for ANOVA too (the Welch one-way test). It’s implemented in R using the oneway.test() function. The arguments that we’ll need for our example are: formula. This is the model formula, which (as usual) needs to specify the outcome variable on the left hand side and the grouping variable on the right hand side: i.e., something like outcome ~ group. data. Specifies the data frame containing the variables. var.equal. If this is FALSE (the default) a Welch one-way test is run. If it is TRUE then it just runs a regular ANOVA. The function also has a subset argument that lets you analyse only some of the observations and a na.action argument that tells it how to handle missing data, but these aren’t necessary for our purposes. So, to run the Welch one-way ANOVA for our example, we would do this: > oneway.test(mood.gain ~ drug, data = clin.trial) One-way analysis of means (not assuming equal variances) data: mood.gain and drug F = 26.322, num df = 2.000, denom df = 9.493, p-value = 0.000134 To understand what’s happening here, let’s compare these numbers to what we got earlier in Section 14.3 when we ran our original ANOVA. To save you the trouble of flicking back, here are those numbers again, this time calculated by setting var.equal = TRUE for the oneway.test() function: - 449 - > oneway.test(mood.gain ~ drug, data = clin.trial, var.equal = TRUE) One-way analysis of means data: mood.gain and drug F = 18.611, num df = 2, denom df = 15, p-value = 8.646e-05 Okay, so originally our ANOVA gave us the result F p2, 15q “ 18.6, whereas the Welch one-way test gave us F p2, 9.49q “ 26.32. In other words, the Welch test has reduced the within-groups degrees of freedom from 15 to 9.49, and the F -value has increased from 18.6 to 26.32. 14.9 Checking the normality assumption Testing the normality assumption is relatively straightforward. We covered most of what you need to know in Section 13.9. The only thing we really need to know how to do is pull out the residuals (i.e., the ✏ik values) so that we can draw our QQ plot and run our Shapiro-Wilk test. First, let’s extract the residuals. R provides a function called residuals() that will do this for us. If we pass our my.anova to this function, it will return the residuals. So let’s do that: > my.anova.residuals hist( x = my.anova.residuals ) # plot a histogram (similar to Figure 14.3a) > qqnorm( y = my.anova.residuals ) # draw a QQ plot (similar to Figure 14.3b) > shapiro.test( x = my.anova.residuals ) # run Shapiro-Wilk test Shapiro-Wilk normality test data: my.anova.residuals W = 0.9602, p-value = 0.6053 The histogram and QQ plot are both shown in Figure 14.3, and both look pretty normal to me.12 This is supported by the results of our Shapiro-Wilk test (W “.96, p “.61) which finds no indication that normality is violated. 14.10 Removing the normality assumption Now that we’ve seen how to check for normality, we are led naturally to ask what we can do to address violations of normality. In the context of a one-way ANOVA, the easiest solution is probably to switch to a non-parametric test (i.e., one that doesn’t rely on any particular assumption about the kind of 12 Note that neither of these figures has been tidied up at all: if you want to create nicer looking graphs it’s always a good idea to use the tools from Chapter 6 to help you draw cleaner looking images. - 450 - 7 0.50 6 0.25 5 Sample Quantiles Frequency 4 0.00 3 −0.25 2 1 −0.50 0 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 −2 −1 0 1 2 Residuals Theoretical Quantiles (a) (b) Figure 14.3: Histogram (panel a) and QQ plot (panel b) for the residuals of our ANOVA: both of these are in agreement with the Shapiro-Wilk test, and suggest that the residuals are normally distributed....................................................................................................... distribution involved). We’ve seen non-parametric tests before, in Chapter 13: when you only have two groups, the Wilcoxon test provides the non-parametric alternative that you need. When you’ve got three or more groups, you can use the Kruskal-Wallis rank sum test (Kruskal & Wallis, 1952). So that’s the test we’ll talk about next. 14.10.1 The logic behind the Kruskal-Wallis test The Kruskal-Wallis test is surprisingly similar to ANOVA, in some ways. In ANOVA, we started with Yik , the value of the outcome variable for the ith person in the kth group. For the Kruskal-Wallis test, what we’ll do is rank order all of these Yik values, and conduct our analysis on the ranked data. So let’s let Rik refer to the ranking given to the ith member of the kth group. Now, let’s calculate R̄k , the average rank given to observations in the kth group: 1 ÿ R̄k “ Rik NK i and let’s also calculate R̄, the grand mean rank: 1 ÿÿ R̄ “ Rik N i k Now that we’ve done this, we can calculate the squared deviations from the grand mean rank R̄. When we do this for the individual scores – i.e., if we calculate pRik ´ R̄q2 – what we have is a “nonparametric” measure of how far the ik-th observation deviates from the grand mean rank. When we calculate the squared deviation of the group means from the grand means – i.e., if we calculate pR̄k ´ R̄q2 – then what we have is a nonparametric measure of how much the group deviates from the grand mean rank. With - 451 - this in mind, let’s follow the same logic that we did with ANOVA, and define our ranked sums of squares measures in much the same way that we did earlier. First, we have our “total ranked sums of squares”: ÿÿ RSStot “ pRik ´ R̄q2 k i and we can define the “between groups ranked sums of squares” like this: ∞ ∞ RSSb “ ∞k i pR̄k ´ R̄q2 “ k Nk pR̄k ´ R̄q 2 So, if the null hypothesis is true and there are no true group di↵erences at all, you’d expect the between group rank sums RSSb to be very small, much smaller than the total rank sums RSStot. Qualitatively this is very much the same as what we found when we went about constructing the ANOVA F -statistic; but for technical reasons the Kruskal-Wallis test statistic, usually denoted K, is constructed in a slightly di↵erent way: RSSb K “ pN ´ 1q ˆ RSStot and, if the null hypothesis is true, then the sampling distribution of K is approximately chi-square with G´1 degrees of freedom (where G is the number of groups). The larger the value of K, the less consistent the data are with null hypothesis, so this is a one-sided test: we reject H0 when K is sufficiently large. 14.10.2 Additional details The description in the previous section illustrates the logic behind the Kruskal-Wallis test. At a conceptual level, this is the right way to think about how the test works. However, from a purely mathematical perspective it’s needlessly complicated. I won’t show you the derivation, but you can use a bit of algebraic jiggery-pokery13 to show that the equation for K can be rewritten as 12 ÿ K“ Nk R̄k2 ´ 3pN ` 1q N pN ´ 1q k It’s this last equation that you sometimes see given for K. This is way easier to calculate than the version I described in the previous section, it’s just that it’s totally meaningless to actual humans. It’s probably best to think of K the way I described it earlier... as an analogue of ANOVA based on ranks. But keep in mind that the test statistic that gets calculated ends up with a rather di↵erent look to it than the one we used for our original ANOVA. But wait, there’s more! Dear lord, why is there always more? The story I’ve told so far is only actually true when there are no ties in the raw data. That is, if there are no two observations that have exactly the same value. If there are ties, then we have to introduce a correction factor to these calculations. At this point I’m assuming that even the most diligent reader has stopped caring (or at least formed the opinion that the tie-correction factor is something that doesn’t require their immediate attention). So I’ll very quickly tell you how it’s calculated, and omit the tedious details about why it’s done this way. Suppose we construct a frequency table for the raw data, and let fj be the number of observations that have the j-th unique value. This might sound a bit abstract, so here’s the R code showing a concrete example: > f print(f) # we have some ties... 0.1 0.2 0.3 0.4 0.5 0.6 0.8 0.9 1.1 1.2 1.3 1.4 1.7 1.8 1 1 2 1 1 2 1 1 1 1 2 2 1 1 13 A technical term. - 452 - Looking at this table, notice that the third entry in the frequency table has a value of 2. Since this corresponds to a mood.gain of 0.3, this table is telling us that two people’s mood increased by 0.3. More to the point, note that we can say that f has a value of 2. Or, in the mathematical notation I introduced above, this is telling us that f3 “ 2. Yay. So, now that we know this, the tie correction factor (TCF) is: ∞ 3 j fj ´ fj TCF “ 1 ´ N3 ´ N The tie-corrected value of the Kruskal-Wallis statistic obtained by dividing the value of K by this quantity: it is this tie-corrected version that R calculates. And at long last, we’re actually finished with the theory of the Kruskal-Wallis test. I’m sure you’re all terribly relieved that I’ve cured you of the existential anxiety that naturally arises when you realise that you don’t know how to calculate the tie-correction factor for the Kruskal-Wallis test. Right? 14.10.3 How to run the Kruskal-Wallis test in R Despite the horror that we’ve gone through in trying to understand what the Kruskal-Wallis test actu- ally does, it turns out that running the test is pretty painless, since R has a function called kruskal.test(). The function is pretty flexible, and allows you to input your data in a few di↵erent ways. Most of the time you’ll have data like the clin.trial data set, in which you have your outcome variable mood.gain, and a grouping variable drug. If so, you can call the kruskal.test() function by specifying a formula, and a data frame: > kruskal.test(mood.gain ~ drug, data = clin.trial) Kruskal-Wallis rank sum test data: mood.gain by drug Kruskal-Wallis chi-squared = 12.076, df = 2, p-value = 0.002386 A second way of using the kruskal.test() function, which you probably won’t have much reason to use, is to directly specify the outcome variable and the grouping variable as separate input arguments, x and g: > kruskal.test(x = clin.trial$mood.gain, g = clin.trial$drug) This isn’t very interesting, since it’s just plain easier to specify a formula. However, sometimes it can be useful to specify x as a list. What I mean is this. Suppose you actually had data as three separate variables, placebo, anxifree and joyzepam. If that’s the format that your data are in, then it’s convenient to know that you can bundle all three together as a list: > mood.gain kruskal.test( x = mood.gain ) And again, this would give you exactly the same results as the command we tried originally. 14.11 On the relationship between ANOVA and the Student t test There’s one last thing I want to point out before finishing. It’s something that a lot of people find kind of surprising, but it’s worth knowing about: an ANOVA with two groups is identical to the Student t-test. - 453 - No, really. It’s not just that they are similar, but they are actually equivalent in every meaningful way. I won’t try to prove that this is always true, but I will show you a single concrete demonstration. Suppose that, instead of running an ANOVA on our mood.gain ~ drug model, let’s instead do it using therapy as the predictor. If we run this ANOVA, here’s what we get: > summary( aov( mood.gain ~ therapy, data = clin.trial )) Df Sum Sq Mean Sq F value Pr(>F) therapy 1 0.47 0.467 1.71 0.21 Residuals 16 4.38 0.274 Overall, it looks like there’s no significant e↵ect here at all but, as we’ll see in Chapter 16 this is actually a misleading answer! In any case, it’s irrelevant to our current goals: our interest here is in the F -statistic, which is F p1, 16q “ 1.71, and the p-value, which is.21. Since we only have two groups, I didn’t actually need to resort to an ANOVA, I could have just decided to run a Student t-test. So let’s see what happens when I do that: > t.test( mood.gain ~ therapy, data = clin.trial, var.equal = TRUE ) Two Sample t-test data: mood.gain by therapy t = -1.3068, df = 16, p-value = 0.2098 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.84495 0.20051 sample estimates: mean in group none mean in group CBT 0.72222 1.04444 Curiously, the p-values are identical: once again we obtain a value of p “.21. But what about the test statistic? Having run a t-test instead of an ANOVA, we get a somewhat di↵erent answer, namely tp16q “ ´1.3068. However, there is a fairly straightforward relationship here. If we square the t-statistic > 1.3068 ^ 2 1.7077 we get the F -statistic from before. 14.12 Summary There’s a fair bit covered in this chapter, but there’s still a lot missing. Most obviously, I haven’t yet discussed any analog of the paired samples t-test for more than two groups. There is a way of doing this, known as repeated measures ANOVA, which will appear in a later version of this book. I also haven’t discussed how to run an ANOVA when you are interested in more than one grouping variable, but that will be discussed in a lot of detail in Chapter 16. In terms of what we have discussed, the key topics were: The basic logic behind how ANOVA works (Section 14.2) and how to run one in R(Section 14.3). How to compute an e↵ect size for an ANOVA (Section 14.4) - 454 - Post hoc analysis and corrections for multiple testing (Section 14.5). The assumptions made by ANOVA (Section 14.6). How to check the homogeneity of variance assumption (Section 14.7) and what to do if it is violated (Section 14.8). How to check the normality assumption (Section 14.9) and what to do if it is violated (Section 14.10). As with all of the chapters in this book, there are quite a few di↵erent sources that I’ve relied upon, but the one stand-out text that I’ve been most heavily influenced by is Sahai and Ageel (2000). It’s not a good book for beginners, but it’s an excellent book for more advanced readers who are interested in understanding the mathematics behind ANOVA. - 455 - - 456 - 15. Linear regression The goal in this chapter is to introduce linear regression, the standard tool that statisticians rely on when analysing the relationship between interval scale predictors and interval scale outcomes. Stripped to its bare essentials, linear regression models are basically a slightly fancier version of the Pearson correlation (Section 5.7) though as we’ll see, regression models are much more powerful tools. 15.1 What is a linear regression model? Since the basic ideas in regression are closely tied to correlation, we’ll return to the parenthood.Rdata file that we were using to illustrate how correlations work. Recall that, in this data set, we were trying to find out why Dan is so very grumpy all the time, and our working hypothesis was that I’m not getting enough sleep. We drew some scatterplots to help us examine the relationship between the amount of sleep I get, and my grumpiness the following day. The actual scatterplot that we draw is the one shown in Figure 15.1, and as we saw previously this corresponds to a correlation of r “ ´.90, but what we find ourselves secretly imagining is something that looks closer to Figure 15.2a. That is, we mentally draw a straight line through the middle of the data. In statistics, this line that we’re drawing is called a regression line. Notice that – since we’re not idiots – the regression line goes through the middle of the data. We don’t find ourselves imagining anything like the rather silly plot shown in Figure 15.2b. This is not highly surprising: the line that I’ve drawn in Figure 15.2b doesn’t “fit” the data very well, so it doesn’t make a lot of sense to propose it as a way of summarising the data, right? This is a very simple observation to make, but it turns out to be very powerful when we start trying to wrap just a little bit of maths around it. To do so, let’s start with a refresher of some high school maths. The formula for a straight line is usually written like this: y “ mx ` c Or, at least, that’s what it was when I went to high school all those years ago. The two variables are x and y, and we have two coefficients, m and c. The coefficient m represents the slope of the line, and the coefficient c represents the y-intercept of the line. Digging further back into our decaying memories of high school (sorry, for some of us high school was a long time ago), we remember that the intercept is interpreted as “the value of y that you get when x “ 0”. Similarly, a slope of m means that if you increase the x-value by 1 unit, then the y-value goes up by m units; a negative slope means that the y-value would go down rather than up. Ah yes, it’s all coming back to me now. Now that we’ve remembered that, it should come as no surprise to discover that we use the exact same formula to describe a regression line. If Y is the outcome variable (the DV) and X is the predictor - 457 - 90 80 My grumpiness (0−100) 70 60 50 40 5 6 7 8 9 My sleep (hours) Figure 15.1: Scatterplot showing grumpiness as a function of hours slept........................................................................................................ variable (the IV), then the formula that describes our regression is written like this: Ŷi “ b1 Xi ` b0 Hm. Looks like the same formula, but there’s some extra frilly bits in this version. Let’s make sure we understand them. Firstly, notice that I’ve written Xi and Yi rather than just plain old X and Y. This is because we want to remember that we’re dealing with actual data. In this equation, Xi is the value of predictor variable for the ith observation (i.e., the number of hours of sleep that I got on day i of my little study), and Yi is the corresponding value of the outcome variable (i.e., my grumpiness on that day). And although I haven’t said so explicitly in the equation, what we’re assuming is that this formula works for all observations in the data set (i.e., for all i). Secondly, notice that I wrote Ŷi and not Yi. This is because we want to make the distinction between the actual data Yi , and the estimate Ŷi (i.e., the prediction that our regression line is making). Thirdly, I changed the letters used to describe the coefficients from m and c to b1 and b0. That’s just the way that statisticians like to refer to the coefficients in a regression model. I’ve no idea why they chose b, but that’s what they did. In any case b0 always refers to the intercept term, and b1 refers to the slope. Excellent, excellent. Next, I can’t help but notice that – regardless of whether we’re talking about the good regression line or the bad one – the data don’t fall perfectly on the line. Or, to say it another way, the data Yi are not identical to the predictions of the regression model Ŷi. Since statisticians love to attach letters, names and numbers to everything, let’s refer to the di↵erence between the model prediction and that actual data point as a residual, and we’ll refer to it as ✏i.1 Written using mathematics, the residuals are defined as: ✏i “ Yi ´ Ŷi which in turn means that we can write down the complete linear regression model as: Y i “ b 1 X i ` b0 ` ✏ i 1 The ✏ symbol is the Greek letter epsilon. It’s traditional to use ✏i or ei to denote a residual. - 458 - The Best Fitting Regression Line Not The Best Fitting Regression Line! 90 90 80 80 My grumpiness (0−100) My grumpiness (0−100) 70 70 60 60 50 50 40 40 5 6 7 8 9 5 6 7 8 9 My sleep (hours) My sleep (hours) (a) (b) Figure 15.2: Panel a shows the sleep-grumpiness scatterplot from Figure 15.1 with the best fitting regres- sion line drawn over the top. Not surprisingly, the line goes through the middle of the data. In contrast, panel b shows the same data, but with a very poor choice of regression line drawn over the top........................................................................................................ 15.2 Estimating a linear regression model Okay, now let’s redraw our pictures, but this time I’ll add some lines to show the size of the residual for all observations. When the regression line is good, our residuals (the lengths of the solid black lines) all look pretty small, as shown in Figure 15.3a, but when the regression line is a bad one, the residuals are a lot larger, as you can see from looking at Figure 15.3b. Hm. Maybe what we “want” in a regression model is small residuals. Yes, that does seem to make sense. In fact, I think I’ll go so far as to say that the “best fitting” regression line is the one that has the smallest residuals. Or, better yet, since statisticians seem to like to take squares of everything why not say that... The estimated regression coefficients, b̂0 and ∞ b̂1 are those that∞ minimise the sum of the squared residuals, which we could either write as i pYi ´ Ŷi q2 or as i ✏i 2. Yes, yes that sounds even better. And since I’ve indented it like that, it probably means that this is the right answer. And since this is the right answer, it’s probably worth making a note of the fact that our regression coefficients are estimates (we’re trying to guess the parameters that describe a population!), which is why I’ve added the little hats, so that we get b̂0 and b̂1 rather than b0 and b1. Finally, I should also note that – since there’s actually more than one way to estimate a regression model – the more technical name for this estimation process is ordinary least squares (OLS) regression. At this point, we now have a concrete definition for what counts as our “best” choice of regression coefficients, b̂0 and b̂1. The natural question to ask next is, if our optimal regression coefficients are those that minimise the sum squared residuals, how do we find these wonderful numbers? The actual answer - 459 - Regression Line Close to the Data Regression Line Distant from the Data 90 90 80 80 My grumpiness (0−100) My grumpiness (0−100) 70 70 60 60 50 50 40 40 5 6 7 8 9 5 6 7 8 9 My sleep (hours) My sleep (hours) (a) (b) Figure 15.3: A depiction of the residuals associated with the best fitting regression line (panel a), and the residuals associated with a poor regression line (panel b). The residuals are much smaller for the good regression line. Again, this is no surprise given that the good line is the one that goes right through the middle of the data........................................................................................................ to this question is complicated, and it doesn’t help you understand the logic of regression.2 As a result, this time I’m going to let you o↵ the hook. Instead of showing you how to do it the long and tedious way first, and then “revealing” the wonderful shortcut that R provides you with, let’s cut straight to the chase... and use the lm() function (short for “linear model”) to do all the heavy lifting. 15.2.1 Using the function The lm() function is a fairly complicated one: if you type ?lm, the help files will reveal that there are a lot of arguments that you can specify, and most of them won’t make a lot of sense to you. At this stage however, there’s really only two of them that you care about, and as it turns out you’ve seen them before: formula. A formula that specifies the regression model. For the simple linear regression models that we’ve talked about so far, in which you have a single predictor variable as well as an intercept term, this formula is of the form outcome ~ predictor. However, more complicated formulas are allowed, and we’ll discuss them later. 2 Or at least, I’m assuming that it doesn’t help most people. But on the o↵ chance that someone reading this is a proper kung fu master of linear algebra (and to be fair, I always have a few of these people in my intro stats class), it will help you to know that the solution to the estimation problem turns out to be b̂ “ pX1 Xq´1 X1 y, where b̂ is a vector containing the estimated regression coefficients, X is the “design matrix” that contains the predictor variables (plus an additional column containing all ones; strictly X is a matrix of the regressors, but I haven’t discussed the distinction yet), and y is a vector containing the outcome variable. For everyone else, this isn’t exactly helpful, and can be downright scary. However, since quite a few things in linear regression can be written in linear algebra terms, you’ll see a bunch of footnotes like this one in this chapter. If you can follow the maths in them, great. If not, ignore it. - 460 - data. The data frame containing the variables. As we saw with aov() in Chapter 14, the output of the lm() function is a fairly complicated object, with quite a lot of technical information buried under the hood. Because this technical information is used by other functions, it’s generally a good idea to create a variable that stores the results of your regression. With this in mind, to run my linear regression, the command I want to use is this: > regression.1 print( regression.1 ) Call: lm(formula = dan.grump ~ dan.sleep, data = parenthood) Coefficients: (Intercept) dan.sleep 125.956 -8.937 This looks promising. There’s two separate pieces of information here. Firstly, R is politely reminding us what the command was that we used to specify the model in the first place, which can be helpful. More importantly from our perspective, however, is the second part, in which R gives us the intercept b̂0 “ 125.96 and the slope b̂1 “ ´8.94. In other words, the best-fitting regression line that I plotted in Figure 15.2 has this formula: Ŷi “ ´8.94 Xi ` 125.96 15.2.2 Interpreting the estimated model The most important thing to be able to understand is how to interpret these coefficients. Let’s start with b̂1 , the slope. If we remember the definition of the slope, a regression coefficient of b̂1 “ ´8.94 means that if I increase Xi by 1, then I’m decreasing Yi by 8.94. That is, each additional hour of sleep that I gain will improve my mood, reducing my grumpiness by 8.94 grumpiness points. What about the intercept? Well, since b̂0 corresponds to “the expected value of Yi when Xi equals 0”, it’s pretty straightforward. It implies that if I get zero hours of sleep (Xi “ 0) then my grumpiness will go o↵ the scale, to an insane value of (Yi “ 125.96). Best to be avoided, I think. 15.3 Multiple linear regression The simple linear regression model that we’ve discussed up to this point assumes that there’s a single predictor variable that you’re interested in, in this case dan.sleep. In fact, up to this point, every statistical tool that we’ve talked about has assumed that your analysis uses one predictor variable and one outcome variable. However, in many (perhaps most) research projects you actually have multiple predictors that you want to examine. If so, it would be nice to be able to extend the linear regression - 461 - framework to be able to include multiple predictors. Perhaps some kind of multiple regression model would be in order? Multiple regression is conceptually very simple. All we do is add more terms to our regression equation. Let’s suppose that we’ve got two variables that we’re interested in; perhaps we want to use both dan.sleep and baby.sleep to predict the dan.grump variable. As before, we let Yi refer to my grumpiness on the i-th day. But now we have two X variables: the first corresponding to the amount of sleep I got and the second corresponding to the amount of sleep my son got. So we’ll let Xi1 refer to the hours I slept on the i-th day, and Xi2 refers to the hours that the baby slept on that day. If so, then we can write our regression model like this: Yi “ b2 Xi2 ` b1 Xi1 ` b0 ` ✏i As before, ✏i is the residual associated with the i-th observation, ✏i “ Yi ´ Ŷi. In this model, we now have three coefficients that need to be estimated: b0 is the intercept, b1 is the coefficient associated with my sleep, and b2 is the coefficient associated with my son’s sleep. However, although the number of coefficients that need to be estimated has changed, the basic idea of how the estimation works is unchanged: our estimated coefficients b̂0 , b̂1 and b̂2 are those that minimise the sum squared residuals. 15.3.1 Doing it in R Multiple regression in R is no di↵erent to simple regression: all we have to do is specify a more complicated formula when using the lm() function. For example, if we want to use both dan.sleep and baby.sleep as predictors in our attempt to explain why I’m so grumpy, then the formula we need is this: dan.grump ~ dan.sleep + baby.sleep Notice that, just like last time, I haven’t explicitly included any reference to the intercept term in this formula; only the two predictor variables and the outcome. By default, the lm() function assumes that the model should include an intercept (though you can get rid of it if you want). In any case, I can create a new regression model – which I’ll call regression.2 – using the following command: > regression.2 print( regression.2 ) Call: lm(formula = dan.grump ~ dan.sleep + baby.sleep, data = parenthood) Coefficients: (Intercept) dan.sleep baby.sleep 125.96557 -8.95025 0.01052 The coefficient associated with dan.sleep is quite large, suggesting that every hour of sleep I lose makes me a lot grumpier. However, the coefficient for baby.sleep is very small, suggesting that it doesn’t really matter how much sleep my son gets; not really. What matters as far as my grumpiness goes is how much sleep I get. To get a sense of what this multiple regression model looks like, Figure 15.4 shows a 3D plot that plots all three variables, along with the regression model itself. - 462 - Figure 15.4: A 3D visualisation of a multiple regression model. There are two predictors in the model, dan.sleep and baby.sleep; the outcome variable is dan.grump. Together, these three variables form a 3D space: each observation (blue dots) is a point in this space. In much the same way that a simple linear regression model forms a line in 2D space, this multiple regression model forms a plane in 3D space. When we estimate the regression coefficients, what we’re trying to do is find a plane that is as close to all the blue dots as possible. (This plot was drawn using the scatter3d() function in the car package, and it looked much nicer before it got butchered by the image conversion process that I used to get it into the book pdf) - 463 - 15.3.2 Formula for the general case The equation that I gave above shows you what a multiple regression model looks like when you include two predictors. Not surprisingly, then, if you want more than two predictors all you have to do is add more X terms and more b coefficients. In other words, if you have K predictor variables in the model then the regression equation looks like this: ˜ ¸ ÿ K Yi “ bk Xik ` b0 ` ✏i k“1 15.4 Quantifying the fit of the regression model So we now know how to estimate the coefficients of a linear regression model. The problem is, we don’t yet know if this regression model is any good. For example, the regression.1 model claims that every hour of sleep will improve my mood by quite a lot, but it might just be rubbish. Remember, the regression model only produces a prediction Ŷi about what my mood is like: my actual mood is Yi. If these two are very close, then the regression model has done a good job. If they are very di↵erent, then it has done a bad job. 15.4.1 The R2 value Once again, let’s wrap a little bit of mathematics around this. Firstly, we’ve got the sum of the squared residuals: ÿ SSres “ pYi ´ Ŷi q2 i which we would hope to be pretty small. Specifically, what we’d like is for it to be very small in comparison to the total variability in the outcome variable, ÿ SStot “ pYi ´ Ȳ q2 i While we’re here, let’s calculate these values in R. Firstly, in order to make my R commands look a bit more similar to the mathematical equations, I’ll create variables X and Y: > X Y Y.pred SS.resid print( SS.resid ) 1838.722 - 464 - Wonderful. A big number that doesn’t mean very much. Still, let’s forge boldly onwards anyway, and calculate the total sum of squares as well. That’s also pretty simple: > SS.tot print( SS.tot ) 9998.59 Hm. Well, it’s a much bigger number than the last one, so this does suggest that our regression model was making good predictions. But it’s not very interpretable. Perhaps we can fix this. What we’d like to do is to convert these two fairly meaningless numbers into one number. A nice, interpretable number, which for no particular reason we’ll call R2. What we would like is for the value of R2 to be equal to 1 if the regression model makes no errors in predicting the data. In other words, if it turns out that the residual errors are zero – that is, if SSres “ 0 – then we expect R2 “ 1. Similarly, if the model is completely useless, we would like R2 to be equal to 0. What do I mean by “useless”? Tempting as it is demand that the regression model move out of the house, cut its hair and get a real job, I’m probably going to have to pick a more practical definition: in this case, all I mean is that the residual sum of squares is no smaller than the total sum of squares, SSres “ SStot. Wait, why don’t we do exactly that? The formula that provides us with out R2 value is pretty simple to write down, SSres R2 “ 1 ´ SStot and equally simple to calculate in R: > R.squared print( R.squared ) 0.8161018 The R2 value, sometimes called the coefficient of determination3 has a simple interpretation: it is the proportion of the variance in the outcome variable that can be accounted for by the predictor. So in this case, the fact that we have obtained R2 “.816 means that the predictor (my.sleep) explains 81.6% of the variance in the outcome (my.grump). Naturally, you don’t actually need to type in all these commands yourself if you want to obtain the R2 value for your regression model. As we’ll see later on in Section 15.5.3, all you need to do is use the summary() function. However, let’s put that to one side for the moment. There’s another property of R2 that I want to point out. 15.4.2 The relationship between regression and correlation At this point we can revisit my earlier claim that regression, in this very simple form that I’ve discussed so far, is basically the same thing as a correlation. Previously, we used the symbol r to denote a Pearson correlation. Might there be some relationship between the value of the correlation coefficient r and the R2 value from linear regression? Of course there is: the squared correlation r2 is identical to the R2 value for a linear regression with only a single predictor. To illustrate this, here’s the squared correlation: > r print( r^2 ) # print the squared correlation 0.8161027 Yep, same number. In other words, running a Pearson correlation is more or less equivalent to running a linear regression model that uses only one predictor variable. 3 And by “sometimes” I mean “almost never”. In practice everyone just calls it “R-squared”. - 465 - 15.4.3 The adjusted R2 value One final thing to point out before moving on. It’s quite common for people to report a slightly di↵erent measure of model performance, known as “adjusted R2 ”. The motivation behind calculating the adjusted R2 value is the observation that adding more predictors into the model will always call the R2 value to increase (or at least not decrease). The adjusted R2 value introduces a slight change to the calculation, as follows. For a regression model with K predictors, fit to a data set containing N observations, the adjusted R2 is: ˆ ˙ SSres N ´1 adj. R2 “ 1 ´ ˆ SStot N ´K ´1 This adjustment is an attempt to take the degrees of freedom into account. The big advantage of the adjusted R2 value is that when you add more predictors to the model, the adjusted R2 value will only increase if the new variables improve the model performance more than you’d expect by chance. The big disadvantage is that the adjusted R2 value can’t be interpreted in the elegant way that R2 can. R2 has a simple interpretation as the proportion of variance in the outcome variable that is explained by the regression model; to my knowledge, no equivalent interpretation exists for adjusted R2. An obvious question then, is whether you should report R2 or adjusted R2. This is probably a matter of personal preference. If you care more about interpretability, then R2 is better. If you care more about correcting for bias, then adjusted R2 is probably better. Speaking just for myself, I prefer R2 : my feeling is that it’s more important to be able to interpret your measure of model performance. Besides, as we’ll see in Section 15.5, if you’re worried that the improvement in R2 that you get by adding a predictor is just due to chance and not because it’s a better model, well, we’ve got hypothesis tests for that. 15.5 Hypothesis tests for regression models So far we’ve talked about what a regression model is, how the coefficients of a regression model are estimated, and how we quantify the performance of the model (the last of these, incidentally, is basically our measure of e↵ect size). The next thing we need to talk about is hypothesis tests. There are two di↵erent (but related) kinds of hypothesis tests that we need to talk about: those in which we test whether the regression model as a whole is performing significantly better than a null model; and those in which we test whether a particular regression coefficient is significantly di↵erent from zero. At this point, you’re probably groaning internally, thinking that I’m going to introduce a whole new collection of tests. You’re probably sick of hypothesis tests by now, and don’t want to learn any new ones. Me too. I’m so sick of hypothesis tests that I’m going to shamelessly reuse the F -test from Chapter 14 and the t-test from Chapter 13. In fact, all I’m going to do in this section is show you how those tests are imported wholesale into the regression framework. 15.5.1 Testing the model as a whole Okay, suppose you’ve estimated your regression model. The first hypothesis test you might want to try is one in which the null hypothesis that there is no relationship between the predictors and the outcome, and the alternative hypothesis is that the data are distributed in exactly the way that the regression model predicts. Formally, our “null model” corresponds to the fairly trivial “regression” model in which we include 0 predictors, and only include the intercept term b0 H0 : Y i “ b0 ` ✏ i - 466 - If our regression model has K predictors, the “alternative model” is described using the usual formula for a multiple regression model: ˜ ¸ ÿ K H1 : Y i “ bk Xik ` b0 ` ✏i k“1 How can we test these two hypotheses against each other? The trick is to understand that just like we did with ANOVA, it’s possible to divide up the total variance SStot into the sum of the residual variance SSres and the regression model variance SSmod. I’ll skip over the technicalities, since we covered most of them in the ANOVA chapter, and just note that: SSmod “ SStot ´ SSres And, just like we did with the ANOVA, we can convert the sums of squares in to mean squares by dividing by the degrees of freedom. SSmod MSmod “ dfmod SSres MSres “ dfres So, how many degrees of freedom do we have? As you might expect, the df associated with the model is closely tied to the number of predictors that we’ve included. In fact, it turns out that dfmod “ K. For the residuals, the total degrees of freedom is dfres “ N ´ K ´ 1. Now that we’ve got our mean square values, you’re probably going to be entirely unsurprised (possibly even bored) to discover that we can calculate an F -statistic like this: MSmod F “ MSres and the degrees of freedom associated with this are K and N ´ K ´ 1. This F statistic has exactly the same interpretation as the one we introduced in Chapter 14. Large F values indicate that the null hypothesis is performing poorly in comparison to the alternative hypothesis. And since we already did some tedious “do it the long way” calculations back then, I won’t waste your time repeating them. In a moment I’ll show you how to do the test in R the easy way, but first, let’s have a look at the tests for the individual regression coefficients. 15.5.2 Tests for individual coefficients The F -test that we’ve just introduced is useful for checking that the model as a whole is performing better than chance. This is important: if your regression model doesn’t produce a significant result for the F -test then you probably don’t have a very good regression model (or, quite possibly, you don’t have very good data). However, while failing this test is a pretty strong indicator that the model has problems, passing the test (i.e., rejecting the null) doesn’t imply that the model is good! Why is that, you might be wondering? The answer to that can be found by looking at the coefficients for the regression.2 model: > print( regression.2 ) Call: lm(formula = dan.grump ~ dan.sleep + baby.sleep, data = parenthood) Coefficients: (Intercept) dan.sleep baby.sleep 125.96557 -8.95025 0.01052 - 467 - I can’t help but notice that the estimated regression coefficient for the baby.sleep variable is tiny (0.01), relative to the value that we get for dan.sleep (-8.95). Given that these two variables are absolutely on the same scale (they’re both measured in “hours slept”), I find this suspicious. In fact, I’m beginning to suspect that it’s really only the amount of sleep that I get that matters in order to predict my grumpiness. Once again, we can reuse a hypothesis test that we discussed earlier, this time the t-test. The test that we’re interested has a null hypothesis that the true regression coefficient is zero (b “ 0), which is to be tested against the alternative hypothesis that it isn’t (b ‰ 0). That is: H0 : b“0 H1 : b‰0 How can we test this? Well, if the central limit theorem is kind to us, we might be able to guess that the sampling distribution of b̂, the estimated regression coefficient, is a normal distribution with mean centred on b. What that would mean is that if the null hypothesis were true, then the sampling distribution of b̂ has mean zero and unknown standard deviation. Assuming that we can come up with a good estimate for the standard error of the regression coefficient, sepb̂q, then we’re in luck. That’s exactly the situation for which we introduced the one-sample t way back in Chapter 13. So let’s define a t-statistic like this, b̂ t“ sepb̂q I’ll skip over the reasons why, but our degrees of freedom in this case are df “ N ´ K ´ 1. Irritatingly, the estimate of the standard error of the regression coefficient, sepb̂q, is not as easy to calculate as the standard error of the mean that we used for the simpler t-tests in Chapter 13. In fact, the formula is somewhat ugly, and not terribly helpful to look at.4 For our purposes it’s sufficient to point out that the standard error of the estimated regression coefficient depends on both the predictor and outcome variables, and is somewhat sensitive to violations of the homogeneity of variance assumption (discussed shortly). In any case, this t-statistic can be interpreted in the same way as the t-statistics that we discussed in Chapter 13. Assuming that you have a two-sided alternative (i.e., you don’t really care if b ° 0 or b † 0), then it’s the extreme values of t (i.e., a lot less than zero or a lot greater than zero) that suggest that you should reject the null hypothesis. 15.5.3 Running the hypothesis tests in R To compute all of the quantities that we have talked about so far, all you need to do is ask for a summary() of your regression model. Since I’ve been using regression.2 as my example, let’s do that: > summary( regression.2 ) The output that this command produces is pretty dense, but we’ve already discussed everything of interest in it, so what I’ll do is go through it line by line. The first line reminds us of what the actual regression model is: Call: lm(formula = dan.grump ~ dan.sleep + baby.sleep, data = parenthood) 4 For advanced readers only. The vector of residuals is ✏ “ y ´ Xb̂. For K predictors plus the intercept, the estimated residual variance is ˆ 2 “ ✏1 ✏{pN ´ K ´ 1q. The estimated covariance matrix of the coefficients is ˆ 2 pX1 Xq´1 , the main diagonal of which is sepb̂q, our estimated standard errors. - 468 - You can see why this is handy, since it was a little while back when we actually created the regression.2 model, and so it’s nice to be reminded of what it was we were doing. The next part provides a quick summary of the residuals (i.e., the ✏i values), Residuals: Min 1Q Median 3Q Max -11.0345 -2.2198 -0.4016 2.6775 11.7496 which can be convenient as a quick and dirty check that the model is okay. Remember, we did assume that these residuals were normally distributed, with mean 0. In particular it’s worth quickly checking to see if the median is close to zero, and to see if the first quartile is about the same size as the third quartile. If they look badly o↵, there’s a good chance that the assumptions of regression are violated. These ones look pretty nice to me, so let’s move on to the interesting stu↵. The next part of the R output looks at the coefficients of the regression model: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 125.96557 3.04095 41.423 |t|) (Intercept) 125.9563 3.0161 41.76 library(lsr) > correlate(parenthood, test=TRUE) CORRELATIONS ============ - correlation type: pearson - correlations shown only when both variables are numeric dan.sleep baby.sleep dan.grump day dan.sleep. 0.628*** -0.903*** -0.098 baby.sleep 0.628***. -0.566*** -0.010 dan.grump -0.903*** -0.566***. 0.076 day -0.098 -0.010 0.076. --- Signif. codes:. = p <.1, * = p residuals( object = regression.2 ) One drawback to using ordinary residuals is that they’re always on a di↵erent scale, depending on what the outcome variable is and how good the regression model is. That is, Unless you’ve decided to run a regression model without an intercept term, the ordinary residuals will have mean 0; but the variance is di↵erent for every regression. In a lot of contexts, especially where you’re only interested in the pattern of the residuals and not their actual values, it’s convenient to estimate the standardised residuals, which are normalised in such a way as to have standard deviation 1. The way we calculate these is to divide the ordinary residual by an estimate of the (population) standard deviation of these residuals. For technical reasons, mumble mumble, the formula for this is: ✏i ✏1i “ ? ˆ 1 ´ hi where ˆ in this context is the estimated population standard deviation of the ordinary residuals, and hi is the “hat value” of the ith observation. I haven’t explained hat values to you yet (but have no fear,8 it’s coming shortly), so this won’t make a lot of sense. For now, it’s enough to interpret the standardised residuals as if we’d converted the ordinary residuals to z-scores. In fact, that is more or less the truth, it’s just that we’re being a bit fancier. To get the standardised residuals, the command you want is this: > rstandard( model = regression.2 ) Note that this function uses a di↵erent name for the input argument, but it’s still just a linear regression object that the function wants to take as its input here. The third kind of residuals are Studentised residuals (also called “jackknifed residuals”) and they’re even fancier than standardised residuals. Again, the idea is to take the ordinary residual and divide it by some quantity in order to estimate some standardised notion of the residual, but the formula for doing the calculations this time is subtly di↵erent: ✏ ✏˚i “ ?i ˆp´iq 1 ´ hi Notice that our estimate of the standard deviation here is written ˆp´iq. What this corresponds to is the estimate of the residual standard deviation that you would have obtained, if you just deleted the ith observation from the data set. This sounds like the sort of thing that would be a nightmare to calculate, since it seems to be saying that you have to run N new regression models (even a modern computer might grumble a bit at that, especially if you’ve got a large data set). Fortunately, some terribly clever person has shown that this standard deviation estimate is actually given by the following equation: d N ´ K ´ 1 ´ ✏1i 2 ˆp´iq “ ˆ N ´K ´2 Isn’t that a pip? Anyway, the command that you would use if you wanted to pull out the Studentised residuals for our regression model is 8 Or have no hope, as the case may be. - 476 - Outlier Outcome Predictor Figure 15.5: An illustration of outliers. The dotted lines plot the regression line that would have been estimated without the anomalous observation included, and the corresponding residual (i.e., the Studen- tised residual). The solid line shows the regression line with the anomalous observation included. The outlier has an unusual value on the outcome (y axis location) but not the predictor (x axis location), and lies a long way from the regression line........................................................................................................ > rstudent( model = regression.2 ) Before moving on, I should point out that you don’t often need to manually extract these residuals yourself, even though they are at the heart of almost all regression diagnostics. That is, the residuals(), rstandard() and rstudent() functions are all useful to know about, but most of the time the various functions that run the diagnostics will take care of these calculations for you. Even so, it’s always nice to know how to actually get hold of these things yourself in case you ever need to do something non-standard. 15.9.2 Three kinds of anomalous data One danger that you can run into with linear regression models is that your analysis might be dispro- portionately sensitive to a smallish number of “unusual” or “anomalous” observations. I discussed this idea previously in Section 6.5.2 in the context of discussing the outliers that get automatically identified by the boxplot() function, but this time we need to be much more precise. In the context of linear regres- sion, there are three conceptually distinct ways in which an observation might be called “anomalous”. All three are interesting, but they have rather di↵erent implications for your analysis. The first kind of unusual observation is an outlier. The definition of an outlier (in this context) is an observation that is very di↵erent from what the regression model predicts. An example is shown in Figure 15.5. In practice, we operationalise this concept by saying that an outlier is an observation that - 477 - High leverage Outcome Predictor Figure 15.6: An illustration of high leverage points. The anomalous observation in this case is unusual both in terms of the predictor (x axis) and the outcome (y axis), but this unusualness is highly con- sistent with the pattern of correlations that exists among the other observations; as a consequence, the observation falls very close to the regression line and does not distort it........................................................................................................ has a very large Studentised residual, ✏˚i. Outliers are interesting: a big outlier might correspond to junk data – e.g., the variables might have been entered incorrectly, or some other defect may be detectable. Note that you shouldn’t throw an observation away just because it’s an outlier. But the fact that it’s an outlier is often a cue to look more closely at that case, and try to find out why it’s so di↵erent. The second way in which an observation can be unusual is if it has high leverage: this happens when the observation is very di↵erent from all the other observations. This doesn’t necessarily have to correspond to a large residual: if the observation happens to be unusual on all variables in precisely the same way, it can actually lie very close to the regression line. An example of this is shown in Figure 15.6. The leverage of an observation is operationalised in terms of its hat value, usually written hi. The formula for the hat value is rather complicated9 but its interpretation is not: hi is a measure of the extent to which the i-th observation is “in control” of where the regression line ends up going. You can extract the hat values using the following command: > hatvalues( model = regression.2 ) In general, if an observation lies far away from the other ones in terms of the predictor variables, it will have a large hat value (as a rough guide, high leverage is when the hat value is more than 2-3 times the 9 Again, for the linear algebra fanatics: the “hat matrix” is defined to be that matrix H that converts the vector of observed values y into a vector of fitted values ŷ, such that ŷ “ Hy. The name comes from the fact that this is the matrix that “puts a hat on y”. The hat value of the i-th observation is the i-th diagonal element of this matrix (so technically I should be writing it as hii rather than hi ). Oh, and in case you care, here’s how it’s calculated: H “ XpX1 Xq´1 X1. Pretty, isn’t it? - 478 - Outcome High influence Predictor Figure 15.7: An illustration of high influence points. In this case, the anomalous observation is highly unusual on the predictor variable (x axis), and falls a long way from the regression line. As a consequence, the regression line is highly distorted, even though (in this case) the anomalous observation is entirely typical in terms of the outcome variable (y axis)........................................................................................................ average; and note that the sum of the hat values is constrained to be equal to K ` 1). High leverage points are also worth looking at in more detail, but they’re much less likely to be a cause for concern unless they are also outliers. This brings us to our third measure of unusualness, the influence of an observation. A high influence observation is an outlier that has high leverage. That is, it is an observation that is very di↵erent to all the other ones in some respect, and also lies a long way from the regression line. This is illustrated in Figure 15.7. Notice the contrast to the previous two figures: outliers don’t move the regression line much, and neither do high leverage points. But something that is an outlier and has high leverage... that has a big e↵ect on the regression line. That’s why we call these points high influence; and it’s why they’re the biggest worry. We operationalise influence in terms of a measure known as Cook’s distance, ✏˚i 2 hi Di “ ˆ K ` 1 1 ´ hi Notice that this is a multiplication of something that measures the outlier-ness of the observation (the bit on the left), and something that measures the leverage of the observation (the bit on the right). In other words, in order to have a large Cook’s distance, an observation must be a fairly substantial outlier and have high leverage. In a stunning turn of events, you can obtain these values using the following command: > cooks.distance( model = regression.2 ) - 479 - As a rough guide, Cook’s distance greater than 1 is often considered large (that’s what I typically use as a quick and dirty rule), though a quick scan of the internet and a few papers suggests that 4{N has also been suggested as a possible rule of thumb. As hinted above, you don’t usually need to make use of these functions, since you can have R au- tomatically draw the critical plots.10 For the regression.2 model, these are the plots showing Cook’s distance (Figure 15.8) and the more detailed breakdown showing the scatter plot of the Studentised residual against leverage (Figure 15.9). To draw these, we can use the plot() function. When the main argument x to this function is a linear model object, it will draw one of six di↵erent plots, each of which is quite useful for doing regression diagnostics. You specify which one you want using the which argument (a number between 1 and 6). If you don’t do this then R will draw all six. The two plots of interest to us in this context are generated using the following commands: > plot(x = regression.2, which = 4) # Figure 15.8 > plot(x = regression.2, which = 5) # Figure 15.9 An obvious question to ask next is, if you do have large values of Cook’s distance, what should you do? As always, there’s no hard and fast rules. Probably the first thing to do is to try running the regression with that point excluded and see what happens to the model performance and to the regression coefficients. If they really are substantially di↵erent, it’s time to start digging into your data set and your notes that you no doubt were scribbling as your ran your study; try to figure out why the point is so di↵erent. If you start to become convinced that this one data point is badly distorting your results, you might consider excluding it, but that’s less than ideal unless you have a solid explanation for why this particular case is qualitatively di↵erent from the others and therefore deserves to be handled separately.11 To give an example, let’s delete the observation from day 64, the observation with the largest Cook’s distance for the regression.2 model. We can do this using the subset argument: > lm( formula = dan.grump ~ dan.sleep + baby.sleep, # same formula + data = parenthood, # same data frame... + subset = -64 #...but observation 64 is deleted + ) Call: lm(formula = dan.grump ~ dan.sleep + baby.sleep, data = parenthood, subset = -64) Coefficients: (Intercept) dan.sleep baby.sleep 126.3553 -8.8283 -0.1319 As you can see, those regression coefficients have barely changed in comparison to the values we got earlier. In other words, we really don’t have any problem as far as anomalous data are concerned. 15.9.3 Checking the normality of the residuals Like many of the statistical tools we’ve discussed in this book, regression models rely on a normality assumption. In this case, we assume that the residuals are normally distributed. The tools for testing this aren’t fundamentally di↵erent to those that we discussed earlier in Section 13.9. Firstly, I firmly 10 Though special mention should be made of the influenceIndexPlot() and influencePlot() functions in the car package. These produce somewhat more detailed pictures than the default plots that I’ve shown here. There’s also an outlierTest() function that tests to see if any of the Studentised residuals are significantly larger than would be expected by chance. 11 An alternative is to run a “robust regression”; I’ll discuss robust regression in a later version of this book. - 480 - Cook’s distance 0.00 0.02 0.04 0.06 0.08 0.10 0.12 64 Cook’s distance 85 27 0 20 40 60 80 100 Obs. number lm(dan.grump ~ dan.sleep + baby.sleep) Figure 15.8: Cook’s distance for every observation. This is one of the standard regression plots produced by the plot() function when the input is a linear regression object. It is obtained by setting which=4........................................................................................................ Residuals vs Leverage 3 85 2 27 Standardized residuals 1 0 −1 −2 64 Cook’s distance −3 0.00 0.02 0.04 0.06 0.08 Leverage lm(dan.grump ~ dan.sleep + baby.sleep) Figure 15.9: Residuals versus leverage. This is one of the standard regression plots produced by the plot() function when the input is a linear regression object. It is obtained by setting which=5........................................................................................................ - 481 - 12 10 8 Frequency 6 4 2 0 −10 −5 0 5 10 Value of residual Figure 15.10: A histogram of the (ordinary) residuals in the regression.2 model. These residuals look very close to being normally distributed, much moreso than is typically seen with real data. This shouldn’t surprise you... they aren’t real data, and they aren’t real residuals!....................................................................................................... - 482 - Normal Q−Q 3 78 2 Standardized residuals 1 0 −1 −2 81 36 −2 −1 0 1 2 Theoretical Quantiles lm(dan.grump ~ dan.sleep + baby.sleep) Figure 15.11: Plot of the theoretical quantiles according to the model, against the quantiles of the standardised residuals. This is one of the standard regression plots produced by the plot() function when the input is a linear regression object. It is obtained by setting which=2........................................................................................................ believe that it never hurts to draw an old fashioned histogram. The command I use might be something like this: > hist( x = residuals( regression.2 ), # data are the residuals + xlab = "Value of residual", # x-axis label + main = "", # no title + breaks = 20 # lots of breaks + ) The resulting plot is shown in Figure 15.10, and as you can see the plot looks pretty damn close to normal, almost unnaturally so. I could also run a Shapiro-Wilk test to check, using the shapiro.test() function; the W value of.99, at this sample size, is non-significant (p “.84), again suggesting that the normality assumption isn’t in any danger here. As a third measure, we might also want to draw a QQ-plot using the qqnorm() function. The QQ plot is an excellent one to draw, and so you might not be surprised to discover that it’s one of the regression plots that we can produce using the plot() function: > plot( x = regression.2, which = 2 ) # Figure 15.11 The output is shown in Figure 15.11, showing the standardised residuals plotted as a function of their theoretical quantiles according to the regression model. The fact that the output appends the model specification to the picture is nice. - 483 -

Use Quizgecko on...
Browser
Browser