Chapter 12: Reporting Hypothesis Test Results PDF

one based on Neyman’s approach to hypothesis testing and the other based on Fisher’s. Unfortunately, there is a third explanation that people sometimes give, especially when they’re first learning statistics, and it is absolutely and completely wrong. This mistaken approach is to refer to the p value as “the probability that the null hypothesis is true”. It’s an intuitively appealing way to think, but it’s wrong in two key respects: (1) null hypothesis testing is a frequentist tool, and the frequentist approach to probability does not allow you to assign probabilities to the null hypothesis... according to this view of probability, the null hypothesis is either true or it is not; it cannot have a “5% chance” of being true. (2) even within the Bayesian approach, which does let you assign probabilities to hypotheses, the p value would not correspond to the probability that the null is true; this interpretation is entirely inconsistent with the mathematics of how the p value is calculated. Put bluntly, despite the intuitive appeal of thinking this way, there is no justification for interpreting a p value this way. Never do it. 11.6 Reporting the results of a hypothesis test When writing up the results of a hypothesis test, there’s usually several pieces of information that you need to report, but it varies a fair bit from test to test. Throughout the rest of the book I’ll spend a little time talking about how to report the results of di↵erent tests (see Section 12.1.9 for a particularly detailed example), so that you can get a feel for how it’s usually done. However, regardless of what test you’re doing, the one thing that you always have to do is say something about the p value, and whether or not the outcome was significant. The fact that you have to do this is unsurprising; it’s the whole point of doing the test. What might be surprising is the fact that there is some contention over exactly how you’re supposed to do it. Leaving aside those people who completely disagree with the entire framework underpinning null hypothesis testing, there’s a certain amount of tension that exists regarding whether or not to report the exact p value that you obtained, or if you should state only that p † ↵ for a significance level that you chose in advance (e.g., p †.05). 11.6.1 The issue To see why this is an issue, the key thing to recognise is that p values are terribly convenient. In practice, the fact that we can compute a p value means that we don’t actually have to specify any ↵ level at all in order to run the test. Instead, what you can do is calculate your p value and interpret it directly: if you get p “.062, then it means that you’d have to be willing to tolerate a Type I error rate of 6.2% to justify rejecting the null. If you personally find 6.2% intolerable, then you retain the null. Therefore, the argument goes, why don’t we just report the actual p value and let the reader make up their own minds about what an acceptable Type I error rate is? This approach has the big advantage of “softening” the decision making process – in fact, if you accept the Neyman definition of the p value, that’s the whole point of the p value. We no longer have a fixed significance level of ↵ “.05 as a bright line separating “accept” from “reject” decisions; and this removes the rather pathological problem of being forced to treat p “.051 in a fundamentally di↵erent way to p “.049. This flexibility is both the advantage and the disadvantage to the p value. The reason why a lot of people don’t like the idea of reporting an exact p value is that it gives the researcher a bit too much freedom. In particular, it lets you change your mind about what error tolerance you’re willing to put up with after you look at the data. For instance, consider my ESP experiment. Suppose I ran my test, and ended up with a p value of.09. Should I accept or reject? Now, to be honest, I haven’t yet bothered to think about what level of Type I error I’m “really” willing to accept. I don’t have an opinion on that topic. But I do have an opinion about whether or not ESP exists, and I definitely have an opinion about - 338 - Table 11.1: A commonly adopted convention for reporting p values: in many places it is conventional to report one of four di↵erent things (e.g., p †.05) as shown below. I’ve included the “significance stars” notation (i.e., a * indicates p †.05) because you sometimes see this notation produced by statistical software. It’s also worth noting that some people will write n.s. (not significant) rather than p °.05. Usual notation Signif. stars English translation The null is... p °.05 The test wasn’t significant Retained p †.05 * The test was significant at ↵ “.05 Rejected but not at ↵ “.01 or ↵ “.001. p †.01 ** The test was significant at ↵ “.05 Rejected and ↵ “.01 but not at ↵ “.001. p †.001 *** The test was significant at all levels Rejected....................................................................................................... whether my research should be published in a reputable scientific journal. And amazingly, now that I’ve looked at the data I’m starting to think that a 9% error rate isn’t so bad, especially when compared to how annoying it would be to have to admit to the world that my experiment has failed. So, to avoid looking like I just made it up after the fact, I now say that my ↵ is.1: a 10% type I error rate isn’t too bad, and at that level my test is significant! I win. In other words, the worry here is that I might have the best of intentions, and be the most honest of people, but the temptation to just “shade” things a little bit here and there is really, really strong. As anyone who has ever run an experiment can attest, it’s a long and difficult process, and you often get very attached to your hypotheses. It’s hard to let go and admit the experiment didn’t find what you wanted it to find. And that’s the danger here. If we use the “raw” p-value, people will start interpreting the data in terms of what they want to believe, not what the data are actually saying... and if we allow that, well, why are we bothering to do science at all? Why not let everyone believe whatever they like about anything, regardless of what the facts are? Okay, that’s a bit extreme, but that’s where the worry comes from. According to this view, you really must specify your ↵ value in advance, and then only report whether the test was significant or not. It’s the only way to keep ourselves honest. 11.6.2 Two proposed solutions In practice, it’s pretty rare for a researcher to specify a single ↵ level ahead of time. Instead, the convention is that scientists rely on three standard significance levels:.05,.01 and.001. When reporting your results, you indicate which (if any) of these significance levels allow you to reject the null hypothesis. This is summarised in Table 11.1. This allows us to soften the decision rule a little bit, since p †.01 implies that the data meet a stronger evidentiary standard than p †.05 would. Nevertheless, since these levels are fixed in advance by convention, it does prevent people choosing their ↵ level after looking at the data. Nevertheless, quite a lot of people still prefer to report exact p values. To many people, the advan- tage of allowing the reader to make up their own mind about how to interpret p “.06 outweighs any disadvantages. In practice, however, even among those researchers who prefer exact p values it is quite common to just write p †.001 instead of reporting an exact value for small p. This is in part because a lot of software doesn’t actually print out the p value when it’s that small (e.g., SPSS just writes p “.000 whenever p †.001), and in part because a very small p value can be kind of misleading. The human mind sees a number like.0000000001 and it’s hard to suppress the gut feeling that the evidence in favour of the - 339 - alternative hypothesis is a near certainty. In practice however, this is usually wrong. Life is a big, messy, complicated thing: and every statistical test ever invented relies on simplifications, approximations and assumptions. As a consequence, it’s probably not reasonable to walk away from any statistical analysis with a feeling of confidence stronger than p †.001 implies. In other words, p †.001 is really code for “as far as this test is concerned, the evidence is overwhelming.” In light of all this, you might be wondering exactly what you should do. There’s a fair bit of con- tradictory advice on the topic, with some people arguing that you should report the exact p value, and other people arguing that you should use the tiered approach illustrated in Table 11.1. As a result, the best advice I can give is to suggest that you look at papers/reports written in your field and see what the convention seems to be. If there doesn’t seem to be any consistent pattern, then use whichever method you prefer. 11.7 Running the hypothesis test in practice At this point some of you might be wondering if this is a “real” hypothesis test, or just a toy example that I made up. It’s real. In the previous discussion I built the test from first principles, thinking that it was the simplest possible problem that you might ever encounter in real life. However, this test already exists: it’s called the binomial test, and it’s implemented by an R function called binom.test(). To test the null hypothesis that the response probability is one-half p =.5,9 using data in which x = 62 of n = 100 people made the correct response, here’s how to do it in R: > binom.test( x=62, n=100, p=.5 ) Exact binomial test data: 62 and 100 number of successes = 62, number of trials = 100, p-value = 0.02098 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.5174607 0.7152325 sample estimates: probability of success 0.62 Right now, this output looks pretty unfamiliar to you, but you can see that it’s telling you more or less the right things. Specifically, the p-value of 0.02 is less than the usual choice of ↵ “.05, so you can reject the null. We’ll talk a lot more about how to read this sort of output as we go along; and after a while you’ll hopefully find it quite easy to read and understand. For now, however, I just wanted to make the point that R contains a whole lot of functions corresponding to di↵erent kinds of hypothesis test. And while I’ll usually spend quite a lot of time explaining the logic behind how the tests are built, every time I discuss a hypothesis test the discussion will end with me showing you a fairly simple R command that you can use to run the test in practice. 9 Note that the p here has nothing to do with a p value. The p argument in the binom.test() function corresponds to the probability of making a correct response, according to the null hypothesis. In other words, it’s the ✓ value. - 340 - Sampling Distribution for X if θ=.55 lower critical region upper critical region (2.5% of the distribution) (2.5% of the distribution) 0 20 40 60 80 100 Number of Correct Responses (X) Figure 11.4: Sampling distribution under the alternative hypothesis, for a population parameter value of ✓ “ 0.55. A reasonable proportion of the distribution lies in the rejection region........................................................................................................ 11.8 E↵ect size, sample size and power In previous sections I’ve emphasised the fact that the major design principle behind statistical hypothesis testing is that we try to control our Type I error rate. When we fix ↵ “.05 we are attempting to ensure that only 5% of true null hypotheses are incorrectly rejected. However, this doesn’t mean that we don’t care about Type II errors. In fact, from the researcher’s perspective, the error of failing to reject the null when it is actually false is an extremely annoying one. With that in mind, a secondary goal of hypothesis testing is to try to minimise , the Type II error rate, although we don’t usually talk in terms of minimising Type II errors. Instead, we talk about maximising the power of the test. Since power is defined as 1 ´ , this is the same thing. 11.8.1 The power function Let’s take a moment to think about what a Type II error actually is. A Type II error occurs when the alternative hypothesis is true, but we are nevertheless unable to reject the null hypothesis. Ideally, we’d be able to calculate a single number that tells us the Type II error rate, in the same way that we can set ↵ “.05 for the Type I error rate. Unfortunately, this is a lot trickier to do. To see this, notice that in my ESP study the alternative hypothesis actually corresponds to lots of possible values of ✓. In fact, the alternative hypothesis corresponds to every value of ✓ except 0.5. Let’s suppose that the true probability of someone choosing the correct response is 55% (i.e., ✓ “.55). If so, then the true sampling distribution for X is not the same one that the null hypothesis predicts: the most likely value for X is now 55 out of 100. Not only that, the whole sampling distribution has now shifted, as shown in Figure 11.4. The critical regions, of course, do not change: by definition, the critical regions are based on what the null - 341 - Sampling Distribution for X if θ=.70 lower critical region upper critical region (2.5% of the distribution) (2.5% of the distribution) 0 20 40 60 80 100 Number of Correct Responses (X) Figure 11.5: Sampling distribution under the alternative hypothesis, for a population parameter value of ✓ “ 0.70. Almost all of the distribution lies in the rejection region........................................................................................................ hypothesis predicts. What we’re seeing in this figure is the fact that when the null hypothesis is wrong, a much larger proportion of the sampling distribution distribution falls in the critical region. And of course that’s what should happen: the probability of rejecting the null hypothesis is larger when the null hypothesis is actually false! However ✓ “.55 is not the only possibility consistent with the alternative hypothesis. Let’s instead suppose that the true value of ✓ is actually 0.7. What happens to the sampling distribution when this occurs? The answer, shown in Figure 11.5, is that almost the entirety of the sampling distribution has now moved into the critical region. Therefore, if ✓ “ 0.7 the probability of us correctly rejecting the null hypothesis (i.e., the power of the test) is much larger than if ✓ “ 0.55. In short, while ✓ “.55 and ✓ “.70 are both part of the alternative hypothesis, the Type II error rate is di↵erent. What all this means is that the power of a test (i.e., 1 ´ ) depends on the true value of ✓. To illustrate this, I’ve calculated the expected probability of rejecting the null hypothesis for all values of ✓, and plotted it in Figure 11.6. This plot describes what is usually called the power function of the test. It’s a nice summary of how good the test is, because it actually tells you the power (1 ´ ) for all possible values of ✓. As you can see, when the true value of ✓ is very close to 0.5, the power of the test drops very sharply, but when it is further away, the power is large. 11.8.2 E↵ect size Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned with mice when there are tigers abroad – George Box (1976, p. 792) The plot shown in Figure 11.6 captures a fairly basic point about hypothesis testing. If the true state of the world is very di↵erent from what the null hypothesis predicts, then your power will be very high; but if the true state of the world is similar to the null (but not identical) then the power of the test is - 342 - Power Function for the Test (N=100) 1.0 Probability of Rejecting the Null 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 True Value of θ Figure 11.6: The probability that we will reject the null hypothesis, plotted as a function of the true value of ✓. Obviously, the test is more powerful (greater chance of correct rejection) if the true value of ✓ is very di↵erent from the value that the null hypothesis specifies (i.e., ✓ “.5). Notice that when ✓ actually is equal to.5 (plotted as a black dot), the null hypothesis is in fact true: rejecting the null hypothesis in this instance would be a Type I error........................................................................................................ going to be very low. Therefore, it’s useful to be able to have some way of quantifying how “similar” the true state of the world is to the null hypothesis. A statistic that does this is called a measure of e↵ect size (e.g., Cohen, 1988; Ellis, 2010). E↵ect size is defined slightly di↵erently in di↵erent contexts,10 (and so this section just talks in general terms) but the qualitative idea that it tries to capture is always the same: how big is the di↵erence between the true population parameters, and the parameter values that are assumed by the null hypothesis? In our ESP example, if we let ✓0 “ 0.5 denote the value assumed by the null hypothesis, and let ✓ denote the true value, then a simple measure of e↵ect size could be something like the di↵erence between the true value and null (i.e., ✓ ´ ✓0 ), or possibly just the magnitude of this di↵erence, absp✓ ´ ✓0 q. Why calculate e↵ect size? Let’s assume that you’ve run your experiment, collected the data, and gotten a significant e↵ect when you ran your hypothesis test. Isn’t it enough just to say that you’ve gotten a significant e↵ect? Surely that’s the point of hypothesis testing? Well, sort of. Yes, the point of doing a hypothesis test is to try to demonstrate that the null hypothesis is wrong, but that’s hardly the only thing we’re interested in. If the null hypothesis claimed that ✓ “.5, and we show that it’s wrong, we’ve only really told half of the story. Rejecting the null hypothesis implies that we believe that ✓ ‰.5, but there’s a big di↵erence between ✓ “.51 and ✓ “.8. If we find that ✓ “.8, then not only have we found that the null hypothesis is wrong, it appears to be very wrong. On the other hand, suppose we’ve successfully rejected the null hypothesis, but it looks like the true value of ✓ is only.51 (this would only be possible with a large study). Sure, the null hypothesis is wrong, but it’s not at all clear that we actually care, because the e↵ect size is so small. In the context of my ESP study we might still care, 10 There’s an R package called compute.es that can be used for calculating a very broad range of e↵ect size measures; but for the purposes of the current book we won’t need it: all of the e↵ect size measures that I’ll talk about here have functions in the lsr package - 343 - Table 11.2: A crude guide to understanding the relationship between statistical significance and e↵ect sizes. Basically, if you don’t have a significant result, then the e↵ect size is pretty meaningless; because you don’t have any evidence that it’s even real. On the other hand, if you do have a significant e↵ect but your e↵ect size is small, then there’s a pretty good chance that your result (although real) isn’t all that interesting. However, this guide is very crude: it depends a lot on what exactly you’re studying. Small e↵ects can be of massive practical importance in some situations. So don’t take this table too seriously. It’s a rough guide at best. big e↵ect size small e↵ect size significant result di↵erence is real, and di↵erence is real, but of practical importance might not be interesting non-significant result no e↵ect observed no e↵ect observed....................................................................................................... since any demonstration of real psychic powers would actually be pretty cool11 , but in other contexts a 1% di↵erence isn’t very interesting, even if it is a real di↵erence. For instance, suppose we’re looking at di↵erences in high school exam scores between males and females, and it turns out that the female scores are 1% higher on average than the males. If I’ve got data from thousands of students, then this di↵erence will almost certainly be statistically significant, but regardless of how small the p value is it’s just not very interesting. You’d hardly want to go around proclaiming a crisis in boys education on the basis of such a tiny di↵erence would you? It’s for this reason that it is becoming more standard (slowly, but surely) to report some kind of standard measure of e↵ect size along with the the results of the hypothesis test. The hypothesis test itself tells you whether you should believe that the e↵ect you have observed is real (i.e., not just due to chance); the e↵ect size tells you whether or not you should care. 11.8.3 Increasing the power of your study Not surprisingly, scientists are fairly obsessed with maximising the power of their experiments. We want our experiments to work, and so we want to maximise the chance of rejecting the null hypothesis if it is false (and of course we usually want to believe that it is false!) As we’ve seen, one factor that influences power is the e↵ect size. So the first thing you can do to increase your power is to increase the e↵ect size. In practice, what this means is that you want to design your study in such a way that the e↵ect size gets magnified. For instance, in my ESP study I might believe that psychic powers work best in a quiet, darkened room; with fewer distractions to cloud the mind. Therefore I would try to conduct my experiments in just such an environment: if I can strengthen people’s ESP abilities somehow, then the true value of ✓ will go up12 and therefore my e↵ect size will be larger. In short, clever experimental design is one way to boost power; because it can alter the e↵ect size. Unfortunately, it’s often the case that even with the best of experimental designs you may have only a small e↵ect. Perhaps, for example, ESP really does exist, but even under the best of conditions it’s very very weak. Under those circumstances, your best bet for increasing power is to increase the 11 Although in practice a very small e↵ect size is worrying, because even very minor methodological flaws might be responsible for the e↵ect; and in practice no experiment is perfect, so there are always methodological issues to worry about. 12 Notice that the true population parameter ✓ doesn’t necessarily correspond to an immutable fact of nature. In this context ✓ is just the true probability that people would correctly guess the colour of the card in the other room. As such the population parameter can be influenced by all sorts of things. Of course, this is all on the assumption that ESP actually exists! - 344 - 1.0 Probability of Rejecting the Null 0.8 0.6 0.4 0.2 0.0 0 20 40 60 80 100 Sample Size, N Figure 11.7: The power of our test, plotted as a function of the sample size N. In this case, the true value of ✓ is 0.7, but the null hypothesis is that ✓ “ 0.5. Overall, larger N means greater power. (The small zig-zags in this function occur because of some odd interactions between ✓, ↵ and the fact that the binomial distribution is discrete; it doesn’t matter for any serious purpose)....................................................................................................... sample size. In general, the more observations that you have available, the more likely it is that you can discriminate between two hypotheses. If I ran my ESP experiment with 10 participants, and 7 of them correctly guessed the colour of the hidden card, you wouldn’t be terribly impressed. But if I ran it with 10,000 participants and 7,000 of them got the answer right, you would be much more likely to think I had discovered something. In other words, power increases with the sample size. This is illustrated in Figure 11.7, which shows the power of the test for a true parameter of ✓ “ 0.7, for all sample sizes N from 1 to 100, where I’m assuming that the null hypothesis predicts that ✓0 “ 0.5. Because power is important, whenever you’re contemplating running an experiment it would be pretty useful to know how much power you’re likely to have. It’s never possible to know for sure, since you can’t possibly know what your e↵ect size is. However, it’s often (well, sometimes) possible to guess how big it should be. If so, you can guess what sample size you need! This idea is called power analysis, and if it’s feasible to do it, then it’s very helpful, since it can tell you something about whether you have enough time or money to be able to run the experiment successfully. It’s increasingly common to see people arguing that power analysis should be a required part of experimental design, so it’s worth knowing about. I don’t discuss power analysis in this book, however. This is partly for a boring reason and partly for a substantive one. The boring reason is that I haven’t had time to write about power analysis yet. The substantive one is that I’m still a little suspicious of power analysis. Speaking as a researcher, I have very rarely found myself in a position to be able to do one – it’s either the case that (a) my experiment is a bit non-standard and I don’t know how to define e↵ect size properly, (b) I literally have so little idea about what the e↵ect size will be that I wouldn’t know how to interpret the answers. Not only that, after extensive conversations with someone who does stats consulting for a living (my wife, as it happens), I can’t help but notice that in practice the only time anyone ever asks her for a power analysis is when she’s helping someone write a grant application. In other words, the only time any scientist ever seems to want a power analysis in real life is when they’re being forced to do it by - 345 - bureaucratic process. It’s not part of anyone’s day to day work. In short, I’ve always been of the view that while power is an important concept, power analysis is not as useful as people make it sound, except in the rare cases where (a) someone has figured out how to calculate power for your actual experimental design and (b) you have a pretty good idea what the e↵ect size is likely to be. Maybe other people have had better experiences than me, but I’ve personally never been in a situation where both (a) and (b) were true. Maybe I’ll be convinced otherwise in the future, and probably a future version of this book would include a more detailed discussion of power analysis, but for now this is about as much as I’m comfortable saying about the topic. 11.9 Some issues to consider What I’ve described to you in this chapter is the orthodox framework for null hypothesis significance testing (NHST). Understanding how NHST works is an absolute necessity, since it has been the dominant approach to inferential statistics ever since it came to prominence in the early 20th century. It’s what the vast majority of working scientists rely on for their data analysis, so even if you hate it you need to know it. However, the approach is not without problems. There are a number of quirks in the framework, historical oddities in how it came to be, theoretical disputes over whether or not the framework is right, and a lot of practical traps for the unwary. I’m not going to go into a lot of detail on this topic, but I think it’s worth briefly discussing a few of these issues. 11.9.1 Neyman versus Fisher The first thing you should be aware of is that orthodox NHST is actually a mash-up of two rather di↵erent approaches to hypothesis testing, one proposed by Sir Ronald Fisher and the other proposed by Jerzy Neyman (see Lehmann, 2011, for a historical summary). The history is messy because Fisher and Neyman were real people whose opinions changed over time, and at no point did either of them o↵er “the definitive statement” of how we should interpret their work many decades later. That said, here’s a quick summary of what I take these two approaches to be. First, let’s talk about Fisher’s approach. As far as I can tell, Fisher assumed that you only had the one hypothesis (the null), and what you want to do is find out if the null hypothesis is inconsistent with the data. From his perspective, what you should do is check to see if the data are “sufficiently unlikely” according to the null. In fact, if you remember back to our earlier discussion, that’s how Fisher defines the p-value. According to Fisher, if the null hypothesis provided a very poor account of the data, you could safely reject it. But, since you don’t have any other hypotheses to compare it to, there’s no way of “accepting the alternative” because you don’t necessarily have an explicitly stated alternative. That’s more or less all that there was to it. In contrast, Neyman thought that the point of hypothesis testing was as a guide to action, and his approach was somewhat more formal than Fisher’s. His view was that there are multiple things that you could do (accept the null or accept the alternative) and the point of the test was to tell you which one the data support. From this perspective, it is critical to specify your alternative hypothesis properly. If you don’t know what the alternative hypothesis is, then you don’t know how powerful the test is, or even which action makes sense. His framework genuinely requires a competition between di↵erent hypotheses. For Neyman, the p value didn’t directly measure the probability of the data (or data more extreme) under the null, it was more of an abstract description about which “possible tests” were telling you to accept the null, and which “possible tests” were telling you to accept the alternative. As you can see, what we have today is an odd mishmash of the two. We talk about having both a - 346 - null hypothesis and an alternative (Neyman), but usually13 define the p value in terms of exreme data (Fisher), but we still have ↵ values (Neyman). Some of the statistical tests have explicitly specified alternatives (Neyman) but others are quite vague about it (Fisher). And, according to some people at least, we’re not allowed to talk about accepting the alternative (Fisher). It’s a mess: but I hope this at least explains why it’s a mess. 11.9.2 Bayesians versus frequentists Earlier on in this chapter I was quite emphatic about the fact that you cannot interpret the p value as the probability that the null hypothesis is true. NHST is fundamentally a frequentist tool (see Chapter 9) and as such it does not allow you to assign probabilities to hypotheses: the null hypothesis is either true or it is not. The Bayesian approach to statistics interprets probability as a degree of belief, so it’s totally okay to say that there is a 10% chance that the null hypothesis is true: that’s just a reflection of the degree of confidence that you have in this hypothesis. You aren’t allowed to do this within the frequentist approach. Remember, if you’re a frequentist, a probability can only be defined in terms of what happens after a large number of independent replications (i.e., a long run frequency). If this is your interpretation of probability, talking about the “probability” that the null hypothesis is true is complete gibberish: a null hypothesis is either true or it is false. There’s no way you can talk about a long run frequency for this statement. To talk about “the probability of the null hypothesis” is as meaningless as “the colour of freedom”. It doesn’t have one! Most importantly, this isn’t a purely ideological matter. If you decide that you are a Bayesian and that you’re okay with making probability statements about hypotheses, you have to follow the Bayesian rules for calculating those probabilities. I’ll talk more about this in Chapter 17, but for now what I want to point out to you is the p value is a terrible approximation to the probability that H0 is true. If what you want to know is the probability of the null, then the p value is not what you’re looking for! 11.9.3 Traps As you can see, the theory behind hypothesis testing is a mess, and even now there are arguments in statistics about how it “should” work. However, disagreements among statisticians are not our real concern here. Our real concern is practical data analysis. And while the “orthodox” approach to null hypothesis significance testing has many drawbacks, even an unrepentant Bayesian like myself would agree that they can be useful if used responsibly. Most of the time they give sensible answers, and you can use them to learn interesting things. Setting aside the various ideologies and historical confusions that we’ve discussed, the fact remains that the biggest danger in all of statistics is thoughtlessness. I don’t mean stupidity, here: I literally mean thoughtlessness. The rush to interpret a result without spending time thinking through what each test actually says about the data, and checking whether that’s consistent with how you’ve interpreted it. That’s where the biggest trap lies. To give an example of this, consider the following example (see Gelman & Stern, 2006). Suppose I’m running my ESP study, and I’ve decided to analyse the data separately for the male participants and the female participants. Of the male participants, 33 out of 50 guessed the colour of the card correctly. This is a significant e↵ect (p “.03). Of the female participants, 29 out of 50 guessed correctly. This is not a significant e↵ect (p “.32). Upon observing this, it is extremely tempting for people to start wondering why there is a di↵erence between males and females in terms of their psychic abilities. However, this is wrong. If you think about it, we haven’t actually run a test that explicitly compares males to females. All we have done is compare males to chance (binomial test was significant) and compared females to chance 13 Although this book describes both Neyman’s and Fisher’s definition of the p value, most don’t. Most introductory textbooks will only give you the Fisher version. - 347 - (binomial test was non significant). If we want to argue that there is a real di↵erence between the males and the females, we should probably run a test of the null hypothesis that there is no di↵erence! We can do that using a di↵erent hypothesis test,14 but when we do that it turns out that we have no evidence that males and females are significantly di↵erent (p “.54). Now do you think that there’s anything fundamentally di↵erent between the two groups? Of course not. What’s happened here is that the data from both groups (male and female) are pretty borderline: by pure chance, one of them happened to end up on the magic side of the p “.05 line, and the other one didn’t. That doesn’t actually imply that males and females are di↵erent. This mistake is so common that you should always be wary of it: the di↵erence between significant and not-significant is not evidence of a real di↵erence – if you want to say that there’s a di↵erence between two groups, then you have to test for that di↵erence! The example above is just that: an example. I’ve singled it out because it’s such a common one, but the bigger picture is that data analysis can be tricky to get right. Think about what it is you want to test, why you want to test it, and whether or not the answers that your test gives could possibly make any sense in the real world. 11.10 Summary Null hypothesis testing is one of the most ubiquitous elements to statistical theory. The vast majority of scientific papers report the results of some hypothesis test or another. As a consequence it is almost impossible to get by in science without having at least a cursory understanding of what a p-value means, making this one of the most important chapters in the book. As usual, I’ll end the chapter with a quick recap of the key ideas that we’ve talked about: Research hypotheses and statistical hypotheses. Null and alternative hypotheses. (Section 11.1). Type 1 and Type 2 errors (Section 11.2) Test statistics and sampling distributions (Section 11.3) Hypothesis testing as a decision making process (Section 11.4) p-values as “soft” decisions (Section 11.5) Writing up the results of a hypothesis test (Section 11.6) E↵ect size and power (Section 11.8) A few issues to consider regarding hypothesis testing (Section 11.9) Later in the book, in Chapter 17, I’ll revisit the theory of null hypothesis tests from a Bayesian perspective, and introduce a number of new tools that you can use if you aren’t particularly fond of the orthodox approach. But for now, though, we’re done with the abstract statistical theory, and we can start discussing specific data analysis tools. 14 In this case, the Pearson chi-square test of independence (Chapter 12; chisq.test() in R) is what we use; see also the prop.test() function. - 348 - Part V. Statistical tools - 349 - 12. Categorical data analysis Now that we’ve got the basic theory behind hypothesis testing, it’s time to start looking at specific tests that are commonly used in psychology. So where should we start? Not every textbook agrees on where to start, but I’m going to start with “ 2 tests” (this chapter) and “t-tests” (Chapter 13). Both of these tools are very frequently used in scientific practice, and while they’re not as powerful as “analysis of variance” (Chapter 14) and “regression” (Chapter 15) they’re much easier to understand. The term “categorical data” is just another name for “nominal scale data”. It’s nothing that we haven’t already discussed, it’s just that in the context of data analysis people tend to use the term “categorical data” rather than “nominal scale data”. I don’t know why. In any case, categorical data analysis refers to a collection of tools that you can use when your data are nominal scale. However, there are a lot of di↵erent tools that can be used for categorical data analysis, and this chapter only covers a few of the more common ones. 12.1 2 The goodness-of-fit test The 2 goodness-of-fit test is one of the oldest hypothesis tests around: it was invented by Karl Pearson around the turn of the century (Pearson, 1900), with some corrections made later by Sir Ronald Fisher (Fisher, 1922a). To introduce the statistical problem that it addresses, let’s start with some psychology... 12.1.1 The cards data Over the years, there have been a lot of studies showing that humans have a lot of difficulties in simulating randomness. Try as we might to “act” random, we think in terms of patterns and structure, and so when asked to “do something at random”, what people actually do is anything but random. As a consequence, the study of human randomness (or non-randomness, as the case may be) opens up a lot of deep psychological questions about how we think about the world. With this in mind, let’s consider a very simple study. Suppose I asked people to imagine a shu✏ed deck of cards, and mentally pick one card from this imaginary deck “at random”. After they’ve chosen one card, I ask them to mentally select a second one. For both choices, what we’re going to look at is the suit (hearts, clubs, spades or diamonds) that people chose. After asking, say, N “ 200 people to do this, I’d like to look at the data and figure out whether or not the cards that people pretended to select were really random. The data are contained in the randomness.Rdata file, which contains a single data frame called cards. Let’s take a look: - 351 - > library( lsr ) > load( "randomness.Rdata" ) > who( TRUE ) -- Name -- -- Class -- -- Size -- cards data.frame 200 x 3 $id factor 200 $choice_1 factor 200 $choice_2 factor 200 As you can see, the cards data frame contains three variables, an id variable that assigns a unique identifier to each participant, and the two variables choice_1 and choice_2 that indicate the card suits that people chose. Here’s the first few entries in the data frame: > head( cards ) id choice_1 choice_2 1 subj1 spades clubs 2 subj2 diamonds clubs 3 subj3 hearts clubs 4 subj4 spades clubs 5 subj5 hearts spades 6 subj6 clubs hearts For the moment, let’s just focus on the first choice that people made. We’ll use the table() function to count the number of times that we observed people choosing each suit. I’ll save the table to a variable called observed, for reasons that will become clear very soon: > observed observed clubs diamonds hearts spades 35 51 64 50 That little frequency table is quite helpful. Looking at it, there’s a bit of a hint that people might be more likely to select hearts than clubs, but it’s not completely obvious just from looking at it whether that’s really true, or if this is just due to chance. So we’ll probably have to do some kind of statistical analysis to find out, which is what I’m going to talk about in the next section. Excellent. From this point on, we’ll treat this table as the data that we’re looking to analyse. However, since I’m going to have to talk about this data in mathematical terms (sorry!) it might be a good idea to be clear about what the notation is. In R, if I wanted to pull out the number of people that selected diamonds, I could do it by name by typing observed["diamonds"] but, since "diamonds" is second element of the observed vector, it’s equally e↵ective to refer to it as observed. The mathematical notation for this is pretty similar, except that we shorten the human-readable word “observed” to the letter O, and we use subscripts rather than brackets: so the second observation in our table is written as observed in R, and is written as O2 in maths. The relationship between the English descriptions, the R commands, and the mathematical symbols are illustrated below: label index, i math. symbol R command the value clubs, | 1 O1 observed 35 diamonds, } 2 O2 observed 51 hearts, ~ 3 O3 observed 64 spades, 4 O4 observed 50 Hopefully that’s pretty clear. It’s also worth nothing that mathematicians prefer to talk about things in general rather than specific things, so you’ll also see the notation Oi , which refers to the number of - 352 - observations that fall within the i-th category (where i could be 1, 2, 3 or 4). Finally, if we want to refer to the set of all observed frequencies, statisticians group all of observed values into a vector, which I’ll refer to as O. O “ pO1 , O2 , O3 , O4 q Again, there’s nothing new or interesting here: it’s just notation. If I say that O “ p35, 51, 64, 50q all I’m doing is describing the table of observed frequencies (i.e., observed), but I’m referring to it using mathematical notation, rather than by referring to an R variable. 12.1.2 The null hypothesis and the alternative hypothesis As the last section indicated, our research hypothesis is that “people don’t choose cards randomly”. What we’re going to want to do now is translate this into some statistical hypotheses, and construct a statistical test of those hypotheses. The test that I’m going to describe to you is Pearson’s 2 goodness of fit test, and as is so often the case, we have to begin by carefully constructing our null hypothesis. In this case, it’s pretty easy. First, let’s state the null hypothesis in words: H0 : All four suits are chosen with equal probability Now, because this is statistics, we have to be able to say the same thing in a mathematical way. To do this, let’s use the notation Pj to refer to the true probability that the j-th suit is chosen. If the null hypothesis is true, then each of the four suits has a 25% chance of being selected: in other words, our null hypothesis claims that P1 “.25, P2 “.25, P3 “.25 and finally that P4 “.25. However, in the same way that we can group our observed frequencies into a vector O that summarises the entire data set, we can use P to refer to the probabilities that correspond to our null hypothesis. So if I let the vector P “ pP1 , P2 , P3 , P4 q refer to the collection of probabilities that describe our null hypothesis, then we have H0 : P “ p.25,.25,.25,.25q In this particular instance, our null hypothesis corresponds to a vector of probabilities P in which all of the probabilities are equal to one another. But this doesn’t have to be the case. For instance, if the experimental task was for people to imagine they were drawing from a deck that had twice as many clubs as any other suit, then the null hypothesis would correspond to something like P “ p.4,.2,.2,.2q. As long as the probabilities are all positive numbers, and they all sum to 1, them it’s a perfectly legitimate choice for the null hypothesis. However, the most common use of the goodness of fit test is to test a null hypothesis that all of the categories are equally likely, so we’ll stick to that for our example. What about our alternative hypothesis, H1 ? All we’re really interested in is demonstrating that the probabilities involved aren’t all identical (that is, people’s choices weren’t completely random). As a consequence, the “human friendly” versions of our hypotheses look like this: H0 : All four suits are chosen with equal probability H1 : At least one of the suit-choice probabilities isn’t.25 and the “mathematician friendly” version is H0 : P “ p.25,.25,.25,.25q H1 : P ‰ p.25,.25,.25,.25q Conveniently, the mathematical version of the hypotheses looks quite similar to an R command defining a vector. So maybe what I should do is store the P vector in R as well, since we’re almost certainly going to need it later. And because I’m Mister Imaginative, I’ll call this R vector probabilities, > probabilities probabilities - 353 - clubs diamonds hearts spades 0.25 0.25 0.25 0.25 12.1.3 The “goodness of fit” test statistic At this point, we have our observed frequencies O and a collection of probabilities P corresponding the null hypothesis that we want to test. We’ve stored these in R as the corresponding variables observed and probabilities. What we now want to do is construct a test of the null hypothesis. As always, if we want to test H0 against H1 , we’re going to need a test statistic. The basic trick that a goodness of fit test uses is to construct a test statistic that measures how “close” the data are to the null hypothesis. If the data don’t resemble what you’d “expect” to see if the null hypothesis were true, then it probably isn’t true. Okay, if the null hypothesis were true, what would we expect to see? Or, to use the correct terminology, what are the expected frequencies. There are N “ 200 observations, and (if the null is true) the probability of any one of them choosing a heart is P3 “.25, so I guess we’re expecting 200 ˆ.25 “ 50 hearts, right? Or, more specifically, if we let Ei refer to “the number of category i responses that we’re expecting if the null is true”, then E i “ N ˆ Pi This is pretty easy to calculate in R: > N expected expected clubs diamonds hearts spades 50 50 50 50 None of which is very surprising: if there are 200 observation that can fall into four categories, and we think that all four categories are equally likely, then on average we’d expect to see 50 observations in each category, right? Now, how do we translate this into a test statistic? Clearly, what we want to do is compare the expected number of observations in each category (Ei ) with the observed number of observations in that category (Oi ). And on the basis of this comparison, we ought to be able to come up with a good test statistic. To start with, let’s calculate the di↵erence between what the null hypothesis expected us to find and what we actually did find. That is, we calculate the “observed minus expected” di↵erence score, Oi ´ Ei. This is illustrated in the following table. | } ~ expected frequency Ei 50 50 50 50 observed frequency Oi 35 51 64 50 di↵erence score O i ´ Ei -15 1 14 0 The same calculations can be done in R, using our expected and observed variables: > observed - expected clubs diamonds hearts spades -15 1 14 0 Regardless of whether we do the calculations by hand or whether we do them in R, it’s clear that people chose more hearts and fewer clubs than the null hypothesis predicted. However, a moment’s thought suggests that these raw di↵erences aren’t quite what we’re looking for. Intuitively, it feels like it’s just as bad when the null hypothesis predicts too few observations (which is what happened with hearts) as - 354 - it is when it predicts too many (which is what happened with clubs). So it’s a bit weird that we have a negative number for clubs and a positive number for heards. One easy way to fix this is to square everything, so that we now calculate the squared di↵erences, pEi ´ Oi q2. As before, we could do this by hand, but it’s easier to do it in R... > (observed - expected)^2 clubs diamonds hearts spades 225 1 196 0 Now we’re making progress. What we’ve got now is a collection of numbers that are big whenever the null hypothesis makes a bad prediction (clubs and hearts), but are small whenever it makes a good one (diamonds and spades). Next, for some technical reasons that I’ll explain in a moment, let’s also divide 2 all these numbers by the expected frequency Ei , so we’re actually calculating pEi ´O Ei iq. Since Ei “ 50 for all categories in our example, it’s not a very interesting calculation, but let’s do it anyway. The R command becomes: > (observed - expected)^2 / expected clubs diamonds hearts spades 4.50 0.02 3.92 0.00 In e↵ect, what we’ve got here are four di↵erent “error” scores, each one telling us how big a “mistake” the null hypothesis made when we tried to use it to predict our observed frequencies. So, in order to convert this into a useful test statistic, one thing we could do is just add these numbers up. The result is called the goodness of fit statistic, conventionally referred to either as X 2 or GOF. We can calculate it using this command in R > sum( (observed - expected)^2 / expected ) 8.44 The formula for this statistic looks remarkably similar to the R command. If we let k refer to the total number of categories (i.e., k “ 4 for our cards data), then the X 2 statistic is given by: ÿk pOi ´ Ei q2 X2 “ i“1 Ei Intuitively, it’s clear that if X 2 is small, then the observed data Oi are very close to what the null hypothesis predicted Ei , so we’re going to need a large X 2 statistic in order to reject the null. As we’ve seen from our calculations, in our cards data set we’ve got a value of X 2 “ 8.44. So now the question becomes, is this a big enough value to reject the null? 12.1.4 The sampling distribution of the GOF statistic To determine whether or not a particular value of X 2 is large enough to justify rejecting the null hypothesis, we’re going to need to figure out what the sampling distribution for X 2 would be if the null hypothesis were true. So that’s what I’m going to do in this section. I’ll show you in a fair amount of detail how this sampling distribution is constructed, and then – in the next section – use it to build up a hypothesis test. If you want to cut to the chase and are willing to take it on faith that the sampling distribution is a chi-squared ( 2 ) distribution with k ´ 1 degrees of freedom, you can skip the rest of this section. However, if you want to understand why the goodness of fit test works the way it does, read on... - 355 - Okay, let’s suppose that the null hypothesis is actually true. If so, then the true probability that an observation falls in the i-th category is Pi – after all, that’s pretty much the definition of our null hypothesis. Let’s think about what this actually means. If you think about it, this is kind of like saying that “nature” makes the decision about whether or not the observation ends up in category i by flipping a weighted coin (i.e., one where the probability of getting a head is Pj ). And therefore, we can think of our observed frequency Oi by imagining that nature flipped N of these coins (one for each observation in the data set)... and exactly Oi of them came up heads. Obviously, this is a pretty weird way to think about the experiment. But what it does (I hope) is remind you that we’ve actually seen this scenario before. It’s exactly the same set up that gave rise to the binomial distribution in Section 9.4. In other words, if the null hypothesis is true, then it follows that our observed frequencies were generated by sampling from a binomial distribution: Oi „ BinomialpPi , N q Now, if you remember from our discussion of the central limit theorem (Section 10.3.3), the binomial distribution starts to look pretty much identical to the normal distribution, especially when N is large and when Pi isn’t too close to 0 or 1. In other words as long as N ˆ Pi is large enough – or, to put it another way, when the expected frequency Ei is large enough – the theoretical distribution ? of Oi is approximately normal. Better yet, if Oi is normally ? distributed, then so is pO i ´ E i q{ E i... since Ei is a fixed value, subtracting o↵ Ei and dividing by Ei changes the mean and standard deviation of the normal distribution; but that’s all it does. Okay, so now let’s have a look at what our goodness of fit statistic actually is. What we’re doing is taking a bunch of things that are normally-distributed, squaring them, and adding them up. Wait. We’ve seen that before too! As we discussed in Section 9.6, when you take a bunch of things that have a standard normal distribution (i.e., mean 0 and standard deviation 1), square them, then add them up, then the resulting quantity has a chi-square distribution. So now we know that the null hypothesis predicts that the sampling distribution of the goodness of fit statistic is a chi-square distribution. Cool. There’s one last detail to talk about, namely the degrees of freedom. If you remember back to Section 9.6, I said that if the number of things you’re adding up is k, then the degrees of freedom for the resulting chi-square distribution is k. Yet, what I said at the start of this section is that the actual degrees of freedom for the chi-square goodness of fit test is k ´ 1. What’s up with that? The answer here is that what we’re supposed to be looking at is the number of genuinely independent things that are getting added together. And, as I’ll go on to talk about in the next section, even though there’s k things that we’re adding, only k ´ 1 of them are truly independent; and so the degrees of freedom is actually only k ´ 1. That’s the topic of the next section.1 12.1.5 Degrees of freedom When I introduced the chi-square distribution in Section 9.6, I was a bit vague about what “degrees of freedom” actually means. Obviously, it matters: looking Figure 12.1 you can see that if we change the degrees of freedom, then the chi-square distribution changes shape quite substantially. But what exactly is it? Again, when I introduced the distribution and explained its relationship to the normal distribution, I did o↵er an answer... it’s the number of “normally distributed variables” that I’m squaring and adding together. But, for most people, that’s kind of abstract, and not entirely helpful. What we really need to do is try to understand degrees of freedom in terms of our data. So here goes. The basic idea behind degrees of freedom is quite simple: you calculate it by counting up the number of 1I should point out that this issue does complicate the story somewhat: I’m not going to cover it in this book, but there’s a sneaky trick that you can do to rewrite the equation for the goodness of fit statistic as a sum over k ´ 1 independent things. When we do so we get the “proper” sampling distribution, which is chi-square with k ´ 1 degrees of freedom. In fact, in order to get the maths to work out properly, you actually have to rewrite things that way. But it’s beyond the scope of an introductory book to show the maths in that much detail: all I wanted to do is give you a sense of why the goodness of fit statistic is associated with the chi-squared distribution. - 356 - df = 3 df = 4 df = 5 0 2 4 6 8 10 Value Figure 12.1: Chi-square distributions with di↵erent values for the “degrees of freedom”........................................................................................................ distinct “quantities” that are used to describe your data; and then subtracting o↵ all of the “constraints” that those data must satisfy.2 This is a bit vague, so let’s use our cards data as a concrete example. We describe out data using four numbers, O1 , O2 , O3 and O4 corresponding to the observed frequencies of the four di↵erent categories (hearts, clubs, diamonds, spades). These four numbers are the random outcomes of our experiment. But, my experiment actually has a fixed constraint built into it: the sample size N.3 That is, if we know how many people chose hearts, how many chose diamonds and how many chose clubs; then we’d be able to figure out exactly how many chose spades. In other words, although our data are described using four numbers, they only actually correspond to 4 ´ 1 “ 3 degrees of freedom. A slightly di↵erent way of thinking about it is to notice that there are four probabilities that we’re interested in (again, corresponding to the four di↵erent categories), but these probabilities must sum to one, which imposes a constraint. Therefore, the degrees of freedom is 4 ´ 1 “ 3. Regardless of whether you want to think about it in terms of the observed frequencies or in terms of the probabilities, the answer is the same. In general, when running the chi-square goodness of fit test for an experiment involving k groups, then the degrees of freedom will be k ´ 1. 12.1.6 Testing the null hypothesis The final step in the process of constructing our hypothesis test is to figure out what the rejection region is. That is, what values of X 2 would lead is to reject the null hypothesis. As we saw earlier, 2 I feel obliged to point out that this is an over-simplification. It works nicely for quite a few situations; but every now and then we’ll come across degrees of freedom values that aren’t whole numbers. Don’t let this worry you too much – when you come across this, just remind yourself that “degrees of freedom” is actually a bit of a messy concept, and that the nice simple story that I’m telling you here isn’t the whole story. For an introductory class, it’s usually best to stick to the simple story: but I figure it’s best to warn you to expect this simple story to fall apart. If I didn’t give you this warning, you might start getting confused when you see df “ 3.4 or something; and (incorrectly) thinking that you had misunderstood something that I’ve taught you, rather than (correctly) realising that there’s something that I haven’t told you. 3 In practice, the sample size isn’t always fixed... e.g., we might run the experiment over a fixed period of time, and the number of people participating depends on how many people show up. That doesn’t matter for the current purposes. - 357 - The critical value is 7.81 The observed GOF value is 8.44 0 2 4 6 8 10 12 14 Value of the GOF Statistic Figure 12.2: Illustration of how the hypothesis testing works for the chi-square goodness of fit test........................................................................................................ large values of X 2 imply that the null hypothesis has done a poor job of predicting the data from our experiment, whereas small values of X 2 imply that it’s actually done pretty well. Therefore, a pretty sensible strategy would be to say there is some critical value, such that if X 2 is bigger than the critical value we reject the null; but if X 2 is smaller than this value we retain the null. In other words, to use the language we introduced in Chapter 11 the chi-squared goodness of fit test is always a one-sided test. Right, so all we have to do is figure out what this critical value is. And it’s pretty straightforward. If we want our test to have significance level of ↵ “.05 (that is, we are willing to tolerate a Type I error rate of 5%), then we have to choose our critical value so that there is only a 5% chance that X 2 could get to be that big if the null hypothesis is true. That is to say, we want the 95th percentile of the sampling distribution. This is illustrated in Figure 12.2. Ah, but – I hear you ask – how do I calculate the 95th percentile of a chi-squared distribution with k ´ 1 degrees of freedom? If only R had some function, called... oh, I don’t know, qchisq()... that would let you calculate this percentile (see Chapter 9 if you’ve forgotten). Like this... > qchisq( p =.95, df = 3 ) 7.814728 So if our X 2 statistic is bigger than 7.81 or so, then we can reject the null hypothesis. Since we actually calculated that before (i.e., X 2 “ 8.44) we can reject the null. If we want an exact p-value, we can calculate it using the pchisq() function: > pchisq( q = 8.44, df = 3, lower.tail = FALSE ) 0.03774185 This is hopefully pretty straightforward, as long as you recall that the “p” form of the probability distribution functions in R always calculates the probability of getting a value of less than the value you entered (in this case 8.44). We want the opposite: the probability of getting a value of 8.44 or more. - 358 - That’s why I told R to use the upper tail, not the lower tail. That said, it’s usually easier to calculate the p-value this way: > 1-pchisq( q = 8.44, df = 3 ) 0.03774185 So, in this case we would reject the null hypothesis, since p †.05. And that’s it, basically. You now know “Pearson’s 2 test for the goodness of fit”. Lucky you. 12.1.7 Doing the test in R Gosh darn it. Although we did manage to do everything in R as we were going through that little example, it does rather feel as if we’re typing too many things into the magic computing box. And I hate typing. Not surprisingly, R provides a function that will do all of these calculations for you. In fact, there are several di↵erent ways of doing it. The one that most people use is the chisq.test() function, which comes with every installation of R. I’ll show you how to use the chisq.test() function later on (in Section 12.6), but to start out with I’m going to show you the goodnessOfFitTest() function in the lsr package, because it produces output that I think is easier for beginners to understand. It’s pretty straightforward: our raw data are stored in the variable cards$choice_1, right? If you want to test the null hypothesis that all four suits are equally likely, then (assuming you have the lsr package loaded) all you have to do is type this: > goodnessOfFitTest( cards$choice_1 ) R then runs the test, and prints several lines of text. I’ll go through the output line by line, so that you can make sure that you understand what you’re looking at. The first two lines are just telling you things you already know: Chi-square test against specified probabilities Data variable: cards$choice_1 The first line tells us what kind of hypothesis test we ran, and the second line tells us the name of the variable that we ran it on. After that comes a statement of what the null and alternative hypotheses are: Hypotheses: null: true probabilities are as specified alternative: true probabilities differ from those specified For a beginner, it’s kind of handy to have this as part of the output: it’s a nice reminder of what your null and alternative hypotheses are. Don’t get used to seeing this though. The vast majority of hypothesis tests in R aren’t so kind to novices. Most R functions are written on the assumption that you already understand the statistical tool that you’re using, so they don’t bother to include an explicit statement of the null and alternative hypothesis. The only reason that goodnessOfFitTest() actually does give you this is that I wrote it with novices in mind. The next part of the output shows you the comparison between the observed frequencies and the expected frequencies: Descriptives: observed freq. expected freq. specified prob. clubs 35 50 0.25 diamonds 51 50 0.25 - 359 - hearts 64 50 0.25 spades 50 50 0.25 The first column shows what the observed frequencies were, the second column shows the expected frequencies according to the null hypothesis, and the third column shows you what the probabilities actually were according to the null. For novice users, I think this is helpful: you can look at this part of the output and check that it makes sense: if it doesn’t you might have typed something incorrecrtly. The last part of the output is the “important” stu↵: it’s the result of the hypothesis test itself. There are three key numbers that need to be reported: the value of the X 2 statistic, the degrees of freedom, and the p-value: Test results: X-squared statistic: 8.44 degrees of freedom: 3 p-value: 0.038 Notice that these are the same numbers that we came up with when doing the calculations the long way. 12.1.8 Specifying a di↵erent null hypothesis At this point you might be wondering what to do if you want to run a goodness of fit test, but your null hypothesis is not that all categories are equally likely. For instance, let’s suppose that someone had made the theoretical prediction that people should choose red cards 60% of the time, and black cards 40% of the time (I’ve no idea why you’d predict that), but had no other preferences. If that were the case, the null hypothesis would be to expect 30% of the choices to be hearts, 30% to be diamonds, 20% to be spades and 20% to be clubs. This seems like a silly theory to me, and it’s pretty easy to test it using our data. All we need to do is specify the probabilities associated with the null hypothesis. We create a vector like this: > nullProbs nullProbs clubs diamonds hearts spades 0.2 0.3 0.3 0.2 Now that we have an explicitly specified null hypothesis, we include it in our command. This time round I’ll use the argument names properly. The data variable corresponds to the argument x, and the probabilities according to the null hypothesis correspond to the argument p. So our command is: > goodnessOfFitTest( x = cards$choice_1, p = nullProbs ) and our output is: Chi-square test against specified probabilities Data variable: cards$choice_1 Hypotheses: null: true probabilities are as specified alternative: true probabilities differ from those specified Descriptives: observed freq. expected freq. specified prob. - 360 - clubs 35 40 0.2 diamonds 51 60 0.3 hearts 64 60 0.3 spades 50 40 0.2 Test results: X-squared statistic: 4.742 degrees of freedom: 3 p-value: 0.192 As you can see the null hypothesis and the expected frequencies are di↵erent to what they were last time. As a consequence our X 2 test statistic is di↵erent, and our p-value is di↵erent too. Annoyingly, the p-value is.192, so we can’t reject the null hypothesis. Sadly, despite the fact that the null hypothesis corresponds to a very silly theory, these data don’t provide enough evidence against it. 12.1.9 How to report the results of the test So now you know how the test works, and you know how to do the test using a wonderful magic computing box. The next thing you need to know is how to write up the results. After all, there’s no point in designing and running an experiment and then analysing the data if you don’t tell anyone about it! So let’s now talk about what you need to do when reporting your analysis. Let’s stick with our card-suits example. If I wanted to write this result up for a paper or something, the conventional way to report this would be to write something like this: Of the 200 participants in the experiment, 64 selected hearts for their first choice, 51 selected diamonds, 50 selected spades, and 35 selected clubs. A chi-square goodness of fit test was conducted to test whether the choice probabilities were identical for all four suits. The results were significant ( 2 p3q “ 8.44, p †.05), suggesting that people did not select suits purely at random. This is pretty straightforward, and hopefully it seems pretty unremarkable. That said, there’s a few things that you should note about this description: The statistical test is preceded by the descriptive statistics. That is, I told the reader something about what the data look like before going on to do the test. In general, this is good practice: always remember that your reader doesn’t know your data anywhere near as well as you do. So unless you describe it to them properly, the statistical tests won’t make any sense to them, and they’ll get frustrated and cry. The description tells you what the null hypothesis being tested is. To be honest, writers don’t always do this, but it’s often a good idea in those situations where some ambiguity exists; or when you can’t rely on your readership being intimately familiar with the statistical tools that you’re using. Quite often the reader might not know (or remember) all the details of the test that your using, so it’s a kind of politeness to “remind” them! As far as the goodness of fit test goes, you can usually rely on a scientific audience knowing how it works (since it’s covered in most intro stats classes). However, it’s still a good idea to be explicit about stating the null hypothesis (briefly!) because the null hypothesis can be di↵erent depending on what you’re using the test for. For instance, in the cards example my null hypothesis was that all the four suit probabilities were identical (i.e., P1 “ P2 “ P3 “ P4 “ 0.25), but there’s nothing special about that hypothesis. I could just as easily have tested the null hypothesis that P1 “ 0.7 and P2 “ P3 “ P4 “ 0.1 using a goodness of fit test. So it’s helpful to the reader if you explain to them what your null hypothesis was. Also, notice that I described the null hypothesis in words, not in maths. That’s perfectly acceptable. You can - 361 - describe it in maths if you like, but since most readers find words easier to read than symbols, most writers tend to describe the null using words if they can. A “stat block” is included. When reporting the results of the test itself, I didn’t just say that the result was significant, I included a “stat block” (i.e., the dense mathematical-looking part in the parentheses), which reports all the “raw” statistical data. For the chi-square goodness of fit test, the information that gets reported is the test statistic (that the goodness of fit statistic was 8.44), the information about the distribution used in the test ( 2 with 3 degrees of freedom, which is usually shortened to 2 p3q), and then the information about whether the result was significant (in this case p †.05). The particular information that needs to go into the stat block is di↵erent for every test, and so each time I introduce a new test I’ll show you what the stat block should look like.4 However the general principle is that you should always provide enough information so that the reader could check the test results themselves if they really wanted to. The results are interpreted. In addition to indicating that the result was significant, I provided an interpretation of the result (i.e., that people didn’t choose randomly). This is also a kindness to the reader, because it tells them something about what they should believe about what’s going on in your data. If you don’t include something like this, it’s really hard for your reader to understand what’s going on.5 As with everything else, your overriding concern should be that you explain things to your reader. Always remember that the point of reporting your results is to communicate to another human being. I cannot tell you just how many times I’ve seen the results section of a report or a thesis or even a scientific article that is just gibberish, because the writer has focused solely on making sure they’ve included all the numbers, and forgotten to actually communicate with the human reader. 12.1.10 A comment on statistical notation Satan delights equally in statistics and in quoting scripture – H.G. Wells If you’ve been reading very closely, and are as much of a mathematical pedant as I am, there is one thing about the way I wrote up the chi-square test in the last section that might be bugging you a little bit. There’s something that feels a bit wrong with writing “ 2 p3q “ 8.44”, you might be thinking. After all, it’s the goodness of fit statistic that is equal to 8.44, so shouldn’t I have written X 2 “ 8.44 or maybe GOF“ 8.44? This seems to be conflating the sampling distribution (i.e., 2 with df “ 3) with the test statistic (i.e., X 2 ). Odds are you figured it was a typo, since and X look pretty similar. Oddly, it’s not. Writing 2 p3q “ 8.44 is essentially a highly condensed way of writing “the sampling distribution of the test statistic is 2 p3q, and the value of the test statistic is 8.44”. In one sense, this is kind of stupid. There are lots of di↵erent test statistics out there that turn out to have a chi-square sampling distribution: the X 2 statistic that we’ve used for our goodness of fit test is 4 Well, sort of. The conventions for how statistics should be reported tend to di↵er somewhat from discipline to discipline; I’ve tended to stick with how things are done in psychology, since that’s what I do. But the general principle of providing enough information to the reader to allow them to check your results is pretty universal, I think. 5 To some people, this advice might sound odd, or at least in conflict with the “usual” advice on how to write a technical report. Very typically, students are told that the “results” section of a report is for describing the data and reporting statistical analysis; and the “discussion” section is for providing interpretation. That’s true as far as it goes, but I think people often interpret it way too literally. The way I usually approach it is to provide a quick and simple interpretation of the data in the results section, so that my reader understands what the data are telling us. Then, in the discussion, I try to tell a bigger story; about how my results fit with the rest of the scientific literature. In short; don’t let the “interpretation goes in the discussion” advice turn your results section into incomprehensible garbage. Being understood by your reader is much more important. - 362 - only one of many (albeit one of the most commonly encountered ones). In a sensible, perfectly organised world, we’d always have a separate name for the test statistic and the sampling distribution: that way, the stat block itself would tell you exactly what it was that the researcher had calculated. Sometimes this happens. For instance, the test statistic used in the Pearson goodness of fit test is written X 2 ; but there’s a closely related test known as the G-test6 (Sokal & Rohlf, 1994), in which the test statistic is written as G. As it happens, the Pearson goodness of fit test and the G-test both test the same null hypothesis; and the sampling distribution is exactly the same (i.e., chi-square with k ´ 1 degrees of freedom). If I’d done a G-test for the cards data rather than a goodness of fit test, then I’d have ended up with a test statistic of G “ 8.65, which is slightly di↵erent from the X 2 “ 8.44 value that I got earlier; and produces a slightly smaller p-value of p “.034. Suppose that the convention was to report the test statistic, then the sampling distribution, and then the p-value. If that were true, then these two situations would produce di↵erent stat blocks: my original result would be written X 2 “ 8.44, 2 p3q, p “.038, whereas the new version using the G-test would be written as G “ 8.65, 2 p3q, p “.034. However, using the condensed reporting standard, the original result is written 2 p3q “ 8.44, p “.038, and the new one is written 2 p3q “ 8.65, p “.034, and so it’s actually unclear which test I actually ran. So why don’t we live in a world in which the contents of the stat block uniquely specifies what tests were ran? The deep reason is that life is messy. We (as users of statistical tools) want it to be nice and neat and organised... we want it to be designed, as if it were a product. But that’s not how life works: statistics is an intellectual discipline just as much as any other one, and as such it’s a massively distributed, partly-collaborative and partly-competitive project that no-one really understands completely. The things that you and I use as data analysis tools weren’t created by an Act of the Gods of Statistics; they were invented by lots of di↵erent people, published as papers in academic journals, implemented, corrected and modified by lots of other people, and then explained to students in textbooks by someone else. As a consequence, there’s a lot of test statistics that don’t even have names; and as a consequence they’re just given the same name as the corresponding sampling distribution. As we’ll see later, any test statistic that follows a 2 distribution is commonly called a “chi-square statistic”; anything that follows a t-distribution is called a “t-statistic” and so on. But, as the X 2 versus G example illustrates, two di↵erent things with the same sampling distribution are still, well, di↵erent. As a consequence, it’s sometimes a good idea to be clear about what the actual test was that you ran, especially if you’re doing something unusual. If you just say “chi-square test”, it’s not actually clear what test you’re talking about. Although, since the two most common chi-square tests are the goodness of fit test and the independence test (Section 12.2), most readers with stats training can probably guess. Nevertheless, it’s something to be aware of. 12.2 2 The test of independence (or association) GUARDBOT 1: Halt! GUARDBOT 2: Be you robot or human? LEELA: Robot...we be. FRY: Uh, yup! Just two robots out roboting it up! Eh? GUARDBOT 1: Administer the test. GUARDBOT 2: Which of the following would you most prefer? A: A puppy, B: A pretty flower from your sweetie, or C: A large properly-formatted data file? GUARDBOT 1: Choose! 6 Complicating matters, the G-test is a special case of a whole class of tests that are known as likelihood ratio tests. I don’t cover LRTs in this book, but they are quite handy things to know about. - 363 - – Futurama, “Fear of a Bot Planet” The other day I was watching an animated documentary examining the quaint customs of the natives of the planet Chapek 9. Apparently, in order to gain access to their capital city, a visitor must prove that they’re a robot, not a human. In order to determine whether or not visitor is human, they ask whether the visitor prefers puppies, flowers or large, properly formatted data files. “Pretty clever,” I thought to myself “but what if humans and robots have the same preferences? That probably wouldn’t be a very good test then, would it?” As it happens, I got my hands on the testing data that the civil authorities of Chapek 9 used to check this. It turns out that what they did was very simple... they found a bunch of robots and a bunch of humans and asked them what they preferred. I saved their data in a file called chapek9.Rdata, which I can now load and have a quick look at: > load( "chapek9.Rdata" ) > who(TRUE) -- Name -- -- Class -- -- Size -- chapek9 data.frame 180 x 2 $species factor 180 $choice factor 180 Okay, so we have a single data frame called chapek9, which contains two factors, species and choice. As always, it’s nice to have a quick look at the data, > head(chapek9) species choice 1 robot flower 2 human data 3 human data 4 human data 5 robot data 6 human flower and then take a summary(), > summary(chapek9) species choice robot:87 puppy : 28 human:93 flower: 43 data :109 In total there are 180 entries in the data frame, one for each person (counting both robots and humans as “people”) who was asked to make a choice. Specifically, there’s 93 humans and 87 robots; and overwhelmingly the preferred choice is the data file. However, these summaries don’t address the question we’re interested in. To do that, we need a more detailed description of the data. What we want to do is look at the choices broken down by species. That is, we need to cross-tabulate the data (see Section 7.1). There’s quite a few ways to do this, as we’ve seen, but since our data are stored in a data frame, it’s convenient to use the xtabs() function. > chapekFrequencies chapekFrequencies species choice robot human puppy 13 15 flower 30 13 data 44 65 - 364 - That’s more or less what we’re after. So, if we add the row and column totals (which is convenient for the purposes of explaining the statistical tests), we would have a table like this, Robot Human Total Puppy 13 15 28 Flower 30 13 43 Data file 44 65 109 Total 87 93 180 which actually would be a nice way to report the descriptive statistics for this data set. In any case, it’s quite clear that the vast majority of the humans chose the data file, whereas the robots tended to be a lot more even in their preferences. Leaving aside the question of why the humans might be more likely to choose the data file for the moment (which does seem quite odd, admittedly), our first order of business is to determine if the discrepancy between human choices and robot choices in the data set is statistically significant. 12.2.1 Constructing our hypothesis test How do we analyse this data? Specifically, since my research hypothesis is that “humans and robots answer the question in di↵erent ways”, how can I construct a test of the null hypothesis that “humans and robots answer the question the same way”? As before, we begin by establishing some notation to describe the data: Robot Human Total Puppy O11 O12 R1 Flower O21 O22 R2 Data file O31 O32 R3 Total C1 C2 N In this notation we say that Oij is a count (observed frequency) of the number of respondents that are of species j (robots or human) who gave answer i (puppy, flower or data) when asked to make a choice. The total number of observations is written N , as usual. Finally, I’ve used Ri to denote the row totals (e.g., R1 is the total number of people who chose the flower), and Cj to denote the column totals (e.g., C1 is the total number of robots).7 So now let’s think about what the null hypothesis says. If robots and humans are responding in the same way to the question, it means that the probability that “a robot says puppy” is the same as the probability that “a human says puppy”, and so on for the other two possibilities. So, if we use Pij to denote “the probability that a member of species j gives response i” then our null hypothesis is that: H0 : All of the following are true: P11 “ P12 (same probability of saying “puppy”), P21 “ P22 (same probability of saying “flower”), and P31 “ P32 (same probability of saying “data”). And actually, since the null hypothesis is claiming that the true choice probabilities don’t depend on the species of the person making the choice, we can let Pi refer to this probability: e.g., P1 is the true probability of choosing the puppy. 7 A technical note. The way I’ve described the test pretends that the column totals are fixed (i.e., the researcher intended to survey 87 robots and 93 humans) and the row totals are random (i.e., it just turned out that 28 people chose the puppy). To use the terminology from my mathematical statistics textbook (Hogg, McKean, & Craig, 2005) I should technically refer to this situation as a chi-square test of homogeneity; and reserve the term chi-square test of independence for the situation where both the row and column totals are random outcomes of the experiment. In the initial drafts of this book that’s exactly what I did. However, it turns out that these two tests are identical; and so I’ve collapsed them together. - 365 -

Chapter 12: Reporting Hypothesis Test Results PDF

Document Details

Tags

Related

Summary

Full Transcript