Chapter 13: X2 Statistic - PDF

Next, in much the same way that we did with the goodness of fit test, what we need to do is calculate the expected frequencies. That is, for each of the observed counts Oij , we need to figure out what the null hypothesis would tell us to expect. Let’s denote this expected frequency by Eij. This time, it’s a little bit trickier. If there are a total of Cj people that belong to species j, and the true probability of anyone (regardless of species) choosing option i is Pi , then the expected frequency is just: Eij “ Cj ˆ Pi Now, this is all very well and good, but we have a problem. Unlike the situation we had with the goodness of fit test, the null hypothesis doesn’t actually specify a particular value for Pi. It’s something we have to estimate (Chapter 10) from the data! Fortunately, this is pretty easy to do. If 28 out of 180 people selected the flowers, then a natural estimate for the probability of choosing flowers is 28{180, which is approximately.16. If we phrase this in mathematical terms, what we’re saying is that our estimate for the probability of choosing option i is just the row total divided by the total sample size: Ri P̂i “ N Therefore, our expected frequency can be written as the product (i.e. multiplication) of the row total and the column total, divided by the total number of observations:8 Ri ˆ C j Eij “ N Now that we’ve figured out how to calculate the expected frequencies, it’s straightforward to define a test statistic; following the exact same strategy that we used in the goodness of fit test. In fact, it’s pretty much the same statistic. For a contingency table with r rows and c columns, the equation that defines our X 2 statistic is ÿr ÿ c pEij ´ Oij q2 X2 “ i“1 j“1 Eij ∞ The only di↵erence is that I have to include two summation sign (i.e., ) to indicate that we’re summing over both rows and columns. As before, large values of X 2 indicate that the null hypothesis provides a poor description of the data, whereas small values of X 2 suggest that it does a good job of accounting for the data. Therefore, just like last time, we want to reject the null hypothesis if X 2 is too large. Not surprisingly, this statistic is 2 distributed. All we need to do is figure out how many degrees of freedom are involved, which actually isn’t too hard. As I mentioned before, you can (usually) think of the degrees of freedom as being equal to the number of data points that you’re analysing, minus the number of constraints. A contingency table with r rows and c columns contains a total of r ˆ c observed frequencies, so that’s the total number of observations. What about the constraints? Here, it’s slightly trickier. The answer is always the same df “ pr ´ 1qpc ´ 1q but the explanation for why the degrees of freedom takes this value is di↵erent depending on the experi- mental design. For the sake of argument, let’s suppose that we had honestly intended to survey exactly 87 robots and 93 humans (column totals fixed by the experimenter), but left the row totals free to vary (row totals are random variables). Let’s think about the constraints that apply here. Well, since we deliberately fixed the column totals by Act of Experimenter, we have c constraints right there. But, there’s actually more to it than that. Remember how our null hypothesis had some free parameters (i.e., we had to estimate the Pi values)? Those matter too. I won’t explain why in this book, but every free parameter in the null hypothesis is rather like an additional constraint. So, how many of those are there? 8 Technically, Eij here is an estimate, so I should probably write it Êij. But since no-one else does, I won’t either. - 366 - Well, since these probabilities have to sum to 1, there’s only r ´1 of these. So our total degrees of freedom is: df “ (number of observations) ´ (number of constraints) “ prcq ´ pc ` pr ´ 1qq “ rc ´ c ´ r ` 1 “ pr ´ 1qpc ´ 1q Alternatively, suppose that the only thing that the experimenter fixed was the total sample size N. That is, we quizzed the first 180 people that we saw, and it just turned out that 87 were robots and 93 were humans. This time around our reasoning would be slightly di↵erent, but would still lead is to the same answer. Our null hypothesis still has r ´ 1 free parameters corresponding to the choice probabilities, but it now also has c ´ 1 free parameters corresponding to the species probabilities, because we’d also have to estimate the probability that a randomly sampled person turns out to be a robot.9 Finally, since we did actually fix the total number of observations N , that’s one more constraint. So now we have, rc observations, and pc ´ 1q ` pr ´ 1q ` 1 constraints. What does that give? df “ (number of observations) ´ (number of constraints) “ rc ´ ppc ´ 1q ` pr ´ 1q ` 1q “ rc ´ c ´ r ` 1 “ pr ´ 1qpc ´ 1q Amazing. 12.2.2 Doing the test in R Okay, now that we know how the test works, let’s have a look at how it’s done in R. As tempting as it is to lead you through the tedious calculations so that you’re forced to learn it the long way, I figure there’s no point. I already showed you how to do it the long way for the goodness of fit test in the last section, and since the test of independence isn’t conceptually any di↵erent, you won’t learn anything new by doing it the long way. So instead, I’ll go straight to showing you the easy way. As always, R lets you do it multiple ways. There’s the chisq.test() function, which I’ll talk about in Section 12.6, but first I want to use the associationTest() function in the lsr package, which I think is easier on beginners. It works in the exact same way as the xtabs() function. Recall that, in order to produce the contingency table, we used this command: > xtabs( formula = ~choice+species, data = chapek9 ) species choice robot human puppy 13 15 flower 30 13 data 44 65 The associationTest() function has exactly the same structure: it needs a formula that specifies which variables you’re cross-tabulating, and the name of a data frame that contains those variables. So the command is just this: > associationTest( formula = ~choice+species, data = chapek9 ) Just like we did with the goodness of fit test, I’ll go through it line by line. The first two lines are, once again, just reminding you what kind of test you ran and what variables were used: 9A problem many of us worry about in real life. - 367 - Chi-square test of categorical association Variables: choice, species Next, it tells you what the null and alternative hypotheses are (and again, I want to remind you not to get used to seeing these hypotheses written out so explicitly): Hypotheses: null: variables are independent of one another alternative: some contingency exists between variables Next, it shows you the observed contingency table that is being tested: Observed contingency table: species choice robot human puppy 13 15 flower 30 13 data 44 65 and it also shows you what the expected frequencies would be if the null hypothesis were true: Expected contingency table under the null hypothesis: species choice robot human puppy 13.5 14.5 flower 20.8 22.2 data 52.7 56.3 The next part describes the results of the hypothesis test itself: Test results: X-squared statistic: 10.722 degrees of freedom: 2 p-value: 0.005 And finally, it reports a measure of e↵ect size: Other information: estimated effect size (Cramer’s v): 0.244 You can ignore this bit for now. I’ll talk about it in just a moment. This output gives us enough information to write up the result: Pearson’s 2 revealed a significant association between species and choice ( 2 p2q “ 10.7, p †.01): robots appeared to be more likely to say that they prefer flowers, but the humans were more likely to say they prefer data. Notice that, once again, I provided a little bit of interpretation to help the human reader understand what’s going on with the data. Later on in my discussion section, I’d provide a bit more context. To illustrate the di↵erence, here’s what I’d probably say later on: The fact that humans appeared to have a stronger preference for raw data files than robots is somewhat counterintuitive. However, in context it makes some sense: the civil authority on - 368 - Chapek 9 has an unfortunate tendency to kill and dissect humans when they are identified. As such it seems most likely that the human participants did not respond honestly to the question, so as to avoid potentially undesirable consequences. This should be considered to be a substantial methodological weakness. This could be classified as a rather extreme example of a reactivity e↵ect, I suppose. Obviously, in this case the problem is severe enough that the study is more or less worthless as a tool for understanding the di↵erence preferences among humans and robots. However, I hope this illustrates the di↵erence between getting a statistically significant result (our null hypothesis is rejected in favour of the alternative), and finding something of scientific value (the data tell us nothing of interest about our research hypothesis due to a big methodological flaw). 12.2.3 Postscript I later found out the data were made up, and I’d been watching cartoons instead of doing work. 12.3 The continuity correction Okay, time for a little bit of a digression. I’ve been lying to you a little bit so far. There’s a tiny change that you need to make to your calculations whenever you only have 1 degree of freedom. It’s called the “continuity correction”, or sometimes the Yates correction. Remember what I pointed out earlier: the 2 test is based on an approximation, specifically on the assumption that binomial distribution starts to look like a normal distribution for large N. One problem with this is that it often doesn’t quite work, especially when you’ve only got 1 degree of freedom (e.g., when you’re doing a test of independence on a 2 ˆ 2 contingency table). The main reason for this is that the true sampling distribution for the X 2 statistic is actually discrete (because you’re dealing with categorical data!) but the 2 distribution is continuous. This can introduce systematic problems. Specifically, when N is small and when df “ 1, the goodness of fit statistic tends to be “too big”, meaning that you actually have a bigger ↵ value than you think (or, equivalently, the p values are a bit too small). Yates (1934) suggested a simple fix, in which you redefine the goodness of fit statistic as: ÿ p|Ei ´ Oi | ´ 0.5q2 X2 “ (12.1) i Ei Basically, he just subtracts o↵ 0.5 everywhere. As far as I can tell from reading Yates’ paper, the correction is basically a hack. It’s not derived from any principled theory: rather, it’s based on an examination of the behaviour of the test, and observing that the corrected version seems to work better. I feel obliged to explain this because you will sometimes see R (or any other software for that matter) introduce this correction, so it’s kind of useful to know what they’re about. You’ll know when it happens, because the R output will explicitly say that it has used a “continuity correction” or “Yates’ correction”. - 369 - 12.4 E↵ect size As we discussed earlier (Section 11.8), it’s becoming commonplace to ask researchers to report some measure of e↵ect size. So, let’s suppose that you’ve run your chi-square test, which turns out to be significant. So you now know that there is some association between your variables (independence test) or some deviation from the specified probabilities (goodness of fit test). Now you want to report a measure of e↵ect size. That is, given that there is an association/deviation, how strong is it? There are several di↵erent measures that you can choose to report, and several di↵erent tools that you can use to calculate them. I won’t discuss all of them,10 but will instead focus on the most commonly reported measures of e↵ect size. By default, the two measures that people tend to report most frequently are the statistic and the somewhat superior version, known as Cramér’s V. Mathematically, they’re very simple. To calculate the statistic, you just divide your X 2 value by the sample size, and take the square root: c X2 “ N The idea here is that the statistic is supposed to range between 0 (no at all association) and 1 (perfect association), but it doesn’t always do this when your contingency table is bigger than 2 ˆ 2, which is a total pain. For bigger tables it’s actually possible to obtain ° 1, which is pretty unsatisfactory. So, to correct for this, people usually prefer to report the V statistic proposed by Cramér (1946). It’s a pretty simple adjustment to. If you’ve got a contingency table with r rows and c columns, then define k “ minpr, cq to be the smaller of the two values. If so, then Cramér’s V statistic is d X2 V “ N pk ´ 1q And you’re done. This seems to be a fairly popular measure, presumably because it’s easy to calculate, and it gives answers that aren’t completely silly: you know that V really does range from 0 (no at all association) to 1 (perfect association). Calculating V or is obviously pretty straightforward. So much so that the core packages in R don’t seem to have functions to do it, though other packages do. To save you the time and e↵ort of finding one, I’ve included one in the lsr package, called cramersV(). It takes a contingency table as input, and prints out the measure of e↵ect size: > cramersV( chapekFrequencies ) 0.244058 However, if you’re using the associationTest() function to do your analysis, then you won’t actually need to use this at all, because it reports the Cramér’s V statistic as part of the output. 10 Though I do feel that it’s worth mentioning the assocstats() function in the vcd package. If you install and load the vcd package, then a command like assocstats( chapekFrequencies ) will run the 2 test as well as the likelihood ratio test (not discussed here); and then report three di↵erent measures of e↵ect size: 2 , Cramér’s V , and the contingency coefficient (not discussed here) - 370 - 12.5 Assumptions of the test(s) All statistical tests make assumptions, and it’s usually a good idea to check that those assumptions are met. For the chi-square tests discussed so far in this chapter, the assumptions are: Expected frequencies are sufficiently large. Remember how in the previous section we saw that the 2 sampling distribution emerges because the binomial distribution is pretty similar to a normal distribution? Well, like we discussed in Chapter 9 this is only true when the number of observations is sufficiently large. What that means in practice is that all of the expected frequencies need to be reasonably big. How big is reasonably big? Opinions di↵er, but the default assumption seems to be that you generally would like to see all your expected frequencies larger than about 5, though for larger tables you would probably be okay if at least 80% of the the expected frequencies are above 5 and none of them are below 1. However, from what I’ve been able to discover (e.g., Cochran, 1954), these seem to have been proposed as rough guidelines, not hard and fast rules; and they seem to be somewhat conservative (Larntz, 1978). Data are independent of one another. One somewhat hidden assumption of the chi-square test is that you have to genuinely believe that the observations are independent. Here’s what I mean. Suppose I’m interested in proportion of babies born at a particular hospital that are boys. I walk around the maternity wards, and observe 20 girls and only 10 boys. Seems like a pretty convincing di↵erence, right? But later on, it turns out that I’d actually walked into the same ward 10 times, and in fact I’d only seen 2 girls and 1 boy. Not as convincing, is it? My original 30 observations were massively non-independent... and were only in fact equivalent to 3 independent observations. Obviously this is an extreme (and extremely silly) example, but it illustrates the basic issue. Non- independence “stu↵s things up”. Sometimes it causes you to falsely reject the null, as the silly hospital example illustrats, but it can go the other way too. To give a slightly less stupid example, let’s consider what would happen if I’d done the cards experiment slightly di↵erently: instead of asking 200 people to try to imagine sampling one card at random, suppose I asked 50 people to select 4 cards. One possibility would be that everyone selects one heart, one club, one diamond and one spade (in keeping with the “representativeness heuristic”; Tversky & Kahneman 1974). This is highly non-random behaviour from people, but in this case, I would get an observed frequency of 50 four all four suits. For this example, the fact that the observations are non-independent (because the four cards that you pick will be related to each other) actually leads to the opposite e↵ect... falsely retaining the null. If you happen to find yourself in a situation where independence is violated, it may be possible to use the McNemar test (which we’ll discuss) or the Cochran test (which we won’t). Similarly, if your expected cell counts are too small, check out the Fisher exact test. It is to these topics that we now turn. 12.6 The most typical way to do chi-square tests in R When discussing how to do a chi-square goodness of fit test (Section 12.1.7) and the chi-square test of independence (Section 12.2.2), I introduced you to two separate functions in the lsr package. We ran our goodness of fit tests using the goodnessOfFitTest() function, and our tests of independence (or association) using the associationTest() function. And both of those functions produced quite - 371 - detailed output, showing you the relevant descriptive statistics, printing out explicit reminders of what the hypotheses are, and so on. When you’re first starting out, it can be very handy to be given this sort of guidance. However, once you start becoming a bit more proficient in statistics and in R it can start to get very tiresome. A real statistician hardly needs to be told what the null and alternative hypotheses for a chi-square test are, and if an advanced R user wants the descriptive statistics to be printed out, they know how to produce them! For this reason, the basic chisq.test() function in R is a lot more terse in its output, and because the mathematics that underpin the goodness of fit test and the test of independence is basically the same in each case, it can run either test depending on what kind of input it is given. First, here’s the goodness of fit test. Suppose you have the frequency table observed that we used earlier, > observed clubs diamonds hearts spades 35 51 64 50 If you want to run the goodness of fit test against the hypothesis that all four suits are equally likely to appear, then all you need to do is input this frequenct table to the chisq.test() function: > chisq.test( x = observed ) Chi-squared test for given probabilities data: observed X-squared = 8.44, df = 3, p-value = 0.03774 Notice that the output is very compressed in comparison to the goodnessOfFitTest() function. It doesn’t bother to give you any descriptive statistics, it doesn’t tell you what null hypothesis is being tested, and so on. And as long as you already understand the test, that’s not a problem. Once you start getting familiar with R and with statistics, you’ll probably find that you prefer this simple output rather than the rather lengthy output that goodnessOfFitTest() produces. Anyway, if you want to change the null hypothesis, it’s exactly the same as before, just specify the probabilities using the p argument. For instance: > chisq.test( x = observed, p = c(.2,.3,.3,.2) ) Chi-squared test for given probabilities data: observed X-squared = 4.7417, df = 3, p-value = 0.1917 Again, these are the same numbers that the goodnessOfFitTest() function reports at the end of the output. It just hasn’t included any of the other details. What about a test of independence? As it turns out, the chisq.test() function is pretty clever.11 If you input a cross-tabulation rather than a simple frequency table, it realises that you’re asking for a test of independence and not a goodness of fit test. Recall that we already have this cross-tabulation stored as the chapekFrequencies variable: > chapekFrequencies species choice robot human puppy 13 15 flower 30 13 data 44 65 11 Not really. - 372 - To get the test of independence, all we have to do is feed this frequency table into the chisq.test() function like so: > chisq.test( chapekFrequencies ) Pearson’s Chi-squared test data: chapekFrequencies X-squared = 10.7216, df = 2, p-value = 0.004697 Again, the numbers are the same as last time, it’s just that the output is very terse and doesn’t really explain what’s going on in the rather tedious way that associationTest() does. As before, my intuition is that when you’re just getting started it’s easier to use something like associationTest() because it shows you more detail about what’s going on, but later on you’ll probably find that chisq.test() is more convenient. 12.7 The Fisher exact test What should you do if your cell counts are too small, but you’d still like to test the null hypothesis that the two variables are independent? One answer would be “collect more data”, but that’s far too glib: there are a lot of situations in which it would be either infeasible or unethical do that. If so, statisticians have a kind of moral obligation to provide scientists with better tests. In this instance, Fisher (1922) kindly provided the right answer to the question. To illustrate the basic idea, let’s suppose that we’re analysing data from a field experiment, looking at the emotional status of people who have been accused of witchcraft; some of whom are currently being burned at the stake.12 Unfortunately for the scientist (but rather fortunately for the general populace), it’s actually quite hard to find people in the process of being set on fire, so the cell counts are awfully small in some cases. The salem.Rdata file illustrates the point: > load("salem.Rdata") > who(TRUE) -- Name -- -- Class -- -- Size -- trial data.frame 16 x 2 $happy logical 16 $on.fire logical 16 > salem.tabs print( salem.tabs ) on.fire happy FALSE TRUE FALSE 3 3 TRUE 10 0 Looking at this data, you’d be hard pressed not to suspect that people not on fire are more likely to be happy than people on fire. However, the chi-square test makes this very hard to test because of the small sample size. If I try to do so, R gives me a warning message: 12 This example is based on a joke article published in the Journal of Irreproducible Results. - 373 - > chisq.test( salem.tabs ) Pearson’s Chi-squared test with Yates’ continuity correction data: salem.tabs X-squared = 3.3094, df = 1, p-value = 0.06888 Warning message: In chisq.test(salem.tabs) : Chi-squared approximation may be incorrect Speaking as someone who doesn’t want to be set on fire, I’d really like to be able to get a better answer than this. This is where Fisher’s exact test (Fisher, 1922a) comes in very handy. The Fisher exact test works somewhat di↵erently to the chi-square test (or in fact any of the other hypothesis tests that I talk about in this book) insofar as it doesn’t have a test statistic; it calculates the p-value “directly”. I’ll explain the basics of how the test works for a 2 ˆ 2 contingency table, though the test works fine for larger tables. As before, let’s have some notation: Happy Sad Total Set on fire O11 O12 R1 Not set on fire O21 O22 R2 Total C1 C2 N In order to construct the test Fisher treats both the row and column totals (R1 , R2 , C1 and C2 ) are known, fixed quantities; and then calculates the probability that we would have obtained the observed frequencies that we did (O11 , O12 , O21 and O22 ) given those totals. In the notation that we developed in Chapter 9 this is written: P pO11 , O12 , O21 , O22 | R1 , R2 , C1 , C2 q and as you might imagine, it’s a slightly tricky exercise to figure out what this probability is, but it turns out that this probability is described by a distribution known as the hypergeometric distribution 13. Now that we know this, what we have to do to calculate our p-value is calculate the probability of observing this particular table or a table that is “more extreme”.14 Back in the 1920s, computing this sum was daunting even in the simplest of situations, but these days it’s pretty easy as long as the tables aren’t too big and the sample size isn’t too large. The conceptually tricky issue is to figure out what it means to say that one contingency table is more “extreme” than another. The easiest solution is to say that the table with the lowest probability is the most extreme. This then gives us the p-value. The implementation of the test in R is via the fisher.test() function. Here’s how it is used: > fisher.test( salem.tabs ) Fisher’s Exact Test for Count Data data: salem.tabs p-value = 0.03571 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.000000 1.202913 sample estimates: odds ratio 0 13 The R functions for this distribution are dhyper(), phyper(), qhyper() and rhyper(), though you don’t need them for this book, and I haven’t given you enough information to use these to perform the Fisher exact test the long way. 14 Not surprisingly, the Fisher exact test is motivated by Fisher’s interpretation of a p-value, not Neyman’s! - 374 - This is a bit more output than we got from some of our earlier tests. The main thing we’re interested in here is the p-value, which in this case is small enough (p “.036) to justify rejecting the null hypothesis that people on fire are just as happy as people not on fire. 12.8 The McNemar test Suppose you’ve been hired to work for the Australian Generic Political Party (AGPP), and part of your job is to find out how e↵ective the AGPP political advertisements are. So, what you do, is you put together a sample of N “ 100 people, and ask them to watch the AGPP ads. Before they see anything, you ask them if they intend to vote for the AGPP; and then after showing the ads, you ask them again, to see if anyone has changed their minds. Obviously, if you’re any good at your job, you’d also do a whole lot of other things too, but let’s consider just this one simple experiment. One way to describe your data is via the following contingency table: Before After Total Yes 30 10 40 No 70 90 160 Total 100 100 200 At first pass, you might think that this situation lends itself to the Pearson 2 test of independence (as per Section 12.2). However, a little bit of thought reveals that we’ve got a problem: we have 100 participants, but 200 observations. This is because each person has provided us with an answer in both the before column and the after column. What this means is that the 200 observations aren’t independent of each other: if voter A says “yes” the first time and voter B says “no”, then you’d expect that voter A is more likely to say “yes” the second time than voter B! The consequence of this is that the usual 2 test won’t give trustworthy answers due to the violation of the independence assumption. Now, if this were a really uncommon situation, I wouldn’t be bothering to waste your time talking about it. But it’s not uncommon at all: this is a standard repeated measures design, and none of the tests we’ve considered so far can handle it. Eek. The solution to the problem was published by McNemar (1947). The trick is to start by tabulating your data in a slightly di↵erent way: Before: Yes Before: No Total After: Yes 5 5 10 After: No 25 65 90 Total 30 70 100 This is exactly the same data, but it’s been rewritten so that each of our 100 participants appears in only one cell. Because we’ve written our data this way, the independence assumption is now satisfied, and this is a contingency table that we can use to construct an X 2 goodness of fit statistic. However, as we’ll see, we need to do it in a slightly nonstandard way. To see what’s going on, it helps to label the entries in our table a little di↵erently: Before: Yes Before: No Total After: Yes a b a`b After: No c d c`d Total a`c b`d n - 375 - Next, let’s think about what our null hypothesis is: it’s that the “before” test and the “after” test have the same proportion of people saying “Yes, I will vote for AGPP”. Because of the way that we have rewritten the data, it means that we’re now testing the hypothesis that the row totals and column totals come from the same distribution. Thus, the null hypothesis in McNemar’s test is that we have “marginal homogeneity”. That is, the row totals and column totals have the same distribution: Pa ` Pb “ Pa ` Pc , and similarly that Pc `Pd “ Pb `Pd. Notice that this means that the null hypothesis actually simplifies to Pb “ Pc. In other words, as far as the McNemar test is concerned, it’s only the o↵-diagonal entries in this table (i.e., b and c) that matter! After noticing this, the McNemar test of marginal homogeneity is no di↵erent to a usual 2 test. After applying the Yates correction, our test statistic becomes: p|b ´ c| ´ 0.5q2 X2 “ b`c or, to revert to the notation that we used earlier in this chapter: p|O12 ´ O21 | ´ 0.5q2 X2 “ O12 ` O21 and this statistic has an (approximately) 2 distribution with df “ 1. However, remember that – just like the other 2 tests – it’s only an approximation, so you need to have reasonably large expected cell counts for it to work. 12.8.1 Doing the McNemar test in R Now that you know what the McNemar test is all about, lets actually run one. The agpp.Rdata file contains the raw data that I discussed previously, so let’s have a look at it: > load("agpp.Rdata") > who(TRUE) -- Name -- -- Class -- -- Size -- agpp data.frame 100 x 3 $id factor 100 $response_before factor 100 $response_after factor 100 The agpp data frame contains three variables, an id variable that labels each participant in the data set (we’ll see why that’s useful in a moment), a response_before variable that records the person’s answer when they were asked the question the first time, and a response_after variable that shows the answer that they gave when asked the same question a second time. As usual, here’s the first 6 entries: > head(agpp) id response_before response_after 1 subj.1 no yes 2 subj.2 yes no 3 subj.3 yes no 4 subj.4 yes no 5 subj.5 no no 6 subj.6 no no and here’s a summary: > summary(agpp) id response_before response_after - 376 - subj.1 : 1 no :70 no :90 subj.10 : 1 yes:30 yes:10 subj.100: 1 subj.11 : 1 subj.12 : 1 subj.13 : 1 (Other) :94 Notice that each participant appears only once in this data frame. When we tabulate this data frame using xtabs(), we get the appropriate table: > right.table print( right.table ) response_after response_before no yes no 65 5 yes 25 5 and from there, we can run the McNemar test by using the mcnemar.test() function: > mcnemar.test( right.table ) McNemar’s Chi-squared test with continuity correction data: right.table McNemar’s chi-squared = 12.0333, df = 1, p-value = 0.0005226 And we’re done. We’ve just run a McNemar’s test to determine if people were just as likely to vote AGPP after the ads as they were before hand. The test was significant ( 2 p1q “ 12.04, p †.001), suggesting that they were not. And in fact, it looks like the ads had a negative e↵ect: people were less likely to vote AGPP after seeing the ads. Which makes a lot of sense when you consider the quality of a typical political advertisement. 12.9 What’s the di↵erence between McNemar and independence? Let’s go all the way back to the beginning of the chapter, and look at the cards data set again. If you recall, the actual experimental design that I described involved people making two choices. Because we have information about the first choice and the second choice that everyone made, we can construct the following contingency table that cross-tabulates the first choice against the second choice. > cardChoices cardChoices choice_2 choice_1 clubs diamonds hearts spades clubs 10 9 10 6 diamonds 20 4 13 14 hearts 20 18 3 23 spades 18 13 15 4 Suppose I wanted to know whether the choice you make the second time is dependent on the choice you made the first time. This is where a test of independence is useful, and what we’re trying to do is see if there’s some relationship between the rows and columns of this table. Here’s the result: - 377 - > chisq.test( cardChoices ) Pearson’s Chi-squared test data: cardChoices X-squared = 29.2373, df = 9, p-value = 0.0005909 Alternatively, suppose I wanted to know if on average, the frequencies of suit choices were di↵erent the second time than the first time. In that situation, what I’m really trying to see if the row totals in cardChoices (i.e., the frequencies for choice_1) are di↵erent from the column totals (i.e., the frequencies for choice_2). That’s when you use the McNemar test: > mcnemar.test( cardChoices ) McNemar’s Chi-squared test data: cardChoices McNemar’s chi-squared = 16.0334, df = 6, p-value = 0.01358 Notice that the results are di↵erent! These aren’t the same test. 12.10 Summary The key ideas discussed in this chapter are: The chi-square goodness of fit test (Section 12.1) is used when you have a table of observed fre- quencies of di↵erent categories; and the null hypothesis gives you a set of “known” probabilities to compare them to. You can either use the goodnessOfFitTest() function in the lsr package to run this test, or the chisq.test() function. The chi-square test of independence (Section 12.2) is used when you have a contingency table (cross-tabulation) of two categorical variables. The null hypothesis is that there is no relation- ship/association between the variables. You can either use the associationTest() function in the lsr package, or you can use chisq.test(). E↵ect size for a contingency table can be measured in several ways (Section 12.4). In particular we noted the Cramér’s V statistic, which can be calculated using cramersV(). This is also part of the output produced by associationTest(). Both versions of the Pearson test rely on two assumptions: that the expected frequencies are suf- ficiently large, and that the observations are independent (Section 12.5). The Fisher exact test (Sec- tion 12.7) can be used when the expected frequencies are small, fisher.test(x = contingency.table). The McNemar test (Section 12.8) can be used for some kinds of violations of independence, mcnemar.test(x = contingency.table). If you’re interested in learning more about categorical data analysis, a good first choice would be Agresti (1996) which, as the title suggests, provides an Introduction to Categorical Data Analysis. If the intro- ductory book isn’t enough for you (or can’t solve the problem you’re working on) you could consider Agresti (2002), Categorical Data Analysis. The latter is a more advanced text, so it’s probably not wise to jump straight from this book to that one. - 378 - 13. Comparing two means In the previous chapter we covered the situation when your outcome variable is nominal scale and your predictor variable1 is also nominal scale. Lots of real world situations have that character, and so you’ll find that chi-square tests in particular are quite widely used. However, you’re much more likely to find yourself in a situation where your outcome variable is interval scale or higher, and what you’re interested in is whether the average value of the outcome variable is higher in one group or another. For instance, a psychologist might want to know if anxiety levels are higher among parents than non-parents, or if working memory capacity is reduced by listening to music (relative to not listening to music). In a medical context, we might want to know if a new drug increases or decreases blood pressure. An agricultural scientist might want to know whether adding phosphorus to Australian native plants will kill them.2 In all these situations, our outcome variable is a fairly continuous, interval or ratio scale variable; and our predictor is a binary “grouping” variable. In other words, we want to compare the means of the two groups. The standard answer to the problem of comparing means is to use a t-test, of which there are several varieties depending on exactly what question you want to solve. As a consequence, the majority of this chapter focuses on di↵erent types of t-test: one sample t-tests are discussed in Section 13.2, independent samples t-tests are discussed in Sections 13.3 and 13.4, and paired samples t-tests are discussed in Sec- tion 13.5. After that, we’ll talk a bit about Cohen’s d, which is the standard measure of e↵ect size for a t-test (Section 13.8). The later sections of the chapter focus on the assumptions of the t-tests, and possible remedies if they are violated. However, before discussing any of these useful things, we’ll start with a discussion of the z-test. 13.1 The one-sample z-test In this section I’ll describe one of the most useless tests in all of statistics: the z-test. Seriously – this test is almost never used in real life. Its only real purpose is that, when teaching statistics, it’s a very convenient stepping stone along the way towards the t-test, which is probably the most (over)used tool in all statistics. 1 We won’t cover multiple predictors until Chapter 15 2 Informal experimentation in my garden suggests that yes, it does. Australian natives are adapted to low phosphorus levels relative to everywhere else on Earth, apparently, so if you’ve bought a house with a bunch of exotics and you want to plant natives, don’t follow my example: keep them separate. Nutrients to European plants are poison to Australian ones. There’s probably a joke in that, but I can’t figure out what it is. - 379 - 13.1.1 The inference problem that the test addresses To introduce the idea behind the z-test, let’s use a simple example. A friend of mine, Dr Zeppo, grades his introductory statistics class on a curve. Let’s suppose that the average grade in his class is 67.5, and the standard deviation is 9.5. Of his many hundreds of students, it turns out that 20 of them also take psychology classes. Out of curiosity, I find myself wondering: do the psychology students tend to get the same grades as everyone else (i.e., mean 67.5) or do they tend to score higher or lower? He emails me the zeppo.Rdata file, which I use to pull up the grades of those students, > load( "zeppo.Rdata" ) > print( grades ) 50 60 60 64 66 66 67 69 70 74 76 76 77 79 79 79 81 82 82 89 and calculate the mean: > mean( grades ) 72.3 Hm. It might be that the psychology students are scoring a bit higher than normal: that sample mean of X̄ “ 72.3 is a fair bit higher than the hypothesised population mean of µ “ 67.5, but on the other hand, a sample size of N “ 20 isn’t all that big. Maybe it’s pure chance. To answer the question, it helps to be able to write down what it is that I think I know. Firstly, I know that the sample mean is X̄ “ 72.3. If I’m willing to assume that the psychology students have the same standard deviation as the rest of the class then I can say that the population standard deviation is “ 9.5. I’ll also assume that since Dr Zeppo is grading to a curve, the psychology student grades are normally distributed. Next, it helps to be clear about what I want to learn from the data. In this case, my research hypothesis relates to the population mean µ for the psychology student grades, which is unknown. Specifically, I want to know if µ “ 67.5 or not. Given that this is what I know, can we devise a hypothesis test to solve our problem? The data, along with the hypothesised distribution from which they are thought to arise, are shown in Figure 13.1. Not entirely obvious what the right answer is, is it? For this, we are going to need some statistics. 13.1.2 Constructing the hypothesis test The first step in constructing a hypothesis test is to be clear about what the null and alternative hypotheses are. This isn’t too hard to do. Our null hypothesis, H0 , is that the true population mean µ for psychology student grades is 67.5%; and our alternative hypothesis is that the population mean isn’t 67.5%. If we write this in mathematical notation, these hypotheses become, H0 : µ “ 67.5 H1 : µ ‰ 67.5 though to be honest this notation doesn’t add much to our understanding of the problem, it’s just a compact way of writing down what we’re trying to learn from the data. The null hypotheses H0 and the alternative hypothesis H1 for our test are both illustrated in Figure 13.2. In addition to providing us with these hypotheses, the scenario outlined above provides us with a fair amount of background knowledge that might be useful. Specifically, there are two special pieces of information that we can add: 1. The psychology grades are normally distributed. - 380 - 40 50 60 70 80 90 Grades Figure 13.1: The theoretical distribution (solid line) from which the psychology student grades (grey bars) are supposed to have been generated........................................................................................................ 2. The true standard deviation of these scores is known to be 9.5. For the moment, we’ll act as if these are absolutely trustworthy facts. In real life, this kind of absolutely trustworthy background knowledge doesn’t exist, and so if we want to rely on these facts we’ll just have make the assumption that these things are true. However, since these assumptions may or may not be warranted, we might need to check them. For now though, we’ll keep things simple. The next step is to figure out what we would be a good choice for a diagnostic test statistic; something that would help us discriminate between H0 and H1. Given that the hypotheses all refer to the population mean µ, you’d feel pretty confident that the sample mean X̄ would be a pretty useful place to start. What we could do, is look at the di↵erence between the sample mean X̄ and the value that the null hypothesis predicts for the population mean. In our example, that would mean we calculate X̄ ´67.5. More generally, if we let µ0 refer to the value that the null hypothesis claims is our population mean, then we’d want to calculate X̄ ´ µ0 If this quantity equals or is very close to 0, things are looking good for the null hypothesis. If this quantity is a long way away from 0, then it’s looking less likely that the null hypothesis is worth retaining. But how far away from zero should it be for us to reject H0 ? To figure that out, we need to be a bit more sneaky, and we’ll need to rely on those two pieces of background knowledge that I wrote down previously, namely that the raw data are normally distributed, and we know the value of the population standard deviation. If the null hypothesis is actually true, and the true mean is µ0 , then these facts together mean that we know the complete population distribution of the data: a normal distribution with mean µ0 and standard deviation. Adopting the notation from Section 9.5, a statistician might write this as: 2 X „ Normalpµ0 , q - 381 - null hypothesis alternative hypothesis µ = µ0 µ ≠ µ0 σ = σ0 σ = σ0 value of X value of X Figure 13.2: Graphical illustration of the null and alternative hypotheses assumed by the one sample z- test (the two sided version, that is). The null and alternative hypotheses both assume that the population distribution is normal, and additionally assumes that the population standard deviation is known (fixed at some value 0 ). The null hypothesis (left) is that the population mean µ is equal to some specified value µ0. The alternative hypothesis is that the population mean di↵ers from this value, µ ‰ µ0........................................................................................................ Okay, if that’s true, then what can we say about the distribution of X̄? Well, as we discussed earlier (see Section 10.3.3), the sampling distribution of the mean X̄ is also normal, and has mean µ. But the standard deviation of this sampling distribution sepX̄q, which is called the standard error of the mean, is sepX̄q “ ? N In other words, if the null hypothesis is true then the sampling distribution of the mean can be written as follows: X̄ „ Normalpµ0 , sepX̄qq Now comes the trick. What we can do is convert the sample mean X̄ into a standard score (Section 5.6). This is conventionally written as z, but for now I’m going to refer to it as zX̄. (The reason for using this expanded notation is to help you remember that we’re calculating standardised version of a sample mean, not a standardised version of a single observation, which is what a z-score usually refers to). When we do so, the z-score for our sample mean is X̄ ´ µ0 zX̄ “ sepX̄q or, equivalently X̄ ´ µ0 zX̄ “ ? { N This z-score is our test statistic. The nice thing about using this as our test statistic is that like all z-scores, it has a standard normal distribution: zX̄ „ Normalp0, 1q (again, see Section 5.6 if you’ve forgotten why this is true). In other words, regardless of what scale the original data are on, the z-statistic iteself always has the same interpretation: it’s equal to the number of standard errors that separate the observed sample mean X̄ from the population mean µ0 predicted by - 382 - Two Sided Test One Sided Test −1.96 0 1.96 0 1.64 Value of z Statistic Value of z Statistic (a) (b) Figure 13.3: Rejection regions for the two-sided z-test (panel a) and the one-sided z-test (panel b)........................................................................................................ the null hypothesis. Better yet, regardless of what the population parameters for the raw scores actually are, the 5% critical regions for z-test are always the same, as illustrated in Figure 13.3. And what this meant, way back in the days where people did all their statistics by hand, is that someone could publish a table like this: critical z value desired ↵ level two-sided test one-sided test.1 1.644854 1.281552.05 1.959964 1.644854.01 2.575829 2.326348.001 3.290527 3.090232 which in turn meant that researchers could calculate their z-statistic by hand, and then look up the critical value in a text book. That was an incredibly handy thing to be able to do back then, but it’s kind of unnecessary these days, since it’s trivially easy to do it with software like R. 13.1.3 A worked example using R Now, as I mentioned earlier, the z-test is almost never used in practice. It’s so rarely used in real life that the basic installation of R doesn’t have a built in function for it. However, the test is so incredibly simple that it’s really easy to do one manually. Let’s go back to the data from Dr Zeppo’s class. Having loaded the grades data, the first thing I need to do is calculate the sample mean: > sample.mean print( sample.mean ) 72.3 Then, I create variables corresponding to known population standard deviation ( “ 9.5), and the value of the population mean that the null hypothesis specifies (µ0 “ 67.5): > mu.null sd.true sem.true print(sem.true) 2.124265 And finally, we calculate our z-score: > z.score print( z.score ) 2.259606 At this point, we would traditionally look up the value 2.26 in our table of critical values. Our original hypothesis was two-sided (we didn’t really have any theory about whether psych students would be better or worse at statistics than other students) so our hypothesis test is two-sided (or two-tailed) also. Looking at the little table that I showed earlier, we can see that 2.26 is bigger than the critical value of 1.96 that would be required to be significant at ↵ “.05, but smaller than the value of 2.58 that would be required to be significant at a level of ↵ “.01. Therefore, we can conclude that we have a significant e↵ect, which we might write up by saying something like this: With a mean grade of 73.2 in the sample of psychology students, and assuming a true popula- tion standard deviation of 9.5, we can conclude that the psychology students have significantly di↵erent statistics scores to the class average (z “ 2.26, N “ 20, p †.05). However, what if want an exact p-value? Well, back in the day, the tables of critical values were huge, and so you could look up your actual z-value, and find the smallest value of ↵ for which your data would be significant (which, as discussed earlier, is the very definition of a p-value). However, looking things up in books is tedious, and typing things into computers is awesome. So let’s do it using R instead. Now, notice that the ↵ level of a z-test (or any other test, for that matter) defines the total area “under the curve” for the critical region, right? That is, if we set ↵ “.05 for a two-sided test, then the critical region is set up such that the area under the curve for the critical region is.05. And, for the z-test, the critical value of 1.96 is chosen that way because the area in the lower tail (i.e., below ´1.96) is exactly.025 and the area under the upper tail (i.e., above 1.96) is exactly.025. So, since our observed z-statistic is 2.26, why not calculate the area under the curve below ´2.26 or above 2.26? In R we can calculate this using the pnorm() function. For the upper tail: > upper.area print( upper.area ) 0.01192287 The lower.tail = FALSE is me telling R to calculate the area under the curve from 2.26 and upwards. If I’d told it that lower.tail = TRUE, then R would calculate the area from 2.26 and below, and it would give me an answer 0.9880771. Alternatively, to calculate the area from ´2.26 and below, we get - 384 - > lower.area print( lower.area ) 0.01192287 Thus we get our p-value: > p.value print( p.value ) 0.02384574 13.1.4 Assumptions of the z-test As I’ve said before, all statistical tests make assumptions. Some tests make reasonable assumptions, while other tests do not. The test I’ve just described – the one sample z-test – makes three basic assumptions. These are: Normality. As usually described, the z-test assumes that the true population distribution is normal.3 is often pretty reasonable, and not only that, it’s an assumption that we can check if we feel worried about it (see Section 13.9). Independence. The second assumption of the test is that the observations in your data set are not correlated with each other, or related to each other in some funny way. This isn’t as easy to check statistically: it relies a bit on good experimetal design. An obvious (and stupid) example of something that violates this assumption is a data set where you “copy” the same observation over and over again in your data file: so you end up with a massive “sample size”, consisting of only one genuine observation. More realistically, you have to ask yourself if it’s really plausible to imagine that each observation is a completely random sample from the population that you’re interested in. In practice, this assumption is never met; but we try our best to design studies that minimise the problems of correlated data. Known standard deviation. The third assumption of the z-test is that the true standard deviation of the population is known to the researcher. This is just stupid. In no real world data analysis problem do you know the standard deviation of some population, but are completely ignorant about the mean µ. In other words, this assumption is always wrong. In view of the stupidity of assuming that is known, let’s see if we can live without it. This takes us out of the dreary domain of the z-test, and into the magical kingdom of the t-test, with unicorns and fairies and leprechauns, and um... 13.2 The one-sample t-test After some thought, I decided that it might not be safe to assume that the psychology student grades necessarily have the same standard deviation as the other students in Dr Zeppo’s class. After all, if I’m 3 Actually this is too strong. Strictly speaking the z test only requires that the sampling distribution of the mean be normally distributed; if the population is normal then it necessarily follows that the sampling distribution of the mean is also normal. However, as we saw when talking about the central limit theorem, it’s quite possible (even commonplace) for the sampling distribution to be normal even if the population distribution itself is non-normal. However, in light of the sheer ridiculousness of the assumption that the true standard deviation is known, there really isn’t much point in going into details on this front! - 385 - null hypothesis alternative hypothesis µ = µ0 µ ≠ µ0 σ = ?? σ = ?? value of X value of X Figure 13.4: Graphical illustration of the null and alternative hypotheses assumed by the (two sided) one sample t-test. Note the similarity to the z-test (Figure 13.2). The null hypothesis is that the population mean µ is equal to some specified value µ0 , and the alternative hypothesis is that it is not. Like the z-test, we assume that the data are normally distributed; but we do not assume that the population standard deviation is known in advance........................................................................................................ entertaining the hypothesis that they don’t have the same mean, then why should I believe that they absolutely have the same standard deviation? In view of this, I should really stop assuming that I know the true value of. This violates the assumptions of my z-test, so in one sense I’m back to square one. However, it’s not like I’m completely bereft of options. After all, I’ve still got my raw data, and those raw data give me an estimate of the population standard deviation: > sd( grades ) 9.520615 In other words, while I can’t say that I know that “ 9.5, I can say that ˆ “ 9.52. Okay, cool. The obvious thing that you might think to do is run a z-test, but using the estimated standard deviation of 9.52 instead of relying on my assumption that the true standard deviation is 9.5. So, we could just type this new number into R and out would come the answer. And you probably wouldn’t be surprised to hear that this would still give us a significant result. This approach is close, but it’s not quite correct. Because we are now relying on an estimate of the population standard deviation, we need to make some adjustment for the fact that we have some uncertainty about what the true population standard deviation actually is. Maybe our data are just a fluke... maybe the true population standard deviation is 11, for instance. But if that were actually true, and we ran the z-test assuming “ 11, then the result would end up being non-significant. That’s a problem, and it’s one we’re going to have to address. 13.2.1 Introducing the t-test This ambiguity is annoying, and it was resolved in 1908 by a guy called William Sealy Gosset (Student, 1908), who was working as a chemist for the Guinness brewery at the time (see J. F. Box, 1987). Because Guinness took a dim view of its employees publishing statistical analysis (apparently they felt it was a trade secret), he published the work under the pseudonym “A Student”, and to this day, the full name of the t-test is actually Student’s t-test. The key thing that Gosset figured out is how we should - 386 - df = 2 df = 10 −4 −2 0 2 4 −4 −2 0 2 4 value of t−statistic value of t−statistic Figure 13.5: The t distribution with 2 degrees of freedom (left) and 10 degrees of freedom (right), with a standard normal distribution (i.e., mean 0 and std dev 1) plotted as dotted lines for comparison purposes. Notice that the t distribution has heavier tails (higher kurtosis) than the normal distribution; this e↵ect is quite exaggerated when the degrees of freedom are very small, but negligible for larger values. In other words, for large df the t distribution is essentially identical to a normal distribution........................................................................................................ accommodate the fact that we aren’t completely sure what the true standard deviation is.4 The answer is that it subtly changes the sampling distribution. In the t-test, our test statistic (now called a t-statistic) is calculated in exactly the same way I mentioned above. If our null hypothesis is that the true mean is µ, but our sample has mean X̄ and our estimate of the population standard deviation is ˆ , then our t statistic is: X̄ ´ µ t“ ? ˆ{ N The only thing that has changed in the equation is that instead of using the known true value , we use the estimate ˆ. And if this estimate has been constructed from N observations, then the sampling distribution turns into a t-distribution with N ´ 1 degrees of freedom (df). The t distribution is very similar to the normal distribution, but has “heavier” tails, as discussed earlier in Section 9.6 and illustrated in Figure 13.5. Notice, though, that as df gets larger, the t-distribution starts to look identical to the standard normal distribution. This is as it should be: if you have a sample size of N “ 70, 000, 000 then your “estimate” of the standard deviation would be pretty much perfect, right? So, you should expect that for large N , the t-test would behave exactly the same way as a z-test. And that’s exactly what happens! 13.2.2 Doing the test in R As you might expect, the mechanics of the t-test are almost identical to the mechanics of the z-test. So there’s not much point in going through the tedious exercise of showing you how to do the calculations using low level commands: it’s pretty much identical to the calculations that we did earlier, except that we use the estimated standard deviation (i.e., something like se.est oneSampleTTest( x=grades, mu=67.5 ) Easy enough. Now lets go through the output. Just like we saw in the last chapter, I’ve written the functions so that the output is pretty verbose. It tries to describe in a lot of detail what its actually done: One sample t-test Data variable: grades Descriptive statistics: grades mean 72.300 std dev. 9.521 Hypotheses: null: population mean equals 67.5 alternative: population mean not equal to 67.5 Test results: t-statistic: 2.255 degrees of freedom: 19 p-value: 0.036 Other information: two-sided 95% confidence interval: [67.844, 76.756] estimated effect size (Cohen’s d): 0.504 Reading this output from top to bottom, you can see it’s trying to lead you through the data analysis process. The first two lines tell you what kind of test was run and what data were used. It then gives you some basic information about the sample: specifically, the sample mean and standard deviation of the data. It then moves towards the inferential statistics part. It starts by telling you what the null and alternative hypotheses were, and then it reports the results of the test: the t-statistic, the degrees of freedom, and the p-value. Finally, it reports two other things you might care about: the confidence interval for the mean, and a measure of e↵ect size (we’ll talk more about e↵ect sizes later). So that seems straightforward enough. Now what do we do with this output? Well, since we’re pretending that we actually care about my toy example, we’re overjoyed to discover that the result is statistically significant (i.e. p value below.05). We could report the result by saying something like this: With a mean grade of 72.3, the psychology students scored slightly higher than the average grade of 67.5 (tp19q “ 2.25, p †.05); the 95% confidence interval is [67.8, 76.8]. where tp19q is shorthand notation for a t-statistic that has 19 degrees of freedom. That said, it’s often the case that people don’t report the confidence interval, or do so using a much more compressed form - 388 - than I’ve done here. For instance, it’s not uncommon to see the confidence interval included as part of the stat block, like this: tp19q “ 2.25, p †.05, CI95 “ r67.8, 76.8s With that much jargon crammed into half a line, you know it must be really smart.5 13.2.3 Assumptions of the one sample t-test Okay, so what assumptions does the one-sample t-test make? Well, since the t-test is basically a z-test with the assumption of known standard deviation removed, you shouldn’t be surprised to see that it makes the same assumptions as the z-test, minus the one about the known standard deviation. That is Normality. We’re still assuming that the the population distribution is normal6 , and as noted earlier, there are standard tools that you can use to check to see if this assumption is met (Section 13.9), and other tests you can do in it’s place if this assumption is violated (Section 13.10). Independence. Once again, we have to assume that the observations in our sample are generated in- dependently of one another. See the earlier discussion about the z-test for specifics (Section 13.1.4). Overall, these two assumptions aren’t terribly unreasonable, and as a consequence the one-sample t-test is pretty widely used in practice as a way of comparing a sample mean against a hypothesised population mean. 13.3 The independent samples t-test (Student test) Although the one sample t-test has its uses, it’s not the most typical example of a t-test7. A much more common situation arises when you’ve got two di↵erent groups of observations. In psychology, this tends to correspond to two di↵erent groups of participants, where each group corresponds to a di↵erent condition in your study. For each person in the study, you measure some outcome variable of interest, and the research question that you’re asking is whether or not the two groups have the same population mean. This is the situation that the independent samples t-test is designed for. 5 More seriously, I tend to think the reverse is true: I get very suspicious of technical reports that fill their results sections with nothing except the numbers. It might just be that I’m an arrogant jerk, but I often feel like an author that makes no attempt to explain and interpret their analysis to the reader either doesn’t understand it themselves, or is being a bit lazy. Your readers are smart, but not infinitely patient. Don’t annoy them if you can help it. 6 A technical comment... in the same way that we can weaken the assumptions of the z-test so that we’re only talking about the sampling distribution, we can weaken the t test assumptions so that we don’t have to assume normality of the population. However, for the t-test, it’s trickier to do this. As before, we can replace the assumption of population normality with an assumption that the sampling distribution of X̄ is normal. However, remember that we’re also relying on a sample estimate of the standard deviation; and so we also require the sampling distribution of ˆ to be chi-square. That makes things nastier, and this version is rarely used in practice: fortunately, if the population is normal, then both of these two assumptions are met. 7 Although it is the simplest, which is why I started with it. - 389 - 13.3.1 The data Suppose we have 33 students taking Dr Harpo’s statistics lectures, and Dr Harpo doesn’t grade to a curve. Actually, Dr Harpo’s grading is a bit of a mystery, so we don’t really know anything about what the average grade is for the class as a whole. There are two tutors for the class, Anastasia and Bernadette. There are N1 “ 15 students in Anastasia’s tutorials, and N2 “ 18 in Bernadette’s tutorials. The research question I’m interested in is whether Anastasia or Bernadette is a better tutor, or if it doesn’t make much of a di↵erence. Dr Harpo emails me the course grades, in the harpo.Rdata file. As usual, I’ll load the file and have a look at what variables it contains: > load( "harpo.Rdata" ) > who(TRUE) -- Name -- -- Class -- -- Size -- harpo data.frame 33 x 2 $grade numeric 33 $tutor factor 33 As we can see, there’s a single data frame with two variables, grade and tutor. The grade variable is a numeric vector, containing the grades for all N “ 33 students taking Dr Harpo’s class; the tutor variable is a factor that indicates who each student’s tutor was. The first six observations in this data set are shown below: > head( harpo ) grade tutor 1 65 Anastasia 2 72 Bernadette 3 66 Bernadette 4 74 Anastasia 5 73 Anastasia 6 71 Bernadette We can calculate means and standard deviations, using the mean() and sd() functions. Rather than show the R output, here’s a nice little summary table: mean std dev N Anastasia’s students 74.53 9.00 15 Bernadette’s students 69.06 5.77 18 To give you a more detailed sense of what’s going on here, I’ve plotted histograms showing the distribution of grades for both tutors (Figure 13.6), as well as a simpler plot showing the means and corresponding confidence intervals for both groups of students (Figure 13.7). 13.3.2 Introducing the test The independent samples t-test comes in two di↵erent forms, Student’s and Welch’s. The original Student t-test – which is the one I’ll describe in this section – is the simpler of the two, but relies on much more restrictive assumptions than the Welch t-test. Assuming for the moment that you want to run a two-sided test, the goal is to determine whether two “independent samples” of data are drawn from populations with the same mean (the null hypothesis) or di↵erent means (the alternative hypothesis). When we say “independent” samples, what we really mean here is that there’s no special relationship between observations in the two samples. This probably doesn’t make a lot of sense right now, but it - 390 - Anastasia’s students Bernadette’s students 7 7 6 6 5 5 Frequency Frequency 4 4 3 3 2 2 1 1 0 0 50 60 70 80 90 100 50 60 70 80 90 100 Grade Grade (a) (b) Figure 13.6: Histograms showing the overall distribution of grades for students in Anastasia’s class (panel a) and in Bernadette’s class (panel b). Inspection of these histograms suggests that the students in Anastasia’s class may be getting slightly better grades on average, though they also seem a little more variable........................................................................................................ will be clearer when we come to talk about the paired samples t-test later on. For now, let’s just point out that if we have an experimental design where participants are randomly allocated to one of two groups, and we want to compare the two groups’ mean performance on some outcome measure, then an independent samples t-test (rather than a paired samples t-test) is what we’re after. Okay, so let’s let µ1 denote the true population mean for group 1 (e.g., Anastasia’s students), and µ2 will be the true population mean for group 2 (e.g., Bernadette’s students),8 and as usual we’ll let X̄1 and X̄2 denote the observed sample means for both of these groups. Our null hypothesis states that the two population means are identical (µ1 “ µ2 ) and the alternative to this is that they are not (µ1 ‰ µ2 ). Written in mathematical-ese, this is... H0 : µ 1 “ µ 2 H1 : µ 1 ‰ µ 2 To construct a hypothesis test that handles this scenario, we start by noting that if the null hypothesis is true, then the di↵erence between the population means is exactly zero, µ1 ´ µ2 “ 0 As a consequence, a diagnostic test statistic will be based on the di↵erence between the two sample means. Because if the null hypothesis is true, then we’d expect X̄1 ´ X̄2 to be pretty close to zero. However, just like we saw with our one-sample tests (i.e., the one-sample z-test and the one-sample t-test) we have to be precise 8 A funny question almost always pops up at this point: what the heck is the population being referred to in this case? Is it the set of students actually taking Dr Harpo’s class (all 33 of them)? The set of people who might take the class (an unknown number) of them? Or something else? Does it matter which of these we pick? It’s traditional in an introductory behavioural stats class to mumble a lot at this point, but since I get asked this question every year by my students, I’ll give a brief answer. Technically yes, it does matter: if you change your definition of what the “real world” population actually is, then the sampling distribution of your observed mean X̄ changes too. The t-test relies on an assumption that the observations are sampled at random from an infinitely large population; and to the extent that real life isn’t like that, then the t-test can be wrong. In practice, however, this isn’t usually a big deal: even though the assumption is almost always wrong, it doesn’t lead to a lot of pathological behaviour from the test, so we tend to just ignore it. - 391 - 80 78 76 74 Grade 72 70 68 66 Anastasia Bernadette Class Figure 13.7: Plots showing the mean grade for the students in Anastasia’s and Bernadette’s tutorials. Error bars depict 95% confidence intervals around the mean. On the basis of visual inspection, it does look like there’s a real di↵erence between the groups, though it’s hard to say for sure........................................................................................................ about exactly how close to zero this di↵erence should be. And the solution to the problem is more or less the same one: we calculate a standard error estimate (SE), just like last time, and then divide the di↵erence between means by this estimate. So our t-statistic will be of the form X̄1 ´ X̄2 t“ SE We just need to figure out what this standard error estimate actually is. This is a bit trickier than was the case for either of the two tests we’ve looked at so far, so we need to go through it a lot more carefully to understand how it works. 13.3.3 A “pooled estimate” of the standard deviation In the original “Student t-test”, we make the assumption that the two groups have the same population standard deviation: that is, regardless of whether the population means are the same, we assume that the population standard deviations are identical, 1 “ 2. Since we’re assuming that the two standard deviations are the same, we drop the subscripts and refer to both of them as. How should we estimate this? How should we construct a single estimate of a standard deviation when we have two samples? The answer is, basically, we average them. Well, sort of. Actually, what we do is take a weighed average of the variance estimates, which we use as our pooled estimate of the variance. The weight assigned to each sample is equal to the number of observations in that sample, minus 1. Mathematically, we can write this as w 1 “ N1 ´ 1 w 2 “ N2 ´ 1 - 392 - null hypothesis alternative hypothesis µ1 µ2 µ value of X value of X Figure 13.8: Graphical illustration of the null and alternative hypotheses assumed by the Student t-test. The null hypothesis assumes that both groups have the same mean µ, whereas the alternative assumes that they have di↵erent means µ1 and µ2. Notice that it is assumed that the population distributions are normal, and that, although the alternative hypothesis allows the group to have di↵erent means, it assumes they have the same standard deviation........................................................................................................ Now that we’ve assigned weights to each sample, we calculate the pooled estimate of the variance by taking the weighted average of the two variance estimates, ˆ12 and ˆ22 w1 ˆ12 ` w2 ˆ22 ˆp2 “ w1 ` w2 Finally, we convert the pooled variance estimate to a pooled standard deviation estimate, by taking the square root. This gives us the following formula for ˆp , d w1 ˆ12 ` w2 ˆ22 ˆp “ w1 ` w2 And if you mentally substitute w1 “ N1 ´ 1 and w2 “ N2 ´ 1 into this equation you get a very ugly looking formula; a very ugly formula that actually seems to be the “standard” way of describing the pooled standard deviation estimate. It’s not my favourite way of thinking about pooled standard deviations, however.9 13.3.4 The same pooled estimate, described di↵erently I prefer to think about it like this. Our data set actually corresponds to a set of N observations, which are sorted into two groups. So let’s use the notation Xik to refer to the grade received by the i-th student in the k-th tutorial group: that is, X11 is the grade received by the first student in Anastasia’s class, X21 is her second student, and so on. And we have two separate group means X̄1 and X̄2 , which we could “generically” refer to using the notation X̄k , i.e., the mean grade for the k-th tutorial group. So far, so good. Now, since every single student falls into one of the two tutorials, and so we can describe their deviation from the group mean as the di↵erence Xik ´ X̄k 9 Yes, I have a “favourite” way of thinking about pooled standard deviation estimates. So what? - 393 - So why not just use these deviations (i.e., the extent to which each student’s grade di↵ers from the mean grade in their tutorial?) Remember, a variance is just the average of a bunch of squared deviations, so let’s do that. Mathematically, we could write it like this: ∞ ` ˘2 ik Xik ´ X̄k N ∞ where the notation “ ik ” is a lazy way of saying “calculate a sum by looking at all students in all tutorials”, since each “ik” corresponds to one student.10 But, as we saw in Chapter 10, calculating the variance by dividing by N produces a biased estimate of the population variance. And previously, we needed to divide by N ´ 1 to fix this. However, as I mentioned at the time, the reason why this bias exists is because the variance estimate relies on the sample mean; and to the extent that the sample mean isn’t equal to the population mean, it can systematically bias our estimate of the variance. But this time we’re relying on two sample means! Does this mean that we’ve got more bias? Yes, yes it does. And does this mean we now need to divide by N ´ 2 instead of N ´ 1, in order to calculate our pooled variance estimate? Why, yes... ∞ ` ˘2 2 ik Xik ´ X̄k ˆp “ N ´2 Oh, and if you take the square root of this then you get ˆp , the pooled standard deviation estimate. In other words, the pooled standard deviation calculation is nothing special: it’s not terribly di↵erent to the regular standard deviation calculation. 13.3.5 Completing the test Regardless of which way you want to think about it, we now have our pooled estimate of the standard deviation. From now on, I’ll drop the silly p subscript, and just refer to this estimate as ˆ. Great. Let’s now go back to thinking about the bloody hypothesis test, shall we? Our whole reason for calculating this pooled estimate was that we knew it would be helpful when calculating our standard error estimate. But, standard error of what? ? In the one-sample t-test, it was the standard error of the sample mean, sepX̄q, and since sepX̄q “ { N that’s what the denominator of our t-statistic looked like. This time around, however, we have two sample means. And what we’re interested in, specifically, is the the di↵erence between the two X̄1 ´ X̄2. As a consequence, the standard error that we need to divide by is in fact the standard error of the di↵erence between means. As long as the two variables really do have the same standard deviation, then our estimate for the standard error is c 1 1 sepX̄1 ´ X̄2 q “ ˆ ` N1 N2 and our t-statistic is therefore X̄1 ´ X̄2 t“ sepX̄1 ´ X̄2 q Just as we saw with our one-sample test, the sampling distribution of this t-statistic is a t-distribution (shocking, isn’t it?) as long as the null hypothesis is true, and all of the assumptions of the test are met. The degrees of freedom, however, is slightly di↵erent. As usual, we can think of the degrees of freedom to be equal to the number of data points minus the number of constraints. In this case, we have N observations (N1 in sample 1, and N2 in sample 2), and 2 constraints (the sample means). So the total degrees of freedom for this test are N ´ 2. 13.3.6 Doing the test in R Not surprisingly, you can run an independent samples t-test using the t.test() function (Section 13.7), 10 A more correct notation will be introduced in Chapter 14. - 394 - but once again I’m going to start with a somewhat simpler function in the lsr package. That function is unimaginatively called independentSamplesTTest(). First, recall that our data look like this: > head( harpo ) grade tutor 1 65 Anastasia 2 72 Bernadette 3 66 Bernadette 4 74 Anastasia 5 73 Anastasia 6 71 Bernadette The outcome variable for our test is the student grade, and the groups are defined in terms of the tutor for each class. So you probably won’t be too surprised to see that we’re going to describe the test that we want in terms of an R formula that reads like this grade ~ tutor. The specific command that we need is: > independentSamplesTTest( formula = grade ~ tutor, # formula specifying outcome and group variables data = harpo, # data frame that contains the variables var.equal = TRUE # assume that the two groups have the same variance ) The first two arguments should be familiar to you. The first one is the formula that tells R what variables to use and the second one tells R the name of the data frame that stores those variables. The third argument is not so obvious. By saying var.equal = TRUE, what we’re really doing is telling R to use the Student independent samples t-test. More on this later. For now, lets ignore that bit and look at the output: Student’s independent samples t-test Outcome variable: grade Grouping variable: tutor Descriptive statistics: Anastasia Bernadette mean 74.533 69.056 std dev. 8.999 5.775 Hypotheses: null: population means equal for both groups alternative: different population means in each group Test results: t-statistic: 2.115 degrees of freedom: 31 p-value: 0.043 Other information: two-sided 95% confidence interval: [0.197, 10.759] estimated effect size (Cohen’s d): 0.74 The output has a very familiar form. First, it tells you what test was run, and it tells you the names of the variables that you used. The second part of the output reports the sample means and standard deviations for both groups (i.e., both tutorial groups). The third section of the output states the null - 395 - hypothesis and the alternative hypothesis in a fairly explicit form. It then reports the test results: just like last time, the test results consist of a t-statistic, the degrees of freedom, and the p-value. The final section reports two things: it gives you a confidence interval, and an e↵ect size. I’ll talk about e↵ect sizes later. The confidence interval, however, I should talk about now. It’s pretty important to be clear on what this confidence interval actually refers to: it is a confidence interval for the di↵erence between the group means. In our example, Anastasia’s students had an average grade of 74.5, and Bernadette’s students had an average grade of 69.1, so the di↵erence between the two sample means is 5.4. But of course the di↵erence between population means might be bigger or smaller than this. The confidence interval reported by the independentSamplesTTest() function tells you that there’s a 95% chance that the true di↵erence between means lies between 0.2 and 10.8. In any case, the di↵erence between the two groups is significant (just barely), so we might write up the result using text like this: The mean grade in Anastasia’s class was 74.5% (std dev = 9.0), whereas the mean in Bernadette’s class was 69.1% (std dev = 5.8). A Student’s independent samples t-test showed that this 5.4% di↵erence was significant (tp31q “ 2.1, p †.05, CI95 “ r0.2, 10.8s, d “.74), suggesting that a genuine di↵erence in learning outcomes has occurred. Notice that I’ve included the confidence interval and the e↵ect size in the stat block. People don’t always do this. At a bare minimum, you’d expect to see the t-statistic, the degrees of freedom and the p value. So you should include something like this at a minimum: tp31q “ 2.1, p †.05. If statisticians had their way, everyone would also report the confidence interval and probably the e↵ect size measure too, because they are useful things to know. But real life doesn’t always work the way statisticians want it to: you should make a judgment based on whether you think it will help your readers, and (if you’re writing a scientific paper) the editorial standard for the journal in question. Some journals expect you to report e↵ect sizes, others don’t. Within some scientific communities it is standard practice to report confidence intervals, in other it is not. You’ll need to figure out what your audience expects. But, just for the sake of clarity, if you’re taking my class: my default position is that it’s usually worth includng the e↵ect size, but don’t worry about the confidence interval unless the assignment asks you to or implies that you should. 13.3.7 Positive and negative t values Before moving on to talk about the assumptions of the t-test, there’s one additional point I want to make about the use of t-tests in practice. The first one relates to the sign of the t-statistic (that is, whether it is a positive number or a negative one). One very common worry that students have when they start running their first t-test is that they often end up with negative values for the t-statistic, and don’t know how to interpret it. In fact, it’s not at all uncommon for two people working independently to end up with R outputs that are almost identical, except that one person has a negative t values and the other one has a positive t value. Assuming that you’re running a two-sided test, then the p-values will be identical. On closer inspection, the students will notice that the confidence intervals also have the opposite signs. This is perfectly okay: whenever this happens, what you’ll find is that the two versions of the R output arise from slightly di↵erent ways of running the t-test. What’s happening here is very simple. The t-statistic that R is calculating here is always of the form (mean 1) ´ (mean 2) t“ (SE) If “mean 1” is larger than “mean 2” the t statistic will be positive, whereas if “mean 2” is larger then the t statistic will be negative. Similarly, the confidence interval that R reports is the confidence interval - 396 - for the di↵erence “(mean 1) minus (mean 2)”, which will be the reverse of what you’d get if you were calculating the confidence interval for the di↵erence “(mean 2) minus (mean 1)”. Okay, that’s pretty straightforward when you think about it, but now consider our t-test comparing Anastasia’s class to Bernadette’s class. Which one should we call “mean 1” and which one should we call “mean 2”. It’s arbitrary. However, you really do need to designate one of them as “mean 1” and the other one as “mean 2”. Not surprisingly, the way that R handles this is also pretty arbitrary. In earlier versions of the book I used to try to explain it, but after a while I gave up, because it’s not really all that important, and to be honest I can never remember myself. Whenever I get a significant t-test result, and I want to figure out which mean is the larger one, I don’t try to figure it out by looking at the t-statistic. Why would I bother doing that? It’s foolish. It’s easier just look at the actual group means, since the R output actually shows them! Here’s the important thing. Because it really doesn’t matter what R printed out, I usually try to report the t-statistic in such a way that the numbers match up with the text. Here’s what I mean... suppose that what I want to write in my report is “Anastasia’s class had higher grades than Bernadette’s class”. The phrasing here implies that Anastasia’s group comes first, so it makes sense to report the t-statistic as if Anastasia’s class corresponded to group 1. If so, I would write Anastasia’s class had higher grades than Bernadette’s class (tp31q “ 2.1, p “.04). (I wouldn’t actually underline the word “higher” in real life, I’m just doing it to emphasise the point that “higher” corresponds to positive t values). On the other hand, suppose the phrasing I wanted to use has Bernadette’s class listed first. If so, it makes more sense to treat her class as group 1, and if so, the write up looks like this: Bernadette’s class had lower grades than Anastasia’s class (tp31q “ ´2.1, p “.04). Because I’m talking about one group having “lower” scores this time around, it is more sensible to use the negative form of the t-statistic. It just makes it read more cleanly. One last thing: please note that you can’t do this for other types of test statistics. It works for t-tests, but it wouldn’t be meaningful for chi-square testsm F -tests or indeed for most of the tests I talk about in this book. So don’t overgeneralise this advice! I’m really just talking about t-tests here and nothing else! 13.3.8 Assumptions of the test As always, our hypothesis test relies on some assumptions. So what are they? For the Student t-test there are three assumptions,

Chapter 13: X2 Statistic - PDF

Document Details

Tags

Related

Summary

Full Transcript