Navarro - Foundations Readings 2 PDF

Figure 8.1: A screenshot showing the hello.R script if you open in using the default text editor (TextEdit) on a Mac. Using a simple text editor like TextEdit on a Mac or Notepad on Windows isn’t actually the best way to write your scripts, but it is the simplest. More to the point, it highlights the fact that a script really is just an ordinary text file........................................................................................................ The line at the top is the filename, and not part of the script itself. Below that, you can see the two R commands that make up the script itself. Next to each command I’ve included the line numbers. You don’t actually type these into your script, but a lot of text editors (including the one built into Rstudio that I’ll show you in a moment) will show line numbers, since it’s a very useful convention that allows you to say things like “line 1 of the script creates a new variable, and line 2 prints it out”. So how do we run the script? Assuming that the hello.R file has been saved to your working directory, then you can run the script using the following command: > source( "hello.R" ) If the script file is saved in a di↵erent directory, then you need to specify the path to the file, in exactly the same way that you would have to when loading a data file using load(). In any case, when you type this command, R opens up the script file: it then reads each command in the file in the same order that they appear in the file, and executes those commands in that order. The simple script that I’ve shown above contains two commands. The first one creates a variable x and the second one prints it on screen. So, when we run the script, this is what we see on screen: > source( "hello.R" ) "hello world" If we inspect the workspace using a command like who() or objects(), we discover that R has created the new variable x within the workspace, and not surprisingly x is a character string containing the text "hello world". And just like that, you’ve written your first program R. It really is that simple. 8.1.3 Using Rstudio to write scripts In the example above I assumed that you were writing your scripts using a simple text editor. However, it’s usually more convenient to use a text editor that is specifically designed to help you write scripts. There’s a lot - 256 - Figure 8.2: A screenshot showing the hello.R script open in Rstudio. Assuming that you’re looking at this document in colour, you’ll notice that the “hello world” text is shown in green. This isn’t something that you do yourself: that’s Rstudio being helpful. Because the text editor in Rstudio “knows” something about how R commands work, it will highlight di↵erent parts of your script in di↵erent colours. This is useful, but it’s not actually part of the script itself........................................................................................................ of these out there, and experienced programmers will all have their own personal favourites. For our purposes, however, we can just use the one built into Rstudio. To create new script file in R studio, go to the “File ” menu, select the “New” option, and then click on “R script”. This will open a new window within the “source” panel. Then you can type the commands you want (or code as it is generally called when you’re typing the commands into a script file) and save it when you’re done. The nice thing about using Rstudio to do this is that it automatically changes the colour of the text to indicate which parts of the code are comments and which are parts are actual R commands (these colours are called syntax highlighting, but they’re not actually part of the file – it’s just Rstudio trying to be helpful. To see an example of this, let’s open up our hello.R script in Rstudio. To do this, go to the “File” menu again, and select “Open...”. Once you’ve opened the file, you should be looking at something like Figure 8.2. As you can see (if you’re looking at this book in colour) the character string “hello world” is highlighted in green. Using Rstudio for your text editor is convenient for other reasons too. Notice in the top right hand corner of Figure 8.2 there’s a little button that reads “Source”? If you click on that, Rstudio will construct the relevant source() command for you, and send it straight to the R console. So you don’t even have to type in the source() command, which actually I think is a great thing, because it really bugs me having to type all those extra keystrokes every time I want to run my script. Anyway, Rstudio provide several other convenient little tools to - 257 - help make scripting easier, but I won’t discuss them here.1 8.1.4 Commenting your script When writing up your data analysis as a script, one thing that is generally a good idea is to include a lot of comments in the code. That way, if someone else tries to read it (or if you come back to it several days, weeks, months or years later) they can figure out what’s going on. As a beginner, I think it’s especially useful to comment thoroughly, partly because it gets you into the habit of commenting the code, and partly because the simple act of typing in an explanation of what the code does will help you keep it clear in your own mind what you’re trying to achieve. To illustrate this idea, consider the following script: itngscript.R 1 # A script to analyse nightgarden.Rdata 2 # author: Dan Navarro 3 # date: 22/11/2011 4 5 # Load the data, and tell the user that this is what we’re 6 # doing. Note that this assumes the nightgarden data file 7 # is in the working directory. 8 cat( "loading data from nightgarden.Rdata...\n" ) 9 load( "nightgarden.Rdata" ) 10 11 # Create a cross tabulation and print it out: 12 cat( "tabulating data...\n" ) 13 itng.table source( "itngscript.R" ) loading data from nightgarden.Rdata... tabulating data... utterance speaker ee onk oo pip makka-pakka 0 2 0 2 tombliboo 1 0 1 0 upsy-daisy 0 2 0 2 Even here, notice that the script announces its behaviour. The first two lines of the output tell us a lot about what the script is actually doing behind the scenes (the code do to this corresponds to the two cat() commands on lines 8 and 12 of the script). It’s usually a pretty good idea to do this, since it helps ensure that the output makes sense when the script is executed. 1 Okay, I lied. Sue me. One of the coolest features of Rstudio is the support for R Markdown, which lets you embed R code inside a Markdown document, and you can automatically publish your R Markdown to the web on Rstudio’s servers. If you’re the kind of nerd interested in this sort of thing, it’s really nice. And, yes, since I’m also that kind of nerd, of course I’m aware that iPython notebooks do the same thing and that R just nicked their idea. So what? It’s still cool. And anyway, this book isn’t called Learning Statistics with Python now, is it? Hm. Maybe I should write a Python version... - 258 - 8.1.5 Di↵erences between scripts and the command line For the most part, commands that you insert into a script behave in exactly the same way as they would if you typed the same thing in at the command line. The one major exception to this is that if you want a variable to be printed on screen, you need to explicitly tell R to print it. You can’t just type the name of the variable. For example, our original hello.R script produced visible output. The following script does not: silenthello.R 1 x IQ.1 IQ.1 90 82 94 99 110 The mean IQ in this sample turns out to be exactly 95. Not surprisingly, this is much less accurate than the previous experiment. Now imagine that I decided to replicate the experiment. That is, I repeat the procedure as closely as possible: I randomly sample 5 new people and measure their IQ. Again, R allows me to simulate the results of this procedure: 3 Technically, the law of large numbers pertains to any sample statistic that can be described as an average of independent quantities. That’s certainly true for the sample mean. However, it’s also possible to write many other sample statistics as averages of one form or another. The variance of a sample, for instance, can be rewritten as a kind of average and so is subject to the law of large numbers. The minimum value of a sample, however, cannot be written as an average of anything and is therefore not governed by the law of large numbers. - 309 - Table 10.1: Ten replications of the IQ experiment, each with a sample size of N “ 5. Person 1 Person 2 Person 3 Person 4 Person 5 Sample Mean Replication 1 90 82 94 99 110 95.0 Replication 2 78 88 111 111 117 101.0 Replication 3 111 122 91 98 86 101.6 Replication 4 98 96 119 99 107 103.8 Replication 5 105 113 103 103 98 104.4 Replication 6 81 89 93 85 114 92.4 Replication 7 100 93 108 98 133 106.4 Replication 8 107 100 105 117 85 102.8 Replication 9 86 119 108 73 116 100.4 Replication 10 95 126 112 120 76 105.8....................................................................................................... > IQ.2 IQ.2 78 88 111 111 117 This time around, the mean IQ in my sample is 101. If I repeat the experiment 10 times I obtain the results shown in Table 10.1, and as you can see the sample mean varies from one replication to the next. Now suppose that I decided to keep going in this fashion, replicating this “five IQ scores” experiment over and over again. Every time I replicate the experiment I write down the sample mean. Over time, I’d be amassing a new data set, in which every experiment generates a single data point. The first 10 observations from my data set are the sample means listed in Table 10.1, so my data set starts out like this: 95.0 101.0 101.6 103.8 104.4... What if I continued like this for 10,000 replications, and then drew a histogram? Using the magical powers of R that’s exactly what I did, and you can see the results in Figure 10.5. As this picture illustrates, the average of 5 IQ scores is usually between 90 and 110. But more importantly, what it highlights is that if we replicate an experiment over and over again, what we end up with is a distribution of sample means! This distribution has a special name in statistics: it’s called the sampling distribution of the mean. Sampling distributions are another important theoretical idea in statistics, and they’re crucial for understanding the behaviour of small samples. For instance, when I ran the very first “five IQ scores” experiment, the sample mean turned out to be 95. What the sampling distribution in Figure 10.5 tells us, though, is that the “five IQ scores” experiment is not very accurate. If I repeat the experiment, the sampling distribution tells me that I can expect to see a sample mean anywhere between 80 and 120. 10.3.2 Sampling distributions exist for any sample statistic! One thing to keep in mind when thinking about sampling distributions is that any sample statistic you might care to calculate has a sampling distribution. For example, suppose that each time I replicated the “five IQ scores” experiment I wrote down the largest IQ score in the experiment. This would give me a data set that started out like this: 110 117 122 119 113... - 310 - 60 80 100 120 140 IQ Score Figure 10.5: The sampling distribution of the mean for the “five IQ scores experiment”. If you sample 5 people at random and calculate their average IQ, you’ll almost certainly get a number between 80 and 120, even though there are quite a lot of individuals who have IQs above 120 or below 80. For comparison, the black line plots the population distribution of IQ scores........................................................................................................ 60 80 100 120 140 160 IQ Score Figure 10.6: The sampling distribution of the maximum for the “five IQ scores experiment”. If you sample 5 people at random and select the one with the highest IQ score, you’ll probably see someone with an IQ between 100 and 140........................................................................................................ - 311 - Sample Size = 1 Sample Size = 2 Sample Size = 10 60 80 100 120 140 60 80 100 120 140 60 80 100 120 140 IQ Score IQ Score IQ Score (a) (b) (c) Figure 10.7: An illustration of the how sampling distribution of the mean depends on sample size. In each panel, I generated 10,000 samples of IQ data, and calculated the mean IQ observed within each of these data sets. The histograms in these plots show the distribution of these means (i.e., the sampling distribution of the mean). Each individual IQ score was drawn from a normal distribution with mean 100 and standard deviation 15, which is shown as the solid black line). In panel a, each data set contained only a single observation, so the mean of each sample is just one person’s IQ score. As a consequence, the sampling distribution of the mean is of course identical to the population distribution of IQ scores. However, when we raise the sample size to 2, the mean of any one sample tends to be closer to the population mean than a one person’s IQ score, and so the histogram (i.e., the sampling distribution) is a bit narrower than the population distribution. By the time we raise the sample size to 10 (panel c), we can see that the distribution of sample means tend to be fairly tightly clustered around the true population mean........................................................................................................ Doing this over and over again would give me a very di↵erent sampling distribution, namely the sampling distribution of the maximum. The sampling distribution of the maximum of 5 IQ scores is shown in Figure 10.6. Not surprisingly, if you pick 5 people at random and then find the person with the highest IQ score, they’re going to have an above average IQ. Most of the time you’ll end up with someone whose IQ is measured in the 100 to 140 range. 10.3.3 The central limit theorem At this point I hope you have a pretty good sense of what sampling distributions are, and in particular what the sampling distribution of the mean is. In this section I want to talk about how the sampling distribution of the mean changes as a function of sample size. Intuitively, you already know part of the answer: if you only have a few observations, the sample mean is likely to be quite inaccurate: if you replicate a small experiment and recalculate the mean you’ll get a very di↵erent answer. In other words, the sampling distribution is quite wide. If you replicate a large experiment and recalculate the sample mean you’ll probably get the same answer you got last time, so the sampling distribution will be very narrow. You can see this visually in Figure 10.7: the bigger the sample size, the narrower the sampling distribution gets. We can quantify this e↵ect by calculating the standard deviation of the sampling distribution, which is referred to as the standard error. The standard error of a statistic is often denoted SE, and since we’re usually interested in the standard error of the sample mean, we often use the acronym SEM. As you can see just by looking at the picture, as the sample size N increases, the - 312 - SEM decreases. Okay, so that’s one part of the story. However, there’s something I’ve been glossing over so far. All my examples up to this point have been based on the “IQ scores” experiments, and because IQ scores are roughly normally distributed, I’ve assumed that the population distribution is normal. What if it isn’t normal? What happens to the sampling distribution of the mean? The remarkable thing is this: no matter what shape your population distribution is, as N increases the sampling distribution of the mean starts to look more like a normal distribution. To give you a sense of this, I ran some simulations using R. To do this, I started with the “ramped” distribution shown in the histogram in Figure 10.8. As you can see by comparing the triangular shaped histogram to the bell curve plotted by the black line, the population distribution doesn’t look very much like a normal distribution at all. Next, I used R to simulate the results of a large number of experiments. In each experiment I took N “ 2 samples from this distribution, and then calculated the sample mean. Figure 10.8b plots the histogram of these sample means (i.e., the sampling distribution of the mean for N “ 2). This time, the histogram produces a X-shaped distribution: it’s still not normal, but it’s a lot closer to the black line than the population distribution in Figure 10.8a. When I increase the sample size to N “ 4, the sampling distribution of the mean is very close to normal (Figure 10.8c), and by the time we reach a sample size of N “ 8 it’s almost perfectly normal. In other words, as long as your sample size isn’t tiny, the sampling distribution of the mean will be approximately normal no matter what your population distribution looks like! On the basis of these figures, it seems like we have evidence for all of the following claims about the sampling distribution of the mean: The mean of the sampling distribution is the same as the mean of the population The standard deviation of the sampling distribution (i.e., the standard error) gets smaller as the sample size increases The shape of the sampling distribution becomes normal as the sample size increases As it happens, not only are all of these statements true, there is a very famous theorem in statistics that proves all three of them, known as the central limit theorem. Among other things, the central limit theorem tells us that if the population distribution has mean µ and standard deviation , then the sampling distribution of the mean also has mean µ, and the standard error of the mean is SEM “ ? N Because we divide the population standard devation by the square root of the sample size N , the SEM gets smaller as the sample size increases. It also tells us that the shape of the sampling distribution becomes normal.4 This result is useful for all sorts of things. It tells us why large experiments are more reliable than small ones, and because it gives us an explicit formula for the standard error it tells us how much more reliable a large experiment is. It tells us why the normal distribution is, well, normal. In real experiments, many of the things that we want to measure are actually averages of lots of di↵erent quantities (e.g., arguably, “general” intelligence as measured by IQ is an average of a large number of “specific” skills and abilities), and when that happens, the averaged quantity should follow a normal distribution. Because of this mathematical law, the normal distribution pops up over and over again in real data. 4 As usual, I’m being a bit sloppy here. The central limit theorem is a bit more general than this section implies. Like most introductory stats texts, I’ve discussed one situation where the central limit theorem holds: when you’re taking an average across lots of independent events drawn from the same distribution. However, the central limit theorem is much broader than this. There’s a whole class of things called “U -statistics” for instance, all of which satisfy the central limit theorem and therefore become normally distributed for large sample sizes. The mean is one such statistic, but it’s not the only one. - 313 - Sample Size = 1 Sample Size = 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Sample Mean Sample Mean (a) (b) Sample Size = 4 Sample Size = 8 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Sample Mean Sample Mean (c) (d) Figure 10.8: A demonstration of the central limit theorem. In panel a, we have a non-normal population distribution; and panels b-d show the sampling distribution of the mean for samples of size 2,4 and 8, for data drawn from the distribution in panel a. As you can see, even though the original population distribution is non-normal, the sampling distribution of the mean becomes pretty close to normal by the time you have a sample of even 4 observations........................................................................................................ - 314 - 10.4 Estimating population parameters In all the IQ examples in the previous sections, we actually knew the population parameters ahead of time. As every undergraduate gets taught in their very first lecture on the measurement of intelligence, IQ scores are defined to have mean 100 and standard deviation 15. However, this is a bit of a lie. How do we know that IQ scores have a true population mean of 100? Well, we know this because the people who designed the tests have administered them to very large samples, and have then “rigged” the scoring rules so that their sample has mean 100. That’s not a bad thing of course: it’s an important part of designing a psychological measurement. However, it’s important to keep in mind that this theoretical mean of 100 only attaches to the population that the test designers used to design the tests. Good test designers will actually go to some lengths to provide “test norms” that can apply to lots of di↵erent populations (e.g., di↵erent age groups, nationalities etc). This is very handy, but of course almost every research project of interest involves looking at a di↵erent population of people to those used in the test norms. For instance, suppose you wanted to measure the e↵ect of low level lead poisoning on cognitive functioning in Port Pirie, a South Australian industrial town with a lead smelter. Perhaps you decide that you want to compare IQ scores among people in Port Pirie to a comparable sample in Whyalla, a South Australian industrial town with a steel refinery.5 Regardless of which town you’re thinking about, it doesn’t make a lot of sense simply to assume that the true population mean IQ is 100. No-one has, to my knowledge, produced sensible norming data that can automatically be applied to South Australian industrial towns. We’re going to have to estimate the population parameters from a sample of data. So how do we do this? 10.4.1 Estimating the population mean Suppose we go to Port Pirie and 100 of the locals are kind enough to sit through an IQ test. The average IQ score among these people turns out to be X̄ “ 98.5. So what is the true mean IQ for the entire population of Port Pirie? Obviously, we don’t know the answer to that question. It could be 97.2, but if could also be 103.5. Our sampling isn’t exhaustive so we cannot give a definitive answer. Nevertheless if I was forced at gunpoint to give a “best guess” I’d have to say 98.5. That’s the essence of statistical estimation: giving a best guess. In this example, estimating the unknown poulation parameter is straightforward. I calculate the sample mean, and I use that as my estimate of the population mean. It’s pretty simple, and in the next section I’ll explain the statistical justification for this intuitive answer. However, for the moment what I want to do is make sure you recognise that the sample statistic and the estimate of the population parameter are conceptually di↵erent things. A sample statistic is a description of your data, whereas the estimate is a guess about the population. With that in mind, statisticians often di↵erent notation to 5 Please note that if you were actually interested in this question, you would need to be a lot more careful than I’m being here. You can’t just compare IQ scores in Whyalla to Port Pirie and assume that any di↵erences are due to lead poisoning. Even if it were true that the only di↵erences between the two towns corresponded to the di↵erent refineries (and it isn’t, not by a long shot), you need to account for the fact that people already believe that lead pollution causes cognitive deficits: if you recall back to Chapter 2, this means that there are di↵erent demand e↵ects for the Port Pirie sample than for the Whyalla sample. In other words, you might end up with an illusory group di↵erence in your data, caused by the fact that people think that there is a real di↵erence. I find it pretty implausible to think that the locals wouldn’t be well aware of what you were trying to do if a bunch of researchers turned up in Port Pirie with lab coats and IQ tests, and even less plausible to think that a lot of people would be pretty resentful of you for doing it. Those people won’t be as co-operative in the tests. Other people in Port Pirie might be more motivated to do well because they don’t want their home town to look bad. The motivational e↵ects that would apply in Whyalla are likely to be weaker, because people don’t have any concept of “iron ore poisoning” in the same way that they have a concept for “lead poisoning”. Psychology is hard. - 315 - refer to them. For instance, if true population mean is denoted µ, then we would use µ̂ to refer to our estimate of the population mean. In contrast, the sample mean is denoted X̄ or sometimes m. However, in simple random samples, the estimate of the population mean is identical to the sample mean: if I observe a sample mean of X̄ “ 98.5, then my estimate of the population mean is also µ̂ “ 98.5. To help keep the notation clear, here’s a handy table: Symbol What is it? Do we know what it is? X̄ Sample mean Yes, calculated from the raw data µ True population mean Almost never known for sure µ̂ Estimate of the population mean Yes, identical to the sample mean 10.4.2 Estimating the population standard deviation So far, estimation seems pretty simple, and you might be wondering why I forced you to read through all that stu↵ about sampling theory. In the case of the mean, our estimate of the population parameter (i.e. µ̂) turned out to identical to the corresponding sample statistic (i.e. X̄). However, that’s not always true. To see this, let’s have a think about how to construct an estimate of the population standard deviation, which we’ll denote ˆ. What shall we use as our estimate in this case? Your first thought might be that we could do the same thing we did when estimating the mean, and just use the sample statistic as our estimate. That’s almost the right thing to do, but not quite. Here’s why. Suppose I have a sample that contains a single observation. For this example, it helps to consider a sample where you have no intutions at all about what the true population values might be, so let’s use something completely fictitious. Suppose the observation in question measures the cromulence of my shoes. It turns out that my shoes have a cromulence of 20. So here’s my sample: 20 This is a perfectly legitimate sample, even if it does have a sample size of N “ 1. It has a sample mean of 20, and because every observation in this sample is equal to the sample mean (obviously!) it has a sample standard deviation of 0. As a description of the sample this seems quite right: the sample contains a single observation and therefore there is no variation observed within the sample. A sample standard deviation of s “ 0 is the right answer here. But as an estimate of the population standard deviation, it feels completely insane, right? Admittedly, you and I don’t know anything at all about what “cromulence” is, but we know something about data: the only reason that we don’t see any variability in the sample is that the sample is too small to display any variation! So, if you have a sample size of N “ 1, it feels like the right answer is just to say “no idea at all”. Notice that you don’t have the same intuition when it comes to the sample mean and the population mean. If forced to make a best guess about the population mean, it doesn’t feel completely insane to guess that the population mean is 20. Sure, you probably wouldn’t feel very confident in that guess, because you have only the one observation to work with, but it’s still the best guess you can make. Let’s extend this example a little. Suppose I now make a second observation. My data set now has N “ 2 observations of the cromulence of shoes, and the complete sample now looks like this: 20, 22 This time around, our sample is just large enough for us to be able to observe some variability: two observations is the bare minimum number needed for any variability to be observed! For our new data set, the sample mean is X̄ “ 21, and the sample standard deviation is s “ 1. What intuitions do we have about the population? Again, as far as the population mean goes, the best guess we can possibly make - 316 - Population Standard Deviation 0 10 20 30 40 50 60 Sample Standard Deviation Figure 10.9: The sampling distribution of the sample standard deviation for a “two IQ scores” experiment. The true population standard deviation is 15 (dashed line), but as you can see from the histogram, the vast majority of experiments will produce a much smaller sample standard deviation than this. On average, this experiment would produce a sample standard deviation of only 8.5, well below the true value! In other words, the sample standard deviation is a biased estimate of the population standard deviation........................................................................................................ is the sample mean: if forced to guess, we’d probably guess that the population mean cromulence is 21. What about the standard deviation? This is a little more complicated. The sample standard deviation is only based on two observations, and if you’re at all like me you probably have the intuition that, with only two observations, we haven’t given the population “enough of a chance” to reveal its true variability to us. It’s not just that we suspect that the estimate is wrong: after all, with only two observations we expect it to be wrong to some degree. The worry is that the error is systematic. Specifically, we suspect that the sample standard deviation is likely to be smaller than the population standard deviation. This intuition feels right, but it would be nice to demonstrate this somehow. There are in fact mathematical proofs that confirm this intuition, but unless you have the right mathematical background they don’t help very much. Instead, what I’ll do is use R to simulate the results of some experiments. With that in mind, let’s return to our IQ studies. Suppose the true population mean IQ is 100 and the standard deviation is 15. I can use the rnorm() function to generate the the results of an experiment in which I measure N “ 2 IQ scores, and calculate the sample standard deviation. If I do this over and over again, and plot a histogram of these sample standard deviations, what I have is the sampling distribution of the standard deviation. I’ve plotted this distribution in Figure 10.9. Even though the true population standard deviation is 15, the average of the sample standard deviations is only 8.5. Notice that this is a very di↵erent result to what we found in Figure 10.7b when we plotted the sampling distribution of the mean. If you look at that sampling distribution, what you see is that the population mean is 100, and the average of the sample means is also 100. Now let’s extend the simulation. Instead of restricting ourselves to the situation where we have a sample size of N “ 2, let’s repeat the exercise for sample sizes from 1 to 10. If we plot the average sample - 317 - 15 Average Sample Standard Deviation 104 Average Sample Mean 102 10 100 98 5 96 0 2 4 6 8 10 2 4 6 8 10 Sample Size Sample Size (a) (b) Figure 10.10: An illustration of the fact that the sample mean is an unbiased estimator of the population mean (panel a), but the sample standard deviation is a biased estimator of the population standard deviation (panel b). To generate the figure, I generated 10,000 simulated data sets with 1 observation each, 10,000 more with 2 observations, and so on up to a sample size of 10. Each data set consisted of fake IQ data: that is, the data were normally distributed with a true population mean of 100 and standard deviation 15. On average, the sample means turn out to be 100, regardless of sample size (panel a). However, the sample standard deviations turn out to be systematically too small (panel b), especially for small sample sizes........................................................................................................ mean and average sample standard deviation as a function of sample size, you get the results shown in Figure 10.10. On the left hand side (panel a), I’ve plotted the average sample mean and on the right hand side (panel b), I’ve plotted the average standard deviation. The two plots are quite di↵erent: on average, the average sample mean is equal to the population mean. It is an unbiased estimator, which is essentially the reason why your best estimate for the population mean is the sample mean.6 The plot on the right is quite di↵erent: on average, the sample standard deviation s is smaller than the population standard deviation. It is a biased estimator. In other words, if we want to make a “best guess” ˆ about the value of the population standard deviation , we should make sure our guess is a little bit larger than the sample standard deviation s. The fix to this systematic bias turns out to be very simple. Here’s how it works. Before tackling the standard deviation, let’s look at the variance. If you recall from Section 5.2, the sample variance is defined to be the average of the squared deviations from the sample mean. That is: 1 ÿ N s2 “ pXi ´ X̄q2 N i“1 The sample variance s2 is a biased estimator of the population variance 2. But as it turns out, we only 6 I should note that I’m hiding something here. Unbiasedness is a desirable characteristic for an estimator, but there are other things that matter besides bias. However, it’s beyond the scope of this book to discuss this in any detail. I just want to draw your attention to the fact that there’s some hidden complexity here. - 318 - need to make a tiny tweak to transform this into an unbiased estimator. All we have to do is divide by N ´ 1 rather than by N. If we do that, we obtain the following formula: 1 ÿ N ˆ2 “ pXi ´ X̄q2 N ´ 1 i“1 This is an unbiased estimator of the population variance. Moreover, this finally answers the question we raised in Section 5.2. Why did R give us slightly di↵erent answers when we used the var() function? Because the var() function calculates ˆ 2 not s2 , that’s why. A similar story applies for the standard deviation. If we divide by N ´ 1 rather than N , our estimate of the population standard deviation becomes: g f f 1 ÿ N ˆ“e pXi ´ X̄q2 N ´ 1 i“1 and when we use R’s built in standard deviation function sd(), what it’s doing is calculating ˆ , not s.7 One final point: in practice, a lot of people tend to refer to ˆ (i.e., the formula where we divide by N ´ 1) as the sample standard deviation. Technically, this is incorrect: the sample standard deviation should be equal to s (i.e., the formula where we divide by N ). These aren’t the same thing, either conceptually or numerically. One is a property of the sample, the other is an estimated characteristic of the population. However, in almost every real life application, what we actually care about is the estimate of the population parameter, and so people always report ˆ rather than s. This is the right number to report, of course, it’s that people tend to get a little bit imprecise about terminology when they write it up, because “sample standard deviation” is shorter than “estimated population standard deviation”. It’s no big deal, and in practice I do the same thing everyone else does. Nevertheless, I think it’s important to keep the two concepts separate: it’s never a good idea to confuse “known properties of your sample” with “guesses about the population from which it came”. The moment you start thinking that s and ˆ are the same thing, you start doing exactly that. To finish this section o↵, here’s another couple of tables to help keep things clear: Symbol What is it? Do we know what it is? s Sample standard deviation Yes, calculated from the raw data Population standard deviation Almost never known for sure ˆ Estimate of the population Yes, but not the same as the standard deviation sample standard deviation Symbol What is it? Do we know what it is? 2 s Sample variance Yes, calculated from the raw data 2 Population variance Almost never known for sure ˆ2 Estimate of the population Yes, but not the same as the variance sample variance 7 Okay, I’m hiding something else here. In a bizarre and counterintuitive twist, since ˆ 2 is an unbiased estimator of 2 , you’d assume that taking the square root would be fine, and ˆ would be an unbiased estimator of. Right? Weirdly, it’s not. There’s actually a subtle, tiny bias in ˆ. This is just bizarre: ˆ 2 is and unbiased estimate of the population variance 2 , but when you take the square root, it turns out that ˆ is a biased estimator of the population standard deviation. Weird, weird, weird, right? So, why is ˆ biased? The technical answer is “because non-linear transformations (e.g., the square root) don’t commute with expectation”, but that just sounds like gibberish to everyone who hasn’t taken a course in mathematical statistics. Fortunately, it doesn’t matter for practical purposes. The bias is small, and in real life everyone uses ˆ and it works just fine. Sometimes mathematics is just annoying. - 319 - 10.5 Estimating a confidence interval Statistics means never having to say you’re certain – Unknown origin8 Up to this point in this chapter, I’ve outlined the basics of sampling theory which statisticians rely on to make guesses about population parameters on the basis of a sample of data. As this discussion illustrates, one of the reasons we need all this sampling theory is that every data set leaves us with a some of uncertainty, so our estimates are never going to be perfectly accurate. The thing that has been missing from this discussion is an attempt to quantify the amount of uncertainty that attaches to our estimate. It’s not enough to be able guess that, say, the mean IQ of undergraduate psychology students is 115 (yes, I just made that number up). We also want to be able to say something that expresses the degree of certainty that we have in our guess. For example, it would be nice to be able to say that there is a 95% chance that the true mean lies between 109 and 121. The name for this is a confidence interval for the mean. Armed with an understanding of sampling distributions, constructing a confidence interval for the mean is actually pretty easy. Here’s how it works. Suppose the true population mean is µ and the standard deviation is. I’ve just finished running my study that has N participants, and the mean IQ among those participants is X̄. We know from our discussion of the central limit theorem (Section 10.3.3) that the sampling distribution of the mean is approximately normal. We also know from our discussion of the normal distribution Section 9.5 that there is a 95% chance that a normally-distributed quantity will fall within two standard deviations of the true mean. To be more precise, we can use the qnorm() function to compute the 2.5th and 97.5th percentiles of the normal distribution > qnorm( p = c(.025,.975) ) -1.959964 1.959964 Okay, so I lied earlier on. The more correct answer is that 95% chance that a normally-distributed quantity will fall within 1.96 standard deviations of the true mean. Next, recall that the standard deviation of the sampling distribution is referred to as the standard error, and the standard error of the mean is written as SEM. When we put all these pieces together, we learn that there is a 95% probability that the sample mean X̄ that we have actually observed lies within 1.96 standard errors of the population mean. Mathematically, we write this as: µ ´ p1.96 ˆ SEMq § X̄ § µ ` p1.96 ˆ SEMq ? where the SEM is equal to { N , and we can be 95% confident that this is true. However, that’s not answering the question that we’re actually interested in. The equation above tells us what we should expect about the sample mean, given that we know what the population parameters are. What we want is to have this work the other way around: we want to know what we should believe about the population parameters, given that we have observed a particular sample. However, it’s not too difficult to do this. Using a little high school algebra, a sneaky way to rewrite our equation is like this: X̄ ´ p1.96 ˆ SEMq § µ § X̄ ` p1.96 ˆ SEMq What this is telling is is that the range of values has a 95% probability of containing the population mean µ. We refer to this range as a 95% confidence interval, denoted CI95. In short, as long as N is 8 This quote appears on a great many t-shirts and websites, and even gets a mention in a few academic papers (e.g., http://www.amstat.org/publications/jse/v10n3/friedman.html but I’ve never found the original source. - 320 - sufficiently large – large enough for us to believe that the sampling distribution of the mean is normal – then we can write this as our formula for the 95% confidence interval: ˆ ˙ CI95 “ X̄ ˘ 1.96 ˆ ? N Of course, there’s nothing special about the number 1.96: it just happens to be the multiplier you need to use if you want a 95% confidence interval. If I’d wanted a 70% confidence interval, I could have used the qnorm() function to calculate the 15th and 85th quantiles: > qnorm( p = c(.15,.85) ) -1.036433 1.036433 and so the formula for CI70 would be the same as the formula for CI95 except that we’d use 1.04 as our magic number rather than 1.96. 10.5.1 A slight mistake in the formula As usual, I lied. The formula that I’ve given above for the 95% confidence interval is approximately correct, but I glossed over an important detail in the discussion. Notice my formula requires you to use the standard error of the mean, SEM, which in turn requires you to use the true population standard deviation. Yet, in Section 10.4 I stressed the fact that we don’t actually know the true population parameters. Because we don’t know the true value of , we have to use an estimate of the population standard deviation ˆ instead. This is pretty straightforward to do, but this has the consequence that we need to use the quantiles of the t-distribution rather than the normal distribution to calculate our magic number; and the answer depends on the sample size. When N is very large, we get pretty much the same value using qt() that we would if we used qnorm()... > N qt( p =.975, df = N-1) # calculate the 97.5th quantile of the t-dist 1.960201 But when N is small, we get a much bigger number when we use the t distribution: > N qt( p =.975, df = N-1) # calculate the 97.5th quantile of the t-dist 2.262157 There’s nothing too mysterious about what’s happening here. Bigger values mean that the confidence interval is wider, indicating that we’re more uncertain about what the true value of µ actually is. When we use the t distribution instead of the normal distribution, we get bigger numbers, indicating that we have more uncertainty. And why do we have that extra uncertainty? Well, because our estimate of the population standard deviation ˆ might be wrong! If it’s wrong, it implies that we’re a bit less sure about what our sampling distribution of the mean actually looks like... and this uncertainty ends up getting reflected in a wider confidence interval. 10.5.2 Interpreting a confidence interval The hardest thing about confidence intervals is understanding what they mean. Whenever people first encounter confidence intervals, the first instinct is almost always to say that “there is a 95% probabaility that the true mean lies inside the confidence interval”. It’s simple, and it seems to capture the common - 321 - sense idea of what it means to say that I am “95% confident”. Unfortunately, it’s not quite right. The intuitive definition relies very heavily on your own personal beliefs about the value of the population mean. I say that I am 95% confident because those are my beliefs. In everyday life that’s perfectly okay, but if you remember back to Section 9.2, you’ll notice that talking about personal belief and confidence is a Bayesian idea. Personally (speaking as a Bayesian) I have no problem with the idea that the phrase “95% probability” is allowed to refer to a personal belief. However, confidence intervals are not Bayesian tools. Like everything else in this chapter, confidence intervals are frequentist tools, and if you are going to use frequentist methods then it’s not appropriate to attach a Bayesian interpretation to them. If you use frequentist methods, you must adopt frequentist interpretations! Okay, so if that’s not the right answer, what is? Remember what we said about frequentist probability: the only way we are allowed to make “probability statements” is to talk about a sequence of events, and to count up the frequencies of di↵erent kinds of events. From that perspective, the interpretation of a 95% confidence interval must have something to do with replication. Specifically: if we replicated the experiment over and over again and computed a 95% confidence interval for each replication, then 95% of those intervals would contain the true mean. More generally, 95% of all confidence intervals constructed using this procedure should contain the true population mean. This idea is illustrated in Figure 10.11, which shows 50 confidence intervals constructed for a “measure 10 IQ scores” experiment (top panel) and another 50 confidence intervals for a “measure 25 IQ scores” experiment (bottom panel). A bit fortuitously, across the 100 replications that I simulated, it turned out that exactly 95 of them contained the true mean. The critical di↵erence here is that the Bayesian claim makes a probability statement about the pop- ulation mean (i.e., it refers to our uncertainty about the population mean), which is not allowed under the frequentist interpretation of probability because you can’t “replicate” a population! In the frequen- tist claim, the population mean is fixed and no probabilistic claims can be made about it. Confidence intervals, however, are repeatable so we can replicate experiments. Therefore a frequentist is allowed to talk about the probability that the confidence interval (a random variable) contains the true mean; but is not allowed to talk about the probability that the true population mean (not a repeatable event) falls within the confidence interval. I know that this seems a little pedantic, but it does matter. It matters because the di↵erence in interpretation leads to a di↵erence in the mathematics. There is a Bayesian alternative to confidence intervals, known as credible intervals. In most situations credible intervals are quite similar to confidence intervals, but in other cases they are drastically di↵erent. As promised, though, I’ll talk more about the Bayesian perspective in Chapter 17. 10.5.3 Calculating confidence intervals in R As far as I can tell, the core packages in R don’t include a simple function for calculating confidence intervals for the mean. They do include a lot of complicated, extremely powerful functions that can be used to calculate confidence intervals associated with lots of di↵erent things, such as the confint() function that we’ll use in Chapter 15. But I figure that when you’re first learning statistics, it might be useful to start with something simpler. As a consequence, the lsr package includes a function called ciMean() which you can use to calculate your confidence intervals. There are two arguments that you might want to specify:9 x. This should be a numeric vector containing the data. conf. This should be a number, specifying the confidence level. By default, conf =.95, since 95% confidence intervals are the de facto standard in psychology. 9 As of the current writing, these are the only arguments to the function. However, I am planning to add a bit more functionality to ciMean(). However, regardless of what those future changes might look like, the x and conf arguments will remain the same, and the commands used in this book will still work. - 322 - Sample Size = 10 (a) 100 110 120 Mean IQ 90 80 0 10 20 30 40 50 Replication Number Sample Size = 25 (b) 100 110 120 Mean IQ 90 80 0 10 20 30 40 50 Replication Number Figure 10.11: 95% confidence intervals. The top (panel a) shows 50 simulated replications of an exper- iment in which we measure the IQs of 10 people. The dot marks the location of the sample mean, and the line shows the 95% confidence interval. In total 47 of the 50 confidence intervals do contain the true mean (i.e., 100), but the three intervals marked with asterisks do not. The lower graph (panel b) shows a similar simulation, but this time we simulate replications of an experiment that measures the IQs of 25 people........................................................................................................ - 323 - 30000 Average Attendance 10000 0 1987 1990 1993 1996 1999 2002 2005 2008 Year Figure 10.12: Means and 95% confidence intervals for AFL attendance, plotted separately for each year from 1987 to 2010. This graph was drawn using the bargraph.CI() function........................................................................................................ So, for example, if I load the afl24.Rdata file, calculate the confidence interval associated with the mean attendance: > ciMean( x = afl$attendance ) 2.5% 97.5% 31597.32 32593.12 Hopefully that’s fairly clear. 10.5.4 Plotting confidence intervals in R There’s several di↵erent ways you can draw graphs that show confidence intervals as error bars. I’ll show three versions here, but this certainly doesn’t exhaust the possibilities. In doing so, what I’m assuming is that you want to draw is a plot showing the means and confidence intervals for one variable, broken down by di↵erent levels of a second variable. For instance, in our afl data that we discussed earlier, we might be interested in plotting the average attendance by year. I’ll do this using three di↵erent functions, bargraph.CI(), lineplot.CI() (both of which are in the sciplot package), and plotmeans() (which is in the gplots) package. Assuming that you’ve installed these packages on your system (see Section 4.2 if you’ve forgotten how to do this), you’ll need to load them. You’ll also need to load the lsr package, because we’ll make use of the ciMean() function to actually calculate the confidence intervals > library( sciplot ) # bargraph.CI() and lineplot.CI() functions > library( gplots ) # plotmeans() function > library( lsr ) # ciMean() function - 324 - 40000 Average Attendance 30000 20000 1987 1990 1993 1996 1999 2002 2005 2008 Year Figure 10.13: Means and 95% confidence intervals for AFL attendance, plotted separately for each year from 1987 to 2010. This graph was drawn using the lineplot.CI() function........................................................................................................ 40000 attendance 30000 20000 1987 1990 1993 1996 1999 2002 2005 2008 year Figure 10.14: Means and 95% confidence intervals for AFL attendance, plotted separately for each year from 1987 to 2010. This graph was drawn using the plotmeans() function........................................................................................................ - 325 - Here’s how to plot the means and confidence intervals drawn using bargraph.CI(). > bargraph.CI( x.factor = year, # grouping variable + response = attendance, # outcome variable + data = afl, # data frame with the variables + ci.fun= ciMean, # name of the function to calculate CIs + xlab = "Year", # x-axis label + ylab = "Average Attendance" # y-axis label + ) This produces the output shown in Figure 10.12. We can use the same arguments when calling the lineplot.CI() function: > lineplot.CI( x.factor = year, # grouping variable + response = attendance, # outcome variable + data = afl, # data frame with the variables + ci.fun= ciMean, # name of the function to calculate CIs + xlab = "Year", # x-axis label + ylab = "Average Attendance" # y-axis label + ) And the output for this command is shown in Figure 10.13. Finally, here’s how you would do it using plotmeans(): > plotmeans( formula = attendance ~ year, # outcome ~ group + data = afl, # data frame with the variables + n.label = FALSE # don’t show the sample sizes + ) This is shown in Figure 10.14. 10.6 Summary In this chapter I’ve covered two main topics. The first half of the chapter talks about sampling theory, and the second half talks about how we can use sampling theory to construct estimates of the population parameters. The section breakdown looks like this: Basic ideas about samples, sampling and populations (Section 10.1) Statistical theory of sampling: the law of large numbers (Section 10.2), sampling distributions and the central limit theorem (Section 10.3). Estimating means and standard deviations (Section 10.4) Estimating a confidence interval (Section 10.5) As always, there’s a lot of topics related to sampling and estimation that aren’t covered in this chapter, but for an introductory psychology class this is fairly comprehensive I think. For most applied researchers you won’t need much more theory than this. One big question that I haven’t touched on in this chapter is what you do when you don’t have a simple random sample. There is a lot of statistical theory you can draw on to handle this situation, but it’s well beyond the scope of this book. - 326 - 11. Hypothesis testing The process of induction is the process of assuming the simplest law that can be made to harmonize with our experience. This process, however, has no logical foundation but only a psychological one. It is clear that there are no grounds for believing that the simplest course of events will really happen. It is an hypothesis that the sun will rise tomorrow: and this means that we do not know whether it will rise. – Ludwig Wittgenstein1 In the last chapter, I discussed the ideas behind estimation, which is one of the two “big ideas” in inferential statistics. It’s now time to turn out attention to the other big idea, which is hypothesis testing. In its most abstract form, hypothesis testing really a very simple idea: the researcher has some theory about the world, and wants to determine whether or not the data actually support that theory. However, the details are messy, and most people find the theory of hypothesis testing to be the most frustrating part of statistics. The structure of the chapter is as follows. Firstly, I’ll describe how hypothesis testing works, in a fair amount of detail, using a simple running example to show you how a hypothesis test is “built”. I’ll try to avoid being too dogmatic while doing so, and focus instead on the underlying logic of the testing procedure.2 Afterwards, I’ll spend a bit of time talking about the various dogmas, rules and heresies that surround the theory of hypothesis testing. 11.1 A menagerie of hypotheses Eventually we all succumb to madness. For me, that day will arrive once I’m finally promoted to full professor. Safely ensconced in my ivory tower, happily protected by tenure, I will finally be able to take leave of my senses (so to speak), and indulge in that most thoroughly unproductive line of psychological research: the search for extrasensory perception (ESP).3 1 The quote comes from Wittgenstein’s (1922) text, Tractatus Logico-Philosphicus. 2A technical note. The description below di↵ers subtly from the standard description given in a lot of introductory texts. The orthodox theory of null hypothesis testing emerged from the work of Sir Ronald Fisher and Jerzy Neyman in the early 20th century; but Fisher and Neyman actually had very di↵erent views about how it should work. The standard treatment of hypothesis testing that most texts use is a hybrid of the two approaches. The treatment here is a little more Neyman-style than the orthodox view, especially as regards the meaning of the p value. 3 My apologies to anyone who actually believes in this stu↵, but on my reading of the literature on ESP, it’s just not reasonable to think this is real. To be fair, though, some of the studies are rigorously designed; so it’s actually an interesting area for thinking about psychological research design. And of course it’s a free country, so you can spend your own time and e↵ort proving me wrong if you like, but I wouldn’t think that’s a terribly practical use of your intellect. - 327 - Let’s suppose that this glorious day has come. My first study is a simple one, in which I seek to test whether clairvoyance exists. Each participant sits down at a table, and is shown a card by an experimenter. The card is black on one side and white on the other. The experimenter takes the card away, and places it on a table in an adjacent room. The card is placed black side up or white side up completely at random, with the randomisation occurring only after the experimenter has left the room with the participant. A second experimenter comes in and asks the participant which side of the card is now facing upwards. It’s purely a one-shot experiment. Each person sees only one card, and gives only one answer; and at no stage is the participant actually in contact with someone who knows the right answer. My data set, therefore, is very simple. I have asked the question of N people, and some number X of these people have given the correct response. To make things concrete, let’s suppose that I have tested N “ 100 people, and X “ 62 of these got the answer right... a surprisingly large number, sure, but is it large enough for me to feel safe in claiming I’ve found evidence for ESP? This is the situation where hypothesis testing comes in useful. However, before we talk about how to test hypotheses, we need to be clear about what we mean by hypotheses. 11.1.1 Research hypotheses versus statistical hypotheses The first distinction that you need to keep clear in your mind is between research hypotheses and statistical hypotheses. In my ESP study, my overall scientific goal is to demonstrate that clairvoyance exists. In this situation, I have a clear research goal: I am hoping to discover evidence for ESP. In other situations I might actually be a lot more neutral than that, so I might say that my research goal is to determine whether or not clairvoyance exists. Regardless of how I want to portray myself, the basic point that I’m trying to convey here is that a research hypothesis involves making a substantive, testable scientific claim... if you are a psychologist, then your research hypotheses are fundamentally about psychological constructs. Any of the following would count as research hypotheses: Listening to music reduces your ability to pay attention to other things. This is a claim about the causal relationship between two psychologically meaningful concepts (listening to music and paying attention to things), so it’s a perfectly reasonable research hypothesis. Intelligence is related to personality. Like the last one, this is a relational claim about two psycho- logical constructs (intelligence and personality), but the claim is weaker: correlational not causal. Intelligence is speed of information processing. This hypothesis has a quite di↵erent character: it’s not actually a relational claim at all. It’s an ontological claim about the fundamental character of intelligence (and I’m pretty sure it’s wrong). It’s worth expanding on this one actually: It’s usually easier to think about how to construct experiments to test research hypotheses of the form “does X a↵ect Y?” than it is to address claims like “what is X?” And in practice, what usually happens is that you find ways of testing relational claims that follow from your ontological ones. For instance, if I believe that intelligence is speed of information processing in the brain, my experiments will often involve looking for relationships between measures of intelligence and measures of speed. As a consequence, most everyday research questions do tend to be relational in nature, but they’re almost always motivated by deeper ontological questions about the state of nature. Notice that in practice, my research hypotheses could overlap a lot. My ultimate goal in the ESP experiment might be to test an ontological claim like “ESP exists”, but I might operationally restrict myself to a narrower hypothesis like “Some people can ‘see’ objects in a clairvoyant fashion”. That said, there are some things that really don’t count as proper research hypotheses in any meaningful sense: Love is a battlefield. This is too vague to be testable. While it’s okay for a research hypothesis to have a degree of vagueness to it, it has to be possible to operationalise your theoretical ideas. - 328 - Maybe I’m just not creative enough to see it, but I can’t see how this can be converted into any concrete research design. If that’s true, then this isn’t a scientific research hypothesis, it’s a pop song. That doesn’t mean it’s not interesting – a lot of deep questions that humans have fall into this category. Maybe one day science will be able to construct testable theories of love, or to test to see if God exists, and so on; but right now we can’t, and I wouldn’t bet on ever seeing a satisfying scientific approach to either. The first rule of tautology club is the first rule of tautology club. This is not a substantive claim of any kind. It’s true by definition. No conceivable state of nature could possibly be inconsistent with this claim. As such, we say that this is an unfalsifiable hypothesis, and as such it is outside the domain of science. Whatever else you do in science, your claims must have the possibility of being wrong. More people in my experiment will say “yes” than “no”. This one fails as a research hypothesis because it’s a claim about the data set, not about the psychology (unless of course your actual research question is whether people have some kind of “yes” bias!). As we’ll see shortly, this hypothesis is starting to sound more like a statistical hypothesis than a research hypothesis. As you can see, research hypotheses can be somewhat messy at times; and ultimately they are sci- entific claims. Statistical hypotheses are neither of these two things. Statistical hypotheses must be mathematically precise, and they must correspond to specific claims about the characteristics of the data generating mechanism (i.e., the “population”). Even so, the intent is that statistical hypotheses bear a clear relationship to the substantive research hypotheses that you care about! For instance, in my ESP study my research hypothesis is that some people are able to see through walls or whatever. What I want to do is to “map” this onto a statement about how the data were generated. So let’s think about what that statement would be. The quantity that I’m interested in within the experiment is P p“correct”q, the true-but-unknown probability with which the participants in my experiment answer the question cor- rectly. Let’s use the Greek letter ✓ (theta) to refer to this probability. Here are four di↵erent statistical hypotheses: If ESP doesn’t exist and if my experiment is well designed, then my participants are just guessing. So I should expect them to get it right half of the time and so my statistical hypothesis is that the true probability of choosing correctly is ✓ “ 0.5. Alternatively, suppose ESP does exist and participants can see the card. If that’s true, people will perform better than chance. The statistical hypotheis would be that ✓ ° 0.5. A third possibility is that ESP does exist, but the colours are all reversed and people don’t realise it (okay, that’s wacky, but you never know...). If that’s how it works then you’d expect people’s performance to be below chance. This would correspond to a statistical hypothesis that ✓ † 0.5. Finally, suppose ESP exists, but I have no idea whether people are seeing the right colour or the wrong one. In that case, the only claim I could make about the data would be that the probability of making the correct answer is not equal to 50. This corresponds to the statistical hypothesis that ✓ ‰ 0.5. All of these are legitimate examples of a statistical hypothesis because they are statements about a population parameter and are meaningfully related to my experiment. What this discussion makes clear, I hope, is that when attempting to construct a statistical hypothesis test the researcher actually has two quite distinct hypotheses to consider. First, he or she has a research hypothesis (a claim about psychology), and this corresponds to a statistical hypothesis (a claim about the data generating population). In my ESP example, these might be - 329 - Dan’s research hypothesis: “ESP exists” Dan’s statistical hypothesis: ✓ ‰ 0.5 And the key thing to recognise is this: a statistical hypothesis test is a test of the statistical hypothesis, not the research hypothesis. If your study is badly designed, then the link between your research hypothesis and your statistical hypothesis is broken. To give a silly example, suppose that my ESP study was conducted in a situation where the participant can actually see the card reflected in a window; if that happens, I would be able to find very strong evidence that ✓ ‰ 0.5, but this would tell us nothing about whether “ESP exists”. 11.1.2 Null hypotheses and alternative hypotheses So far, so good. I have a research hypothesis that corresponds to what I want to believe about the world, and I can map it onto a statistical hypothesis that corresponds to what I want to believe about how the data were generated. It’s at this point that things get somewhat counterintuitive for a lot of people. Because what I’m about to do is invent a new statistical hypothesis (the “null” hypothesis, H0 ) that corresponds to the exact opposite of what I want to believe, and then focus exclusively on that, almost to the neglect of the thing I’m actually interested in (which is now called the “alternative” hypothesis, H1 ). In our ESP example, the null hypothesis is that ✓ “ 0.5, since that’s what we’d expect if ESP didn’t exist. My hope, of course, is that ESP is totally real, and so the alternative to this null hypothesis is ✓ ‰ 0.5. In essence, what we’re doing here is dividing up the possible values of ✓ into two groups: those values that I really hope aren’t true (the null), and those values that I’d be happy with if they turn out to be right (the alternative). Having done so, the important thing to recognise is that the goal of a hypothesis test is not to show that the alternative hypothesis is (probably) true; the goal is to show that the null hypothesis is (probably) false. Most people find this pretty weird. The best way to think about it, in my experience, is to imagine that a hypothesis test is a criminal trial4... the trial of the null hypothesis. The null hypothesis is the defendant, the researcher is the prosecutor, and the statistical test itself is the judge. Just like a criminal trial, there is a presumption of innocence: the null hypothesis is deemed to be true unless you, the researcher, can prove beyond a reasonable doubt that it is false. You are free to design your experiment however you like (within reason, obviously!), and your goal when doing so is to maximise the chance that the data will yield a conviction... for the crime of being false. The catch is that the statistical test sets the rules of the trial, and those rules are designed to protect the null hypothesis – specifically to ensure that if the null hypothesis is actually true, the chances of a false conviction are guaranteed to be low. This is pretty important: after all, the null hypothesis doesn’t get a lawyer. And given that the researcher is trying desperately to prove it to be false, someone has to protect it. 11.2 Two types of errors Before going into details about how a statistical test is constructed, it’s useful to understand the phi- losophy behind it. I hinted at it when pointing out the similarity between a null hypothesis test and a criminal trial, but I should now be explicit. Ideally, we would like to construct our test so that we never make any errors. Unfortunately, since the world is messy, this is never possible. Sometimes you’re just really unlucky: for instance, suppose you flip a coin 10 times in a row and it comes up heads all 10 times. That feels like very strong evidence that the coin is biased (and it is!), but of course there’s a 1 in 1024 4 This analogy only works if you’re from an adversarial legal system like UK/US/Australia. As I understand these things, the French inquisitorial system is quite di↵erent. - 330 - chance that this would happen even if the coin was totally fair. In other words, in real life we always have to accept that there’s a chance that we did the wrong thing. As a consequence, the goal behind statistical hypothesis testing is not to eliminate errors, but to minimise them. At this point, we need to be a bit more precise about what we mean by “errors”. Firstly, let’s state the obvious: it is either the case that the null hypothesis is true, or it is false; and our test will either reject the null hypothesis or retain it.5 So, as the table below illustrates, after we run the test and make our choice, one of four things might have happened: retain H0 reject H0 H0 is true correct decision error (type I) H0 is false error (type II) correct decision As a consequence there are actually two di↵erent types of error here. If we reject a null hypothesis that is actually true, then we have made a type I error. On the other hand, if we retain the null hypothesis when it is in fact false, then we have made a type II error. Remember how I said that statistical testing was kind of like a criminal trial? Well, I meant it. A criminal trial requires that you establish “beyond a reasonable doubt” that the defendant did it. All of the evidentiary rules are (in theory, at least) designed to ensure that there’s (almost) no chance of wrongfully convicting an innocent defendant. The trial is designed to protect the rights of a defendant: as the English jurist William Blackstone famously said, it is “better that ten guilty persons escape than that one innocent su↵er.” In other words, a criminal trial doesn’t treat the two types of error in the same way... punishing the innocent is deemed to be much worse than letting the guilty go free. A statistical test is pretty much the same: the single most important design principle of the test is to control the probability of a type I error, to keep it below some fixed probability. This probability, which is denoted ↵, is called the significance level of the test (or sometimes, the size of the test). And I’ll say it again, because it is so central to the whole set-up... a hypothesis test is said to have significance level ↵ if the type I error rate is no larger than ↵. So, what about the type II error rate? Well, we’d also like to keep those under control too, and we denote this probability by. However, it’s much more common to refer to the power of the test, which is the probability with which we reject a null hypothesis when it really is false, which is 1 ´. To help keep this straight, here’s the same table again, but with the relevant numbers added: retain H0 reject H0 H0 is true 1 ´ ↵ (probability of correct retention) ↵ (type I error rate) H0 is false (type II error rate) 1 ´ (power of the test) A “powerful” hypothesis test is one that has a small value of , while still keeping ↵ fixed at some (small) desired level. By convention, scientists make use of three di↵erent ↵ levels:.05,.01 and.001. Notice the asymmetry here... the tests are designed to ensure that the ↵ level is kept small, but there’s no corresponding guarantee regarding. We’d certainly like the type II error rate to be small, and we try to design tests that keep it small, but this is very much secondary to the overwhelming need to control 5 An aside regarding the language you use to talk about hypothesis testing. Firstly, one thing you really want to avoid is the word “prove”: a statistical test really doesn’t prove that a hypothesis is true or false. Proof implies certainty, and as the saying goes, statistics means never having to say you’re certain. On that point almost everyone would agree. However, beyond that there’s a fair amount of confusion. Some people argue that you’re only allowed to make statements like “rejected the null”, “failed to reject the null”, or possibly “retained the null”. According to this line of thinking, you can’t say things like “accept the alternative” or “accept the null”. Personally I think this is too strong: in my opinion, this conflates null hypothesis testing with Karl Popper’s falsificationist view of the scientific process. While there are similarities between falsificationism and null hypothesis testing, they aren’t equivalent. However, while I personally think it’s fine to talk about accepting a hypothesis (on the proviso that “acceptance” doesn’t actually mean that it’s necessarily true, especially in the case of the null hypothesis), many people will disagree. And more to the point, you should be aware that this particular weirdness exists, so that you’re not caught unawares by it when writing up your own results. - 331 - the type I error rate. As Blackstone might have said if he were a statistician, it is “better to retain 10 false null hypotheses than to reject a single true one”. To be honest, I don’t know that I agree with this philosophy – there are situations where I think it makes sense, and situations where I think it doesn’t – but that’s neither here nor there. It’s how the tests are built. 11.3 Test statistics and sampling distributions At this point we need to start talking specifics about how a hypothesis test is constructed. To that end, let’s return to the ESP example. Let’s ignore the actual data that we obtained, for the moment, and think about the structure of the experiment. Regardless of what the actual numbers are, the form of the data is that X out of N people correctly identified the colour of the hidden card. Moreover, let’s suppose for the moment that the null hypothesis really is true: ESP doesn’t exist, and the true probability that anyone picks the correct colour is exactly ✓ “ 0.5. What would we expect the data to look like? Well, obviously, we’d expect the proportion of people who make the correct response to be pretty close to 50%. Or, to phrase this in more mathematical terms, we’d say that X{N is approximately 0.5. Of course, we wouldn’t expect this fraction to be exactly 0.5: if, for example we tested N “ 100 people, and X “ 53 of them got the question right, we’d probably be forced to concede that the data are quite consistent with the null hypothesis. On the other hand, if X “ 99 of our participants got the question right, then we’d feel pretty confident that the null hypothesis is wrong. Similarly, if only X “ 3 people got the answer right, we’d be similarly confident that the null was wrong. Let’s be a little more technical about this: we have a quantity X that we can calculate by looking at our data; after looking at the value of X, we make a decision about whether to believe that the null hypothesis is correct, or to reject the null hypothesis in favour of the alternative. The name for this thing that we calculate to guide our choices is a test statistic. Having chosen a test statistic, the next step is to state precisely which values of the test statistic would cause is to reject the null hypothesis, and which values would cause us to keep it. In order to do so, we need to determine what the sampling distribution of the test statistic would be if the null hypothesis were actually true (we talked about sampling distributions earlier in Section 10.3.1). Why do we need this? Because this distribution tells us exactly what values of X our null hypothesis would lead us to expect. And therefore, we can use this distribution as a tool for assessing how closely the null hypothesis agrees with our data. How do we actually determine the sampling distribution of the test statistic? For a lot of hypothesis tests this step is actually quite complicated, and later on in the book you’ll see me being slightly evasive about it for some of the tests (some of them I don’t even understand myself). However, sometimes it’s very easy. And, fortunately for us, our ESP example provides us with one of the easiest cases. Our population parameter ✓ is just the overall probability that people respond correctly when asked the question, and our test statistic X is the count of the number of people who did so, out of a sample size of N. We’ve seen a distribution like this before, in Section 9.4: that’s exactly what the binomial distribution describes! So, to use the notation and terminology that I introduced in that section, we would say that the null hypothesis predicts that X is binomially distributed, which is written X „ Binomialp✓, N q Since the null hypothesis states that ✓ “ 0.5 and our experiment has N “ 100 people, we have the sampling distribution we need. This sampling distribution is plotted in Figure 11.1. No surprises really: the null hypothesis says that X “ 50 is the most likely outcome, and it says that we’re almost certain to see somewhere between 40 and 60 correct responses. - 332 - Sampling Distribution for X if the Null is True 0.08 0.06 Probability 0.04 0.02 0.00 0 20 40 60 80 100 Number of Correct Responses (X) Figure 11.1: The sampling distribution for our test statistic X when the null hypothesis is true. For our ESP scenario, this is a binomial distribution. Not surprisingly, since the null hypothesis says that the probability of a correct response is ✓ “.5, the sampling distribution says that the most likely value is 50 (our of 100) correct responses. Most of the probability mass lies between 40 and 60........................................................................................................ 11.4 Making decisions Okay, we’re very close to being finished. We’ve constructed a test statistic (X), and we chose this test statistic in such a way that we’re pretty confident that if X is close to N {2 then we should retain the null, and if not we should reject it. The question that remains is this: exactly which values of the test statistic should we associate with the null hypothesis, and which exactly values go with the alternative hypothesis? In my ESP study, for example, I’ve observed a value of X “ 62. What decision should I make? Should I choose to believe the null hypothesis, or the alternative hypothesis? 11.4.1 Critical regions and critical values To answer this question, we need to introduce the concept of a critical region for the test statistic X. The critical region of the test corresponds to those values of X that would lead us to reject null hypothesis (which is why the critical region is also sometimes called the rejection region). How do we find this critical region? Well, let’s consider what we know: X should be very big or very small in order to reject the null hypothesis. If the null hypothesis is true, the sampling distribution of X is Binomialp0.5, N q. If ↵ “.05, the critical region must cover 5% of this sampling distribution. - 333 - Critical Regions for a Two−Sided Test lower critical region upper critical region (2.5% of the distribution) (2.5% of the distribution) 0 20 40 60 80 100 Number of Correct Responses (X) Figure 11.2: The critical region associated with the hypothesis test for the ESP study, for a hypothesis test with a significance level of ↵ “.05. The plot itself shows the sampling distribution of X under the null hypothesis (i.e., same as Figure 11.1): the grey bars correspond to those values of X for which we would retain the null hypothesis. The black bars show the critical region: those values of X for which we would reject the null. Because the alternative hypothesis is two sided (i.e., allows both ✓ †.5 and ✓ °.5), the critical region covers both tails of the distribution. To ensure an ↵ level of.05, we need to ensure that each of the two regions encompasses 2.5% of the sampling distribution........................................................................................................ It’s important to make sure you understand this last point: the critical region corresponds to those values of X for which we would reject the null hypothesis, and the sampling distribution in question describes the probability that we would obtain a particular value of X if the null hypothesis were actually true. Now, let’s suppose that we chose a critical region that covers 20% of the sampling distribution, and suppose that the null hypothesis is actually true. What would be the probability of incorrectly rejecting the null? The answer is of course 20%. And therefore, we would have built a test that had an ↵ level of 0.2. If we want ↵ “.05, the critical region is only allowed to cover 5% of the sampling distribution of our test statistic. As it turns out, those three things uniquely solve the problem: our critical region consists of the most extreme values, known as the tails of the distribution. This is illustrated in Figure 11.2. As it turns out, if we want ↵ “.05, then our critical regions correspond to X § 40 and X 60.6 That is, if the number of people saying “true” is between 41 and 59, then we should retain the null hypothesis. If the number is between 0 to 40 or between 60 to 100, then we should reject the null hypothesis. The numbers 40 and 60 are often referred to as the critical values, since they define the edges of the critical region. At this point, our hypothesis test is essentially complete: (1) we choose an ↵ level (e.g., ↵ “.05, (2) come up with some test statistic (e.g., X) that does a good job (in some meaningful sense) of comparing H0 to H1 , (3) figure out the sampling distribution of the test statistic on the assumption that the null hypothesis is true (in this case, binomial) and then (4) calculate the critical region that produces an 6 Strictly speaking, the test I just constructed has ↵ “.057, which is a bit too generous. However, if I’d chosen 39 and 61 to be the boundaries for the critical region, then the critical region only covers 3.5% of the distribution. I figured that it makes more sense to use 40 and 60 as my critical values, and be willing to tolerate a 5.7% type I error rate, since that’s as close as I can get to a value of ↵ “.05. - 334 - appropriate ↵ level (0-40 and 60-100). All that we have to do now is calculate the value of the test statistic for the real data (e.g., X “ 62) and then compare it to the critical values to make our decision. Since 62 is greater than the critical value of 60, we would reject the null hypothesis. Or, to phrase it slightly di↵erently, we say that the test has produced a significant result. 11.4.2 A note on statistical “significance” Like other occult techniques of divination, the statistical method has a private jargon deliber- ately contrived to obscure its methods from non-practitioners. – Attributed to G. O. Ashley7 A very brief digression is in order at this point, regarding the word “significant”. The concept of statistical significance is actually a very simple one, but has a very unfortunate name. If the data allow us to reject the null hypothesis, we say that “the result is statistically significant”, which is often shortened to “the result is significant”. This terminology is rather old, and dates back to a time when “significant” just meant something like “indicated”, rather than its modern meaning, which is much closer to “important”. As a result, a lot of modern readers get very confused when they start learning statistics, because they think that a “significant result” must be an important one. It doesn’t mean that at all. All that “statistically significant” means is that the data allowed us to reject a null hypothesis. Whether or not the result is actually important in the real world is a very di↵erent question, and depends on all sorts of other things. 11.4.3 The di↵erence between one sided and two sided tests There’s one more thing I want to point out about the hypothesis test that I’ve just constructed. If we take a moment to think about the statistical hypotheses I’ve been using, H0 : ✓ “.5 H1 : ✓ ‰.5 we notice that the alternative hypothesis covers both the possibility that ✓ †.5 and the possibility that ✓ °.5. This makes sense if I really think that ESP could produce better-than-chance performance or worse-than-chance performance (and there are some people who think that). In statistical language, this is an example of a two-sided test. It’s called this because the alternative hypothesis covers the area on both “sides” of the null hypothesis, and as a consequence the critical region of the test covers both tails of the sampling distribution (2.5% on either side if ↵ “.05), as illustrated earlier in Figure 11.2. However, that’s not the only possibility. It might be the case, for example, that I’m only willing to believe in ESP if it produces better than chance performance. If so, then my alternative hypothesis would only covers the possibility that ✓ °.5, and as a consequence the null hypothesis now becomes ✓ §.5: H0 : ✓ §.5 H1 : ✓ °.5 When this happens, we have what’s called a one-sided test, and when this happens the critical region - 335 -

Navarro - Foundations Readings 2 PDF

Document Details

Tags

Related

Summary

Full Transcript