Stats Notes PDF

Part I Frequentist statistics Session 1 Philosophy of statistics This is the first of the sessions in the course. In it, I’m going to try and give you an overview of the core concepts that define statistics. The only way this can possibly work is if you bear with me and promise to try and forget everything you think you have been told about statistics. It’s easy to get the wrong end of the stick with these concepts, particularly if you first encountered them some time ago, and often the easiest way to learn is to start again. Please also try and bear in mind that statistics, as a field, is larger than bioinformatics itself. In order to take a statistics class seriously, you’ve got to learn some fundamental vocabulary and concepts. Don’t approach this class with the mindset of trying to figure out how it will help you analyse the dataset you have in front of you right now: treat it as a grounding in the fundamental concepts that will allow you to learn anything you want about any dataset. Just bear with me... 1.1 Probability: frequentist and Bayesian definitions What is a probability? You might be surprised to learn that statisticians don’t really know. In the frequentist philosophy of statistics1 , a probability is the fraction of times a certain outcome would be expected in the long-run. So, for example, p(heads) = 0.5 when you flip a coin because we would expect to get 50 heads if we flipped a coin 100 times. The frequentist definition depends on the ability to do something multiple times, and is in no way speculative. Thus a frequentist can’t give a probability of life on Mars, because we haven’t been to any other planets and so they have no idea what the overall ‘rate’ of life being found on planets is. Under the Bayesian philosophy, a probability is a belief about the relative likelihood of something happening, which can be modified in the face of data. If that seems a bit hand-wave-y, that’s because it is: it’s a belief, and so it has no connection with any formal quantitative number. It’s possible, in the Bayesian framework, to make relative comparisons between the probabilities of things (“this is twice as likely as that”), but ultimately there’s no connection to a hard, externally-verifiable reality. While this might sound a bit terrifying, it shouldn’t be surprising: the frequentist definition is rooted in empiricism, and so that base is already covered. This may all seem so strange that there seems no advantage to a Bayesian perspective; there is, and it all relates to the concept 1 Note that no one statistician is ‘a frequentist’ or ‘a Bayesian’, regardless of what you might read online. To describe yourself as ‘a frequentist’ is a bit like saying you’re ‘a portrait artist’. That’s great and everything, but if portraits are literally all you can paint then you’re probably not much of an artist. It is, however, a useful shorthand when writing handouts, so I’m going to use it occasionally. Just don’t introduce yourself to a real statistician as ‘a Bayesian’ to try and sound clever, because I can assure you that you won’t. 4 Philosophy of statistics of relative probability, but we’ve not got time to go into it yet. Just be aware that there are, fundamentally, only2 two kinds of probability: one rooted in the relative frequency of events, and the other in belief3 1.2 Statistical distributions A statistical distribution is easier to intuitively understand from a frequentist perspective. A statistical distribu- tion describes the relative frequencies with which samples of given values will be drawn. Imagine we were to go out and record (sample) the heights (values) of 100 undergraduates and plot them in a histogram. This histogram would have a particular shape, and we could describe that shape using a mathematical equation. If we divided the vertical axis by the total number of samples collected (leaving the shape of the histogram unchanged), we would have a plot of the distribution of the relative frequencies of each height. A frequentist would say that, the more heights you sampled, the closer that distribution would get to the true statistical distribution of heights. Statisticians often say that samples are drawn from a distribution. This means that they find it convenient to statistically model a given variable (e.g., height) as if all its samples follow the distribution they specify. How statisticians go about drawing new samples from a distribution is a complicated business. Frequentists often use associations between physical objects and statistical mechanics—in other words, they do carry out some physical process that obeys the same mathematical principles as their statistics and simultaneously introduces ‘randomness’ to draw from their distribution of choice. A good example of this is tossing a coin, another is the throwing of beads into cups4. We are lucky that modern computers have ‘randomness’ in the form of electrical fluctuations that we can use to draw numbers from a distribution5. Some distributions are continuous, such as the height example above, such that it only makes sense to think of ranges within them. For example, we might ask informally what fraction of the population we might expect to have a height of 1.2 meters, but really we’d know that’s a shorthand for asking what fraction of the population we’d expect to have a height greater than or equal to 1.15 meters and less than 1.25 meters. The normal distribution is continuous. Some distributions are discrete and don’t display this property; there are many examples, but the simplest is maybe the Poisson distribution that is only capable of spewing out whole numbers. The problem with statistical distributions is they have a tendency to look terrifyingly complex. I think it’s important to remember that each wasn’t ‘made up’ to look a particular way, but rather was mathematically derived to describe samples that were taken under certain conditions. The most important example of this is the normal distribution, which was derived by the ‘Prince of Mathematicians’ Gauss6. Gauss showed that random, independent samples of a variable (x) with a known mean (µ 7 ) and standard deviation (s 8 ) should follow the following distribution: 2...there are a arguably a few more, but let’s not worry about that right now... 3 Bayesian philosophy is named for Reverend Bayes—it shouldn’t be surprising that a priest invented a statistical method based around belief, if you think about it. 4 Several mathematicians have ‘proven’ theorems by throwing beads into cups. If there ever were a reason to be grateful for calculus, this is that. 5The commonest ways of drawing random numbers (including the ones used by R) are done using a different approach, and you ask me if you are interested. 6And the normal distribution is sometimes called the Gaussian distribution. 7‘mu’—the Greek letter for ‘m’ (mean) 8‘sigma’—the Greek letter for ‘s’ 1.3 Estimating parameters: maximum likelihood 5 1 (x µ)2 f (x) = p e 2s 2 (1.1) 2ps This theorem is so important that it has become the basis of all modern statistics, and so is called the Central Limit Theorem. The equation is complicated, and you don’t need to learn it, but it’s important that you remember that if you sample enough data it should eventually look vaguely like it’s from a normal distribution. If it doesn’t, you’ve almost certainly done something wrong... It’s quite common to get confused and wonder what’s so special about the normal distribution that means the Central Limit Theorem always ‘points to it’: remember it’s actually the ‘other way round’, and the Central Limit Theorem was used to derive the normal distribution. The Central Limit Theorem is how we ‘found’ the normal distribution, not the other way round! In equation 1.1 above, I referred to a normal distribution’s parameters: its mean (µ) and standard deviation (s ). The mean is the centre of the distribution, and most of the samples from that distribution will look more-or-less like that value (it’s the distribution’s ‘central tendency’ that, in the ‘limit’ where you sample from it many times, it will centre on). The standard deviation is the spread of the distribution, and observations from a normal distribution with a greater value will exhibit more variation. You’ve probably encountered these terms before, but it’s critical that you understand there’s nothing magical about them. They just happen to be the parameters of the normal distribution; if we were modelling data from another distribution we might use totally different parameters. Indeed, some distributions don’t even have defined means or standard deviations that you can estimate from them! 1.3 Estimating parameters: maximum likelihood So far we’ve talked about drawing samples from a distribution with known parameters. You may be wondering how, given a set of samples, we can determine the distribution (and parameters) most likely to have generated those samples. Each distribution has something called a likelihood function, which gives the probability that a certain sample was drawn from it. Luckily, you already know the basics of it, because it’s essentially just the frequency density you’ve already been introduced to. Plug in an observation (x) to equation 1.1, and voilá, you’ve got the likelihood that your observation was drawn from that distribution and set of parameters. Thus to find out what distribution/parameters a set of samples were drawn from, we find the distribution/parameters that maximise the product of the likelihoods of all the datapoints (the product because it’s the probability we got each value together, not independently). We call this the maximum likelihood solution, or ML estimate of a distribution. Notice that it’s an estimate, taken from the sample, of the parameter of the true distribution from which those samples were drawn. The true value of the underlying, true distribution is called the population estimate. Estimating all these parameters has a cost. Intuitively, you would probably agree that the more things we estimate, the more underlying data we would need. In statistics, the concept of degrees of freedom helps us keep track of this. Imagine a friend sampled ten heights from a normal distribution, told you nine of them, and then offered you a million dollars if you exactly guessed the tenth. You’d say they were being unfair, because there’s no way you could get it exactly right—you could get close, but not exactly there. Now imagine that they gave 6 Philosophy of statistics you the mean of all ten numbers as well: you would laugh, because using simple algebra9 you could exactly predict the missing value using the formula for a mean10. The total degrees of freedom of a random sample is equal to the number of data points in it; once you estimate a parameter from that sample, you ‘soak up’ (use) a degree of freedom and so you can predict one of those data points from all the others and that estimate. In the previous example, calculating the mean of the same meant there were 9 remaining degrees of freedom (10 1; 10 data points minus one estimated parameter). Thus you always need at least one data point for each parameter you estimate, and Very Bad Things will happen if you don’t have that. My general piece of advice is to make sure you have at least ten data points per estimated parameter, but that’s based less on statistical rigour and more on experience. The problem with a single ML solution/estimate is it doesn’t give you any idea of the uncertainty associated with that estimate. One way to get at this is with confidence intervals, often abbreviated to CIs. Confidence intervals allow us to bracket the uncertainty around a parameter estimate that we’ve found through an ML search. To find them, you have to make a decision: how worried are you about making an error? If you want to be really, really certain that you’re going to construct some confidence intervals that will contain the true parameter value, then you want to have very wide CIs. The problem is, you won’t then have a lot of power to know with any certainty what the true value is. Through a lot of maths, it turns out that if you make your CIs ±2 log-likelihood units11 you have a 5% chance of making an error (missing the true value) and an 80% chance of getting the true estimate12. Confusingly, such CIs are called 95% confidence intervals. When, next time, we talk about something called the acrit you’ll be in a better place to understand where all these numbers come from and why they have these names. Make sure you take a look at figure 1.1, which gives you an example of what all of this looks like. In general, the more data you have, the greater your power to detect something that’s going on in your data. The more complex your model—the more parameters that you need to build your statistical distribution—the more degrees of freedom you will soak up, and so the more data you will need. A statistical model that misses a real effect (e.g., mood is affected by time of day, but you can’t find evidence for it in your data) has made a type II error. Conversely, if you find evidence for an effect that isn’t really there (e.g., you suggest there is a link between the MMR vaccine and autism, but really there isn’t) you have committed the opposite kind of mistake—a type I error. The way to remember the difference is the story of ‘The Boy Who Cried Wolf’: first he cries there’s a wolf when there isn’t one (type I error), then he cries wolf when there is and no one believes him (they commit a type II error). 9 Perhaps not simple, but for a million dollars? 10... Â x... n 11 Log-likelihood means the natural log of the likelihood; because likelihoods get very small very quickly, it becomes convenient to present them in that way. 12There’s an important caveat here: increasing your sampling (i.e., collecting more data) can increase the probability of getting the true estimate and reduce your probability of missing the true value. These 80% and 5% rules of thumb come from the normal ditribution under what one statistician thought of as reasonable conditions, and they’re used as a kind of baseline. If you’re interested, do the exercise on ‘power analysis’ later in the course. 1.4 The zoo of statistical distributions 7 Fig. 1.1 A graphical overview of Maximum Likelihood estimation. As described in the text, it’s possible to find the best estimate of a given parameter by calculating the likelihood of the observed data if it were drawn from a distribution parameterised by different estimates. Doing so creates a likelihood curve, whose maximum represents the best-estimate of that particular parameter (the Maximum Likelihood Estimate, or MLE). It is also possible to construct confidence intervals by going a certain distance away from that maximum; in this diagram, I represent the standard distance of 2 log units. This is equivalent to finding estimates two orders of magnitude less likely than the ML estimate. Note that, because likelihoods are almost always incredibly small, we work with them on the log scale. Note that it is not coincidence that this likelihood curve looks approximately normal; if you should ever look at an empirical ML curve and it does not, be careful, as it likely indicates something has gone Badly Wrong. 1.4 The zoo of statistical distributions When I was a graduate student, I used to almost ‘collect’ statistical distributions and tests in the sense that I felt that the more I knew, the ‘better’ I was at statistics. Sadly, I was wrong13 : it’s better to understand a few distributions, and where they come from, than to collect them like imaginary pocket monsters. With that in mind, I’m going to give you a brief tour of some useful statistical distributions, where they come from, and what they can be used for. I’m going to assume that you’re comfortable with the concept of a normal distribution, either informally (from the text above) or formally (from the extension section below where we derive it from first principles). I’m also not going to cover the t distribution because it’s the basis of the t-test, which is what we will be covering extensively in the next session. Note also that I’m not giving you nice pictures of what these distributions look like: doing so is one of your exercises for today... 1.4.1 Chi-squared (c 2 ) A c distribution is what you expect get if you take the variances of normal distributions and sum them up. Let’s state that semi-formally14 : Q = ÂÂ(xk x¯k )2 (1.2) 13...as I found out in exams 14That phrase ‘semi-formally’ is doing quite a lot of work here. This is less a definition and more the start of a derivation, to be completely honest, but if you’ve noticed that then you don’t really need to be reading this part of the course anyway... 8 Philosophy of statistics Where Q is a c 2 -distributed variable, xk is an observation of the kth random normally-distributed variable we’re working with, and x¯k is the mean of the kth normal distribution. Note that there are two summation signs here because we’re doing two sets of summing15 : once within each random variable, summing the squared deviation from the mean (the variance) for each draw within that sample, and then again to sum across all the variables’ squared differences. To calculate the probability density function for this variable, you would carry out that operation across the probability density function of a set of normal distributions. It’s a useful distribution to know about because it comes up so much; it’s essentially a measure of goodness of fit. Now you know how it’s derived from the variances, perhaps that makes some sense, or at least perhaps it will after next lecture where we will be calculating variances over and over again until you’re sick of them. 1.4.2 F The F-distribution is so-named for Ronald Fisher, who worked extensively on model-testing, and it’s incredibly important in model testing and we will spend a lot of time working with it next session. Sadly, Fisher was a horrible person, and notably racist even in a time when racism was rampant and widely accepted, so I don’t want to dwell on the distribution’s origins anymore. Luckily, the distribution itself is quite simple16 , albeit looking a bit weird because it’s two fractions nested in a fraction so I simplify it out below into just one fraction as well in a non-standard presentation that you might find useful to refer back to later in the course: Q1 d1 Q1 d2 F= Q2 = (1.3) d2 Q2 d1 Where Q1 and Q2 are two c 2 -distributed variables (see above) and d1 and d2 are the degrees of freedom of those variables. As we’ll see next time, this is a very useful distribution because it basically tells you whether there’s anything unusual about the differences in variation of two distributions. 1.4.3 The Poisson distribution The Poisson distribution is the first ‘oh wow’ distribution for many, because it solves a problem you have been repeatedly told, as an undergraduate, wasn’t a problem but in reality actually is a Very Big Deal: nearly-but- not-quite-continuous variables. Not everything in life can take non-integer values: lightning can strike once, twice, thirty times, but not half a time. The Poisson distribution can deal with this, and is predicts only whole numbers: l ke l p(x = k) = (1.4) k! What this equation means is that the probability that you observe k things (lightning strikes, horse kicks, fish swimming past you) in your random variable x is a function of the natural number e and l. l 17 is the only 15...and I’m not bothering to formally/properly write that out for simplicity 16Ahem. Well, it is when you do a simplification along the lines of the trick I used above... This is the last trick in this section, I promise! 17‘lambda’—the Greek letter for ‘l’ 1.4 The zoo of statistical distributions 9 parameter in the Poisson distribution, and is equal to both the mean and the variance of the distribution. We need all the fancy p(x = k) on the left-hand side because we must calculate discrete probabilities for each value of k, and there can be no k of 1.5 (only 1 or 2), and thus the Poisson distribution is not really technically a continuous distribution. This all makes the maths of deriving the distribution a bit involved, so we won’t dwell on it. 1.4.4 The Binomial distribution Finally, the workhorse of the bioinformatic world: the Binomial distribution. It’s the probability of seeing an event happen a certain number of times given how many times you looked and the underlying probability of it happening. For example, it can give you the probability of getting ten heads when you toss twenty coins. It can give you the probability of detecting 5 sequence variants in a population of 100 individuals given a background prevlance of 1%. It’s your best friend, really it is, and it’s so commonly used it’s got its own special notation: ✓ ◆ n k p(x = k) = p (1 p)n k (1.5) k... where... ✓ ◆ n n! = (1.6) k k!(n k)!...I know, right? This is definitely the worst one to look at the maths of. Bear with me. It’s giving us the probability of detecting k ‘successes’, where a success is ‘the event we’re interested in happened’ and is quite often, in bioinformatics, something that doesn’t feel like a success such as ‘a member of a population has a disease’. We are doing the thing (testing a patient, tossing a coin) n times, and our probability of success is p. Some people introduce the term q in the above equations to represent the probability of a failure, but I think not writing it out makes the logic of equation 1.5 easier to track through: product of the probability you succeed, the probability you fail (which is just when you don’t succeed), and a correction for how many ways there are of doing the two. Let’s do it with a worked example: tossing four coins and getting two heads. It’s the probability of all the success you would need (pk ; two heads are 0.5 ⇥ 0.5 = 0.52 = 0.25) multiplied by the failures you would need ((1 p)n k ; two tails are (1 0.5) ⇥ (1 0.5) = (1 0.5)2 = 0.52 = 0.25). We then correct for all all the different ways of throwing two heads in a row (where H is ‘heads’ and T is ‘tails’: HHTT, HTHT, HTTH, TTHH, THTH, THHT—six ways total): k!(nn! k)! = 2!(44! 2)! = 2!2! 4! 4⇥3⇥2⇥1 = (2⇥1)⇥(2⇥1) = 24 4 = 6. In total: 0.25 ⇥ 0.25 ⇥ 6 = 0.375. Look, it’s a lot of maths. But trust me, it’s a lot easier than drawing out those blasted branching diagrams you will have been made to do previously... Oh yes and, of course, will be made to do in the exercises. Sorry about that. 10 Philosophy of statistics 1.5 Extension: deriving the normal distribution There are plenty of derivations of the normal distribution online, and if you are interested (and have the mathematical background) I would encourage you to look a couple up. It is possible, however, with the equivalent of an A-level in maths, to understand the important principles of the derivation of the normal distribution and I think it is a useful exercise to go through because it somewhat demystifies what the normal distribution represents. It didn’t fall from the sky and it wasn’t made up for fun, it follows from relatively simple assumptions. With that in mind, what follows is an attempt to capture the spirit of a full derivation. To say a variable is normally distributed is to say that the frequency of observations drawn from that distribution is lesser the further away from the mean of those observations. Let’s write that in maths: df = v(x µ) f (x) (1.7) dx Where f (x) is the normal distribution we’re trying to derive, µ is the mean, and v18 is a measure of how fast the frequency falls off away from that mean. The left-hand side of the equation is a derivative: we’re talking about the rate of change of the distribution as we move along the values of the observation (x), so we’re starting from this change (our starting observation that motivated the distribution in the first place) to calculate the actual distribution itself ( f (x)). There’s x µ because we are defining the shape of the distribution by how far away we are (x) from the mean (µ), and we multiply that by v because as we ‘go away from’ the mean (become more positive than the mean) we want the frequency to go down (and go up as we approach it—become less negative). Finally, we multiply everything by f (x) because we want everything to be proportional to the frequency at that point along the distribution: if we have a very low probability, we want to change very slowly because we never want to hit zero and have impossible values, and we want to change fast closer to the centre of the distribution. Equally, if you’ve been paying attention earlier, you remember that probability distributions need to have an area equal to 1, and so having this in here is convenient in that sense too (but more on that later). OK, I think the above was the most challenging part so if you’re still with me well done. Now let’s re-arrange a bit to get all our xs on one side: df = v(x µ)dx (1.8) f That wasn’t so bad! We just multiplied by x19 ! Now let’s get rid of those nasty derivatives by integrating both sides: Z Z df = v (x µ)dx (1.9) f The integral on the left-hand side is something you can look up in a table (or derive from first principles if you’re that good at maths but that bad at remembering things), and the right-hand side is not so bad if you’ve done calculus but I do appreciate it’s not trivial. Once we’ve done that we have: 18 I’m using non-standard terminology for this constant but you will see why at the end. 19Alright, I know calculus doesn’t quite work like that, but it really does look that way... 1.5 Extension: deriving the normal distribution 11 v(x µ)2 ln f = +c (1.10) 2...where c is an integration constant as per usual. Let’s exponentiate to finish this off and finally write this in terms of f (x): v µ)2 f (x) = ce 2 (x (1.11) Great! Now this nearly looks like equation 1.1, but it isn’t quite there, is it? In particular, we have that nasty integration constant c lying around, and we also have v. Let’s tackle c first, because it’s mathematically the hardest part to deal with (and is where we’ll stop with the maths) but is conceptually the simplest. Remember that all probability distributions have to have an area under them equal to 1? This is because something has to happen, and so the total realm of possible probabilities has to sum to one. You would probably agree with me that equation 1.11 cannot have a negative value (there is an e in there) unless c is negative, thus c is essentially a normalising constant that determines what the area under that curve will be. Thus the tricky part of the maths of the derivation of the normal distribution is finding a general solution for c such that the integral of equation 1.11 (i.e., the area under the curve) always sums to 1 for any given value of µ or v. That’s all there 1 is to it; it happens that the solution is to set c to be p2ps (I’ll get to what s is...) but there’s no great magic to it, just a lot of fiddling. So now you know where that normalising constant comes from and what it achieves. So now to v. Remember that v is a way of measuring how quickly we swoop up/down when we come closer to or further away from the mean (µ). I suppose, therefore, it a measure of how variable the distribution is: larger values of v will mean that the distribution moves more ‘quickly’ around the mean. Hmmm, so I guess that’s a bit silly to call it a measure of how variable the distribution is, then, because bigger values of v will tend to be in distributions that are less variable (faster sweeps up/down). Why don’t we fix that—it’s just a variable, after all, and replace it with its inverse— 1v ? That would give us: 1 µ)2 f (x) = ce 2v (x (1.12) Great! Oh, hang on, ‘variable’ isn’t a good word to use because we’re studying random variables and that’ll be really confusing. Let’s call it variance instead. Oh, hang on, we can’t use v because we use Greek letters for parameters (like the mean—µ) in statistical distributions because otherwise the maths looks much too approachable and we can’t scare graduate students. Also, v looks like the Greek letter n, which is actually said as an ‘n’ sound. Oh, stuff it, let’s go for something silly—let’s call it s. Oh wait, hang on, we can never have that variable be negative20 because otherwise we’ll come ‘down’ towards the mean and then go back ‘up’, and have a distribution that is weirdly shaped. I guess an easy way to ensure that never happens is just to square it, right, because squared values can never be negative21 , right? OK, let’s call it s 2 then: 1 (x µ)2 f (x) = ce 2s 2 (1.13) 20This comes from the maths that I ignore about deriving c, but the argument I’m giving here is valid 21 Shut up at the back, I know they sometimes can, but we’re dealing with real random numbers here, OK? Give me a break... 12 Philosophy of statistics Oh look! We’re done! Hooray! 1.6 Exercises Today’s exercises are going to seem a little strange, because we don’t know enough practical statistics to start working with real datasets yet. Focus on getting comfortable with the concepts, and use these questions as launching-off points for asking me questions about what you do and do not understand. There are, deliberately, no ‘right or wrong’ answers to these questions, but there are wrong ways to go about structuring a response. Don’t write me pages and pages of text; write at most a handful of short sentences showing your logic, and I can use those to see if you understand the topics. These are the most difficult questions you will face in the course, so don’t worry if they seem tough. 1. Fundamentals of probability definitions. Please do not write pages and pages for these questions: you should be able to answer them in a few sentences. (a) Your friend Sarah warns you that there’s a 10% chance (in a frequentist sense) that Mount St Helens will erupt tomorrow. Your other friend, Michael, scoffs, and says that her probability makes no sense because the volcano will either erupt tomorrow or it won’t, and because there is only one tomorrow there can be no frequentist probability of that. What do you think? (b) Your friend Michael approaches you with a problem. He has the average blood pressures of two professors, and wants to statistically test whether the professors’ pressures are drawn from a Blaargian or a Floogian distribution. The probabilities each measurement are from a Blaargian distribution are 0.1 and 0.9, and the probabilities the pressures are from a Floogian distribution are 0.4 and 0.6. Which is distribution do you think the data are most likely drawn from? Show your working (hint: if the working isn’t trivial, you’ve gone wrong). (c) Michael is beginning to disturb you a little, and wants to keep talking about these blood pressures. He tells you that the Blaargian distribution is defined by one parameter (a), while the Floogian distribution is defined by two parameters (a and b ). Does this affect your answer to question 1b above? Why? 2. Going to the statistical zoo. The aim of this exercise is not to produce hundreds of plots that you laborious copy-paste into Word, but rather to become familiar with both some of R’s basic plotting functions and also the shapes of statistical distributions. Use the code below to generate draws from statistical distributions (which you can lookup in R help with something like ?rnorm), and then sketch the impact of changing different parameters either on a piece of paper or, if it is easier for you, using line drawings in Word or something like that. Do not simply copy-paste the output from R; the objective, again, is to develop familiarity with general principles. Make sure you look carefully at the horizontal (the ‘x’ axis) in these plots, because otherwise you will miss differences between plots. (a) The normal distribution has two parameters: µ and s. Use the code below to investigate the impact of changing µ on the normal distribution, and sketch out your findings. # Draw 10,000 times from a normal distribution with # a mean (mu) of 0 and sd (sigma) of 1 1.6 Exercises 13 # and then make a histogram of those samples hist(rnorm(n=10000, mean=0, sd=1)) # Plot out again, changing the mean hist(rnorm(n=10000, mean=5, sd=1)) # Change the plot so that we can have two plots on the screen # at the same time, then plot the histograms side-by-side par(mfrow=c(1,2)) hist(rnorm(n=10000, mean=0, sd=1)) hist(rnorm(n=10000, mean=5, sd=1)) # Reset the plotting window (clearing the two plots) # so that you can plot one thing at a time again par(mfrow=c(1,1)) (b) Use the code above to sketch the impact of changing s (note that R uses the standard deviation, s , to parameterise the normal and not the variance, s 2 but, as one is just the square of the other, it needn’t bother you). (c) The Poisson distirbution has just one parameter, l. It can be plotted out like this: hist(rpois(n=10000, lambda=1))...sketch out how it changes across l. Pay attention to the asymmetry of its distribution. (d) For convenience of use, R parameterises the c 2 distribution using the degrees of freedom left over, not the number of distributions/observations summed up, but from my handout you can see how the two are linked. Use the R function rchisq, and its two arguments n and df, to explore how the distribution changes across values of df such as 1, 5, 10, 50, and 100. Note, again, its asymmetry. Are you beginning to see a pattern in how R defines its functions that draw Random numbers from a distribution? (e) Drawing random numbers from the Binomial distribution often confuses students because it requires three arguments: the number of times to run the simulation (draw numbers) (n), the numbers of times to attempt a success within each draw/run (size), and the probability of success (prob). Make a sketch of how different values of size affect how ‘discrete’ and ‘gappy’ the distribution looks, but first follow the outline of how to explore the function given below. # Let's toss a coin in the air ten times rbinom(n=10, size=1, prob=0.5) # Let's toss ten coins in the air and count the number of heads rbinom(n=1, size=10, prob=0.5) # Let's toss ten coins in the air and count the number of heads ten times rbinom(n=10, size=10, prob=.5) #... make sense? Great. If not... Play around a bit, then ask me. (f) The F distribution has too many parameters to be easy to sketch, so take five on that one. This question is left deliberately blank. 14 Philosophy of statistics 3. Maximum likelihood by hand. There is nothing magical about how maximum likelihood (ML22 ) estimates are derived. They are found by a computing trying, in careful order, different parameter values in ways that are only slightly mathematically cleverer than what you are about to do now. The purpose of this exercise is to demystify that entire process for you. (a) Michael is acting oddly again, and has now measured the average systolic blood pressures of 5 professors (he assures you they were taken ‘at rest’ and for reasons you can’t quite put your finger on that concerns you): 120, 130, 110, 105, and 121. Using the function dnorm, estimate the likelihood that these numbers were drawn from a normal distribution with a mean of 120 and standard deviation of 5. The code below may help, but remember that dnorm will return the density (likelihood) of a given set of input data and so, to calculate the overall likelihood of all the data, you will have to do something to all of those likelihoods. Hint: likelihood is just a fancy word for probability, and so how do we PROD probabilities to get their joint probability? dnorm(x=c(-5, 5), mean=0, sd=2) (b) Michael is unsatisfied, and decides he will have to keep taking measurements until he can be sure. You convince him that a 95% confidence interval will tell him all he needs to know about how variable the blood pressure measurements could be. His eyes light up, and so you decide to fix the estimate of the standard deviation of the distribution at the standard deviation of your data (sd(c(120,130,110,105,121))) and use that to find (roughly) the 95% CI. Find (to the nearest whole number) the upper and lower limits of the mean. Hint: look above at how I define the 95% CIs in log-likelihood units, and remember that the R function for the natural logarithm is log. (c) Michael is becoming agitated, and says that your estimate is wrong because you assumed the estimate of the SD was the same as the sample’s SD. Reassure him by now fixing your estimate of the mean to be what you found in part (b), and estimate the 95% CIs on the standard deviation. (d) Michael’s eyes light up, as he realises you have essentially implemented the first pass through something called Brent’s method (you don’t care, you just want him out of your office). He asks you one final thing before he leaves: what is the probability, given all this, that he would find a person with a systolic blood pressure below 100? You remember that you can find this for, say, a value of -1 from a ‘standard normal’ distribution (mean of 0 and standard deviation of 1) with pnorm(-1, mean=0, sd=1) but can you do it for Michael’s value23 ? (e) Michael leaves, but you still feel uneasy. You check your own systolic blood pressure and find it’s 150. Given you know, from (d), how to find the area to the left of a distribution, can you figure out how to estimate the probability that you would have a resting blood pressure of at least 150? As a bonus exercise, use this information to reflect upon whether you should be spending as much time with Michael as you do, and whether you should tell someone about him. 4. The Binomial distribution. You cannot become familiar enough with the Binomial distribution; it really does make your life much easier. While it is always possible to find precise, analytical solutions to 22 Later in the course, when we cover machine learning, you are going to see just how awfully confusing it is that ML as an acronym is used for both maxmimum likelihood and machine learning. Not my fault, don’t shoot the messenger. 23Are you seeing that pattern in the naming ofr R probability functions yet?... 1.6 Exercises 15 Binomial problems, you may find that learning a thing or two about how to simulate draws from it comes in handy from time to time. (a) Sarah hands you a bag with five marbles in it, two of which are red and the other three black. She says you can pull a marble out and put it back in the bag (i.e., in statistical parlance you are ‘drawing with replacement’) three times. She asks to calculate, without using the Binomial distribution but by drawing a tree diagram, the probability you will draw three red balls. (b) She asks you to now estimate the probability you will draw two red balls, again only using a tree diagram. You are so glad that she helped you with that problem with Michael that, honestly, you just go along with it. (c) Sarah, who makes it clear to you that she finds tree diagrams boring as well, asks you to use the Binomial probability equation your biostatistics lecturer gave you to estimate the answers to (a) and (b) for her. She says her lecturer never gave it to her and she feels hard-done-by, and so would like you to do her a favour. (d) After calculating the probabilities by hand, Sarah is not so sure she was that hard-done-by. Sarah shares a trick her lecturer taught her with you, which is to use the Binomial distribution density function (dbinom) to estimate the probabilities from parts (a) and (b). She asks if you can do it and, not wanting to look foolish, you glance back at your workings from question 2e and give it a go. (e) Sarah’s so impressed with your ability to think on your feet, she asks if you can work together to estimate the probability of drawing (with replacement) between 30–50 red balls (inclusive) from a bag with 200 balls in total, of which 30 are red and the remaining 170 are black. She knows lots of ways to do it, but feels there’s got to be an easy ‘one-liner’ in R that could do it for her. Hint: Take a look in the help entry for dbinom (?dbinom). (f) Sarah notices a question posted by Michael written in the margin of her class workbook (who you have both now decided to aviod), wondering whether it matters if half of the black balls in the bag are replaced with blue balls. Would this affect any of your answers? Why (not)? (g) Sarah notices a final entry in her workbook from Michael, which appears to be somewhat hastily written and followed by frantic scratchings. It seems to follow a question about whether it would be possible to repeat these exercises ‘without replacement’: in cases where no ball taken from the bag was returned afterwards. Draw out the probability diagram for part (a) and, on the basis of that, consider whether it was wise of Michael to attempt part (e) using the Binomial distribution. Session 2 Test statistics Overview This is the second of our sessions, and in it we’re going to discuss test statistics, using ‘Student’s’ t-test as an example and then moving on to ANOVAs in general. Test statistics are a fundamental component of statistics, and it’s rarely appreciated that t-tests are only the tip of the ice-berg. The t-test is the perfect way to test the difference between two continuous variables. The story of the t-test is a funny one, and worth bearing in mind in our modern world of closed-access publications and non-disclosure agreements. The test was developed by William Gosset, who worked for the Guinness Brewing Company and had signed a contract stating anything he developed was company property. Unwilling to let what is now one of the major foundations of statistics be hidden away by Guinness, he published under the pseudonym ‘Student’. 2.1 The fundamentals of test statistics and the t-test 2.1.1 An introduction to test statistics Science is all about finding something that we wouldn’t otherwise expect, and the concept of a test statistic was developed to help us measure whether something is surprising to us. Intuitively, you might agree that something was surprising if it reliably differed from our expectations. We can formalise that mathematically as: o e t= (2.1) v Where o is our observation (e.g., the temperature was 30 C), e is our expectation (e.g., the temperature yesterday was 30 C), and v is the overall variation in our observation (e.g., the temperature fluctuated by ±10 C that day). Taking those things together, we can generate a test statistic—t—that is a composite of the difference in something and the variation associated with it. This formalisation of a test statistic underlies essentially all modern frequentist hypothesis-testing. It allows an investigator to set a particular level of t that they think is sufficiently interesting, and then estimate whether their observation is more or less surprising than that. It is critical that you understand that, to someone using any 18 Test statistics kind of test statistic, the magnitude of the difference (o e) is as important a determinant as the background variation (v). Thus something that has a very small effect size (o e; impact on a system) could have a very large test statistic value if the variation was low (v). Consequently you must never use a test statistic as an estimate of the magnitude of an effect or pattern—there are no exceptions to this, ever1. So how does one go about deciding whether a particular test statistic value is sufficiently important to warrant further investigation? We compare our test statistic with a distribution of expected values. That distribution is often scaled according to the number of degrees of freedom left over after estimating that value, since we can be more confident of a test statistic that was calculated from 100 datapoints than we can of one that was calculated from 10. We can rank that observed test statistic within that distribution: that gives us some degree of certainty as to how unusual a particular value was. It’s much easier to understand this with a concrete example, so let’s move on to our first real test statistic: the t-test. 2.1.2 The t-test A t-test reveals whether there is evidence that two samples of continuous data are drawn from distributions with different means. The simplest definition of t is: x̄ ȳ t= (2.2) SEdi f f Where x̄ is the mean of one sample (x), and ȳ is the mean of another (y). x̄ ȳ is our observation. Our expectation is that the difference between the two means will be 0 and so there is nothing in equation 2.2 for our expectation because our expectation is nothing! SEdi f f is the standard error of the difference in those two means—our measure of how variable the two samples are, and so how much faith we can put in the difference we have observed. For completeness, let’s define it mathematically, but I don’t want you to get stuck on the details: s s2x s2y SEdi f f = + (2.3) nx ny Where s2x and s2y are the observed variances of x and y, respectively, and nx and ny are the number of observations in x and y respectively. What matters is that you grasp the intuition of these equations: variation within such samples might be mistaken for variation across samples, and so more variable samples will lead to a greater SEdi f f and so a lesser t. Equally, we divide our variances by the number of observations, because if we measure something we’d expect it to vary more—we’re giving it more opportunities for variation. So we know how to calculate a t-value; how do we tell whether it’s sufficiently large that we should be surprised by our result? First of all, we have to pick what ‘sufficiently large’ means. Most scientists think that a less than 5% chance of seeing something, by chance, is sufficiently surprising. You may think otherwise, but you must pick before you see the result, otherwise you’re cheating. Because the t-test is a frequentist test, our result is surprising at the 5% level if it is more extreme than the 5% of the t-distribution. Remember: in frequentist 1 For an empirical example of this problem see, for example, Pearse et al. 2013; Ecology, 94(12), 2861–2872. 2.1 The fundamentals of test statistics and the t-test 19 land, frequency is probability, so there’s only a 1% chance of being more extreme than 1% of the distribution. Another way of phrasing this is that there is a less than 1% chance that we would see a t-value this extreme if there were no real difference between the two distributions. We call this threshold acrit ; if your value were more extreme than 5% of the data, we would say the result is statistically significant at a0.05. If we were doing this by hand, we would calculate the density of the t-distribution for a given number of observations and rank our observed t-value within that distribution (see figure 2.2 for an example of this). Of course, nowadays we have computers to do this for us, but it’s important you know how this works because (1) you really ought to know what these numbers mean and (2) if you ever have to do a ‘bootstrap’ analysis you will have to generate the expected distribution yourself. Notice that if we were working to a0.05 we would need a t value that was ranked at or beyond the 2.5th or 97.5th quantile: we would be surprised if the difference were really low or really big, and so we have to split our 5% of ranking between these two expectations. Because we care about both possibilities, this is called a two-tailed test: we are testing both the lower and upper tail of the distribution. If we only cared about one kind of value, we might be able to only check one (perhaps the 5th quantile) and we would be performing a one-tailed test. If you find yourself performing a one-tailed test “to make something significant”, you have almost certainly failed as a scientist and so you should stop what you are doing. Last time I told you about 95% confidence intervals; these intervals map directly onto a0.05. Thus they come with an important caveat: these CIs map out regions where we would be surprised to see certain differences by chance. But that’s subtly different from giving us a probability that we will detect a true difference if one is present. At a0.05 , there is an 80% chance that you will detect a difference between two samples if there is one2. There is also, of course a 5% chance you will make a mistake: 5% of the time you will see a value more extreme than 95% of the distribution, simply by chance! 2.1.3 Theory paired with practice Let’s start off by simulating some data: two samples drawn from two similar, but slightly different, distributions. We’ll then plot their differences using a boxplot. exercise

Document Details

Tags

Related

Summary

Full Transcript