STATS 331 Introduction to Bayesian Statistics PDF

STATS 331 Introduction to Bayesian Statistics Brendon J. Brewer This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/deed.en GB. Contents 1 Prologue 4 1.1 Bayesian and Classical Statistics........................ 5 1.2 This Version of the Notes........................... 6 1.3 Assessment................................... 6 2 Introduction 8 2.1 Certainty, Uncertainty and Probability.................... 8 3 First Examples 11 3.1 The Bayes’ Box................................. 11 3.1.1 Likelihood................................ 12 3.1.2 Finding the Likelihood Values..................... 13 3.1.3 The Mechanical Part.......................... 14 3.1.4 Interpretation.............................. 14 3.2 Bayes’ Rule................................... 15 3.3 Phone Example................................. 16 3.3.1 Solution................................. 17 3.4 Important Equations.............................. 19 4 Parameter Estimation I: Bayes’ Box 21 4.1 Parameter Estimation: Bus Example..................... 22 4.1.1 Sampling Distribution and Likelihood................. 25 4.1.2 What is the “Data”?.......................... 26 4.2 Prediction in the Bus Problem......................... 26 4.3 Bayes’ Rule, Parameter Estimation Version.................. 27 5 Parameter Estimation: Analytical Methods 29 5.1 “∼” Notation.................................. 30 5.2 The Effect of Different Priors......................... 31 5.2.1 Prior 2: Emphasising the Extremes.................. 32 5.2.2 Prior 3: Already Being Well Informed................. 32 5.2.3 The Beta Distribution......................... 33 5.2.4 A Lot of Data.............................. 35 6 Summarising the Posterior Distribution 36 6.1 Point Estimates................................. 37 6.1.1 A Very Brief Introduction to Decision Theory............ 38 6.1.2 Absolute Loss.............................. 39 1 CONTENTS 2 6.1.3 All-or-nothing Loss........................... 40 6.1.4 Invariance of Decisions......................... 40 6.1.5 Computing Point Estimates from a Bayes’ Box............ 41 6.1.6 Computing Point Estimates from Samples.............. 41 6.2 Credible Intervals................................ 42 6.2.1 Computing Credible Intervals from a Bayes’ Box........... 42 6.2.2 Computing Credible Intervals from Samples............. 43 6.3 Confidence Intervals.............................. 43 7 Hypothesis Testing and Model Selection 45 7.1 An Example Hypothesis Test......................... 45 7.2 The “Testing” Prior.............................. 46 7.3 Some Terminology............................... 49 7.4 Hypothesis Testing and the Marginal Likelihood............... 51 8 Markov Chain Monte Carlo 52 8.1 Monte Carlo................................... 52 8.1.1 Summaries................................ 52 8.2 Multiple Parameters.............................. 54 8.3 The Metropolis Algorithm........................... 56 8.3.1 Metropolis, Stated........................... 56 8.4 A Two State Problem............................. 58 8.5 The Steady-State Distribution of a Markov Chain.............. 59 8.6 Tactile MCMC................................. 60 9 Using JAGS 61 9.1 Basic JAGS Example.............................. 62 9.2 Checklist for Using JAGS........................... 65 10 Regression 67 10.1 A Simple Linear Regression Problem..................... 67 10.2 Interpretation as a Bayesian Question..................... 67 10.3 Analytical Solution With Known Variance.................. 68 10.4 Solution With JAGS.............................. 70 10.5 Results for “Road” Data............................ 72 10.6 Predicting New Data.............................. 73 10.7 Simple Linear Regression With Outliers.................... 75 10.8 Multiple Linear Regression and Logistic Regression............. 76 11 Replacements for t-tests and ANOVA 77 11.1 A T-Test Example............................... 77 11.1.1 Likelihood................................ 78 11.1.2 Prior 1: Very Vague.......................... 79 11.1.3 Prior 2: They might be equal!..................... 79 11.1.4 Prior 3: Alright, they’re not equal, but they might be close..... 80 11.2 One Way Anova................................. 82 11.2.1 Hierarchical Model........................... 83 11.2.2 MCMC Efficiency............................ 85 11.2.3 An Alternative Parameterisation................... 86 CONTENTS 3 12 Acknowledgements 88 A R Background 89 A.1 Vectors...................................... 89 A.2 Lists....................................... 90 A.3 Functions.................................... 90 A.4 For Loops.................................... 91 A.5 Useful Probability Distributions........................ 91 A Probability 92 A.1 The Product Rule................................ 92 A.1.1 Bayes’ Rule............................... 93 A.2 The Sum Rule.................................. 93 A.3 Random Variables................................ 94 A.3.1 Discrete Random Variables....................... 94 A.3.2 Continuous Random Variables..................... 94 A.3.3 Shorthand Notation........................... 95 A.4 Useful Probability Distributions........................ 95 A Rosetta Stone 96 Chapter 1 Prologue This course was originally developed by Dr Wayne Stewart (formerly of The University of Auckland) and was first offered in 2009 (Figure 1.1). I joined the Department of Statistics in July 2012 and took over the course from him. It was good fortune for me that Wayne left the university as I arrived. If I had been able to choose which undergraduate course I would most like to teach, it would have been this one! Wayne is a passionate Bayesian1 and advocate for the inclusion of Bayesian statistics in the undergraduate statistics curriculum. I also consider myself a Bayesian and agree that this approach to statistics should form a greater part of statistics education than it does today. While this edition of the course differs from Wayne’s in some ways2 , I hope I am able to do the topic justice in an accessible way. In this course we will use the following software: R (http://www.r-project.org/) JAGS (http://mcmc-jags.sourceforge.net/) The rjags package in R RStudio (http://www.rstudio.com/) You will probably have used R, at least a little bit, in previous statistics courses. RStudio is just a nice program for editing R code, and if you don’t like it, you’re welcome to use any other text editor. JAGS is in a different category and you probably won’t have seen it before. JAGS is used to implement Bayesian methods in a straightforward way, and rjags allows us to use JAGS from within R. Don’t worry, it’s not too difficult to learn and use JAGS! We will have a lot of practice using it in the labs. These programs are all free and open source software. That is, they are free to use, share and modify. They should work on virtually any operating system including the three 1 Bayesian statistics has a way of creating extreme enthusiasm among its users. I don’t just use Bayesian methods, I am a Bayesian. 2 The differences are mostly cosmetic. 90% of the content is the same. 4 CHAPTER 1. PROLOGUE 5 Figure 1.1: An ad for the original version of this course (then called STATS 390), showing Wayne Stewart with two ventriloquist dolls (Tom Bayes and Freaky Frequentist), who would have debates about which approach to statistics is best. most popular: Microsoft Windows, Mac OS X and GNU/Linux. In previous editions of the course, another program called WinBUGS was used instead of JAGS. Unfortunately, WinBUGS has not been updated for several years, and only works on Microsoft Windows. Therefore I switched over to JAGS in 2013. The differences between JAGS and WinBUGS are fairly minor, but JAGS has the advantage of being open source and cross-platform. All of this software is already installed on the lab computers, but if you would like to install it on your own computer, instructions are provided on the Course Information Sheet. 1.1 Bayesian and Classical Statistics Throughout this course we will see many examples of Bayesian analysis, and we will sometimes compare our results with what you would get from classical or frequentist statistics, which is the other way of doing things. You will have seen some classical statistics methods in STATS 10X and 20X (or BioSci 209), and possibly other courses as well. You may have seen and used Bayes’ rule before in courses such as STATS 125 or 210. Bayes’ rule can sometimes be used in classical statistics, but in Bayesian stats it is used all the time). Many people have differing views on the status of these two different ways of doing statistics. In the past, Bayesian statistics was controversial, and you had to be very brave to admit to using it. Many people were anti-Bayesian! These days, instead of CHAPTER 1. PROLOGUE 6 Bayesians and anti-Bayesians, it would be more realistic to say there are Bayesians and non-Bayesians, and many of the non-Bayesians would be happy to use Bayesian statistics in some circumstances. The non-Bayesians would say that Bayesian statistics is one way of doing things, and it is a matter of choice which one you prefer to use. Most Bayesian statis- ticians think Bayesian statistics is the right way to do things, and non-Bayesian methods are best thought of as either approximations (sometimes very good ones!) or alternative methods that are only to be used when the Bayesian solution would be too hard to calculate. Sometimes I may give strongly worded opinions on this issue, but there is one important point that you should keep in mind throughout this course: You do not have to agree with me in order to do well in STATS 331! 1.2 This Version of the Notes Wayne Stewart taught STATS 331 with his own course notes. When I took over the course, I found that our styles were very different, even though we teach the same ideas. Unfortunately, it was challenging for the students to reconcile my explanations with Wayne’s. Therefore I thought it would be better to have my own version of the notes. These lecture notes are a work in progress, and do not contain everything we cover in the course. There are many things that are important and examinable, and will be only discussed in lectures, labs and assignments! The plots in these notes were not produced using R, but using a different plotting package where I am more familiar with the advanced plotting features. This means that when I give an R command for a plot, it will not produce a plot that looks exactly like the plot that follows. However, it will give approximately the same plot, conveying the same information. I apologise if you find this inconsistency distracting. At this stage, the course notes contain the basic material of the course. Some more advanced topics will be introduced and discussed in lectures, labs and assignments. I appreciate any feedback you may have about these notes. 1.3 Assessment The assessment for this course is broken down as follows: 20% Assignments. There will be four assignments, worth 5% each. The assignments are not small, so please do not leave them until the last minute. CHAPTER 1. PROLOGUE 7 20% Midterm test (50 minutes, calculators permitted). This will be held in class, in place of a lecture, some time just after mid semester break. 60% Final exam (two hours, calculators permitted). Chapter 2 Introduction Every day, throughout our lives, we are required to believe certain things and not to believe other things. This applies not only to the “big questions” of life, but also to trivial matters, and everything in between. For example, this morning I boarded the bus to university, sure that it would actually take me here and not to Wellington. How did I know the bus would not take me to Wellington? Well, for starters I have taken the same bus many times before and it has always taken me to the university. Another clue was that the bus said “Midtown” on it, and a bus to Wellington probably would have said Wellington, and would not have stopped at a minor bus stop in suburban Auckland. None of this evidence proves that the bus would take me to university, but it does makes it very plausible. Given all these pieces of information, I feel quite certain that the bus will take me to the city. I feel so certain about this that the possibility of an unplanned trip to Wellington never even entered my mind until I decided to write this paragraph. Somehow, our brains are very often able to accurately predict the correct answer to many questions (e.g. the destination of a bus), even though we don’t have all the available information that we would need to be 100% certain. We do this using our experience of the world and our intuition, usually without much conscious attention or problem solving. However, there are areas of study where we can’t just use our intuition to make judgments like this. For example, most of science involves such situations. Does a new treatment work better than an old one? Is the expansion of the universe really accelerating? People tend to be interested in trying to answer questions that haven’t been answered yet, so our attention is always on the questions where we’re not sure of the answer. This is where statistics comes in as a tool to help us in this grey area, when we can’t be 100% certain about things, but we still want to do the best we can with our incomplete information. 2.1 Certainty, Uncertainty and Probability In the above example, I said things like “I couldn’t be 100% certain”. The idea of using a number to describe how certain you are is quite natural. For example, contestants on the TV show “Who Wants to be a Millionaire” often say things like “I’m 75% sure the answer 8 CHAPTER 2. INTRODUCTION 9 is A”1. There are some interesting things to notice about this statement. Firstly, it is a subjective statement. If someone else were in the seat trying to answer the question, she might say the probability that A is correct is 100%, because she knows the answer! A third person faced with the same question might say the probability is 25%, because he has no idea and only knows that one of the four answers must be correct. In Bayesian statistics, the interpretation of what probability means is that it is a description of how certain you are that some statement, or proposition, is true. If the probability is 1, you are sure that the statement is true. So sure, in fact, that nothing could ever change your mind (we will demonstrate this in class). If the probability is 0, you are sure that the proposition is false. If the probability is 0.5, then you are as uncertain as you would be about a fair coin flip. If the probability is 0.95, then you’re quite sure the statement is true, but it wouldn’t be too surprising to you if you found out the statement was false. See Figure 2.1 for a graphical depiction of probabilities as degrees of certainty or plausibility. Somewhat sure I'm very uncertain. I am very sure that it's false, but I could be wrong It's a toss-up it is true 0 Probability 1 Figure 2.1: Probability can be used to describe degrees of certainty, or how plausible some statement is. 0 and 1 are the two extremes of the scale and correspond to complete certainty. However, probabilities are not static quantities. When you get more information, your probabilities can change. In Bayesian statistics, probabilities are in the mind, not in the world. It might sound like there is nothing more to Bayesian statistics than just thinking about a question and then blurting out a probability that feels appropriate. Fortunately for us, there’s more to it than that! To see why, think about how you change your mind when new evidence (such as a data set) becomes available. For example, you may be on “Who Wants to be a Millionaire?” and not know the answer to a question, so you might think the probability that it is A is 25%. But if you call your friend using “phone a friend”, and your friend says, “It’s definitely A”, then you would be much more confident that it is A! Your probability probably wouldn’t go all the way to 100% though, because there is 1 This reminds me of an amusing exchange from the TV show Monk. Captain Stottlemeyer: [about someone electrocuting her husband] Monk, are you sure? I mean, are you really sure? And don’t give me any of that “95 percent” crap. Monk: Captain, I am 100% sure... that she probably killed him. CHAPTER 2. INTRODUCTION 10 always the small possibility that your friend is mistaken. When we get new information, we should update our probabilities to take the new information into account. Bayesian methods tell us exactly how to do this. In this course, we will learn how to do data analysis from a Bayesian point of view. So while the discussion in this chapter might sound a bit like philosophy, we will see that using this kind of thinking can give us new and powerful ways of solving practical data analysis problems. The methods we will use will all have a common structure, so if you are faced with a completely new data analysis problem one day, you will be able to design your own analysis methods by using the Bayesian framework. Best of all, the methods make sense and perform extremely well in practice! Chapter 3 First Examples We will now look at a simple example to demonstrate the basics of how Bayesian statistics works. We start with some probabilities at the beginning of the problem (these are called prior probabilities), and how exactly these get updated when we get more information (these updated probabilities are called posterior probabilities). To help make things more clear, we will use a table that we will call a Bayes’ Box to help us calculate the posterior probabilities easily. Suppose there are two balls in a bag. We know in advance that at least one of them is black, but we’re not sure whether they’re both black, or whether one is black and one is white. These are the only two possibilities we will consider. To keep things concise, we can label our two competing hypotheses. We could call them whatever we want, but I will call them BB and BW. So, at the beginning of the problem, we know that one and only one of the following statements/hypotheses is true: BB: Both balls are black BW: One ball is black and the other is white. Suppose an experiment is performed to help us determine which of these two hypotheses is true. The experimenter reaches into the bag, pulls out one of the balls, and observes its colour. The result of this experiment is (drumroll please!): D: The ball that was removed from the bag was black. We will now do a Bayesian analysis of this result. 3.1 The Bayes’ Box A Bayesian analysis starts by choosing some values for the prior probabilities. We have our two competing hypotheses BB and BW, and we need to choose some probability values to describe how sure we are that each of these is true. Since we are talking about two hypotheses, there will be two prior probabilities, one for BB and one for BW. For simplicity, 11 CHAPTER 3. FIRST EXAMPLES 12 we will assume that we don’t have much of an idea which is true, and so we will use the following prior probabilities: P (BB) = 0.5 (3.1) P (BW) = 0.5. (3.2) Pay attention to the notation. The upper case P stands for probability, and if we just write P (whatever), that means we are talking about the prior probability of whatever. We will see the notation for the posterior probability shortly. Note also that since the two hypotheses are mutually exclusive (they can’t both be true) and exhaustive (one of these is true, it can’t be some undefined third option). We will almost always consider mutually exclusive and exhaustive hypotheses in this course1. The choice of 0.5 for the two prior probabilities describes the fact that, before we did the experiment, we were very uncertain about which of the two hypotheses was true. I will now present a Bayes’ Box, which lists all the hypotheses (in this case two) that might be true, and the prior probabilities. There are some extra columns which we haven’t discussed yet, and will be needed in order to figure out the posterior probabilities in the final column. The first column of a Bayes’ Box is just the list of hypotheses we are considering. In Hypotheses prior likelihood prior × likelihood posterior BB 0.5 BW 0.5 Totals: 1 this case there are just two. If you need to construct a Bayes’ box for a new problem, just think about what the possible answers to the problem are, and list them in the first column. The second column lists the prior probabilities for each of the hypotheses. Above, before we did the experiment, we decided to say that there was a 50% probability that BB is true and a 50% probability that BW is true, hence the 0.5 values in this column. The prior column should always sum to 1. Remember, the prior probabilities only describe our initial uncertainty, before taking the data into account. Hopefully the data will help by changing these probabilities to something a bit more decisive. 3.1.1 Likelihood The third column is called likelihood, and this is a really important column where the action happens. The likelihood is a quantity that will be used for calculating the posterior probabilities. In colloquial language, likelihood is synonymous with probability. It means the same thing. However, in statistics, likelihood is a very specific kind of probability. To fill in the third column of the Bayes’ Box, we need to calculate two likelihoods, so you can tell from this that the likelihood is something different for each hypothesis. But what is it exactly? 1 If this does not appear to be true in a particular problem, it is usually possible to redefine the various hypotheses into a set that of hypotheses that are mutually exclusive and exhaustive. CHAPTER 3. FIRST EXAMPLES 13 The likelihood for a hypothesis is the probability that you would have observed the data, if that hypothesis were true. The values can be found by going through each hypothesis in turn, imagining it is true, and asking, “What is the probability of getting the data that I observed?”. Here is the Bayes’ Box with the likelihood column filled in. I will explain how these numbers were calculated in a bit more detail in the next subsection. If you have taken Hypotheses prior likelihood h = prior × likelihood posterior BB 0.5 1 BW 0.5 0.5 Totals: 1 STATS 210 and used the maximum likelihood method, where you find the value of a parameter that maximises the likelihood function, that is the same as the likelihood we use in this course! So you have a head start in understanding this concept. 3.1.2 Finding the Likelihood Values We will first calculate the value of the likelihood for the BB hypothesis. Remember, the data we are analysing here is that we chose one of the balls in the bag “at random”, and it was black. The likelihood for the BB hypothesis is therefore the probability that we would get a black ball if BB is true. Imagine that BB is true. That means both balls are black. What is the probability that the experiment would result in a black ball? That’s easy – it’s 100%! So we put the number 1 in the Bayes Box as the likelihood for the BB hypothesis. Now imagine instead that BW is true. That would mean one ball is black and the other is white. If this were the case and we did the experiment, what would be the probability of getting the black ball in the experiment? Since one of the two balls is black, the chance of choosing this one is 50%. Therefore, the likelihood for the BW hypothesis is 0.5, and that’s why I put 0.5 in the Bayes’ Box for the likelihood for BW. In general, the likelihood is the probability of the data that you actually got, assuming a particular hypothesis is true. In this example it was fairly easy to get the likelihoods directly by asking “if this hypothesis is true, what is the probability of getting the black ball when we do the experiment?”. Sometimes this is not so easy, and it can be helpful to think about ALL possible experimental outcomes/data you might have seen – even though ultimately, you just need to select the one that actually occurred. Table 3.1 shows an example of this process. The fact that only the blue probabilities in Table 3.1 enter the Bayes’ Box calculation is related to the likelihood principle, which we will discuss in lectures. Note also that in Table 3.1, the probabilities for the different possible data sets add to 1 within each hypothesis, but the sum of the blue “selected” likelihood values is not 1 (it is, in fact, meaningless). CHAPTER 3. FIRST EXAMPLES 14 Hypotheses Possible Data Probability BB Black Ball 1 White Ball 0 BW Black Ball 0.5 White Ball 0.5 Table 3.1: This table demonstrates a method for calculating the likelihood values, by considering not just the data that actually occurred, but all data that might have occurred. Ultimately, it is only the probability of the data which actually occurred that matters, so this is highlighted in blue. When we come to parameter estimation in later chapters, we will usually set up our problems in this way, by considering what data sets are possible, and assigning probabilities to them. 3.1.3 The Mechanical Part The third column of the Bayes’ Box is the product of the prior probabilities and the likelihoods, calculated by simple multiplication. The result will be called “prior times likelihood”, but occasionally we will use the letter h for these quantities. This is the unnormalised posterior. It does not sum to 1 as the posterior probabilities should, but it is at least proportional to the actual posterior probabilities. To find the posterior probabilities, we take the prior × likelihood column and divide it by its sum, producing numbers that do sum to 1. This gives us the final posterior probabilities, which were the goal all along. The completed Bayes’ Box is shown below: Hypotheses prior likelihood h = prior × likelihood posterior BB 0.5 1 0.5 0.667 BW 0.5 0.5 0.25 0.333 Totals: 1 0.75 1 We can see that the posterior probabilities are not the same as the prior probabilities, because we have more information now! The experimental result made BB a little bit more plausible than it was before. Its probability has increased from 1/2 to 2/3. 3.1.4 Interpretation The posterior probabilities of the hypotheses are proportional to the prior probabilies and the likelihoods. A high prior probability will help a hypothesis have a high posterior probability. A high likelihood value also helps. To understand what this means about reasoning, consider the meanings of the prior and the likelihood. There are two things that can contribute to a hypothesis being plausible: If the prior probability is high. That is, the hypothesis was already plausible, before we got the data. If the hypothesis predicted the data well. That is, the data was what we would have expected to occur if the hypothesis had been true. CHAPTER 3. FIRST EXAMPLES 15 I hope you agree that this is all very sensible. In class we will also study variations on this problem, considering different assumptions about the prior probabilities and how they affect the results, and also considering what happens when we get more and/or different data. 3.2 Bayes’ Rule Bayes’ rule is an equation from probability theory, shown in Figure 3.1. The various terms in Bayes’ rule are all probabilities, but notice that there are conditional probabilities in there. For example, the left hand side of the equation is P (A|B) and that means the probability of A given B. That is, it’s the probability of A after taking into account the information B. In other words, P (A|B) is a posterior probability, and Bayes’ rule tells us how to calculate it from other probabilities. Bayes’ rule is true for any statements A Figure 3.1: A blue neon sign displaying Bayes’ rule. You can use it to calculate the probability of A given B, if you know the values of some other probabilities on the right hand side. Image credit: Matt Buck. Obtained from Wikimedia Commons. and B. If you took the equation in Figure 3.1 and replaced A with “Kākāpō will survive beyond 2050” and B with “I had coffee this morning”, the resulting equation would still be true2. It is helpful to relabel A and B in Bayes’ rule to give a more clear interpretation of how the equation is to be used. In this version of Bayes’ rule (which is one you should commit to memory), A has been replaced by H, and B has been replaced by D. The reason for these letters is that you should interpret H as hypothesis and D as data. Then you can interpret Bayes’ rule as telling you the probability of a hypothesis given some data, in 2 It would still be true, but it would not very interesting, because whether or not I had coffee doesn’t tell you much about the survival prospects of endangered New Zealand parrots. CHAPTER 3. FIRST EXAMPLES 16 other words, a posterior probability. P (H)P (D|H) P (H|D) = (3.3) P (D) In Bayesian statistics, most of the terms in Bayes’ rule have special names. Some of them even have more than one name, with different scientific communities preferring different terminology. Here is a list of the various terms and the names we will use for them: P (H|D) is the posterior probability. It describes how certain or confident we are that hypothesis H is true, given that we have observed data D. Calculating posterior probabilities is the main goal of Bayesian statistics! P (H) is the prior probability, which describes how sure we were that H was true, before we observed the data D. P (D|H) is the likelihood. If you were to assume that H is true, this is the probability that you would have observed data D. P (D) is the marginal likelihood. This is the probability that you would have observed data D, whether H is true or not. Since you may encounter Bayesian methods outside of STATS 331, I have included an Appendix called “Rosetta Stone” that lists some common alternative terminology. In the above example, we did some calculations to work out the numbers in the Bayes’ Box, particularly the posterior probabilities, which are the ultimate goal of the calculation. What we were actually doing in these calculations was applying Bayes’ rule. We actually applied Bayes’ rule twice, once to compute P (BB|D) and a second time to calculate P (BW|D). When you use a Bayes’ Box to calculate posterior probabilities, you are really just applying Bayes’ rule a lot of times: once for each hypothesis listed in the first column. 3.3 Phone Example This example is based on Question 1 from the 2012 final exam. I got the idea for this question from an example in David MacKay’s wonderful book “Information Theory, Inference and Learning Algorithms” (available online as a free PDF download. You’re welcome to check it out, but it is a large book and only about 20% of the content is relevant to this course!). You move into a new house which has a phone installed. You can’t remember the phone number, but you suspect it might be 555-3226 (some of you may recognise this as being the phone number for Homer Simpson’s “Mr Plow” business). To test this hypothesis, you carry out an experiment by picking up the phone and dialing 555-3226. If you are correct about the phone number, you will definitely hear a busy signal because you are calling yourself. If you are incorrect, the probability of hearing a busy signal is 1/100. However, all of that is only true if you assume the phone is working, and it might be broken! If the phone is broken, it will always give a busy signal. CHAPTER 3. FIRST EXAMPLES 17 When you do the experiment, the outcome (the data) is that you do actually get the busy signal. The question asked us to consider the following four hypotheses, and to calculate their posterior probabilities: Note that the four hypotheses are mutually exclusive and Hypothesis Description Prior Probability H1 Phone is working and 555-3226 is correct 0.4 H2 Phone is working and 555-3226 is incorrect 0.4 H3 Phone is broken and 555-3226 is correct 0.1 H4 Phone is broken and 555-3226 is incorrect 0.1 Table 3.2: The four hypotheses about the state of the phone and the phone number. The prior probabilities are also given. exhaustive. If you were to come up with hypotheses yourself, “phone is working” and “555-3226 is correct” might spring to mind. They wouldn’t be mutually exclusive so you couldn’t do a Bayes’ Box with just those two, but it is possible to put these together (using “and”) to make the four mutually exclusive options in the table. 3.3.1 Solution We will go through the solution using a Bayes’ Box. The four hypotheses listed in Table 3.2 and their prior probabilities are given, so we can fill out the first two columns of a Bayes’ Box right away: The next thing we need is the likelihoods. The outcome of the experiment Hypotheses prior likelihood prior × likelihood posterior H1 0.4 H2 0.4 H3 0.1 H4 0.1 Totals: 1 (the data) was the busy signal, so we need to work out P (busy signal|H) for each H in the problem (there are four of them). Let’s start (naturally!) with H1. If we assume H1 is true, then the phone is working and 555-3226 is the correct phone number. In that case, we would definitely get a busy signal because we are calling ourselves. Therefore P (busy signal|H1 ) = 1 is our first likelihood value. Next, let’s imagine that H2 is true, so the phone is working, but 555-3226 is not the right phone number. In this case, it is given in the question that the probability of getting a busy signal is 1/100 or 0.01 (in reality, this would be based on some other data, or perhaps be a totally subjective judgement). Therefore P (busy signal|H2 ) = 0.01, and that’s our second likelihood value. The likelihoods for H3 and H4 are quite straightforward because they both imply the phone is broken, and that means a busy signal is certain. Therefore P (busy signal|H3 ) = P (busy signal|H4 ) = 1. We have our four likelihoods, and can proceed to work out everything in the Bayes’ Box, including the main goal – the posterior probabilities! Here it is: CHAPTER 3. FIRST EXAMPLES 18 Hypotheses prior likelihood prior × likelihood posterior H1 0.4 1 0.4 0.662 H2 0.4 0.01 0.004 0.00662 H3 0.1 1 0.1 0.166 H4 0.1 1 0.1 0.166 Totals: 1 0.604 1 To conclude this phone problem, I should admit that I actually calculated the numbers in the Bayes’ Box using R. My code is shown below. A lot of the code we write in labs will look like this. Obviously in the 2012 exam the students had to use their calculators instead. prior = c(0.4, 0.4, 0.1, 0.1) # Vector of prior probs lik = c(1, 0.01, 1, 1) # Vector of likelihoods h = prior*lik Z = sum(h) # Sum of prior times likelihood post = prior*lik/Z # Normalise to get posterior # Look at all the results print(prior) print(lik) print(h) print(Z) print(post) Now let’s try to see if this makes sense. There are many things we could think about, but let’s just consider the question of whether the phone is working or not. The first two hypotheses correspond to the phone being in a working state. If you want to calculate the probability of A or B, then you can just add the probabilities if they are mutually exclusive. The prior probability that the phone is working is therefore: P (phone working) = P (H1 ∨ H2 ) (3.4) = P (H1 ) + P (H2 ) (3.5) = 0.4 + 0.4 (3.6) = 0.8. (3.7) Here, I have introduced the notation ∨, meaning “logical or”: For any two propositions A, B, the proposition (A ∨ B) is true if either one of A or B is true (or both). The posterior probability is worked out in a similar way, but using the posterior probabilities instead of the prior ones: P (phone working|busy signal) = P (H1 ∨ H2 |busy signal) (3.8) = P (H1 |busy signal) + P (H2 |busy signal) (3.9) = 0.662 + 0.00662 (3.10) = 0.6689. (3.11) Our probability that the phone is working has gone down a little bit as a result of this evidence! That makes sense to me. A busy signal is what you would expect to happen if the phone was broken. This data doesn’t prove the phone is broken, but it does point in CHAPTER 3. FIRST EXAMPLES 19 that direction a little bit, and hence the probability that the phone is working has been reduced from 0.8 to 0.6689. 3.4 Important Equations Posterior probabilities are calculated using Bayes’ rule. For a single hypothesis H given data D, Bayes’ rule is: P (H)P (D|H) P (H|D) = (3.12) P (D) This gives the posterior probability P (H|D) in terms of the prior probability P (H), the likelihood P (D|H) and the marginal likelihood P (D) in the denominator. To obtain P (H), think about your prior beliefs (which may indicate a large amount of uncertainty, or may already be well informed based on previous data sets). To obtain P (D|H), think about what the experiment is doing: If H is true, what data would you expect to see and with what probabilities? The denominator is the probability of obtaining the data D but without assuming that H is either true or false. This is obtained using the sum rule. There are two ways that the data D could occur, either via the route of H being true (this has probability P (H)P (D|H)), or via the route of H being false (this has probability P (H̄)P (D|H̄)). These two ways are mutually exclusive, so we can add their probabilities: P (D) = P (H)P (D|H) + P (H̄)P (D|H̄). (3.13) Bayes’ rule can be applied to a whole set of hypotheses (that are mutually exclusive and exhaustive) simultaneously. This is a more common way of using it, and it is the way we use it when we use a Bayes’ Box. If we applied Equation 3.12 to N hypotheses H1 , H2 ,..., HN , given data D, we would get the following for the posterior probability of each hypothesis Hi (for i = 1, 2,..., N ): P (Hi )P (D|Hi ) P (Hi |D) = (3.14) P (D) The denominator P (D) is a single number. It does not depend on the index i. It can again be obtained using the sum rule. There are N mutually exclusive ways that the data D could have occurred: via H1 being true, or via H2 being true, etc. Adding the probabilities of these gives: N X P (D) = P (Hi )P (D|Hi ). (3.15) i=1 which just happens to be the sum of the prior times likelihood values. If you don’t find equations particularly easy to read, just remember that following the steps for making a Bayes’ Box is equivalent to applying Bayes’ rule in this form! The P (Hi ) values are the prior probability column, the P (D|Hi ) values are the likelihood column, and the denominator is the sum of the prior times likelihood column. For example, the posterior CHAPTER 3. FIRST EXAMPLES 20 probability for H1 (the top right entry in a Bayes’ Box) is given by the prior probability for H1 times the likelihood for H1 , divided by the sum of prior times likelihood values. That is, P (H1 |D) = P (H1 )P (D|H1 )/P (D). The correspondence between the probabilities that go in a Bayes’ Box (in general) and the terms in the Equations are given in Table 3.3. Hypotheses prior likelihood prior × likelihood posterior H1 P (H1 ) P (D|H1 ) P (H1 ) × P (D|H1 ) P (H1 |D) H2 P (H2 ) P (D|H2 ) P (H2 ) × P (D|H2 ) P (H2 |D)............... Totals: 1 P (D) 1 Table 3.3: A general Bayes’ Box. Using Bayes’ rule or making a Bayes’ Box are actually the same thing, and this table can be used to identify the terms. Chapter 4 Parameter Estimation I: Bayes’ Box One of the most important times to use Bayes’ rule is when you want to do parameter estimation. Parameter estimation is a fairly common situation in statistics. In fact, it is possible to interpret almost any problem in statistics as a parameter estimation problem and approach it in this way! Firstly, what is a parameter? One way to think of a parameter is that it is just a fancy term for a quantity or a number that is unknown1. For example, how many people are currently in New Zealand? Well, a Google search suggests 4.405 million. But that does not mean there are exactly 4,405,000 people. It could be a bit more or a bit less. Maybe it is 4,405,323, or maybe it is 4,403,886. We don’t really know. We could call the true number of people in New Zealand right now θ, or we could use some other letter or symbol if we want. When talking about parameter estimation in general we often call the unknown parameter(s) θ, but in specific applications we will call the parameter(s) something else more appropriate for that application. The key is to realise that we can use the Bayes’ Box, like in previous chapters. But now, our list of possible hypotheses is a list of possible values for the unknown parameter. For example, a Bayes’ Box for the precise number of people in New Zealand might look something like the one in Table 4.1. There are a few things to note about this Bayes’ box. Firstly, it is big, which is why I just put a bunch of “... ”s in there instead of making up numbers. There are lots of possible hypotheses, each one corresponding to a possible value for θ. The prior probabilities I have put in the second column were for illustrative purposes. They needn’t necessarily all be equal (although that is often a convenient assumption). All the stuff we’ve seen in smaller examples of Bayes’ rule and/or use of a Bayes’ Box still applies here. The likelihoods will still be calculated by seeing how the probability of the data depends on the value of the unknown parameter. You still go through all the same steps, multiplying prior times likelihood and then normalising that to get the posterior probabilities for all of the possibilities listed. Note that a set of possible values together with the probabilities is what is commonly termed a probability distribution. In basic Bayesian problems, like in the introductory chapters, we start with some prior probabilities and update them to get 1 Another use for the term parameter is any quantity that something else depends on. For example, a normal distribution has a mean µ and a standard deviation σ that defines which normal distribution we are talking about. µ and σ are then said to be parameters of the normal distribution. 21 CHAPTER 4. PARAMETER ESTIMATION I: BAYES’ BOX 22 Possible Hypotheses prior likelihood prior × likelihood posterior............... θ = 4404999 0.000001......... θ = 4405000 0.000001......... θ = 4405001 0.000001......... θ = 4405002 0.000001......... θ = 4405003 0.000001......... θ = 4405004 0.000001......... θ = 4405005 0.000001......... θ = 4405006 0.000001........................ Totals: 1... 1 Table 4.1: An example of how a Bayes’ Box may be used in a parameter estimation situation. posterior probabilities. In parameter estimation, we start with a prior distribution for the unknown parameter(s) and update that to get a posterior distribution for the unknown parameter(s). A quantity which has a probability associated with each possible value is traditionally called a “random variable”. Random variables have proba- bility distributions associated with them. In Bayesian stats, an unknown parameter looks mathematically like a “random variable”, but I try to avoid the word random itself because it usually has connotations about something that fluctuates or varies. In Bayesian statistics, the prior dis- tribution and posterior distribution only describe our uncertainty. The actual parameter is a single fixed number. 4.1 Parameter Estimation: Bus Example This is a beginning example of parameter estimation from a Bayesian point of view. It shows the various features that are always present in a Bayesian parameter estimation problem. There will be a prior distribution, the likelihood, and the posterior distribution. We will spend a lot of time on this problem but keep in mind that this is just a single example, and certain things about this example (such as the choice of the prior and the likelihood) are specific to this example only, while other things about it are very general and will apply in all parameter estimation problems. You will see and gain experience with different problems in lectures, labs, and assignments. After moving to Auckland, I decided that I would take the bus to work each day. However, I wasn’t very confident with the bus system in my new city, so for the first week I just took the first bus that came along and was heading in the right direction, towards the city. In the first week, I caught 5 morning buses. Of these 5 buses, two of them took me to the right place, while three of them took me far from work, leaving me with an extra 20 minute walk. Given this information, I would like to try to infer the proportion of the CHAPTER 4. PARAMETER ESTIMATION I: BAYES’ BOX 23 buses that are “good”, that would take me right to campus. Let us call this fraction θ and we will infer θ using the Bayesian framework. We will start with a prior distribution that describes initial uncertainty about θ and update this to get the posterior distribution, using the data that 2/5 buses I took were “good”. First we must think about the meaning of the parameter θ in our particular problem so we can choose a sensible prior distribution. Since θ is, in this example, a proportion, we know it cannot be less than 0 or greater than 1. In principle, θ could be any real value between 0 and 1. To keep things simple to begin with, we shall make an approximation and assume that the set of possible values for θ is: {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1}. This discrete approximation means that we can use a Bayes’ Box. The first things to fill out in the Bayes’ Box are the possible values and the prior probabilities (the prior distribution). For starters, let us assume that before we got the data (two successes out of 5 trials), we were very uncertain about the value of θ, and this can be modelled by using a uniform prior distribution. There are 11 possible values for θ that are being considered with our discrete approximation, so the probability of each is 1/11 = 0.0909. The partially complete Bayes’ Box is given in Table 4.2. Note the new notation that I have put in the column titles. We will use this notation in all of our parameter estimation examples (although the parameter(s) and data may have different symbols when θ and x respectively are not appropriate). possible values prior likelihood prior × likelihood posterior θ p(θ) p(x|θ) p(θ)p(x|θ) p(θ|x) 0 0.0909 0.1 0.0909 0.2 0.0909 0.3 0.0909 0.4 0.0909 0.5 0.0909 0.6 0.0909 0.7 0.0909 0.8 0.0909 0.9 0.0909 1 0.0909 Totals 1 1 Table 4.2: Starting to make a Bayes’ Box for the bus problem. This one just has the possible parameter values and the prior distribution. To get the likelihoods, we need to think about the properties of our experiment. In particular, we should imagine that we knew the value of θ and were trying to predict what experimental outcome (data) would occur. Ultimately, we want to find the probability of our actual data set (2 out of the 5 buses were “good”), for all of our possible θ values. Recall that, if there are N repetitions of a “random experiment” and the “success” probability is θ at each repetition, then the number of “successes” x has a binomial CHAPTER 4. PARAMETER ESTIMATION I: BAYES’ BOX 24 distribution: N p(x|θ) = θx (1 − θ)N −x. (4.1) x N ! where = x!(NN−x)!. This is the probability mass function for x (if we imagine θ to x be known), hence the notation p(x|θ), read as “the probability distribution for x given θ”. Since there are five trials (N = 5) in the bus problem, the number of successes x must be one of 0, 1, 2, 3, 4, or 5. If θ is a high number close to 1, then we would expect the resulting value of the data (number of successes) x to be something high like 4 or 5. Low values for x would still be possible but they would have a small probability. If θ is a small number, we would expect the data to be 0, 1, or 2, with less probability for more than 2 successes. This is just saying in words what is written precisely in Equation 5.1. The probability distribution for the data x is plotted in Figure 4.1 for three illustrative values of the parameter θ. To obtain the actual likelihood values that go into the Bayes’ Box, we θ = 0.3 θ = 0.5 0.8 θ = 0.8 0.6 Probability 0.4 0.2 0.0 0 1 2 3 4 5 Possible Data x Figure 4.1: The binomial probability distribution for the data x, for three different values of the parameter θ. If θ is low then we would expect to see lower values for the data. If θ is high then high values are more probable (but all values from 0 to 5 inclusive are still possible). The actual observed value of the data was x = 2. If we focus only on the values of the curves at x = 2, then the heights of the curves give the likelihood values for these three illustrative values of θ. can simply substitute in the known values N = 5 and x = 2: 5 P (x = 2|θ) = θ2 (1 − θ)5−2 (4.2) 2 = 10 × θ2 (1 − θ)3. (4.3) The resulting equation depends on θ only! We can go through the list of θ values and get a numerical answer for the likelihood P (x = 2|θ), which is what we need for the Bayes’ Box. CHAPTER 4. PARAMETER ESTIMATION I: BAYES’ BOX 25 possible values prior likelihood prior × likelihood posterior θ p(θ) p(x|θ) p(θ)p(x|θ) p(θ|x) 0 0.0909 0 0 0 0.1 0.0909 0.0729 0.0066 0.0437 0.2 0.0909 0.2048 0.0186 0.1229 0.3 0.0909 0.3087 0.0281 0.1852 0.4 0.0909 0.3456 0.0314 0.2074 0.5 0.0909 0.3125 0.0284 0.1875 0.6 0.0909 0.2304 0.0209 0.1383 0.7 0.0909 0.1323 0.0120 0.0794 0.8 0.0909 0.0512 0.0047 0.0307 0.9 0.0909 0.0081 0.0007 0.0049 1 0.0909 0 0 0 Totals 1 0.1515 1 Table 4.3: The completed Bayes’ Box for the bus problem (using a binomial distribution to obtain the likelihood). The final steps are, as usual, to multiply the prior by the likelihood and then normalise that to get the posterior distribution. The completed Bayes’ Box is given in Table 4.3. There are a few interesting values in the likelihood column that should help you to understand the concept of likelihood a bit better. Look at the likelihood for θ = 0: it is zero. What does this mean? It means that if we imagine θ = 0 is the true solution, the probability of obtaining the data that we got (x = 2 successes) would be zero. That makes sense! If θ = 0, it means none of the buses are the “good” buses, so how could I have caught a good bus twice? The probability of that is zero. The likelihood for θ = 1 is also zero for similar reasons. If all of the buses are good, then having 2/5 successes is impossible. You would get 5/5 with 100% certainty. So P (x = 2|θ = 1) = 0. The likelihood is highest for θ = 0.4, which just so happens to equal 2/5. This θ = 0.4 predicted the data best. It does not necessarily mean that θ = 0.4 is the most probable value. That depends on the prior as well (but with a uniform prior, it does end up being that way. As you can see in the posterior distribution column, θ = 0.4 has the highest probability in this case). 4.1.1 Sampling Distribution and Likelihood As we study more examples of parameter estimation, you might notice that we always find the likelihood by specifying a probability distribution for the data given the parameters p(x|θ), and then we substituting in the actual observed data (Equations 4.1 and 4.3). Technically, only the version with the actual data set substituted in is called the likelihood. The probability distribution p(x|θ), which gives the probability of other data sets that did not occur (as well as the one that did), is sometimes called the sampling distribution. At times, I will distinguish between the sampling distribution and the likelihood, and at other times I might just use the word likelihood for both concepts. CHAPTER 4. PARAMETER ESTIMATION I: BAYES’ BOX 26 4.1.2 What is the “Data”? Even though this example is meant to be introductory, there is a subtlety that has been swept under the rug. Notice that our data consisted of the fact that we got 2/5 successes in the experiment. When we worked out the likelihood, we were considering the probability of getting x = 2, but we didn’t have a probability for N = 5. In principle, we could treat x and N as two separate data sets. We could first update from the prior to the posterior given N = 5, and then update again to take into account x as well as N. However, the first update would be a bit weird. Why would knowing the number of trials tell you anything about the success probability? Effectively, what we have done in our analysis is assume that N = 5 is prior information that lurks in the background the whole time. Therefore our uniform prior for θ already “knows” that N = 5, so we didn’t have to consider P (N = 5|θ) in the likelihood. This subtlety usually doesn’t matter much. 4.2 Prediction in the Bus Problem We have now seen how to use information (data) to update from a prior distribution to a posterior distribution when the set of possible parameter values is discrete. The posterior distribution is the complete answer to the problem. It tells us exactly how strongly we should believe in the various possible solutions (possible values for the unknown parameter). However, there are other things we might want to do with this information. Predicting the future is one! It’s fun, but risky. Here we will look at how prediction is done using the Bayesian framework, continuing with the bus example. To be concrete, we are interested in the following question: what is the probability that I will catch the right bus tomorrow?. This is like trying to predict the result of a future experiment. In the Bayesian framework, our predictions are always in the form of probabilities or (later) probability distributions. They are usually calculated in three stages. First, you pretend you actually know the true value of the parameters, and calculate the probability based on that assumption. Then, you do this for all possible values of the parameter θ (alternatively, you can calculate the probability as a function of θ). Finally, you combine all of these probabilities in a particular way to get one final probability which tells you how confident you are of your prediction. Suppose we knew the true value of θ was 0.3. Then, we would know the probability of catching the right bus tomorrow is 0.3. If we knew the true value of θ was 0.4, we would say the probability of catching the right bus tomorrow is 0.4. The problem is, we don’t know what the true value is. We only have the posterior distribution. Luckily, the sum rule of probability (combined with the product rule) can help us out. We are interested in whether I will get the good bus tomorrow. There are 11 different ways that can happen. Either θ = 0 and I get the good bus, or θ = 0.1 and I get the good bus, or θ = 0.2 and I get the good bus, and so on. These 11 ways are all mutually exclusive. That is, only one of them can be true (since θ is actually just a single number). Mathematically, we can CHAPTER 4. PARAMETER ESTIMATION I: BAYES’ BOX 27 obtain the posterior probability of catching the good bus tomorrow using the sum rule: X P (good bus tomorrow|x) = p(θ|x)P (good bus tomorrow|θ, x) (4.4) θ X = p(θ|x)θ (4.5) θ This says that the total probability for a good bus tomorrow (given the data, i.e. using the posterior distribution and not the prior distribution) is given by going through each possible θ value, working out the probability assuming the θ value you are considering is true, multiplying by the probability (given the data) this θ value is actually true, and summing. In this particular problem, because P (good bus tomorrow|θ, x) = θ, it just so happens that the probability for tomorrow is the expectation value of θ using the posterior distribution. To three decimal places, the result for the probability tomorrow is 0.429. Interestingly, this is not equal to 2/5 = 0.4. In practice, these kinds of calculations are usually done in a computer. The R code for computing the Bayes’ Box and the probability for tomorrow is given below. This is very much like many of the problems we will work on in labs. # Make a vector of possibilities (first column of the Bayes' Box) theta = seq(0, 1, by=0.1) # Corresponding vector of prior probabilities # (second column of the Bayes' Box) prior = rep(1/11,11) # Likelihood. Notice use of dbinom() rather than formula # because R conveniently knows a lot of # standard probability distributions already lik = dbinom(2,5,theta) # Prior times likelihood, then normalise to get posterior h = prior*lik post = h/sum(h) # Probability for good bus tomorrow (prediction!) # This happens to be the same as the posterior expectation of theta # *in this particular problem* because the probability of a # good bus tomorrow GIVEN theta is just theta. prob_tomorrow = sum(theta*post) 4.3 Bayes’ Rule, Parameter Estimation Version Mathematically, what we did to calculate the posterior distribution was to take the prior distribution as a whole (the whole second column) and multiply it by the likelihood (the whole third column) to get the unnormalised posterior, then normalise to get the final posterior distribution. This can be written as follows, which we will call the “parameter CHAPTER 4. PARAMETER ESTIMATION I: BAYES’ BOX 28 estimation” version of Bayes’ rule. There are three ways to write it: p(θ)p(x|θ) p(θ|x) = (4.6) p(x) p(θ|x) ∝ p(θ)p(x|θ) (4.7) posterior ∝ prior × likelihood. (4.8) Writing the equations in these ways is most useful when you can write the prior p(θ) and the likelihood p(x|θ) as formulas (telling you how the values depend on θ as you go through the rows). Then you can get the equation for the posterior distribution (whether it is a discrete distribution, or a continuous one, in which case p(θ) and p(θ|x) are probability densities. We will do this in the next chapter. The notation in Equation 4.8 is very simplified and concise, but is a popular kind of notation in Bayesian work. For an explanation of the relationship between this notation and other common choices (such as P (X = x) for a discrete distribution or f (x) for a density), see Appendix A. Chapter 5 Parameter Estimation: Analytical Methods Analytical methods are those which can be carried out with a pen and paper, or the “old school” way before we all started using computers. There are some problems in Bayesian statistics that can be solved in this way, and we will see a few of them in this course. For an analytical solution to be possible, the maths usually has to work out nicely, and that doesn’t always happen, so the techniques shown here don’t always work. When they do – great! When they don’t, that’s what MCMC (and JAGS) is for! Let’s look at the binomial likelihood problem again, with the familiar bus example. Out of N = 5 attempts at a “repeatable” experiment, there were x = 2 successes. From this, we want to infer the value of θ, the success probability that applied on each trial, or the overall fraction of buses that are good. Because of its meaning, we know with 100% certainty that θ must be between 0 and 1 (inclusive). Recall that, if we knew the value of θ and wanted to predict the data x (regarding N as being known in advance), then we would use the binomial distribution: N p(x|θ) = θx (1 − θ)N −x. (5.1) x Let’s use a uniform prior for θ, but instead of making the discrete approximation and using a Bayes’ Box, let’s keep the continuous set of possibilities, that θ can be any real number between 0 and 1. Because the set of possibilities is continuous, the prior and the posterior for θ will both be probability densities. If we tried to do a Bayes’ Box now, it would have infinitely many rows! The equation for our prior, a uniform probability density between 0 and 1, is: 1, 0 ≤ θ ≤ 1 p(θ) = (5.2) 0, otherwise If we keep in mind that θ is between 0 and 1, and therefore remember at all times that we are restricting our attention to θ ∈ [0, 1], we can write the uniform prior much more simply as: p(θ) = 1. (5.3) 29 CHAPTER 5. PARAMETER ESTIMATION: ANALYTICAL METHODS 30 If you find the Bayes’ Box way of thinking easier to follow than the mathematics here, you can imagine we are making a Bayes’ Box like in Table 4.2, but with an “infinite” number of rows, and the equation for the prior tells us how the prior probability varies as a function of θ as we go down through the rows (since the prior is uniform, the probabilities don’t vary at all). To find the posterior probability density for θ, we use the “parameter estimation” form of Bayes’ rule: posterior ∝ prior × likelihood (5.4) p(θ|x) ∝ p(θ)p(x|θ). (5.5) We already wrote down the equations for the prior and the likelihood, so we just need to multiply them. p(θ|x) ∝ p(θ)p(x|θ) (5.6) N ∝ 1× θx (1 − θ)N −x (5.7) x Since we are using the abbreviated form of the prior, we must remember this equation only applies for θ ∈ [0, 1]. To simplify the maths, there are some useful tricks you can use a lot of the time when working things out analytically. Notice that the “parameter estimation” form of Bayes’ rule has a proportional sign in it, not an equals sign. That’s because the prior times the likelihood can’t actually be the posterior distribution because it is not normalised. The sum or integral is not 1. However, the equation still gives the correct shape of the posterior probability density function (the way it varies as a function of θ). This is helpful because you can save ink. If there are some constant factors in your expression for the posterior that don’t involve the parameter (in this case, θ), you can ignore them. The proportional sign will take care of them. In this case, it means we can forget about the pesky “N choose x” term, and just write: p(θ|x) ∝ θx (1 − θ)N −x (5.8) ∝ θ2 (1 − θ)3. (5.9) The final step I included was to substitute in the actual values of N and x instead of leaving the symbols there. That’s it! We have the correct shape of the posterior distribution. We can use this to plot the posterior, as you can see in Figure 5.1. 5.1 “∼” Notation While it is very helpful to know the full equations for different kinds of probability distributions (both discrete and continuous), it is useful to be able to communicate about probability distributions in an easier manner. There is a good notation for this which we will sometimes use in STATS 331. If we want to communicate about our above analysis, and someone wanted to know what prior distribution we used, we could do several things. We could say “the prior for θ was uniform between 0 and 1”, or we could give the formula for the prior distribution (Equation 5.2). However, a convenient shorthand in common use is to simply write: θ ∼ Uniform(0, 1) (5.10) CHAPTER 5. PARAMETER ESTIMATION: ANALYTICAL METHODS 31 2.5 Prior p(θ) Posterior p(θ|x) 2.0 Probability Density 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Possible Values θ Figure 5.1: The prior and the posterior for θ in the bus problem, given that we had 2/5 successes. The prior is just a uniform density and this is plotted as a flat line, describing the fact that θ can be anywhere between 0 and 1 and we don’t have much of an idea. After getting the data, the distribution changes to the posterior which is peaked at 0.4, although there is still a pretty wide range of uncertainty. or, even more concisely: θ ∼ U (0, 1). (5.11) This notation conserves ink, and is good for quick communication. It is also very similar to the notation used in JAGS, which will be introduced in later chapters. We can also write the binomial likelihood (which we used for the bus problem) in this notation, instead of writing out the full equation (Equation 5.1). We can write: x|θ ∼ Binomial(N, θ) (5.12) This says that if we knew the value of θ, x would have a binomial distribution with N trials and success probability θ. We can also make this one more concise: x ∼ Bin(N, θ) (5.13) The differences here are that “Binomial” has been shortened to “Bin” and the “given θ” part has been left out. However, we see that there is a θ present on the right hand side, so the “given θ” must be understood implicitly. 5.2 The Effect of Different Priors We decided to do this problem with a uniform prior, because it is the obvious first choice to describe “prior ignorance”. However, in principle, the prior could be different. This CHAPTER 5. PARAMETER ESTIMATION: ANALYTICAL METHODS 32 will change the posterior distribution, and hence the conclusions. This isn’t a problem of Bayesian analysis, but a feature. Data on its own doesn’t tell us exactly what should believe. We must combine the data with all our other prior knowledge (i.e. put the data in context) to arrive at reasoned conclusions. In this section we will look at the effect of different priors on the results, again focusing on the bus problem for continuity. Specifically, we will look at three different priors: the uniform one that we already used, and two other priors discussed below. 5.2.1 Prior 2: Emphasising the Extremes One possible criticism of the uniform prior is that there is not much probability given to extreme solutions. For example, according R 0.1 to the Uniform(0, 1) prior, the prior probability that θ is between 0 and 0.1 is only 0 1 dθ = 0.1. But, depending on the situation, we might think values near zero should be more plausible1. One possible choice of prior distribution that assigns more probability to the extreme values (close to 0 or 1) is: 1 1 p(θ) ∝ θ− 2 (1 − θ)− 2. (5.14) 5.2.2 Prior 3: Already Being Well Informed Here’s another scenario that we might want to describe in our prior. Suppose that, before getting this data, you weren’t ignorant at all, but already had a lot of information about the value of the parameter. Say that we already had a lot of information which suggested the value of θ was probably close to 0.5. This could be modelled by the following choice of prior: p(θ) ∝ θ100 (1 − θ)100. (5.15) The three priors are plotted in Figure 5.2 as dotted lines. The three corresponding posterior distributions are plotted as solid lines. The posteriors were computed by multiplying the three priors by the likelihood and normalising. The blue curves correspond to the uniform prior we used before, the red curves use the “emphasising the extremes” prior, and the green curves use the “informative” prior which assumes that θ is known to be close to 0.5. There are a few interesting things to notice about this plot. Firstly, the posterior dis- tributions are basically the same for the red and blue priors (the uniform prior and the “emphasising the extremes” prior). The main difference in the posterior is, as you would expect, that the extremes are emphasised a little more. If something is more plausible before you get the data, it’s more plausible afterwards as well. The big difference is with the informative prior. Here, we were already pretty confident that θ was close to 0.5, and the data (since it’s not very much data) hasn’t given us any 1 Here’s another parameter that is between 0 and 1: the proportion of households in New Zealand that keep a Macaw as a pet (call that φ). I hope this number is low (it is very difficult to take responsible care of such a smart bird). I also think it probably is low. I would definitely object to a prior that implied P (φ < 0.1) = 0.1. I would want a prior that implied something like P (φ < 0.1) = 0.999999. CHAPTER 5. PARAMETER ESTIMATION: ANALYTICAL METHODS 33 Three Priors, Three Posteriors 12 Prior 1 10 Prior 2 Prior 3 Probability Density 8 6 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0 Possible Values θ Figure 5.2: Three different priors (dotted lines) and the corresponding three posteriors (solid lines) given the bus example data. See the text for discussion of these results. reason to doubt that, so we still think θ is close to 0.5. Since we already knew that θ was close to 0.5, the data are acting only to increase the precision of our estimate (i.e. make the posterior distribution narrower). But since we had so much prior information, the data aren’t providing much “extra” information, and the posterior looks basically the same as the prior. 5.2.3 The Beta Distribution The three priors we have used are all examples of beta distributions. The beta distributions are a family of probability distributions (like the normal, Poisson, binomial, and so on) which can be applied to continuous random variables known to be between 0 and 1. The general form of a beta distribution (here written for a variable x) is: p(x|α, β) ∝ xα−1 (1 − x)β−1. (5.16) The quantities α and β are two parameters that control the shape of the beta distribution. Since we know the variable x is between 0 and 1 with probability 1, the normalisation constant could be found by doing an integral. Then you could write the probability distribution with an equals sign instead of a proportional sign, xα−1 (1 − x)β−1 p(x|α, β) = R 1 (5.17) 0 xα−1 (1 − x)β−1 dx α−1 x (1 − x)β−1 =. (5.18) B(α, β) where B(α, β) (called the “beta function”) is defined (usefully... ) as the result of doing that very integral (it can be related to factorials too, if you’re interested). Thankfully, we CHAPTER 5. PARAMETER ESTIMATION: ANALYTICAL METHODS 34 can get away with the “proportional” version most of the time. In “∼” notation the beta distribution is written as: x|α, β ∼ Beta(α, β). (5.19) Again, the “given α and β” can be dropped. It is implicit because they appear on the right hand side. By identifying the terms of Equation 5.16 with the form of our three priors (Equations 5.2, 5.14 and 5.15)), we see that our three priors can be written in “∼” notation like this: Prior 1: θ ∼ Beta(1, 1) Prior 2: θ ∼ Beta 12 , 21 Prior 3: θ ∼ Beta(101, 101) When you work out the posterior distributions analytically and then compare them to the formula for the beta distribution, you can see that the three posteriors are also beta distributions! Specifically, you get: Posterior 1: θ ∼ Beta(3, 4) Posterior 2: θ ∼ Beta(2.5, 3.5) Posterior 3: θ ∼ Beta(103, 104) This is “magic” that is made possible by the mathematical form of the beta prior and the binomial likelihood2. It is not always possible to do this. We can also derive the general solution for the posterior for θ when the prior is a Beta(α, β) distribution, the likelihood is a binomial distribution, and x successes were observed out of N trials. The posterior is: p(θ|x) ∝ p(θ)p(x|θ) (5.20) ∝ θα−1 (1 − θ)β−1 × θx (1 − θ)N −x (5.21) = θα+x−1 (1 − θ)β+N −x−1 (5.22) which can be recognised as a Beta(α + x, β + N − x) distribution. Remember that in this particular problem, the probability of a success tomorrow is simply the expectation value (mean) of the posterior distribution for θ. We can look up (or derive) the formula for the mean of a beta distribution and find that if x ∼ Beta(α, β) then E(x) = α/(α + β). Applying this to the three posterior distributions gives: P (good bus tomorrow|x) = 3/7 ≈ 0.429 (using prior 1) P (good bus tomorrow|x) = 2.5/6 ≈ 0.417 (using prior 2) P (good bus tomorrow|x) = 103/207 ≈ 0.498 (using prior 3) The result for Prior 1 is Laplace’s infamous “rule of succession” which I will discuss a little bit in lectures. 2 The technical term for this magic is that the beta distribution is a conjugate prior for the binomial likelihood. CHAPTER 5. PARAMETER ESTIMATION: ANALYTICAL METHODS 35 5.2.4 A Lot of Data As shown above, the choice of prior distribution has an impact on the conclusions. Sometimes it has a big impact (the results using prior 3 were pretty different to the results from priors 1 and 2), and sometimes not much impact (e.g. the results from priors 1 and 2 were pretty similar). There is a common phenomenon that happens when there is a lot of data: the prior tends not to matter so much. Imagine we did a much bigger version of the bus experiment with N = 1000 trials, which resulted in x = 500 successes. Then the posterior distributions corresponding to the three different priors are all very similar (Figure 5.3). Lots of Data 30 Prior 1 25 Prior 2 Prior 3 Probability Density 20 15 10 5 0 0.40 0.45 0.50 0.55 Possible Values θ Figure 5.3: When you have a lot of data, the results are less sensitive to the choice of prior distribution. Note that we have zoomed in and are only looking around θ = 0.5: these posterior distributions are quite narrow because there is now a lot more information about θ. The red and blue posteriors (based on priors 1 and 2) are so similar that they overlap and look like one purple curve. This is reassuring. Note, however, that this only occurs because the three analyses used the same likelihood. If three people have different prior distributions for something and they can’t agree on what the experiment even means, there is no guarantee they will end up agreeing, even if there’s a large amount of data! Remember though, that when the results are sensitive to the choice of prior, that is not a problem with the Bayesian approach, but rather an important warning message: the data aren’t very informative! Then, the options are: i) think really hard about your prior distribution and be careful when deciding what it should be, and ii) get more or better data! Chapter 6 Summarising the Posterior Distribution The posterior distribution is the full answer to any Bayesian problem. It gives a complete description of our state of knowledge and our uncertainty about the value(s) of unknown parameters. From the posterior distribution, we can calculate any probability we want. For example, if we had a posterior distribution p(θ|x) and we wanted to know the probability that θ is greater than or equal to 100, we could do: Z ∞ P (θ ≥ 100|x) = p(θ|x) dθ (6.1) 100 or ∞ X P (θ ≥ 100|x) = p(θ|x) (6.2) 100 depending on whether the set of possible θ values is continuous or discrete. We could also work out the probability of anything else. However, the posterior distribution is sometimes too much information for us to think about easily. Maybe a giant list of θ values and probabilities isn’t easy to digest. Sometimes, we need to summarise the posterior distribution to help us communicate our results with others. A giant Bayes’ Box (or a million MCMC samples of the parameter, we’ll see that later), might technically contain everything we want, but it’s not easy to talk about. For example, say you were trying to estimate a parameter, and a colleague asked you to state your uncertainty about the parameter. Well, your posterior distribution might be complicated. It might have bumps and wiggles in it, or some other kind of structure. If there were two or more unknown parameters, there might be dependence in the posterior distribution. In some cases there might even me multiple separate peaks! Figure 6.1 shows an example of what a complicated posterior distribution might look like. If this was your result, your colleague might not care about all the little wiggles in this plot. They just want to know the “big picture” of your results. The idea of summarising the posterior distribution is very closely related to the idea of summarising a data set, which you probably encountered when you studied descriptive statistics. 36 CHAPTER 6. SUMMARISING THE POSTERIOR DISTRIBUTION 37 0.8 0.7 Posterior Probability Density 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0 2 4 6 8 10 Some Parameter Figure 6.1: A complicated posterior distribution. When communicating with others, it is often useful to summarise the posterior distribution with a few numbers. In this case, something like “the parameter = 5 ± 1” might be a useful summary. In descriptive statistics, you often make summaries of a complex data set (e.g. the mean and the standard deviation) so that you can communicate about the data set in a concise way. In Bayesian statistics, you often do a similar thing, but instead of giving a concise description of the data, you give a concise description of the posterior distribution. 6.1 Point Estimates A “point estimate” refers to a single number guess for the value of a parameter. If you have several parameters, a point estimate would be a single guess for the value of each parameter (like a single point in a multidimensional space). If you look at the posterior distribution plotted in Figure 6.1, you can see that the true value of the parameter is probably somewhere around 5, but with some uncertainty. If you were to provide a single number as a guess of the parameter, you would probably say something close to 5. In classical statistics, a single number guess is called an “estimate”, and a rule for generating such guesses is called an “estimator”. Estimates are usually written by putting a hat over the name of the parameter. So, by looking at the plot of the posterior, you could give an estimate like this: θ̂ = 5. (6.3) But there are better things you could do than just looking at the plot, and you’ve probably learnt some of them in previous statistics courses. Here are three methods you could use to choose a point estimate using the posterior distribution: the posterior mean (the CHAPTER 6. SUMMARISING THE POSTERIOR DISTRIBUTION 38 expectation value of the parameter), the posterior median (the value that divides the probability in half), and the posterior mode (the value where the posterior distribution has its peak). In our illustrative example, the values of these three point estimates are: θ̂ = 4.988 (the posterior mean) (6.4) θ̂ = 4.924 (the posterior median) (6.5) θ̂ = 4.996 (the posterior mode) (6.6) In this example, there’s not much of a difference between these three methods. But in other situations, they can be quite different (this usually happens if the posterior distribution is skewed, or has multiple modes; you may notice a strong analogy between this topic and descriptive statistics). Is there a way to say which one is the best? It turns out there is, but that depends on what you mean by “best”. Before we move on to the formal ways of deciding what constitutes a good estimate, I would like to mention a very common method that is easy to use. If the posterior distribution looks even vaguely like a normal distribution, it is common to summarise it like this: θ = posterior mean ± posterior standard deviation. (6.7) I use this kind of summary frequently in my own research. 6.1.1 A Very Brief Introduction to Decision Theory Decision theory is a very important topic. In this course we will use a tiny amount of it, just enough to solve the problem of “which point estimate is best?”. If you think about it, this is a bit of a weird question. Obviously, the best point estimate is the true value. Of course it is, how could it be otherwise? Our only problem is that we can’t actually implement this suggestion. We don’t know the true value. We only have the posterior distribution (which is based on all the evidence we have), and we have to do the best we can with our incomplete information. To think about which decision is best, the first thing we should think about is which decisions are possible. For estimating a single parameter, any real number is a possible guess. The key idea in decision theory is the concept of utility, and the related concept of loss (loss is just negative utility). Utility is a numerical measure of how good it would be if a certain outcome came true. Conversely, loss is a measure of how bad it would be if a certain outcome came true. Utilities are often subjective (not unlike prior probabilities), but in some applications utility can be more concrete. For example, in betting or investment decisions the utility can be measured in dollars. The problem with utility is that we have uncertainty about what is going to happen, or about what is true, so we can’t just choose the decision that gives us the greatest utility. Instead we will use our posterior probabilities and choose the decision that gives us the maximum possible expected value of the utility. Imagine we were estimating a parameter θ and we wanted to give a point estimate θ̂. One idea for what the utility or loss might be is the quadratic loss function, which is given by 2 L(θ, θ̂) = θ̂ − θ. (6.8) CHAPTER 6. SUMMARISING THE POSTERIOR DISTRIBUTION 39 This expression inside the parentheses is the difference between our point estimate and the true value. This formula says that if our point estimate is off by 2, that is four times worse than if we were off by 1. If we were off by 10, that is 100 times worse than if we were off by 1, due to the squaring in the quadratic loss function formula. It turns out (we will prove this below) that if the loss function is quadratic, the best estimate you can give is the posterior mean. Here is the proof. The expected value of the loss is h i Z E L(θ, θ̂) = p(θ|x)(θ̂ − θ)2 dθ (6.9) Since we are summing (integrating) over all possible true θ values, the expected loss is only a function of our estimate θ̂. To minimise a function of one variable, you differentiate it and then set the derivative to zero. The derivative is d h d i Z E L(θ, θ̂) = p(θ|x) (θ̂ − θ)2 dθ (6.10) dθ̂ Z dθ̂ = p(θ|x)2(θ̂ − θ) dθ (6.11) Setting this equal to zero and then solving for θ̂ gives the final result: Z θ̂ = θp(θ|x) dθ. (6.12) which is the posterior mean. Some people call the posterior mean the “Bayes Estimate” for this reason. I don’t like that term because I don’t think point estimates are really Bayesian. The actual output of a Bayesian analysis is the posterior distribution. Note that I didn’t verify that θ̂ actually minimises the expected loss , because setting the derivative to zero would also find a maximum. To make sure it really does minimise the expected loss, you can calculate the second derivative and verify that it is positive. But that’s not really needed. It wo

STATS 331 Introduction to Bayesian Statistics PDF

Document Details

Tags

Related

Summary

Full Transcript