Bayesian Statistics Instructional Materials PDF

P U P Instructional Materials in STAT 20153 Bayesian Statistics Prepared by Peter John B. Aranas College of Science Polytechnic University of the Philippines 2022...

P U P Instructional Materials in STAT 20153 Bayesian Statistics Prepared by Peter John B. Aranas College of Science Polytechnic University of the Philippines 2022 Contents 1 Introduction to Statistical Science 1 1.1 The Scientific Method: A Process for Learning................... 2 1.2 The Role of Statistics in the Scientific Method................... 3 1.3 Main Approaches to Statistics............................ 3 2 Probability and Bayesian Inference 9 2.1 Probability...................................... 10 2.1.1 Joint Probability and Independent Events................. 11 2.1.2 Conditional Probability........................... 11 2.2 Bayes’ Theorem: The Key to Bayesian Statistics.................. 14 2.2.1 Odds and Bayes Factor............................ 17 2.3 General notation for statistical inference...................... 20 3 Specifying Priors: Single-parameter models 23 3.1 Estimating a probability from binomial data.................... 24 3.1.1 Using a Uniform Prior............................ 24 3.1.2 Using a Beta Prior.............................. 26 3.1.3 Choosing Your Prior for a Binomial Data.................. 28 3.2 Bayes’ Theorem for Poisson Parameter with a Continuous Prior......... 30 3.2.1 Some Prior Distributions for Poisson.................... 30 3.3 Bayes’ Theorem for Normal Mean with a Discrete Prior.............. 33 3.4 Bayes’ Theorem for Normal Mean with a Continuous Prior............ 37 3.4.1 Choosing Your Normal Prior......................... 40 4 Summarizing Posterior Information and Asymptotics 43 4.1 Introduction...................................... 44 4.2 Measures of Location................................. 44 4.2.1 Posterior Mode................................ 44 4.2.2 Posterior median............................... 45 4.2.3 Posterior Mean................................ 45 4.3 Measures of Spread.................................. 45 4.3.1 Posterior Variance and Posterior Standard Deviation........... 45 4.3.2 Percentiles of the Posterior Distribution.................. 46 4.4 Bayesian Credible Interval.............................. 46 4.5 Normal Approximation to the Posterior Distribution............... 47 1 5 Multiparameter and Hierarchical Models 51 5.1 Multiparameter Models: Introduction........................ 51 5.1.1 Averaging over ‘nuisance parameters’.................... 52 5.1.2 Normal data with a noninformative prior distribution........... 52 5.1.3 Normal data with a conjugate prior distribution.............. 55 5.1.4 Multinomial model for categorical data................... 56 5.2 Hierarchical Models.................................. 57 5.2.1 Constructing a parameterized prior distribution.............. 58 5.2.2 Exchangeability and setting up hierarchical models............ 61 5.2.3 Fully Bayesian analysis of conjugate hierarchical models......... 64 6 Empirical Bayes 69 6.1 Introduction...................................... 70 6.2 The missing species problem............................. 72 6.3 James-Stein estimator................................ 75 6.4 False discovery rates................................. 77 6.5 f -modeling and g-modeling............................. 81 7 Introduction to Regression Models 83 7.1 Conditional Modeling................................. 84 7.2 Bayesian analysis of the classical regression model................. 85 7.3 Regression for causal inference............................ 88 Chapter 1 Introduction to Statistical Science Chapter Overview This chapter will introduce the students to the scientific method, albeit just a review, and the role of statistics in the scientific method. Two main approaches to statistics, Frequentist and Bayesian, will be shortly discussed. Finally, Monte Carlo studies are also presented in this chapter. Learning Objectives At the end of this chapter, the student is expected to 1. discuss the scientific method; 2. discuss statistics as a science; 3. discuss the differences between frequentist statistics and Bayesian statistics; and 4. perform Monte Carlo simulations using R. 1 1.1. THE SCIENTIFIC METHOD: A PROCESS FOR LEARNING pjbaranas Showing a Causal Relationship from Data Suppose we have observed two variables X and Y. An association between the variables X and Y may be exhibited in a scenario in which high values of X occur with high values of Y (positive association) or high values of X occur with low values of Y (negative association). This association may be indicated by a dashed curve connecting X and Y. There are several explanations for the association between X and Y. One possible explanation is that X is the cause of Y ; that is, the relationship is a causal one. For example, X might be the cause of Y. This may be indicated by drawing an arrow from X to Y. Another possible explanation for the association between X and Y is that there could be an unidentified third variable, say Z, that has a causal effect on both X and Y. In this case, X and Y are not related in a direct causal relationship. The association between them is due to the effect of Z. Z is called a lurking variable, since it is hiding in the background and it affects the data. However, it is also possible that both a causal effect and a lurking variable may both be contributing to the association. In this case, we say that the causal effect and the effect of the lurking variable are confounded. This means that both effects are included in the association. Our first goal is to determine which of the possible reasons for the association holds. If we conclude that it is due to a causal effect, then our next goal is to determine the size of the effect. If we conclude that the association is due to causal effect confounded with the effect of a lurking variable, then our next goal becomes determining the sizes of both the effects. 1.1 The Scientific Method: A Process for Learning The scientific method is a way of thinking that sparked the Renaissance. The idea is that sci- entific theories should be tested against real world data. Premises of the scientific method: A scientific hypothesis can never be shown to be absolutely true. However, it must potentially be disprovable. It is a useful model until it is established that it is not true. Always go for the simplest hypothesis, unless it can be shown to be false. Procedures of the scientific method: 1. Ask a question or pose a problem in terms of the current scientific hypothesis. 2. Gather all the relevant information that is currently available. This includes the current knowledge about parameters of the model. 2 CHAPTER 1. INTRODUCTION TO STATISTICAL SCIENCE pjbaranas 3. Design an investigation or experiment that addresses the question from step 1. The pre- dicted outcome of the experiment should be one thing if the current hypothesis is true, and something else if the hypothesis is false. 4. Gather data from the experiment. 5. Draw conclusions given the experimental results. Revise the knowledge about the param- eters to take the current results into account. The scientific method searches for a cause-and-effect relationships between an experimental variable and an outcome variable. In other words, how changing the experimental variable results in a change to the outcome variable. Scientific modeling develops mathematical models of these relationships. Scientific method uses controlled experiments, where outside factors that may affect the measurements are controlled. This isolates the relationship between the two variables from the outside factors, so the relationship can be determined. Exercise 1. Discuss statistics as a science. 1.2 The Role of Statistics in the Scientific Method Statistical methods extend the scientific method to cases where the outside factors are not identified and hence cannot be controlled. The lack of direct control means the outside factors will be affecting the data and this could lead to wrong conclusions that could be drawn from the experiment. The principle of randomization is used to statistically control these unidentified outside factors by averaging out their effects. This contributes to variability in the data. This random variability in the data is the very reason why we use the statistical methods of inference. However, there is always the presence of uncertainty in our inferences. We can use the probability model (based on the randomization method) to measure the uncertainty. 1.3 Main Approaches to Statistics Frequentist (classical) - procedures are developed by looking at how they perform over all possible random samples. The probabilities do not relate to the particular random sample that was obtained. This approach is based on the following ideas: Parameters, the numerical characteristics of the population, are fixed but unknown con- stants. Probabilities are always interpreted as long-run frequency. 3 1.3. MAIN APPROACHES TO STATISTICS pjbaranas Statistical procedures are judged by how well they perform in the long run over an infinite number of hypothetical repetitions of the experiment. Bayesian - applies the laws of probability directly to the problems based on the following ideas: Since we are uncertain about the true value of the parameters, we will consider them to be random variables. The rules of probability are used directly to make inferences about the parameters. Probability statements about parameters must be interpreted as “degree of belief.” The prior distribution must be subjective. Each person can have his/her own prior, which con- tains the relative weights that person gives to every possible parameter value. It measure how “plausible” the person considers each parameter value to be before observing the data. We revise our beliefs about parameters after getting the data by using Bayes’ Theorem. This gives our posterior distribution which gives the relative weights to each parameters value after analyzing the data. The posterior distribution comes from two sources: the prior distribution and the observed data. Monte Carlo Studies A Monte Carlo study is where we perform the experiment a large number of times and calcu- late the statistic for each experiment. In frequentist statistics, a method called sample space averaging is used to evaluate statistical procedures by comparing the performance of the pro- cedures before we take any data. Statistical procedures are evaluated by looking how they perform in the long run over all possible samples of the data, for fixed parameter values over some range. For instance, we fix the parameter at some value. The estimator depends on the random sample, so it is considered a random variable having a distribution (sampling distribu- tion). Then we look at how the estimator is distributed around the parameter value. Bayesian procedures consider the parameter to be a random variable, and its posterior distribution is conditional on the sample data that actually occurred, not all those samples that were possible but did not occur. A Monte Carlo study uses the empirical distribution of the statistic over all the samples we took in our study instead of its sampling distribution over all possible repetitions. Monte Carlo Exercises Monte Carlo study comparing methods for random sampling. We will use a Monte Carlo computer simulation to evaluate the methods of random sampling. Now, if we want to evaluate a method, we need to know how it does in the long run. In a real-life situation, we 4 CHAPTER 1. INTRODUCTION TO STATISTICAL SCIENCE pjbaranas cannot judge a method by the sample estimate it gives, because if we knew the population pa- rameter, we would not be taking a sample and estimating it with a sample statistic. One way to evaluate a statistical procedure is to evaluate the sampling distribution which sum- marizes how the estimate based on that procedure varies in the long run (over all possible random samples) for a case when we know the population parameters. Then we can see how closely the sampling distribution is centered around the true parameter. The closer it is, the better the statistical procedure, and the more confidence we will have in it for realistic cases when we do not know the parameter. if we use computer simulations to run a large number of hypothetical repetitions of the procedure with known parameters, this is known as a Monte Carlo study named after the famous casino. Instead of having the theoretical sampling distribution, we have the empirical distribution of the sample statistic over those simulated repetitions. We judge the statistical procedure by seeing how closely the empirical distribution of the estimator is centered around the known parameter. The population. Suppose there is a population made up of 100 individuals, and we want to estimate the mean income of the population from a random sample of size 20. The individuals come from three ethnic groups with population proportions of 40%, 40%, and 20%, respectively. There are twenty neighborhoods, and five individuals live in each one. Now, the income distri- bution may be different for the three ethnic groups. Also, individuals in the same neighborhood tend to be more similar than individuals in different neighborhoods. [R:] Details about the population can be seen by typing help(sscsample.data) In the Monte Carlo study we will approximate the sampling distribution of the sample means for three types of random sampling, simple random sampling, stratified random sampling, and cluster random sampling. We do this by drawing a large number (in this case 200) random samples from the population using each method of sampling, calculating the sample mean as our estimate. The empirical distribution of these 200 sample means approximate the sampling distribution of the estimate. (a) Display the incomes for the three ethnic groups (strata) using boxplots on the same scale. Compute the mean income for the three ethnic groups. Do you see any difference between the income distributions? [R:] This may be done in R by typing: 5 1.3. MAIN APPROACHES TO STATISTICS pjbaranas boxplot(income∼ethnicity, data = sscsample.data) (b) Draw 200 random samples of size 20 from the population using simple random sampling using the sscsample function. [R:] mySamples = list(simple = NULL, strat = NULL, cluster = NULL) mySamples$simple = sscsample(20, 200) The means and the number of observations sampled from each ethnic group can be seen by typing [R:] mySamples$simple Answer the following questions from the output: i. Does simple random sampling always have the strata represented in the correct pro- portions? ii. On the average, does simple random sampling give the strata in their correct propor- tions? iii. Does the mean of the sampling distribution of the sample mean for the simple random sampling appear to be close enough to the population mean that we can consider the difference to be due to chance alone? (We only took 200 samples, not all possible samples.) (c) Draw 200 stratified random samples using the function and store the output in mySamples$strat. [R:] mySamples$strat = sscsample(20, 200, “stratified”) mySamples$strat Answer the following questions from the output: i. Does stratified random sampling always have the strata represented in the correct proportions? ii. On the average, does stratified random sampling give the strata in their correct pro- portions? 6 CHAPTER 1. INTRODUCTION TO STATISTICAL SCIENCE pjbaranas iii. Does the mean of the sampling distribution of the sample mean for stratified random sampling appear to be close enough to the population mean that we can consider the difference to be due to chance alone? (We only took 200 samples, not all possible samples.) (d) Draw 200 cluster random samples using the function and store the output in mySamples$cluster. [R:] mySamples$cluster = sscsample(20, 200, “cluster”) mySamples$cluster Answer the following questions from the output: i. Does cluster random sampling always have the strata represented in the correct pro- portions? ii. On the average, does cluster random sampling give the strata in their correct propor- tions? iii. Does the mean of the sampling distribution of the sample mean for cluster random sampling appear to be close enough to the population mean that we can consider the difference to be due to chance alone? (We only took 200 samples, not all possible samples.) (e) Compare the spreads of the sampling distributions (standard deviation and interquartile range). Which method of random sampling seems to be more effective in giving sample means more concentrated about the true mean? [R:] sapply(mySamples, function(x)sd(x$means)) sapply(mySamples, function(x)IQR(x$means)) (f) Give reasons for this. 7 1.3. MAIN APPROACHES TO STATISTICS pjbaranas 8 Chapter 2 Probability and Bayesian Inference Chapter Overview This chapter revisits the concept of probability that arises from random experiments with emphasis on the law of total probability and conditional probability. The students will see that the Bayes’ theorem is just a restatement of the conditional probability, multiplication rule and the theorem of total probability. Learning Objectives At the end of this chapter, the student is expected to 1. define a random experiment and give examples; 2. apply the axioms of probability and other properties in computing probabilities; and 3. estimate probabilities of unobservable events using the Bayes’ theorem. 9 2.1. PROBABILITY pjbaranas 2.1 Probability Definition 2.1: Random experiment A random experiment is an experiment that has an outcome that is not completely pre- dictable. We can repeat the experiment under the same conditions and not get the same result. Tossing a coin is an example of a random experiment. The result of one single trail of the random experiment is called an outcome. The set of all possible outcomes of one single trial of the random experiment is called the sample space. We denote it Ω. The sample space contains everything we are considering in this analysis of the experiment, so we can also call it the universe. An event is any set of possible outcomes of a random experiment. Definition 2.2: Axioms of probability The mapping/assignment P which assigns a real number from zero to one to every possible event associated to the random experiment is called a probability function. The higher the probability of an event, the more likely it is to occur. A probability that equals 1 means that the event is certain to occur, and a probability of 0 means that the event cannot possibly occur. To be consistent, the assignment of probabilities to events must satisfy the following axioms. 1. P(A) ≥ 0 for any event A. 2. P(Ω) = 1. 3. If A and B are mutually exclusive events, then P(A ∪ B) = P(A) + P(B). The following rules of probability can be proved from the axioms. 1. P(Ø) = 0. 2. P(Ā) = 1 − P(A). 3. P(A ∪ B) = P(A) + P(B) − P(A ∩ B) for any events A and B. 4. P(A − B) = P(A) − P(A ∩ B). 10 CHAPTER 2. PROBABILITY AND BAYESIAN INFERENCE pjbaranas 2.1.1 Joint Probability and Independent Events Definition 2.3 The joint probability of events A and B is the probability that both events occur simul- taneously, on the same repetition of the random experiment. In other words the joint probability of events A and B is P(A ∩ B). If event A and event B are independent, then P(A ∩ B) = P(A) × P(B). The joint probability is the product of the individual probabilities. If that does not hold, the events are called dependent events. Note that whether or not two events A and B are independent or dependent depends on the probabilities assigned. Note. Independence of two events is not a property of the events themselves, rather it is a prop- erty that comes from the probabilities of the events and their intersection. This is in contrast to mutually exclusive events, which have the property that they contain no elements in common. Two mutually exclusive events each with non-zero probability cannot be independent. Their intersection is the empty set, so it must have probability zero, which cannot equal the product of the probabilities of the two events! Definition 2.4: Marginal probability The probability of one of the events A, in the joint event setting is called its marginal probability. It is found by summing P(A∩B) and P(A∩ B̄) using the axioms of probability. A = A ∩ Ω = A ∩ (B ∪ B̄) A = (A ∩ B) ∪ (A ∩ B̄) P(A) = P(A ∩ B) + P(A ∩ B̄) (2.1) 2.1.2 Conditional Probability If we know that one event has occurred, does that affect the probability that another event has occurred? Suppose we knew that the event A has occurred. Everything outside of A is no longer possible. We only have to consider outcomes inside event A. This gives us the reduced universe Ur = A. Consider any event B. The only part of event B that is now relevant is that part which is also in A. This is B ∩ A. 11 2.1. PROBABILITY pjbaranas Definition 2.5: Conditional probability The conditional probability of event B given event A is defined as P(A ∩ B P(B|A) =. (2.2) P(A) When A and B are independent events we have P(B|A) = P(B), since P(B ∩ A) = P(B) · P(A) for independent events, and the factor P(A) will cancel out. If we reverse the roles of the two events A and B in Equation 2.2 we would get the conditional probability of A given B as follows P(A ∩ B) P(A|B) =. P(B) We will not consider the two conditional events the same way since B is unobservable. The occurrence or nonoccurrence of event B is not observed. However, A is an observable event that can occur either with event B or with its complement B̄. The chances of A occurring may depend on which one of B or B̄ has occurred. In other words, the probability of event A is conditional on the occurrence or nonoccurrence of event B. When we clear the fractions in the conditional probability formula, we get the multiplication rule for probability as follows P(A ∩ B) = P(B) · P(A|B). (2.3) Equation 2.3 restates the conditional probability relationship of an observable event given an unobservable event in a way that is useful for finding the joint probability P(A ∩ B). Similarly, P(A ∩ B̄) = P(B̄) · P(A|B̄). From Equation 2.1, the probability of event A is found by summing the probabilities of its disjoint parts. That is, P(A) = P(A ∩ B) + P(A ∩ B̄). We substitute this into Equation 2.2 to get P(A ∩ B) P(B|A) =. P(A ∩ B) + P(A ∩ B̄) 12 CHAPTER 2. PROBABILITY AND BAYESIAN INFERENCE pjbaranas Now we use the multiplication rule to find each of the joint probabilities in Equation 2.1. This gives Bayes’ theorem for a single event: Theorem 2.1: Bayes’ Theorem for a single event P(A|B) · P(B) P(B|A) =. (2.4) P(A|B) · P(B) + P(A|B̄) · P(B̄) We see that the Bayes’ theorem is just a restatement of the conditional probability P(B|A) where: 1. The probability of A is found as the sum of the probabilities of its disjoint parts, (A ∩ B) and (A ∩ B̄), and 2. Each of the joint probabilities are found using the multiplication rule. Remark. B and B̄ are disjoint and their union recreates the whole universe Ω. We say that events B and B̄ partition the universe. Definition 2.6: Partition of Ω Let B1 , B2 ,... , Bn be events such that B1 ∪ B2 ∪... ∪ Bn = Ω and Bi ∩ Bj = Ø for all i ̸= j. Then we say that the set of events B1 , B2 ,... , Bn partitions Ω. Consider an observable event A. This will be partitioned into parts by the partition above. That is, A = (A ∩ B1 ) ∪ (A ∩ B2 ) ∪ · · · ∪ (A ∩ Bn ). Note that the parts are mutually disjoint since the Bi ’s are themselves mutually disjoint. Hence we have Theorem 2.2: Law of Total Probability n X P(A) = P(A ∩ Bj ) j=1 The Law of Total Probability simply says that the probability of an event A is the sum of the probabilities of its disjoint parts. Using the multiplication rule on each disjoint probability gives n X P(A) = P(A|Bj ) · P(Bj ). j=1 The conditional probability P(Bi |A) for i = 1, 2,... , n is found by dividing each joint probability 13 2.2. BAYES’ THEOREM: THE KEY TO BAYESIAN STATISTICS pjbaranas by the probability of the event A. That is, P(A ∩ Bi ) P(Bi |A) =. P(A) Using the multiplication rule to find the joint probability in the numerator, along with the law of total probability in the denominator, gives Theorem 2.3: Bayes’ Theorem P(A|Bi ) · P(Bi ) P(Bi |A) = Pn. (2.5) j=1 P(A|Bj ) · P(Bj ) This result was published posthumously in 1763 after the death of its discoverer, Reverend Thomas Bayes. Note how the events A and Bi for i = 1, 2,... , n are not treated symmetrically. The events Bi for i = 1, 2,... , n are considered unobservable. We never know which of them occurred. The event A, however, is an observable event. The marginal probabilities P(Bi ) for i = 1, 2,... , n are assumed known before we start and are called prior probabilities. 2.2 Bayes’ Theorem: The Key to Bayesian Statistics We will now discuss how to use the Bayes’ Theorem to make inferences from data using prob- ability models for quantities we observe and for quantities about which we wish to learn. The essential characteristic of Bayesian methods is their explicit use of probability for quantifying uncertainty in inferences based on statistical data analysis. The process of Bayesian data analysis can be idealized by dividing it into the following three steps: 1. Setting up a full probability model - a joint probability distribution for all observable and unobservable quantities in a problem. The model should be consistent with knowledge about the underlying scientific problem and the data collection process. 2. Conditioning on observed data: calculating and interpreting the appropriate posterior dis- tribution - the conditional probability distribution of the unobserved quantities of ultimate interest, given the observed data. 3. Evaluating the fit of the model and the implications of the resulting posterior distribution: how well does the model fit the data, are the substantive conclusions reasonable, and how sensitive are the results to the modeling assumptions in step 1? In response, one can alter or expand the model and repeat the three steps. 14 CHAPTER 2. PROBABILITY AND BAYESIAN INFERENCE pjbaranas Let us see how we can use Bayes’ theorem to revise our beliefs on the basis of evidence. Let B1 , B2 ,... , Bn be a set of unobservable events which partition the universe. We start with P(Bi ) for i = 1, 2,... , n, the prior probability for the events Bi , for i = 1, 2,... , n. This distribution gives the weight we attach to each of the Bi from our prior belief. Then we find that A has occurred. The likelihood of the unobservable events B1 , B2 ,... , Bn is the conditional probability that A has occurred given Bi for i = 1, 2,... , n. Thus the likelihood of event Bi is given by P(A|Bi ). The likelihood is the weight given to each of the Bi events given the occurrence of A. P(Bi |A) for i = 1, 2,... , n is the posterior probability of event Bi , given that event A has oc- curred. This distribution contains the weight we attach to each of the events Bi for i = 1, 2,... , n after we know event A has occurred. It combines our prior beliefs with the evidence given by the occurrence of event A. To get better insight into Bayes’ theorem, we think of the universe as having two dimensions, one observable, and one unobservable. We let the observable dimension be horizontal, and let the unobservable dimension be vertical. The unobservable events no longer partition the universe haphazardly. Instead, they partition the universe as rectangles that cut completely across the universe in a horizontal direction. The whole universe consists of these horizontal rectangles in a vertical stack. Since we do not ever observe which of these events occurred, we never know what vertical position we are in the Bayesian universe. Observable events are vertical rectangles that cut the universe from top to bottom. We observe that vertical rectangle A has occurred, so we observe the horizontal position in the universe. Each event Bi ∩ A is a rectangle at the intersection of Bi and A. The probability of the event Bi ∩ A is found by multiplying the prior probability of Bi times the conditional probability of A given Bi. This is the multiplication rule. The event A is the union of the disjoint parts A ∩ Bi for i = 1, 2,... , n. Its probability is found by summing the probabilities of each disjoint part down the vertical column represented by A. This is the marginal probability of A. The posterior probability of any particular Bi given A is the proportion of A that is also in Bi. In other words, the probability of Bi ∩ A divided by the sum of Bj ∩ A summed over all j = 1, 2,... , n. 15 2.2. BAYES’ THEOREM: THE KEY TO BAYESIAN STATISTICS pjbaranas We often write Bayes’ theorem in its proportional form as posterior ∝ prior × likelihood. (2.6) This gives the relative weights for each of the events Bi for i = 1, 2,... , n after we know A has occurred. Dividing by the sum of the relative weights rescales the relative weights so they sum to 1. This makes it a probability distribution. We can summarize the use of Bayes’ theorem for events by the following three steps: 1. Multiply prior times likelihood for each of the Bi. This finds the probability of Bi ∩ A by the multiplication rule. 2. Sum them for i = 1, 2,... , n. This finds the probability of A by the law of total probability. 3. Divide each of the prior times likelihood values by their sum. This finds the conditional probability of that particular Bi given A. Prediction Making inferences about an unknown observable is called predictive inferences. The logic is as follows: Before the data y are considered, the distribution of the unknown but observable y is Z Z p(y) = p(y, θ)dθ = p(θ)p(y|θ)dθ. (2.7) Equation 2.7 is often called the marginal distribution of y, but a more informative name is the prior predictive distribution: prior because it is not conditional on a previous observation of the process, and predictive because it is the distribution for a quantity that is observable. After the data y have been observed, we can predict an unknown observable, ỹ, from the same process. For example, y = (y1 ,... , yn ) may be the vector of recorded weights of an object weighed n times on a scale, θ = (µ, σ 2 ) may be the unknown true weight of the object and the measurement variance of the scale, and ỹ may be the yet to be recorded weight of the object in a planned new weighing. The distribution of ỹ is called the posterior predictive distribution, posterior because it is conditional on the observed y and predictive because it is a prediction for an observable ỹ: Z p(ỹ|y) = p(ỹ, θ|y)dθ Z = p(ỹ|θ, y)p(θ|y)dθ Z = p(ỹ|θ)p(θ|y)dθ. (2.8) 16 CHAPTER 2. PROBABILITY AND BAYESIAN INFERENCE pjbaranas The second and third lines display the posterior predictive distribution as an average of condi- tional predictions over the posterior distribution of θ. The last step follows from the assumed conditional independence of y and ỹ given θ. 2.2.1 Odds and Bayes Factor One way of dealing with uncertain events that we are modeling as random is to form the odds of the event. Definition 2.7: The odds of an event The odds for an event C equals the probability of the event occurring divided by the probability of the event not occurring: P(C) P(C) odds(C) = =. (2.9) P(C̄) 1 − P(C) If we are using prior probabilities, we get the prior odds – in other words, the ratio before we have analyzed the data. If we are using posterior probabilities, we get the posterior odds. Solving Equation 2.9 for the probability of event C we get odds(C) P(C) =. 1 + odds(C) Definition 2.8: Bayes Factor (B) The Bayes factor B is the factor by which the prior odds is changed to the posterior odds: prior odds(C) × B = posterior odds(C) This factor contains the evidence in the data D that occurred relevant to the question about C occurring. We can solve the above relationship for the Bayes factor to get posterior odds B=. prior odds We can substitute in the ratio of probabilities for both the posterior and prior odds to find P(D|C) B=. P(D|C̄) Thus the Bayes factor is the ratio of the probability of getting the data which occurred given the event, to the probability of getting the data which occurred given the complement of the 17 2.2. BAYES’ THEOREM: THE KEY TO BAYESIAN STATISTICS pjbaranas event. If the Bayes factor is greater than 1, then the data has made us believe that the event is more probable than we thought before. If the Bayes factor is less than 1, then the data has made us believe that the event is less probable than we originally thought. Example 1. Inference about a genetic status Human males have one X-chromosome and one Y-chromosome, whereas females have two X- chromosomes, each chromosome being inherited from one parent. Hemophilia is a disease that exhibits X-chromosome-linked recessive inheritance, meaning that a male who inherits the gene that causes the disease on the X-chromosome is affected, whereas female carrying the gene on only one of her two X-chromosomes is not affected. The disease is generally fatal for women who inherit two such genes, and this is rare, since the frequency of occurrence of the gene is low in human populations. Prior distribution. Consider a woman who has an affected brother, which implies that her mother must be a carrier of the hemophilia gene with one ‘good’ and one ‘bad’ hemophilia gene. We are also told that her father is not affected; thus the woman herself has a fifty-fifty chance of having the gene. The unknown quantity of interest, the state of the woman, has just two values: the woman is either a carrier of the gene (θ = 1) or not (θ = 0). Based on the information provided thus far, the prior distribution for the unknown θ can be expressed simply as 1 P(θ = 1) = P(θ = 0) =. 2 Data model and likelihood. The data used to update the prior information consist of the affection status of the woman’s sons. Suppose she has two sons, neither of whom is affected. Let yi = 1 or 0 denote an affected or unaffected son, respectively. The outcomes of the two sons are exchangeable and , conditional on the unknown dθ, are independent; we assume the sons are not identical twins. The two items of independent data generate the following likelihood function:: P(y1 = 0, y2 = 0 | θ = 1) = (0.5)(0.5) = 0.25 P(y1 = 1, y2 = 0 | θ = 0) = (1)(1) = 1. These expression follow from the fact that if the woman is a carrier, then each of her sons will have a 50% chance of inheriting the gene and so being affected, whereas if she is not a carrier then there is a probability close to 1 that a son of hers will be unaffected. (in fact, there is a nonzero probability of being affected even if the mother is not a carrier, but this risk–the mutation rate–is small and can be ignored for this example.) Posterior distribution. Bayes’ rule can now be used to combine the information in the data with the prior probability, in particular, interest is likely to focus on the posterior probability that 18 CHAPTER 2. PROBABILITY AND BAYESIAN INFERENCE pjbaranas the woman is a carrier. Using y to denote the joint data (y1 , y2 ), this is simply p(y|θ = 1) P(θ = 1) P(θ = 1 | y) = p(y|θ = 1) P(θ = 1) + p(y|θ = 0) P(θ = 0) (0.25)(0.5) 0.125 = = = 0.20. (0.25)(0.5) + (1.0)(0.5) 0.625 Intuitively it is clear that if a woman has unaffected children, it is less probable that she is a carrier, and Bayes’ rule provides a formal mechanism for determining the extent of the correc- tion. The results can also be described in terms of prior and posterior odds. The prior odds of the woman being a carrier are 0.5/0.5 = 1. The likelihood ratio based on the information about her two unaffected sons is 0.25/1 = 0.25, so the posterior odds are 1 · 0.25 = 0.25. Converting back to a probability, we obtain 0.25/(1 + 0.25) = 0.2, as before. Adding more data. A key concept of Bayesian analysis is the ease with which sequential analyses can be performed. For example, suppose that the woman has a third son, who is also unaffected. The entire calculation does not need to be redone; rather we use the previous posterior distribution as the new prior distribution, to obtain: (0.5)(0.20) P(θ = 1 | y1 , y2 , y3 ) = = 0.111. (0.5)(0.20) + (1)(0.8) Alternatively, if we suppose that the third son is affected, it is easy to check that the posterior probability of the woman being a carrier becomes 1 (again ignoring the possibility of a mutation). Exercise 2. Suppose there is a medical diagnostic test for a disease. The sensitivity of the test 0.95. This means that if a person has the disease, the probability that the test gives a positive response is 0.95. The specificity of the test is 0.90. This means that if a person does not have the disease, the probability that the test gives a negative response is 0.90, or that the false positive rate of the test is 0.10. In the population, 1% of the people have the disease. What is the probability that a person tested has the disease, given the results of the test is positive? Let D be the event “the person has the disease” and let T be the event “the test gives a positive result.” Exercise 3. Suppose that if θ = 1, then y has a normal distribution with mean 1 and standard deviation σ, and if θ = 2, then y has a normal distribution with mean 2 and standard deviation σ. Also, suppose P(θ = 1) = 0.5 and P(θ = 2) = 0.5. 1. For σ = 2, write the formula for the marginal probability density for y and sketch it. 2. What is P(θ = 1|y = 1), again supposing σ = 2. 3. Describe how the posterior density of θ changes in shape as σ is increased and as it is decreased. ni 19 2.3. GENERAL NOTATION FOR STATISTICAL INFERENCE pjbaranas 2.3 General notation for statistical inference Statistical inference is concerned with drawing conclusions, from numerical data, about quanti- ties that are not observed. For example, a clinical trial of a new cancer drug might be designed to compare the five-year survival probability in a population given the new drug to that in a population under standard treatment. These survival probabilities refer to a large population of patients, and it is neither feasible nor ethically acceptable to experiment on an entire popula- tion. Therefore inferences about the true probabilities and, in particular, their differences must be based on a sample of patients. In this example, statistical inference will be used to assess the causal inference – the comparison between the observed outcome in each patient and that patient’s unobserved outcome if exposed to the other treatment. We now distinguish two kinds of estimands–unobserved quantities for which statistical inferences are made–first, potentially observable quantities, such as future observations of a process, or the outcome under the treatment not received in the clinical trial example; and second, quantities that are not directly observable, that is, parameters that govern the hypothetical process leading to the observed data (for example, regression coefficients). Parameters, data, and predictions For our general notation, we let θ denote unobservable vector quantities or population parame- ters of interest, y denote the observed data, and ỹ denote unknown, but potentially observable, quantities (such as the outcomes of the patients under the other treatment, or the outcome under each of the treatments for a new patient similar to those already in the trial). Greek letters are generally used to denote parameters, lower case Roman letters for observed or observable scalars and vectors, and upper Roman letters for observed or observable matrices. Observational units and variables In many statistical studies, data are gathered on each of a set of n objects or units, and we can write the data as a vector, y = (y1 ,... , yn ). Sometimes, several variables are measured on a single units, thus each yi is actually a vector and the entire data set y is a matrix. The y variables are called the ‘outcomes’ and are considered ‘random’ in the sense that, when making inferences, we wish to allow for the possibility that the observed values of the variables could have turned out otherwise, due to the sampling process and the natural variation of the population. Exchangeability Exchangeability refers to the invariance property of the joint probability density p(y1 ,... , yn ) with respect to permutations of the indexes. This idea is fundamental to statistics; usually the starting point of a statistical analysis. Commonly, data from an exchangeable distribution is modeled as independently and identically distributed (iid) given some unknown parameter θ 20 CHAPTER 2. PROBABILITY AND BAYESIAN INFERENCE pjbaranas with distribution p(θ). 21 2.3. GENERAL NOTATION FOR STATISTICAL INFERENCE pjbaranas 22 Chapter 3 Specifying Priors: Single-parameter models Chapter Overview This chapter focuses on single parameter models, specifically, binomial, Poisson and normal models. Several methods in specifying priors will be discussed for the three models. Learning Objectives At the end of this chapter, the student is expected to 1. estimate parameters for the binomial, Poisson, and normal models; and 2. specify appropriate priors for binomial, Poisson, and normal data; 23 3.1. ESTIMATING A PROBABILITY FROM BINOMIAL DATA pjbaranas 3.1 Estimating a probability from binomial data Recall that the simple binomial model aims to estimate an unknown population proportion from the results of a sequence of ‘Bernoulli trials’. It provides a natural model for data that arise from a sequence of n exchangeable trials or draws from a large population where each trial gives rise to one of two possible outcomes, conventionally labeled ‘success’ and ‘failure’. Because of the exchangeability, the data can be summarized by the total number of successes in the n trials, which we denote here by y. Converting from a formulation in terms of exchangeable trials to one using independent and identically distributed random variables is achieved naturally by letting the parameter θ represent the proportion of successes in the population or, equivalently, the probability of success in each trial. The binomial sampling model is, n y p(y|θ) = Bin(y|n, θ) = θ (1 − θ)n−y (3.1) y where on the left side we suppress the dependence on n because it is regarded as part of the experimental design that is considered fixed; all the probabilities discussed for this problem are assumed to be conditional on n. Example 2. Estimating the probability of a female birth As a specific application of the binomial model, we consider the estimation of the sex ratio within a population of human births. The proportion of births that are female has long been a topic of interest both scientifically and to the lay public. Two hundred years ago it was established that the proportion of female births in European populations was less than 0.5, while in this century interest has focused on factors that may influence the sex ratio. The currently accepted value of the proportion of female births in large European-race populations is 0.485. For this example we define the parameter θ to be the proportion of female births, but an alternative way of reporting this parameter is as a ratio of male to female birth rates, ϕ = (1−θ) θ. Let y be the number of girls in n recorded births. By applying the binomial model 3.1, we are assuming that the n births are conditionally independent given θ, with the probability of a female birth equal to θ for all cases. This modeling assumption is motivated by the exchangeability that may be judged when we have no explanatory information (for example, distinguishing multiple births or births within the same family) that might affect the sex of the baby. 3.1.1 Using a Uniform Prior To perform Bayesian inference in the binomial model, we must specify a prior distribution for θ. At this point, we assume that the prior distribution for θ is uniform on the interval [0, 1]. 24 CHAPTER 3. SPECIFYING PRIORS: SINGLE-PARAMETER MODELS pjbaranas Elementary application of Bayes’ rule as displayed in (2.6), applied to (3.1), then gives the posterior density for θ as p(θ|y) ∝ θy (1 − θ)n−y. (3.2) With fixed n and y, the factor ny does not depend on the unknown parameter θ, and so it can be treated as a constant when calculating the posterior distribution of θ. We can recognize (3.2) as the unnormalized form of the beta distribution θ|y ∼ Beta(y + 1, n − y + 1). (3.3) Example 3. Parameter estimation: Bus Example After moving to Auckland, Professor George decided that he would take the bus to work each day. However, He wasn’t very confident with the bus system in the new city, so for the first week he just took the first bus that came along and was heading in the right direction, towards the city. In the first week, he caught 5 morning buses. Of these 5 buses, two of them took him to the right place, while three of them took him far from work, leaving him with an extra 20 minute walk. Given this information, he would like to try to infer the proportion of the buses that are “good”, that would take him right to campus. We will call this fraction θ and we will infer θ using the Bayesian framework. We will start with a prior distribution that describes initial uncertainty about θ and update this to get the posterior distribution, using the data that 2/5 buses Prof. George took were “good”. First we must think about the meaning of the parameter θ in our particular problem so we can choose a sensible prior distribution. Since θ is, in this example, a proportion, we know it cannot be less than 0 or greater than 1. In principle, θ could be any real value between 0 and 1. To keep things simple, we shall make an approximation and assume that the set of possible values for θ is: {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,.0.8, 0.9, 0.9, 1}. Let us assume that before we got the data (two successes out of 5 trials), we were very uncertain about the value of θ, and this can be modeled by using a uniform prior distribution. There are 11 possible values for θ that are being considered with our discrete approximation, so the probability of each is 1/11 = 0.0909. We will construct a Bayes’ box to see the components in updating our beliefs about the parameter θ. The partially complete Bayes’ Box is given in Table 3.2. To get the likelihoods, we need to think about the properties of our experiment. In particular, we should imagine that we knew the value of θ and we were trying to predict what experimental outcome (data) would occur. Ultimately, we want to find the probability of our actual data set (2 out of the 5 buses were “good”), for all of our possible θ values. 25 3.1. ESTIMATING A PROBABILITY FROM BINOMIAL DATA pjbaranas Table 3.1: Starting to make a Bayes’ Box for the bus problem possible values prior likelihood prior × likelihood posterior θ p(θ) p(x|θ) p(θ)p(x|θ) p(θ|x) 0 0.0909 0.1 0.0909 0.2 0.0909...... 0.9 0.0909 1 0.0909 Recall that, if there are n repetitions of a “random experiment” and the “success” probability is θ at each repetition, then the number of “successes” x has a binomial distribution. That is n x p(x|θ) = θ (1 − θ)n−x. x This is the probability mass function for x (if we imagine θ to be known), hence the notation p(x|θ), read as “the probability distribution for x given θ”. Since there are five trials (n = 5) in the bus problem, the number of successes x must be one of 0, 1, 2, 3, 4, or 5. If θ is a high number close to 1, then we would expect the resulting value of the data x to be something high like 4 or 5. Low values for x would still be possible but they would have a small probability. If θ is a small number, we would expect the data to be 0, 1, or 2, with less probability for more than 2 successes. This is just saying in words what is written precisely in the above equation. To obtain the actual likelihood values that go into the Bayes’ Box, we can simply substitute in the known values n = 5 and x = 2: 5 2 P(x = 2|θ) = θ (1 − θ)5−2 (3.4) 2 = 10θ2 (1 − θ)3. (3.5) The resulting equation depends on θ only. We can go through the list of θ values and get a numerical answer of the likelihood P(x = 2|θ), which is what we need for the Bayes’ Box. The final steps are, as usual, to multiply the prior by the likelihood and then normalize that to get the posterior distribution. The completed Bayes’ Box is given in Table 3.2. 3.1.2 Using a Beta Prior Suppose a beta(a, b) prior density is used for θ: Γ(a + b) a−1 g(θ; a, b) = θ (1 − θ)b−1 for 0 ≤ θ ≤ 1. Γ(a)Γ(b) 26 CHAPTER 3. SPECIFYING PRIORS: SINGLE-PARAMETER MODELS pjbaranas possible values prior likelihood prior × likelihood posterior θ p(θ) p(x|θ) p(θ)p(x|θ) p(θ|x) 0 0.0909 0 0 0 0.1 0.0909 0.0729 0.0066 0.0437 0.2 0.0909 0.2048 0.0186 0.1229 0.3 0.0909 0.3087 0.0281 0.1852 0.4 0.0909 0.3456 0.0314 0.2074 0.5 0.0909 0.3125 0.0284 0.1875 0.6 0.0909 0.2304 0.0209 0.1383 0.7 0.0909 0.1323 0.0120 0.0794 0.8 0.0909 0.0512 0.0047 0.0307 0.9 0.0909 0.0081 0.0007 0.0049 1 0.0909 0 0 0 Totals 1 0.151 1 Table 3.2: The completed Bayes’ Box for the bus problem Ignoring the constants in the prior and likelihood that do not depend on the parameter gives us the posterior distribution p(θ|y) ∝ θa+y−1 (1 − θ)b+n−y−1 for 0 ≤ θ ≤ 1, which is the shape of the posterior as a function of θ. We recognize that this is the beta distribution with parameters a′ = a + y and b′ = b + n − y. That is, we add the number of successes to a and add the number of failures to b: Γ(n + a + b) p(θ|y) = θy+a−1 (1 − θ)n−y+b−1 for ≤ θ ≤ 1. Γ(y + a)Γ(n − y + b) When we examine the shape of the binomial likelihood function as a function of θ, we see that this is of the same form as the beta(a, b) distribution, a product of θ to a power times (1 − θ) to another power. When we multiply the beta prior times the binomial likelihood, we add the exponents of θ and (1 − θ), respectively. So we start with a beta prior, and we get a beta posterior by the simple rule “add successes to a, add failures to b”. This makes using beta(a, b) priors when we have binomial observations particularly easy. Using Bayes’ theorem moves us to another member of the same family. We then say that the beta distribution is the conjugate family for the binomial observation distribution. Jeffreys’ prior for binomial. The beta( 21 , 12 ) prior is known as the Jeffreys’ prior for the binomial. This gives a prior that is invariant under any continuous transformation of the parameter. This means that Jeffreys’ prior is objective in the sense that it does not depend on the particular parameterization used. 27 3.1. ESTIMATING A PROBABILITY FROM BINOMIAL DATA pjbaranas 3.1.3 Choosing Your Prior for a Binomial Data When you have a vague prior knowledge, one of the beta(a, b) prior distributions would be a suitable prior. For example, if your prior knowledge about θ, is that θ is very small, then beta(.5, 1), beta(.5, 2), beta(.5, 3), beta(1, 2), or beta(1, 3) would all be satisfactory priors. All of these conjugate priors offer easy computation of the posterior, together with putting most of the prior probability at small values of θ. The beta distribution can have a number of shapes. The prior chosen should correspond to your belief. Bulstad and Curran suggested choosing a beta(a, b) that matches your prior belief about the mean and standard deviation. Let θ0 be your prior mean for the proportion, and let σ0 be a your prior standard deviation for the proportion. Recall that the mean of beta(a, b) is a+b while q the standard deviation is (a+b)2ab (a+b+1). Set the mean and the standard deviation to your prior beliefs, that is, we set s a ab θ0 = and σ0 =. a+b (a + b)2 (a + b + 1) b Note that a+b = 1 − θ0 , thus we get r θ0 (1 − θ0 ) σ0 =. a+b+1 Solving these two equations for a and b gives your beta(a, b) prior. Precautions Before Using Your Conjugate Prior 1. Graph your beta(a, b) prior. If the shape looks reasonably close to what you believe, you will use it. Otherwise, you can adjust θ0 and σ0 until you find a prior whose graph approximately corresponds to your belief. As long as the prior has reasonable probability over the whole range of values that you think the parameter could possibly be in, it will be a satisfactory prior. 2. Calculate the equivalent sample size of the prior. We note that the sample proportion θ̂ = ny from a binomial (n, θ) distribution has variance equal to θ(1−θ) n. We equate this variance (at θ0 , the prior mean) to the prior variance. θ0 (1 − θ0 ) ab =. neq (a + b)2 (a + b + 1) a b Since θ0 = a+b and (1 − θ0 ) = a+b , the equivalent sample size is neq = a + b + 1. it says that the amount of information about the parameter from your prior is equivalent to the amount 28 CHAPTER 3. SPECIFYING PRIORS: SINGLE-PARAMETER MODELS pjbaranas from a random sample of that size. You should always check if this is unrealistically high. Ask yourself, “Is my prior knowledge about θ really equal to the knowledge about θ that I would obtain if I checked a random sample of size neq ?” If it is not, you should increase your prior standard deviation and recalculate your prior. Otherwise, you would be putting too much prior information about the parameter relative to the amount of information that will come from the data. Example 4. Three students are constructing their prior belief about θ, the proportion of Hamil- ton residents who support building a casino in Hamilton. Anna thinks that her prior mean is 0.2, and her prior standard deviation is 0.08. The beta(a, b) prior that satisfies her prior belief is found by 0.2 × 0.08 = 0.082. a+b+1 Therefore her equivalent sample size is a + b + 1 = 25. For Anna’s prior, a = 4.8 and b = 19.2. Bart is a newcomer to Hamilton, so he is not aware of the local feeling for or against the proposed casino. He decides to use a uniform prior. For him, a = b = 1. His equivalent sample size is a + b + 1 = 3. Chris cannot fit a beta(a, b) prior to match his belief. He believes his prior probability has a trapezoidal shape. He gives heights of his prior as follows Value Weight 0 0 0.05 1 0.1 2 0.3 2 0.4 1 0.5 0 and he linearly interpolates between them to get his continuous prior. When we interpolate between these points, we see that Chris’s prior is given by    for 0 ≤ θ ≤ 0.1, 20θ  g(θ) = 2 for 0.10 ≤ θ ≤ 0.30,    5 − 10θ for 0.3 ≤ θ ≤ 0.50. Note that Chris’s prior is not actually a density since it does not have area equal to one. However, this is not a problem since the relative weights given by the shape of the distribution are all that is needed since the constant will cancel out. Effect of the Prior When we have enough data, the effect of the prior we choose will be small compared to the 29 3.2. BAYES’ THEOREM FOR POISSON PARAMETER WITH A CONTINUOUS pjbaranas PRIOR data. In that case we will find that we can get very similar posteriors despite starting from quite different priors. All that is necessary is that they give reasonable weight over the range that is indicated by the likelihood. The exact shape of the prior does not matter very much. The data are said to “swamp the prior.” 3.2 Bayes’ Theorem for Poisson Parameter with a Con- tinuous Prior We have a random sample y1 , y2 ,... , yn from a Poisson(µ) distribution. The parameter µ can have any positive value, so we should use a continuous prior defined on all positive values. The likelihood of a single draw from a Poisson(µ) distribution is given by µy e−µ f (y|µ) = y! for y = 0, 1,... and µ > 0. The part that determines the shape of the likelihood is f (y|µ) ∝ µy e−µ. When y1 , y2 ,... , yn is a random sample from Poisson(µ) distribution, the likelihood of the random sample is the product of the original likelihoods. This simplifies to n Y f (y1 ,... , yn ) = f (yi |µ) i=1 P yi −nµ ∝ µ e. P We recognize this as the likelihood where yi is a single draw from a Poisson(nµ) distribution. P It has the shape of a gamma(r , v ) density where r′ = ′ ′ yi + 1 and v ′ = n. 3.2.1 Some Prior Distributions for Poisson Positive uniform prior density. If we have no idea what the value of µ prior to looking at the data, we would consider that we should give all positive values of µ equal weight. That is, g(µ) = 1 for µ > 0. Clearly this prior density is improper since its integral over all possible values is infinite. Nevertheless, the posterior will be proper in this case and we can use it for making inference about µ. The proportional posterior will be g(µ|y1 ,... , yn ) ∝ g(µ)f (y1 ,... , yn |µ) P yi −nµ ∝ 1×µ e. We see that the posterior is the same shape as the likelihood function so we know that it is P gamma(r′ , v ′ ) density where r′ = y + 1 and v ′ = n. Clearly the posterior is proper despite 30 CHAPTER 3. SPECIFYING PRIORS: SINGLE-PARAMETER MODELS pjbaranas starting from an improper prior. Jeffreys’ prior for Poisson. The parameter indexes all possible observation distributions. Any one-to-one continuous function of the parameter would give an equally valid index. Jeffreys’ method gives us priors which are objective in the sense that they are invariant under any con- tinuous transformation of the parameter. The Jeffreys’ prior for the Poisson is 1 g(µ) ∝ √ for µ > 0. µ This also will be an improper prior, since its integral over the whole range of possible values is infinite. However, it is not non-informative since it gives more weight to small values. The pro- portional posterior will be the prior times likelihood. Using the Jeffreys’ prior the proportional posterior will be g(µ|y1 ,... , yn ) ∝ g(µ) f (y1 ,... , yn |µ) 1 P ∝ √ × µ yi e−nµ µ P y− 12 −nµ ∝ µ e , P 1 which we recognize as the shape of a gamma(r′ , v ′ ) density where r′ = y+ 2 and v ′ = n. The conjugate prior for the observations from the Poisson distribution with parameter (θ) will have the same form as the likelihood whose shape is given by g(θ) ∝ e−kθ elog θ×l ∝ θl e−kθ. The distribution having this shape is known as the gamma(r, v) distribution and has density given by v r θr−1 e−vθ g(θ; r, v) = , Γ(r) v r where r − 1 = l and v = k and Γ(r) is the scale factor needed to make this a density. When we have a single Poisson(θ) observation, and use gamma(r, v) prior for θ, the shape of the posterior is given by g(θ|y) ∝ g(θ) × f (y|θ) v r θr−1 e−vθ θy e−θ ∝ × Γ(r) y! r−1+y −(v+1)θ ∝ θ e. 31 3.2. BAYES’ THEOREM FOR POISSON PARAMETER WITH A CONTINUOUS pjbaranas PRIOR We recognize this to be a gamma(r′ , v ′ ) density where the constants are updated by the simple formulas r′ = r+y and v ′ = v +1. We add the observation y to r, and we add 1 to v. Hence when we have a random sample y1 , y2 ,... , yn from a Poisson(θ) distribution, and use a gamma(r, v) prior, we repeat the updating after each observation, using the posterior from the ith observa- tion as the prior for the i + 1st observation. We end up with a gamma(r′ , v ′ ) posterior where P r′ = r + y and v ′ = v + n. The simple updating rules are “add the sum of the observations to r”, and “add the number of observations to v.” Bulstad and Curran suggested that we summarize our prior belief into our prior mean m, and our prior standard deviation s. Our variance will be the square of our prior standard deviation. Then we find the gamma conjugate prior that matches those first two prior moments. That means that r and v will be the simultaneous solutions of the two equations r r m= and s2 =. v v2 m2 Hence v = sm2. Substitute this into the first equation and solve for r. We find r = s2. This gives our gamma(r, v) prior. Precautions before using your conjugate prior 1. Graph your prior. If the shape looks reasonably close to your prior belief use it. Otherwise you can adjust your prior mean m and prior standard deviation s until you find a prior with shape matching your prior belief. 2. Calculate the equivalent sample size of your prior. This is the size of a random sample of Poisson(θ) variables that matches the amount of prior information about θ that you are putting in with your prior. We note that if y1 , y2 ,... , yn is a random sample from Poisson(θ), then ȳ will have mean θ and variance nθ. The equivalent sample size will be the solution of θ r = 2. neq v Setting the mean equal to the prior mean θ = vr the equivalent sample size of the gamma(r, v) prior for θ is neq = v. We check to make sure this in not too large. Example 5. The weekly number of traffic accidents on a highway has the Poisson(θ) distribution. Four students are going to count the number of traffic accident for each of the next eight weeks. They are going to analyze this in a Bayesian manner, so they each need a prior distribution. Aretha says she has no prior information, so will assume all possible values are equally likely. Thus she will use the positive uniform prior g(θ) = 1 for θ > 0, which is improper. Byron also says he has no prior information, but he wants his prior to be invariant if the parameter is 32 CHAPTER 3. SPECIFYING PRIORS: SINGLE-PARAMETER MODELS pjbaranas multiplied by a constant. Thus, he uses the Jeffreys’ prior for the Poisson which is g(θ) = θ−1/2 which will also be improper. Chase decides that he believes the prior mean should be 2.5 and the prior standard deviation is 1 He decides to use the gamma(r, v) that matches his prior mean and standard deviation, and finds that v = 2.5 and r = 6.25. His equivalent sample size is neq = 2.5, which he decides is acceptable since he will be putting information worth 2.5 observations and there will be 8 observations from the data. Diana decides that her prior distribution has a trapezoidal shape found by interpolating the prior weights given below: Value Weight 0 0 2 2 4 2 8 0 10 0 The number of accidents on the highway over the next 8 weeks are: 3, 2, 0, 8, 2, 4, 6, 1. Aretha will have a gamma(27, 8) posterior, Byron will have a gamma(26.5, 8) posterior, and Chase will have a gamma(32.25, 10.5) posterior. Diana finds her posterior numerically using the equation g(θ) × f (y1 ,... , yn |θ) g(θ|y1 , y2 ,... , yn ) = R ∞. 0 g(θ) × f (y1 ,... , yn |θ)dθ 3.3 Bayes’ Theorem for Normal Mean with a Discrete Prior For a Single Normal Observation Suppose that we are going to take a single observation from the conditional density f (y|µ) that is known to be normal with known variance σ 2 and suppose further that there are only m possible values µ1 , µ2 ,... , µm for the mean. If we do not have any prior information, we could give all values equal probability. The likelihood gives relative weights to all the possible parameter values according to how likely the observed value was given each parameter value. It looks like the conditional observation distribution given the parameter, µ, but instead of the parameter being fixed and the observation varying, we fix the observation at the one that actually occurred, and vary the parameter over all possible values. We only need to know it up to a multiplicative constant since the relative weights are all that is needed to apply Bayes’ theorem. The posterior 33 3.3. BAYES’ THEOREM FOR NORMAL MEAN WITH A DISCRETE PRIOR pjbaranas is proportional to prior times likelihood, so it equals prior × likelihood g(µ|y) = P. (prior × likelihood) The conditional observation distribution of y|µ is normal with mean µ and variance σ 2 , which is known. Its density is 1 1 2 f (y|µ) = √ e− 2σ2 (y−µ). 2π The likelihood of each parameter value is the value of the observation distribution at the observed value. Now the part that does not depend on the parameter µ is the same for all parameter values, so it can be absorbed into the proportionality constant. The important part is the one that depends on the parameter µ which gives the shape of the likelihood; it is given by 1 2 f (y|µ) ∝ e− 2σ2 (y−µ) , (3.6) where y is held constant at the observed value and µ is allowed to vary over all possible values. Example 6. Suppose y|µ is normal with mean µ and known variance σ 2 = 1. We know there are only five possible values for µ. They are 2.0, 2.5, 3.0, 3.5, 4. We let them be equally likely for our prior. We take a single observation of y and obtain the value y = 3.2. Let y−µ z=. σ The values for the likelihood f (z) may be found using a table of ordinates for the standard normal distribution. If we evaluate the likelihood using the normal density formula, the likelihood is proportional to 1 2 e− 2σ2 (y−µ) , where y is held at 3.2 and µ varies over all possible values. Table 3.3: Finding posterior from the normal density formula µ Prior Likelihood Prior × Likelihood Posterior − 21 (3.2−2.0)2 2.0 0.2 e = 0.4868 0.0974 0.1239 − 21 (3.2−2.5)2 2.5 0.2 e = 0.7827 0.1565 0.1990 − 21 (3.2−3.0)2 3.0 0.2 e = 0.9802 0.1960 0.2493 − 21 (3.2−3.5)2 3.5 0.2 e = 0.9560 0.1912 0.2432 − 21 (3.2−4.0)2 4.0 0.2 e = 0.7261 0.1452 0.1846 0.7863 1.0000 34 CHAPTER 3. SPECIFYING PRIORS: SINGLE-PARAMETER MODELS pjbaranas For a Random Sample of Normal Observations Suppose we have a random sample y1 , y2 ,... , yn of observations where each observation yj |µ is normal with mean µ and variance σ 2 , which is known. The observations in a random sample are all independent of each other, so the joint likelihood of the sample is the product of the individual observation likelihoods. This gives f (y1 ,... , yn |µ) = f (y1 |µ) × f (y2 |µ) × · · · × f (yn |µ). Thus given a random sample, Bayes’ theorem with a discrete prior is given by g(µ|y1 ,... , yn ) ∝ g(µ) × f (y1 |µ) × · · · × f (yn |µ). One way of finding the posterior probabilities is by analyzing observations sequentially one at a time. We do this by letting the posterior from the previous observation become the prior for the next observation. The likelihood of a single observation yj is the column of values of the observation distribution at each possible parameter value at that observed value. The posterior is proportional to prior times likelihood. Example 7. Suppose we take a random sample of three observations from a normal distribution having mean µ and known variance σ 2 = 1. The observations are 3.2, 2.2, and 3.6. The possible values of µ are 2.0, 2.5, 3.0, 3.5, and 4.0. Again, we will use the prior that gives them all equal weight. We want to use Bayes’ theorem to find our posterior belief about µ given the whole random sample. The results of analyzing the observations one at a time are shown in Table 3.4. Another way of finding the posterior probabilities is by analyzing the sample all together in a single step. Recall that the posterior is proportional to the prior × likelihood, and the joint likelihood of the sample is the product of the individual observation likelihoods. Each observation is normal, so it has a normal likelihood. This gives the joint likelihood 1 2 1 2 1 2 f (y1 ,... , yn |µ) ∝ e− 2σ2 (y1 −µ) × e− 2σ2 (y2 −µ) × e− 2σ2 (yn −µ) 1 2 +(y 2 2 ∝ e− 2σ2 [(y1 −µ) 2 −µ) +···+(yn −µ) ]. We look at the terms in brackets: (y1 − µ)2 + · · · + (yn − µ)2 = y12 − 2y1 µ + µ2 + · · · + yn2 − 2yn µ + µ2 = (y12 + · · · + yn2 ) − 2µ(y1 + · · · + yn ) + nµ2. 35 3.3. BAYES’ THEOREM FOR NORMAL MEAN WITH A DISCRETE PRIOR pjbaranas Table 3.4: Analyzing observations one at a time µ Prior1 Likelihood1 Prior×Likelihood Posterior1 − 21 (3.2−2.0)2 2.0 0.2 e = 0.4868 0.0974 0.1239 − 21 (3.2−2.5)2 2.5 0.2 e = 0.7827 0.1565 0.1990 − 21 (3.2−3.0)2 3.0 0.2 e = 0.7827 0.1960 0.2493 − 21 (3.2−3.5)2 3.5 0.2 e = 0.9560 0.1912 0.2432 − 21 (3.2−4.0)2 4.0 0.2 e = 0.7261 0.1452 0.1846 0.7863 1.0000 µ Prior2 Likelihood2 Prior×Likelihood Posterior2 − 21 (2.2−2.0)2 2.0 0.1239 e = 0.9802 0.1214 0.1916 − 21 (2.2−2.5)2 2.5 0.1990 e = 0.9560 0.1902 0.3002 − 21 (2.2−3.0)2 3.0 0.2493 e = 0.7261 0.1810 0.2857 − 21 (2.2−3.5)2 3.5 0.2432 e = 0.4296 0.1045 0.1649 − 21 (2.2−4.0)2 4.0 0.1846 e = 0.1979 0.0363 0.0576 0.6336 1.0000 µ Prior3 Likelihood3 Prior×Likelihood Posterior3 − 21 (3.6−2.0)2 2.0 0.1916 e = 0.2780 0.0533 0.0792 − 21 (3.6−2.5)2 2.5 0.3002 e = 0.5461 0.1639 0.2573 − 21 (3.6−3.0)2 3.0 0.2857 e = 0.8353 0.2386 0.3745 − 21 (3.6−3.5)2 3.5 0.1649 e = 0.9950 0.1641 0.2576 − 21 (3.6−4.0)2 4.0 0.0576 e = 0.9231 0.0532 0.0835 0.6731 1.0000 When we substitute this back in, factor out n, and complete the square we get n y 2 +···+y 2 − µ2 −2µȳ+ȳ 2 −ȳ 2 + 1 n n 2σ 2 f (y1 ,... , yn µ) ∝ e 2 +···+y 2 n y1 n − −ȳ 2 ∝ e − n2 2σ [ µ2 −2µȳ+ȳ 2 ]×e 2σ 2 n. We can see from above that the likelihood of the normal random sample y1 , y2 ,... , yn is propor- tional to the likelihood of the sample mean ȳ. Absorbing the part that does not involve µ into the proportionality constant we get 1 − (ȳ−µ)2 f (y1 ,... , yn |µ) ∝ e 2σ 2 /n. We recognize that this likelihood has the shape of a normal distribution with mean µ and 2 variance σn. We know from statistical inference that ȳ is normally distributed with mean µ and variance σ 2 /n. So the joint likelihood of the random sample is proportional to the likelihood of the sample mean, which is − 1 (ȳ−µ)2 f (ȳ|µ) ∝ e 2σ2 /n. (3.7) 36 CHAPTER 3. SPECIFYING PRIORS: SINGLE-PARAMETER MODELS pjbaranas We can think of this as drawing a single value, ȳ, the sample mean, from the normal distribution 2 with mean µ and varince σn. Example 8. In the preceding example, Example 7, the sample mean was ȳ = 3.0. We use the likelihood of ȳ which is proportional to the likelihood of the whole sample. The results are shown in Table 3.5. Table 3.5: Analyze the observations all together using the likelihood of sample mean µ Prior Likelihood Prior × Likelihood Posterior 1 − (2×1)/3 (3.0−2.0)2 2.0 0.2 e = 0.5134 0.1027 0.1380 1 − (2×1)/3 (3.0−2.5)2 2.5 0.2 e = 0.8465 0.1693 0.2276 1 − (2×1)/3 (3.0−3.0)2 3.0 0.2 e = 1.0000 0.2000 0.2688 1 − (2×1)/3 (3.0−3.5)2 3.5 0.2 e = 0.8465 0.1693 0.2276 1 − (2×1)/3 (3.0−4.0)2 4.0 0.2 e = 0.5134 0.1027 0.1380 0.7440 1.0000 3.4 Bayes’ Theorem for Normal Mean with a Continuous Prior As before, we have a random sample y1 , y2 ,... , yn from a normal distribution with mean µ and known variance σ 2. It is more realistic to believe that all values of µ are possible, at least all those in an interval. We know that Bayes’ theorem can be summarized as g(µ|y1 , y2 ,... , yn ) ∝ g(µ) × f (y1 , y2 ,... , yn |µ). Here we allow g(µ) to be a continuous prior density. In this case, we can evaluate the posterior by dividing the prior × likelihood by the integral of the prior × likelihood over the whole range of possible values. Thus g(µ) × f (y1 , y2 ,... , yn |µ) g(µ|y1 , y2 ,... , yn ) = R. (3.8) g(µ) × f (y1 , y2 ,... , yn |µ) In the previous section, we saw that the likelihood of the random sample is proportional to the likelihood of the sample mean, ȳ. So 1 − (ȳ−µ)2 g(µ) × e 2σ 2 /n g(µ|y1 , y2 ,... , yn ) = R − 1 (ȳ−µ)2. g(µ) × e 2σ 2 /n dµ This works for any continuous prior density g(µ). However, it requires an integration, which may have to be done numerically. 37 3.4. BAYES’ THEOREM FOR NORMAL MEAN WITH A CONTINUOUS PRIORpjbaranas Flat Prior Density for µ (Jeffreys’ Prior for Normal Mean) The flat prior gives each possible value of µ equal weight. It does not favor any value over any other value, g(µ) = 1. Now this is not really a proper distribution since −∞ < u < ∞, so it cannot integrate to 1. This improper prior works out all right since the posterior will integrate to 1. (We saw this in the previous sections.) The Jeffreys’ prior for the mean of a normal distribution turns out to be the flat prior. For a single normally distributed observation y with mean µ and known variance σ 2 , the likeli- hood is given by 1 2 f (y|µ) ∝ e− 2σ2 (y−µ) if we ignore the constant of proportionality. Since the prior always equals 1, the posterior is proportional to this. We rewrite it as 1 2 g(µ|y) ∝ e− 2σ2 (µ−y). For a normal random sample y1 , y2 ,... , yn , we use the likelihood of the sample mean since we saw previously that it is proportional to the likelihood of the random sample itself. Again, we 2 knew that ȳ is normally distributed with mean µ and variance σn. Hence the likelihood has shape given by − 1 (µ−ȳ)2 f (ȳ|µ) ∝ e 2σ2 /n. Since the prior always equals 1, the posterior is proportional to this. We can rewrite it as 1 − (µ−ȳ)2 g(µ|ȳ) ∝ e 2σ 2 /n. Normal Prior Density for µ Single observation. Suppose we have a prior distribution that is normal with mean m and variance s2. That is 1 2 g(µ) ∝ e− 2s2 (µ−m) where we are ignoring the part that does not involve µ. The shape of the likelihood is 1 2 f (y|µ) ∝ e− 2σ2 (y−µ). 38 CHAPTER 3. SPECIFYING PRIORS: SINGLE-PARAMETER MODELS pjbaranas The prior times likelihood is (µ−m)2 (y−µ)2 − 12 + s2 σ2 g(µ) × f (y|µ) ∝ e 2 1 (σ 2 m+s2 y) − µ− 2σ 2 s2 /(σ 2 +s2 ) 2 2 ∝ e σ +s (Show this!). We recognize from this shape that the posterior is a normal distribution having mean and variance given by σ 2 m + s2 y σ 2 s2 m′ = and (s ′ 2 ) = , (3.9) σ 2 + s2 σ 2 + s2 respectively. We started with a normal(m, s2 ) prior, and ended up with a normal(m′ , (s′ )2 ) posterior. This shows that the normal(m, s2 ) distribution is the conjugate family for the normal observation distribution with known variance. Equation 3.9 gives us the updating rule for normal family. We can simplify it by introducing first the precision of a distribution which is the reciprocal of the variance. The posterior precision is thus given by 2 2 −1 1 σ s 1 1 ′ 2 = 2 2 = 2 + 2. (s ) σ +s

Bayesian Statistics Instructional Materials PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue