Baldi et al. Chapter 9 (3rd ed) PDF - Introducing Probability
Document Details
Uploaded by .keeks.
Marian University
2013
Baldi et al.
Tags
Related
- The Nature of Probability and Statistics PDF
- JG University Probability and Statistics Sample Mid-Term Exam Paper 2023-2024 PDF
- JG University Probability and Statistics Mid-Term Exam 2023-24 PDF
- Chapter 8 - Statistics and Probability PDF
- Probability & Statistics 2024-2025 PDF
- Mathematics of Data Management Notes (Lower Canada College)
Summary
This chapter from Baldi et al.'s textbook introduces probability, discussing its role in statistics and examples like predicting flu cases or coin tosses. It highlights how probability models, despite short-term unpredictability, exhibit regular patterns in the long run. The chapter emphasizes the empirical approach to probability, relying on observations.
Full Transcript
Baldi-4100190 psls November 8, 2013 15:5 Blend Images/Getty Images CHAPTER 9...
Baldi-4100190 psls November 8, 2013 15:5 Blend Images/Getty Images CHAPTER 9 Introducing Probability W hy is probability, the mathematics of chance behavior, needed to under- IN THIS CHAPTER stand statistics, the science of data? Let’s look at a typical sample survey. WE COVER... The idea of probability EXAMPLE 9.1 Who gets the flu? Probability models Probability rules Example 7.7 (page 163) described a Gallup-Healthways survey seeking to find out what Discrete probability models proportion of American adults were sick with the flu in October 2012. In that month, Continuous probability models Gallup took a national random sample of 28,295 adults and found that 594 of the people Random variables in the sample said they were sick with the flu the previous day. The proportion who were Personal probability* sick with the flu was 594 Risk and odds* sample proportion = = 0.021 (that is, 2.1%) 28,295 Because all adults had the same chance to be among the chosen 28,295, it seems reason- able to use this 2.1% as an estimate of the unknown proportion in the population. It’s a fact that 2.1% of the sample were sick with the flu—we know because Gallup asked them. We don’t know what percent of all adults in the United States were sick with the flu in October 2012, but we estimate that about 2.1% were. This is a basic move in statistics: Use a result from a sample to estimate something about a population. What if Gallup took a second random sample of 28,295 adults and asked them the same question? The new sample would have different people in it. It is almost certain that there would not be exactly 594 “sick with the flu” responses. That is, 207 Baldi-4100190 psls November 8, 2013 15:5 208 CHAPTER 9 Introducing Probability Gallup’s estimate of the proportion of adults who were sick with the flu in October 2012 will vary from sample to sample. Could it happen that one random sample finds that 2.1% of adults were sick with the flu and a second random sample finds that 5.3% were? Random samples eliminate bias from the act of choosing a sample, but they can still be wrong because of the variability that results when we choose at random. If the variation when we take repeat samples from the same population is too great, we can’t trust the results of any one sample. This is where we need facts about probability to make progress in statistics. Because Gallup uses chance to choose its samples, the laws of probability govern the behavior of the samples. We will address this in greater detail in Chapter 13 on sampling distributions. Our purpose in this chapter is to understand the language of probability, but without going into the mathematics of probability theory. The idea of probability To understand why we can trust random samples and randomized comparative experiments, we must look closely at chance behavior. The big fact that emerges is this: Chance behavior is unpredictable in the short run but has a regular and predictable pattern in the long run. Toss a coin or choose an SRS. The result can’t be predicted in advance, because the result will vary when you toss the coin or choose the sample repeatedly. But there is still a regular pattern in the results, a pattern that emerges clearly only after many repetitions. This remarkable fact is the basis for the idea of probability. EXAMPLE 9.2 Coin tossing When you toss a coin, there are only two possible outcomes, heads or tails. Figure 9.1 shows the results of tossing a coin 5000 times on two separate occasions. For each number of tosses from 1 to 5000, we have plotted the proportion of those tosses that gave a head. Trial A (solid line) begins tail, head, tail, tail. You can see that the proportion of heads for Trial A starts at 0 on the first toss, rises to 0.5 when the second toss gives a head, then falls to 0.33 and 0.25 as we get two more tails. Trial B (dotted line), on the other hand, starts with five straight heads, so the proportion of heads is 1 until the sixth toss. The proportion of tosses that produce heads is quite variable at first. Trial A starts low and Trial B starts high. As we make more and more tosses, however, the proportion of heads for both trials gets close to 0.5 and stays there. If we made yet a third trial at tossing the coin a great many times, the proportion of heads would again settle down to 0.5 in the long run. This is the intuitive idea of probability. Probability 0.5 means “occurs half the time in a very large number of trials.” The probability 0.5 appears as a horizontal line on the graph. We might suspect that a coin has probability 0.5 of coming up heads just because the coin has two sides. But we can’t be sure. The coin might be unbal- anced. In fact, spinning a penny or nickel on a flat surface, rather than tossing the coin, doesn’t give heads probability 0.5. The idea of probability is empirical. That is, it is based on observation rather than theorizing. Probability describes what happens in very many trials, and we must actually observe many trials to pin down a probability. In the case of tossing a coin, some diligent people have in fact made thousands of tosses. Baldi-4100190 psls November 8, 2013 15:5 The idea of probability 209 1.0 0.9 0.8 0.7 Proportion of heads 0.6 0.5 0.4 0.3 0.2 FIGURE 9.1 The proportion of 0.1 tosses of a coin that give a head changes as we make more tosses. 0.0 Eventually, however, the proportion 1 5 10 50 100 500 1000 5000 approaches 0.5, the probability of a head. This figure shows the results of Number of tosses two trials of 5000 tosses each. EXAMPLE 9.3 Theory and practice How does practice agree with theory? When it comes to coin tossing, we have a few historical examples of very large trial numbers. The French naturalist Count Buffon (1707–1788) tossed a coin 4040 times. Around 1900, the English statistician Karl Pearson heroically tossed a coin 24,000 times. While imprisoned by the Germans dur- ing World War II, the South African mathematician John Kerrich tossed a coin 10,000 times. Here are their results: Buffon Pearson Kerrich Total tosses 4,040 24,000 10,000 Number of heads 2,048 12,012 5,067 Proportion of heads 0.5069 0.5005 0.5067 Genetic theory predicts an equal proportion over the long term of male and female newborns because each spermatozoid carries either an X or a Y chromosome. However, many factors affect the dispersion and success rate of gametes, and birth certificates indi- cate a slight departure from the equal-proportions model. For example, the U.S. National Center for Health Statistics reports the number of births (in thousands) by gender in the United States for the following years:1 1990 2000 2002 2008 Males 2129 2077 2058 2173 Females 2029 1982 1964 2074 Proportion of males 0.5120 0.5117 0.5117 0.5117 Baldi-4100190 psls November 8, 2013 15:5 210 CHAPTER 9 Introducing Probability RANDOMNESS AND PROBABILITY We call a phenomenon random if individual outcomes are uncertain but there is nonetheless a regular distribution of outcomes in a large number of repetitions. The probability of any outcome of a random phenomenon is the proportion of times the outcome would occur in a very long series of repetitions. That some things are random is an observed fact about the world. The outcome of a coin toss, the time between emissions of particles by a radioactive source, and the sexes of the next litter of lab rats are all random. So is the outcome of a random sample or a randomized experiment. Probability theory is the branch of mathemat- ics that describes random behavior. Of course, we can never observe a probability exactly. We could always continue tossing the coin, for example. Mathematical probability is an idealization based on imagining what would happen in an indefi- nitely long series of trials. The best way to understand randomness is to observe random behavior, as in Figure 9.1. You can do this with physical devices like coins, but computer simu- lations (imitations) of random behavior allow faster exploration. The Probability applet is a computer simulation that animates Figure 9.1. It allows you to choose the probability of a head and simulate any number of tosses of a coin with that prob- ability. Experience shows that the proportion of heads gradually settles down close to the chosen probability. Equally important, it also shows that the proportion in a small or moderate number of tosses can be far from the probability. Probability describes frequentist only what happens in the long run. This is called a frequentist approach to defining probabilities, because we rely on the relative frequency (proportion) of one par- ticular outcome among very many observations of the random phenomenon. The optional section on personal probabilities later in this chapter discusses another approach to obtaining probabilities. What looks Computer simulations like the Probability applet start with given probabili- random? Toss a coin ties and imitate random behavior, but we can estimate a real-world probability six times and record heads (H) or tails (T) only by actually observing many trials. Nonetheless, computer simulations are on each toss. Which of these very useful because we need long runs of trials. In situations such as coin tossing, outcomes is more probable: the proportion of an outcome often requires several hundred trials to settle down HTHTTH or TTTHHH? Almost to the probability of that outcome. Shorter runs give only rough estimates of a everyone says that HTHTTH is probability. more probable, because TTTHHH does not “look random.” In fact, both are equally probable. That heads has probability 0.5 says that APPLY YOUR KNOWLEDGE about half of a very long sequence 9.1 Hemophilia. Hemophilia refers to a group of rare hereditary disorders of blood co- of tosses will be heads. It doesn’t say that heads and tails must come agulation. Because the disorder is caused by defective genes on the X chromosome, close to alternating in the short run. hemophilia affects primarily men. According to the Centers for Disease Control The coin doesn’t know what past and Prevention, the prevalence of hemophilia (the number in the population with outcomes were, and it can’t try to hemophilia at any given time) among American males is 13 in 100,000. Explain create a balanced sequence. carefully what this means. In particular, explain why it does not mean that if you Baldi-4100190 psls November 8, 2013 15:5 Probability models 211 obtain the medical records of 100,000 males, exactly 13 will be diagnosed with hemophilia. 9.2 Random digits. The table of random digits (Table A) was produced by a random mechanism that gives each digit probability 0.1 of being a 0. (a) What proportion of the first 50 digits in the table are 0s? This proportion is an estimate, based on 50 repetitions, of the true probability, which in this case is known to be 0.1. (b) The Probability applet can imitate random digits. Set the probability of heads in the applet to 0.1. Check “Show true probability” to show this value on the graph. A head stands for a 0 in the random digits table, and a tail stands for any other digit. Simulate 200 digits. If you kept going forever, presumably you would get 10% heads. What was the result of your 200 tosses? 9.3 Probability says... Probability is a measure of how likely an event is to occur. Match one of the probabilities that follow with each statement of likelihood given. (The probability is usually a more exact measure of likelihood than is the verbal statement.) 0 0.01 0.3 0.6 0.99 1 (a) This event is impossible. It can never occur. (b) This event is certain. It will occur on every trial. (c) This event is very unlikely, but it will occur once in a while in a long sequence of trials. (d) This event will occur more often than not. Probability models What is the chance Gamblers have known for centuries that the fall of coins, cards, and dice displays of rain? We have all clear patterns in the long run. After all, the first formal studies of probabilities were checked the weather aimed at understanding various games of chance. The idea of probability rests on forecast to see whether the observed fact that the average result of many thousands of chance outcomes rain will spoil our weekend plans. A can be known with near certainty. How can we give a mathematical description report says that the chance of rain in your city tomorrow is 20%. What of long-run regularity? does that mean and how is this To see how to proceed, think first about a very simple random phenomenon, probability calculated? The the birth of one child. When the child is conceived, we cannot know the outcome National Weather Service keeps a in advance. What do we know? We are willing to say that the outcome will be historical database of daily weather either male or female. We believe that each of these outcomes has probability 1/2. conditions such as temperature, pressure, and humidity. The chance This description of a child’s birth has two parts: of rain on a given day is calculated as the percent of days in the A list of possible outcomes database with similar weather A probability for each outcome conditions that had rain. So, a 20% chance of rain tomorrow means that it has rained in only 20% of Such a description is the basis for all probability models. Here is the basic vocab- days with similar weather ulary we use. conditions. Baldi-4100190 psls November 8, 2013 15:5 212 CHAPTER 9 Introducing Probability PROBABILITY MODELS The sample space S of a random phenomenon is the set of all possible outcomes. An event is an outcome or a set of outcomes of a random phenomenon. That is, an event is a subset of the sample space. A probability model is a mathematical description of a random phenomenon consisting of two parts: a sample space S and a way of assigning probabilities to events. A sample space S can be very simple or very complex. When one child is born, there are only two outcomes, male and female. The sample space is S = {M, F}. When the National Health Survey records the body weights in pounds of a random sample of adults, the sample space contains all possible adult weights over a realistic interval. EXAMPLE 9.4 Blood types Your blood type greatly impacts, for instance, the kind of blood transfusion or organ transplant you can safely get. There are 8 different blood types based on the presence or absence of certain molecules on the surface of red blood cells. A person’s blood type is given as a combination of a group (O, A, B, or AB) and a Rhesus factor (+ or −). They make up the sample space S: S = {O+, O−, A+, A−, B+, B−, AB+, AB−} CDC/Janice Haney Carr How can we assign probabilities to this sample space? First of all, these 8 blood types are represented differently in different ethnic groups. Within a given ethnic group, we can use the blood types’ frequencies in that group to assign their respective probabilities. The American Red Cross reports that, among Asian Americans, there are 39% blood type O+, 1% O−, 27% A+, 0.5% A−, 25% B+, 0.4% B−, 7% AB+, 0.1% AB−.2 Because 39% of all Asian Americans have blood type O+, the probability that a randomly chosen Asian American has blood type O+ is 39%, or 0.39. We can thus construct the complete probability model for blood types among Asian Americans: Blood type O+ O− A+ A− B+ B− AB+ AB− Probability 0.39 0.01 0.27 0.005 0.25 0.004 0.07 0.001 What if we were interested only in the person’s Rhesus factor? For any randomly selected Asian American, the Rhesus factor can only be positive or negative. Therefore, the sample space for this new question is: S = {Rh+, Rh−} Based on the known population percents, the probability model for Rhesus factor is: Rh factor Rh+ Rh− Probability 0.98 0.02 Baldi-4100190 psls November 8, 2013 15:5 Probability models 213 In Example 9.4, we used the known frequencies of blood types among Asian Americans to construct the probability models. In some cases, we can use known properties of the random phenomenon (for example, physical properties, genetic laws) to compute the probabilities of the outcomes in the sample space. Here is an example. EXAMPLE 9.5 A boy or a girl? Young couples often discuss how many children they would like to have. One couple wants to have two children. There are four possible outcomes when we examine the gender of the two children in order (first child, second child). Figure 9.2 displays these four outcomes. First child Girl Boy Second Girl GG BG child Boy GB BB FIGURE 9.2 The four possible outcomes in gender sequence for couples having two children, for Example 9.5. If we assume that male and female newborns are equally likely, all four of these outcomes have the same probability. Parents often care more about how many boys or girls they could have than about the particular order. The sample space for the number of girls a couple with two children could have is S = {0, 1, 2} What are the probabilities for this sample space? For each newborn, the probability that it will be a girl and the probability that it will be a boy are approximately equal. We also know that the gender of the first child does not influence the gender of the second child. Therefore, all four outcomes in Figure 9.2 will be equally likely. That is, each of the four outcomes will, in the long run, come up in one-fourth of all couples with two children. So each outcome in Figure 9.2 has probability 1/4. However, the three possible outcomes in our sample space are not equally likely, because there are two ways to have exactly one girl and only one way to have no girl at all. So “no girl” has probability 1/4 but “one girl” has probability 2/4 (2 outcomes from Figure 9.2). Here is the complete probability model: Number of girls 0 1 2 Probability 1/4 2/4 1/4 We built the probability model in Example 9.5 by assuming equal probability of both genders at birth. This model is reasonably accurate. However, we have seen in Example 9.3 that national data point to a slight imbalance between the gen- ders at birth. So, in reality, all four outcomes in Figure 9.2 are not exactly equally likely. Baldi-4100190 psls November 8, 2013 15:5 214 CHAPTER 9 Introducing Probability APPLY YOUR KNOWLEDGE 9.4 Sample space. Choose a student at random from a large statistics class. Describe a sample space S for each of the following. (In some cases, you may have some freedom in specifying S.) (a) Ask whether the student is male or female. (b) Ask how tall the student is. (c) Ask what the student’s blood type is. (d) Ask how many times a day the student brushes his or her teeth. (e) Ask how long since the student’s last flu or cold. 9.5 More on boys and girls. A couple wants to have three children. Assume that the probabilities of a newborn being male or being female are the same and that the gender of one child does not influence the gender of another child. (a) There are 8 possible arrangements of girls and boys. What is the sample space for having three children (gender of the first, second, and third child)? All 8 arrangements are (approximately) equally likely. (b) The future parents are wondering how many boys they might get if they have three children. Give a probability model (sample space and probabilities of outcomes) for the number of boys. Follow the method of Example 9.5. Probability rules Examples 9.4 and 9.5 describe pretty simple random phenomena. However, we don’t always have a probability model available to answer our questions. We can make progress by listing some facts that must be true for any assignment of proba- bilities. These facts follow from the idea of probability as “the long-run proportion of repetitions on which an event occurs.” 1. Any probability is a number between 0 and 1, inclusively. Any proportion is a number between 0 and 1, so any probability is also a number between 0 and 1. An event with probability 0 never occurs, and an event with probability 1 occurs on every trial. An event with probability 0.5 occurs in half the trials in the long run. 2. All possible outcomes together must have probability 1. Because some outcome must occur on every trial, the sum of the probabilities for all possible outcomes must be exactly 1. 3. If two events have no outcomes in common, the probability that one or the other occurs is the sum of their individual probabilities. If one event occurs in 40% of all trials, a different event occurs in 25% of all trials, and the two can never occur together, then one or the other occurs on 65% of all trials because 40% + 25% = 65%. 4. The probability that an event does not occur is 1 minus the probability that the event does occur. If an event occurs in (say) 70% of all trials, it fails to occur in the other 30%. The probability that an event occurs and the probability that it does not occur always add to 100%, or 1. Baldi-4100190 psls November 8, 2013 15:5 Probability rules 215 We can use mathematical notation to state Facts 1 to 4 more concisely. Capital letters near the beginning of the alphabet denote events. If Ais any event, we write its probability as P (A). Here are our probability facts in formal language. As you apply these rules, remember that they are just another form of intuitively true facts about long-run proportions. PROBABILITY RULES Rule 1. The probability P (A) of any event A satisfies 0 ≤ P (A) ≤ 1. Rule 2. If S is the sample space in a probability model, then P (S) = 1. Rule 3. Two events A and B are disjoint (mutually exclusive) if they have no outcomes in common and so can never occur together. If A and B are disjoint, P (A or B) = P (A) + P (B) This is the addition rule for disjoint events. Rule 4. For any event A, P (A does not occur) = 1 − P (A) The addition rule extends to more than two events that are disjoint in the sense that no two have any outcomes in common. If events A, B, and C are dis- joint, the probability that one of these events occurs is P (A) + P (B) + P (C). EXAMPLE 9.6 Using the probability rules We already used the addition rule, without calling it by that name, to find the probabil- ities in Example 9.5. The event “one girl” contains the two disjoint outcomes displayed in Figure 9.2, so the addition rule (Rule 3) says that its probability is P (one girl) = P (GB) + P (BG) 1 1 = + 4 4 2 = = 0.5 4 Check that the probabilities in Example 9.5, found using the addition rule, are all be- tween 0 and 1 and add to exactly 1. That is, this probability model obeys Rules 1 and 2. What is the probability that a couple with two children would not have two girls? By Rule 4, P (couple does not have two girls) = 1 − P (has two girls) = 1 − 0.25 = 0.75 APPLY YOUR KNOWLEDGE 9.6 What’s your weight? According to the U.S. Census Bureau, 29% of American males 18 years of age or older are obese, 41% are overweight, 29% have a healthy weight, and 1% are underweight.3 (a) Does this assignment of probabilities to adult American males satisfy Rules 1 and 2? Baldi-4100190 psls November 8, 2013 15:5 216 CHAPTER 9 Introducing Probability (b) What percent of adult American males have a weight higher than what is considered healthy? (c) The Census Bureau reports that the percent of adult American females with a weight higher than what is considered healthy is 56%. Does the assignment of probabilities for males and females who are over a healthy weight satisfy Rules 1 and 2? 9.7 Rabies in Florida. Rabies is a viral disease of mammals transmitted through the bite of a rabid animal. The virus infects the central nervous system, causing encephalopathy and ultimately death. The Florida Department of Health reports the distribution of documented cases of rabies for all of 2011:4 Species Raccoon Bat Fox Other Probability 0.57 0.15 0.11 ? Gregory G. Dimijian, M.D./Science Source (a) What probability should replace “?” in the distribution? (b) What is the probability that a reported case of rabies is not a raccoon? (c) What is the probability that a reported case of rabies is either a bat or a fox? Discrete probability models Examples 9.4, 9.5, and 9.6 illustrate one way to assign probabilities to events: As- sign a probability to every individual outcome, then add these probabilities to find the probability of any event. This idea works well when there are only a finite (fixed and limited) number of outcomes. DISCRETE PROBABILITY MODEL A probability model with a sample space made up of a list of individual outcomes5 is called discrete. To assign probabilities in a discrete model, list the probabilities of all the individual outcomes. These probabilities must be numbers between 0 and 1 and must have sum 1. The probability of any event is the sum of the probabilities of the outcomes making up the event. EXAMPLE 9.7 Hearing impairment in dalmatians Pure dog breeds are often highly inbred, leading to high numbers of congenital defects. A study examined hearing impairment in 5333 dalmatians.6 Call the number of ears impaired (deaf) in a randomly chosen dalmatian X for short. The researchers found the following probability model for X: X 0 1 2 Probability 0.70 0.22 0.08 Baldi-4100190 psls November 8, 2013 15:5 Discrete probability models 217 Check that the probabilities of the outcomes sum to exactly 1. This is therefore a legit- imate discrete probability model. The probability that a randomly chosen dalmatian has some hearing impairment is the probability that X is equal to or greater than 1: P (X ≥ 1) = P (X = 1) + P (X = 2) = 0.22 + 0.08 = 0.30 Almost a third of dalmatians are deaf in one or both ears. This is a very high proportion that may be explained in part by the fact that breeders cannot detect partial deafness behaviorally. The study suggested giving the dogs a hearing test before considering them for breeding. Note that the probability that X is greater than or equal to 1 is not the same as the probability that X is strictly greater than 1. The latter probability here is P (X > 1) = P (X = 2) = 0.08 The outcome X = 1 is included in “greater than or equal to” and is not included in “strictly greater than.” APPLY YOUR KNOWLEDGE 9.8 Soda consumption. A survey by Gallup asked a random sample of American adults about their soda consumption.7 Let’s call X the number of glasses of soda consumed on a typical day. Gallup found the following probability model for X: X 0 1 2 3 4+ Probability 0.52 0.28 0.09 0.04 0.07 Consider the events A = {number of glasses of soda is 1 or greater} B = {number of glasses of soda is 2 or less} (a) What outcomes make up the event A? What is P (A)? age fotostock/SuperStock (b) What outcomes make up the event B? What is P (B)? (c) What outcomes make up the event “ A or B”? What is P (A or B)? Why is this probability not equal to P (A) + P (B)? 9.9 Physically active high schoolers. The 2011 National Youth Risk Behavior Sur- vey provides insight on the physical activity of high school students in the United States. Over 15,000 high schoolers were asked, “During the past 7 days, on how many days were you physically active for a total of at least 60 minutes per day?” Physical activity was defined as any activity that increased heart rate. Call the re- sponse X for short. The survey results give the following probability model for X:8 Days 0 1 2 3 4 5 6 7 Probability 0.15 0.08 0.10 0.11 0.10 0.12 0.07 0.27 Baldi-4100190 psls November 8, 2013 15:5 218 CHAPTER 9 Introducing Probability (a) Verify that this is a legitimate discrete probability model. (b) Describe the event X < 7 in words. What is P (X < 7)? (c) Express the event “physically active at least one day” in terms of X. What is the probability of this event? Continuous probability models When we use the table of random digits to select a digit between 0 and 9, the discrete probability model assigns probability 1/10 to each of the 10 possible out- comes. Suppose that we want to choose a number at random between 0 and 1, allowing any number between 0 and 1 as the outcome. Software random number generators will do this. For example, here is the result of asking software to produce five random numbers between 0 and 1: 0.2893511 0.3213787 0.5816462 0.9787920 0.4475373 The sample space is now an entire interval of numbers: S = {all numbers between 0 and 1} Call the outcome of the random number generator Y for short. How can we assign probabilities to such events as {0.3 ≤ Y ≤ 0.7}? As in the case of selecting a ran- dom digit, we would like all possible outcomes to be equally likely. But we cannot assign probabilities to each individual value of Y and then add them, because there are infinitely many possible values. We use a new way of assigning probabilities directly to events—as areas under a density curve. Density curves are models for continuous distributions. Any density curve has area exactly 1 underneath it, corresponding to total probability 1. DENSITY CURVE A density curve is a curve that is always on or above the horizontal axis, and has area exactly 1 underneath it. A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values on the horizontal axis is the proportion of all observations that fall in that range. In Chapter 1 we described continuous distributions with histograms. Some- times the overall pattern of a large number of observations is so regular that we can describe it by a smooth curve. EXAMPLE 9.8 From histogram to density curve Figure 9.3 is a histogram of the heights of a large government sample survey of women 40 to 49 years of age.9 Overall, the distribution of heights is quite regular. The histogram is symmetric, and both tails fall off smoothly from a single center peak. There are no Baldi-4100190 psls November 8, 2013 15:5 Continuous probability models 219 18 16 14 12 10 Percent 8 6 4 2 FIGURE 9.3 Histogram of the heights in inches of women aged 40 0 to 49 in the United States, for under 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 Example 9.8. The smooth curve 56 Height (inches) or shows the overall shape of the more distribution. large gaps or obvious outliers. The smooth curve drawn over the histogram is a good description of the overall pattern of the data. Our eyes respond to the areas of the bars in a histogram. The bar areas represent the percents of the observations. Figure 9.4(a) is a copy of Figure 9.3 with the leftmost bars shaded. The area of the shaded bars represents the women 62 inches tall or less. They make up 31.9% of all women in the sample—this is the cumulative percent, the sum of all bars for 62 inches and below. Now look at the curve drawn through the bars. In Figure 9.4(b), the area under the curve to the left of 62 inches is shaded. The smooth curve we use to model the histogram distribution is chosen with the specific constraint that the total area under the curve is exactly 1. The total area represents 100%, that is, all the observations. We can 18 18 16 16 14 14 12 12 10 10 Percent Percent 8 8 6 6 4 4 2 2 0 0 under 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 or under 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 or 56 more 56 more Height (inches) Height (inches) (a) (b) FIGURE 9.4 (a) The proportion of women 62 inches tall or less in the sample is 0.319. (b) The proportion of women 62 inches tall or less calculated from the density curve is 0.316. The density curve is a good approximation to the distribution of the data. Baldi-4100190 psls November 8, 2013 15:5 220 CHAPTER 9 Introducing Probability then interpret areas under the curve as proportions of the observations. The curve is now a density curve. The shaded area under the density curve in Figure 9.4(b) represents the proportion of women in their 40s who are 62 inches or shorter. This area is 31.6%, less than half a percentage point away from the actual 31.9%. Areas under the density curve give quite good approximations to the actual distribution of the sampled women. Density curves, like distributions, come in many shapes. Figure 9.5 shows a strongly skewed distribution: the survival times of guinea pigs from Example 1.7 (page 18). The histogram and density curve were both created from the data by software. Both show the overall shape and the “bumps” in the long right tail. The density curve shows a higher single peak as a main feature of the distribution. The histogram divides the observations near the peak between two bars, thus re- ducing the height of the peak. A density curve is often a good description of the overall pattern of a distribution. Outliers, which are deviations from the over- all pattern, are not described by the curve. Of course, no set of real data is ex- actly described by a density curve. The curve is a model, an idealized description that is easy to use and accurate enough for practical use. Conceptually, a density curve is similar to a regression line: We use a least-squares regression line to model an observed linear trend and to make predictions about similar individuals in the population. Measures of center and spread apply to density curves as well as to actual sets of observations. Areas under a density curve represent proportions of the total number of observations. The median is the point with half the observations on either side. So the median of a density curve is the equal-areas point. The mean of a set of observations is their arithmetic average. If we think of observations as a series of different weights strung out along a thin rod, the mean is the point at which the rod would balance. This fact is also true of density curves. The mean is the point at which the curve would balance if made of solid material of varying weight. Because density curves are idealized patterns, a symmetric density curve is ex- actly symmetric. The mean and median of a symmetric density curve are therefore FIGURE 9.5 A right-skewed distribution, pictured by both a histogram and a density curve, representing the survival times in days of guinea pigs infected with a 0 100 200 300 400 500 600 Survival time (days) pathogen. Baldi-4100190 psls November 8, 2013 15:5 Continuous probability models 221 The long right tail pulls the mean to the right. FIGURE 9.6 The mean and Median and mean Mean median of (a) a symmetric density Median curve and (b) a skewed density (a) (b) curve. equal, as shown in Figure 9.6(a). The mean of a skewed distribution is pulled to- ward the long tail more than is the median, as shown in Figure 9.6(b) (finding the mean and standard deviation of a density curve is beyond the scope of this text- book). Density curves represent the whole population. Their mean and standard deviation are expressed with Greek letters to distinguish them from the mean x and standard deviation s computed from actual sample observations. The usual notation for the mean of a density curve is µ (the Greek letter mu). We write the mean µ standard deviation of a density curve as σ (the Greek letter sigma). standard deviation σ How do we use density curves to assign probabilities over continuous inter- vals? There is a direct relationship between the representation (or proportion) of a given type of individual in a population and the probability that one individual randomly selected from the population will be of that given type. Just as with dis- crete probabilities, we define probabilities over continuous intervals by the relative frequency of relevant individuals in the population—in this case, all individuals from the population that belong to the desired interval. EXAMPLE 9.9 From population distribution to probability distribution In Example 9.8 we used a density curve to estimate that the proportion of women in their 40s who are 62 inches or shorter is 31.6%. This is the area under the density curve for heights of 62 inches or less, as shown in Figure 9.4(b). Let’s now ask: What is the probability that a randomly chosen woman in her 40s has a height of 62 inches or less? Because the selection is random, this probability depends on the relative frequency of women in the population who are 62 inches tall or less, and Blend Images/Getty Images that relative frequency is 31.6%. Therefore, the probability that a randomly selected woman in her 40s would measure 62 inches or less is 0.316. This is the area under the density curve for heights 62 inches or less. CONTINUOUS PROBABILITY MODEL A continuous probability model assigns probabilities as areas under a density curve. The area under the curve and above any range of values on the horizontal axis is the probability of an outcome in that range. Baldi-4100190 psls November 8, 2013 15:5 222 CHAPTER 9 Introducing Probability EXAMPLE 9.10 Random numbers The random number generator will spread its output uniformly across the entire interval from 0 to 1 as we allow it to generate a long sequence of numbers. The results of many trials are represented by the uniform density curve shown in Figure 9.7. This density curve has height 1 over the interval from 0 to 1. The area under the curve is 1, and the probability of any event is the area under the curve and above the event in question. As Figure 9.7(a) illustrates, the probability that the random number generator pro- duces a number between 0.3 and 0.7 is P (0.3 ≤ Y ≤ 0.7) = 0.4 because the area under the density curve and above the interval from 0.3 to 0.7 is 0.4. The height of the curve is 1 and the area of a rectangle is the product of height and length, so the probability of any interval of outcomes is just the length of the interval. Similarly, P (Y ≤ 0.5) = 0.5 P (Y > 0.8) = 0.2 P (Y ≤ 0.5 or Y > 0.8) = 0.7 The last event consists of two nonoverlapping intervals, so the total area above the event is found by adding two areas, as illustrated by Figure 9.7(b). This assignment of probabilities obeys all our rules for probability. Area = 0.4 Area = 0.5 Area = 0.2 Height = 1 FIGURE 9.7 Probability as area under a density curve, for Example 9.10. This uniform density curve 0 0.3 0.7 1 0 0.5 0.8 1 spreads probability evenly between 0 and 1. (a) P(0.3 ≤ Y ≤ 0.7) (b) P(Y ≤ 0.5 or Y > 0.8) The probability model for a continuous random variable assigns probabilities to intervals of outcomes rather than to individual outcomes. In fact, all continuous probability models assign probability 0 to any individual outcome. Only intervals of values have positive (non-zero) probabilities. To see that this is true, consider a specific out- come such as P (Y = 0.8) in Example 9.10. The probability of any interval is the same as its length. The point 0.8 has no length, so its probability is 0. We can use any density curve to assign probabilities. In Chapter 11 we will examine Normal curves, a particularly useful family of density curves in probability and statistics. Baldi-4100190 psls November 8, 2013 15:5 Random variables 223 APPLY YOUR KNOWLEDGE 9.10 Sketch density curves. Sketch density curves that describe distributions with the following shapes: (a) symmetric, but with two peaks (that is, two strong clusters of observations) (b) single peak and skewed to the left 9.11 Random numbers. Let X be a random number between 0 and 1 produced by the random number generator described in Example 9.10 and Figure 9.7. Find the fol- lowing probabilities: (a) P (X ≤ 0.4) (b) P (X < 0.4) (c) P (0.3 ≤ X ≤ 0.5) (d) P (X < 0.3 or X > 0.5) 9.12 TV viewing among high school students. Some argue that TV viewing promotes obesity because of physical inactivity and high exposure to commercials for foods with high fat and sugar content. In the two examples provided here, decide whether the probability model is discrete or continuous. Explain your reasoning. (a) According to the National Center for Health Statistics, there is a 38% chance that a high school student watches at least 3 hours of television on an average school day.10 This nationwide government survey recorded the amount of time high school students watch TV on an average school day. (b) A large sample survey of school-age children by the Kaiser Family Foundation found a 42% chance that the TV is on most of the time at home.11 The survey asked whether, at home, the TV was on most of the time, some of the time, a little bit of the time, or never. Random variables Examples 9.7 and 9.10 use a notation that is often convenient. It is especially useful when using mathematical functions to describe probability distributions, as we will see in Chapters 11 and 12. In Example 9.7 we let X stand for the result of choosing a dalmatian at random and assessing its hearing impairment. We know that X may take a different value if we make another random choice. Because its value changes from one random choice to another, we call the hearing impairment X a random variable. RANDOM VARIABLE A random variable is a variable whose value is a numerical outcome of a random phenomenon. The probability distribution of a random variable X tells us what values X can take and how to assign probabilities to those values. There are two main types of random variables, corresponding to two types of probability models: discrete and continuous. Baldi-4100190 psls November 8, 2013 15:5 224 CHAPTER 9 Introducing Probability EXAMPLE 9.11 Discrete and continuous random variables The hearing impairment X in Example 9.7 is a random variable whose possible values are the whole numbers {0, 1, 2}. The distribution of X assigns a probability to each of these outcomes. Random variables that have a countable (typically finite) list of possible discrete random variable outcomes are called discrete. Compare this with the value Y obtained with the random number generator in Example 9.10. The values of Y fill the entire interval of numbers between 0 and 1. The probability distribution of Y is given by its density curve, shown in Figure 9.7. Random variables that can take on any value in an interval, with probabilities given as areas continuous random variable under a density curve, are called continuous. Be sure to consider closely the true nature of your random variable. Exam- ple 9.8 described the process of assigning a density curve to model women’s heights in inches. While most people report their heights in whole numbers of inches, they do not grow one inch at a time. All possible values of height within a realistic inter- val can actually exist, for example, 65.125 inches. So the random variable height is truly a continuous random variable. APPLY YOUR KNOWLEDGE 9.13 Discrete or continuous random variable? Indicate in the following examples whether the random variable X is discrete or continuous. Explain your reasoning. (a) X is the number of days last week that a randomly chosen child exercised for at least one hour. (b) X is the amount of time in hours that a randomly chosen child spends watching television today. (c) X is the number of television sets in a randomly chosen household. 9.14 Discrete or continuous random variable? Indicate in the following examples whether the random variable X is discrete or continuous. Explain your reasoning. (a) X is the number of petals on a randomly chosen daisy. (b) X is the stem length in centimeters of a randomly chosen daisy. (c) X is the number of daisies found in a randomly chosen grassy area 1 square meter in size. (d) X is the average number of petals per daisy computed from all the daisies found in a randomly chosen grassy area 1 square meter in size. Personal probability* We began our discussion of probability with one idea: The probability of an out- come of a random phenomenon is the proportion of times that outcome would occur in a very long series of repetitions. This idea ties probability to actual out- comes. It allows us, for example, to estimate probabilities by simulating random phenomena. Yet we often meet another, quite different, idea of probability. *This section is optional. Baldi-4100190 psls November 8, 2013 15:5 Personal probability 225 EXAMPLE 9.12 Intelligent life and the universe Joe reads an article discussing the Search for Extraterrestrial Intelligence (SETI) project. We ask Joe, “What’s the chance that we will find evidence of extraterrestrial intelligence in this century?” Joe responds, “Oh, about 1%.” Does Joe assign probability 0.01 to humans finding extraterrestrial intelligence this century? The outcome of our search is certainly unpredictable, but we can’t reasonably ask what would happen in many repetitions. This century will happen only once and will differ from all other centuries in many ways, especially in terms of technology. If proba- bility measures “what would happen if we did this many times,” Joe’s 0.01 is not a proba- bility. The frequentist definition of probability is based on data from many repetitions of the same random phenomenon. Joe is giving us something else, his personal judgment. Although Joe’s 0.01 isn’t a probability in the frequentist sense, it gives use- ful information about Joe’s opinion. Closer to home, a government asking, “How likely is it that building a new nuclear power plant will pay off within five years?” can’t employ an idea of probability based on many repetitions of the same thing. The opinions of science and business advisers are nonetheless useful information, and these opinions can be expressed in the language of probability. These are per- sonal probabilities. PERSONAL PROBABILITY A personal probability of an outcome is a number between 0 and 1 that expresses an individual’s judgment of how likely the outcome is. Rachel’s opinion about finding extraterrestrial intelligence may differ from Joe’s, and the opinions of several advisers about the new power plant may dif- fer. Personal probabilities are indeed personal: They vary from person to person. Moreover, a personal probability can’t be called right or wrong. If we say, “In the long run, this coin will come up heads 60% of the time,” we can find out if we are right by actually tossing the coin several thousand times. If Joe says, “I think there is a 1% chance of finding extraterrestrial intelligence this century,” that’s just Joe’s opinion. Why think of personal probabilities as probabilities? Because any set of per- sonal probabilities that makes sense obeys the same basic Rules 1 to 4 that de- scribe any legitimate assignment of probabilities to events. If Joe thinks there’s a 1% chance that we find extraterrestrial intelligence this century, he must also think that there’s a 99% chance that we won’t. There is just one set of rules of probability, even though we now have two interpretations of what probability means. APPLY YOUR KNOWLEDGE 9.15 Will you have an accident? The probability that a randomly chosen adult will be involved in a car accident in the next year is about 0.2. This is based on the pro- portion of millions of drivers who have accidents. “Accident” includes things like crumpling a fender in your own driveway, not just highway accidents. Baldi-4100190 psls November 8, 2013 15:5 226 CHAPTER 9 Introducing Probability (a) What do you think is your own probability of being in an accident in the next year? This is a personal probability. (b) Give some reasons why your personal probability might be a more accurate prediction of your “true chance” of having an accident than the probability for a random driver. (c) Almost everyone says that their personal probability is lower than the random driver probability. Why do you think this is true? Risk and odds* Random events can be described in a variety of ways. The most common descrip- tors of random events in the life sciences are probability, risk, and odds. So far in this chapter we have introduced the idea of probability and fundamental proba- bility rules. We now briefly describe what risk and odds are and how they relate to the notion of probability. Chapter 20 will further discuss the applications of risk and odds in clinical research when comparing two groups. risk Risk means different things in different fields. In statistics, risk corresponds to the probability of an undesirable event such as death, disease, or side effects. This term is particularly common in the health sciences, which aim to assess risk and find ways to reduce it. The risk of a given adverse event, like its probability, is defined by the frequency of that adverse event in a population or sample of interest. Odds are a somewhat less intuitive concept, with a foundation in gambling (although mathematical odds are not to be confused with betting odds, which odds typically reflect payoffs offered by bookies to winning bets). An odds is a ratio of two probabilities where the numerator represents the probability of an event and the denominator represents the complementary probability of that event not occurring. Therefore, odds can take any positive value, including values greater than 1. The odds of an event can be expressed as the numerical value of the ratio or as a ratio of two integers with no common denominator. RISK AND ODDS The risk of an undesirable outcome of a random phenomenon is the probability of that undesirable outcome. The odds of any outcome of a random phenomenon is the ratio of the probability of that outcome over the probability of that outcome not occurring. That is, if an outcome A has probability p of occurring, then risk(A) = p odds(A) = p/(1 − p) *This section is optional. Baldi-4100190 psls November 8, 2013 15:5 Risk and odds 227 EXAMPLE 9.13 Blood clots in immobilized patients Patients immobilized for a substantial amount of time can develop deep vein throm- bosis (DVT), a blood clot in a leg or pelvis vein. DVT can have serious adverse health effects and can be difficult to diagnose. On its website, drug manufacturer Pfizer reports the outcome of a study looking at the effectiveness of the drug Fragmin (dalteparin) in preventing DVT in immobilized patients. Of the 1518 randomly chosen immobilized patients given Fragmin, 42 experienced a complication from DVT (the remaining 1476 patients did not).12 The proportion of patients experiencing DVT complications is 42/1518 = 0.0277, or 2.77%. We can use this information to compute the risk and odds of experiencing DVT complications for immobilized patients treated with Fragmin: risk = 0.0277, or 2.77% odds = 0.0277/(1 − 0.0277) = 42/1476 = 0.0285 The odds of experiencing DVT complications among immobilized patients given Frag- min are 42:1476, or about 1:35. That is, for every such patient experiencing a DVT complication, about 35 do not experience a DVT complication. The numerical values for the risk and odds in this example are very close, 0.277 and 0.0285. In general, when the sample size is very large and the undesirable event not very frequent, risk and odds give similar numerical values. In other situations, risk and odds can be very different. EXAMPLE 9.14 Sickle-cell anemia Sickle-cell anemia is a serious, inherited blood disease affecting the shape of red blood cells. Individuals with both genes causing the defect suffer pain from blocked arteries and can have their life shortened from organ damage. Individuals carrying only one copy of the defective gene (“sickle-cell trait”) are generally healthy but may pass on the gene to their offspring. An estimated two million Americans carry the sickle-cell trait. If a couple learns from blood tests that they both carry the sickle-cell trait, genetic laws of inheritance tell us that there is a 25% chance that they could conceive a child suffering from sickle-cell anemia. That is, the risk of conceiving a child who will suffer from sickle-cell anemia is 0.25, or 25%. The odds of this are odds = 0.25/(1 − 0.25) = 0.333, or 1:3 CDC/Sickle Cell Foundation of Georgia: Jackie In this second example, the risk and the odds of conceiving a child suffering George, Beverly Sinclair from sickle-cell anemia have quite different numerical values, 0.25 and 0.33. Al- ways make sure that you understand how risk and odds relate to probability when you read reports about these concepts. APPLY YOUR KNOWLEDGE 9.16 Blood clots in immobilized patients, continued. The Fragmin study from Exam- ple 9.13 compared patients treated with Fragmin with patients given a placebo in a randomized, double-blind design. Of the 1473 immobilized patients given a placebo, 73 experienced a complication from DVT. Baldi-4100190 psls November 8, 2013 15:5 228 CHAPTER 9 Introducing Probability (a) Compute the proportion of patients given a placebo who experienced a complication from DVT. What are the risk and the odds of experiencing a complication from DVT when an immobilized patient is given a placebo? How do these values compare? (b) Compare your results with those of Example 9.13. What do you conclude? We will see in Chapter 20 how to formally compare risks or odds for two groups. 9.17 HPV infections in women. Human papillomavirus (HPV) infection is the most common sexually transmitted infection. Certain types of HPV can cause genital warts in both men and women and cervical cancer in women. The U.S. National Health and Nutrition Examination Survey (NHANES) contacted a representative sample of 1921 women between the ages of 14 and 59 years and asked them to provide a self-collected vaginal swab specimen. Of these, 515 tested positive for HPV, indicating a current HPV infection.13 (a) Give the probability, risk, and odds that a randomly selected American woman between the ages of 14 and 59 years has a current HPV infection. (b) The survey broke down the data by age group: Age group (years) 14–19 20–24 25–29 30–39 40–49 50–59 Percent HPV positive 24.5 44.8 27.4 27.5 25.2 19.6 Give the risk and the odds of being HPV positive for women in each age group. Which age group is the most at risk (has the highest odds) of testing HPV positive? CHAPTER 9 SUMMARY A random phenomenon has outcomes that we cannot predict but nonetheless has a regular distribution of outcomes in very many repetitions. The probability of an event is the proportion of times the event occurs in many repeated trials of a random phenomenon. A probability model for a random phenomenon consists of a sample space S and an assignment of probabilities P. The sample space S is the set of all possible outcomes of the random phenomenon. Sets of outcomes are called events. P assigns a number P (A) to an event A as its probability. Any assignment of probability must obey the rules that state the basic properties of probability: 1. 0 ≤ P (A) ≤ 1 for any event A. 2. P (S) = 1. 3. Addition rule: Events A and B are