Sampling Distributions PDF
Document Details
Uploaded by BelovedSulfur
J. L. Devore et al.
Tags
Summary
This document discusses statistics and sampling distributions, bridging the gap between probability and inferential statistics. It explores how sample estimates vary in repeated sampling and introduces the central limit theorem. The document also covers sampling distributions based on samples from normal populations and illustrates concepts with examples, including sampling from a Weibull distribution and calculations with varying sample sizes.
Full Transcript
Statistics and Sampling Distributions 6 Introduction This chapter helps make the transition between probability and inferential statistics. Given a sample of n observations from a population...
Statistics and Sampling Distributions 6 Introduction This chapter helps make the transition between probability and inferential statistics. Given a sample of n observations from a population, we will be calculating estimates of the population mean, median, standard deviation, and various other population characteristics (parameters). Prior to obtaining data, there is uncertainty as to which of all possible samples will occur. Because of this, estimates such as x, ~x, and s will vary from one sample to another. The behavior of such estimates in repeated sampling is described by what are called sampling distributions. Any particular sampling distribution will give an indication of how close the estimate is likely to be to the value of the parameter being estimated. The first two sections use probability results to study sampling distributions. A particularly important result is the Central Limit Theorem, which shows how the behavior of the sample mean can be described by a normal distribution when the sample size is large. The last two sections introduce several distributions related to samples from a normal population distribution. Many inferential procedures are based on properties of these sampling distributions. 6.1 Statistics and Their Distributions The observations in a single sample were denoted in Chapter 1 by x1, x2, …, xn. Consider selecting two different samples of size n from the same population distribution. The xi’s in the second sample will virtually always differ at least a bit from those in the first sample. For example, a first sample of n = 3 cars of a particular model might result in fuel efficiencies x1 = 30.7, x2 = 29.4, x3 = 31.1, whereas a second sample may give x1 = 28.8, x2 = 30.0, and x3 = 31.1. Before we obtain data, there is uncertainty about the value of each xi. Because of this uncertainty, before the data becomes available we view each observation as a random variable and denote the sample by X1, X2, …, Xn (uppercase letters for random variables). This variation in observed values in turn implies that the value of any function of the sample observations—such as the sample mean, sample standard deviation, or sample iqr—also varies from sample to sample. That is, prior to obtaining x1, …, xn, there is uncertainty as to the value of x, the value of s, and so on. Example 6.1 Suppose that material strength for a randomly selected specimen of a particular type has a Weibull distribution with parameter values a = 2 (shape) and b = 5 (scale). The corresponding density curve is shown in Figure 6.1. Formulas from Section 4.5 give © The Editor(s) (if applicable) and The Author(s), under exclusive license 357 to Springer Nature Switzerland AG 2021 J. L. Devore et al., Modern Mathematical Statistics with Applications, Springer Texts in Statistics, https://doi.org/10.1007/978-3-030-55156-8_6 358 6 Statistics and Sampling Distributions f (x).15.10.05 0 x 0 5 10 15 Figure 6.1 The Weibull density curve for Example 6.1 l ¼ EðXÞ ¼ 4:4311 ~ ¼ 4:1628 l r2 ¼ VðXÞ ¼ 5:365 r ¼ 2:316 The mean exceeds the median because of the distribution’s positive skew. We used statistical software to generate six different samples, each with n = 10, from this dis- tribution (material strengths for six different groups of ten specimens each). The results appear in Table 6.1, followed by the values of the sample mean, sample median, and sample standard deviation for each sample. Notice first that the ten observations in any particular sample are all different from those in any other sample. Second, the six values of the sample mean are all different from each other, as are the six values of the sample median and the six values of the sample standard deviation. The same would be true of the sample 10% trimmed means, sample iqrs, and so on. Table 6.1 Samples from the Weibull distribution of Example 6.1 Sample 1 2 3 4 5 6 Observation 1 6.1171 5.07611 3.46710 1.55601 3.12372 8.93795 2 4.1600 6.79279 2.71938 4.56941 6.09685 3.92487 3 3.1950 4.43259 5.88129 4.79870 3.41181 8.76202 4 0.6694 8.55752 5.14915 2.49759 1.65409 7.05569 5 1.8552 6.82487 4.99635 2.33267 2.29512 2.30932 6 5.2316 7.39958 5.86887 4.01295 2.12583 5.94195 7 2.7609 2.14755 6.05918 9.08845 3.20938 6.74166 8 10.2185 8.50628 1.80119 3.25728 3.23209 1.75468 9 5.2438 5.49510 4.21994 3.70132 6.84426 4.91827 10 4.5590 4.04525 2.12934 5.50134 4.20694 7.26081 Mean 4.401 5.928 4.229 4.132 3.620 5.761 Median 4.360 6.144 4.608 3.857 3.221 6.342 SD 2.642 2.062 1.611 2.124 1.678 2.496 6.1 Statistics and Their Distributions 359 Furthermore, the value of the sample mean from any particular sample can be regarded as a point estimate (“point” because it is a single number, corresponding to a single point on the number line) of the population mean l, whose value is known to be 4.4311. None of the estimates from these six samples is identical to what is being estimated. The estimates from the second and sixth samples are much too large, whereas the fifth sample gives a substantial underestimate. Similarly, the sample standard deviation gives a point estimate of the population standard deviation, r ¼ 2:316. All six of the resulting estimates are in error by at least a small amount. In summary, the values of the individual sample observations vary from sample to sample, so in general the value of any quantity computed from sample data, and the value of a sample characteristic used as an estimate of the corresponding population characteristic, will virtually never coincide with what is being estimated. DEFINITION A statistic is any quantity whose value can be calculated from sample data. Prior to obtaining data, there is uncertainty as to what value of any particular statistic will result. Therefore, a statistic is a random variable and will be denoted by an uppercase letter; a lowercase letter is used to represent the calculated or observed value of the statistic. Thus the sample mean, regarded as a statistic (before a sample has been selected or an experiment has been carried out), is denoted by X; the calculated value of this statistic from a particular sample is x: Similarly, S represents the sample standard deviation thought of as a statistic, and its computed value is s. Any statistic, being a random variable, has a probability distribution. The probability distribution of any particular statistic depends not only on the population distribution (normal, uniform, etc.) and the sample size n but also on the method of sampling. Our next definition describes a sampling method often encountered, at least approximately, in practice. DEFINITION The rvs X1, X2, …, Xn are said to form a (simple) random sample of size n if 1. The Xi’s are independent rvs. 2. Every Xi has the same probability distribution. Such a collection of random variables is also referred to as being independent and identically distributed (iid). If sampling is either with replacement or from an infinite (conceptual) population, Conditions 1 and 2 are satisfied exactly. These conditions will be approximately satisfied if sampling is without replacement, yet the sample size n is much smaller than the population size N. In practice, if n/N .05 (at most 5% of the population is sampled), we can proceed as if the Xi’s form a random sample. The virtue of this sampling method is that the probability distribution of any statistic can be more easily obtained than for any other sampling method. The probability distribution of a statistic is sometimes referred to as its sampling distribution to emphasize that it describes how the statistic varies in value across all samples that might be selected. There are two general methods for obtaining information about a statistic’s sampling distribution. One method involves calculations based on probability rules, and the other involves carrying out a simulation experiment. 360 6 Statistics and Sampling Distributions Deriving the Sampling Distribution of a Statistic Probability rules can be used to obtain the distribution of a statistic provided that it is a “fairly simple” function of the Xi’s and either there are relatively few different X values in the population or else the population distribution has a “nice” form. Our next two examples illustrate such situations. Example 6.2 An online florist offers three different sizes for Mother’s Day bouquets: a small arrangement costing $80 (including shipping), a medium-sized one for $100, and a large one with a price tag of $120. If 20% of all purchasers choose the small arrangement, 30% choose medium, and 50% choose large (because they really love Mom!), then the probability distribution of the cost of a single randomly selected flower arrangement is given by ð6:1Þ Suppose only two bouquets are sold today. Let X1 = the cost of the first bouquet and X2 = the cost of the second. Suppose that X1 and X2 are independent, each with the probability distribution shown in (6.1), so that X1 and X2 constitute a random sample from the distribution (6.1). Table 6.2 lists possible (x1, x2) pairs, the probability of each pair computed using (6.1) and the assumption of independence, and the resulting x and s2 values. (Note that when n = 2, s2 ¼ ðx1 xÞ2 þ ðx2 xÞ2.) Table 6.2 Outcomes, probabilities, and values of x and s2 for Example 6.2 x1 x2 p(x1, x2) x s2 80 80 (.2)(.2) =.04 80 0 80 100 (.2)(.3) =.06 90 200 80 120 (.2)(.5) =.10 100 800 100 80 (.3)(.2) =.06 90 200 100 100 (.3)(.3) =.09 100 0 100 120 (.3)(.5) =.15 110 200 120 80 (.5)(.2) =.10 100 800 120 100 (.5)(.3) =.15 110 200 120 120 (.5)(.5) =.25 120 0 Now to obtain the probability distribution of X, the sample average cost per bouquet, we must consider each possible value x and compute its probability. For example, x ¼ 100 occurs three times in the table with probabilities.10,.09, and.10, so PðX ¼ 100Þ ¼ :10 þ :09 þ :10 ¼ :29 Similarly, s2 = 800 appears twice in the table with probability.10 each time, so PðS2 ¼ 800Þ ¼ PðX1 ¼ 80; X2 ¼ 120Þ þ PðX1 ¼ 120; X2 ¼ 80Þ ¼ :10 þ :10 ¼ :20 The complete sampling distributions of X and S2 appear in (6.2) and (6.3). 6.1 Statistics and Their Distributions 361 ð6:2Þ ð6:3Þ Figure 6.2 depicts a probability histogram for both the original distribution of X (6.1) and the X distribution (6.2). The figure suggests first that the mean (i.e., expected value) of X is equal to the mean $106 of the original distribution, since both histograms appear to be centered at the same place. Indeed, from (6.2), X EðXÞ ¼ xpX ðxÞ ¼ 80ð:04Þ þ þ 120ð:25Þ ¼ 106 ¼ l a.5 b.29.30.3.25.2.12.04 x x 80 100 120 80 90 100 110 120 Figure 6.2 Probability histograms for (a) the underlying population distribution and (b) the sampling distribution of X in Example 6.2 Second, it appears that the X distribution has smaller spread (variability) than the original distribution, since the values of x are more concentrated toward the mean. Again from (6.2), X X VðXÞ ¼ ðx lX Þ2 pX ðxÞ ¼ ðx 106Þ2 pX ðxÞ ¼ ð80 106Þ2 ð:04Þ þ þ ð120 106Þ2 ð:25Þ ¼ 122 Notice that VðXÞ ¼ 122 ¼ 244=2 ¼ r2 =2, exactly half the population variance; that is a consequence of the sample size n = 2, and we’ll see why in the next section. Finally, the mean value of S2 is X EðS2 Þ ¼ s2 pS2 ðs2 Þ ¼ 0ð:38Þ þ 200ð:42Þ þ 800ð:20Þ ¼ 244 ¼ r2 That is, the X sampling distribution is centered at the population mean l, and the S2 sampling distribution (histogram not shown) is centered at the population variance r2. If four flower arrangements had been purchased on the day of interest, the sample average cost X would be based on a random sample of four Xi’s, each having the distribution (6.1). More calculation eventually yields the distribution of X for n = 4 as x 80 85 90 95 100 105 110 115 120 pX ðxÞ.0016.0096.0376.0936.1761.2340.2350.1500.0625 362 6 Statistics and Sampling Distributions From this, EðXÞ ¼ 106 ¼ l and VðXÞ ¼ 61 ¼ r2 =4. Figure 6.3 is a probability histogram of this distribution. x 80 90 100 110 120 Figure 6.3 Probability histogram for X based on n = 4 in Example 6.2 Example 6.2 should suggest first of all that the computation of pX ðxÞ and pS2 ðs2 Þ can be tedious. If the original distribution (6.1) had allowed for more than the three possible values 80, 100, and 120, then even for n = 2 the computations would have been more involved. The example should also suggest, however, that there are some general relationships between EðXÞ; VðXÞ; EðS2 Þ, and the mean l and variance r2 of the original distribution. These are stated in the next section. Now consider an example in which the random sample is drawn from a continuous distribution. Example 6.3 The time that it takes to serve a customer at the cash register in a minimarket is a random variable having an exponential distribution with parameter k. Suppose X1 and X2 are service times for two different customers, assumed independent of each other. Consider the total service time To = X1 + X2 for the two customers, also a statistic. The cdf of To is, for t 0, Z Z FTo ðtÞ ¼ PðX1 þ X2 tÞ ¼ f ðx1 ; x2 Þ dx1 dx2 fðx1 ;x2 Þ:x1 þ x2 tg Z t Ztx1 ¼ kekx1 kekx2 dx2 dx1 0 0 Zt ¼ ðkekx1 kekt Þdx1 ¼ 1 ekt ktekt 0 The region of integration is pictured in Figure 6.4. 6.1 Statistics and Their Distributions 363 x2 (x1, t − x1) x1 + x2 = t x1 x1 Figure 6.4 Region of integration to obtain cdf of To in Example 6.3 The pdf of To is obtained by differentiating FTo ðtÞ: fTo ðtÞ ¼ k2 tekt t0 ð6:4Þ This is a gamma pdf (a = 2 and b = 1/k). This distribution for To can also be derived by convolution or by the moment generating function argument from Section 5.3. Since FX ðxÞ ¼ PðX xÞ ¼ PðTo 2xÞ ¼ FTo ð2xÞ, differentiating with respect to x and using (6.4) plus the chain rule gives us the pdf of X ¼ To =2: fX ðxÞ ¼ 4k2xe2kx x 0 ð6:5Þ The mean and variance of the underlying exponential distribution are l = 1/k and r2 = 1/k2. Using Expressions (6.4) and (6.5), it can be verified that EðXÞ ¼ 1=k; VðXÞ ¼ 1=ð2k2 Þ; EðTo Þ ¼ 2=k, and VðTo Þ ¼ 2=k2. These results again suggest some general relationships between means and variances of X, To, and the underlying distribution. Simulation Experiments The second method of obtaining information about a statistic’s sampling distribution is to perform a simulation experiment. This method is often used when a derivation via probability rules or properties of distributions is too difficult or complicated to be carried out. Simulations are virtually always done with the aid of computer software. The following characteristics of a simulation experiment must be specified: 1. The statistic of interest (X, S, a particular trimmed mean, etc.) 2. The population distribution (normal with l = 100 and r = 15, uniform with lower limit A = 5 and upper limit B = 10, etc.) 3. The sample size n (e.g., n = 10 or n = 50) 4. The number of replications k (e.g., k = 10,000). Then use a computer to obtain k different random samples, each of size n, from the designated population distribution. For each such sample, calculate the value of the statistic and construct a histogram of the k calculated values. This histogram gives the approximate sampling distribution of the statistic. The larger the value of k, the better the approximation will tend to be (the actual sampling distribution emerges as k ! 1). In practice, k = 10,000 may be enough for a “fairly simple” statistic and population distribution, but modern computers allow for a much larger number of replications. 364 6 Statistics and Sampling Distributions Example 6.4 Consider a simulation experiment in which the population distribution is quite skewed. Figure 6.5 shows the density curve for lifetimes of a certain type of electronic control. This is actually a lognormal distribution with E[ln(X)] = 3 and V[ln(X)] = 0.16; that is, ln(X) is normal with mean 3 and standard deviation 0.4. f (x).05.04.03.02.01 x 0 25 50 75 Figure 6.5 Density curve for the simulation experiment of Example 6.4: a lognormal distribution with E(X) = 21.76 and V(X) = 82.14 Imagine the statistic of interest is the sample mean, X. For any given sample size n, we repeat the following procedure k times: Generate values x1, …, xn from a lognormal distribution with the specified parameter values; equivalently, generate y1, …, yn from a N(3, 0.4) distribution and apply the transformation x = ey to each value. Calculate and store the sample mean x of the n x-values. We performed this simulation experiment at four different sample sizes: n = 5, 10, 20, and 30. The experiment utilized k = 1000 replications (a very modest value) for each sample size. The resulting histograms along with a normal probability plot from R for the 1000 x values based on n = 30 are shown in Figure 6.6 on the next page. The first thing to notice about the histograms is that each one is centered approximately at the mean of the population being sampled, lX ¼ e3 þ 0:16=2 21:76. Had the histograms been based on an unending sequence of x values, their centers would have been exactly at the population mean. Second, note the spread of the histograms relative to each other. The smaller the value of n, the greater the extent to which the sampling distribution spreads out about the mean value. This is why the histograms for n = 20 and n = 30 are based on narrower class intervals than those for the two smaller sample sizes. For the larger sample sizes, most of the x values are quite close to µX. This is the effect of averaging. When n is small, a single unusual x value can result in an x value far from the center. With a larger sample size, any unusual x values, when averaged in with the other sample values, still tend to yield an x value close to lX. Combining these insights yields an intuitively- appealing result: X based on a large n tends to be closer to l than does X based on a small n. Third and finally, consider the shapes of the histograms. Recall from Figure 6.5 that the population from which the samples were drawn is quite skewed. But as the sample size n increases, the distribution of X appears to become progressively less skewed. In particular, when n = 30 the 6.1 Statistics and Their Distributions 365 distribution of the 1000 x values appears to be approximately normal, a fact validated by the normal probability plot in Figure 6.6e. We will discover in the next section that this is part of a much broader phenomenon known as the Central Limit Theorem: as the sample size n increases, the sampling distribution of X becomes increasingly normal, irrespective of the population distribution from which values were sampled. a n=5 b n = 10 Density Density 0.10 0.12 0.08 0.10 0.06 0.08 0.06 0.04 0.04 0.02 0.02 0.00 x 0.00 x 10 15 20 25 30 35 40 10 15 20 25 30 35 40 c n = 20 d n = 30 Density Density 0.20 0.20 0.15 0.15 0.10 0.10 0.05 0.05 0.00 x 0.00 x 10 15 20 25 30 35 40 10 15 20 25 30 35 40 e Normal Q-Q Plot 28 26 Sample Quantiles 24 22 20 18 16 –3 –2 –1 0 1 2 3 Theoretical Quantiles Figure 6.6 Results of the simulation experiment of Example 6.4: (a) X histogram for n = 5; (b) X histogram for n = 10; (c) X histogram for n = 20; (d) X histogram for n = 30; (e) normal probability plot for n = 30 (from R) 366 6 Statistics and Sampling Distributions Example 6.5 The 2017 study described in Example 4.23 determined that the variable X = proximal grip distance for female surgeons follows a normal distribution with mean 6.58 cm and standard deviation 0.50 cm. Consider the statistic Q1 = the sample 25th percentile (equivalently, the lower quartile). To investigate the sampling distribution of Q1 we repeated the following procedure k = 1000 times: Generate a sample x1, …, xn from the N(6.58, 0.50) distribution. Calculate and store the lower quartile, q1, of the n resulting x values. The results of two such simulation experiments—one for n = 5, another for n = 40—are shown in Figure 6.7. Similar to X’s behavior in the previous example, we see that the sampling distribution of Q1 has greater variability for small n than for large n. Both sampling distributions appear to be centered roughly at 6.5 cm, which is perhaps not surprising: the 25th percentile of the population distribution is g:25 ¼ l þ U1 ð:25Þ r ¼ 6:83 þ ð0:675Þð0:50Þ 6:49 cm a n=5 b n = 40 Density Density 1.4 3.0 1.2 2.5 1.0 2.0 0.8 1.5 0.6 1.0 0.4 0.2 0.5 0.0 q1 0.0 q1 5.5 6.0 6.5 7.0 7.5 5.5 6.0 6.5 7.0 7.5 Figure 6.7 Sample histograms of Q1 based on 1000 samples, each consisting of n observations: (a) n = 5, (b) n = 40 In fact, even with an infinite set of replications (i.e., the “true” sampling distribution), the mean of Q1 is not exactly g:25 , but that difference decreases as n increases. Exercises: Section 6.1 (1–10) 1. A particular brand of dishwasher soap is a. Determine the sampling distribution of sold in three sizes: 25, 40, and 65 oz. 20% X, calculate EðXÞ, and compare to l. of all purchasers select a 25-oz box, 50% b. Determine the sampling distribution of select a 40-oz box, and the remaining 30% the sample variance S2, calculate E(S2), choose a 65-oz box. Let X1 and X2 denote and compare to r2. the package sizes selected by two inde- 2. There are two traffic lights on the way to pendently selected purchasers. work. Let X1 be the number of lights that 6.1 Statistics and Their Distributions 367 are red, requiring a stop, and suppose that sizes. How would you guess the distri- the distribution of X1 is as follows: bution would change as n increases? 5. Let X be the number of packages being x1 0 1 2 l = 1.1, r =.49 2 mailed by a randomly selected customer at p(x1).2.5.3 a shipping facility. Suppose the distribution of X is as follows: Let X2 be the number of lights that are red on the way home; X2 is independent of X1. x 1 2 3 4 Assume that X2 has the same distribution as p(x).4.3.2.1 X1, so that X1, X2 is a random sample of size n = 2. a. Consider a random sample of size n = 2 a. Let To = X1 + X2, and determine the (two customers), and let X be the sam- probability distribution of To. ple mean number of packages shipped. b. Calculate lTo. How does it relate to l, Obtain the sampling distribution of X. the population mean? b. Refer to part (a) and calculate c. Calculate r2To. How does it relate to r2, PðX 2:5Þ. c. Again consider a random sample of size the population variance? n = 2, but now focus on the statistic 3. It is known that 80% of all Brand A MP3 R = the sample range (difference players work in a satisfactory manner between the largest and smallest values throughout the warranty period (are “suc- in the sample). Obtain the sampling cesses”). Suppose that n = 10 players are distribution of R. [Hint: Calculate the randomly selected. Let X = the number of value of R for each outcome and use the successes in the sample. The statistic X/n is probabilities from part (a).] the sample proportion (fraction) of suc- d. If a random sample of size n = 4 is cesses. Obtain the sampling distribution of selected, what is PðX 1:5Þ? [Hint: this statistic. [Hint: One possible value of You should not have to list all possible X/n is.3, corresponding to X = 3. What is outcomes, only those for which the probability of this value (what kind of x 1:5.] random variable is X)?] 6. A company maintains three offices in a 4. A box contains ten sealed envelopes num- region, each staffed by two employees. bered 1, …, 10. The first five contain no Information concerning yearly salaries money, the next three each contain $5, and (1000s of dollars) is as follows: there is a $10 bill in each of the last two. A sample of size 3 is selected with Office 1 1 2 2 3 3 replacement (so we have a random sample), Employee 1 2 3 4 5 6 Salary 29.7 33.6 30.2 33.6 25.8 29.7 and you get the largest amount in any of the envelopes selected. If X1, X2, and X3 denote a. Suppose two of these employees are the amounts in the selected envelopes, the randomly selected from among the six statistic of interest is M = the maximum of (without replacement). Determine the X1, X2, and X3. sampling distribution of the sample a. Obtain the probability distribution of mean salary X. this statistic. b. Suppose one of the three offices is ran- b. Describe how you would carry out a domly selected. Let X1 and X2 denote the simulation experiment to compare the salaries of the two employees. Deter- distributions of M for various sample mine the sampling distribution of X. 368 6 Statistics and Sampling Distributions c. How does EðXÞ from parts (a) and 9. Carry out a simulation experiment using a (b) compare to the population mean statistical computer package or other soft- salary l? ware to study the sampling distribution of X 7. The number of dirt specks on a randomly when the population distribution is Weibull selected square yard of polyethylene film of with a = 2 and b = 5, as in Example 6.1. a certain type has a Poisson distribution Consider the four sample sizes n = 5, 10, with a mean value of 2 specks per square 20, and 30, and in each case use at least yard. Consider a random sample of n = 5 1000 replications. For which of these film specimens, each having area 1 square sample sizes does the X sampling distribu- yard, and let X be the resulting sample tion appear to be approximately normal? mean number of dirt specks. Obtain the first 10. Carry out a simulation experiment using a 21 probabilities in the X sampling distri- statistical computer package or other bution. [Hint: What does a moment gener- software to study the sampling distribution ating function argument say about the of X when the population distribution is distribution of X1 + + X5?] lognormal with E[ln(X)] = 3 and V[ln(X)] 8. Suppose the amount of liquid dispensed by = 1. Consider the four sample sizes a machine is uniformly distributed with n = 10, 20, 30, and 50, and in each case lower limit A = 8 oz and upper limit use at least 1000 replications. For which B = 10 oz. Describe how you would carry of these sample sizes does the X sampling out simulation experiments to compare the distribution appear to be approximately sampling distribution of the sample iqr for normal? sample sizes n = 5, 10, 20, and 30. 6.2 The Distribution of Sample Totals, Means, and Proportions Throughout this section, we will be primarily interested in the properties of two particular rvs derived from random samples: the sample total To and the sample mean X: X n X1 þ þ X n T o To ¼ X1 þ þ Xn ¼ Xi ; X¼ ¼ i¼1 n n The importance of the sample mean X springs from its use in drawing conclusions about the pop- ulation mean l. Some of the most frequently used inferential procedures are based on properties of the sampling distribution of X. A preview of these properties appeared in the calculations and simulation experiments of the previous section, where we noted relationships between EðXÞ and l and also among VðXÞ, r2, and n. PROPOSITION Let X1, X2, …, Xn be a random sample from a distribution with mean value l and standard deviation r. Then 1. E(To) = nµ 1. EðXÞ ¼ l pffiffiffi r2 r 2. V(To) = nr and rTo 2 ¼ nr 2. VðXÞ ¼ and rX ¼ pffiffiffi n n 3. If the Xi’s are normally distributed, 3. If the Xi’s are normally distributed, then To is also normally then X is also normally distributed. distributed. 6.2 The Distribution of Sample Totals, Means, and Proportions 369 Proof From the main theorem of Section 5.3, the expected value of a sum is the sum of the individual expected values; moreover, when the variables in the sum are independent, the variance of the sum is the sum of the individual variances: EðTo Þ ¼ EðX1 þ þ Xn Þ ¼ EðX1 Þ þ þ EðXn Þ ¼ l þ þ l ¼ nl VðTo Þ ¼ VðX1 þ þ Xn Þ ¼ VðX1 Þ þ þ VðXn Þ ¼ r2 þ þ r2 ¼ nr2 pffiffiffiffiffiffiffi pffiffiffi rTo ¼ nr2 ¼ nr 1 The corresponding results for X can be derived by writing X ¼ To and using basic rescaling n properties, such as E(cY) = cE(Y). Property 3 is a consequence of the more general result from Section 5.3 that any linear combination of independent normal rvs is normal. According to Property 1, the distribution of X is centered precisely at the mean of the population from which the sample has been selected. If the sample mean is used to compute an estimate (educated guess) of the population mean µ, there will be no systematic tendency for the estimate to be too large or too small. Property 2 shows that the X distribution becomes more concentrated about µ as the sample size n increases, because its standard deviation decreases. In marked contrast, the distribution of To becomes more spread out as n increases. Averaging moves probability in toward the middle, whereas pffiffiffi totaling spreads probability out over a wider and wider range of values. The expression r= n for the standard deviation of X is called the standard error of the mean, and it indicates the typical amount by which a value of X will deviate from the true mean, µ (in contrast, r itself represents the typical difference between an individual Xi and µ). When r is unknown, as is usually the case when µ is unknown and we are trying to estimate it, we may substitute the sample standard deviation, s, of our sample into the standard error formula and say pffiffiffi that an observed value of X will typically differ by about s= n from µ. This is the estimated standard error formula presented in Sections 3.8 and 4.8. Finally, Property 3 says that X and To are both normally distributed when the population distri- bution is normal. In particular, probabilities such as Pða X bÞ and P(c To d) can be obtained simply by standardizing, with the appropriate means and standard deviations provided by Properties 1 and 2. Figure 6.8 illustrates the X part of the proposition. X distribution when n = 10 X distribution when n = 4 Population distribution Figure 6.8 A normal population distribution and X sampling distributions Example 6.6 The amount of time that a patient spends in a certain outpatient surgery center is a random variable with a mean value of 4.5 h and a standard deviation of 1.4 h. Let X1, …, X25 be the times for a random sample of 25 patients. Then the expected total time for the 25 patients is 370 6 Statistics and Sampling Distributions E(To) = nµ = 25(4.5) = 112.5 h, whereas the expected sample mean amount of time is EðXÞ ¼ l ¼ 4:5 h. The standard deviations of To and X are pffiffiffi pffiffiffiffiffi rTo ¼ nr ¼ 25ð1:4Þ ¼ 7 h r 1:4 rX ¼ pffiffiffi ¼ pffiffiffiffiffi ¼ :28 h n 25 Suppose further that such patient times follow a normal distribution; i.e., Xi * N(4.5, 1.4). Then the total time spent by 25 randomly selected patients in this center is also normal: To * N(112.5, 7). The probability their total time exceeds five days (120 h) is 120 112:5 PðTo [ 120Þ ¼ 1 PðTo 120Þ ¼ 1 U ¼ 1 Uð1:07Þ ¼ :8577 7 This same probability can be reframed in terms of X: for 25 patients, a total time of 120 h equates to an average time of 120/25 = 4.8 h, and since X Nð4:5; :28Þ, 4:8 4:5 PðX [ 4:8Þ ¼ 1 U ¼ 1 Uð1:07Þ ¼ :8577 :28 Example 6.7 Resistors used in electronics manufacturing are labeled with a “nominal” resistance as well as a percentage tolerance. For example, a 330-X resistor with a 5% tolerance is anticipated to have an actual resistance between 313.5 and 346.5 X. Consider five such resistors, randomly selected from the population of all resistors with those specifications, and model the resistance of each by a uniform distribution on [313.5, 346.5]. If these are connected in series, the resistance R of the system is given by R ¼ X1 þ þ X5 , where the Xi’s are the iid uniform resistances. A random variable uniformly distributed on [A, B] has mean (A + B)/2 and standard deviation pffiffiffiffiffi ðB AÞ= 12. For our uniform model, the mean resistance is E(Xi) = (313.5 + 346.5)/2 = 330 X, the pffiffiffiffiffi nominal resistance, with a standard deviation of ð346:5 313:5Þ= 12 ¼ 9:526 X. The system’s resistance has mean and standard deviation pffiffiffi pffiffiffi EðRÞ ¼ nl ¼ 5ð330Þ ¼ 1650 X; rR ¼ nr ¼ 5ð9:526Þ ¼ 21:3 X But what is the probability distribution of R? Is R also uniformly distributed? Determining the exact pdf of R is difficult (it requires four convolutions). And the mgf of R, while easy to obtain, is not recog- nizable as coming from any particular family of known distributions. Instead, we resort to a simulation of R, the results of which appear in Figure 6.9. For 10,000 iterations, five independent uniform variates on [313.5, 346.5] were created and summed; see Section 4.8 for information on simulating values from a uniform distribution. The histogram in Figure 6.9 clearly indicates that R is not uniform; in fact, if anything, R appears (from the simulation, anyway) to be approximately normally distributed! 6.2 The Distribution of Sample Totals, Means, and Proportions 371 Histogram of R 1500 1000 Frequency 500 0 1600 1650 1700 R Figure 6.9 Simulated distribution of the random variable R in Example 6.7 The Central Limit Theorem When iid Xi’s are normally distributed, so are To and X for every sample size n. The simulation results from Example 6.7 suggest that even when the population distribution is not normal, summing (or averaging) produces a distribution more bell-shaped than the one being sampled. Upon reflection, this is quite intuitive: in order for R to be near 5(346.5) = 1732.5, its theoretical maximum, all five randomly selected resistors would have to exert resistances at the high end of their common range (i.e., every Xi would have to be near 346.5). Thus, R-values near 1732.5 are unlikely, and the same applies to R’s theoretical minimum of 5(313.5) = 1567.5. On the other hand, there are many ways for R to be near the mean value of 1650: all five resistances in the middle, two low and one middle and two high, and so on. Thus, R is more likely to be “centrally” located than out at the extremes. (This is analogous to the well-known fact that rolling a pair of dice is far more likely to result in a sum of 7 than 2 or 12, because there are more ways to obtain 7.) This general pattern of behavior for sample totals and sample means is formalized by the most important theorem of probability, the Central Limit Theorem (CLT). CENTRAL LIMIT Let X1, X2, …, Xn be a random sample from a distribution with mean l and THEOREM (CLT) standard deviation r. Then, in the limit as n ! 1, the standardized versions of X and To have the standard normal distribution. That is, Xl lim P pffiffiffi z ¼ PðZ zÞ ¼ UðzÞ n!1 r= n 372 6 Statistics and Sampling Distributions and To nl lim P pffiffiffi z ¼ PðZ zÞ ¼ UðzÞ n!1 nr where Z is a standard normal rv. It is customary to say that X and To are asymptotically normal, and that their standardized versions converge in distribution to Z. Thus when n is sufficiently large, X has approximately a pffiffiffi normal distribution with lX ¼ l and rX ¼ r= n. Equivalently, for large n the pffiffiffi sum To has approximately a normal distribution with lTo ¼ nl and rTo ¼ nr. Figure 6.10 illustrates the Central Limit Theorem. A partial proof of the CLT appears in the appendix to this chapter. It is shown that, if the moment generating function exists, then the mgf of the standardized X (and of To) approaches the standard normal mgf. With the aid of an advanced probability theorem, this implies the CLT statement about convergence of probabilities. X distribution for large n (approximately normal) X distribution for small to moderate n Population distribution Figure 6.10 The Central Limit Theorem for X illustrated A practical difficulty in applying the CLT is in knowing when n is “sufficiently large.” The problem is that the accuracy of the approximation for a particular n depends on the shape of the original underlying distribution being sampled. If the underlying distribution is symmetric and there is not much probability far out in the tails, then the approximation will be good even for a small n, whereas if it is highly skewed or has “heavy” tails, then a large n will be required. For example, if the distribution is uniform on an interval, then it is symmetric with no probability in the tails, and the normal approximation is very good for n as small as 10 (in Example 6.9, even for n = 5, the distribution of the sample total appeared rather bell-shaped). However, at the other extreme, a distribution can have such fat tails that its mean fails to exist and the Central Limit Theorem does not apply, so no n is big enough. A popular, although frequently somewhat conservative, convention is that the Central Limit Theorem may be safely applied when n > 30. Of course, there are exceptions, but this rule applies to most distributions of real data. Example 6.8 When a batch of a certain chemical product is prepared, the amount of a particular impurity in the batch is a random variable with mean value 4.0 g and standard deviation 1.5 g. If 50 batches are independently prepared, what is the (approximate) probability that the sample average amount of impurity X is between 3.5 and 3.8 g? According to the convention mentioned above, n = 50 is large enough for the CLT to be applicable. The sample mean X then has approximately a pffiffiffiffiffi normal distribution with mean value lX ¼ 4:0 and rX ¼ 1:5= 50 ¼ :2121, so 6.2 The Distribution of Sample Totals, Means, and Proportions 373 3:5 4:0 3:8 4:0 Pð3:5 X 3:8Þ P Z :2121 :2121 ¼ Uð:94Þ Uð2:36Þ ¼ :1645 Example 6.9 Suppose the number of times a randomly selected customer of a large bank uses the bank’s ATM during a particular period is a random variable with a mean value of 3.2 and a standard deviation of 2.4. Among 100 randomly selected customers, how likely is it that the sample mean number of times the bank’s ATM is used exceeds 4? Let Xi denote the number of times the ith customer in the sample uses the bank’s ATM. Notice that Xi is a discrete rv, but the CLT is not limited to continuous random variables. Also, although the fact that the standard deviation of this nonneg- ative variable is quite large relative to the mean value suggests that its distribution is positively skewed, the large sample size implies that X does have approximately a normal distribution. Using pffiffiffi pffiffiffiffiffiffiffiffi lX ¼ 3:2 and rX ¼ r= n ¼ 2:4= 100 ¼ :24, 4 3:2 PðX [ 4Þ P Z [ ¼ 1 Uð3:33Þ ¼ :0004 :24 Example 6.10 Consider the distribution shown in Figure 6.11 for the amount purchased (rounded to the nearest dollar) by a randomly selected customer at a particular gas station. (A similar distribution for purchases in Britain (in £) appeared in the article “Data Mining for Fun and Profit,” Stat. Sci. 2000: 111–131; there were big spikes at the values 10, 15, 20, 25, and 30.) The distribution is obviously quite nonnormal. 0.16 0.14 0.12 Probability 0.10 0.08 0.06 0.04 0.02 0.00 5 10 15 20 25 30 35 40 45 50 55 60 Purchase amount (x) Figure 6.11 Probability distribution of X = amount of gasoline purchased ($) in Example 6.10 We asked R to select 1000 different samples, each consisting of n = 15 observations, and calculate the value of the sample mean X for each one. Figure 6.12 is a histogram of the resulting 1000 values; this is the approximate sampling distribution of X under the specified circumstances. This distribution is clearly approximately normal even though the sample size is not all that large. As further evidence 374 6 Statistics and Sampling Distributions for normality, Figure 6.13 shows a normal probability plot of the 1000 x values; the linear pattern is very prominent. It is typically not nonnormality in the central part of the population distribution that causes the CLT to fail, but instead very substantial skewness or extremely heavy tails. 0.14 0.12 0.10 Density 0.08 0.06 0.04 0.02 0.00 18 21 24 27 30 33 36 Mean purchase amount (x) Figure 6.12 Approximate sampling distribution of the sample mean amount purchased when n = 15 and the population distribution is as shown in Figure 6.11 Normal Probability Plot 0.999 0.997 0.99 0.98 0.95 0.90 Probability 0.75 0.50 0.25 0.10 0.05 0.02 0.01 0.003 0.001 20 25 30 35 40 Data Figure 6.13 Normal probability plot of the 1000 x values based on samples of size n = 15 The CLT can also be generalized so it applies to nonidentically-distributed independent random variables and certain linear combinations. Roughly speaking, if n is large and no individual term is likely to contribute too much to the overall value, then asymptotic normality prevails (see Exercise 68). It can also be generalized to sums of variables which are not independent, provided the extent of dependence between most pairs of variables is not too strong. 6.2 The Distribution of Sample Totals, Means, and Proportions 375 Other Applications of the Central Limit Theorem The CLT can be used to justify the normal approximation to the binomial distribution discussed in Chapter 4. Recall that a binomial variable X is the number of successes in a binomial experiment consisting of n independent success/failure trials with p = P(success) for any particular trial. Define new rvs X1, X2, …, Xn by 1 if the ith trial results in a success Xi ¼ ði ¼ 1;...; nÞ 0 if the ith trial results in a failure Because the trials are independent and P(success) is constant from trial to trial, the Xi’s are iid (a random sample from a Bernoulli distribution). When the Xi’s are summed, a 1 is added for every success that occurs and a 0 for every failure so X = X1 + + Xn, their total. The sample mean of the Xi’s is X ¼ X=n, the sample proportion of successes, which in previous discussions we have denoted P. ^ The CLT then implies that if n is sufficiently large, both X and P ^ are approximately normal when n is large. We summarize properties of the P ^ distribution in the following corollary; Statements 1 and 2 were derived in Section 3.5. COROLLARY Consider an event A in the sample space of some experiment with p = P(A). Let X = the number of times A occurs when the experiment is repeated n independent times, and define ^ ¼ PðAÞ ^ X P ¼ n Then ^ ¼p 1. EðPÞ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ^ ¼ pð1 pÞ pð1 pÞ 2. VðPÞ and rP^ ¼ n n 3. As n increases, the distribution of P ^ approaches a normal distribution. In practice, Property 3 is taken to say that P ^ is approximately normal, provided that np 10 and n(1 – p) 10. The necessary sample size for this approximation depends on the value of p: When p is close to.5, the distribution of each Bernoulli Xi is reasonably symmetric (see Figure 6.14), whereas the distribution is quite skewed when p is near 0 or 1. Using the approximation only if both np 10 and n(1 – p) 10 ensures that n is large enough to overcome any skewness in the underlying Bernoulli distribution. a b 0 1 0 1 Figure 6.14 Two Bernoulli distributions: (a) p =.4 (reasonably symmetric); (b) p =.1 (very skewed) 376 6 Statistics and Sampling Distributions Example 6.11 A computer simulation in the style of Section 2.6 is used to determine the probability that a complex system of components operates properly throughout the warranty period. Unknown to the investigator, the true probability is P(A) =.18. If 10,000 simulations of the underlying process are ^ run, what is the chance the estimated probability PðAÞ will be within.01 of the true probability P(A)? Apply the preceding corollary, with n = 10,000 and p = P(A) =.18. The expected value of PðAÞ ^ is pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi p =.18, and the standard deviation is rP^ ¼ :18ð:82Þ=10;000 ¼ :00384. Since np = 1800 10 and n(1 – p) = 8200 10, a normal distribution can safely be used to approximate the distribution ^ of PðAÞ. ^ This sample proportion is within.01 of the true probability if and only if :17\PðAÞ\:19, so the desired likelihood is approximately ^ \ :19Þ P :17 :18 :19 :18 Pð:17 \ P \Z \ ¼ Uð2:60Þ Uð2:60Þ ¼ :9906 :00384 :00384 The normal distribution serves as a reasonable approximation to the binomial pmf when n is large because the binomial distribution is additive; i.e., a binomial rv can be expressed as the sum of other, iid rvs. Other additive distributions include the Poisson, negative binomial, gamma, and (of course) normal distributions; some of these were discussed at the end of Section 5.3. In particular, CLT justifies normal approximations to the following distributions: Poisson, when µ is large Negative binomial, when r is large Gamma, when a is large As a final application of the CLT, first recall from Section 4.5 that X has a lognormal distribution if ln(X) has a normal distribution. PROPOSITION Let X1, X2, …, Xn be a random sample from a distribution for which only positive values are possible [P(Xi > 0) = 1]. Then if n is sufficiently large, the product Y = X1 X2, , Xn has approximately a lognormal distribution; that is, ln(Y) has a normal distribution. To verify this, note that lnðYÞ ¼ lnðX1 Þ þ lnðX2 Þ þ þ lnðXn Þ Since ln(Y) is a sum of independent and identically distributed rvs [the ln(Xi)’s], it is approximately normal when n is large, so Y itself has approximately a lognormal distribution. As an example of the applicability of this result, it has been argued that the damage process in plastic flow and crack propagation is a multiplicative process, so that variables such as percentage elongation and rupture strength have approximately lognormal distributions. The Law of Large Numbers In the simulation sections of Chapters 2–4, we described how a sample proportion P^ could estimate a true probability p, and a sample mean X served to approximate a theoretical expected value µ. Moreover, in both cases the precision of the estimation improves as the number of simulation runs, n, 6.2 The Distribution of Sample Totals, Means, and Proportions 377 increases. We would like to be able to say that our estimates “converge” to the correct answers in some sense. Such a convergence statement is justified by another important theoretical result, called the Law of Large Numbers. To begin, recall the first proposition in this section: If X1, X2, …, Xn is a random sample from a distribution with mean l and standard deviation r, then EðXÞ ¼ l and VðXÞ ¼ r2 =n. As n increases, the expected value of X remains at l but the variance approaches zero; that is, E½ðX lÞ2 ¼ VðXÞ ¼ r2 =n ! 0. We say that X converges in mean square to l because the mean of the squared difference between X and l goes to zero. This is one form of the Law of Large Numbers. Another form of convergence states that as the sample size n increases, X is increasingly unlikely to differ by any set amount from µ. More precisely, let e be a positive number close to 0, such as.01 or.001, and consider PðX l eÞ, the probability that X differs from l by at least e (at least.01, at least.001, etc.). We will prove shortly that, no matter how small the value of e, this probability will approach zero as n ! 1. Because of this, statisticians say that X converges to l in probability. The two forms of the Law of Large Numbers are summarized in the following theorem. LAW OF LARGE If X1, X2,... , Xn is a random sample from a distribution with mean l, NUMBERS then X converges to l 1. in mean square: E½ðX lÞ2 ! 0 as n ! 1 2. in probability: PðX l eÞ ! 0 as n ! 1 for any e > 0. Proof The proof of Statement 1 appears a few paragraphs above. For Statement 2, recall Chebyshev’s inequality (Exercises 45 and 163 in Chapter 3), which states that for any rv Y, P(|Y – µY| krY) 1/k2 for any k 1 (i.e., the probability that Y is at least k standard deviations away from its mean is at most 1/k2). Let Y ¼ X, so lY ¼ EðXÞ ¼ l and pffiffiffi pffiffiffi rY ¼ rX ¼ r= n. Now, for any e > 0, determine the value of k such that e ¼ krY ¼ kr= n; pffiffiffi solving for k yields k ¼ e n=r, which for sufficiently large n will exceed 1. Apply Chebyshev’s inequality: pffiffiffi 1 e n r 1 PðjY lY j krY Þ 2 ) P X l pffiffiffi pffiffiffi 2 k r n ðe n=rÞ r2 ) P X l e 2 ! 0 as n ! 1 e n That is, PðX l eÞ ! 0 ! 0 as n ! 1 for any e > 0. Convergence of X to µ in probability actually holds even if the variance r does not exist (a heavy- 2 tailed distribution) as long as µ is finite. But then Chebyshev’s inequality cannot be used, and the proof is much more complicated. In statistical language, the Law of Large Numbers states that X is a consistent estimator of µ. Other statistics are also consistent estimators of the corresponding parameters. For example, it can be shown that the sample proportion P ^ is a consistent estimator of the population proportion p (Exercise 24), P and the sample variance S2 = ðXi XÞ2 =ðn 1Þ is a consistent estimator of the population variance r. 2 378 6 Statistics and Sampling Distributions Exercises: Section 6.2 (11–27) 11. The inside diameter of a randomly selected Now what is the (approximate) proba- piston ring is a random variable with mean bility that the sample mean will be at value 12 cm and standard deviation.04 cm. least 86.3? In light of this calculation, do you think that 82 is a reasonable a. If X is the sample mean diameter for a value for l? random sample of n = 16 rings, where is the sampling distribution of X cen- 14. There are 40 students in an elementary tered, and what is the standard deviation statistics class. On the basis of years of of the X distribution? experience, the instructor knows that the b. Answer the questions posed in part time needed to grade a randomly chosen (a) for a sample size of n = 64 rings. first examination paper is a random variable c. For which of the two random samples, with an expected value of 6 min and a the one of part (a) or the one of part (b), standard deviation of 6 min. is X more likely to be within.01 cm of a. If grading times are independent and the 12 cm? Explain your reasoning. instructor begins grading at 6:50 p.m. and grades continuously, what is the 12. Refer to the previous exercise. Suppose the (approximate) probability that he is distribution of diameter is normal. through grading before the 11:00 p.m. a. Calculate Pð11:99 X 12:01Þ when TV news begins? n = 16. b. If the sports report begins at 11:10, what is b. How likely is it that the sample mean the probability that he misses part of the diameter exceeds 12.01 when n = 25? report if he waits until grading is done 13. The National Health Statistics Reports dated before turning on the TV? Oct. 22, 2008 stated that for a sample size of 15. The tip percentage at a restaurant has a 277 18-year-old American males, the sam- mean value of 18% and a standard devia- ple mean waist circumference was 86.3 cm. tion of 6%. A somewhat complicated method was used a. What is the approximate probability that to estimate various population percentiles, the sample mean tip percentage for a resulting in the following values: random sample of 40 bills is between 16 5th 10th 25th 50th 75th 90th 95th and 19%? 69.6 70.9 75.2 81.3 95.4 107.1 116.4 b. If the sample size had been 15 rather than 40, could the probability requested a. Is it plausible that the waist size distri- in part (a) be calculated from the given bution is at least approximately normal? information? Explain your reasoning. If your answer 16. The time taken by a randomly selected is no, conjecture the shape of the pop- applicant for a mortgage to fill out a cer- ulation distribution. tain form has a normal distribution with b. Suppose that the population mean waist mean value 10 min and standard deviation size is 85 cm and that the population 2 min. If five individuals fill out a form on standard deviation is 15 cm. How likely one day and six on another, what is the is it that a random sample of 277 indi- probability that the sample average amount viduals will result in a sample mean of time taken on each day is at most waist size of at least 86.3 cm? 11 min? c. Referring back to (b), suppose now that 17. The lifetime of a type of battery is normally the population mean waist size is 82 cm distributed with mean value 10 h and (closer to the median than the mean). standard deviation 1 h. There are four 6.2 The Distribution of Sample Totals, Means, and Proportions 379 batteries in a package. What lifetime value a. At least 25 will make no errors. [Hint: is such that the total lifetime of all batteries Normal approximation to the binomial.] in a package exceeds that value for only 5% b. Between 15 and 25 (inclusive) will of all packages? make no errors. 18. Let X represent the amount of gasoline 21. The number of parking tickets issued in a (gallons) purchased by a randomly selected certain city on any given weekday has a customer at a gas station. Suppose that the Poisson distribution with parameter µ = 50. mean value and standard deviation of X are What is the approximate probability that 11.5 and 4.0, respectively. a. Between 35 and 70 tickets are given out a. In a sample of 50 randomly selected cus- on a particular day? [Hint: When µ is tomers, what is the approximate proba- large, a Poisson rv has approximately a bility that the sample mean amount normal distribution.] purchased is at least 12 gallons? b. The total number of tickets given out b. In a sample of 50 randomly selected during a 5-day week is between 225 and customers, what is the approximate 275? probability that the total amount of c. Use software to obtain the exact prob- gasoline purchased is at most 600 abilities in (a) and (b), and compare to gallons? the approximations. c. What is the approximate value of the 22. Suppose the distribution of the time X (in 95th percentile for the total amount hours) spent by students at a certain uni- purchased by 50 randomly selected versity on a particular project is gamma customers? with parameters a = 50 and b = 2. 19. Suppose that the fracture angle under pure Because a is large, it can be shown that compression of a randomly selected speci- X has approximately a normal distribution. men of fiber reinforced polymer-matrix Use this fact to compute the probability that composite material is normally distributed a randomly selected student spends at most with mean value 53 and standard deviation 125 h on the project. 1 (suggested in the article “Stochastic 23. The Central Limit Theorem says that X is Failure Modelling of Unidirectional Com- approximately normal if the sample size is posite Ply Failure,” Reliability Engr. Syst. large. More specifically, the theorem states Safety 2012: 1–9; this type of material is that the standardized X has a limiting stan- used extensively in the aerospace industry). dard normal distribution. That is, the rv pffiffiffi a. If a random sample of 4 specimens is (X lÞ=ðr= nÞ has a distribution appro- selected, what is the probability that the aching the standard normal. Can you rec- sample mean fracture angle is at most oncile this with the Law of Large Numbers? 54? Between 53 and 54? 24. Assume a sequence of independent trials, b. How many such specimens would be each with probability p of success. Use the required to ensure that the first proba- Law of Large Numbers to show that bility in (a) is at least.999? the proportion of successes approaches p as 20. The first assignment in a statistical com- the number of trials becomes large. puting class involves running a short pro- 25. Let Yn be the largest order statistic in a sample gram. If past experience indicates that 40% of size n from the uniform distribution on of all students will make no programming [0, h]. Show that Yn converges in probability errors, compute the (approximate) proba- to h, that is, that PðjYn hj eÞ ! 0 as n bility that in a class of 50 students approaches 1. [Hint: The pdf of the largest 380 6 Statistics and Sampling Distributions order statistic appears in Section 5.7, so the probability that total waiting time for an relevant probability can be obtained by entire week is at most 75 min? integration (Chebyshev’s inequality is not 27. It can be shown that if Yn converges in relevant here).] probability to a constant s, then h(Yn) 26. A friend commutes by bus to and from converges to h(s) for any function hðÞ that work 6 days/week. Suppose that waiting is continuous at s. Use this to obtain a time is uniformly distributed between 0 and consistent estimator for the rate parameter k 10 min, and that waiting times going and of an exponential distribution. [Hint: How returning on various days are independent does µ for an exponential distribution relate of each other. What is the approximate to the exponential parameter k?] 6.3 The v2, t, and F Distributions The previous section explored the sampling distribution of the sample mean, X, with particular attention to the special case when our sample X1 ;...; Xn is drawn from a normally distributed population. In this section, we introduce three distributions closely related to the normal: the chi- squared (v2), t, and F distributions. These distributions will then be used in the next section to describe the sampling variability of several statistics on which important inferential procedures are based. The Chi-Squared Distribution DEFINITION For a positive integer m, let Z1 ;...; Zm be iid standard normal random variables. Then the chi-squared distribution with m degrees of freedom (df) is defined to be the distribution of the rv X ¼ Z12 þ þ Zm2 This will sometimes be denoted by X v2m. Our first goal is to determine the pdf of this distribution. We start with the m = 1 case, where we may write X ¼ Z12. As in previous chapters, let UðzÞ and /ðzÞ denote the cdf and pdf, respectively, of the standard normal distribution. Then the cdf of X, for x > 0, is given by pffiffiffi pffiffiffi pffiffiffi pffiffiffi FðxÞ ¼ PðX xÞ ¼ PðZ12 xÞ ¼ Pð x Z1 xÞ ¼ Uð xÞ Uð xÞ pffiffiffi pffiffiffi pffiffiffi ¼ Uð xÞ ½1 Uð xÞ ¼ 2Uð xÞ 1 Above, we’ve used the symmetry property UðzÞ ¼ 1 UðzÞ of the standard normal distribution. Differentiate to obtain the pdf for x > 0: pffiffiffi 1 pffiffiffi 1 1 pffiffi 2 1 1 f ðxÞ ¼ F 0 ðxÞ ¼ 2U0 ð xÞ pffiffiffi 0 ¼ /ð xÞ pffiffiffi ¼ pffiffiffiffiffiffi eð xÞ =2 pffiffiffi ¼ pffiffiffiffiffiffi x1=2 ex=2 ð6:6Þ 2 x x 2p x 2p We have established the v21 pdf. But this expression looks familiar: comparing (6.6) to the gamma pdf pffiffiffi in Expression (4.6), and recalling that Cð12Þ ¼ p, we find that the v21 distribution is exactly the same as the gamma distribution with parameters a = 1/2 and b = 2! 6.3 The v2, t, and F Distributions 381 To generalize to any number of degrees of freedom, recall that the moment generating function of the gamma distribution is MðtÞ ¼ ð1 btÞa. So, the mgf of a v21 rv—that is, the mgf of Z2 when Z * N(0, 1)—is MðtÞ ¼ ð1 2tÞ1=2. Using the definition of the chi-squared distribution and properties of mgfs, we find that for X v2m , MX ðtÞ ¼ MZ12 ðtÞ MZm2 ðtÞ ¼ ð1 2tÞ1=2 ð1 2tÞ1=2 ¼ ð1 2tÞm=2 ; which we recognize as the mgf of the gamma distribution with a = m/2 and b = 2. By the uniqueness of mgfs, we have established the following distributional result. PROPOSITION The chi-squared distribution with m degrees of freedom is the gamma distri- bution with a = m/2 and b = 2. In particular, the pdf of the v2m distribution is 1 f ðx; mÞ ¼ xðm=2Þ1 ex=2 x[0 2m=2 Cðm=2Þ Moreover, if X v2m then E(X) = m, V(X) = 2m, and MX ðtÞ ¼ ð1 2tÞm=2. The mean and variance stated in the proposition follow from properties of the gamma distribution: m m l ¼ ab ¼ 2 ¼ m; r2 ¼ ab2 ¼ 22 ¼ 2m 2 2 Figure 6.15 shows graphs of the chi-squared pdf for 1, 2, 3, and 5 degrees of freedom. Notice that the pdf is unbounded near x = 0 for 1 df and the pdf is exponentially decreasing for 2 df. Indeed, the chi- squared for 2 df is exponential with mean 2, f ðxÞ ¼ 12ex=2 for x > 0. If m > 2 the pdf is unimodal with a peak at x = m – 2, as shown in Exercise 31. The distribution is skewed, but it becomes more symmetric as the number of degrees of freedom increases, and for large df values the distribution is approximately normal (see Exercise 29). f (x;ν) 1.0 0.8 1 df 0.6 2 df 0.4 3 df 0.2 5 df 0.0 x 0 2 4 6 8 10 Figure 6.15 The chi-squared pdf for 1, 2, 3, and 5 DF 382 6 Statistics and Sampling Distributions Without software, it is difficult to integrate a chi-squared pdf, so Table A.5 in the appendix has critical values for chi-squared distributions. For example, the second row of the table is for 2 df, and under the heading.01 the value 9.210 indicates that Pðv22 [ 9:210Þ ¼ :01. We will use the notation v2:01;2 ¼ 9:210. In general v2a;v ¼ c means that Pðv2v [ cÞ ¼ a. Instructions for chi-squared compu- tations using R appear at the end of this section. Example 6.12 The article “Reliability analysis of LED-based electronic devices” (Proc. Engr. 2012: 260–269) uses chi-squared distributions to model the lifecycle, in thousands of hours, of certain LED lamps. In one particular setting, the authors suggest a parameter value of m = 8 df. Let X represent this pffiffiffiffiffi v2 rv. The mean and standard deviation of X are E(X) = m = 8 thousand hours and SDðXÞ ¼ 2m ¼ p8ffiffiffiffiffi 16 ¼ 4 thousand hours. We can use the gamma cdf, as illustrated in Chapter 4, to determine probabilities concerning X, because the v28 distribution is the same as the gamma distribution with a = 8/2 = 4 and b = 2. For instance, the probability an LED lamp of this type has a lifecycle between 6 and 10 thousand hours is Pð6 X 10Þ ¼ Gð10=2; 4Þ Gð6=2; 4Þ ¼ Gð5; 4Þ Gð3; 4Þ ¼ :735 :353 ¼ :382 Next, what values define the “middle 95%” of lifecycle values for these LED lamps? We desire the.025 and.975 quantiles of the v28 distribution; from Appendix Table A.5, they are v2:975;8 ¼ 2:180 and v2:025;8 ¼ 17:534 That is, the middle 95% of lifecycle values ranges from 2.180 to 17.534 h. Given the definition of the chi-squared distribution, the following properties should come as no surprise. Proofs of both statements rely on moment generating functions (Exercises 32 and 33). PROPOSITION 1. If X3 ¼ X1 þ X2 , X1 and X2 are independent, X1 v2v1 , and X2 v2v2 , then X3 v2v1 þ v2. 2. If X3 ¼ X1 þ X2 , X1 and X2 are independent, X1 v2v1 , and X3 v2v3 with m3 > m1, then X2 v2v3 v1. Statement 1 says that the chi-squared distribution is an additive distribution; we saw in Chapter 5 that the normal and Poisson distributions are also additive. Statement 2 establishes a “subtractive” property of chi-squared, which will be critical in the next section for establishing the sampling distribution of the sample variance S2. 6.3 The v2, t, and F Distributions 383 The t Distribution DEFINITION Let Z be a standard normal rv and let Y be a v2v rv independent of Z. Then the t distribution with m degrees of freedom (df) is defined to be the distribution of the ratio Z T ¼ pffiffiffiffiffiffiffiffi Y=m We will sometimes abbreviate this distribution by T * tm. With some careful calculus, we can obtain the t pdf. PROPOSITION The pdf of a random variable T having a t distribution with m degrees of freedom is 1 Cððm þ 1Þ=2Þ 1 f ðtÞ ¼ pffiffiffi 1\t\1 pm Cðm=2Þ ð1 þ t2 =mÞðm þ 1Þ=2 Proof A tm variable is defined in terms of a standard normal Z and a v2m variable Y. They are independent, so their joint pdf f(y, z) is the product of their individual pdfs. We first find the cdf of T and then differentiate to obtain the pdf: pffiffiffiffiffi ! rffiffiffiffi! Z1 t Z y=m Z Y FðtÞ ¼ PðT tÞ ¼ P pffiffiffiffiffiffiffiffi t ¼ P Z t ¼ f ðy; zÞ dz dy Y=m m 0 1 Differentiating with respect to t using the Fundamental Theorem of Calculus, pffiffiffiffiffi Z1 t Z y=m Z1 rffiffiffi rffiffiffi d @ y y f ðtÞ ¼ FðtÞ ¼ f ðy; zÞ dz dy ¼ f y; t dy dt @t m m 0 1 0 Now substitute the joint pdf—that is, the product of the marginal pdfs of Y and Z—and integrate: Z1 rffiffiffi ym=21 y=2 1 ½tpffiffiffiffiffi y=v2 =2 y f ðtÞ ¼ e p ffiffiffiffiffiffi e dy 2m=2 Cðm=2Þ 2p m 0 Z1 1 2 ¼ pffiffiffiffiffiffiffiffi yðm þ 1Þ=21 e1½1=2 þ t =ð2vÞy dy m=2 2 Cðm=2Þ 2pm 0 384 6 Statistics and Sampling Distributions The integral can be evaluated using Expression (4.5) from the section on the gamma distribution: 1 Cððm þ 1Þ=2Þ f ðtÞ ¼ pffiffiffiffiffiffiffiffi 2pm ½1=2 þ t2 =ð2vÞm=2 þ 1=2 2m=2 Cðm=2Þ Cððm þ 1Þ=2Þ 1 ¼ pffiffiffiffiffi ; 1\t\1 pmCðm=2Þ ð1 þ t2 =mÞðm þ 1Þ=2 The pdf has a maximum at 0 and decreases symmetrically as |t| increases. As m becomes large, the t pdf approaches the standard normal pdf, as shown in Exercise 36. It makes sense that the t distri- pffiffiffiffiffiffiffiffiffi bution would be close to the standard normal for large m, because T ¼ Z= v2v =v, and v2v =v converges to 1 by the Law of Large Numbers, as shown in Exercise 30. Figure 6.16 shows t density curves for m = 1, 5, and 20 along with the standard normal (z) curve. Notice how fat the tails are for 1 df, as compared to the standard normal. However, as the number of df increases, the t pdf becomes more like the standard normal. For 20 df there is not much difference. f (t).5 20 df z 5 df.4 1 df.3.2.1 0 t −5 −3 −1 1 3 5 Figure 6.16 Comparison of t curves to the z curve Integration of the t pdf is difficult without software, so values of upper-tail areas are given in Table A.7. For example, the value in the column labeled 2 and the row labeled 3.0 is.048, meaning that P(T > 3.0) =.048 when T * t2. We write this as t.048,2 = 3.0. In general we write ta,m = c if P(T > c) = a when T * tm. A tabulation of these t critical values (i.e., ta,m) for frequently used tail areas a appears in Table A.6. pffiffiffi Using Cð1=2Þ ¼ p, we obtain the pdf for the t distribution with 1 df as f ðtÞ ¼ 1=½pð1 þ t2 Þ, which is also known as the Cauchy distribution. This distribution has such heavy tails that the mean does not exist (Exercise 37). The mean and variance of a t variable can be obtained directly from the pdf, but it’s instructive to derive them through the definition in terms of independent standard normal and chi-squared variables, pffiffiffiffiffiffiffiffi T ¼ Z= Y=v. Recall from Section 5.2 that E(UV) = E(U)E(V) if U and V are independent and the expectations of U and V both exist. Thus, pffiffiffiffiffiffiffiffi EðTÞ ¼ EðZÞEð1= Y=vÞ ¼ EðZÞm1=2 EðY 1=2 Þ 6.3 The v2, t, and F Distributions 385 Of course, E(Z) = 0, so E(T) = 0 if EðY 1=2 Þ exists. Let’s compute E(Yk) for any k if Y is chi-squared, using Expression (4.5): Z1 Z1 k k yðm=2Þ1 y=2 1 EðY Þ ¼ y m=2 e dy ¼ m=2 yðk þ m=2Þ1 ey=2 dy 2 Cðm=2Þ 2 Cðm=2Þ 0 0 1 2k Cðk þ m=2Þ ¼ m=2 2k þ m=2 Cðk þ m=2Þ ¼ for k þ m=2 [ 0 ð6:7Þ 2 Cðm=2Þ Cðm=2Þ If k + m/2 0, the integral does not converge and EðY k Þ does not exist. When k ¼ 12, we require that m > 1 for the integral to converge. Thus, the mean of a t variable fails to exist if m = 1 and the mean is indeed 0 otherwise. For the variance of T we need E(T2) = E(Z2) E[1/(Y/m)] = 1 mE(Y–1). Using k = –1 in Expression (6.7), we obtain, with the help of the property C(a + 1) = aC(a), 21 Cð1 þ m=2Þ 21 1 1 m EðY 1 Þ ¼ ¼ ¼ ) VðTÞ ¼ m ¼ Cðm=2Þ m=2 1 m 2 m2 m2 provided that –1 + m/2 > 0, or m > 2. For 1 or 2 df the variance of T does not exist. For m > 2, the variance always exceeds 1, and for large df the variance is close to 1. This is appropriate because any t curve spreads out more than the z curve, but for large df the t curve approaches the z curve. The F Distribution DEFINITION Let Y1 and Y2 be independent chi-squared random variables with m1 and m2 degrees of freedom, respectively. The F distribution with m1 numerator degrees of freedom and m 2 denominator degrees of freedom is defined to be the distribution of the ratio Y1 =v1 F¼ ; ð6:8Þ Y2 =v2 This distribution will sometimes be denoted Fv1 ;v2. The pdf of a random variable having an F distribution is C½ðm1 þ m2 Þ=2 m1 m1 =2 xm1 =21 f ðx; m1 ; m2 Þ ¼ x[0 Cðm1 =2ÞCðm2 =2Þ m2 ð1 þ m1 x=m2 Þðm1 þ m2 Þ=2 Its derivation (Exercise 40) is similar to the derivation of the t pdf. Figure 6.17 shows the F density curves for several choices of m1 and m2 = 10. 386 6 Statistics and Sampling Distributions f (x) 1.0.8 5, 10 df 3, 10 df.6 2, 10 df.4 1, 10 df.2 0 x 0 1 2 3 4 5 Figure 6.17 F density curves The mean of the F distribution can be obtained with the help of Equation (6.8): E(F) = m2/(m2 – 2) if m2 > 2, and it does not exist if m2 2 (Exercise 41). What happens to F if the degrees of freedom are large? If m2 is large, then the denominator of Expression (6.8) will be close to 1 (see Exercise 30), and approximately the F will be just the numerator chi-squared over its degrees of freedom. Similarly, if both m1 and m2 are large, then both the numerator and denominator will be close to 1, and the F ratio therefore will be close to 1. Except for a few special choices of degrees of freedom, integration of the F pdf is difficult without software, so F critical values (values that capture specified F distribution tail areas) are given in Table A.8. For example, the value in the column labeled 1 and the row labeled 2 and.100 is 8.53, meaning that P(F1,2 > 8.53) =.100. We can express this as F.1,1,2 = 8.53, where Fa;v1 ;v2 ¼ c means that PðFv1 ;v2 [ cÞ ¼ a. That same table can also be used to determine some lower-tail areas. Since 1/F = (X2/m2)/(X1/m1), the reciprocal of an F variable also has an F distribution, but with the degrees of freedom reversed, and this can b