OPRE - Lec7 PDF
Document Details
University of Texas at Dallas
Shouqiang Wang
Tags
Summary
This document provides lecture notes on sampling distribution. It covers topics like descriptive statistics, inferential statistics, and the central limit theorem. The lecture is part of OPRE 6301, a statistics and data analysis course at the University of Texas at Dallas.
Full Transcript
Announcements Register ! The grades for Mid-term Exam #1 will be posted after the class! HW #5 is due Oct. 22. HW #6 has been posted and is due on Oct. 29. 1 Sampling Distribution Lecture 7 Shouqiang Wang Univer...
Announcements Register ! The grades for Mid-term Exam #1 will be posted after the class! HW #5 is due Oct. 22. HW #6 has been posted and is due on Oct. 29. 1 Sampling Distribution Lecture 7 Shouqiang Wang University of Texas at Dallas OPRE 6301: Statistics and Data Analysis 2 Data Collection and Sampling Sampling Continuous Distributions Distributions Graphical Exploratory Techniques Discrete Distributions Numerical L ANGUAGE TO Exploratory D ESCRIPTIVE D ESCRIBE Techniques S TATISTICS U NCERTAINTY Probability & Random Variable S TATISTICS AND DATA A NALYSIS Estimation I NFERENTIAL P REDICTIVE S TATISTICS S TATISTICS Hypothesis Testing Inference on More Multiple Linear Populations Regression (ANOVA) Inference on a Simple Linear Inference on Two Population Regression Populations 3 Outline Sampling Distribution of the Mean Sampling Distribution of a Proportion 4 Outline Sampling Distribution of the Mean Sampling Distribution of a Proportion 5 Example: Throwing a die X = the result of throwing a fair six-faced die x 1 2 3 4 5 6 P (x) 1/6 1/6 1/6 1/6 1/6 1/6 If we throw the die infinitely many times, ☞ the population mean X µX = xP (x) = 1 ∗ (1/6) + · · · + 6 ∗ (1/6) = 3.5 ☞ the population variance 2 X σX = (x − µ)2 P (x) = (1 − 3.5)2 ∗ (1/6) + · · · + (6 − 3.5)2 ∗ (1/6) = 2.92 p 2 √ ☞ the population standard deviation σX = σX = 2.92 = 1.71 6 Sampling Distribution of Throwing Two Dice If we throw 2 dice, each of the following sample occurs with prob. 1/36: X1 +X2 ☞ 11 different possible values for X̄ = 2 : 1.0, 1.5, 2.0, · · · , 5.5, 6.0 7 Sampling Distribution of the Mean X̄ of Two Dice Mean X̄ is a random variable with ☞ Mean P µX̄ = x̄P (x̄) = 1.0 ∗ (1/36) + 1.5 ∗ (2/36) + · · · + 6.0 ∗ (1/36) = 3.5 = µX 2 = (x̄ − µx̄ )2 P (x̄) = P ☞ Variance σX̄ (1 − 3.5)2 ∗ (1/36) + (1.5 − 3.5)2 ∗ (2/36) + · · · + (6 − 3.5)2 ∗ (1/36) = 1.46 = σX 2 /2 q 2 p 2 √ ☞ Standard Deviation σX̄ = σX̄ = 1.21 = σX /2 = σX / 2 8 Compare.... Distribution of X Distribution of X̄ 2 2 σX σX µX̄ =µX , σX̄ = , σX̄ = √ 2 2 9 In general, if we throw n dice, we have n samples: X1 , X2 , · · · , Xn , each of which is a random variable. X1 +···+Xn Sample Mean X̄ = n is a random variable with ☞ mean µX̄ = µX , 2 2 σX ☞ variance σX̄ = n σX ☞ standard deviation σX̄ = √ , n which is called standard error of the mean The properties about the distribution of the Sample Mean: Watch This Video ☞ It is always “centered” at the population mean ☞ It gets “tighter” as n, the sample size, increases ☞ It looks more like a bell-shaped curve as n increases 10 Theorem (Central Limit Theorem) 2 For any population with mean µX and variance σX , and for a large enough sample size n: σX X̄ ∼ Normal µX̄ = µX , σX̄ = √ n ☞ The expected value of the sample mean is the population mean (no bias). ☞ The standard error (a.k.a. the standard deviation of the sample mean) equals the standard deviation of the population divided by the square root of n (greater precision as the sample size increases). ☞ As n increases, the sampling distribution approaches a normal distribution. Note: If the population distribution is itself normal, CLT holds regardless of sample size; otherwise we need “large enough sample size,” by which we mean roughly above 30 samples. 11 The foreman of a bottling plant has observed that the amount of soda in each “32-ounce” bottle is actually a normally distributed random variable, with a mean of 32.2 ounces and a standard deviation of.3 ounce. 1. If a customer buys one bottle, what is the probability that the bottle will contain more than 32 ounces? Solution The weight of a bottle X ∼ Normal (µ = 32.2, σ = 0.3). Thus, X −µ 32 − 32.2 P (X > 32) = P Z = > = P (Z > −0.67) σ 0.3 = 1 − P (Z ≤ −0.67) = 1 − 0.2514 = 0.7486. 2. If a customer buys four bottles, what is the probability that the mean amount of the four bottles will be greater than 32 ounces? Solution √ The mean weight of four bottles X̄ ∼ Normal µX̄ = 32.2, σX̄ = 0.3/ 4. Thus, X̄ − µX̄ 32 − 32.2 P (X̄ > 32) = P Z = > = P (Z > −1.33) σX̄ 0.15 = 1 − P (Z ≤ −1.33) = 1 − 0.0918 = 0.9082. 12 13 Sampling Distribution of Difference Between Two Means If we repeatedly and independently draw samples from two populations with mean µi and standard deviation σi (i = 1, 2), as sample size becomes large enough, the difference between two sample means X̄1 − X̄2 approximately follows a normal distribution with ☞ mean µX̄1 −X̄2 = µ1 − µ2 ☞ standard deviation (a.k.a. standard error of the difference between two means) s σ12 σ2 σX̄1 −X̄2 = + 2 n1 n2 14 The starting salaries of graduates from UTD master program have a mean of $70, 000 and a standard deviation of $15, 000, while the starting salaries of graduates from SMU master program have a mean of $68, 000 and a standard deviation of $16, 000. If a random sample of 60 UTD graduates and a random sample of 50 SMU graduates are selected, what is the probability that the sample mean starting salaries of UTD graduates are higher than that of the SMU graduates? Solution Let X̄1 denote the UTD sample mean and X̄2 denote the SMU sample mean. Then, X̄1 − X̄2 is normally distributed with mean µ1 − µ2 =70, 000 − 68, 000 = 2, 000, s r σ12 σ2 15, 0002 16, 0002 standard deviation + 2 = + = 2978.26 n1 n2 60 50 Therefore, X̄1 − X̄2 − (µ 1 − µ 2 ) 0 − 2000 P (X̄1 − X̄2 > 0) =P Z = q 2 > σ1 σ22 2978.26 n1 + n2 =P (Z > −0.67) = 1 − P (Z ≤ −0.67) = 1 − 0.2509 = 0.7490 15 Outline Sampling Distribution of the Mean Sampling Distribution of a Proportion 16 Recall: Binomial probabilities If p represents probability of success, (1 − p) represents probability of failure, n represents number of independent trials, and k represents number of successes ! n P (k successes in n trials) = pk (1 − p)(n−k) k We can use the binomial distribution to calculate the probability of k successes in n trials, as long as 1. the trials are independent 2. the number of trials, n, is fixed 3. each trial outcome can be classified as a success or a failure 4. the probability of success, p, is the same for each trial 17 A recent study found that “Facebook users get more than they give”. For example: 40% of Facebook users in our sample made a friend request, but 63% received at least one request Users in our sample pressed the “like” button next to friends’ content an average of 14 times, but had their content “liked” an average of 20 times Users sent 9 personal messages, but received 12 12% of users tagged a friend in a photo, but 35% were themselves tagged in a photo How can this be possible? Any guess? “Power users” who contribute much more content than the typical user. http://www.pewinternet.org/Reports/2012/Facebook-users/Summary.aspx 18 This study also found that approximately 25% of Facebook users are considered “power users.” The same study found that the average Facebook user has 245 friends. What is the probability that the average Facebook user with 245 friends has 70 or more friends who would be considered “power users”? Number of power users among one’s Facebook friends X ∼ Bin(n = 245, p = 0.25) Then, we are asked for the probability P (X ≥ 70). P (X ≥ 70) = P (X = 70 or X = 71 or X = 72 or · · · or X = 245) = P (X = 70) + P (X = 71) + P (X = 72) + · · · + P (X = 245) This seems like an awful lot of work... 19 Histograms of binomial distribution Fix p = 0.10 and let n = 10, 30, 100, and 300. 0 2 4 6 0 2 4 6 8 10 n = 10 n = 30 0 5 10 15 20 10 20 30 40 50 n = 100 n = 300 Try it yourself at https://shiny.rit.albany.edu/stat/binomial/. 20 Normal approximation to the binomial p Recall: Bin(n, p) has µ = np and σ = np(1 − p). When the sample size n is large enough, p Bin (n, p) ≈ N µ = np, σ = np(1 − p) 0.06 Bin(245,0.25) In the case of the Facebook power 0.05 N(61.5,6.78) users, n = 245 and p = 0.25. 0.04 µ =245 × 0.25 = 61.25 0.03 √ σ = 245 × 0.25 × 0.75 = 6.78 0.02 0.01 Bin(n = 245, p = 0.25) ≈ N(µ = 0.00 61.25, σ = 6.78). 20 40 60 80 100 k 21 Correction for the normal approximation P (X ≥ 70) = P (X = 70) + P (X = 71) + P (X = 72) + · · · + P (X = 245) = P (X > 70 − 0.5) = P (X > 69.5) 0.06 Bin(245,0.25) 0.05 N(61.5,6.78) 0.04 We sometimes apply a 0.5 correction in order to 0.03 account for the probability of exactly 70 “successes,” 0.02 but when n is large, that correction can be omitted!! 0.01 0.00 20 40 60 80 100 k 22 Question What is the probability that the average Facebook user with 245 friends has 70 or more friends who would be considered power users? (a) 0.0984 (c) 0.8888 (b) 0.1112 (d) 0.9016 X −µ 69.5 − 61.25 Z= = = 1.22 σ 6.78 P (Z > 1.22) = 1 − 0.8888 = 0.1112 Second decimal place of Z Z 0.00 0.01 0.02 0.03 0.04 1.0 0.8413 0.8438 0.8461 0.8485 0.8508 1.1 0.8643 0.8665 0.8686 0.8708 0.8729 1.2 0.8849 0.8869 0.8888 0.8907 0.8925 61.25 69.5 23 Low large is large enough? The sample size is considered large enough if the expected number of successes and failures are both at least 5. np ≥ 5 and n(1 − p) ≥ 5 Question Below are four pairs of Binomial distribution parameters. Which distribution can be approximated by the normal distribution? (a) n = 100, p = 0.8 (c) n = 50, p = 0.95 (b) n = 25, p = 0.6 (d) n = 300, p = 0.015 Note: Some textbooks use a threshold of 10. 24 Approximate Sampling Distribution of a Sample Proportion p Knowing X ∼ Bin (n, p) ≈ N µ = np, σ = np(1 − p) , what is the distribution of sample proportion Pb = X ? n Sampling Distribution of a Sample Proportion Provided that both np and n(1 − p) ≥ 5, r ! p(1 − p) Pb ∼ Normal µPb = p, σPb = n ☞ The sample proportion Pb is approximately normally distributed. ☞ The mean of the sample proportion is the population proportion p (no bias). ☞ The standard error p of the proportion (a.k.a. the standard deviation of the sample proportion) equals p(1 − p)/n (i.e., more precise as n increases). 25 A cell phone assembly line has a daily output of 500 phones with a defective rate of 5%. If the quality department conducts daily audits of all the outputs, what is the probability that the defective rate will be no higher than 4% on a typical day? Solution The daily number of defective products follows Bin(500, 0.05). Therefore, the sample proportion of defectives r ! 0.05 ∗ 0.95 P ∼ Normal µ = 0.05, σ = b = 0.00975. 500 As such, ! Pb − µ 0.04 − 0.05 P (Pb ≤ 0.04) =P Z= ≤ σ 0.00975 =P (Z ≤ −1.026) =N ORM.S.DIST (−1.026, T RU E) = 0.152. 26