Lec07_Statistical Inferences using t-Tools & Log-Transformation PDF

Summary

This document is lecture notes on advanced statistical inferences, including t-tests and log transformation. The notes are for a course on advanced statistics for data science.

Full Transcript

Naveen Jindal School of Management The University of Texas at Dallas BUAN/ OPRE 6359 Advanced Statistics for Data Science Making Inferences using t-Tools & Log- Transforma...

Naveen Jindal School of Management The University of Texas at Dallas BUAN/ OPRE 6359 Advanced Statistics for Data Science Making Inferences using t-Tools & Log- Transformation Rasoul Ramezani The University of Texas at Dallas Jindal School of Management Naveen Jindal School of Management The University of Texas at Dallas Lecture Outline One Sample t-test Paired t-test Two-sample t-test – Pooled t-test – Welch t-test Log-transformation Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Introduction A random sample is drawn from a population with the mean of 𝜇 and the standard deviation of 𝜎. If we were to repeat the process and create every possible sample, we would create the sampling distribution of the sample mean such that: 𝜇𝑌ത = 𝜇 ത = 𝜎 𝑆𝐸(𝑌) 𝑛 Under CLT, 𝑌ത is approximately normally distributed. If 𝜎 is unknown, then 𝑆 𝑆𝐸 𝑌ത = with 𝜈 = 𝑛 − 1 degrees of freedom 𝑛 𝑠 = Sample standard deviation ത 𝜈 = Number of independent values used to estimate 𝑆𝐸(𝑌). Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas t-ratio Instead of Z-ratio When the standard deviation of population, 𝜎, is known, we use 𝑍-ratio to make inferences about the population mean. ത 𝑌−𝜇 𝑍= 𝜎/ 𝑛 If 𝜎 is unknown, we replace it in 𝑍 with the sample standard deviation, 𝑠, and call the ratio 𝑡-ratio. ത 𝑌−𝜇 𝑡= 𝑠/ 𝑛 – This ratio has a t distribution with 𝜈 = 𝑛 − 1 degrees of freedom. Much like the standard normal distribution, the 𝑡 distribution is bell-shaped and symmetrical about its mean of zero. As 𝜈 increases, the 𝑡 distribution approaches the 𝑍 distribution. Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas One-Sample t-Test (Hypothesis Test for 𝜇) ത 𝑠, and 𝛼. 1. Identify/ calculate 𝑛, 𝑌, 3. Calculate the standard error of 𝑌ത : 𝑛 = sample size 𝑠𝑒 = 𝑠/ 𝑛 𝑌ത = sample mean 4. Identify 𝜇0 from 𝐻0 and calculate the test 𝑠 = sample standard deviation statistic: ത 0 𝑌−𝜇 𝛼 = The level of significance (unless t-ratio = 𝑠𝑒 otherwise specified, 𝛼 = 0.05) 4. Calculate the degrees of freedom: 𝜈 = 𝑛 − 1 2. Define 𝐻0 and 𝐻1 and determine 5. Calculate the p-value: – Two-tail: p-value = 2∗pt(abs(t-ratio), 𝜈, lower.tail=F) the type of the test: – Right-tail: p-value = pt(t−ratio, 𝜈, lower.tail=F) 𝐻0 : 𝜇 = 𝜇0 vs. – Left-tail: p-value = pt(t−ratio, 𝜈) 𝐻1 : 𝜇 ≠ 𝜇0 ----> Two-tail 6. Reject 𝐻0 and conclude that 𝐻1 is true if p- value ≤ 𝜶. Otherwise, don’t reject 𝐻0 and or 𝐻1 : 𝜇 > 𝜇0 ----> Right-tail conclude the 𝐻0 is true. or 𝐻1 : 𝜇 < 𝜇0 ----> Left-tail Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Confidence Interval (CI) for 𝜇 We have the below information: – Sample size: 𝑛 – Sample mean: 𝑌ത – Sample standard deviation: 𝑠 Steps for obtaining 100 1 − 𝛼 % CI for 𝜇: 1. Identify 𝛼 = 1 − Confidence Level 2. Find the degrees of freedom: 𝜈 =𝑛−1 3. Using R, calculate the critical value: 𝑡𝑐 = 𝑞𝑡(1 − 𝛼/2, 𝜈) 𝑠 4. Calculate the standard error (se): 𝑠𝑒 = 𝑛 5. Calculate the margin of error (me): 𝑚𝑒 = 𝑡𝑐 ∗ 𝑠𝑒 6. Construct the confidence interval for 𝜇: 𝑌ത ± 𝑚𝑒 Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas One-Sample t-Test for 𝜇: Example ABI Insurance company used the mean life expectancy of all policyholders in the last year (77 years) to determine this year’s life insurance premium. To ensure they used the correct measure, they randomly sampled 20 of their recent customers: Age 86 75 83 84 81 77 78 79 79 81 76 85 70 76 79 81 73 74 72 83 Conduct the below hypothesis test at a 5% significance level: 𝐻0 : 𝜇 = 77 𝐻1 : 𝜇 ≠ 77 Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Example; Answer Hypothesis Testing 95% Confidence Interval 1. 𝑛 = 20, 𝑌ത = 78.6, 𝑠 = 4768, 𝛼 =.05 1. 𝛼 = 1 − 0.95 =.05 2. 𝐻0 : 𝜇 = 77 vs. 𝐻1 : 𝜇 ≠ 77 ---> (Two-tail Test) 2. 𝜈 = 𝑛 − 1 = 20 − 1 = 19 𝑠 4.4768 3. 𝑡𝑐 = 𝑞𝑡 1 −.05/2,19 = 2.093 3. 𝑠𝑒 = 𝑛 = 20 = 1.0011 𝑠 4.48 ത 0 𝑌−𝜇 78.6−77 4. 𝑠𝑒 = = = 1.0011 4. 𝜇0 = 77 --> t-ratio = 𝑠𝑒 = 1.0011 = 1.5983 𝑛 20 5. 𝑚𝑒 = 𝑡𝑐 ∗ 𝑠𝑒 5. 𝜈 = 𝑛 − 1 = 19 = 2.093 1.0018 = 2.0952 6. p-value = 2 ∗ 6. 𝑌ത ± 𝑚𝑒 = 78.6 ± 2.0952 = (76.5, 80.7) 𝑝𝑡 𝑎𝑏𝑠(1.5983), 19, 𝑙𝑜𝑤𝑒𝑟. 𝑡𝑎𝑖𝑙 = 𝐹 = 0.1265 7. Since p-value > 0.05, 𝐻0 is not rejected. Command Approach: t.test(age) Command Approach: t.test(age, mu = 𝜇0 ) See R codes for other options. Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Paired t-test A paired t-test can be used when Assumptions: we have two samples in which 1. Two samples are drawn observations in one sample can randomly. be paired with observations in the 2. Two samples are dependent. other sample. 3. Observations within each Examples: sample are independent. – Thirty students’ test scores before 4. The difference between two and after a particular module. samples (𝑌1 −𝑌2 ) is nearly – Fifteen workers’ productivity before and after a training program. normal. – Ten women’s weight before and after a new diet drug. Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Paired t-Test; Example Was the training module effective in increasing students’ test score? Let: 𝑌𝑝𝑟𝑒 = pre-module test score, 𝑌𝑝𝑜𝑠𝑡 = post-module test score Define the mean difference: 𝜇𝑑 = 𝜇𝑝𝑜𝑠𝑡 − 𝜇𝑝𝑟𝑒 We want to conduct the below hypothesis test: 𝐻0 : 𝜇𝑑 = 0 𝐻1 : 𝜇𝑑 > 0 If we reject 𝐻0 , then we conclude that the module, on average, increased the test score. Otherwise, it did not. Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Steps for Paired t-Test 𝐻0 : 𝜇𝑑 = 0 vs. 𝐻1 : 𝜇𝑑 > 0 6. Calculate the test statistics: 1. Identify the number of 𝑑ത 𝑡-ratio = observations (𝑛) and 𝛼. 𝑠𝑒 7. Calculate the degrees of 2. For each observation, calculate freedom: 𝑑 = 𝑌𝑝𝑜𝑠𝑡 − 𝑌𝑝𝑟𝑒 𝜈 =𝑛−1 3. Calculate the mean of 𝑑: 8. Calculate the p-value (according 𝑑ҧ = 𝑚𝑒𝑎𝑛(𝑑) to the type of test): 4. Calculate SD of 𝑑: p-value = pt(t-ratio, 𝜈, lower.tail=F) 𝑠𝑑 = 𝑠𝑑(𝑑) 9. Reject 𝐻0 if p-value ≤ 𝛼. 5. Calculate the standard error of 𝑑:ҧ 𝑠𝑑 𝑠𝑒 = 𝑛 Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Paired t-Test; Test Score Did the training module increase students’ test scores? 𝐻0 : 𝜇𝑑 = 0 vs. 𝐻1 : 𝜇𝑑 > 0 Referring to score.csv dataset and R codes, we have: 𝑛 = 25 and 𝛼 =.05 𝑑ҧ = 5.88 𝑠𝑑 = 9.1575 𝑠𝑒 = 9.1575/ 25 = 1.8315 5.16 𝑡-ratio= = 3.21 1.8315 𝜈 = 25 − 1 = 24 p-value= pt(3.21,24, lower.tail=F) = 0.002 𝐻0 is rejected since p-value < 0.05. We conclude that there is the module improved the test score significantly. Command Approach: t.test(df$post, df$pre, paired=T, alt=“greater") Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Paired t-Test, Confidence Interval Above, we obtained 𝑛, 𝑑,ҧ 𝑠𝑑 , The 95% CI for 𝜇𝑑 : o 𝛼 =.05 and 𝜈. o 𝑡𝑐 = 2.0639 Steps for 100 1 − 𝛼 % CI for o 𝑠𝑒 = 1.8315 𝜇𝑑 : o 𝑚𝑒 = 3.78 o CI: 5.88 ± 3.78 = (2.1, 9.66) 1. Identify the significance level: Command Approach: 𝛼 = 1 − Confidence Level t.test(df$post, df$pre, paired=T, alt=“greater") 2. Calculate 𝑡𝑐 = 𝑞𝑡 (1 − 𝛼/2, 𝜈) 3. Calculate 𝑠𝑒 = 𝑠𝑑 / 𝑛 With 95% confidence, the 5. Calculate 𝑚𝑒 = 𝑡𝑐 ∗ 𝑠𝑒 module would increase 6. Construct the CI: 𝑑ҧ ± 𝑚𝑒 the test score between 2.1 and 9.66 points. Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Two-Sample t-Test We want to test if there is any difference between the mean of two independent populations. Let 𝜇1 = The average of population 1 𝜇2 = The average of population 1 We want to test: 𝐻0 : 𝜇1 − 𝜇2 = 0 𝐻1 : 𝜇1 = 𝜇2 ←−−−→ 𝐻1 : 𝜇1 − 𝜇2 ≠ 0 𝐻1 : 𝜇1 ≠ 𝜇2 If we reject 𝐻0 , we conclude that the two population means are different. Otherwise, they are equal. Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas t-Test for Two-sample Inference Drawing conclusions about the difference in two independent populations’ mean from the difference in the sample averages. What if 𝜎1 and 𝜎2 are unknown? Two test: – Pooled t-test (if 𝜎1 = 𝜎2 , statistically) – Welch t-test (if 𝜎1 ≠ 𝜎2 , statistically) Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Pooled t-Test Hypothesis Testing: 6. Degrees of freedom: 𝜈 = 𝑛1 + 𝑛2 − 2 𝐻0 : 𝜇1 = 𝜇2 vs. 𝐻1 : 𝜇1 ≠ 𝜇2 7. p-value = 2∗pt(abs(t-ratio), 𝜈, lower.tail=F) 1. Identify/ calculate 8. If p-value ≤ 𝛼, reject 𝐻0 and conclude 𝜇1 ≠ 𝑛1 , 𝑛2 , 𝑌ത1 , 𝑌ത2 , 𝑠1 , 𝑠2. 𝜇2. Otherwise, don’t reject 𝐻0 and conclude they are equal. 3. Pooled standard deviation (𝑆𝑝 ): 100 𝟏 − 𝜶 % CI for (𝝁𝟏 − 𝝁𝟐 ) 𝑛1 −1 𝑠12 + 𝑛2 −1 𝑠22 𝑆𝑝 = 𝛼 = 1 − Confidence Level 𝑛1 +𝑛2 −2 Critical value (in R): 4. Standard error of (𝑌ത1 − 𝑌ത2 ): 𝑡𝑐 = 𝑞𝑡(1 − 𝛼/2, 𝜈) 1 1 Margin of error: 𝑠𝑒 = 𝑆𝑝 + 𝑚𝑒 = 𝑡𝑐 ∗ 𝑠𝑒 𝑛1 𝑛2 Confidence interval: 𝑌ത1 −𝑌ത2 5. Test statistic: t-ratio = (𝑌ത1 − 𝑌ത2 ) ± 𝑚𝑒 𝑠𝑒 Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Pooled t-Test; Example Is there any differences in the number of books purchased by females and males? To answer this question, a researcher took a random sample of customers in a retail bookstore and asked each about the number of books they read during the last 12 months. The following data were recorded. Female (1) 5 18 11 3 7 5 9 13 15 6 Male (2) 9 7 9 3 6 5 3 Let 𝜇1 = Average number of books purchased by females 𝜇2 = Average number of books purchased by males Conduct a pooled t-test at a 5% level of significance to 𝐻0 : 𝜇1 = 𝜇2 vs. 𝐻1 : 𝜇1 ≠ 𝜇2 Construct a 95% confidence interval for 𝜇1 − 𝜇2. Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Answer (Hypothesis Testing) 1. Referring to R codes, we have: 𝑛1 = 10, 𝑛2 = 7 𝑌ഥ1 = 8.2, 𝑌ത2 = 6 𝑠1 = 3.8239, 𝑠2 = 2.5166 2. 𝐻0 : 𝜇1 = 𝜇2 vs. 𝐻1 : 𝜇1 ≠ 𝜇2 10−1 3.8239 2 + 7−1 2.5166 2 3. 𝑆𝑝 = 10+8−2 = 3.3625 1 1 4. 𝑠𝑒 = 3.3625 10 + 7 = 1.6571 8.2−6 5. t-ratio = 1.6571 = 1.3276 6. 𝜈 = 10 + 7 − 2 = 15 7. 𝑝-value = 2*pt(abs(1.3276), 15, lower.tail=F) = 0.2041 8. Since 𝑝-value > 0.05, 𝐻0 cannot be rejected. Thus, on average, there is no significant difference between number of books purchased by females and males. Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Answer (Confidence Interval) We have: 𝑛1 = 10, 𝑛2 = 7 𝑌ഥ1 = 8.2, 𝑌ത2 = 6 𝑠1 = 3.8239, 𝑠2 = 2.5166 Steps: 1. 𝛼 = 1 −.95 = 0.05 2. 𝜈 = 15 and 𝑠𝑒 = 1.6571 (from the previous slide) 3. 𝑡𝑐 = 𝑞𝑡 1 −.05/2, 15 = 2.1315 4. 𝑚𝑒 = 2.1315 1.6571 = 3.5320 5. 95% CI for (𝜇1 − 𝜇2 ) = 8.2 − 6 ± 3.5320 = (−1.33,5.73) Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Welch t-Test If 𝜎1 ≠ 𝜎2 (statistically), we 𝑠𝑒𝑤 = 𝑠12 + 𝑠22 𝑛1 𝑛2 conduct a Welch t-test to compare two population means. 𝑠𝑒𝑤 4 𝜈𝑤 = 2 2 𝑠2 1 /𝑛1 + 𝑠2 2 /𝑛2 The only differences between (𝑛1 −1) (𝑛2 −1) Welch and Pooled tests are: Note: in manual calculations, use 𝜈𝑤 = 1. The standard error of the samples’ min 𝑛1 − 1, 𝑛2 − 1 mean difference Rule of Thumb: If 𝑠1 /𝑠2 2 ≥ 2, use the 2. The degrees of freedom. Welch t-test. – Note: The assumption is that 𝑠1 > 𝑠2. For manual computations, repeat the steps discussed under the pooled t- test replacing 𝑠𝑒 with 𝑠𝑒𝑤 and 𝜈 with 𝜈𝑤. Command Approach: t.test(Y1, Y2, var.equal = F) Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Log Transformation Skewed datasets could be transformed to become symmetric. The most common choice is a (natural) log transformation: 𝑍 = ln(𝑌) When to use log transformation? – In one-sample tests, if data are skewed to the right (i.e., positively skewed). – In two-sample tests, if the spread (i.e., standard deviation) is higher in the group with the larger center (i.e., median). Ideal result after log transformation: – Two symmetric samples with similar spreads but possibly different centers. Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Logged Data: Observational Studies Let: 𝑌 = Starting Salary Two points: 𝑚 = Male & 𝑓 = Female 1. If the log-transformed data are symmetric: Mean[ln(𝑌)] = Median[ln(𝑌)] Log-transform Salary: 2. The log preserve ordering: 𝑍𝑚 = ln(𝑌𝑚 ) Median[ln(𝑌)] = ln[Median 𝑌 ] 𝑍𝑓 = ln(𝑌𝑓 ) Combining (1) and (2), we have: The 𝑡-tools provide inferences about: ln 𝑌 = ln Med 𝑌 ln(𝑌𝑚 ) − ln 𝑌𝑓 Hence: Med 𝑌𝑚 𝑍ҧ𝑚 − 𝑍𝑓ҧ = ln ത Interpretation problem: ln(𝑌) ≠ ln(𝑌) Med 𝑌𝑓 Taking antilog: Med 𝑌𝑚 So, taking the antilog of ln(𝑌𝑚 ) − ln 𝑌𝑓 exp(𝑍ҧ𝑚 − 𝑍𝑓ҧ ) estimates Med 𝑌𝑓 won’t give an estimate of 𝑌ത𝑚 /𝑌ത𝑓. Interpretation: The median of male-population starting salary is exp(𝑍ҧ𝑚 − 𝑍𝑓ҧ ) times as large as the median of female-population starting salary. Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas Example – Salary Discrimination Referring to R codes, we have: ҧ − 𝑍𝑓ҧ = ln 𝑆𝑎𝑙𝑎𝑟𝑦𝑚 − ln 𝑆𝑎𝑙𝑎𝑟𝑦𝑓 = 0.1469 𝑍𝑚 Hence, Med 𝑌𝑚 ҧ ҧ = 𝑒 (𝑍𝑚 −𝑍𝑓) = 𝑒 0.1469 = 1.1583 Med (𝑌𝑓 ) The median salary for males is estimated to be 15.83% more than the median salary for females. ҧ − 𝑍𝑓ҧ ) = (0.0996, 0.1942) The 95% CI for (𝑍𝑚 Taking anti-log. The 95% CI for the median salary ratio is: (𝑒 0.0996 , 𝑒 0.1942 ) = (1.11, 1.21) Interpretation: Males’ median salary is larger than females’ median salaries by 11% to 21% and this is true 95% of the times. Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359 Naveen Jindal School of Management The University of Texas at Dallas This is the last slide End of Lecture 7 Making Inferences using t-Tools & Log-Transformation Rasoul Ramezani BUAN/OPRE 6359

Use Quizgecko on...
Browser
Browser