Basic Statistical Data Analysis Using Microsoft Excel PDF
Document Details
Central Luzon State University
2023
Zeren Lucky L. Cabanayan
Tags
Summary
This document provides an introduction to basic statistical data analysis and how to conduct it using Microsoft Excel software. It is designed for undergraduate students at Central Luzon State University in the Philippines.
Full Transcript
Basic Statistical Data Analysis Using Microsoft Excel Zeren Lucky L. Cabanayan STAT2100 – Statistical Analysis with Software Application 1st Semester, 2023-2025 Basic Statistical Data Analysis Using Microsoft Excel | 1 CENTRAL LUZON STATE UN...
Basic Statistical Data Analysis Using Microsoft Excel Zeren Lucky L. Cabanayan STAT2100 – Statistical Analysis with Software Application 1st Semester, 2023-2025 Basic Statistical Data Analysis Using Microsoft Excel | 1 CENTRAL LUZON STATE UNIVERSITY DEPARTMENT of STATISTICS Learning Outcomes After completing this chapter, the students must be able to Assess the normality of data using Microsoft Excel Perform hypothesis testing using Microsoft Excel Perform correlation and simple linear regression using Microsoft Excel ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 2 DEPARTMENT of STATISTICS Normal Distribution also called as the Gaussian distribution It is a distribution related to continuous data from a statistical analysis standpoint. It is sometimes called the “bell curve,” since it is a bell-shaped and asymptotic about the x-axis Mean = Median = Mode Symmetric about the mean (50% of the data will fall below the mean and half of the data will fall above the mean) ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 3 DEPARTMENT of STATISTICS Normal Distribution Since a number of the most common statistical tests rely on the normality of a sample or population, it is often useful to test whether the underlying distribution is normal, or at least symmetric. This can be done via the following approaches: o Review the distribution graphically (via histograms, boxplots, QQ plots) o Analyze the skewness and kurtosis o Employ statistical tests (esp. Chi-square, Kolmogorov-Smirnov, Shapiro- Wilk, Jarque-Barre, D’Agostino-Pearson) Note: If data is not symmetric, sometimes it is useful to make a transformation whereby the transformed data is symmetric and so can be analyzed more easily. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 4 DEPARTMENT of STATISTICS Normal Distribution A normal or Gaussian distribution : ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 5 DEPARTMENT of STATISTICS Normal Distribution The following graphs can be used to test whether the data is normally distributed or not : Histogram Normal Q-Q plot Boxplot Note: the graphs are not as accurate as the formal test for normality. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 6 DEPARTMENT of STATISTICS 1. Histogram A histogram can be used to test whether data is normally distributed. This test simply consists of looking at the histogram and discerning whether it approximates the bell curve of a normal distribution. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 7 DEPARTMENT of STATISTICS 2. Q-Q plot A Q-Q plot is a short term for “quantile-quantile” plot often used to assess whether or not a set of data potentially came from some theoretical distribution. In most cases, this type of plot is used to determine whether or not a set of data follows a normal distribution. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 8 DEPARTMENT of STATISTICS 2. Q-Q plot an alternative graphical method of assessing normality to the histogram and is easier to use when there are small sample sizes. The scatter should lie as close to the line as possible with no obvious pattern coming away from the line for the data to be considered normally distributed. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 9 DEPARTMENT of STATISTICS 2. Q-Q plot Example: Using Q-Q plot determine whether the data set with 50 elements is normally distributed. -14 -11 18 37 36 -21 -14 50 43 40 -48 -1 -48 17 -23 18 -50 34 39 -41 0 38 31 -20 27 46 -18 29 -6 -24 -39 -14 -6 -47 45 -27 -37 6 -38 -29 18 49 6 22 45 -24 24 20 4 -43 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 10 DEPARTMENT of STATISTICS 2. Q-Q plot Step 1. Enter and sort the data Step 2. Find the rank of each data value =RANK(number, ref, [order]) (…) 0 – Descending (…) 1 - Ascending Note: When a formula contains an absolute reference, no matter which cell the formula occupies the cell reference does not change: if you copy or move the formula, it refers to the same cell as it did in its original location. In an absolute reference, each part of the reference (the letter that refers to the row and the number that refers to the column) is preceded by a “$” – for example, $A$1 is an absolute reference to cell A1. Wherever the formula is copied or moved, it always refers to cell A1. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 11 DEPARTMENT of STATISTICS 2. Q-Q plot Step 3. Find the cumulative = (𝑖 − 0.5)/𝑛 probability of getting less than or theoretical z-score of the z-score = (𝑖 − 0.5)/𝐶𝑂𝑈𝑁𝑇(𝑑𝑎𝑡𝑎 𝑟𝑎𝑛𝑔𝑒) of i. 𝑷𝒓𝒐𝒃. 𝒐𝒇 𝒊 = (𝑖 − 0.5)/𝑛 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 12 DEPARTMENT of STATISTICS 2. Q-Q plot = 𝑁𝑂𝑅𝑀. 𝑆. 𝐼𝑁𝑉(𝑃𝑟𝑜𝑏. 𝑜𝑓 𝑖) Step 4. Find the theoretical z-score based on Prob. of i z ~ 𝑁𝑜𝑟𝑚𝑎𝑙 (0,1) ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 13 DEPARTMENT of STATISTICS 2. Q-Q plot Step 5. Get the standardized = ( 𝑋 − 𝐴𝑉𝐸𝑅𝐴𝐺𝐸 𝑑𝑎𝑡𝑎 𝑟𝑎𝑛𝑔𝑒 / value of X 𝑆𝑇𝐷𝐸𝑉. 𝑆(𝑑𝑎𝑡𝑎 𝑟𝑎𝑛𝑔𝑒)) 𝑋−𝜇 𝑧= 𝜎 Where: 𝜇 = 𝑀𝑒𝑎𝑛 𝜎 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 14 DEPARTMENT of STATISTICS 2. Q-Q plot Step 6. Plot the z-score and standardized value of X Insert ➝ scatter (z-score of i, standardized of X) then Add linear trend ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 15 DEPARTMENT of STATISTICS 2. Q-Q plot Step 6. Plot the z-score and standardized value of X ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 16 DEPARTMENT of STATISTICS 2. Q-Q plot Step 6. Plot the z-score and standardized value of X ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 17 DEPARTMENT of STATISTICS 2. Q-Q plot Step 6. Plot the z-score and standardized value of X Note: Click the line then change the format 2.50 For normally distributed data, 2.00 observations should lie approximately on a straight line. If the data is non-normal, 1.50 the points form a curve that deviates 1.00 markedly from a straight line. Possible 0.50 outliers are points at the ends of the line, AXIS TITLE 0.00 distanced from the bulk of the -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 -0.50 observations. -1.00 -1.50 In this case, the data was merely -2.00 approximately follows a normal -2.50 distribution AXIS TITLE ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 18 DEPARTMENT of STATISTICS 3. Box Plot Box plots can’t actually be used to test for normality, they can be useful for testing for symmetry, which often is a sufficient substitute for normality. The box plot shape will show if a statistical data set is normally distributed or skewed. If your data comes from a NORMAL DISTRIBUTION, the box will be symmetrical with the mean and median in the center. If the data meets the assumption of normality, there should also be few outliers. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 19 DEPARTMENT of STATISTICS 3. Box Plot Interquartile Range (IQR) whisker whisker Minimum/ Maximum/ Lower Fence Median Upper Fence 𝑸𝟏 𝑸𝟑 (𝑸𝟏 − 1.5 ∗ IQR) (25th Percentile) (75th Percentile) (𝑸𝟑 + 1.5 ∗ IQR) ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 20 DEPARTMENT of STATISTICS Skewness Skewness is a measure of the asymmetry of the probability distribution of a random variable about its mean. If skewness is 0, the data are perfectly symmetrical. If skewness is between -0.5 and 0.5, the distribution is approximately symmetric. If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed. If skewness is less than -1 or greater than 1, the distribution is highly skewed. In Microsoft Excel: =SKEW(number1, [number2],...) Moderately Approximately Moderately Skewed symmetric Skewed Highly Highly Skewed Skewed −𝟏 − 𝟏ൗ𝟐 𝟎 𝟏ൗ 𝟐 𝟏 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 21 DEPARTMENT of STATISTICS Skewness Perfectly Symmetric or Normal Distribution Positively skewed distribution Negatively skewed distribution or Skewed to the right or Skewed to the left ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 22 DEPARTMENT of STATISTICS Kurtosis Kurtosis tells you the height and sharpness of the central peak, relative to that of a standard bell curve. A perfect normal distribution has a kurtosis of 0 (mesokurtic distribution) Kurtosis value above 0 = Leptokurtic distribution (sharper peak and longer/flatter tails) Kurtosis value below 0 = Platykurtic distribution (rounder peak and shorter/thinner tails) In Microsoft Excel: KURT(number1, [number2],...) ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 23 DEPARTMENT of STATISTICS Kurtosis Platykurtic Distribution Normal Distribution or Leptokurtic Distribution Kurtosis < 0 Mesokurtic Distribution Kurtosis > 0 Kurtosis = 0 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 24 DEPARTMENT of STATISTICS Shapiro-Wilk W Test Tests for assessing if data is normally distributed using Shapiro-Wilk W Test developed by Shapiro and Wilk (1965) It is the ratio of two estimates of the variance of a normal distribution based on a random sample of n observations. This test for normality has been found to be the most powerful test in most situations. Hypothesis: Ho: The data is normally distributed. Ha: The data is not normally distributed. Decision Rule: Reject Ho if P-value < 𝛼 (level of significance) ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 25 DEPARTMENT of STATISTICS Shapiro-Wilk W Test Step 1. Arrange the data in ascending order. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 26 DEPARTMENT of STATISTICS Shapiro-Wilk W Test Step 2. Calculate 𝑆𝑆 as follows: 𝑆𝑆 = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 Note: If 𝑛 is even, let 𝑚 = 𝑛/2, while if 𝑛 is odd let 𝑚 = (𝑛 − 1)/2 In excel use DEVSQ function to calculate 𝑆𝑆 = 𝐷𝐸𝑉𝑆𝑄(𝑑𝑎𝑡𝑎 𝑟𝑎𝑛𝑔𝑒) ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 27 DEPARTMENT of STATISTICS Shapiro-Wilk W Test Step 3. Calculate 𝑏 as follows a. taking the 𝑎𝑖 weights from the Shapiro-Wilk W Table 1. b. Corresponding to each of the 6 coefficients a1 ,… , a6 , calculate the values 𝑥12 − 𝑥1 , 𝑥11 − 𝑥2 ,…, 𝑥7 − 𝑥6 where 𝑥𝑖 is the ith data element in ascending order. Shapiro-Wilk W Table 1: Coefficients c. Get the product of the = 𝑆𝑈𝑀(𝐼5: 𝐼10) coefficients and difference values. d. Get the sum of the product of the coefficients and difference values. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 28 DEPARTMENT of STATISTICS Shapiro-Wilk W Test Step 4. Calculate the test statistic 𝑊 = 𝑏 2 /𝑆𝑆 = (𝑏^2)/𝑆𝑆 or = (𝑏 ∗ 𝑏)/𝑆𝑆 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 29 DEPARTMENT of STATISTICS Shapiro-Wilk W Test Step 5. Find the value in the Shapiro-Wilk W Table 2 that is closest to 𝑊, interpolating if necessary. This is the p-values for the test. The p-value lies between 0.50 and 0.90. The W value for 0.5 is 0.943 and the W value for 0.9 is 0.973. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 30 DEPARTMENT of STATISTICS Shapiro-Wilk W Test We need to interpolate since 𝑊 = 0.971 𝑦 lies between 0.50 and 0.90. The 𝑊 value for 0.5 𝑥1 is 0.943 𝑦1 and the 𝑊 value for 0.9 𝑥2 is 0.973 𝑦2. Linear Interpolation: 𝑦2 − 𝑦1 𝑦 − 𝑦1 = 𝑥 − 𝑥1 𝑥2 − 𝑥1 0.973 − 0.943 0.971 − 0.943 = 𝑥 − 0.5 0.9 − 0.5 𝑥 = 0.873 ⇒ 𝑃 − 𝑣𝑎𝑙𝑢𝑒 Therefore, the p-value is equivalent to 0.873 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 31 DEPARTMENT of STATISTICS Formal hypothesis testing for Shapiro-Wilk W Test a. Ho: The data is normally distributed. Ha: The data is not normally distributed. b. Test to use: Shapiro-Wilk W Test c. Decision Rule: P-value Method Reject Ho if P-value < 𝛼 = 0.05 d. P-value =0.873 e. Decision: Since 0.873 ≮ 0.05, failed to reject Ho f. Conclusion: We therefore conclude that the data is normally distributed. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 32 DEPARTMENT of STATISTICS What if the data is not normally distributed? If the data is not normally distributed, there are two options: Transform the dependent variable (repeating the normality checks on the transformed data): Common transformations include taking the log or square root of the dependent variable. Use a non-parametric test: Non-parametric tests are often called distribution free tests and can be used instead of their parametric equivalent. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 33 DEPARTMENT of STATISTICS Key to non-parametric tests: Parametric Test What to check for normality Non-parametric test Independent t-test Dependent variable by group Mann-Whitney test Paired t-test Paired differences Wilcoxon signed rank test Residuals/ dependent variable One-way ANOVA Kruskal-Wallis test by group Repeated measures ANOVA Residuals at each time point Friedman test Pearson’s correlation coefficient Both variables Spearman’s correlation Simple linear regression Residuals N/A Note: The residuals are the differences between the observed and expected values. Excel will not perform non-parametric tests even with the data analysis toolpak add in. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 34 DEPARTMENT of STATISTICS Hypothesis Testing One Population One Sample One Sample t-test Parameter Hypothesis Dependent Paired Sample t-test Samples Testing Independent Sample z-test Two Population Parameter (𝑖𝑓 𝜎1 & 𝜎2 𝑎𝑟𝑒 𝑘𝑛𝑜𝑤𝑛) Independent Independent Sample t-test Samples (𝑖𝑓 𝜎1 & 𝜎2 𝑎𝑟𝑒 𝑢𝑛𝑘𝑛𝑜𝑤𝑛 𝑎𝑛𝑑 𝜎1 = 𝜎2 ) Independent Sample t-test (𝑖𝑓 𝜎1 & 𝜎2 𝑎𝑟𝑒 𝑢𝑛𝑘𝑛𝑜𝑤𝑛 𝑎𝑛𝑑 𝜎1 ≠ 𝜎2 ) ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 35 DEPARTMENT of STATISTICS Components of Hypothesis Testing Test of Hypothesis Set of rules that lead to the acceptance or rejection of statistical method. Hypothesis - standard procedure for testing a claim about a property of a population based on sample data It is a claim or statement about a property of a population. Null Hypothesis (𝑯𝒐 ) Alternative Hypothesis (𝑯𝒂 𝒐𝒓𝑯𝟏 ) Tail of distribution hypothesis that differs from the null must contain the condition of equality hypothesis Two-tailed Equal (=) 𝜇 = 𝜇𝑜 Not equal to (≠) 𝜇 ≠ 𝜇𝑜 Right-tailed Less than or equal to (≤) 𝜇 ≤ 𝜇𝑜 Greater than (>) 𝜇 > 𝜇𝑜 Left-tailed Greater than or equal to (≥) 𝜇 ≥ 𝜇𝑜 Less than ( 𝜇𝑜 Acceptance Region Rejection Region It means that if the test statistic fall in this region, we failed to reject Ho It means that if the test statistic fall in this region, we reject Ho Critical Value ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 42 DEPARTMENT of STATISTICS 3. Two-tailed test: The CR is in the two extreme regions under the curve. Ha: 𝜇 ≠ 𝜇𝑜 Rejection Region Acceptance Region Rejection Region It means that if the test statistic fall in this region, It means that if the test It means that if the test we reject Ho statistic fall in this region, statistic fall in this region, we failed to reject Ho we reject Ho Critical Value Critical Value ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 43 DEPARTMENT of STATISTICS Guidelines in Hypothesis Testing: a. State the null (Ho) and alternative (Ha) hypothesis. Identify the claim. b. Determine the test statistic, critical value and tail of the distribution where the rejection region is located. c. Formulate the decision rule. d. Compute the value of the test statistic. e. Make a decision. ( whether to reject or to accept Ho) f. Draw the conclusion by answering the original claim. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 44 DEPARTMENT of STATISTICS Steps in testing a hypothesis: P-value Method Identify the State the test to use, Make a null and the tailed Formulate decision alternative of the the Determine Interpret (whether hypothesis. distribution decision the P-value the result to reject or Identify the and the rule not the Ho) claim level of significance Decision Rule If the p-value is smaller than the significance level, Ho is rejected. If it is larger than the significance level, Ho is not rejected (or failed to reject Ho ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 45 DEPARTMENT of STATISTICS P-value The probability of observing a sample value as extreme as, or more extreme than, the value observed, given that the null hypothesis is true. INTERPRETING THE WEIGHT OF EVIDENCE AGAINST Ho If the p-value is less than (a) 0.10, we have some evidence that Ho is not true. (b) 0.05, we have strong evidence that Ho is not true. (c) 0.01, we have very strong evidence that Ho is not true. (d) 0.001, we have extremely evidence that Ho is not true. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 46 DEPARTMENT of STATISTICS It is important to summarize the results of a statistical study correctly. The following table will help you when you summarize the results. Decision Claim is Ho Claim is Ha Reject Ho There is enough There is enough evidence to reject the evidence to support the claim. claim. Failed to reject Ho There is no enough There is no enough evidence to reject the evidence to support the claim. claim. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 47 DEPARTMENT of STATISTICS Symbol Definition Example greater than The mean weight is greater than 50kg > 𝜇 > 50kg more than X is more than 50kg < less than The mean weight is less than 50kg 𝜇 < 50kg The mean weight is greater than or greater than or equal to ≥ equal to 50kg 𝜇 ≥ 50kg at least The mean weight is at least 50kg The mean weight is less than or equal less than or equal to ≤ to 50kg 𝜇 ≤ 50kg at most The mean weight is at most 50kg = equal The mean weight is equal to 50kg 𝜇 = 50kg ≠ not equal The mean weight is not equal to 50kg 𝜇 ≠ 50kg ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 48 DEPARTMENT of STATISTICS 1. t-Test for One Sample Mean Used to compare the mean value of a sample with a constant value (𝜇0 ). Example 1: Test the hypothesis that the average sugar content of particular cookies is 10g, if the contents of a random sample of 10 cookies are: 10.3 9.7 10.1 10.3 10.1 9.8 9.9 10.5 10.3 9.9 Use 0.01 level of significance and assume that the distribution of contents is normal. Assume the data are normally distributed. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 49 DEPARTMENT of STATISTICS Step in performing t-Test for One Sample Mean Step 1: Create a dummy second variable and input at least 2 zeroes in it. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 50 DEPARTMENT of STATISTICS Step 2: Click Data → Data Analysis → t-Test Two-Sample Assuming Unequal Variances → OK ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 51 DEPARTMENT of STATISTICS Step 3: For variable 1 range, select the actual data. For variable 2 range, select the dummy data. For the hypothesized mean difference, input the test value from the hypothesis. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 52 DEPARTMENT of STATISTICS Step 4: For the output, delete the dummy column, change the title of the table delete the word mean in the hypothesized mean difference row. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 53 DEPARTMENT of STATISTICS Solution: P-value Method Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim Ho : 𝜇 = 10g ⟹ The average sugar content of particular cookies is 10g (claim) Ha : 𝜇 ≠ 10g ⟹ The average sugar content of particular cookies is not 10g) Step 2: Identify the test to use, the tailed of the distribution and the level of significance t-Test for One Sample Mean Since the alternative hypothesis is 𝜇 ≠ 10g, therefore the tail of the distribution is two-tailed test. Level of significance: 𝛼 = 0.01 Step 3: Formulate the decision rule Reject Ho if P-value < 𝛼 = 0.01 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 54 DEPARTMENT of STATISTICS Step 4: Determine the P-value Since the tail of the distribution is two-tailed, therefore the P-value is 0.30. Step 5: Make a decision (whether to reject or not the Ho) Since P-value is not less than 𝛼 = 0.01 , failed to reject Ho Step 6: Interpret the results At 1% level of significance, there is no enough evidence to reject the claim that the average sugar content of particular cookies is 10g. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 55 DEPARTMENT of STATISTICS The following table will help you when you summarize the results. Decision Claim is Ho Claim is Ha Reject Ho There is enough There is enough evidence to reject the evidence to support the claim. claim. Failed to reject Ho There is no enough There is no enough evidence to reject the evidence to support the claim. claim. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 56 DEPARTMENT of STATISTICS Example 2: Communication company conducted a study about the number of text messages sent by a teenager in a day. They claim that on the average, teenager sent at least 65 text messages per day. To update the estimates, a sample of 11 teenagers ask how many text messages they sent a day. Their response were: 51 175 47 49 44 54 145 203 21 42 100 At 0.05 level of significance, can you conclude that the mean number of text messages sent by teenagers per day is 65. Assume the data are normally distributed. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 57 DEPARTMENT of STATISTICS Solution: P-value Method Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim Ho : 𝜇 ≥ 65 ⟹ the average text messages per day is at least 65 (claim) Ha : 𝜇 < 65 ⟹ the average text messages per day is less than 65 Step 2: Identify the test to use, the tailed of the distribution and the level of significance t-Test for One Sample Mean Since the alternative hypothesis is 𝜇 < 65 text messages per day, therefore the tail of the distribution is one tailed to the right test. Level of significance: 𝛼 = 0.05 Step 3: Formulate the decision rule Reject Ho if P-value < 𝛼 = 0.05 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 58 DEPARTMENT of STATISTICS Step 4: Determine the P-value Since the tail of the distribution is one-tailed, therefore the P-value is 0.16. Step 5: Make a decision (whether to reject or not the Ho) Since P-value is not less than 𝛼 = 0.05 , failed to reject Ho Step 6: Interpret the results At 5% level of significance, there is no enough evidence to reject the claim that the average text messages per day is at least 65. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 59 DEPARTMENT of STATISTICS The following table will help you when you summarize the results. Decision Claim is Ho Claim is Ha Reject Ho There is enough There is enough evidence to reject the evidence to support the claim. claim. Failed to reject Ho There is no enough There is no enough evidence to reject the evidence to support the claim. claim. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 60 DEPARTMENT of STATISTICS Hypothesis Testing for Two Populations It is used when researchers wish to compare two sample means, using experimental and control groups. Main interest is on the difference between the two populations. For example, the average lifetimes of two different brands of bus tires might be compared to see whether there is any difference in tread wear. Two different brands of fertilizer might be tested to see whether one is better than the other for growing plants. Or two brands of cough syrup might be tested to see whether one brand is more effective than the other. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 61 DEPARTMENT of STATISTICS 2. z-Test for Two Sample Means Main interest is on the difference between 𝜇1 and 𝜇2 Assumptions: 1. Both samples are random samples. 2. The samples must be independent of each other. That is, there can be no relationship between the subjects in each sample. 3. The standard deviations of both populations must be known; and if the sample sizes are less than 30, the populations must be normally or approximately normally distributed. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 62 DEPARTMENT of STATISTICS Males Females Example 1: 6 11 11 6 8 11 A random sample of the number of 6 14 8 7 5 13 sports offered by colleges for males and females is shown. At 𝛼 = 0.10, 6 9 5 6 5 5 is there enough evidence to 6 9 18 10 7 6 support the claim that there is 15 6 11 16 10 7 significant difference in the number 9 9 5 7 5 5 of sports offered by colleges for 8 9 6 9 18 13 males and females? 9 5 11 7 8 5 Assume 𝜎1 = 𝜎2 = 3.3. 7 7 5 11 4 6 10 7 10 14 12 5 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 63 DEPARTMENT of STATISTICS Step in performing z-Test for Two Sample Means 1. Click the Data tab on the ribbon, then look for Data Analysis 2. Choose z-Test: Two Sample for Means, then click OK ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 64 DEPARTMENT of STATISTICS 3. Input the range that contains the variables 1 and 2. 4. Input the known variance for variables 1 and 2. 5. Choose output options, then click ok ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 65 DEPARTMENT of STATISTICS 6. the output will be ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 66 DEPARTMENT of STATISTICS Solution: P-value Method Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim Ho : 𝜇1 = 𝜇2 ⟹ There is no significant difference in the number of sports offered by colleges for males and female. Ha : 𝜇1 ≠ 𝜇2 ⟹ There is significant difference in the number of sports offered by colleges for males and female. (claim) Step 2: Identify the test to use, the tailed of the distribution and the level of significance z-Test for Two Sample Means Since the alternative hypothesis is 𝜇1 ≠ 𝜇2 , therefore the tail of the distribution is two-tailed test. Level of significance: 𝛼 = 0.10 Step 3: Formulate the decision rule Reject Ho if P-value < 𝛼 = 0.10 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 67 DEPARTMENT of STATISTICS Step 4: Determine the P-value Since the tail of the distribution is two-tailed, therefore the P-value is 0.619. Step 5: Make a decision (whether to reject or not the Ho) Since P-value is not less than 𝛼 = 0.10 , failed to reject Ho Step 6: Interpret the results At 10% level of significance, there is no enough evidence to support the claim that there is significant difference in the number of sports offered by colleges for males and females. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 68 DEPARTMENT of STATISTICS The following table will help you when you summarize the results. Decision Claim is Ho Claim is Ha Reject Ho There is enough There is enough evidence to reject the evidence to support the claim. claim. Failed to reject Ho There is no enough There is no enough evidence to reject the evidence to support the claim. claim. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 69 DEPARTMENT of STATISTICS Example 2: Resident Commuter The dean of students wants to see 22 19 23 23 32 18 26 18 36 35 whether there is a significant 25 19 24 21 18 23 23 22 23 24 difference in ages of resident 18 21 18 28 20 26 21 22 25 20 students and commuting students. 19 23 29 26 18 19 25 19 25 32 She selects a random sample of 50 students from each group. The ages 26 27 21 19 24 29 19 19 21 20 are shown here. At 𝛼 = 0.05, 22 26 21 22 19 20 23 19 18 19 determine if there is enough 19 26 26 18 23 20 22 22 27 18 evidence to reject the claim of no 25 18 27 22 22 20 35 28 19 17 difference in the ages of the two 20 19 18 26 18 18 21 22 35 17 groups. Assume 𝜎1 = 3.68 and 𝜎2 = 4.7. 30 19 19 18 20 30 18 19 24 16 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 70 DEPARTMENT of STATISTICS Solution: P-value Method Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim Ho : 𝜇1 = 𝜇2 ⟹ There is no significant difference in ages of resident students and commuting students. (claim) Ha : 𝜇1 ≠ 𝜇2 ⟹ There is no significant difference in ages of resident students and commuting students. Step 2: Identify the test to use, the tailed of the distribution and the level of significance z-Test for Two Sample Means Since the alternative hypothesis is 𝜇1 ≠ 𝜇2 , therefore the tail of the distribution is two-tailed test. Level of significance: 𝛼 = 0.05 Step 3: Formulate the decision rule Reject Ho if P-value < 𝛼 = 0.05 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 71 DEPARTMENT of STATISTICS Step 4: Determine the P-value Since the tail of the distribution is two- tailed, therefore the P-value is 0.130. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 72 DEPARTMENT of STATISTICS Step 5: Make a decision (whether to reject or not the Ho) Since P-value is not less than 𝛼 = 0.05 , failed to reject Ho Step 6: Interpret the results At 5% level of significance, there is no enough evidence to reject the claim that the there is no significant difference in ages of resident students and commuting students ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 73 DEPARTMENT of STATISTICS 3. t-Test for Two Independent Sample Means Samples are independent if the sample selected from one population is not related to the sample selected from the other population. Example 1: A teacher wants to know which of the two sections has a higher score. The teacher will use these scores to make recommendations to the principal. Random samples of students are asked about their scores. Test the claim that the population mean scores are different for the two sections? Assume the data are normally distributed and equal variances. Use =0.05. Section A 81 77 75 74 86 90 62 73 91 98 Section B 89 64 35 68 69 55 37 57 42 49 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 74 DEPARTMENT of STATISTICS Step in performing t-Test for Two Independent Sample Means 1. Click the Data tab on the ribbon, then look for Data Analysis 2. Choose t-Test: Two Sample…, then click OK ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 75 DEPARTMENT of STATISTICS 3. Input the range that contains the variables 1 and 2. 4. Choose output options, then click ok ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 76 DEPARTMENT of STATISTICS 5. the output will be ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 77 DEPARTMENT of STATISTICS Solution: P-value Method Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim Ho : 𝜇1 = 𝜇2 ⟹ The population mean scores are not different for the two sections Ha : 𝜇1 ≠ 𝜇2 ⟹ The population mean scores are different for the two sections (claim) Step 2: Identify the test to use, the tailed of the distribution and the level of significance t-Test for Two Independent Sample Means (assume equal variances) Since the alternative hypothesis is 𝜇1 ≠ 𝜇2 , therefore the tail of the distribution is two-tailed test. Level of significance: 𝛼 = 0.05 Step 3: Formulate the decision rule Reject Ho if P-value < 𝛼 = 0.05 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 78 DEPARTMENT of STATISTICS Step 4: Determine the P-value Since the tail of the distribution is two-tailed, therefore the P- value is 0.001. Step 5: Make a decision (whether to reject or not the Ho) Since P-value is less than 𝛼 = 0.05 , reject Ho Step 6: Interpret the results At 5% level of significance, there is enough evidence to support the claim that the population mean scores are different for the two sections. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 79 DEPARTMENT of STATISTICS The following table will help you when you summarize the results. Decision Claim is Ho Claim is Ha Reject Ho There is enough There is enough evidence to reject the evidence to support the claim. claim. Failed to reject Ho There is no enough There is no enough evidence to reject the evidence to support the claim. claim. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 80 DEPARTMENT of STATISTICS Example 2: The production manager at Bellevue Steel, a manufacturer of wheelchairs, wants to compare the number of defective wheelchairs produced on the day shift with the number on the afternoon shift. A sample of the production from 6 day shifts and 8 afternoon shifts revealed the following number of defects. At the 0.05 significance level, test the claim that there is a difference in the mean number of defects per shift. Assume the data are normally distributed and unequal variances. Day 5 8 7 6 9 7 Afternoon 8 10 7 11 9 12 14 9 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 81 DEPARTMENT of STATISTICS Solution: P-value Method Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim Ho : 𝜇1 = 𝜇2 ⟹ there is no difference in the mean number of defects per shift Ha : 𝜇1 ≠ 𝜇2 ⟹ there is a difference in the mean number of defects per shift (claim) Step 2: Identify the test to use, the tailed of the distribution and the level of significance t-Test for Two Independent Sample Means (assume unequal variances) Since the alternative hypothesis is 𝜇1 ≠ 𝜇2 , therefore the tail of the distribution is two-tailed test. Level of significance: 𝛼 = 0.05 Step 3: Formulate the decision rule Reject Ho if P-value < 𝛼 = 0.05 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 82 DEPARTMENT of STATISTICS Note: Assume unequal variances, so use t-Test: Two-Sample Assuming Unequal Variances ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 83 DEPARTMENT of STATISTICS Step 4: Determine the P-value Since the tail of the distribution is two-tailed, therefore the P-value is 0.010. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 84 DEPARTMENT of STATISTICS Step 5: Make a decision (whether to reject or not the Ho) Since P-value is less than 𝛼 = 0.05 , reject Ho Step 6: Interpret the results At 5% level of significance, there is enough evidence to support the claim that there is a difference in the mean number of defects per shift. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 85 DEPARTMENT of STATISTICS The following table will help you when you summarize the results. Decision Claim is Ho Claim is Ha Reject Ho There is enough There is enough evidence to reject the evidence to support the claim. claim. Failed to reject Ho There is no enough There is no enough evidence to reject the evidence to support the claim. claim. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 86 DEPARTMENT of STATISTICS 4. t-Test for Two Dependent Sample Means Dependent means one sample is related to the other sample. such samples are referred to as paired sample or matched sample get one value from each of two subjects sharing the same characteristics Example: The following are the scores of eight students before and after a review. At 5% level of significance, test the claim that there is significant difference in the scores before and after the review. Assume the data are normally distributed. Student 1 2 3 4 5 6 7 8 Before 77 74 82 73 87 68 66 80 After 72 68 76 68 84 68 61 76 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 87 DEPARTMENT of STATISTICS Step in performing t-Test for Two Dependent Sample Means 1. Click the Data tab on the ribbon, then look for Data Analysis 2. Choose t-Test: Paired Two Sample for Means, then click OK ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 88 DEPARTMENT of STATISTICS 3. Input the range that contains the variables 1 and 2. 4. Choose output options, then click ok ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 89 DEPARTMENT of STATISTICS 5. the output will be ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 90 DEPARTMENT of STATISTICS Solution: P-value Method Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim Ho : 𝜇1 = 𝜇2 ⟹ There is no significant difference in the scores before and after the review Ha : 𝜇1 ≠ 𝜇2 ⟹ There is a significant difference in the scores before and after the review (claim) Step 2: Identify the test to use, the tailed of the distribution and the level of significance t-Test for Two Dependent Sample Means Since the alternative hypothesis is 𝜇1 ≠ 𝜇2 , therefore the tail of the distribution is two-tailed test. Level of significance: 𝛼 = 0.05 Step 3: Formulate the decision rule Reject Ho if P-value < 𝛼 = 0.05 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 91 DEPARTMENT of STATISTICS Step 4: Determine the P-value Since the tail of the distribution is two-tailed, therefore the P-value is 0.0005. Step 5: Make a decision (whether to reject or not the Ho) Since P-value is less than 𝛼 = 0.05, reject Ho Step 6: Interpret the results At 5% level of significance, there is enough evidence to support the claim that there is significant difference in the scores before and after the review. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 92 DEPARTMENT of STATISTICS The following table will help you when you summarize the results. Decision Claim is Ho Claim is Ha Reject Ho There is enough There is enough evidence to reject the evidence to support the claim. claim. Failed to reject Ho There is no enough There is no enough evidence to reject the evidence to support the claim. claim. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 93 DEPARTMENT of STATISTICS Example 2: Advertisements by Fitness Center claim that completing its course will result in losing weight. A random sample of eight recent participants showed the following weights before and after completing the course. At 0.01 level of significance, can we conclude the students lost weight? Assume the data are normally distributed. Student 1 2 3 4 5 6 7 8 Before 155 228 141 162 211 164 184 172 After 154 201 147 157 196 150 170 165 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 94 DEPARTMENT of STATISTICS Solution: P-value Method Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim Ho : 𝜇1 = 𝜇2 ⟹ The students doesn’t lost weight Ha : 𝜇1 > 𝜇2 ⟹The students lost weight (claim) Step 2: Identify the test to use, the tailed of the distribution and the level of significance t-Test for Two Dependent Sample Means Since the alternative hypothesis is 𝜇1 > 𝜇2 , therefore the tail of the distribution is one tailed to the right test. Level of significance: 𝛼 = 0.01 Step 3: Formulate the decision rule Reject Ho if P-value < 𝛼 = 0.01 ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 95 DEPARTMENT of STATISTICS Step 4: Determine the P-value Since the tail of the distribution is two-tailed, therefore the P- value is 0.016. Step 5: Make a decision (whether to reject or not the Ho) Since P-value is not less than 𝛼 = 0.01 , failed to reject Ho Step 6: Interpret the results At 1% level of significance, there is no enough evidence to support the claim that the students lost weight. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 96 DEPARTMENT of STATISTICS The following table will help you when you summarize the results. Decision Claim is Ho Claim is Ha Reject Ho There is enough There is enough evidence to reject the evidence to support the claim. claim. Failed to reject Ho There is no enough There is no enough evidence to reject the evidence to support the claim. claim. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 97 DEPARTMENT of STATISTICS 5. F-test for Two-Sample Variance It is used to test whether the variances of two populations are equal. Step in performing F-test for Two-Sample Variance 1. Click the Data tab on the ribbon, then look for Data Analysis 2. Choose F-Test: Two Sample for Variance, then click OK ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 98 DEPARTMENT of STATISTICS 3. Input the range that contains the variables 1 and 2. 4. Choose output options, then click ok Note: The variance of variable 1 must be larger than the variance of variable 2. In this case, section B have larger variance that section A, so the data of section B must be put in the “Variable 1 Range” and section A in “Variable 2 Range”. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 99 DEPARTMENT of STATISTICS 5. the output will be ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 100 DEPARTMENT of STATISTICS Solution: F-test Step 1: Ho: 𝜎12 = 𝜎22 (Variances are equal) Ha: 𝜎12 ≠ 𝜎22 (Variances are not equal) Step 2: F-test, 𝛼=0.05 Step 3: Reject Ho F > F Critical one-tail Reject Ho if P-value < 𝛼 = 0.05 (P-value Method) Step 4: F = 2.46 and F Critical = 3.17 P-value = 0.098 Step 5: Since F is not greater than F Critical, failed to reject Ho Since P-value is not less that 𝛼 = 0.05, failed to reject Ho Step 6: At 5% level of significance, we can conclude that the variances are equal. ZLLCABANAYAN Basic Statistical Data Analysis Using Microsoft Excel | 101 DEPARTMENT of STATISTICS Correlation Analysis Correlation is one of the most common and most useful statistics. Linear correlation analysis is a statistical technique to determine the strength or degree of linear relationship existing between two variables. A measure of the degree of linear relationship is called population coefficient ρ Assumptions: The sample of paired (X,Y) data is a random sample. The data pairs fall approximately on a straight line and are measured at the interval or ratio level. The pairs of (X,Y) data have a bivariate normal distribution. Basic Statistical Data Analysis Using Microsoft Excel | 102 DEPARTMENT of STATISTICS Scatter plot Scatter Diagram / Scatter plot - is a type of mathematical diagram using Cartesian coordinates to display values for two variables for a set of data Positive linear Negative linear Curvilinear No Relationship Relationship relationship relationship Basic Statistical Data Analysis Using Microsoft Excel | 103 DEPARTMENT of STATISTICS Example Suppose data below were collected from a sample of 10 Pizza Parlor restaurants located near college campuses. The size of student population (in thousands) and the quarterly sales (in thousands of dollars) for the 10 restaurants are summarized in the table below. Restaurant 1 2 3 4 5 6 7 8 9 10 (𝒊) Student Population (1000s) 2 6 8 8 12 16 19 20 22 26 (𝒙𝒊 ) Quarterly Sales ($1000s) 58 105 88 118 117 137 157 169 148 202 (𝒚𝒊 ) Basic Statistical Data Analysis Using Microsoft Excel | 104 DEPARTMENT of STATISTICS Example: Scatter plot in Excel Step 1: Highlight the cells that contain the data you want to use in the chart Basic Statistical Data Analysis Using Microsoft Excel | 105 DEPARTMENT of STATISTICS Step 2: Click the insert tab on the ribbon Step 3: Select the X, Y (scatter) you want to create Basic Statistical Data Analysis Using Microsoft Excel | 106 DEPARTMENT of STATISTICS Step 4: You'll see many options when you select this button, select Scatter Basic Statistical Data Analysis Using Microsoft Excel | 107 DEPARTMENT of STATISTICS Step 5: The chart will appear. Customize bar chart through ”chart design”, “format”, “Quick Layout”. Basic Statistical Data Analysis Using Microsoft Excel | 108 DEPARTMENT of STATISTICS The final output will be 250 200 QUARTERLY SALES ($1000S) 150 100 50 0 0 5 10 15 20 25 30 STUDENT POPULATION (1000S) Basic Statistical Data Analysis Using Microsoft Excel | 109 DEPARTMENT of STATISTICS Types of Correlation 1. Positive linear correlation – if the independent variable increases, the dependent variable also increases; or if the independent variable decreases, the dependent variable also decreases. 2. Negative linear correlation – if the independent variable increases, the dependent variable decreases; or if the independent variable decreases, the independent variable increases. 3. Zero linear correlation – no linear relationship exists between the dependent and independent variables. If the relationship is not linear, then it can be exponential or logarithmic. Note that this does not imply that the independent variable is not related with the independent variable. Basic Statistical Data Analysis Using Microsoft Excel | 110 DEPARTMENT of STATISTICS a Linear Correlation Coefficient r Measures the strength of the linear relationship between the paired X and Y values in a sample. It is sometimes referred to as Pearson Product Moment Correlation in honor of Karl Pearson (1857-1936), who originally develop it. 𝑛 σ 𝑋𝑖 𝑌𝑖 − (σ 𝑋𝑖 )(σ 𝑌𝑖 ) 𝑟= 2 2 𝑛 σ 𝑋𝑖2 − (σ 𝑋𝑖 ) 𝑛 σ 𝑌𝑖2 − (σ 𝑌𝑖 ) Basic Statistical Data Analysis Using Microsoft Excel | 111 DEPARTMENT of STATISTICS Interpretations of the Linear 𝐶𝑃𝑋𝑌 Correlation Coefficient r 𝑟= 𝑆𝑆𝑋 𝑆𝑆𝑌 Value of r Interpretation where σ 𝑋𝑖 σ 𝑌𝑖 0.01 to 0.20 Very weak linear relationship ▪ 𝐶𝑃𝑋𝑌 = σ 𝑋𝑖 𝑌𝑖 − 𝑛 2 0.21 to 0.40 Weak linear relationship 2 σ 𝑋𝑖 ▪ 𝑆𝑆𝑥 = σ 𝑋𝑖 - 𝑛 0.41 to 0.70 Moderate linear relationship 2 σ 𝑌𝑖 2 ▪ 𝑆𝑆𝑦 = σ 𝑌𝑖 - 0.71 to 0.90 Strong linear relationship 𝑛 0.91 to 0.99 Very strong linear relationship Basic Statistical Data Analysis Using Microsoft Excel | 112 DEPARTMENT of STATISTICS Number of Final grade Example 1: Student absences (X) (Y) The data below shows the 1 6 82 number of absences and the final grades of eight randomly 2 2 86 selected students from a 3 15 45 statistics class. Determine 4 9 74 whether there is significant 5 11 59 linear relationship between 6 5 90 number of absences and final grade. 7 8 78 8 4 86 Basic Statistical Data Analysis Using Microsoft Excel | 113 DEPARTMENT of STATISTICS Step in performing Correlation: 1. Click the Data tab on the ribbon, then look for Data Analysis 2. Choose Correlation, then click OK Basic Statistical Data Analysis Using Microsoft Excel | 114 DEPARTMENT of STATISTICS 3. Input the range that contains the data. 4. Choose output options, then click ok Basic Statistical Data Analysis Using Microsoft Excel | 115 DEPARTMENT of STATISTICS 5. the output will be 𝑟 = −0.94 It means that the number of absences and final grade have negative very strong linear relationship Basic Statistical Data Analysis Using Microsoft Excel | 116 DEPARTMENT of STATISTICS Properties of the Linear Correlation Coefficient r 1. The value of r is always -1 to 1. That is, -1 ≤ r ≤ 1. 2. A value near 0 indicates there is little linear relationship between variables 3. The value of r does not change if all values of either variable are converted to a different scale. 4. The value of r is not affected by the choice of x or y. Interchange all x and y values and the value of r will not change. 5. r measures the strength of a linear relationship. It is not designated to measure the strength of a relationship that is not linear. 6. The value of r is sensitive to outliers. Basic Statistical Data Analysis Using Microsoft Excel | 117 DEPARTMENT of STATISTICS Formal Hypothesis Test The null and alternative hypothesis are: Ho: 𝜌 = 0 (There is no significant linear correlation between variables X and Y.) Ha: 𝜌 ≠ 0 (There is significant linear correlation between variables X and Y.) The decision rule: Reject Ho if P-value < 𝛼 (level of significance) Basic Statistical Data Analysis Using Microsoft Excel | 118 DEPARTMENT of STATISTICS How to perform a Pearson correlation test in Excel Step 1: Calculate the Pearson correlation coefficient in Excel r = PEARSON(array1, array2) Step 2: Calculate the t-statistic from the coefficient value t = (r*SQRT(n-2))/(SQRT(1-r^2)) where: r is the Pearson correlation coefficient n is the number of observation Note: if your coefficient value is negative, then use the following formula t = (ABS(r)*SQRT(n-2))/(SQRT(1-ABS(r)^2)) Basic Statistical Data Analysis Using Microsoft Excel | 119 DEPARTMENT of STATISTICS Step 3: Calculate the p-value from the t statistic P-value = TDIST(t, df, tails) where: t is the t-statistic from the coefficient value df is the degrees of freedom which is equivalent to n-2 tails is the tailed of the distribution (‘1‘ for a one-tailed analysis or a ‘2‘ for a two-tailed analysis Basic Statistical Data Analysis Using Microsoft Excel | 120 DEPARTMENT of STATISTICS Example 1 in Excel: =PEARSON(B2:B8, C2:C8) =(ABS(F2)*SQRT(F3-2)) /(SQRT(1-ABS(F2)^2)) =TDIST(F4,F5,F6) Basic Statistical Data Analysis Using Microsoft Excel | 121 DEPARTMENT of STATISTICS Example 1: Formal Hypothesis Test of Correlation Analysis 1.) Ho: 𝜌 = 0 (There is no significant linear correlation between number of absences and final grade) Ha: 𝜌 ≠ 0 (There is significant linear correlation between number of absences and final grade) 2.) Correlation analysis 3.) Reject Ho if P-value < 𝛼 = 0.05 4.) P-value = 0.0014 5.) Since P-value is less than 𝛼 = 0.05 , reject Ho 6.) At 5% level of significance, we can conclude that there is significant linear correlation between number of absences and final grade. Basic Statistical Data Analysis Using Microsoft Excel | 122 DEPARTMENT of STATISTICS Y-Height (cm) X-Span (cm) Example 2: 171 173 The height and arm span of 10 195 193 adult males were measures (in 180 188 cm). 182 185 190 186 175 178 177 182 178 182 192 198 202 202 Basic Statistical Data Analysis Using Microsoft Excel | 123 DEPARTMENT of STATISTICS =PEARSON(A2:A11,B2:B11) =(ABS(E2)*SQRT(E3-2))/(SQRT(1-E2^2)) =TDIST(E4,E5,E6) Basic Statistical Data Analysis Using Microsoft Excel | 124 DEPARTMENT of STATISTICS Example 2: Formal Hypothesis Test of Correlation Analysis 1.) Ho: 𝜌 = 0 (There is no significant linear correlation between number of absences and final grade) Ha: 𝜌 ≠ 0 (There is significant linear correlation between number of absences and final grade) 2.) Correlation analysis 3.) Reject Ho if P-value < 𝛼 = 0.05 4.) P-value = 0.000089 5.) Since P-value is less than 𝛼 = 0.05 , reject Ho 6.) At 5% level of significance, we can conclude that there is significant linear correlation between height and arm span. Basic Statistical Data Analysis Using Microsoft Excel | 125 DEPARTMENT of STATISTICS Other Correlations Pearson Product Moment Correlation - appropriate when both variables are measured at an interval and ratio level. Spearman Rank Order Correlation (rho) – appropriate when the two variables are in ordinal level. Point-Biserial Correlation – appropriate when one measure is a continuous interval level and the other is dichotomous (i.e., two- category). Basic Statistical Data Analysis Using Microsoft Excel | 126 DEPARTMENT of STATISTICS Simple Linear Regression Analysis Used to predict the value of a dependent variable based on the value of at least one independent variable Used to explain the impact of changes in an independent variable on the dependent variable. Components of Regression Analysis: 1. Dependent/ Response variable (Y) – the variable we wish to explain 2. Independent/ Predictor / Explanatory variable (X’s) – the variable used to explain the dependent variable Basic Statistical Data Analysis Using Microsoft Excel | 127 DEPARTMENT of STATISTICS Simple Linear Regression Model: Population Slope Independent Population y Coefficient Variable intercept Dependent Variable Random Error term, 𝒚 = 𝜷𝒐 + 𝜷 𝟏 𝑿 + 𝜺 or residual Linear Regression Assumptions Error values (𝜀) are statistically independent Error values are normally distributed for any given values of x The probability distribution of the errors is normal The probability distribution of the errors has constant variance The underlying relationship between x variable and the y variable is linear Basic Statistical Data Analysis Using Microsoft Excel | 128 DEPARTMENT of STATISTICS Regression Estimated Regression Model by Least Squares Method Estimate of the Estimate of the Formulas for 𝒃𝟏 and 𝒃𝟎 : Estimated (or regression regression Independent σ 𝑥−𝑥ҧ 𝑦−𝑦ത 𝐶𝑃𝑥𝑦 predicted) intercept slope variable 𝑏1 = σ 𝑥−𝑥ҧ 2 or 𝑏1 = y value 𝑆𝑆𝑥 𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ 𝒚ෝ𝒊 = 𝒃𝒐 + 𝒃𝟏 𝑿 Interpretation of the slope and the intercept: ▪ 𝑏0 is the estimated average value of y when the value of x is zero ▪ 𝑏1 is the estimated change (increase or decrease) in the average value of y as a result of a one-unit increase in x Basic Statistical Data Analysis Using Microsoft Excel | 129 DEPARTMENT of STATISTICS Evaluating a regression equation’s ability to predict 𝐶𝑃𝑋𝑌 Coefficient of Determination 𝑅2 = 𝑏1 𝑆𝑆𝑦 The proportion of the total variation in the dependent variable Y that is explained, or accounted for, by the variation in the independent variable X. The coefficient of determination is also called R-squared and is denoted as 𝑅2. It is the squared of the correlation coefficient. If it were possible to make perfect predictions, the coefficient of determination would be 100%. A coefficient of determination of 100% is associated with a correlation coefficient of +1.0 or −1.0. Basic Statistical Data Analysis Using Microsoft Excel | 130 DEPARTMENT of STATISTICS Step in performing Regression: 1. Click the Data tab on the ribbon, then look for Data Analysis 2. Choose Regression, then click OK Basic Statistical Data Analysis Using Microsoft Excel | 131 DEPARTMENT of STATISTICS 3. Input the range that contains the variables X and Y. 4. Choose output options, then click ok Basic Statistical Data Analysis Using Microsoft Excel | 132 DEPARTMENT of STATISTICS 5. the output will be Basic Statistical Data Analysis Using Microsoft Excel | 133 DEPARTMENT of STATISTICS Example: The data below shows the number of absences and the final grades of seven randomly selected students from a statistics class. Number of Final grade 1. Create regression equation Student absences (X) (Y) to predict the final grade of A 6 82 student. B 2 86 2. Interpret 𝑏0 and 𝑏1 C 15 43 3. Compute the value of 𝑅2 4. What is the estimated D 9 74 value of the final grade if E 12 58 the number of absences is F 5 90 1, 4, 7, 10, 20 G 8 78 Basic Statistical Data Analysis Using Microsoft Excel | 134 DEPARTMENT of STATISTICS a Solution: Regression Equation: 𝑦ෝ𝑖 = 𝑏𝑜 + 𝑏1 𝑥𝑖 𝑦ො𝑖 = 101.31 − 3.51𝑥𝑖 101.31 is the estimated average final grade when the number of absences is zero. -3.51 is the estimated decreased in the average final grade as a result of one-unit increase in the number of absences. Basic Statistical Data Analysis Using Microsoft Excel | 135 DEPARTMENT of STATISTICS Solution: Coefficient of Determination: 𝑅2 = 0.8927 or 89.27% 89.27% of the total variation in the final grade was explained, or accounted for, by the variation of the number of absences. Basic Statistical Data Analysis Using Microsoft Excel | 136 DEPARTMENT of STATISTICS a Solution: What is the estimated value of the final grade if the number of absences is 1, 4, 7, 10, and 20 Final regression equation: 𝐹𝑖𝑛𝑎𝑙 𝑔𝑟𝑎𝑑𝑒 = 101.31 − 3.51 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑏𝑠𝑒𝑛𝑐𝑒𝑠 Case 1: no. of absences = 1 Final grade = 101.31 − 3.51 1 = 97.80 Case 2: no. of absences = 4 Final grade = 101.31 − 3.51 4 = 87.27 Case 3: no. of absences = 7 Final grade = 101.31 − 3.51 7 = 76.74 Case 4: no. of absences = 10 Final grade = 101.31 − 3.51 10 = 66.21 Case 3: no. of absences = 20 Final grade = 101.31 − 3.51 20 = 31.11 Basic Statistical Data Analysis Using Microsoft Excel | 137 DEPARTMENT of STATISTICS Y-Height (cm) X-Span (cm) Example 2: 171 173 The height and arm span of 10 adult 195 193 males were measures (in cm). 180 188 1. Create regression equation to predict 182 185 the height of males. 190 186 2. Interpret 𝑏0 and 𝑏1 175 178 3. Determine and interpret 𝑅2 177 182 4. What is the estimated value of the height if the span is 150, 200, 250. 178 182 192 198 202 202 Basic Statistical Data Analysis Using Microsoft Excel | 138 DEPARTMENT of STATISTICS a Solution: Regression Equation: 𝑦ෝ𝑖 = 𝑏𝑜 + 𝑏1 𝑥𝑖 𝑦ො𝑖 = −10.997 + 1.05𝑥𝑖 −10.997 is the estimated average height when the size of arm span is zero. 1.05 is the estimated increased in the average height as a result of one-unit increase in the size of arm span. Basic Statistical Data Analysis Using Microsoft Excel | 139 DEPARTMENT of STATISTICS Solution: Coefficient of Determination: 𝑅2 = 0.8677 or 86.77% 86.77% of the total variation in the height was explained, or accounted for, by the variation of the size of arm span. Basic Statistical Data Analysis Using Microsoft Excel | 140 DEPARTMENT of STATISTICS Solution: What is the estimated value of the height if the span is 150, 200, 250. Final regression equation: ℎ𝑒𝑖𝑔ℎ𝑡 = −10.997 + 1.05(𝑎𝑟𝑚 𝑠𝑝𝑎𝑛) Case 1: arm span = 150 ℎ𝑒𝑖𝑔ℎ𝑡 = −10.997 + 1.05 150 = 146.503 Case 2: arm span = 200 ℎ𝑒𝑖𝑔ℎ𝑡 = −10.997 + 1.05 200 = 199.003 Case 3:arm span = 250 ℎ𝑒𝑖𝑔ℎ𝑡 = −10.997 + 1.05 250 = 251.503 Basic Statistical Data Analysis Using Microsoft Excel | 141