ANOVA PDF
Document Details
Uploaded by WelcomeParadox
Tags
Summary
This document explains Analysis of Variance (ANOVA), a statistical technique used to analyze the difference between the means of more than two samples. It details the concept of variance and its application in statistical inference and hypothesis testing.
Full Transcript
F-TEST and Analysis of Variance (ANOVA) Introduction Analysis of variance (ANOVA) is statistical technique used for analyzing the difference between the means of more than two samples. It is a parametric test of hypothesis. It is a step wise estimation procedures (such as the "variation" among and...
F-TEST and Analysis of Variance (ANOVA) Introduction Analysis of variance (ANOVA) is statistical technique used for analyzing the difference between the means of more than two samples. It is a parametric test of hypothesis. It is a step wise estimation procedures (such as the "variation" among and between groups) used to attest the equality between two or more population means. ANOVA was developed by statistician and eugenicist Ronald Fisher. Though many statisticians including Fisher worked on the development of ANOVA model but it became widely known after being included in Fisher's 1925 book “Statistical Methods for Research Workers”. The ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. ANOVA provides an analytical study for testing the differences among group means and thus generalizes the t-test beyond two means. ANOVA uses F-tests to statistically test the equality of means. Concept of Variance Variance is an important tool in the sciences including statistical science. In the Theory of Probability and statistics, variance is the expectation of the squared deviation of a random variable from its mean. Actually, it is measured to find out the degree to which the data in series are scattered around its average value. Variance is widely used in statistics, its use is ranging from descriptive statistics to statistical inference and testing of hypothesis. Relationship Among Variables under the said analysis, we use to examine the differences in the mean values of the 1 dependent variable associated with the effect of the controlled independent variables, after taking into account the influence of the uncontrolled independent variables. We take the null hypothesis that there is no significant difference between the means of different populations. In its simplest form, analysis of variance must have a dependent variable that is metric (measured using an interval or ratio scale). There must also be one or more independent variables. The independent variables must be all categorical (non-metric). Categorical independent variables are also called factors. A particular combination of factor levels, or categories, is called a treatment. What type of analysis would be made for examining the variations depends upon the number of independent variables taken into account for the study purpose. One-way analysis of variance involves only one categorical variable, or a single factor. If two or more factors are involved, the analysis is termed n-way (eg. Two-Way, Three-Way etc.) Analysis of Variance. F Tests F-tests are named after the name of Sir Ronald Fisher. The F-statistic is simply a ratio of two variances. Variance is the square of the standard deviation. For a common person, standard deviations are easier to understand than variances because they’re in the same units as the data rather than squared units. F-statistics are based on the ratio of mean squares. The term “mean squares” may sound confusing but it is simply an estimate of population variance that accounts for the degrees of freedom (DF) used to calculate that estimate. For carrying out the test of significance, we calculate the ratio F, which is defined as: 2 𝐹 = 𝑆1 , where 𝑆2= (𝑋1 −𝑋̅1 )2 𝑆22 1 𝑛1−1 (𝑋2 −𝑋̅2 )2 And 𝑆22 = 𝑛2−1 It should be noted that 𝑆2 is always the larger estimate of variance, i.e., 𝑆2> 𝑆2 1 1 2 𝐿𝑎𝑟𝑔𝑒𝑟 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 F= 𝑆𝑚𝑎𝑙𝑙𝑒𝑟 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 1= 𝑛1-1 and 2= 𝑛2-1 1= degrees of freedom for sample having larger variance. 2= degrees of freedom for sample having smaller variance. 2 The calculated value of F is compared with the table value for 1 and 2 at 5% or 1% level of significance. If calculated value of F is greater than the table value then the F ratio is considered significant and the null hypothesis is rejected. On the other hand, if the calculated value of F is less than the table value the null hypothesis is accepted and it is inferred that both the samples have come from the population having same variance. Illustration 1: Two random samples were drawn from two normal populations and their values are: A 65 66 73 80 82 84 88 90 92 B 64 66 74 78 82 85 87 92 93 95 97 Test whether the two populations have the same variance at the 5% level of significance. (Given: F=3.36 at 5% level for 1=10 and 2 =8.) Solution: Let us take the null hypothesis that the two populations have not the same variance. Applying F-test: 2 F=𝑆 1 𝑆22 A (𝑋1 -𝑋̅1 ) 𝑥12 B (𝑋2 -𝑋̅2 ) 𝑥22 𝑋1 𝑥1 𝑋2 𝑥2 65 -15 225 64 -19 361 66 -14 196 66 -17 289 73 -7 49 74 -9 81 80 0 0 78 -5 25 82 2 4 82 -1 1 84 4 16 85 2 4 88 8 64 87 4 16 90 10 100 92 9 81 92 12 144 93 10 100 95 12 144 97 14 196 ∑𝑋1= 720 ∑𝑥1=0 ∑𝑥21-=798 ∑𝑋2=913 ∑𝑥2=0 ∑𝑥22=1298 𝑋̅= ∑𝑋1 = 720 = 80; 1 𝑛1 9 𝑋̅ = ∑𝑋2 = 913 = 83 2 𝑛2 11 798 𝑆2= ∑𝑥2/n1-1= = 99.75 1 1 9−1 3 734 𝑆2= ∑𝑥2/n2-1= = 129.8 2 2 11−1 2 F= 𝑆1 = 99.75 𝑆22 = 0.768 129.8 𝐴𝑡 5 𝑝𝑒𝑟𝑐𝑒𝑛𝑡 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒, For 1= 10 and 2 =8, the table value of 𝐹0.05=3.36. The calculated value of F is less than the table value. The hypothesis is accepted. Hence the two populations have not the same variance. TESTING EQUALITY OF POPULATION (TREATMENT) MEANS: ONE-WAY CLASSIFICATION In one way classification, following steps are carrying out for computing F- ratio through most popular method i.e. short-cut method: 1. Firstly get the squared value of all the observation for different samples (column) 2. Get the sum total of sample observations as ∑X1, ∑X2,……. ∑Xk in each column. 3. Get the sum total of squared values for each column as ∑𝑋2, ∑𝑋2,……. ∑𝑋2 in each column. 1 2 𝐾 4. Finding the value of “T” by adding up all the sums of sample observations i.e. T= ∑X1+ ∑X2+……. ∑Xk 5. Compute the Correction Factor by the formula: 2 C.F.= 𝑇 𝑁 6. Find out Total sum of Squares (SST) through squared values and C F: SST= ∑𝑋12+ ∑𝑋22, +……. ∑𝑋2𝐾- CF 7. Find out Sum of square between the samples SSC by following formula: (∑𝑋1)2 (∑𝑋2)2 (∑𝑋𝐾)2 SSC= + +..................... – CF 𝑛1 𝑛2 𝑛𝑘 8. Finally, find out sum of squares within samples i.e. SSE as under: SSE= SST-SSC ANALYSIS OF VARIANCE (ANOVA) TABLE Source of Sum of squares Degrees of Mean square (MS) Variance ratio Variation (SS) freedom (v) of F Between samples SSC 1 = C-1 MSC= SSC/1 4 (Treatments) Within samples SSE 2= N-C MSE SSE/2 F= 𝑀𝑆𝐶 𝑀𝑆𝐸 (error) Total SST n-1 SSC= Sum of squares between samples (Columns) SST= Total sum of the squares of Variations. SSE= Sum of squares within the samples. MSC= Mean sum of squats between samples MSE= Mean sum of squares within samples Illustration 2: To test the significance of variation in the retail prices of a commodity in three principal cities, Mumbai, Kolkata, and Delhi, four shops were chosen at random in each city and the prices who lack confidence in their mathematical ability observed in rupees were as follows: Kanpur 15 7 11 13 Lucknow 14 10 10 6 Delhi 4 10 8 8 Do the data indicate that the price in the three cities are significantly different? Solution: Let us take the null hypothesis that there is no significant difference in the prices of a commodity in the three cities. Calculations for analysis of variance are us under: Sample 1 Sample 2 Sample 3 Kanpur Lucknow Delhi 𝑥1 2 𝑥1 𝑥2 𝑥22 𝑥3 𝑥23 15 225 14 196 4 16 7 49 10 100 10 100 11 121 10 100 8 64 13 169 6 36 8 64 ∑𝑥1 = 46 ∑𝑥12 =564 ∑𝑥2=40 ∑ 𝑥22 =432 ∑𝑥3 = 30 ∑𝑥23 =244 There are r = treatments (samples) with 𝑛1=4, 𝑛2= 4, 𝑛3 = 4, and n= 12. T= Sum of all the observations in the three samples = ∑𝑥1+ ∑𝑥2 +∑𝑥3 = 46+40+30 = 116 5 2 2 CF = Correction Factor = 𝑇 = (116) = 1121.33 𝑛 12 SST = Total sum of the squares = (∑𝑥2+ ∑𝑥2 +∑𝑥2) – CF = (564+ 432+ 244)-1121.33 = 118.67 1 2 3 SSC= Sum of the squares between the samples (∑𝑥1)2 (∑𝑥2)2 (∑𝑥3)2 =[ + + ] – CF 𝑛1 𝑛2 𝑛3 2 2 2 =[ (46) + (40) + (30) ] – 1121.33 4 4 4 = [ 2116 + 1600 + 900 ] – 1121.33 4 4 4 = 4616 – 1121.33 = 32.67 4 SSE= SST-SSC= 118.67-32.67= 86 Degrees of freedom: df1 = r-1= 3-1 = 2 and df2 =n-r= 12-3=9 Thus MSTR= 𝑆𝑆𝑇𝑅 = 32.67 = 16.33 and MSE= 𝑆𝑆𝐸 = 86 = 9.55 𝑑𝑓1 2 𝑑𝑓2 9 ANOVA TABLE Source of Sum of Squares Degrees of Mean Squares Test-Statistic Variation Freedom Between 32.67 = SSTR 2=r-1 𝑆𝑆𝑇𝑅 𝑀𝑆𝐶 16.335 MSC= F= = 𝑟−1 𝑀𝑆𝐸 9.55 Samples 32.67 = 1.71 = 2 = 16.335 Within Samples 86= SSE 9=n-r 86 MSE= 𝑆𝑆𝐸 = =9.55 𝑛−𝑟 9 Total 118.67=SST 11=n-1 The table value of F for df1 =2 , df2 = 9, and = 5% level of significance is 4.26. Since calculated value of F is less than its critical (or table) value, the null hypothesis is accepted. Hence we conclude that prices of a commodity in three cities have no significant difference. 6 TESTING EQUALITY OF POPULATION (TREATMENT) MEANS: TWO-WAY CLLASIFICATION ANOVA TABLE FOR TWO-WAY CLASSIFICATION Source of Sum of Degrees of Mean Square Variation square Freedom Between SSC c-1 MSC= SSTR/(c-1) 𝐹𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 =MSC/MSE columns Between rows SSR r-1 MSR= SSR/(r-1) 𝐹𝑏𝑙𝑜𝑐𝑘𝑠= MSR/MSE Residual error SSE (c-1)(r-1) MSE= SSE/(c-1)(r-1) Total SST n-1 Total variation consists of three parts: (i) variation between columns, SSC; (ii) variation between rows, SSR; and (iii) actual variation due to random error, SSE. That is, SST=SSC+(SSR+SSE). The degrees of freedom associated with SST are cr-1, where c and r are the number of columns and rows, respectively. Degrees of freedom between columns= c-1 Degrees of freedom between rows= r-1 Degrees of freedom for residual error=(c-1)(r-1) The test-statistic F for analysis of variance is given by: 𝐹𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 = MSC/MSE; MSC > MSE or MSE/MSC; MSE > MSC 𝐹𝑏𝑙𝑜𝑐𝑘𝑠 = MSR/MSE; MSR > MSE or MSE/MSR; MSE>MSR. Illustration 3: The following table gives the number of refrigerators sold by 4 salesmen in three months May, June and July: Month Salesman A B C D March 50 40 48 39 April 46 48 50 45 May 39 44 40 39 7 Is there a significant difference in the sales made by the four salesmen? Is there a significant difference in the sales made during different months? Solution: Let us take the following null hypothesis: 𝐻𝑂 ∶ There is no significant difference in the sales made by the four salesmen. 𝐻𝑂 ∶ There is no significant difference in the sales made during different months. The given data are coded by subtracting 40 from each observation. Calculations for a two- criteria-month and salesman-analysis of variance are shown below: Two-way ANOVA Table Month Salesman Row A(x1) 𝑥12 B(x2) 𝑥22 C(x3) 𝑥32 D(x4) 𝑥42 Sum March 10 100 0 0 8 64 -1 1 17 April 6 36 8 64 10 100 5 25 29 May -1 1 4 16 0 0 -1 1 2 Column 15 137 12 80 18 164 3 27 48 sum T= Sum of all observations in three samples of months= 48 𝑇2 (48)2 CF= Correction Factor= = = 192 𝑛 12 SSC= Sum of squares between salesmen (columns) (15)2 (12)2 (18)2 (3)2 =[ + + + ] – 192 3 3 3 3 = (75+48+108+3)-192= 42 SSR= Sum of squares between months (rows) (17)2 (29)2 (2)2 =[ + + ]-192 4 4 4 = (72.25 +210.25+1) -192 = 91.5 SST= Total sum of squares = (∑𝑥2 +∑𝑥2 +∑𝑥2 +∑𝑥2 )-CF 1 2 3 4 = (137+80+164+27)-192 = 216 SSE= SST-(SSC+SSR) = 216-(42+91.5) = 82.5 8 The total degrees of freedom are df= n-1=12-1=11. So dfc= c-1 = 4-1 = 3, dfr = r-1=3-1=2; df =(c-1)(r-1)= 3x2=6 Thus, MSC= SSC/(c-1) = 42/3=14 MSR= SSR/(r-1)= 91.5/2= 45.75 MSE= SSE/(c-1)(r-1) = 82.5/6=13.75 The ANOVA table is shown below: Source of Sum of Degrees of Mean Squares Variance Ratio variation squares freedom Between SSC=42.0 c-1=3 MSC=SSC/(c-1) 𝐹𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 = MSC/MSE Salesmen =14.00 =14/13.75 Between SSR=91.5 r-1=2 MSR=SSR/(r-1) =1.018 months =45.75 𝐹𝐵𝑙𝑜𝑐𝑘= MSR/MSE Residual SSE=82.5 (c-1)(r-1)=6 MSE=SSE/(c-1)(r-1) =45.75/13.75 error =13.75 =3.327 Total SST=216 n-1=11 (a) The table value of F = 4.75 for df1 =3, df2 = 6, and =5%. Since the calculated value of 𝐹𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 = 1.018 is less than its table value, the null hypothesis is accepted. Hence we conclude that the sales made by the salesmen do not differ significantly. (b) The table value of F= 5.14 for df1=2, df2=6, and = 5%. Since the calculated value of 𝐹𝐵𝑙𝑜𝑐𝑘= 3.327 is less than its table value, the null hypothesis is accepted. Hence we conclude that sales made during different months do not differ significantly. 9