Chi-Squared Test: Pearson's Chi-Squared Test - PDF

Document Details

LargeCapacityAntigorite4770

Uploaded by LargeCapacityAntigorite4770

Tags

chi-squared test statistics categorical data hypothesis testing

Summary

The document explains Pearson's Chi-Squared Test, introduced in 1900, designed to identify meaningful patterns in categorical data and differences not due to random chance. The document discusses expectations, observations, and uses an example to define the expected behavior of two categorical variables by using tables to analyze income level and smoking status.

Full Transcript

Chi-Squared Test Pearson’s πœ’ 2 Test The πœ’ 2 (β€œchi”-squared, pronounced like the last syllable of sky) test allows us to assess if observed differences in the behavior of categorical data is explained by random chance or represents a meaningful pattern. This method was developed by Karl Pearson and in...

Chi-Squared Test Pearson’s πœ’ 2 Test The πœ’ 2 (β€œchi”-squared, pronounced like the last syllable of sky) test allows us to assess if observed differences in the behavior of categorical data is explained by random chance or represents a meaningful pattern. This method was developed by Karl Pearson and introduced in 1900 - it is a foundational method in modern statistics. Expectations v. Observations Given two categorical variables 𝑋 and π‘Œ , it often of interest to ask if these two variables are associated with one another. However, categorical data can be challenging to work with because we cannot map the values of categorical data onto some distribution. The πœ’ 2 test was developed to give us a way to map observed patterns in the data between categorical variables onto a theoretical distribution, thus allowing us to conduct a signifance test. The goal with the πœ’ 2 Test of Independence is to determine if 𝑋 and π‘Œ are related to one another (i.e., if they are dependent on one another)? For example, we might want to know if income level (measured categorically) is related to current smoking status. Since both variables are categorical, we do not have a good way of mapping the behavior of either variable on to a theoretical distribution. But, we can define how we expect the distribution of values of 𝑋 and π‘Œ within our sample to be if they are not related to one another (i.e., if they are independent of one another). Let’s use an example to display how we can define the expected behavior of two categorical variables if we assume they are independent of one another. Income Level and Smoking Status Let’s say we have recruited 400 people into a study examining the relationship of income and smoking status. We have measured annual income categorically ($50k) and smoking status categorically (not a current smoker, current smoker). Looking at the sample, we see that 100 people reported an income $50k. Further, we see that 100 people reported currently smoking and the other 300 reported not currently smoking. We can start by generating a cross-tabulation we will use as a framework to build our analysis on, like so: The total row and column display the total number of people that fall into each category. The total value in the bottom right thus displays the total number of people in our study. Within the dotted lines represents the number of people who reported each combination of income and smoking. For example, the empty cell in the top left corner represents the number of people who reported an annual income below $20k and who reported not currently smoking. Before we examine the observed disitribution of our data, we can actually map how we expect the variables to be distributed assuming that income and cigarette smoking are not related. Well, if we assume that these two variables are independent (hint: sounds like a null hypothesis), then we would expect the values of both variables would be evenly distributed across both groups. We are able to calculate the expected value in each 𝑛 βˆ—π‘› cell using the equation, π‘Ÿπ‘œπ‘€ 𝑛 π‘π‘œπ‘™π‘’π‘šπ‘› , where π‘›π‘Ÿπ‘œπ‘€ is the total number of participants in that given row, π‘›π‘π‘œπ‘™π‘’π‘šπ‘› is the total number in that given column, and 𝑛 is the total number of participants. Let’s calculate the top-left cell below: Here we can see that π‘›π‘Ÿπ‘œπ‘€ = 100 , π‘›π‘π‘œπ‘™π‘’π‘šπ‘› = 300 , and 𝑛 = 400. Thus, the expected number of people reporting income less than $20k who don’t currently smoke is: 300βˆ—100 = 75. We will place the expected 400 value in parentheses in the cell like so: We can repeat this process for each cell, resulting in the following table of expected values: Here, we are assuming that the distribution of income is the same among people who don’t currently smoke and those who currently do. Likewise, under our assumption, we are expecting that distribution of current smoking to be the same across each income level (in this case, 75% not current smoking and 25% current smoking). This represents the expected distribution of our two variables under the assumption that they are independent. Once we have established expected values, we can fill in the actual observed values within our sample. The following table now includes the values (not in parentheses) of what we observed in our sample: We can see, right away, that our observations differ from our expected values. We can see that a greater proportion of people making less than $20k currently smoke than expected and a smaller proportion of those making more than $50k than expected. We need a way to quantify this difference. The difference between our expected values πΈπ‘Ÿπ‘œπ‘€,π‘π‘œπ‘™π‘’π‘šπ‘› and observed values π‘‚π‘Ÿπ‘œπ‘€,π‘π‘œπ‘™π‘’π‘šπ‘› within each cell can be understood to represent our signal. This difference represents how different our observed data is from what we expected under our assumption that the two variables are independent. We could try to sum up all of these differences, like so: βˆ‘ (π‘‚π‘Ÿπ‘œπ‘€,π‘π‘œπ‘™π‘’π‘šπ‘› βˆ’ πΈπ‘Ÿπ‘œπ‘€,π‘π‘œπ‘™π‘’π‘šπ‘› ) This would be the sum of the difference between the expected and observed value in each cell. Interestingly though, this will always equal 0, so it is not a very useful metric. Let’s do the calculation by hand to display that it equals 0, for our example table above: (50 βˆ’ 75) + (50 βˆ’ 25) + (160 βˆ’ 150) + (40 βˆ’ 50) + (90 βˆ’ 75) + (10 βˆ’ 25) = βˆ’ 25 + 25 + 10 + (βˆ’10) + 15 + (βˆ’15) =0 + 0 + 0 =0 This is a similar issue we face when we try to calculate standard deviation 𝑠 for a normally distributed variable and why we have to calculate the variance 𝑠2 first. Importantly, if you take the square of any value (positive or negative), the result will be positive. So if you take the sum of several squares, your result will only be 0 if your data perfectly matches your expected values. We can define this new sum like so: (π‘‚π‘Ÿπ‘œπ‘€,π‘π‘œπ‘™π‘’π‘šπ‘› βˆ’ πΈπ‘Ÿπ‘œπ‘€,π‘π‘œπ‘™π‘’π‘šπ‘› )2 βˆ‘ Now, when we take this sum, we get a meaningful value, like so: (50 βˆ’ 75)2 + (50 βˆ’ 25)2 + (160 βˆ’ 150)2 + (40 βˆ’ 50)2 + (90 βˆ’ 75)2 + (10 βˆ’ 25)2 = βˆ’ 252 + 252 + 102 + (βˆ’10)2 + 152 + (βˆ’15)2 =625 + 625 + 100 + 100 + 225 + 225 =1900 This is much better and still represents the strength of our signal (i.e., difference between observed and expected values)! However, we can notice that the size of this value is also dependent on the size of our study sample. We could have sampled twice as many people with an identical proportional distribution of observed results and the strength of the signal would appear much larger, even though the proportional distribution across groups is the same. As such, it is important that we take a step to standardize our calculation of the strength of our signal. We do this by dividing the difference between each observed and expected value by the expected value. Now, instead of representing the crude difference between observed and expected values, each term represents the difference between observed and expected values relative to the expected value. We can define this mathematically like so: 2 (π‘‚π‘Ÿπ‘œπ‘€,π‘π‘œπ‘™π‘’π‘šπ‘› βˆ’ πΈπ‘Ÿπ‘œπ‘€,π‘π‘œπ‘™π‘’π‘šπ‘› )2 βˆ‘ πœ’ = πΈπ‘Ÿπ‘œπ‘€,π‘π‘œπ‘™π‘’π‘šπ‘› In this instance, we are able to calculate: (50 βˆ’ 75)2 (50 βˆ’ 25)2 (160 βˆ’ 150)2 (40 βˆ’ 50)2 (90 βˆ’ 75)2 (10 βˆ’ 25)2 πœ’2 = + + + + + 75 25 150 50 75 25 2 πœ’ =625/75 + 625/25 + 100/150 + 100/50 + 225/75 + 225/25 πœ’ 2 =48 We have now calculated a standardized value representing the strength of our signal under the assumption that no signal exists. It turns out that this value corresponds to a πœ’ 2 distribution with (π‘Ÿπ‘œπ‘€π‘  βˆ’ 1) βˆ— (π‘π‘œπ‘™π‘’π‘šπ‘›π‘  βˆ’ 1) = (3 βˆ’ 1) βˆ— (2 βˆ’ 1) = 2 βˆ— 1 = 2 degrees of freedom. This is very cool, because it allows us to use this metric to assess the probability of our observed data under our assumption that our two variables are not related to one another. But, this should immediately beg the question: what is a πœ’ 2 distribution? The πœ’ 2 Distribution Earlier, we learned about how the normal distribution arises from how we understand certain natural phenomenon to randomly occur. A normally distributed variable is one where the mean value is the most likely value to observe, where values closer to the mean are more likely than values further from the mean, and where values less than the mean are equally as likely to occur as those greater than the mean. The standard normal distribution (𝑍 -distribution) is depicted like so: ## Let's create our x-axis, ranging from -5 to 5, with increments of 0.1 x

Use Quizgecko on...
Browser
Browser