Chi-Squared (𝜒2) Test

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Explain how the Chi-squared test helps in analyzing categorical data, especially when direct mapping to distributions is challenging.

The Chi-squared test allows for mapping observed patterns in categorical data to a theoretical distribution, enabling a significance test to assess relationships between variables without direct mapping.

In the context of the Chi-squared test, what does the term 'degrees of freedom' signify, and how is it calculated for a test of independence?

Degrees of freedom indicate the number of values in the final calculation that are free to vary. For a test of independence, it is calculated as $(rows - 1) * (columns - 1)$.

Describe the conditions under which the Chi-squared test of independence is considered valid. What assumptions need to be checked?

Both variables must be categorical, levels of each variable must be mutually exclusive, observations must be independent, and the expected value in each cell should be at least 1, with at least 80% of cells having expected values of 5 or greater.

Explain the difference between 'observed values' and 'expected values' in a Chi-squared test. How are expected values calculated, and what do they represent?

<p>Observed values are the actual data counts from a sample, while expected values are calculated based on the assumption of independence between variables, using the formula: $(n_x * n_y) / n$, where $n_x$ and $n_y$ are the totals for a specific level and $n$ is the total number of participants.</p> Signup and view all the answers

How does the Chi-squared test statistic quantify the difference between observed and expected values, and why is it necessary to standardize this difference?

<p>The test statistic is calculated by summing the squared differences between observed and expected values, divided by the expected values: $\sum ((O - E)^2 / E)$. This standardization is necessary because the raw difference is dependent on the sample size.</p> Signup and view all the answers

Flashcards

What is the Chi-Squared Test?

A statistical test to assess if observed differences in categorical data are due to random chance or a meaningful pattern.

What does the Chi-Squared Test of Independence do?

It helps determine if two categorical variables are related or independent.

How is the Chi-Squared Statistic calculated?

It is calculated by summing the squared differences between observed and expected values, divided by the expected values.

What are Degrees of Freedom?

The number of values in the final calculation of a statistic that are free to vary.

Signup and view all the flashcards

What is the Goodness of Fit Test?

Assesses if the distribution of a single categorical variable matches a predefined distribution.

Signup and view all the flashcards

Study Notes

Pearson's Chi-Squared (𝜒2) Test

  • The 𝜒2 test assesses if observed differences in categorical data are due to random chance or a meaningful pattern.
  • Karl Pearson developed the method, introducing it in 1900.

Expectations vs. Observations

  • With two categorical variables (X and Y), the goal with the 𝜒2 Test of Independence is to determine if X and Y are related.
  • The 𝜒2 test maps observed patterns in data to a theoretical distribution for significance testing.
  • When both variables are categorical, there is no good way of mapping the behavior of either variable on to a theoretical distribution.
  • An example of its use is assessing the relationship between income level and smoking status.
  • If X and Y are independent, the distribution of their values within the sample should be definable.

Income Level and Smoking Status Example

  • In a hypothetical study, 400 people are recruited to examine the relationship between income and smoking status.

  • Income is measured categorically (<$20k, $20-50k, >$50k).

  • Smoking status is measured categorically (current smoker, not a current smoker).

  • The sample shows the following:

  • 100 people reported <$20k income.

  • 200 people reported $20-50k income.

  • 100 people reported >$50k income.

  • 100 people reported currently smoking.

  • 300 people reported not currently smoking.

  • Data can be organized into a cross-tabulation for analysis.

  • The total row and column display totals for each category.

  • The bottom right displays the total sample size.

  • The cell in the top left corner would represent those with income below $20k who do not currently smoke.

  • A null hypothesis is assumed, and the two variables are assumed independent, so the data is expected to be evenly distributed.

  • The expected value in each cell is calculated using: 𝑛𝑟𝑜𝑤∗𝑛𝑐𝑜𝑙𝑢𝑚𝑛 , where:

    • 𝑛𝑟𝑜𝑤 is the total participants in a given row.
    • 𝑛𝑐𝑜𝑙𝑢𝑚𝑛 is the total participants in a given column.
    • 𝑛 is the total number of participants.
  • For the top left cell in the example, the expected number of people reporting income less than $20k who don't currently smoke is (300∗100) / 400 = 75.

  • This process is repeated for each cell.

  • In the example, independence assumes the income distribution is the same for smokers and non-smokers.

  • Current smoking is assumed to be the same across each income level (75% not current smoking, 25% current smoking).

  • Once expected values are established, actual observed values from the sample data are included in the table.

  • Typically the observations differ from the expected values.

  • A greater proportion of people making less than $20k currently smoke than expected, and a smaller proportion of those making more than $50k than expected.

  • Quantification of this difference is needed.

  • The difference between expected values (𝐸𝑟𝑜𝑤,𝑐𝑜𝑙𝑢𝑚𝑛) and observed values (𝑂𝑟𝑜𝑤,𝑐𝑜𝑙𝑢𝑚𝑛) represents our signal.

  • This shows how different the observed data is from what we expected under the assumption that the two variables are independent.

  • The differences in all cells can be summed - ∑(𝑂𝑟𝑜𝑤,𝑐𝑜𝑙𝑢𝑚𝑛 −𝐸𝑟𝑜𝑤,𝑐𝑜𝑙𝑢𝑚𝑛 ).

  • This value will always equal 0, and is not a useful metric.

Addressing Limitations of Summing Differences

  • This is similar to issues calculating standard deviation. Calculating s2 first is necessary.
  • Squaring any value makes the result positive.
  • The sum of several squares is only zero if the data perfectly matches expected values.
  • The new sum is defined as ∑(𝑂𝑟𝑜𝑤,𝑐𝑜𝑙𝑢𝑚𝑛 –𝐸𝑟𝑜𝑤,𝑐𝑜𝑙𝑢𝑚𝑛 )2.
  • Squaring the differences yields a meaningful value, representing the strength of correlation.

Standardizing the Signal Strength

  • The size of the signal strength depends on the study sample size.
  • To standardize is done by dividing the difference between each observed and expected value by the expected value.
  • This represents the difference between observed and expected values relative to the expected value.
  • This can be mathematically defined as 𝜒2 = ∑ ((𝑂𝑟𝑜𝑤,𝑐𝑜𝑙𝑢𝑚𝑛 −𝐸𝑟𝑜𝑤,𝑐𝑜𝑙𝑢𝑚𝑛 )^2) / 𝐸𝑟𝑜𝑤,𝑐𝑜𝑙𝑢𝑚𝑛.
  • A standardized value corresponds to a 𝜒2 distribution with (𝑟𝑜𝑤𝑠 − 1) ∗ (𝑐𝑜𝑙𝑢𝑚𝑛𝑠 − 1) degrees of freedom.
  • If rows=3 and columns=2 then, df =2. Allows assessment under the two variables are unrelated.

The 𝜒 2 Distribution

  • With a normally distributed variable, where the mean is the most likely value, closer values are more likely, and values less than the mean are as likely as those greater than the mean.
  • Squaring the values of the Z-distribution (standard normal distribution) produces a 𝜒2 distribution.
  • The x² distribution with 1 degree of freedom or 𝜒2 is less intuitive than the normal distribution.
  • It describes a normally distributed variable’s behavior rather than describing a natural phenomenon directly.
  • A random variable can be explained by a 𝜒²-distribution if the variable is constructed from the square of a normally distributed variable.
  • The value of 0 is the most likely in a 𝜒2 distribution, and greater values become much more rare.
  • Because x² is the square of the Z-distribution, negative values are impossible.
  • For a variable following the Z-distribution, ~68% of observations fall between -1 and 1.
  • Squaring a value between -1 and 1 results in a value between 0 and 1.
  • Since the 𝜒2 distribution is just the Z-distribution squared, ~68% of observations on a variable explained by the 𝜒² distribution will fall between 0 and 1.

The 𝜒 2 Distribution with k Degrees of Freedom

  • If a variable (X) is assumed to follow the standard normal Z-distribution, then 𝑋2 follows the 𝜒2 distribution.
  • A more broad definition of such distributions exists:
  • if k variables (X1, X2, ..., Xk) are independent and follow the standard normal Z-distribution, a new variable (Y) can be calculated as the sum of their squares: 𝑌 = ∑ 𝑋𝑖^2.
  • Y is distributed according to 𝜒², having k degrees of freedom.
  • As the number of degrees of freedom increases, the shape of the curve shifts drastically.
  • For one standard normal variable, values close to 0 are more likely. However, the likelihood of every variable being close to 0 is less likely with k variables.
  • As values are summed together, the resultant value is larger.
  • The probability distribution of a variable Y is defined as the sum of k independent squares.

The 𝜒 2 Test of Independence

  • To test if two categorical variables are related, calculating 𝜒2 is done as: ∑((𝑂𝑟𝑜𝑤,𝑐𝑜𝑙𝑢𝑚𝑛 −𝐸𝑟𝑜𝑤,𝑐𝑜𝑙𝑢𝑚𝑛 )^2) / 𝐸𝑟𝑜𝑤,𝑐𝑜𝑙𝑢𝑚𝑛.
  • The formula is a sum of squares, ((𝑂𝑟𝑜𝑤,𝑐𝑜𝑙𝑢𝑚𝑛 −𝐸𝑟𝑜𝑤,𝑐𝑜𝑙𝑢𝑚𝑛 )^2 ).
  • The 𝜒2 distribution describes the sum of squares of k normally distributed variables.
  • If X and Y are independent, the most likely value is 0 (observed count equals expected count).
  • Assuming independence, differences will follow a normal distribution.
  • Calculating what we have corresponds to a specific 𝜒2 distribution. Permits testing and calculating a p-value.
  • Number of degrees of freedom must still be determined.

Degrees of Freedom

  • The equation sums together squares for each row X column combination, meaning nr𝑜𝑤 * nc𝑜𝑙𝑢𝑚𝑛 squares are summed together.
  • The 𝜒2 distribution in relation to k independent normally distributed variables.
  • Series of differences being summed are not totally independent.
  • Observing more people in one cell influences how many people can be distributed across the other cells.
  • Insert the observed number of people reporting an income of less than $20k and who report not currently smoking is inserted first.
  • Because there is a maximum number of people to add, the values become limited in variety - this is not independent.
  • With 6 cells total in the tables, only 2 of the values can vary freely.

Conclusions

  • Compare the test statistic to the 𝜒2-distribution with 2 degrees of freedom. Know the probability.
  • The test statistic is calculated as P(𝜒2 ≥ test statistic / 𝜒2).
  • A value closer to 0 indicates that our observed data is more unlikely under our null hypothesis.
  • If p < α (α = 0.05), the result is significant.
  • A significant result suggests reject the null hypothesis of independent X and Y in favor of the alternative.

Steps to Summarize Variables

  • Calculate the expected value of each X and Y combination, Ex,y.
  • Calculate the observed value for each X and Y combination in our data, Ox,y.
  • Calculate the test statistic 𝜒2 = ∑((𝑂𝑥𝑦 −𝐸𝑥𝑦 )^2) / 𝐸𝑥,𝑦.
  • Calculate the number of degrees of freedom k = (levels(X) – 1) * (levels(Y) – 1).
  • Compare the test statistic to the 𝜒-distribution to calculate our p-value.

Assumptions of the 𝜒 2 Test of Independence

  • The chi² Test of Independence assesses the relation of two variables (X and Y) given the following:
  • X and Y are both categorical.
  • The levels of each variable are mutually exclusive. In other words, each participant must belong to one and only one level of both X and Y.
  • Each observation is independent. Data comes from a random sample of independent observations.
  • The value of Ex,y is 5 or greater in at least 80% of table cells, and Ex,y must be at least 1 for every cell.
  • If the final assumption is not met, results are biased.

Checking and Running the Test in R

  • Check assumptions by: confirming X and Y are categorical and levels are mutually exclusive, and generating the table of expected and observed values.
  • To run the test, generate a table with observed values.
  • Pearson's Chi-squared test will calculate results as well as p-values.
  • Often, the 𝜒2 test will be run “behind the scenes”. For example, the tableone package automatically runs the test when generating a table with categorical variables.

Other Variations of the 𝜒2 Test

  • The other common variations of the 𝜒2 test that use the same general approach include: The Goodness of Fit Test; and the Homogeneity Test.

Goodness of Fit Test

  • Assess if the distribution of a single categorical variable X matches a pre-defined distribution.
  • Statistical hypotheses:
  • The distribution of X fits the predetermined distribution.
  • The distribution of X does not fit the distribution.
  • The expected value Ex is equal to the number of people in our sample n times the population proportion for level x.
  • In the example of handedness, if the sample of 200, and known 85% of people are right-handed, it is expected that Eright-handed = 200 * 0.85 = 170.
  • Calculate the 𝜒2 value by comparing the expected values Ex with the observed values Ox: 𝜒2 = ∑((𝑂𝑥 −𝐸𝑥 )^2)/𝐸𝑥.
  • This test statistic is understood to have k = levels(X) – 1 degrees of freedom, and the test statistic is compared to the 𝜒𝑘2 distribution.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Statistiek: Chi-kwadraat Toets
8 questions
Chi-Square Test for Independence
29 questions

Chi-Square Test for Independence

ExceedingChrysoprase7632 avatar
ExceedingChrysoprase7632
Chi-squared Goodness-of-Fit Test
29 questions

Chi-squared Goodness-of-Fit Test

LargeCapacityAntigorite4770 avatar
LargeCapacityAntigorite4770
Use Quizgecko on...
Browser
Browser