Statistical Inference – Two Categorical Variables PDF
Document Details
Uploaded by ExceedingChrysoprase7632
Monash University
Tags
Summary
This document presents an introduction to statistical inference, focusing on two categorical variables. It explains concepts like two-way contingency tables, marginal and conditional distributions. It also discusses Simpson's paradox.
Full Transcript
Statistical Inference – Two Categorical Variables Two-way contingency table 2 Categorical variables: Two-Way Contingency Table Recall that categorical variables are...
Statistical Inference – Two Categorical Variables Two-way contingency table 2 Categorical variables: Two-Way Contingency Table Recall that categorical variables are variables whose values fall in groups or categories. Some variables such as gender and race are categorical by nature. Other categorical variables are created by grouping values of a quantitative variable into classes. To analyze categorical data, we use the counts or percents of individuals that fall into various categories and summarize them in table – a two-way contingency table Example: Living arrangement is the row variable, and age is the column variable. The number of observations falling into each combination of categories is entered into each cell of the table. Is living arrangement related to age? 3 Marginal distributions For marginal distributions, look at the distribution of each variable separately. ↓Marginal distribution of living arrangement ↓Marginal distribution of age Percent Percent Parent’s home 1357/2984 → 45.5% 19 540/2984 → 18.1% Another person’s 162/2984 → 5.5% 20 766/2984 → 25.7% home Your own place 1254/2984 → 42.0% 21 801/2984 → 26.8% Group quarters 192/2984 → 6.4% 22 877/2984 → 29.4% other 19/2984 → 0.6% The percentages in each distribution should be summed to 100. If the row and column totals are missing, calculate them first. Conditional distributions 4 Marginal distributions tell us nothing about the relationship between two variables. A conditional distribution of a variable is the distribution of values of that variable among only individuals who have a given value of the other variable. ↓Conditional distributions of living arrangements for each age group 5 Conditional distributions Find the conditional distribution of age for each living arrangement. 6 Conditional distributions (bar chart) Can be displayed as a stacked 100% bar chart: If the conditional distributions are nearly the same for each category → no relationship evident. If there are significant differences in the conditional distributions → a relationship may be present. 7 Simpson’s Paradox - Effect of a Third Variable A trend appears in several different groups of data but disappears or reverses when these groups are combined. This reversal is called Simpson’s paradox For quantitative data: Be careful of lurking variables. The lurking variables in Simpson’s paradox are categorical. The lurking variables create subgroups, and failure to take these subgroups into consideration can lead to misleading conclusions regarding the association between the two variables. 8 Example: Do medical helicopters save lives? For categorical data: ↓Accident victims transported to hospital Same data broken down by the seriousness of the accident Helicopter Road Serious accidents LESS Serious accidents Victim died 64 260 Helicopter Road Helicopter Road Victim survived 136 840 died 48 60 died 16 200 200 1100 survived 52 40 survived 84 800 100 100 100 1000 Consider the column percentages … ie conditional Both groups have a higher survival rate when evacuated on mode of transport … by helicopter. Helicopter has lower survival rate. Taking sub-groups’ effect can reverse the overall How can it happen? interpretation of the data (Simpson’s Paradox).