Statistics - LS4003 PDF
Document Details
![UncomplicatedRomanArt5405](https://quizgecko.com/images/avatars/avatar-3.webp)
Uploaded by UncomplicatedRomanArt5405
Joseph Trigiante
Tags
Summary
This document is a presentation on statistical methods, specifically on contingency tables. It covers topics such as chi-squared test, odds ratio, and relative risk.
Full Transcript
STATISTICS LS4003 Joseph Trigiante PREVIOUSLY… In the last lecture we examined how to deal with the Continuous variable on the Y axis and categorical on the X. Continuous Categorica...
STATISTICS LS4003 Joseph Trigiante PREVIOUSLY… In the last lecture we examined how to deal with the Continuous variable on the Y axis and categorical on the X. Continuous Categorical T-test and ANOVA We learned we need to use the t-test in its various forms for 2 X categories and the ANOVA for 3 or more. TODAY… Today we deal with the last case: both variables categorical Categorical 23 12 46 21 Categorical Chi squared and Fisher CATEGORICAL VS CATEGORICAL WHAT’S SPECIAL? This case also occurs very often in biomedical research It happens when the predictor (X axis) variable, instead of causing a change in a Y parameter (like blood pressure and so) causes subjects to fall into 2 or more categories of outcome Unhappy people 12 8 15 19 Happy people Placebo Pill E CONTINGENCY TABLES This example asks how gender influences the likelihood of having an HIV test Results of this kind of analysis are normally displayed as a table It is called a contingency table ODDS, ODDS RATIO AND RELATIVE RISK THE METRICS OF THE TEST Before we go about calculating significances we need to understand what we are after To say a condition influences a categorical variable means it’s changing the proportions of the numbers in the categories Suppose we have this question: does adding milk to your coffee help with your grades? Coffee Coffee + milk Exam passed 36 54 Exam failed 6 9 We do this statistics and get these results THE METRICS OF THE TEST Coffee Coffee + milk At first glance it seems not to work: Exam passed 54 36 36 passed with the milk vs 54 without Exam failed 9 6 But this does not count that overall more people drank coffee without milk. Let’s turn these numbers into percentages to account for that Coffee Coffee + milk No effect. Because the Exam passed 84% 84% proportions of pass/fail are Exam failed 16% 16% the same ODDS RATIO Coffee Exam passed 54 Exam failed 9 The ratio of observations in two Y categories a and b is called the odds of a vs b. In this case the odds of passing the exam are 54/9=6 Coffee Odds are strictly Exam passed 84% connected to percentages Exam failed 16% ODDS RATIO The odds are the same as those used in football-the ratio of wins/losses in this case ODDS RATIO To say that a certain condition influences a categorical variable means it is changing the odds Coffee Coffee + milk Exam passed 54 36 Exam failed 9 6 Odds of passing the exam without milk are 54/9=6 Odds of passing the exam with milk are 36/6=6 Milk in coffee does not change the odds ODDS RATIO In statistics there is a convenient parameter to measure if odds change The odds ratio (OR) The odds ratio is simply the odds under one condition divided by those under another condition In the milk and coffee case the odds were 6 both with and without milk so OR=1 An OR of 1 is the equivalent of effect size = 0 for the continuous case-no effect This table is from a clinical statistics paper Cigarette smoking and lung cancer--relative risk estimates for the major histological types from a pooled analysis of case-control studies. Int J Cancer. 2012 Sep 1;131(5):1210-9. doi: 10.1002/ijc.27339. Epub 2011 Dec 14. PMID: 22052329; PMCID: PMC3296911. Look what they’ve done. The odds of contracting lung cancer if you’re not a smoker are 220/2883=0.076 The odds of contracting lung cancer if you are a smoker are 6784/3829=1.77 The Odds Ratio OR is therefore 1.77/0.076=23,6 RELATIVE RISK The relative risk (RR) is a very similar parameter Where the odds are the ratio of one event over another the risk is one event over all events (the percentage we saw before) In this case the odds of passing the exam are 54/9=6 Coffee The “risk” (so to speak)of Exam passed 84% passing the exam is 54/(54+9)=84% Exam failed 16% RELATIVE RISK The relative risk (RR) is a very similar parameter Once you know the risks for both conditions the relative risk is just their ratio like OR Coffee Coffee + milk Exam passed 84% 84% Exam failed 16% 16% Risk of passing without milk: Risk of passing with milk: 84% 84% Relative risk of passing by adding milk to coffee 84%/84%=1 RELATIVE RISK As you can see there is a minor difference between RR and OR The “zero” for both is actually 1 OR 1 H0-no effect RR 1 Let’s calculate the Relative Risk for smokers to get lung cancer The absolute risk for non smokers to get lung cancers is 220/(2883 + 220)= 7.1% The absolute risk for smokers to get lung cancers is 6784/(3829 + 6784)= 63.9% The Relative Risk or RR is therefore 63.9%/7.1%=9.0 OR AND RR OR is used most often and we will focus on it But OR is a metric (like r and the effect size) We need to test it for significance p-value CHI SQUARED TEST Let’s see how to test it for significance Our friend Karl Pearson (the one behind the r correlation coefficient) He also developed a significance test for contingency tables The Pearson’s Chi Squared (or c2) test CHI SQUARED TEST This test is based on the difference between expected values on a table and the observed ones The expected values are those we would get if the condition had no effect on the outcome (perfect H0), a “control” By taking the differences between these sets we calculate a value (the c2) The bigger this value the more significantly different the observed values are from the control CHI SQUARED TEST The expected value set has the same total number of observations but in different ratio. For example After Before meal meal Pass 35 30 Fail 15 20 We want to know if having lunch before an exam helps or not. CHI SQUARED TEST As you can see it looks like it helps. The odds are 35/15 =2.3 vs 30/20=1.5 After Before meal meal Pass 35 30 Fail 15 20 The OR is 2.3/1.5=1.53 Let’s see what the expected value table looks like CHI SQUARED TEST This is the expected value table. We’ll see later how to make it Note now the odds ratio is exactly 1 as expected After Before After Before meal meal meal meal Pass 35 30 Pass 32.5 32.5 Fail 15 20 Fail 17.5 17.5 (It’s OK to have fractions of people in an expected value table, not in the observed though) CHI SQUARED TEST Let’s tabulate the differences After Before After Before meal meal meal meal Pass 35 30 - Pass 32.5 32.5 = Fail 15 20 Fail 17.5 17.5 After Before meal meal Pass 2.5 -2.5 Fail -2.5 2.5 CHI SQUARED TEST These differences are mathematically elaborated to give a c2value Like the t-test these values are then looked up on a table that gives the p-value Fortunately, we don’t need to do this Because we have R and Excel! CHI SQUARED TEST- EXCEL Let’s see how to carry out a chi squared test in Excel We load our observed value table on a spreadsheet After Before meal meal Pass 35 30 Fail 15 20 CHI SQUARED TEST- EXCEL Then we calculate the sum of all rows and columns And in the corner the total of all observations CHI SQUARED TEST-EXCEL Now we create the expected value table. Make a copy of the table on the left without numbers CHI SQUARED TEST-EXCEL Now the numbers. Every expected value cell is the product of its row total, its column total all divided by the grand total Row 1, Column 1 Total row 1 x total column 1 /grand total = 65 * 50 /100=32.5 CHI SQUARED TEST-EXCEL Same for the other 3 cells Row 2, Column 1 32.5 Total row 2 x total column 1 /grand total = 35 * 50 /100=17.5 CHI SQUARED TEST-EXCEL We finally have the 2 tables we need CHI SQUARED TEST-EXCEL Now we type the CHISQ.TEST function in any cell. Observed data first argument, expected data second =CHISQ.TEST(F14:G15,K14:L15) CHI SQUARED TEST-EXCEL And here is our p-value H1 rejected! Appearances can be deceiving =CHISQ.TEST(F14:G15,K14:L15) While the Excel version requires you to produce the expected values table, the R version does not-it simply takes the observed contingency table and outputs all the stats Attend this week’s workshop to try it for yourselves CONFIDENCE INTERVALS OR, like Effect sizes (and r coefficients) also have a confidence Interval (CI) This is the range of OR with 95% confidence to be true We won’t be calculating them but just remember, like for the effect size: Same thing OR CI doesn’t include 1 Result significant OR AND LOG(OR) Sometimes you will find the result as log(OR) instead of OR This just means your “no effect” value becomes zero as log(1)=0 Log(OR) CI doesn’t include OR 1 0 Same thing Log(OR) 0 Result significant FISHER’S EXACT TEST FISHER’S EXACT TEST The Chi Squared test is actually a simplification Calculating the real p-value from a contingency table requires a lot of math Res1 Res2 cond1 a c cond2 b d The Factorial n!=1x2x3x4x….x n FISHER’S EXACT TEST This equation is Fisher’s Exact test The numbers become impossible over 20, try doing 20! and you will see That’s why chi squared was developed FISHER’S EXACT TEST However, Fisher’s test is always exact but Pearson’s chi squared breaks down for small numbers This is why we use Fisher when the numbers are small How small? Here’s a rule of thumb. You can use chi-squared if 1. The grand total must exceed 50 2. Every observation must exceed 5 Otherwise use Fisher FISHER’S EXACT TEST Example 1. The grand total must exceed 50 After Before meal meal Pass 35 30 2. Every observation must exceed 5 Fail 15 20 The grand total is 100 Every cell > 5 Chi Squared OK FISHER’S EXACT TEST Example 1. The grand total must exceed 50 Week1 Week2 Rainy 3 5 2. Every observation must exceed 5 Sunshine 4 2 The grand total is 14 Every cell < 5 We must use Fisher FISHER’S EXACT TEST Example 1. The grand total must exceed 50 Law 1 Law 2 For 125 185 2. Every observation must exceed 5 Against 64 4 The grand total is 378 One cell < 5 We must use Fisher FISHER’S EXACT TEST You will seldom have to use Fisher’s exact test because in biomedicine numbers are normally large enough If you do need it, Excel unfortunately doesn’t have it We must use R or another statistical software (Jamovi, SPSS) LARGER TABLES LARGER CONTINGENCY TABLES All we said so far applies to 2x2 contingency tables that is two conditions What if we have more? Recovere Not d We may have several conditions Control 5 15 affecting one outcome such as drug 50 mg 6 12 dosages vs disease recovery 100 mg 8 8 400 mg 13 9 LARGER CONTINGENCY TABLES The good news is that we can still use chi squared and Fisher on any size table Recovere Not 2 d c Control 5 15 p-value 50 mg 6 12 100 mg 8 8 400 mg 13 9 LARGER CONTINGENCY TABLES The bad news is that like ANOVA it will only tell us if ONE condition sticks out but not WHICH one p-value 50 and no observation