CHS 729 Week 4: Fundamentals of T-tests and Chi-Squared Tests PDF
Document Details

Uploaded by LargeCapacityAntigorite4770
Tags
Summary
This document is a presentation on statistical methods focusing on t-tests and chi-squared tests. It details various t-test types, the logic behind them, and the assumptions required. Chi-squared tests and ANOVA are also discussed. The slides are likely for an undergraduate course in statistics.
Full Transcript
CHS 729 Week 4 T-tests Chi-Squared tests Bivariate Associations Today, we are going to start talking about how we can compare two variables. Often, we have a variable that places observations into groups (e.g., trial/control, men/women) and we want to know if group membership is...
CHS 729 Week 4 T-tests Chi-Squared tests Bivariate Associations Today, we are going to start talking about how we can compare two variables. Often, we have a variable that places observations into groups (e.g., trial/control, men/women) and we want to know if group membership is associated with some other variable. For example, maybe we want to know if PhD students are more anxious than undergrads or if men are more likely to binge drink than women. We need to learn methods that let us ask such questions! Two Foundational Tests The t-test and chi-squared test are two of the most important inferential tests Today, we are going to talk about them in reference to comparing groups and making bivariate comparisons But these tests also play signifi cant roles when we employ other techniques, such as determining the significance of regression coeffi cients. The goal of today is to understand these tests, the logic underlying them, and how we can anticipate using them. The Student’s T-Test The t-test is commonly used when we have a normally distributed variable X and we want to know if specific group membership is associated with different values of X There are three common variations of the t-test: One sample t-test Independent two sample t-test Paired t-test We will focus on the independent two sample t-test and then discuss the others after. The Independent Samples T-Test Let’s say we have a normally distributed random variable X and two groups, G1 and G2. We want to know if the population level mean of X for G1 and G2 are the same or different. The t-test allows us to define the following null and alternative hypotheses: H0: μG1=μG2 HA: μG1≠μG2 The null is that the population means are equal and the alternate is that they are not. The Logic of the T-Test Equivalently, we can represent our hypotheses as follows. H0: μG1−μG2=0 HA: μG1−μG2≠0 To run our study, we are going to now collect a sample of individuals, identify which group G1/G2 people are in and measure X for each person. We will then be able to calculate our sample mean values for each group: From here we can calculate The Logic of the T-Test Equivalently, we can represent our hypotheses as follows. H0: μG1−μG2=0 HA: μG1−μG2≠0 If the null hypothesis is true, then intuitively, we can understand that the most likely value for is 0, that values closer to 0 are more likely than values further, and that values greater than 0 are just as likely as values less than 0. In other words, the possible values for assuming the null is true appear to normally distributed. But… A normal distribution is defined by two population- level parameters, the mean μ and the standard deviation σ However, even if we know that X is normally distributed, we often do not know σ This is fairly common in drug use epidemiology, when our populations are understudied (or diffi cult to fully capture): such as people who inject; undergraduates who vape; etc) So, a new distribution was developed… The t-Distribution The t-distribution is a variation of the standard normal distribution (Z-distribution) Like the Z-distribution, the t-distribution has a mean value of 0, it is symmetrical around the mean However, it is a little bit “wider” and bit “shorter” than the Z-distribution. This is because we have to derive the standard deviation from the sample and our sample typically is made up of a small number of people. The t-distribution and degrees of freedom We define the t-distribution in terms of “degrees of freedom” The more degrees of freedom we have to define our t- distribution, the more similar it becomes to the Z- distribution. The degrees of freedom represent the amount of data we have to calculate the variability (i.e., standard deviation) of our data. But, what are degrees of freedom? Degrees of Freedom Degrees of freedom (df) refer to the number of parameters that are able to “vary freely” given some assumed outcome. For example, let’s say we have 100 participants and we know their mean age is 60 years old. There are infinite possibilities for how age can be distributed throughout this group, BUT… If we know 99 of their ages, then the final persons age is fixed. In other words, to calculate the mean value, one observation cannot “vary freely” In this example, we have n = 100 observations and must spend 1 df to calculate the mean Degrees of Freedom & Normal Distributions Here’s the thing – the normal distribution is defined by a mean value and a standard deviation. Say we have n observations, well we have to “spend” one degree of freedom to identify the mean value. Then, we have n – 1 degrees of freedom remaining to calculate the standard deviation. The t-distribution is defined by n – 1 degrees of freedom because we are trying to calculate the standard deviation from our sample (as opposed to from a known population-level metric). The more observations we have, the more degrees of freedom we have to inform our t-distribution. Degrees of Freedom & the t- Distribution The t-distribution is intended to capture uncertainty in the measurement of the standard deviation from a small sample. The fewer df we have, the less certain that our measured standard deviation s represents our population-level standard deviation σ To capture this, the t-distribution is “shorter” and “wider” than the Z-distribution. Since we are less certain about the standard deviation, values further from 0 become more probable. Comparing t(1) and Z t(30) approaches Z Mapping Our Test to t(n – 1) The t-test is almost identical to the z-test, except now we will map our signal onto the t(n-1)-distribution. To do so, we first calculate our signal However, we need to now standardize our signal by dividing it by the noise. For the t-test we divide our signal by the standard error of the mean. Standard Error of the Mean The standard error of the mean is the “conservative” estimate of the standard deviation (since we don’t know the population level standard deviation). SE can be calculated as: Where s is the standard deviation of X in the sample This equation in full form can be written as: Calculating the t-statistic To calculate our test statistic, t, we use the following equation: = By dividing by the standard error of the mean, we have taken our signal and mapped it onto the t-distribution with degrees of freedom This is because we have degrees of freedom to calculate the standard deviation for G1 and for G2 Mapping Our Value onto T- Distribution Now we can map our test statistic onto the appropriate t- distribution. Let’s say that G1 has 100 people and so does G2 The average for G1 is = 21 and for G2 is , with a pooled standard deviation of 3 Then we can calculate: We now can map this value onto a t-dist with 100 + 100 – 2 =198 degrees of freedom! Mapping Our Value onto T- Distribution Two Tail vs. One Tail If p < 0.05 If the calculated p <.05 then we consider this significant evidence against our null hypothesis This indicates that the signal (or a more extreme signal) would be observed less than 5% of the time if the null were true This provides evidence that our null hypothesis is not true. Assumptions for t-test Our variable of interest X must be measured on an ordinal or continuous scale Data must be drawn from a random sample and the two- groups being compared must be independent X must be normally distributed. As our sample size gets larger, this assumption becomes weaker. The t-test is more robust to violating this assumption as sample gets larger. The variance of X in both groups must be the same. In other words, the standard deviation of X in both groups must be roughly equal. Testing Assumption of Normality Testing Assumption of Normality with the Shapiro Wilkes test Testing Assumption of Homogeneity of Variance with Levene Test Running the t-test in R Variation: One Sample t-test We can run a t-test comparing mean of X for one group to some pre-defined level, y In such a case, our null hypothesis is that: We then calculate our sample mean and our sample standard deviation s and calculate our t-score: Compared to t-distribution with n-1 degrees of freedom Variation: Paired Samples t-test We can run a t-test comparing mean of X for one group at time 1 versus at time 2 In such a case, our null hypothesis is that: Where is the difference in measurement from time 1 and time 2. We then calculate our sample mean and our sample standard deviation s of the differences and calculate our t-score: Compared to t-distribution with n-1 degrees of freedom Chi-Squared Test The Chi-squared test The chi-squared test can be used to assess if two categorical variables X and Y are independent Given X and Y, our null hypothesis and alternate are: H0: X and Y are independent HA: X and Y are not independent We run this test by comparing the observed patterns in the distribution of X and Y compared to what we would expect to observe if the null were true. Comparing Observed Vs Expected Frequencies We can look at the frequencies of two categorical variables at the Housing same time. On- Off- Total Campus Campus This is called a contingency 1st 97 3 100 table (or crosstabs) 2nd 82 18 100 Year Let us look at this example 3rd 51 49 100 of housing status based on 4th 22 88 100 year of college Total 252 158 400 Comparing Observed Vs Expected Frequencies In the parentheses we are going to calculate the expected values for each cell if housing and year were Housing independent On- Off- Total Essentially, we are evenly Campus Campus distributing housing status and 1st 97 () 3 () 100 year 2nd 82 () 18 () 100 Year For each cell, to do this, we 3rd 51 () 49 () 100 multiply the number of people in 4th 22 () 88 () 100 that row, by the number of people Total 252 158 400 in that column, and divide by the total number: For example, 1 year On-Campus st Housing We start with our equation. We see On- Off- Total that: Campus Campus N_row = 100 1st 97 () 3 () 100 N_column = 252 2nd 82 () 18 () 100 Year N = 400 3rd 51 () 49 () 100 4th 22 () 88 () 100 Therefore, our expected value is: Total 252 158 400 = 63 For example, 1 year On-Campus st We may notice this calculation is the same for every On-Campus row Housing because every row has 100 people On- Off- Total in it Campus Campus 1st 97 (63) 3 () 100 We do the same calculation for the 2nd 82 () 18 () 100 second column and find for each Year 3rd 51 () 49 () 100 row: 4th 12 () 88 () 100 = 37 Total 242 158 400 Now we can calculate our Chi- squared score The goal of the chi-squared test is to identify if the observed counts are similar or different than the expected Housing counts. On- Off- Total Campus Campus The greater the difference between the 1st 97 (63) 3 (37) 100 observed and expected counts, the 2nd 82 (63) 18 (37) 100 higher our chi-squared score (and the Year 3rd 51 (63) 49 (37) 100 lower our corresponding p-value). 4th 12 (63) 88 (37) 100 In this example, we can see that 1st Total 252 158 400 years are much more likely to live on- campus than 4th years, so we might expect to see a significant result. Calculating Chi-Squared In order to calculate the Chi- squared score, you go cell-by-cell and you calculate: Housing On- Off- Total Campus Campus 1st 97 (63) 3 (37) 100 Once you have calculated this for [18.3] [31.2] every square, you take their sum. 2nd 82 (63) 18 (37) 100 [5.7] [9.8] So, you get the following: Year 3rd 51 (63) 49 (37) 100 [2.3] [3.9] 4th 12 (63) 88 (37) 100 [41.2] [70.3] For this example, chi-squared = Total 252 158 400 182.7 Wait what’s a chi-squared distribution We learned that the normal distribution arises from we understand certain natural phenomenon to occur The chi-squared distribution with one degree of freedom is the square of the Z-distribution Square? This means that for any point (x,y) on the Z distribution, it gets mapped onto () on the chi-squared distribution, like so: Comparing Z and chi Z-distribution Chi-squared df = 1 Area Under Curve Equivalence 68% of Z between -1 and 1 68% of chi between 0 and 1 Chi-squared with k degrees of freedom In more general terms, let us say we have k random variables that are independent and follow the Z-distribution. Let’s calculate a new variable Y by taking the sum of each random variable: Y is understood to be distributed according to the chi-squared distribution with k degrees of freedom. Our test statistic is a sum of squares In our test, we calculated that: Interestingly, that is a sum of squares!! So, this follows a chi-squared distribution. We summed 8 squares (4 years x 2 housing options) But they are not independent! If we have more people in one housing option, then we inherently have less people in another. So how do we determine the right number of degrees of freedom? Going back to our test For this example, chi-squared = 182.7 This value was calculated as the difference between what we expected assuming Housing the null versus what we actually On- Off- Total observed Campus Campus 1st 97 (63) 3 (37) 100 To compare our test statistic to a [18.3] [31.2] distribution, we need the number of degrees 2nd 82 (63) 18 (37) 100 of freedom [5.7] [9.8] Year Degrees of freedom (df) can be calculated as 3rd 51 (63) 49 (37) 100 [2.3] [3.9] follows: 4th 12 (63) 88 (37) 100 [41.2] [70.3] Total 252 158 400 But why? Degrees of freedom Let us say I start with a blank board, I just have the number in Housing each row and each column On- Off- Total Campus Campus The degrees of freedom is the 1st 100 number of pieces of information I 2nd 100 need to fill it out Year 3rd 100 4th 100 Total 252 158 400 Degrees of freedom Let us say I start with a blank board, I just have the Housing number in each row and each On- Off- Total Campus Campus column 1st 100 97 The degrees of freedom is the 2nd 100 number of pieces of information I Year 3rd 100 need to fill it out 4th 100 So I’m going to start by filling in Total 252 158 400 1st year on-campus Degrees of freedom Let us say I start with a blank board, I just have the number in Housing each row and each column On- Off- Total Campus Campus The degrees of freedom is the 1st 97 3 100 number of pieces of information I 2nd 100 need to fill it out Year 3rd 100 So I’m going to start by filling in 4th 100 1st year on-campus Total 252 158 400 Then, I can fill in one additional piece of information myself Degrees of freedom The degrees of freedom is the number of pieces of information I Housing need to fill it out On- Off- Total Campus Campus So I’m going to start by filling in 1st 1st 97 3 100 year on-campus 2nd 100 82 18 Year Then, I can fill in one additional 3rd 100 piece of information myself 4th 100 Next, I’ll fill in 2nd year on-campus Total 252 158 400 and one additional square I can deduce Degrees of freedom The degrees of freedom is the number of pieces of information I need to fill it out So I’m going to start by filling in 1 st Housing year on-campus On- Off- Total Campus Campus Then, I can fill in one additional piece of information myself 1st 97 3 100 2nd 82 18 100 Next, I’ll fill in 2nd year on-campus Year and one additional square I can 3rd 51 49 100 deduce 4th 12 88 100 Finally, I will fill in 3 rd year on- Total 400 campus and voila I have enough 252 158 information to fill out the whole board! Calculating p-value Chi-squared = 182.7 Housing Degrees of Freedom = 3 On- Off- Total Campus Campus In this case we get p < 0.00001 1st 97 (63) 3 (37) 100 [18.3] [31.2] We get this value by taking the 2nd 82 (63) 18 (37) 100 area under the curve of the chi- Year [5.7] [9.8] squared distribution with 3 3rd 51 (63) 49 (37) 100 [2.3] [3.9] degrees of freedom! 4th 12 (63) 88 (37) 100 [41.2] [70.3] Total 252 158 400 Assumptions of chi-squared test X and Y are both categorical The levels of X and Y are mutually exclusive. In other words, each participant must belong to one and only one level of each. Each observation is independent – in other words, our data is drawn from a random sample. The expected value for each cell should be 5 or greater for 80% of cells and must be at least 1 for every cell. Running chi- squared in R We Will Talk About One-Way ANOVA, but a quick overview In the syllabus, I noted that we would discuss ANOVA today. Interestingly, just like how the chi-squared distribution arises by squaring Z-distribution, the F distribution arises by taking the ratio of two chi-squared distributed variables. I want to be able to better explain how this arises to you. One way ANOVA allows us to compare group means of three or more groups (extending the t-test) and determine if they are all the same or if they differ in some way. For One-Way ANOVA Let’s say we measured a normal random variable X and we have k groups We want to know if the mean value across each group is the same or different. Our null hypothesis is: The alternate hypothesis that they do not all equal each other. This could be all of them being different or even just one. Logic of One-Way Anova Running one way ANOVA in R Assumptions Each observation must be independent X must be a normally distributed variable within each group The distribution of X for each group must have the same variance