Summary

This document provides an introduction to the t-test, a statistical method for comparing two groups with independent samples, and introduces the Student's t-distribution. The text explains statistical hypotheses, the logic of t-tests, and covers topics from degrees of freedom to the assumptions needed for running an independent samples t-test. Includes keywords like t-test, statistics, hypothesis, and data to help readers understand the material better.

Full Transcript

An Introduction to the T-Test Comparing Two Groups w/ Independent Samples T-test A common objective we have in our research work is to compare groups of people. For example, we might wish to know: if PhD students are more anxious than undergrad students; if men are more likely to binge drink than wo...

An Introduction to the T-Test Comparing Two Groups w/ Independent Samples T-test A common objective we have in our research work is to compare groups of people. For example, we might wish to know: if PhD students are more anxious than undergrad students; if men are more likely to binge drink than women; or, if students in di๏ฌ€erent school districts have di๏ฌ€erent performance on standardized tests. The ๐‘ก -test is one such test we can use, when the data calls for it! Let us say we have a random sample that we have split into two groups, ๐บ1 and ๐บ2. We have measured a numeric variable ๐‘‹ , which we assume is normally distributed. For example, ๐บ1 might represent men and ๐บ2 might represent women and ๐‘‹ might be systolic blood pressure. We are curious if group membership (๐บ1 ๐‘ฃ๐‘ ๐บ2 ) is associated with di๏ฌ€erent values of ๐‘‹. One way we can frame this as a scientific question is to ask if: โ€œThe mean value of ๐‘‹ is di๏ฌ€erent between groups ๐บ1 and ๐บ2.โ€ Letโ€™s define ๐œ‡๐บ1 as the population-level mean of ๐‘‹ for ๐บ1 , and ๐œ‡๐บ2 as that for ๐บ2. What we are essentially trying to ask is, does ๐œ‡๐บ1 = ๐œ‡๐บ2 ? Statistical Hypotheses of the Independent Samples T-Test The null and alternate hypothesis of the t-test are typically presented as: ๐ป0 : ๐œ‡๐บ1 = ๐œ‡๐บ2 ๐ป๐ด : ๐œ‡๐บ1 โ‰  ๐œ‡๐บ2 Importantly, ๐œ‡๐บ1 = ๐œ‡๐บ2 is the same as ๐œ‡๐บ1 โˆ’ ๐œ‡๐บ2 = 0. Therefore, we can also consider the null and alternate hypothesis as follows: ๐ป0 : ๐œ‡๐บ1 โˆ’ ๐œ‡๐บ1 = 0 ๐ป๐ด : ๐œ‡๐บ1 โˆ’ ๐œ‡๐บ1 โ‰  0 So, the independent sample t-test will attempt to assess the probability of our observed data assuming that ๐ป0 is true. Logic of the t-test Before we learn how to run a t-test, we need to think through what our null hypothesis means and what it might mean to reject it. When we run a statistical test, we assume that our null hypothesis is true. If we assume that the null is true and that ๐œ‡๐บ1 โˆ’ ๐œ‡๐บ2 = 0 at the population-level then it follows that if we were to sample a set of people from ๐บ1 and ๐บ2 and looked at the di๏ฌ€erence in their group means of ๐‘‹ , that 0 is the most likely value. While we use ๐œ‡๐บ1 and ๐œ‡๐บ2 to represent the population-level mean, we use ๐‘ฅยฏ๐บ1 and ๐‘ฅยฏ๐บ2 to represent the mean values of our sample groups. However, we know that sampling pretty much never results in a perfect representation of the populations, so it is actually fairly likely that the di๏ฌ€erence in sample group means (๐‘ฅยฏ๐บ1 โˆ’ ๐‘ฅยฏ๐บ2 ) wonโ€™t equal 0. This shouldnโ€™t worry us too much, it is fairly likely that if the null is true then it would make sense if ๐‘ฅยฏ๐บ1 โˆ’ ๐‘ฅยฏ๐บ2 was a value pretty close to 0 (whether positive or negative). In fact, if the null hypothesis is true, then we would assume that values of ๐‘ฅยฏ๐บ1 โˆ’ ๐‘ฅยฏ๐บ2 further from 0 would be less likely than values closer to 0 and that negative and positive values are equally likely. Hmmโ€ฆthat actually sounds a lot like the normal distribution doesnโ€™t it? 1. 0 is the most likely value 2. values closer to 0 are more likely than values further from 0 3. positive and negative values appear equally as likely to occur (i.e. symmetry) Butโ€ฆ Introducing the Studentโ€™s t-Distribution A normal distribution is defined by a mean value ๐œ‡ and a standard deviation ๐œŽ. Unfortunately, even if we know we have measured a normally distributed variable ๐‘‹ , we may not know the population-level standard deviation of ๐‘‹. This is fairly common in epidemiologic research, because we often donโ€™t have population-level data about the groups we are researching (such as people who inject drugs, or undergraduate students who vape, or people who access syringe exchanges). The ๐‘ก -distribution is a variation of the standard normal distribution (๐‘ -distribution), to be used when the standard deviation of the population is unknown. As we discussed before, any normal distribution (๐‘(๐œ‡, ๐œŽ) ) can be transformed to the ๐‘ -distribution (๐‘(0, 1) ). Similarly, we may understand the ๐‘ก -distribution as a standardized distribution. The ๐‘ก -distribution is intended to be a more conservative version of the ๐‘ - distribution, where we assume a wider variability in observations. Essentially, the less information we know, the less certain we are that observations will be near the mean. We define the ๐‘ก -distribution as a function of the number of degrees of freedom we have available to measure the variability in our data: Degrees of freedom refer to the number of parameters that are able to โ€œvary freelyโ€, given some assumed outcome. For example, let us say you have have 100 participants and you know that their mean average is 60 years old. Well, there are countless possibilities for how 100 participants age could average out to 60 (e.g., everyone could be 60 years old, half could be 59 and half could be 61, etc etc). In other words, we can see that peopleโ€™s ages can โ€œvary freelyโ€ while still maintaining an average age of 60 years old. But, let us imagine we know the exact age of 99 of the 100 individuals. The average age of the population is 60 years of age. Can the age of the final person โ€œvary freelyโ€? No! There is actually only one value that could take the first 99 values and get the average to 60 years of age. For example, if the average age of the first 99 people is 60 years of age, then the age of the final person must be 60 years of age. If they were older, then the average would move above 60, and vice versa if they were younger. As such to calculate a mean, we must โ€œspendโ€ one degree of freedom. A normal distribution is defined by a mean value ๐œ‡ and a standard deviation ๐œŽ. If we have ๐‘› observations and measure sample-mean ๐‘ฅยฏ and standard deviation ๐‘  , as discussed above, we must spend one degree of freedom to calculate ๐‘ฅยฏ. This means that we have ๐‘› โˆ’ 1 degrees of freedom to calculate ๐‘ . The fewer observations we have (i.e., the smaller that ๐‘› is), the less information we have to estimate the variation of our observed variable ๐‘‹. As such, the ๐‘ก -distribution is intended to capture uncertainty in the measurement of the standard deviation from a small sample. The fewer degrees of freedom (i.e., the smaller our sample), the less certain that our measured standard deviation ๐‘  represents our population-level standard deviation ๐œŽ. To capture this, the ๐‘ก - distribution is โ€œshorterโ€ and โ€œwiderโ€ than the normal distribution. Essentially, under the ๐‘ก -distribution, values farther from 0 are more likely than under the ๐‘ -distribution. As we collect more data (i.e., as ๐‘› gets larger), the ๐‘ก -distributionโ€™s shapes approaches that of the ๐‘ -distribution. To picture what this means, letโ€™s plot some distributions. First we are going to plot the ๐‘ -distribution along with ๐‘ก(1), the ๐‘ก -distribution with one degree of freedom: ## Let's create our x-axis, ranging from -5 to 5, with increments of 0.1 x

Use Quizgecko on...
Browser
Browser