Selection Bias and Missing Data PDF
Document Details
Uploaded by SweetJackalope3014
2024
Tom McAdams
Tags
Summary
This document analyzes selection bias and its implications in research studies. It covers various aspects of selection bias, including initial recruitment, reasons for non-participation, and the implications of attrition. The study also explores different types of missing data, emphasizing how to deal with them.
Full Transcript
Selection Bias and Missing Data Dr Tom McAdams MSc DevPP Nature & Nurture 1 7PADDTMF November 26th, 2024 1 Imagine we want to recruit people to take part in a new study. First, we define the target population we will be drawing our sample from. TEDS: all twins born...
Selection Bias and Missing Data Dr Tom McAdams MSc DevPP Nature & Nurture 1 7PADDTMF November 26th, 2024 1 Imagine we want to recruit people to take part in a new study. First, we define the target population we will be drawing our sample from. TEDS: all twins born in England and Wales between 1994 and 1996. ALSPAC: all children born in Avon, England during 1991 and 1992. MoBa: All children born in Norway between 2000-2010. UKB: Anyone aged 40-69 during the recruitment period (2006-2010) 2 It would be rare to recruit the entire target population. The goal is usually to recruit a random, representative sample. When our sample represents our target population (and hopefully therefore wider populations as well), we can be more confident that our results will generalise. Challenge: Samples are often NOT truly random. So, a sample may not perfectly represent the population from which it was drawn. 3 Understanding Selection Bias In an ideal situation a sample is randomly drawn from a population. Population Sample If randomly drawn there should be no systematic differences between sample and population, so findings within the sample will generalise to the population. The bigger the sample, the more representative it should be. 4 Understanding Selection Bias If a sample is not randomly drawn from a population... Population Sample Then selection bias can occur, and findings within the sample may not generalise to the population 5 What is Selection Bias? Bias caused by systematic differences between study participants and non- participants. Most research on humans is based on voluntary participation Differences between those who do vs do not volunteer to participate is common and can bias results Differences between participants/non-participants can occur… 1. During recruitment: there may be differences between those who are recruited and those who are not. 6 1. Initial recruitment and participation Differences can occur at recruitment between eligible members of the population who do vs do not take part in the study. Generally the more of the target population that are recruited into the sample, the more representative it is likely to be. Proportions of target population recruited: TEDS: 80% ALSPAC: 75% MoBa: 41% UKBB: 5.5% 7 Reasons for non-participation People might choose not to participate. People may forget/not get around to participating. Some people may be unable to participate. Recruitment drives may miss some people, so they may be unaware that they were eligible. Some people originally targeted for recruitment may prove ineligible 8 What is Selection Bias? Bias caused by systematic differences between study participants and non-participants. Most research on humans is based on voluntary participation, so differences between those who do/not volunteer can bias results Differences between participants/non-participants can occur… 1. During recruitment: there may be differences between those who are recruited and those who are not. 2. As a result of attrition (in longitudinal studies): there may be differences between those who drop-out vs remain in a study 9 2. Attrition (discontinued participation) Differences between participants who drop out vs remain in the study at follow up (wave 2 onwards) Sample at wave 1 Sample at wave 2 Sample at wave 3 When researchers want to study the effects of selection bias they typically focus on attrition rather than participation. This is because it is easier to study differences between continued participants and drop-outs (on whom we have some data) than between participants and those who never took part in the study (on whom we usually have no data at all). 10 Differences between non/participants Common differences between participants and non-participants, and between those who drop out of studies vs those who remain in: Participants tend to be… Healthier, both physically and mentally. They tend to be wealthier than non-participants They are more likely to be white. In countries where most research on humans is conducted (Europe and North America) minority ethnic groups tend to participate in research studies less. More likely to be female than male. Engage in more positive health behaviours (they exercise and eat well) They engage in fewer negative health behaviours (drinking, smoking, drugs, etc.) Better educated than They have access to IT equipment and the internet BUT note that selection effects can be sample-specific so do not ALWAYS follow these patterns 11 Selection Bias Note that differences between between participants and non- participants (or between a sample and the population) do not always lead to selection bias. (Selection bias = Bias caused by these differences.) But in some circumstances these differences can lead to biases in our results… 12 Selection Bias: Prevalence Estimates When participation is related to health, wealth, and health behaviours, then estimates of disease prevalence will tend to be biased downwards. Biases can however be sample specific Example 1: Uni-WiSE: Define a survey as a mental health survey and prevalence estimates will increase as people with problems select into the study and those without opt not to participate. Example 2: CoTEDS/ALSPAC: Recruit the second generation of a study (children of participants), and the initial recruitment phase will identify low SES, high risk individuals having children at a young age. When participation is related to health, then estimates of disease prevalence will tend to be biased downwards. This will lead to underestimates of e.g. the prevalence of depression or obesity. However, biases can be sample specific. Sometimes when a study is advertised as a study about mental health, then people without mental health problems think that it is not about them so do not take part. And people with mental health problems (who may not have taken part in another study) do take part. This can lead to overestimates of disease prevalence. Some birth cohorts (CoTEDS, ALSPAC) are recruiting the offspring of participants. At present the participants of these second-generation studies are low SES and high risk (showing the reverse of usual selection patterns in birth cohorts). This is because the participants having children are younger than average at present. Over time this bias will disappear from the samples. 13 Selection Bias: Prevalence Estimates Arguably most studies are not primarily focused on the estimation of prevalence rates Rather they are focused on identifying predictors of outcomes of interest, e.g. Predictor Outcome Which risk factors predict major depressive disorder? How strong are those predictions? What might mitigate or mediate these associations? 14 Is there any reason to think that differences in prevalence between participants and non-participants should impact estimates of association? 15 Selection Bias: Estimates of Association Some have argued that differences between participants and non- participants do not matter when we are focused on estimating associations between variables. Some would argue that differences between participants and non-participants do not matter a great deal when we are focused on estimating associations between variables. If the association is linear this should be true, even if people at e.g. the extreme of the distribution of x and/or y are missing, the estimated association will be the same 16 Selection Bias: Estimates of Association Some have argued that differences between participants and non- participants do not matter when we are focused on estimating associations between variables. Some would argue that differences between participants and non-participants do not matter a great deal when we are focused on estimating associations between variables. If the association is linear this should be true, even if people at e.g. the extreme of the distribution of x and/or y are missing, the estimated association will be the same 17 Selection Bias: Estimates of Association Some have argued that differences between participants and non- participants do not matter when we are focused on estimating associations between variables. Some would argue that differences between participants and non-participants do not matter a great deal when we are focused on estimating associations between variables. If the association is linear this should be true, even if people at e.g. the extreme of the distribution of x and/or y are missing, the estimated association will be the same 18 Selection Bias: Estimates of Association Some have argued that differences between participants and non- participants do not matter when we are focused on estimating associations between variables. Some would argue that differences between participants and non-participants do not matter a great deal when we are focused on estimating associations between variables. If the association is linear this should be true, even if people at e.g. the extreme of the distribution of x and/or y are missing, the estimated association will be the same 19 Selection Bias: Estimates of Association If association is non-linear, then missingness may bias estimated associations. 20 Selection Bias: Estimates of Association If association is non-linear, then missingness may bias estimated associations. 21 Selection Bias: Estimates of Association If association is non-linear, then missingness may bias estimated associations. 22 Selection Bias: Estimates of Association Study participation may also act as a collider variable and this can induce serious bias A collider variable is a variable that is caused by both the exposure and the outcome of interest. This matters in mental health research because we know that mental health problems often predict participation. Likewise, predictors of mental health often do too. 23 What is a collider? Confounder Control for confounder Desirable Reduces bias in byx Tells us what byx would be if the confounding variable was held constant byx X Y Control for/condition on a collider Undesirable Induces bias in byx Collider Tells us what byx is when conditioning on the collider variable A collider variable is a variable that is caused by both the exposure and the outcome. If we statistically control for a collider as though it is a confounder, we will bias estimates of the association between x and y. If we run our analyses only in subgroup determined by our collider variable, then we will bias our estimateof the association between x and y. 24 byx = 0 byx ~=0 byx ~=0 X Y X Y X Y Collider Collider Collider Z Conditioning on a collider can induce associations where non exist, inflate associations, or even reverse the sign of an association. 25 Mental byx ~= 0 Mental byx ~= 0 Mental Education byx = 0 health Education health Education health Participation Participation Participation U U Sometimes, participation in a study can itself be a collider variable. And because we can only run analyses ion those who participate, we condition on a collider by default. This is not guaranteed to cause major bias but it can. Collider bias can be quite unpredictable and difficult to deal with. 26 X Y Collider https://watzilei.com/shiny/collider/ Example: exploring the association between sodium intake and systolic blood pressure, while controlling for urinary protein excretion (a collider) 27 28 Selection bias Focus on attrition bias in the literature Biases are present but magnitude varies Bias often results in underestimation of socioeconomic inequalities in health-related outcomes (Howe et al., 2013). Bias can be sample specific Bias can be association specific “General statements about bias are not possible for studies that investigate multiple exposures and outcomes” (Biele et al., 2019) Bias increases over time in longitudinal studies (Howe et al. 2013) Typically, bias does not drastically change the qualitative conclusions that are drawn from a sample about an association – e.g. it will still be in the same direction, and although the estimated magnitude may change slightly it is not usually drastic. We cannot however assume that this is the case for all associations in all studies 29 Missing Data 1 2 Missing data = data that we know exists but that we do not have 3 4 Sources of missing data: 5 Non-participation 6 Attrition Incomplete responses 7 There are ways of correcting for missing data and associated selection biases, as long as we have some data that predicts missingness… Selection bias attributable to attrition is frequently treated as a missing data problem, and in this manner corrected for (to an extent). Non-participation is less often treated as a missing data problem because we typically lack any information on non- participants. However, in rare instances where we do (e.g. when data is linked to national health records or similar), then even non-participation can be dealt with in the same way. 30 1 Types of missing data 2 3 4 Missing completely at random 5 Truly random unsystematic missingness. 6 Not a concern as will not cause bias 7 Missing at random The missingness is systematic and is related to the observed data. Missingness that can be predicted from the data (and therefore corrected for) Missing not at random The missingness is systematic but is unrelated to the observed data. Missingness cannot be predicted from the available data. Missing data can be categorized into 3 types: Missing completely at random (MCAR): the missingness is truly random and unrelated to any other variable. It is unsystematic. For example, if a print-run of a postal questionnaire accidentally missed the last page from 20% of the questionnaires, 20% of the sample would be missing data, but this missingness would be random and unrelated to the data collected. Missing at random (MAR): the missingness is related to non-missing variables within the dataset. It is systematic but we have information that we can use to understand the missingness and correct for it. For example, if people who answer “yes” to the item “Have you ever had a mental health diagnosis?” are less likely to take part in subsequent follow-up waves in a longitudinal study, then we know that history of mental health diagnosis predicts missingness at later waves. Missing not at random (MNAR): Missingness is systematic but is unrelated to the observed data (related only to unobserved data). That is, the missingness is related to factors not measured by the researchers. The probability of being missing cannot be predicted from the data that we have available to us. For example, if recreational drug users decline to take part in a study, we will have no data on drug use and will 31 not have any information to help us understand why the data is missing. Another example would be if depressed patients did not answer questions about depression severity then the resulting dataset would be censored (would not include severe depression cases) and the researchers would lack the information to correct for this. 31 1 Dealing with missing data 2 3 4 If data is “missing at random” we can 5 (attempt to) correct for missingness using… 6 Inverse probability weights 7 Data imputation Complete case analysis (dropping anyone missing data of interest) does nothing to correct for potential biases Any correction will only be as good as our ability to predict missingness with the observed data we have. It is often difficult to know how successful any attempt at dealing with missing data has been. If data is MNAR we cannot correct for any biases through statistical means. 32 Selection Bias in a Wider Context Typically, we know that study participants more likely to be healthy, wealthy and white Globally, most study participants also come from WIERD countries (Western, Industrialised, Educated, Rich, Democratic countries) Also, scientists tend to be WEIRD as well Bias operates at the level of the… Individual Institute Country A major shortcoming of most research on humans is that the vast majority of studies are comprised of participants from western, educated, industrialised, rich, democratic (WEIRD) countries, predominantly in Europe and the USA. These samples tend to be culturally homogenous, comprised of socioeconomically privileged participants of European heritage. To what extent research findings derived from WEIRD samples can be generalised to other populations within WEIRD countries, or to populations across the globe, is an open question that can only be answered by collecting more data on more 33 diverse samples. As such, the transferrable, clinical utility of existing results remains limited and serves to perpetuate a scientific knowledgebase that is weighted towards serving privileged groups. 33 Selection Bias and Missing Data Dr Tom McAdams MSc DevPP Nature & Nurture 1 7PADDTMF November 26th, 2024 34 Further Reading Howe LD, Tilling K, Galobardes B, Lawlor DA. Loss to follow-up in cohort studies: bias in estimates of socioeconomic inequalities. Epidemiology. 2013 Jan;24(1):1- 9. doi: 10.1097/EDE.0b013e31827623b1. Marcus R Munafò, Kate Tilling, Amy E Taylor, David M Evans, George Davey Smith, Collider scope: when selection bias can substantially influence observed associations, International Journal of Epidemiology, Volume 47, Issue 1, February 2018, Pages 226–235, https://doi.org/10.1093/ije/dyx206 Griffith, G.J., Morris, T.T., Tudball, M.J. et al. Collider bias undermines our understanding of COVID-19 disease risk and severity. Nat Commun 11, 5749 (2020). https://doi.org/10.1038/s41467-020-19478-2 Biele G, Gustavson K, Czajkowski NO, Nilsen RM, Reichborn-Kjennerud T, Magnus PM, Stoltenberg C, Aase H. Bias from self selection and loss to follow-up in prospective cohort studies. Eur J Epidemiol. 2019 Oct;34(10):927-938. doi: 10.1007/s10654-019-00550-1. 35