Statistics in Health Sciences Past Paper PDF

Document Details

PraiseworthyHammeredDulcimer

Uploaded by PraiseworthyHammeredDulcimer

UAB

null

Jose Barrera

Tags

statistics applied statistics health science missing data

Summary

This document is a lecture or presentation about dealing with missing data in applied statistics. It discusses different types of missing data and methods for dealing with them. It also includes an example of regression analysis with missing data in R.

Full Transcript

B.Sc. Degree in Applied Statistics Statistics in Health Sciences 5. Dealing with missing data Jose Barreraab [email protected] https://sites.google.com/view/josebarrera a ISGlobal Barcelona Institute for Global Health - Campus MAR b Department of Mathematics (UAB) This work is licensed under...

B.Sc. Degree in Applied Statistics Statistics in Health Sciences 5. Dealing with missing data Jose Barreraab [email protected] https://sites.google.com/view/josebarrera a ISGlobal Barcelona Institute for Global Health - Campus MAR b Department of Mathematics (UAB) This work is licensed under a Creative Commons “Attribution-NonCommercial-ShareAlike 4.0 International” license. Statistics in Health Sciences 1 Introduction 2 Types of missing data 3 Dealing with missing data Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 2 / 33 Missing data Introduction • Missing data is a common problem in statistical analyses that involve real data sets. • Specifically, this is the case in most of epidemiological studies that are related to non controlled, observational studies. • Next, we classify different types of missing data as well as introduce to how to deal with them. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 3 / 33 Missing data Types of missing data patterns Missing data can be classified according to three different patterns (Rubin [1] ): • MCAR (Missing completely at random) • MAR (Missing at random) • MNAR (Missing not at random) Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 4 / 33 Types of missing data: MCAR MCAR (Missing completely at random) • It is the case in which the probability of an observation Xi being missing does not depend on the value of Xi , nor on any of the remaining variables in the data set. • If the probability of missing Xi depends on the probability of missing the value in other variable in the same individual, Yi , this does not affect the assumption of MCAR. Examples • A participant in the study was not able to go to the interview for the health questionnaire because them missed the train. • The scale used to weight individuals has a constant probability of malfunction. Consequences (by comparison with the case of complete data) • Sample size reduction −→ statistical power reduction. • Parameter estimates are unbiased. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 5 / 33 Types of missing data: MAR MAR (Missing at random) • It is not MCAR but the probability of missing Xi does not depend on the value of Xi , after stratifying by other variables in the data set potentially related to the probability of missing Xi . • In other words, the propensity of missingness depends on observed data, not missing data. Examples • Depressed people could be less likely to report their physical activity (PA) and, at the same time, be characterized by a low PA. If among depressed people the probability of missing physical activity information was unrelated to its value, missing data would be MAR and depression could be used as a predictor of PA levels. • A survey respondent choosing not to answer a question on income because they believe the privacy of personal information. The missing value for income can be predicted by looking at the answers for the personal information question. Consequences (by comparison with the case of complete data) • Sample size reduction −→ statistical power reduction. • Parameter estimates could be biased (there are methods to deal with this problem). Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 6 / 33 Types of missing data: MNAR MNAR (Missing not at random) • The missing is not random, it correlates with unobservable characteristics unknown to a researcher (i.e. to variables not present in the data set). • Particular case: the probability of missing Xi depends on the value of Xi itself, and X is unrelated to other observed variables. Examples • When assessing the mental health status X via a health questionnaire among a random sample of the population of interest, if individuals with poor mental health (e.g. those with low scores in X ) are less likely to report their health status than individuals with good mental (e.g. those with high scores in X ), then data are MNAR. • Drugs consumers could be more likely to do not answer about drugs consumption. Consequences (by comparison with the case of complete data) • Sample size reduction −→ statistical power reduction. • Parameter estimates could be biased (no solution). Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 7 / 33 Dealing with missing data Different approaches There is a number of methods for dealing with missing data... • Complete cases analysis • Mean/median/mode substitution • Regression imputation • Inverse probability weighting imputation • Multiple imputation Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 8 / 33 Complete cases analysis Complete cases analysis • All rows of the dataset that are not complete cases regarding the subset of variables of interest are excluded of the analysis.a • Estimates are unbiased only under the assumption of data MCAR. At also applies if the variable with missing data is the response variable of interest. • The sample size reduction implies a decrease in the statistical power → the probability of false negative increases. • It is usually appropriate if the percentage of complete cases is very high and we assume MCAR. • Alternatives to this method include substitution methods... a This is the default option in R. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 9 / 33 Mean/median/mode substitution Mean/median/mode substitution • Each missing datum is replaced by the sample mean, median or mode of the variable computed using available data. • This method does not add additional information because of the sample mean, median or mode remains unaltered while the variance is underestimated because the sample size is enlarged but no extra variability is added. • In addition, this method could provide unrealistic imputed values (e.g. a 2-year-old baby with 56 kg weight). Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 10 / 33 Mean/median/mode substitution Exercise  Our aim is to estimate the mean and the variance of glucose levels in blood (X ) in a given population. To do that, we select a random sample S = {x1 , x2 , x3 , . . . , xn−1 , xn } of size n. Then, we notice that all data in the sample have been correctly collected except in the case of xn , which is missing. We denote: • X̂c and V̂c the mean and variance estimates of glucose levels, respectively, using a complete cases analysis (i.e. using Sc ); • X̂m and V̂m the mean and variance estimates of glucose levels, respectively, using a mean substitution analysis. Prove that both methods provide the same mean estimate (i.e. X̂m = X̂c ) while the variance is underestimated when using the mean substitution method (i.e. V̂m < V̂c ). Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 11 / 33 Regression imputation Regression imputation • If Xi is missing, its value is rebuilt with the prediction of its mean value resulted of fitting a linear regression model in which the outcome is the variable X and the predictors are the remaining variables in the data set, using all available data. • This method is more sophisticated, realistic than the mean substitution because imputed values within each variable are not all equal due to the values of the remaining variables. • However, the reduction of variance problem persists because, despite no additional information has been used to rebuild the datum, the sample size has been increased. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 12 / 33 Inverse probability weighting (IPW) Inverse probability weighting • Inverse probability weighting (IPW) is a statistical technique to estimate parameters in a population different from that in which the data was collected, which is not a rare situation in the context of observational studies (due, for instance, to cost, time, or ethical issues). • Essentially, each observation is weighted by the inverse of the probability of such an observation being sampled. Hence, the lower the probability of being sampled, the higher the weight of the observation in the analysis. • IPW can be also applied to deal with missing data. Essentially, IPW can be used to inflate the weight for subjects who are under-represented due to a large degree of missing data. For further details: Seaman and White [2] . Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 13 / 33 Multiple imputation (MI) Multiple imputation (MI) MI (see Rubin [3] ) is a proper approach to impute realistic values to the missing data and propagate the uncertainty due to the missing data to the results of the analysis of interest. Missing data assumptions • MCAR is desirable but typically unrealistic. • MI techniques assume MAR. • Hence, we make the assumption that missing values can be replaced by predictions derived by the observed data. • This is a fundamental assumption, otherwise we wouldn’t be able to predict plausible values of missing data points from the observed data. – Wade, relax. . . Jose said “multiple imputation”, not “multiple amputation”. @overdispersion Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 14 / 33 Multiple imputation: MICE algorithm Multiple imputation by chained equations (MICE) algorithm 1 Let {Y1 , Y2 , . . . , Ym−1 , Ym } be the subset of all variables of the data set with at least one missing value, and let {X1 , X2 , . . . , Xl−1 , Xl } be the subset of all variables of the data set that are complete; 2 For every missing datum in {Y1 , Y2 , . . . , Ym−1 , Ym }, perform a simple imputation (e.g. mean/median/mode substitution); 3 For i = 1, 2, . . . , m − 1, m, perform the following cycle: 1 2 3 Set back to missing all imputed values for Yi in step 2; Using only complete cases, perform a regression model with Yi as the dependent variable and all or some of the other variables as the independent variables (predictors). The regression model used for Yi works under the same assumptions it makes when performing a typical regression model analysis (e.g. linear, logistic, or Poison regression). The predictors are set by the analyst and can be different for each Yi ; Replace the missing values for Yi with predictions (imputations) from the fitted regression model. At the end of the cycle all of the missing values have been replaced with predictions from regressions that reflect the relationships observed in the data. 4 Repeat step 3 until the distribution of the parameters governing the imputations (e.g. the coefficients in the regression models) are stable. Then imputation models are retained. 5 For each i, use the final model to impute M values for each originally missing value of Yi . The M imputations are different because models are used probabilistically (instead of predicting just the mean of Yi ). Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 15 / 33 Multiple imputation in regression analysis Multiple imputation in regression analysis A typical application of MI is in the context of linear regression analysis, which is performed as follows: 1 2 3 For each imputed data set j = 1, 2, . . . , M, the parameter of interest, θ (e.g. a β coefficient of \j ). interest in the regression model) is estimated to get θ̂j , as well as its variance, Var(θ P M The final estimate of θ is the mean of the M estimates, θ̂MI = M1 j=1 θ̂j . \ =V bW + The final estimate of the variance of θ is Var(θ) MI bW = • V b • VB = M+1 b VB , M where: PM 1 \ j=1 Var(θj ) is the mean of the M estimates of the variance of θ or within imputation variance M 1 PM 2 j=1 (θ̂j − θ̂MI ) is the variance of the estimates of θ or between imputation variance M−1 Comments • V̂B captures the uncertainty of the imputations and inflates the error of the estimate accordingly. • In R, multiple imputation analysis can be performed using the mice package. • For further details: Klebanoff and Cole [4] , Sterne et al. [5] . Detailed explanation: Azur et al. [6] . Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 16 / 33 Linear regression analysis with MI in R: illustrative example Data We will work with a random sample (10%) of 748 Dutch boys from the cross-sectional data used to construct the Dutch growth (height, weight, head circumference and puberty) references 1997.a Variables are: • age: Decimal age (0-21 years) • hgt: Height (cm) • wgt: Weight (kg) • bmi: Body mass index (kg/m2 ) • hc: Head circumference (cm) • gen: Genital Tanner stage (G1-G5) (https://en.wikipedia.org/wiki/Tanner_scale) • phb: Pubic hair (Tanner P1-P6) • tv: Testicular volume (ml) • reg: Region (north, east, west, south, city) We will use data for boys 5-year old or older. a For further details, ?mice::boys. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 17 / 33 Linear regression analysis with MI in R: illustrative example Aim Suppose we are interested in fitting a linear regression model to estimate the mean testicular volume (tv) as a function of age (age), adjusting by region (reg), among boys 5-year old or older. The following plot include only observations that are complete in age, tv and reg. . . Testicular volume (ml) 25 Region north east west south city 20 15 10 5 5 10 15 20 Age (decimal years) but we would like to use all observations. . . Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 18 / 33 Linear regression analysis with MI in R: illustrative example Data in R > > > > > library(mice) ### ?boys # ("boys" data set is included in the "mice" package) ### We will work only with boys 5-year old or older: boys5 <- boys[boys$age >= 5, ] head(boys5) ## ## ## ## ## ## ## 2874 2883 2885 2891 2892 2932 age 5.196 5.286 5.292 5.333 5.338 5.533 hgt 111.1 128.7 109.3 112.9 123.0 116.3 wgt 20.9 30.6 19.6 19.1 28.1 20.2 bmi 16.93 18.47 16.40 14.98 18.57 14.93 hc 54.0 55.0 52.4 53.0 52.5 50.0 gen <NA> <NA> <NA> <NA> <NA> <NA> phb <NA> <NA> <NA> <NA> <NA> <NA> tv reg NA east NA east NA south NA south NA city NA west > summary(boys5) ## ## ## ## ## ## ## ## age Min. : 5.196 1st Qu.:11.545 Median :14.540 Mean :14.129 3rd Qu.:16.807 Max. :21.177 hgt Min. :109.3 1st Qu.:150.8 Median :170.3 Mean :165.4 3rd Qu.:180.7 Max. :198.0 NA's :3 Jose Barrera (ISGlobal & UAB) wgt Min. : 19.10 1st Qu.: 39.12 Median : 54.80 Mean : 54.35 3rd Qu.: 66.90 Max. :117.40 NA's :3 bmi Min. :13.69 1st Qu.:16.82 Median :18.77 Mean :19.18 3rd Qu.:20.89 Max. :31.74 NA's :3 hc Min. :48.20 1st Qu.:53.70 Median :55.50 Mean :55.39 3rd Qu.:57.00 Max. :65.00 NA's :36 Statistics in Health Sciences, 2023/2024 gen G1 : 56 G2 : 50 G3 : 22 G4 : 42 G5 : 75 NA's:212 phb P1 : 63 P2 : 40 P3 : 19 P4 : 32 P5 : 50 P6 : 41 NA's:212 tv Min. : 1.00 1st Qu.: 4.00 Median :12.00 Mean :11.89 3rd Qu.:20.00 Max. :25.00 NA's :231 reg north: 60 east :105 west :143 south:107 city : 42 19 / 33 Linear regression analysis with MI in R: illustrative example Exploration of missingness > ### percentage of complete cases: > 100 * mean(complete.cases(boys5)) ## [1] 48.7965 > ### percentage of cells with missing value: > 100 * mean(is.na(boys5)) ## [1] 17.01921 Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 20 / 33 Linear regression analysis with MI in R: illustrative example “1” = observed; “0” = missing ## ## ## ## ## ## ## ## ## ## ## 223 19 1 1 176 34 1 1 1 age reg hgt wgt bmi hc gen phb tv 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 2 1 1 1 1 1 1 0 0 0 3 1 1 1 1 1 0 0 0 0 4 1 1 0 0 0 1 1 1 1 3 1 1 0 0 0 0 1 1 1 4 1 1 0 0 0 0 0 0 0 7 0 0 3 3 3 36 212 212 231 700 tv phb gen hc bmi wgt 223 0 19 1 1 1 1 2 176 3 34 4 1 3 1 4 1 7 0 Jose Barrera (ISGlobal & UAB) hgt age > mice::md.pattern(boys5, + rotate.names = TRUE) reg blue = observed; maroon = missing Missingness patterns 0 Statistics in Health Sciences, 2023/2024 3 3 3 36 212 212 231 700 21 / 33 Linear regression analysis with MI in R: illustrative example Setting the imputation process: initial setting First, we run the imputation model but with no iterations (i.e. no imputations are made), to get the default setting that R uses and modify it to our convenience: > ### ?mice > ini <- mice(boys5, maxit = 0) > ini ### note that age doesn't need to be imputed ## ## ## ## ## ## ## ## ## ## ## ## ## Class: mids Number of multiple imputations: 5 Imputation methods: age hgt wgt bmi hc gen phb "" "pmm" "pmm" "pmm" "pmm" "polr" "polr" PredictorMatrix: age hgt wgt bmi hc gen phb tv reg age 0 1 1 1 1 1 1 1 1 hgt 1 0 1 1 1 1 1 1 1 wgt 1 1 0 1 1 1 1 1 1 bmi 1 1 1 0 1 1 1 1 1 hc 1 1 1 1 0 1 1 1 1 gen 1 1 1 1 1 0 1 1 1 tv "pmm" reg "" See the help of the mice function for details on the available imputation methods. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 22 / 33 Linear regression analysis with MI in R: illustrative example Setting the imputation process: customizing the imputation method • The mice function in the mice package includes a number of possible imputation methods, depending on the metric of the variable to be imputed. See “Details” in ?mice. • We can customize the imputation method used for each imputed variable with the argument method. • A special case is one in which there are variables related in a deterministic way. For instance, in weight (kg) our data set, we need to make sure that data obey the relationship BMI = (height . It can be (cm)/100)2 done using passive imputation: > > > > ### Get the imputation method and modify: meth <- ini$method ### passive imputation for bmi: meth["bmi"] <- "~I(wgt / (hgt / 100)^2)" Passive imputation also affects the predictorMatrix argument (see next slide). Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 23 / 33 Linear regression analysis with MI in R: illustrative example Setting the imputation process: customizing the predictors • The predictorMatrix is p × p, where p is the number of variables in the data set, containing 0/1 data specifying the set of predictors to be used for each target (i.e. to be imputed) variable. Each row corresponds to variable to be imputed. A value of 1 means that the column variable is used as a predictor for the row target variable. By default, all values are 1, except for the diagonal. • For instance, we can modify the predictorMatrix to avoid BMI being a predictor for height or weight: > > > > > ### Get the predictorMatrix and modify: pred <- ini$predictorMatrix ### We made sure to exclude bmi as a predictor for the imputation of hgt and wgt: pred[c("hgt", "wgt"), "bmi"] <- 0 pred ## ## ## ## ## ## ## ## ## ## age hgt wgt bmi hc gen phb tv reg age hgt wgt bmi hc gen phb tv reg 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 24 / 33 Linear regression analysis with MI in R: illustrative example Setting the imputation process: customizing the predictors (cont.) • Our data set could include variables that don’t have any role in the imputation process. An example is a identifier variable, say id. In that case, id should be neither imputed nor a predictor variable. Hence, we would need to set: > meth["id"] <- "" > pred["id"] <- "" # do not impute "id" # do not use "id" as a predictor for any variable • In addition, when MI is performed to then fit a regression model, as in this example, we should make sure that each of the variables included in the main analysis model (i.e. in the regression model of interest) is predictor for each of the other. Hence, in our example each of the variables tv, age and reg, should be predictor of the others in the imputation process. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 25 / 33 Linear regression analysis with MI in R: illustrative example Performing the imputations Now, we can perform MI using the modified methods and predictor matrix, and using 10 imputations: > > > > > + + + + + + ### number of imputed data sets: M <- 10 ### imputations: imps <- mice(data = boys5, m = M, method = meth, predictorMatrix = pred, maxit = 20, printFlag = FALSE, # set printFlag = TRUE until the analysis is definitive seed = 666) Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 26 / 33 Linear regression analysis with MI in R: illustrative example Testicular volume (mean) Testicular volume (variance) > ### mean in the original data set: > mean(boys5$tv, na.rm = TRUE) > ### variance in the original data set: > var(boys5$tv, na.rm = TRUE) ## [1] 11.89381 ## [1] 63.88201 > ### mean in the first imputed data set: > imp1 <- mice::complete(data = imps, action = 1) > mean(imp1$tv) > ### variance in the first imputed data set: > imp1 <- mice::complete(data = imps, action = 1) > var(imp1$tv) ## [1] 12.29322 ## [1] 64.41384 > > > > > > > > > > > > ### pooled mean using all imputed data sets: library(miceadds) # withPool_MI # list of means from each imputed data set: means <- with(imps, mean(tv)) # pool means: withPool_MI(means) ## [1] 12.36608 Jose Barrera (ISGlobal & UAB) ### pooled variance using all imputed data sets: library(miceadds) # withPool_MI # list of variances from each imputed data set: vars <- with(imps, var(tv)) # pool variances: withPool_MI(vars) ## [1] 64.64265 Statistics in Health Sciences, 2023/2024 27 / 33 Linear regression analysis with MI in R: illustrative example Fitting the regression model in the original data set > modorig <- lm(tv ~ age + reg, data = boys5) > summary(modorig) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = tv ~ age + reg, data = boys5) Residuals: Min 1Q -10.1672 -2.9345 Median -0.4986 3Q 2.4073 Max 10.2126 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -21.87424 1.49505 -14.631 < 2e-16 *** age 2.15914 0.08892 24.282 < 2e-16 *** regeast 4.68260 0.94114 4.975 0.00000131 *** regwest 3.61709 0.91099 3.970 0.00009715 *** regsouth 3.98715 0.91632 4.351 0.00002072 *** regcity 2.49995 1.19229 2.097 0.0372 * --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.179 on 220 degrees of freedom (231 observations deleted due to missingness) Multiple R-squared: 0.7327,Adjusted R-squared: 0.7266 F-statistic: 120.6 on 5 and 220 DF, p-value: < 2.2e-16 Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 28 / 33 Linear regression analysis with MI in R: illustrative example Fitting the regression model in the 1st imputed data set > modimp1 <- lm(tv ~ age + reg, data = imp1) > summary(modimp1) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = tv ~ age + reg, data = imp1) Residuals: Min 1Q -12.1793 -3.2220 Median -0.6139 3Q 3.1626 Max 10.4163 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -17.25001 1.08139 -15.952 < 2e-16 *** age 1.84122 0.05875 31.338 < 2e-16 *** regeast 5.07733 0.74192 6.843 2.53e-11 *** regwest 3.30778 0.69899 4.732 2.98e-06 *** regsouth 4.05642 0.73325 5.532 5.37e-08 *** regcity 4.10041 0.91599 4.476 9.62e-06 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.519 on 451 degrees of freedom Multiple R-squared: 0.6864,Adjusted R-squared: 0.6829 F-statistic: 197.4 on 5 and 451 DF, p-value: < 2.2e-16 Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 29 / 33 Linear regression analysis with MI in R: illustrative example Fitting the regression model in all imputed data sets and pooling results > modimp <- with(imps, lm(tv ~ age + reg)) > pooledest <- summary(pool(modimp), conf.int = TRUE) > pooledest ## ## ## ## ## ## ## term estimate std.error statistic df p.value 2.5 % 97.5 % 1 (Intercept) -17.596312 1.22893640 -14.318326 89.91460 6.424416e-25 -20.0378398 -15.154783 2 age 1.859907 0.07124406 26.106134 56.35211 3.684463e-33 1.7172076 2.002606 3 regeast 5.085049 0.80103130 6.348128 147.20107 2.549040e-09 3.5020424 6.668056 4 regwest 4.070209 0.88841177 4.581445 43.39399 3.876443e-05 2.2790266 5.861392 5 regsouth 4.099201 0.97790888 4.191802 34.83876 1.798018e-04 2.1136115 6.084790 6 regcity 3.063818 1.31680944 2.326698 26.52301 2.787490e-02 0.3596728 5.767963 Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 30 / 33 Linear regression analysis with MI in R: illustrative example Comparison of results for the β coefficient of interest 2.3 2.2 ^ βage 2.1 2.0 1.9 1.8 1.7 Original data set 1st imputed data set All imputed data sets (pooling) Data used in the main analysis (The R code to reproduce this plot is included in the file SHS_05_Missing_data.R) Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 31 / 33 Linear regression analysis with MI in R Further details You can find additional explained examples and R code in the document Exercises_SHS-CDA_v0_4.pdf, section 3. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 32 / 33 References [1] DB. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976. URL https://doi.org/10.2307/2335739. [2] SR. Seaman and IR. White. Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research, 22(3):278–295, 2013. URL https://doi.org/10.1177/0962280210395740. [3] DB. Rubin. Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, Inc., Hoboken, New Jersey, 1987. [4] MA. Klebanoff and SR. Cole. Use of multiple imputation in the epidemiologic literature. American Journal of Epidemiology, 168(4):355– 357, 2008. URL https://doi.org/10.1093/aje/kwn071. [5] JA. Sterne, IR. White, JB. Carlin, M. Spratt, P. Royston, MG. Kenward, AM. Wood, and JR. Carpenter. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ, 338:b2393, 2009. URL https://doi.org/10.1136/bmj.b2393. [6] MJ. Azur, EA. Stuart, C. Frangakis, and PJ. Leaf. Multiple imputation by chained equations: what is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1):40–49, 2011. URL https://doi.org/10.1002/mpr.329. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 33 / 33

Use Quizgecko on...
Browser
Browser