Statistics in Health Sciences A1-G01 PDF

Document Details

PraiseworthyHammeredDulcimer

Uploaded by PraiseworthyHammeredDulcimer

UAB

2023

Andrea Gonzalez, Alejandro Donaire, Gerard Lahuerta, Ona Sanchez

Tags

statistics health sciences data analysis cluster analysis

Summary

This document is a student report on statistics in health sciences. It details the analysis of missing data, multiple imputation, and the integration with cluster analysis. The report uses R code examples and visualizations of the results.

Full Transcript

Statistics in Health Sciences A1-G01 Andrea Gonzalez (1603921), Alejandro Donaire (1600697), Gerard Lahuerta (1601350), Ona Sanchez (1601181) 8th November 2023 Contents 1 Introduction 1 2 Analysis 2.1 Description of missingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

Statistics in Health Sciences A1-G01 Andrea Gonzalez (1603921), Alejandro Donaire (1600697), Gerard Lahuerta (1601350), Ona Sanchez (1601181) 8th November 2023 Contents 1 Introduction 1 2 Analysis 2.1 Description of missingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Multiple imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Integration of multiple imputation in cluster analysis . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 3 Results 4 4 Report 11 5 References 12 1 Introduction Nothing to do in this section. 2 Analysis 2.1 Description of missingness First, the data is loaded: > load("copd_redux.RData") Next, the variables that are needed for describing the missingness are set: > > + + + + + + > > > > > > > > > > > > > > > > > > > > > > > > > > > > # Checks if the number to print is exactly 0 and returns a "0" if so check_zero <- function(number_to_print) { if (as.numeric(number_to_print) == 0) { return("0") } else { return(number_to_print) } } # An auxiliary copy of the data is created data_aux <- copd # The variable id is ommited data_aux$id <- NULL # The dimensions of the dataset are stored rows <- dim(data_aux)[1] columns <- dim(data_aux)[2] # The missing values are identified isna <- is.na(data_aux) # The percentage of missigness is calculated column-wise # Min-max are subsequently extracted missvar <- 100*colMeans(isna) missvar_min <- check_zero(sprintf("%.1f", min(missvar))) missvar_max <- check_zero(sprintf("%.1f", max(missvar))) # Percentage of variables with less than 5 percent of missing data missvar_less_5percent <- check_zero(sprintf("%.1f", 100*mean(missvar < 5))) # Percentage of complete complete_cases <- check_zero(sprintf("%.1f", 100*mean(complete.cases(data_aux)))) # Overall missing information misscells <- check_zero(sprintf("%.1f", 100*mean(isna))) The resulting text is the following: The subset used here included 342 participants and 44 variables. The range of missing data in the different variables went from 0% to 31.6%, with 68.2% of the variables having less than 5% of the data missing. Only 33.3% of the participants had complete data on all 44 variables. Overall, 5.6% of the information was missing; that is, 5.6% of the total of 15,048 cells (44 × 342 = 15,048) had missing data. Note: to check that a number is exactly 0, we have manually programmed the function check zero() which, given a string or a number, returns a string with a 0 if the original number is 0. 1 2.2 Multiple imputation Prior to performing multiple imputation, the pattern of missing data can be visualized using the code shown next: > > > > > # Create the plot p <- ggmice::plot_pattern(copd, square = TRUE, rotate = TRUE) # Save the plot with a larger size ggsave(filename = "missings_patter.jpeg", plot = p, width = 30, height = 15) Due to the amount of variables, the plot using the md.pattern() function from the mice package is hardly readable. However, using a combination of the libraries ggmice and ggplot2 a better plot can be generated. It will not be shown here, nonetheless, because of its size. From the plot, it can be seen that, for example, the variable with the most missing values is missing in less than 50% of the observations, and most observations only have around 5 missing columns. Now, the imputation model is run with no iterations. This way, we can see which imputation methods are going to be used, as well as which variables are going to be used to predict each other, and modify it if necessary. > mi_initialization <- mice(copd, maxit=0) In particular, the imputation methods for each variable can be visualized with the following command: > mi_initialization$method ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## id index_barthel2v1 "" "pmm" fevp_postv1 rv_tlcv1 "" "pmm" sin11_v1 vas2_v1 "logreg" "pmm" cil8_s_mv1 sin7_v1 "pmm" "logreg" dtdve_ecov1 cccs1_v0 "pmm" "logreg" lvmass_ase_ecov1 vtit_catv1 "pmm" "logreg" pro_v1 basA_v1 "pmm" "pmm" sin5_v1 bc_v1 "logreg" "pmm" graudisn_v1 cpcr_mg_l_mediav1 "polyreg" "pmm" quo_calcpostv1 "" ctnf_s_mv1 "pmm" ic_prepv1 "pmm" sin8_v1 "logreg" kcals_set_v0 "pmm" urg_ambv0 "" fc_pmv1 "pmm" fvcp_postv1 "" neutA_v1 "pmm" rv_prepv1 "pmm" vtdvdbid_ecov1 "pmm" iam_chv1 "logreg" ch_v1 "pmm" impactsv1 "pmm" po2_v1 "pmm" tad_v1 "pmm" vas1_v1 "pmm" ahpl_v1 "pmm" incp_fevv1 "pmm" pp_ecov1 "pmm" sin15_v1 "logreg" vc_prepv1 "pmm" monoA_v1 "pmm" ahpf_v1 "pmm" col_v1 "pmm" sin6_v1 "logreg" activityv1 "pmm" Also, the predictor matrix can be displayed using: > mi_initialization$predictorMatrix Before actually performing the multiple imputation, the column corresponding to the ID is removed, as the data will not be reordered throughout the code. > copd$id <- NULL Finally, the imputation can be run. Note that the number of imputations used is 50 and the maximum number of iterations (maxit) is 20, as this number has been observed to be enough to converge empirically. Regarding the predictorMatrix argument, as we are not experts in the subject matter, we cannot make any meaningful links between the variables other than those that are explicitly obvious, such as the F EV1 and F V C variables, and the F EV1 /F V C variable, used in the original paper[1]. Since not even these variables are in our dataset, we therefore assume that all variables can be used to predict all other 2 variables without any problem and, thus, the predictor matrix should be left as is. On the other hand, the argument method is used to specify the method used to predict each variable. The function mice chooses the method automatically using the data type of the variable. As we do not have any more information than mice does, and having checked the methods proposed, it has been agreed that there should be no changes to the method argument. Additionally, we have set a seed in the seed argument for reproducibility purposes. > > > > > > > + + + + + + + number.of.imputations <- 50 methods <- mi_initialization$method pred_matrix <- mi_initialization$predictorMatrix # The multiple imputation is performed imps <- mice(data = copd, #predictorMatrix = pred_matrix, #method = methods, m = number.of.imputations, maxit = 20, printFlag = FALSE, seed = 2002 ) After running the previous chunk, the variable imps contains all 50 imputed datasets as well as the original dataset. 2.3 Integration of multiple imputation in cluster analysis The following chunk preprocesses the data for clustering analysis. It involves extracting the imputed datasets, converting non-numeric columns to numeric ones, and passing the processed data to the function getdata(). > > > > > + + + + + > > > # The datasets will be saved here mylist <- vector("list", 1 + number.of.imputations) # The datasets are extracted, preprocessed and saved for (i in 1:(1+number.of.imputations)) { temp_df <- complete(imps, action=i-1) indx <- sapply(temp_df, is.factor) temp_df[indx] <- lapply(temp_df[indx], function(x) as.numeric(as.character(x))) mylist[[i]] <- temp_df }; # The datasets are passed to getdata and the final preprocessed data is obtained imps.processed <- getdata(mylist) Finally, the clustering analysis can be performed. Next, the choice for the arguments of the function miclust() are justified: • data: The preprocessed imputed datasets. • method: The method “kmeans” is chosen as there are no other methods available. • search: In the original paper, the “backward” search method is used. • ks: The search range 2 to 5 is used in the original paper. As we have a subset of the variables, this range should be appropriate. • distance: The “euclidean” distance is the most natural and commonly used distance. It can also be argued that there are no additional reasons to use the “manhattan” one. • centpos: The data distributions do not justify the use of the “medians” method. 3 • initcl: As the way the variables are related is unknown, it is safer to use a random initialization. • seed: For reproducibility purposes, a random seed has been set. > # > analysis <- miclust(data=imps.processed, + method="kmeans", + search="backward", + ks=2:5, + distance="euclidean", + centpos="means", + initcl="rand", + seed=2002, + verbose=FALSE + ) Once the analysis is done, the variable “analysis” will contain all the relevant information. For example, much of this information can be visualized using the function summary() as follows: > # 2.3.2 > summary(analysis) In subsequent sections this information will be visualized and commented on. 3 Results In this section, the results obtained using the miclust() function will be displayed visually as Figures and Tables. These visual elements will then be used to explain the results in Section 4. The code to generate the plots has been adapted from the original code in the miclust() package. # k chosen frequency histogram k_freqs <- 100*analysis$selectedkdistribution/sum(analysis$selectedkdistribution) max_freq <- max(k_freqs); par(mar=c(5,5,4,1)) # Adjust plot margins barplot(k_freqs, main = "", names.arg = paste("k", analysis$ks, sep = "="), xlab = "Number of clusters", ylab = "Selection frequency (%)", ylim = c(0, max_freq), cex.axis = 1.9, cex.names = 1.9, cex.lab=1.9) 0 Selection frequency (%) 20 40 60 80 100 > > > > + k=2 k=3 k=4 Number of clusters k=5 Figure 1: Frequency plot of the selected number of clusters over the 50 imputed datasets. The selection has been made using the backward search algorithm along with the CritCF criterion. The clustering algorithm is k-means. Data from the PAC COPD Study (subset of the variables), Spain, 2004–2008. 4 0.815 CritCF 0.820 0.825 0.830 0.835 > # CritCF Boxplot > par(mar=c(5,5,4,1)) # Adjust plot margins > boxplot(analysis$critcf, col = "gray", main = "", xlab = "Number of clusters", + ylab = "CritCF", cex.axis = 1.9, cex.names = 1.9, cex.lab=1.9) k=2 k=3 k=4 Number of clusters k=5 Figure 2: Box plots of the between-imputation distribution of CritCF by number of clusters (k). Each box plot is based on 50 values, corresponding to the imputed datasets. The bigger the value of CrtiCF, the better the clustering is. Data from the PAC COPD Study (subset of the variables), Spain, 2004–2008. > > > > > > > > > + + + > > > > > > > > + > > > + + + > > > # Frequency of selected variables histogram par(mar=c(5, 6, 4, 3)) # Adjust plot margins k <- analysis$kfin nusedimp <- length(analysis$usedimp) selectedvars <- list() numberofselectedvariables <- vector() # Get the number of selected variables for each imputation for (i in 1:nusedimp) { selectedvars[[i]] <- analysis$clustering[[i]][[paste0("k=", k)]]$selectedvariables numberofselectedvariables[i] <- length(selectedvars[[i]]) } # Claculate summary satatistcs medianfancy <- sprintf("%.1f", median(numberofselectedvariables)) meanfancy <- sprintf("%.1f", mean(numberofselectedvariables)) sdfancy <- sprintf("%.1f", sd(numberofselectedvariables)) q1fancy <- sprintf("%.1f", quantile(numberofselectedvariables, probs = 1/4)) q3fancy <- sprintf("%.1f", quantile(numberofselectedvariables, probs = 3/4)) summ <- paste0("Mean = ", meanfancy, ", Median = ", medianfancy, ", Sd = ", sdfancy, ", Q1 = ", q1fancy, ", Q3 = ", q3fancy) # Generate the bar plot barplot(table(numberofselectedvariables) * 100/sum(table(numberofselectedvariables)), main = "", ylab = "Frequency of selection (%)", xlab = paste("Number of selected variables for", k, "clusters"), cex.axis = 1.9, cex.names = 1.9, cex.lab = 1.9) # Add summary statistics to the plot mtext(summ, side = 3, line = 1, cex = 1.7) 5 0 Frequency of selection (%) 5 10 15 Mean = 31.9, Median = 31.5, Sd = 3.2, Q1 = 29.0, Q3 = 35.0 27 29 31 33 35 37 Number of selected variables for 2 clusters Figure 3: Frequency plot of the number of selected variables for each of the imputed datasets. Only those datasets with the number of clusters (k) equal to 2 have been taken into account. The variables have been selected using the backward search algorithm along with the CritCF criterion, this is, the k-means algorithm is run excluding one of the variables in the selected set at a time. The variable that, when excluded, provides a higher value of CritCF is removed from the set of selected variables. PAC COPD Study (subset of the variables), Spain, 2004–2008. > > > > > > > > > > > > + + > > > > + + > > > > > > > > > + + # Number of selected variables plot # Prepare data and summary selectedvars <- as.factor(unlist(selectedvars)) selectedvarssummary <- as.data.frame(sort(summary(selectedvars), decreasing = TRUE)) * 100 / nusedimp nv <- dim(selectedvarssummary)[1] # Create an empty plot xmin <- 0 xmax <- 100 par(mar=c(5, 16, 4, 9)) # Adjust plot margins plot(c(xmin, xmax), c(1, nv), type = "n", yaxt = "n", main = "", xlab = paste("Variable selection frequency for", k, "clusters (%)"), ylab = "", xlim = c(0, 100), ylim = c(1, nv * 1.1), cex.axis = 2.3, cex.lab = 2.3) axis(2, at = 1:nv, rownames(selectedvarssummary)[nv:1], cex.axis = 1.9, las = 2) # Add lines corresponding to the frequencies of appearance for each variable for (i in 1:nv) { segments(0, i, selectedvarssummary[nv + 1 - i, 1], i, lwd = 2) } # Add horizontal lines corresponding to the summary statistics abline(h = nv + 1 - quantile(numberofselectedvariables, probs = 1/4), lty = 3, lwd = 2) abline(h = nv + 1 - median(numberofselectedvariables), lty = 2, lwd = 2) abline(h = nv + 1 - mean(numberofselectedvariables), lty = 4, lwd = 2) abline(h = nv + 1 - quantile(numberofselectedvariables, probs = 3/4), lty = 3, lwd = 2) # Add the legend legend("top", title = "Number of selected variables", legend = c("Q1 and Q3", "Median", "Mean"), horiz = TRUE, bty = "n", lty = c(3, 2, 4), cex = 2.3) 6 Number of selected variables Q1 and Q3 Median Mean activityv1 ahpf_v1 cccs1_v0 fevp_postv1 fvcp_postv1 graudisn_v1 iam_chv1 ic_prepv1 impactsv1 index_barthel2v1 po2_v1 quo_calcpostv1 rv_prepv1 rv_tlcv1 sin11_v1 sin15_v1 sin5_v1 sin6_v1 sin7_v1 sin8_v1 urg_ambv0 vas1_v1 vas2_v1 vc_prepv1 vtit_catv1 ahpl_v1 fc_pmv1 cpcr_mg_l_mediav1 neutA_v1 cil8_s_mv1 lvmass_ase_ecov1 vtdvdbid_ecov1 dtdve_ecov1 col_v1 tad_v1 incp_fevv1 bc_v1 pp_ecov1 pro_v1 ch_v1 ctnf_s_mv1 0 20 40 60 80 Variable selection frequency for 2 clusters (%) 100 Figure 4: Variables that remained in the final set of variables in at least 1 data set after the variable selection algorithm was applied for k = 2, by percentage of appearance. The last variable included in the final analysis (32nd) appeared in 44% of the datasets. The median is 31.5 and the mean 31.9. PAC COPD Study (subset of the variables), Spain, 2004–2008. > > > > > > > > > > > > > > + + + # The probability of assignement values are extracted summary_analysis_2 <- summary(analysis, k=2) summary_analysis_3 <- summary(analysis, k=3) # The information is extracted and formated as a unique dataframe table_k2 <- summary_analysis_2$allocationprobabilities table_k2 <- rbind(table_k2, rep(" ", times = dim(table_k2)[2])) table_k3 <- summary_analysis_3$allocationprobabilities table_full <- rbind(table_k2, table_k3) k_indicator <- c("2", "2", "","3", "3", "3") table_full <- cbind(k = k_indicator, table_full) # exporting coefficients table to LaTeX format: mytable <- xtable(table_full, digits = c(0, 0, 2, 2, 2, 2, 2), align = c("l", "c", "c", "c", "c", "c", "c"), caption = "Distribution of the Frequencies of Assignment (Proportions) for Each Cluster (for k = 2 and k = 3). The results for k=3 are only for illustrative purposes. PAC COPD Study (subset of the variables), Spain, 2004{2008.", + label = "tab:assignement_distribution") > > # cosmetics for table head: > colnames(mytable) <- c("\\multicolumn{1}{c}{\\bf{k}}", + "\\multicolumn{1}{c}{\\bf{Minimum}}", + "\\multicolumn{1}{c}{\\bf{First Quartile}}", + "\\multicolumn{1}{c}{\\bf{Median}}", + "\\multicolumn{1}{c}{\\bf{Third Quartile}}", + "\\multicolumn{1}{c}{\\bf{Maximum}}") > rownames(mytable) <- c("Cluster 1", "Cluster 2", "","Cluster 1 ", "Cluster 2 ", "Cluster 3") > > # Format and print the table: 7 > print(mytable, + size = "footnotesize", + include.rownames = TRUE, + include.colnames = TRUE, + floating = TRUE, + sanitize.text.function = force, + booktabs = TRUE) k Minimum First Quartile Median Third Quartile Maximum Cluster 1 Cluster 2 2 2 0.54 0.52 1 1 1 1 1 1 1 1 Cluster 1 Cluster 2 Cluster 3 3 3 3 0.46 0.6 0.48 1 1 1 1 1 1 1 1 1 1 1 1 Table 1: Distribution of the Frequencies of Assignment (Proportions) for Each Cluster (for k = 2 and k = 3). The results for k=3 are only for illustrative purposes. PAC COPD Study (subset of the variables), Spain, 2004–2008. > > > > > > > > > > > > > > > > > > + + + + > > > > > > > > > > > + > + > > > + + + # The medians of the means of the variables per cluster are extracted summary_analysis <- summary(analysis, k=2) variables_description <- summary_analysis$summarybycluster[c("mean (cl.1)", "mean (cl.2)")] analysis_variables <- data_aux[c(row.names(variables_description))] # The percentage of NaNs for each variable are calculated manually count_na <- function(x, output){100*(sum(is.na(x))/rows)} percentage_nans <- data.frame(apply(analysis_variables, 2, count_na)) variables_description <- cbind(percentage_nans, variables_description) var_desc_col_names <- names(variables_description) names(variables_description) <- c("%miss.", var_desc_col_names[2], var_desc_col_names[3]) # The means for each variable are calculated for the raw dataset data_aux <- copd raw_selected <- data_aux[rownames(variables_description)] raw_assigned <- cbind(raw_selected, analysis$clustering$clusters$`k=2`["assigned"]) for (col_name in names(raw_assigned)) { if (!is.numeric(raw_assigned[[col_name]])) { raw_assigned[[col_name]] <- as.numeric(as.character((raw_assigned[[col_name]]))) } } raw_k1 <- subset(raw_assigned, assigned == 1) raw_k2 <- subset(raw_assigned, assigned == 2) raw_means_k1 <- colMeans(raw_k1, na.rm = TRUE) raw_means_k2 <- colMeans(raw_k2, na.rm = TRUE) raw_means_k1 <- raw_means_k1[1:length(raw_means_k1)-1] raw_means_k2 <- raw_means_k2[1:length(raw_means_k2)-1] variables_description <- add_column(variables_description, d = raw_means_k1, .after = colnames(variables_description)[1]) variables_description <- add_column(variables_description, d = raw_means_k2, .after = colnames(variables_description)[2]) # exporting coefficients table to LaTeX format: mytable2 <- xtable(variables_description, digits = c(0, 1, 1, 1, 1, 1), align = c("r", "c", "c", "c", "c", "c"), caption = "Description of the Variables (Mean Values) by Cluster. The values before and after multiple imputation is performed are compared. PAC-COPD Study, Spain, 2004{2008. The columns under \(Raw)" show the results obtained when the final cluster assignment was decided by majority vote and the missing values of the variables were excluded from the calculation of the means. The columns under \(Imp)" show the results obtained when using the 50 data sets with imputed missing values and variable cluster assignment. The values corresponding to the imputed data are the medians 8 + > > > + + + + > > > > > + + + + + + over the 50 imputed datasets.", label = "tab:variables_description") # cosmetics for table head: colnames(mytable2) <- c("\\multicolumn{1}{c}{\\bf{\\makecell{\\% of Subjects With \\\\ Missing Values}}}", "\\multicolumn{1}{c}{\\bf{\\makecell{Cluster 1 \\\\ (Raw)}}}", "\\multicolumn{1}{c}{\\bf{\\makecell{Cluster 2 \\\\ (Raw)}}}", "\\multicolumn{1}{c}{\\bf{\\makecell{Cluster 1 \\\\ (Imp)}}}", "\\multicolumn{1}{c}{\\bf{\\makecell{Cluster 2 \\\\ (Imp)}}}") rownames(mytable2) <- gsub("_", "-", rownames(mytable2)) # format and print the table: print(mytable2, size = "footnotesize", include.rownames = TRUE, include.colnames = TRUE, floating = TRUE, sanitize.text.function = force, booktabs = TRUE) activityv1 ahpf-v1 cccs1-v0 fevp-postv1 fvcp-postv1 graudisn-v1 iam-chv1 ic-prepv1 impactsv1 index-barthel2v1 po2-v1 quo-calcpostv1 rv-prepv1 rv-tlcv1 sin11-v1 sin15-v1 sin5-v1 sin6-v1 sin7-v1 sin8-v1 urg-ambv0 vas1-v1 vas2-v1 vc-prepv1 vtit-catv1 ahpl-v1 fc-pmv1 cpcr-mg-l-mediav1 neutA-v1 cil8-s-mv1 lvmass-ase-ecov1 vtdvdbid-ecov1 # whether \begin{table} should be created (TRUE) or not (FALSE) # important to treat content of columns as latex function # requires \usepackage{booktabs} in the preamble of the document % of Subjects With Missing Values Cluster 1 (Raw) Cluster 2 (Raw) Cluster 1 (Imp) Cluster 2 (Imp) 1.2 16.4 1.5 0.0 0.0 1.2 0.9 5.6 1.2 2.6 3.2 0.0 7.9 7.9 1.2 1.2 1.2 1.2 1.2 1.2 0.0 1.2 1.2 5.6 25.4 19.3 11.4 5.0 2.0 4.1 15.8 31.6 34.5 2.9 0.1 64.3 79.7 2.0 0.1 72.7 14.4 99.5 77.1 60.8 133.1 49.5 0.5 0.1 0.2 0.5 0.2 0.5 0.4 1.7 3.7 76.8 0.6 2.9 2.1 0.7 4.4 4.8 207041.2 32.2 61.5 2.4 0.2 41.9 66.2 3.3 0.1 53.5 36.9 97.6 70.3 48.3 178.1 62.4 0.6 0.1 0.4 0.8 0.3 0.6 0.4 4.2 6.6 63.1 0.7 2.4 3.3 1.1 4.8 5.4 193781.9 31.3 33.1 3.0 0.1 65.3 80.1 2.0 0.2 71.4 14.9 99.7 78.0 62.5 133.9 49.5 0.5 0.1 0.2 0.6 0.2 0.4 0.4 1.5 3.7 76.0 0.5 2.8 2.5 0.6 4.4 4.8 206245.7 32.1 63.1 2.4 0.2 43.1 66.0 3.5 0.2 56.1 38.2 97.8 70.3 49.8 183.9 62.2 0.7 0.1 0.4 0.8 0.4 0.6 0.4 4.2 6.8 64.5 0.6 2.5 3.6 0.6 5.3 4.8 197112.1 30.9 Table 2: Description of the Variables (Mean Values) by Cluster. The values before and after multiple imputation is performed are compared. PAC-COPD Study, Spain, 2004–2008. The columns under “(Raw)” show the results obtained when the final cluster assignment was decided by majority vote and the missing values of the variables were excluded from the calculation of the means. The columns under “(Imp)” show the results obtained when using the 50 data sets with imputed missing values and variable cluster assignment. The values corresponding to the imputed data are the medians over the 50 imputed datasets. 9 # Cohen's Kapp distribution to asses cluster relabeling analysis_summary = summary(analysis) kappas = analysis_summary$kappas kappas_statistics = analysis_summary$kappadistribution par(mar=c(5, 7, 4, 4)) # Adjust plot margins hist(kappas, main="", xlab="Cohen's Kappa Values", cex.axis = 1.9, cex.names = 1.9, cex.lab=1.9) abline(v = kappas_statistics["mean"], lwd=2, lty=1, label="Mean") abline(v = kappas_statistics["25%"], lwd=2, lty=2) abline(v = kappas_statistics["75%"], lwd=2, lty=2) 0 2 Frequency 4 6 8 10 > > > > > > > > > 0.93 0.94 0.95 0.96 Cohen's Kappa Values Figure 5: Distribution of the Cohens Kappa values obtained from the 49 comparisons when performing the cluster relabeling operation. The continous line corresponds to the mean and, the discontinuous lines, to the first and third quartile PAC COPD Study (subset of the variables), Spain, 2004–2008. It must be noted that the function plot() could have been used to extract all 4 plots presented here at the same time. However, there are various upsides to plotting them separately: it allows for better and more in-depth customization of the plots, they can be cited separately, and individual captions can be created for them. Note: To compare the variable means between clusters, F-statistics could have been calculated. However, simply by inspection, some differences in the means stand out: variables impactsv1 and vas1-v1 display the most notable differences, whereas variables iam-chv1, sin15-v1 and urg-ambv0 have identical means. Nonetheless, as we do not have the variances, we cannot assure that these differences are significant. 10 4 Report After the miclust() function performed the backward search algorithm for each of the 50 imputed datasets, 50 pairs (k, number-of-variables) have been obtained. The distribution of the values of k is shown in Figure 1. All values turned out to be 2. Additionally, the results using CritCF are displayed in Figure 2. On the other hand, the distribution of the number of variables selected can be seen in Figure 3. In the light of these results, the chosen k is clearly 2. However, a 100% chosen frequency doesn’t mean 0 uncertainty. Nonetheless, it can be said that the uncertainty introduced by the missing data is low. The miclust function’s default strategy for selecting the final number of variables is based on the median number of variables chosen across imputations, which is 32 (rounded from 31.5). These 32 variables are the most frequent ones, as shown in Table 2. This is a reasonable result, as it aligns with the 10-variables-per-subject rule. Further examination of the frequency plot reveals that 25 of the 32 variables are present in all imputed datasets, indicating low uncertainty for them. However, 2 variables appear in less than 50% of the datasets, suggesting they might have been excluded with an alternative method (e.g., selecting variables present in at least 50% of datasets). After relabeling the clusters, 49 Cohens Kappa Values were obtained. Their distribution can be seen in Figure 5. These results reflect a substantially high agreement when relabeling the clusters, as most values are around 0.95. Additionally, the quartiles are close to the mean, showing little variance. Finally, in order to analyse the probability of assignment to each cluster, it is illustrative to look at Table 1. This table provides a way of assessing the uncertainty of cluster assignment to individuals. From the results, it can be said with reasonable certainty that a single individual is likely to be assigned to the same cluster regardless of the particular imputation it happens to be in. Therefore, the decision seems robust; the influence of missing data is low. In particular, even at the first quantile, the probability is already 1. In contrast to the results shown in the original paper, results for k=3 are also reasonably good. 11 5 References [1] X. Basagana, J. Barrera-Gomez, M. Benet, JM. Anto, and J. Garcia-Aymerich. A Framework for Multiple Imputation in Cluster Analysis. American Journal of Epidemiology, 177(7):718:725, 2013. URL https://doi.org/10.1093/aje/kws289 12

Use Quizgecko on...
Browser
Browser