Multiple Imputation in Cluster Analysis PDF

Summary

This document presents a framework for applying multiple imputation in cluster analysis. It discusses methods for deciding the optimal number of clusters and selecting variables for the final analysis, along with a description of the uncertainty associated with missing data. Implementation details in the R "miclust" package and associated exercises are also included.

Full Transcript

A framework for the application of multiple imputation in cluster analysis Results for each imputed data set • We now can apply the procedure described in slide 10 to each of the r imputed data sets. • As a result, we have a combination of optimal k (kopt ) and subset of selected variables (S) that...

A framework for the application of multiple imputation in cluster analysis Results for each imputed data set • We now can apply the procedure described in slide 10 to each of the r imputed data sets. • As a result, we have a combination of optimal k (kopt ) and subset of selected variables (S) that gives the highest CritCF in each of the r imputed data sets: (kopt 1 , S1 ), . . . (kopt r , Sr ). • Now, we have to pool the results. . . Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 11 / 18 A framework for the application of multiple imputation in cluster analysis Deciding the optimal number of clusters • We can decide the optimal number of clusters (kfin ) as the most frequently selected one according to CritCF. • We can describe uncertainty in kfin associated with missing data. For example, with a barplot of the selection frequency of each value of k and the boxplot of CritCF for each value of k . These plots are provided by the plot method in the miclust package. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 12 / 18 A framework for the application of multiple imputation in cluster analysis Deciding the subset of clustering variables We now retain only the results for the r imputed data sets with k = kfin k and then we can: • Decide the variables to be included in the final analysis (Sfin ) according to their selection frequency. For example, keep the variables that are selected in at least 50% of the data sets; or keep the s most frequent variables, where s is determined by the median size of S1 , . . . , Sr . • Describe uncertainty in Sfin associated with missing data. For example, show the distribution of the sizes of S1 , . . . , Sr , and the frequency of selection of each variable. These plots are also provided by the plot method in the miclust package. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 13 / 18 A framework for the application of multiple imputation in cluster analysis Final analysis 1 Refit the cluster analysis with k = kfin in the r data sets containing only the variables in Sfin . 2 Relabel the clusters so that they all have the same meaning in the r data sets. 3 Allocation of subjects to clusters: 1 2 4 ⇤ Assign each subject to a cluster. For example, assign subjects to the cluster they were assigned to in most data sets. Describe the uncertainty in subject allocation to clusters. For example, describe the distribution of the frequency of assignment to each cluster.⇤ Description of clusters. It can be done by restricting each subject to be assigned to only one cluster, or taking into account that a subject may be assigned to different clusters in different data sets.⇤ These results are provided by the summary method in the miclust package. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 14 / 18 A framework for the application of multiple imputation in cluster analysis Proposed framework by Basagaña et al. [1] Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 15 / 18 Implementation in R The miclust package • The proposed framework is implemented in the miclust package: > library(miclust) ## ## This is miclust 1.2.8. For details, use: ## > help(package = ’miclust’) ## ## To cite the methods in the package use: ## > citation(’miclust’) • The main functions in the package are: • • • • • getdata to format the provided multiply imputed data sets. miclust to perform the proposed analysis detailed in Table 1 (slide 15). getvariablesfrequency to creates a ranked selection frequency for the variables summary method to print a summary of the results plot method to visualize the results Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 16 / 18 Exercises Methods These slides summarize the framework proposed by Basagaña et al. [1] to integrate multiple imputation with cluster analysis. Read the paper and the supplementary document carefully (both attached in the provided ZIP) to get familiar with the methods. miclust package Get familiar with the miclust package usage. Specifically: 1 Read the package manual (https://cran.r-project.org/web/packages/miclust/miclust.pdf) 2 Run and explore the examples of the getdata function: > example(topic = "getdata", package = "miclust", type = "html") 3 Run and explore the examples of the miclust function: > example(topic = "miclust", package = "miclust", type = "html") 4 Read the examples at https://cran.r-project.org/web/packages/miclust/readme/README.html Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 17 / 18 References [1] X. Basagaña, J. Barrera-Gómez, M. Benet, JM. Antó, and J. Garcia-Aymerich. A Framework for Multiple Imputation in Cluster Analysis. American Journal of Epidemiology, 177(7):718–725, 2013. URL https://doi.org/10.1093/aje/kws289. [2] GW. Milligan. Cluster analysis. In S. Kotz, CB. Read, N. Balakrishnan, and B. Vidakovic, editors, Encyclopedia of Statistical Sciences. John Wiley & Sons, Inc, New York, New York, USA, 1998. [3] AK. Jain, AP. Topchy, and JM. Law, MHC. Buhmann. Landscape of clustering algorithms. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 1, pages 260–263, Washington, DC, 2004. IEEE Computer Society Press. [4] S. Zhong and J. Ghosh. A unified framework for model-based clustering. Journal of Machine Learning Research, 4:1001–1037, 2003. URL https://www.jmlr.org/papers/volume4/zhong03a/zhong03a.pdf. [5] D. Steinley. K-means clustering: a half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59(1):1–34, 2006. URL https://doi.org/10.1348/000711005X48266. [6] M. Breaban and H. Luchian. A unifying criterion for unsupervised clustering and feature selection. Pattern Recognition, 44(4):854–865, 2011. URL https://doi.org/10.1016/j.patcog.2010.10.006. [7] AK. Jain, RPW. Duin, and J. Mao. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4–37, 2000. URL https://doi.org/10.1109/34.824819. [8] Y. Wang, DJ. Miller, and R. Clarke. Approaches to working in high-dimensional data spaces: gene expression microarrays. British journal of cancer, 98(6):1023–1028, 2008. URL https://doi.org/10.1038/sj.bjc.6604207. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 18 / 18

Use Quizgecko on...
Browser
Browser