SHS_06_MI_in_cluster_analysis PDF
Document Details
Uploaded by PraiseworthyHammeredDulcimer
Universitat Autònoma de Barcelona
Jose Barrera
Tags
Summary
This document covers the integration of multiple imputation in cluster analysis. Specifically, it examines methods for applying multiple imputation to data sets with missing data within a cluster analysis using the k-means algorithm. The document further describes the effect of missing data on results, and outlines a framework for such analyses, including variable selection, using an R package called miclust.
Full Transcript
B.Sc. Degree in Applied Statistics Statistics in Health Sciences 6. Example of methods: Integration of multiple imputation in cluster analysis Jose Barreraab [email protected] https://sites.google.com/view/josebarrera a ISGlobal Barcelona Institute for Global Health - Campus MAR b Department o...
B.Sc. Degree in Applied Statistics Statistics in Health Sciences 6. Example of methods: Integration of multiple imputation in cluster analysis Jose Barreraab [email protected] https://sites.google.com/view/josebarrera a ISGlobal Barcelona Institute for Global Health - Campus MAR b Department of Mathematics (UAB) This work is licensed under a Creative Commons “Attribution-NonCommercial-ShareAlike 4.0 International” license. Statistics in Health Sciences 1 Introduction 2 Overview of multiple imputation 3 Overview of cluster analysis k -means clustering method Optimal number of clusters Feature selection Join selection of optimal k and clustering variables 4 A framework for the application of multiple imputation in cluster analysis 5 Implementation in R 6 Exercises Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 2 / 18 Introduction Aims • The aim is this lesson is to learn about a proposed framework to integrate multiple imputation in cluster analysis. • Specifically we will learn and practice on: • How we can apply multiple imputation to a data set with missing data and integrate all the imputed data sets in a cluster analysis using the k -means algorithm. • How to describe the impact of the missing data on the uncertainty when deciding the optimal number of clusters. • How to describe the impact of the missing data on the uncertainty when selecting the most important clustering variables. • To do that, we will work on the study by Basagaña et al. [1] and the R package miclust. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 3 / 18 Overview of multiple imputation Multiple imputation • In the presence of missing data, complete case analysis can produce biased inferences. In addition, it leeds to a loss of efficiency due to loss of information. • Multiple imputation is a proper alternative to complete cases analysis when dealing with missing data. • Multiple imputation provides a number of imputed data sets. The statistical model of interest is applied to each of the imputed data sets and the results are then pooled. • Multiple imputation is especially well-suited for situations where the main interest is in a population parameter, such as a regression β coefficient. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 4 / 18 Overview of cluster analysis Cluster analysis • Cluster analysis is the process whereby data elements are classified into homogeneous unknown groups. [2] • Numerous clustering methods exist which fall into different families, such as hierarchical, partitional, or model-based algorithms. [2–4] • We will consider the k -means algorithm, which is probably the most widely used partitional technique. [5] Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 5 / 18 k -means clustering method k -means searching approach • Finding the optimal clustering by performing an exhaustive search of all possible partitions is not computationally feasible. • k -means reduces the search but are not guaranteed to reach a global solution. The k -means algorithm 1 Set the number of clusters, k , as an input; 2 Set k arbitrary initial clusters chosen randomly, using the output of a hierarchical clustering technique or using other criteria; 3 Calculate the centroid of each cluster; 4 Move each individual to the closest cluster (according the its centroid); 5 Iterate from step 3 until no individuals can be moved between clusters. The final clustering will depend on the initial centroids, so it is suggested to run the algorithm several times using different starting values and choose the best solution according to some criterion. [5] Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 6 / 18 Optimal number of clusters Deciding the number of clusters • To find the optimal value of k , several methods have been suggested, usually based on repeating the clustering algorithm using different values of k and comparing the results using some criterion. [5] • To compare the fit of two classifications with different k , some penalization for the value of k needs to be used to prevent choosing as many clusters as observations. [6] Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 7 / 18 Feature selection Selecting the clustering variables • The number of variables included in the cluster analysis is a relevant issue. • Applying any clustering algorithm to high-dimensional data can be problematic: adding more variables to an analysis may degrade the final classification if the number of individuals (n) is small relative to the number of variables (p). [7,8] • Hence, it is convenient to choose a small number of salient variables to create the clusters, which is usually defined as feature selection. [6–8] • As a general stipulation, it is good practice that pn ⩾ 10. [7] • The main difficulty of choosing the best subset of variables is that most of criteria depend on distances, which are altered by changing the dimensionality. [6] • It implies a difficulty to compare two clustering classifications based on different numbers of variables. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 8 / 18 Join selection of optimal k and clustering variables CritCF • The optimal number of clusters and the final set of variables can be selected according to CritCF [6] , which can rank partitions based on different numbers of clusters and different numbers of variables: CritCF = 2m 1 · 2m + 1 1 + W /B log(2k +2) log(2m+2) , where m is the number of variables, k is the number of clusters, and W and B are the within- and between-cluster inertias. • Higher values of CritCF are preferred. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 9 / 18 Join selection of optimal k and clustering variables Searching strategy based on CritCF One possible (non exhaustive) search strategy to find both the optimal number of clusters and the final variables is a backward sequential selection algorithm, which would work as follows [6] : 1 2 For each k ∈ {k = 2, . . . , kmax }, where kmax is fixed a priori: i Define S as the set of selected variables and R as the set of removed variables; ii Start with R empty and all variables in S; iii For each variable in S, run the k -means algorithm excluding that variable; iv Move from S to R the variable whose exclusion provides the highest value of CritCF; v Iterate from step iii until any variable in S brings an improvement in CritCF when excluded. The final model is the combination of k and S that gives the highest CritCF. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 10 / 18 A framework for the application of multiple imputation in cluster analysis Results for each imputed data set • We now can apply the procedure described in slide 10 to each of the r imputed data sets. • As a result, we have a combination of optimal k (kopt ) and subset of selected variables (S) that gives the highest CritCF in each of the r imputed data sets: (kopt 1 , S1 ), . . . (kopt r , Sr ). • Now, we have to pool the results. . . Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 11 / 18 A framework for the application of multiple imputation in cluster analysis Deciding the optimal number of clusters • We can decide the optimal number of clusters (kfin ) as the most frequently selected one according to CritCF. • We can describe uncertainty in kfin associated with missing data. For example, with a barplot of the selection frequency of each value of k and the boxplot of CritCF for each value of k . These plots are provided by the plot method in the miclust package. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 12 / 18 A framework for the application of multiple imputation in cluster analysis Deciding the subset of clustering variables We now retain only the results for the r imputed data sets with k = kfin k and then we can: • Decide the variables to be included in the final analysis (Sfin ) according to their selection frequency. For example, keep the variables that are selected in at least 50% of the data sets; or keep the s most frequent variables, where s is determined by the median size of S1 , . . . , Sr . • Describe uncertainty in Sfin associated with missing data. For example, show the distribution of the sizes of S1 , . . . , Sr , and the frequency of selection of each variable. These plots are also provided by the plot method in the miclust package. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 13 / 18 A framework for the application of multiple imputation in cluster analysis Final analysis 1 Refit the cluster analysis with k = kfin in the r data sets containing only the variables in Sfin . 2 Relabel the clusters so that they all have the same meaning in the r data sets. 3 Allocation of subjects to clusters: 1 2 4 ∗ Assign each subject to a cluster. For example, assign subjects to the cluster they were assigned to in most data sets. Describe the uncertainty in subject allocation to clusters. For example, describe the distribution of the frequency of assignment to each cluster.∗ Description of clusters. It can be done by restricting each subject to be assigned to only one cluster, or taking into account that a subject may be assigned to different clusters in different data sets.∗ These results are provided by the summary method in the miclust package. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 14 / 18 A framework for the application of multiple imputation in cluster analysis Proposed framework by Basagaña et al. [1] Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 15 / 18 Implementation in R The miclust package • The proposed framework is implemented in the miclust package: > library(miclust) ## ## This is miclust 1.2.8. For details, use: ## > help(package = ’miclust’) ## ## To cite the methods in the package use: ## > citation(’miclust’) • The main functions in the package are: • • • • • getdata to format the provided multiply imputed data sets. miclust to perform the proposed analysis detailed in Table 1 (slide 15). getvariablesfrequency to creates a ranked selection frequency for the variables summary method to print a summary of the results plot method to visualize the results Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 16 / 18 Exercises Methods These slides summarize the framework proposed by Basagaña et al. [1] to integrate multiple imputation with cluster analysis. Read the paper and the supplementary document carefully (both attached in the provided ZIP) to get familiar with the methods. miclust package Get familiar with the miclust package usage. Specifically: 1 Read the package manual (https://cran.r-project.org/web/packages/miclust/miclust.pdf) 2 Run and explore the examples of the getdata function: > example(topic = "getdata", package = "miclust", type = "html") 3 Run and explore the examples of the miclust function: > example(topic = "miclust", package = "miclust", type = "html") 4 Read the examples at https://cran.r-project.org/web/packages/miclust/readme/README.html Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 17 / 18 References [1] X. Basagaña, J. Barrera-Gómez, M. Benet, JM. Antó, and J. Garcia-Aymerich. A Framework for Multiple Imputation in Cluster Analysis. American Journal of Epidemiology, 177(7):718–725, 2013. URL https://doi.org/10.1093/aje/kws289. [2] GW. Milligan. Cluster analysis. In S. Kotz, CB. Read, N. Balakrishnan, and B. Vidakovic, editors, Encyclopedia of Statistical Sciences. John Wiley & Sons, Inc, New York, New York, USA, 1998. [3] AK. Jain, AP. Topchy, and JM. Law, MHC. Buhmann. Landscape of clustering algorithms. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 1, pages 260–263, Washington, DC, 2004. IEEE Computer Society Press. [4] S. Zhong and J. Ghosh. A unified framework for model-based clustering. Journal of Machine Learning Research, 4:1001–1037, 2003. URL https://www.jmlr.org/papers/volume4/zhong03a/zhong03a.pdf. [5] D. Steinley. K-means clustering: a half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59(1):1–34, 2006. URL https://doi.org/10.1348/000711005X48266. [6] M. Breaban and H. Luchian. A unifying criterion for unsupervised clustering and feature selection. Pattern Recognition, 44(4):854–865, 2011. URL https://doi.org/10.1016/j.patcog.2010.10.006. [7] AK. Jain, RPW. Duin, and J. Mao. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4–37, 2000. URL https://doi.org/10.1109/34.824819. [8] Y. Wang, DJ. Miller, and R. Clarke. Approaches to working in high-dimensional data spaces: gene expression microarrays. British journal of cancer, 98(6):1023–1028, 2008. URL https://doi.org/10.1038/sj.bjc.6604207. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 18 / 18