Statistics in Health Sciences PDF

B.Sc. Degree in Applied Statistics Statistics in Health Sciences 6. Example of methods: Integration of multiple imputation in cluster analysis Jose Barreraab [email protected] https://sites.google.com/view/josebarrera a ISGlobal Barcelona Institute for Global Health - Campus MAR b Department of Mathematics (UAB) This work is licensed under a Creative Commons “Attribution-NonCommercial-ShareAlike 4.0 International” license. Statistics in Health Sciences 1 Introduction 2 Overview of multiple imputation 3 Overview of cluster analysis 4 A framework for the application of multiple imputation in cluster analysis 5 Implementation in R 6 Exercises k -means clustering method Optimal number of clusters Feature selection Join selection of optimal k and clustering variables Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 2 / 18 Introduction Aims • The aim is this lesson is to learn about a proposed framework to integrate multiple imputation in cluster analysis. • Specifically we will learn and practice on: • How we can apply multiple imputation to a data set with missing data and integrate all the imputed data sets in a cluster analysis using the k -means algorithm. • How to describe the impact of the missing data on the uncertainty when deciding the optimal number of clusters. • How to describe the impact of the missing data on the uncertainty when selecting the most important clustering variables. • To do that, we will work on the study by Basagaña et al. [1] and the R package miclust. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 3 / 18 Overview of multiple imputation Multiple imputation • In the presence of missing data, complete case analysis can produce biased inferences. In addition, it leeds to a loss of efficiency due to loss of information. • Multiple imputation is a proper alternative to complete cases analysis when dealing with missing data. • Multiple imputation provides a number of imputed data sets. The statistical model of interest is applied to each of the imputed data sets and the results are then pooled. • Multiple imputation is especially well-suited for situations where the main interest is in a population parameter, such as a regression coefficient. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 4 / 18 Overview of cluster analysis Cluster analysis • Cluster analysis is the process whereby data elements are classified into homogeneous unknown groups. [2] • Numerous clustering methods exist which fall into different families, such as hierarchical, partitional, or model-based algorithms. [2–4] • We will consider the k -means algorithm, which is probably the most widely used partitional technique. [5] Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 5 / 18 k -means clustering method k -means searching approach • Finding the optimal clustering by performing an exhaustive search of all possible partitions is not computationally feasible. • k -means reduces the search but are not guaranteed to reach a global solution. The k -means algorithm 1 Set the number of clusters, k , as an input; 2 Set k arbitrary initial clusters chosen randomly, using the output of a hierarchical clustering technique or using other criteria; 3 Calculate the centroid of each cluster; 4 Move each individual to the closest cluster (according the its centroid); 5 Iterate from step 3 until no individuals can be moved between clusters. The final clustering will depend on the initial centroids, so it is suggested to run the algorithm several times using different starting values and choose the best solution according to some criterion. [5] Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 6 / 18 Optimal number of clusters Deciding the number of clusters • To find the optimal value of k , several methods have been suggested, usually based on repeating the clustering algorithm using different values of k and comparing the results using some criterion. [5] • To compare the fit of two classifications with different k , some penalization for the value of k needs to be used to prevent choosing as many clusters as observations. [6] Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 7 / 18 Feature selection Selecting the clustering variables • The number of variables included in the cluster analysis is a relevant issue. • Applying any clustering algorithm to high-dimensional data can be problematic: adding more variables to an analysis may degrade the final classification if the number of individuals (n) is small relative to the number of variables (p). [7,8] • Hence, it is convenient to choose a small number of salient variables to create the clusters, which is usually defined as feature selection. [6–8] • As a general stipulation, it is good practice that pn > 10. [7] • The main difficulty of choosing the best subset of variables is that most of criteria depend on distances, which are altered by changing the dimensionality. [6] • It implies a difficulty to compare two clustering classifications based on different numbers of variables. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 8 / 18 Join selection of optimal k and clustering variables CritCF • The optimal number of clusters and the final set of variables can be selected according to CritCF [6] , which can rank partitions based on different numbers of clusters and different numbers of variables: CritCF = ✓ 2m 1 · 2m + 1 1 + W /B ◆ log(2k +2) log(2m+2) , where m is the number of variables, k is the number of clusters, and W and B are the within- and between-cluster inertias. • Higher values of CritCF are preferred. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 9 / 18 Join selection of optimal k and clustering variables Searching strategy based on CritCF One possible (non exhaustive) search strategy to find both the optimal number of clusters and the final variables is a backward sequential selection algorithm, which would work as follows [6] : 1 2 For each k 2 {k = 2, . . . , kmax }, where kmax is fixed a priori: i Define S as the set of selected variables and R as the set of removed variables; ii Start with R empty and all variables in S; iii For each variable in S, run the k -means algorithm excluding that variable; iv Move from S to R the variable whose exclusion provides the highest value of CritCF; v Iterate from step iii until any variable in S brings an improvement in CritCF when excluded. The final model is the combination of k and S that gives the highest CritCF. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 10 / 18

Statistics in Health Sciences PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue