Cluster Analysis Lecture Notes PDF

Cluster Analysis Introduction Distance Measures Hierarchical Clusteringg Non-hierarchical Clustering An Illustration Application Issues 1 Introduction ¾ Cluster analysis is a technique for grouping individuals or objects bj t into i t unknown k groups. It diff differs ffrom other th methods th d off classification, such as discriminant analysis, in that in cluster analysis the number and characteristics of the groups are to be d i d ffrom th derived the d data t and d are nott usually ll k known prior i tto th the analysis. ¾ Before clustering is done, a scatter diagram can be used to display the main characteristics of the underlying clusters (e.g., Figure 16.2). The concept of closeness is implicitly used to define clusters. ¾ Profile diagram is a helpful technique for a moderate number of variables (e.g. Figures 16.3 & 16.4). ¾ Clustering is sensitive to outliers; hence, some preliminary checking for outliers/errors is advisable 2 Distance Measures ¾ Clusteringg is based on distances between observations,, between observations and clusters, and between clusters. ¾ The Euclidean distance between two p points ((X11, X21) and (X ( 12, X22) is: ( X 11 − X 12 ) 2 + ( X 21 − X 22 ) 2 ¾ For P variables, the Euclidean distance between points i and j is: P ∑ ( Xpi − Xpj )2 p =1 ¾ Different distance measures can yield different clusters. When variables have different units of measurement, measurement they should be standardised to minimise the “dominant size” effect. 3 Hierarchical Clustering ¾ In agglomerative gg hierarchical methods,, we begin g with N clusters (i.e., each observation is a cluster). ¾ By combining the closest two observations/clusters, the number of clusters is reduced by one in each successive step. In the final step, all observations form one cluster. ¾ In divisive hierarchical methods methods, we begin with one cluster containing all observations. In successive steps, observations that are most dissimilar to the remaining ones are split off. ¾ Most of the commonly used programs are of the agglomerative l ti type. t 4 Hierarchical Clustering (cont.) (cont ) ¾ The centroid procedure is a widely-used example of agglomerative l i methods. h d In I the h centroid id method, h d the h di distance between two clusters is defined as the distance between the ggroup p centroids. The ggroups p with the shortest distance are combined first (see Figures 16.5 & 16.6). ¾ A dendrogram or tree graph summarises clustering at successive steps (see Figure 16.7). ¾ Hierarchical procedures are appealing but may mislead if an undesirable early combination persists. persists The investigator may wish to perform the analysis several times after deleting certain suspect observations. ¾ An important issue is how to select the number of clusters. 5 The distance between clusters may serve as a guide. K-Means Clustering ¾ The K-means clustering is a popular non-hierarchical clustering technique. K-means clustering requires the number of clusters to be specified. It: ((a)) divides the data into K initial clusters,, (b) computes the centroid of each clusters, (c) finds distance of each observation to each centroid, (d) assigns the observation to the cluster with the closest centroid, and (e) repeats until no observations can be re-assigned. ¾ Another approach is to consider all observations as one cluster and then search for the variable with the hi h t variance highest i ( (say, X1). ) The Th cluster l t iis then th split lit into two using the mid-point of X1. 6 K-Means Clustering (cont.) (cont ) ¾ If the data are standardised, the variable with the smallest range is selected. The procedure (i.e., splitting clusters with the largest variance or smallest range) continues until K clusters are formed. ¾ After this, this observations are assigned to the cluster whose centroid they are nearest to (see Figure 16.8). ¾ If no natural choice of K is available, it is useful to tryy various values and compare the results (e.g., graphically). ¾ Other methods to evaluate clustering results include: ( ) F tests (a) t t on th the equality lit off cluster l t means by b variable i bl (to (t compare results only and not hypothesis testing), and (b) inputting the data to a K-group discriminant procedure. d 7 An Illustration ¾ See Tables 16.1 and 16.2. The dendrogram g is shown in Figure 16.9. The centroid method, Euclidean distance and standardised data are used in this illustration. ¾ Distance is also shown in the dendrogram. It indicates how disparate two clusters just joined are. A large increase in distance signals an “unnatural” unnatural joining. Hence, that particular step should be examined. ¾ Cluster p profiles can help p users understand the nature and characteristics of the cluster (see Figure 16.10). Most clustering applications are based on such knowledge (e.g. (e g market segmentation, segmentation direct mailing mailing, investment diversification … etc.). 8 Application Issues ¾ Judgement is required to interpret the results since outputs of different runs may be different. It may be desirable to make several runs to reach some consensus. ¾ To reduce the number of variables and facilitate interpretation, interpretation principal components analysis or factor analysis can be performed before clustering. ¾ Whether clustering has been successful depends on whether the results make intuitive sense. ¾ Some users split the data and run clustering on both halves to see if they obtain similar results. Such agreement gives further credence to the conclusions. ¾ It is also desirable to try different clustering methods and different number of clusters, and compare the clustering results 9

Cluster Analysis Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue