Cluster Analysis: Basic Concepts and Algorithms Lecture 7 PDF

Cluster Analysis: Basic Concepts and Algorithms Lecture 7 What is Cluster Analysis? Given a set of objects, place them in groups such that the objects in a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster Intra-cluster distances are distances are maximized minimized Applications of Cluster Analysis Discovered Clusters Industry Group Understanding Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, – Group related documents 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Technology1-DOWN Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, for browsing, group Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, genes and proteins that 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Technology2-DOWN have similar functionality, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, or group stocks with 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN similar price fluctuations Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, 4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP Oil-UP Summarization – Reduce the size of large data sets Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters Types of Clusterings A clustering is forming a set of clusters from unlabeled data. Important distinction between hierarchical and partitional sets of clusters – Partitional Clustering ◆ A division of data objects into non-overlapping subsets (clusters) – Hierarchical clustering ◆ A set of nested clusters organized as a hierarchical tree Partitional Clustering Original Points A Partitional Clustering Hierarchical Clustering p1 p2 p3 p4 Traditional Dendrogram p1 p2 p3 p4 Non-traditional Dendrogram Characteristics of the Input Data Are Important Type of proximity or density measure – Central to clustering – Depends on data and application Data characteristics that affect proximity and/or density are – Dimensionality ◆ Sparseness – Attribute type – Special relationships in the data ◆ For example, autocorrelation – Distribution of the data Noise and Outliers – Often interfere with the operation of the clustering algorithm Clusters of differing sizes, densities, and shapes Clustering Algorithms K-means and its variants Hierarchical clustering Density-based clustering K-means Clustering Partitional clustering approach Number of clusters, K, must be specified Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid The basic algorithm is very simple Example of K-means Clustering Iteration 6 1 2 3 4 5 3 2.5 2 1.5 y 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x Example of K-means Clustering Iteration 1 Iteration 2 Iteration 3 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Iteration 4 Iteration 5 Iteration 6 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x K-means Clustering – Details Simple iterative algorithm. – Choose initial centroids; – repeat {assign each point to a nearest centroid; re-compute cluster centroids} – until centroids stop changing. Initial centroids are often chosen randomly. – Clusters produced can vary from one run to another The centroid is (typically) the mean of the points in the cluster, but other definitions are possible K-means points will converge using common proximity measures with the appropriate centroid Most of the convergence happens in the first few iterations. – Often the stopping condition is changed to ‘Until relatively few points change clusters’ Time Complexity is O( n * K * I * d ) – n = number of points, K = number of clusters, I = number of iterations, d = number of attributes K-means Objective Function A common objective function (used with Euclidean distance measure) is Sum of Squared Error (SSE) – For each point, the error is the distance to the nearset cluster center – To get SSE, we square these errors and sum them. K SSE =   dist2 (mi , x) i =1 xCi – x is a data point in cluster Ci and mi is the centroid (mean) for cluster Ci and K number of clusters – SSE improves in each iteration of K- means. Importance of Choosing Initial Centroids … Original Points Iteration 1 Iteration 2 3 3 2.5 2.5 2 2 1.5 1.5 y y 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x Iteration 3 Iteration 4 Iteration 5 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Importance of Choosing Intial Centroids Depending on the choice of initial centroids, B and C may get merged or remain separate Problems with Selecting Initial Points If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small. – Chance is relatively small when K is large – If clusters are the same size, n, then – For example, if K = 10, then probability = 10!/1010 = 0.00036 – Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t – Consider an example of five pairs of clusters 8 10 Clusters Example 6 4 2 0 2 4 6 Starting with two initial centroids in one cluster of each pair of clusters 10 Clusters Example Iteration 1 Iteration 2 8 8 6 6 4 4 2 2 y y 0 0 -2 -2 -4 -4 -6 -6 0 5 10 15 20 0 5 10 15 20 x x Iteration 3 Iteration 4 8 8 6 6 4 4 2 2 y y 0 0 -2 -2 -4 -4 -6 -6 0 5 10 15 20 0 5 10 15 20 x x Starting with two initial centroids in one cluster of each pair of clusters 10 Clusters Example Starting with some pairs of clusters having three initial centroids, while other have only one. 5 10 15 20 10 Clusters Example Iteration 1 Iteration 2 8 8 6 6 4 4 2 2 y y 0 0 -2 -2 -4 -4 -6 -6 0 5 10 15 20 0 5 10 15 20 x Iteration 3 x Iteration 4 8 8 6 6 4 4 2 2 y y 0 0 -2 -2 -4 -4 -6 -6 0 5 10 15 20 0 5 10 15 20 x x Starting with some pairs of clusters having three initial centroids, while other have only one. Solutions to Initial Centroids Problem Multiple runs – Helps, but probability is not on your side Use some strategy to select the k initial centroids and then choose among these initial centroids – Select most widely separated ◆K-means++ is a robust way of doing this selection – Use hierarchical clustering to determine initial centroids Bisecting K-means – Not as susceptible to initialization issues Limitations of K-means K-means has problems when clusters are of differing – Sizes – Densities – Non-globular shapes K-means has problems when the data contains outliers. – One possible solution is to remove outliers before clustering Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters) Limitations of K-means: Differing Density Original Points K-means (3 Clusters) Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters) Overcoming K-means Limitations Original Points K-means Clusters One solution is to find a large number of clusters such that each of them represents a part of a natural cluster. But these small clusters need to be put together in a post-processing step. Overcoming K-means Limitations Original Points K-means Clusters One solution is to find a large number of clusters such that each of them represents a part of a natural cluster. But these small clusters need to be put together in a post-processing step. Overcoming K-means Limitations Original Points K-means Clusters One solution is to find a large number of clusters such that each of them represents a part of a natural cluster. But these small clusters need to be put together in a post-processing step.

Cluster Analysis: Basic Concepts and Algorithms Lecture 7 PDF

Document Details

Tags

Related

Summary

Full Transcript