2_unsupervised.pdf
Document Details
Uploaded by EncouragingSilver4242
Full Transcript
Unsupervised Learning Clustering K-means 1 What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related) to one another and different from (or unrelated to) the objects in other groups...
Unsupervised Learning Clustering K-means 1 What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster Intra-cluster distances are distances are maximized minimized 2 Applications of Cluster Analysis Discovered Clusters Industry Group 1 Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Understanding Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Technology1-DOWN Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Group related documents for Sun-DOWN 2 Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, browsing, group genes and ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Technology2-DOWN proteins that have similar Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN 3 Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, functionality, or group stocks MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN with similar price fluctuations 4 Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP Schlumberger-UP Summarization Reduce the size of large data sets Clustering precipitation in Australia Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters 4 Types of Clusterings Partitional Clustering A division data objects into subsets (clusters) such that each data object is in exactly one subset K-means and its variants Hierarchical clustering A set of nested clusters organized as a hierarchical tree Agglomerative and divisive Density-based clustering Groups data points that are closely packed together based on a specified density criteria, while marking sparse regions as outliers. DBSCAN 5 Partitional Clustering Original Points A Partitional Clustering 6 Hierarchical Clustering p1 p3 p4 p2 p1 p2 p3 p4 Traditional Hierarchical Traditional Dendrogram Clustering p1 p3 p4 p2 p1 p2 p3 p4 Non-traditional Hierarchical Non-traditional Dendrogram 7 Clustering K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The objective is to minimize the sum of distances of the points to their respective centroid 8 K-means Clustering Problem: Given a set X of n points in a d-dimensional space and an integer K group the points into K clusters C= {C1, C2,…,Ck} such that 𝑘𝑘 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶 = 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑥𝑥, 𝑐𝑐) 𝑖𝑖=1 𝑥𝑥∈𝐶𝐶𝑖𝑖 is minimized, where ci is the centroid of the points in cluster Ci Most common definition is with euclidean distance, minimizing the Sum of Squares Error (SSE) function Sometimes K-means is defined like that Therefore cost function 𝑘𝑘 2 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶 = 𝑥𝑥 − 𝑐𝑐𝑖𝑖 𝑖𝑖=1 𝑥𝑥∈𝐶𝐶𝑖𝑖 is minimized, where ci is the mean of the points in cluster Ci 9 K-means Algorithm Also known as Lloyd’s algorithm. K-means is sometimes synonymous with this algorithm 10 k-Means Example (I) k1 Y k2 Pick 3 initial Cluster centers (randomly) k3 X k-Means Example (II) k1 Y k2 Assign each point k3 to the closest cluster center X k-Means Example (III) k1 k1 Y k2 k3 Move each k2 cluster center to the mean k3 of each cluster X k-Means Example (IV) k1 Y Reassign points k3 closest to a different k2 new cluster center Q: Which points are reassigned? X k-Means Example (V) k1 Y A: three points with animation k3 k2 X k-Means Example (VI) k1 Y k3 k2 re-compute cluster means X k-Means Example (VII) k1 Y k2 k3 move cluster centers to cluster means X The K-Means Algorithm - Example Cluster C1 C2 C3 Centroid Value 1 20 40 Split 14 people into 3 groups P1 1 0 19 39 Only one attribute,age P2 3 2 17 37 P3 5 4 15 35 Initial centroids are 1, 20, 40 P4 8 7 12 32 Right table demonstrates result P5 9 8 11 31 after steps 1, and 2 P6 11 10 9 29 P7 12 11 8 28 P8 13 12 7 27 P9 37 36 17 3 P10 43 42 23 3 P11 45 44 25 5 P12 49 48 29 9 P13 51 50 31 11 P14 65 64 45 25 The K-Means Algorithm - Example Cluster C1 C2 C3 Re-compute centroid, we have 5, Centroid Value 5 12 48 12, and 48 P1 1 4 11 47 P2 3 2 9 45 Re-compute the distance between P3 5 0 7 43 each instance and 3 clusters P4 8 3 4 40 P5 is closer to C2 P5 9 4 3 39 P6 11 6 1 37 Need to re-compute the centroid P7 12 7 0 36 for C1 and C2 P8 13 8 1 35 No need to update C3 as there is P9 37 32 25 11 P10 43 38 31 5 no change P11 45 40 33 3 P12 49 44 37 1 P13 51 46 39 3 P14 65 60 53 17 The K-Means Algorithm - Example Cluster C1 C2 C3 Centroid Value 4 11 48 The centroid for 3 Clusters are 4, P1 1 3 10 47 11, and 48 P2 3 1 8 45 P3 5 1 6 43 Calculate the distance between P4 8 4 3 40 each instance to each cluster P5 9 5 2 39 P4 is closer to C2 P6 11 7 0 37 P7 12 8 1 36 Need to update C1 and C2’s P8 13 9 2 35 centroid, P9 37 33 26 11 P10 43 39 32 5 No need to update C3 as no P11 45 41 34 3 changes happened P12 49 45 38 1 P13 51 47 40 3 P14 65 61 54 17 The K-Means Algorithm - Example Cluster C1 C2 C3 Centroid Value 3 10 48 3 Clusters’ centroids are 3, 10, P1 1 2 9 47 48 P2 3 0 7 45 P3 5 2 5 43 Compute the distance between P4 8 5 2 40 each instance to each Cluster P5 9 6 1 39 P6 11 8 1 37 No change happened P7 12 9 2 36 No new update P8 13 10 3 35 P9 37 34 27 11 P10 43 40 33 5 P11 45 42 35 3 P12 49 46 39 1 P13 51 48 41 3 P14 65 62 55 17 Two different K-means Clusterings 3 2.5 Original Points 2 1.5 y 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 3 3 2.5 2.5 2 2 1.5 1.5 y y 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x Optimal Clustering Sub-optimal Clustering 22 Importance of Choosing Initial Centroids Iteration 6 1 2 3 4 5 3 2.5 2 1.5 y 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 23 Importance of Choosing Initial Centroids Iteration 1 Iteration 2 Iteration 3 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Iteration 4 Iteration 5 Iteration 6 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x 24 Importance of Choosing Initial Centroids Iteration 5 1 2 3 4 3 2.5 2 1.5 y 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 25 Importance of Choosing Initial Centroids … Iteration 1 Iteration 2 3 3 2.5 2.5 2 2 1.5 1.5 y y 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x Iteration 3 Iteration 4 Iteration 5 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x 26 Dealing with Initialization Do multiple runs and select the clustering with the smallest error Select original set of points by methods other than random. E.g., pick the most distant (from each other) points as cluster centers (K-means++ algorithm) 27 Limitations of K-means K-means has problems when clusters are of different Sizes Densities Non-globular shapes K-means has problems when the data contains outliers. 28 Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters) 29 Limitations of K-means: Differing Density Original Points K-means (3 Clusters) 30 Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters) 31 Overcoming K-means Limitations Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together. 32 Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters? But “clusters are in the eye of the beholder”! Then why do we want to evaluate them? To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters Measuring Clustering Quality Two methods: extrinsic vs. intrinsic Extrinsic: supervised, i.e., the ground truth is available Compare a clustering against the ground truth using certain clustering quality measure Intrinsic: unsupervised, i.e., the ground truth is unavailable Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are Cluster evaluation: ground truth We use some labeled data (for classification) Assumption: Each class is a cluster. After clustering, a confusion matrix is constructed. From the matrix, we compute various measurements, entropy, purity, precision, recall and F- score. Let the classes in the data D be C = (c1, c2, …, ck). The clustering method produces k clusters, which divides D into k disjoint subsets, D1, D2, …, Dk. Evaluation measures: purity AN EXAMPLE A remark about ground truth evaluation Commonly used to compare different clustering algorithms. Thus although an algorithm may perform very A real-life data set for well on some labeled data sets, no guarantee clustering has no class labels. that it will perform well on the actual application data at hand. The fact that it performs well on some label data sets does give us some confidence of the quality of the algorithm. This evaluation method is said to be based on external data or information. Evaluation based on internal information Intra-cluster cohesion (compactness): Cohesion measures how near the data points in a cluster are to the cluster centroid. Sum of squared error (SSE) is a commonly used measure. Inter-cluster separation (isolation): Separation means that different cluster centroids should be far away from one another. In most applications, expert judgments are still the key. 10 9 8 Internal Measures: SSE 7 6 SSE 5 4 3 2 1 Clusters in more complicated figures aren’t 0 2 5 10 15 20 25 30 K well separated SSE is good for comparing two clusterings or two clusters (average SSE). Can also be used to estimate the number of 6 clusters 4 2 0 -2 -4 -6 5 10 15 Internal Measures: Cohesion and Separation Cluster Cohesion: Measures how closely related are objects in a cluster Example: SSE Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters Example: Squared Error Cohesion is measured by the within cluster sum of squares (SSE) SSE = WSS = ∑ ∑ ( x − mi ) 2 i x∈Ci Separation is measured by the between cluster sum of squares BSS = ∑ Ci ( m − mi ) 2 i Where |Ci| is the size of cluster i , m is the centroid of the whole data set Internal Measures: Cohesion and Separation Example: SSE BSS + WSS = constant m 1 × m 2 ×3 4 × m 5 1 2 K=1 cluster: SSE = WSS= (1 − 3) 2 + ( 2 − 3) 2 + ( 4 − 3) 2 + (5 − 3) 2 = 10 BSS= 4 × (3 − 3) 2 = 0 Total = 10 + 0 = 10 K=2 clusters: SSE = WSS= (1 − 1.5) 2 + ( 2 − 1.5) 2 + ( 4 − 4.5) 2 + (5 − 4.5) 2 = 1 BSS= 2 × (3 − 1.5) 2 + 2 × ( 4.5 − 3) 2 = 9 Total = 1 + 9 = 10 Example: Kmeans clustering with Tit an ic dat aset Clust er t he r ecor ds in t o t wo i.e. t h e on es wh o sur vived an d t he on es who did n ot https://www.kaggle.com/datasets/yasserh/titanic-dataset 43