Clustering Techniques PDF
Document Details
Uploaded by InestimableOmaha
Anjali Jivani
Tags
Summary
This document provides a comprehensive overview of clustering techniques, explaining concepts like clustering, applications, considerations, and different approaches, including detailed examples and algorithms for better understanding.
Full Transcript
CLUSTERING Courtesy: Jiawei Han, Micheline Kamber, and Jian Pei https://livebook.manning.com/book/machine-learning-for-mortals- mere-and-otherwise/chapter-18/27 https://www.geeksforgeeks.org/ml-optics-clustering-explanation/ https://towardsdatascience.com/clustering-using-optics- cac1d10ed7a7...
CLUSTERING Courtesy: Jiawei Han, Micheline Kamber, and Jian Pei https://livebook.manning.com/book/machine-learning-for-mortals- mere-and-otherwise/chapter-18/27 https://www.geeksforgeeks.org/ml-optics-clustering-explanation/ https://towardsdatascience.com/clustering-using-optics- cac1d10ed7a7 1 Anjali Jivani WHAT IS CLUSTERING? ⮚ Cluster: A collection of data objects ▪ similar (or related) to one another within the same group ▪ dissimilar (or unrelated) to the objects in other groups ⮚ Cluster analysis (or clustering, data segmentation, …) ▪ Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters ⮚ Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) ⮚ Typical applications ▪ As a stand-alone tool to get insight into data distribution 2 ▪ As a preprocessing step for other algorithms CLUSTERING APPLICATIONS ⮚ Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species ⮚ Information retrieval: document clustering ⮚ Land use: Identification of areas of similar land use in an earth observation database ⮚ Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs ⮚ City-planning: Identifying groups of houses according to their house type, value, and geographical location ⮚ Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults ⮚ Climate: understanding earth climate, find patterns of atmospheric and ocean ⮚ Economic Science: market research 3 CLUSTERING AS PRE-PROCESSING ⮚ SUMMARIZATION: ▪ Preprocessing for regression, PCA, classification, and association analysis ⮚ COMPRESSION: ▪ Image processing: vector quantization ⮚ CLASSIFICATION ▪ Localizing search to one or a small number of clusters ⮚ OUTLIER DETECTION ▪ Outliers are often viewed as those “far away” from any cluster 4 ⮚GOOD QUALITY A GOOD CLUSTERING CLUSTERING METHOD WILL PRODUCE HIGH QUALITY CLUSTERS ▪ high intra-class similarity: cohesive within clusters ▪ low inter-class similarity: distinctive between clusters ⮚ THE QUALITY OF A CLUSTERING METHOD DEPENDS ON ▪ the similarity/dissimilarity measure used by the method ▪ its implementation, and ▪ Its ability to discover some or all of the hidden patterns 5 MEASURING QUALITY OF CLUSTERS ⮚ Dissimilarity/Similarity metric ▪ Similarity is expressed in terms of a distance function, typically metric: d(i, j) ▪ The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables ▪ Weights should be associated with different variables based on applications and data semantics ⮚ Quality of clustering: ▪ There is usually a separate “quality” function that measures the “goodness” of a cluster. ▪ It is hard to define “similar enough” or “good enough” 6 ❖ The answer is typically highly subjective CONSIDERATIONS FOR CLUSTERING ⮚ Partitioning criteria ▪ Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable) ⮚ Separation of clusters ▪ Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one cluster) ⮚ Similarity measure ▪ Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity) ⮚ Clustering space ▪ Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering) 7 REQUIREMENTS AND CHALLENGES ⮚ Scalability ▪ Clustering all the data instead of only on samples ⮚ Ability to deal with different types of attributes ▪ Numerical, binary, categorical, ordinal, linked, and mixture of these ⮚ Constraint-based clustering ▪ User may give inputs on constraints ▪ Use domain knowledge to determine input parameters ⮚ Interpretability and usability ⮚ Others ▪ Discovery of clusters with arbitrary shape ▪ Ability to deal with noisy data ▪ Incremental clustering and insensitivity to input order ▪ High dimensionality 8 CLUSTERING APPROACHES ⮚ Partitioning approach: ▪ Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors ▪ Typical methods: k-means, k-medoids, CLARANS ⮚ Hierarchical approach: ▪ Create a hierarchical decomposition of the set of data (or objects) using some criterion ▪ Typical methods: Diana, Agnes, BIRCH, CAMELEON ⮚ Density-based approach: ▪ Based on connectivity and density functions ▪ Typical methods: DBSCAN, OPTICS, DenClue ⮚ Grid-based approach: ▪ based on a multiple-level granularity structure 9 ▪ Typical methods: STING, WaveCluster, CLIQUE CLUSTERING APPROACHES ⮚ Model-based: ▪ A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other ▪ Typical methods: EM, SOM, COBWEB ⮚ Frequent pattern-based: ▪ Based on the analysis of frequent patterns ▪ Typical methods: p-Cluster ⮚ User-guided or constraint-based: ▪ Clustering by considering user-specified or application-specific constraints ▪ Typical methods: COD (obstacles), constrained clustering ⮚ Link-based clustering: ▪ Objects are often linked together in various ways ▪ Massive links can be used to cluster objects: 10 SimRank, LinkClus PARTITIONING METHODS ⮚ Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci) ⮚ Given k, find a partition of k clusters that optimizes the chosen partitioning criterion ▪ Global optimal: exhaustively enumerate all partitions ▪ Heuristic methods: k-means and k-medoids algorithms ▪ k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the 11 cluster ▪ k-medoids or PAM (Partition around medoids) THE K-MEANS CLUSTERING ALGORITHM ⮚ The k-means algorithm is implemented as follows: 1. Choose the number of clusters to be created – k (input). 2. Randomly select any k data points as cluster centers from the dataset. 3. Calculate distance between each data point and each cluster center. 4. Assign the points to the cluster center it is closest to. 5. Re-compute the cluster centers of the newly formed clusters i.e. calculate the means. 6. Keep repeating the procedure from Step-3 to Step-5 until any of the following stopping criteria is met: 1. Center of newly formed clusters do not change 12 2. Data points remain present in the same cluster 3. Maximum number of iterations are reached Start K-MEANS CLUSTERING Input k FLOWCHART Randomly select k objects Calculate distance between each object and each centre Assign each point to the centre it is closest to NO Stoppin Recalculate the g cluster centres Stop criteria (mean) met? YES 13 K-MEANS CLUSTERING EXAMPLE K=2 Assign Update points to the the cluster closest centroids centroid The initial data Loop if set Reassign objects needed else ⮚ Do this a number of stop times by selecting different initial cluster centers ⮚ Each time calculate the Update variance (sum of the distances of each cluster object from its centroids centroid) ⮚ Select the clusters for 14 which the variance is minimum Step-2 Step-3 Step-4 Step-5 Step-6 15 K-MEANS CLUSTERING EXAMPLE ⮚ Cluster the following eight points (with (x, y) representing two attributes) into three clusters i.e. k=3: A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9) ⮚ Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2). ⮚ The distance function between two points a = (x1, y1) and b = (x2, y2) to be considered is Manhattan Distance – L1 norm: 16 d(a, b) = |x2 – x1| + |y2 – y1| A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), ITERATION-1 A8(4, 9) Distance from Distance from Distance from Point belongs Given Points center (2, 10) of center (5, 8) of center (1, 2) of to Cluster Cluster-01 Cluster-02 Cluster-03 A1(2, 10) 0 5 9 C1 A2(2, 5) 5 6 4 C3 A3(8, 4) 12 7 9 C2 A4(5, 8) 5 0 10 C2 A5(7, 5) 10 5 9 C2 A6(6, 4) 10 5 7 C2 A7(1, 2) 9 10 0 C3 A8(4, 9) 3 2 10 C2 17 A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), ITERATION-1 A8(4, 9) Cluster-01: Now, First cluster contains points- We re-compute the new cluster centres. A1(2, 10) The new cluster center is computed by taking mean of all the points contained in that cluster. Cluster-02: Second cluster contains For Cluster-01: points- We have only one point A1(2, 10) in Cluster- A3(8, 4) 01. A4(5, 8) So, cluster center remains the same. A5(7, 5) A6(6, 4) For Cluster-02: Center of Cluster-02 A8(4, 9) = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5) Cluster-03: = (6, 6) Third cluster contains points- A2(2, 5) For Cluster-03: A7(1, 2) Center of Cluster-03 = ((2 + 1)/2, (5 + 2)/2) 18 = (1.5, 3.5) A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), ITERATION-2: C1(2,10), C2(6,6),A8(4, 9) C3(1.5,3.5) Distance from Distance from Distance from Point belongs Given Points center (2, 10) of center (6, 6) of center (1.5, 3.5) to Cluster Cluster-01 Cluster-02 of Cluster-03 A1(2, 10) 0 8 7 C1 A2(2, 5) 5 5 2 C3 A3(8, 4) 12 4 7 C2 A4(5, 8) 5 3 8 C2 A5(7, 5) 10 2 7 C2 A6(6, 4) 10 2 5 C2 A7(1, 2) 9 9 2 C3 A8(4, 9) 3 5 8 C1 19 A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), ITERATION-2 A8(4, 9) Cluster-01: Now, First cluster contains points- We re-compute the new cluster centres. A1(2, 10) The new cluster center is computed by taking mean of all the points contained in A8(4, 9) that cluster. Cluster-02: For Cluster-01: Second cluster contains Center of Cluster-01 points- = ((2 + 4)/2, (10 + 9)/2) A3(8, 4) = (3, 9.5) A4(5, 8) A5(7, 5) For Cluster-02: Center of Cluster-02 A6(6, 4) = ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4) = (6.5, 5.25) Cluster-03: Third cluster contains points- For Cluster-03: A2(2, 5) Center of Cluster-03 A7(1, 2) = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5) 20 This is completion of Iteration-02. A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), ITERATION-3: A8(4,C3(1.5,3.5) C1(3,9.5), C2(6.5,5.25), 9) Distance from Distance from Distance from center Point belongs Given Points center (3, 9.5) of center (1.5, 3.5) (6.5, 5.25) of to Cluster Cluster-01 of Cluster-03 Cluster-02 A1(2, 10) 1.5 9.25 7 C1 A2(2, 5) 5.5 4.75 2 C3 A3(8, 4) 10.5 2.75 7 C2 A4(5, 8) 3.5 4.25 8 C1 A5(7, 5) 8.5 0.75 7 C2 A6(6, 4) 8.5 1.75 5 C2 A7(1, 2) 9.5 8.75 2 C3 A8(4, 9) 1.5 6.25 8 C1 21 A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), ITERATION-3 A8(4, 9) Cluster-01: Now, First cluster contains points- We re-compute the new cluster centres. A1(2, 10) The new cluster centre is computed by taking mean of all the points contained in A4(5, 8) that cluster. A8(4, 9) For Cluster-01: Cluster-02: Center of Cluster-01 Second cluster contains = ((2 + 5 + 4)/3, (10 + 8 + 9)/3) points- = (3.67, 9) A3(8, 4) A5(7, 5) For Cluster-02: Center of Cluster-02 A6(6, 4) = ((8 + 7 + 6)/3, (4 + 5 + 4)/3) = (7, 4.3) Cluster-03: Third cluster contains points- For Cluster-03: A2(2, 5) Center of Cluster-03 A7(1, 2) = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5) 22 This is completion of Iteration-03. A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), ITERATION- 4: A8(4, 9) C1(3.67,9), C2(7,4.3), C3(1.5,3.5) Distance from Distance from Distance from Point belongs Given Points center (3.67, 9) of center (7, 4.3) center (1.5, 3.5) to Cluster Cluster-01 of Cluster-02 of Cluster-03 A1(2, 10) 2.67 10.7 7 C1 A2(2, 5) 5.67 5.7 2 C3 A3(8, 4) 9.33 1.3 7 C2 A4(5, 8) 2.33 5.7 8 C1 A5(7, 5) 7.33 0.7 7 C2 A6(6, 4) 5.33 1.3 5 C2 A7(1, 2) 9.3 8.33 2 C3 A8(4, 9) 0.33 7.7 8 C1 23 A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), ITERATION- 4 A8(4, 9) Cluster-01: First cluster contains points- A1(2, 10) A4(5, 8) ⮚ The clusters remain the A8(4, 9) same after the 4th iteration Cluster-02: ⮚ It means all points are now Second cluster contains in the cluster they belong points- to A3(8, 4) A5(7, 5) ⮚ And the k-means algorithm A6(6, 4) stops Cluster-03: Third cluster contains 24 points- A2(2, 5) 12 10 8 6 4 2 0 0 1 2 3 4 5 6 7 8 9 12 10 8 6 4 2 0 0 1 2 3 4 5 6 7 8 9 25 Sum of Squared 1. i = 1 to 4 (green, yellow, orange, Errors black) 2. Find distance of each green point from its centre and square it and add 26 it. 3. Do this for all clusters THE SSE Minimizing the Sum of Squared Error (SSE) 27 THE SSE After the 1st iteration: Cluster-01: First cluster contains points- A1(2, 10) C1 (2, 10) Cluster-02: Second cluster contains points- A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A8(4, 9) C2 (6, 6) Cluster-03: Third cluster contains points- A2(2, 5) A7(1, 2) 28 C3 (1.5, 3.5) THE SSE After the 2nd iteration: Cluster-01: First cluster contains points- A1(2, 10) A8(4, 9) C1 (3, 9.5) Cluster-02: Second cluster contains points- A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) C2 (6.5, 5.25) Cluster-03: Third cluster contains points- A2(2, 5) A7(1, 2) 29 C3 (1.5, 3.5) THE SSE After the 3rd iteration: Cluster-01: First cluster contains points- A1(2, 10) A4(5, 8) A8(4, 9) C1 (3.67, 9) Cluster-02: Second cluster contains points- A3(8, 4) A5(7, 5) A6(6, 4) C2 (7, 4.3) Cluster-03: Third cluster contains points- A2(2, 5) A7(1, 2) 30 C3 (1.5, 3.5) ADVANTAGES & LIMITATIONS OF K-MEANS ⮚ Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t