T2. AC Clustering 2023/2024 PDF
Document Details
Uploaded by TranquilOrchid
Universidade de Coimbra
Jorge Henriques
Tags
Summary
This document is a presentation on clustering and machine learning. It summarizes the subjects covered, including clustering techniques, and similarity measures. The presentation was prepared by Prof. Antonio Dourado for the 2023-2024 academic year.
Full Transcript
1 AC – Aprendizagem Computacional / Machine Learning T2 – Clustering Jorge Henriques [email protected] Departamento de Engenharia Informática Faculdade de Ciências e Tecnologia 2...
1 AC – Aprendizagem Computacional / Machine Learning T2 – Clustering Jorge Henriques [email protected] Departamento de Engenharia Informática Faculdade de Ciências e Tecnologia 2 ▪Note ▪ These slides summarize the subjects covered in the classes. ▪ They do do not expose and detail all the necessary information. ▪ They are not assumed to be the only element of study JH | AC | T2. Clustering 3 ▪ NOTE ▪ These slides were prepared based on the ones presented by Prof. Antonio Dourado (2023/24) ▪ Thank you to Prof. Dourado for permitting their use. JH | AC | T2. Clustering Contents 4 ▪ 1| Introduction ▪ 2| Similarity measures ▪ 3| Clustering techniques ▪ Hierarchical ▪ Partitional ▪ 4| Evaluation ▪ 5| Conclusions ▪ Bibliography JH | AC | T2. Clustering 1| Introduction 5 ▪ Clustering objectives ▪ Unsupervised learning technique – aims to find patterns embedded in the data. Division of a data set into clusters (groups) of similar objects There are no predefined classes The objects of a cluster are similar among them and dissimilar from the objects of the other clusters. ▪ The overall dataset can be represented by its clusters, in this way summarizing/representing the data. JH | AC | T2. Clustering Used in the exploratory analysis of the data and to extract information from the data. 6 1| Introduction 6 ▪ Unsupervised method ▪ There is no desired output for training the model ▪ Therefore, the objective is to interpret/explain the data (the inputs) Example Depending on the characteristics of a population, (patients) Three groups of patients with similar symptoms can be identified This does not mean that this grouping allows to directly identify a clinical condition, JH | AC | T2. Clustering or a disease It is not a classification problem 1| Introduction 7 ▪ Clustering applications ▪ Clustering applications are used to achieve Identifying the underlying structure: to gain insights into the data, detect anomalies, and identify relevant features Natural stratification: to identify the degree of similarity between objects. Compression: method for organizing data and summarizing it through cluster prototypes JH | AC | T2. Clustering https://www.researchgate.net/figure/a-Original-data-before-k-means-clustering-algorithm-b-Data-after-k-means-clustering_fig3_328486915 1| Introduction 8 ▪ Ex. Kmeans method ▪ Creates NK partitions of the data, being NK typically defined a priori Main idea: Similar data points should be grouped How to measure the similarity between two points? o High similarity within the cluster | small distances o Low similarity outside the clusters | high distances Distance Similar JH | AC | T2. Clustering Not similar 1| Introduction 9 ▪ Ex. Kmeans method ▪ The points of a cluster are similar among them and dissimilar from the objects of the other clusters. JH | AC | T2. Clustering 1| Introduction 10 ▪ Ex. Kmeans method ▪ The points of a cluster are similar among them and dissimilar from the objects of the other clusters. ▪ K=2 a - maximize b - minimize p b a C2 JH | AC | T2. Clustering C1 1| Introduction 11 ▪ Ex. Kmeans method ▪ The points of a cluster are similar among them and dissimilar from the objects of the other clusters. ▪ K=4 JH | AC | T2. Clustering 1| Introduction 12 ▪ Steps of the clustering process ▪ 1| Data: Assume a set of features/examples ▪ 2| Similarity measure Define a similarity measure appropriate for the problem (usually a distance). ▪ 3| Algorithm Perform clustering with an appropriate algorithm. Data abstraction: the data can be represented by their clusters. ▪ 4| Evaluation Critical analysis of the results. Assess the quality of the clustering process JH | AC | T2. Clustering Go to step 2 if results do not satisfy. Contents 13 ▪ 1| Introduction ▪ 2| Similarity measures ▪ 3| Clustering techniques ▪ Hierarchical ▪ Partitional ▪ 4| Evaluation ▪ 5| Conclusions ▪ 6| Bibliography JH | AC | T2. Clustering ▪ 1| Introduction 14 ▪ Steps of the clustering process ▪ 1| Data: Assume a set of features/examples ▪ 2| Similarity measure Define a similarity measure appropriate for the problem (usually a distance). ▪ 3| Algorithm / Techniques Cluster data with an appropriate algorithm. Data abstraction: the data can be represented by their clusters. ▪ 4| Evaluation Critical analysis of the results. Assess the quality of the clustering process JH | AC | T2. Clustering 2| Similarity 15 ▪ Notation m=2 ▪ X - instance, example, point P=(x,y)=(x1,y2) X = { x 1 , x 2 ,..., x m } X | xi is a scalar component m | is the dimensionality of an instance in the data space. n | is composed of n measurements (attributes, features, characteristic) ▪ Example Patient | X = { age, gender, weight } The Patient is an instance, composed of 3 attributes: age, gender, weight JH | AC | T2. Clustering 2| Similarity 16 ▪ Notation ▪ A dataset - is compose by n examples/points x 11 x 12... x1 m x 21 x 22... x2 m X= ............ x n1 x n2... x nm xij - scalar | a simple measurement (attribute) | xij = 34 kilos X - vector | a set of attributes (example) | X={age, weight, gender} X - matrix | a set of examples (dataset) JH | AC | T2. Clustering x 11 x 12... x1 m x 21 x 22... x2 m X= ............ x n1 x n2... x nm 2| Similarity 17 ▪ Data types ▪ An instance/point may represent Physical object (an apple, a table, a car,...), Physiological state defined by several biosignals (time series) An abstract concept such as writing style, painting, etc.. ▪ The characteristics or attributes can be: Quantitative Continuous (weight, height, temperature,…), Discrete (number of patients ) or Intervals (ex. duration of an event). Qualitative Nominal or unordered (ex. color), Ordered (ex. position in a professional career). JH | AC | T2. Clustering ▪ In this course we will work mainly with quantitative attributes. 1| Introduction 18 ▪ Steps of the clustering process ▪ 1| Data: Assume a set of features/examples ▪ 2| Similarity measure Define a similarity measure appropriate for the problem (usually a distance). ▪ 3| Algorithm / Techniques Cluster data with an appropriate algorithm. Data abstraction: the data can be represented by their clusters. ▪ 4| Evaluation Critical analysis of the results. Assess the quality of the clustering process JH | AC | T2. Clustering 2| Similarity 19 ▪ Similarity measures (quantitative) ▪ The concept of distance of fundamental to assess the similarity between two points d(xi,xj) ▪ There is no best measure in general. ▪ It depends on the dataset, the application, the objective. JH | AC | T2. Clustering ▪ For an interesting comparison of measures see https://towardsdatascience.com/log-book-guide-to-distance-measuring-approaches-for- k-means-clustering-f137807e8e21 2| Similarity 20 ▪ Similarity measures 1/2 Euclidean m m 2 d(X i , X j ) = ik jk ik jk − = − 2 ( x x ) ( x x ) k =1 k =1 m Manhattan City-block d ( X i , X j ) = | x ik − x jk | k =1 Chebychev d ( X i , X j ) = max km= 1 x ik − x jk 1/ p Minkowsky m d ( X i , X j ) = ( x ik − x jk ) p p=2 Euclidean k =1 JH | AC | T2. Clustering https://towardsdatascience.com/log-book-guide-to-distance-measuring-approaches-for-k-means-clustering-f137807e8e21 2| Similarity 22 ▪ Mahalanobis ▪ Considers the statistical variations of the data distribution, defined by the covariance matrix C ▪ Xi and Xj are two points of the same distribution, distance d ( X i − X j ) = ( X i − X J )T C − 1 ( X i − X J ) ▪ If the matrix C is an identity matrix, this distance is equal to the Euclidean distance. 1 0 d(X i , X j ) = (X i − X j ) (X i − X j ) T C= Euclidean 0 1 JH | AC | T2. Clustering c11 c12 −1 d(X i , X j ) = (X i − X j ) C (X i − X j ) C= c22 T Mahalanobis c21 2| Similarity 23 ▪ Euclidean / Mahalanobis Euclidean - Spherical d(X i , X j ) = (X i − X j ) (X i − X j ) T Mahalanobis - Ellipsoidal −1 d(X i , X j ) = (X i − X j ) C (X i − X j ) T JH | AC | T2. Clustering 2| Similarity 24 ▪ Mahalanobis=Euclidean mean = [2 0] 1 0 C= 0 1 Eigenvectors 1 0 VP = JH | AC | T2. Clustering 0 1 2| Similarity 25 ▪ Mahalanobis mean = [2 0] 1 0.5 C= 0.5 1 Eigenvectors 0.7071 − 0.7071 VP = 0.7071 JH | AC | T2. Clustering 0.7071 2| Similarity 26 ▪ Mahalanobis mean = [2 0] 1 S C= S 1 JH | AC | T2. Clustering 2| Similarity 27 ▪ Mahalanobis mean = [2 0] 1 −S C= −S 1 JH | AC | T2. Clustering 2| Similarity 28 ▪ Mahalanobis mean = [2 0] 1 0 C= 0 S JH | AC | T2. Clustering 2| Similarity 29 ▪ Mahalanobis mean = [2 0] S 0 C= 0 0.1 JH | AC | T2. Clustering 2| Similarity 30 ▪ Mahalanobis Mahalanobis distance takes variability into account. Instead of treating all values equally, when calculating the distance to the center point, it weights them by the difference in the amplitude of variation in the direction of the test point. mean1 = 13 5 1.0 0.9 C1 = 0.9 1.0 mean2 = 18 2 1.0 − 0.8 C2 = − 0.8 1.0 JH | AC | T2. Clustering 2| Similarity 31 ▪ Mahalanobis ▪ P(14,5) is closer to data1 or data2 ? de Euclidean distance dm Mahalanobis distance Euclidean d1e d1e = 1.00 ✓ d2e = 5.00 d2e Mahalanobis d1m = 1.66 ✓ d2m = 4.01 JH | AC | T2. Clustering Closer to the data1 using Euclidean distance Closer to the data1 using Mahalanobis distance 32 2| Similarity 32 ▪ Mahalanobis ▪ P(16,5) is closer to data1 or data2 ? de Euclidean distance dm Mahalanobis distance Euclidean d1e d1e = 3.00 ✓ d2e = 3.60 d2e Mahalanobis d1m = 4.97 d2m = 3.07 ✓ JH | AC | T2. Clustering Closer to the data1 using Euclidean distance Closer to the data2 using Mahalanobis distance !!! Contents 33 ▪ 1| Introduction ▪ 2| Similarity measures ▪ 3| Clustering techniques ▪ Hierarchical ▪ Partitional ▪ 4| Evaluation ▪ 5| Conclusions ▪ Bibliography JH | AC | T2. Clustering 1| Introduction 34 ▪ Steps of the clustering process ▪ 1| Data: Assume a set of features/examples ▪ 2| Similarity measure Define a similarity measure appropriate for the problem (usually a distance). ▪ 3| Algorithm / Techniques Cluster data with an appropriate algorithm. Data abstraction: the data can be represented by their clusters. ▪ 4| Evaluation Critical analysis of the results. Assess the quality of the clustering process JH | AC | T2. Clustering 3| Clustering techniques 35 ▪ Main goals of a (ideal) cluster technique ▪ Discover clusters with arbitrary shape ▪ Deal with the various types of possible variables ▪ Be insensitive to the order in which objects are presented ▪ Work with objects with any number of attributes (dimension) ▪ Be scalable to handle any number of objects ▪ Provide interpretable and usable results ▪ Be robust in the presence of noise ▪ Require minimal knowledge to determine the parameters specific to the JH | AC | T2. Clustering method 3| Clustering techniques 36 ▪ Two main classes of clustering Other possible classifications are possible … ▪ Partitional: Groups are independent of each other ▪ Hierarchical: Groups are dependent, at different levels of granularities JH | AC | T2. Clustering 3| Clustering techniques 37 ▪ Partitional Given a set of N unlabeled data points, a partitional method creates NK partitions of the data, The partitional number NK is usually defined a-priori. The organization in groups is based on the evaluation of the similarity between objects (e.g., Euclidean distance) The similarity within elements in the same group is minimized and the JH | AC | T2. Clustering dissimilarity between elements of different groups is maximized. 3| Clustering techniques 38 ▪ Hierarchical A D C E B Organizes data objects successively, merging the most similar groups of objects based on the pairwise distances until a termination condition is achieved. A hierarchical representation is achieved (a dendrogram, i.e., a tree of clusters) o Structure that is more informative than the unstructured clusters obtained by the partitional method. A D C B E JH | AC | T2. Clustering The method does may not require the a-priori definition of the number of clusters 3| Clustering techniques 41 ▪ Clustering techniques Hierarchical Groups at different levels of granularities ▪ Partitional: Groups are independent of each other JH | AC | T2. Clustering 3.1| Hierarchical techniques 42 ▪ Hierarchical ▪ Two approaches: agglomerative (bottom-up) and divisive (top- down) ▪ Agglomerative (Bottom-Up) Starts with each data point as an individual cluster Merges the closest pairs of clusters iteratively until all points belong to a single cluster or a certain number of clusters is achieved. ▪ Divisive (Top-Down) Starts with all data points in a single cluster Splits them iteratively into smaller clusters until each data point is in its own cluster or a specified number of clusters is achieved. JH | AC | T2. Clustering ▪ The agglomerative method is more common than the divisive: It is computationally lighter and usually produces better results https://www.janbasktraining.com/tutorials/hierarchical-clustering/ 3.1| Hierarchical techniques 43 ▪ AGNES|AGlomerative NESting Bottom-Up clustering method ▪ 1| At start, each object composes a cluster. ▪ 2| Using an appropriate metric (ex. Euclidean), compute the distances between all pairs of clusters. ▪ 3| Find the pair of clusters more similar and merge these two into one single cluster. ▪ 4| Recalculate the distances between the new cluster and all the other clusters already existent. ▪ 5| Repeat step 3 until one single cluster is obtained, JH | AC | T2. Clustering containing all of the objects 3.1| Hierarchical techniques 44 ▪ AGNES|AGlomerative NESting Bottom-Up JH | AC | T2. Clustering Adapted from https://www.displayr.com/what-is-hierarchical-clustering 3.1| Hierarchical techniques 45 ▪ AGNES|AGlomerative NESting ▪ How to compute the distance between two clusters ? ▪ Single-link Considers the minimum of the distances between all pairs of points (each element of the pair from each cluster). Produces clusters geometrically “elongated”. min dist (a , b) a A , b A Considers the maximum of the distances between all pairs of points (each ▪ Complete-link element of the pair from each cluster). Produces small and compact clusters with well defined frontiers.. max dist (a , b) a A , bB JH | AC | T2. Clustering See also https://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf 11/09/2023) 3.1| Hierarchical techniques 46 ▪ AGNES|AGlomerative NESting ▪ How to compute the distance between two clusters ? ▪ Average-link Compromise between the single-link and the complete-link. Compute the mean of all the distances between all pairs of points (each element of the pair from each cluster). 1 | A || B | dist (a , b) a A bB ▪ Centroid-link Considers the distance between the centroids of the clusters. 2 C A − CB JH | AC | T2. Clustering See also https://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf 11/09/2023) 3.1| Hierarchical techniques 48 ▪ AGNES|AGlomerative NESting A D Dendrogram E C B ▪ Naming the points as A, B , C, D, E, F, we can build a dendrogram A tree replicating the agglomerative operations of the previous algorithm. Dendrogram similarity D level A E C JH | AC | T2. Clustering B A D C B E 3.1| Hierarchical techniques 49 ▪ AGNES|AGlomerative NESting Dendrogram Dendrogram similarity level A D C B E ▪ The dendrogram has a hierarchical structure (tree) that displays the groups that are formed by clustering at each step and their similarity levels The level of the hierarchy is related to the level of similarity among clusters at each level, JH | AC | T2. Clustering We can see how many clusters do we have 3.1| Hierarchical techniques 51 ▪ DIANA| DIvisive ANAlysis ▪ 1| All of the points are agglomerated into a cluster. ▪ 2| Divide that cluster in two new clusters. Polytetic methods | All of the features of the data are used for the division. Monotetic methods | Only one feature of the data is used for the division. JH | AC | T2. Clustering https://www.slideshare.net/salahecom/10-clusbasic 3.1| Hierarchical techniques 52 ▪ DIANA| DIvisive ANAlysis ▪ 32| Repeat the process: Find the most heterogeneous cluster to be divided into two: higher number of samples, higher variance, higher mean squared distance,..... This heterogeneous cluster is divided* ▪ 4| While The number of clusters is not equal to the total number of patterns in the sample, or a limit for the number of clusters was not achieved, step 3 is repeated. *To divide a cluster use for example a cutting algorithm based on optimization JH | AC | T2. Clustering “Cut-based optimization”, with similar results as the k- means. See more in https://www.cs.princeton.edu/courses/archive/spr08/cos435/Class_notes/clustering4.pdf 3| Clustering techniques 53 ▪ Two main classes of clustering: Hierarchical: Groups at different levels of granularities Partitional: Groups are independent of each other JH | AC | T2. Clustering 3.2| Partitional techniques 54 ▪ Partitional ▪ Two approaches ▪ Distance based k-means k-medoids ▪ Density based Subtractive clustering DBSCAN JH | AC | T2. Clustering 3.2| Partitional techniques 55 ▪ Distance based method ▪ How to measure similarity/proximity between two data points (vectors)? x 21 x 22 X2 = ... x 2 N d(X1,X2) x 11 d ( X 1, X 2 ) = ( x 11 − x 21 )2 + ( x12 − x 22 )2 x12 X1 = ... x 1N x22 X2 N 2 dimension dimension X1 x21 JH | AC | T2. Clustering x11 x12 3.2| Partitional techniques 56 ▪ Kmeans – The most popular clustering method ▪ Organizes data into NK clusters where NK is a value defined a priori. Each cluster has at least one element Each element belongs to one and only one cluster. An element belongs to the cluster whose center is closest to it, therefore, which is most similar Center – “midpoint” of the cluster NK=3 JH | AC | T2. Clustering https://medium.com/@pranav3nov/understanding-k-means-clustering-f5e2e84d2129 3.2| Partitional techniques 57 ▪ The data is partitioned into clusters and only the representation (center = centroids) a cluster is calculated (C1 and C2). ▪ a - maximization ▪ b - minimization C1 C= NK=2 C 2 Ci N ,1 b a C2 C1 JH | AC | T2. Clustering 3.2| Partitional techniques 58 ▪ Compute a center (2D) X 11 X 21 X1 = X2 = X 12 X 22 x22 X2 c2 X1 x21 c1 x 11 + x 12 C= c2 c1 = 2 x11 x12 c1 x 21 + x 22 c2 = 2 JH | AC | T2. Clustering 3.2| Partitional techniques 59 ▪ Compute a center ▪ m= 6 points N=2 Xi = { x i 1 , x i 2 } x3 x4 x 11 + x 12 + x 13 + x 14 + x 15 + x 16 x2 c1 = 6 c2 x5 x1 x 21 + x 22 + x 23 + x 24 + x 25 + x 26 c2 = x6 6 c1 JH | AC | T2. Clustering N dimensions Average of the coordinates, one by one, of the points that belong to the cluster 3.2| Partitional techniques 60 ▪ Kmeans ▪ How to determine the centers = centroids, in such a way that? Each cluster has at least one element Each element belongs to one and only one cluster. An element belongs to the cluster whose center is closest to it High similarity within the cluster – small distances Low similarity outside the clusters – large distances JH | AC | T2. Clustering ▪ How to formalize the problem mathematically? 3.2| Partitional techniques 61 ▪ Criterion to minimize ▪ Basically, the aim is to minimize the sum of the distance between all points and the center to which they belong NK E ( X ,C ) = k =1 X Ci ( x − c i )2 For all centers For points belonging to a center ▪ Simple to solve? No …. ▪ Very difficult to find an analytical solution JH | AC | T2. Clustering ▪ Solution: Use iterative methods. 3.2| Partitional techniques 62 ▪ Kmeans ▪ 1| Initially choose the number of centers and NK= ? their location, according to common-sense Location Ck = ? k=1,..NK criterion (or randomly). ▪ 2| Assign each point X, from 1 to N, to the Xi belongs to which center ? center that is closest to it (Euclidean distance). To the center Ci that is closest to it ▪ 3| Recalculate the coordinates of each center Averages of the coordinates JH | AC | T2. Clustering of the points of the respective center 3.2| Partitional techniques 63 ▪ 4| Adjust the coordinates of each center, considering the computed centers and the = previous step =1 =0 C(k) center of current iteration C(K-1) center of previous iteration 0 1 c i (k ) = (1 − ) c i (k − 1) + c i (k ) c i (k ) = 0.5 c i (k − 1) + 0.5 c i (k ) = = ▪ 5a| If the centers have not moved, or in other words, if they are all well classified, Stop ! stop! JH | AC | T2. Clustering ▪ 5b| Otherwise, go back to step 2. 3.2| Partitional techniques 64 ▪ Kmeans example ▪ Data related to the eruption of the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. ▪ For each eruption the duration (minutes) and the time elapsed between the previous eruption – inactivity (minutes) are recorded. X1 | duration X2 | inactivity Inactivity Matlab Clusters.m JH | AC | T2. Clustering Duration https://www.stat.cmu.edu/~larry/all-of-statistics/=data/faithful.dat 3.2| Partitional techniques 65 ▪ To simplify the explanation, only n=20 data are considered. Inactivity JH | AC | T2. Clustering Duration 3.2| Partitional techniques 66 ▪ NK= 2 (seems to make sense) ▪ Initial location of the centers? In this case randomly…. Step 1 NK=2 c21 = 2.5 Two groups C2 = c22 = 75 Location Inactivity C1, C2 c11 = 4.5 C1 = c12 = 75 JH | AC | T2. Clustering Duration 3.2| Partitional techniques 67 ▪ Assign each point X, from 1 to n=20, to the center that is closest to it (Euclidean distance). Step 2 c21 = 2.5 C2 = ▪ Red (group 1) c22 = 75 d2 d1 ▪ Blue (group 2) d2