Clustering Methods: Unsupervised Learning - PDF Guide
Document Details

Uploaded by CapableReasoning8076
Sarvajanik University
Prof.Vasundhara Uchhula
Tags
Summary
This guide details clustering methods in machine learning, focusing on unsupervised learning. It covers algorithms like K-means, hierarchical clustering, and DBSCAN. The guide also includes examples and discusses cluster quality metrics along with outlier detection techniques.
Full Transcript
Unit-5 Clustering Methods By: Prof.Vasundhara Uchhula Introduction: Unsupervised learning is a machine learning concept where the unlabelled and unclassified information is analysed to discover hidden knowledge. The algorithms work on the data without any prior training, but they are co...
Unit-5 Clustering Methods By: Prof.Vasundhara Uchhula Introduction: Unsupervised learning is a machine learning concept where the unlabelled and unclassified information is analysed to discover hidden knowledge. The algorithms work on the data without any prior training, but they are constructed in such a way that they can identify patterns, groupings, sorting order, and numerous other interesting knowledge from the set of data UNSUPERVISED VS SUPERVISED LEARNING Till now, we have discussed about supervised learning where the aim was to predict the outcome variable Y on the basis of the feature set X1 :X2 :… Xn , and we discussed methods such as regression and classification for the same. We will now introduce the concept of unsupervised learning where the objective is to observe only the features X1 :X2 :… Xn ; We are not going to predict any outcome variable, but rather our intention is to find out the association between the features or their grouping to understand the nature of the data. This analysis may reveal an interesting correlation between the features or a common behaviour within the subgroup of the data, which provides better understanding of the data. APPLICATIONS OF UNSUPERVISED LEARNING Segmentation of target consumer populations by an advertisement consulting agency on the basis of few dimensions such as demography, financial data, purchasing habits, etc. so that the advertisers can reach their target consumers efficiently Anomaly or fraud detection in the banking sector by identifying the pattern of loan defaulters Image processing and image segmentation such as face recognition, expression identification, etc. Grouping of important characteristics in genes to identify important influencers in new areas of genetics Utilization by data scientists to reduce the dimensionalities in sample data to simplify modelling Document clustering and identifying potential labelling options I. CLUSTERING Clustering refers to a broad set of techniques for finding subgroups, or clusters, in a data set on the basis of the characteristics of the objects within that data set in such a manner that the objects within the group are similar but are different from the objects from the other groups. The effectiveness of clustering depends on how similar or related the objects within a group are or how different or unrelated the objects in different groups are from each other. Clustering as a machine learning task Different types of clustering techniques The major clustering techniques are Partitioning methods, Hierarchical methods, and Density-based methods. Partitioning methods : K-means - A centroid-based technique K-means clustering is one of the unsupervised machine learning algorithm that means it is used when we do not have any specified labels in the dataset. Why it is used? ○ The purpose of this unsupervised machine learning algorithm is to choose clusters or rather groups ,in a given data set, with the number of groups indicated by the variable K. ○ This works repeatedly, in order to assign each and every data point to one of the K cluster, based on the features that are given. ○ The data points are generally grouped based on the feature similarity. ○ The end results of the K-means clustering algorithm would be: Centroids of the number of clusters, which were identified (denoted as K). Labels for the training data. 3) How does it work? 1. Categorizes the data into a number of groups as K (K is predefined). 2. Choose K points arbitrarily, as centers of clusters. 3. Distribute the points to their nearest cluster center, conforming to the Euclidean distance function. 4. Compute the mean or centroid of the entire objects within each cluster. 5. Iterate steps 2, 3 and 4 up until the matching points are allocated to each cluster in continuous rounds. Let us take an Example : D = { (5,3), (10,15), (15,12), (24,10), (30,45), (85,70), (71,80), (60,78), (55,52), (80,91) } Let's name centroids of clusters C1 and C2 as c1 and c2 and initialize them with the values of the first two data points i.e. (5, 3) and (10, 15). For cluster C1, there is currently only one point i.e. (5,3), therefore the mean of the coordinates remain same and the new centroid value for c1 will also be (5,3). For C2, there are currently 9 data points. We name the coordinates of data points as x and y. The new value for x coordinate of centroid c2 can be calculated by determining the mean of x coordinates of all 9 points that belong to cluster C2 as given below: c2(x) = (10 + 15 + 24 + 30 + 85 + 71 + 60 + 55 + 80) / 9 = 47.77 The new value for y coordinate of centroid c2 can be calculated by determining the mean of all y coordinates of all 9 points that belong to cluster C2. c2(y) = (15 + 12 + 10 + 45 + 70 + 80 + 78 + 52 + 91) / 9 = 50.33 The updated centroid value for c2 will now be {47.77, 50.33}. For the next iteration, the new centroid values for c1 and c2 will be used and the whole process will be repeated. The iterations continue until the centroid values stop updating. The next iterations are as follows: c1(x) = (5, 10, 15, 24) / 4 = 13.5 c1(y) = (3, 15, 12, 10) / 4 = 10.0 Updated c1 to be (13.5, 10.0). c2(x) = (30 + 85 + 71 + 60 + 55 + 80) / 6 = 63.5 c2(y) = (45 + 70 + 80 + 78 + 52 +91) / 6 = 69.33 c1(x) = (5, 10, 15, 24, 30) / 5 = 16.8 c1(y) = (3, 15, 12, 10, 45) / 5 = 17.0 Updated c1 to be (16.8, 17.0). c2(x) = (85 + 71 + 60 + 55 + 80) / 5 = 70.2 c2(y) = (70 + 80 + 78 + 52 + 91) / 5 = 74.2 Updated c2 to be (70.2, 74.2). c1(x) = (5, 10, 15, 24, 30) / 5 = 16.8 c1(y) = (3, 15, 12, 10, 45) / 5 = 17.0 Updated c1 to be (16.8, 17.0). c2(x) = (85 + 71 + 60 + 55 + 80) / 5 = 70.2 c2(y) = (70 + 80 + 78 + 52 + 91) / 5 = 74.2 Updated c2 to be (70.2, 74.2). At the end of fourth iteration, the updated values of C1 and C2 are same as they were at the end of the third iteration. This means that data cannot be clustered any further. c1 and c2 are the centroids for C1 and C2. To classify a new data point, the distance between the data point and the centroids of the clusters is calculated. Data point is assigned to the cluster whose centroid is closest to the data point. Choosing appropriate number of clusters For a small data set, sometimes a rule of thumb that is followed is ○ K=√(n/2) which means that K is set as the square root of n/2 for a data set of n examples Another method is the Elbow method: Vary the number of clusters ( K ) from 1 – 10. For each value of K, calculate WCSS ( Within-Cluster Sum of Square ). WCSS is the sum of squared distance between each point and the centroid in a cluster. When we plot the WCSS with the K value, the plot looks like an Elbow. As the number of clusters increases, the WCSS value will start to decrease. WCSS value is largest when K = 1. When we analyze the graph we can see that the graph will rapidly change at a point and thus creating an elbow shape. From this point, the graph starts to move almost parallel to the X-axis. The K value corresponding to this point is the optimal K value or an optimal number of clusters. Strengths and weaknesses of k-Means algorithm K-Medoids algorithm (also called as Partitioning Around Medoid) The k-means algorithm is sensitive to outliers in the data set and inadvertently produces skewed clusters when the means of the data points are used as centroids. k-medoids provides a solution to this problem. Instead of considering the mean of the data points in the cluster, k-medoids considers k representative data points from the existing points in the data set as the centre of the clusters. It then assigns the data points according to their distance from these centres to form k clusters. Note that the medoids in this case are actual data points or objects from the data set and not an imaginary point as in the case when the mean of the data sets within cluster is used as the centroid in the k-means technique A medoid can be defined as the point in the cluster, whose dissimilarities with all the other points in the cluster is minimum. The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E = |Pi - Ci| The cost in K-Medoids algorithm is given as Algorithm: 1. Initialize: select k random points out of the n data points as the medoids. 2. Associate each data point to the closest medoid by using any common distance metric methods. 3. While the cost decreases: For each medoid m, for each data point o which is not a medoid: 1. Swap m and o, associate each data point to the closest medoid, recompute the cost. 2. If the total cost is more than that in the previous step, undo the swap. Let’s consider the following example: If a graph is drawn using the above data points, we obtain the following: Step 1: Let the randomly selected 2 medoids, so select k = 2 and let C1 -(4, 5) and C2 -(8, 5) are the two medoids. Step 2: Calculating cost. [ Manhattan Distance: |x1 - x2| + |y1 - y2|] The dissimilarity of each non-medoid point with the medoids is calculated and tabulated: Each point is assigned to the cluster of that medoid whose dissimilarity is less. The points 1, 2, 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2. The Cost = (3 + 4 + 4) + (3 + 1 + 1 + 2 + 2) = 20 Step 3: randomly select one non-medoid point and recalculate the cost. Let the randomly selected point be (8, 4). The dissimilarity of each non-medoid point with the medoids – C1 (4, 5) and C2 (8, 4) is calculated and tabulated. Each point is assigned to that cluster whose dissimilarity is less. So, the points 1, 2, 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2. The New cost = (3 + 4 + 4) + (2 + 2 + 1 + 3 + 3) = 22 Swap Cost = New Cost – Previous Cost = 22 – 20 and 2 >0 As the swap cost is not less than zero, we undo the swap. Hence (4, 5) and C2 -(8, 5) are the final medoids. The clustering would be in the following way Advantages: 1. It is simple to understand and easy to implement. 2. K-Medoid Algorithm is fast and converges in a fixed number of steps. 3. PAM is less sensitive to outliers than other partitioning algorithms. Disadvantages: 1. The main disadvantage of K-Medoid algorithms is that it is not suitable for clustering non-spherical (arbitrary shaped) groups of objects. 2. It may obtain different results for different runs on the same dataset because the first k medoids are chosen randomly. Hierarchical Clustering Hierarchical Clustering creates clusters in a hierarchical tree-like structure (also called a Dendrogram). Meaning, a subset of similar data is created in a tree-like structure in which the root node corresponds to entire data, and branches are created from the root node to form several clusters. Hierarchical Clustering is of two types. ○ Divisive ○ Agglomerative Hierarchical Clustering Divisive Hierarchical Clustering is also termed as a top-down clustering approach. In this technique, entire data or observation is assigned to a single cluster. The cluster is further split until there is one cluster for each data or observation. Agglomerative Hierarchical Clustering is popularly known as a bottom-up approach, wherein each data or observation is treated as its cluster. A pair of clusters are combined until all clusters are merged into one big cluster that contains all the data. How Agglomerative Hierarchical clustering Algorithm Works For a set of N observations to be clustered: 1. Start assigning each observation as a single point cluster, so that if we have N observations, we have N clusters, each containing just one observation. 2. Find the closest (most similar) pair of clusters and make them into one cluster, we now have N-1 clusters. This can be done in various ways to identify similar and dissimilar measures 3. Find the two closest clusters and make them to one cluster. We now have N-2 clusters. This can be done using agglomerative clustering linkage techniques 4. Repeat steps 2 and 3 until all observations are clustered into one single cluster of size N. Agglomerative clustering linkage algorithm(Cluster Distance Measure) This technique is used for combining two clusters. Note that it’s the distance between clusters, and not individual observation. DBSCAN Clustering Algorithm DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. DBSCAN is a density-based clustering algorithm that works on the assumption that clusters are dense regions in space separated by regions of lower density. It groups ‘densely grouped’ data points into a single cluster. It can identify clusters in large spatial datasets by looking at the local density of the data points. The most exciting feature of DBSCAN clustering is that it is robust to outliers. It also does not require the number of clusters to be told beforehand, unlike K-Means, where we have to specify the number of centroids. The density based clustering approach provides a solution to identify clusters of arbitrary shapes. The principle is based on identifying the dense area and sparse area within the data set and then run the clustering algorithm. Cluster Quality Metrics Clustering algorithms try to reveal natural groupings amongst the data sets. However it is quite tricky to evaluate the performance of a clustering algorithm. Clustering, by nature is very subjective and whether the cluster is good or bad is open to interpretation. It was noted, ‘clustering is in the eye of the beholder’. There are two challenges that lie in the process of clustering ○ It is generally not known how many clusters can be formulated from a particular data set ○ Even if number of clusters is given, the same number of clusters can be formed with different groups of data instances. We can say that a clustering algorithm is successful if the clusters identified using the algorithm is able to achieve the right results in the problem domain, however there are 2 popular approaches for cluster quality evaluation. a) Internal evaluation Measures cluster quality based on the homogeneity of data belonging to the same cluster and heterogeneity of data belonging to different clusters. The homogeneity/heterogeneity is decided by some similarity measure. Silhouette coefficient uses distance (Euclidean or Manhattan distance most commonly used) between data elements as a similarity measure. The value of silhouette width ranges between -1 and +1, with a high value indicating high intra cluster homogeneity and inter cluster heterogeneity Contd.. For a data set clustered into ‘k’ clusters, silhouette width is calculated as: ○ Silhouette width = b(i) - a(i) / max{a(i), b(i)} ○ Where a(i) is the average distance between the i th data instance and all other data instances belonging to the same cluster and b(i) is the lowest average distance between the i th data instance and data instance of all other clusters. Lets understand with an example in the figure given. There are four clusters namely cluster 1, 2, 3 and 4. Let’s consider an arbitrary data element ‘i’ in cluster 1, resembled by asterisk. a(i) is the average of the distances ai1, ai2,...ain1 of different data elements from the ith data element in cluster 1, assuming there are n1 data elements in cluster 1. Mathematically, ○ a(i) = (ai1 + ai2 + …ain1) / n1 Contd. In the same way let’s calculate the distance of an arbitrary data element ‘i’ in cluster 1 with different data elements from another cluster say cluster 4 and take average of all those distances. Hence, b14(average) = (b14(1) + b14(2) + ….b14(n4) ) / (n4) Where n4 is the total number of elements in cluster 4. In the same way, we can calculate the values of b12(average) and b13(average). b(i) is the minimum of all these values. Hence we can say that, ○ b(i) = minimum [b12 (average), b13 (average), b14(average)] 2) External evaluation In this approach, class label is known for the data set subjected to clustering. The known class labels are not a part of the data used in clustering. The cluster algorithm is assessed based on how close the results are compared to those known class labels. For example, purity is one of the most popular measure of cluster algorithms - evaluates the extent to which clusters contain a single class For a data set having ‘n’ data instances and ‘c’ known class labels which generates ‘k’ clusters, purity is measured as: Purity = Outlier Detection Handling outliers Outliers are data values with an abnormally high values which may impact prediction accuracy If outliers are natural i.e the value of data element is surprisingly high or low because of valid reason then we should not amend it. Two approaches for amending outliers ○ Remove outliers: if number is less ○ Imputation: (the assignment of a value to something by inference) Impute the value with mean or median or mode or most similar data item ○ Capping: For values that lie outside the 1.5 |*| IQR limits , we can cap them by replacing those observations below the lower limit with the value of 5th percentile and those that lie above the upper limit with the value of 95th percentile.