Clustering & K-Means Clustering, DBSCAN PDF

1 SML TEXTBOOKS/LEARNING RESOURCES: a) Masashi Sugiyama, Introduction to Statistical Machine Learning (1 st ed.), Morgan Kaufmann, 2017. ISBN 978- 0128021217. b) T. M. Mitchell, Machine Learning (1st ed.), McGraw Hill, 2017. ISBN 978-1259096952. REFERENCE BOOKS/LEARNING RESOURCES: a) Richard Golden, Statistical Machine Learning A Unified Framework (1 st ed.), unknown, 2020. Dr. Sanchali UNIT IV Unsupervised learning Topic : Clustering, K-Means Clustering Algorithm, DBSCAN Dr. Sanchali Clustering Clustering is the process of dividing the entire data into groups (also known as clusters) based on the patterns in the data. Cluster analysis is a technique used in data mining and machine learning to group similar objects into clusters. In clustering, we do not have a target to predict. We look at the data, try to club similar observations, and form different groups. Hence it is an unsupervised learning problem. Dr. Sanchali Example A bank wants to give credit card offers to its customers. Currently, they look at the details of each customer and, based on this information, decide which offer should be given to which customer. Now, the bank can potentially have millions of customers. Does it make sense to look at the details of each customer separately and then make a decision? It is a manual process and will take a huge amount of time. So what can the bank do? One option is to segment its customers into different groups. For instance, the bank can group the customers based on their income: Dr. Sanchali How unsupervised algorithm helps For simplicity purposes, let’s say the bank only wants to use the income and debt to make the segmentation. They collected the customer data and used a scatter plot to visualize it: Dr. Sanchali How unsupervised algorithm helps On the X-axis, we have the income of the customer, and the y-axis represents the amount of debt. Here, we can clearly visualize that these customers can be segmented into 4 different clusters, as shown here This is how clustering helps to create segments (clusters) from the data. The bank can further use these clusters to make strategies and offer discounts to its customers Dr. Sanchali Different distance measure Euclidean Distance: The Euclidean distance is the distance between two points, which we have already studied in geometry. It can be calculated as: Dr. Sanchali Different distance measure Manhattan Distance: It is used when we are interested in the total distance traveled by the object instead of the displacement. This metric is calculated by summing the absolute difference between the coordinates of the points in n-dimensions. Minkowski Distance: We can say that the Euclidean, as well as the Manhattan distance, are special cases of the Minkowski distance. From the formula above we can say that when p = 2 then it is the same as the formula for the Euclidean distance and when p = 1 then we obtain the formula for the Manhattan distance. Dr. Sanchali Different Evaluation Metrics for Clustering Dunn Index:. The ratio of the minimum of inter-cluster distances and maximum of intracluster distances. Inertia tries to minimize the intracluster distance. It is trying to make more compact clusters. The MORE the value of the Dunn index, the better the clusters will be Dr. Sanchali Different Evaluation Metrics for Clustering Inertia : It tells us how far the points within a cluster are.  Inertia actually calculates the sum of distances of all the points within a cluster from the centroid of that cluster.  Normally, we use Euclidean distance as the distance metric, as long as most of the features are numeric  Manhattan distance in case most of the features are categorical.  We calculate this for all the clusters That the LESSER the inertia value, the  The final inertia value is the sum of all these distances. This BETTER our clusters are. distance within the clusters is known as intracluster distance. Dr. Sanchali Applications of Clustering in Real-World Scenarios Customer segmentation Document clustering Image segmentation Recommendation engine etc Dr. Sanchali K-Means Clustering Algorithm  An unsupervised learning algorithm that is used to solve the clustering problems in machine learning or data science.  Method for grouping n observations into K clusters.  It uses vector quantization and aims to assign each observation to the cluster with the nearest mean or centroid, which serves as a prototype for the cluster. Dr. Sanchali Introduction  It is a centroid-based algorithm, where each cluster is associated with a centroid.  The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters. Dr. Sanchali Introduction : K-Means Algorithm  It groups the unlabeled dataset into different clusters.  Here, K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.  It is an iterative algorithm that divides the unlabelled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties. Dr. Sanchali How does the K-Means Algorithm Work? Dr. Sanchali Example of working We have these 8 points, and we want to apply k- means to create clusters for these points. Dr. Sanchali Working 1. Choose the number of clusters k: The first step in k-means is to pick the number of clusters, k. 2. Select k random points from the data as centroids: 3. Next, we randomly select the centroid for each cluster. Here, the red and green circles represent the Let’s say we want to have 2 clusters, so centroid for these clusters. k is equal to 2 here. Dr. Sanchali Working 4. Calculate the distance of each point from the centroid by any Distance measure function Assign each point to the closest cluster centroid The points closer to the red point are assigned to the red cluster, whereas the points closer to the green point are assigned to the green cluster. Dr. Sanchali Working 5. Recompute the centroids of newly formed clusters: once we have assigned all the points to either cluster, it is required to compute the centroids of newly formed clusters: Here, the red and green crosses are the new centroids. Dr. Sanchali Working The step of computing the centroid and assigning Repeat steps 3 and 4: We then repeat stepsall 3 the points to the cluster based on their distance and 4 from the centroid is a single iteration. But when should we stop this process? Stopping Criteria for K-Means Clustering 1.Centroids of newly formed clusters do not change 2.Points remain in the same cluster 3.Maximum number of iterations is reached Dr. Sanchali The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. How to choose By choosing the optimal number of clusters. the value of K number of There are some different ways to find the optimal number of clusters. clusters ? One of the most appropriate method to find the number of clusters or value of K is Elbow Method. Dr. Sanchali o The Elbow method is one of the most popular ways to find the optimal number of clusters. o This method uses the concept of WCSS value. o WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below: Elbow WCSS = ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2 Method distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2 Here, ∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its centroid within a cluster1 and the same for the other two terms. Dr. Sanchali Elbow Method To measure the distance between data points and centroid, we can use any method such as Euclidean distance. To find the optimal value of clusters, the elbow method follows the below steps: o It executes the K-means clustering on a given dataset for different K values (ranges from 1-10). o For each value of K, calculates the WCSS value. o Plots a curve between calculated WCSS values and the number of clusters K. o The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best value of K. Dr. Sanchali Solve the problem(lecture assignment) 1-D data example Initial centroid values are as follows : C1=1,C2=8,C3=15 Marks: 3 Dr. Sanchali Solve the problem (lecture assignment) 2-D data example Dr. Sanchali Challenges in K Means  One of the common challenges we face while working with K-Means is that the size of clusters is different  The leftmost and the rightmost clusters are of smaller size compared to the central cluster.  Now, if we apply k-means clustering on these points, the results will be like this Dr. Sanchali Challenges in K Means  Another challenge with k-means is when the densities of the original points are different. Let’s say these are the original points:  Here, the points in the red cluster are spread out, whereas the points in the remaining clusters are closely packed together.  Now, if we apply k-means on these points, we will get clusters like this: Dr. Sanchali Implementation Dr. Sanchali Implementation Dr. Sanchali Implementation Dr. Sanchali Helpful links https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/ https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-learning https://www.simplilearn.com/tutorials/machine-learning-tutorial/k-means-clustering-algorithm https://www.youtube.com/watch?v=4b5d3muPQmA Dr. Sanchali DBSCAN (Density-Based Spatial Clustering Of Applications With Noise) Dr. Sanchali Density based clustering Unsupervised learning methods that identify distinctive groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region of high point density, separated from other such clusters by contiguous regions of low point density. It can discover clusters of different shapes and sizes from a large amount of data, which is containing noise and outliers. It groups together points that are closely packed together while marking others as outliers which lie alone in low-density regions. Dr. Sanchali Density based clustering : DBSCAN There are two key parameters in DBSCAN needed to define ‘Density’.  minPts: The minimum number of points (a threshold) clustered together for a region to be considered dense.  eps (ε): A distance measure that will be used to locate the points in the neighborhood of any point. Dr. Sanchali Logic and Steps: The DBSCAN algorithm takes two input parameters.  Radius around each point ( eps) and the minimum number of data points that should be around that point within that radius ( MinPts). Considering the example, consider the point (1.5,2.5), if we take eps = 0.3, then the circle around the point with radius = 0.3, will contain only one other point inside it (1.2,2.5) as shown below: Dr. Sanchali Core concept In this, we have 3 types of data points. Core Point: A point is a core point if it has more than MinPts points within eps. Border Point: A point which has fewer than MinPts within “eps” but it is in the neighborhood of a core point. Noise or outlier: A point which is not a core point or border point. Dr. Sanchali Example We have to choose first the values for eps and MinPts. Let’s choose eps = 0.6 and MinPts = 4. Let’s consider the first data point in the dataset (1,2) & calculate its distance from every other data point in the data set. The Calculated values are shown in side: Dr. Sanchali Distance Calculation from the table, the point (1, 2) has only two other points in its neighborhood (1, 2.5), (1.2, 2.5) for the assumed value of eps, as its less than MinPts, we can’t declare it as a core point. Let’s repeat the above process for every point in the dataset and find out the neighborhood of each. The calculations when repeated can be summarized as below: Dr. Sanchali DBSCAN left-most column contains all the points we have in our data set. To the right of them are the data points which are there in their neighborhood i.e. the points whose distance from them is less or equal to the eps value. There are three points in the data set, (2.8, 4.5) (1.2, 2.5) (1, 2.5) that have 4 neighborhood points around them, hence they would be called core points if the core point is not assigned to any cluster, a new cluster is formed. Hence, (2.8, 4.5) is assigned to a new cluster, Cluster 1 and so is the point (1.2, 2.5), Cluster 2 Also observe that the core points (1.2, 2.5) and (1, 2.5) share at least one common neighborhood point (1,2) so, they are assigned to the same cluster. Dr. Sanchali Categorization of three points Dr. Sanchali Final clusters There are three types of points in the dataset as detected by the DBSCAN algorithm, core, border and outliers. Every core point will be assigned to a new cluster unless some of the core points share neighborhood points, they will be included in the same cluster. Every border point will be assigned to the cluster-based upon the core point in its neighborhood e.g. the first point (1, 2) is a border point and has a core point (1.2, 2.5) in its neighborhood, which is included in Cluster 2, hence, the point (1,2) will be included in the Cluster 2 too.. Dr. Sanchali Implementation Dr. Sanchali Important links https://www.kdnuggets.com/2020/04/dbscan-clustering- algorithm-machine-learning.html https://towardsdatascience.com/dbscan-make-density-b ased-clusters-by-hand-2689dc335120 Difference type of clustering: https://www.analyticsvidhya.com/blog/2020/09/how-dbs can-clustering-works/ Dr. Sanchali

Clustering & K-Means Clustering, DBSCAN PDF

Document Details

Tags

Related

Summary

Full Transcript