Unsupervised Learning Lecture Notes PDF
Document Details

Uploaded by WellManneredSarod9760
Universiti Malaysia Pahang
Dr Nor Azuana Ramli
Tags
Summary
These lecture notes cover unsupervised learning techniques, focusing on clustering algorithms like K-Means, DBSCAN, and Hierarchical clustering. The material delves into the theory and applications of these methods, along with their advantages and drawbacks.
Full Transcript
CHAPTER 5: UNSUPERVISED LEARNING Dr Nor Azuana Ramli 4.1 Types of Unsupervised Learning 4.2 Challenges in CONTENT Unsupervised Learning 4.3 Clustering Able to understand the concept of clustering. COURSE Able to understan...
CHAPTER 5: UNSUPERVISED LEARNING Dr Nor Azuana Ramli 4.1 Types of Unsupervised Learning 4.2 Challenges in CONTENT Unsupervised Learning 4.3 Clustering Able to understand the concept of clustering. COURSE Able to understand the K-means, DBSCAN and Hierarchical clustering OUTCOMES method. Able to understand the clustering applications involving real world dataset. TYPES OF UNSUPERVISED LEARNING Unsupervised Transformations Clustering Algorithms Algorithms that create a new representation of Partition data into distinct groups of similar the data which might be easier for humans or items. other machine learning algorithms to understand compared to the original representation of the data. Example: Dimensionality reduction CHALLENGES IN UNSUPERVISED LEARNING Evaluating whether the algorithm learned something useful. Unsupervised learning algorithms are usually applied to data that does not contain any label information, so we don’t know what the right output should be. Therefore, it is very hard to say whether a model “did well.” Clustering is a technique used in data mining and machine learning to group similar objects into clusters. Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar What is to other data points in the same group than those in other groups. In simple words, the aim is to Clustering? segregate groups with similar traits and assign them into clusters. Hierarchical clustering and K-means clustering are two popular techniques in the field of unsupervised learning used for clustering data points into distinct groups. While K-means clustering divides data into a predefined number of clusters, hierarchical clustering creates a hierarchical tree-like structure to represent the relationships between the clusters. Clustering can be helpful as a data analysis activity in order to learn more about the problem domain, so-called pattern discovery or knowledge discovery. For example: i. The phylogenetic tree could be considered the result of a manual clustering analysis. Clustering ii. Separating normal data from outliers or anomalies may be considered a clustering problem. iii.Separating clusters based on their natural behavior is a clustering problem, referred to as market segmentation. iv.Clustering can also be useful as a type of feature engineering, where existing and new examples can be mapped and labeled as belonging to one of the identified clusters in the data. Properties of Clustering: 1. All the data points in a cluster should be similar to each other. 2. The data points from different clusters should be as different as possible. STAGES OF CLUSTERING A good clustering algorithm is able to identify the cluster independent of cluster shape. There are 3 basic stages of clustering algorithm which are shown below: Clustering Clustering Raw Data Algorithm Data Applications of Clustering in Real-World Scenarios 1. Customer Segmentation 2. Document Clustering 3. Image Segmentation 4. Recommendation Engines Advantages of Clustering Advantages of Clustering Scalability we require highly scalable clustering algorithms to work with large databases. Ability to deal with different kinds of attributes Algorithms should be able to work with the type of data such as categorical, numerical, and binary data. Discovery of clusters with attribute shape The algorithm should be able to detect clusters in arbitrary shape and it should not be bounded to distance measures. Interpretability The results should be comprehensive, usable, and interpretable. High dimensionality The algorithm should be able to handle high dimensional space instead of only handling low dimensional data K-Means Hierarchical CLUSTERING Fuzzy C-Means Mean Shift DBSCAN GMM with Expectation K-Means Clustering K-Means is one of the most widely used and perhaps the simplest unsupervised algorithms to solve the clustering problems. Using this algorithm, we classify a given data set through a certain number of predetermined clusters or “k” clusters. Each cluster is assigned a designated cluster center and they are placed as much as possible far away from each other. Subsequently, each point belonging gets associated with it to the nearest centroid till no point is left unassigned. Once it is done, the centers are re-calculated and the above steps are repeated. The algorithm converges at a point where the centroids cannot move any further. This algorithm targets to minimize an objective function called the squared error function F(V) : where, ||xi – vj|| is the distance between Xi and Vj, Ci is the count of data in cluster and C is the number of cluster centroids. Examples of K-Means The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid. So, here are the important steps in applying K- Means: 1. Choose the number of clusters k Steps in 2. Select k random points from the data as K-Means centroids 3. Assign all the points to the closest cluster Clustering centroid 4. Recompute the centroids of newly formed clusters 5. Repeat steps 3 and 4 The best way to do this is by Elbow method. How to choose the number of clusters, k? To determine the optimal number of clusters, we have to select the value of k at the “elbow” i.e. the point after which the distortion/inertia start decreasing in a linear fashion. Thus, for the given data, we conclude that the optimal number of clusters for the data is 3. Distortion: It is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used. Inertia: It is the sum of squared distances of How to samples to their closest cluster center. Other metrics can also be used such as the choose the silhouette score, the mean silhouette coefficient for all samples or the calinski_harabasz score, number of which computes the ratio of dispersion between and within clusters. clusters, k? The silhouette plots can be used to select the most optimal value of the K (no. of cluster) in K-means clustering. The aspects to look out for in Silhouette plots are cluster scores below the average silhouette score, wide fluctuations in the size of the clusters, and also the thickness of the silhouette plot. There are essentially three stopping criteria that can be adopted to stop the K-means algorithm: 1.Centroids of newly formed clusters do not Stopping change. Even after multiple iterations, if we are getting the same centroids for all the clusters, we Criteria for can say that the algorithm is not learning any new pattern and it is a sign to stop the training. K-Means 2.Points remain in the same cluster even after training the algorithm for multiple iterations. Clustering 3.Maximum number of iterations are reached. Suppose if we have set the number of iterations as 100. The process will repeat for 100 iterations before stopping. Before Applying K-Means Before applying K-Means algorithm, feature scaling need to be done. This is Feature scaling ensures that all the features get same weight in the clustering analysis. because: Consider a scenario of clustering people based on their weights (in KG) with range 55-110 and height (in inches) with range 5.6 to 6.4. In this case, the clusters produced without scaling can be very misleading as the range of weight is much higher than that of height. Therefore, its necessary to bring them to same scale so that they have equal weightage on the clustering result. K-Means Step by Step Example Problem: Use K-Means Algorithm to create two clusters K-Means Step by Step Example Solution: Assume A(2, 2) and C(1, 1) are centers of the two clusters. Iteration-01: We calculate the distance of each point from each of the center of the two clusters. The distance is calculated by using the Euclidean distance formula. K-Means Step by Step Example K-Means Step by Step Example Now, We re-compute the new cluster clusters. The new cluster center is computed by taking mean of all the points contained in that cluster. For Cluster-01: Center of Cluster-01 = ((2 + 3 + 3)/3, (2 + 2 + 1)/3) = (2.67, 1.67) For Cluster-02: Center of Cluster-02 = ((1 + 1.5)/2, (1 + 0.5)/2) = (1.25, 0.75) Exercise Cluster the following eight points with (x, y) representing locations into three clusters: A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9) Initial cluster centers are: A1(2, 5), A4(5, 8) and A7(4, 9). The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as: Ρ(a, b) = |x2 – x1| + |y2 – y1| Use K-Means Algorithm to find the three cluster centers. K-Means Clustering Advantages Can be applied to any form of data – as long as the data has numerical (continuous) entities. Much faster than other algorithms. Easy to understand and interpret. Drawbacks Fails for non-linear data. It requires us to decide on the number of clusters before we start the algorithm – where the user needs to use additional mathematical methods and also heuristic knowledge to verify the correct number of centers. This cannot work for Categorical data. Cannot handle outliers. Application Areas Document clustering – high application area in Segmenting text-matrix related like data like DTM, TF-IDF etc. Banking and Insurance fraud detection where majority of the columns represent a financial figure – continuous data. Image segmentation. Customer Segmentation. HIERARCHICAL CLUSTERING Hierarchical Clustering Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left. The results of hierarchical clustering can be shown using dendrogram. At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest clusters are then merged till we have just one cluster at the top. The height in the dendrogram at which two clusters are merged represents the distance between two clusters in the data space. The decision of the no. of clusters that can best depict different groups can be chosen by observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a horizontal line that can transverse the maximum distance vertically without intersecting a cluster. In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the dendrogram below covers maximum vertical distance AB. Types of Hierarchical Clustering Agglomerative Hierarchical Clustering This clustering starts by assigning each point to an individual cluster in this technique. Then, at each iteration, we merge the closest pair of clusters and repeat this step until only a single cluster is left. It’s called additive hierarchical clustering since we add/merge clusters at each step. Divisive Hierarchical Clustering This clustering works in the opposite way. Instead of starting with n clusters, we start with a single cluster and assign all the points to that cluster. So, it doesn’t matter if we have 10 or 1000 data points. All these points will belong to the same cluster at the beginning. Now, at each iteration, we split the farthest point in the cluster and repeat this process until each cluster only contains a single point. We are splitting (or dividing) the clusters at each step, hence the name divisive hierarchical clustering. Steps to Perform Hierarchical Clustering 1 2 3 First, we assign all the Next, we will look at the We will repeat step 2 until points to an individual smallest distance in the only a single cluster is left. cluster proximity matrix and merge the points with the smallest distance. We then update the proximity matrix. To get the number of clusters for hierarchical clustering, we make use of an awesome concept called a Dendrogram. How should A dendrogram is a tree-like diagram that records the sequences of merges or splits. we Choose the Number More the distance of the vertical lines in the of Clusters in dendrogram, more the distance between those clusters. Hierarchical The number of clusters will be the number of Clustering? vertical lines which are being intersected by the line drawn using the threshold. In K-Means clustering, Hierarchical clustering can’t since we start with random handle big data well but K- choice of clusters, the Means clustering can. This Difference is because the time complexity of K-Means is linear i.e. O(n) while that of results produced by running the algorithm multiple times might differ. between While results are hierarchical clustering is reproducible in Hierarchical quadratic i.e. O(n2). clustering. K-Means & K-Means clustering Hierarchical K-Means is found to work requires prior knowledge of K i.e. no. of clusters you want to divide your data Clustering well when the shape of the into. But, you can stop at clusters is hyper spherical whatever number of (like circle in 2D, sphere in clusters you find 3D). appropriate in hierarchical clustering by interpreting the dendrogram. Conclusion Hierarchical clustering is a very useful way of segmentation. The advantage of not having to pre-define the number of clusters gives it quite an edge over k-Means. However, it doesn't work well when we have huge amount of data. Why we need DBSCAN clustering? K-Means and Hierarchical Clustering both fail in creating clusters of arbitrary shapes. They are not able to form clusters based on varying densities. That’s why we need DBSCAN clustering. What Exactly is DBSCAN Clustering? Parameter Selection in DBSCAN Clustering DBSCAN is very sensitive to the values of epsilon and minPoints. Therefore, it is very important to understand how to select the values of epsilon and minPoints. A slight variation in these values can significantly change the results produced by the DBSCAN algorithm. The value of minPoints should be at least one greater than the number of dimensions of the dataset, i.e., minPoints>=Dimensions+1. It does not make sense to take minPoints as 1 because it will result in each point being a separate cluster. Therefore, it must be at least 3. Generally, it is twice the dimensions. But domain knowledge also decides its value. The value of epsilon can be decided from the K-distance graph. The point of maximum curvature (elbow) in this graph tells us about the value of epsilon. If the value of epsilon chosen is too small then a higher number of clusters will be created, and more data points will be taken as noise. Whereas, if chosen too big then various small clusters will merge into a big cluster, and we will lose details. DBSCAN Clustering Types of Data Points In DBSCAN, we have 3 types of data points. 1. Core Point: A point is a core point if it has more than MinPts points within eps. 2. Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a core point. 3. Noise or outlier: A point which is not a core point or border point. The figure shows us a cluster created by DBCAN with minPoints = 3. Here, we draw a circle of equal radius epsilon around every data point. These two parameters help in creating spatial clusters. All the data points with at least 3 points in the circle including itself are considered as Core points represented by red color. All the data points with less than 3 but greater than 1 point in the circle including itself are considered as Border points. They are represented by yellow color. Finally, data points with no point other than itself present inside the circle are considered as Noise represented by the purple color. DBSCAN Clustering Algorithm 1. Find all the neighbor points within eps and identify the core points or visited with more than MinPts neighbors. 2. For each core point if it is not already assigned to a cluster, create a new cluster. 3. Find recursively all its density connected points and assign them to the same cluster as the core point. A point a and b are said to be density connected if there exist a point c which has a sufficient number of points in its neighbors and both the points a and b are within the eps distance. This is a chaining process. So, if b is neighbor of c, c is neighbor of d, d is neighbor of e, which in turn is neighbor of a implies that b is neighbor of a. 4. Iterate through the remaining unvisited points in the dataset. Those points that do not belong to any cluster are noise. Reachability and Connectivity Reachability states if a data point can be accessed from another data point directly or indirectly, whereas Connectivity states whether two data points belong to the same cluster or not. In terms of reachability and connectivity, two points in DBSCAN can be referred to as: Directly Density-Reachable Density-Reachable Density-Connected Difference between K-Means & DBSCAN Clustering Difference between K-Means & DBSCAN Clustering Advantages & Disadvantages of DBSCAN Advantages of DBSCAN: 1. Is great at separating clusters of high density versus clusters of low density within a given dataset. 2. Is great with handling outliers within the dataset. Disadvantages of DBSCAN: 1. While DBSCAN is great at separating high density clusters from low density clusters, DBSCAN struggles with clusters of similar density. 2. Struggles with high dimensionality data. I know, this entire article I have stated how DBSCAN is great at contorting the data into different dimensions and shapes. However, DBSCAN can only go so far, if given data with too many dimensions, DBSCAN suffers Evaluation Metrics for Sum of distances of all the points within a cluster from the centroid of that cluster. Inertia Clustering The lesser the inertia value, the better our clusters are. Ratio of the minimum of inter-cluster distances Dunn and maximum of intracluster distances. Index The more the value of the Dunn index, the better the clusters will be. Clusters are compact Conclusion Basically, DBSCAN algorithm overcomes all the above-mentioned drawbacks of K-Means algorithm. DBSCAN algorithm identifies the dense region by grouping together data points that are closed to each other based on distance measurement.