K-Means Clustering Algorithm PDF
Document Details
Uploaded by MonumentalTropicalIsland728
Kent State University
Tags
Summary
This document presents a lecture or presentation on the K-Means clustering algorithm. It details the algorithm's steps, types of clustering, and the concept of distance, along with advantages and disadvantages.
Full Transcript
Let us now consider how the k-Means clustering algorithm works. 1 k-Means is a non-hierarchical clustering algorithm. Here, we specify the number of clusters k to group the data into non-overlapping sets....
Let us now consider how the k-Means clustering algorithm works. 1 k-Means is a non-hierarchical clustering algorithm. Here, we specify the number of clusters k to group the data into non-overlapping sets. 2 All clustering algorithms rely on the concept of distance; Distance between records, and distance between clusters. While distances dij between two records i and j can be defined in many ways, they in general have the following properties: 1. Non-negative: dij ≥ 0 2. Self-proximity: dii = 0 (the distance from a record to itself is zero) 3. Symmetry: dij = dji 4. Triangle inequality: dij ≤ dik + dkj (the distance between any pair cannot exceed the sum of distances between the other two pairs) Secondly, where are the distances calculated from? For two records, this is simple. From the center of the observation. For a cluster, we normally use the centroid as the point from where the distances are calculated from. The most common measure of distance is the Euclidean distance, which we illustrate in the next few slides. Here we are calculating the Euclidean distances between pairs of points. So, for example, the distance from p1 to p2 is sqrt((.4-.22)^2 + (.53-.38)^2) = 0.234. Note that the distance from p1 to p2 is the same as the distance from p2 to p1, and the distance of a point from itself is always 0. As we discussed earlier, the distance from a point to itself will be 0, and dij = dji. Given that, we only need to specify distances as an upper or lower triangular matrix, as shown in the Table above. An important point to note is that the measure computed above is highly influenced by the scale of each variable, so that variables with larger scales have a much greater influence over the total distance. It is therefore customary to normalize continuous measurements before computing the Euclidean distance. This converts all measurements to the same scale. Normalizing a measurement means subtracting the average and dividing by the standard deviation (normalized values are also called z- scores). The choice of the distance measure plays a major role in cluster analysis. The main guideline is domain dependent: What exactly is being measured? How are the different measurements related? What scale should each measurement be treated as (numerical, ordinal, or nominal)? Are there outliers? Finally, depending on the goal of the analysis, should the clusters be distinguished mostly by a small set of measurements, or should they be separated by multiple measurements that weight moderately? Although Euclidean distance is the most widely used distance, it has three main features that need to be kept in mind. 1. It is scale dependent 2. It completely ignores any relationship between measurements, so if the measurements are correlated, maybe other measures of distances might be better 3. It is sensitive to outliers. These are issues you should keep in mind. 5 A non-hierarchical approach to forming good clusters is to pre-specify a desired number of clusters, k, and assign each case to one of the k clusters so as to minimize a measure of dispersion within the clusters. In other words, the goal is to divide the sample into a predetermined number k of non-overlapping clusters so that clusters are as homogeneous as possible with respect to the measurements used. A common measure of within-cluster dispersion is the sum of distances (or sum of squared Euclidean distances) of records from their cluster centroid. The k-means algorithm starts with an initial partition of the records into k clusters. Subsequent steps modify the partition to reduce the sum of the distances of each record from its cluster centroid. The modification consists of allocating each record to the nearest of the k centroids of the previous partition. This leads to a new partition for which the sum of distances is smaller than before. The means of the new clusters are computed and the improvement step is repeated until the improvement is very small. There are a few key points to note. 1. The choice of the number of clusters can either be driven by external considerations (e.g., previous knowledge, practical constraints, etc.), or we can try a few values 2. After choosing k, the n records are partitioned into these initial clusters. If there is external reasoning for the initial assignment, we can apply that, but otherwise, this is done randomly. In these cases, the algorithm can be rerun with different randomly generated starting partitions to reduce chances of the heuristic producing a poor solution. 6 This illustrates the starting solution for the k-Means algorithm where the assignment is done at random. In subsequent steps, data is then assigned to the centroid to which it is closest, and the centroid then recalculated. In each iteration, points are assigned / reassigned to the centroid to which it is closest. This may shift both the centroid and the assignment of points. This is illustrated in the next few slides. Here the centroids are reassigned and data points are assigned to the nearest centroid. Eventually, the clustering algorithm should converge when no more improvements are possible. The algorithm is then repeated with a new starting assignment of points to clusters. While the k-Means clustering algorithm can be solved by modeling it as an Integer Programming optimization problem, the heuristic presented here is easy to implement and is computationally efficient, especially for small k. The major difficulty is in deciding on the choice of k, though there are tools to help with that. The k-means algorithm is stochastic in nature. By that we mean that if we ran the algorithm on the data again, we may get a different result. This is because the initial assignment of points to clusters is usually random, and if that assignment changes, then the possible solutions also change. As such, we are not guaranteed to get (or know) if the solution we have is the best possible one. The algorithm is also sensitive to outliers and the scale of data. Having domain knowledge is important for the application of any ML method, and that applies to the k-means algorithm too. This concludes our initial discussion on the k-means algorithm. We next examine the implementation of the algorithm in R. 13