k-means Clustering PDF

k-means Clustering The -means problem: ▶ Divide data into subsets ( is a parameter) ▶ Subsets represented by their arithmetic mean ▶ Optimize the least squared error Important properties: ▶ Squared errors put more weight on larger deviations ▶ Arithmetic mean is the maximum likelihood estimator of centrality ▶ Data is split into Voronoi cells ▶ Non-convex problem ➜ -means is a good choice, if we have signals and normal distributed measurement error Suggested reading: the history of least squares estimation (Legendre, Gauss): https://en.wikipedia.org/wiki/Least_squares#History The Sum of Squares Objective The sum-of-squares objective: For every cluster and dimension , the arithmetic mean minimizes Assigning every point to its least-squares closest cluster usually1 reduces , too. Note: sum of squares is equivalent to the squared Euclidean distance: We can therefore say that every point is assigned the “closest” cluster, but we cannot use arbitrary other distances in -means (because the arithmetic mean only minimizes ). 1 This is not always optimal: the change in mean can increase the SSQ of the new cluster. But this difference is commonly ignored in algorithms and textbooks; see [HaWo79]. 23.1 Steiner Translation Theorem / König-Huygens Theorem An important theorem attributed to König, Huygens, and Steiner (independently): Proof: Note: if Note: avoid , we obtain , i.e.,. in computations, this subtraction is numerically unstable – in proofs it is great! 23.2 Pairwise Sum of Squared Deviations First observation (for the arithmetic mean ): Do not confuse with the binomial expansion – this only holds for the arithmetic mean ! Now consider the pairwise squared deviations in a set: ➜ minimizing squared deviations from the mean minimizing pairwise squared deviations 23.3 The Standard Algorithm (Lloyd’s Algorithm) The standard algorithm for -means [Forg65; Lloy82; Stei56]: This is not the most efficient algorithm (despite everybody teaching this variant). ELKI [SKEZ15] contains >12 variants (some benchmarks in [KrScZi16]). There is little good reason to still use the standard algorithm in practice! The name -means was first used by MacQueen for a slightly different algorithm [MacQ67]. -means was invented several times, and has an interesting history [Bock07]. 24 Non-determinism & Non-optimality Almost all -means algorithms ▶ do not guarantee to find the global optimum (would be NP-hard – too expensive) ▶ find different local optima,2 depending on the starting point (because the problem is non-convex, there may be local optima that are not global) In practical use: ▶ data is never exact, or complete ▶ the “optimum”2 result is not necessarily the most useful ➜ usually, we gain little by finding the true optimum ➜ it is usually good enough to try multiple random initializations and keep the “best”3 2 more precisely: static points. The standard algorithm may fail to even find a local minimum [HaWo79]. 3 least squares, i.e., lowest – this does not mean it will actually give the most insight. 25 Complexity of k-means Clustering In the standard algorithm: 1. 2. 3. 4. 5. initialization is usually cheap, (random) to (e.g., farthest points, k-means++) reassignment is mean computation is number of iterations [ArVa06] (but fortunately, usually small ) total: Worst case is superpolynomial, but in practice the method will usually run much better than We can force a limit on the number of iterations, e.g.,. , with little loss in quality usually. In practice, often the fastest clustering algorithm we use. Improved algorithms primarily reduce the number of computations for reassignment, but usually with no theoretical guarantees (so still worst-case) 26 k-means for Text Clustering -means cannot be used with arbitrary distances, but only with Bregman divergences [BMDG05]. Cosine similarity is closely connected to squared Euclidean distance. Spherical -means [DhMo01] uses: ▶ all input data is normalized to have ▶ at each iteration, the new centers are normalized to ▶ the normalized mean minimizes the average cosine similarity [DhMo01]: ▶ sparse nearest-centroid computations in where is the number of non-zero values ▶ result is similar to a SVD factorization of the document-term-matrix [DhMo01] ▶ we an accelerate this using stored bounds to avoid similarity computations [ScLaFe21] 27 k-means Can be Arbitrarily Poor The solution found by the standard -means algorithm can be arbitrarily poor: Consider the one-dimensional data set, with Given : , the optimum solution would cluster {a,b}, but with bad starting conditions we keep {c,d}. The optimum solution then has cost So the ratio , the bad solution has. can be made arbitrarily bad. ➜ in the worst case, a -means solution can be arbitrarily worse than the best solution. -means++ can pick this solution, but it is unlikely to do so multiple times. If we have chosen and , it chooses the better solution with probability. 35 Nearest Center Assignment is Not Optimal [HaWo79; TeVa10] The standard algorithm converges to a local fix point, where no point is reassigned. Is this a local optimum? not guaranteed! If we assign a point, the mean of the clusters may change, and other points may now be farther away! Consider moving from cluster to (and let denote the sum of squares of cluster ): ∖ ∖ It can be better (change Using ) to not assign points to the “nearest” cluster! and ∖ (c.f. Ward linkage). 36 Example: Nearest Center Assignment is Not Optimal Initially, let point Assigning the point be assigned to cluster , which yields a sum of squares of 9+4+4+9+0=26. to will yield the following situation, with sum of squares 4+5+5+4+4=22. ➜ The standard -means algorithm can get stuck in sub-optimal fixpoints. The algorithm of Hartigan and Wong avoids these situations [HaWo79; TeVa10], but the quality difference is usually negligible (as only single points assigned differently). 37 Benefits and Drawbacks of k-means Benefits: ▶ very fast algorithm ( , if we limit the number of iterations) ▶ convenient centroid vector for every cluster (We can analyze this vector to get a “topic” when used with bag-of-words data) ▶ can be run multiple times to get different results Limitations: ▶ difficult to choose the number of clusters, ▶ cannot be used with arbitrary distances ▶ sensitive to scaling – requires careful preprocessing ▶ does not produce the same result every time ▶ clever initialization helps finding better results ▶ sensitive to outliers (squared errors emphasize outliers) ▶ cluster sizes can be quite unbalanced (e.g., one-element outlier clusters) 38 Choosing the “Optimum” k for k-means A key challenge of -means is choosing : ▶ trivial to prove: ➜ avoid comparing ▶ ▶ for different or different data (including normalization) — “perfect” solution? No: useless may exhibit an “elbow” or “knee”: initially it improves fast, then much slower ▶ use alternate criteria such as Silhouette [Rous87], AIC [Akai77], BIC [Schw78; ZhXuFr08] ➜ computing silhouette is – more expensive than -means ➜ AIC, BIC try to reduce overfitting by penalizing model complexity (= high ) more details will come in evaluation section ▶ nevertheless, these measures are heuristics – other can be better in practice! ▶ methods such as X-means [PeMo00] and G-means [HaEl03] split clusters as long as a quality criterion such as AIC [Akai77], BIC [Schw78; ZhXuFr08] improves 39 Limitations of k-means -means has problems with differing sizes, densities, or have non-spherical shape; or outliers Example – cluster shapes: From k-means to Gaussian EM Clustering -means can not handle clusters with different “radius” well. Toy “mouse” data set: Best -means: Best -means: ➜ could we estimate mean and radius? ➜ model the data with multivariate Gaussian distributions 41

k-means Clustering PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue