10-Evaluation-Clustering.pdf

Evaluation of Clusterings We can distinguish between four different kinds of evaluation: ▶ intrinsic evaluation (usually based on distance / deviation) statistics such as , Silhouette, Davies-Bouldin, … used as: heuristics for choosing model parameters such as ▶ extrinsic evaluation based on labels indexes such as Adjusted Rand Index, Normalized Mutual Information, Purity, … used as: similarity to a reference solution ▶ indirect evaluation based on the performance of some other (usually supervised) algorithm in a later stage e.g., how much does classification improve, if we use the clustering as feature ▶ expert evaluation manual evaluation by a domain expert 49 Evaluation Problems Several evaluation problems exist in clustering, and there is no universal solution: ▶ distance-based evaluation is sensitive to data preprocessing (scaling etc.) ▶ noise handling: how to evaluate clusterings that do not label all points? ▶ overlap problem: how to evaluate overlapping clusters? ▶ hierarchy problem: how to evaluate hierarchical clusters? ▶ matching problem: which cluster in solution A is which cluster in solution B? Because this is an unsupervised problem, there is no “objective” evaluation. If we knew what the user needs, we would not need to use an unsupervised method. 50 Intrinsic Evaluation of Clusterings Recall the basic idea of distance-based clustering algorithms: ▶ items in the same cluster should be more similar ▶ items in other clusters should be less similar ➜ compare the distance within the same cluster to distances to other clusters Some simple non-standard approaches ( points, dimension): But how can we combine this with the distance to other clusters? 52 R2: Coefficient of Determination The “total sum of squares” ( ) of a data set is: this can be decomposed into: Total Sum of Squares Within-Cluster Sum of Squares Between-Cluster Sum of Squares ➜ Because total sum of squares is constant, minimizing maximizing Explained variance 53.1 Sum-of-Squares Decomposition, k Hypothetical Clusters Recall theorem of König, Huygens, and Steiner, used in Ward and -means. Assume all clusters have , the same variance, and distance. Expected in this very simple hypothetical scenario: 53.2 The (Poor) Elbow Method [Schu23] The “elbow” method is fairly commonly discussed, but is easy to misuse! As we just saw, we must expect an exponential drop-off like Expectation from the simple model: from random partitions. Uniform data, -means: There is no “Elbow” here, these are meaningless partitions – needs to be much more pronounced! “Stop using the Elbow method” – it is not reliable [Schu23] 54 Variance-Ratio-Criterion [CaHa74] The Calinski-Harabasz Variance-Ratio-Criterion (VRC) is defined as: Increasing increases and decreases. Here, both terms get a penalty for increasing ! Connected to statistics: one-way analysis of variance (ANOVA) test (“F-test”). Null hypothesis: all samples are drawn from distributions with the same mean. A large F-value indicates that the difference in means is significant. Beware: you must not use the same data to cluster and to do this test! The proper test scenario is to, e.g., test if a categorical attribute exhibits a significant difference in means of a numerical attribute (and which is only used in the test). 56 Silhouette [Rous87] Define the mean distance of to its own cluster, and to the closest other cluster: The silhouette width of a point (can be used for plotting) then is: The average Silhouette of a clustering Silhouette of a point: 1 if 0 if -1 if An average Silhouette of then is: (well assigned point) (point is between the two clusters), and (poorly assigned point). is considered “weak”, is “reasonable”, is “strong”. 57 Silhouette [Rous87] /2 Challenges with using the Silhouette criterion: ▶ The complexity of Silhouette is. ⇒ Silhouette does not scale to large data sets. ▶ Simplified Silhouette / Medoid Silhouette: Use the distances to cluster centers instead of average distances, ▶ When a cluster has a single point, Rousseeuw [Rous87] suggests to use. is not defined. then (see the definition of ) ▶ Which distance should we use, e.g., with -means – Euclidean, or squared Euclidean? ▶ In high-dimensional data, because due to the curse of dimensionality: ▶ Assumes clusters of similar diameter – does not work well with wide clusters close to compact clusters. ▶ Silhouette can be directly optimized with a -medoids-like procedure, but in [VaPoBr03]. 58 Davies-Bouldin Index [DaBo79] Let the scatter of a cluster be (recall -norms): Let the separation of clusters be: The similarity of two clusters then is defined as: Clustering quality is the average maximum similarity: A small Davies-Bouldin index is better, i.e., , scatter distance 60 Other Internal Clustering Indexes There have been many more indexes proposed over time: ▶ Dunn index: cluster distance maximum cluster diameter [Dunn73; Dunn74] ▶ Gamma and Tau: within-cluster distance between-cluster distance [BaHu75] ▶ C-Index: sum of within-cluster distances same number of smallest distances [HuLe76] ▶ Xie-Beni index for fuzzy clustering [XiBe91] ▶ Gap statistic: compare to results on generated random data [TiWaHa01] ▶ I-Index: maximum between cluster centers sum of distances to the cluster center [MaBa02] ▶ S_Dbw: separation and density based [HaBaVa02; HaVa01] ▶ PBM Index: distance to cluster center distance to total center [PaBaMa04] ▶ DBCV: density based cluster validation [MJCZ14] ▶… No clearly “superior” measure, but Silhouette often performs among the best [AGMP13; BSHL07]. 61 Extrinsic Cluster Evaluation Extrinsic evaluation measures assume we know the “true” clustering. In the following, every point has a cluster and a true class. The “raw data” (e.g., vectors) will not be used by these measures. In literature this is popular to compare the potential of algorithms. Often, classification data is used, and it is assumed that good clusters = the classes. On real data this will often not be possible: no labels available. There may be more than one meaningful clustering of the same data! Sometimes, we can at least label some data, or treat some properties as potential labels, then choose the clustering that has the best agreement on the labeled part of the data. 65 Supervised Cluster Evaluation /2 The matching problem: ▶ clusters are usually enumerated ▶ true classes are usually labeled with meaningful classes ▶ which is which class ? ➜ clustering is not classification, we cannot evaluate it the same way... ▶ what if there are more clusters than classes? ▶ what if a cluster contains two classes? ▶ what if a class contains two clusters? To overcome this ▶ choose the best matching with, e.g., the Hungarian algorithm (uncommon) ▶ compare every cluster to every class 66 Purity, Precision and Recall A simple measure popular in text clustering (but not much in “other” clustering): ⇒ A cluster, which only contains elements of class has purity ➜ similar to “precision” in classification and information retrieval But: every document is its own cluster has purity , is this really optimal? We could also define a “recall” equivalent: But this is even more misleading: if we put all objects into one cluster, we get recall 1! 67 Pair-counting Evaluation Many cluster evaluation measures are based on pair-counting. If (and only if) objects and are in the same cluster, then is a pair. This gives us a binary, classification-like problem: : pair pair : no pair true positive (a) false positive (b) no pair false negative (c) true negative (d) which can be computed using Objects are a “pair” iff they are related (“should be together”) ➜ we are predicting which objects are related (pairs), and which are not 68.1 Pair-counting Evaluation – Example Recall the binomial coefficient: ⇝ Clusters and classes: , , Cross-tab: Pairs: Aggregated: ▴ ▴ 2 4 1 7 1 6 0 21 16 33 1 3 4 8 0 3 6 28 18 38 3 7 5 15 3 21 10 105 , Note: we can think of this as predicting whether two objects our predictions must be consistent, i.e., if we predict and to be in different clusters. are in the “same”, or in “different” clusters. But to be in the same clusters, we cannot predict 68.2 Pair-counting Cluster Evaluation Measures : pair pair true positive (a) : no pair false positive (b) no pair false negative (c) true negative (d) 69 Adjusted Rand Index [HuAr85] The Rand index (Accuracy) is not 0 for random results ⇝ hard to interpret we can perform an adjustment for chance: a uniform random permutation of labels yields: and hence the standard version known of the ARI [HuAr85]: 70 Entropy and Mutual Information Entropy, Joint Entropy, and Mutual Information: Joint Entropy: Conditional Entropy: Mutual Information: observe: Normalized Mutual Information [StGh02; ViEpBa10; Yao03] Normalized Mutual Information (NMI): But: NMI has a tendency to prefer solutions that split clusters too much. 72 Variation of Information [Meil03; Meil05; Meil12; ViEpBa10] The Variation of Information is a variant of NMI: This is a metric, it satisfies the triangle inequality. Similarly, we can obtain a metric for the Normalized Variation of Information: Not all NMI variants yield a metric. Unfortunately, the adjusted for chance versions are not metric. 73 Conclusions Clustering text data is hard because: ▶ preprocessing (TF-IDF etc.) has major influence on the result (but we do not know which preprocessing is “correct” or “best”) ▶ text is high-dimensional, and our intuition of “distance” and “density” do not work well ▶ text is sparse, and many clustering assume dense, Gaussian data. ▶ text is noisy, and many documents may not be part of a cluster at all. ▶ some cases can be handled with biclustering or frequent itemset mining. ▶ the (proper) evaluation of clustering is very difficult: The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage. — Jain and Dubes [JaDu88] ➜ methods designed for text – i.e., topic modeling – are usually preferable 78

10-Evaluation-Clustering.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue