7-Hierarchical-Clustering.pdf

Hierarchical Agglomerative Clustering One of the earliest clustering methods [Hart75; KaRo90; Sibs73; Snea57]: 1. 2. 3. 4. Initially, every object is a cluster Find two most similar clusters, and merge them Repeat (2) until only one cluster remains Plot tree (“dendrogram”), and choose interesting subtrees Many variations that differ by: ▶ distance / similarity measure of objects ▶ distance measure of clusters after merging (“Linkage”) ▶ optimizations 10 Distances of Clusters /1 Single-linkage: minimum distance maximum similarity Complete-linkage: maximum distance minimum similarity Average-linkage (UPGMA): average distance average similarity Centroid-linkage: distance of cluster centers (squared Euclidean only) 11 Distances of Clusters /2 McQuitty (WPGMA): average of previous sub-clusters Defined recursively, e.g., via Lance-Williams equation. Average distance to the previous two clusters. Median-linkage (squared Euclidean only): distance from midpoint Defined recursively, e.g., via Lance-Williams equation. Median is the halfway point of the previous merge. Ward-linkage (squared Euclidean only): Minimum increase of squared error Mini-Max-linkage: Best maximum distance, best minimum similarity more: Medoid linkages [HHLE16; MiKaEn16; Schu21], Hausdorff linkage (maximin) [BBDF08], … AGNES – Agglomerative Nesting [KaRo90] AGNES, using the Lance-Williams equations [LaWi67]: 1. compute the pairwise distance matrix of objects 2. find position of the minimum distance (resp. maximum similarity ) 3. combine rows and columns of and into one using Lance-Williams update equations using only the stored, known distances , ,.1 4. repeat from (2.) until only one entry remains 5. return dendrogram tree 1 avoid to compute directly, as this is expensive:. Use recursive computation using stored values. 13 AGNES – Agglomerative Nesting [KaRo90] /2 Lance-Williams update equation have the general form: Several (but not all) linkages can be expressed in this form (for distances): input Single-linkage any Complete-linkage any Average-group-linkage (UPGMA) any Centroid-linkage (UPGMC) McQuitty (WPGMA) squared any Median-linkage (WPGMC) squared Ward squared MiniMax, Medoid, Hausdorff linkages: cannot be computed with Lance-Williams updates, but we need to find the best cluster representative (in ). 14 Extracting Clusters from a Dendrogram At this point, we have the dendrogram – but not yet “clusters”. Various strategies have been discussed: ▶ visually inspect the dendrogram, choose interesting branches ▶ stop when clusters remain (may be necessary to ignore noise [SSWG17]) ▶ stop at a certain distance (e.g., maximum cluster diameter with complete-link) ▶ significant change in distance [Moje77] ▶ significance via bootstrap resampling [Shim02; Shim04] ▶ change in cluster sizes or density [CMZS13] ▶ constraints satisfied (semi-supervised) [PMCZ14]: certain objects are labeled as “must” or “should not” be in the same clusters. 15 Benefits and Limitations of HAC Benefits: ▶ very general: any distance / similarity (for text: cosine!) ▶ easy to understand and interpret ▶ hierarchical result ▶ dendrogram visualization often useful (for small data) ▶ number of clusters does not need to be fixed beforehand ▶ many variants Limitations: ▶ scalability is the main problem (in particular, ▶ in many cases, users want a flat partitioning ▶ unbalanced cluster sizes (i.e., number of points) ▶ outliers / noise for some linkage strategies memory) 20

7-Hierarchical-Clustering.pdf

Document Details

Tags

Related

Full Transcript