7-Hierarchical-Clustering.pdf
Document Details
Uploaded by ThrillingTuba
Tags
Full Transcript
Hierarchical Agglomerative Clustering One of the earliest clustering methods [Hart75; KaRo90; Sibs73; Snea57]: 1. 2. 3. 4. Initially, every object is a cluster Find two most similar clusters, and merge them Repeat (2) until only one cluster remains Plot tree (“dendrogram”), and choose interesting su...
Hierarchical Agglomerative Clustering One of the earliest clustering methods [Hart75; KaRo90; Sibs73; Snea57]: 1. 2. 3. 4. Initially, every object is a cluster Find two most similar clusters, and merge them Repeat (2) until only one cluster remains Plot tree (“dendrogram”), and choose interesting subtrees Many variations that differ by: ▶ distance / similarity measure of objects ▶ distance measure of clusters after merging (“Linkage”) ▶ optimizations 10 Distances of Clusters /1 Single-linkage: minimum distance maximum similarity Complete-linkage: maximum distance minimum similarity Average-linkage (UPGMA): average distance average similarity Centroid-linkage: distance of cluster centers (squared Euclidean only) 11 Distances of Clusters /2 McQuitty (WPGMA): average of previous sub-clusters Defined recursively, e.g., via Lance-Williams equation. Average distance to the previous two clusters. Median-linkage (squared Euclidean only): distance from midpoint Defined recursively, e.g., via Lance-Williams equation. Median is the halfway point of the previous merge. Ward-linkage (squared Euclidean only): Minimum increase of squared error Mini-Max-linkage: Best maximum distance, best minimum similarity more: Medoid linkages [HHLE16; MiKaEn16; Schu21], Hausdorff linkage (maximin) [BBDF08], … AGNES – Agglomerative Nesting [KaRo90] AGNES, using the Lance-Williams equations [LaWi67]: 1. compute the pairwise distance matrix of objects 2. find position of the minimum distance (resp. maximum similarity ) 3. combine rows and columns of and into one using Lance-Williams update equations using only the stored, known distances , ,.1 4. repeat from (2.) until only one entry remains 5. return dendrogram tree 1 avoid to compute directly, as this is expensive:. Use recursive computation using stored values. 13 AGNES – Agglomerative Nesting [KaRo90] /2 Lance-Williams update equation have the general form: Several (but not all) linkages can be expressed in this form (for distances): input Single-linkage any Complete-linkage any Average-group-linkage (UPGMA) any Centroid-linkage (UPGMC) McQuitty (WPGMA) squared any Median-linkage (WPGMC) squared Ward squared MiniMax, Medoid, Hausdorff linkages: cannot be computed with Lance-Williams updates, but we need to find the best cluster representative (in ). 14 Extracting Clusters from a Dendrogram At this point, we have the dendrogram – but not yet “clusters”. Various strategies have been discussed: ▶ visually inspect the dendrogram, choose interesting branches ▶ stop when clusters remain (may be necessary to ignore noise [SSWG17]) ▶ stop at a certain distance (e.g., maximum cluster diameter with complete-link) ▶ significant change in distance [Moje77] ▶ significance via bootstrap resampling [Shim02; Shim04] ▶ change in cluster sizes or density [CMZS13] ▶ constraints satisfied (semi-supervised) [PMCZ14]: certain objects are labeled as “must” or “should not” be in the same clusters. 15 Benefits and Limitations of HAC Benefits: ▶ very general: any distance / similarity (for text: cosine!) ▶ easy to understand and interpret ▶ hierarchical result ▶ dendrogram visualization often useful (for small data) ▶ number of clusters does not need to be fixed beforehand ▶ many variants Limitations: ▶ scalability is the main problem (in particular, ▶ in many cases, users want a flat partitioning ▶ unbalanced cluster sizes (i.e., number of points) ▶ outliers / noise for some linkage strategies memory) 20