K-means Clustering Lecture Notes PDF
Document Details
Uploaded by InfallibleLawrencium3753
Northwestern University
Tags
Summary
These lecture notes cover various aspects of K-means clustering, including the algorithm, its optimization objective, and evaluation methodologies. The discussion involves different strategies and examples, along with the limitations of the algorithm. The document provides a comprehensive overview of k-means and related concepts, and a detailed description of the different concepts and elements.
Full Transcript
Last Time: Lecture: In-class Assignment: Python Environment Explore different ML classifiers for Tips on HW1 multi-class classification Performance Metrics for multi-class Exp...
Last Time: Lecture: In-class Assignment: Python Environment Explore different ML classifiers for Tips on HW1 multi-class classification Performance Metrics for multi-class Explore different strategies for classification mitigating imbalanced multi-class Mitigating Imbalanced classification classification Last Time: Classification Metrics in Multi-class Error Rate Recall 𝐶𝐶𝐹𝐹𝐹𝐹 + 𝐶𝐶𝐹𝐹𝐹𝐹 𝐶𝐶𝑇𝑇𝑇𝑇 𝑁𝑁 𝑁𝑁𝑃𝑃 Accuracy Rate Precision 𝐶𝐶𝑇𝑇𝑇𝑇 + 𝐶𝐶𝑇𝑇𝑇𝑇 𝐶𝐶𝑇𝑇𝑇𝑇 𝑁𝑁 𝐶𝐶𝑇𝑇𝑇𝑇 + 𝐶𝐶𝐹𝐹𝐹𝐹 F1-Score Can they still work 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ⋅ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 for multi-class 2 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 classification? Last Time: Dealing with Imbalanced classification Adjusting Class weight during training Cost-sensitive learning Stratified splitting is Oversampling minority class required for imbalanced dataset Undersampling majority class SMOTE (Synthetic Minority Oversampling Technique) In Python, you can use `imbalanced-learn` library to handle imbalanced datasets. It provides various techniques for resampling the dataset, such as under-sampling, over-sampling, and combining methods. Install `imbalanced-learn` if you haven’t already. Last Time: After class assignment: Explore different ML classifiers for multiclass classification Explore different strategies to deal with imbalanced classification Agenda for Section 2 Clustering (Today) Dimension Reduction(Next Monday) Big data Visualization (Next Wednesday) HW2: High dimensional dataset, conduct dimension reduction followed by clustering/multiclass classification and big data visualization Today’s agenda Lecture: After-class Assignment: Introduction to unsupervised learning Gain a good understanding of K-means Clustering (K-means) clustering Explore other clustering algorithms Machine Learning 101: We have focused on Supervised Learning Labelled dataset, meaning that the predictors is paired with corresponding outcome/target. The goal of supervised learning is to learn a mapping from inputs to outputs, based on the labeled examples provided during training. Supervised Learning: Regression vs. Classification Regression: Binary Classification: Multiclass Classification: The response is a continuous The response is a categorical The response is a categorical variable. For example, predicting variable. For example, classifying a variable. For example, life expectancy of a country based country as having low or high life classifying emails into spam, on its GDP per capita. expectancy based on its GDP per promotions, social, and primary. capita. Introduction to Unsupervised Learning No Labeled data: no correct answers for training Pattern Discovery: The model identifies patterns, clusters, or relationships that are inherent in the data Uncover patterns or structures Useful as a pre-processing or post-processing step for supervised learning. Unsupervised Learning: Real world application Customer Segmentation: Grouping customers based on purchasing behavior. Anomaly Detection: Identifying unusual data points, such as fraudulent transactions. Recommendation Systems: Discovering user preferences from large datasets of user behavior without explicit labels. We will cover: Clustering: Grouping similar data points together based on their features. Example: K-means, DBSCAN, Hierarchical clustering. Dimensionality Reduction: Reducing the number of features or variables while preserving the data's structure. Example: Principal Component Analysis (PCA), t-SNE, Autoencoders. Unsupervised Learning: Challenges Lack of Ground Truth: Difficult to know how well you are doing. Algorithm-specific limitations Clustering Unsupervised Learning Related Publications Yakang Lu, Lizhen Shi, Marc W Van Goethem, Vokan Sevim, Michael Mascagni, Li Deng, Zhong Wang. Hybrid Clustering of Long and Short-read for Improved Metagenome Assembly. Frontiers in Microbiology (Minor Revision) Lizhen Shi, Bo Chen. Comparison and Benchmark of Graph Clustering Algorithms. 05/2020, arXiv Kexue Li, Lili Wang, Lizhen Shi, Li Deng, Zhong Wang. Deconvolute individual genomes from metagenome sequences through short read clustering. PeerJ 8 (2020): e8966. Lizhen Shi, Volkan Sevim, Michael Mascagni, Zhong Wang. Leveraging long-read sequencing for cost-effective metagenome clustering. Lizhen Shi, Xiandong Meng, Elizabeth Tseng, Michael Mascagni, Zhong Wang. SpaRC: scalable sequence clustering using Apache Spark. Bioinformatics 35.5 (2018): 760-768. Metagenomic Clustering LSHvec: https://dl.acm.org/doi/abs/10.1145/3459930.3469521 What is clustering? Basic idea: Group together similar instances Market segmentation Supervised learning (Labelled data) Classification 𝑥𝑥 1 𝑥𝑥 2 Training set: Unsupervised learning (Unlabeled data) Clustering 𝑥𝑥 1 𝑥𝑥 2 Training set: Clustering Basic idea: Group together similar instances Market segmentation Clustering Algorithms Partition algorithms (Flat) K-means DB-Scan Spectral Clustering Mixture of Gaussian Hierarchical algorithms: Bottom up – agglomerative Top down - divisive Clustering K-means intuition K-means An iterative clustering algorithm – Initialize: Pick K random points as cluster centers – Alternate: 1. Assign data points to closest cluster center 2. Change the cluster center to the average of its assigned points – Stop when no pointsʼ assignments change K=2 𝑥𝑥 (1) , 𝑥𝑥(2), 𝑥𝑥(3), … , 𝑥𝑥(30) Iteration 1: Step 1: Assign each point to its closest centroid K=2 Iteration 1: 𝑥𝑥 (1) , 𝑥𝑥(2), 𝑥𝑥(3), … , 𝑥𝑥(30) Step 2: Recompute the centroids K=2 𝑥𝑥 (1) , 𝑥𝑥(2), 𝑥𝑥(3), … , 𝑥𝑥(30) Iteration 2: Step 1: Re-Assign each point to its closest Moving to centroid a different cluster K=2 𝑥𝑥 (1) , 𝑥𝑥(2), 𝑥𝑥(3), … , 𝑥𝑥(30) Iteration 2: Step 2: Recompute the centroids Iteration 3: K=2 𝑥𝑥 (1) , 𝑥𝑥(2), 𝑥𝑥(3), … , 𝑥𝑥(30) Step 1: Assign each point to its closest centroid K=2 𝑥𝑥 (1) , 𝑥𝑥(2), 𝑥𝑥(3), … , 𝑥𝑥(30) Iteration 3: Step 2: Recompute the centroids Clustering K-means algorithm K-means algorithm Randomly initialize cluster centroids x2 x1 K-means algorithm Randomly initialize cluster centroids Repeat { # Assign points to cluster centroids for i = 1 to m c(i):= index (from 1 to K ) of cluster x2 centroid closest to x(i) x1 K-means algorithm Randomly initialize cluster centroids Repeat { # Assign points to cluster centroids for i = 1 to m c(i):= index (from 1 to K ) of cluster x2 centroid closest to x(i) # Move cluster centroids for k = 1 to K x1 μk := average (mean) of points assigned to cluster k } K-means algorithm: O(n) Randomly initialize cluster centroids Repeat { # Assign points to cluster centroids for i = 1 to m O(Km) c(i):= index (from 1 to K ) of cluster x2 centroid closest to x(i) # Move cluster centroids for k = 1 to K x1 O(m) μk := average (mean) of points assigned to cluster k } K-means clustering Optimization Objective K-means optimization objective 𝑐𝑐(i) = index of cluster (1, 2, … , 𝐾𝐾) to which example 𝑥𝑥(i) is currently assigned 𝜇𝜇k = cluster centroid 𝑘𝑘 𝜇𝜇c(i) = cluster centroid of cluster to which example 𝑥𝑥(i) has been assigned Cost function: SSE (sum of squared error) 𝑚𝑚 1 m 1 2 𝐽𝐽 𝑐𝑐 , … , 𝑐𝑐 , 𝜇𝜇1.. , 𝜇𝜇 k =. || 𝑥𝑥 (i) − 𝜇𝜇 c(i) 𝑚𝑚 𝑖𝑖=1 𝑚𝑚𝑖𝑖𝑛𝑛 1 m 𝐽𝐽(𝑐𝑐 , … , 𝑐𝑐 , 𝜇𝜇1,.. , 𝜇𝜇k) 𝑐𝑐 1 , … , 𝑐𝑐 m 𝜇𝜇1, … , 𝜇𝜇k Assigning the points: minimizing SSE Repeat { # Assign points to cluster centroids for 𝑖𝑖 = 1 to 𝑚𝑚: 𝑐𝑐(i) = index of cluster centroid x2 closest to 𝑥𝑥(i) # Move cluster centroids for 𝑖𝑖 = 1 to 𝐾𝐾: 𝜇𝜇 k= average of points in cluster 1 2 3 4 5 6 7 8 9 x1 } Moving the centroid: minimizing SSE x2 1 2 3 4 5 6 7 8 9 11 12 13x 1 K-means Convergence Guaranteed to converge in a finite number of iterations Drawbacks of K-means Clustering While K-means clustering is a widely used and efficient clustering algorithm, it has several drawbacks: Dependence on Initial Centroids: Choosing the number of clusters (K) Not suitable for complex shapes Sensitive to outliers Despite these drawbacks, K-means remains a popular choice for clustering tasks due to its simplicity, scalability, and efficiency on large datasets. However, it's important to be aware of its limitations and consider alternative clustering algorithms when dealing with complex datasets or when the assumptions of K-means are violated. K-means Clustering Dependence on Initial Centroids K-means algorithm Step 0: Randomly initialize 𝐾𝐾 cluster centroids 𝜇𝜇 1 , 𝜇𝜇 2 ,…, 𝜇𝜇k Repeat { Step 1: Assign points to cluster centroids Step 2: Move cluster centroids } Random initialization Choose 𝐾𝐾 < 𝑚𝑚 Randomly pick 𝐾𝐾 training examples. Set 𝜇𝜇1, 𝜇𝜇2,…, 𝜇𝜇 k equal to these 𝐾𝐾 examples. 𝐽𝐽 𝑐𝑐 1 , … , 𝑐𝑐 m , 𝜇𝜇 1 ,.. , 𝜇𝜇 k Random initialization For 𝑖𝑖 = 1 to 100 { Randomly initialize K-means. Run K-means. Get 𝑐𝑐 1 , … , 𝑐𝑐 m , 𝜇𝜇1, 𝜇𝜇2,…, 𝜇𝜇 k Compute 𝐽𝐽(𝑐𝑐 1 , … , 𝑐𝑐 m , 𝜇𝜇1, 𝜇𝜇2,…, 𝜇𝜇 k ) } Pick set of clusters that gave lowest cost 𝐽𝐽 K-means++ initialization Randomly pick the first one from the training examples. Selection of Subsequent Centroids based on the distance to the nearest centroid that has already been chosen. Well-spread out and representative of the data Used by default in Sklearn k-means++: The Advantages of Careful Seeding K-means Clustering Choosing the Number of Clusters K in K-means has to be specified before training K=3 K=5 What is the right value of K? Unknown? XL T-shirt sizing L T-shirt sizing L M M S Weight Weight S XS 𝑘𝑘 = 3 𝑘𝑘 = 5 Height Height Choosing the value of K Elbow method Cost function Cost function 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 (no. of clusters) (no. of clusters) Choosing the value of K Silhouette Score method: a higher score indicates better separation of clusters. K-means Clustering Feature Scaling Feature Scaling in K-means Feature scaling is crucial in K-means, since it is a distance-based algorithm, especially when dealing with: High dimensional data Features have different magnitudes K-means Clustering Complex Shapes Working fine with circular distribution k-means model places a circle (or, in higher dimensions, a hyper-sphere) at the center of each cluster, with a radius defined by the most distant point in the cluster. K-means assumption Unexpected K-means clusters K-means: Clustering unevenly sized blobs Increasing the number of random initializations helps n_init=1 n_init=10 K-means limitations Try alternative clustering algorithms Alternative Clustering Algorithms Density-Based Spatial Clustering of Applications with Noise (DBSCAN) Agglomerative Hierarchical Clustering Gaussian Mixture Models Feel free to try other clustering algorithms available in Sklearn: https://scikit-learn.org/stable/modules/clustering.html DBSCAN Density-Based Spatial Clustering of Applications with Noise (DBSCAN) Can be used for anomaly detection DBSCAN Key Features: Does not require the number of clusters to be specified beforehand. Capable of discovering clusters of arbitrary shapes and sizes. Robust to noise and outliers (points in low density regions) Limitations: Struggles with varying density across clusters and can be sensitive to the choice of hyperparameters (e.g., eps and min_samples). Hierarchical Clustering Agglomerative is more popular Time complexity Cluster Linkage Please refer to wiki page Agglomerative Clustering Agglomerative clustering: – First merge very similar instances – Incrementally build larger clusters out of smaller clusters Algorithm: – Maintain a set of clusters – Initially, each instance in its own cluster – Repeat: Pick the two closest clusters Merge them into a new cluster Stop when thereʼs only one cluster left Produces not one clustering, but a family of clusterings represented by a dendrogram Agglomerative Clustering How should we define “closest” for clusters Agglomerative Clustering Please refer to here for detail Gaussian Mixture Models Identifying the mixture of multiple Gaussian distributions Expectation Maximization Initialization E-step GMMs can be computationally expensive, they can M-step scale to large datasets with efficient implementations Converge check and parallelization techniques. Clustering performance evaluation Metrics for Clustering Metrics for Clustering No ground truth: Silhouette Score Having ground truth: Purity, Completeness, Adjusted Rand Index (ARI), Mutual Information (MI) Evaluation is challenging if ground truth is unknown Often, you want to get clusters for some later (downstream) purpose. Evaluate K-means based on how well it performs on that later purpose. XL T-shirt sizing L T-shirt sizing L M M S Weight Weight S XS 𝑘𝑘 = 3 𝑘𝑘 = 5 Height Height Today’s Assignment Implementing K-means clustering Find the suitable clustering algorithm for Reference: https://www.deeplearning.ai/ https://www.kaggle.com/code/vipulgandhi/gaussian- mixture-models-clustering-explained https://en.wikipedia.org/wiki/Hierarchical_clustering https://www.mit.edu/~9.54/fall14/