Data Science with Machine Learning COMP4030 Lecture 4 PDF

Summary

This document presents lecture notes for COMP4030, focusing on unsupervised machine learning techniques. The lecture covers core concepts such as K-Means and DBSCAN clustering algorithms alongside Dimensionality Reduction methods like PCA. These concepts are foundational for data analysis and pattern recognition.

Full Transcript

Data Science with Machine Learning COMP4030 Lecture 4 Unsupervised Learning Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Di...

Data Science with Machine Learning COMP4030 Lecture 4 Unsupervised Learning Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA How Does Machine Learning Work? Machine learning is a way for computers to learn patterns from data and make predictions or decisions without being explicitly programmed. How Does Machine Learning Work? Machine learning is a way for computers to learn patterns from data and make predictions or decisions without being explicitly programmed. How Does Machine Learning Work? Machine learning is a way for computers to learn patterns from data and make predictions or decisions without being explicitly programmed. Under the hood Mathematical model e.g. number of corners: Corners Label =4 =3 =4 4 square 3 triangle 0 circle 3 triangle =3 =0 0 circle 4 square 4 square =0 =4 ⋮ Under the hood Mathematical model e.g. number of corners: 0 → 𝑐𝑖𝑟𝑐𝑙𝑒 𝑖𝑓 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑛𝑒𝑟𝑠 == ቐ3 → 𝑡𝑟𝑖𝑎𝑛𝑔𝑙𝑒 4 → 𝑠𝑞𝑢𝑎𝑟𝑒 Supervised Learning features Label 4 square 3 triangle 0 circle 3 triangle 0 circle 4 square 4 square Supervised Learning / Unsupervised Learning features Label features Label 4 square 4 ? 3 triangle 3 ? 0 circle 0 ? 3 triangle 3 ? 0 circle 0 ? 4 square 4 ? 4 square 4 ? Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA Clustering Clustering Alphabetically by authors’ last names Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA K-Means K-means clustering is a unsupervised learning algorithm used to cluster unlabeled datasets into distinct groups based on feature similarity. K-means identifies clusters by iteratively refining centroid positions— points representing the "centre" of each cluster. The algorithm assumes that data points closer to a centroid belong to the same group, mimicking how humans naturally categorize objects spatially. K-Means It aims to minimize the within-cluster sum of squares (WCSS): Given a set of observations (x1, x2,..., xn) And k clusters C = {C1, C2,..., Ck} 𝑘 𝑎𝑟𝑔𝑚𝑖𝑛 2 ෍෍ 𝑥 − 𝜇𝑖 𝐶 𝑥∈𝐶𝑖 𝑖=1 K-Means Example K-Means Example K-Means Elbow Method K-Means Example K-Means Calculation K-Means Calculation 1. Initialize centroids K-Means Calculation 1. Initialize centroids 2. Assign points to clusters K-Means Calculation 1. Initialize centroids 2. Assign points to clusters 3. Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids 2. Convergence (clusters no longer move) K-Means Strengths and Limitations Strengths: Simplicity: Easy to implement and interpret. Efficiency: Linear time complexity O(n⋅k⋅d⋅t), suitable for large datasets. Versatility: Effective for spherical clusters and preprocessing in supervised learning. Limitations: Manual k Selection: Requires heuristic methods (e.g., Elbow, Silhouette). Sensitivity to Initialization: Poor centroid seeds lead to suboptimal clusters (mitigated by k-means++). Assumption of Spherical Clusters: Struggles with elongated or irregular shapes. K-Means Strengths and Limitations K-Means Strengths and Limitations K-Means Strengths and Limitations Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA DBSCAN Density-Based Spatial Clustering of Applications with Noise DBSCAN identifies clusters as dense regions separated by sparse areas. Unlike centroid-based methods like k-means, DBSCAN excels at detecting irregularly shaped clusters and automatically filtering noise. DBSCAN DBSCAN – Density-Based Clustering DBSCAN Key Definitions ε-neighborhood: All points within radius ε of a given point. Core point: A point with ≥ min_samples neighbors in its ε- neighborhood. Border point: A non-core point within the ε-neighborhood of a core point. Noise: Points neither core nor border. DBSCAN DBSCAN DBSCAN Core point 𝜺 DBSCAN Border point 𝜺 DBSCAN Noise 𝜺 DBSCAN DBSCAN Calculation Core point ε=1 min_samples = 3 𝜺 DBSCAN Calculation Border point ε=1 min_samples = 3 𝜺 DBSCAN Calculation Noise ε=1 min_samples = 3 𝜺 DBSCAN Calculation ε=1 min_samples = 3 DBSCAN Calculation ε=1 min_samples = 2 DBSCAN Calculation ε=1.5 min_samples = 2 DBSCAN vs K-Means DBSCAN vs K-Means K-Means DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN Strengths and Limitations Strengths: Shape Flexibility: Captures complex cluster geometries through local density. Noise Robustness: Automatically filters (some) outliers in typical datasets. Parameter Guidance: k-distance graphs provide empirical ε selection. Limitations: Density Uniformity: Fails when clusters have significantly different densities. High-Dimensionality: Distance metrics become less meaningful (curse of dimensionality). Border Point Ambiguity: Edge cases may assign points to multiple clusters DBSCAN vs K-Means Criterion DBSCAN K-Means Cluster Shape Arbitrary (handles non-convex) Spherical Noise Handling Explicit outlier detection Forces all points into clusters Parameter Sensitivity Critical ε/min_sample selection Requires predefined k Density Variation Struggles with varying densities Uniform density assumption Complexity O(n log n) with spatial indexing O(nk) per iteration Clustering K-Means Mini-Batch K-Means DBSCAN OPTICS Hierarchical clustering Gaussian Mixture Models Manifold Learning Source: https://scikit-learn.org/stable/modules/clustering.html# Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA Dimensionality Reduction The Curse of Dimensionality High-dimensional data (e.g., images with thousands of pixels, genomic sequences) often contains redundancies and noise. Dimensionality reduction simplifies such data Principal Component Analysis (PCA) PCA identifies latent variables—directions of maximum variance that encode the most informative features. PCA PCA PCA PCA PCA looks for projections in the data that maximise variance Source: https://builtin.com/data-science/step-step- explanation-principal-component-analysis PCA + K-Means PCA + K-Means PCA + K-Means PCA + K-Means PCA + K-Means More Dimensionality Reduction (Unsupervised) PCA Singular Value Decomposition (SVD) Non-Negative Matrix Factorization (NMF) Random projections UMAP (Uniform Manifold Approximation and Projection) t-SNE (t-Distributed Stochastic Neighbor Embedding) Independent Component Analysis (ICA) Feature Agglomeration More Unsupervised Learning Clustering Dimensionality Reduction Association Rule Mining Anomaly Detection Reading Chapter 9: Unsupervised Learning Techniques Chapter 20: Clustering Chapter 5, Visualisation Chapter 10, Unsupervised Learning and dimensionality reduction