Data Science with Machine Learning COMP4030 Lecture 4 PDF
Document Details
![BuoyantAntigorite8597](https://quizgecko.com/images/avatars/avatar-14.webp)
Uploaded by BuoyantAntigorite8597
Lucas
Tags
Summary
This document presents lecture notes for COMP4030, focusing on unsupervised machine learning techniques. The lecture covers core concepts such as K-Means and DBSCAN clustering algorithms alongside Dimensionality Reduction methods like PCA. These concepts are foundational for data analysis and pattern recognition.
Full Transcript
Data Science with Machine Learning COMP4030 Lecture 4 Unsupervised Learning Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Di...
Data Science with Machine Learning COMP4030 Lecture 4 Unsupervised Learning Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA How Does Machine Learning Work? Machine learning is a way for computers to learn patterns from data and make predictions or decisions without being explicitly programmed. How Does Machine Learning Work? Machine learning is a way for computers to learn patterns from data and make predictions or decisions without being explicitly programmed. How Does Machine Learning Work? Machine learning is a way for computers to learn patterns from data and make predictions or decisions without being explicitly programmed. Under the hood Mathematical model e.g. number of corners: Corners Label =4 =3 =4 4 square 3 triangle 0 circle 3 triangle =3 =0 0 circle 4 square 4 square =0 =4 โฎ Under the hood Mathematical model e.g. number of corners: 0 โ ๐๐๐๐๐๐ ๐๐ ๐๐ข๐๐๐๐ ๐๐ ๐๐๐๐๐๐๐ == แ3 โ ๐ก๐๐๐๐๐๐๐ 4 โ ๐ ๐๐ข๐๐๐ Supervised Learning features Label 4 square 3 triangle 0 circle 3 triangle 0 circle 4 square 4 square Supervised Learning / Unsupervised Learning features Label features Label 4 square 4 ? 3 triangle 3 ? 0 circle 0 ? 3 triangle 3 ? 0 circle 0 ? 4 square 4 ? 4 square 4 ? Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA Clustering Clustering Alphabetically by authorsโ last names Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA K-Means K-means clustering is a unsupervised learning algorithm used to cluster unlabeled datasets into distinct groups based on feature similarity. K-means identifies clusters by iteratively refining centroid positionsโ points representing the "centre" of each cluster. The algorithm assumes that data points closer to a centroid belong to the same group, mimicking how humans naturally categorize objects spatially. K-Means It aims to minimize the within-cluster sum of squares (WCSS): Given a set of observations (x1, x2,..., xn) And k clusters C = {C1, C2,..., Ck} ๐ ๐๐๐๐๐๐ 2 เทเท ๐ฅ โ ๐๐ ๐ถ ๐ฅโ๐ถ๐ ๐=1 K-Means Example K-Means Example K-Means Elbow Method K-Means Example K-Means Calculation K-Means Calculation 1. Initialize centroids K-Means Calculation 1. Initialize centroids 2. Assign points to clusters K-Means Calculation 1. Initialize centroids 2. Assign points to clusters 3. Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids 2. Convergence (clusters no longer move) K-Means Strengths and Limitations Strengths: Simplicity: Easy to implement and interpret. Efficiency: Linear time complexity O(nโ kโ dโ t), suitable for large datasets. Versatility: Effective for spherical clusters and preprocessing in supervised learning. Limitations: Manual k Selection: Requires heuristic methods (e.g., Elbow, Silhouette). Sensitivity to Initialization: Poor centroid seeds lead to suboptimal clusters (mitigated by k-means++). Assumption of Spherical Clusters: Struggles with elongated or irregular shapes. K-Means Strengths and Limitations K-Means Strengths and Limitations K-Means Strengths and Limitations Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA DBSCAN Density-Based Spatial Clustering of Applications with Noise DBSCAN identifies clusters as dense regions separated by sparse areas. Unlike centroid-based methods like k-means, DBSCAN excels at detecting irregularly shaped clusters and automatically filtering noise. DBSCAN DBSCAN โ Density-Based Clustering DBSCAN Key Definitions ฮต-neighborhood: All points within radius ฮต of a given point. Core point: A point with โฅ min_samples neighbors in its ฮต- neighborhood. Border point: A non-core point within the ฮต-neighborhood of a core point. Noise: Points neither core nor border. DBSCAN DBSCAN DBSCAN Core point ๐บ DBSCAN Border point ๐บ DBSCAN Noise ๐บ DBSCAN DBSCAN Calculation Core point ฮต=1 min_samples = 3 ๐บ DBSCAN Calculation Border point ฮต=1 min_samples = 3 ๐บ DBSCAN Calculation Noise ฮต=1 min_samples = 3 ๐บ DBSCAN Calculation ฮต=1 min_samples = 3 DBSCAN Calculation ฮต=1 min_samples = 2 DBSCAN Calculation ฮต=1.5 min_samples = 2 DBSCAN vs K-Means DBSCAN vs K-Means K-Means DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN Strengths and Limitations Strengths: Shape Flexibility: Captures complex cluster geometries through local density. Noise Robustness: Automatically filters (some) outliers in typical datasets. Parameter Guidance: k-distance graphs provide empirical ฮต selection. Limitations: Density Uniformity: Fails when clusters have significantly different densities. High-Dimensionality: Distance metrics become less meaningful (curse of dimensionality). Border Point Ambiguity: Edge cases may assign points to multiple clusters DBSCAN vs K-Means Criterion DBSCAN K-Means Cluster Shape Arbitrary (handles non-convex) Spherical Noise Handling Explicit outlier detection Forces all points into clusters Parameter Sensitivity Critical ฮต/min_sample selection Requires predefined k Density Variation Struggles with varying densities Uniform density assumption Complexity O(n log n) with spatial indexing O(nk) per iteration Clustering K-Means Mini-Batch K-Means DBSCAN OPTICS Hierarchical clustering Gaussian Mixture Models Manifold Learning Source: https://scikit-learn.org/stable/modules/clustering.html# Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA Dimensionality Reduction The Curse of Dimensionality High-dimensional data (e.g., images with thousands of pixels, genomic sequences) often contains redundancies and noise. Dimensionality reduction simplifies such data Principal Component Analysis (PCA) PCA identifies latent variablesโdirections of maximum variance that encode the most informative features. PCA PCA PCA PCA PCA looks for projections in the data that maximise variance Source: https://builtin.com/data-science/step-step- explanation-principal-component-analysis PCA + K-Means PCA + K-Means PCA + K-Means PCA + K-Means PCA + K-Means More Dimensionality Reduction (Unsupervised) PCA Singular Value Decomposition (SVD) Non-Negative Matrix Factorization (NMF) Random projections UMAP (Uniform Manifold Approximation and Projection) t-SNE (t-Distributed Stochastic Neighbor Embedding) Independent Component Analysis (ICA) Feature Agglomeration More Unsupervised Learning Clustering Dimensionality Reduction Association Rule Mining Anomaly Detection Reading Chapter 9: Unsupervised Learning Techniques Chapter 20: Clustering Chapter 5, Visualisation Chapter 10, Unsupervised Learning and dimensionality reduction