Data Science with Machine Learning COMP4030 Lecture 4 PDF

Summary

This document presents lecture notes for COMP4030, focusing on unsupervised machine learning techniques. The lecture covers core concepts such as K-Means and DBSCAN clustering algorithms alongside Dimensionality Reduction methods like PCA. These concepts are foundational for data analysis and pattern recognition.

Full Transcript

Data Science with Machine Learning COMP4030 Lecture 4 Unsupervised Learning Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Di...

Data Science with Machine Learning COMP4030 Lecture 4 Unsupervised Learning Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA How Does Machine Learning Work? Machine learning is a way for computers to learn patterns from data and make predictions or decisions without being explicitly programmed. How Does Machine Learning Work? Machine learning is a way for computers to learn patterns from data and make predictions or decisions without being explicitly programmed. How Does Machine Learning Work? Machine learning is a way for computers to learn patterns from data and make predictions or decisions without being explicitly programmed. Under the hood Mathematical model e.g. number of corners: Corners Label =4 =3 =4 4 square 3 triangle 0 circle 3 triangle =3 =0 0 circle 4 square 4 square =0 =4 โ‹ฎ Under the hood Mathematical model e.g. number of corners: 0 โ†’ ๐‘๐‘–๐‘Ÿ๐‘๐‘™๐‘’ ๐‘–๐‘“ ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘๐‘œ๐‘Ÿ๐‘›๐‘’๐‘Ÿ๐‘  == แ‰3 โ†’ ๐‘ก๐‘Ÿ๐‘–๐‘Ž๐‘›๐‘”๐‘™๐‘’ 4 โ†’ ๐‘ ๐‘ž๐‘ข๐‘Ž๐‘Ÿ๐‘’ Supervised Learning features Label 4 square 3 triangle 0 circle 3 triangle 0 circle 4 square 4 square Supervised Learning / Unsupervised Learning features Label features Label 4 square 4 ? 3 triangle 3 ? 0 circle 0 ? 3 triangle 3 ? 0 circle 0 ? 4 square 4 ? 4 square 4 ? Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA Clustering Clustering Alphabetically by authorsโ€™ last names Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA K-Means K-means clustering is a unsupervised learning algorithm used to cluster unlabeled datasets into distinct groups based on feature similarity. K-means identifies clusters by iteratively refining centroid positionsโ€” points representing the "centre" of each cluster. The algorithm assumes that data points closer to a centroid belong to the same group, mimicking how humans naturally categorize objects spatially. K-Means It aims to minimize the within-cluster sum of squares (WCSS): Given a set of observations (x1, x2,..., xn) And k clusters C = {C1, C2,..., Ck} ๐‘˜ ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘› 2 เทเท ๐‘ฅ โˆ’ ๐œ‡๐‘– ๐ถ ๐‘ฅโˆˆ๐ถ๐‘– ๐‘–=1 K-Means Example K-Means Example K-Means Elbow Method K-Means Example K-Means Calculation K-Means Calculation 1. Initialize centroids K-Means Calculation 1. Initialize centroids 2. Assign points to clusters K-Means Calculation 1. Initialize centroids 2. Assign points to clusters 3. Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids K-Means Calculation 1. Initialize centroids a) Assign points to clusters b) Re-calculate centroids 2. Convergence (clusters no longer move) K-Means Strengths and Limitations Strengths: Simplicity: Easy to implement and interpret. Efficiency: Linear time complexity O(nโ‹…kโ‹…dโ‹…t), suitable for large datasets. Versatility: Effective for spherical clusters and preprocessing in supervised learning. Limitations: Manual k Selection: Requires heuristic methods (e.g., Elbow, Silhouette). Sensitivity to Initialization: Poor centroid seeds lead to suboptimal clusters (mitigated by k-means++). Assumption of Spherical Clusters: Struggles with elongated or irregular shapes. K-Means Strengths and Limitations K-Means Strengths and Limitations K-Means Strengths and Limitations Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA DBSCAN Density-Based Spatial Clustering of Applications with Noise DBSCAN identifies clusters as dense regions separated by sparse areas. Unlike centroid-based methods like k-means, DBSCAN excels at detecting irregularly shaped clusters and automatically filtering noise. DBSCAN DBSCAN โ€“ Density-Based Clustering DBSCAN Key Definitions ฮต-neighborhood: All points within radius ฮต of a given point. Core point: A point with โ‰ฅ min_samples neighbors in its ฮต- neighborhood. Border point: A non-core point within the ฮต-neighborhood of a core point. Noise: Points neither core nor border. DBSCAN DBSCAN DBSCAN Core point ๐œบ DBSCAN Border point ๐œบ DBSCAN Noise ๐œบ DBSCAN DBSCAN Calculation Core point ฮต=1 min_samples = 3 ๐œบ DBSCAN Calculation Border point ฮต=1 min_samples = 3 ๐œบ DBSCAN Calculation Noise ฮต=1 min_samples = 3 ๐œบ DBSCAN Calculation ฮต=1 min_samples = 3 DBSCAN Calculation ฮต=1 min_samples = 2 DBSCAN Calculation ฮต=1.5 min_samples = 2 DBSCAN vs K-Means DBSCAN vs K-Means K-Means DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN vs K-Means K-Means DBSCAN DBSCAN Strengths and Limitations Strengths: Shape Flexibility: Captures complex cluster geometries through local density. Noise Robustness: Automatically filters (some) outliers in typical datasets. Parameter Guidance: k-distance graphs provide empirical ฮต selection. Limitations: Density Uniformity: Fails when clusters have significantly different densities. High-Dimensionality: Distance metrics become less meaningful (curse of dimensionality). Border Point Ambiguity: Edge cases may assign points to multiple clusters DBSCAN vs K-Means Criterion DBSCAN K-Means Cluster Shape Arbitrary (handles non-convex) Spherical Noise Handling Explicit outlier detection Forces all points into clusters Parameter Sensitivity Critical ฮต/min_sample selection Requires predefined k Density Variation Struggles with varying densities Uniform density assumption Complexity O(n log n) with spatial indexing O(nk) per iteration Clustering K-Means Mini-Batch K-Means DBSCAN OPTICS Hierarchical clustering Gaussian Mixture Models Manifold Learning Source: https://scikit-learn.org/stable/modules/clustering.html# Agenda Supervised vs Unsupervised Machine Learning Clustering K-Means DBSCAN Dimensionality Reduction PCA Dimensionality Reduction The Curse of Dimensionality High-dimensional data (e.g., images with thousands of pixels, genomic sequences) often contains redundancies and noise. Dimensionality reduction simplifies such data Principal Component Analysis (PCA) PCA identifies latent variablesโ€”directions of maximum variance that encode the most informative features. PCA PCA PCA PCA PCA looks for projections in the data that maximise variance Source: https://builtin.com/data-science/step-step- explanation-principal-component-analysis PCA + K-Means PCA + K-Means PCA + K-Means PCA + K-Means PCA + K-Means More Dimensionality Reduction (Unsupervised) PCA Singular Value Decomposition (SVD) Non-Negative Matrix Factorization (NMF) Random projections UMAP (Uniform Manifold Approximation and Projection) t-SNE (t-Distributed Stochastic Neighbor Embedding) Independent Component Analysis (ICA) Feature Agglomeration More Unsupervised Learning Clustering Dimensionality Reduction Association Rule Mining Anomaly Detection Reading Chapter 9: Unsupervised Learning Techniques Chapter 20: Clustering Chapter 5, Visualisation Chapter 10, Unsupervised Learning and dimensionality reduction

Use Quizgecko on...
Browser
Browser