Document Details

InfallibleLawrencium3753

Uploaded by InfallibleLawrencium3753

Northwestern University

Tags

PCA dimensionality reduction machine learning data analysis

Summary

These lecture notes cover Principal Component Analysis (PCA), a dimensionality reduction technique in machine learning. The notes discuss the theory behind PCA, including its cost function and relationship to singular value decomposition. Practical implementation using NumPy is also demonstrated.

Full Transcript

Last Time: Lecture: After-class Assignment: Unsupervised learning Gain a good understanding of K-means Clustering (K-means) clustering Explore other clustering algorithms Last Time: Supervised learning...

Last Time: Lecture: After-class Assignment: Unsupervised learning Gain a good understanding of K-means Clustering (K-means) clustering Explore other clustering algorithms Last Time: Supervised learning Learning the Mapping Last Time: Unsupervised Clustering Cluster 1 Cluster 2 Cluster 3 Last Time: K-means algorithm Step 0: Randomly initialize 𝐾𝐾 cluster centroids πœ‡πœ‡ 1 , πœ‡πœ‡ 2 ,…, πœ‡πœ‡k Repeat { Step 1: Assign points to cluster centroids Step 2: Move cluster centroids } Cost Objective: Last time: Drawbacks of K-means Clustering While K-means clustering is a widely used and efficient clustering algorithm, it has several drawbacks: Dependence on Initial Centroids Choosing the number of clusters (K) Not suitable for complex shapes Sensitive to outliers Despite these drawbacks, K-means remains a popular choice for clustering tasks due to its simplicity, scalability, and efficiency on large datasets. However, it's important to be aware of its limitations and consider alternative clustering algorithms when dealing with complex datasets or when the assumptions of K-means are violated. K-means MiniBatch version is fast (big data) Dependence on Initial Centroids: Random initialization, K-means ++ Choosing the number of clusters (K): Elbow Method, Silhouette Score Feature Scaling is required Tends to find even sized clusters Bad with non-spherical cluster shapes DBSCAN Density based. Have to try epsilon (eps) and min_samples values Too small epsilon (too many clusters) is not trustworthy Can find uneven cluster sizes Full distance metric options (metric='euclidean’) Can handle tons of data and weird shapes Hierarchical Clustering (Ward) Get a full hierarchy tree, useful for some problems. Have to try k values like K-means Finds uneven cluster sizes (one is big, some are tiny) A lot of distance metric and linkage options Can be slow to calculate Today’s agenda Lecture: Dimensionality Reduction Principle component analysis (PCA) Dimensionality Reduction Reduce Features Feature Selection from the sequence course PCA: Feature Transformation Why dimensionality reduction? Gain an insight about data exploratory Save computation/memory As a preprocessing step to reduce overfitting Data visualization in 2/3 dimensions Principle Component Analysis (PCA) Linear dimensionality reduction to 1 dimension: turns X into Xw It is enough to consider only unit vectors, ||w|| = 1 Principle Component Analysis (PCA) If we project all instances onto the red line, we get: How to choose w? 1. To minimize the reconstruction(projection) error 2. To maximize the variance Surprising fact: these are equivalent and PCA does both! Maximizing variance  minimizing error Assume all features are centered: Projection on component x has smaller lost variance than component and comp onent y. PCA vs. Regression Projection y Residual x2 x x1 Linear PCA Regression Background (optional) PCA Cost Function Maximizing the variance of the projected data, given all features are centered: 1 x2 J(w) = 𝑀𝑀 𝑇𝑇 𝑋𝑋 𝑇𝑇 Xw = 𝑀𝑀 𝑇𝑇 Cw, s.t. ||w|| = 1 W: line that capture π‘šπ‘š the most variance Goal: Maximize J(w) Let p=Xw, which is the projection of X onto the new w (since we impose 𝑀𝑀 𝑇𝑇 𝑀𝑀 =1) x1x1 Variance of the projected data is 𝑝𝑝𝑇𝑇 𝑝𝑝 1 C = 𝑋𝑋 𝑇𝑇 X is the sample π‘šπ‘š covariance matrix Maximizing 𝑀𝑀 𝑇𝑇 C w 1 J(c) = 𝑀𝑀 𝑇𝑇 𝑋𝑋 𝑇𝑇 Xw = 𝑀𝑀 𝑇𝑇 Cw, s.t. ||w|| = 1 π‘šπ‘š We can use lagrange multiplier to solve this problem J (c) = 𝑀𝑀 𝑇𝑇 Cw - πœ†πœ† (𝑀𝑀 𝑇𝑇 w – 1), s.t. ||w|| = 1 πœ•πœ•π½π½ Setting = 0, we get: Cw = πœ†πœ†π‘€π‘€ πœ•πœ•π‘€π‘€ 𝑇𝑇𝑇𝑇𝑇𝑇𝑇 π‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘š w is an eigenvector of C, πœ†πœ† is the eigenvalue associated with the eigenvector w. To maximize 𝑀𝑀 𝑇𝑇 Cw = πœ†πœ†, choose the eigenvector with the largest eigenvalue πœ†πœ†! How to implement using NumPy? Two main functions to compute eigenvalues and eigenvectors: numpy.linalg.eig() numpy.linalg.eigh() (preferred for symmetric matrix due to its numerical stability and efficiency) Covariance Matrix Representing Covariance between dimensions as a matrix e.g. for 3 dimensions C= Diagonal is the variances of x, y and z cov(x, y) = cov(y, x) hence matrix is symmetrical about the diagonal N-dimensional data will result in NxN covariance matrix Spectral theorem C is symmetric n x n matrix. One can prove that it has n eigenvectors that are all orthogonal to each other. If 𝑀𝑀1𝑇𝑇 𝑀𝑀2 = 0 for eigenvectors 𝑀𝑀1 and 𝑀𝑀2 , then 𝑀𝑀1𝑇𝑇 𝐢𝐢𝑀𝑀2 =0, i.e. projections on two eigenvectors have correlation zero This implies that in the eigenvector basis, the covariance matrix becomes diagonal: Total variance: N-dimensional data There are N PCs for an N- dimensional dataset. Each PC will capture a certain portion of the total variance in the dataset. Note the descending order PC1 always captures the largest variance. There is less remaining variance for each PC in the sequence. Relationship to SVD Consider singular value decomposition X = US𝑉𝑉 𝑇𝑇 1 𝑇𝑇 1 1 Then C = 𝑋𝑋 X = VSπ‘ˆπ‘ˆ US𝑉𝑉 = V𝑆𝑆 2 𝑉𝑉 𝑇𝑇 = V𝛬𝛬 𝑉𝑉 𝑇𝑇 𝑇𝑇 𝑇𝑇 𝑛𝑛 𝑛𝑛 𝑛𝑛 This is eigen decomposition! 𝛬𝛬 = 𝑆𝑆 2 / n Note: this only holds true if X is centered! PCA for data exploration https://stats.stackexchange.com/questions/7860/ Feature Scaling in PCA If features are on a different scale, it can make sense to standardize all of them ( making C the correlation matrix) https://stats.stackexchange.com/questions/53/ Picking the number of components How to choose the number of PCs? There are rules of thumb: look for an β€œelbow”; capture 90% of the total variance, etc. Better criteria: cross-validation PCA variants have been developed to handle large-scale and high dimensional dataset PCA Full SVD Randomized SVD Truncated SVD (arpack ) IncrementalPCA SparsePCA TruncatedSVD Moving beyond Linearity Transformations calculated with PCA/SVD are linear Data can have non-linear features This can cause dimensionality reduction to fail -Solution: Kernels can be used to perform non-linear PCA Next Time: Manifold Learning Lecture: T-SNE UMAP Reference: Tuebingen machine learning

Use Quizgecko on...
Browser
Browser