Summary

These lecture notes cover Principal Component Analysis (PCA), a dimensionality reduction technique in machine learning. The notes discuss the theory behind PCA, including its cost function and relationship to singular value decomposition. Practical implementation using NumPy is also demonstrated.

Full Transcript

Last Time: Lecture: After-class Assignment: Unsupervised learning Gain a good understanding of K-means Clustering (K-means) clustering Explore other clustering algorithms Last Time: Supervised learning...

Last Time: Lecture: After-class Assignment: Unsupervised learning Gain a good understanding of K-means Clustering (K-means) clustering Explore other clustering algorithms Last Time: Supervised learning Learning the Mapping Last Time: Unsupervised Clustering Cluster 1 Cluster 2 Cluster 3 Last Time: K-means algorithm Step 0: Randomly initialize 𝐾𝐾 cluster centroids 𝜇𝜇 1 , 𝜇𝜇 2 ,…, 𝜇𝜇k Repeat { Step 1: Assign points to cluster centroids Step 2: Move cluster centroids } Cost Objective: Last time: Drawbacks of K-means Clustering While K-means clustering is a widely used and efficient clustering algorithm, it has several drawbacks: Dependence on Initial Centroids Choosing the number of clusters (K) Not suitable for complex shapes Sensitive to outliers Despite these drawbacks, K-means remains a popular choice for clustering tasks due to its simplicity, scalability, and efficiency on large datasets. However, it's important to be aware of its limitations and consider alternative clustering algorithms when dealing with complex datasets or when the assumptions of K-means are violated. K-means MiniBatch version is fast (big data) Dependence on Initial Centroids: Random initialization, K-means ++ Choosing the number of clusters (K): Elbow Method, Silhouette Score Feature Scaling is required Tends to find even sized clusters Bad with non-spherical cluster shapes DBSCAN Density based. Have to try epsilon (eps) and min_samples values Too small epsilon (too many clusters) is not trustworthy Can find uneven cluster sizes Full distance metric options (metric='euclidean’) Can handle tons of data and weird shapes Hierarchical Clustering (Ward) Get a full hierarchy tree, useful for some problems. Have to try k values like K-means Finds uneven cluster sizes (one is big, some are tiny) A lot of distance metric and linkage options Can be slow to calculate Today’s agenda Lecture: Dimensionality Reduction Principle component analysis (PCA) Dimensionality Reduction Reduce Features Feature Selection from the sequence course PCA: Feature Transformation Why dimensionality reduction? Gain an insight about data exploratory Save computation/memory As a preprocessing step to reduce overfitting Data visualization in 2/3 dimensions Principle Component Analysis (PCA) Linear dimensionality reduction to 1 dimension: turns X into Xw It is enough to consider only unit vectors, ||w|| = 1 Principle Component Analysis (PCA) If we project all instances onto the red line, we get: How to choose w? 1. To minimize the reconstruction(projection) error 2. To maximize the variance Surprising fact: these are equivalent and PCA does both! Maximizing variance  minimizing error Assume all features are centered: Projection on component x has smaller lost variance than component and comp onent y. PCA vs. Regression Projection y Residual x2 x x1 Linear PCA Regression Background (optional) PCA Cost Function Maximizing the variance of the projected data, given all features are centered: 1 x2 J(w) = 𝑤𝑤 𝑇𝑇 𝑋𝑋 𝑇𝑇 Xw = 𝑤𝑤 𝑇𝑇 Cw, s.t. ||w|| = 1 W: line that capture 𝑚𝑚 the most variance Goal: Maximize J(w) Let p=Xw, which is the projection of X onto the new w (since we impose 𝑤𝑤 𝑇𝑇 𝑤𝑤 =1) x1x1 Variance of the projected data is 𝑝𝑝𝑇𝑇 𝑝𝑝 1 C = 𝑋𝑋 𝑇𝑇 X is the sample 𝑚𝑚 covariance matrix Maximizing 𝑤𝑤 𝑇𝑇 C w 1 J(c) = 𝑤𝑤 𝑇𝑇 𝑋𝑋 𝑇𝑇 Xw = 𝑤𝑤 𝑇𝑇 Cw, s.t. ||w|| = 1 𝑚𝑚 We can use lagrange multiplier to solve this problem J (c) = 𝑤𝑤 𝑇𝑇 Cw - 𝜆𝜆 (𝑤𝑤 𝑇𝑇 w – 1), s.t. ||w|| = 1 𝜕𝜕𝐽𝐽 Setting = 0, we get: Cw = 𝜆𝜆𝑤𝑤 𝜕𝜕𝑤𝑤 𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 w is an eigenvector of C, 𝜆𝜆 is the eigenvalue associated with the eigenvector w. To maximize 𝑤𝑤 𝑇𝑇 Cw = 𝜆𝜆, choose the eigenvector with the largest eigenvalue 𝜆𝜆! How to implement using NumPy? Two main functions to compute eigenvalues and eigenvectors: numpy.linalg.eig() numpy.linalg.eigh() (preferred for symmetric matrix due to its numerical stability and efficiency) Covariance Matrix Representing Covariance between dimensions as a matrix e.g. for 3 dimensions C= Diagonal is the variances of x, y and z cov(x, y) = cov(y, x) hence matrix is symmetrical about the diagonal N-dimensional data will result in NxN covariance matrix Spectral theorem C is symmetric n x n matrix. One can prove that it has n eigenvectors that are all orthogonal to each other. If 𝑤𝑤1𝑇𝑇 𝑤𝑤2 = 0 for eigenvectors 𝑤𝑤1 and 𝑤𝑤2 , then 𝑤𝑤1𝑇𝑇 𝐶𝐶𝑤𝑤2 =0, i.e. projections on two eigenvectors have correlation zero This implies that in the eigenvector basis, the covariance matrix becomes diagonal: Total variance: N-dimensional data There are N PCs for an N- dimensional dataset. Each PC will capture a certain portion of the total variance in the dataset. Note the descending order PC1 always captures the largest variance. There is less remaining variance for each PC in the sequence. Relationship to SVD Consider singular value decomposition X = US𝑉𝑉 𝑇𝑇 1 𝑇𝑇 1 1 Then C = 𝑋𝑋 X = VS𝑈𝑈 US𝑉𝑉 = V𝑆𝑆 2 𝑉𝑉 𝑇𝑇 = V𝛬𝛬 𝑉𝑉 𝑇𝑇 𝑇𝑇 𝑇𝑇 𝑛𝑛 𝑛𝑛 𝑛𝑛 This is eigen decomposition! 𝛬𝛬 = 𝑆𝑆 2 / n Note: this only holds true if X is centered! PCA for data exploration https://stats.stackexchange.com/questions/7860/ Feature Scaling in PCA If features are on a different scale, it can make sense to standardize all of them ( making C the correlation matrix) https://stats.stackexchange.com/questions/53/ Picking the number of components How to choose the number of PCs? There are rules of thumb: look for an “elbow”; capture 90% of the total variance, etc. Better criteria: cross-validation PCA variants have been developed to handle large-scale and high dimensional dataset PCA Full SVD Randomized SVD Truncated SVD (arpack ) IncrementalPCA SparsePCA TruncatedSVD Moving beyond Linearity Transformations calculated with PCA/SVD are linear Data can have non-linear features This can cause dimensionality reduction to fail -Solution: Kernels can be used to perform non-linear PCA Next Time: Manifold Learning Lecture: T-SNE UMAP Reference: Tuebingen machine learning

Use Quizgecko on...
Browser
Browser