Principal Component Analysis PDF
Document Details
![GratefulSerpentine86](https://quizgecko.com/images/avatars/avatar-15.webp)
Uploaded by GratefulSerpentine86
Tags
Summary
This document provides a detailed explanation of Principal Component Analysis (PCA), focusing on its mathematical underpinnings and practical applications. It covers key concepts such as unsupervised learning, vector spaces, and manifold representations.
Full Transcript
Principal Component Analysis Idea High dimensional data can often be represented using a much lower dimensional space. This happens when the data lives near a linear manifold in the high dimensional space. If we find the manifold we can project the data in the manifold and using to represent...
Principal Component Analysis Idea High dimensional data can often be represented using a much lower dimensional space. This happens when the data lives near a linear manifold in the high dimensional space. If we find the manifold we can project the data in the manifold and using to represent the data without losing much information. Manifold A Manifold is a topological space that locally (near each point) resembles the Euclidean space. Consider the upper half of the unit circle, x2 + y2 = 1, where y≥0 (yellow arc). Every point on this arc can be uniquely identified by its x-coordinate χtop(x,y) = x The projection onto the first coordinate defines a smooth and invertible mapping from the upper arc to the open interval (−1,1) Functions which provide a one-to-one correspondence between open regions of a surface and subsets of Euclidean space, are called charts. Charts Each chart can be seen as a mapping ϕ : R1 → S ⊂ R2. ϕ must be smooth and invertible (diffeomorphism). The key property is that both the function and its inverse are continuously differentiable. The domain of ϕ is the parametric space and is Euclidean. The image of ϕ is the embedding and is a surface. Manifolds in the end are unions of charts. Unsupervised Learning Unsupervised learning focuses on effectively representing datasets that lack labeled outputs, meaning we work solely with input data. The primary goal is to uncover meaningful structures or representations within the data. This concept is closely related to the idea of a spanning set of vectors in basic linear algebra, where the aim is to represent a space efficiently using a minimal and meaningful set of components. Vector Space When visualizing data points in a multi-dimensional vector space, we can represent them as either dots (left panel) or arrows (middle panel). To understand a basis, it’s helpful to use both conventions simultaneously (right panel): some points as arrows (a basis or spanning set) and others as dots (the points to be represented). The basis vectors are used to efficiently reconstruct all other points in the space. Basis representation Suppose our dataset consists of N input points, {x1,x2,…,xN}, each living in D-dimensional space. For simplicity, we assume the dataset has been mean-centered (mean is subtracted along each dimension) ensuring the data is centered at the origin. To perfectly represent all N points, the basis {c1,c2,…,cD} must also reside in the same D-dimensional space. For any D-dimensional data point, there must exist a set of weights such that the basis, in a specific linear combination, reconstructs the data point: This requires the basis vectors to be linearly independent, meaning they do not overlap and point in distinct directions, ensuring they span the entire D-dimensional space. Standard basis As a simple example, consider the spanning set as the D standard basis vectors. Each standard basis vector consists of zeros everywhere except for a 1 in the k-th position: ck=[0,0,…,1,…,0] where the 1 is in the k-th slot. Key properties Or Representing a data point using the standard basis is straightforward. The weights are simply the values of the data point itself: wd,n=xd,n. Therefore the new representation for the point xn is exactly wn. For most any other spanning set however these weights must be solved for numerically. Example Once tuned, the weight vector wn provides the representation (or encoding or embedding) of xn in terms of the spanning set c1,…,cD. How do we find the proper weights? C is DxD matrix where the columns are the basis vectors, wn is the learned weight and xn the original input. By setting the gradient of the cost function to zero and solving for wn, we obtain a linear symmetric system of equations: CTCwn=CTxn Orthonormal basis? When the spanning set is orthonormal, the algebraic formula for the weight vector or encoding wp of a point xp becomes straightforward: wp=CTxp This equation demonstrates that, when the spanning set is orthonormal, the entire set of encodings for a dataset can be expressed directly in terms of the spanning set and the data itself. CCTxp = xp The operation CCT acts as a projection matrix, ensuring that each data point x p is perfectly represented by the orthonormal basis, with no need for further adjustment or solving systems of equations. Into a Smaller Dimension We previously discussed two key requirements for a spanning set or basis to perfectly represent points in a generic D-dimensional space: 1. The vectors must be linearly independent, meaning they point in different directions within the space. 2. The set must contain at least D-vectors to span the entire space. But what happens if we relax the second condition and consider a case where we have fewer than D spanning vectors, specifically K≤D? Lower dimension Doesn’t change much While we may not be able to perfectly represent a given point or set of points in the space we can still approximate it very well using k spanning vectors C now is (DxK). Once the weight vectors wp are computed, the projection of xp (its representation in the subspace spanned by C) is given by Cw p. This projection represents the 'dropping' of xp perpendicularly onto the subspace formed by the K basis vectors. The weight vector wp gives the encoded representation of xp over the spanning set C. The decoded version of xp is simply the projection of the original data point onto the subspace defined by the spanning vectors, which is given by Cwp. See? PCA We will learn both an appropriate basis and the corresponding weights. This approach, where the basis is learned alongside the weights, is known as Principal Component Analysis (PCA). The only change here is that, since we aim to learn the basis C as well, it has been added to the list of variables we wish to minimize in the original Least Squares cost function. Learn orthonormal Basis If we constrain our search to orthogonal matrices C such that C TC=I (K×K) , the PCA Least Squares cost function simplifies as follows: This is significant because, under the orthogonality constraint, the cost function no longer depends on the weight vectors wn, and is only a function of C. Autoencoder This simplified PCA Least Squares cost function is known as the autoencoder. The reason for this name is that, by minimizing the cost, we learn both the encoding (via the learned weights wp) and the decoding (via the projection Cwp) for each data point. In this form, we aim to encode and decode each point in terms of itself. The solution? The classic orthogonal PCA minimizer of the autoencoder cost function. The elements of this basis point in the orthogonal directions of variance of the dataset, that is the orthogonal directions in which the dataset is most spread out. It is a closed form solution! The elements of this basis are so special they are given the formal name principal components Analytical Solution Given the data matrix X(NxD), the principal component basis can be computed (as a minimum of the autoencoder cost function) as the eigenvectors of the corresponding correlation matrix of this data The eigenvector/eigenvalue decomposition of the covariance matrix is: where V contains the eigenvectors and D is the diagonal matrix of eigenvalues. The orthonormal basis we recover is precisely given by the eigenvectors, C=V (the principal components of the data). Additionally, the variance along each principal component direction is exactly the corresponding eigenvalue in D. References All the images were taken from: https://kenndanielso.github.io/mlrefined/blog_posts/11_Linear_unsupervised_learning/11_1_Spanning_sets_orthonormality_projections.html https://kenndanielso.github.io/mlrefined/blog_posts/11_Linear_unsupervised_learning/11_2_Principal_Component_Analysis.html