National Higher School of Mathematics (NHSM) Data Mining II PDF

Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu “National Higher School of Mathematics (NHSM)” “Data Mining II” Course 3 - Dimensionality Reduction - PPCA, t-SNE and UMAP 4th year Statistics, Econometrics for the Actuarial Sciences Ayoub Asri 25 October 2024 Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu 1 Motivation 2 Linear Dimensionality Reduction 3 The Locally Linear Embedding (LLE) Algorithm 4 The t-distributed Stochastic Neighbor Embedding (t-SNE) 5 Spectral Embedding 6 The Uniform Manifold Approximation & Projection (UMAP) Algorithm Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu 7 Other dimensionality Reduction subjects 8 References Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Section 1 Motivation Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Problems in High Dimensions I Real-world datasets are often unstructured and high-dimensional (e.g., images, sounds, time series... ): Not well-suited for visualization Suffer from the curse of dimensionality The number of samples needed to faithfully represent an n-dimensional space becomes impractical. Example: 100 points on the interval [0, 1] ⇒ 104 points on the square, 106 points on the cube, 102n points on the hypercube in dimension n! Noisy dimensions reduce the discriminative power of the Euclidean distance. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Problems in High Dimensions II The minimal and maximal distance between a reference point q and the points in a dataset D = {x1 , x2 ,..., xn } becomes the same when d ↗ : ! maxxj ∥q − xj ∥ − minxj ∥q − xj ∥ lim E →0 d→+∞ minxj ∥q − xj ∥ Dimensionality reduction has two main benefits: Visualizing the structure of the data cloud Simplifying data for subsequent processing (e.g., classification) Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Practical Example An image is represented by an 8 × 8 pixel matrix Vector of dimension d = 64 Difficult to visualize   0. 0. 11. 12. 0. 0. 0. 0. 0. 2. 16. 16. 16. 13. 0. 0.   0. 3. 16. 12. 10. 14. 0. 0.     0. 1. 16. 1. 12. 15. 0. 0.   0. 0. 13. 16. 9. 15. 2. 0.   0. 0. 0. 3. 0. 9. 11. 0.   0. 0. 0. 0. 9. 15. 4. 0.   0. 0. 0. 12. 13. 3. 0. 0. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Practical Example Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Manifold Learning High-dimensional datasets are difficult to visualize. Hypothesis: the observed dimensionality of the data is artificially large. The data is generated by a set of intrinsic variables (the “degrees of freedom”). It is assumed that there are fewer degrees of freedom than variables in the observations. In other words, the observed variables are partially redundant or superfluous. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Manifold Learning: Example Circle: a two-dimensional cloud of observations (x, y) The points satisfy the constraint: x2 + y 2 = 1 By reparametrizing the observations: x = cos θ, y = sin θ ⇒ We observe that the data is generated by a one-dimensional manifold (the segment [−π, π]) Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Section 2 Linear Dimensionality Reduction Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Principal Component Analysis (PCA) : Recall I Consider a data set {x(i) }N i=1. Each input vector x ∈ RD is approximated as: x̄ + Uz(i) (i) x(i) ≈ x̃(i) = x̄ + Uz(i) , where x̄ = N1 i x(i) is the data mean, U ∈ RD×K is the P orthogonal basis for the principal subspace, and z(i) ∈ RK is the code vector given by: z(i) = U⊤ (x(i) − x̄) Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Principal Component Analysis (PCA) : Recall II The matrix U is chosen to minimize the reconstruction error: U∗ = arg min ∥x(i) − x̄ − UU⊤ (x(i) − x̄)∥2 X U i Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu We are looking for directions For example, in a 2-dimensional problem, we are looking for the direction u1 along which the data is well represented: direction of higher variance direction of minimum reconstruction error They are the same! Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Limitatioss of Linear Dimensionality Reduction There is no linear transformation that can separate the two circles. What algorithms exist for non-linear dimensionality reduction? Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Section 3 The Locally Linear Embedding (LLE) Algorithm Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Formal Framework Consider a matrix of observations X with n samples of dimension p: ′ We seek X of dimension q < p that accurately represents the structure of X; ′ ′ The neighbors xj of xi in the reduced space must be the ′ ′ same as the neighbors xj of xi in the original space. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Locally Linear Embedding Roweis & Saul (2000) Principle : Locally, the cloud of observations X is generated by an affine subspace. 1 For each point xi ∈ X ⊂ Rp : n ′ ′ ′ o Calculate the k nearest neighbors Vi = x 1 , x 2 ,..., x k of ′ xi Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Locally Linear Embedding 2 Solve 2 X X X arg min xi − wij xj Subject to wij = 1 W i j∈Vi j Find the parameters W such that xi can be reconstructed by the linear combination of its neighbors. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Locally Linear Embedding 3 Solve 2 X X arg min yi − wij yj Subject to yi ∈ Rq Y i j∈Vi Find the reduced matrix Y such that yi is reconstructed by the same linear combination. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Illustration LLE finds the coefficients W such that the blue point can be reconstructed by the linear combination of its 6 nearest neighbors (in yellow). LLE then determines the coordinates of the corresponding points in the reduced space such that the reduced point can be reconstructed by the same linear combination of its neighbors. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Example Dataset Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Example The projection onto the first two components does not capture the structure of the dataset: the chosen components are arbitrary. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Example Principal Component Analysis extracts the directions that explain the most variance but obscures the third dimension. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Example The LLE neighborhood approach better respects the local structure and “unrolls” the surface of the dataset. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Example Neighborhoods ̸= distances LLE preserves neighborhoods, not distances. The reconstructions are local. ⇒ the distances between distant points have no significance! Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Section 4 The t-distributed Stochastic Neighbor Embedding (t-SNE) Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu t-SNE: General Principle t-distributed Stochastic Neighbor Embedding Maaten & Hinton (2008) A non-linear dimensionality reduction algorithm The general principle is similar to LLE: seek a dataset in a reduced-dimensional space that exhibits the same local structures in the neighborhood of each point. The neighborhood of a point xi is represented by the conditional probability pj|i = p(xj |xi ) that xj is considered a “neighbor” of xi. We seek points yi in the reduced space for which p(yj |yi ) ≃ p(xj |xi ) The neighborhood distributions are similar in the original space and in the reduced space. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Similarity in the Observation Space (or Original Space) Consider a matrix X of n observations xi of dimension D. To each pair xi , xj , we associate the joint probability pij = pji defined by: pi|j + pj|i exp − ∥xi − xj ∥2 /2σi2 pij = Where pj|i = P 2n k̸=i exp − ∥xi − xk ∥2 /2σi2 P with the convention that pii is chosen to verify i,j pij = 1 σ is chosen based on the desired perplexity (approximately the average number of neighbors). Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Similarity in the Observation Space (or Original Space) Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu The Objective of t-SNE The objective of t-SNE is to obtain a reduced matrix Y of dimension n × d, d < D, such that the similarities qij approach the pij. For Y, the similarity is defined using a Student’s t-distribution with 1 degree of freedom (or Cauchy distribution): −1 1 + ∥yi − yj ∥22 qij = P −1 k̸=l 1 + ∥yk − yl ∥22 with the convention that qii = 0, as well. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu The Objective of t-SNE Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Choice of Distributions Gaussian Similarity: : pi|j +pj|i exp(−∥xi −xj ∥2 /2σi2 ) pij = 2n Where pj|i = P exp −∥x −x ∥2 /2σ 2 k̸=i ( i k i) Rapid decay: neighbors are easily distinguished from non-neighbors. Short tail: distant values are all approximately ≃ 0 → all non-neighbors are 0. t-Distribution (Student) Similatity: −1 (1+∥yi −yj ∥22 ) qij = P −1 k̸=l ( 1+∥yk −yl ∥22 ) Rapid decay: neighbors are easily distinguished from non-neighbors. Long tail: better spread in the space, no “clumping.” Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Choice of Distributions Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Optimization A measure of dissimilarity between distributions is the Kullback-Leibler divergence: X XX pij Lt−SN E = DKL (P ||Q) = pij log i i j qij To minimize the differences between the qij and the pij , it is sufficient to minimize Lt−SN E , Minimization by gradient descent: apply successive modifications to the parameters (here the yi ) of the optimized criterion (here Lt−SN E ), in the opposite direction of the gradient of the criterion with respect to the parameters, until a minimum is reached. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Optimization Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu t-SNE Algorithm I t-SNE modifies the points yi to minimize Lt−SN E using gradient descent: 1 For each pair of points (xi , xj ) : Calculate the similarity : 2 2 pi|j +pj|i exp(−∥xi −xj ∥ /2σi ) pij = 2n Where pj|i = P exp −∥x −x ∥2 /2σ 2 k̸=i ( k l i) 2 Randomly initialize the points y1 , y2 ,..., yn in the d-dimensional space. 3 While Lt−SN E ≥ ϵ or the algorithm has not converged or the maximum number of iterations has not been reached : Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu t-SNE Algorithm II Calculate the gradient of the KL divergence with respect to each yi : ∂Lt−SN E −1 = 4 (pij − qij )(yi − yj ) 1 + ∥yk − yl ∥22 X ∂yi j update each point yi by: ∂Lt−SN E yi := yi − η ∂yi Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Example The “S” dataset Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Example The projection by t-SNE allows to “unroll” the surface into an S shape. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Initialization The classical t-SNE algorithm initializes the points yi randomly: High stochasticity Low preservation of the global structure Strong dependence of the results on initialization Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Initialization The classical t-SNE algorithm initializes the points yi randomly. Solution: apply PCA for calculating the initial yi Preserves the global structure Better reproducibility More robust in practice Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Perplexity Low perplexity: local structure, only immediate neighbors are considered. High perplexity: global structure, all points are considered as neighbors. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Optimization The gradient descent in t-SNE involves two phases: Phase 1: Early exaggeration : All conditional probabilities in the initial space are multiplied by a factor γ to artificially increase the separation of natural groups in the data. Phase 2: Final optimization : Actual values of conditional probabilities are used to refine the local placement of observations within the groups. There are heuristics for choosing the exaggeration factor and the learning rate Belkina et al. (2019): η = max 14 nγ , 50... other empirical methods. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Approximation t-SNE is a costly method due to its high complexity: Gradient calculation: O(dn2 ) Gradient for a single yi : O(dn) Barnes-Hut approximation : Maaten (2014) Gradient calculation: O(dn log n) Only for visualization (2D or 3D reduced space) ∂L −1 = 4 (pij − qij )(yi − yj ) 1 + ∥yk − yl ∥22 X ∂yi j Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Limitations Sensitivity to parameters and stochasticity of the method: t-SNE is a stochastic method: two repetitions do not necessarily yield the same projection. Initialization has a significant impact on the final result. The optimization parameters (exaggeration, learning rate, and especially perplexity) can greatly alter the final projection. Interpretability is questionable: Wattenberg et al. (2016) The size of groups in t-SNE has no significance. The distance between groups does not (always) have significance. With low perplexity, noise in the data can artificially connect groups or, conversely, overly subdivide them. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Limitations Practical disadvantages: Non-inductive method: it is impossible to project a new observation; optimization must be redone! Non-invertible: it is impossible to retrieve the antecedent in the observation space of an arbitrary point from the reduced space. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Section 5 Spectral Embedding Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Principle of Spectral Embedding Dataset: Consider a set E = {xi , 1 ≤ 1 ≤ N } of N data points in dimension D described by a similarity matrix S uch that the element sij ≥ 0 represents the similarity between xi and xj Objective : Distribute the data F = {yi , 1 ≤ 1 ≤ N } into a lower-dimensional space of dimension d < D such that distances ∥yi − yj ∥ are small if the similarities sij are strong, and vice versa. Approach : 1 Construct a similarity graph from S. 2 Using the graph, create a “suitable” vector representation of the data. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Principle of Spectral Embedding Luxburg (2007) Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Graphs: Some Definitions Graph: An undirected graph G = (V, E), where V = {v1 ,..., vN } is the set of N vertices and E is the set of edges. Weighted Edges: For each edge connecting vertices vi and vj , let wij ≥ 0 denote the weight of the edge. An edge does not exist between vi and vj if wij = 0 Weight Matrix: The weight matrix W has elements wij. Degree of a Vertex: The degree di of vertex vi is defined as: di = N P j=1 wij Diagonal Degree Matrix: The diagonal degree matrix D has the degrees {d1 ,..., dN } on its diagonal. Path (or Chain): A path µ(vi , vj ) between vertices vi and vj is a sequence of edges from E that connects the two vertices. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Graphs: Some Definitions Indicator Vector of a Subset: For a subset C ⊂ V , the indicator vector c is N -component binary vector where ci = 1 if vertex vi ∈ C and ci = 0 otherwise. Connected Component: A subset C ⊂ V forms a connected component if for every ∀vi , vj ∈ C there exists a path µ(vi , vj ) connecting them. Furthermore, for every ∀vi ∈ C, all paths ∀µ(vi , vp ) must connect to vp ∈ C Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Laplacian Matrix of a Graph For an undirected graph with weighted edges, the Laplacian matrix L (also known as the unnormalized Laplacian) is defined as: L=D−W - Properties Quadratic form : xT Lx = 21 N i,j=1 wij (xi − xj ) 2 P The matrix L is symmetric and positive semi-definite ⇒ L has N real eigenvalues 0 = λ1 ≤ λ2 ≤... ≤ λN (eigenvalues λ and eigenvectors u : Lu = λu) The smallest eigenvalue of L is 0 (λ1 = 0). The corresponding eigenvector to this smallest eigenvalue is the constant vector (all entries are 1). Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Laplacian Matrix of a Graph Justification: Since di = N P j=1 wij , it follows that: PN di + j=1 (−wij ) = 0. Hence, the constant vector is an eigenvector with eigenvalue 0. Relation to Connected Components: The multiplicity of the eigenvalue 0 is equal to the number of connected components of the graph G. The indicator vectors of the connected components are eigenvectors corresponding to the eigenvalue 0. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Laplacian Matrix of a Graph Normalized Laplacian Matrices: Random Walk Normalized Laplacian: Lrw = D−1 L = IN − D−1 W Symmetric Normalized Laplacian: 1 1 1 1 Lsym = D− 2 LD− 2 = IN − D− 2 WD− 2 Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Spectral embedding: an algorithm Input: similarity matrix S, N × N Output: reduced observations F = {yi , 1 ≤ 1 ≤ N } 1 Construct the similarity graph G from S (where wij = sij ). 2 Compute the Laplacian matrix L. 3 Compute the k eigenvectors u1 ,... , uk associated with the k smallest eigenvalues of Lu = λDu. 4 Let Uk be the N × k matrix whose columns are the vectors u1 ,... , uk. 5 yi ∈ Rk , 1 ≤ i ≤ N are the N rows of the Uk matrix. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Spectral embedding: an algorithm Q. can you specify why we chose the smallest eigenvalues in the spectral embedding ? answer in tutorials Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu From Distance to Similarity How can we define a similarity measure suitable for the data E = {xi , 1 ≤ i ≤ N }? Understanding the data is very useful for choosing a good similarity measure. It is - Possible to define similarity measures based on kernel functions (e.g., multivariate normal distribution if the data is vectorial). What matters most is the behavior of the similarity measure for very similar data points (very dissimilar data points will not be in the same group). Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu From Distance to Similarity Often, we have distances between the data points. -Using the normal distribution to define d(xi ,xj )2 similarities:sij = exp − 2σ2 1 There is a challenge in choosing σ when different groups have very different “densities,” especially since the normal distribution has a rapid decay. 2 σ ∼ the longest edge in the minimum spanning tree of the data E. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Choosing a Type of Similarity Graph Transitioning from a similarity matrix S to a graph G: 1 ϵ-neighborhood: Vertices vi , vj ∈ V are connected if sij ≥ ϵ. There is a difficulty in choosing ϵ when different groups have very different “densities” (very different distances to the nearest neighbors within the group). Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Choosing a Type of Similarity Graph 2 k-nearest neighbors (kNN): Vertices vi , vj ∈ V are connected if vi is among the k nearest neighbors of vj or vj is among the k nearest neighbors of vi. This solution is suitable for the presence of groups with different “densities,” but it tends to create links between groups. Robustness has been observed concerning the choice of the parameter k → this solution is recommended, at least as an initial approach. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Choosing a Type of Similarity Graph 3 Mutual k-nearest neighbors: Vertices vi , vj ∈ V are connected if vi is among the k nearest neighbors of vj and vj is among the k nearest neighbors of vi. This solution is suitable for the presence of groups with different “densities,” and it prevents links between groups with different densities. 4 Complete Graph: All vertices are connected. There is total dependence on the similarity measure for partitioning. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Choosing a Type of Similarity Graph The more connected the graph, the higher the cost of subsequent steps. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Section 6 The Uniform Manifold Approximation & Projection (UMAP) Algorithm Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu General Principle of UMAP I Uniform Manifold Approximation & Projection (UMAP) McInnes et al. (2018) Non-linear dimensionality reduction algorithm similar to t-SNE. General idea: find a representation of the data in a lower dimension that has the same topology as the cloud of observations in the original space. 1 Define an appropriate similarity matrix S. 2 From S , construct a similarity graph. 3 Build a vector representation of the data in a lower-dimensional space that exhibits the same similarity graph. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu General Principle of UMAP II Major differences from t-SNE: Use a variable perplexity notion to define a local similarity that depends on the density around each point. Consider a broader family of similarity functions inspired by the Student’s t-distribution for the target space. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Adaptive Similarity UMAP defines an adaptive similarity that varies according to the local density of the data: ∥xi − xj ∥2 − ρi ! pi,j = exp − σi where ρi is the distance between xi and its nearest neighbor. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Adaptive Similarity I An adaptive Gaussian kernel with a local normalization of the perplexity σi is fixed such that: k max(0, ∥xi − xj ∥2 − ρi ) ! X exp − = log2 (k) j=1 σi where k is a parameter that sets the number of nearest neighbors to consider. or two points xi and xi , the joint similarity pij should be symmetric (pij = pji ), obtained in UMAP by pij = pi|j + pj|i − pi|j pj|i The condition of symetry is necessary because pi ̸= pj. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Adaptive Similarity II For reference, t-SNE uses a different symetry condition : p +p pij = i|j2n j|i. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Similarity in the Target (reduced) Space I The similarity in the target space is inspired by the t-distribution 1 qij = 1 + a(yi − yj )2b Thicker tail to avoid the crowding problem in the reduced space. Parameters a and b are automatically chosen so that the similarity is as close as possible to the following piecewise-defined function: Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Similarity in the Target (reduced) Space II ( 1 yi − yj ≤ min_dist e−(yi −yj ) yi − yj > min_dist with min_dist being the minimum allowed distance in the target space between two points. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Similarity in the Target (reduced) Space III Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Optimization Like t-SNE, the goal is to make qij → pij. The objective function to minimize is the fuzzy cross-entropy between qij and pij : N X N " ! !# X pij 1 − pij CE(X, Y ) = pij log + (1 − pij ) log i=1 j=1 qij 1 − qij Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Optimization Expanding the logarithms gives: N X X N CE(X, Y ) = [pij log pij + (1 − pij ) log(1 − pij )] i=1 j=1 N X X N − [pij log qij + (1 − pij ) log(1 − qij )] i=1 j=1 The first term depends only on the original data and cannot be modified. Thus, only the second term is minimized by modifying the yi (the projections in the target space), and hence the qij. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Optimization Update the points in the target space: yi := yi − η ∂CE ∂yi η corresponds to the learning rate of the gradient descent algorithm. UMAP optimizes the objective function by performing stochastic gradient descent: To accelerate the calculation of CEE on a large dataset, only a limited sample is considered to compute an approximate value. The sample contains a mix of neighbors (pij ≃ 1) and non-neighbors (pij ≃ 0) to optimize both terms of the cross-entropy simultaneously. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Initialization of Points in the Target Space As with t-SNE, the initialization of points in the target space can greatly influence the results of UMAP. In the case of t-SNE with early exaggeration, it has been observed that the first phase of the algorithm approximately corresponds to performing a spectral embedding. Linderman & Steinerberger (2019) The default initialization of UMAP thus consists of performing a spectral embedding. 1 Less randomness during the initialization phase. 2 Faster convergence of stochastic gradient descent. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Algorithm I Input: Point cloud E = {xi }1≤1≤N , number of nearest neighbors k, dimension of the target space d, parameters (a, b) for the similarity qij. Output: Reduced point cloud F = {yi }1≤1≤N 1 Determine the k nearest neighbors for each point xi. 2 Calculate pij for all pairs (i, j). 3 Initialize the points F = {yi } using spectral embedding in Rd. 4 While the cross-entropy CE ≥ ϵ or the algorithm has not converged: Randomly sample m observations xi. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Algorithm II Calculate qij for the sampled points’ corresponding yi. Compute CE(p, q) on this sample. Update each point yi by: ∂CE yi := yi − η ∂yi Repeat until the convergence criterion is met. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Parameterization Main parameters of the UMAP algorithm: n_components: the dimension of the target (Euclidean) space n_neighbors: the number of neighbors to consider for defining the adaptive similarity (implicitly defines the values of a and b) min_dist: the minimum allowed distance between two points in the target space Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Parameterization : number of neighbors n_neighbors : the number of neighbors to consider for defining the adaptive similarity (implicitly defines the values of a and b) Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Parameterization : minimum distance min_dist : the minimum allowed distance between two points in the target space Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu (Semi-)Supervised UMAP I Consider a dataset containing group labels: X = {(x1 , l1 ) ,..., (xN , lN )}, where xi are the observations and li are the labels (nominal variable). How to include the knowledge of labels in the UMAP algorithm? Define a mixed distance on the pair (xi , li ) The distance on group labels is defined as:   0  if li = lj dclasse (xi , xj ) = 0.5 if li ou lj is unkown if li ̸= lj   1 Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu (Semi-)Supervised UMAP II - The total distance involved in calculating similarities between points in the original space is a weighted sum with a parameter λ (targetweight) : d(xi , xj ) = (1 − λ)dobs (xi , xj ) + λdclasse (xi , xj ) λ = 0 → only the distance between the data xi is considered λ = 0.5 → the total distance is an average of the distance between observations and between labels λ = 1 → only the distance between labels is considered Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Limitations Parametric Projection of New Data Like t-SNE, UMAP is a non-inductive method; there is no explicit expression for projecting a point from the original space to the reduced space. It is possible to approximate this projection using a regression model, such as a neural network (multi-layer perceptron). Questionable Interpretability The same precautions as with t-SNE apply regarding the lack of significance of group densities or inter-group distances. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Limitations Parameter Tuning of the Algorithm The number of neighbors (“num_neighbors’) in UMAP is less sensitive than the perplexity in t-SNE to large values. It is important to compare several visualizations with different parameters to observe local structures and global structure. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Section 7 Other dimensionality Reduction subjects Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Probabilistic PCA (PPCA) Consider the following latent variable model. Similar to the Gaussian mixture model but with Gaussian latents: z ∼ NK (0, IK ) x(i) |z (i) ∼ ND Wz + µ, σ 2 ID This is similar to naive Bayes graphical model, because p(x|z) factorizes with respect to the dimensions of x. What sort of data does this model produce? Matrix-vector multiplication: Wz is a linear combination of the columns W with coefficients z = (z1 ,..., zK ) Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Probabilistic PCA Wz is a random linear combination of the columns of W To get the random variable x, we sample a standard normal z and then add a small amount of isotropic noise to Wz + µ. The column span of W refers to the principal subspace in PCA. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Probabilistic PCA : The Likelihood function To perform maximum likelihood in this model, we need to maximize the following: Z 2 max log p(x|W, µ, σ ) = max log p(x|z, W, µ, σ 2 )p(z)dz W,µ,σ 2 W,µ,σ 2 This is easier than the Gaussian mixture model. x = Wz + µ + ϵ (x is an affine transformations of Gaussian vars) p(x|W, µ, σ 2 ) is Gaussian Only need to compute E[x] and Cov[x]. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Probabilistic PCA : Maximum Likelihood E[x] = E[Wz + µ + ϵ] = µ Cov[x] = E[(Wz + ϵ)(Wz + ϵ)⊤ ] Cov[x] = E[(Wz + ϵ)(Wz + ϵ)⊤ ] = E[(Wzz⊤ W⊤ ] + Cov[ϵ] = WW⊤ + σ 2 ID Recall: R orthogonal if RR⊤ = I This model is not identifiable because WW⊤ = (WR)(WR)⊤. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Probabilistic PCA : Maximum Likelihood Thus, the log-likelihood of the data under this model is given by N ND N 1X − log (2π) − log |C| − (x(i) − µ)⊤ C−1 (x(i) − µ) 2 2 2 i=1 where C = WW⊤ + σ 2 ID Here the MLE(µ̂, Ŵ, σ̂ 2 ) is given in a closed-form! Bishop & Tipping (2001) Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu The maximum likelihood estimates The maximum likelihood estimator is: N 1 X µ̂ = x(i) N i=1 1 Ŵ = Û(L̂ − σ̂ 2 ID ) 2 R D 1 X σ̂ 2 = λi D − K i=K+1 Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu The maximum likelihood estimates The columns of Û ∈ RD×K are the K unit eigenvectors of the empirical covariance matrix Σ̂ that have the largest eigenvalues, λ1 ≥ λ2 ≥... ≥ λD are the eigenvalues of Σ̂. L̂ = diag(λ1 ,..., λK ) is the diagonal matrix whose elements are the corresponding eigenvalues, and R is an orthogonal matrix. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Probabilistic PCA : Maximum Likelihood That seems complex, to get an intuition about how this model behaves when it is fit to data, lets consider the MLE density. Recall that the marginal distribution on x in our fitted model is a Gaussian with mean µ̂ = x̄ and covariance Ĉ = ŴŴ⊤ + σ̂ 2 I = Û(L̂ − σ̂ 2 I)Û⊤ + σ̂ 2 I The covariance gives us a nice intuition about the model. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Probabilistic PCA : Maximum Likelihood Center the data and check the variance along one of the unit eigenvectors ui , which are the vectors forming the columns of Û : Var[ui⊤ (x − x̄)] = ui⊤ Cov[x]ui 1 = ui⊤ Û(L̂ − σ̂ 2 I) 2 Û⊤ + σ̂ 2 = λi − σ̂ 2 + σ̂ 2 = λi Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Probabilistic PCA : Maximum Likelihood Now, center the data and check the variance along any unit vector orthogonal to the subspace spanned by Û (i > K): 1 Var[ui⊤ (x − x̄)] = ui⊤ Û(L̂ − σ̂ 2 I) 2 Û⊤ + σ̂ 2 = σ̂ 2 - The model captures the variance along the principle axes and approximates it in all remaining directions with a single variance. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu How does it relate to PCA? The posterior mean is given by E[z|x] = (W⊤ W + σ̂ 2 I)−1 W⊤ (x − µ) Posterior variance: Cov[z|x] = σ −2 (W⊤ W + σ̂ 2 I) In the limit σ 2 → 0, we get σ 2 →0 E[z|x] → (W⊤ W + σ̂ 2 I)−1 W⊤ (x − µ) Plugging in the MLEs, this limit recovers the standard PCA Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Why Probabilistic PCA (PPCA)? Fitting a full-covariance Gaussian model of data requires D(D + 1)/2 + D parameters. With PPCA we model only the K most significant correlations and this only requires O(KD) parameters as long as K is small. Bayesian PCA gives us a Bayesian method for determining the low dimensional principal subspace. Existence of likelihood functions allows direct comparison with other probabilistic models. Instead of solving directly, we can also use EM. The EM can be scaled to very large high- dimensional datasets. Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu Section 8 References Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu References I Belkina, A. C., Ciccolella, C. O., Anno, R., Halpert, R., Spidlen, J., & Snyder-Cappione, J. E. (2019). Automated optimized parameters for t-distributed stochastic neighbor embedding improve visualization and analysis of large datasets. Nature Communications, 10(1). https://doi.org/10.1038/s41467-019-13055-y Bishop, C. M., & Tipping, M. E. (2001). Probabilistic principal component analysis. Linderman, G. C., & Steinerberger, S. (2019). Clustering with t-SNE, provably. SIAM Journal on Mathematics of Data Science, 1(2), 313–332. https://doi.org/10.1137/18M1216134 Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu References II Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416. https://doi.org/10.1007/s11222-007-9033-z Maaten, L. van der. (2014). Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res., 15, 3221–3245. https://api.semanticscholar.org/CorpusID:14618636 Maaten, L. van der, & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605. http://www.jmlr.org/papers/v9/vandermaaten08a.html Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II” Motivation Linear Dimensionality Reduction The Locally Linear Embedding (LLE) Algorithm The t-distribu References III McInnes, L., Healy, J., Saul, N., & Großberger, L. (2018). UMAP: Uniform manifold approximation and projection. Journal of Open Source Software, 3(29), 861. https://doi.org/10.21105/joss.00861 Roweis, S. T., & Saul, L. K. (2000). Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 290(5500), 2323–2326. https://doi.org/10.1126/science.290.5500.2323 Wattenberg, M., Viégas, F. B., & Johnson, I. (2016). How to use t-SNE effectively. https://api.semanticscholar.org/CorpusID:155770513 Ayoub Asri “National Higher School of Mathematics (NHSM)” “Data Mining II”

National Higher School of Mathematics (NHSM) Data Mining II PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue