Podcast
Questions and Answers
What is the primary goal of t-SNE in terms of dimensionality reduction?
What is the primary goal of t-SNE in terms of dimensionality reduction?
In the context of t-SNE, how is the similarity $q_{ij}$ defined?
In the context of t-SNE, how is the similarity $q_{ij}$ defined?
What parameter is chosen based on the desired perplexity in the probabilistic model associated with pairs of data points?
What parameter is chosen based on the desired perplexity in the probabilistic model associated with pairs of data points?
Which of the following statements about joint probability $p_{ij}$ is correct?
Which of the following statements about joint probability $p_{ij}$ is correct?
Signup and view all the answers
What is the role of the distance measure $ ext{∥xi − xj∥}^2$ in defining the joint probabilities $p_{ij}$?
What is the role of the distance measure $ ext{∥xi − xj∥}^2$ in defining the joint probabilities $p_{ij}$?
Signup and view all the answers
What is one main characteristic of the Locally Linear Embedding (LLE) algorithm?
What is one main characteristic of the Locally Linear Embedding (LLE) algorithm?
Signup and view all the answers
What does t-SNE seek to achieve in reduced-dimensional space?
What does t-SNE seek to achieve in reduced-dimensional space?
Signup and view all the answers
How does LLE differ from traditional distance-preserving techniques?
How does LLE differ from traditional distance-preserving techniques?
Signup and view all the answers
In a probabilistic context, how is the neighborhood of a point represented in t-SNE?
In a probabilistic context, how is the neighborhood of a point represented in t-SNE?
Signup and view all the answers
What is a limitation of using LLE for dimensionality reduction?
What is a limitation of using LLE for dimensionality reduction?
Signup and view all the answers
What type of algorithm is t-SNE classified as?
What type of algorithm is t-SNE classified as?
Signup and view all the answers
Which of the following best describes the goal of dimensionality reduction techniques like LLE and t-SNE?
Which of the following best describes the goal of dimensionality reduction techniques like LLE and t-SNE?
Signup and view all the answers
Which statement is true regarding the similarities in neighborhood distributions in t-SNE?
Which statement is true regarding the similarities in neighborhood distributions in t-SNE?
Signup and view all the answers
What is the primary goal of the t-SNE algorithm?
What is the primary goal of the t-SNE algorithm?
Signup and view all the answers
In the context of t-SNE, what is the role of the parameter $ au_i$?
In the context of t-SNE, what is the role of the parameter $ au_i$?
Signup and view all the answers
Which of the following best describes the process of Locally Linear Embedding (LLE)?
Which of the following best describes the process of Locally Linear Embedding (LLE)?
Signup and view all the answers
What is a key aspect of the gradient descent update in the t-SNE algorithm?
What is a key aspect of the gradient descent update in the t-SNE algorithm?
Signup and view all the answers
What does the minimization condition $Lt−SN E ≥ ϵ$ represent in the t-SNE algorithm?
What does the minimization condition $Lt−SN E ≥ ϵ$ represent in the t-SNE algorithm?
Signup and view all the answers
How does t-SNE initialize the points in the lower-dimensional space?
How does t-SNE initialize the points in the lower-dimensional space?
Signup and view all the answers
What does the expression $rac{ ext{d}L_{t-SNE}}{ ext{d}y_i}$ represent in the t-SNE algorithm?
What does the expression $rac{ ext{d}L_{t-SNE}}{ ext{d}y_i}$ represent in the t-SNE algorithm?
Signup and view all the answers
What is the main characteristic of the neighborhood structure preserved by Locally Linear Embedding (LLE)?
What is the main characteristic of the neighborhood structure preserved by Locally Linear Embedding (LLE)?
Signup and view all the answers
Study Notes
Data Mining II - Dimensionality Reduction
- Course 3, 4th year Statistics, Econometrics for the Actuarial Sciences
- Instructor: Ayoub Asri
- Date: October 25, 2024
Problems in High Dimensions I
- Real-world datasets are often unstructured and high-dimensional (e.g., images, sounds, time series).
- Not well-suited for visualization.
- Suffer from the curse of dimensionality.
- The number of samples needed to faithfully represent an n-dimensional space becomes impractical.
- Example: 100 points on the interval [0, 1], 104 points on the square, 106 points on the cube, 102n points on the hypercube in dimension n!
- Noisy dimensions reduce the discriminative power of the Euclidean distance.
Problems in High Dimensions II
- The minimal and maximal distance between a reference point q and the points in a dataset D = {x1, x2, ..., xn} becomes the same when d↗.
- Dimensionality reduction has two main benefits:
- Visualizing the structure of the data cloud.
- Simplifying data for subsequent processing (e.g., classification).
Practical Example
- An image is represented by an 8 x 8 pixel matrix.
- Vector of dimension d = 64.
- Difficult to visualize.
Manifold Learning
- High-dimensional datasets are difficult to visualize.
- Hypothesis: the observed dimensionality of the data is artificially large.
- The data is generated by a set of intrinsic variables (the degrees of freedom).
- It is assumed that there are fewer degrees of freedom than variables in the observations.
- In other words, the observed variables are partially redundant or superfluous.
Manifold Learning: Example
- Circle: a two-dimensional cloud of observations (x, y).
- The points satisfy the constraint: x² + y² = 1.
- By reparametrizing the observations: x = cos θ, y = sin θ.
- We observe that the data is generated by a one-dimensional manifold (the segment [-π, π]).
Linear Dimensionality Reduction
- Principal Component Analysis (PCA): Recall I
- Consider a dataset {x(i)}i=1..N.
- Each input vector x(i) ∈ RD is approximated as: x(i) ≈ x + U z(i)
- x is the data mean, U ∈ RDXK is the orthogonal basis for the principal subspace, and z(i) ∈ RK is the code vector given by: z(i) = UT (x(i) - x).
- Principal Component Analysis (PCA): Recall II
- The matrix U is chosen to minimize the reconstruction error: U* = arg minU ∑i ||x(i) – x – UUT (x(i) – x)||2.
- We are looking for directions
- For example, in a 2-dimensional problem, we are looking for the direction u₁ along which the data is well represented.
- direction of higher variance.
- direction of minimum reconstruction error.
- They are the same!
- Limitations of linear dimensionality reduction
- There is no linear transformation that can separate the two circles.
- What algorithms exist for non-linear dimensionality reduction?
The Locally Linear Embedding (LLE) Algorithm
- Formal Framework
- Consider a matrix of observations X with n samples of dimension p.
- We seek X of dimension q < p that accurately represents the structure of X.
- The neighbors x'i of xi in the reduced space must be the same as the neighbors x'i of xi in the original space.
- Locally Linear Embedding
- Rowies & Saul (2000)
- Principle: Locally, the cloud of observations X is generated by an affine subspace.
- For each point xi ∈ X ⊂ RP: Calculate the k nearest neighbors V₁ = {x1, x2, ..., xk} of xi
- Solve
- arg minW ∑i ||xi - Σj ∈ Vi wj xj||2 Subject to Σj ∈ Vi wj = 1
- Find the parameters W such that xi can be reconstructed by the linear combination of its neighbours.
- Solve
- arg minY ∑i ||yi − Σj∈Vi wj yj||2 Subject to yi ∈ Rq
- Find the reduced matrix Y such that yi is reconstructed by the same linear combination.
t-distributed Stochastic Neighbor Embedding (t-SNE)
-
t-SNE: General Principle
- t-distributed Stochastic Neighbor Embedding, Maaten & Hinton (2008)
- A non-linear dimensionality reduction algorithm
- The general principle is similar to LLE: seek a dataset in a reduced-dimensional space that exhibits the same local structures in the neighborhood of each point.
- The neighborhood of a point xi is represented by the conditional probability p(xj|xi) that xj is considered a neighbor of xi.
- We seek points yj in the reduced space for which p(yj|yi) = p(xj|xi)
- The neighborhood distributions are similar in the original space and in the reduced space.
-
Similarity in the Observation Space (or Original Space)
- Consider a matrix X of n observations xi of dimension D.
- To each pair xi, xj, we associate the joint probability Pij = Pj|i defined by:
- Where Pij = exp(-||xi - xj||2/2σ2) / Σk≠i exp(-||xk - xi||2/2σ²)
- σ is chosen based on the desired perplexity (approximately the average number of neighbours).
-
The Objective of t-SNE
- The objective of t-SNE is to obtain a reduced matrix Y of dimension n x d, d < D, such that the similarities qij approach the pij.
- For Y, the similarity is defined using a Student's t-distribution with 1 degree of freedom (or Cauchy distribution) -- Where qij = (1 + ||yi − yj||²/2)−1 / Σk≠i(1 + ||yk − yj||²/2)−1
-
Choice of Distributions
- Gaussian Similarity: Pij = [exp(-||xi - xj||²/2σ²) / 2πσ√(2π)]
- t-Distribution (Student) Similarity: qij = (1+ ||yi − yj||2)−1 / Σk≠i(1 + ||yk − yj||2)−1.
-
Optimization
- A measure of dissimilarity between distributions is the Kullback-Leibler divergence: Lt-SNE = Σi,j DKL (P||Q) = Σi,j Pij log (Pij/qij)
-
Optimization
- The gradient descent in t-SNE involves two phases: Phase 1: Early exaggeration; Phase 2: Final optimization
-
t-SNE Algorithm I
- For each pair of points (xi, xj), Calculate the similarity Pij where Pij = exp(-||xi-xj||2/2σ2) / Σk≠i exp(-||xk - xi||2/2σ²)
-
t-SNE Algorithm II
- Calculate the gradient of the KL divergence with respect to each yi: ○ ∂Lt-SNE/∂yi = 4 Σj (Pij – qij)(yi – yj) (1 + ||yk - yj||2)−1
- Update each point yi by: yi:= yi - η ∂Lt-SNE/∂yi
Uniform Manifold Approximation and Projection (UMAP)
-
General Principle of UMAP I
- Uniform Manifold Approximation & Projection (UMAP), McInnes et al. (2018)
- Non-linear dimensionality reduction algorithm similar to t-SNE
- General idea: find a representation of the data in a lower dimension that has the same topology as the cloud of observations in the original space.
- Define an appropriate similarity matrix S
- From S, construct a similarity graph
- Build a vector representation of the data in a lower-dimensional space that exhibits the same similarity graph.
-
General Principle of UMAP II
- Major differences from t-SNE:
- Use a variable perplexity notion to define a local similarity that depends on the density around each point.
- Consider a broader family of similarity functions inspired by the Student’s t-distribution for the target space
-
Adaptive Similarity
- UMAP defines an adaptive similarity that varies according to the local density of the data ○ Pi,j = exp(-||xi − xj||2/σi2) Where p₁ is the distance between xi and its nearest neighbor
-
Adaptive Similarity I ○ k ∑j exp(max(0, ||xi − xj||2/(σi2,k))) = log2(k)
-
Adaptive Similarity II
- For reference, t-SNE uses a different symetry condition: Pij = (pi|j + pj|i)/2η
-
Similarity in the Target (reduced) Space I
- The similarity in the target space is inspired by the t-distribution
-
Similarity in the Target (reduced) Space II
- Where qij = {1/(1 + a(||yi - yj||2)^b) if || yi - yj || <= min_dist ; e^(-(||yi - yj||2/c)) if || yi - yj || > min_dist} and min_dist being the minimum allowed distance in the target space between two points
-
Similarity in the Target (reduced) Space III
- Graphs of similarity qij as a function of distance ||yi - yj|| for different values of a and k.
-
Optimization
- Like t-SNE, the goal is to make qij → pij
- The objective function is to minimize the fuzzy cross-entropy between qij and pij CE(X, Y) = Σi Σ j[pij log(pij/qij) + (1 - pij) log((1 - pij)/(1 - qij))]
-
Optimization
-
Expansion of the logarithms gives: CE(X,Y) = Σi Σj[pij log pij + (1 - pij) log(1 - pij)] − Σi Σj[pij log qij + (1 - pij) log(1-qij)]
-
Update the points in the target space:yi:= yi − η ∂CE/∂yi, where η corresponds to the learning rate of the gradient descent algorithm.
-
Initialization of Points in the Target Space
- As with t-SNE, the initialization of points in the target space can greatly influence the results of UMAP.
- In the case of t-SNE with early exaggeration, it has been observed that the first phase of the algorithm approximately corresponds to performing a spectral embedding.
-
Algorithm I - Determine the k nearest neighbors for each point xi. - Calculate pij for all pairs (i, j). - Initialize the points F = {yi} using spectral embedding in Rd. - While the cross-entropy CE ≥ ε or the algorithm has not converged Randomly sample m observations xi.
-
Algorithm II
- Calculate qij for the sampled points' corresponding yi.
- Compute CE(p, q) on this sample.
- Update each point yi by: yi := yi − η ∂CE/∂yi
- Repeat until the convergence criterion is met.
-
Parameterization
- n_components: the dimension of the target (Euclidean) space
- n_neighbors: the number of neighbors to consider for defining the adaptive similarity (implicitly defines the values of a and b).
- min_dist: the minimum allowed distance between two points in the target space.
Probabilistic PCA (PPCA)
-
Consider the following latent variable model.
- Similar to the Gaussian mixture model but with Gaussian latents: z ~ Nκ(0, IK), x(i) ~ Nκ(Wz(i) + μ, σ²ID).
- This is similar to a naive Bayes graphical model, because p(x|z) factorizes with respect to the dimensions of x.
- What sort of data does this model produce?
- Matrix-vector multiplication: Wz is a linear combination of the columns W with coefficients z = (z1, ..., zK).
-
Probabilistic PCA: The Likelihood function
- To perform maximum likelihood in this model, we need to maximize the following: maxW,μ,σ² log p(x|W, μ, σ²).
- This is easier than the Gaussian mixture model.
- x = Wz + μ + ε (x is an affine transformations of Gaussian vars).
- p(x|W, μ, σ²) is Gaussian.
- Only need to compute E[x] and Cov[x].
-
Probabilistic PCA: Maximum Likelihood
- E[x] = E[Wz + μ + ε] = μ
- Cov[x] = E[(Wz + ε)(Wz + ε)T] = WWT + σ²ID
- Recall: R orthogonal if RRT = I.
- This model is not identifiable because WWT = (WR)(WR)T.
-
Probabilistic PCA: Maximum Likelihood
- Thus, the log-likelihood of the data under this model is given by:
- where C = WWT + σ²ID
-
The maximum likelihood estimates
- The maximum likelihood estimator is: μ = (1/N) Σi x(i), ŵ = Û(ÎL − σ²ID)R, σ² = (1/(D − K)) Σi=K+1 λi
-
The maximum likelihood estimates
- The columns of Û ∈ RDK are the K unit eigenvectors of the empirical covariance matrix Σ that have the largest eigenvalues.
- 1>2>... > λκ are the eigenvalues of Σ.
- ÎL = diag(λ1, ..., λκ) is the diagonal matrix whose elements are the corresponding eigenvalues, and R is an orthogonal matrix.
-
Probabilistic PCA: Maximum Likelihood
- Center the data and check the variance along any unit vector orthogonal to the subspace spanned by Û(і > κ): Var[uT(x − x)] = σ²
-
How does it relate to PCA?
- The posterior mean is given by E[z|x] = (WW + σ²I)−1WT(x − μ)
- Posterior variance: Cov[z|x] = σ−²(WTW + σ²I)
- In the limit σ²→0, we get E[z|x] ≈ (WTW)−1WT(x − μ)
- Plugging in the MLEs, this limit recovers the standard PCA.
-
Why Probabilistic PCA (PPCA)?
- Fitting a full-covariance Gaussian model of data requires D(D + 1)/2 + D parameters, with PPCA we model only K most significant correlations, only requiring O(KD) parameters when K is small.
- Bayesian PCA gives us a Bayesian method for determining the low-dimensional principal subspace.
- Existence of likelihood functions allows direct comparison with other probabilistic models.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on dimensionality reduction techniques like t-SNE and Locally Linear Embedding (LLE). This quiz covers core concepts such as similarity definitions, joint probabilities, and key characteristics of these methods. Perfect for students and professionals in data science and machine learning.