L7.pdf - Northeastern Lecture Notes on Dimensionality Reduction PDF

Document Details

InfallibleLawrencium3753

Uploaded by InfallibleLawrencium3753

Northwestern University

Tags

dimensionality reduction machine learning data visualization PCA

Summary

This document is a lecture note discussing dimensionality reduction techniques, including PCA, t-SNE, UMAP, and neural networks. It covers the concepts and applications of these methods in various fields such as bioinformatics, image processing, and natural language processing.

Full Transcript

Things to note Midterm exam is scheduled next Monday during the lecture time Exam is open note, ChatGPT-like tool is not allowed ANUs will be emailed separately HW2 will be released tomorrow Last Time: Lecture: After-class Assignment: Dimensionality Reduction...

Things to note Midterm exam is scheduled next Monday during the lecture time Exam is open note, ChatGPT-like tool is not allowed ANUs will be emailed separately HW2 will be released tomorrow Last Time: Lecture: After-class Assignment: Dimensionality Reduction Apply PCA using Sklearn Principle Component Analysis Last time: PCA Cost Objective Maximizing the variance of the projected data, given all features are centered: 1 x2 J(w) = 𝑚𝑚 𝑤𝑤 𝑇𝑇 𝑋𝑋𝑇𝑇 Xw = 𝑤𝑤 𝑇𝑇Cw, s.t. ||w|| = 1 W: l i ne that capture the mos t variance Goal: Maximize J(w) Let p=Xw, which is the projection of X onto the new w (since we impose 𝑤𝑤 𝑇𝑇 𝑤𝑤 =1) x1x1 Variance of the projected data is 𝑝𝑝𝑇𝑇 𝑝𝑝 1 C = 𝑚𝑚 𝑋𝑋 𝑇𝑇 X is the sample covariance matrix Last time: Maximizing the Projection Variance( 𝑤𝑤 𝑇𝑇 C w) 1 J(c) = 𝑚𝑚 𝑤𝑤 𝑇𝑇 𝑋𝑋 𝑇𝑇Xw = 𝑤𝑤 𝑇𝑇Cw, s.t. ||w|| = 1 We can use lagrange multiplier to solve this problem J (c) = 𝑤𝑤 𝑇𝑇Cw - 𝜆𝜆 (𝑤𝑤 𝑇𝑇w – 1), s.t. ||w|| = 1 𝜕𝜕𝐽𝐽 Setting = 0, we get: Cw = 𝜆𝜆𝑤𝑤 𝜕𝜕𝑤𝑤 Cw = 𝜆𝜆𝑤𝑤 w is an eigenvector of C, 𝜆𝜆 is the eigenvalue associated with the eigenvector w. To maximize 𝑤𝑤 𝑇𝑇Cw = 𝜆𝜆, choose the eigenvector with the largest eigenvalue 𝜆𝜆! PCA is sensitive to feature magnitude PCA is based on the covariance matrix (C ), which can be biased by features with larger scales. Feature Scaling is essential to ensures that features with different scales contribute equally to the analysis Feature Scaling: Standardization is commonly used for PCA Can we use min-max scaling for PCA? Min-Max Scaling (not recommend) Transforms the data to a fixed range, usually [0, 1]. This is done using the formula: Last time: PCA PCA Last time: after class assignment pca. fit(x) pca. fit_transform(x) Dimensional Reduction It plays an important role in data science: Pre-processing for Machine Learning Data Visualization Today’s agenda Lecture: Visualizing high-dimensional data in 2D/3D (Neighbor embeddings) T-SNE UMAP Application Areas: Both t-SNE and UMAP are powerful tools for exploring and visualizing high-dimensional data in unsupervised learning scenarios. They are widely used in various fields such as: Bioinformatics (e.g., single-cell RNA sequencing data) Image processing (e.g., feature extraction and visualization) Natural language processing (e.g., word embeddings visualization) Bioinformatics: Single-cell transcriptomics (single-cell RNA sequencing): samples are cells, features are genes Image Processing MNIST dataset NLP: Digital humanities: samples are books, features are words HW2 Embeddings generated from Large genomic models. It will be discussed in more detail during next Wednesday's lecture. Evolution of Neighbor Embeddings The lecture will focus on the big idea For the detail, please read the complete papers listed below:  SNE (Stochastic Neighbor Embedding)  T-SNE (Visualizing Data using t- SNE)  UMAP (UMAP: Uniform manifold approximation and projection for dimension reduction) Recognizing MNIST Handwritten Data Set Image Preprocessing: Converting Pixel Data to Vectors MNIST images are 2D arrays representing grayscale images Shape of the data 28 n = 70, 000 28x28 images = 784 pixels 28 One channel: grayscale MNIST Dataset: PCA MNIST Dataset: Kernal PCA Failed on the whole dataset, the plot only uses 10% of whole set Historical Significance of Multidimensional Scaling (MDS) MDS holds a foundational place in the history of techniques for visualizing high-dimensional data, paving the way for more advanced methods that have been developed since its introduction. MDS was introduced in the 1960s and was designed to help researchers visualize the relationships between objects based on their pairwise distances. http://cda.psych.uiuc.edu/psychometrika_hi ghly_cited_articles/kruskal_1964b.pdf Paper: MNIST Dataset: MDS Multidimensional scaling: arrange points in 2D to approximate high- dimensional pairwise distance (1950s-1960s; Kruskal, Torgerson, etc.) Calculating distance in high dimensions is computationally intensive. Even with only 10% of whole set, it took long time to run. Why does MDS fail? Preserving high-dimensional distance is usually a bad idea because it is not possible to preserve them (Curse of dimensionality). In high dimensional data, all objects appear to be sparse and dissimilar in many ways the Evolution of Neighbor Embeddings Idea: preserve nearest neighbors instead of preserving distances SNE vs. t-SNE vs. UMAP: Idea: preserve nearest neighbors instead of preserving distances The three algorithms operate roughly in the same way: Compute high dimensional probabilities p. Compute low dimensional probabilities q. Calculate the difference between the probabilities by a given cost function C(p,q). Using gradient descent to minimize the cost function. Stochastic Neighbor Embedding (SNE): the Fundamental building block Step 1: Calculate the High Dimensional Probabilities P (true distribution) Step 2: Calculate the Low Dimensional Probabilities Q (the approximation) Step 3: Choice of Cost Function (quantify the difference between two probability distributions): Kullback-Leibler divergence Step 4: Minimizing the loss via gradient descent Key issues with using KL divergence in the SNE Crowding Problem: Points that are far apart in high-dimensional space might get crowded into a smaller area in the low-dimensional space, leading to poor visualization and misleading results Asymmetry of KL Divergence: SNE may preserve local structures well but fail to capture the global geometry of the dataset, leading to a representation that doesn't fully reflect the data's overall structure. Solution in t-SNE (t-Distributed SNE) Replacing the Gaussian distribution in the low-dimensional space with a Student's t-distribution (with one degree of freedom, equivalent to a Cauchy distribution) Heavy-tailed distribution: the heavier tails prevent distant points from collapsing together. Symmetric KL Divergence: a better balance between the preservation of local and global structures. Low-dimensional similarity kernel The main innovation of t-SNE compared to SNE was the Cauchy kernel (t-distribution), addressing the ‘crowding problem’ of SNE Gaussian Kernel Cauchy Kernel Heavier-tailed Kernel Even heavier-tailed kernels can bring out even finer cluster structure (Kolak et al., 2020) t-SNE on MNIST dataset 5 mins on the whole dataset Default setting in Sklearn Fast approximate implementations t-SNE had been the state-of-the-art for dimension reduction for visualization for many years until UMAP came along. Fast t-SNE implementations Source: https://github.com/CannyLab/tsne-cuda Tricks (optimizations) in t-SNE to perform better Hyperparameters in t-SNE really matter Perplexity: the number of neighbors in t-SNE Perplexity can be seen as the ‘effective’ number of neighbors that enter the loss function. Default perplexity is 30: balance attention between local and global aspects of your data! The higher the perplexity, the more likely it is to consider points that are far away as neighbors. Perplexity values in the range (5 - 50) Much smaller values are rarely useful, local variations dominate Much larger values are rarely useful with merged clusters; it is even computationally prohibitive. Analyzing multiple plots with different perplexities. The role of initialization T-SNE preserves local structure (neighbors) but often struggle to preserve global structure. The loss function has many local minima, and initialization can play a large role Initialization is critical for preserving global data structure in both t-SNE and UMAP (Kobak et al., 2019) Attraction-repulsion spectrum Early exaggeration multiples attractive forces by 12-250 iterations. It helps t-SNE to better capture the global structure of the data Many other methods, e.g. UMAP, produce embeddings that approximately fall on this spectrum Inconsistent Results Starting with a random initialization of points in the low- dimensional space. As a result, different runs of t-SNE can yield different embeddings, even when applied to the same dataset. Even when using the same parameters, t-SNE can place points differently within clusters or flip the orientation of clusters, making it harder to obtain identical or reproducible embeddings across runs. Distance in t-SNE in relative High-dimensional distances are computed using Euclidean distance and converted to similarities using a Gaussian kernel. Low-dimensional distances are computed again using Euclidean distance, but similarities are modeled using a Student's t-distribution with heavy tails to prevent distant points from being crowded together. t-SNE minimizes the KL divergence between the high-dimensional and low-dimensional similarity distributions, preserving the local structure (neighboring points) more than the global structure! KL divergence in t-SNE the higher the probability 𝑝𝑝𝑖𝑖𝑖𝑖 the more likely points i and j are neighbors in the high-dimensional space. High price for putting close neighbors far away Low price for putting far away neighbors close High dimension Advantages & Disadvantages of t-SNE Advantages If done correctly, can give very intuitive visualizations as it preserves the local structure of the data in the lower dimensions Disadvantages Not very good at preserving global structure Sensitive to hyperparameters Inconsistent Results Computational slower UMAP overcomes some limitations of t-SNE Unlike t-SNE that works with probabilities, UMAP works directly with similarities. Unlike t-SNE that works with scaled Euclidean distance, UMAP allows for different choice of metric functions for the high dimensional similarities. Unlike t-SNE uses KL divergence, UMAP uses cross entropy as loss function. Unlike t-SNE uses gradient descent, UMAP uses stochastic gradient descent to minimize the cost function. UMAP Cost Function: Cross Entropy High price for putting far away neighbors close High price for putting close neighbors far away UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction The algorithmic decisions in UMAP are justified by strong mathematical theory, which ultimately will make it such a high- quality general-purpose dimension reduction technique for machine learning. Please read the paper for detail. MNIST Dataset: UMAP Tricks (optimizations) done in UMAP to perform better Hyperparameters in UMAP matter n_neighbors: the number of nearest neighbors in high dimension It balances attention between local and global aspects of your data! The higher the value, the more likely it represents the global structure while losing the fine detail of the local structure min_dist: the minimum distance between points in low-dimensional It controls how tightly UMAP clumps points together with low values leading to more tightly packed embeddings. Larger values will make UMAP pack points together more loosely, focusing instead on the preservation of the broad topological structure. n_neighbors (high dimension) and min_dist (low dimension) Together Min_dist (low dimension) n_neighbors (high dimension) At a high value, UMAP tends to "spread out" the At a high value, UMAP connects more projected points, leading to decreased clustering of neighboring points, leading to a the data and less emphasis on global structure. projection that more accurately reflects At a low value, its emphasis on global structure the global structure of the data At a low value, any notion of global structure is almost completely lost They work together to balance the local structure and global structure of your data, try multiple runs to get a better idea about the full structure of your data. metrics: the similarity distance metric It controls how distance is computed in the ambient space of the input data https://umap-learn.readthedocs.io/en/latest/parameters.html#metric for a full list n_components: the embedding dimensionality UMAP is not just for data visualization n_components in (2,3) It is a flexible non-linear dimension reduction algorithm n_components> 3 https://stackoverflow.com/questions/ 63139766/how-to-choose-the-right- number-of-dimension-in-umap Does UMAP consistently perform better than t-SNE? Often times, but not always https://pair-code.github.io/understanding-umap/ UMAP Advantages over t-SNE Preserving the global structure of the data and maintaining meaningful distances between data points in the low-dimensional embedding Much Faster than t-SNE for large datasets More consistent embeddings across different runs compared to t-SNE Better documentation than t-SNE Dimensionality reduction using Neutral Nets Reference: Tuebingen machine learning https://arize.com/blog-course/reduction-of- dimensionality-top-techniques/ https://pair-code.github.io/understanding-umap/ https://umap-learn.readthedocs.io/en/latest/ Dimensional Reduction Techniques 2 categories: Supervised: LDA (Linear Discriminant Analysis)/Unsupervised: PCA Linear/non-linear (PCA/Kernel PCA) 4 categories in Unsupervised: Feature selection: Best subset, SelectKBest, Lasso, etc Matrix factorization: PCA, SVD Neighbor embeddings: SNE, t-SNE, and UMAP Neural Nets: AutoEncoder

Use Quizgecko on...
Browser
Browser