Lecture15_Dimensionality Reductionpptx_240818_151426.pdf

Document Details

MesmerizedNashville

Uploaded by MesmerizedNashville

University of Tripoli

Tags

dimensionality reduction machine learning data analysis statistics

Full Transcript

Dimensionality Reduction What is Dimensionality Reduction Many Machine Learning problems involve thousands or even millions of features for each training instance. Not only do all these features make training extremely slow, but they can also make it much harder to find a good so...

Dimensionality Reduction What is Dimensionality Reduction Many Machine Learning problems involve thousands or even millions of features for each training instance. Not only do all these features make training extremely slow, but they can also make it much harder to find a good solution. This problem is often referred to as the curse of dimensionality. Dimensionality reduction is a process used in data analysis and machine learning to reduce the number of random variables or features under consideration, by obtaining a set of principal variables. House Price Prediction Example Key Concepts: Feature Selection vs. Feature Extraction: ○ Feature Selection: Involves selecting a subset of the most important features in the dataset without altering the original features. ○ Feature Extraction: Involves transforming the data from a high-dimensional space to a lower-dimensional space. This often results in new features that are combinations or projections of the original features. Why do we need to reduce the dimension of data? Why do we need to reduce the dimension of data? 1-Improving Computational Efficiency Speed: Lower-dimensional data requires less computational power and memory, resulting in faster processing and analysis. Feasibility: Some algorithms scale poorly with the number of dimensions. Reducing dimensions makes it feasible to apply these algorithms. Why do we need to reduce the dimension of data? 2-Storage and Memory Efficiency Storage: Reduced-dimensional data occupies less storage space, which is beneficial for large-scale data storage and management. Memory: Lower memory usage is critical for in-memory data processing, especially when dealing with large datasets. Why do we need to reduce the dimension of data? 3-Easier Data Visualization and Interpretation Visualization: It is challenging to visualize data beyond three dimensions. Dimensionality reduction techniques allow for projecting high-dimensional data into 2D or 3D spaces for better visualization. Interpretation: Simplified data is easier to interpret and understand, which aids in communicating results and insights. Why do we need to reduce the dimension of data? Enhancing Model Performance Overfitting: High-dimensional data can lead to overfitting, where the model captures noise rather than the underlying pattern. Dimensionality reduction can reduce overfitting by simplifying the model. Generalization: Models built on reduced-dimensionality data often generalize better to unseen data. Noise Reduction Data Quality: High-dimensional data often contains irrelevant features or noise that can obscure the true signal. Dimensionality reduction helps in filtering out noise and focusing on the most informative aspects of the data. Improved Data Quality By identifying and retaining the most important features, dimensionality reduction improves the overall quality of the data used for modeling. How do we reduce the dimension of our data? Dimensionality Reduction Techniques 1- Principal Component Analysis (PCA): Identifies the directions (principal components) that maximize variance in the data. Projects data onto these principal components to reduce dimensions. Commonly used for exploratory data analysis and preprocessing before applying machine learning algorithms. 2-Linear Discriminant Analysis (LDA): Finds the linear combinations of features that best separate different classes in the data Dimensionality Reduction Techniques 3. t-Distributed Stochastic Neighbor Embedding (t-SNE): ○ A non-linear technique that reduces dimensions while preserving local structure in the data. 4- Autoencoders: ○ Neural network-based techniques that learn efficient codings of input data. 5- Feature Selection Methods: ○ Techniques like filter methods, wrapper methods, and embedded methods that select a subset of relevant features from the original set. Principal Component Analysis Principal Component Analysis (PCA) problem formulation PCA is a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set. It is a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. PCA is the most popular dimensionality reduction algorithm Principal Component Analysis (PCA) problem formulation Given two features, x1 and x2, we want to find a single line that effectively describes both features at once. We then map our old features onto this new line to get a new single feature. This line can map x1 and x2 but with high projection error Principal Component Analysis (PCA) problem formulation The red projection line is the best choice Principal Component Analysis The goal of PCA is to reduce the average of all the distances of every feature to the projection line. This is the projection error. Reduce from n-dimension to k-dimension: Find K vectors onto which to project the data, so as to minimize the projection error Reduce from 2d to 1d: find a direction (a vector u(1)∈ℝn) onto which to project the data so as to minimize the projection error. Reduce data from 2D to 1D 2D 1D Reduce data from 3D to 2D Projection PCA is not linear regression In linear regression, we are minimizing the squared error from every point to our predictor line. These are vertical distances. In PCA, we are minimizing the shortest distance, or shortest orthogonal distances, to our data points. In PCA, we are taking a number of features x1,x2,…,xn, and finding a closest common dataset among them. We aren't trying to predict any result and we aren't applying any theta weights to the features. Principal Component Analysis algorithm PCA Steps 1. Standardize the Data: Ensure all features are on the same scale. 2. Compute Covariance Matrix: Understand feature relationships. 3. Compute Eigenvectors: Determine the directions of maximum variance. 4. Choose Principal Components: ( choose value of k) 5. Transform the Data: Project data onto the principal components. 6. Analyze Results: Visualize and use the transformed data for modeling 1-Standardize the Data Training set: Standardize the data to have a mean of 0 and a standard deviation of 1. Standardize the Data Principal Component Analysis (PCA) algorithm 2-Compute “covariance matrix”: 4- Compute Eigenvectors: Perform Singular Value Decomposition (SVD) on the covariance matrix to obtain the eigenvectors and eigenvalues. Here, U contains the eigenvectors, S contains the singular values (related to eigenvalues), and Vt is the transpose of the right singular vectors. 4- Choose Principal Components: Take the first k columns of the U matrix and compute z We'll assign the first k columns of U to a variable called 'Ureduce'. This will be an n×k matrix. We compute z (projected data points) with: UreduceT will have dimensions k×n while x(i) will have dimensions n×1. Then UreduceT⋅x(i) will have dimensions k×1. Can we go back to our original number of features? And How? x Reconstruct x U matrix has the special property that it is a Unitary Matrix. One of the special properties of a Unitary Matrix is: U−1=UT To go from k-dimension back to n we do: z∈ℝk→x∈ℝn. We can do this with the equation: xapprox=Ureduce. Z Note that we can only get approximations of our original data. Choosing the number of principal components Choosing (number of principal components) This value is not fixed, eg. it could be 0.05 0r 0.1 Algorithm for choosing k 1. Try PCA with k=1 2. Compute Ureduce,z,xapprox 3. Check the formula (99% of the variance is retained). If not, go to step one and increase k. This procedure would actually be inefficient Choosing (number of principal components) Algorithm: [U,S,V] = svd(Sigma) Try PCA with Compute Check if Choosing (number of principal components) Algorithm: [U,S,V] = svd(Sigma) Try PCA with Compute Check if Choosing (number of principal components) [U,S,V] = svd(Sigma) Pick smallest value of for which (99% of variance retained) Do we always need to apply PCA? Do we always need to apply PCA? PCA is sometimes used where it shouldn’t be Before implementing PCA, first try running whatever you want to do with the original/raw data. Only if that doesn’t do what you want, then implement PCA Practical Steps in Dimensionality Reduction 1. Identify and Understand the Dataset: ○ Examine the dataset to understand the number of features and their relationships. 2. Choose an Appropriate Technique: ○ Select a dimensionality reduction technique based on the nature of the data and the problem at hand. 3. Apply the Technique: ○ Implement the chosen dimensionality reduction technique using libraries and tools available in machine learning frameworks 4. Evaluate the Results: ○ Assess the impact of dimensionality reduction on model performance, computational efficiency, and visualization. Summary Motivations of dimensionality reduction: It is a vital technique in data preprocessing that simplifies datasets by reducing the number of features while preserving essential information. Principal Component Analysis problem formulation Principal Component Analysis algorithm Reconstruction from compressed representation Choosing the number of principal components

Use Quizgecko on...
Browser
Browser