Dimensionality Reduction Techniques in Machine Learning

Study Notes

Dimensionality is the number of features, variables, or columns in a dataset.
Dimensionality reduction techniques aim to reduce the number of features in a dataset while retaining the important information.
It is used in Machine learning to train algorithms.
Feature selection selects a subset of the original features.
- Filter methods are used to evaluate the importance of features based on statistical criteria.
- Wrapper methods use a machine learning model to evaluate the performance of different feature subsets.
- Intrinsic/ Embedded methods integrate feature selection into the learning process.
Feature Extraction transforms the original features into a smaller set of new features.
- Principal Component Analysis (PCA) finds a set of orthogonal linear combinations of the original features that capture the maximum variance in the data.
- Factor Analysis is used to identify underlying factors that explain the relationships between variables.
- Singular value decomposition is a matrix decomposition technique that can be used for feature extraction.

It is a part of dimensionality reduction.
Raw data is divided and reduced to more manageable groups.
Lower dimensions should be uncorrelated and have large variance.
It can be applied to images, text, geospatial data, date and time, web data, and sensor data.

A set of techniques that analyzes correlations between variables to reduce them to fewer factors that explain the original data.
Exploratory Factor Analysis (EFA) uses PCA to identify the underlying factors.
Confirmatory Factor Analysis (CFA) tests a hypothesized model of the relationships between variables.
Assumptions:
- Variables must be related (sufficient correlation).
- Sample size should be adequate (minimum 50, preferably 100).
Issues:
- Overloading: too many items loading on the factor.
- Cross-loading: variables loading highly on multiple factors.

Can be used to understand the underlying motives of consumers who buy a product category or a brand.
Used to determine which variables potential customers consider when buying a product.

A statistical method used to predict the probability of a binary outcome, where the outcome can be either 0 or 1. (e.g. pass/fail, buy/not buy)
Dependent variable is binary.
Independent variables can be either continuous or categorical.
Used for:
- Predicting customer response to marketing campaigns.
- Assessing credit risk.
- Identifying risk factors for disease.

Linear regression is not suitable for binary outcomes, while logistic regression is.
Linear regression is sensitive to outliers, which can significantly affect the results.