Data Pre-Processing Techniques and Feature Selection

Study Notes

Data cleaning involves handling missing values and removing duplicates.
Data integration combines data from different sources into a cohesive dataset.
Data transformation standardizes and normalizes data for analysis.
Data reduction summarizes data sets to enhance performance and reduce dimensionality.
Data discretization converts continuous data into discrete values, often through binning.

The filter approach evaluates features based on their intrinsic properties, without considering the learning algorithm.
Common statistical measures include correlation, Chi-squared tests, and information gain.
Filters select features based on their relevance to the target variable, using ranking methods to score features independently.
Fast and computationally efficient, filter methods can handle large datasets well, but may ignore interactions between features.

The wrapper approach evaluates feature subsets by training and validating a model using those features.
Utilizes a specific learning algorithm to assess the performance of selected features based on the model’s accuracy.
It is computationally intensive since it requires multiple iterations of model training and validation.
Considers interactions between features, potentially resulting in better performance than filter methods.

Filter methods do not involve a learning algorithm during feature selection, while wrapper methods directly assess the algorithm’s performance.
Operational complexity varies, with filter methods being simpler and faster compared to the resource-intensive wrapper approach.
Filter methods may miss contextual feature interactions, whereas wrapper methods can capture complex relationships.

Correlation Coefficients measure linear relationships between features, highlighting redundancy.
Mutual Information assesses the amount of information one variable provides about another.
Variance Inflation Factor (VIF) quantifies how much the variance of a predicted coefficient increases due to multicollinearity.
Principal Component Analysis (PCA) transforms features into uncorrelated components, identifying redundant features.

Correlation relies on linear relationships, may overlook non-linear associations.
Mutual Information captures both linear and non-linear relationships, providing a more comprehensive measure.
VIF focuses on multicollinearity changes impacting linear regression models, emphasizing variance rather than outright redundancy.
PCA combines features into components but loses interpretability, contrasting with other methods that retain feature names.

Filter Approach: Quick evaluation based on statistical metrics, not reliant on any model; limited interaction consideration.
Wrapper Approach: Model-dependent evaluation considers feature interactions; higher computational cost and more accurate results.
Embedded Approach: Integrates feature selection within model training; balances efficiency and model performance, capturing interactions while being less computationally intensive than wrap methods.