Dimensionality Reduction - Feature Selection PDF
Document Details
Uploaded by Deleted User
Mohammed Brahimi & Sami Belkacem
Tags
Summary
This document provides an overview of dimensionality reduction, focusing on feature selection. It examines different approaches, including unsupervised and supervised methods, along with their advantages and disadvantages.
Full Transcript
Dimensionality Reduction Part 2: Feature selection Mohammed Brahimi & Sami Belkacem 1 Outline What is Feature Subset Selection? Why doing Feature Subset Selection? Taxonomy of Feature Subset Selection methods ○ Unsupervised methods ○...
Dimensionality Reduction Part 2: Feature selection Mohammed Brahimi & Sami Belkacem 1 Outline What is Feature Subset Selection? Why doing Feature Subset Selection? Taxonomy of Feature Subset Selection methods ○ Unsupervised methods ○ Supervised methods Filter methods Wrapper methods Comparison of Feature Subset Selection methods 2 What is Feature Subset Selection (FSS)? Definition: select a subset of features (attributes) from a given feature set. Goal: optimize an objective function for the chosen features. Feature extraction (e.g. PCA): ○ Transforming the existing features into a lower dimensional space. Feature selection: ○ Selecting a subset of the existing features without transformation. 3 What is Feature Subset Selection (FSS)? Feature Set: Objective: Find a subset of features: that optimizes the objective function Optimization: 4 Why doing Feature Subset Selection? Why not stick to feature extraction, as it essentially accomplishes the task of dimensionality reduction? 5 Why doing Feature Subset Selection? Feature values may be expensive to collect and obtain ○ Test many features (sensors) and choose only a few for the final implementation. The need to extract meaningful rules from data mining models ○ When you transform the original features into new ones, the measurement units of the features are lost. Features may not be numeric ○ A typical situation in data mining. Fewer features means fewer parameters for pattern recognition ○ Improved the generalization capabilities of the model and avoid overfitting. ○ Reduced model complexity and running time. 6 Why Feature selection is challenging? Number of features = 3 ○ {A1, A2, A3} How many features subsets? ○ {A1}, {A2}, {A3} ○ {A1, A2}, {A1, A3}, {A2, A3} ○ {A1, A2, A3} 7 Why Feature selection is challenging? Number of features = n ○ {A1, A2, A3, …, An} How many features subsets? ○ {A1}, {A2}, {A3} ….. ○ {A1, A2}, {A1, A3}, {A2, A3}.... ○ {A1, A2, A3} …. ○ …… For n=100 1 267 650 600 228 229 401 496 703 205 376 8 Taxonomy of Feature Selection methods 9 Unsupervised Feature Selection Unsupervised feature selection typically uses statistical measures to evaluate features: Variance: ○ Features with low variance are less likely to be informative and useful for prediction. Correlation: ○ Features that are highly correlated with each other are likely to contain redundant information. Mutual information: ○ Measures the amount of information that two features share. ○ Features with high mutual information are more likely to be predictive of each other. 10 Variance Filter Remove features with a high percentage of identical values across all objects. Low variance indicates that constant features lack useful information. 11 Correlation Filter Remove highly correlated input features. Correlated features indicates redundant information (Feature 2 = 2 × Feature 1). Note: Correlation can only detect linear relationships between features. 12 Mutual Information filter Measures the amount of information that two categorical features share using entropies. Features with high mutual information are more likely to be predictive of each other. Mutual information can detect any kind of relationship between two features. ❖ H(X), H(Y): individual entropies ❖ H(X, Y): joint entropies ❖ I(X; Y): mutual information The concept of entropy will be introduced in the upcoming slides. 13 Taxonomy of Feature Selection methods 14 Filter methods for FSS Goal: evaluate features independently using the target variable, and select the most important ones based on statistical measures Common statistical measures: Correlation with the target variable, information gain, and chi-square test. How to use filter methods for FSS 1. Choose a filter metric (correlation, information gain, or chi-square test). 2. Evaluate each feature using the chosen filter metric and rank them accordingly. 3. Select the features with the highest filter metric scores. 15 Filter methods: Correlation In filter methods, correlation measures how well a feature relates to the target variable. Features highly correlated with the target variable: ○ These features are often good candidates for prediction. Features Lowly correlated with the target variable: ○ These features may not contribute meaningfully to predictions. Note: correlation does not consider the non-linear correlation with the target variable and the interaction between features. 16 Correlation Heatmap Helps in visualizing the correlation between attributes as well as the correlation with the target variable. Warm colors for positive. Cool colors for negative. Neutral for no correlation. Exemple: In this heatmap, suppose the price of the house is the target variable. 17 Filter methods: Information Gain Goal: Measure of how much information is gained about the target categorical variable when a dataset is split on a given categorical feature. Principle: Calculate the difference between the entropy of the target variable before and after the split. Note: Information Gain is commonly used in classification algorithms such as decision trees. what is entropy? 18 What is entropy ? It measures the level of uncertainty, randomness, or disorder in a system or dataset. In the context of data, entropy quantifies the impurity of a set. 19 What is entropy ? It measures the level of uncertainty, randomness, or disorder in a system or dataset. In the context of data, entropy quantifies the impurity of a set. S = {x1,x2 ….xn}, which is the set of elements H(S): is the entropy of the whole set p(xi): is the probability of having the element xi 20 What is entropy ? p(red) = 6/12 = 0.5 p(green) = 6/12 = 0.5 H(S) = -(0.5×log(0.5) + 0.5×log(0.5)) = -(-0.5-0.5) H(S) = 1 21 What is entropy ? p(red) = 3/12 = 0.25 p(green) = 9/12 = 0.75 H(S) = -(0.25×log(0.25) + 0.75×log(0.75)) H(S) = 0.811278 22 What is entropy ? p(red) = 0 p(green) = 1 H(S) = -(0 + 1×log(1)) H(S) = 0 23 Filter methods: Information Gain Goal: Measure of how much information is gained about the target categorical variable when a dataset is split on a given categorical feature. Principle: Calculate the difference between the entropy of the target variable before and after the split. 24 Example: Information Gain for Age IG(Age) = H(Root node) - (p(Age = Recent)×H(Recent)) - (p(Age = Old)×H(Old)) = 1 - ((¾)×0.918) – ((¼)×0) IG(Age) = 0.3115 25 Example: Information Gain for Road Tested Pure set Pure set IG(Road Tested) = 1 - (0.5×0) - (0.5×0) =1-0 IG(Age) = 1 (Perfect predictor) 26 Filter methods: Chi-Square test Goal: Measure the association or dependency between a given categorical feature and the target categorical variable. Principle: Compare observed (O) and expected frequencies (E) to assess feature-target association. Chi-Square Expected value for row i and column j 27 Example: Chi-Square test Expected values Frequencies Dataset: information about people, one of the attribute is gender, we want to predict pets preference Question: Is there a relationship between gender and preference for pets (cats or dogs)? Assumption H0: The two columns are not related to each other. 28 Example: Chi-Square test X2 = 1,099 + 0,918 + 1,136 + 0,949 = 4,102 Define Degree of freedom: df = (number of rows−1)×(number of columns−1) = 1 Define significance level α (probability threshold below which we reject H0, often α=0.05) Look up the Critical values of chi-square (right tail), we find that the critical value is 3.84. X2 > 3.84, so, we reject H0. 29 Filter methods for FSS Advantages Disadvantages Computational Efficiency Suboptimal Selection No data mining model is trained ⇒ May not always identify the best feature Computational efficiency and scalability to subset because it does not consider the high dimensions and large datasets. performance of the data mining algorithm. Algorithm Independence Complex Relationships Independent of any data mining algorithm ⇒ May not handle complex relationships and Unbiased toward any particular algorithm. interactions between features. 30 How to use filter methods for FSS Filter Methods as an initial step ○ Filter methods serve as an excellent initial step in the feature selection process. ○ Particularly valuable when dealing with large datasets. Hybridization is key ○ It's advisable to combine filter methods with other feature selection techniques. ○ Wrapper methods complement filter methods effectively. 31 Taxonomy of Feature Selection methods 32 Wrapper methods for FSS Goal: evaluate features by using a data mining model as a black box. Principle: the model is trained on different subsets of features, and the subset that produces the best performance is selected. How to use filter methods for FSS Feature Subset Generation: The search algorithm generates various feature subsets. Model Evaluation: Each subset is input to the model and its performance is evaluated via model performance. Selection Criterion: A selection criterion (e.g. accuracy) guides the choice of the best-performing feature subset Iteration: The process is repeated with different subsets until an optimal set is found. 33 34 Search strategy and Objective function Search Strategy: ○ Exhaustive evaluation of all subsets is impossible ○ A search strategy navigates the vast feature subset space efficiently Objective Function: ○ Evaluates candidate feature subsets ○ Quantitative measure of the "goodness" or quality of a subset ○ Feedback from the objective function to guide the search strategy PR: Pattern Recognition 35 Search strategy and Objective function Forward selection Add features step by step, choosing the one that improve the model the most at each iteration. Backward elimination Remove features step by step, choosing the one that does not improve the model at each iteration. Genetic Algorithm (GA) Optimization technique using natural selection to evolve feature subsets for better model performance. 36 Forward Selection 1. Start with 0 features 2. Add the best feature 3. Stop adding features a. Reach a given number of features. b. Performance threshold. c. Elbow methods. 37 Backward elimination 1. Start with all feature 2. Remove the worst feature 3. Stop removing features a. Reach a given number of features. b. Performance threshold. c. Elbow methods. 38 Genetic Algorithm (GA) for Feature Selection What is a genetic algorithm? 39 Genetic Algorithm (GA) for Feature Selection Crossover ○ Combine two solutions ○ Produce good children Mutation ○ Random change ○ Exploration Selection ○ Select the best ○ Survival of fittest N.B. In feature selection and GA, we can encode the selected feature using 1's and the discarded one by 0's 40 Wrapper methods for FSS Advantages Disadvantages Optimal Subset Selection Computational Intensity Can detect the ideal feature subset for a Tend to be computationally demanding, given data mining algorithm. particularly with large datasets. Managing Complex Relationships Algorithm Bias Effective in handling relationships between May exhibit bias towards the data mining features by evaluating a subset, not a feature. algorithm used for feature evaluation. 41 Comparison of Feature Selection Methods 42 Comparison of Feature Selection Methods How to choose a Feature Selection Method? Size of dataset Domain knowledge ○ Filter methods work well for large datasets ○ Unsupervised methods do not use labels ○ Wrapper methods work well for small datasets ○ Supervised methods use domain labels Computational budget Algorithm fit ○ Filter methods are fast ○ Wrapper methods tailor to specific DM algorithm ○ Wrapper methods are slow ○ Filter methods are independent of any algorithm Need for interpretability Feature interactions ○ Filter methods allow inspection of feature importance ○ Wrapper methods handle feature interactions ○ Wrapper methods act as black box ○ Filter methods assess features independently 43 Summary The goal of FSS is to select an optimal subset of features Main approaches: ○ Unsupervised: Fast evaluation of features using statistical metrics ○ Supervised Filter: Evaluate features independently using statistical metrics along with the target ○ Supervised Wrapper: Use model performance to search feature subsets Considerations: ○ Dataset size and dimensionality ○ Computational budget and time constraints ○ Need for interpretability vs performance ○ Domain knowledge of features ○ The chosen data mining algorithm ○ Relationships between features No one-size-fits-all method, choose a method based on goals, data, and constraints Hybrid approaches combine strengths of different techniques 44