Feature Selection For DEA

Feature Selection What is feature selection? A procedure in machine learning to find a subset of features that produces ‘better’ model for given dataset Avoid overfitting and achieve better generalization ability Reduce the storage requirement and training time Interpretability 2 Relevant vs. Redundant features Feature selection keeps relevant features for learning and removes redundant and irrelevant features For example, for a binary classification task (f1 is relevant; f2 is redundant given f1; f3 is irrelevant) 3 Feature Extraction vs. Feature Selection Commonalities – Speed up the learning process – Reduce the storage requirements – Improve the learning performance – Build more generalized models Diﬀerences –Feature extraction obtains new features while feature selection selects a subset of original ones – Feature selection maintains physical meanings and gives models better readability and interpretability 5 Interpretability of Learning Algorithms 6 Interpretability of Learning Algorithms As the complexity of the machine learning model increases, we get better performance but lose out on interpretability. 7 When feature selection is important? Noisy data Lots of low frequent features Use multi-type features Too many features comparing to samples Complex model Samples in real scenario is inhomogeneous with training & test samples 8 Feature Selection Algorithms From the label perspective (whether label information is involved during the selection phase): – Supervised – Unsupervised – Semi-Supervised From the selection strategy perspective (how the features are selected): – Wrapper methods – Filter methods – Embedded methods 9 Supervised Feature Selection Supervised feature selection is often for classification or regression problems Find discriminative features that separate samples from diﬀerent classes (classification) or approximate target variables (regression) 10 Unsupervised Feature Selection It is often for clustering problems Label information is expensive to obtain which requires both time and eﬀorts Unsupervised methods seek alternative criteria to define feature relevance 11 Semi-Supervised Feature Selection We often have a small amount of labeled data and a large amount of unlabeled data Semi-supervised methods exploit both labeled and unlabeled data to find relevant features 12 Feature Selection Techniques Subset selection method : Two types: Forward Search and Backward Search Forward Search Start with no features Greedily include the most relevant feature Stop when selected the desired number of features Backward Search Start with all the features Greedily remove the least relevant feature Stop when selected the desired number of features Inclusion/Removal criteria uses cross-validation 13 Wrapper Methods Step 1: search for a subset of features Step 2: evaluate the selected features Repeat Step 1 and Step 2 until stopped 14 Wrapper Methods Can be applied for ANY model! Rely on the predictive performance of a predefined learning algorithm to assess features Shrink / grow feature set by greedy search Repeat until some stopping criteria are satisfied Achieve high accuracy for a particular learning method Run CV / train-val split per feature Computational expensive (worst case search space is 2d ) , some typical search strategies are – Sequential search – Best-first search – Branch-and-bound search 15 Filter Methods Independent of any learning algorithms Relying on certain characteristics of data to assess feature importance (e.g., feature correlation, mutual information…) More eﬃcient than wrapper methods The selected features may not be optimal for a particular learning algorithm 16 Feature Selection Techniques Single feature evaluation: Measure quality of features by all kinds of metrics Frequency based Dependence of feature and label (Co-occurrence), e.g., Mutual information, Chi square statistic Information theory, KL divergence, Information gain Gini indexing 17 Embedded Methods A trade-oﬀ between wrapper and filter methods by embedding feature selection into the model learning, e.g., ID3 Inherit the merits of wrapper and filter methods – Include the interactions with the learning algorithm – More eﬃcient than wrapper methods Like wrapper methods, they are biased to the underlying learning algorithms 18 Selection Criteria 19 Traditional Feature Selection Information Statistical based Sparse Learning Theoretical based methods based methods methods 20 Information Theoretical based Methods Exploit diﬀerent heuristic filter criteria to measure the importance of features Our target is to find these “optimal” features 21 Preliminary - Information Theoretical Measures Entropy of a discrete variable X Conditional entropy of X given Y 22 Preliminary - Information Theoretical Measures Information gain between X and Y Conditional information gain 23 Sample Example Weather Picnic Rainy No Rainy No Sunny No Cloudy No Rainy No Sunny Yes Sunny Yes Sunny Yes Cloudy Yes Cloudy Yes The information gain from considering the weather (X) to predict if people will go for a picnic (Y) is roughly 0.383 bits. This suggests that understanding the weather notably decreases the uncertainty surrounding picnic attendance. Sample data 2 Weather Picnic Rainy No Rainy No Rainy No Rainy No Rainy No Sunny Yes Sunny Yes Sunny Yes Sunny Yes Sunny Yes Information Theoretic based Methods - A General Framework Searching for the best feature subset is NP-hard, most methods employ forward/backward sequential search heuristics E.g., for forward search, given selected features S, we should do the following for the next selected feature – Maximize its correlation with class labels Y: – Minimize the redundancy w.r.t. selected features in S: – Maximize its complementary info w.r.t. selected features in S: 27 Information Theoretic based Methods - A General Framework Given selected features S, the feature score for the next selected feature can be determined by If g(*) is a linear function, then it can be represented as But also, g(*) can be a nonlinear function 28 Information Gain [Lewis, 1992] Information gain only measures the feature importance by its correlation with class labels The information gain of a new unselected feature Selecting features independently It is a special case of the linear function by setting 29 Mutual Information Feature Selection [Battiti, 1994] Information gain only considers feature relevance Features also should not be redundant to each other The score of a new unselected feature It is also a special case of the linear function by setting 30 Minimum Redundancy Maximum Relevance [Peng et al., 2005] Intuitively, with more selected features, the eﬀect of feature redundancy should gradually decrease Meanwhile, pairwise feature independence becomes stronger The score of a new unselected feature x is MRMR is also a special case of the linear function by setting and adjusting adaptively 31 Conditional Infomax Feature Extraction [Lin and Tang, 2006] Correlated feature is useful if the correlation within classes is stronger than the overall correlation Correlation does not imply redundancy! [Guyon et al. 2006] It is also a special case of the linear function by 32 Function g(∗) Can Also Be Nonlinear Conditional Mutual Information Maximization [Fleuret, 2004] It aims to identify informative features by maximizing the conditional mutual information between each feature and the target variable conditioned on the previously selected features. Information Fragments [Vidal-Naquet and Ullman, 2003] Information Fragments is a feature selection method that aims to identify relevant features by maximizing the mutual information between each feature and the target variable, while simultaneously minimizing the redundancy among selected features. It operates on the principle that informative features should provide unique and complementary information about the target variable. 33 Information Theoretical based Methods - Summary Other information theoretical based methods – Fast Correlation Based Filter [Yu and Liu, 2004] – Interaction Gain Feature Selection [El Akadi et al. 2008] – Conditional MIFS [Cheng et al. 2011]… Pros – Can handle both feature relevance and redundancy – Selected features can be generalized for subsequent learning tasks Cons – Most algorithms can only work in a supervised scenario – Can only handle discretized data 34 Traditional Feature Selection Information Statistical based Sparse Learning Theoretical based methods based methods methods 35 Statistical based Methods This family of algorithms are based on diﬀerent statistical measures to measure feature importance Most of them are filter feature selection methods Most algorithms evaluate features individually, so the feature redundancy is inevitably ignored Most algorithms can only handle discrete data, the numerical features have to be discretized first 36 T-Score [Davis and Sampson, 1986] It is used for binary classification problems Assess whether the feature makes the means of samples from two classes statistically significant The t-score of each feature is The higher the T-score, the more important the feature is 37 Chi-Square Score [Liu and Setiono, 1995] Utilize independence test to assess whether the feature is independent of class label Given a feature with r values, its feature score is Higher chi-square indicates that the feature is more important 38 Statistical based Methods - Summary Other statistical based methods – Low variance – CFS [Hall and Smith, 1999] – Kruskal Wallis [McKnight, 2010]… Pros – Computational eﬃcient – The selected features can be generalized to subsequent learning tasks Cons – Cannot handle feature redundancy – Require data discretization techniques – Many statistical measures are not that eﬀective in high-dim space 39 Traditional Feature Selection Information Statistical based Sparse Learning Theoretical based methods based methods methods 40 What is Feature Sparsity? The model parameters in many data mining tasks can be represented as a vector w or a matrix W Sparsity indicates that many elements in w and W are small or exactly zero 41 Sparse Learning Methods - A General Framework Let us start from the binary classification or the univariate regression problem Let w denote the model parameter (a.k.a. feature coeﬃcient), it can be obtained by solving 42 Lasso [Tibshirani, 1996] Lasso: Least Absolute Shrinkage and Selection Operator Based on ℓ-norm regularization on weight In the case of least square loss with oﬀset value, it looks like this … It is also equivalent to the following model 43 Extension to Multi-Class or Multi- Variate Problems Require feature selection results to be consistent across multiple targets in multi-class classification or multi-variate regression ||W||2 achieves joint feature sparsity across multiple targets In the case of least square loss with oﬀset, it looks like this 44 Sparse Learning based Methods - Summary Other sparse learning based methods – Multi-label informed feature selection [Jian et al. 2016] – Embedded unsupervised feature selection [Wang et al. 2015] – Adaptive structure learning feature selection [Du et al. 2015] Pros – Obtain good performance for the underlying learning method – With good model interpretability Cons – The selected features may not be suitable for other tasks –Require solving non-smooth optimization problems, which is computational expensive 45 Feature Selection Issues Recent popularity of big data presents challenges to conventional FS – Streaming data and features – Heterogeneous data – Structures between features – Volume of collected data 46 Summary Feature selection is eﬀective to tackle the curse of dimensionality and is essential to many data mining and machine learning problems The objectives of feature selection include – Building simpler and more comprehensive models – Improving learning performance – Preparing clean and understandable data Feature selection is equally important in the age of deep learning and big data We provide a structured overview of feature selection from a data perspective – Feature selection for conventional data (three main categories) More Advance FS methods to learn: – Feature selection with structured features – Feature selection with heterogeneous data – Feature selection with streaming data 47 More of Feature Engineering … Feature Extraction: Numerical data Dimensionality reduction techniques – SVD and PCA Clustering and using cluster IDs or/and distances to cluster centers as new features Feature Extraction: Textual data e.g., Bag-of-Words: extract tokens from text and use their occurrences (or TF/IDF weights) as features Feature Extraction: Time series and GEO location Feature Extraction: Image data Feature Extraction: Relational data Anomaly detection (advanced)

Feature Selection For DEA

Document Details

Tags

Related

Summary

Full Transcript