Feature Selection & Dimensionality Reduction PDF

Attendance COMP3009/ COMP4139 Machine Learning (MLE) - Feature Selection & Dimensionality Reduction Dr Xin Chen Associate Professor School of Computer Science University of Nottingham Aim of Feature Selection and Dimensionality Reduction Reduce the impact caused by Curse of Dimensionality Remove redundant features to improve performance Increase computational efficiency Reduce cost in new data acquisition FS vs DR o FS retain a subset of the original features o DR generate a new set of features that is compact but does NOT retain the original meaning of features. Things to consider when using FS and DR The target dimension Interpretability (Yes: FS; No: Dr or FS) Feature correlations/dependency Feature reliability and repeatability Methods (different methods likely to result in different features) Popular FS Methods Wrapper methods o Search for optimal feature subset that maximise the decision-making performance o Methods: recursive feature elimination; sequential feature selection. Embedded methods o Integrate the FS process to the model learning process. o Methods: ridge (ElasticNet); lasso; random forest (feature ranking). Filter-based methods o Selection is based on feature relationships and statistics rather than performance. o Methods: univariate (ANOVA); Chi Square; correlation/variance Forward Feature Selection (Wrapper Method) X: the final selected feature set B: is stored best evaluation metric value Y: the selected feature set at each iteration Features that belongs to F not X M: The evaluation metric, e.g. Entropy; Classification rate; Regression error Recursive feature elimination method is similar: starts with the full set and eliminate one at a time. LASSO (Embedded Method) LASSO (least absolute shrinkage and selection operator) Add a L1 regularisation term to reduce the number of effective features The loss function is not differentiable. Sub-gradient methods or least-angle regression can be used to optimise the loss y (weight) min( ∥ 𝑋𝑤 − 𝑦 ∥2 +𝜆|𝑤|) Training data 𝑥 Least squared term L1 regularisation term For multiple features: Validation data e.g. y=w0+w1*x1+w2*x2+w3*x3 LASSO a higher 𝜆 will make some of the weights of x (height) x becomes 0, hence reduce the dimensionalities. Chi Square vs T-test vs ANOVA (Filter Method) Univariant feature selection (assuming features are independent to each other) A chi-square tests the independence of predictor and outcome event, suitable for categorical features in categorical outcome. T-test compares the statistical difference of two groups (binary class) and used for continues features ANOVA uses variance to test the relationship between categorical predictors and continuous outcome response (e.g. gender, age group to predict exam mark). Correlation test work for predictors and outcome are both continuous. Assume a null-hypothesis. Use p value to reject the null-hypothesis (e.g. p

Feature Selection & Dimensionality Reduction PDF

Document Details

Tags

Related

Summary

Full Transcript