COMP3009/COMP4139 Machine Learning (MLE) Data Pre-processing & Feature Analysis PDF

COMP3009/ COMP4139 Machine Learning (MLE) - Data Pre-processing & Feature Analysis Dr Xin Chen Associate Professor School of Computer Science University of Nottingham General Pipeline of Machine Learning Data representation + Modelling + Evaluation + Optimisation (Pedro Domingos, 2012) Data Understanding, Representation and Pre-processing Understand the data o Truly understand the underlying problem as much as possible. o Visualise the data (outliers, value range, etc.) Feature representation o Reliability, repeatability o Categorical, binary, continuous. o Feature value normalisation Data pre-processing o Missing data, errors, etc. o Data imputation Data Visualisation Boxplot: Good to Histogram: Good to Scatter plot: Good to visualise continuous visualise categorical explore relationships variable (check outliers) variables (check between two variables distribution) Categorical to One-hot Encoding Data Normalisation Methods: o Z-Normalisation (or Zero-mean normalisation) X𝑛𝑜𝑟𝑚 = (𝑋 − μ)/σ X is the feature vector of original values, μ and σ are the mean and standard deviation of vector X. o Min-Max Normalisation (use 5%/95% percentile as the min and max more robust to outliers) X𝑛𝑜𝑟𝑚 = (𝑋 − min(𝑋))/(max(X )-min(X )) Min(X) and Max(X) are the minimum and maximum values of vector X respectively o Vector Normalisation 𝑋 X𝑛𝑜𝑟𝑚 = |𝑋| | 𝑋 | is the vector length of X Data Normalisation Benefit of data normalisation: ✓ They are all linear scaling methods, so won’t affect the original data distribution. ✓ Improves the numerical stability of the machine learning model. (e.g. using gradient descend to optimise neural network, if different features are in different value range a fixed learning rate will likely overshoot or undershoot the optima for certain features). ✓ Reduce the negative impact to distance-based algorithms (e.g. KNN, SVM). Data Imputation (deal with missing data) Get rid of the instances with missing features (safer when have a lot of samples) Some algorithms will ignore the missing data, but majority of methods will be panic and output errors. Data Imputation methods: Use mean or median values Use most frequent values Use k-nearest neighbour based on feature similarity Use multivariate imputation by chained equations (i.e. try filling the missing data multiple times with different values and pool the final results) Estimate the missing value using machine learning models based on other features. Curse of Dimensionality What is curse of dimensionality in machine learning? - Data samples are too sparse in the feature space. (i.e. the number of instances are not large enough to densely distributed in the feature space) Curse of Dimensionality How many data samples do we need to fill in the feature space ? Curse of Dimensionality What shall we do to avoid curse of dimensionality? o Increase the number of data samples. The required sample numbers will need to increase exponentially with a linear increase of feature dimensions. o Reduce the number of features (more feasible): use feature selection and dimensionality reduction methods.

COMP3009/COMP4139 Machine Learning (MLE) Data Pre-processing & Feature Analysis PDF

Document Details

Tags

Related

Summary

Full Transcript