Data Preprocessing & Feature Engineering Lecture Note PDF
Document Details
Uploaded by DetachableToucan
Dr Nor Azuana Ramli
Tags
Summary
This lecture note provides an overview of data preprocessing and feature engineering techniques. It discusses the importance of data preprocessing in machine learning, various methods for handling missing values, different types of missing values, techniques for feature engineering and selection, and concepts of dimensionality reduction. The note also touches on imbalanced datasets.
Full Transcript
CHAPTER 2 Data Preprocessing & Feature Engineering Dr Nor Azuana Ramli CONTENTS 2.1: Preprocessing & Exploration 2.2: Feature Selection & Dimensionality Reduction 2.3: Imbalance Dataset COURSE OUTCOMES To understand the steps in data 1...
CHAPTER 2 Data Preprocessing & Feature Engineering Dr Nor Azuana Ramli CONTENTS 2.1: Preprocessing & Exploration 2.2: Feature Selection & Dimensionality Reduction 2.3: Imbalance Dataset COURSE OUTCOMES To understand the steps in data 1 preprocessing and exploration. To apply feature selection and 2 dimensionality reduction in the dataset. To understand the problem with 3 imbalanced dataset and how to solve the problem. DATA PRE-PROCESSING ▪ Data Pre-processing includes the steps we need to follow to transform or encode data so that it may be easily parsed by the machine. ▪ The main agenda for a model to be accurate and precise in predictions is that the algorithm should be able to easily interpret the data’s features. Why is Data Pre-processing Important? The majority of the real-world datasets for machine learning are highly susceptible to be missing, inconsistent, and noisy due to their heterogeneous origin. Applying data mining algorithms on this noisy data would not give quality results as they would fail to identify patterns effectively. Data Processing is, therefore, important to improve the overall data quality. Duplicate or missing values may give an incorrect view of the overall statistics of data. Outliers and inconsistent data points often tend to disturb the model’s overall learning, leading to false predictions. Quality decisions must be based on quality data. Data Preprocessing is important to get this quality data, without which it would just be a Garbage In, Garbage Out scenario. 4 Steps in Data Pre-processing Feature Engineering Data Pre-processing: Best Practices The first step in Data Pre- Use statistical methods or pre- processing is to understand built libraries that help you your data. Just looking at your visualize the dataset and give a dataset can give you an clear image of how your data intuition of what things you looks in terms of class need to focus on. distribution. Drop the fields you think have no use for the modelling or are Summarize your data in terms closely related to other of the number of duplicates, attributes. Dimensionality missing values, and outliers reduction is one of the very present in the data. important aspects of Data Pre- processing. Do some feature engineering and figure out which attributes contribute most towards model training. DATA CLEANING Dealing with Missing Values Missing value is the most common problem in real-life datasets. It is important to treat the missing values as it can bias the results of the machine learning models and/or reduce the accuracy of the model. In the dataset, blank shows the missing values. In Pandas, usually, missing values are represented by NaN. There can be multiple reasons why certain values are missing from the data. Some of them are: Past data might get corrupted due to improper maintenance. Observations are not recorded for certain fields due to some reasons. There might be a failure in recording the values due to human error. The user has not provided the values intentionally. Types of Missing Value Types of Missing Value Missing Completely At Random (MCAR) Missing At Random (MAR) In MCAR, the probability of data being missing is MAR means that the reason for missing values the same for all the observations. can be explained by variables on which you have In this case, there is no relationship between the complete information as there is some missing data and any other values observed or relationship between the missing data and other unobserved (the data which is not recorded) values/data. within the given dataset. That is, missing values In this case, the data is not missing for all the are completely independent of other data. There observations. It is missing only within sub- is no pattern. samples of the data and there is some pattern in In the case of MCAR, the data could be missing the missing values. due to human error, some system/equipment failure, loss of sample, or some unsatisfactory technicalities while recording the values. Types of Missing Value Missing Not At Random (MNAR) Missing values depend on the unobserved data. If there is some structure/pattern in missing data and other observed data cannot explain it, then it is MNAR. If the missing data does not fall under the MCAR or MAR then it can be categorized as MNAR. It can happen due to the reluctance of people in providing the required information. A specific group of people may not answer some questions in a survey. Handling Missing Values Deleting Imputing Deleting the missing Imputing the missing values values Imputing the Missing Values Replacing With Arbitrary Value #Replace the missing value with '0' using 'fiilna' method train_df['Dependents'] = train_df['Dependents'].fillna(0) train_df[‘Dependents'].isnull().sum() Replacing With Mean #Replace the missing values for numerical columns with mean train_df['LoanAmount'] = train_df['LoanAmount'].fillna(train_df['LoanAmount'].mean()) train_df['Credit_History'] = train_df[‘Credit_History'].fillna(train_df['Credit_History'].mean()) Replacing With Mode #Replace the missing values for categorical columns with mode train_df['Gender'] = train_df['Gender'].fillna(train_df['Gender'].mode()) train_df['Married'] = train_df['Married'].fillna(train_df['Married'].mode()) train_df['Self_Employed'] = train_df[‘Self_Employed'].fillna(train_df['Self_Employed'].mode()) train_df.isnull().sum() Imputing the Missing Values Replacing With Median train_df['Loan_Amount_Term']= train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median()) Replacing with previous value – Forward fill # Forward-Fill test.fillna(method=‘ffill') Replacing with next value – Backward fill # Backward-Fill test.fillna(method=‘bfill') Interpolation test.interpolate() DATA TRANSFORMATION To convert data into a format or structure that is more suitable for analysis or modelling. It often involves changing the data’s representation to make patterns more apparent or to ensure the data meets model requirements. Common Techniques: Normalization: Scaling features to a specific range, usually [0,1]. Standardization: Rescaling features to have a mean of 0 and a standard deviation of 1. Log Transformation: Applying a logarithmic function to compress the range of values and reduce skewness. Square Root Transformation: Used to stabilize variance and reduce skewness. Binning: Converting continuous variables into discrete bins or intervals. Encoding Categorical Variables: When to Use: Data transformation is used to prepare data for better model performance, to meet model assumptions, or to handle non-linear relationships and skewed distributions. Label Encoding ▪ Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning. ▪ Example : Suppose we have a column Height in some dataset. One Hot Encoding/ Dummy Encoding In this technique, the categorical parameters will prepare separate columns for both Male and Female labels. So, wherever there is Male, the value will be 1 in Male column and 0 in Female column, and vice-versa. Let’s understand with an example: Consider the data where fruits and their corresponding categorical values and prices are given. Label Encoding or Dummy Encoding? The choice between dummy encoding and label encoding depends on the nature of your categorical data and the machine learning model you’re using. Some key points to consider: 1. Label Encoding: When your categorical data has a meaningful ordinal relationship since the numerical labels can reflect this hierarchy. Suitable for algorithms that can handle ordered numerical features, such as decision trees or gradient boosting. 2. Dummy Encoding: When your categorical data is nominal (i.e., has no meaningful order) and each category is equally important. Suitable for algorithms like linear regression, logistic regression, neural networks, or other models that cannot handle direct numerical labels. In summary: Choose Label Encoding if the categorical data is ordinal or if you’re using a model that can naturally handle categorical features. Choose Dummy Encoding if the data is nominal, and you’re concerned about the model making incorrect assumptions about ordinal relationships. Feature Engineering ▪ Feature Engineering/ Feature Generation is the process of transforming raw data in order to make it more useful or more stable for predictive modelling purposes. Some common approaches to feature engineering: a. Feature Extraction: Creating new features from existing ones. b. Functional transformations (e.g. log transform to shift skewed distributions) c. Calculating counts, sum, average, min/max/range, and ratios (descriptive statistics) d. Interaction effect variables e. Binning continuous variables f. Combining high cardinality nominal variables g. Date/ time manipulations to define relative values or intervals h. Feature Selection and Dimensionality Reduction ▪ Automatic Feature Engineering tools can create many new features using these techniques. FEATURE SELECTION & DIMENSIONALITY REDUCTION WHY WE NEED FEATURE SELECTION & DIMENSIONALITY REDUCTION? We typically represent data as a grid of numbers (a matrix). Each column represents a variable, which we call a feature in machine learning. In supervised learning, one of the variables is actually not a feature, but the label that we're trying to predict. And in supervised learning, each row is an example that we can use for training or testing. The number of features corresponds to the dimensionality of the data. Our machine learning approach depends on the number of dimensions versus the number of examples. For instance, text and image data are very high dimensional, while stock market data has relatively fewer dimensions. DRKU2022 To reduce the number Faster to complete Less storage capacity of features (or computations or tasks/ needed means the variables) that the less computation time computer can do more computer must process work. to perform its function. Removes It makes data easier to multicollinearity visualize for humans, resulting in particularly when the data improvement of the is reduced to two or three machine learning model dimensions which can be in use. easily displayed graphically. FEATURE SELECTION VS DIMENSIONALITY REDUCTION Feature selection is Dimensionality simply selecting reduction and excluding transforms features given features into a lower without changing dimension. them. DRKU2022 Feature Selection Feature selection is the process of picking a subset Not all of the features are useful, and they of significant features for use in better model may only add randomness to our results. It's construction. In practice, not every feature in a therefore often important to do good feature dataset carries information useful for selection. discriminating samples; some features are either redundant or irrelevant, and hence can be discarded with little loss. DRKU2022 Feature Selection Whenever you have a large Simple selection based on Model performance-based number of highly correlated correlation of attributes (e.g. selection methods (e.g., possible input variables, it heatmap) forward selection and can cause problems for backward elimination) various modelling or machine learning algorithms: Memory usage, matrix sparsity, etc. DRKU2022 Better features means better results!!! “The algorithms we used are very standard for Kagglers. […] We spent most of our efforts in feature engineering.” — Xavier Conort, on “Q&A with Xavier Conort” on winning the Flight Quest challenge on Kaggle DRKU2022 (based on Statistical method) Summary of Key Methods: Numerical input & binary categorical output: Point-Biserial Correlation. Numerical input & multi-class categorical output: ANOVA or Logistic Regression. Categorical input & categorical output: Chi-Square Test, Cramér's V. Mixed inputs: Logistic Regression for modelling or use specific tests for each input-output type combination. DIMENSIONALITY REDUCTION An interesting problem that feature reduction can help with is called the curse of dimensionality. This refers to a group of phenomena in which a problem will have so many dimensions that the data becomes sparse. Dimensionality reduction is used to decrease the number of dimensions, making the data less sparse and more statistically significant for machine learning applications. Example of method: PCA DRKU2022 Principle Component Analysis Principal Component Analysis (PCA) is one of the most popular linear dimension reduction algorithms. It is a projection-based method that transforms the data by projecting it onto a set of orthogonal(perpendicular) axes. “PCA works on a condition that while the data in a higher- dimensional space is mapped to data in a lower dimension space, the variance or spread of the data in the lower dimensional space should be maximum.” Eigenvalue Decomposition and Singular Value Decomposition (SVD) from linear algebra are the two main procedures used in PCA to reduce dimensionality. Eigenvalue Decomposition Matrix Decomposition is a process in which a matrix is reduced to its constituent parts to simplify a range of more complex operations. Eigenvalue Decomposition is the most used matrix decomposition method which involves decomposing a square matrix(n*n) into a set of eigenvectors and eigenvalues. Eigenvectors are unit vectors, which means that their length or magnitude is equal to 1.0. They are often referred to as right vectors, which simply means a column vector (as opposed to a row vector or a left vector). Eigenvalues are coefficients applied to eigenvectors that give the vectors their length or magnitude. For example, a negative eigenvalue may reverse the direction of the eigenvector as part of scaling it. Mathematically, A vector is an eigenvector of a matrix any n*n square matrix A if it satisfies the following equation: A. v = 𝞴. v Eigenvalue Decomposition The whole problem of PCA boils down to finding Eigen values and Eigen vectors of Covariance Matrix of our Data after standardization. P/S: Please open back your Linear Algebra notes!!! :D DATA PARTITIONING Imbalanced Dataset ✓ Imbalanced Classification: A classification predictive modeling problem where the distribution of examples across the classes is not equal. ✓ The imbalance to the class distribution in an imbalanced classification predictive modeling problem may have many causes. ✓ There are perhaps two main groups of causes for the imbalance we may want to consider; they are data sampling and properties of the domain. ✓ It is possible that the imbalance in the examples across the classes was caused by the way the examples were collected or sampled from the problem domain. This might involve biases introduced during data collection, and errors made during data collection. Biased Sampling. Measurement Errors. Imbalanced Dataset ✓ Many classification problems may have a severe imbalance in the class distribution; nevertheless, looking at common problem domains that are inherently imbalanced will make the ideas and challenges of class imbalance concrete. Fraud Detection. Claim Prediction Default Prediction. Churn Prediction. Spam Detection. Anomaly Detection. Outlier Detection. Intrusion Detection Conversion Prediction. ✓ The list of examples sheds light on the nature of imbalanced classification predictive modeling. Techniques to deal with Imbalanced Dataset: Random Under-Sampling (Removing some observations of the majority class) Random Over-Sampling (Adding more copies to the minority class) Balance data with imbalanced learn Python module (import imblearn) Tomek links Synthetic Minority Oversampling Techniques (SMOTE) - oversampling technique where the synthetic samples are generated for the minority class. Advantage and disadvantages of Under-sampling The sample chosen It can help improve by random under- run time and storage sampling may be a problems by It can discard biased sample. And it reducing the number potentially useful will not be an of training data information which accurate samples when the could be representation of the training data set is important for population. Thereby, huge. building rule resulting in classifiers. inaccurate results with the actual test data set. Advantages Disadvantages Advantage and disadvantages of Over-sampling Unlike under- sampling, this method leads to no information loss. It increases the likelihood of overfitting since it replicates the minority Outperforms class events. under sampling Advantages Disadvantages SMOTE for Imbalanced Classification with Python ✓ It focuses on the feature space to generate new instances with the help of interpolation between the positive instances that lie together. ✓ Working procedure: At first the total no. of oversampling observations, N is set up (Generally, it is selected such that the binary class distribution is 1:1. But that could be tuned down based on need) Then the iteration starts by first selecting a positive class instance at random. Next, the KNN’s (by default 5) for that instance is obtained. At last, N of these K instances is chosen to interpolate new synthetic instances. To do that, using any distance metric the difference in distance between the feature vector and its neighbors is calculated. Now, this difference is multiplied by any random value in (0,1] and is added to the previous feature vector. Python Code for SMOTE Algorithm Feature Scaling Feature scaling in machine learning is one of the most critical steps during the pre-processing of data before creating a machine learning model (usually done after data partitioning). Scaling can make a difference between a weak machine learning model and a better one. Feature scaling transforms feature values to a similar scale, ensuring all features contribute equally to the model. It’s essential for datasets with features of varying ranges, units, or magnitudes. Common techniques include standardization, normalization, and min-max scaling. This process improves model performance, convergence, and prevents bias from features with larger values. “Scale data for better performance of Machine Learning Model” Why do we need scaling? Feature Scaling ✓ Rule of thumb we may follow here is an algorithm that computes distance or assumes normality, scales your features. ✓ Some examples of algorithms where feature scaling matters are: → K-nearest neighbors (k-NN) with a Euclidean distance measure is sensitive to magnitudes and hence should be scaled for all features to weigh in equally. → K-Means uses the Euclidean distance measure here feature scaling matters. → Scaling is critical while performing Principal Component Analysis (PCA). → We can speed up gradient descent by scaling because θ descends quickly on small ranges and slowly on large ranges and oscillates inefficiently down to the optimum when the variables are very uneven. Feature Scaling ✓ Algorithms that do not require normalization/scaling are the ones that rely on rules. They would not be affected by any monotonic transformations of the variables. Scaling is a monotonic transformation. Examples of algorithms in this category are all the tree-based algorithms — CART, Random Forests, Gradient Boosted Decision Trees. These algorithms utilize rules (series of inequalities) and do not require normalization. ✓ Algorithms like Linear Discriminant Analysis(LDA), Naive Bayes is by design equipped to handle this and give weights to the features accordingly. Performing features scaling in these algorithms may not have much effect. Feature Scaling Normalization is used when we want to bound our values between two numbers, typically, between [0,1] or [-1,1]. While Standardization transforms the data to have zero mean and a variance of 1, they make our data unitless. Refer to the below diagram, which shows how data looks after scaling in the X-Y plane. Feature Scaling Normalization Normalization, a vital aspect of Feature Scaling, is a data preprocessing technique employed to standardize the values of features in a dataset, bringing them to a common scale. This process enhances data analysis and modelling accuracy by mitigating the influence of varying scales on machine learning models. Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling. Here’s the formula for normalization: This Scaler responds well if the standard deviation is small and when a distribution is not Gaussian. This Scaler is sensitive to outliers. Feature Scaling Standard Scaler The Standard Scaler assumes data is normally distributed within each feature and scales them such that the distribution centered around 0, with a standard deviation of 1. Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. If data is not normally distributed, this is not the best Scaler to use. THE BIG QUESTION: Normalize or Standardize? At the end of the day, the choice of using normalization or standardization will depend on your problem and the machine learning algorithm you are using. There is no hard and fast rule to tell you when to normalize or standardize your data. You can always start by fitting your model to raw, normalized, and standardized data and comparing the performance for the best results. It is a good practice to fit the scaler on the training data and then use it to transform the testing data. This would avoid any data leakage during the model testing process. Also, the scaling of target values is generally not required. Feature Scaling Other Scaler: Quantile Power Unit Vector Max Abs Scaler Robust Scaler Transformer Transformer Scaler Scaler Scaler