EE7209 Machine Learning Lecture 06 PDF
Document Details
Uploaded by Deleted User
University of Ruhuna
EE7209
Mr. M.W.G Charuka Kavinda Moremada
Tags
Summary
This lecture covers data preprocessing and feature selection techniques used in machine learning. It discusses the rationale for data preprocessing, key steps, and the important Python libraries like Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, and SciPy used in the process.
Full Transcript
Lecture 06: Data Preprocessing & Feature Selection EE5253: Machine Learning Mr. M.W.G Charuka Kavinda Moremada Lecturer, Department of Electrical and Information Engineering, Faculty of Engineering, University of Ruhuna. Lecture Overview 1. Data Pre-Processing: An Overview i. What is Dat...
Lecture 06: Data Preprocessing & Feature Selection EE5253: Machine Learning Mr. M.W.G Charuka Kavinda Moremada Lecturer, Department of Electrical and Information Engineering, Faculty of Engineering, University of Ruhuna. Lecture Overview 1. Data Pre-Processing: An Overview i. What is Data Pre-Processing. ii. Requirement of Data Preprocessing in ML. iii. Important Data Preprocessing Steps. 2. Data Preprocessing: Important Libraries 3. Train Test Split 4. Handling Null/Missing Values 5. Treating Outliers and Duplicate Records 6. Feature Scaling 7. Handling Categorical Variables 8/13/2024 EE7209: Machine Learning 2 A text generated image from Google’s Gemini-Pro model using the Imagen 2 backend. Data Preprocessing: An Overview 8/13/2024 EE7209: Machine Learning 3 What is Data Pre-Processing Pre-processing refers to the transformations applied to our data before feeding it to a (ML) algorithm. Data preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. Under most of the cases can be overlapped 8/13/2024 EE7209: Machine Learning 4 Requirement of Data Preprocessing in ML The majority of the real-world datasets for machine learning are highly susceptible to be missing, inconsistent, and noisy data due to their heterogeneous origin. Usage of such data in ML model development will resulting in poor performance since ML models will fail to identify the exact patterns from data effectively. Some specified Machine Learning model needs information in a specified format (e.g., Random Forest algorithm does not support null values). Data set should be formatted in such a way that more than one Machine Learning or Deep Learning algorithms are executed in the same data set. Some other examples: Duplicate or missing values may give an incorrect view of the overall statistics of data. Outliers and inconsistent data points often tend to disturb the model’s overall learning, leading to false predictions. 8/13/2024 EE7209: Machine Learning 5 Important Data Preprocessing Steps 1. Importing Libraries 2. Importing dataset 3. Train Test Split 4. Handling Null/Missing Values 5. Treating Outliers and Duplicate Records 6. Feature Scaling 7. Handling Categorical Variables 8/13/2024 EE7209: Machine Learning 6 A text generated image from Google’s Gemini-Pro model using the Imagen 2 backend. Data Preprocessing: Important Libraries 8/13/2024 EE7209: Machine Learning 7 Generally Used Libraries for Data Preprocessing & EDA Pandas NumPy Matplotlib Seaborn Scikit-Learn SciPy 8/13/2024 EE7209: Machine Learning 8 Overview on Libraries for Data Preprocessing and EDA Pandas: Pandas is a software library written for the Python programming language for data manipulation and analysis. It offers data structures and operations for manipulating numerical tables and time series. # Importing Data Import pandas as pd df = pd.read_csv(‘PATH_TO_DATASET’) NumPy: NumPy is a library for the Python programming language, adding support for large, multi- dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. SciPy: SciPy is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE (Ordinary Differential Equation) solvers and other tasks common in science and engineering. 8/13/2024 EE7209: Machine Learning 9 Overview on Libraries for Data Preprocessing and EDA Matplotlib: Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. Seaborn: Seaborn is a Python data visualization library based on matplotlib. It provides a high- level interface for drawing attractive and informative statistical graphics. import pandas as pd import scipy Importing the import numpy as np Necessary from sklearn.preprocessing import StandardScaler # from sklearn.preprocessing import MinMaxScaler Libraries import seaborn as sns import matplotlib.pyplot as plt 8/13/2024 EE7209: Machine Learning 10 A text generated image from Google’s Gemini-Pro model using the Imagen 2 backend. Data Preprocessing: Train Test Split 8/13/2024 EE7209: Machine Learning 11 Train-Test Split from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) 8/13/2024 EE7209: Machine Learning 12 Train-Test Split Main Goal: Train-Test (or Train-Test-Validation) split helps assess how well a machine learning model will generalize to new, unseen data while avoiding overfitting. Training Set: The training dataset is a set of data that was utilized to fit the model. The dataset on which the model is trained. This data is seen and learned by the model. With the training set, the model learn the patterns/features in data. Validation Set: A validation dataset is a sample of data from your model’s training set that is used to estimate model performance while tuning the model’s hyperparameters. Test Set: The test dataset is a subset of the training dataset that is utilized to give an accurate evaluation of a final model fit. Other Important Aspects: Evaluation of Model Generalization, Strength Assessment (especially with k-fold cross validation), Bias-Variance Trade-off Management. 8/13/2024 EE7209: Machine Learning 13 Train-Test Split The split ratio can be different based on your dataset. Under most cases 20% to 40% of the whole dataset being allocated for testing and validation sets while keeping the rest for the training. 8/13/2024 EE7209: Machine Learning 14 A text generated image from Google’s Gemini-Pro model using the Imagen 2 backend. Data Preprocessing: Handling Null/Missing Values 8/13/2024 EE7209: Machine Learning 15 Missing/Null Values Since many ML algorithms do not support missing values, these has to be treated before feeding into the models. 8/13/2024 EE7209: Machine Learning 16 Missing/Null Values – Methods for Identifying Missing Values Pandas functions to identifying the missing values in a dataframe..isnull() This function takes a scalar or array-like object and indicates whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike)..notnull() This function takes a scalar or array-like object and indicates whether values are valid (not missing, which is NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike)..info() This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage..isna() Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings “” or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). 8/13/2024 EE7209: Machine Learning 17 Handling Missing/Null Values Approach 01: Delete Rows/Columns with Missing Values. Pros: Straightforward and simple to use. A model trained with the removal of all missing values creates a robust model. Beneficial when missing values have no importance. Cons: Using this approach can lead to information loss. Works poorly if the percentage of missing values is excessive in comparison to the complete dataset. This is not appropriate when the data is not missing completely at random. In Pandas.dropna() function can be utilize in order to drop the null values. 8/13/2024 EE7209: Machine Learning 18 Handling Missing/Null Values Approach 02: Imputing the Missing Values Imputing with an arbitrary value such as 0 (for numerical features). Imputing with mean (If there are outliers, then the mean will not be appropriate. In such cases, outliers need to be treated first). Imputing with median; the middlemost value (better to use the median value for imputation in the case of outliers). Imputing with previous value; forward fill (mostly used in time series data). Imputing with next value; backward fill. Can use the Pandas.fillna() function in the aforementioned cases. 8/13/2024 EE7209: Machine Learning 19 Handling Missing/Null Values Approach 02: Imputing the Missing Values Pros: Simplicity and ease of implementation. Prevent data loss which results in deletion of rows or columns. Works well with a small dataset and is easy to implement. The imputation is performed using the existing information from the non-missing data; hence no additional data is required. Cons: Mean and median imputation works only with numerical continuous variables; not with the categorical features. Can cause data leakage (between train and test/validation sets). 8/13/2024 EE7209: Machine Learning 20 Handling Missing/Null Values: For Categorical Features Approach 02: Imputing the missing value. Impute with the most frequent value; mode (can be string or numerical categories) Treat as a separate category (e.g., missing) if the number of number of missing values is very large. Pros: Prevent data loss which results in deletion of rows or columns. Works well with a small dataset and is easy to implement. Negates the loss of data by adding a unique category. Cons: Works only with categorical variables. Addition of new features to the model while encoding, which may result in poor performance. 8/13/2024 EE7209: Machine Learning 21 Handling Missing/Null Values General interpolation examples from Approach 02 Cont.: Interpolation mathematics (not limiting to functions using value before and after of the available in Pandas) missing value. Can be used with time series data (E.g., Stock Market Prices, Heart Rate). Methods available in Pandas: Polynomial Linear Quadratic Can utilize.interpolate() function in Pandas. 8/13/2024 EE7209: Machine Learning 22 Imputation with mean, mode, median or some constant Handling Missing/Null Values values falls under univariate approaches. Can be done using Sci-kit learn’s “SimpleImputer” class Approach 02 Cont.: Prediction of missing values Iterative-Imputer (Multivariate Approach): Multivariate imputer that estimates each feature from all the others. A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for a maximum number imputation rounds. KNN-Imputer: Imputation for completing missing values using k-Nearest Neighbors. Each sample’s missing values are imputed using the mean value from ‘n’ nearest neighbors found in the training set. By default, a Euclidean distance metric that supports missing values is used to find the nearest neighbors. Each missing feature is imputed using values from ‘n’ nearest neighbors that have a value for the feature. ❖ Note: The above versions are presented as implemented in Sci-kit learn library. There are other approaches/implementations as well. 8/13/2024 EE7209: Machine Learning 23 Handling Missing/Null Values Approach 04: Prediction of missing values (Imputation Cont.) Pros: Provides better results in general. Takes into account the covariance between the missing value column and other columns. Cons: Despite all the benefits, this approach can be computationally expensive compared to other techniques, especially when working with large datasets. This approach requires more effort than the previous ones. Considered only as a proxy for the true values. These functions coming with sklearn.impute.IterativeImputer function and sklearn.impute.KNNImputer. 8/13/2024 EE7209: Machine Learning 24 Handling Missing/Null Values What are the other approaches existing for handling missing/null values. 8/13/2024 EE7209: Machine Learning 25 A text generated image from Google’s Gemini-Pro model using the Imagen 2 backend. Data Preprocessing: Treating Outliers and Duplicate Records 8/13/2024 EE7209: Machine Learning 26 Treating Duplicate Records Duplicate Records: In the real-world datasets, the duplicate values are quite common. Such records adds no additional value or information while using them may slow down the processing. So, it would be a good idea to drop them before feeding into the ML models. Pandas,.duplicated() and.drop_duplicates() functions can be utilized to process the duplicates. 8/13/2024 EE7209: Machine Learning 27 Treating Outliers An outlier is a data point that significantly deviates from the rest of the data. It can be either much higher or much lower than the other data points, and its presence can have a significant impact on the results of machine learning algorithms. Detecting and treating outliers is very crucial in any machine learning project. However, it is not always required to delete or remove outliers. It depends on the problem statement and what are we trying to achieve from that model (For example, in problems related to anomaly detection, fraud detection, etc. outliers play a major role; It is basically the outliers that need to be tracked in such scenarios). Also, the type of algorithms used also decide to what extent outliers would affect the model. Weight based algorithms like linear regression, logistic regression, ADABoost and other deep learning techniques get affected by outliers a lot. Whereas tree-based algorithms like decision tree, random forest etc. don’t get affected by outliers as much. 8/13/2024 EE7209: Machine Learning 28 Treating Outliers: Detecting Outliers Boxplots: Creating a boxplot is a smart way to detect if the dataset has outliers. The following picture shows a boxplot. 8/13/2024 EE7209: Machine Learning 29 Treating Outliers: Detecting Outliers Using Z-Scores: According to 66–95–99.7 rule, for a normally distributed data 99.7% of the data is within 3 standard deviations of the mean. So, if a point lies outside 3 standard deviations from the mean, it is considered as an outlier. For this we can calculate the z-scores of the data points and keep the threshold as 3. If the z-score of any point is greater than 3 or less than -3, it is an outlier. But this rule is only valid for normal distributions. 8/13/2024 EE7209: Machine Learning 30 Treating Outliers 1. Trimming: This is basically removing or deleting outliers. This works well for large datasets. But if there are a lot of outliers or if the dataset is not that large, it would reduce the number of data points, that in turn might affect results and the model’s accuracy. This type of technique is good if the outliers are due to measurement errors or data entry errors. 2. Capping: This is another technique generally used for small datasets where outliers cannot be removed. Here, we decide an upper threshold and lower threshold value and assign those same values to all data points above or below that threshold accordingly. This method is also called Winsorization. 8/13/2024 EE7209: Machine Learning 31 Treating Outliers 3. Imputation: This is a less commonly used technique. Again, this is for relatively small datasets. Here, the outliers are treated as missing values and treated the way we treat missing values, i.e. imputing mean or median depending on the type of data. But this method is not very common, as it might change the distribution of data. 4. Discretization: Discretization is a technique in which the values in the numerical features are converted into bins. For example, for “Age” column in a dataset, bins like 45–55, 55– 65, 65–75, >75, etc. can be created. In this case, the values above 75 would fall in the >75 and all ages above 75 will be treated equally. 8/13/2024 EE7209: Machine Learning 32 A text generated image from Google’s Gemini-Pro model using the Imagen 2 backend. Data Preprocessing: Feature Scaling 8/13/2024 EE7209: Machine Learning 33 Feature Scaling Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing to handle highly varying magnitudes or values or units. If feature scaling is not done, then a machine learning algorithm tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values. Reasons for Feature Scaling: Scaling guarantees that all features are on a comparable scale and have comparable ranges. Algorithm performance improvement. Preventing numerical instability. Scaling features makes ensuring that each characteristic is given the same consideration during the learning process. 8/13/2024 EE7209: Machine Learning 34 Feature Scaling 1. Min-Max Scaling 𝑋𝑖 − 𝑋𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑋𝑠𝑐𝑎𝑙𝑒𝑑 = 𝑋𝑚𝑎𝑥𝑖𝑚𝑢𝑚 − 𝑋𝑚𝑖𝑛𝑖𝑚𝑢𝑚 Note: As we are using the maximum and the minimum value this method is also prone to outliers. 8/13/2024 EE7209: Machine Learning 35 Feature Scaling 2. Normalization 𝑋𝑖 − 𝑋𝑚𝑒𝑎𝑛 𝑋𝑠𝑐𝑎𝑙𝑒𝑑 = 𝑋𝑚𝑎𝑥𝑖𝑚𝑢𝑚 − 𝑋𝑚𝑖𝑛𝑖𝑚𝑢𝑚 This method is more or less the same as the previous method but here instead of the minimum value, we subtract each entry by the mean value of the whole data and then divide the results by the difference between the minimum and the maximum value. 8/13/2024 EE7209: Machine Learning 36 Feature Scaling 3. Standardization 𝑋𝑖 − 𝑋𝑚𝑒𝑎𝑛 𝑋𝑠𝑐𝑎𝑙𝑒𝑑 = ; 𝜎 = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝜎 This helps us achieve a normal distribution(if it is already normal but skewed) of the data with a mean equal to zero and a standard deviation equal to 1. from sklearn.preprocessing import StandardScaler Avoid Data scaler = StandardScaler() X_train = scaler.fit_transform(X_train) Snooping! X_test = scaler.transform(X_test) 8/13/2024 EE7209: Machine Learning 37 Feature Scaling What are the other approaches existing for Feature Scaling? 8/13/2024 EE7209: Machine Learning 38 A text generated image from Google’s Gemini-Pro model using the Imagen 2 backend. Data Preprocessing: Handling Categorical Variables 8/13/2024 EE7209: Machine Learning 39 Handling Categorical Variables Categorical data is a type of data that is used to group information with similar characteristics (e.g., Gender), while numerical data is a type of data that expresses information in the form of numbers. Requirement: Most machine learning algorithms cannot handle categorical variables unless we convert them to numerical values. Many algorithm’s performances even vary based upon how the categorical variables are encoded. Types of categorical variables: Nominal: no particular order (Eye color: blue, brown, green, hazel/Favorite color: red, orange, yellow, green, blue, purple) Ordinal: there is some order between values (Economic status: low, medium, high/Educational level: high school diploma, bachelor's degree, master's degree, Ph.D.). 8/13/2024 EE7209: Machine Learning 40 Handling Categorical Variables Label Encoding: This approach is very simple, and it involves converting each value in a column to a number. Consider a dataset of bridges having a column names bridge-types having below values. Good for ordinal data. 8/13/2024 EE7209: Machine Learning 41 Handling Categorical Variables One-Hot Encoding: In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column. Though label encoding is straight, but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them. This can be solved with one-hot encoding. Good for nominal data. 8/13/2024 EE7209: Machine Learning 42 Handling Categorical Variables What are the other approaches existing for Handling Categorical Variables? 8/13/2024 EE7209: Machine Learning 43 A text generated image from Google’s Gemini-Pro model using the Imagen 2 backend. Feature Selection 8/13/2024 EE7209: Machine Learning 44 Feature Selection When developing machine learning models under many cases you will not have to feed all the features into your machine learning models. Using unnecessary data can confuse the model and lead to undesirable results like overfitting or underfitting. Furthermore, adding redundant variables reduces the model’s generalization capability and may also reduce the overall accuracy of a classifier. Besides that, adding more features to a model increases the overall complexity and training time of the model and reduce the impact of potentially important features. Therefore, selecting the features with highest impact for the ML models is vital under ML model development. 8/13/2024 EE7209: Machine Learning 45 What is the difference between Feature Selection feature selection and dimensionality reduction? The goal of feature selection techniques in machine learning is to find the best set of features that allows one to build optimized models of studied phenomena. The available feature selection approaches can be divided into: Supervised Techniques. Unsupervised Techniques (Most of the time can be recognized as dimensionality reduction approaches as well). In this case, we are going to consider two feature selection approaches such as, Correlation Matrix Chi-Square Test 8/13/2024 EE7209: Machine Learning 46 Feature Selection: Correlation Matrix Correlation is a measure of the linear relationship between 2 or more variables. Through correlation, we can predict one variable from the other. The logic behind using correlation for feature selection is that good variables correlate highly with the target. Adding a predictor variable with a weak relationship with the target variable to the model has some negative effects on performance. Furthermore, variables should be correlated with the target but uncorrelated among themselves. The correlation matrix computed using Pearson correlation coefficient between variables and most commonly used for continues/numerical features. 8/13/2024 EE7209: Machine Learning 47 Feature Selection: Correlation Matrix Correlation coefficients have a value between -1 and 1. The closer a correlation is to -1 or 1, the stronger that relationship. A negative correlation means that if one set of data increases, the other decreases. A positive correlation means that if one set of data increases, the other also increases. 8/13/2024 EE7209: Machine Learning 48 Feature Selection: Correlation Matrix What is the difference between the correlation and regression? 8/13/2024 EE7209: Machine Learning 49 Feature Selection: Correlation Matrix If the value is close to 0, it indicates that there is no connection between these two variables. If some features indicates a correlation close to zero with the target variable, we may drop them. If two features are highly correlated with each other one of them can be eliminated. This may sometimes lead to data loss and considering other methods and domain knowledge, to ensure a comprehensive and accurate selection of features for your specific machine learning problem 8/13/2024 EE7209: Machine Learning 50 Feature Selection: Chi-square Test The Chi-square test is used for categorical features in a dataset. We calculate Chi-square between each feature and the target and select the desired number of features with the best Chi-square scores. Conditions to be met: The variables have to be categorical Sampled independently Values should have an expected frequency greater than 5. This is a non-parametric test, meaning it makes no assumptions about the distribution of the data. A greater Chi-square score shows a stronger link between the feature and the target, showing that the feature is more important for the classification job. 8/13/2024 EE7209: Machine Learning 51 Feature Selection: Chi-square Test 8/13/2024 EE7209: Machine Learning 52 A text generated image from Google’s Gemini-Pro model using the Imagen 2 backend. Handling Imbalanced Data 8/13/2024 EE7209: Machine Learning 53 Handling Imbalanced Datasets Data imbalance usually reflects an unequal distribution of classes within a dataset. Consider a binary classification problem where you have two classes 1 and 0 and suppose more than 90% of your training examples belong to only one of these classes. Now if you try to train a classification model on top of this data, your model is Imbalanced datasets, can lead to biased going to be biased towards the majority models favoring the majority class and class because machine learning models delivering subpar performance for the learn from the examples and most of the minority class, especially in critical examples in your dataset belong to a single contexts like disease diagnosis. class. 8/13/2024 EE7209: Machine Learning 54 Handling Imbalanced Datasets Approach 01: Resample the Training Set Under-sampling: Under-sampling balances the dataset by reducing the size of the abundant class. This method is used when quantity of data is sufficient. By keeping all samples in the rare class and randomly selecting an equal number of samples in the abundant class, a balanced new dataset can be retrieved for further modelling. Over-Sampling: On the contrary, oversampling is used when the quantity of data is insufficient for under-sampling. It tries to balance dataset by increasing the size of rare samples. Rather than getting rid of abundant samples, new rare samples are generated by using e.g. random replication, or SMOTE (Synthetic Minority Over-Sampling Technique) like techniques. 8/13/2024 EE7209: Machine Learning 55 Handling Imbalanced Datasets Approach 01: Resample the Training Set Under-Sampling Over-Sampling 8/13/2024 EE7209: Machine Learning 56 Handling Imbalanced Datasets Approach 01: Resample the Training Set – SMOTE (Synthetic Minority Oversampling Technique) SMOTE performs over-sampling over the minority class and unlike random over-sampling which can cause over fitting this method is smart enough to perform oversampling by avoiding replication of the examples. In SMOTE, a subset of minority class is taken, and new synthetic data points are generated based on it. These synthetic data points are then added to the original training dataset as additional examples of the minority class. 8/13/2024 EE7209: Machine Learning 57 Handling Imbalanced Datasets Approach 01: Resample the Training Set – SMOTE (Synthetic Minority Oversampling Technique) Steps: 1. Draw a random sample from the minority class. 2. For the observations in this sample, you will identify the k nearest neighbors. 3. Then take one of those neighbors and identify the vector between the current data point and the selected neighbor. 4. You multiply the vector by a random number between 0 and 1. 5. To obtain the synthetic data point, you add this to the current data point. 8/13/2024 EE7209: Machine Learning 58 Handling Imbalanced Datasets Approach 01: Resample the Training Set – SMOTE (Synthetic Minority Oversampling Technique) 8/13/2024 EE7209: Machine Learning 59 Handling Imbalanced Datasets Approach 01: Resample the Training Set – SMOTE (Synthetic Minority Oversampling Technique) Limitations: The vanilla SMOTE does not take the similar majority class samples into consideration while creating the synthetic examples of the minority class, it might increase the class overlap and result in additional noise to the training dataset. SMOTE is not very effective on high dimensional datasets. What are the other variants of SMOTE? 8/13/2024 EE7209: Machine Learning 60 Handling Imbalanced Datasets Approach 02: Use the right evaluation metrics. Accuracy might be a good evaluation metric for a model trained with a well-balanced dataset; while it is not ideal for imbalanced dataset. For an imbalanced class dataset F1 score is a more appropriate metric. It is the harmonic mean of precision and recall. F1 score keeps the balance between precision and recall and improves the score only if the classifier identifies more of a certain class correctly. 8/13/2024 EE7209: Machine Learning 61 Reminder Precision and Recall Precision is the measure of how accurate the classifier’s prediction of a specific class and Recall is the measure of the classifier’s ability to identify a class. 8/13/2024 EE7209: Machine Learning 62 Handling Imbalanced Datasets Approach 03: Choose the correct ML algorithms & make alternations for the cost function. Generally, decision tree-based algorithms perform well on imbalanced datasets. Similarly bagging and boosting based techniques are good choices for imbalanced classification problems (e.g., Random forest, XGBoost). When considering the cost functions, you need to put an additional cost every time your model misclassifies the minority class, this would force the model to pay more attention to the minority class and the model would try to learn to make lesser mistakes for the minority class. 8/13/2024 EE7209: Machine Learning 63 Handling Imbalanced Datasets Approach 04: Solve as an anomaly detection problem. Anomaly detection is a term for the problems concerned with the prediction of rare events. These events could be system failures, fraud-detection…etc. Anomaly detection problems consider the minority class(rare-events) as outliers and apply several approaches to detect them. We can do a similar thing for our classification problem by treating the minority class as an outlier. This might give you new ideas for solving your problem efficiently. 8/13/2024 EE7209: Machine Learning 64 Handling Imbalanced Datasets 8/13/2024 EE7209: Machine Learning 65 Handling Imbalanced Datasets Identify the other techniques that can be used for handling imbalanced datasets. 8/13/2024 EE7209: Machine Learning 66 References 1. “Data Preprocessing for Machine learning in Python - GeeksforGeeks,” GeeksforGeeks, Oct. 29, 2017. https://www.geeksforgeeks.org/data- preprocessing-machine-learning-python/ 2. P. Baheti, “A Simple Guide to Data Preprocessing in Machine Learning,” www.v7labs.com, Aug. 31, 2021. https://www.v7labs.com/blog/data- preprocessing-guide 3. “pandas (software),” Wikipedia, Feb. 19, 2020. https://en.wikipedia.org/wiki/Pandas_(software). 4. Wikipedia Contributors, “NumPy,” Wikipedia, Jul. 26, 2019. https://en.wikipedia.org/wiki/NumPy. 5. Wikipedia Contributors, “SciPy,” Wikipedia, Dec. 19, 2019. https://en.wikipedia.org/wiki/SciPy 6. Wikipedia Contributors, “Matplotlib,” Wikipedia, Nov. 21, 2019. https://en.wikipedia.org/wiki/Matplotlib 7. seaborn, “seaborn: statistical data visualization — seaborn 0.9.0 documentation,” Pydata.org, 2012. https://seaborn.pydata.org/ 8. “How To Do Train Test Split Using Sklearn In Python,” GeeksforGeeks, Jun. 25, 2022. https://www.geeksforgeeks.org/how-to-do-train-test-split-using- sklearn-in-python/ 9. R. Chavan, “Understanding Train, Test, and Validation Split in Simple Quick Terms,” Medium, Jun. 25, 2023. https://medium.com/@rahulchavan4894/understanding-train-test-and-validation-dataset-split-in-simple-quick-terms-5a8630fe58c8 10. S. Khanna, “A Comprehensive Guide to Train-Test-Validation Split in 2023,” Analytics Vidhya, Nov. 16, 2023. https://www.analyticsvidhya.com/blog/2023/11/train-test-validation-split/ 8/13/2024 EE7209: Machine Learning 67 References 11.“pandas.notnull — pandas 2.2.0 documentation,” pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.notnull.html (accessed Feb. 14, 2024). 12.“pandas.isnull — pandas 1.4.2 documentation,” pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.isnull.html 13. “pandas.DataFrame.isna — pandas 1.3.2 documentation,” pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html 14. N. Tamboli, “Tackling Missing Value in Dataset,” Analytics Vidhya, Oct. 29, 2021. https://www.analyticsvidhya.com/blog/2021/10/handling-missing-value/ 15. Z. Kelta, “Top Techniques to Handle Missing Values Every Data Scientist Should Know,” datacamp, Jan. 2023. https://www.datacamp.com/tutorial/techniques-to-handle-missing-data-values. 16. S. Kumar, “7 Ways to Handle Missing Values in Machine Learning,” Medium, Aug. 02, 2020. https://towardsdatascience.com/7-ways-to-handle-missing- values-in-machine-learning-1a6326adf79e 17. Wikipedia Contributors, “Interpolation,” Wikipedia, Nov. 10, 2019. https://en.wikipedia.org/wiki/Interpolation 18. “sklearn.impute.IterativeImputer,” scikit-learn. https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html 19. “6.4. Imputation of missing values,” scikit-learn. https://scikit-learn.org/stable/modules/impute.html#iterative-imputer 20.“sklearn.impute.KNNImputer — scikit-learn 0.23.1 documentation,” scikit-learn.org. https://scikit- learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html. 8/13/2024 EE7209: Machine Learning 68 References 22. “6.4. Imputation of missing values,” scikit-learn. https://scikit-learn.org/stable/modules/impute.html#knnimpute (accessed Feb. 14, 2024). 23. A. Durgapal, “Data Preprocessing — Handling Duplicate Values and Outliers in a dataset,” Medium, Jul. 30, 2023. https://medium.com/@ayushmandurgapal/handling-duplicate-values-and-outliers-in-a-dataset-b00ce130818e 24. “Machine Learning | Outlier,” GeeksforGeeks, Jan. 12, 2019. https://www.geeksforgeeks.org/machine-learning-outlier/ 25.“ML | Feature Scaling – Part 2,” GeeksforGeeks, Jul. 02, 2018. https://www.geeksforgeeks.org/ml-feature-scaling-part-2/ 26.“Machine Learning | Outlier,” GeeksforGeeks, Jan. 12, 2019. https://www.geeksforgeeks.org/machine-learning-outlier/ 27.S. Garg, “How to Deal with Categorical Data for Machine Learning,” KDnuggets, Aug. 04, 2022. https://www.kdnuggets.com/2021/05/deal-with-categorical- data-machine-learning.html 28.D. Yadav, “Categorical encoding using Label-Encoding and One-Hot-Encoder,” Medium, Dec. 09, 2019. https://towardsdatascience.com/categorical- encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd 29.K. Pelzel, “Data Visualization Part 7: correlations, differentiation, and linear regression.,” Upskilling, Feb. 05, 2022. https://medium.com/upskilling/data- visualization-part-7-correlations-differentiation-and-linear-regression-d1474bd2f3f2 30.A. Gupta, “Feature Selection Techniques in Machine Learning,” Analytics Vidhya, Oct. 10, 2020. https://www.analyticsvidhya.com/blog/2020/10/feature- selection-techniques-in-machine-learning/ 8/13/2024 EE7209: Machine Learning 69 References 31. G. Güner, “Machine Learning # 2 — Correlation Matrix, Feature Selection, Class Imbalance, Decision Trees…,” Analytics Vidhya, Dec. 15, 2020. https://medium.com/analytics-vidhya/machine-learning-2-correlation-matrix-feature-selection-class-imbalance-decision-trees-9a447fdb825. 32. https://www.facebook.com/kdnuggets, “Advanced Feature Selection Techniques for Machine Learning Models,” KDnuggets. https://www.kdnuggets.com/2023/06/advanced-feature-selection-techniques-machine-learning-models.html 33. “ML | Chi-square Test for feature selection,” GeeksforGeeks, Dec. 20, 2018. https://www.geeksforgeeks.org/ml-chi-square-test-for-feature-selection/ 34. H. Tripathi, “What Is Balance And Imbalance Dataset?,” Medium, Sep. 25, 2019. https://medium.com/analytics-vidhya/what-is-balance-and-imbalance- dataset-89e8d7f46bc5 35. K. Chaudhary, “How to deal with Imbalanced data in classification?,” Game of Bits, Sep. 23, 2023. https://medium.com/game-of-bits/how-to-deal-with- imbalanced-data-in-classification-bd03cfc66066 36. H. G. Rho, “Unveiling the Power of Bias Adjustment: Enhancing Predictive Precision in Imbalanced Datasets,” Medium, Aug. 16, 2023. https://towardsdatascience.com/unveiling-the-power-of-bias-adjustment-enhancing-predictive-precision-in-imbalanced-datasets- ecad1836fc58#:~:text=Imbalanced%20datasets%2C%20characterized%20by%20significant (accessed Feb. 21, 2024). 37. Y. Wu, “7 Techniques to Handle Imbalanced Data,” KDnuggets, Aug. 24, 2022. https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced- data.html 38. K. Chaudhary, “How to deal with Imbalanced data in classification?,” Game of Bits, Sep. 23, 2023. https://medium.com/game-of-bits/how-to-deal-with- imbalanced-data-in-classification-bd03cfc66066 8/13/2024 EE7209: Machine Learning 70 References 39. S. Mazumder, “What is Imbalanced Data | Techniques to Handle Imbalanced Data,” Analytics Vidhya, Jun. 21, 2021. https://www.analyticsvidhya.com/blog/2021/06/5-techniques-to-handle-imbalanced-data-for-a-classification-problem/ 40. J. Korstanje, “SMOTE,” Medium, Aug. 30, 2021. https://towardsdatascience.com/smote-fdce2f605729 41. “imbalanced-learn documentation — Version 0.8.0,” imbalanced-learn.org. https://imbalanced-learn.org/ 8/13/2024 EE7209: Machine Learning 71 Thank You 8/13/2024 EE7209: Machine Learning 72