STA 302 Machine Learning - Osun State University PDF

Summary

This document is a self-guided learning resource for statistical machine learning II, STA 302, intended for Redeemer's University students. Authored by Timothy A. Ogunleye from the Department of Statistics at Osun State University in Nigeria, it covers types of machine learning, resampling methods, and more.

Full Transcript

Osun State University, Osogbo, Nigeria Department of Statistics Lecturer-in-Charge : Timothy A. OGUNLEYE, PhD Email Contacts : [email protected] ; [email protected] Personal Websites : https://t...

Osun State University, Osogbo, Nigeria Department of Statistics Lecturer-in-Charge : Timothy A. OGUNLEYE, PhD Email Contacts : [email protected] ; [email protected] Personal Websites : https://timothy-ogunleye.com.ng ; https://timothy-ogunleye.vercel.app A SELF-GUIDED LEARNING RESOURCE FOR STATISTICAL MACHINE LEARNING II - STA 302 (REDEEMER'S UNIVERSITY STUDENTS) INTRODUCTION TO MACHINE LEARNING: Machine Learning (ML) is a subset of artificial intelligence (AI) that enables computers to learn from data and make predictions or decisions without being explicitly programmed. It focuses on developing algorithms that improve their performance over time as they are exposed to more data. TYPES OF MACHINE LEARNING 1. Supervised Learning – The algorithm learns from labeled data, making predictions based on input- output pairs (e.g., classification and regression). 2. Unsupervised Learning – The model identifies patterns and relationships in data without predefined labels (e.g., clustering and dimensionality reduction). 3. Reinforcement Learning – The algorithm learns by interacting with an environment and receiving rewards or penalties based on actions taken. APPLICATIONS OF MACHINE LEARNING  Finance: Fraud detection, stock price prediction.  Healthcare: Disease diagnosis, medical image analysis.  Retail: Customer recommendation systems, demand forecasting.  Autonomous Systems: Self-driving cars, robotics. Machine Learning is transforming industries by enabling automation, predictive analytics, and intelligent decision-making, making it a key driver of technological advancement. 1 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria RESAMPLING METHODS Resampling is a statistical technique used to repeatedly draw samples from a dataset to estimate population parameters, assess model performance, or improve the robustness of predictions. It is particularly useful when working with small datasets, non-parametric statistics, and machine learning models. Resampling methods can be broadly categorized into: 1. BOOTSTRAPING Bootstrapping is a resampling technique where new datasets are created by randomly drawing samples with replacement from an original dataset. This method helps estimate standard errors, confidence intervals, and the distribution of a statistic. 1.1 KEY FEATURES OF BOOTSTRAPING (i) Each bootstrap sample has the same size as the original dataset. (ii) Since sampling is with replacement, some observations may appear multiple times while others may be left out. (iii) The method is particularly useful for small datasets where traditional parametric approaches may not be reliable. 1.2 APPLICATIONS OF BOOTSTRAPING (i) Estimating confidence intervals for means, medians, or regression coefficients. (ii) Assessing the stability of machine learning models. (iii) Estimating bias and variance in predictions. 1.3 LIMITATIONS OF BOOTSTRAPING (i) Computationally expensive when performed with large datasets. (ii) Can be biased if the original sample is not representative of the population. 2. CROSS-VALIDATION Cross-validation is a technique used to evaluate machine learning models by splitting the dataset into multiple training and testing subsets. The goal is to assess how well the model generalizes to new data. 2 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria 2.1 TYPES OF CROSS-VALIDATION 2.1.1 K-FOLD CROSS-VALIDATION K-Fold Cross-Validation is a resampling technique used in machine learning and statistical modeling to assess a model's performance by dividing the dataset into k equal-sized subsets (folds). It ensures that every data point is used for both training and testing, reducing model bias and improving generalizability. Why Do We Need K-Fold Cross-Validation? In machine learning, the traditional approach to evaluate model performance is to split the dataset into training and testing sets. However, this approach can lead to biased results, especially if the dataset is small. Cross-validation helps: (i) To avoid Overfitting – By testing the model on different subsets, we ensure that it generalizes well to unseen data. (ii) To maximize Data Utilization – Every observation is used for training and validation at least once. (iii) To provide More Reliable Model Performance Estimates – Since the model is trained multiple times, we get a better estimate of its performance. How K-Fold Cross-Validation Works K-Fold Cross-Validation follows these steps: (i) Shuffle the Dataset (optional but recommended) to ensure randomness. (ii) Split the Dataset into K Folds – The data is divided into k equal subsets (folds). (iii) Train the Model on K-1 Folds – One fold is held out for validation, while the remaining k-1 folds are used for training. (iv) Test the Model on the Remaining Fold – The trained model is evaluated on the holdout (test) fold. (v) Repeat the Process K Times – Each fold is used as a test set once, and the remaining k-1 folds are used for training. (vi) Compute the Average Performance Metric – The final model performance is the average of all k test results. 3 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria Example with k = 5 Folds Assume we have 100 data points and we use 5-fold cross-validation (k = 5). Iteration Training Set (80%) Testing Set (20%) Fold 1 Folds 2, 3, 4, 5 Fold 1 Fold 2 Folds 1, 3, 4, 5 Fold 2 Fold 3 Folds 1, 2, 4, 5 Fold 3 Fold 4 Folds 1, 2, 3, 5 Fold 4 Fold 5 Folds 1, 2, 3, 4 Fold 5 At the end of the 5 iterations, we take the average score of the model from the five test sets to get a final performance estimate. Choosing the Right K Value The value of k affects the bias-variance tradeoff:  Low k (e.g., k = 2 or 5): Leads to higher variance and lower bias.  High k (e.g., k = 10 or more): Leads to lower variance and higher bias but increases computational cost.  Typical Choice: k = 5 or k = 10 is commonly used as it provides a balance between bias, variance, and computational efficiency. Advantages of K-Fold Cross-Validation (i) More Reliable Performance Estimates – Uses all data points for both training and validation. (ii) Reduces Model Overfitting – Trains and tests on different subsets, improving generalizability. (iii) Works Well for Small Datasets – Maximizes data usage efficiently. Disadvantages of K-Fold Cross-Validation (a) Computationally Expensive – Requires training the model k times, increasing computational cost. (b) Not Suitable for Time Series Data – Since time series data has an inherent order, random splitting can break patterns. Instead, Time Series Cross-Validation (e.g., rolling window CV) is used. K-Fold Cross-Validation is a powerful technique for evaluating machine learning models. It provides a better estimate of model performance, prevents overfitting, and maximizes data utilization. Choosing the right k depends on the dataset size and computational resources available. 4 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria 2.1.2 LEAVE-ONE-OUT CROSS-VALIDATION (LOOCV) Leave-One-Out Cross-Validation (LOOCV) is an extreme case of k-Fold Cross-Validation where the number of folds k is equal to the total number of observations (n). In other words, each individual data point is used as a test set exactly once, while the remaining n-1 observations are used for training. LOOCV provides an unbiased estimate of model performance, but it comes at a high computational cost because it requires training the model n times (once per observation). How LOOCV Works LOOCV follows these steps: (i) Divide the Dataset – Each observation is treated as a separate test set. If the dataset has n samples, then we perform n iterations. (ii) Train the Model on n-1 Observations – In each iteration, all data points except one are used to train the model. (iii) Test the Model on the Remaining Observation – The left-out data point is used as the test set. (iv) Repeat the Process for Each Observation – This means the model is trained n times. (v) Compute the Average Error Metric – The final model performance is the average of all n test errors (e.g., Mean Squared Error (MSE), Accuracy, etc.). Example of LOOCV Suppose we have a dataset with 5 observations: Observation Training Set (n-1) Testing Set (1 observation) 1 2, 3, 4, 5 1 2 1, 3, 4, 5 2 3 1, 2, 4, 5 3 4 1, 2, 3, 5 4 5 1, 2, 3, 4 5 After running the model 5 times, we take the average of the 5 test scores to get the final model performance estimate. 5 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria Advantages of LOOCV (i) Unbiased Performance Estimation – Since each observation is tested once, the performance estimate is accurate and unbiased. (ii) Maximum Data Utilization – Every data point contributes to both training and testing, making it useful for small datasets. (iii) No Randomness – Unlike k-fold cross-validation (where fold assignments may vary), LOOCV always produces the same results. Disadvantages of LOOCV (i) Computationally Expensive – The model is trained n times, making it impractical for large datasets. (ii) High Variance in Error Estimates – Since only one observation is used for testing in each iteration, small changes in data can lead to high variability in performance scores. (iii) Not Suitable for Large Datasets – If n = 100,000, the model needs to be trained 100,000 times, which is infeasible. When to Use LOOCV? (i) When the dataset is very small (e.g., n < 100) and every data point is crucial. (ii) When an unbiased model performance estimate is required, especially for research purposes. (iii) When computational power is not a concern, as LOOCV is expensive for large datasets. LOOCV is a highly precise but computationally expensive method for model evaluation. It provides unbiased performance estimates but suffers from high variance and slow execution. For large datasets, k- fold cross-validation (e.g., k=5 or k=10) is often preferred as it provides a good trade-off between computational efficiency and accuracy. 2.1.3 STRATIFIED K-FOLD CROSS-VALIDATION Stratified k-Fold Cross-Validation is a variation of k-Fold Cross-Validation that ensures the distribution of target variable classes remains balanced across all folds. Instead of randomly splitting the dataset into k subsets, stratified k-fold makes sure that each fold maintains the same proportion of classes as in the original dataset. 6 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria This method is particularly useful when dealing with imbalanced datasets, where some classes appear more frequently than others. Without stratification, a regular k-fold split may lead to some folds containing very few (or even none) of the minority class, leading to biased model evaluation. Why Use Stratified k-Fold Cross-Validation? (i) Prevents Data Imbalance Issues – Ensures all folds contain the same class proportions as the entire dataset. (ii) Improves Model Evaluation – Since each fold has a balanced class distribution, model performance metrics (e.g., accuracy, F1-score) are more reliable. (iii) More Representative Testing – Ensures that every class is present in both training and test sets in each fold. How Stratified k-Fold Works Stratified k-Fold Cross-Validation follows these steps: (i) Determine the Number of Folds (k) – Choose a suitable k (e.g., 5 or 10). (ii) Preserve Class Proportions – The dataset is split into k subsets while maintaining the same proportion of target classes in each fold as in the original dataset. (iii) Train the Model on k-1 Folds – One fold is left out for validation, and the model is trained on the remaining k-1 folds. (iv) Test the Model on the Remaining Fold – The trained model is evaluated on the holdout fold. (v) Repeat for All k Folds – Each fold is used as a test set once, and the model is trained k times. (vi) Compute the Average Performance Score – The final model performance is the average of the k test scores. When to Use Stratified k-Fold? (i) When dealing with imbalanced datasets, where one class is much more frequent than others. (ii) When classification performance matters, especially for precision, recall, and F1-score. (iii) When ensuring equal representation of classes across all folds is necessary for fair model evaluation. Advantages of Stratified k-Fold Cross-Validation (i) Better Performance Estimation – Maintains the original class distribution, leading to a more reliable evaluation. 7 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria (ii) Reduces Variance in Model Training – Ensures every fold represents the dataset correctly, preventing extreme variations. (iii) More Reliable for Imbalanced Datasets – Prevents training/testing bias when class frequencies differ significantly. Disadvantages of Stratified k-Fold Cross-Validation (i) Computationally Expensive – Requires training the model k times, just like regular k-Fold Cross- Validation. (ii) Not Always Necessary for Balanced Datasets – If the dataset already has equal class distribution, regular k-Fold works fine. Stratified k-Fold Cross-Validation is a powerful technique that ensures equal class representation across all folds, making it ideal for imbalanced datasets. By preventing biased training and testing, it provides more accurate and fair model performance estimates. 2.1.4 JACKKNIFE RESAMPLING Jackknife resampling is a statistical resampling method used to estimate the bias and variance of a statistical estimator. It is particularly useful for small datasets and helps assess the stability and reliability of statistical estimates. The Jackknife method systematically leaves out one observation at a time from the dataset, computes the statistic of interest on the remaining data, and repeats this process for every data point. By aggregating the results, it provides an estimate of sampling variability and reduces bias in the estimator. Key Purposes of Jackknife Resampling (i) Estimating the bias of an estimator. (ii) Estimating the variance and standard error of a statistic. (iii) Improving estimates by reducing overfitting and increasing robustness. (iv) Serving as an alternative to bootstrap resampling, especially when data is limited. How Jackknife Resampling Works The Jackknife method follows these steps: (i) Remove One Observation – Exclude one observation from the dataset. 8 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria (ii) Compute the Statistic – Calculate the statistic (e.g., mean, median, regression coefficient) using the remaining n-1 observations. (iii) Repeat for Every Observation – Perform steps 1 and 2 for all n observations, creating n different estimates. (iv) Compute the Jackknife Estimate – The final estimate is obtained by averaging all n estimates. (v) Calculate Bias and Variance – The Jackknife estimate is used to calculate bias and variance for the statistic. 9 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria Advantages of Jackknife Resampling (i) Simple to Implement – Easy to apply in small datasets. (ii) Bias Correction – Helps correct biased estimates. (iii) Works Well for Small Data – Unlike bootstrap, it doesn’t require thousands of resamples. (iv) Provides Variance and Standard Error Estimates – Useful in hypothesis testing. 10 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria Disadvantages of Jackknife Resampling (i) Computationally Expensive – Requires training or calculating the statistic n times. (ii) Less Flexible than Bootstrap – Bootstrap allows more resampling and better estimates for large datasets. (iii) Not Always Effective for Highly Skewed Data – May not work well if the dataset has extreme outliers. When to Use Jackknife Resampling? (i) When working with small datasets where bootstrap resampling is impractical. (ii) When needing bias correction for an estimator. (iii) When estimating the standard error of a statistic. (iv) When computational cost is a concern and bootstrap is too expensive. Jackknife resampling is a powerful technique for estimating bias, variance, and standard errors of statistical estimators. It systematically removes one observation at a time, computes the statistic, and aggregates the results. While it is useful for small datasets, it can be computationally expensive and is often replaced by bootstrap resampling for larger datasets. 2.1.5 PERMUTATION TESTS A permutation test (also known as a randomization test) is a non-parametric statistical test used to determine whether an observed difference between groups is statistically significant. It does this by shuffling (permuting) the data many times and recalculating the test statistic under the assumption that the null hypothesis is true. Unlike traditional hypothesis tests (e.g., t-tests), permutation tests do not assume normality and are useful when sample sizes are small or the data distribution is unknown. Why Use a Permutation Test? (i) No Assumptions about Data Distribution – Works with non-normal data. (ii) Valid for Small Sample Sizes – Unlike parametric tests that require large samples. (iii) Flexible – Can be applied to various statistical problems (means, medians, correlations, etc.). (iv) More Accurate p-values – Since it calculates p-values based on the actual data, not approximations. 11 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria How a Permutation Test Works (i) Define the Hypotheses  Null Hypothesis: There is no real difference between the two groups; any observed difference is due to randomness.  Alternative Hypothesis: The difference is real and not due to chance. (ii) Compute the Observed Test Statistic  Choose a test statistic such as mean difference, median difference, correlation, or t-statistic.  Compute this statistic from the original dataset. (iii) Permute the Data  Randomly shuffle (permute) the labels of the groups multiple times (e.g., 10,000 times)  For each permutation, recalculate the test statistic on the shuffled data. (iv) Compare the Observed Statistic to the Permuted Distribution Example of a Permutation Test 12 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria Types of Permutation Tests (i) Permutation t-Test – Compares means of two independent groups. (ii) Permutation ANOVA – Tests differences among three or more groups. (iii) Permutation Correlation Test – Tests if two variables are correlated. (iv) Permutation Regression Test – Assesses the significance of regression coefficients. Advantages of a Permutation Test (i) Distribution-Free – Works with skewed or non-normal data. (ii) Applicable to Any Test Statistic – Can test means, medians, correlations, etc. (iii) More Accurate for Small Samples – No reliance on large-sample approximations. (iv) Handles Outliers Better – Unlike parametric tests, which are sensitive to outliers. Disadvantages of a Permutation Test (i) Computationally Expensive – Requires thousands of permutations. (ii) Not Practical for Large Datasets – Too slow for massive datasets. (iii) Randomization Issues – The results can slightly vary if not enough permutations are performed. 13 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria When to Use a Permutation Test? (i) When data is not normally distributed. (ii) When sample size is small (< 30). (iii) When traditional tests are unreliable. (iv) When you want an exact p-value rather than an approximation. The Permutation Test is a powerful, flexible, and assumption-free method for testing statistical hypotheses. It relies on random shuffling to determine whether an observed effect is truly significant or just due to chance. Although computationally intensive, it is especially useful for small datasets or non-normal distributions. 2.1.5 TIME SERIES RESAMPLING Time series resampling refers to changing the frequency of observations in time series data. It is a crucial technique in time series analysis that allows us to aggregate (downsample) or interpolate (upsample) data to match the required time granularity. Downsampling: Reducing the frequency of data (e.g., converting daily data to monthly data). Upsampling: Increasing the frequency of data (e.g., converting monthly data to daily data by filling in missing values). Resampling is often used for data preprocessing, trend analysis, forecasting, and handling missing values in time series data. Types of Time Series Resampling 1. Downsampling (Aggregation) Downsampling involves reducing the number of observations by grouping data into larger time intervals. For example, converting hourly stock prices into daily averages. Why Downsample? ✔ Reduce data size for faster computation. ✔ Identify long-term trends. ✔ Match different datasets with different frequencies. 14 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria Common Aggregation Methods for Downsampling: 1. Mean: Compute the average of each group. 2. Sum: Total value in the interval (e.g., total monthly sales). 3. Min/Max: Record the highest/lowest value in the interval. 4. First/Last: Take the first or last value in each group. 2. Upsampling (Interpolation) Upsampling involves increasing the number of observations by inserting new data points at higher frequencies. For example, converting weekly temperature data to daily temperature data. Why Upsample? ✔ Fill in missing data for better continuity. ✔ Improve resolution for detailed trend analysis. ✔ Synchronize with other time series datasets of higher frequency. Common Interpolation Methods for Upsampling: 1. Forward Fill (ffill): Uses the last available value to fill missing values. 2. Backward Fill (bfill): Uses the next available value to fill missing values. 3. Linear Interpolation: Estimates missing values by connecting two known points with a straight line. 4. Spline Interpolation: Uses a smooth curve to estimate missing values. Applications of Time Series Resampling ✅ Financial Data Processing – Adjusting stock market data from tick-level to daily or monthly summaries. ✅ Weather Forecasting – Converting yearly temperature averages into monthly or weekly trends. ✅ Retail Sales Analysis – Aggregating hourly sales into daily, weekly, or monthly reports. ✅ Energy Consumption Analysis – Transforming minute-level power consumption data into hourly reports. ✅ Machine Learning – Preparing time series data for modeling by ensuring uniform time intervals. Time series resampling is a fundamental technique in time series analysis. Downsampling helps to summarize data and detect trends, while upsampling fills gaps and improves data granularity. The choice of method depends on the data characteristics and analysis objectives. 15 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria 2.1.6 MONTE CARLO RESAMPLING Monte Carlo Resampling is a statistical simulation technique used to estimate the properties of a dataset by repeatedly randomly sampling from the data. It is widely used in risk analysis, forecasting, machine learning, finance, and scientific modeling to assess uncertainty and make probabilistic predictions. Unlike traditional resampling methods like Bootstrap and Jackknife, which rely on specific data-based resampling rules, Monte Carlo resampling uses randomness to generate new samples from a given probability distribution. Why Use Monte Carlo Resampling? 1. Handles Complex Problems – Works well in situations where analytical solutions are difficult. 2. Estimates Uncertainty – Helps measure confidence intervals and variances. 3. Robust for Small Samples – Even if data is limited, Monte Carlo methods can estimate probabilities effectively. 4. Flexible – Can be applied to a variety of statistical problems (mean estimation, regression, classification, etc.). How Monte Carlo Resampling Works Step 1: Define the Probability Distribution The data can come from an existing dataset or a theoretical probability distribution (e.g., normal, uniform, exponential). Step 2: Generate Random Samples Draw random samples from the given distribution many times (e.g., 10,000 times). If working with real data, random samples can be drawn with or without replacement. Step 3: Compute the Statistic of Interest For each resampled dataset, compute the desired statistic (e.g., mean, variance, correlation, model accuracy). Step 4: Aggregate the Results Analyze the distribution of the computed statistics from all resampled datasets. 16 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria Calculate confidence intervals, probabilities, or error margins. Applications of Monte Carlo Resampling ✅ Finance & Risk Analysis – Portfolio risk assessment, stock price prediction. ✅ Machine Learning – Cross-validation in model evaluation, uncertainty estimation. ✅ Physics & Engineering – Simulation of particle movements, reliability testing. ✅ Healthcare & Epidemiology – Disease spread modeling, drug effectiveness analysis. ✅ Economics & Business – Market trend predictions, demand forecasting. Advantages of Monte Carlo Resampling ✔ Works in Non-Standard Cases – No need for normality assumptions. ✔ Handles Missing Data Well – Can still provide good estimates. ✔ Applicable Across Fields – From machine learning to finance and physics. ✔ Improves Decision Making – Used in risk analysis and forecasting. Disadvantages of Monte Carlo Resampling ❌ Computationally Expensive – Requires thousands of simulations. ❌ Sensitive to Initial Assumptions – The choice of distribution affects results. ❌ Requires Large Sample Size for Accuracy – More samples = better estimates, but also more computation. Monte Carlo Resampling is a powerful and flexible technique that uses randomized sampling to estimate statistics and quantify uncertainty. It is widely used in various domains where traditional analytical methods are insufficient. 2.1.7 OVERSAMPLING & UNDERSAMPLING (FOR IMBALANCED DATASETS) In many machine learning and data analysis tasks, particularly classification problems, you may encounter imbalanced datasets. An imbalanced dataset occurs when the classes in a target variable are not equally represented. This imbalance can negatively affect the model's performance, especially if the model tends to favor the majority class due to its higher frequency. Examples of imbalanced datasets include: Fraud detection: The number of fraudulent transactions is much lower compared to non-fraudulent ones. Medical diagnosis: The occurrence of a particular disease is rare compared to healthy individuals. 17 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria Spam detection: The number of spam emails is much smaller than non-spam emails. Challenges of Imbalanced Datasets Imbalanced datasets lead to several issues: 1. Model Bias: The model may be biased toward predicting the majority class, leading to poor performance on the minority class. 2. Low Sensitivity (Recall) for Minority Class: In classification problems, the ability to correctly identify the minority class is usually compromised. 3. Poor Generalization: A model trained on imbalanced data may fail to generalize well on real-world, balanced data. Solutions for Imbalanced Datasets: Oversampling & Undersampling Two common methods to address class imbalance in datasets are oversampling and undersampling. Both techniques modify the distribution of the dataset to provide a more balanced representation of the classes. 1. Oversampling What is Oversampling? Oversampling involves increasing the number of samples in the minority class to make the class distribution more balanced. This can be done by duplicating existing data points or generating synthetic data. Techniques for Oversampling: 1. Random Oversampling: Method: Randomly duplicate samples from the minority class to increase its size. Pros: Simple to implement. Cons: Can lead to overfitting because it introduces duplicate data, which does not provide new information. 18 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria 2. Synthetic Minority Over-sampling Technique (SMOTE): Method: SMOTE generates synthetic samples by creating new data points that are combinations of existing minority class instances. These synthetic points are created by choosing two similar points from the minority class and drawing a new point between them in the feature space. Pros: Reduces the risk of overfitting compared to random oversampling because it generates new, diverse data. Cons: May introduce noise if not used carefully. 3. Adaptive Synthetic Sampling (ADASYN): Method: Similar to SMOTE, but it focuses on creating more synthetic samples in regions where the minority class is harder to learn (i.e., where the minority class samples are close to the majority class). Pros: It can adapt to the complexity of the problem and create more informative samples. Cons: More computationally expensive than SMOTE. When to Use Oversampling: When the minority class is very underrepresented and you want to ensure the model learns to classify it effectively. When the risk of overfitting is low, or when you use techniques like SMOTE, which can mitigate the overfitting issue. 2. Undersampling Undersampling involves reducing the number of samples from the majority class to balance the class distribution. This can be done by randomly removing samples from the majority class, ensuring that the dataset becomes more balanced. Techniques for Undersampling: 1. Random Undersampling: Method: Randomly remove samples from the majority class until the class distribution is more balanced. 19 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria Pros: Simple and straightforward approach. Cons: Loss of potentially useful data, which could lead to underfitting and reduced model performance. 2. Cluster-Based Undersampling: Method: Instead of randomly selecting samples to remove, the majority class is first grouped into clusters, and then some clusters are discarded. This reduces the loss of information by ensuring that the remaining majority class samples are representative of the original data. Pros: More efficient than random undersampling because it aims to maintain the diversity of the majority class. Cons: More computationally expensive than random undersampling. 3. Tomek Links: Method: Tomek links identify pairs of instances from different classes that are closest to each other. If these instances are from the majority class and are very close to the minority class, one of them is removed. This helps in cleaning the data and removing borderline examples. Pros: Removes noisy, borderline examples, which may lead to better model performance. Cons: May lead to the loss of valuable data if not carefully applied. When to Use Undersampling: When the majority class is overwhelmingly large, and reducing its size can still preserve the underlying patterns of the data. When computational efficiency is crucial (since smaller datasets are quicker to process). When dealing with data that is heavily imbalanced and a more conservative approach is needed. 20 Department of Statistics, Faculty of Basic and Applied Sciences, College of Science, Engineering, and Technology, Osun State University, Osogbo, Nigeria