Data Preprocessing in Machine Learning PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides a comprehensive overview of data preprocessing techniques, particularly focusing on handling missing values utilizing Pandas and Scikit-Learn. It explores various imputation methods, including deleting rows/columns, arbitrary/mean/median imputation, and imputation strategies for categorical variables. The document also covers outlier detection and treatment, discussing techniques such as trimming, capping, imputation, and discretization.
Full Transcript
# Data Preprocessing in Machine Learning ## **Handling Missing/Null Values in Pandas** ### **Identifying Missing Values** * Missing values are typically represented as *NaN*. * Characters like empty strings "" or *nummv.inf* are not considered missing unless specified by setting *pandas.options.m...
# Data Preprocessing in Machine Learning ## **Handling Missing/Null Values in Pandas** ### **Identifying Missing Values** * Missing values are typically represented as *NaN*. * Characters like empty strings "" or *nummv.inf* are not considered missing unless specified by setting *pandas.options.mode.use_inf_as_na = True*. ### **Approach 01: Deleting Rows/Columns** #### **Pros:** * Simple and straightforward * Results in a robust model when trained without missing values. * Useful when missing values are not important. #### **Cons:** * Can lead to significant information loss * Ineffective if a large percentage of data is missing. * Not suitable when data is not missing completely at random. **Pandas Function:** *Use .dropna() to remove null values.* ### **Approach 02: Imputing Missing Values** #### **Methods of Imputation:** 1. **Arbitrary Value:** Replace missing values with a fixed number (e.g. 0). 2. **Mean:** Use the average value (not suitable with outliers). 3. **Median:** Use the middle value (better for outliers). 4. **Forward Fill:** Use the previous value (common in time series). 5. **Backward Fill:** Use the next value. **Pandas Function:** *Use .fillna() for imputation.* #### **Pros and Cons of Imputation** * **Pros:** * Easy to implement * Prevents data loss from deletion * Utilizes existing data for imputation. * **Cons:** * Mean/median imputation is limited to numerical variables. * Risk of data leakage between training and testing sets. ## **Handling Missing Values for Categorical Features** #### **Imputation Methods:** * **Mode:** Replace with the most frequent value * **Separate Category:** Treat missing values as a unique category if they are numerous. #### **Pros:** * Prevents data loss * Simple to implement * Adds a unique category to retain information. #### **Cons:** * Only applicable to categorical variables * May introduce new features that could degrade model performance. ## **Advanced Imputation Techniques** * **Interpolation:** Use values before and after the missing value, especially in time series data. * **Methods Available in Pandas** * Polynomial * Linear * Quadratic **Pandas Function:** *Use .interpolate() for interpolation.* ## **Prediction of Missing Values** * **Iterative Imputer:** A multivariate approach that estimates each feature based on other features. These methods provide a comprehensive approach to handling missing values in datasets, ensuring that data integrity is maintained while preparing for analysis or modeling. ## **Handling Missing/Null Values** ### **Imputation Strategies** * **Round-Robin Imputation** * Each feature with missing values is modeled as a function of other features. * **Steps:** 1. Designate a feature column as output *y*. 2. Treat other feature columns as inputs *X*. 3. Fit a regressor on (*X,y*) for known *y*. 4. Predict missing values of *y* using the regressor. 5. Repeat for each feature and for a maximum number of rounds. * **KNN Imputer** * Uses k-Nearest Neighbors to impute missing values. * Each sample's missing values are filled using the mean from 'n' nearest neighbors. * Default metric: Euclidean distance that supports missing values. **Note:** * Implemented in Sci-kit learn library * Other approaches exist as well. ### **Univariate Approaches** * Imputation using mean, mode, median, or constant values. * Can be done using Sci-kit learn's *SimpleImputer* class. #### **Pros and Cons of Prediction-Based Imputation** * **Pros:** * Generally provides better results * Considers covariance between missing and other feature columns. * **Cons:** * Computationally expensive for large datasets * Requires more effort than simpler methods. * Only a proxy for true values. **Functions in Sci-kit Learn** * *sklearn.impute.IterativeImputer* * *sklearn.impute.KNNImputer* ## **Data Preprocessing** ### **Treating Duplicate Records** #### **Duplicate Records:** * Common in real-world datasets * Adds no value and can slow down processing. * Recommended to drop duplicates before using ML models **Pandas Functions:** * *Use .duplicated() to identify duplicates* * *Use .drop_duplicates() to remove them* ### **Treating Outliers** #### **Outliers:** * Data points that significantly deviate from the rest. * Can be much higher or lower than other points. * Their presence can impact machine learning results. #### **Detection and Treatment:** * Important in machine learning projects. * Not always necessary to remove outliers; depends on the problem. * In some cases (e.g. anomaly detection), outliers are crucial. ### **Conclusion** Proper handling of missing values, duplicates, and outliers is essential for effective machine learning model performance. ## **Treating Outliers in Machine Learning** ### **Impact of Outliers** * Outliers can significantly affect the performance of machine learning models * The type of algorithm used determines how much outliers influence the model: * **Weight-based algorithms** (e.g. linear regression, logistic regression, ADABoost, deep learning) are highly affected by outliers. * **Tree-based algorithms** (e.g. decision trees, random forests) are less affected by outliers. ### **Detecting Outliers** 1. **Boxplots:** * A visual tool to identify outliers in a dataset. * Displays the distribution of data and highlights outliers. 2. **Using Z-Scores:** * Based on the 68-95-99.7 rule for normal distributions: * 99.7% of data lies within 3 standard deviations form the mean. * A point is considered an outlier if its z-score is: * Greater than 3 or less than -3 * This method is valid only for normally distributed data. ### **Treating Outliers** 1. **Trimming:** * Removing outliers from the dataset. * Effective for large datasets. * Caution: Excessive trimming can reduce data points and affect model accuracy. 2. **Capping (Winsorization):** * Setting upper and lower thresholds for outliers. * Values above the upper threshold and below the lower threshold are replaced with those threshold values. * Useful for small datasets where outliers cannot be removed. 3. **Imputation:** * Treating outliers as missing values. * Impute using mean or median. * Less common due to potential changes in data distribution. 4. **Discretization:** * Converting numerical features into bins. * Example: For an "Age" column, bins could be: * 45-55,55-65, 65-75, >75. * Values above 75 are treated equally in the >75 bin. ## **Data Preprocessing: Feature Scaling** * **Feature Scaling** is essential for standardizing independent features in a dataset. * It helps manage varying magnitudes, values, or units. * Without feature scaling, machine learning models may perform poorly due to the influence of features with larger ranges. # **Machine Learning Study Notes** ## **Importing Data** ```python import pandas as pd df = pd.read_csv(' PATH_TO_DATASET ') ``` ## **Libraries for Data Preprocessing and EDA** * **Matplotlib:** A plotting library for Python and NumPy. * **Seaborn:** A data visualization library based on Matplotlib, providing a high-level interface for attractive statistical graphics. ## **Importing Necessary Libraries** ```python import pandas as pd import scipy import numpy as np from sklearn.preprocessing import StandardScaler import seaborn as sns import matplotlib.pyplot as plt ``` ## **Data Preprocessing: Train-Test Split** * **Train-Test Split:** A method to assess how well a machine learning model generalizes to new data. #### **Key Components** 1. **Training Set:** * Data used to fit the model. * The model learns patterns/features from this data. 2. **Validation Set:** * A sample from the training set used to estimate model performance while tuning hyperparameters. 3. **Test Set:** * A subset of the training dataset used for accurate evaluation of the final model. #### **Important Aspects** * **Model Generalization:** Evaluating how well the model performs on unseen data. * **Strength Assessment:** Especially with k-fold cross-validation. * **Bias-Variance Trade-off Management:** Balancing model complexity and performance. #### **Example Code for Train-Test Split** ```python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) ``` #### **Split Ratio** * Typically, 20% to 40% of the dataset is allocated for testing and validation, with the remainder for training. ## **Data Preprocessing: Handling Null/Missing Values** * Many ML algorithms do not support missing values, so they must be addressed before model training. #### **Methods for Identifying Missing Values** * **.isnull():** Checks for missing values (NaN, None). * **.notnull():** Checks for valid (non-missing) values. * **.info():** Provides information about the DataFrame, including non-null counts and memory usage. * **.isna():** Returns a boolean object indicating if values are NA. #### **Example Usage** ```python df.isnull() # Returns a DataFrame indicating missing values df.notnull() # Returns a DataFrame indicating valid values df.info() # Displays DataFrame information df.isna() # Returns a boolean DataFrame for NA values ``` These notes summarize the key concepts and code snippets related to importing data, libraries for data preprocessing, and handling missing values in machine learning. # **EE5253: Machine Learning - Lecture 06: Data Preprocessing & Feature Selection** ## **Data Pre-Processing: An Overview** * What is Data Pre-Processing? * Data pre-processing involves transforming raw data into a clean dataset suitable for analysis. * It is essential because raw data is often not feasible for machine learning (ML) algorithms. ## **Requirement of Data Preprocessing in ML** * Real-world datasets often contain: * Missing values * Inconsistent data * Noisy data * Issues with unprocessed data: * Poor model performance due to inability to identify patterns. * Some ML models require data in specific formats (e.g. Random Forest does not support null values). * Examples of problems: * Duplicate or missing values can skew overall statistics. * Outliers can disrupt model learning, leading to inaccurate predictions. ## **Important Data Preprocessing Steps** 1. Importing Libraries 2. Importing Dataset 3. Train-Test Split 4. Handling Null/Missing Values 5. Treating Outliers and Duplicate Records 6. Feature Scaling 7. Handling Categorical Variables ## **Data Preprocessing: Important Libraries** * **Common Libraries for Data Preprocessing & EDA:** * **Pandas:** For data manipulation and analysis. * **NumPy:** For handling large, multi-dimensional arrays and matrices. * **Matplotlib:** For creating static, animated, and interactive visualizations. * **Seaborn:** For statistical data visualization. * **Scikit-Learn:** For machine learning algorithms and tools. * **SciPy:** For scientific and technical computing. ## **Overview on Libraries for Data Preprocessing and EDA** * **Pandas:** * Offers data structures and operations for manipulating numerical tables and time series. * **NumPy:** * Adds support for large arrays and matrices, along with high-level mathematical functions * **SciPy:** * Contains modules for optimization, linear algebra, integration, and other scientific tasks. ## **Summary** Data preprocessing is a crucial step in machine learning that ensures the data is clean and suitable for analysis. It involves various steps and utilizes several libraries to handle different aspects of data manipulation and analysis. Proper preprocessing can significantly enhance the performance of machine learning models.