Data Preprocessing Techniques

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

What is the primary goal of data preprocessing in machine learning?

  • To reduce the size of the dataset for faster processing.
  • To select the most relevant features for model training.
  • To visualize data distributions using advanced plotting techniques.
  • To improve model performance by cleaning, transforming, and organizing raw data. (correct)

Which of the following is NOT a typical step in data preprocessing?

  • Encoding categorical variables.
  • Model deployment. (correct)
  • Feature scaling and normalization.
  • Handling missing data.

Why is handling missing data an important step in data preprocessing?

  • To reduce the computational complexity of the dataset.
  • To prevent machine learning models from generating biased or incorrect predictions. (correct)
  • To ensure all data points are visually appealing when plotted.
  • To minimize the storage space required for the dataset.

Which library is NOT commonly used for data preprocessing tasks in Python?

<p>TensorFlow (D)</p> Signup and view all the answers

In the context of preprocessing, what distinguishes Exploratory Data Analysis (EDA) from Feature Engineering?

<p>EDA uncovers patterns, while feature engineering creates meaningful new features. (A)</p> Signup and view all the answers

Which of the following methods is suitable for identifying missing values in a dataset?

<p><code>isnull().sum()</code> (B)</p> Signup and view all the answers

When is it most appropriate to use mean imputation for missing data?

<p>When the data is normally distributed and missing at random. (C)</p> Signup and view all the answers

What is the main drawback of using deletion methods to handle missing data?

<p>It can result in a significant loss of information, especially with large amounts of missing data. (C)</p> Signup and view all the answers

Which imputation technique is most suitable for filling missing values in time series data when you want to consider the temporal order?

<p>Forward-fill or backward-fill (A)</p> Signup and view all the answers

What is the primary purpose of handling outliers in a dataset?

<p>To prevent extreme values from distorting model training and results. (A)</p> Signup and view all the answers

Which outlier detection method is based on identifying data points that fall outside the interquartile range (IQR)?

<p>Box plots (D)</p> Signup and view all the answers

For what type of data is the Z-score method most effective for outlier detection?

<p>Normally distributed data (C)</p> Signup and view all the answers

Which approach is most appropriate when valid extreme values should not be removed from a dataset?

<p>Capping (Winsorization) (A)</p> Signup and view all the answers

What is the primary goal of encoding categorical data?

<p>To convert categorical values into a numerical format that machine learning models can understand. (C)</p> Signup and view all the answers

When should label encoding be used?

<p>For ordinal data with a meaningful order. (B)</p> Signup and view all the answers

One-Hot Encoding is best suited for what type of categorical data?

<p>Nominal data. (C)</p> Signup and view all the answers

Which encoding technique is most memory-efficient when dealing with a high number of categories?

<p>Binary Encoding (D)</p> Signup and view all the answers

What is the purpose of feature scaling and normalization?

<p>To ensure all features contribute equally to the model by bringing them to a similar scale. (A)</p> Signup and view all the answers

Min-Max Scaling transforms values to what range?

<p>0 to 1 (C)</p> Signup and view all the answers

When is standardization (Z-score normalization) most appropriate?

<p>When the data has a normal distribution. (B)</p> Signup and view all the answers

Which scaling technique is particularly useful when dealing with outliers?

<p>Robust Scaling (B)</p> Signup and view all the answers

What is the goal of data transformation?

<p>To apply mathematical functions to improve feature relationships. (A)</p> Signup and view all the answers

For what type of data is a log transformation most effective?

<p>Highly skewed data (B)</p> Signup and view all the answers

When should you use a square root or cube root transformation instead of a log transformation?

<p>When log transformation is too strong (B)</p> Signup and view all the answers

Reciprocal transformation is best suited for data that is

<p>Extremely skewed (D)</p> Signup and view all the answers

What is the purpose of handling skewness in data?

<p>To make the data symmetrical and approximate a normal distribution (D)</p> Signup and view all the answers

What does it indicate if the skewness of a dataset is greater than 1?

<p>Data is right-skewed and transformation is recommended. (C)</p> Signup and view all the answers

Why is feature extraction from dates important in data preprocessing?

<p>To extract useful patterns that may be hidden in date and time components. (A)</p> Signup and view all the answers

What kind of information can typically be extracted from datetime features?

<p>Year, month, day of week (D)</p> Signup and view all the answers

When is it beneficial to convert weekends into a binary feature (0/1)?

<p>When weekends impact business activity. (C)</p> Signup and view all the answers

What is the purpose of calculating time differences between events?

<p>To track durations, calculate processing times for machine downtime (B)</p> Signup and view all the answers

What is the primary use of rolling averages and moving averages in handling time-series data?

<p>Smoothing short-term fluctuations to make it easy understand trends in data. (D)</p> Signup and view all the answers

What are lag features used for in time series analysis?

<p>To predict future values based on past values. (D)</p> Signup and view all the answers

What is the purpose of differencing in time series data?

<p>To remove trends and stabilize the data (D)</p> Signup and view all the answers

What is the purpose of removing punctuation, special characters, and stopwords when handling text data?

<p>To reduce the complexity of text. (B)</p> Signup and view all the answers

What is tokenization in the context of text preprocessing?

<p>Splitting text into words (A)</p> Signup and view all the answers

What is the difference between stemming and lemmatization?

<p>Stemming gives the root form of the word but lemmatization is a more meaningful base form for machine learning. (B)</p> Signup and view all the answers

Which of the following is NOT a method for converting text into numerical representation?

<p>Data Deletion (C)</p> Signup and view all the answers

What does TF-IDF (Term Frequency-Inverse Document Frequency) do?

<p>Gives importance to unique words in a document. Common words get lower scores and rare words get higher scores. (A)</p> Signup and view all the answers

What is the key characteristic of the Bag-of-Words (BoW) model?

<p>It ignores word meaning and order. (A)</p> Signup and view all the answers

What is the overall goal of handling imbalanced data?

<p>Balancing the dataset the so the model learns to recognize both classes fairly. (B)</p> Signup and view all the answers

What do oversampling techniques like SMOTE and ADASYN do?

<p>Create more samples of the smaller class to balance the dataset. (C)</p> Signup and view all the answers

What is the difference between oversampling and undersampling?

<p>Oversampling increases the minority class, undersampling reduces the majority class (B)</p> Signup and view all the answers

What is the main advantage of using hybrid methods (combination of over and under sampling) for handling imbalanced data?

<p>They can balance a dataset without overfitting or losing key data. (B)</p> Signup and view all the answers

Flashcards

What is Data Preprocessing?

Cleaning, transforming, and organizing raw data to improve model performance.

Why is Preprocessing Important?

Data in the real world is often incomplete, incorrect, inconsistent, or imbalanced. Preprocessing ensures the model understands the data better and performs accurately.

Goal of Handling Missing Data

Detect and handle missing values, outliers, and inconsistencies to ensure data quality.

What are Deletion Methods for missing data?

Rows/columns with too many missing values are removed.

Signup and view all the flashcards

What are Imputation Techniques for missing data?

Missing values are replaced with estimated values (Mean, Median, Mode).

Signup and view all the flashcards

Goal of Handling Outliers

Identify and handle extreme values distorting model accuracy.

Signup and view all the flashcards

What are outlier detection methods?

Remove or adjust extreme values to prevent misleading results.

Signup and view all the flashcards

Detecting outliers using Box Plots.

Box Plots use IQR to find outliers.

Signup and view all the flashcards

What is Transforming values for outliers?

Logarithmic Transformation of values to handle outliers.

Signup and view all the flashcards

Goal of Encoding Categorical Data

Converting categorical values into numerical format for machine learning model.

Signup and view all the flashcards

What is Nominal Data?

No ranking or order exists in these categories.

Signup and view all the flashcards

What is Ordinal Data?

Categories have a meaningful order, but differences are not measurable.

Signup and view all the flashcards

What is Label Encoding?

Assign numbers to categories based on order.

Signup and view all the flashcards

What is One-Hot Encoding?

Create separate columns for each category, using 0 or 1 to indicate presence.

Signup and view all the flashcards

Goal of Feature Scaling & Normalization

Transform numerical features to a consistent scale for better model performance.

Signup and view all the flashcards

What is Min-Max Scaling?

Scales between 0 and 1.

Signup and view all the flashcards

What is Standardization?

Mean is set to 0, and Standard Deviation is set to 1.

Signup and view all the flashcards

Goal of Data Transformation

Apply mathematical transforms to improve feature relationships.

Signup and view all the flashcards

What is Logarithmic Transformation?

Converts large values into smaller ones, keeping relative differences.

Signup and view all the flashcards

What does skewness tell us?

Skewness tells us if data leans right or left.

Signup and view all the flashcards

Goal of Handling Date & Time Data

Extract useful information from datetime features.

Signup and view all the flashcards

Goal of Handling Text Data

Prepare textual data for machine learning models.

Signup and view all the flashcards

Why removing punctuation, special characters & stopwords in text data?

These elements don't add meaning in most cases.

Signup and view all the flashcards

What is Tokenization?

Splitting text into individual words (tokens) for analysis.

Signup and view all the flashcards

What is Stemming?

Stemming cuts words to their base, like "run."

Signup and view all the flashcards

What is Lemmatization?

Converting words to a meaningful base, with better grammar

Signup and view all the flashcards

What is TF-IDF

Gives importance to unique words in a document.

Signup and view all the flashcards

Goal of Handling Imbalanced Data

Balance datasets where one class dominates the other.

Signup and view all the flashcards

OverSampling method

Oversampling increases minority class data.

Signup and view all the flashcards

UnderSampling method

Undersampling reduces the majority class data.

Signup and view all the flashcards

Study Notes

  • Data preprocessing is a critical step in data science and machine learning
  • It involves cleaning, transforming, and organizing raw data
  • This improves the model's performance

Introduction to Data Preprocessing

  • Goal is to understand the importance and role of preprocessing in machine learning
  • Crucial to learn what data preprocessing is, why it's important, and the steps involved
  • Knowledge on the differences between Data Preprocessing, EDA, and Feature Engineering is needed
  • Familiarity with tools and libraries like Pandas, NumPy, Sklearn, and OpenCV (for image data) is required
  • Before training a model to predict house prices, missing values are cleaned, prices are normalized, and categorical data is encoded

Handling Missing Data

  • The goal is to learn techniques to detect and handle missing values effectively
  • Learn to identify missing data by using .isnull().sum(), .info(), and the Missingno library
  • Visualize missing data with heatmaps
  • Two main handling strategies are available deletion and imputation
  • Deletion Methods involve removing rows or columns with many missing values
  • Imputation Techniques include Mean, Median, Mode imputation
  • Other imputation methods are Forward-fill / Backward-fill, K-Nearest Neighbors (KNN) Imputation, and Regression-based Imputation
  • If income data is missing for 5% of users, replacing it with the median income of similar users prevents data loss

Handling Outliers

  • Focuses on detecting and managing extreme values that can distort model accuracy
  • Several outlier detection methods exist
  • Methods include Box Plots (Interquartile Range (IQR) Method)
  • Z-Score Method (Values beyond ±3 standard deviations) can be used
  • Tukey's Fences, Mahalanobis Distance, Isolation Forest & DBSCAN are also used
  • Handling outliers involves removing outliers, transforming values (log, square root), and capping (Winsorization)
  • If 99% of house prices are under $500K, but some homes are priced at $50M, they might be outliers affecting predictions

Encoding Categorical Data

  • Conversion transforms categorical values into numerical format for machine learning models
  • Crucial to understand types of categorical data: Nominal (No Order) vs. Ordinal (Ordered)
  • Encoding techniques include Label Encoding (for Ordinal Data)
  • Other encoding methods are One-Hot Encoding (for Nominal Data), Binary Encoding (Memory-efficient alternative to One-Hot Encoding)
  • Techniques such as Frequency Encoding, Target Encoding (Mean Encoding, Leave-One-Out Encoding), and Hash Encoding (for High Cardinality Categories) exit
  • Converting "Low, Medium, High" salary levels into 0, 1, 2 represents Ordinal Encoding.

Feature Scaling & Normalization

  • The objective is to transform numerical features to a consistent scale for better model performance
  • Scaling's importance lies in ensuring all features contribute equally
  • Different scaling techniques are available
  • Techniques like Min-Max Scaling (Normalization) to scales values between 0 and 1
  • Another technique is Standardization (Z-score Normalization) where the Mean equals 0, and Std equals 1
  • Robust Scaling, which uses median and IQR, is useful for outliers
  • Power Transformations such as Box-Cox & Yeo-Johnson exist
  • If age is in the range 1-100, and salary is in the range 10,000-500,000, salary would dominate the model unless scaled properly

Data Transformation

  • The goal is to apply mathematical transformations to improve feature relationships
  • Mathematical transformations include Logarithmic Transformation for handling skewed data
  • Square Root & Cube Root Transformations and Reciprocal Transformation also exist
  • Focuses on handling skewness in data
  • Detecting skewed distributions uses the skew() function in Pandas
  • Normalizing distributions uses transformations
  • If income follows a right-skewed distribution, applying a log transformation can make it more normally distributed

Handling Date & Time Data

  • The goal is to extract useful information from datetime features
  • Feature extraction from dates includes Year, Month, Day, Hour, Day of Week
  • Extraction also includes if it is a Weekend or Not (Binary Feature) and Time Difference Calculation
  • Focuses on handling time-series data
  • Averages involve Rolling Averages & Moving Averages and Lag Features
  • Techniques such as Differencing (for stationarity) are considered
  • Converting "2023-07-01" into separate columns such as Year (2023), Month (7), Day (1), Weekday (Saturday) is an example

Handling Text Data (Basic Preprocessing for NLP)

  • Focuses on preparing textual data for machine learning models
  • Removing Punctuation, Special Characters, Stopwords is key
  • Tokenization (Splitting text into words), Stemming & Lemmatization (Reducing words to root forms), are also key
  • Converting Text into Numerical Features which involves TF-IDF (Term Frequency-Inverse Document Frequency) comes into play
  • The Bag-of-Words Model (BoW) and Word Embeddings (Word2Vec, GloVe, BERT) are also used
  • Converting the sentence "I love data science!" into a numerical representation for sentiment analysis, is an example

Handling Imbalanced Data

  • Focuses on balancing datasets where one class dominates the other
  • It is important to understand Class Imbalance
  • Resampling Methods includes Oversampling (SMOTE, ADASYN)
  • Resampling also includes Undersampling plus Hybrid Methods which combines over and under sampling
  • Fraud detection dataset has only 1% fraudulent transactions, requiring SMOTE to balance the classes

Introduction to Data Preprocessing

  • Data preprocessing is preparing ingredients before cooking
  • Raw data is messy and needs cleaning before models can be trained effectively
  • Definition: Data preprocessing cleans, transforms, and organizes raw data for machine learning models
  • Real-life Analogy: Cleaning, peeling, and extracting juice from an orange is similar to removing noise, missing values, and duplicates in data
  • Data in the real world is often incomplete (missing values), incorrect (errors or outliers), inconsistent (wrong format, different scales), too large or too small (imbalanced)
  • Bad data leads to poor predictions, preprocessing improves model understanding and accuracy

Steps in Data Preprocessing

  • Handling Missing Data involves filling missing values with average or mode, or removing rows/columns with missing values
  • Real-life example: You're either ignoring guests with missing numbers or asking other guests for the missing numbers
  • Handling Outliers involves removes extreme values to prevent misleading results or adjusting them
  • Real-life example: Correcting the height of a student to remove any mistaken values
  • Encoding Categorical Data involves converting text into numerical values that machine learning models understand
  • Real-life example: Assigning numbers to genres instead of text
  • Feature Scaling involves adjusts all the values to a similar range
  • Real-life example: Converting all measure of distance in meters for a 100m race
  • There are differences between data preprocesssing and EDA
  • Preprocessing cleans and prepares data
  • EDA understands patterns and insights
  • Feature engineering creates new meaningful features
  • There are key tools and libraries for data preprocessing
  • Pandas is used for handling missing values and transforming data
  • Numpy is for numerical operations, arrays, and matrix operations
  • Scikit-learn is used for scaling, encoding, and prepping
  • Opencv is used for processing images
  • Example:
    • House sizes are scaled and prices have to be converted

Handling Missing Data – Making Your Data Complete!

  • Missing data is like missing ingridents
  • Can confuse the algorithm and cause poor performance
  • How to identfiy these
    • Using .isnull(),sum() to counting missing values in each volumns
    • using .info() to show columns with null values
    • Using missingo library the visulaize data with heatmaps
  • The next step is to choose a handling strategy
  • Deletion Methods
  • Drop rows or columns if there are to many missing values
  • Imputation
  • Mean, Median, Mode is the best in numerical data without extreme data
  • Forward and backward fill for the row values.
  • KNN imputer uses the nearest neighbors to find values
  • With a regression the model will fill in the valus
  • Choosing the right method is key.

Outlier Detection Methods – Finding the Odd Ones Out

  • Outliers are values that are to low or igh

  • Methods to find them include:.

    • IQR -> uses a simle plot to show the distribution
    • Z- score values show the distrubtion in a standard deviation unit
    • Tukey's Fences similat to IQR
    • The Mahaalanobis Dinstance measures the distance from the data set
    • Isolations forest the divides the algorithm by building random forestes
  • Deletiong outlier removes them

    • Tansforming values -> Reduce the affect of outlier
  • Capping sets limits

  • Nominal:Categories have no ranking or order

  • shirts color are non ordinal as shirts arent greater than others

  • Ordinal data: Categoriies have a meaningful order but the differnce is measurable

  • Hotel start ranking. a 5 star does not mean 5x more than the others

  • Label Encoded: label the catogries based on order

  • Beginner -> Yellow -> Black

  • One Hot : creates seprate columns foreach catogires

  • Apple, Bannana are not greater than each other

  • Biinary Encoding: Is used if the columns have more featoures

  • Frequency ENcoding: Counts how many times a category appears

  • Mean Encoding, leav one encoded and hash edcoding are methods that create target models

  • The Best Method is subjective

  • Scaling Brings the data to the same level

    • Use alogoritms like regression,KNN> SVM and nerual notworks
    • Without scaling
      • Larger features make the other ones not matter
      • Takes for models to converge
    • Min Mix scales from a vlaue of o to 1
    • standardize is for better distrubition -.Roobust fordata when there are outliers
      • Uses meaidan instead of teh mean
    • Power transforms is for adjust the brightess
  • Logrithmatic transforms reduce values but keep relative diffrences

  • this isbest foe right skewed data

  • square and cube transforms reduce values less aggressively

  • Works better when is to high

  • Recoplical: Flips alues but data relationships

skewness tells if if data is symetrical or leanring to one side

  • right skewed is a positve sckew.

  • Left is a netagive means the oposite Tranformations: Transfrom the data set

  • Log transforms helps skwed datasets

  • Square and Cube

  • Fetaure extractor is used too turn date to time to useful numbers

  • You can determine time stampds by months hours and days of week

  • this can be used to prdict trends liek high sale on certain months or days of the week

  • time diffrence calcuation to find the time diffrents

  • Rolling Averages & Moving Averages to predict teends

    • can be used to smooth out short term fluxutaion
  • lag shifts data for predtions: use past data to precdiot

  • Differenceing helps stabilize data by removing treamda

  • Text data needs to be cleaned before training

  • remove puntuation, symbols and remove stop words

    • these do not provide meaningful text
  • Tokenization is splitting text into words

    • This can be done at at word levels
  • Stremming reduces word to their root. -Lemmatization coverts word to meaningful better forms

  • You must convert text into numerical features

    • TFIDF important uniwue words in a s document common word get lower scores, rare wordd get ehgher scores
    • Bag fo Model: count how ,amy times a docemtn
    • Model can not dferenttiate a sentence meaning
    • Work embedddings uses deep leaeraning to find the meaiing
      • word2Vec
  • Imbanalced has lots of features in which one outshines the others

  • Mosts data sets are imablanced

  • Oversample the dataset

  • Undersample the data

  • Hybrids method make the data combined

    • SMOTE creates synthetci data -ADASYN: adaptice sysnthetic sompling
  • cleaning data reduces missing values

  • Features engineers transform values to be better

  • Data needs to be checked at certain steps

  • Data should be cleaned before a model is run.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Data Preprocessing: Why and How
16 questions
Data Pre-processing Techniques Quiz
18 questions

Data Pre-processing Techniques Quiz

AppreciatedBlackTourmaline2280 avatar
AppreciatedBlackTourmaline2280
Data Cleaning in Machine Learning
40 questions

Data Cleaning in Machine Learning

UnselfishChrysoprase2896 avatar
UnselfishChrysoprase2896
Use Quizgecko on...
Browser
Browser