ML Data Preprocessing

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following is a primary goal of data preprocessing in machine learning?

  • To ensure data is stored in a specific database format.
  • To make the data suitable for machine learning algorithms. (correct)
  • To create more complex models.
  • To reduce the size of the dataset for faster processing.

Data collection is typically the final step in the machine learning framework.

False (B)

Which of the following methods is used to handle missing data by estimating values based on other values?

  • Removing data points.
  • Ignoring the missing values.
  • Imputation using k-nearest neighbors. (correct)
  • Analyzing MCAR mechanisms

What does 'data leakage' refer to in the context of data splitting?

<p>The unintentional use of information from outside the training dataset to build a model. (A)</p>
Signup and view all the answers

Scaling all features to a specific range is called standardization.

<p>False (B)</p>
Signup and view all the answers

What is the consequence of imbalanced data in machine learning?

<p>The resulting model may be biased toward the majority class or value. (D)</p>
Signup and view all the answers

The technique of creating synthetic samples from the minority class to address imbalanced data is known as ______.

<p>SMOTE</p>
Signup and view all the answers

What is one-hot encoding used for?

<p>Encoding categorical data with no meaningful order. (B)</p>
Signup and view all the answers

Explain the difference between the terms 'population' and 'sample' in the context of data collection.

<p>A population is the entire group of items or events of interest, whereas a sample is a subset of the population chosen to represent it.</p>
Signup and view all the answers

Match the data transformation techniques with their descriptions:

<p>Normalization = Scales data to a specific range, e.g., [0, 1]. Standardization = Scales data to have a mean of 0 and a standard deviation of 1. Tokenization = Splits text into smaller units like words or subwords. Vectorization = Converts tokens into numerical representations.</p>
Signup and view all the answers

Which data cleaning step involves making all data conform to a consistent format?

<p>Fixing data formats (A)</p>
Signup and view all the answers

Removing redundant duplicates is not a part of data cleaning, as duplicates can provide more weightage to certain data points.

<p>False (B)</p>
Signup and view all the answers

What is the primary purpose of 'imputing' missing data?

<p>To estimate and fill in missing values with plausible data. (D)</p>
Signup and view all the answers

What is a potential drawback of using sample mean/median/mode for imputing missing data?

<p>It can reduce variance and introduce bias if the data is not missing randomly since it does not account for relationships between variables.</p>
Signup and view all the answers

Match outlier detection methods with their descriptions:

<p>Interquartile Range (IQR) = Defines outliers as values outside a certain range based on the IQR. Z-score = Measures how many standard deviations a data point is from the mean. Mahalanobis distance = Measures the distance between a point and a distribution, accounting for covariance.</p>
Signup and view all the answers

Winsorization is a technique used for outlier detection.

<p>False (B)</p>
Signup and view all the answers

Why should one be cautious when dealing with outliers?

<p>A sample outlier is not necessarily a population outlier. (D)</p>
Signup and view all the answers

If a classification dataset has significantly more data points for one class than another, this is referred to as ______ data.

<p>imbalanced</p>
Signup and view all the answers

Which of the following is a technique to address imbalanced data by adjusting the weights assigned to different classes?

<p>Class weighting (B)</p>
Signup and view all the answers

Undersampling involves duplicating the minority class to balance the dataset.

<p>False (B)</p>
Signup and view all the answers

In the context of imbalanced data, what is stratified resampling?

<p>Ensuring each fold in cross-validation contains the same percentage of samples for each target class. (C)</p>
Signup and view all the answers

How does oversampling address the issue of imbalanced data, and what is a potential drawback?

<p>Oversampling balances the dataset by increasing the number of minority class samples. A drawback is that it can lead to overfitting by replicating existing samples.</p>
Signup and view all the answers

What is the primary goal of data splitting?

<p>To evaluate model performance on unseen data. (B)</p>
Signup and view all the answers

In k-fold cross-validation, each data point is used for training exactly once.

<p>False (B)</p>
Signup and view all the answers

What is the purpose of using group k-fold cross-validation?

<p>To ensure that data from the same group is not present in both training and test sets. (D)</p>
Signup and view all the answers

The error from erroneous assumptions in the learning algorithm is called ______.

<p>bias</p>
Signup and view all the answers

Which of the following describes variance in the context of machine learning?

<p>An error from sensitivity to small fluctuations in the training set. (A)</p>
Signup and view all the answers

Tokenization is the process of converting numerical data into text.

<p>False (B)</p>
Signup and view all the answers

What is the purpose of vectorization?

<p>To convert tokens into numerical representations. (A)</p>
Signup and view all the answers

Why is automation and pipelining important in data preprocessing?

<p>It ensures consistency, improves efficiency, and enhances the reproducibility of the preprocessing steps, making the entire process scalable.</p>
Signup and view all the answers

Flashcards

Data Preprocessing

Transforming raw data into a clean, usable, and suitable format for machine learning models.

Collecting Data

Accessing & evaluating relevant data sources.

Cleaning Data

Verifying data quality and fixing inconsistencies or errors.

Fixing Data Formats

Ensuring that data is uniform and consistent in format.

Signup and view all the flashcards

Removing Redundant Duplicates

Removing duplicate data entries.

Signup and view all the flashcards

Mechanism: Missing Data

Understanding why data is missing for better handling.

Signup and view all the flashcards

Outliers

Data points that deviate significantly from the majority.

Signup and view all the flashcards

Sample

A subset representing the whole, used for analysis.

Signup and view all the flashcards

Dealing with Imbalanced Data

Techniques that adjust data to correct class imbalances.

Signup and view all the flashcards

Class weighting

Assigning weights to samples to balance class representation.

Signup and view all the flashcards

Undersampling

Reducing the number of samples in the majority class.

Signup and view all the flashcards

Oversampling

Increasing the number of samples in the minority class.

Signup and view all the flashcards

SMOTE

Technique to create synthetic data to balance dataset.

Signup and view all the flashcards

Bias

An error due to oversimplified model assumptions.

Signup and view all the flashcards

Variance

Error from high sensitivity to small data fluctuations.

Signup and view all the flashcards

Data Leakage

When outside info is used to build the model incorrectly.

Signup and view all the flashcards

Data Splitting

Separating data into training, validation, and test sets.

Signup and view all the flashcards

Training Set

A data split for fitting model parameters.

Signup and view all the flashcards

Validation Set

Data split for model selection and tuning.

Signup and view all the flashcards

Test Set

Data split for evaluating model performance.

Signup and view all the flashcards

K-fold cross-validation

Data assessed with k-equal divisions used sequentially for model training and validation.

Signup and view all the flashcards

Group k-fold cross-validation

Ensuring same patient data isn't used in both training and testing.

Signup and view all the flashcards

Stratified k-fold cross-validation

Ensuring similar proportions of target classes in each fold.

Signup and view all the flashcards

Data transformation

Making data appropriate for machine learning.

Signup and view all the flashcards

Normalization

Scaling values to fit within a specified range.

Signup and view all the flashcards

Standardization

Scaling to zero mean and unit variance.

Signup and view all the flashcards

Encoding Categorical Data

Converting categorical data to numerical format.

Signup and view all the flashcards

Tokenization

Converting text to tokens.

Signup and view all the flashcards

Vectorization

Convert tokens into numerical vectors.

Signup and view all the flashcards

Automation and Pipelining

Ensuring consistent steps and reducing manual intervention.

Signup and view all the flashcards

Study Notes

Data Preprocessing in ML Framework

  • Data preprocessing and cleaning is the second step in the ML framework
  • The other steps are data collection, selecting the algorithm, training the model, evaluating performance, tuning and optimizing the model, and deploying the model.
  • Data is used for training, validation, and testing
  • Training involves fitting model parameters
  • Validation involves model selection and parameter tuning
  • Testing involves performance evaluation

Data Preprocessing Outline

  • Data preprocessing includes collecting, cleaning, handling imbalance and splitting data
  • It also includes transformation, automation, and pipelining
  • Collecting data involves accessing and evaluating data sources
  • Cleaning ensures data quality
  • Handling imbalance avoids class bias
  • Splitting avoids data leakage
  • Transformation makes data suitable for ML
  • Automating and pipelining ensures reproducibility and improves efficiency

Collecting Data

  • Public sources are supported by national/international organizations and have available metadata
  • Research datasets include paper supplements, code, and data repositories but varying scope and quality
  • In-house data is relatively small with readily available and related expertise, but is proprietary
  • Access modes include tables (csv, ods, xlsx) using Python Pandas, APIs (csv, json, xml) using Python Requests and BioPython, and web scraping
  • Web scraping of web page content (html, json) uses Python BeautifulSoup and Scrapy
  • Limitations include licences and ethics which could be privacy, confidentiality, informed consent, obscured provenance, ethically compromised origin.
  • Permissive licenses allow commercial use such as MIT, BSD-2/3-Clause, Apache 2.0
  • Copyleft (viral) licenses require derivative work to adapt the same license [GPL, AGPL, CC BY-SA]

Data Cleaning

  • Data cleaning ensures data quality, involves fixing data formats, removing redundant duplicates, handling missing data, and handling outliers
  • Data is made to conform to a consistent format.
  • Incoherent data formats indicate corrupted files or data inconsistencies
  • Exact duplicates are identical entries repeated in the dataset.
  • Near duplicates are slight variations which can happen due to format discrepancies, alternative representations or typographical errors
  • It is import to note that not all duplicates are redundant
  • Patient data from two sources should be merged, not deleted.
  • Patient data at different time points may require proper structuring
  • Duplicates may result from oversampling small datasets

Handling Missing Data

  • It is important to understand the mechanism for why the data is missing:
  • Completely At Random (MCAR) conveys no information
  • For MCAR it could have been caused by a data entry mistake by a surveyor
  • At Random (MAR) depends on other recorded features
  • MNAR can mean people are more likely to forget things that occurred far in the past.
  • Not At Random (MNAR) conveys information
  • It can mean people do not disclose embarrassing information
  • Some methods can live with missing values (empty/None) and are safe with MCAR
  • One can remove data points with any or relevant missing values which works well for small amounts of missing data and MCAR
  • Imputing the data:
  • Sample mean/median/mode may reduce variance, may introduce bias if data is not missing randomly
  • An estimate from k-nearest neighbors (found based on other values) is computationally expensive, and prone to overfitting
  • Forward/backward/interpolation fill for time seies may introduce bias if data is not missing randomly
  • Machine learning prediction may be complex, computationally expensive, prone to overfitting
  • Compare models trained on data with imputation to models trained on data without imputation

Handling Outliers

  • Outliers can be a data point that significantly deviates from the majority of observations in a dataset
  • Outliers can be due to measurement errors, data processing errors, fraud or natural variability
  • A sample outlier is not necessarily a population outlier

Outlier Detection

  • Outlier detection is a statistical process
  • Interquartile Range (IQR) is where IQR = Q3 - Q1
  • Tukey's fence is where the mild outlier is outside [Q1 – 1.5 IQR, Q3 + 1.5 IQR]
  • The extreme outlier is outside [Q1 – 3 IQR, Q3 + 3 IQR]
  • Standard score (Z-score) is where Z = (x − μ) / σ
  • x is the observation, u – sample mean, o – sample std dev
  • Z-score interpretation:
    • mild outlier: |Z| > 2
    • extreme outlier: |Z| > 3
    • caution: assumes normal distribution.

Dealing with Outliers

  • An outlier is a data point that significantly deviates from the majority of observations in a dataset, due to measurement errors, data processing errors, fraud or natural variability
  • A sample outlier is not necessarily a population outlier.
  • Outlier detection
    • 1d statistical methods (IQR, Z-score) are used
    • 2d & beyond: ML methods, Mahalanobis distance
  • Techniques for dealing with outliers:
    • Remove outliers if due to errors or irrelevant
    • Reduce the impact of outliers by:
    • Capping outliers:
    • p% winsorization: bottom & top 5% values capped at 5th & 95th percentile values.
    • Transform e.g. log transformation (esp. for right skewed distribution)

Imbalanced Data

  • One class or value range has significantly more data points than another in the training set
  • The model may be biased toward the majority class or value; e.g., if the training set includes 900 images of sour cherry and 100 images of sweet cherry, the ML model will always predict a sour cherry, which is correct in 90% on the training set

Dealing with Imbalanced data

  • Class weighting: training algorithms often accept weights assigned to samples
  • Resampling - undersampling the majority class: less data often mean quicker training and loss of possibly important information
  • Resampling - oversampling the minority class involves:
    • duplicating existing samples
    • generating synthetic samples: SMOTE
      • interpolate between existing minority class samples
    • prone to overfitting
  • Data Augmentation duplicates samples with undistorting transformations like e.g. image rotation, text paraphrasis etc
  • Always test if a particular method improves the model performance in comparison to vanilla setting

SMOTE

  • Synthetic Minority Over-sampling Technique
  • Given are:
    • Number of samples in minority class, T
      • Amount of sampling, N, typically 1T, 2T, 3T
      • Number of nearest neighbors, k
  • For each sample x in T
    • Compute and store x alongside its k nearest neighbors
  • While N > 0:
    • Take next sample x and randomly pick one if its neighbors, y
    • Generate random number g ∈ (0, 1)
    • Add the scaled difference between x and y to generate a new synthetic sample z: z = x + g(x - y)
    • Decrease N

Feature-based Imbalance

  • Certain features are overrepresented in a class
  • Results in the model disproportionately relying on features that do not have causal relationship with the predicted outcome
  • When the a data distribution of non-causal feature values in classes is first checked in classes:
    • e.g. use histograms, box- and scatter-plots
  • Try reduce overrepresented feature:
    • here: randomly trim the negative sequences to length corresponding to the positive set
  • Applying Synthetic data generation:
    • here: extend the positive sequences with randomly generated amino acids (what distribution?) - Try stratified resampling (when applicable), which will equalize data points in each value range of the non-causal feature (not in this case)

Bias-Variance Tradeoff

  • Bias is an error from erroneous assumptions in the learning algorithm
  • Can cause an algorithm to miss relevant relations between features and target outputs (underfitting).
  • Variance is an error from sensitivity to small fluctuations in the training set
  • High variance may result from an algorithm modeling the random noise in the training data (overfitting).

Splitting Data

  • Data leakage occurs when information from outside the training dataset is used to build the model
    • results in overly optimistic performance during training
    • but poor generalization to new data
  • Basic Strategies
    • Training set which is the fitting model parameters
    • Validation set (optional) for model selection and parameters tuning
    • Test set is for performance evaluation
  • Splitting Techniques
    • Holdout: 80/20, 70/15/15
      • simple but often not representative with small sets
  • K-fold cross-validation
    • k equal parts
    • model trained k times using different subsets

Cross Validation

  • Group k-fold cross-validation will ensures that the same patient is not represented in both testing and training sets
  • Stratified k-fold cross-validation will Ensures that each fold contains approximately the same percentage of samples of each target class as the complete set.

Data Transformation

  • The goal: is to make data suitable for machine learning
  • Selected techniques:
    • Normalization to scale to specific range
    • Standardization to scale to have a mean o 0 and a std dev of 1
    • Encoding categorical data
    • Tokenization
    • Vectorization

Normalization and Standardization

  • Min-max: x' = (x - Xmin) / (xmax - Xmin)
  • Robust: x' = (x – Q2) / (Q3 – Q1)
  • Exponential: x' = e^x / (1 – e^x) = 1 / (1 – e^-x)
  • Unit vector (L2 normalization): x' = x / ||x||
  • Z-score standardization: x' = (x − μ) | σ
  • To avoid data leakage calculate scaling parameters (min, max, quartiles, mean, std dev) using the training set only

Encoding Categorical Data

  • Ordinal encoding maps ['low', 'medium', 'high'] → [0, 1, 2] when there is a meaningful order
  • One-hot encoding maps ['red', 'green', 'blue'] → [[1, 0, 0], [0, 1, 0], [0, 0, 1]] when there is no meaningful order
  • Binary encoding maps ['red', 'green', 'blue'] → [00, 01, 10] when no order and many unique categories
  • Hashing Maps categorical values into a fixed number of buckets or hash codes for high-cardinality features

Tokenization & Vectorization

  • Tokenization is the process for splitting text/sequence into smaller units (tokens)
    • words, subwords, or characters
    • "I love machine learning!,"
      • Word-level tokenization: [“I”, “love”, “machine","learning”]
        • Character-level tokenization: ['I’, ' ', 'l’, 'o’, 'v’, 'e’, ' ', 'm’, 'a', 'c','h', 'ï', 'n','e', ... Vectorization converts tokens into numerical representations (vectors) e.g. bag-of-words (BoW), embeddings "I love machine learning!”
        • BoW vocabulary: [“and”, “artificial”, “I”, “intelligence”, “learning”, “love”, “machine”]
        • BoW vector: [0, 0, 1, 0, 1, 1, 1]

Automation & Pipelining

  • Consistency ensures steps are applied consistently to all datasets preventing human error. Also speeds up repetitive tasks and reduces manual intervention
  • Improves Reproducibility making the preprocessing process repeatable and transparent, allowing others to use the same pipeline.
  • Improves Scalability handling large datasets and multiple stages of data processing without manually adjusting for each new dataset.

Final Thoughts

  • Know your data and try to understand it
  • Avoid data leakage by all means
  • Try to automate & pipeline your preprocessing of Python scripting and libraries and automation tools
  • Test processing choices and always compare with vanilla

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser