Podcast
Questions and Answers
Which of the following is a primary goal of data preprocessing in machine learning?
Which of the following is a primary goal of data preprocessing in machine learning?
- To ensure data is stored in a specific database format.
- To make the data suitable for machine learning algorithms. (correct)
- To create more complex models.
- To reduce the size of the dataset for faster processing.
Data collection is typically the final step in the machine learning framework.
Data collection is typically the final step in the machine learning framework.
False (B)
Which of the following methods is used to handle missing data by estimating values based on other values?
Which of the following methods is used to handle missing data by estimating values based on other values?
- Removing data points.
- Ignoring the missing values.
- Imputation using k-nearest neighbors. (correct)
- Analyzing MCAR mechanisms
What does 'data leakage' refer to in the context of data splitting?
What does 'data leakage' refer to in the context of data splitting?
Scaling all features to a specific range is called standardization.
Scaling all features to a specific range is called standardization.
What is the consequence of imbalanced data in machine learning?
What is the consequence of imbalanced data in machine learning?
The technique of creating synthetic samples from the minority class to address imbalanced data is known as ______.
The technique of creating synthetic samples from the minority class to address imbalanced data is known as ______.
What is one-hot encoding used for?
What is one-hot encoding used for?
Explain the difference between the terms 'population' and 'sample' in the context of data collection.
Explain the difference between the terms 'population' and 'sample' in the context of data collection.
Match the data transformation techniques with their descriptions:
Match the data transformation techniques with their descriptions:
Which data cleaning step involves making all data conform to a consistent format?
Which data cleaning step involves making all data conform to a consistent format?
Removing redundant duplicates is not a part of data cleaning, as duplicates can provide more weightage to certain data points.
Removing redundant duplicates is not a part of data cleaning, as duplicates can provide more weightage to certain data points.
What is the primary purpose of 'imputing' missing data?
What is the primary purpose of 'imputing' missing data?
What is a potential drawback of using sample mean/median/mode for imputing missing data?
What is a potential drawback of using sample mean/median/mode for imputing missing data?
Match outlier detection methods with their descriptions:
Match outlier detection methods with their descriptions:
Winsorization is a technique used for outlier detection.
Winsorization is a technique used for outlier detection.
Why should one be cautious when dealing with outliers?
Why should one be cautious when dealing with outliers?
If a classification dataset has significantly more data points for one class than another, this is referred to as ______ data.
If a classification dataset has significantly more data points for one class than another, this is referred to as ______ data.
Which of the following is a technique to address imbalanced data by adjusting the weights assigned to different classes?
Which of the following is a technique to address imbalanced data by adjusting the weights assigned to different classes?
Undersampling involves duplicating the minority class to balance the dataset.
Undersampling involves duplicating the minority class to balance the dataset.
In the context of imbalanced data, what is stratified resampling?
In the context of imbalanced data, what is stratified resampling?
How does oversampling address the issue of imbalanced data, and what is a potential drawback?
How does oversampling address the issue of imbalanced data, and what is a potential drawback?
What is the primary goal of data splitting?
What is the primary goal of data splitting?
In k-fold cross-validation, each data point is used for training exactly once.
In k-fold cross-validation, each data point is used for training exactly once.
What is the purpose of using group k-fold cross-validation?
What is the purpose of using group k-fold cross-validation?
The error from erroneous assumptions in the learning algorithm is called ______.
The error from erroneous assumptions in the learning algorithm is called ______.
Which of the following describes variance in the context of machine learning?
Which of the following describes variance in the context of machine learning?
Tokenization is the process of converting numerical data into text.
Tokenization is the process of converting numerical data into text.
What is the purpose of vectorization?
What is the purpose of vectorization?
Why is automation and pipelining important in data preprocessing?
Why is automation and pipelining important in data preprocessing?
Flashcards
Data Preprocessing
Data Preprocessing
Transforming raw data into a clean, usable, and suitable format for machine learning models.
Collecting Data
Collecting Data
Accessing & evaluating relevant data sources.
Cleaning Data
Cleaning Data
Verifying data quality and fixing inconsistencies or errors.
Fixing Data Formats
Fixing Data Formats
Signup and view all the flashcards
Removing Redundant Duplicates
Removing Redundant Duplicates
Signup and view all the flashcards
Mechanism: Missing Data
Mechanism: Missing Data
Signup and view all the flashcards
Outliers
Outliers
Signup and view all the flashcards
Sample
Sample
Signup and view all the flashcards
Dealing with Imbalanced Data
Dealing with Imbalanced Data
Signup and view all the flashcards
Class weighting
Class weighting
Signup and view all the flashcards
Undersampling
Undersampling
Signup and view all the flashcards
Oversampling
Oversampling
Signup and view all the flashcards
SMOTE
SMOTE
Signup and view all the flashcards
Bias
Bias
Signup and view all the flashcards
Variance
Variance
Signup and view all the flashcards
Data Leakage
Data Leakage
Signup and view all the flashcards
Data Splitting
Data Splitting
Signup and view all the flashcards
Training Set
Training Set
Signup and view all the flashcards
Validation Set
Validation Set
Signup and view all the flashcards
Test Set
Test Set
Signup and view all the flashcards
K-fold cross-validation
K-fold cross-validation
Signup and view all the flashcards
Group k-fold cross-validation
Group k-fold cross-validation
Signup and view all the flashcards
Stratified k-fold cross-validation
Stratified k-fold cross-validation
Signup and view all the flashcards
Data transformation
Data transformation
Signup and view all the flashcards
Normalization
Normalization
Signup and view all the flashcards
Standardization
Standardization
Signup and view all the flashcards
Encoding Categorical Data
Encoding Categorical Data
Signup and view all the flashcards
Tokenization
Tokenization
Signup and view all the flashcards
Vectorization
Vectorization
Signup and view all the flashcards
Automation and Pipelining
Automation and Pipelining
Signup and view all the flashcards
Study Notes
Data Preprocessing in ML Framework
- Data preprocessing and cleaning is the second step in the ML framework
- The other steps are data collection, selecting the algorithm, training the model, evaluating performance, tuning and optimizing the model, and deploying the model.
- Data is used for training, validation, and testing
- Training involves fitting model parameters
- Validation involves model selection and parameter tuning
- Testing involves performance evaluation
Data Preprocessing Outline
- Data preprocessing includes collecting, cleaning, handling imbalance and splitting data
- It also includes transformation, automation, and pipelining
- Collecting data involves accessing and evaluating data sources
- Cleaning ensures data quality
- Handling imbalance avoids class bias
- Splitting avoids data leakage
- Transformation makes data suitable for ML
- Automating and pipelining ensures reproducibility and improves efficiency
Collecting Data
- Public sources are supported by national/international organizations and have available metadata
- Research datasets include paper supplements, code, and data repositories but varying scope and quality
- In-house data is relatively small with readily available and related expertise, but is proprietary
- Access modes include tables (csv, ods, xlsx) using Python Pandas, APIs (csv, json, xml) using Python Requests and BioPython, and web scraping
- Web scraping of web page content (html, json) uses Python BeautifulSoup and Scrapy
- Limitations include licences and ethics which could be privacy, confidentiality, informed consent, obscured provenance, ethically compromised origin.
- Permissive licenses allow commercial use such as MIT, BSD-2/3-Clause, Apache 2.0
- Copyleft (viral) licenses require derivative work to adapt the same license [GPL, AGPL, CC BY-SA]
Data Cleaning
- Data cleaning ensures data quality, involves fixing data formats, removing redundant duplicates, handling missing data, and handling outliers
- Data is made to conform to a consistent format.
- Incoherent data formats indicate corrupted files or data inconsistencies
- Exact duplicates are identical entries repeated in the dataset.
- Near duplicates are slight variations which can happen due to format discrepancies, alternative representations or typographical errors
- It is import to note that not all duplicates are redundant
- Patient data from two sources should be merged, not deleted.
- Patient data at different time points may require proper structuring
- Duplicates may result from oversampling small datasets
Handling Missing Data
- It is important to understand the mechanism for why the data is missing:
- Completely At Random (MCAR) conveys no information
- For MCAR it could have been caused by a data entry mistake by a surveyor
- At Random (MAR) depends on other recorded features
- MNAR can mean people are more likely to forget things that occurred far in the past.
- Not At Random (MNAR) conveys information
- It can mean people do not disclose embarrassing information
- Some methods can live with missing values (empty/None) and are safe with MCAR
- One can remove data points with any or relevant missing values which works well for small amounts of missing data and MCAR
- Imputing the data:
- Sample mean/median/mode may reduce variance, may introduce bias if data is not missing randomly
- An estimate from k-nearest neighbors (found based on other values) is computationally expensive, and prone to overfitting
- Forward/backward/interpolation fill for time seies may introduce bias if data is not missing randomly
- Machine learning prediction may be complex, computationally expensive, prone to overfitting
- Compare models trained on data with imputation to models trained on data without imputation
Handling Outliers
- Outliers can be a data point that significantly deviates from the majority of observations in a dataset
- Outliers can be due to measurement errors, data processing errors, fraud or natural variability
- A sample outlier is not necessarily a population outlier
Outlier Detection
- Outlier detection is a statistical process
- Interquartile Range (IQR) is where IQR = Q3 - Q1
- Tukey's fence is where the mild outlier is outside [Q1 – 1.5 IQR, Q3 + 1.5 IQR]
- The extreme outlier is outside [Q1 – 3 IQR, Q3 + 3 IQR]
- Standard score (Z-score) is where Z = (x − μ) / σ
- x is the observation, u – sample mean, o – sample std dev
- Z-score interpretation:
- mild outlier: |Z| > 2
- extreme outlier: |Z| > 3
- caution: assumes normal distribution.
Dealing with Outliers
- An outlier is a data point that significantly deviates from the majority of observations in a dataset, due to measurement errors, data processing errors, fraud or natural variability
- A sample outlier is not necessarily a population outlier.
- Outlier detection
- 1d statistical methods (IQR, Z-score) are used
- 2d & beyond: ML methods, Mahalanobis distance
- Techniques for dealing with outliers:
- Remove outliers if due to errors or irrelevant
- Reduce the impact of outliers by:
- Capping outliers:
- p% winsorization: bottom & top 5% values capped at 5th & 95th percentile values.
- Transform e.g. log transformation (esp. for right skewed distribution)
Imbalanced Data
- One class or value range has significantly more data points than another in the training set
- The model may be biased toward the majority class or value; e.g., if the training set includes 900 images of sour cherry and 100 images of sweet cherry, the ML model will always predict a sour cherry, which is correct in 90% on the training set
Dealing with Imbalanced data
- Class weighting: training algorithms often accept weights assigned to samples
- Resampling - undersampling the majority class: less data often mean quicker training and loss of possibly important information
- Resampling - oversampling the minority class involves:
- duplicating existing samples
- generating synthetic samples: SMOTE
- interpolate between existing minority class samples
- prone to overfitting
- Data Augmentation duplicates samples with undistorting transformations like e.g. image rotation, text paraphrasis etc
- Always test if a particular method improves the model performance in comparison to vanilla setting
SMOTE
- Synthetic Minority Over-sampling Technique
- Given are:
- Number of samples in minority class, T
- Amount of sampling, N, typically 1T, 2T, 3T
- Number of nearest neighbors, k
- Number of samples in minority class, T
- For each sample x in T
- Compute and store x alongside its k nearest neighbors
- While N > 0:
- Take next sample x and randomly pick one if its neighbors, y
- Generate random number g ∈ (0, 1)
- Add the scaled difference between x and y to generate a new synthetic sample z: z = x + g(x - y)
- Decrease N
Feature-based Imbalance
- Certain features are overrepresented in a class
- Results in the model disproportionately relying on features that do not have causal relationship with the predicted outcome
- When the a data distribution of non-causal feature values in classes is first checked in classes:
- e.g. use histograms, box- and scatter-plots
- Try reduce overrepresented feature:
- here: randomly trim the negative sequences to length corresponding to the positive set
- Applying Synthetic data generation:
- here: extend the positive sequences with randomly generated amino acids (what distribution?) - Try stratified resampling (when applicable), which will equalize data points in each value range of the non-causal feature (not in this case)
Bias-Variance Tradeoff
- Bias is an error from erroneous assumptions in the learning algorithm
- Can cause an algorithm to miss relevant relations between features and target outputs (underfitting).
- Variance is an error from sensitivity to small fluctuations in the training set
- High variance may result from an algorithm modeling the random noise in the training data (overfitting).
Splitting Data
- Data leakage occurs when information from outside the training dataset is used to build the model
- results in overly optimistic performance during training
- but poor generalization to new data
- Basic Strategies
- Training set which is the fitting model parameters
- Validation set (optional) for model selection and parameters tuning
- Test set is for performance evaluation
- Splitting Techniques
- Holdout: 80/20, 70/15/15
- simple but often not representative with small sets
- Holdout: 80/20, 70/15/15
- K-fold cross-validation
- k equal parts
- model trained k times using different subsets
Cross Validation
- Group k-fold cross-validation will ensures that the same patient is not represented in both testing and training sets
- Stratified k-fold cross-validation will Ensures that each fold contains approximately the same percentage of samples of each target class as the complete set.
Data Transformation
- The goal: is to make data suitable for machine learning
- Selected techniques:
- Normalization to scale to specific range
- Standardization to scale to have a mean o 0 and a std dev of 1
- Encoding categorical data
- Tokenization
- Vectorization
Normalization and Standardization
- Min-max: x' = (x - Xmin) / (xmax - Xmin)
- Robust: x' = (x – Q2) / (Q3 – Q1)
- Exponential: x' = e^x / (1 – e^x) = 1 / (1 – e^-x)
- Unit vector (L2 normalization): x' = x / ||x||
- Z-score standardization: x' = (x − μ) | σ
- To avoid data leakage calculate scaling parameters (min, max, quartiles, mean, std dev) using the training set only
Encoding Categorical Data
- Ordinal encoding maps ['low', 'medium', 'high'] → [0, 1, 2] when there is a meaningful order
- One-hot encoding maps ['red', 'green', 'blue'] → [[1, 0, 0], [0, 1, 0], [0, 0, 1]] when there is no meaningful order
- Binary encoding maps ['red', 'green', 'blue'] → [00, 01, 10] when no order and many unique categories
- Hashing Maps categorical values into a fixed number of buckets or hash codes for high-cardinality features
Tokenization & Vectorization
- Tokenization is the process for splitting text/sequence into smaller units (tokens)
- words, subwords, or characters
- "I love machine learning!,"
- Word-level tokenization: [“I”, “love”, “machine","learning”]
- Character-level tokenization: ['I’, ' ', 'l’, 'o’, 'v’, 'e’, ' ', 'm’, 'a', 'c','h', 'ï', 'n','e', ... Vectorization converts tokens into numerical representations (vectors) e.g. bag-of-words (BoW), embeddings "I love machine learning!”
- BoW vocabulary: [“and”, “artificial”, “I”, “intelligence”, “learning”, “love”, “machine”]
- BoW vector: [0, 0, 1, 0, 1, 1, 1]
- Word-level tokenization: [“I”, “love”, “machine","learning”]
Automation & Pipelining
- Consistency ensures steps are applied consistently to all datasets preventing human error. Also speeds up repetitive tasks and reduces manual intervention
- Improves Reproducibility making the preprocessing process repeatable and transparent, allowing others to use the same pipeline.
- Improves Scalability handling large datasets and multiple stages of data processing without manually adjusting for each new dataset.
Final Thoughts
- Know your data and try to understand it
- Avoid data leakage by all means
- Try to automate & pipeline your preprocessing of Python scripting and libraries and automation tools
- Test processing choices and always compare with vanilla
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.