Recent Lessons

Show all results for ""

Design of Experiments for Data Science

Design of Experiments for Data Science

Choose a study mode

Play Quiz

Study Flashcards

Spaced Repetition

Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Flashcards

Train/Test Paradigm

Evaluating models on data not used for fitting.

Coefficient of Determination (R²)

Proportion of variance in dependent variable predictable from independent variable.

Root Mean Squared Error (RMSE)

The average deviation between predictions and the target.

Extrapolation

Data from outside the observed input range

Signup and view all the flashcards

Data Leakage

Duplicated observations ending up in both training and test sets

Signup and view all the flashcards

Overfitting

Model fits training data too well, including statistical noise

Signup and view all the flashcards

K-Fold Cross-Validation

Out-of-sample testing that estimates the generalisation error.

Signup and view all the flashcards

Leave-One-Out Validation

Extreme cross-validation where K equals the number of observations.

Signup and view all the flashcards

Monte Carlo Cross Validation

Train/test split based on a random seed repeated

Signup and view all the flashcards

Simple Bootstrap

Sampling N observations with replacement for the training set.

Signup and view all the flashcards

Class Imbalance

Unequal distribution of classes in a dataset.

Signup and view all the flashcards

Undersampling

Randomly sampling from the majority class to reduce its importance.

Signup and view all the flashcards

Oversampling

Duplicating data from the minority class to increase its importance.

Signup and view all the flashcards

Cost-Sensitive Learning

Giving more weight to false positives or false negatives.

Signup and view all the flashcards

Data Augmentation

Adding random noise to create more training data

Signup and view all the flashcards

Online Experiments

Model deployed, cannot take it offline for evaluation

Signup and view all the flashcards

A/B Testing

Randomized experiment comparing system variants

Signup and view all the flashcards

Hypothesis Testing

If performance differences are real and not by chance

Signup and view all the flashcards

Null Hypothesis

There is no difference in variations

Signup and view all the flashcards

Alternative Hypothesis

There is a difference in the variations

Signup and view all the flashcards

Factorial Experiments

Structured ways that measure factors and their interaction on performance

Signup and view all the flashcards

A/B Testing

Model redeployed by comparing variations for a fixed period.

Signup and view all the flashcards

Multi-Armed Bandit

MAB takes an adaptive approach continuously allocating participants.

Signup and view all the flashcards

Reproducibility

scientific process as its essential for other researchers.

Signup and view all the flashcards

Reproducibility

Original data and computer codes are used for data generation

Signup and view all the flashcards

Intrinsic Interpretability

A property of a model.

Signup and view all the flashcards

Post Hoc Interpretability

interpreting after the fact

Signup and view all the flashcards

Local Interpretability

Interpreting one prediction.

Signup and view all the flashcards

Global Interpretability

Interpreting the overall model.

Signup and view all the flashcards

local Feature Importance

Identify feature importance

Signup and view all the flashcards

Global Feature Importance

Identify feature importance

Signup and view all the flashcards

Intrinsically Interpretable Models

Decision trees and nearest-neighbor models for easy interpretation

Signup and view all the flashcards

Global Surrogate Models

Surrogate and interpretable model trained to approximate black box model.

Signup and view all the flashcards

local Surrogate Models

Generate New Data set.

Signup and view all the flashcards

Counterfactual explanations

what has not happened or is not the case.

Signup and view all the flashcards

Counterfactual

if your answer was changed to a different value, it pushes your grade

Signup and view all the flashcards

Study Notes

Lecture 7 is on Design and Analysis of Experiments for Data Science and Machine Learning.

Supervised Learning

A space of functions F maps observations to targets.
An ideal function f* that is an element of F maps each observation to a target.
A space of functions Fm is a subset of F that can be modeled.
The goal is to find the function f in that space that best approximates the ideal function.
Supervised learning is function approximation.
A function maps an observation to an outcome.
An example of this is a set of binary numbers turning into numbers, i.e. [0, 1, 0, 1] -> 2.
The function isn't known, therefore, needs approximation.
A dataset is required of observations on which this function holds.
Core assumption: Data should be representative of the real world.

Data Science and Design Thinking

Exploring the problem involves study design and EDA.
Refining the problem involves design choices.
Developing the model involves model building and feature engineering.
Interpreting and communicating involves writing, plotting, and talking.

The Train/Test Paradigm

Models are evaluated on data that wasn't used for fitting,
The goal is to estimate generalisation/out-of-sample error.
The train/test paradigm means holding out to simulate "new" data.
Dataset sizes can come in 1,000 observations.
Common splits are 80/20 train/test splits, i.e. 800 training observations and 200 test observations.
Testing is used for final evaluation.

Performance Measures in Classification

Accuracy, which is easy to interpret, is calculated as the number of correct predictions divided by the total number of predictions.
F1-score is less sensitive to class imbalances, calculated as the true positives divided by the true positives plus half of the sum of false positives and false negatives.
A confusion matrix gives details about the errors.
Some problems have a higher cost for a type 1 or type 2 error.
Spam classification: erasing a legitimate email has a higher cost.
Cancer screening: classifying a malignant tumor as benign has a higher cost.
Churn prediction: missing a customer who might churn has a higher cost.
Classification algorithms put more weight on the mistakes to avoid via cost-sensitive classification.

Performance Measures in Regression

Coefficient of determination (R²) is the proportion of the variation in the dependent variable (Y) that is predictable from the independent variable (X).
A value closer to 1 signifies that a large amount of variation in Y is explained by X.
A value closer to 0 signifies that most of the variation in Y can't be explained by X.
Residual sum of squares is calculated
Total sum of squares is calculated
The coefficient of determination is calculated
Root Mean Squared Error(RMSE) is the square root of the Mean Squared Error (MSE).
RMSE measures the average deviation between predictions and the target.
The formula for MSE sums the squared differences between the predicted and actual values, divided by N.
The formula for RMSE is the square root of MSE.
A drawback of RMSE is that it assigns the same penalty for predictions over or under the target, making it sensitive to outliers, not having a unit that can be compared across different datasets, and hides error distribution.

Pitfalls of Evaluation

Interpolation means making sure the inputs are within the range of observed data.
As an example, you cannot train f(x) for x between -10 and 10, and test x = 9999.
Data leakage occurs if duplicated observations end up in both training and test sets.
Overfitting means you shouldn't look for a training RMSE of 0 or an accuracy of 1; statistical noise is normal.
Data quality is based on how the data was collected.
Biased sampling can lead to irrelevant models.
How the data was labelled is based on if labels/target variables come from human annotators.
Some concepts are inherently harder to label.
Inter-rater reliability is important to look for when annotators disagree.
Cohen's Kappa is a metric for categorical labels, measures agreement beyond chance for 2 annotators.
Fleiss' Kappa is a generalisation of Cohen's Kappa to more than 2 annotators.
Krippendorff's Alpha is a metric suitable for label types, i.e. ordinal data like sentiment analysis.

Resampling Methods

Resampling methods: cross-validation
K-fold cross-validation involves out-of-sample testing to estimate the generalisation error using train/test with rotation.
If there is a dataset of 1,000 observations with an 80/20 train/test split, partition 800 into 4: [200, 200, 200, 200].
Use partition 1 for validation, partition 2,3,4 for training for performance 1.
Use partition 2 for validation, partition 1,3,4 for training for performance 2.
Use partition 3 for validation, partition 1,2,4 for training for performance 3.
Use partition 4 for validation, partition 1,2,3 for training for performance 4.
The overall performance is the average of the above calculated performances.
Cross-validation provides more stable estimates (mean average performance with variance).
K, in relation to cross-validation, is usually between 3 and 10.
Leave One Out validation: extreme cross-validation where K = the number of observations.
For example, for 1,000 observations, running 1,000 experiments gathers 1,000 performance measures.
Cross-validation gives more information about the average behaviour of a model versus the performance for a specific dataset.
Multiple performance measures can compare distributions instead of single points.
Tuned hyperparameters will require nested k-fold cross-validation.
Nested CV is cross-validation within the training folds of a cross-validation where hyperparameters can be optimised within the internal cross-validation.
An example is a 4x4 nested cross-validation.
Monte Carlo Cross Validation is an alternative which generates train/test splits based on random seed, performs evaluation, Increments seed, and loops.
Monte Carlo Cross Validation is also known as repeated hold-out
Simple bootstrap samples N observations with replacement to create a training set.
The remaining observations are used for testing.
This is repeated 50 to 200 times.
It produces a confidence interval.
Refined versions include the .632 and .632+ bootstrap estimators.

Class Imbalance

Class Imbalance is a common problem.
Example: spam is less than 1% of emails.
Undersampling the majority class involves random sampling to give it less importance.
Oversampling the minority class involves duplicating data from the minority class to give it more importance.
Cost-sensitive learning involves giving more weight on false positive/false negatives, depending on the minority class.
Data augmentation involves generating additional data by applying small random noise on the features.
The training set is used for these, you should never manipulate the test set.

Practical Challenges in Experimentation

Data leakage is the test set and training sets not being isolated.
Computational cost involves expensive evaluation, i.e. 10x10 nested cross-validation which is problematic.
Reproducibility involves machine learning relying on randomness which is why random seed is relevant.

Hypothesis Testing and Statistical Significance

Hypothesis testing helps determine if the performance differences between models is real or due to chance.
One cannot prove or disprove an effect, but can see where the evidence is pointing.
Null hypothesis: “there is no effect”: classifier 1 and classifier 2 perform the same overall.
P-value: probability of observing results which are at least as extreme as the results observed under the null hypothesis i.e. if there is no actual effect.
Define if enough evidence exists if the p-value is under your predefined threshold (usually 0.05), and then reject the null hypothesis.
Parametric tests have assumptions about the underlying distribution of the observations.
Non-parametric tests do not have as many assumptions about the underlying distribution of the observations.
One-tailed tests are for finding testing in a specific direction, e.g. "is classifier 1 better than classifier 2?"
Two-tailed tests are bidirectional, e.g. “is there a difference between classifier 1 and classifier 2?"
Paired samples use paired observations, e.g. cross-validation with the same folds.
Independent samples compare numbers on different data i.e different properties, but with the same statistical properties.
P-value is the probability of observing results which are at least as extreme as the results observed under the null hypothesis.
There is a small chance that an effect is due to chance even if p < 0.05.
If p = 0.05, then there is a 95% chance that any difference is due to chance.

Multiple Comparison Problem

The multiple comparison problem centres on the p-value for observing results which are at least as extreme as if there is no actual effect.
There is a small chance that an effect is due to chance even if p < 0.05.
Each additional comparison compounds the risk of making a Type 1 error.
This is the Multiple Comparison Problem, which can be corrected.
One of the most well known corrections are Bonferroni correction and Benjamini–Hochberg procedure.

Interpreting Results

Statistical significance testing is not always done in machine learning experiments, but it is more rigorous.
Being able to reject the null hypothesis does not mean that an alternate hypothesis you're proving is correct.
Not being able to reject the null hypothesis does not mean it is necessarily true.
Replication studies confirm a hypothesis.
Machine learning experiments violate a lot of assumptions of statistical tests which is better if you use the most relaxed test.
Wilcoxon Signed Rank test compares algorithms on k-fold cross-validation.
McNemar test is used if you cannot afford to cross-validate.

Factorial Experiments

Structured ways of measuring the impact of multiple factors and their interaction on performance.
Factors: hyperparameters, preprocessing choices, model architecture, and pipeline design choices.
Types of factorial designs: full factorial design, fractional factorial design and Plackett-Burman design.

Online Experiments and A/B Testing

Online experiments are deployed so the model is not taken offline for evaluation.
A/B testing compares two (or more) variants of a system in a controlled experiment.
Their foundation is in hypothesis testing to quantify the likelihood of observed differences being due to real effects rather than random chance.
Tests used are G-test is for yes/no metrics, and Z-test is for numerical targets.
A hypothesis defines the specific difference you want to test for.
Sample Size Determination: Estimate the sample size needed.
Users are randomly assigned to either the control group (version A) or treatment group(s) (versions B, C, etc.).
Tests are used to determine if observed differences reach statistical significance using chi-squared or t-tests.
Null vs. Alternative Hypothesis: The common significance level (alpha) is 0.05 (5% risk).
Statistical Power (1-beta) is the probability of correctly rejecting the null hypothesis when it's false.
Effect Size is the minimal difference you wish to detect.
Multi-Armed Bandit is a advanced design which compare two or more variations and allocates participants to variations based on their performance history.
Algorithms for a MAB are Epsilon-Greedy algorithm, Upper Confidence Bound, Thompson Sampling and Contextual Bandits

Reproducibility and Interpretability

This is essential for scientific progress, as it allows other researchers to verify and build upon existing work.
Factors include variations in data collection and preprocessing, differences in coding practices, and the use of random seeds.
Reproducibility focuses on re-doing the experiment with the same data.
Replicability focuses on re-doing the experiment with new data to get the same finding.
Interpretability matters to have trust and confidence in decisions, for debugging and model improvement, for regulatory & ethical compliance, and to discover new insights from the model itself.
Intrinsic interpretability is a property of a model.
Post hoc interpretability is interpreting a prediction after the fact.
Interpretability is either model-specific and for a class of model; or model agnostic and a feature of all models.
This can be localised to a single prediction, or globalised to apply to the overall model.

Feature Importance and Surrogate Models

Global Feature Importance identifies the overall impact features have on the model's output.
Local Feature Importance provides Explain contributions of features to a single prediction.
These are achieved via Permutation Importance and SHAP (SHapley Additive exPlanations)
Decision trees and nearest-neighbour models are easy to interpret.
Global surrogate models use an interpretable model (i.e. a decision tree) that approximate the predictions of a black box model SVM.
Local surrogate models explain individual predictions of black box machine learning models.
In local surrogate models, the generation of is from creating a new dataset based on 1 datapoint by switching features off.
Train an interpretable model on that new dataset.
Counterfactual explanations express what has not happened or is not the case.
Are used to explain individual predictions and to identify potential causal links.
Advantages include it being clear and easy to implement with disadvantages including many explanations across a single parameter.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Design of Experiments in Statistics

10 questions

Design of Experiments in Statistics

BrandNewCornett

Design of Experiments: Factors and Responses

24 questions

Design of Experiments: Factors and Responses

EnthusiasticLlama

Design of Experiments Flashcards 1.6

33 questions

Design of Experiments Flashcards 1.6

AvidFoxglove

Design of Experiments (DOE) Overview

8 questions

Design of Experiments (DOE) Overview

JudiciousDetroit6938

Use Quizgecko on...

Browser