Podcast
Questions and Answers
Flashcards
Train/Test Paradigm
Train/Test Paradigm
Evaluating models on data not used for fitting.
Coefficient of Determination (R²)
Coefficient of Determination (R²)
Proportion of variance in dependent variable predictable from independent variable.
Root Mean Squared Error (RMSE)
Root Mean Squared Error (RMSE)
The average deviation between predictions and the target.
Extrapolation
Extrapolation
Signup and view all the flashcards
Data Leakage
Data Leakage
Signup and view all the flashcards
Overfitting
Overfitting
Signup and view all the flashcards
K-Fold Cross-Validation
K-Fold Cross-Validation
Signup and view all the flashcards
Leave-One-Out Validation
Leave-One-Out Validation
Signup and view all the flashcards
Monte Carlo Cross Validation
Monte Carlo Cross Validation
Signup and view all the flashcards
Simple Bootstrap
Simple Bootstrap
Signup and view all the flashcards
Class Imbalance
Class Imbalance
Signup and view all the flashcards
Undersampling
Undersampling
Signup and view all the flashcards
Oversampling
Oversampling
Signup and view all the flashcards
Cost-Sensitive Learning
Cost-Sensitive Learning
Signup and view all the flashcards
Data Augmentation
Data Augmentation
Signup and view all the flashcards
Online Experiments
Online Experiments
Signup and view all the flashcards
A/B Testing
A/B Testing
Signup and view all the flashcards
Hypothesis Testing
Hypothesis Testing
Signup and view all the flashcards
Null Hypothesis
Null Hypothesis
Signup and view all the flashcards
Alternative Hypothesis
Alternative Hypothesis
Signup and view all the flashcards
Factorial Experiments
Factorial Experiments
Signup and view all the flashcards
A/B Testing
A/B Testing
Signup and view all the flashcards
Multi-Armed Bandit
Multi-Armed Bandit
Signup and view all the flashcards
Reproducibility
Reproducibility
Signup and view all the flashcards
Reproducibility
Reproducibility
Signup and view all the flashcards
Intrinsic Interpretability
Intrinsic Interpretability
Signup and view all the flashcards
Post Hoc Interpretability
Post Hoc Interpretability
Signup and view all the flashcards
Local Interpretability
Local Interpretability
Signup and view all the flashcards
Global Interpretability
Global Interpretability
Signup and view all the flashcards
local Feature Importance
local Feature Importance
Signup and view all the flashcards
Global Feature Importance
Global Feature Importance
Signup and view all the flashcards
Intrinsically Interpretable Models
Intrinsically Interpretable Models
Signup and view all the flashcards
Global Surrogate Models
Global Surrogate Models
Signup and view all the flashcards
local Surrogate Models
local Surrogate Models
Signup and view all the flashcards
Counterfactual explanations
Counterfactual explanations
Signup and view all the flashcards
Counterfactual
Counterfactual
Signup and view all the flashcards
Study Notes
- Lecture 7 is on Design and Analysis of Experiments for Data Science and Machine Learning.
Supervised Learning
- A space of functions F maps observations to targets.
- An ideal function f* that is an element of F maps each observation to a target.
- A space of functions Fm is a subset of F that can be modeled.
- The goal is to find the function f in that space that best approximates the ideal function.
- Supervised learning is function approximation.
- A function maps an observation to an outcome.
- An example of this is a set of binary numbers turning into numbers, i.e. [0, 1, 0, 1] -> 2.
- The function isn't known, therefore, needs approximation.
- A dataset is required of observations on which this function holds.
- Core assumption: Data should be representative of the real world.
Data Science and Design Thinking
- Exploring the problem involves study design and EDA.
- Refining the problem involves design choices.
- Developing the model involves model building and feature engineering.
- Interpreting and communicating involves writing, plotting, and talking.
The Train/Test Paradigm
- Models are evaluated on data that wasn't used for fitting,
- The goal is to estimate generalisation/out-of-sample error.
- The train/test paradigm means holding out to simulate "new" data.
- Dataset sizes can come in 1,000 observations.
- Common splits are 80/20 train/test splits, i.e. 800 training observations and 200 test observations.
- Testing is used for final evaluation.
Performance Measures in Classification
- Accuracy, which is easy to interpret, is calculated as the number of correct predictions divided by the total number of predictions.
- F1-score is less sensitive to class imbalances, calculated as the true positives divided by the true positives plus half of the sum of false positives and false negatives.
- A confusion matrix gives details about the errors.
- Some problems have a higher cost for a type 1 or type 2 error.
- Spam classification: erasing a legitimate email has a higher cost.
- Cancer screening: classifying a malignant tumor as benign has a higher cost.
- Churn prediction: missing a customer who might churn has a higher cost.
- Classification algorithms put more weight on the mistakes to avoid via cost-sensitive classification.
Performance Measures in Regression
- Coefficient of determination (R²) is the proportion of the variation in the dependent variable (Y) that is predictable from the independent variable (X).
- A value closer to 1 signifies that a large amount of variation in Y is explained by X.
- A value closer to 0 signifies that most of the variation in Y can't be explained by X.
- Residual sum of squares is calculated
- Total sum of squares is calculated
- The coefficient of determination is calculated
- Root Mean Squared Error(RMSE) is the square root of the Mean Squared Error (MSE).
- RMSE measures the average deviation between predictions and the target.
- The formula for MSE sums the squared differences between the predicted and actual values, divided by N.
- The formula for RMSE is the square root of MSE.
- A drawback of RMSE is that it assigns the same penalty for predictions over or under the target, making it sensitive to outliers, not having a unit that can be compared across different datasets, and hides error distribution.
Pitfalls of Evaluation
- Interpolation means making sure the inputs are within the range of observed data.
- As an example, you cannot train f(x) for x between -10 and 10, and test x = 9999.
- Data leakage occurs if duplicated observations end up in both training and test sets.
- Overfitting means you shouldn't look for a training RMSE of 0 or an accuracy of 1; statistical noise is normal.
- Data quality is based on how the data was collected.
- Biased sampling can lead to irrelevant models.
- How the data was labelled is based on if labels/target variables come from human annotators.
- Some concepts are inherently harder to label.
- Inter-rater reliability is important to look for when annotators disagree.
- Cohen's Kappa is a metric for categorical labels, measures agreement beyond chance for 2 annotators.
- Fleiss' Kappa is a generalisation of Cohen's Kappa to more than 2 annotators.
- Krippendorff's Alpha is a metric suitable for label types, i.e. ordinal data like sentiment analysis.
Resampling Methods
- Resampling methods: cross-validation
- K-fold cross-validation involves out-of-sample testing to estimate the generalisation error using train/test with rotation.
- If there is a dataset of 1,000 observations with an 80/20 train/test split, partition 800 into 4: [200, 200, 200, 200].
- Use partition 1 for validation, partition 2,3,4 for training for performance 1.
- Use partition 2 for validation, partition 1,3,4 for training for performance 2.
- Use partition 3 for validation, partition 1,2,4 for training for performance 3.
- Use partition 4 for validation, partition 1,2,3 for training for performance 4.
- The overall performance is the average of the above calculated performances.
- Cross-validation provides more stable estimates (mean average performance with variance).
- K, in relation to cross-validation, is usually between 3 and 10.
- Leave One Out validation: extreme cross-validation where K = the number of observations.
- For example, for 1,000 observations, running 1,000 experiments gathers 1,000 performance measures.
- Cross-validation gives more information about the average behaviour of a model versus the performance for a specific dataset.
- Multiple performance measures can compare distributions instead of single points.
- Tuned hyperparameters will require nested k-fold cross-validation.
- Nested CV is cross-validation within the training folds of a cross-validation where hyperparameters can be optimised within the internal cross-validation.
- An example is a 4x4 nested cross-validation.
- Monte Carlo Cross Validation is an alternative which generates train/test splits based on random seed, performs evaluation, Increments seed, and loops.
- Monte Carlo Cross Validation is also known as repeated hold-out
- Simple bootstrap samples N observations with replacement to create a training set.
- The remaining observations are used for testing.
- This is repeated 50 to 200 times.
- It produces a confidence interval.
- Refined versions include the .632 and .632+ bootstrap estimators.
Class Imbalance
- Class Imbalance is a common problem.
- Example: spam is less than 1% of emails.
- Undersampling the majority class involves random sampling to give it less importance.
- Oversampling the minority class involves duplicating data from the minority class to give it more importance.
- Cost-sensitive learning involves giving more weight on false positive/false negatives, depending on the minority class.
- Data augmentation involves generating additional data by applying small random noise on the features.
- The training set is used for these, you should never manipulate the test set.
Practical Challenges in Experimentation
- Data leakage is the test set and training sets not being isolated.
- Computational cost involves expensive evaluation, i.e. 10x10 nested cross-validation which is problematic.
- Reproducibility involves machine learning relying on randomness which is why random seed is relevant.
Hypothesis Testing and Statistical Significance
- Hypothesis testing helps determine if the performance differences between models is real or due to chance.
- One cannot prove or disprove an effect, but can see where the evidence is pointing.
- Null hypothesis: “there is no effect”: classifier 1 and classifier 2 perform the same overall.
- P-value: probability of observing results which are at least as extreme as the results observed under the null hypothesis i.e. if there is no actual effect.
- Define if enough evidence exists if the p-value is under your predefined threshold (usually 0.05), and then reject the null hypothesis.
- Parametric tests have assumptions about the underlying distribution of the observations.
- Non-parametric tests do not have as many assumptions about the underlying distribution of the observations.
- One-tailed tests are for finding testing in a specific direction, e.g. "is classifier 1 better than classifier 2?"
- Two-tailed tests are bidirectional, e.g. “is there a difference between classifier 1 and classifier 2?"
- Paired samples use paired observations, e.g. cross-validation with the same folds.
- Independent samples compare numbers on different data i.e different properties, but with the same statistical properties.
- P-value is the probability of observing results which are at least as extreme as the results observed under the null hypothesis.
- There is a small chance that an effect is due to chance even if p < 0.05.
- If p = 0.05, then there is a 95% chance that any difference is due to chance.
Multiple Comparison Problem
- The multiple comparison problem centres on the p-value for observing results which are at least as extreme as if there is no actual effect.
- There is a small chance that an effect is due to chance even if p < 0.05.
- Each additional comparison compounds the risk of making a Type 1 error.
- This is the Multiple Comparison Problem, which can be corrected.
- One of the most well known corrections are Bonferroni correction and Benjamini–Hochberg procedure.
Interpreting Results
- Statistical significance testing is not always done in machine learning experiments, but it is more rigorous.
- Being able to reject the null hypothesis does not mean that an alternate hypothesis you're proving is correct.
- Not being able to reject the null hypothesis does not mean it is necessarily true.
- Replication studies confirm a hypothesis.
- Machine learning experiments violate a lot of assumptions of statistical tests which is better if you use the most relaxed test.
- Wilcoxon Signed Rank test compares algorithms on k-fold cross-validation.
- McNemar test is used if you cannot afford to cross-validate.
Factorial Experiments
- Structured ways of measuring the impact of multiple factors and their interaction on performance.
- Factors: hyperparameters, preprocessing choices, model architecture, and pipeline design choices.
- Types of factorial designs: full factorial design, fractional factorial design and Plackett-Burman design.
Online Experiments and A/B Testing
- Online experiments are deployed so the model is not taken offline for evaluation.
- A/B testing compares two (or more) variants of a system in a controlled experiment.
- Their foundation is in hypothesis testing to quantify the likelihood of observed differences being due to real effects rather than random chance.
- Tests used are G-test is for yes/no metrics, and Z-test is for numerical targets.
- A hypothesis defines the specific difference you want to test for.
- Sample Size Determination: Estimate the sample size needed.
- Users are randomly assigned to either the control group (version A) or treatment group(s) (versions B, C, etc.).
- Tests are used to determine if observed differences reach statistical significance using chi-squared or t-tests.
- Null vs. Alternative Hypothesis: The common significance level (alpha) is 0.05 (5% risk).
- Statistical Power (1-beta) is the probability of correctly rejecting the null hypothesis when it's false.
- Effect Size is the minimal difference you wish to detect.
- Multi-Armed Bandit is a advanced design which compare two or more variations and allocates participants to variations based on their performance history.
- Algorithms for a MAB are Epsilon-Greedy algorithm, Upper Confidence Bound, Thompson Sampling and Contextual Bandits
Reproducibility and Interpretability
- This is essential for scientific progress, as it allows other researchers to verify and build upon existing work.
- Factors include variations in data collection and preprocessing, differences in coding practices, and the use of random seeds.
- Reproducibility focuses on re-doing the experiment with the same data.
- Replicability focuses on re-doing the experiment with new data to get the same finding.
- Interpretability matters to have trust and confidence in decisions, for debugging and model improvement, for regulatory & ethical compliance, and to discover new insights from the model itself.
- Intrinsic interpretability is a property of a model.
- Post hoc interpretability is interpreting a prediction after the fact.
- Interpretability is either model-specific and for a class of model; or model agnostic and a feature of all models.
- This can be localised to a single prediction, or globalised to apply to the overall model.
Feature Importance and Surrogate Models
- Global Feature Importance identifies the overall impact features have on the model's output.
- Local Feature Importance provides Explain contributions of features to a single prediction.
- These are achieved via Permutation Importance and SHAP (SHapley Additive exPlanations)
- Decision trees and nearest-neighbour models are easy to interpret.
- Global surrogate models use an interpretable model (i.e. a decision tree) that approximate the predictions of a black box model SVM.
- Local surrogate models explain individual predictions of black box machine learning models.
- In local surrogate models, the generation of is from creating a new dataset based on 1 datapoint by switching features off.
- Train an interpretable model on that new dataset.
- Counterfactual explanations express what has not happened or is not the case.
- Are used to explain individual predictions and to identify potential causal links.
- Advantages include it being clear and easy to implement with disadvantages including many explanations across a single parameter.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.