Podcast
Questions and Answers
In supervised learning, what is the primary goal?
In supervised learning, what is the primary goal?
- To create a representative sample of the real world.
- To identify the core assumptions in a dataset.
- To discover new patterns in unlabeled data.
- To approximate a function that maps observations to outcomes. (correct)
During the 'Refining the problem' stage of data science and design thinking, illustrated in the lecture, what is the main task?
During the 'Refining the problem' stage of data science and design thinking, illustrated in the lecture, what is the main task?
- Design choices. (correct)
- Model building.
- Study design.
- Feature engineering.
In the context of evaluating models, what does the train/test paradigm aim to estimate?
In the context of evaluating models, what does the train/test paradigm aim to estimate?
- Generalisation error. (correct)
- Training time.
- The dataset's size.
- The model's complexity.
What is a key advantage of using the F1-score over accuracy in classification problems?
What is a key advantage of using the F1-score over accuracy in classification problems?
In a classification problem, if a model incorrectly classifies a malignant tumor as benign, which type of error is this considered?
In a classification problem, if a model incorrectly classifies a malignant tumor as benign, which type of error is this considered?
In regression analysis, what does a higher value of the coefficient of determination (R²) indicate?
In regression analysis, what does a higher value of the coefficient of determination (R²) indicate?
What is a noted drawback of using Root Mean Squared Error (RMSE) as a performance metric?
What is a noted drawback of using Root Mean Squared Error (RMSE) as a performance metric?
What is 'data leakage' in the context of model evaluation?
What is 'data leakage' in the context of model evaluation?
When evaluating models, what issue does the use of resampling methods, such as cross-validation, primarily address?
When evaluating models, what issue does the use of resampling methods, such as cross-validation, primarily address?
Why is nested k-fold cross-validation used?
Why is nested k-fold cross-validation used?
What is the key characteristic of 'Monte Carlo Cross Validation'?
What is the key characteristic of 'Monte Carlo Cross Validation'?
In the context of resampling methods, what is the purpose of sampling with replacement?
In the context of resampling methods, what is the purpose of sampling with replacement?
When dealing with class imbalance, which of the following techniques should ONLY be applied to the training set?
When dealing with class imbalance, which of the following techniques should ONLY be applied to the training set?
What is a potential argument for using a fixed random seed in machine learning experiments?
What is a potential argument for using a fixed random seed in machine learning experiments?
In hypothesis testing, failing to reject the null hypothesis means:
In hypothesis testing, failing to reject the null hypothesis means:
What is a key assumption of parametric statistical tests?
What is a key assumption of parametric statistical tests?
What does a 'one-tailed test' in statistical testing evaluate?
What does a 'one-tailed test' in statistical testing evaluate?
What statistical issue arises when conducting multiple comparisons, such as comparing a set of classifiers against each other?
What statistical issue arises when conducting multiple comparisons, such as comparing a set of classifiers against each other?
What is the initial action to take when facing the multiple comparisons problem?
What is the initial action to take when facing the multiple comparisons problem?
What should machine learning experiments do to address violations of statical tests?
What should machine learning experiments do to address violations of statical tests?
What is the primary focus of factorial experiments?
What is the primary focus of factorial experiments?
What is a key characteristic of A/B testing in online experiments?
What is a key characteristic of A/B testing in online experiments?
What does the null hypothesis state in A/B testing?
What does the null hypothesis state in A/B testing?
In A/B testing, what does 'statistical power' refer to?
In A/B testing, what does 'statistical power' refer to?
How does Multi-Armed Bandit (MAB) testing differ from traditional A/B testing?
How does Multi-Armed Bandit (MAB) testing differ from traditional A/B testing?
In the Epsilon-Greedy Algorithm, what is the 'balancing act'?
In the Epsilon-Greedy Algorithm, what is the 'balancing act'?
What does UCB mean in terms of MAB variations?
What does UCB mean in terms of MAB variations?
What features does Contextual Bandits use to help inform arm selection?
What features does Contextual Bandits use to help inform arm selection?
What does Reproducibility let researchers do?
What does Reproducibility let researchers do?
In the context of scientific research, what does 'reproducibility' primarily refer to?
In the context of scientific research, what does 'reproducibility' primarily refer to?
What is measured in Intrinsic?
What is measured in Intrinsic?
What does Local feature importance focus on?
What does Local feature importance focus on?
Which models are easy to interpret?
Which models are easy to interpret?
What is LIME?
What is LIME?
In the context of machine learning interpretability, what do counterfactual explanations aim to provide?
In the context of machine learning interpretability, what do counterfactual explanations aim to provide?
What is a potential disadvantage of using counterfactual explanations?
What is a potential disadvantage of using counterfactual explanations?
Flashcards
Train/Test Paradigm
Train/Test Paradigm
Evaluating models on data not used for fitting to estimate generalization
Coefficient of determination (R²)
Coefficient of determination (R²)
Proportion of variance in the dependent variable predictable from the independent variable.
Root Mean Squared Error (RMSE)
Root Mean Squared Error (RMSE)
Square root of the Mean Squared Error; measures average deviation between predictions and target.
K-fold cross-validation
K-fold cross-validation
Signup and view all the flashcards
Nested cross-validation
Nested cross-validation
Signup and view all the flashcards
Simple bootstrap
Simple bootstrap
Signup and view all the flashcards
Class imbalance
Class imbalance
Signup and view all the flashcards
Undersampling
Undersampling
Signup and view all the flashcards
Oversampling
Oversampling
Signup and view all the flashcards
Factorial experiments
Factorial experiments
Signup and view all the flashcards
Online experiments
Online experiments
Signup and view all the flashcards
A/B testing hypothesis
A/B testing hypothesis
Signup and view all the flashcards
Null Hypothesis (H0)
Null Hypothesis (H0)
Signup and view all the flashcards
Alternative Hypothesis (H1)
Alternative Hypothesis (H1)
Signup and view all the flashcards
Significance Level (alpha)
Significance Level (alpha)
Signup and view all the flashcards
Statistical Power (1-beta)
Statistical Power (1-beta)
Signup and view all the flashcards
Effect Size
Effect Size
Signup and view all the flashcards
MAB
MAB
Signup and view all the flashcards
Reproducibility
Reproducibility
Signup and view all the flashcards
Replicability
Replicability
Signup and view all the flashcards
Counterfactual explanations
Counterfactual explanations
Signup and view all the flashcards
Study Notes
- Lecture 7 focuses on the design and analysis of experiments for data science and machine learning
Supervised Learning as Function Approximation
- Supervised learning involves approximating an ideal function (f*) that maps observations to targets
- The goal is to find a function (f) within a modelable space (Fm) that best approximates f*
- A core assumption in supervised learning is data representativeness of the real world
Data Science and Design Thinking
- The data science process involves:
- Exploring the problem which included study design and EDA
- Refining the problem via design choices
- Developing the model through model building and feature engineering
- Interpreting and communicating the results through writing, plotting, and talking
The Train/Test Paradigm
- Models are evaluated using data not used for fitting to estimate generalisation or out-of-sample error
- Datasets are split into training and testing sets
- For example, a dataset of 1,000 observations might use an 80/20 split, resulting in 800 training and 200 testing observations
- Testing is used for final evaluation
Performance Measures in Classification
- Accuracy is a measure that is easy to interpret but provides an overall rate without detail
- F1-score is a measure that is less sensitive to class imbalances
- A confusion matrix presents the error specifics; false positives and false negatives
Errors in Classification
- Type 1 error is a false positive
- Type 2 error is a false negative
- Some problems have a higher cost for a type 1 or type 2 error
- Cost-sensitive classification algorithms are used to give weight on mistakes to avoid
Performance Measures in Regression
- Coefficient of determination (R²) indicates the variation proportion in the dependent variable (Y) predictable from the independent variable (X)
- R² closer to 1 indicates that a large amount of Y variation is explained by X
- R² closer to 0 indicates that most of Y variation is not explained by X
- Root Mean Squared Error (RMSE) measures the average deviation between predictions and target
Drawbacks of RMSE
- Same penalty for the prediction being over and under
- Sensitive to outliers
- No unit, not comparable datasets
- Hides error distribution
Pitfalls of Evaluation
- Inputs need to be the range of the observed data
- Interpolation vs extrapolation problems
- Data leakage can cause duplication in training and test sets
- Overfitting is the results of statistical noise being interepreted as normal
- Irrelevant models can be due to biased sampling
- Data labels can be impacted by human annotators leading to disagreement in interpretations
Resampling Methods: Cross-Validation
- K-fold cross-validation helps estimate generalisation error
- The dataset is split into k partitions
- Each partition is used once as a validation set while the remaining partitions form the training set
- The performance is averaged across all k trials for a more stable estimate
- "Leave One Out" validation is an extreme form of cross-validation, where k equals the number of observations
- Nested k-fold cross-validation is used to tune hyperparameters
Resampling Methods: Monte Carlo Cross Validation
- Monte Carlo Cross Validation is an alternative cross-validation approach
- Generate train/test data split based on random seed
- Perform evaluation and increment seed
- Repeated hold-out is the same as Monte Carlo Cross Validation
Resampling Methods: Simple Bootstrap
- N observations are sampled with replacement to create a training set
- Remaining observations are used for testing
- Process is repeated 50 to 200 times
- Can produce a confidence interval
- Refined versions include the .632 and .632+ bootstrap estimators
Solving Class Imbalance Problem
- Class imbalance is a common ML problem because the minority class has significant importance
- Undersampling the majority class randomly samples to give less importance
- Oversampling the minority class duplicates data to give it more importance
- Cost-sensitive learning can give more weight on false positives/negatives
- Data augmentation generates additional data by applying small random noise on the features
- You should not manipulate the test set
Practical Challenges in Experimentation
- Data leakage must be avoided (test and training sets)
- Expensive evaluation of nested cross-validation can be problematic
- Machine learning relies a lot on randomness (e.g., random seed)
Historical Note: Student's T Test
- William S. Gosset developed the T test
- Gosset was hired as the Head Experimental Brewer by Guiness and was not allowed to use his name
- Gosset published several papers
- Gosset was under the guidance of Karl Pearson
Hypothesis Testing and Statistical Significance
- Hypothesis testing determines the likelihood of performance differences being real versus due to chance
- A P-value indicates the probability of observing results as extreme as those observed
- If there is enough evidence, the p-value is under your predefined threshold (usually 0.05), and you can reject the null hypothesis
Parametric and Non-Parametric Tests
- Parametric tests make assumptions about data distribution
- Non-parametric tests make fewer assumptions
Statistical Tests
- One-tailed tests are for a specific direction
- Two-tailed tests are bidirectional
- Paired observations are cross-validated with same folds
- Independent samples are comparing numbers on different data
The Multiple Comparison Problem
- Multiple comparisons increases risk of making a Type 1 error
- Correct by Bonferroni correction
- Avoid too many comparisons
Interpreting Results
- It is more rigorous to use statistical significance testing
- Replication studies are needed to confirm a hypothesis
- There may be alternative explanations
Hypothesis Testing for Machine Learning
- Hypothesis testing with Machine learning could violate assumptions
- It is better to use relaxed tests such as Wilcoxon Signed Rank test for k-fold cross validation
- You can use McNemar test if you cannot afford to cross-validate
Factorial Experiments
- Structured ways to measure impact of multiple factors and their interaction on performance
- Full Factorial, Fractional Factorial, Plackett-Burman designs
Online Experiments and A/B Testing
- Online models are deployed and cannot be taken offline
- A/B testing is a randomised experiment that compares the randomised model
- Quantify observed differences given real effects that are random chance
- G-test is used for yes/no metrics
- Z-test is used for numerical targets
A/B Testing Steps
- Define a specific hypothesis
- Sample SIze Determination
- Traffic Allocation (control vs treatment groups)
- Statistical Analysis using Chi-squared or t-tests
A/B Testing Concepts
- Null Hypothesis is no difference in any metric
- Aim for a high-power effect when rejecting hypothesis
- Effect SIze is percentage increase in conversion or time reduction
Advanced Designs: Multi-Armed Bandit (MAB)
- Continuously allocate participants with performance history
- Allocations increase with variations that are high peformance
- Underperformers see allocation decrease and stop when threshold is reached
Thompson Sampling
- Use Bayesian Approach to measure probability
- Samplying is selected by arm with highest value
- Experienced is updated with distribution
Reproducibility and Replicability
- Reproducibility is testing the same data
- Replicability is testing same data with different experiment
Types of Interpretability
- Interpretability matters to trust decision, debug, discover new insights
- Intrinsic is property of Model
- Post Hoc is interpreted after predictions
- Model Specificity vs Model Agnostic
- Model specific is to what class of mode
- Model agnostic is feature of all modes
- There is Local vs Global Models
- Local models are interpreted a single prediction
- Global models interpret the overall model
Feature Importance
- Global identifies overall impact on the model's output
- Local importance is explains contributions
- SHAP is Shapely Additive eXPlanations
Intrinsically Interpretable Models
- Decision trees and nearest-neighbour models are easy to interpret
Global Surrogate Models
- Approximates predictions to a black box mode with interpretable label
- Train SVM
- Classify data label
- Train decision tree
Local Surrogate Models - LIME
- Local surrogate models explain machine learning
- Generate a new dataset based on 1 data with features turned on
- Train a model with that dataset
Counterfactual Explanations
- Relate to expressing if happens or if did not happens
- If feature was Y, the prediction would have been positive
- Identify potential causal links
- Disadvantage include having multiple explanations
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.