Experiment Design for Data Science

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

In supervised learning, what is the primary goal?

  • To create a representative sample of the real world.
  • To identify the core assumptions in a dataset.
  • To discover new patterns in unlabeled data.
  • To approximate a function that maps observations to outcomes. (correct)

During the 'Refining the problem' stage of data science and design thinking, illustrated in the lecture, what is the main task?

  • Design choices. (correct)
  • Model building.
  • Study design.
  • Feature engineering.

In the context of evaluating models, what does the train/test paradigm aim to estimate?

  • Generalisation error. (correct)
  • Training time.
  • The dataset's size.
  • The model's complexity.

What is a key advantage of using the F1-score over accuracy in classification problems?

<p>It is less sensitive to class imbalances. (D)</p> Signup and view all the answers

In a classification problem, if a model incorrectly classifies a malignant tumor as benign, which type of error is this considered?

<p>Type 2 error. (C)</p> Signup and view all the answers

In regression analysis, what does a higher value of the coefficient of determination (R²) indicate?

<p>A larger amount of variation in Y is explained by X. (C)</p> Signup and view all the answers

What is a noted drawback of using Root Mean Squared Error (RMSE) as a performance metric?

<p>It treats over- and under-predictions equally. (C)</p> Signup and view all the answers

What is 'data leakage' in the context of model evaluation?

<p>Duplicated observations ending up in both training and test sets. (A)</p> Signup and view all the answers

When evaluating models, what issue does the use of resampling methods, such as cross-validation, primarily address?

<p>Generalisation error. (A)</p> Signup and view all the answers

Why is nested k-fold cross-validation used?

<p>To tune hyperparameters. (B)</p> Signup and view all the answers

What is the key characteristic of 'Monte Carlo Cross Validation'?

<p>It generates train/test splits based on a random seed. (B)</p> Signup and view all the answers

In the context of resampling methods, what is the purpose of sampling with replacement?

<p>To create a training set. (C)</p> Signup and view all the answers

When dealing with class imbalance, which of the following techniques should ONLY be applied to the training set?

<p>Applying data augmentation techniques. (D)</p> Signup and view all the answers

What is a potential argument for using a fixed random seed in machine learning experiments?

<p>To ensure reproducibility. (C)</p> Signup and view all the answers

In hypothesis testing, failing to reject the null hypothesis means:

<p>There isn't enough evidence to reject the null hypothesis. (D)</p> Signup and view all the answers

What is a key assumption of parametric statistical tests?

<p>Assumptions about the underlying distribution of the observations. (A)</p> Signup and view all the answers

What does a 'one-tailed test' in statistical testing evaluate?

<p>Testing a specific direction such as seeing if classifier 1 is better or worse. (B)</p> Signup and view all the answers

What statistical issue arises when conducting multiple comparisons, such as comparing a set of classifiers against each other?

<p>Increased likelihood of making a Type 1 error. (B)</p> Signup and view all the answers

What is the initial action to take when facing the multiple comparisons problem?

<p>Avoid making too many comparisons. (A)</p> Signup and view all the answers

What should machine learning experiments do to address violations of statical tests?

<p>Use more relaxed tests. (C)</p> Signup and view all the answers

What is the primary focus of factorial experiments?

<p>Impact of multiple factors and their interaction on performance. (A)</p> Signup and view all the answers

What is a key characteristic of A/B testing in online experiments?

<p>It's a randomized controlled experiment comparing two variants of a system. (D)</p> Signup and view all the answers

What does the null hypothesis state in A/B testing?

<p>There's no difference in the performance metric. (C)</p> Signup and view all the answers

In A/B testing, what does 'statistical power' refer to?

<p>The probability of correctly rejecting the null hypothesis. (C)</p> Signup and view all the answers

How does Multi-Armed Bandit (MAB) testing differ from traditional A/B testing?

<p>MAB takes a more adaptive approach. (B)</p> Signup and view all the answers

In the Epsilon-Greedy Algorithm, what is the 'balancing act'?

<p>Balancing exploitation and exploration. (C)</p> Signup and view all the answers

What does UCB mean in terms of MAB variations?

<p>Upper Confidence Bound. (D)</p> Signup and view all the answers

What features does Contextual Bandits use to help inform arm selection?

<p>Features the user or the environment. (D)</p> Signup and view all the answers

What does Reproducibility let researchers do?

<p>To verify and build upon our work. (B)</p> Signup and view all the answers

In the context of scientific research, what does 'reproducibility' primarily refer to?

<p>Obtaining the same results using the original data and code. (B)</p> Signup and view all the answers

What is measured in Intrinsic?

<p>Interpretability is a property of a model. (B)</p> Signup and view all the answers

What does Local feature importance focus on?

<p>A single prediction. (B)</p> Signup and view all the answers

Which models are easy to interpret?

<p>Decision Model and nearest-neighbour models. (C)</p> Signup and view all the answers

What is LIME?

<p>Generate a new dataset. (D)</p> Signup and view all the answers

In the context of machine learning interpretability, what do counterfactual explanations aim to provide?

<p>An explanation of why the model made a specific prediction. (D)</p> Signup and view all the answers

What is a potential disadvantage of using counterfactual explanations?

<p>Often multiple counterfactual explanations. (A)</p> Signup and view all the answers

Flashcards

Train/Test Paradigm

Evaluating models on data not used for fitting to estimate generalization

Coefficient of determination (R²)

Proportion of variance in the dependent variable predictable from the independent variable.

Root Mean Squared Error (RMSE)

Square root of the Mean Squared Error; measures average deviation between predictions and target.

K-fold cross-validation

k-fold splits the data into k groups. Each group serves as test set in one round.

Signup and view all the flashcards

Nested cross-validation

Cross-validation is inside the training folds of cross-validation.

Signup and view all the flashcards

Simple bootstrap

Sampling N observations with replacement to create a training set

Signup and view all the flashcards

Class imbalance

Where spam is less than 1% of e-mails

Signup and view all the flashcards

Undersampling

Randomly sampling from the majority class to give it less importance.

Signup and view all the flashcards

Oversampling

Duplicating data from the minority class to give it more importance.

Signup and view all the flashcards

Factorial experiments

A structured way of measuring impact of multiple factors and their interaction on performance

Signup and view all the flashcards

Online experiments

Model is deployed, and you cannot afford to take it offline for evaluation

Signup and view all the flashcards

A/B testing hypothesis

Define the specific difference you believe your variations will cause.

Signup and view all the flashcards

Null Hypothesis (H0)

There's no difference in the performance metric.

Signup and view all the flashcards

Alternative Hypothesis (H1)

There is a difference in the performance metric.

Signup and view all the flashcards

Significance Level (alpha)

Acceptable level of risk for a false positive.

Signup and view all the flashcards

Statistical Power (1-beta)

Probability correctly rejecting null hypothesis when false.

Signup and view all the flashcards

Effect Size

The minimal difference you wish to detect.

Signup and view all the flashcards

MAB

Adaptive approach.

Signup and view all the flashcards

Reproducibility

Scientific progress allows other researchers to verify and build upon our work.

Signup and view all the flashcards

Replicability

Independent researchers.

Signup and view all the flashcards

Counterfactual explanations

Used to explain individual predictions.

Signup and view all the flashcards

Study Notes

  • Lecture 7 focuses on the design and analysis of experiments for data science and machine learning

Supervised Learning as Function Approximation

  • Supervised learning involves approximating an ideal function (f*) that maps observations to targets
  • The goal is to find a function (f) within a modelable space (Fm) that best approximates f*
  • A core assumption in supervised learning is data representativeness of the real world

Data Science and Design Thinking

  • The data science process involves:
  • Exploring the problem which included study design and EDA
  • Refining the problem via design choices
  • Developing the model through model building and feature engineering
  • Interpreting and communicating the results through writing, plotting, and talking

The Train/Test Paradigm

  • Models are evaluated using data not used for fitting to estimate generalisation or out-of-sample error
  • Datasets are split into training and testing sets
  • For example, a dataset of 1,000 observations might use an 80/20 split, resulting in 800 training and 200 testing observations
  • Testing is used for final evaluation

Performance Measures in Classification

  • Accuracy is a measure that is easy to interpret but provides an overall rate without detail
  • F1-score is a measure that is less sensitive to class imbalances
  • A confusion matrix presents the error specifics; false positives and false negatives

Errors in Classification

  • Type 1 error is a false positive
  • Type 2 error is a false negative
  • Some problems have a higher cost for a type 1 or type 2 error
  • Cost-sensitive classification algorithms are used to give weight on mistakes to avoid

Performance Measures in Regression

  • Coefficient of determination (R²) indicates the variation proportion in the dependent variable (Y) predictable from the independent variable (X)
  • R² closer to 1 indicates that a large amount of Y variation is explained by X
  • R² closer to 0 indicates that most of Y variation is not explained by X
  • Root Mean Squared Error (RMSE) measures the average deviation between predictions and target

Drawbacks of RMSE

  • Same penalty for the prediction being over and under
  • Sensitive to outliers
  • No unit, not comparable datasets
  • Hides error distribution

Pitfalls of Evaluation

  • Inputs need to be the range of the observed data
  • Interpolation vs extrapolation problems
  • Data leakage can cause duplication in training and test sets
  • Overfitting is the results of statistical noise being interepreted as normal
  • Irrelevant models can be due to biased sampling
  • Data labels can be impacted by human annotators leading to disagreement in interpretations

Resampling Methods: Cross-Validation

  • K-fold cross-validation helps estimate generalisation error
  • The dataset is split into k partitions
  • Each partition is used once as a validation set while the remaining partitions form the training set
  • The performance is averaged across all k trials for a more stable estimate
  • "Leave One Out" validation is an extreme form of cross-validation, where k equals the number of observations
  • Nested k-fold cross-validation is used to tune hyperparameters

Resampling Methods: Monte Carlo Cross Validation

  • Monte Carlo Cross Validation is an alternative cross-validation approach
  • Generate train/test data split based on random seed
  • Perform evaluation and increment seed
  • Repeated hold-out is the same as Monte Carlo Cross Validation

Resampling Methods: Simple Bootstrap

  • N observations are sampled with replacement to create a training set
  • Remaining observations are used for testing
  • Process is repeated 50 to 200 times
  • Can produce a confidence interval
  • Refined versions include the .632 and .632+ bootstrap estimators

Solving Class Imbalance Problem

  • Class imbalance is a common ML problem because the minority class has significant importance
  • Undersampling the majority class randomly samples to give less importance
  • Oversampling the minority class duplicates data to give it more importance
  • Cost-sensitive learning can give more weight on false positives/negatives
  • Data augmentation generates additional data by applying small random noise on the features
  • You should not manipulate the test set

Practical Challenges in Experimentation

  • Data leakage must be avoided (test and training sets)
  • Expensive evaluation of nested cross-validation can be problematic
  • Machine learning relies a lot on randomness (e.g., random seed)

Historical Note: Student's T Test

  • William S. Gosset developed the T test
  • Gosset was hired as the Head Experimental Brewer by Guiness and was not allowed to use his name
  • Gosset published several papers
  • Gosset was under the guidance of Karl Pearson

Hypothesis Testing and Statistical Significance

  • Hypothesis testing determines the likelihood of performance differences being real versus due to chance
  • A P-value indicates the probability of observing results as extreme as those observed
  • If there is enough evidence, the p-value is under your predefined threshold (usually 0.05), and you can reject the null hypothesis

Parametric and Non-Parametric Tests

  • Parametric tests make assumptions about data distribution
  • Non-parametric tests make fewer assumptions

Statistical Tests

  • One-tailed tests are for a specific direction
  • Two-tailed tests are bidirectional
  • Paired observations are cross-validated with same folds
  • Independent samples are comparing numbers on different data

The Multiple Comparison Problem

  • Multiple comparisons increases risk of making a Type 1 error
  • Correct by Bonferroni correction
  • Avoid too many comparisons

Interpreting Results

  • It is more rigorous to use statistical significance testing
  • Replication studies are needed to confirm a hypothesis
  • There may be alternative explanations

Hypothesis Testing for Machine Learning

  • Hypothesis testing with Machine learning could violate assumptions
  • It is better to use relaxed tests such as Wilcoxon Signed Rank test for k-fold cross validation
  • You can use McNemar test if you cannot afford to cross-validate

Factorial Experiments

  • Structured ways to measure impact of multiple factors and their interaction on performance
  • Full Factorial, Fractional Factorial, Plackett-Burman designs

Online Experiments and A/B Testing

  • Online models are deployed and cannot be taken offline
  • A/B testing is a randomised experiment that compares the randomised model
  • Quantify observed differences given real effects that are random chance
  • G-test is used for yes/no metrics
  • Z-test is used for numerical targets

A/B Testing Steps

  • Define a specific hypothesis
  • Sample SIze Determination
  • Traffic Allocation (control vs treatment groups)
  • Statistical Analysis using Chi-squared or t-tests

A/B Testing Concepts

  • Null Hypothesis is no difference in any metric
  • Aim for a high-power effect when rejecting hypothesis
  • Effect SIze is percentage increase in conversion or time reduction

Advanced Designs: Multi-Armed Bandit (MAB)

  • Continuously allocate participants with performance history
  • Allocations increase with variations that are high peformance
  • Underperformers see allocation decrease and stop when threshold is reached

Thompson Sampling

  • Use Bayesian Approach to measure probability
  • Samplying is selected by arm with highest value
  • Experienced is updated with distribution

Reproducibility and Replicability

  • Reproducibility is testing the same data
  • Replicability is testing same data with different experiment

Types of Interpretability

  • Interpretability matters to trust decision, debug, discover new insights
  • Intrinsic is property of Model
  • Post Hoc is interpreted after predictions
  • Model Specificity vs Model Agnostic
  • Model specific is to what class of mode
  • Model agnostic is feature of all modes
  • There is Local vs Global Models
  • Local models are interpreted a single prediction
  • Global models interpret the overall model

Feature Importance

  • Global identifies overall impact on the model's output
  • Local importance is explains contributions
  • SHAP is Shapely Additive eXPlanations

Intrinsically Interpretable Models

  • Decision trees and nearest-neighbour models are easy to interpret

Global Surrogate Models

  • Approximates predictions to a black box mode with interpretable label
  • Train SVM
  • Classify data label
  • Train decision tree

Local Surrogate Models - LIME

  • Local surrogate models explain machine learning
  • Generate a new dataset based on 1 data with features turned on
  • Train a model with that dataset

Counterfactual Explanations

  • Relate to expressing if happens or if did not happens
  • If feature was Y, the prediction would have been positive
  • Identify potential causal links
  • Disadvantage include having multiple explanations

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Data Science Research Methodology
14 questions
Scientific Method and Experimental Design
20 questions
Design of Experiments for Data Science
0 questions
Use Quizgecko on...
Browser
Browser