ML - data drift

ChivalrousSmokyQuartz avatar
ChivalrousSmokyQuartz
·
·
Download

Start Quiz

Study Flashcards

64 Questions

What does data drift refer to in the context of machine learning models?

A change in the statistical properties and characteristics of the input data

How does data drift affect a machine learning model's performance?

It can lead to a decline in the model's performance

Why is it important to monitor and address data drift in production ML models?

To keep the model's performance accurate over time

What can happen if a machine learning model faces data drift and is not adapted accordingly?

The model's performance may decrease

What is the main concern addressed in the text regarding machine learning models?

Data drift

In the retail chain example, what caused a significant shift in sales channels?

Marketing campaign for the mobile app

What is the difference between data drift and concept drift?

Data drift involves changes in data distribution, while concept drift involves changes in relationships between input and target variables.

How can prediction drift be best described?

Distribution shift in the model outputs.

In what scenario could prediction drift be an indication of model issues?

If the model starts predicting outcomes with higher frequency.

What is NOT a term related to data drift mentioned in the text?

Prediction skew

What can cause data drift but not concept drift?

Average basket size per channel remains consistent.

Which factor does concept drift primarily involve?

Shifts in relationships between input and target variables.

What can prediction drift signal beyond changes in environment?

Issues with training data quality.

What is the primary difference between data drift and prediction drift?

Data drift involves shifts in input feature distributions, whereas prediction drift refers to shifts in model outputs.

What kind of shift can signal issues with model quality according to the text?

Shift towards more frequent fraud predictions by a fraud detection model.

What is one of the methods mentioned in the text for early monitoring of model performance?

Tracking data distribution drift

What issue can occur due to a significant time gap between making a prediction and receiving feedback?

Feedback delay

In which scenario might it be challenging to definitively label a user transaction as fraudulent or legitimate?

Payment fraud detection

Why are ground truth labels important in evaluating model quality?

To evaluate the model quality accurately

What technique is useful for model troubleshooting and debugging?

Data drift analysis

In which situation might data drift analysis not be used as an alerting signal?

Model debugging and troubleshooting

What is a common way to compare two distributions, mentioned in the text?

Looking at key summary statistics

When comparing summary statistics, what issue can arise if monitoring many features at once?

"Noisy" observations due to multiple comparisons

"How 'different' is different enough?" refers to which aspect of the text?

"Detecting a change in distributions"

What is a common industry approach to retrain machine learning models when facing data drift?

Retrain the model using old and new data

When observing unnecessary data drift alerts, what adjustment might you make to the sensitivity of drift detection methods?

Decrease the sensitivity

What could happen if a machine learning model's predictions are adversely affected by drift?

The model's operation might need to be temporarily halted

What is one way to adjust machine learning models to be more resilient to data shifts without taking a reactive approach?

Review historical variability of features and filter out ones with significant drifts

Which action might be taken if retraining a machine learning model is not feasible due to a lack of new labels for model updates?

Consider process interventions

What could be a consequence of continuing to use a machine learning model without verifying that the data is valid and complete?

Potential false positives in predictions

What is a recommended rule of thumb when observing data drift in machine learning models related to alerting?

Alert only to drift in top model features

When it comes to updating machine learning models due to a true data drift, what specific actions might be necessary?

Develop a completely new approach from scratch

What could be a consequence of neglecting to adjust the sensitivity of drift detection methods when unnecessary alerts are observed?

Continued unnecessary alerts causing disruptions.

How can machine learning models be designed to be more resilient to data shifts without reacting to changes?

Apply feature selection based on historical variability.

What might happen if a machine learning model continues operating without considering data quality verification?

Elevated risk of generating false positives.

What action might be taken if retraining a machine learning model isn't viable due to missing labels for updates?

Halt the operation of the model temporarily.

What is the difference between data drift and training-serving skew?

Data drift refers to gradual changes in input data distributions, while training-serving skew refers to immediate post-deployment discrepancies.

What can trigger a training-serving skew?

Mismatch between the data the model was trained on and the data it encounters in production.

How do you distinguish data quality issues from data drift?

Data quality issues involve corrupted and incomplete data, while data drift involves changes in otherwise correct and valid data distributions.

In which situation can you encounter a training-serving skew?

If there's a mismatch between the model's input training data and production data.

What is the common similarity between data drift and prediction drift?

Both are useful techniques for production model monitoring without ground truth.

When might you face a training-serving skew immediately after model deployment?

If there's a mismatch between the model's training data features and production feature availability.

What does data drift refer to?

Gradual changes in input data distributions.

What is the similarity between data quality issues and data drift?

Both can lead to model quality drops

What is the main implication of a training-serving skew on model performance?

The model might not perform well if it lacks important attributes trained on.

What is the primary goal of drift detection?

Decide if the model still performs as expected

How do outliers differ from data drift?

Drift helps monitor model inputs while outliers do not

What can signal a change in the model environment without ground truth?

Both data drift and prediction drift.

Why is tracking data distribution drift considered important?

To maintain production ML model quality

What actions can help differentiate between data quality issues and data drift?

First verify completeness of the data, then check for distribution shifts.

What is a key reason for ongoing model maintenance in machine learning systems?

To keep models updated due to changing real-world data

What is one way to detect a training-serving skew?

When there's a mismatch between the features available during training and those available during production.

How does detecting outliers differ from detecting data drift?

Drift detection focuses on individual unusual inputs in the data

What is a common feature of data drift and outliers existing independently?

Detection methods for both should be designed differently

How does outlier detection differ from drift detection?

Outlier detectors should be robust to some outliers, while drift detectors should be sensitive enough to catch individual anomalies.

What is a key purpose of outlier detection?

Identify individual objects in the data that look different from others

What is one drawback of using statistical tests for data drift detection?

Statistical tests may be overly sensitive with large datasets.

When is it recommended to use distance metrics for detecting data drift?

When dealing with a large dataset where statistical tests may be too sensitive.

What is the purpose of using rule-based checks for data drift?

As alerting heuristics to detect meaningful changes.

Why might statistical significance not always imply practical significance in data drift detection?

The p-value might not accurately reflect the drift magnitude.

Which distance metric is commonly used to understand the extent of drift in data?

Jensen-Shannon Divergence

In what scenario are rule-based checks particularly useful for detecting data drift?

In industries like healthcare or education.

Why might using statistical hypothesis testing for data drift be challenging?

Selecting the right test based on data distribution assumptions can be complex.

What factor influences whether statistical tests or distance metrics are more suitable for data drift detection?

The size of the dataset being analyzed.

Test your knowledge on detecting data quality issues, such as negative sales or shifts in feature scale, using statistical tracking methods. Understand how to identify data drift even when values remain within expected ranges but exhibit different distribution patterns.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free

More Quizzes Like This

Use Quizgecko on...
Browser
Browser