Active Learning and EPIG in Model Training

Study Notes

Active Learning for Improved Model Training

EPIG (Expected Prediction Information Gain) prioritizes data labeling that maximizes predictive information about evaluation points.
EPIG differs from BALD (Bayesian Active Learning by Disagreement) in that it focuses on where learning is most impactful for the intended prediction task, not just disagreement in the model.
Not all training data is equally valuable, and EPIG quantifies the value based on impact on the evaluation data.
Active sampling is used when labeling data is costly, focusing on the most informative data for model improvement.
Prior labeling strategies focused solely on the training data. However, to improve predictive ability of models, consider the evaluation data when determining useful training data.
The evaluation set is crucial information.

EPIG Definition and Advantage

EPIG measures the gained information about predictions at the evaluation point (x_{eval}) by acquiring labels for a specific data point (\x).
EPIG uses a symmetric mutual information decomposition, enabling conditioning on the evaluation set (\x_{eval}).
This decomposition allows for evaluation of the information gained about the evaluation data by acquiring training data.
Sufficient evaluation data allows training on all data since marginal improvement is small.

RhoLoss

RhoLoss prioritizes training examples that reduce the holdout loss.

Description

This quiz explores the concept of Expected Prediction Information Gain (EPIG) and its advantages over traditional labeling strategies in active learning. Learn how EPIG prioritizes data labeling based on its impact on evaluation points, enhancing model performance. Understand the role of active sampling in contexts where data labeling is costly.