Lec 4.pptx
Document Details
Uploaded by SweetheartHibiscus
University of Colorado Boulder
Tags
Full Transcript
Evaluation, Model Selection, Diagnosis CCSW 438 ADVANCED TOPICS IN SOFTWARE ENGINEERING From Applied Machine Learning University of Colorado Boulder, INFO-4604-Prof. Michael Paul 1 https://cmci.colorado...
Evaluation, Model Selection, Diagnosis CCSW 438 ADVANCED TOPICS IN SOFTWARE ENGINEERING From Applied Machine Learning University of Colorado Boulder, INFO-4604-Prof. Michael Paul 1 https://cmci.colorado.edu/classes/INFO-4604/ Outline: Evaluation Evaluation metrics Model selection Error Analysis Learning curves 2 Evaluation 3 Evaluatio n Training data is usually separate from test data Training accuracy is often much higher than test accuracy Training accuracy is what your classifier is optimizing for (plus regularization), but not a good indicator of how it will perform Evaluatio n Distinction between: In-sample data The data that is available when building your model “Training” data in machine learning terminology Out-of-sample data Data that was not seen during training Also called held-out data or a holdout set Useful to see what your classifier will do on data it hasn’t seen before Usually assumed to be from the same Evaluatio n Ideally, you should be “blind” to the test data until you are ready to evaluate your final model Often you need to evaluate a model repeatedly (e.g., you’re trying to pick the best regularization strength, and you want to see how different values affect the performance) If you keep using the same test data, you risk overfitting to the test set Should use a different set, still held-out from training data, but different from test set Evaluatio n Held-Out Data Typically you set aside a random sample of your labeled data to use for testing A lot of ML datasets you download will already be split into training vs test, so that people use the same splits in different experiments How much data to set aside for testing? Tradeoff: Smaller test set: less reliable performance estimate Smaller training set: less data for training, probably worse classifier (might Held-Out Data A common approach to getting held-out estimates is k-fold cross validation General idea: split your data into k partitions (“folds”) use all but one for training, use the last fold for testing Repeat k times, so each fold gets used for testing once This will give you k different held-out Held-Out Data Illustration of 10-fold cross- Held-Out Data How to choose k? Generally, larger is better, but limited by efficiency Most common values: 5 or 10 Smaller k means less training data used, so your estimate may be an underestimate When k is the number of instances, this is called leave-one-out cross-validation Held-Out Data Benefits of obtaining multiple held-out estimates: More robust final estimate; less sensitive to the particular test split that you choose Multiple estimates also gives you the variance of the estimates; can be used to construct confidence intervals. Other Considerations When splitting into train vs test partitions, keep in mind the unit of analysis Some examples: If you are making predictions about people (e.g., guessing someone’s age based on their search queries), probably shouldn’t have data from the same person in both train and test Split on people rather than individual instances (queries) If time is a factor in your data, probably want test sets to be from later time periods Other Considerations If there are errors in your annotations, then there will be errors in your estimates of performance Example: your classifier predicts “positive” sentiment but it was labeled “neutral” If the label actually should have been (or at least could have been) positive, then your classifier will be falsely penalized This is another reason why it’s important to understand the quality of the annotations in order to correctly understand the Other Considerations If your test performance seems “suspiciously” good, trust your suspicions Make sure you aren’t accidentally including any training information in the test set General takeaway: Make sure the test conditions are as similar as possible to the actual prediction environment Don’t trick yourself into thinking your Evaluation Metrics 1 6 Evaluation Metrics How do we measure performance? What metrics should be used? Evaluation Metrics So far, we have mostly talked about accuracy in this class The number of correctly classified instances divided by the total number of instances Error is the complement of accuracy Accuracy = 1 – Error Error = 1 – Accuracy Evaluation Metrics Accuracy/error give an overall summary of model performance, though sometimes hard to interpret Example: fraud detection in bank transactions 99.9% of instances are legitimate A classifier that never predicts fraud would have an accuracy of 99.9% Need a better way to understand performance Evaluation Metrics Some metrics measure performance with respect to a particular class With respect to a class c, we define a prediction as: True positive: the label is c and the classifier predicted c False positive: the label is not c but the classifier predicted c True negative: the label is not c and the classifier did not predict c False negative: the label is c but the classifier did not predict c Evaluation Metrics Two different types of errors: False positive (“type I” error) False negative (“type II” error) Usually there is a tradeoff between these two Can optimize for one at the expense of the other Which one to favor? Depends on task Evaluation Metrics Recall is the percentage of positive instances that were predicted to be positive Fraud example: Low recall means there are fraudulent transactions that you aren’t detecting Evaluation Metrics Precision is the percentage of instances predicted to be positive that were actually positive Fraud example: Low precision means you are classifying legitimate transactions as fraudulent Precision vs Recall For example: a search engine returns 30 pages, only 20 of which are relevant, while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3, which tells us how valid the results are, while its recall is 20/60 = 1/3, which tells us how complete the results are. https://en.wikipedia.org/wiki/ Precision_and_recall 2 4 Evaluation Metrics Similar to optimizing for false positives vs false negatives, there is usually a tradeoff between prediction and recall Can increase one at the expense of the other One might be more important than the other, or they might be equally important; depends on task Fraud example: If a human is reviewing the transactions flagged as fraudulent, probably optimize for recall Evaluation Metrics Can modify prediction rule to adjust tradeoff Increased prediction threshold (i.e., score or probability of an instance belonging to a class) → increased precision Fewer instances will be predicted positive But the ones that are classified positive are more likely to be classified correctly (more confidence classifier) Decreased threshold → increased recall More instances will get classified as positive (the bar has been lowered) Evaluation Metrics The F1 score is an average of precision and recall Used as a summary of performance, still with respect to a particular class c Defined using harmonic mean, affected more by lower number Both numbers have to be high for F1 to be high F1 is therefore useful when both are important Evaluation Metrics Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition: For binary classification, accuracy can also be calculated in terms of positives and negatives as follows: Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives. 2 9 Evaluation Metrics Which metrics to use? Accuracy easier to understand/communicate than precision/recall/F1, but harder to interpret correctly Precision/recall/F1 generally more informative If you have a small number of classes, just show P/R/F for all classes If you have a large number of classes, then probably should do macro/micro averaging F1 better if precision/recall both important, but sometimes you might highlight one Error Analysis 4 1 Error Analysis What kinds of mistakes does a model make? Which classes does a classifier tend to mix up? Error Analysis Error Analysis A confusion matrix (or error matrix) is a table that counts the number of test instances with each true label vs each predicted label. In binary classification, the confusion matrix is just a 2x2 table of true/false positive/negatives But can be generalized to multiclass settings Error analysis A confusion matrix (or error matrix) is a tablethat counts the number of test instances with each true label vs each predicted label. 4 5 Error Analysis Example for a binary-class classifier https:// 4 www.knime.com/blog/from-modeling-to-scoring-confusion-matrix-and-class- statistics 6 Error Analysis Example for a multi-class classifier Error Analysis Just as false negatives vs false positives are two different types of errors where one may be preferable, different types of multiclass errors may have different importance Mistaking a deer for an antelope is not such a bad mistake Mistaking a deer for a cereal box would be an odd mistake Are the mistakes acceptable? Need to look at confusions, not just a summary Error Analysis Another good practice: look at a sample of misclassified instances From: https:// www.freecodecamp.org/news/chihuahua-or-muffin-my-search-for-the-best-computer-vision-api-cbda4d6 b425d/ Error Analysis Another good practice: look at a sample of misclassified instances Might help you understand why the classifier is making those mistakes Might help you understand what kinds of instances the classifier makes mistakes on Maybe they were ambiguous to begin with, so not surprising the classifier had trouble Error Analysis Another good practice: look at a sample of misclassified instances If you have multiple models, can compare how they do on individual instances If your model outputs probabilities, useful to examine If the correct class was the 2nd most probable, that’s a better mistake than if it was the 10th most probable Error Analysis Error analysis can help inform: Feature engineering If you observe that certain classes are more easily confused, think about creating new features that could distinguish those classes Feature selection If you observe that certain features might be hurting performance (maybe the classifier is picking up on an association between a feature and a class that isn’t meaningful), you could remove it Learning Curves 5 4 Learning Curves A learning curve measures the performance of a model at different amounts of training data Learning Curves Primarily used to understand two things: How much training data is needed? Learning Curves How much training data is needed? Usually the validation accuracy increases noticeable after an initial increase in training data, then levels off after a while Might still be increasing, but with diminishing returns The learning curve can be used to predict the benefit you’ll get from adding more training data Still increasing rapidly → get more data Completely flattened out → more data won’t References Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow by Raschka and Mirjalili Applied Machine Learning University of Colorado Boulder, INFO-4604- Prof. Michael Paul https://cmci.colorado.edu/classes/INFO-4604/ (Regularization) Hands-on machine learnig with Sciket-Learn and Tensor flow: concepts, tools and techniques to build intelligent systems by Aurelien Geron. 6 8