Chapter 6 Evaluating Performance PDF

Chapter 6 Evaluating Performance Data450 - Data Mining and Predictive Analytics Fall 2024 Dr. Gissella Bejarano 1 Types of Errors Tid Attrib1 A...

Chapter 6 Evaluating Performance Data450 - Data Mining and Predictive Analytics Fall 2024 Dr. Gissella Bejarano 1 Types of Errors Tid Attrib1 Attrib2 Attrib3 Class Learning 1 Yes Large 125K No algorithm Training errors: Errors committed on the training set 2 3 No No Medium Small 100K 70K No No 4 Yes Medium 120K No Induction 5 No Large 95K Yes Test errors: Errors committed on the test set (usually 6 No Medium 60K No 7 Yes Large 220K No Learn used for research and evaluate different models). 8 9 No No Small Medium 85K 75K Yes No Model 10 No Small 90K Yes Model 10 Training Set Generalization errors: Expected error of a model over Apply Model random selection of records from same distribution, Tid 11 Attrib1 No Attrib2 Small Attrib3 55K Class ? especially for real-world implementations 12 Yes Medium 80K ? Deduction 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set 2 Train and Test sets Randomly selection of the subsets In the course we will work with training and testing only The test error is usually an estimation of the generalization error. Data Science Stack Exchange https://images.app.goo.gl/2CbpNuJxXJWRZKHC7 3 Sign Language Recognition In this paper we compared pose estimation models (MediaPipe, OpenPose and WholePose) and the sign recognition model (Spoter vs SmileLab) [Lazo-Quispe, et al, 2022] LatinXinAI Workshop at NeurIPS Estimated generalization error Real generalization error? LSP-Spanish Dictionary 4 or Model Underfitting and Overfitting Number of epochs Number of epochs As the model becomes more and more complex, test errors can start increasing even though training error may be decreasing Underfitting: when model is too simple, both training and test errors are large Overfitting: when model is too complex, training error is small but test error is large 5 Model Selection (Comparison) Performed during model building Purpose is to ensure that model is not overly complex (to avoid overfitting) Need to estimate generalization error Using the Validation or test set Incorporating Model Complexity 6 Model Selection: Incorporating Model Complexity Rationale: Occam’s Razor Given two models of similar generalization errors (or estimation), one should prefer the simpler model over the more complex model A complex model has a greater chance of being fitted accidentally Therefore, one should include model complexity when evaluating a model Gen. Error(Model) = Train. Error(Model, Train.Data) + α Complexity(Model) 7 Model Selection: Incorporating Model Complexity In this case our regression model is getting more complex The blue line is the function or A plot of the training error model learned from the (blue) versus the test error training dataset (red) 8 Cross-validation train Divide the sample into k parts Use k-1 of the parts for training, … train and 1 for testing train Repeat the procedure k times, for every possible testing subset … Define how to summarize all the train test errors. The average is a common approach. k=10 9 Variations on Cross-validation Repeated cross-validation Perform cross-validation a number of times Gives an estimate of the variance of the generalization error Stratified cross-validation Guarantee the same percentage of class labels in training and test Important when classes are imbalanced and the sample is small Use nested cross-validation approach for model selection and evaluation 10 Model Evaluation - Regression The prediction error for record i is defined as the difference between its actual outcome value (𝑦ො𝑖 ) and its predicted outcome value (𝑦𝑖 ). A few popular numerical measures of predictive accuracy are: MSE: when you want to penalize larger errors more heavily, as squaring the errors magnifies their impact. RMSE (root mean squared error) When you want the error to be on the same scale as the dependent variable, making it easier to interpret. MAE: less sensitive to outliers because it does not square the difference 11 Model Evaluation - Classification 12 Model Evaluation - Classification Accuracy The proportion of correctly classified instances out of the total number of instances Where (for binary classification) TP = True Positives (correct positive classifications) TN = True Negatives (correct negative classifications) FP = False Positives (incorrect positive classifications) FN = False Negatives (incorrect negative classifications) When classes are balanced, accuracy gives a good measurement of the model’s performance For multi-class add up all the correct predictions (T1 + T2 + … Tk) where k is the number of classes 13 Model Evaluation - Confusion Matrix Modified from Introduction to Data Mining by Tan, Steinbach, https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html 14 Kumar Model Evaluation - Classification Precision Proportion of true positive prediction out of all true predictions made by the model When the cost of false positives is high (i.e. the cost of a treatment is high, so we want to minimize the patients we tell suffer from this illness). 15 Model Evaluation - Classification Recall Proportion of actual positives that were correctly classified by the model. Recall is crucial when you want to minimize false negatives, like in medical diagnosis (e.g., saying a patient does not have certain illness, missing a positive case could have severe consequences) 16 Model Evaluation - Classification AUC-ROC The area under the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate (recall) against the false positive rate (1 - specificity). The AUC score ranges from 0 to 1, where 1 represents perfect classification and 0.5 represents random guessing. AUC-ROC is useful when evaluating the overall discriminatory ability of the model, particularly in binary classification tasks. To build this plot, change the threshold value from 0 to 1 instead of 0.5 This Photo by Unknown Author is licensed under CC BY 17 Model Evaluation - Classification Cross-entropy (log loss) representing the difference between two probability distributions. In the context of classification, it measures how well the predicted probabilities match the true labels. Where: yi is the actual label (0 or 1), pi is the predicted probability for the positive class. 18 Final notes Most of these metrics are specific for binary classification For n-class classification (where n>2): Accuracy counting all correct predictions is more popular You can convert the task to a binary approach by doing 1 class vs all 19

Chapter 6 Evaluating Performance PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue