Evaluating Performance I PDF
Document Details
Uploaded by CleverNobelium1412
Tags
Related
Summary
This document provides a summary of the different stages in Supervised Learning, from preprocessing data to making predictions and evaluating performance. It explores several techniques regarding model selection, data cleaning, and performance evaluation. This includes data visualizations, scaling, and feature extraction. It provides several different examples of metrics and graphs used for model performance.
Full Transcript
Evaluating Performance I Readings: 4.1, 4.2, 4.3 Evaluating Performance I 1 Linear 𝑦i 𝑦-i= 𝒘 𝑇 𝒙 i Regression = 𝑤0 + 𝑤1𝑥i 𝑃 𝑦i...
Evaluating Performance I Readings: 4.1, 4.2, 4.3 Evaluating Performance I 1 Linear 𝑦i 𝑦-i= 𝒘 𝑇 𝒙 i Regression = 𝑤0 + 𝑤1𝑥i 𝑃 𝑦i 𝑃(𝑦i = 1|𝒙i) = 𝜎 𝒘 𝑇 𝒙 i 0.91 Logistic Regression 0.5 0.5 0.09 𝑃 𝑦i = 0|𝒙i = 1 − 𝜎 𝒘 𝑇 𝒙 i -4 -2 0 2 4 𝒙i Evaluating Performance I 2 Evaluating Performance I 3 Supervised learning in practice Preprocessing Model training Performance Explore & prepare data evaluation Iteratively Select models fine tune Data Visualization the model Make a prediction Data Cleaning (hypotheses) and Exploration on validation data Select model options Identify patterns that Missing data that may fit the data can be leveraged for Noisy data well. We’ll call them learning Erroneous data “hypotheses”. Metrics Classification Precision, Recall, F1, Scaling Fit the model to ROC Curves Feature Extraction (Standardization) training data (Binary), Confusion Matrices (Multiclass) Prepare data for use Dimensionality in scale-dependent reduction eliminates Pick the “best” algorithms. redundant hypothesis function of Regression information MSE, explained the options by choosing model parameters variance, R2 Evaluating Performance I 4 Supervised learning in practice Preprocessing Model training Performance Explore & prepare data evaluation Iteratively Select models fine tune Data Visualization the model Make a prediction Data Cleaning (hypotheses) and Exploration on validation data Select model options Identify patterns that Missing data that may fit the data can be leveraged for Noisy data well. We’ll call them learning Erroneous data “hypotheses”. Metrics Classification Precision, Recall, F1, Scaling Fit the model to ROC Curves Feature Extraction (Standardization) training data (Binary), Confusion Matrices (Multiclass) Prepare data for use Dimensionality in scale-dependent reduction eliminates Pick the “best” algorithms. redundant hypothesis function of Regression information the options by choosing MSE, explained model parameters variance, R2 Evaluating Performance I 5 Performance evaluation overview Data resampling Metrics techniques (regression/classification (Train/validation/test splits and metrics, ROC curves) cross validation) Fairly evaluate Quantify model generalization performance performance Today Next Class Evaluating Performance I 6 Modeling Considerations Accuracy Computational Efficiency Interpretability Evaluating Performance I 7 Accuracy Supervised Learning Performance Evaluation Regression Classification Binary Multiclass Receiver Operating Confusion matrices Characteristic (ROC) curves Common Metrics Mean squared error (MSE) Classification accuracy Classification accuracy Mean absolute error (MAE) True positive rate Micro-averaged F1 Score R2, coefficient of determination False positive rate Macro-averaged F1 Score Precision F1 Score Area under the ROC curve (AUC) Evaluating Performance I 8 Regression: Mean Squared Error The mean squared error (MSE) Absolute measure of performance One of the most widely used loss / cost functions (when in doubt - use this!) Evaluating Performance I 9 Regression: Mean Absolute Error Evaluating Performance I 10 12 13 Binary Classification You input your training data into your KNN model 2 of the 3 nearest neighbors are Class 1, so we predict the class to be Class 1 What do we do if our training labels match that class? What if they don’t? Evaluating Performance I 14 Types of False Positive False Negative (Type I error) (Type II error) classification error Image from: Ellis. The Essential Guide to Effect Sizes Evaluating Performance I 15 Binary Predicted Class, Classification Class 1 Class 0 (target) (non-target) Type II Error (missed target) Class 1 (target) true false positive negative True Class, 𝑦 Type I Error (non-target) Class 0 false true positive negative Evaluating Performance I 16 Binary True positive rate Classification Predicted Class, Probability of detection, 𝑝𝐷 Class 1 Class 0 Sensitivity (target) (non-target) Recall true Class 1 (target) true false positive positive negative True Class, 𝑦 true false positive + negative (non-target) Class 0 false true How many targets (Class 1) positive negative were correctly classified as targets? Evaluating Performance I 17 Binary Classification Predicted Class, False positive rate Class 1 Class 0 Probability of false alarm, 𝑝𝐹𝐴 (target) (non-target) false Class 1 (target) true false positive positive negative True Class, 𝑦 false true positive + negative (non-target) Class 0 false true How many non-targets (Class 0) positive negative were incorrectly classified as targets? Evaluating Performance I 18 Binary Classification Predicted Class, Class 1 Class 0 Precision (target) (non-target) true Class 1 (target) true false positive positive negative True Class, 𝑦 true false positive + positive (non-target) Class 0 false true How many of the predicted positive negative targets are targets? Evaluating Performance I 19 ROC Curves True positive rate False positive rate Evaluating Performance I 20 ROC Curves True positive rate False positive rate True Class Classifier Label (y) Confidence 1 1.40 1 0.95 0 0.80 1 0.60 0 -0.10 Evaluating Performance I 21 ROC Curves True positive rate False positive rate Estimate True Classifier () Class Confidence Label (y) ? 1 1.40 ? 1 0.95 ? 0 0.80 ? 1 0.60 ? 0 -0.10 Evaluating Performance I 22 23 24 25 26 31 32 33 34 ROC Curves: how do they compare? True positive rate True positive rate True positive rate False positive rate False positive rate False positive rate The model represented by this ROC curve is the most discriminative (but usually predicts incorrectly) Evaluating Performance I 35 ROC Curves: where do we operate? 3 What does it mean to 2 operate at a point on 1 this curve? True positive rate False positive rate Evaluating Performance I 36 Precision PR Curves true true positive positive true false true false positive + negative positive + positive Total Positives = 3 Total Negatives = 2 Recall # True # Predicted Threshold Recall Precision True Class Classifier Positive Positive Label (y) Confidence s 0 ∞ 0 0 undefined 1 1.40 1.0 1 0.333 1 1 0.95 1 0.9 2 0.667 2 0 0.80 1 0.7 2 0.667 3 0.667 1 0.60 0.0 3 1 4 0.75 0 -0.10 −∞ 3 1 5 0.6 Evaluating Performance I 37 Be wary of overall accuracy as sole metric Evaluating Performance I 44 𝑖 𝑦i Case study 1 1 1 1 A false 2 1 1 Overall classification accuracy = 13/15 = 0.87 positive 3 1 1 4 1 1 ROC Curves measure the tradeoff between… false positive + true negative 5 1 1 6 1 1 A False positive rate = 1/8 = 0.13 7 1 0 B true B True positive rate (Recall) = 6/7 = 0.86 positive 8 0 1 9 0 0 true false positive + negative 10 0 0 PR Curves measure the tradeoff between… 11 0 0 12 0 0 B True positive rate (Recall) = 6/7 = 0.86 C true positive 13 0 0 14 0 0 C Precision= 6/7 = 0.86 true false 15 0 0 positive + positive Evaluating Performance I 45 𝑖 𝑦i Case study 2 1 1 1 A false 2 1 1 Overall classification accuracy = 13/15 = 0.87 positive 3 1 0 4 1 0 ROC Curves measure the tradeoff between… false positive + true negative 5 0 0 6 0 0 A False positive rate = 0/11 = 0 7 0 0 B true B True positive rate (Recall) = 2/4 = 0.5 positive 8 0 0 9 0 0 true false positive + negative 10 0 0 PR Curves measure the tradeoff between… 11 0 0 12 0 0 B True positive rate (Recall) = 2/4 = 0.5 C true positive 13 0 0 14 0 0 C Precision= 2/2 = 1 true false 15 0 0 positive + positive Evaluating Performance I 46 𝑖 𝑦i Case study 3 1 1 1 A false 2 1 1 Overall classification accuracy = 13/15 = 0.87 positive 3 1 1 4 1 1 ROC Curves measure the tradeoff between… false positive + true negative 5 1 1 6 1 1 A False positive rate = 2/2 = 1 7 1 1 B true B True positive rate (Recall) = 13/13 = 1 positive 8 1 1 9 1 1 true false positive + negative 10 1 1 PR Curves measure the tradeoff between… 11 1 1 12 1 1 B True positive rate (Recall) = 13/13 = 1 C true positive 13 1 1 14 0 1 C Precision= 13/15 = 0.87 true false 15 0 1 positive + positive Evaluating Performance I 47 F1-score 1 𝐹1 = 2 1 Harmonic mean of 1 precision and recall recall + precision precision ⋅ recall =2 precision + recall Generally: precision ⋅ recall 𝛽 controls the relative 𝐹 = 1 + 𝛽2 weight of precision/recall 𝛽2 ⋅ precision + recall Evaluating Performance I 50 Multiclass F1 Micro-average: Calculate precision and recall metrics globally by counting the total true positives, false negatives, and false positives (average for the whole dataset) Macro-average: Use the average precision and recall for each class label (average of class-averages) Evaluating Performance I 51 Computational Efficiency Measure of how an algorithm’s run time (or space requirements) grow as the input size grows Complexity of making predictions with kNN (compare an unseen sample to the training samples) Assume we have n = 10,000, p = 2 𝑥1,1 𝑥2,1 The Euclidean distance between 𝑥 and 𝑥 can be measured as: 1,2 2,2 2 2 𝑥2,1 − 𝑥1,1 + 𝑥2,1 − 𝑥1,1 That’s two (p) distinct sets of operations dependent on the data We repeat that n times – once for each sample in the training dataset 𝑂(𝑛𝑝) Evaluating Performance I 52 Computational Efficiency Training time efficiency? Test time efficiency? How do each change with the size of our data? Evaluating Performance I 53 Recidivism prediction algorithm Interpretability Performance as good as a black box model with 130+ factors; might include socio-economic info; expensive (software license); within software used in US justice system Transparency (can I tell how the model works) Simulatability: can I contemplate the whole model at once? Decomposability: is there an intuitive Rudin, Cynthia. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature explanation for each part of the model? Machine Intelligence 1, no. 5 (2019): 206–15. (e.g. all patients with diastolic blood pressure over 150) Loan approval example (credit history too short AND at least one bad past trade) OR (at least 4 bad past trades) OR Explainability (post-hoc explanations) (at least one recent delinquency AND high Visualization, local explanations, explanations percentage of delinquent trades) by example Rudin 2019 (e.g. this tumor is classified as malignant because to the model it looks a lot like these other tumors) Lipton, Zachary C. “The Mythos of Model Interpretability: In Machine Learning, the Concept of Interpretability Is Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “Model-Agnostic Both Important and Slippery.” Queue 16, no. 3 (2018): 31–57. Interpretability of Machine Learning.” ArXiv Preprint ArXiv:1606.05386, 2016. Evaluating Performance I 54