Model Evaluation Presentation PDF

Model evaluation 의료 인공지능 So Yeon Kim Cross-validation 2 Overfitting and cross-validation Internal validation: validate your model on your current data set (cross- validation) External validation: validate your model on a completely new dataset 3 Overfitting and cross-validation Cross-validation To choose the best parameter setting To prove that your model does not over-fit the training data Holdout validation Leave-One-Out Cross Validation (LOOCV) K-fold Cross Validation 4 Tuning Hyperparameters Suppose you want to determine a good value for some hyperparameter Partition data into train and test set (Holdout) In general, 80% of training and 20% test set. 5 Tuning Hyperparameters For each hyperparameter value 𝑖, run learning algorithm on training set and evaluate on test set 6 Holdout validation Advantages: Fast and computationally efficient Simple to implement Scalable for large dataset Disadvantages: High variance: results can vary significantly if different subsets are chosen Waste data: only a portion of the data is used for training, and the other portion is held out for testing, which may lead to less accurate models, especially with smaller datasets. 7 Leave-One-Out Cross Validation (LOOCV) Use only one sample as the test data and the remaining (N-1) samples for model training. Model validation is performed N times, corresponding to the total number of data samples. 8 Leave-One-Out Cross Validation (LOOCV) Model validation is performed N times, resulting in N accuracy values, and the average of these values becomes the final performance score. Accuracy Accuracy Accuracy Average Accuracy 9 Leave-One-Out Cross Validation (LOOCV) Advantages: Maximizes data usage (especially useful for small datasets) Low variance: Since every sample is tested, it provides a more stable and reliable estimate Disadvantages: Computationally expensive May overfit, since the training set is almost the entire dataset, the model might overfit on similar data patterns, giving an optimistic view of performance 10 K-fold Cross Validation After dividing the dataset into k subsets, perform holdout validation k times. In each fold, use one of the k subsets as the test set, and use the remaining (k-1) subsets for model training. 11 K-fold Cross Validation Perform model validation K times, resulting in K accuracy values, and the average of these values becomes the final performance score. Although the K subsets are randomly divided, testing the model on all subsets makes this method more accurate than holdout validation. Advantages: All data points are used for both training and testing, ensuring that no data is wasted Reduces overfitting risk More stable and reliable performance estimates (less variance) 12 K-fold Cross Validation As the value of K increases, the variance in performance decreases, but computation takes longer. N-fold cross-validation = LOOCV (N: number of samples) There is still some sensitivity to how the data is split into folds, especially when K is small. It is important to choose an appropriate value for K based on the dataset size. Typically, K = 5 or 10 is used. 13 Evaluation metrics 14 Confusion Matrix True Positive (TP): Correctly classified as the class of interest False Negative (FN): Incorrectly classified as not the class of interest False Positive (FP): Incorrectly classified as True class the class of interest True Negative (TN): Correctly classified Positive (1) Negative (0) as not the class of interest True Positive False Positive Positive (FP) (TP) (1) Type I error Predicted False Negative True Negative Negative (FN) (TN) (0) Type II error 15 Confusion Matrix False Positive (FP): Predicting a value as 1 when it is actually 0 (Type I error) False Negative (FN): Predicting a value as 0 when it is actually 1 (Type II error) Case 1: A problem to determine whether a drug is effective for a certain disease If the drug is actually ineffective but is judged to be effective (Type I error) If the drug is actually effective but is judged to be ineffective (Type II error) If a drug is falsely judged as effective when it isn’t, a patient may take it and fail to be treated, making it much more dangerous. Therefore, reducing Type I errors is much more important. 16 Confusion Matrix False Positive (FP): Predicting a value as 1 when it is actually 0 (Type I error) False Negative (FN): Predicting a value as 0 when it is actually 1 (Type II error) Case 2: A problem of diagnosing cancer patients When a healthy person is misdiagnosed as having cancer (Type I error). When a cancer patient is misdiagnosed as healthy (Type II error). Misdiagnosing a cancer patient as healthy is more critical, so reducing Type II errors is much more important. 17 How to evaluate the model? 𝑇𝑃 + 𝑇𝑁 Accuracy = 𝑇𝑃+𝐹𝑃+𝐹𝑁+𝑇𝑁 True class Accuracy alone can be misleading Predicted Positive (1) Negative (0) In classification problems where the False Positive True Positive proportion of class 0 is 99%, even if the Positive (TP) (FP) (1) Type I error classifier predicts all instances as class 0, the accuracy will still be 99%. False Negative TP = 0, TN = 99 Negative (FN) True Negative (TN) (0) Type II error 18 How to evaluate the model? 𝑇𝑃 Sensitivity (TP rate, Recall) = 𝑇𝑃+𝐹𝑁 True class The proportion of actual 1s that were correctly predicted as 1 Predicted Positive (1) Negative (0) 𝑇𝑁 Specificity (TN rate) = True Positive False Positive 𝐹𝑃+𝑇𝑁 Positive (FP) (TP) The proportion of actual 0s that were (1) Type I error correctly predicted as 0 False Negative True Negative Negative (FN) (TN) (0) Type II error 19 How to evaluate the model? 𝑇𝑃 Precision = 𝑇𝑃+𝐹𝑃 True class The proportion of predicted 1s that are actually 1 Predicted Positive (1) Negative (0) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑅𝑒𝑐𝑎𝑙𝑙 F1 score = 2 × True Positive False Positive 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 Positive (TP) (FP) The harmonic mean of Precision and Recall (1) Type I error False Negative True Negative Negative (FN) (TN) (0) Type II error 20 Sensitivity & Specificity Tradeoff High Sensitivity: Low false negative rate. High Specificity: Low false positive rate. Case 1: Airport security If the equipment fails to detect a terrorist carrying a weapon (high FN rate), it is extremely dangerous. Increasing sensitivity is much more important than specificity. Case 2: COVID-19 self-test kit Sensitivity: The probability of correctly identifying a truly positive patient as positive. Specificity: The probability of correctly identifying a truly negative patient as negative. It is ideal to have both high sensitivity and high specificity. 21 ROC (Receiver Operating Characteristic) curve ROC curve plots (1-specificity) (or FPR) on the x-axis against sensitivity (or TPR) on the y-axis TPR = TP / (TP+FN) FPR = FP / (FP+TN) AUROC (Area Under the ROC Curve) 22 How to evaluate the model? When we have an unbalanced dataset (more positives than negatives), Precision Recall Curve (PRC) is useful Precision = TP / (TP+FP) Recall = TPR = TP / (TP+FN) AUPRC (Area Under the PRCurve) 23 Evaluation: Regression MAE, RMSE Pearson correlation, Spearman rank correlation Correlation coefficient and significant tests 24 Summary: Evaluation metrics Classification Accuracy, precision, recall, F1 score, AUC Regression Mean absolute error, root mean squared error Pearson, spearman rank correlation, one sided/two sided test, correlation significance test Clustering Adjusted mutual information, adjusted rand score, silhouette index https://scikit-learn.org/stable/modules/model_evaluation.html 25 Thank You! Q&A 26

Model Evaluation Presentation PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue