Document Details

RomanticOrphism

Uploaded by RomanticOrphism

Tags

machine learning evaluation methods classification algorithms

Full Transcript

Machine Learning Chapter 3 - Evaluation Methods for Machine Learning Models Dr. Fuad Mire Hassan Learning Outcomes After completing this lecture, you will be able to: ▪ Understand the broader concept of ML evaluation ▪ Categorize different metrics for evaluation regression...

Machine Learning Chapter 3 - Evaluation Methods for Machine Learning Models Dr. Fuad Mire Hassan Learning Outcomes After completing this lecture, you will be able to: ▪ Understand the broader concept of ML evaluation ▪ Categorize different metrics for evaluation regression models. ▪ Describe methods for assessing the performance of classification algorithms ▪ Compute different machine learning evaluation measures, e.g. MSE, RMSE, MAE, precision, recall, f- score, etc. Evaluation of Machine Learning Algorithms ▪ The quality of a ML model is measured by its performance in both the training & testing stages. ▪ To assess this performance, a range of different evaluation measures are used, including those for regression and classification models ▪ Other means for evaluating machine learning models include using baselines and comparing results of different trained models. Evaluation of Regression Algorithms ▪ As mentioned, the goal of regression algorithms is to make predictions on numerical/continuous values. ▪ The most common regression evaluation measures are: ✔ Mean Squared Error (MSE) ✔ Root Mean Squared Error (RMSE) ✔ Mean Absolute Error (MAE) Evaluation of Regression Algorithms ▪ The MSE is calculated by taking the average of the squared differences between the predicted and the actual values of the target variable ▪ It is an essential tool for evaluating the accuracy of a model or an estimator. ▪ A more appropriate measure that often used is the Root Mean Square Error (RMSE), which is square root of MSE. Evaluation of Regression Algorithms ▪ The RMSE is most useful to measure how far the predicted values of a model are from the actual values. ▪ The Mean Absolute Error (MAE) is another regression performance measure which finds the average magnitude of the errors without considering their signs. MAE does not square errors, which means is either less than or equal to RMSE. Evaluation of Regression Algorithms RMSE vs MAE ▪ Both measures represent the average model prediction error in similar units as the target variable. ▪ They are both loss functions – the lower score the better ▪ The RMSE penalizes large errors more (since it squares) than MAE, which helps us to detect presence of large errors. Evaluation of Regression Algorithms Baseline models ▪ In addition to the previously discussed metrics, the quality of regression models can be understood relative to a baseline. ▪ A baseline is simple model that makes basic predictions, for example, one that predicts all instances as the median or mean of the training data. ▪ Baseline results are used as reference, i.e., if a model outperforms the baseline, it has learned something about the model problem Evaluation of Classification Algorithms ▪ The most common classification evaluation measures are: ✔ Precision ✔ Recall ✔ F-Score ✔ Accuracy Evaluation of Classification Algorithms ▪ The four possible outcomes for a classification algorithm: ✔ True Positives (TP): Correct positive model predictions ✔ True Negatives (TN): Correct negative model predictions. ✔ False Positives (FP): Negative instance predicted as positive ✔ False Negatives (FN): Positive instanced predicted as negative Actual Predicted Positive (1) Negative (0) Positive (1) TP FP Negative (0) FN TN FP is sometimes called Type 1 error while FN is known as Type 2 error Evaluation of Classification Algorithms You falsely identify an You falsely instance identify an belonging to instance another class belonging to this for this class calss for another class Source : Effect Size FAQs by Paul Ellis Evaluation of Classification Algorithms ▪ Recall (aka sensitivity) is the number of correct positive observations that a classifier identifies divided by the actual number of positive instances in the data, i.e., #𝑇𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 = #𝑇𝑃+#𝐹𝑁 Recall values fall How can we improve the recall performance of between 0 and 1. classification model ? Evaluation of Classification Algorithms ▪ Recall (aka sensitivity) is the number of correct positive observations that a classifier identifies divided by the actual number of positive instances in the data, i.e., #𝑇𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 = #𝑇𝑃+#𝐹𝑁 To maximize the recall of a given classification algorithm, we need to minimise the type 2 error or FN Evaluation of Classification Algorithms ▪ Precision is the number of correct positive instances that a classifier identifies divided by the total number it predicts as positives. #𝑇𝑃 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = #𝑇𝑃+#𝐹𝑃 To maximize the precision of a given classification model, we need to minimise the Precision values fall type 1 error or FP between 0 and 1. Evaluation of Classification Algorithms ▪ Assume that we are using an automated COVID-19 test model, and we have the following actual and predicted medical conditions for three patients, what is the precision of our predictive model ? - take 5 minutes to do this Actual condition Predicted condition 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑝𝑜𝑠𝑖𝑡𝑣𝑖𝑒 𝑝𝑜𝑠𝑖𝑡𝑣𝑖𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑇𝑃 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ? 𝑇𝑃+𝐹𝑃 Evaluation of Classification Algorithms ▪ Assume that we are using an automated COVID-19 test model, and we have the following actual and predicted medical conditions for three patients, what is the precision of our predictive model ? Actual condition Actual condition 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑝𝑜𝑠𝑖𝑡𝑣𝑖𝑒 𝑝𝑜𝑠𝑖𝑡𝑣𝑖𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃 2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = =2 𝑇𝑃+𝐹𝑃 2+1 3 Evaluation of Classification Algorithms ▪ There are certain machine learning applications which required high precision, e.g. ✔ Child safe videos on YouTube – safe prediction must be safe ✔ Shoplifter detection system – classify thieves only as thieves ▪ There are other ML applications which focus high recall, e.g. ✔ Product add to customers – don’t miss all potential buyers ✔ Patient diagnosis system – a disease should not be missed ▪ You can have a high precision or high recall, but not both high, i.e., increasing one reduces the other – so you need to trade-off – F-score Evaluation of Classification Algorithms ▪ In many classification problems, e.g. spam detection, we need to trade- off between precision and recall using f-score. ▪ F-score is the harmonic mean of the precision and recall measures which combines their effect, i.e., 2 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝐹 − 𝑠𝑠𝑐𝑜𝑟𝑒 = 𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 How can we maximise F-score ? Evaluation of Classification Algorithms ▪ In many classification problems, e.g. spam detection, we need to tradeoff between precision and recall using f-score. ▪ F-score is the harmonic mean of the precision and recall measures which combines their effect, i.e., 2 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝐹 − 𝑠𝑠𝑐𝑜𝑟𝑒 = 𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 To maximise F-score, we need to have both high precision and recall scores Evaluation of Classification Algorithms Hands-on exercise ▪ Assume that we have a binary classification dataset consisting of 50 observations: 30 of class 1 and 20 of class 2. Suppose that we tested a pre- trained classification model on the dataset and obtained the following results for each class: TP FP FN Class 1 25 2 5 Class 2 18 5 2 Calculate the precision, recall and f1 scores for each class? Evaluation of Classification Algorithms ▪ The scores can be computed using the following formula. 𝑇𝑃 𝑇𝑃 2∗𝑅∗𝑃 Recall (R)= Precision (P)= 𝑭𝟏 =𝑅+𝑃 (𝑇𝑃+𝐹𝑁) 𝑇𝑃+𝐹𝑃 TP FP FN P R F1 Class 1 25 2 5 0.93 0.83 0.88 Class 2 18 5 2 0.78 0.90 0.92 𝟐𝟓 For example, the class 1 Recall (P) = = 𝟎. 𝟖𝟑 𝟐𝟓+𝟓 Please confirm all other scores yourself. Evaluation of Classification Algorithms Baseline models ▪ Using the precision, recall and F-measure scores alone is not always enough understand the model performance. ▪ For that, we can compare our model performance against results of simple classification algorithms as baseline models. ▪ Example classification basslines include: ✔ majority class-classifier – predict every instance to belong to the majority class. ✔ Constant – predict every instance as a pre-defined class References & Reading Resources A complete guide to Machine Learning Evaluation. https://medium.com/analytics-vidhya/complete-guide-to- machine- learning-evaluation-metrics-615c2864d916 How to evaluate your Machine Learning model https://medium.com/analytics-vidhya/how-to-evaluate- your- machine-learning-model-76a7671e9f2e Chapter 6 (p.211-222), Python Machine Learning, Sebastian and Vahid Any other useful resource found relevant to the lecture topic THANKS

Use Quizgecko on...
Browser
Browser