Evaluation Metrics in Machine Learning PDF
Document Details
Uploaded by CozyOctopus
null
null
Tags
Summary
This document provides a detailed overview of evaluation metrics in machine learning, covering both regression and classification tasks. It discusses various metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and more. The document also introduces concepts like the confusion matrix and Receiver Operating Characteristic (ROC) curves, along with relevant calculations and visualizations.
Full Transcript
start of the 3rd lecture Chapter 3 Assessing model accuracy, machine learning diagnostics 54 Evaluation metrics - concept At this point, we have a broad understanding of the cost function and its crucial role in machine learning. We know that cost function should meet certain properties (e.g. d...
start of the 3rd lecture Chapter 3 Assessing model accuracy, machine learning diagnostics 54 Evaluation metrics - concept At this point, we have a broad understanding of the cost function and its crucial role in machine learning. We know that cost function should meet certain properties (e.g. differentiability with respect to parameters). In practice, this means that we can use a limited number of functions to train and monitor the quality of our model. However, for the extensive process of evaluation the performance of a model (during training and testing), evaluation metrics have been developed that do not have to comply with restrictive mathematical properties. Evaluation metrics are calculated after the estimator is already created with use of different cost function, thus evaluation metric does not affect estimator per se. We distinguish evaluation metrics for following problems: ● regression ● classification ● probabilities 55 Evaluation metrics - regression In case of regression we deal with continuous target. Intuitively we are looking for metrics which describe distance between prediction and actual (it’s straightforward). The most popular regression metrics are: ● Mean Square Error (MSE): ● Root Mean Square Error (RMSE): ● Mean Absolute Error (MAE): ● Mean Absolute Percentage Error (MAPE): ● Mean Squared Logarithmic Error (MSLE): ● R² score: ● Median Absolute Error (MedAE): , where epsilon is small strictly positive number Mean Absolute Scaled Error, Mean Directional Accuracy and many many more… To visualize errors distribution we can use histogram/KDE model and we are able to get a complete picture of the performance of regression estimator. [source: Scikit-learn] 56 Evaluation metrics - regression When choosing an evaluation metric, be very careful and deeply understand the business outcome of your decisions: MAPE = symmetric MAPE (sMAPE) = [source: Towards Data Science] 57 Evaluation metrics - classification In the case of a classification problem, it is much more difficult to make a correct assessment of the model. It requires a bit more knowledge and abstract thinking. First of all let’s introduce confusion matrix: Example: [source: Wikipedia] 58 Evaluation metrics - classification * not applicable for imbalanced problem Based on confusion matrix we can derive following classification metrics: ● Accuracy* (how many observations, both positive and negative, were correctly classified): ● True Positive Rate or Recall or Sensitivity (how many observations out of all positive observations are classified as positive): ● True Negative Rate or Specificity (how many observations out of all negative are classified as negative): ● Positive Predictive Value or Precision (how many observations predicted as positive are in fact positive): ● Negative Predictive Value (how many predictions out of all negative predictions were correct): ● False Positive Rate or Type I error: ● False Negative Rate or Type II error: ● F beta score (combination of precision and recall in one metric, the more you care about recall over precision the higher beta you should choose; well suited to the problem of imbalanced dataset): [source: Neptune.ai blog] 59 Evaluation metrics - classification Based on confusion matrix we can derive following classification metrics: ● Matthews Correlation Coefficient (correlation between predicted classes and ground truth; well suited to the problem of imbalanced dataset): and many many more … In case of binary classification metrics we strongly recommend following Neptune.ai blogpost: link. They accurately define each evaluation metric with an intuitive interpretation (super useful in regular business environment). Additionally they provide very pertinent advice on when to apply a given metric. Of course we can generalize binary classification metrics to the multivariate classification metrics. First of all we can plot confusion matrix which is sell explanatory. Additionally for each class (one vs all approach) we can calculate separately: precision, recall, f-beta score etc. and finally average each of them by some aggregation rule (micro, macro and weighted aggregation approach) (here you can find more details link). [source: Neptune.ai blog] 60 Accuracy, precision, recall, F1-score - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of accuracy, precision, recall, F1-score metrics. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Precision & Recall by MLU-EXPLAIN 61 Evaluation metrics - probabilities (for classification task) When we use classification algorithms we nearly always want to deal with probabilities. In most cases we can setup up models to return probabilities (not predicted class)! We need to decide where to place the probability cut-off point (after which we assign observation to specific class). It’s not easy task. In most cases we start with 0.5 (50%) cut-off point, but for most cases it might be wrong value! Thanks to evaluation metrics and plots dedicated to probabilities (for classification) we can make above decision in responsible and aware way. We can distinct here following metrics: ● Receiver Operating Characteristic Curve (ROC) ● Precision/recall curve ● Lift curve ● Gini curve ● Area Under the Curve ROC (AUC ROC) ● Area Under the Curve Precision/recall (AUC PR) ● Log-loss or Cross entropy or Entropy 62 Evaluation metrics - probabilities Receiver Operating Characteristic Curve (ROC) ROC allows to address the tradeoff between true positive rate (TPR) and false positive rate (FPR). For every probability cut-off point, we calculate TPR and FPR and plot it on one chart. At the beginning when the cutoff point is 1 we classify every observation as "0". Obviously in this situation FPR is equal to 0. With the decrease of the cut-off point we increase the number of "1" - the TPR starts to increase. However our estimator will probably not be perfect so some of predicted "1" are incorrect, therefore the increase of FPR (and decrease of TNR). Generally, the higher TPR and the lower FPR is for each threshold the better and so classifiers that have curves that are more top-left side are better. As you may notice ROC curve is not well suited for imbalanced classification tasks (for more details please read this article). Area Under the Curve ROC (AUC ROC) Additionally we can calculate AUC ROC, which will be one ultimate metric to assess quality of the model. It takes values from 0.5 to 1. We should not use it with imbalanced dataset. It is recommended if you care about true negatives as much as true positives, and you care about ranking predictions. Additionally, this metric can be interpreted as: the probability that a uniformly drawn random positive has a higher score than a uniformly drawn random negative. Notice: AUC metric treats all classification errors equally, ignoring the potential consequences of ignoring one type over another. For example, in cancer detection, we’ll probably want to minimize false negatives. [source: Wikipedia] 63 ROC and ROC AUC - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of ROC and ROC AUC. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. ROC & AUC by MLU-EXPLAIN 64 Evaluation metrics - probabilities Precision recall curve (PR curve) PR curve combines precision and recall in a single visualization. For every probability cut-off, we calculate PPV and TPR and we plot it into the graph. The higher results the better model is. However we deal here with classic precision/recall dilemma (the higher the precision the lower the recall). Area Under the Curve Precision recall (AUC PR) Similarly to ROC AUC score we can calculate the Area Under the Precision-Recall Curve to get one representative number for the whole model. We can treat PR AUC as the average of precision calculated for each recall threshold from 0.0 to 1.0. AUC AUC PR is recommended to highly imbalance problems and when we communicate precision/recall decision to our stakeholders (and we additionally suggest where is the best possible cut-off point). [source: Stackoverflow] 65 Evaluation metrics - probabilities (for regression task) As in the case of classification in regression, we can also use the notion of probability, but in a slightly different sense: we can build regression models that, in addition to the expected value, estimate the confidence intervals of the forecast. We will not discuss this issue in our classes due to its high level of advancement, but it is worth knowing about the existence of the metric: The Continuous Ranked Probability Score (CRPS) which generalizes the MAE to the case of probabilistic forecasts. Link for more details: https://www.lokad.com/continuous-ranked-probability-score. 66 Bias/variance trade-off - concept The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points. The variance of a model is the variability of the model prediction for given data points. MSE decomposition: Error = Bias^2 + Variance + Noise Bias/variance tradeoff - The simpler the model, the higher the bias, and the more complex the model, the higher the variance. For MSE decomposition check page 19. [source: Stanford CS 229, Scott Fortmann-Roe Essay] 67 Bias/variance trade-off - overfitting and underfitting [source: Stanford CS 229] 68 Bias/variance trade-off - overfitting and underfitting (cont’d) + use boosting + use bagging reduce complexity of the model These illustrations present learning curves. A learning curve is a plot of model learning performance over experience or time. We distinguish following learning curves: Train Learning Curve: Learning curve calculated from the training dataset that gives an idea of how well the model is learning. Validation Learning Curve: Learning curve calculated from a hold-out validation dataset that gives an idea of how well the model is generalizing. Optimization Learning Curves: Learning curves calculated on the metric by which the parameters of the model are being optimized, e.g. log-loss. Performance Learning Curves: Learning curves calculated on the metric by which the model will be evaluated and selected, e.g. AUC ROC. [source: Stanford CS 229, Machine Learning mastery] 69 Bias/variance trade-off - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of bias/variance trade-off. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Bias/variance trade-off by MLU-EXPLAIN 70 Training, validation and testing sets - concept Generally, learning the parameters of a prediction function and testing it on the same data is a methodological mistake (we can easily overfit our model). In statistics and machine learning, there is a good practice of dividing a data set into three parts that have a dedicated purpose. (for more details: see slide 61) In some classification problems we can encounter imbalanced datasets (e.g. few “1” and lot “0”). It’s super important to use stratify approach, which allows you to ensure that relative class frequencies is approximately preserved in train-validation pair. Sometimes a strategy of creating several models independently on the train and validation data is used, and then one best model is selected on the testing sample. [source: Stanford CS 229] 71 Training, validation and testing sets - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of train-test-validation dataset split. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Train-test-validation by MLU-EXPLAIN 72 Cross-validation - concept We can generalize idea of training, validation and testing sets split to much more complex and powerful solution which is Cross-validation (CV). CV is a technique for evaluating a machine learning model and testing its performance. More precisely, CV is a resampling method that uses different portions of the data to validate and train a model on different iterations. This approach is much more robust that single train-validation split, because we shouldn't treat the value in a single validation as ideal approximation of ground truth. There are two crucial reasons for validation/cross validation usage: ● assessment of the quality of our model in a quasi-objective way (less probability of overfitting) ● “safe” (again less probability of overfitting) execution of the hyperparameter tuning procedure (hyperparameter - is a parameter whose value is used to control the learning process, thus it is not estimable and researcher has to specify it by hand based on intuition or via hyperparameter searching procedure, where CV is crucial) [source: Wikipedia, Scikit-learn] 73 Cross-validation - different types We can distinguish dozens types of cross-validations, for instance (we will discuss some of them): ● ● ● ● ● ● ● ● Hold-out K-folds Leave-one-out (LOO) Leave-p-out Stratified K-folds Repeated K-folds Nested K-folds Time series CV The needs for having multiple types are many: the specifics of the data (e.g. cross-sectional vs. time series data), the specifics of the business problem, the size of the dataset, the imbalance of the dataset, computing resources, probability of data leakage etc. In such a view, it is impossible to say which approach is best and should only be followed. However in everyday use k-fold seems most popular (for cross-sectional problems). 74