Evaluation Metrics in Machine Learning PDF

start of the 3rd lecture Chapter 3 Assessing model accuracy, machine learning diagnostics 54 Evaluation metrics - concept At this point, we have a broad understanding of the cost function and its crucial role in machine learning. We know that cost function should meet certain properties (e.g. diﬀerentiability with respect to parameters). In practice, this means that we can use a limited number of functions to train and monitor the quality of our model. However, for the extensive process of evaluation the performance of a model (during training and testing), evaluation metrics have been developed that do not have to comply with restrictive mathematical properties. Evaluation metrics are calculated after the estimator is already created with use of diﬀerent cost function, thus evaluation metric does not aﬀect estimator per se. We distinguish evaluation metrics for following problems: ● regression ● classiﬁcation ● probabilities 55 Evaluation metrics - regression In case of regression we deal with continuous target. Intuitively we are looking for metrics which describe distance between prediction and actual (it’s straightforward). The most popular regression metrics are: ● Mean Square Error (MSE): ● Root Mean Square Error (RMSE): ● Mean Absolute Error (MAE): ● Mean Absolute Percentage Error (MAPE): ● Mean Squared Logarithmic Error (MSLE): ● R² score: ● Median Absolute Error (MedAE): , where epsilon is small strictly positive number Mean Absolute Scaled Error, Mean Directional Accuracy and many many more… To visualize errors distribution we can use histogram/KDE model and we are able to get a complete picture of the performance of regression estimator. [source: Scikit-learn] 56 Evaluation metrics - regression When choosing an evaluation metric, be very careful and deeply understand the business outcome of your decisions: MAPE = symmetric MAPE (sMAPE) = [source: Towards Data Science] 57 Evaluation metrics - classiﬁcation In the case of a classiﬁcation problem, it is much more diﬃcult to make a correct assessment of the model. It requires a bit more knowledge and abstract thinking. First of all let’s introduce confusion matrix: Example: [source: Wikipedia] 58 Evaluation metrics - classiﬁcation * not applicable for imbalanced problem Based on confusion matrix we can derive following classiﬁcation metrics: ● Accuracy* (how many observations, both positive and negative, were correctly classiﬁed): ● True Positive Rate or Recall or Sensitivity (how many observations out of all positive observations are classiﬁed as positive): ● True Negative Rate or Speciﬁcity (how many observations out of all negative are classiﬁed as negative): ● Positive Predictive Value or Precision (how many observations predicted as positive are in fact positive): ● Negative Predictive Value (how many predictions out of all negative predictions were correct): ● False Positive Rate or Type I error: ● False Negative Rate or Type II error: ● F beta score (combination of precision and recall in one metric, the more you care about recall over precision the higher beta you should choose; well suited to the problem of imbalanced dataset): [source: Neptune.ai blog] 59 Evaluation metrics - classiﬁcation Based on confusion matrix we can derive following classiﬁcation metrics: ● Matthews Correlation Coeﬃcient (correlation between predicted classes and ground truth; well suited to the problem of imbalanced dataset): and many many more … In case of binary classiﬁcation metrics we strongly recommend following Neptune.ai blogpost: link. They accurately deﬁne each evaluation metric with an intuitive interpretation (super useful in regular business environment). Additionally they provide very pertinent advice on when to apply a given metric. Of course we can generalize binary classiﬁcation metrics to the multivariate classiﬁcation metrics. First of all we can plot confusion matrix which is sell explanatory. Additionally for each class (one vs all approach) we can calculate separately: precision, recall, f-beta score etc. and ﬁnally average each of them by some aggregation rule (micro, macro and weighted aggregation approach) (here you can ﬁnd more details link). [source: Neptune.ai blog] 60 Accuracy, precision, recall, F1-score - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of accuracy, precision, recall, F1-score metrics. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Precision & Recall by MLU-EXPLAIN 61 Evaluation metrics - probabilities (for classiﬁcation task) When we use classiﬁcation algorithms we nearly always want to deal with probabilities. In most cases we can setup up models to return probabilities (not predicted class)! We need to decide where to place the probability cut-oﬀ point (after which we assign observation to speciﬁc class). It’s not easy task. In most cases we start with 0.5 (50%) cut-oﬀ point, but for most cases it might be wrong value! Thanks to evaluation metrics and plots dedicated to probabilities (for classiﬁcation) we can make above decision in responsible and aware way. We can distinct here following metrics: ● Receiver Operating Characteristic Curve (ROC) ● Precision/recall curve ● Lift curve ● Gini curve ● Area Under the Curve ROC (AUC ROC) ● Area Under the Curve Precision/recall (AUC PR) ● Log-loss or Cross entropy or Entropy 62 Evaluation metrics - probabilities Receiver Operating Characteristic Curve (ROC) ROC allows to address the tradeoﬀ between true positive rate (TPR) and false positive rate (FPR). For every probability cut-oﬀ point, we calculate TPR and FPR and plot it on one chart. At the beginning when the cutoﬀ point is 1 we classify every observation as "0". Obviously in this situation FPR is equal to 0. With the decrease of the cut-oﬀ point we increase the number of "1" - the TPR starts to increase. However our estimator will probably not be perfect so some of predicted "1" are incorrect, therefore the increase of FPR (and decrease of TNR). Generally, the higher TPR and the lower FPR is for each threshold the better and so classiﬁers that have curves that are more top-left side are better. As you may notice ROC curve is not well suited for imbalanced classiﬁcation tasks (for more details please read this article). Area Under the Curve ROC (AUC ROC) Additionally we can calculate AUC ROC, which will be one ultimate metric to assess quality of the model. It takes values from 0.5 to 1. We should not use it with imbalanced dataset. It is recommended if you care about true negatives as much as true positives, and you care about ranking predictions. Additionally, this metric can be interpreted as: the probability that a uniformly drawn random positive has a higher score than a uniformly drawn random negative. Notice: AUC metric treats all classiﬁcation errors equally, ignoring the potential consequences of ignoring one type over another. For example, in cancer detection, we’ll probably want to minimize false negatives. [source: Wikipedia] 63 ROC and ROC AUC - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of ROC and ROC AUC. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. ROC & AUC by MLU-EXPLAIN 64 Evaluation metrics - probabilities Precision recall curve (PR curve) PR curve combines precision and recall in a single visualization. For every probability cut-oﬀ, we calculate PPV and TPR and we plot it into the graph. The higher results the better model is. However we deal here with classic precision/recall dilemma (the higher the precision the lower the recall). Area Under the Curve Precision recall (AUC PR) Similarly to ROC AUC score we can calculate the Area Under the Precision-Recall Curve to get one representative number for the whole model. We can treat PR AUC as the average of precision calculated for each recall threshold from 0.0 to 1.0. AUC AUC PR is recommended to highly imbalance problems and when we communicate precision/recall decision to our stakeholders (and we additionally suggest where is the best possible cut-oﬀ point). [source: Stackoverﬂow] 65 Evaluation metrics - probabilities (for regression task) As in the case of classiﬁcation in regression, we can also use the notion of probability, but in a slightly diﬀerent sense: we can build regression models that, in addition to the expected value, estimate the conﬁdence intervals of the forecast. We will not discuss this issue in our classes due to its high level of advancement, but it is worth knowing about the existence of the metric: The Continuous Ranked Probability Score (CRPS) which generalizes the MAE to the case of probabilistic forecasts. Link for more details: https://www.lokad.com/continuous-ranked-probability-score. 66 Bias/variance trade-oﬀ - concept The bias of a model is the diﬀerence between the expected prediction and the correct model that we try to predict for given data points. The variance of a model is the variability of the model prediction for given data points. MSE decomposition: Error = Bias^2 + Variance + Noise Bias/variance tradeoﬀ - The simpler the model, the higher the bias, and the more complex the model, the higher the variance. For MSE decomposition check page 19. [source: Stanford CS 229, Scott Fortmann-Roe Essay] 67 Bias/variance trade-oﬀ - overﬁtting and underﬁtting [source: Stanford CS 229] 68 Bias/variance trade-oﬀ - overﬁtting and underﬁtting (cont’d) + use boosting + use bagging reduce complexity of the model These illustrations present learning curves. A learning curve is a plot of model learning performance over experience or time. We distinguish following learning curves: Train Learning Curve: Learning curve calculated from the training dataset that gives an idea of how well the model is learning. Validation Learning Curve: Learning curve calculated from a hold-out validation dataset that gives an idea of how well the model is generalizing. Optimization Learning Curves: Learning curves calculated on the metric by which the parameters of the model are being optimized, e.g. log-loss. Performance Learning Curves: Learning curves calculated on the metric by which the model will be evaluated and selected, e.g. AUC ROC. [source: Stanford CS 229, Machine Learning mastery] 69 Bias/variance trade-oﬀ - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of bias/variance trade-oﬀ. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Bias/variance trade-oﬀ by MLU-EXPLAIN 70 Training, validation and testing sets - concept Generally, learning the parameters of a prediction function and testing it on the same data is a methodological mistake (we can easily overﬁt our model). In statistics and machine learning, there is a good practice of dividing a data set into three parts that have a dedicated purpose. (for more details: see slide 61) In some classiﬁcation problems we can encounter imbalanced datasets (e.g. few “1” and lot “0”). It’s super important to use stratify approach, which allows you to ensure that relative class frequencies is approximately preserved in train-validation pair. Sometimes a strategy of creating several models independently on the train and validation data is used, and then one best model is selected on the testing sample. [source: Stanford CS 229] 71 Training, validation and testing sets - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of train-test-validation dataset split. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Train-test-validation by MLU-EXPLAIN 72 Cross-validation - concept We can generalize idea of training, validation and testing sets split to much more complex and powerful solution which is Cross-validation (CV). CV is a technique for evaluating a machine learning model and testing its performance. More precisely, CV is a resampling method that uses diﬀerent portions of the data to validate and train a model on diﬀerent iterations. This approach is much more robust that single train-validation split, because we shouldn't treat the value in a single validation as ideal approximation of ground truth. There are two crucial reasons for validation/cross validation usage: ● assessment of the quality of our model in a quasi-objective way (less probability of overﬁtting) ● “safe” (again less probability of overﬁtting) execution of the hyperparameter tuning procedure (hyperparameter - is a parameter whose value is used to control the learning process, thus it is not estimable and researcher has to specify it by hand based on intuition or via hyperparameter searching procedure, where CV is crucial) [source: Wikipedia, Scikit-learn] 73 Cross-validation - diﬀerent types We can distinguish dozens types of cross-validations, for instance (we will discuss some of them): ● ● ● ● ● ● ● ● Hold-out K-folds Leave-one-out (LOO) Leave-p-out Stratiﬁed K-folds Repeated K-folds Nested K-folds Time series CV The needs for having multiple types are many: the speciﬁcs of the data (e.g. cross-sectional vs. time series data), the speciﬁcs of the business problem, the size of the dataset, the imbalance of the dataset, computing resources, probability of data leakage etc. In such a view, it is impossible to say which approach is best and should only be followed. However in everyday use k-fold seems most popular (for cross-sectional problems). 74

Evaluation Metrics in Machine Learning PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue