04-sl-overview-handout.pdf
Document Details
Uploaded by Deleted User
Tags
Full Transcript
Supervised Learning MInDS @ Mines Supervised learning is the subset of machine learning that covers predicting a target label or class given labeled data. The two main subsets of supervised learning are classification and regression. Classification is supervised learning when the target c...
Supervised Learning MInDS @ Mines Supervised learning is the subset of machine learning that covers predicting a target label or class given labeled data. The two main subsets of supervised learning are classification and regression. Classification is supervised learning when the target class is discrete and regression is when it is continuous. There are many methods that can be applied to supervised learning. In this lecture, we will cover some metrics used to evaluate models as well as the k -Nearest Neighbors, Decision Trees and Random Forests methods. Data is commonly available as features and labels or targets. Features are the observations that we detect about our data and labels are the useful values that we would like to be able to predict. For supervised learning, we start with data that includes both features and labels. We then learn a model that can predict the labels from the features. If the labels of our data are continuous, we refer to this as regression. If the labels of our data are discrete, we refer to this as classification. Metrics Figure 1: The two types of supervised learning Before we start considering methods that apply supervised learning, let’s look at how we evaluate each of classification and regression problems. For each, we have a set of metrics that can be useful in gauging a model’s performance. Classification The goal of a classifier is to accurately predict the correct label for each data point. Here, we will go over some commonly used metrics to evaluate a classifier. For all the metrics for classification let’s start by defining the notation we’ll use. Based on the results of a model that is trained to classify data as one of n classes, ci is the count of data from class i, and ci,j is the count of data from class i that was labeled as class j. Based on that definition we can create an n × n matrix that displays how the model performed. This matrix is called the confusion matrix and an example of that is, c11 c12 c13... c1n c21 c22 c23... c2n C= .... .. . (1) ... cn1 cn2 cn3... cnn The simplest metric to use is accuracy which is the percentage of predic- tions that were correct. The trace of a square matrix, is the sum of the diagonal of that matrix. ∑ n tr(A) = aii i=1 2 mi n d s @ m ine s ∑ i cii tr(C) Accuracy = ∑ ∑ =∑ ∑ , (2) i j cij i j cij In other words, accuracy is the sum of samples where the class and predic- tion were the same divided by the sum of all samples. Accuracy is useful at a high level but it actually doesn’t tell us much about our model’s performance for each class. For that we’ll move on to more involved metrics that will be the basis for our summary metrics.1 The true 1 If this is your first time learning of these metrics, this section is worth reading a positive for a class i, T Pi ,is the count of that class that were labeled correctly. couple times before moving on. The true negative for a class i, T Ni , is the sum of all the other classes’ true positives. The false positive for a class i, F Pi , is the sum of all the other classes’ instances that were incorrectly labeled as class i. The false negative for a class i, F Ni , is the sum of all instances of class i that were incorrectly labeled as another class. T Pi cii ∑ From those values, we can calculate precision, recall and the F-1 score. T Ni j̸=i,k̸=i cjk Precision, for a class i, tells us what percentage of data predicted as class i is ∑ F Pi j̸=i cji in fact class i. Recall, for a class i, tells us what percentage of that class was ∑ F Ni j̸=i cij actually predicted as class i. The F-1 score for a class i, is the harmonic mean T Pi of precision and recall. P ri T Pi +F Pi These values are useful if we are only interested in a particular class. If we Rei T Pi T Pi +F Ni are interested in an overall summary score for the model then we can use 2∗P ri ∗Rei F 1i P ri +Rei a summary score by combining all the available classes’ scores. The three most common ways to combine the scores result in the weighted, micro Table 1: Commonly used metrics in deter- mining classification performance for a and macro scores for precision, recall and F-1. Micro aggregation use the particular class i. values at the data level then calculates the average score, macro aggregation calculates each score and then averages them, and weighted aggregation calculates each score and then averages them weighted by the percentage of data in each label. As an example, the various aggregation methods for precision are, ∑ ∑ ∑ ci T Pi P ri P rmicro = ∑ i , P rmacro = i , P rweighted = (∑ P ri ). (3) i (T P i + F Pi ) n i j cj Regression The goal of a regression model is to accurately predict the correct continuous value for the target for each data point. Here, we will go over some commonly used metrics to evaluate a model. For all the metrics for regression, we will represent the real values of the target as y where yi is the target value for the ith item and f (x) and f (xi ) are the respective predicted values. A frequently used metric to assess model performance in regression is the mean squared error (MSE). MSE is the av- erage of the difference between the real value and the predicted value, the error, squared which is, ∑ i (yi − f (xi ))2 || y − f (x) ||22 MSE = =. (4) n n s u p e rv i s e d l e a rn i ng 3 A common variation used of MSE is the root mean square error (RMSE) which is the square root of the MSE. Another commonly used metric that you may have seen when using Excel to plot a trend line on a chart is the R2 value. The R2 value is one minus the sum of the squared errors divided by the sum of the squared distance to the mean. This gives us a value that is weighted by the variability within the sample data. ∑ i (yi − f (xi )) 2 R =1− ∑ 2 , (5) i (yi − ŷ) 2 where ŷ is the average of the real target values. Methods We’ve discussed some metrics to help us evaluate a model, now let’s take a look at some methods that can actually solve the supervised learning prob- lem. We’ll look at some simple yet effective methods; namely k -Nearest Neighbors, decision trees and random forests. k -Nearest Neighbors k -Nearest Neighbors (k NN) is a simple method for applying supervised learn- ing. When training a k NN model, the model practically memorizes all the locations of all the points and their values. When it is time to use the model for predictions, the model takes the input data and calculates the k near- est points to that input. In a classification problem, the majority class of the The Stanford Vision Lab has created a useful demo at http://vision.stanford. nearest points is the prediction and in regression, the average of the nearest edu/teaching/cs231n-demos/knn/ to points is the prediction. This approach may seem simplistic but with a large illustrate how k NN works.. amount of data, it can be quite effective. When using k NN, there are a few things to consider: k NN delays the computational effort until the inference/prediction stage k is a hyperparameter that can affect the model’s performance The way we calculate “nearest points” can significantly affect how our model performs. We can use an ℓp -norm for our distance metric and ℓ1 , and ℓ2 are usually effective. Decision Trees You’ve probably come across decision trees several times before and it might come as a surprise to find out that they can be the basis for a supervised learning model. You’ve at least seen flow charts before. Flow charts and decision trees are used to reach a conclusion or make a decision. In machine learning, this final decision is the prediction we are in search of. Figure 2: An example flow chart courtesy of https://xkcd.com/518 4 mi n d s @ mine s A decision tree starts at the root node with all the data and, based on a criterion, it splits the data into two nodes. Each of these nodes then repeats the procedure until we reach a stopping criterion. Stopping criteria include a maximum tree depth or a minimum number of samples in a node. It is simple to follow a decision tree, but training one can be more com- plicated. A decision tree trains on a dataset by investigating the features and finding the points on those features where, if we were to split the data, we would get the most consistency in each individual set. The metric we use as a proxy for consistency is impurity and there are many ways to calculate it, two of which are Gini and cross entropy. For a classification problem with k classes, a decision tree with n nodes, has cm points in node m, and cmi points in node m belonging to class i. We define the proportion of points of node m that belong to class i as pmi = cmi cm. Following that, the two popular impurity metrics to choose between are, ∑ Ginim = pmi (1 − pmi ), (6) i ∑ Cross-entropym =− pmi log pmi. (7) i For a regression problem, we can use the MSE of each set to represent its impurity. When using decision trees for regression, each node is an exact value that is the average of its points. This means that there are usually more superior methods for regression problems than decision trees. When using decision trees it is important to understand that: The splitting criterion or measure for impurity is a hyperparameter The stopping criterion for training is a critical hyperparameter to prevent overfitting of the data Random Forests In this section we will briefly introduce random forests. Random forests are an ensemble method2 applied to decision trees that can improve the 2 Ensemble methods are methods that combine several models to produce better model’s performance. Similar to a forest being a grouping of trees, a random predictions. forest model is an aggregation of several decision trees. Each decision tree is trained on a random subset of the data to create the random forest model. To use the model, the random forest aggregates the prediction from each of the trees. For classification, the model selects the most common class and for regression, it selects the average of the resulting predictions. When using random forests, the same considerations from decision trees apply. In addition to those, the number of decision trees used to create the random forest model is a hyperparameter to consider.