Lecture 8: Support Vector Machine ISyE 521 Fall 2022 PDF

University of Wisconsin - Madison Department of Industrial and Systems Engineering ISyE 521: Machine Learning in Action Fall 2022 Lecture 8: Support Vector Machine Instructor: Just...

University of Wisconsin - Madison Department of Industrial and Systems Engineering ISyE 521: Machine Learning in Action Fall 2022 Lecture 8: Support Vector Machine Instructor: Justin J. Boutilier November 10, 2022 In this lecture, we will introduce support vector machines. We will learn: 1. The differences between a soft and hard margin SVM 2. The differences between linear and nonlinear SVMs 3. How to train an SVM These notes were partly inspired by Dr. Andrew Ng’s fantastic course notes, which you can find here. *Corrections provided by Jin-ri Lee Support Vector Machine A support vector machine (SVM) is a supervised learning algorithm that was originally designed for classification problems. SVMs are one of the oldest machine learning algo- rithms, dating back to a 1963 paper by Vladimir Vapnik and Alexey Yakovlevich Chervo- nenkis. In the 1990s, Dr. Vapnik moved to Bell Labs (starting to see a trend?) where he continued to develop the theoretical foundation for SVMs. Although SVMs were originally designed for classification problems, they can be extended to regression problems. Preliminaries Before we begin, we need to introduce some new concepts: Space: a set with added structure. For example, the 2-dimensional plane is called the Euclidean space. The Euclidean space includes all points in 2-dimensions and includes structural properties like distance metrics and additivity. Subspace: a particular region or subset of a space. For example, the positive quad- rant (all points where both values are greater than zero) is a subset of Euclidean space. Hyperplane: a subspace whose dimension is one less than the space where the hy- perplane lives. For example, in Euclidean space, a hyperplane is a line. 1 University of Wisconsin - Madison Department of Industrial and Systems Engineering ISyE 521: Machine Learning in Action Fall 2022 Motivation Recall the logistic regression equation: 1 1 ŷi = =. 1+ e−(β0 +β1 xi1 +β2 xi2 +···+βF xiF ) 1 + e−β T x We can re-write this equation as ŷi = f (β T x), where f (·) is the logistic function. Re- member that f (β T x) predicts a probability and we typically use 0.5 as a threshold to convert the predicted probability to a binary output. In other words, we predict a 1 if f (β T x) ≥ 0.5. We can prove mathematically, that f (β T x) ≥ 0.5 is equivalent to β T x ≥ 0. We can visualize this equivalence by plotting the logistic function - see Figure 1. Figure 1: The default logistic function. Figure 1 shows us that as β T x increases, so does the predicted probability f (β T x) (and vice versa). We also know that a larger predicted probability implies that we are more confident in our prediction. For example, if we predict f (β T x) = ŷi = 0.99, then we are very confident that y is 1. In this case, β T x is much larger than 0 and we write this as β T x 0. On the other hand, if we predict f (β T x) = ŷi = 0.01, then we are very confident that y is 0 because β T x is much smaller than 0. Note that a prediction of ŷi = 0.51 suggests that we believe y = 1, but we are not overly confident in that decision (in this case β T x is close to 0). Ideally, we’d like all of our predicted probabilities to be either close to 1 (i.e., β T x 0) or close to 0 (i.e., β T x 0). If that happens, we can be very confident in our predictions! It turns out that logistic regression can be represented by a hyperplane. Let’s visualize this with the small 2-dimensional example shown in Figure 2. The line separating the x’s and o’s is called a separating hyperplane and is given by the equation β T x = 0. This means that if an observation x lies on the line, then β T x = 0 and f (β T x) = 0.5. In other words, we are unsure if we should predict a 1 or 0. The farther a point is from the separating hyperplane, the more confident we are in our prediction because β T x will be large (or small) and therefore f (β T x) will be close to 1 (or 0). In the figure, we are more confident in our prediction of A than we are for C. 2 University of Wisconsin - Madison Department of Industrial and Systems Engineering ISyE 521: Machine Learning in Action Fall 2022 Figure 2: A separating hyperplane. Source. In this example, it is possible to draw many different hyperplanes that perfectly sep- arate the data. So the question is, What is the optimal separating hyperplane? The goal of SVMs is answer this question and find the best/optimal separating hyperplane. Intuitively, the best hyperplane is as far as possible from observations of both classes because this will maximize our confidence. The distance from the separating hyperplane to the closest observation is called the margin. We want to maximize the margin. Consider Figure 3. The line given by H1 is not a separating hyperplane. The line H2 is a separating hyperplane, but the margin is very small (because it is close to the black dots at the top and the white dot at the bottom). So if we use H2 to make predictions, our confidence will be low for some observations. Line H3 is the optimal separating hyperplane and maximizes the distance between the closest white and black dots (i.e., the margin). Figure 3: Multiple separating hyperplanes, including the optimal one. Source. 3 University of Wisconsin - Madison Department of Industrial and Systems Engineering ISyE 521: Machine Learning in Action Fall 2022 New notation We need to introduce some new notation before we proceed. First, we will use y ∈ {−1, 1} instead of y ∈ {0, 1} to denote the class labels. Second, we will rewrite β T x as wT x + b. This allows us to treat the intercept term b (formerly β0 ) separately. Note that the w vector represents β1 ,... , βF. Although this is not the standard notation in this course, it is the standard notation for SVM and this will (hopefully) allow you to more easily read and understand SVM literature. We can use this notation to rewrite our separating hyperplane as wT x + b = 0. Furthermore, we can now introduce one of the most famous machine learning concepts. The perceptron The perceptron was invented by Dr. Frank Rosenblatt in 1957 and serves as the foundation for both SVMs and neural networks. It is modeled after the biological neurons that comprise our brains. We can write the perceptron as  1, if wT x + b > 0 f (x) =  −1, if wT x + b < 0. It is important to note that the perceptron algorithm directly predicts 1 or −1 (i.e., there is no probability here). Also note that it is unclear what value to predict when wT x + b = 0. We’ll dive further into the perceptron algorithm in the next lecture. Linear SVM The first type of SVM that we will consider is the linear SVM. The linear SVM linearly separates the data using a linear separating hyperplane (as in Figure 3 or 4). There are two types of linear SVM: 1. Hard margin: used for perfectly separable data (both Figure 3 and 4 are perfectly separable). This approach was invented first. 2. Soft margin: extends the hard margin for problems where the data is not perfectly separable. How to fit a hard margin SVM? As mentioned above, the goal of SVM is to find the optimal separating hyperplane - the hyperplane that maximizes the margin. Instead of directly finding the optimal hyper- plane, we will find two parallel hyperplanes that are as far apart as possible (i.e., those that maximize the margin), where each hyperplane has at least one observation that lies directly on the plane itself. Consider Figure 4, which shows the optimal separating hyperplane (the red line) and the two corresponding hyperplanes that we will find. 4 University of Wisconsin - Madison Department of Industrial and Systems Engineering ISyE 521: Machine Learning in Action Fall 2022 Figure 4: Margin maximizing hyperplanes. The hyperplanes that we want to find can be represented with a slightly modified perceptron  1, if wT x + b ≥ 1 f (x) = −1, if wT x + b ≤ −1. Notice that it is unclear what happens when −1 < wT x + b < 1. However, because this is a hard margin SVM, we know that the data is perfectly separable and as a result, there will be no data within −1 < wT x + b < 1. The primal problem As noted above, we want to maximize the margin, which is equivalent to maximizing the distance between the two new hyperplanes. The distance between these two parallel 2 hyperplanes is given by ||w|| 2 , where ||w||2 is the 2-norm that measures Euclidean distance, Note that maximizing ||w||2 is equivalent to minimizing 12 ||w||2 = 12 wT w. We can write 2 this problem as follows: 1 T minimize w w w,b 2 (1) T subject to yi (w xi + b) ≥ 1, i = 1, 2,... , n This is a quadratic programming model, which is a convex problem. This means that we can efficiently find the optimal solution and we do not need to use a heuristic algorithm. The constraint can be interpreted as follows. If yi = 1, then we need wT xi + b to be larger than 1 and if yi = −1, then we need wT xi + b to be smaller than -1. In other words, this constraint models the perceptron! There is one important observation that we need to make. The constraint in (1) is listed for every observation, which may result in a huge number of constraints. However, many of the observations are not needed because they do not influence the solution - only those points that lie on the hyperplanes impact our optimal solution. These points are called support vectors. In Figure 4, there are three support vectors (two blue and one green). 5 University of Wisconsin - Madison Department of Industrial and Systems Engineering ISyE 521: Machine Learning in Action Fall 2022 The dual problem In practice, we do not solve (1). Instead, we solve the dual problem, which is an equivalent reformulation. Duality is a key concept in convex optimization that we will not cover in this course. The dual problem is n n X n 1X yi yj αi αj xiT xj X maximize αi − α i=1 2 i=1 j=1 n X (2) subject to αi yi = 0, i=1 αi ≥ 0, i = 1, 2,... , n We use the dual problem for several reasons. First, it is typically more efficient and easier to solve. Second, it allows us to more efficiently make predictions (see the next sec- tion). Lastly, it allows us to use the kernel trick and extend SVMs to nonlinear separating hyperplanes (see below). Making predictions To make a prediction for observation xk , we need to use the perceptron and compute wT xk + b. If we have a large number of features, computing wT xk can be computationally expensive. To remedy this, we rewrite the perceptron using the dual n w T xi + b = X α i y i xi xk + b i=1 After we solve the dual problem, we obtain values for each αi and we can easily compute b (not covered here). The benefit of the dual problem lies in the fact that many αi ’s will be 0. Specifically, αi will only be non-zero if i is a support vector. This means that only the terms corresponding to the support vectors in the summation ni=1 αi yi xi xk need to P be calculated. In other words, our predictions depend only on the support vectors and there are typically only a handful of support vectors, even for very large problems. Nonlinear SVM There are many classification problems that are not linearly separable. Consider the example shown in Figure 5. No linear hyperplane will be able to accurately separate this data. However, it is quite obvious that a circular hyperplane would indeed provide perfect separation. These types of problems are solved using nonlinear SVMs and the kernel trick. 6 University of Wisconsin - Madison Department of Industrial and Systems Engineering ISyE 521: Machine Learning in Action Fall 2022 Figure 5: Data that is separable in a nonlinear way. Source. The kernel trick The kernel trick is an essential component of nonlinear SVMs (and a broader mathematical concept). We will not go into detail on the mathematics of kernel functions, but rather, provide a high-level intuitive explanation. A kernel K takes two observations as input and returns a similarity score between the observations. We typically write this as K(xi , xj ). Kernels are useful because they allow us to project the data to higher dimensions and compute the similarity score in the higher dimension. Intuitively, you can think of the kernel trick as a way to map data that is not linearly separable to a higher dimension where the data is linearly separable. For example, consider Figure 6. The original data is shown on the right in a two dimensional space. On the left, we have projected (or mapped) the data to three di- mensional space using a kernel. In this example, the third dimension (Z) measures the similarity between observations, defined by the kernel. We can see that in 3-dimensional space the data can be linearly separated using a hyperplane. The hyperplane can then be mapped back to 2-dimensional space and voila, we have our circular hyperplane. Figure 6: Nonlinear separating hyperplane and the kernel trick. Source. 7 University of Wisconsin - Madison Department of Industrial and Systems Engineering ISyE 521: Machine Learning in Action Fall 2022 The challenge with the kernel trick (and with SVMs in general) is to find the right kernel (i.e., the kernel that leads to good separation in a higher dimensional space). Some popular SVM kernels include: Linear: simplest case. Defined as: K(xi , xj ) = xiT xj + c, where c is a hyperparameter. Polynomial: well-suited for normalized data. Defined as: K(xi , xj ) = (ζxiT xj + c)d , where ζ, c, d are hyperparameters. RBF: radial basis function kernel (this is used in Figure 6). Defined as: 2 K(xi , xj ) = e−γ||xi −xj ||2 , where γ is a hyperparameter. Sigmoid: we typically use the tan function. Defined as: K(xi , xj ) = tan (ζxiT xj + c), where ζ and c are hyperparameters. Soft margin SVMs and regularization We can extend both linear and non-linear SVMs to problems that are not perfectly sep- arable. To do this, we rewrite the primal problem as 1 T minimize w w + C||ξ||1 w,b 2 (3) subject to yi (wT xi + b) ≥ 1 − ξi , i = 1, 2,... , n, ξi ≥ 0, i = 1, 2,... , n. In this problem, we allow observations xk to lie on the wrong side of the separating hyperplane. The distance from the hyperplane to a point on the wrong side, is determined by ξk. In the hard margin case, ξi = 0, ∀i. In the soft margin case, if observation xk lies on the wrong side and it’s distance is ξk , then we pay a penalty equal to Cξk. Similar to the hard margin case, we can reformulate this problem and solve the dual. Connection to regularization In soft margin SVMs, the ξ term looks very similar to a regularization term. We want to keep ξ small, just like we want the β’s to be small in regression. Once we realize this, we can start to change the way that the ξ variable is modeled in the objective function. In other words, we can use lasso, ridge, or elastic net regularization terms for ξ. Note that in (3) we have used L1-regularization (lasso). 8 University of Wisconsin - Madison Department of Industrial and Systems Engineering ISyE 521: Machine Learning in Action Fall 2022 Key hyperparameters Support vector machines have two key hyperparameters: Kernel: considered to be the most important hyperparameter. Kernel choice can significantly impact the model performance (as seen in Figure 6) C (penalty): the penalty for data that lies on the wrong side of the hyperplane. As noted above, this is very similar to a regularization parameter. 9

Lecture 8: Support Vector Machine ISyE 521 Fall 2022 PDF

Document Details

Tags

Related

Summary

Full Transcript