AI 305 Support Vector Machines (SVM) PDF

Summary

This document presents lecture notes on Support Vector Machines (SVM). It covers the fundamentals of SVM, including maximal margin classifier, support vector classifiers and support vector machines. The explanation includes examples and diagrams to illustrate the concepts.

Full Transcript

Introduction to Machine learning AI 305 Support Vector Machine SVM 1 Contents Maximal Margin classifier Support Vector Classifier Support Vector Machine SVM for multiclass problems SVM vs. Logistic regression...

Introduction to Machine learning AI 305 Support Vector Machine SVM 1 Contents Maximal Margin classifier Support Vector Classifier Support Vector Machine SVM for multiclass problems SVM vs. Logistic regression 2 Introduction oSupport Vector Machines (SVMs) is an approach for classification that was developed in the computer science community in the 1990s and that has grown in popularity since then. oSVMs have been shown to perform well in a variety of settings, and are often considered one of the best “out of the box” classifiers. 3 Introduction – Cont. oThe basic idea is a simple and intuitive classifier called the maximal margin classifier. oSupport vector classifier is an extension of the maximal margin classifier to be applied to a broader range of datasets. oSVM is a further extension of the support vector classifier in order to accommodate non-linear class boundaries. 4 Introduction – Cont. oHere we approach the two-class classification problem in a direct way: ◦We try and find a plane that separates the classes in feature space. oIf we cannot, we get creative in two ways: ◦We soften what we mean by “separates”, and ◦We enrich and enlarge the feature space so that separation is possible. 5 What is a Hyperplane? o A hyperplane in p dimensions is a flat affine subspace of dimension p − 1. o In general, the equation for a hyperplane has the form o β0 + β1 X 1 + β2 X 2 +... + βp X p = 0 o In p = 2 dimensions a hyperplane is a line. β0 + β1X 1 + β2X 2 = 0 oIn p=3 dimensions, a hyperplane is a at two-dimensional subspace, othat is, a plane o If β0 = 0, the hyperplane goes through the origin, otherwise not. o The vector β = (β1, β2, · · · , βp) is called the normal vector o — it points in a direction orthogonal to the surface of a hyperplane. 6 Hyperplane in 2 Dimensions 7 Hyperplanes – Example oThe hyperplane is: ◦ 1 + 2𝑋1 + 3𝑋2 = 0. oThe blue region is the set of points for which ◦ 1+2𝑋1 +3𝑋2 > 0, oThe purple region is the set of points for which ◦ 1 + 2𝑋1 + 3𝑋2 < 0. 8 Classification Using a Separating Hyperplane oNow suppose that we have a n×p data matrix X that consists of n training observations in p-dimensional space, oThese observations fall into two classes, ◦ that is, 𝑦1 ,... , 𝑦𝑛 ∈ −1, +1 ◦ where −1 represents one class and 1 the other class. oWe also have a test observation, a p-vector of observed features oOur goal is to develop a classifier based on the training data that will correctly classify the test observation using its feature measurements. oWe have seen a number of approaches for this task, such as logistic regression, and classification trees, bagging, and boosting. oWe will now see a new approach that is based upon the concept of a separating hyperplane. 9 Separating Hyperplanes If f (X) = β0 + β 1 X 1 + · · · + β p X p , then f (X) > 0 for points on one side of the hyperplane, and f (X) < 0 for points on the other. If we code the colored points as Y i = +1 for blue, say, and Yi = −1 for purple, then if Y i · f (X i ) > 0 for all i, f (X ) = 0 defines a separating hyperplane. 10 Maximal Margin Classifier o Among all separating hyperplanes, find the one that makes the biggest gap or margin between the two classes. othe maximal margin hyperplane is the solution to the following constraint optimization problem maximize M β0 ,β1 ,...,βp 𝑝 subject to σ𝑗=1 𝛽𝑗2 = 1 yi (β0 + β1x i1 +... + βp x ip ) ≥ M for all i = 1,... , N. oThe constraints ensure that each observation is on the correct side of the hyperplane and at least a distance M from the hyperplane (M is the width of the margin.) oThis can be rephrased as a convex quadratic program, and solved efficiently. 1 Minimize 2 𝛽 2 β0 ,β1 ,...,βp subject to yi (β0 + β1x i1 +... + βp x ip ) ≥ 1 for all i = 1,... , N 11 Non-separable Data oThe data on the right are not separable by a linear boundary. ◦ The optimization problem has no solution with M > 0 ◦ This is often the case, unless N < p. oWe can extend the concept of a separating hyperplane in order to develop a hyperplane that almost separates the classes, ◦ using a so-called soft margin. ◦ The generalization of the maximal margin classifier to the non-separable case is known as the support vector classifier. 12 Noisy Data oSometimes the data are separable, but noisy. ◦ This can lead to a poor solution for the maximal-margin classifier. oThe support vector classifier maximizes a soft margin. 13 Drawback of Maximal Margin Classifier oA classifier based on a separating hyperplane will necessarily perfectly classify all of the training observations; ◦ this can lead to sensitivity to individual observations. oThe addition of a single observation in the right-hand figure (previous slide) leads to a dramatic change in the maximal margin hyperplane. oThe resulting maximal margin hyperplane is not satisfactory ◦ Because it has only a tiny margin. oThis is problematic because the distance of an observation from the hyperplane can be seen as a measure of our confidence that the observation was correctly classified. oMoreover, the fact that the maximal margin hyperplane is extremely sensitive to a change in a single observation suggests that it may have overfit the training data. 14 Support Vector Classifier oThe previous problem with maximal margin classifier leads us to consider a classifier based on a hyperplane that does not perfectly separate the two classes, in the interest of ◦ Greater robustness to individual observations, and ◦ Better classification of most of the training observations. oWhich leads to misclassify a few training observations in order to do a better job in classifying the remaining observations. oThe support vector classifier, sometimes called a soft margin classifier is the solution. 15 Support Vector Classifier – Cont. The optimization problem has a very interesting property: ◦ it turns out that only observations that either lie on the margin or that violate the margin will affect the hyperplane, and hence the classifier obtained. oIn other words, an observation that lies strictly on the correct side of the margin does not affect the support vector classifier! ◦ Changing the position of that observation would not change the classifier at all, provided that its position remains on the correct side of the margin. oObservations that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors. ◦ hold/ support the margin planes in place oThese observations do affect the support vector classifier. 16 Support Vector Classifier – Cont. A support vector classifier was fit to a small data set. oLeft: The hyperplane is shown as a solid line and the margins are shown as dashed lines. ◦ Purple observations: Observations 3; 4; 5, and 6 are on the correct side of the margin, observation 2 is on the margin, and observation 1 is on the wrong side of the margin. ◦ Blue observations: Observations 7 and 10 are on the correct side of the margin, observation 9 is on the margin, and observation 8 is on the wrong side of the margin. No observations are on the wrong side of the hyperplane. Right: Same as left with two additional points, 11 and 12. ◦ These two observations are on the wrong side of the hyperplane and the wrong side of the margin. 17 Details of the Support Vector classifier oThe support vector classifier classifies a test observation depending on which side of a hyperplane it lies. oThe hyperplane is chosen to correctly separate most of the training observations into the two classes, but may misclassify a few observations. oIt is the solution to the optimization problem 18 Details of the Support Vector classifier –Cont. oC is a nonnegative tuning parameter. 1 oM (𝑀 = )is the width of the margin; 𝛽 owe seek to make this quantity M as large as possible. o𝜖1 ,... , 𝜖𝑛 are slack variables that allow individual observations to be on the wrong side of the margin or the hyperplane 19 Slack variable oSlack variable 𝜖𝑖 tells us where the 𝑖 𝑡ℎ observation is located, relative to the hyperplane and relative to the margin. oIf 𝜖𝑖 = 0 then the 𝑖 𝑡ℎ observation is on the correct side of the margin. oIf 𝜖𝑖 > 0 then the 𝑖 𝑡ℎ observation is on the wrong side of the margin, ◦ the 𝑖 𝑡ℎ observation has violated the margin. oIf 𝜖𝑖 > 1 then it is on the wrong side of the hyperplane. 20 The regularization parameter C oC bounds the sum of the slack variables, and so it determines the number and severity of the violations to the margin (and to the hyperplane) that we will tolerate. oWe can think of C as a budget for the amount that the margin can be violated by the n observations. oIf C = 0 then there is no budget for violations to the margin, and it must be the case that 𝜖1 = … = 𝜖𝑛 = 0, ◦ It is simply like the maximal margin hyperplane optimization oFor C > 0 no more than C observations can be on the wrong side of the hyperplane, ◦ because if an observation is on the wrong side of the hyperplane then 𝜖1 > 1, and that requires oIn practice, C is treated as a tuning parameter that is generally chosen via cross- validation. 21 The regularization parameter C oAs the C increases, we Large C: (top left) become more tolerant omany observations are involved in of violations to the determining the hyperplane. margin, and so the margin will widen. omany observations are support vectors ◦ This classifier has low variance oConversely, as C ◦ but potentially high bias. decreases, we become less tolerant of Small C: violations to the ofewer support vectors margin and so the ◦ the classifier will have low bias margin narrows. ◦ But high variance. oBottom left: only eight support vectors. 22 Robustness of support vector classifiers oThe decision rule of the support vector classifier is based only on a potentially small subset of the training observations (the support vectors) ◦it is quite robust to the behavior of observations that are far away from the hyperplane. oThis property is distinct from some of the other classification methods, such as linear discriminant analysis (LDA). 23 Linear boundary can fail oSometime a linear boundary simply won't work, ono matter what value of C. oThe example on the right is such a case. oWhat to do? oIn support vector classifier, we could address this problem by enlarging the feature space using quadratic, cubic, and even higher-order polynomial functions of the predictors. 24 Feature Expansion oEnlarge the space of features by including transformations; ◦ e.g. X 12, X 13, X 1X 2, X 1X 22,.... Hence go from a ◦ p-dimensional space to a M > p dimensional space. oFit a support-vector classifier in the enlarged space. oThis results in non-linear decision boundaries in the original space. oAnd the optimization problem will be: 25 Feature Expansion –Example oSuppose we use (X 1 , X 2 , 𝑋12 , 𝑋22 , X 1 X 2 ) instead of just (X 1 , X 2 ). oThen the decision boundary would be of the form ◦ β0 + β 1 X 1 + β 2 X 2 + β 3 𝑋12 + β 4 𝑋22 + β 5 X 1 X 2 = 0 oThis leads to nonlinear decision boundaries in the original space (quadratic conic sections). 26 Cubic Polynomials oHere we use a basis expansion of cubic polynomials oFrom 2 variables to 9 oThe support-vector classifier in the enlarged space solves the problem in the lower- dimensional space 27 Support Vector Machines 28 Nonlinearities and Kernels Polynomials (especially high-dimensional ones) get wild rather fast. (e.g. d =1000, then 𝑝 = 𝑑 3 = 109 ) There is a more elegant and controlled way to introduce nonlinearities in support-vector classifiers — through the use of kernels. Before we discuss these, we must understand the role of inner products in support-vector classifiers. 29 Inner products and support vectors oThe inner product of two vectors (two observations) of p features: 𝑝 ◦ 𝑥𝑖 , 𝑥𝑖 ′ = σ𝑗=1 𝑥𝑖𝑗 𝑥𝑖 ′𝑗 oThe linear support vector classifier can be represented as ◦ f (X) = β0 + β1X1 + · · · + β p X p oBy representing 𝛽 as linear combination of input observations 𝛽 = σ𝑛𝑖=1 𝛼𝑖 𝜑( 𝑥𝑖 ) ◦ 𝑓 𝑥 = 𝛽0 + σ𝑛𝑖=1 𝛼𝑖 𝑥, 𝑥𝑖 ◦ where there are n parameters 𝛼𝑖 , i = 1,2, … , n, one for each observation. 30 Inner products and support vectors oTo estimate the parameters α1,... , αn and β0, all we need are the 𝑛 2 inner products 𝑥𝑖 , 𝑥𝑖 ′ between all pairs of the training examples. 𝑛 𝑛(𝑛−1) o 2 = , which is the number of pairs among a set of n items 2 oIt turns out that most of the 𝑎Ƹ 𝑖 can be zero ◦ 𝑓 𝑥 = 𝛽0 + σ𝑖∈𝑆 𝑎Ƹ 𝑖 𝑥, 𝑥𝑖 ◦ S is the support set of indices i such that 𝑎Ƹ 𝑖 > 0 oIf we can compute inner-products between observations, we can fit a SV classifier. 31 Kernels oNow suppose that every time the inner product 𝑥, 𝑥𝑖 appears in the previous representation, or in a calculation of the solution for the support vector classifier, we replace it with a generalization of the inner product of the form 𝐾(𝑥, 𝑥𝑖 ) owhere K is some function that we will refer to as a kernel oA kernel is a function that quantifies the similarity of two observations. 𝑝 𝐾 𝑥, 𝑥𝑖 ′ = ෍ 𝑥𝑖𝑗 𝑥𝑖 ′𝑗 𝑗=1 oThis is known as a linear kernel because the support vector classifier is linear in the features 32 Kernels and Support Vector Machines When the support vector classifier is combined with a non-linear kernel, the resulting classifier is known as a support vector machine Some special kernel functions can do this. E.g. : It is called polynomial kernel of degree d: ocomputes the inner-products needed for d dimensional polynomial 𝑝+𝑑 ◦ 𝑑 basis functions oThe solution has the form: 𝑓 𝑥 = 𝛽0 + ෍ 𝑎Ƹ 𝑖 𝐾(𝑥, 𝑥𝑖 ) 𝑖∈𝑆 33 Radial Kernel oAnother form of the kernel function oImplicit feature space; very high dimensional. oControls variance by squashing down most dimensions severely oAs 𝛾 increases and the fit becomes more non-linear, the ROC curves improve. 34 How radial basis works? oIf a given test observation is far from a training observation 𝑥𝑖 in terms of Euclidean distance, oThen will be large, oand so will be tiny. oThis means that, 𝑥𝑖 will play virtually no role in f(x). 35 How radial basis works? –cont. oThe predicted class label for the test observation x is based on the sign of f(x). otraining observations that are far from x will play essentially no role in the predicted class label for x. oThis means that the radial kernel has very local behavior, in the sense that only nearby training observations have an effect on the class label of a test observation. 36 Advantages of kernels oOne advantage is computational, and it amounts to the fact that using kernels, 𝑛 ◦one need only compute 𝐾(𝑥𝑖 , 𝑥𝑖ư) for all 2 distinct pairs 𝑖 and 𝑖.ư oThis can be done without explicitly working in the enlarged feature space. 37 Example: Heart Data oROC (Receiver Operator Curve) curve is obtained by changing the threshold 0 to መ threshold t in 𝑓(𝑋) > 𝑡, and recording false positive and true positive rates as t varies. oHere we see ROC curves on training data. 38 Example continued: Heart Test Data 39 SVMs: more than 2 classes? oThe SVM as defined works for K = 2 classes. What do we do if we have K > 2 classes? ◦ OVA One versus All. ◦ Fit K different 2-class SVM classifiers fˆk(x), k = 1,... , K; each class versus the rest. ◦ Classify x ∗ to the class for which fˆk(x∗) is largest. ◦ OVO One versus One. ◦ Fit all pairwise classifiers fˆkl(x). ◦ Classify x ∗ to the class that wins the most pairwise competitions. oWhich to choose? If K is not too large, ◦ use OVO. 40 Support Vector versus Logistic Regression? oWith f (X) = β0 + β 1 X 1 +... + β p X p can rephrase support-vector classifier optimization as oThis has the form loss plus penalty. The loss is known as the hinge loss. oVery similar to “loss” in logistic regression (negative log-likelihood). 41 Support Vector versus Logistic Regression? oThe SVM and logistic regression loss functions are compared, as a function of yi (β0 + β1x i1 +... + βp x ip ) oWhen yi (β0 + β1x i1 +... + βp x ip ) is greater than 1, ◦ then the SVM loss is zero, since this corresponds to an observation that is on the correct side of the margin. oOverall, the two loss functions have quite similar behavior. 42 Which to use: SVM or Logistic Regression o When classes are (nearly) separable, SVM does better than LR. So does LDA. o When not, LR (with ridge penalty) and SVM very similar. o If you wish to estimate probabilities, LR is the choice. o For nonlinear boundaries, kernel SVMs are popular. Can use kernels with LR and LDA as well, but computations are more expensive. 43 End 44

Use Quizgecko on...
Browser
Browser