Lecture 8 (Linear) Support Vector Machines PDF

G.C. Calafiore (Politecnico di Torino) 1 / 25 LECTURE 8 (Linear) Support Vector Machines G.C. Calafiore (Politecnico di Torino) 2 / 25 Outline 1 Origins: the perceptron algorithm 2 Maximum margin classifiers The separable case The general case 3 Linear Support Vector Machine 4 Regularization, sparsity, robustness G.C. Calafiore (Politecnico di Torino) 3 / 25 Hyperplane geometry Given a linear (affine) function f (x) = w0 + w > x we can define the hyperplane H as the locus of points for which f (x) = 0. That is H = {x ∈ Rn : w0 + w > x = 0} Geometrically, H is an (n − 1)-dimensional affine set orthogonal to the span of vector w and at (signed) distance −w0 /kw k2 from the origin. H separates the whole space Rn into two halfspaces H+ = {x : w0 + w > x ≥ 0}, H− = {x : w0 + w > x ≤ 0} which are the regions respectively above and below H. The (signed) distance to be travelled in direction w for going from point p to H is w0 + w > p dist(x; H) = −. kw k2 G.C. Calafiore (Politecnico di Torino) 4 / 25 Hyperplane geometry G.C. Calafiore (Politecnico di Torino) 5 / 25 Perceptrons One can classify features based on H as a discrimination surface: say a point x is in class 1 if it belongs to H+ or it is in class 2 if it belongs to H−. The hyperplane H is thus a decision boundary. Classifiers that compute a linear combination of the input features and return the sign, were called perceptrons in the engineering literature in the late 1950s (Rosenblatt, 1958). Perceptrons set the foundations for the neural network models of the 1980s and 1990s. The perceptron learning algorithm tries to find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary H. G.C. Calafiore (Politecnico di Torino) 6 / 25 Rosenblatt’s Perceptron learning If a response yi = 1 is misclassified, then w0 + w > x (i) < 0, and the opposite for a misclassified response with yi = −1. The goal is thus to minimize (over nonzero w ) N X L(w0 , w ) = [−yi (w0 + w > x (i) )]+ i=1 Notice that the cost contribution is zero for the points that are correctly classified. For the misclassified point, the cost contribution is proportional to the distance of x (i) from the decision boundary. G.C. Calafiore (Politecnico di Torino) 7 / 25 Rosenblatt’s Perceptron learning The loss function to be minimized is piece-wise linear: N X L(w0 , w ) = max(−yi (w0 + w > x (i) ), 0) i=1 It can be minimized iteratively via a sub-gradient type algorithm. Given a current hyperplane (w0 , w ), we evaluate the set M of indices of the misclassified points, whence X L(w0 , w ) = −yi w0 − yi w > x (i) i∈M Then, we iteratively update (w0 , w ) according to a gradient descent rule: (w0 , w )+ = (w0 , w ) − α∇(w0 ,w ) L(w0 , w ). G.C. Calafiore (Politecnico di Torino) 8 / 25 Maximum margin classifiers Separable case One problem with the plain perceptron is that there may be multiple optimal solutions. A more reliable approach would be to optimally separate the two classes by maximizing the distance to the closest point from either class (Vapnik 1996). Not only does this provide a unique solution to the separating hyperplane problem, but by maximizing the margin between the two classes on the training data, this leads to better classification performance on test data. G.C. Calafiore (Politecnico di Torino) 9 / 25 Maximum margin classifiers Separable case G.C. Calafiore (Politecnico di Torino) 10 / 25 Maximum margin classifiers Separable case Assume strict separability. For a point in the negative class (i.e., a xi for which yi = −1) the distance from the boundary H is −(w0 + w > xi )/kw k2. For a point in the positive class (i.e., a xi for which yi = 1) the distance from the boundary H is (w0 + w > xi )/kw k2. For a generic point xi the distance from the decision boundary is thus yi (w0 + w > xi ) kw k2 We impose that this distance must be larger than some value M, and maximize this M: max M s.t.: yi (w0 + w >x i) ≥ kw k2 M, i = 1,... , N. G.C. Calafiore (Politecnico di Torino) 11 / 25 Maximum margin classifiers Separable case Further, we observe that the constraints are positively homogeneous, hence we can normalize them by fixing M = 1/kw k2 and the problem can be rewritten equivalently as max 1/kw k2 s.t.: yi (w0 + w > xi ) ≥ 1, i = 1,... , N. Finally, we observe that maximizing 1/kw k2 is equivalent to minimizing kw k2 , which is also equivalent to minimizing kw k22. Thus 1 2 min 2 kw k2 s.t.: yi (w0 + w > xi ) ≥ 1, i = 1,... , N. This is a convex QP! The value of 2/kw k2 at the optimum defines the separation margin between the two classes. G.C. Calafiore (Politecnico di Torino) 12 / 25 Maximum margin classifiers But what happens if the two classes are NOT linearly separable? The above QP formulation would turn INFEASIBLE (i.e., no solution). Two classes that are not linearly separable. Way out... G.C. Calafiore (Politecnico di Torino) 13 / 25 Linear Support Vector Machine We may just allow for some points to be on the wrong side of the margin... To this end, we relax the constraints to yi (w0 + w > xi ) ≥ 1 − ξi , i = 1,... , N, where ξi ≥ 0 represents the amount of violation (misclassification) of the i-th point. Then, we penalize the misclassifications in the objective, leading to the modified problem (here c ≥ 0 is a tradeoff parameter): 1 2 PN min 2 kw k2 +c i=1 ξi s.t.: yi (w0 + w > xi ) ≥ 1 − ξi , i = 1,... , N ξi ≥ 0, i = 1,... , N. This is again a convex QP, which is now always feasible! G.C. Calafiore (Politecnico di Torino) 14 / 25 Linear Support Vector Machine Penalty format The general linear SVM problems can be rewritten in the form of an unconstrained minimization problem by substituting the slack variables in the objective with the expression ξi = max(1 − yi (w0 + w > xi ), 0) = [1 − yi (w0 + w > xi )]+ = h(yi , w0 + w > xi ),. where h(y , f ) = [1 − yf ]+ is the hinge loss function. The linear SVM problem becomes: PN min c i=1 [1 − yi (w0 + w > xi )]+ + 12 kw k22 which has the typical format of the objective in learning problems: (loss function) + (regularization term) G.C. Calafiore (Politecnico di Torino) 15 / 25 Hinge loss function Cost based on the hinge loss serves as an approximation to the number of errors made on the training set: N X L(w0 , w ) = max(1 − yi (w > xi + w0 ), 0) i=1 2 1.5 max(1-s,0) 1 0.5 0 -1 -0.5 0 0.5 1 1.5 2 s G.C. Calafiore (Politecnico di Torino) 16 / 25 CVX syntax CVX allows to bypass the need to put the problem in standard format. Here is a Matlab snippet that solves an SVM problem, given n × N matrix X , N-vector y and non-negative scalar c exist in the workspace: cvx_begin variable w(N,1); variable b(1); minimize( c*sum(max(0,1-y.*(X’*w+b)))+.5*w’*w) cvx_end G.C. Calafiore (Politecnico di Torino) 17 / 25 Comparison of Perceptron, Logistic and Hinge loss We have encountered so far in the discussion of classifiers three loss functions: Pm The Logistic loss Llogit = i=1 log(1 + exp(−zi )); Pm The Perceptron loss Lpercpt = i=1 [−zi ]+ ; Pm The Hinge loss Lhinge = i=1 [1 − zi ]+ , where zi = yi (w > x (i) + w0 ) is positive for correctly classified training points and it is negative otherwise. In both cases, the absolute value of zi is proportional to the distance of point x (i) from the discrimination hyperplane H = {x : w > x + wo = 0}. G.C. Calafiore (Politecnico di Torino) 18 / 25 Comparison of Perceptron, Logistic and Hinge loss The shape of these three loss functions is shown in the figure. 7 6 5 4 3 2 1 0 -6 -4 -2 0 2 4 6 We see in particular that > log(1 + exp(−z)) ∼ max(0, −z) hence the Logistic loss is an upper bound to the perceptron loss. The two losses are similar when |z| is large, a regime where both minimize the sum of miscalssifications, or “equation errors” from the decision boundary. G.C. Calafiore (Politecnico di Torino) 19 / 25 General form We consider the problem min L(X > w + w0 1, y ) + λp(w ), w ,w0 where L is a convex loss function that encodes the error between the observed value and the predicted value; (w , w0 ) are the model parameters; p is a penalty on the regression parameters; λ > 0 is a penalty parameter. When L(z, y ) = 1> (1 − yz)+ , and p(w ) = kw k22 , we recover the linear SVM. G.C. Calafiore (Politecnico di Torino) 20 / 25 Loss functions and penalties Changing loss functions allows to cover Perceptron SVMs Logistic regression Typical penalties: `1 -norm: to enforce sparsity; `2 -norm (often, squared): to control statistical noise and improve prediction error; sum-block norms enable to enforce whole blocks of w to be zero. G.C. Calafiore (Politecnico di Torino) 21 / 25 Robustness interpretation of SVM Return to separable data in the Perceptron/SVM setup. The set of constraints yi (w > xi + w0 ) ≥ 0, i = 1,... , N, has many possible solutions (w , w0 ). We will select a solution based on the idea of robustness (to changes in data points). G.C. Calafiore (Politecnico di Torino) 22 / 25 Maximally robust separating hyperplane Spherical uncertainty model: assume that the data points are actually unknown, but bounded:. xi ∈ Si = {x̂i + ui : kui k2 ≤ ρ} , where x̂i ’s are known, ρ > 0 is a given measure of uncertainty, and ui is unknown. Robust counterpart: we now ask that the separating hyperplane separates the spheres (and not just the points): ∀xi ∈ Si : yi (w > xi + w0 ) ≥ 0, i = 1,... , N. For separable data we can try to separate spheres around the given points. We’ll grow the spheres’ radius until sphere separation becomes impossible. G.C. Calafiore (Politecnico di Torino) 23 / 25 Robust classification We obtain the equivalent condition yi (w > x̂i + w0 ) ≥ ρkw k2 , i = 1,... , N. Now we seek (w , w0 ) which maximize ρ subject to the above. By homogeneity we can always set ρkw k2 = 1 , so that problem reduces to min kw k2 : yi (w > x̂i + w0 ) ≥ 1, i = 1,... , N. w This is exactly the same problem as the SVM in separable case, a.k.a. the “maximum-margin classifier”. G.C. Calafiore (Politecnico di Torino) 24 / 25 Separating boxes instead of spheres We can use a box uncertainty model:. xi ∈ Bi = {x̂i + ui : kui k∞ ≤ ρ}. This leads to min kw k1 : yi (w > x̂i + w0 ) ≥ 1, i = 1,... , N. w Classifiers found this way tend to be sparse. In 2D, the boundary line tends to be vertical or horizontal. G.C. Calafiore (Politecnico di Torino) 25 / 25

Lecture 8 (Linear) Support Vector Machines PDF

Document Details

Tags

Related

Summary

Full Transcript