Support Vector Machines (SVM) PDF
Document Details
Uploaded by UnequivocalTone4118
Università degli Studi di Padova
Tags
Related
Summary
This document is an overview of support vector machines (SVM), a machine learning technique used for classification and regression. It covers concepts like maximal margin classifiers, support vector classifiers, and the idea behind kernel tricks in the context of non-linear data separation.
Full Transcript
Support Vector Machines (SVM) Support vector machine: introduction 1 Support Vector Machine (SVM) is a Machine Learning supervised* method DATA are labeled (we do have...
Support Vector Machines (SVM) Support vector machine: introduction 1 Support Vector Machine (SVM) is a Machine Learning supervised* method DATA are labeled (we do have a response variable!) 2 SVMs can be used for both classification and Binary classification regression problems Multiclass classification Numeric prediction Let’s see how SVM works for binary classification problems * Note that SVM has been recently used also for clustering purposes (unsupervised learning) Support Vector Machines 2 SVM: Introduction Theory about SVM has been introducted by Vapnik (1965) and more recently developed in 1995: SVM are one of the most adopted models for pattern classification. Instead of estimating probability density of the classes, SVM directly solve the problem of interest: determine classification boundaries. SVM has been introduct as binary classifier. We will see SVM by the following steps: Maximal margin classifier Support vector classifier – Linear SVM with linearly separable data (as hypothesis at least one hyperplane separating classes exists) Linear SVM with non-linearly separable data (errors would be inevitably committed since no hyperplane is capable to separate classes) Non linear SVM (i.e. complex classification boundaries) without hypothesis on separability Multiclass extensions Support Vector Machines 3 Classification using a Hyperplane Before to see how the Support Vector Machine works, we need to introduce two concepts: Maximal Margin Classifiers Support Vector Classifiers Support Vector Machines 4 Maximal margin classifier: idea Let’s consider the case in which we observe some specimens of a certain metal alloy. For each specimen we measure the content of Nichel (X = predictor). Each speciment is classified as compliant (OK) or not compliant (KO) according to the metal strenght (Y = response variable) We can define a treshold that we can use as classifier! KO OK Nichel content If a new observation is If a new observation is measured with less Nichel measured with more Nichel content than the treshold, it content than the treshold, it will will be classified as KO be classified as OK Support Vector Machines 5 Maximal margin classifier: idea What if we change the threshold and we get an obs here? KO OK Nichel content OK The method classifies the observation as OK, but the observation is much closer to the KO class rather than to the OK class! A better idea would be to set the threshold at the midpoint of the distance between the two closest observations that belong to the two classes! KO OK Nichel content margin margin The shortest distance between observations and threshold is called margin. When the threshold is at the midpoint the margin is at its maximum. When we define a threshold in such a way for classification problems we are using a Maximal Margin Classifier. Support Vector Machines 6 Maximal margin classifier: problems Unfortunately Maximal Margin classifier is: sensible to outliers KO not possible to apply when observations are overlapping Support Vector Machines 7 Support vector classifier To overcome the Maximal Marginal Classifier problems, we can define a more flexible threshold which allows misclassification of some units SUPPORT VECTORS misclassification Soft margin Soft margin When we allow for misclassification the distance between observations and the threshold is called SOFT MARGIN When we use soft margins we are defyining a Support Vector classifier Observations at the edge of margins are called Support Vectors Support Vector Machines 8 Hyperplanes as decision boundaries What happens when we consider multiple predictors? The decision boundary is a hyperplane: a linear decision surface that split the space in 2 parts A hyperplane in 𝑅𝑅2 is a line A hyperplane in 𝑅𝑅3 is a plane A hyperplane in 𝑅𝑅𝑛𝑛 is a n-1 dimensional subspace Support Vector Machines 9 Support vector classifier: n-dimensional We can generalize the lexicon we introduced in the one-dimensional example to the N-dimensional case. 𝑥𝑥2 SUPPORT VECTORS Data in the margin are called SUPPORT VECTORS Misclassification Support vectors alone can identify the maximal margin hyperplane: They completely define the problem solution that can be expressed as a Maximum function of just them, idependently Soft margin margin from the dimensionality of the space and the number of data in the TS They give a very compact way to store the classification model even if the number of observations is very high. 𝑥𝑥1 Support Vector Machines 10 Hyperplanes Consider the case of 𝑅𝑅3 The equation of a hyperplane is defined by: - A point 𝑃𝑃0 - A vector 𝑤𝑤 perpendicular to the plane at that point Define vectors: 𝑥𝑥⃗0 = 𝑂𝑂𝑃𝑃0 𝑥𝑥⃗ = 𝑂𝑂𝑂𝑂 Where 𝑃𝑃 is an arbitrary point on the hyperplane. A condition for 𝑃𝑃 to be on the plane is that 𝑥𝑥⃗ − 𝑥𝑥⃗0 is perpendicular to 𝑤𝑤 : 𝑤𝑤 𝑥𝑥⃗ − 𝑥𝑥⃗0 = 0 𝑤𝑤 𝑥𝑥⃗ − 𝑤𝑤 𝑥𝑥⃗0 = 0 Hence, defining 𝑏𝑏 = −𝑤𝑤 𝑥𝑥⃗0 𝑤𝑤 𝑥𝑥⃗ − 𝑏𝑏 = 0 The above equation of a general hyperplane holds also for 𝑅𝑅𝑛𝑛 when 𝑛𝑛 > 3 Support Vector Machines 11 Support vector classifier: Linear SVM for linearly separable data Given a training set of linearly separable data, containing 𝑛𝑛 obs 𝑥𝑥3 𝒙𝒙1 , 𝑦𝑦1 , … , 𝒙𝒙𝑛𝑛 , 𝑦𝑦𝑛𝑛 where 𝒙𝒙𝑖𝑖 are multidimensional data and 𝑦𝑦𝑖𝑖 ∈ 𝑥𝑥 +1, −1 are the data label made of 2 classes 𝐷𝐷 𝒙𝒙 is a linear function: 𝐷𝐷 𝒙𝒙 = 𝒘𝒘𝒘𝒘 + 𝑏𝑏 𝑥𝑥𝑝𝑝 𝓡𝓡𝟏𝟏 It define a hyperplane in the feature space. We can observe: 𝓡𝓡𝟐𝟐 𝒘𝒘 the unit lenght normal vector of the hyperplane 𝒘𝒘 𝒘𝒘 𝑏𝑏 the distance of the hyperplane from the origin 𝑥𝑥2 𝒘𝒘 ℛ1 and ℛ2 are the regions of the two classes 𝐷𝐷 𝒙𝒙 = 0 is the place of vector in the plane 𝑥𝑥1 We can define: 𝒘𝒘 𝒙𝒙𝑝𝑝 is the projection 𝐷𝐷(𝒙𝒙) 𝒙𝒙 = 𝒙𝒙𝑝𝑝 + r , 𝐷𝐷 𝒙𝒙𝑝𝑝 = 0 of 𝒙𝒙 on the plane The distance of an 𝒙𝒙 vector from the hyperplane is then 𝑟𝑟 𝒘𝒘 𝒘𝒘 Support Vector Machines 12 SVM: idea SVM performs classification by drawing a hyperplane that separate observations belonging to different classes 2D Support Vector Machines 13 SVM: idea SVM performs classification by drawing a hyperplane that separate observations belonging to different classes 3D Support Vector Machines 14 SVM: idea But there can be multiple hyperplanes all that separates observations of one category from observation of the other! Which one to choose? Support Vector Machines 15 SVM: idea Which one to choose? Support Vector Machines 16 SVM: idea Which one to choose? The one that best separates the categories, meaning that it maximize the distance between the points in the 2 categories! Support Vector Machines 17 SVM: idea Which one to choose? The one that best separates the categories, meaning that it maximize the distance between the points in the 2 categories! This distance to be maximized is called the MARGIN Maximum margin Support Vector Machines 18 SVM: idea Which one to choose? The one that best separates the categories, meaning that it maximize the distance between the points in the 2 categories! This distance to be maximized is called the MARGIN Maximum Observation falling on the margin margin are called SUPPORT VECTORS Support Vector Machines 19 Support vector machine: idea To classify the observations we can define infinite lines (i.e. a 2-dimensianal function), which line best classifies the observations? 𝑥𝑥2 𝑥𝑥2 The line that maximize the difference between observations belonging to Maximum margin different classes. This quantity 𝑥𝑥1 is called 𝑥𝑥1 𝑦𝑦 = +1 margin. 𝑦𝑦 = −1 Support Vector Machines 20 Linear SVM The minimum distance between the 𝑥𝑥2 separating hyperplane and a data of the training set is called margin 𝜏𝜏 𝐷𝐷 𝒙𝒙 > +1 The distance between the hyperplane 1/ 𝒘𝒘 𝐷𝐷 𝒙𝒙 = +1 𝐷𝐷 𝒙𝒙 = +1 and the separating hyperplane is 1/ 𝒘𝒘 equal to 1/ 𝒘𝒘 𝐷𝐷 𝒙𝒙 = 0 Hence the margin is: 𝜏𝜏= 2/ 𝒘𝒘 𝐷𝐷 𝒙𝒙 = −1 𝐷𝐷 𝒙𝒙 < −1 𝑥𝑥1 Support Vector Machines 21 Linear SVM How to define the optimal hyperplane? According to the SVM strategy, the optimal hyperplane separating classes is the one that satisfy the constraints of class separation and that maximize the 𝝉𝝉 margin (or minimize its inverse): 𝒘𝒘 𝟐𝟐 Minimize: 𝟐𝟐 Constraints: 𝑦𝑦𝑖𝑖 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 − 1 ≥ 0 for 𝑖𝑖 = 1, … , 𝑛𝑛 Support Vector Machines 22 Hyperplane identification: convex hull In case of linearly separable classes finding the maximal margin hyperplane is quite easy: The optimal hyperplane is the farthest from the external boundaries of the groups of points 𝑥𝑥2 Note: The external boundary of a class is called convex hull The optimal hyperplane is the perpendicular bisector of the shortest line connecting the two hull convex Advanced algorithms apply quadratic optimization to find the hull maximum margin in this way 𝑥𝑥1 An alternative approach would require to search in the space for every possible hyperplane minimizing the function that maximize the margins , under the constraints that can be solved using Lagrangian formulation and then a dual formulation Support Vector Machines 23 Linear SVM with non-linearly separable data (1) When we got non-linearly separable data, it is necessary to relax the separation constraints and to allow some (the fewest possible) obs to be misclassified Aiming at this we introduce 𝑛𝑛 positive slack Misclassified terms 𝜉𝜉𝑖𝑖 for 𝑖𝑖 = 1, … , 𝑛𝑛 and the separation obs: 𝜉𝜉𝑖𝑖 > 0 constraints are modified as: 𝐴𝐴 𝑦𝑦𝑖𝑖 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 ≥ 1 − 𝜉𝜉𝑖𝑖 for 𝑖𝑖 = 1, … , 𝑛𝑛 𝜉𝜉𝐴𝐴 For each obs 𝒙𝒙𝑖𝑖 the slack term 𝜉𝜉𝑖𝑖 codify the 𝜉𝜉𝐵𝐵 deviation from the margin. 𝐵𝐵 For separable data slack terms will be 0 Support Vector Machines 24 Linear SVM with non-linearly separable data (2) The optimal hyperplane has to maximize the margin, but at the same time has to minimize the number of misclassified elements. Thus the objective function, and so the optimization problem, becomes: 𝒘𝒘 𝟐𝟐 Minimize: + 𝐶𝐶 ∑𝑛𝑛𝑖𝑖=1 𝜉𝜉𝑖𝑖 𝟐𝟐 Constraints: 𝑦𝑦𝑖𝑖 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 ≥ 1 − 𝜉𝜉𝑖𝑖 for 𝑖𝑖 = 1, … , 𝑛𝑛 𝐶𝐶 indicate the relative importance of the classification error with respect to the size of the margin: its the only hyperparameter to be defined in SVM tuning Support Vector Machines 25 Non Linear SVM: idea SVM are: Easy to understand Effective when Easy to you have small sample size implement But this simplicity can also be a problem! Easy to Easy to use interpret Support Vector Machines 26 Non Linear SVM: idea Support Vector Machines 27 Non Linear SVM: idea Support Vector Machines 28 Non Linear SVM: idea Support Vector Machines 29 Non Linear SVM: idea In many applications the point cannot be separated by a hyperplane! ? Support Vector Machines 30 Non Linear SVM: idea What to do? Support Vector Machines 31 Non Linear SVM: idea What to do? A common workaround it to augment the data with other feature: add another dimension to let the data be separable. It is called the KERNEL TRICK Support Vector Machines 32 Non Linear SVM: idea From the existing dimension (𝑥𝑥 and 𝑦𝑦) I compute anothe one (ex. 𝑥𝑥 2 + 𝑦𝑦 2 ) Support Vector Machines 33 Non Linear SVM: idea In this hygher dimensional space I can find a separation hyperplane! Support Vector Machines 34 Non Linear SVM: idea Then we can go back to the original space (𝑥𝑥 and 𝑦𝑦) to obtain separation between classes! Support Vector Machines 35 Non linear SVM: idea SVM can be used to separate classes that are not separable with a hyperplane: to do this coordinates of obs are mapped in a feature space using non linear functions Let’s consider the example of Nichel content of metal specimens and assume we got the following distribution of OK and KO: Input space High dimensionality space KERNEL TRICK X2 X (Content of We add the Nichel) In this case Nichel doesn’t work if dimension X2 content is too small or to large Maximal Margin Classifier or Support In this higher dimension the SVC Vector Classifier can’t handle this data! works and a separating hyperplane is found: We pass to a higher dimensional space New data are squared and their value is using a KERNEL function! predicted according to the threshold X Support Vector Machines 36 Non linear SVM: method SVM for non linear classification boundaries works the following way: Non separable data space is mapped to a new feature space using a non linear function (kernel trick) A classifier is identified using the previously described method for linear SVM The classifier function solving the problem is brought back to the input space to have the final solution (classification boundary) The basic concept is that vectors that in the input space are non linearly separable, in a higher dimensionality space have higher probability to be separable Support Vector Machines 37 Kernel functions Different Kernel functions have been applied and developed so far. Some examples are: Linear Kernel Polynomial Kernel Sigmoidal Kernel Gaussian RBF Kernel Add a simple non- Produce a SVM RBF = Radial Basis Function 𝐾𝐾 𝒙𝒙, 𝒙𝒙′ = 𝒙𝒙 𝒙𝒙𝒙 similar to a NN linear transformation of data using a sigmoidal It is always considered a good activation starting point 𝐾𝐾 𝒙𝒙, 𝒙𝒙′ = [ 𝒙𝒙 𝒙𝒙′ + 1]𝑞𝑞 𝐾𝐾 𝒙𝒙, 𝒙𝒙′ = tanh(𝑘𝑘𝒙𝒙 𝒙𝒙′ − 𝛿𝛿) 𝒙𝒙−𝒙𝒙′ 2 − 𝐾𝐾 𝒙𝒙, 𝒙𝒙′ = 𝑒𝑒 2𝜎𝜎2 Support Vector Machines 38 Which Kernel function to use? Unfortunately there isn’t a reliable rule to select the best kernel. Often a trial - error approach is necessary, i.e. training and evaluating different SVM on a It all depend on: validation set learning task training data relationships between features 1. Select a 2. Train the Kernel model function Choosing a kernel according to prior knowledge of invariances is a good idea. No END Yes In the absence of expert 4. Are 3. Apply the performance model to the knowledge, the Radial Basis acceptable? test set Function kernel makes a good default kernel 4. Evaluate performance Support Vector Machines 39 Multiclass SVM (1) Multiclass classification with SVM is a still open problem. Some solution already exists: One-Against-One approach It solve multiclass through binary classifiers (is the approach adopted by LIBSVM library) Given 𝑠𝑠 as the classes of the problem, 𝑠𝑠 × (𝑠𝑠 − 1)/2 binary classifier are trained: all possible couple of classes, independently from the order During classification 𝒙𝒙 is classified by each binary SVM classifier that gives a vote to the most probable class (between the two) At the end, 𝒙𝒙 is assigned to the class that recieved the highest number of votes (majority vore rule) Support Vector Machines 40 Multiclass SVM (2) Multiclass classification with SVM is a still open problem. Some solution already exists: One-Against-All approach Given 𝑠𝑠 classes: 𝑤𝑤1 , 𝑤𝑤2 … 𝑤𝑤𝑠𝑠 For each class 𝑤𝑤𝑘𝑘 a SVM determine the separation surface between data of 𝑤𝑤𝑘𝑘 (labeled as +1) and all other data belonging to remaining classes 𝑤𝑤ℎ , ℎ ≠ 𝑘𝑘 (labeled as -1), obtaining a function 𝐷𝐷𝑘𝑘 (𝒙𝒙) that indicate how much 𝒙𝒙 is far from the decision boundary in the direction of 𝑤𝑤𝑘𝑘. The highest 𝐷𝐷𝑘𝑘 (𝒙𝒙) is, the more we are confident that 𝒙𝒙 belongs to 𝑤𝑤𝑘𝑘 At the end of training 𝒙𝒙 is assigned to the class 𝑘𝑘 ∗ for which the distance to the decisional surface is highest: 𝑘𝑘 ∗ = arg max 𝐷𝐷𝑘𝑘 (𝒙𝒙) 𝑘𝑘 Support Vector Machines 41 Examples of SVM applications BIOINFORMATICS FACIAL EXPRESSION CLASSIFICATION Classification of genetic data to identify cancer or other genetic disorders TEXT CLASSIFICATION RARE EVENT DETECTION SPEECH RECOGNITION Classify emails into Faults of spam or not spam combustion Classify articles by engine topics Security Language violation identification Earthquakes Support Vector Machines 42 Practical indication Linear or Non-Linear If dimensionality 𝑑𝑑 is very high (e.g. 5000 features) Linear SVM is usually adopted: in such a big space data are typically sparse and simple hyperplanes are able to separate classes. If dimensionality 𝑑𝑑 is low (e.g. 20 features) the first choice is Non-linear SVM with RFB kernel If dimensionality 𝑑𝑑 is medium (e.g. 200 features) you generally try out both typologies Multiclass Typically you go with the solution available within the library (One-Against-One for LIBSVM) If the number of classes is very high, the computational costs can be inacceptable: One-Against-All forcibly becomes the choice Support Vector Machines 43 Support Vector Machines 44