Podcast
Questions and Answers
What is a drawback of the maximal margin classifier?
What is a drawback of the maximal margin classifier?
The support vector classifier aims to perfectly separate the two classes.
The support vector classifier aims to perfectly separate the two classes.
False
What is the primary role of the hyperplane in the support vector classifier?
What is the primary role of the hyperplane in the support vector classifier?
What are observations that lie directly on the margin or on the wrong side of the margin for their class called?
What are observations that lie directly on the margin or on the wrong side of the margin for their class called?
Signup and view all the answers
If a slack variable $\epsilon_i$ is greater than 1, it indicates that the observation is on the wrong side of the margin.
If a slack variable $\epsilon_i$ is greater than 1, it indicates that the observation is on the wrong side of the margin.
Signup and view all the answers
In a support vector classifier, changing the position of an observation that lies strictly on the correct side of the margin will ___ the classifier.
In a support vector classifier, changing the position of an observation that lies strictly on the correct side of the margin will ___ the classifier.
Signup and view all the answers
What happens to the margin of a support vector classifier as the regularization parameter C increases?
What happens to the margin of a support vector classifier as the regularization parameter C increases?
Signup and view all the answers
A small C value leads to a classifier with high bias and low variance.
A small C value leads to a classifier with high bias and low variance.
Signup and view all the answers
What does the acronym SVM stand for?
What does the acronym SVM stand for?
Signup and view all the answers
The maximal margin classifier is the most complex form of SVM.
The maximal margin classifier is the most complex form of SVM.
Signup and view all the answers
What is the purpose of a hyperplane in SVM?
What is the purpose of a hyperplane in SVM?
Signup and view all the answers
The vector β in the hyperplane equation
β0 + β1 X1 + β2 X2 +...+ βp Xp = 0 is known as the ______.
The vector β in the hyperplane equation β0 + β1 X1 + β2 X2 +...+ βp Xp = 0 is known as the ______.
Signup and view all the answers
What method is used in SVM when there are more than 2 classes?
What method is used in SVM when there are more than 2 classes?
Signup and view all the answers
Support Vector Machine (SVM) is more effective than Logistic Regression (LR) when classes are not separable.
Support Vector Machine (SVM) is more effective than Logistic Regression (LR) when classes are not separable.
Signup and view all the answers
What is the loss function used in support vector classifier optimization?
What is the loss function used in support vector classifier optimization?
Signup and view all the answers
When $y_i(\beta_0 + \beta_1x_{i1} +...+ \beta_px_{ip})$ is greater than 1, the SVM loss is ______.
When $y_i(\beta_0 + \beta_1x_{i1} +...+ \beta_px_{ip})$ is greater than 1, the SVM loss is ______.
Signup and view all the answers
Match the following concepts with their descriptions:
Match the following concepts with their descriptions:
Signup and view all the answers
What characterizes a support vector machine compared to a support vector classifier?
What characterizes a support vector machine compared to a support vector classifier?
Signup and view all the answers
The radial kernel has a global behavior, meaning all training observations affect the predicted class label for a test observation.
The radial kernel has a global behavior, meaning all training observations affect the predicted class label for a test observation.
Signup and view all the answers
What is the role of the parameter gamma (𝛾) in radial basis kernel?
What is the role of the parameter gamma (𝛾) in radial basis kernel?
Signup and view all the answers
Support vector machines utilize kernels to compute the __________ needed for different dimensions.
Support vector machines utilize kernels to compute the __________ needed for different dimensions.
Signup and view all the answers
Match the kernel types with their characteristics:
Match the kernel types with their characteristics:
Signup and view all the answers
Which of the following best describes the polynomial kernel?
Which of the following best describes the polynomial kernel?
Signup and view all the answers
As the distance between a test observation and a training observation increases, the contribution of that training observation to the prediction increases.
As the distance between a test observation and a training observation increases, the contribution of that training observation to the prediction increases.
Signup and view all the answers
What happens to the predicted class label when the training observations are far from the test observation?
What happens to the predicted class label when the training observations are far from the test observation?
Signup and view all the answers
Study Notes
Introduction to Machine Learning - AI 305: Support Vector Machines (SVM)
- Support Vector Machines (SVMs) are a classification approach developed in the 1990s, gaining popularity since.
- SVMs perform well in various settings and are considered strong "out-of-the-box" classifiers.
- The core concept is the maximal margin classifier.
- The support vector classifier extends the maximal margin classifier for broader datasets.
- Support Vector Machines (SVM) extend the support vector classifier further to accommodate non-linear class boundaries.
Contents
- Maximal Margin Classifier
- Support Vector Classifier
- Support Vector Machine
- SVM for Multiclass Problems
- SVM vs. Logistic Regression
Introduction - Continued
- Support Vector Machines (SVMs) are an approach for classification, originally developed in the computer science community during the 1990s.
- The popularity has grown since then.
- These approaches perform well across a range of contexts, frequently being regarded as one of the best "off-the-shelf" or pre-built classifiers.
- The approach handles two-class classification problems directly.
- Trying to find a plane that cleanly segregates the classes in feature space is the first step.
- If a separating plane can't be readily identified, two strategies are employed : -Refining the meaning of "separates" -Expanding and elaborating the feature space to enable separation.
What is a Hyperplane?
- A hyperplane in p-dimensions is an affine subspace of dimension p−1.
- The generic equation for a hyperplane is: 60 + 61X1 + 62X2 + ... + 6pXp = 0
- In two dimensions, a hyperplane is a line.
- In three dimensions, it's a plane.
- If 60 = 0, the hyperplane goes through the origin. Otherwise, it does not.
- The vector 6 = (61, 62, ..., 6p) is deemed the normal vector, pointing orthogonal the hyperplane's surface.
Hyperplanes - Example
- Let the hyperplane be represented as: 1 + 2X1 + 3X2 = 0.
- The blue region represents the points where 1 +2X1 + 3X2 > 0.
- The purple region represents the points where 1 + 2X1 + 3X2 < 0.
Classification using a Separating Hyperplane
- Given a nxp dataset X of n training observations in p-dimensional space, where these observations fall into two categories (y1,..., yn ∈ {-1, +1}).
- The objective is to develop a classifier to categorize the test observation based on its feature measurements.
- A variety of techniques are used (logistic regression, classification trees, bagging, boosting).
- This approach introduces a novel method based on a separating hyperplane concept.
Separating Hyperplanes
- If f(X) = 60 + 61X1 + ... + 6pXp, f(x) > 0 for points on one side of the hyperplane; f(x) < 0 on the other side.
- If y₁ = +1 for blue and y₁ = -1 for purple, then y₁f(x₁) > 0 for all i.
- f(x) = 0 defines a separating hyperplane.
Maximal Margin Classifier
- Among all separating hyperplanes, it seeks the one maximizing the gap (margin) between the two classes.
- The maximal margin hyperplane is the solution of the optimization problem that minimizes ‖β‖2 subject to a set of constraints.
- The constraints enforce that each observation must fall on the correct side of the hyperplane and maintain a distance at least M from it, with M being the margin width.
- This formulation can be resolved effectively as a convex quadratic program.
Non-separable Data
- Data that cannot be separated by a linear boundary using the specified criterion.
- There's no solution with a margin larger than zero, often the case unless the number of observations (N) is less than the dimensionality (p).
- The generalization of the maximal margin classifier, accommodating non-separable cases is called a support vector classifier, employing a "soft margin".
Noisy Data
- Data that is separable but includes noise, potentially leading to a less desirable solution for maximal-margin classifiers.
- For this case the support vector classifier maximizes a soft margin.
Drawbacks of Maximal Margin Classifiers
- Classifiers based on separating hyperplanes invariably perfectly classify all training observations, leading to increased sensitivity towards individual observations.
- The addition of a single new observation can dramatically alter the maximal margin hyperplane.
- The resulting hyperplane with a narrow margin is often undesirable, making it problematic because its small distance between observations and the hyperplane lowers confidence that the observation was correctly categorized.
Support Vector Classifiers
- Given the limitations of the maximal margin classifier, support vector classifiers (called soft margin classifiers) are introduced to tolerate misclassifications of a few observations in order to perform better for the remaining data points.
- They use less restrictive conditions on hyperplane selection, aiming to improve overall classification accuracy.
Support Vector Classifier - Continued
- The optimization problem is structured in such a way that only observations on or violating the margin affect the hyperplane.
- Points that lie directly on the margin, or on the "wrong" side are considered "support vectors" and control the margin boundaries.
- These “support vectors” significantly influence the SVM classifier.
Support Vector Classifier- Continued
- Example illustrating how support vector classifiers fit to a small dataset with dashed margins indicate the fitted hyperplanes
- Illustrates how data points on or violating the margin affect the hyperplane position in the plots. Some points in the sample dataset are close to the margin (support vectors).
Details of the Support Vector Classifier
- SVM classifiers are based on the side of a hyperplane on which a test observation falls.
- The hyperplane is carefully selected to correctly categorize the majority of training observations while tolerating a few possible misclassifications.
- The solution rests on an optimization problem.
- The problem uses a parameter C and the width of the margins M (inverse of the norm of its weight vector) and slack variables to enable some observations to be on the wrong side of the margin.
Details of the Support Vector Classifier - Continued
- C is a non-negative model tuning parameter.
- M as related to maximizing margin width.
- Slack variables allow individual observations to be on the wrong side of the margin or hyperplane.
Slack Variable
- Slack variable єi reflects the position of the ith observation relative to the margin and hyperplane.
- єi = 0 indicates the ith observation is on the correct side of the margin.
- єi > 0 suggests the ith observation is on the incorrect side of the margin (in violation); єi > 1 implies the ith obs. is on the incorrect side of the hyperplane.
Regularization Parameter C
- C limits the total amount of violations made to the margin or hyperplane.
- It acts as a constraint against a high number of misclassifications on training data.
- C=0 indicates a strict adherence to the margin (no violations allowed).
- Higher C leads to a wider margin and a tendency to tolerate more margin violations, which impacts confidence levels on observations' categorization. If more than C observations deviate from the margin or hyperplane, adjustments may be needed.
The Regularization Parameter C - Continued
- Analyzing the effect of C on the support vector classifier's performance shows how varying C impacts the margin width and the number of support vectors.
- In high C cases, almost all the training observations will influence the hyperplane, potentially creating a low-bias and high-variance classifier; conversely, small C means the hyperplane is determined by few observations, resulting in low-variance and high-bias classifiers.
Robustness of Support Vector Classifiers
- Support vector classifier decision rules predominantly rely on a restricted (potentially smaller subset of training observations). These observations are known as support vectors.
- Decision robustness is elevated because of this reliance on support vectors, reducing susceptibility to distant outlier impacts.
- Note the contrast to other classification approaches (for example, linear discriminant analysis).
Linear Boundary Failures
- Linear boundaries may fail to separate observations in some cases regardless of C values
- Data patterns requiring non-linear decision boundaries could also be solved by employing non-linear transformations in the original feature space.
Feature Expansion
- Feature space is enlarged by introducing polynomial or other transformations.
- The support vector machine in this enlarged dimensional space may find a separating hyperplane that produces a non-linear decision boundary in the original input space (i.e. using quadratic, cubic, higher order-polynomial expansions).
- The optimization problem will be altered to reflect the higher dimensionality space.
Feature Expansion - Example
- This example demonstrates how enlarging feature space with specific transformations can produce a non-linear decision boundary.
- Illustrating practical application.
Cubic Polynomials
- Illustrates cubic polynomials basis expansion from 2 to 9 variables
- Applying this transformation to a specific dataset (plotted sample) yields a support vector classifier solution to the non-linear separation problem.
SVMs: More Than Two Classes
- Classic Support Vector Machine implementations work for only two classes; this section discusses multi-class expansions.
- The "one-versus-all" (OVA) approach fits individual classifiers (one vs all other classes) resulting K classifiers.
- The class assignment is determined based on the maximum value amongst all these classifiers for a given observation.
- The "one-versus-one" (OVO) approach fits all pairwise combinations yielding K(K−1)/2 classifiers; the class with the most winning pairwise competitions is chosen for the input example.
SVM vs. Logistic Regression
- The optimization problem in SVMs can be rephrased using a "hinge" loss function that closely resembles the "loss" function used in logistic regression (negative log-likelihood).
- The loss functions of both approaches have notable similarities in their respective shapes.
Which to Use: SVM or Logistic Regression?
- SVMs outperform logistic regression when the classes are clearly separable and a linear boundary can readily be identified.
- In cases where the classes are not well-segmented, logistic regression with a regularisation penalty or support vector techniques generally yield similar outcomes.
- When estimating probabilities, logistic regression is the more appropriate choice.
- In cases where non-linear boundaries or high dimensionality are required, kernel SVMs may be prioritized due to their adaptability; however, they typically require more computations.
End
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on support vector classifiers and their components. This quiz covers topics like maximal margin classifiers, hyperplanes, slack variables, and observations in relation to the margin. Challenge yourself with these essential concepts in machine learning.