Podcast
Questions and Answers
What is a drawback of the maximal margin classifier?
What is a drawback of the maximal margin classifier?
- It perfectly classifies all training observations.
- It may have overfit the training data. (correct)
- It identifies support vectors effectively.
- It is insensitive to individual observations.
The support vector classifier aims to perfectly separate the two classes.
The support vector classifier aims to perfectly separate the two classes.
False (B)
What is the primary role of the hyperplane in the support vector classifier?
What is the primary role of the hyperplane in the support vector classifier?
- To minimize the width of the margin.
- To classify observations without any misclassification.
- To separate the training observations into two classes. (correct)
- To increase the number of observations on the correct side of the margin.
What are observations that lie directly on the margin or on the wrong side of the margin for their class called?
What are observations that lie directly on the margin or on the wrong side of the margin for their class called?
If a slack variable $\epsilon_i$ is greater than 1, it indicates that the observation is on the wrong side of the margin.
If a slack variable $\epsilon_i$ is greater than 1, it indicates that the observation is on the wrong side of the margin.
In a support vector classifier, changing the position of an observation that lies strictly on the correct side of the margin will ___ the classifier.
In a support vector classifier, changing the position of an observation that lies strictly on the correct side of the margin will ___ the classifier.
What happens to the margin of a support vector classifier as the regularization parameter C increases?
What happens to the margin of a support vector classifier as the regularization parameter C increases?
A small C value leads to a classifier with high bias and low variance.
A small C value leads to a classifier with high bias and low variance.
What does the acronym SVM stand for?
What does the acronym SVM stand for?
The maximal margin classifier is the most complex form of SVM.
The maximal margin classifier is the most complex form of SVM.
What is the purpose of a hyperplane in SVM?
What is the purpose of a hyperplane in SVM?
The vector β in the hyperplane equation
β0 + β1 X1 + β2 X2 +...+ βp Xp = 0 is known as the ______.
The vector β in the hyperplane equation β0 + β1 X1 + β2 X2 +...+ βp Xp = 0 is known as the ______.
What method is used in SVM when there are more than 2 classes?
What method is used in SVM when there are more than 2 classes?
Support Vector Machine (SVM) is more effective than Logistic Regression (LR) when classes are not separable.
Support Vector Machine (SVM) is more effective than Logistic Regression (LR) when classes are not separable.
What is the loss function used in support vector classifier optimization?
What is the loss function used in support vector classifier optimization?
When $y_i(\beta_0 + \beta_1x_{i1} +...+ \beta_px_{ip})$ is greater than 1, the SVM loss is ______.
When $y_i(\beta_0 + \beta_1x_{i1} +...+ \beta_px_{ip})$ is greater than 1, the SVM loss is ______.
Match the following concepts with their descriptions:
Match the following concepts with their descriptions:
What characterizes a support vector machine compared to a support vector classifier?
What characterizes a support vector machine compared to a support vector classifier?
The radial kernel has a global behavior, meaning all training observations affect the predicted class label for a test observation.
The radial kernel has a global behavior, meaning all training observations affect the predicted class label for a test observation.
What is the role of the parameter gamma (𝛾) in radial basis kernel?
What is the role of the parameter gamma (𝛾) in radial basis kernel?
Support vector machines utilize kernels to compute the __________ needed for different dimensions.
Support vector machines utilize kernels to compute the __________ needed for different dimensions.
Match the kernel types with their characteristics:
Match the kernel types with their characteristics:
Which of the following best describes the polynomial kernel?
Which of the following best describes the polynomial kernel?
As the distance between a test observation and a training observation increases, the contribution of that training observation to the prediction increases.
As the distance between a test observation and a training observation increases, the contribution of that training observation to the prediction increases.
What happens to the predicted class label when the training observations are far from the test observation?
What happens to the predicted class label when the training observations are far from the test observation?
Flashcards
What is Support Vector Machine (SVM)?
What is Support Vector Machine (SVM)?
A method for classification developed in the 1990s and known for its strong performance.
Maximal Margin Classifier
Maximal Margin Classifier
A simple classifier that aims to find a hyperplane that best separates data points into two classes.
Support Vector Classifier
Support Vector Classifier
An extension of the maximal margin classifier that can handle more complex datasets by allowing some misclassified points.
Hyperplane
Hyperplane
Signup and view all the flashcards
Hyperplane Equation
Hyperplane Equation
Signup and view all the flashcards
Normal Vector
Normal Vector
Signup and view all the flashcards
Data Matrix X
Data Matrix X
Signup and view all the flashcards
Class Labels (y)
Class Labels (y)
Signup and view all the flashcards
What is a Hyperplane?
What is a Hyperplane?
Signup and view all the flashcards
What is a Support Vector Classifier?
What is a Support Vector Classifier?
Signup and view all the flashcards
What is the Margin?
What is the Margin?
Signup and view all the flashcards
How is the Margin Width Calculated?
How is the Margin Width Calculated?
Signup and view all the flashcards
What are Slack Variables?
What are Slack Variables?
Signup and view all the flashcards
Drawback of Maximal Margin Classifier
Drawback of Maximal Margin Classifier
Signup and view all the flashcards
What is the Regularization Parameter C?
What is the Regularization Parameter C?
Signup and view all the flashcards
What Happens When C is 0?
What Happens When C is 0?
Signup and view all the flashcards
How do we use the Parameter C?
How do we use the Parameter C?
Signup and view all the flashcards
Support Vectors
Support Vectors
Signup and view all the flashcards
Non-Support Vectors
Non-Support Vectors
Signup and view all the flashcards
Margin
Margin
Signup and view all the flashcards
Optimization Problem of Support Vector Classifier
Optimization Problem of Support Vector Classifier
Signup and view all the flashcards
Generalization Performance
Generalization Performance
Signup and view all the flashcards
Radial Kernel
Radial Kernel
Signup and view all the flashcards
How Radial Basis works?
How Radial Basis works?
Signup and view all the flashcards
Implicit Feature Space
Implicit Feature Space
Signup and view all the flashcards
Computational Advantage of Kernels
Computational Advantage of Kernels
Signup and view all the flashcards
Kernel Function
Kernel Function
Signup and view all the flashcards
Support Vector Machine with Non-linear Kernel
Support Vector Machine with Non-linear Kernel
Signup and view all the flashcards
Polynomial Kernel
Polynomial Kernel
Signup and view all the flashcards
Gamma (γ) in Radial Kernel
Gamma (γ) in Radial Kernel
Signup and view all the flashcards
Multi-class SVM
Multi-class SVM
Signup and view all the flashcards
SVM Optimization
SVM Optimization
Signup and view all the flashcards
Hinge Loss
Hinge Loss
Signup and view all the flashcards
SVM vs. Logistic Regression
SVM vs. Logistic Regression
Signup and view all the flashcards
Kernel SVM for Non-Linear Data
Kernel SVM for Non-Linear Data
Signup and view all the flashcards
What is the role of the regularization parameter C?
What is the role of the regularization parameter C?
Signup and view all the flashcards
Why are support vector machines robust?
Why are support vector machines robust?
Signup and view all the flashcards
When can a linear boundary fail?
When can a linear boundary fail?
Signup and view all the flashcards
What is Feature Expansion?
What is Feature Expansion?
Signup and view all the flashcards
How does feature expansion lead to non-linear boundaries?
How does feature expansion lead to non-linear boundaries?
Signup and view all the flashcards
Why is feature expansion important in the optimization problem?
Why is feature expansion important in the optimization problem?
Signup and view all the flashcards
How does feature expansion affect the shape of the decision boundary?
How does feature expansion affect the shape of the decision boundary?
Signup and view all the flashcards
What is the benefit of using polynomial features?
What is the benefit of using polynomial features?
Signup and view all the flashcards
Study Notes
Introduction to Machine Learning - AI 305: Support Vector Machines (SVM)
- Support Vector Machines (SVMs) are a classification approach developed in the 1990s, gaining popularity since.
- SVMs perform well in various settings and are considered strong "out-of-the-box" classifiers.
- The core concept is the maximal margin classifier.
- The support vector classifier extends the maximal margin classifier for broader datasets.
- Support Vector Machines (SVM) extend the support vector classifier further to accommodate non-linear class boundaries.
Contents
- Maximal Margin Classifier
- Support Vector Classifier
- Support Vector Machine
- SVM for Multiclass Problems
- SVM vs. Logistic Regression
Introduction - Continued
- Support Vector Machines (SVMs) are an approach for classification, originally developed in the computer science community during the 1990s.
- The popularity has grown since then.
- These approaches perform well across a range of contexts, frequently being regarded as one of the best "off-the-shelf" or pre-built classifiers.
- The approach handles two-class classification problems directly.
- Trying to find a plane that cleanly segregates the classes in feature space is the first step.
- If a separating plane can't be readily identified, two strategies are employed : -Refining the meaning of "separates" -Expanding and elaborating the feature space to enable separation.
What is a Hyperplane?
- A hyperplane in p-dimensions is an affine subspace of dimension p−1.
- The generic equation for a hyperplane is: 60 + 61X1 + 62X2 + ... + 6pXp = 0
- In two dimensions, a hyperplane is a line.
- In three dimensions, it's a plane.
- If 60 = 0, the hyperplane goes through the origin. Otherwise, it does not.
- The vector 6 = (61, 62, ..., 6p) is deemed the normal vector, pointing orthogonal the hyperplane's surface.
Hyperplanes - Example
- Let the hyperplane be represented as: 1 + 2X1 + 3X2 = 0.
- The blue region represents the points where 1 +2X1 + 3X2 > 0.
- The purple region represents the points where 1 + 2X1 + 3X2 < 0.
Classification using a Separating Hyperplane
- Given a nxp dataset X of n training observations in p-dimensional space, where these observations fall into two categories (y1,..., yn ∈ {-1, +1}).
- The objective is to develop a classifier to categorize the test observation based on its feature measurements.
- A variety of techniques are used (logistic regression, classification trees, bagging, boosting).
- This approach introduces a novel method based on a separating hyperplane concept.
Separating Hyperplanes
- If f(X) = 60 + 61X1 + ... + 6pXp, f(x) > 0 for points on one side of the hyperplane; f(x) < 0 on the other side.
- If y₁ = +1 for blue and y₁ = -1 for purple, then y₁f(x₁) > 0 for all i.
- f(x) = 0 defines a separating hyperplane.
Maximal Margin Classifier
- Among all separating hyperplanes, it seeks the one maximizing the gap (margin) between the two classes.
- The maximal margin hyperplane is the solution of the optimization problem that minimizes ‖β‖2 subject to a set of constraints.
- The constraints enforce that each observation must fall on the correct side of the hyperplane and maintain a distance at least M from it, with M being the margin width.
- This formulation can be resolved effectively as a convex quadratic program.
Non-separable Data
- Data that cannot be separated by a linear boundary using the specified criterion.
- There's no solution with a margin larger than zero, often the case unless the number of observations (N) is less than the dimensionality (p).
- The generalization of the maximal margin classifier, accommodating non-separable cases is called a support vector classifier, employing a "soft margin".
Noisy Data
- Data that is separable but includes noise, potentially leading to a less desirable solution for maximal-margin classifiers.
- For this case the support vector classifier maximizes a soft margin.
Drawbacks of Maximal Margin Classifiers
- Classifiers based on separating hyperplanes invariably perfectly classify all training observations, leading to increased sensitivity towards individual observations.
- The addition of a single new observation can dramatically alter the maximal margin hyperplane.
- The resulting hyperplane with a narrow margin is often undesirable, making it problematic because its small distance between observations and the hyperplane lowers confidence that the observation was correctly categorized.
Support Vector Classifiers
- Given the limitations of the maximal margin classifier, support vector classifiers (called soft margin classifiers) are introduced to tolerate misclassifications of a few observations in order to perform better for the remaining data points.
- They use less restrictive conditions on hyperplane selection, aiming to improve overall classification accuracy.
Support Vector Classifier - Continued
- The optimization problem is structured in such a way that only observations on or violating the margin affect the hyperplane.
- Points that lie directly on the margin, or on the "wrong" side are considered "support vectors" and control the margin boundaries.
- These “support vectors” significantly influence the SVM classifier.
Support Vector Classifier- Continued
- Example illustrating how support vector classifiers fit to a small dataset with dashed margins indicate the fitted hyperplanes
- Illustrates how data points on or violating the margin affect the hyperplane position in the plots. Some points in the sample dataset are close to the margin (support vectors).
Details of the Support Vector Classifier
- SVM classifiers are based on the side of a hyperplane on which a test observation falls.
- The hyperplane is carefully selected to correctly categorize the majority of training observations while tolerating a few possible misclassifications.
- The solution rests on an optimization problem.
- The problem uses a parameter C and the width of the margins M (inverse of the norm of its weight vector) and slack variables to enable some observations to be on the wrong side of the margin.
Details of the Support Vector Classifier - Continued
- C is a non-negative model tuning parameter.
- M as related to maximizing margin width.
- Slack variables allow individual observations to be on the wrong side of the margin or hyperplane.
Slack Variable
- Slack variable єi reflects the position of the ith observation relative to the margin and hyperplane.
- єi = 0 indicates the ith observation is on the correct side of the margin.
- єi > 0 suggests the ith observation is on the incorrect side of the margin (in violation); єi > 1 implies the ith obs. is on the incorrect side of the hyperplane.
Regularization Parameter C
- C limits the total amount of violations made to the margin or hyperplane.
- It acts as a constraint against a high number of misclassifications on training data.
- C=0 indicates a strict adherence to the margin (no violations allowed).
- Higher C leads to a wider margin and a tendency to tolerate more margin violations, which impacts confidence levels on observations' categorization. If more than C observations deviate from the margin or hyperplane, adjustments may be needed.
The Regularization Parameter C - Continued
- Analyzing the effect of C on the support vector classifier's performance shows how varying C impacts the margin width and the number of support vectors.
- In high C cases, almost all the training observations will influence the hyperplane, potentially creating a low-bias and high-variance classifier; conversely, small C means the hyperplane is determined by few observations, resulting in low-variance and high-bias classifiers.
Robustness of Support Vector Classifiers
- Support vector classifier decision rules predominantly rely on a restricted (potentially smaller subset of training observations). These observations are known as support vectors.
- Decision robustness is elevated because of this reliance on support vectors, reducing susceptibility to distant outlier impacts.
- Note the contrast to other classification approaches (for example, linear discriminant analysis).
Linear Boundary Failures
- Linear boundaries may fail to separate observations in some cases regardless of C values
- Data patterns requiring non-linear decision boundaries could also be solved by employing non-linear transformations in the original feature space.
Feature Expansion
- Feature space is enlarged by introducing polynomial or other transformations.
- The support vector machine in this enlarged dimensional space may find a separating hyperplane that produces a non-linear decision boundary in the original input space (i.e. using quadratic, cubic, higher order-polynomial expansions).
- The optimization problem will be altered to reflect the higher dimensionality space.
Feature Expansion - Example
- This example demonstrates how enlarging feature space with specific transformations can produce a non-linear decision boundary.
- Illustrating practical application.
Cubic Polynomials
- Illustrates cubic polynomials basis expansion from 2 to 9 variables
- Applying this transformation to a specific dataset (plotted sample) yields a support vector classifier solution to the non-linear separation problem.
SVMs: More Than Two Classes
- Classic Support Vector Machine implementations work for only two classes; this section discusses multi-class expansions.
- The "one-versus-all" (OVA) approach fits individual classifiers (one vs all other classes) resulting K classifiers.
- The class assignment is determined based on the maximum value amongst all these classifiers for a given observation.
- The "one-versus-one" (OVO) approach fits all pairwise combinations yielding K(K−1)/2 classifiers; the class with the most winning pairwise competitions is chosen for the input example.
SVM vs. Logistic Regression
- The optimization problem in SVMs can be rephrased using a "hinge" loss function that closely resembles the "loss" function used in logistic regression (negative log-likelihood).
- The loss functions of both approaches have notable similarities in their respective shapes.
Which to Use: SVM or Logistic Regression?
- SVMs outperform logistic regression when the classes are clearly separable and a linear boundary can readily be identified.
- In cases where the classes are not well-segmented, logistic regression with a regularisation penalty or support vector techniques generally yield similar outcomes.
- When estimating probabilities, logistic regression is the more appropriate choice.
- In cases where non-linear boundaries or high dimensionality are required, kernel SVMs may be prioritized due to their adaptability; however, they typically require more computations.
End
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on support vector classifiers and their components. This quiz covers topics like maximal margin classifiers, hyperplanes, slack variables, and observations in relation to the margin. Challenge yourself with these essential concepts in machine learning.