Supervised Learning - Naive Bayes & Support Vector Machines PDF
Document Details
Uploaded by Deleted User
Colorado School of Mines
Tags
Summary
This document details supervised learning methods, focusing on Naive Bayes and Support Vector Machines (SVMs). It explains conditional probabilities, Bayes' rule, and examples using marbles. The document also touches on kernel methods.
Full Transcript
Supervised Learning - Naive Bayes & Support Vector Machines MInDS @ Mines In this lecture we continue looking at supervised learning methods. We’ll cover Naive Bayes and Support Vector Machines (SVMs) which are both popular methods for varying applications. We also cover the kernel method...
Supervised Learning - Naive Bayes & Support Vector Machines MInDS @ Mines In this lecture we continue looking at supervised learning methods. We’ll cover Naive Bayes and Support Vector Machines (SVMs) which are both popular methods for varying applications. We also cover the kernel method which is an effective method when applied to data of particular patterns. SVMs coupled with the kernel method are powerful. Naive Bayes Naive Bayes is a modeling method that learns from the conditional proba- bilities that exist between feature values and the resulting target value. To go into how that works, we should first discuss some background about probabilities. A probability is a value between zero and one that denotes the chance of a certain event happening. For example, a fair coin can be heads Figure 1: Marbles problem when returning or tails at equal chance, and there is no other option, which means that the selected marbles. probability of heads is 0.5 and tails is 0.5. The sum of the probabilities of mu- tually exclusive events, where only one of which can occur at a given point, is equal to 1. Now let’s take a look at conditional dependence. Conditional de- pendence means that the probability of an event changes based on another event. Independence means that the probability of an event is not affected by the state of another event. For a fair coin, each toss is independent of the previous ones. Let’s go over an example of this using a bag of marbles with 8 blue mar- bles and 2 orange marbles where we blindly select one of the marbles. The 8 probability of selecting a blue marble, P (Select = blue) = 8+2 = 0.8, and 2 selecting an orange marble, P (Select = orange) = 8+2 = 0.2. If after se- lecting a marble we return it back to the bag then the next turn of selection is independent from this one’s and uses the same probabilities. If we select a marble and remove it from the bag then the probabilities have changed. Figure 2: Marbles problem with removal of We represent a conditional probability using the | symbol. For example, the selected marbles. probability of selecting a blue marble given that we previously selected a blue marble is, P (Select = blue | Previous = blue). If we are removing marbles then P (Select = blue | Previous = blue) = 7+2 7 = 0.78. If we are If we have an understanding of an event A’s returning marbles then P (Select = blue | Previous = blue) = 8+2 8 = 0.8 = occurrence, the probability that it occurs, P (Select = blue). When two events, A and B , are independent, their condi- P (A), is the prior. If we determine the tional probabilities are equal to their original probabilities, P (A|B) = P (A). existence of another event, B , that may impact A, the conditional probability of A To simplify calculations with probabilities, you can always use Bayes’ Rule, given B , P (A|B), is the posterior. P (A) P (B|A) P (A | B) =. (1) P (B) Bayes’ Rule allows us to mathematically represent conditional depen- dence in data. We can use it to understand how certain features affect a 2 mi n d s @ m ine s target. The Naive Bayes method uses this concept to create a model for our data. It calculates the probabilities of an input being a particular class given the probabilities of all the inputs being each of the classes. If we apply Bayes’ Rule to this problem we get, P (Class = i) P (x1 ,... , xn | Class = i) P (Class = i | x1 ,... , xn ) =. (2) P (x1 ,... , xn ) We will examine a classification problem with k classes where each data item has n features. If we assume that all the features are independent then the probability of a particular feature value resulting in a class is equal to the probability of that feature resulting in the class given all the other features, P (xa | Class = i, x1 ,... , xa−1 , xa+1 ,... , xn ) = P (xa | Class = i). This assumption is what makes the Naive Bayes method naive. Naive Bayes variations: We can then update the probability of a class given an input as, (xa −µa,i )2 exp ∏n 2 2σa,i P (xa | Class = i) = √ , P (Class = i) a=1 P (xa | Class = i) P (Class = i | x1 ,... , xn ) = 2. (3) Gaussian 2πσa,i P (x1 ,... , xn ) where σa,i and µa,i are the standard From our training data, we can calculate the probability of a class as the deviation and mean of feature a for class i. proportion of the data that has that class. For the probability of a feature ∑ x∈I xa + α given the class, we can use a variety of methods to calculate them. The value P (xa | Class = i) = ∑n ∑ , M ultinomial a=1 x∈I xa + α n for the probability of a given feature vector is a constant. where I is the set of feature vectors of class From there, we can determine the appropriate class for a particular input i, and α is a smoothing parameter. vector as, ∑ xa ∏ n P (xa | Class = i) = x∈I xa Class = arg max P (Class = i) P (xa | Class = i) (4) Bernouli mi ∑ i xa a=1 x∈I + (1 − )(1 − xa ), mi Three common variations of Naive Bayes (NB) are the Gaussian NB, Multi- where all n features are boolean values, I is nomial NB, and Bernouli NB. Each of these uses the respective distribution to the set of feature vectors of class i, and mi is estimate the probability of a feature given the class. the number of vectors in class i. Support Vector Machines Support Vector Machines (SVMs) are another popular method used in ma- chine learning. For a 2-dimensional, 2-class classification problem, SVMs find In the generalized case of n-dimensional problems, SVMs find the hyperplane that the line that maximizes the separation between the points of each class. The maximizes the separation between the distance between the two nearest points classified one way or the other is points of each class. A hyperplane in an the margin. Many lines can separate the sets of points but the goal is to find n-dimensional space is a subspace in n − 1 dimensions. In 2D, a hyperplane is a line, the line with the largest margin. The points closest to the separating line are in 3D, a hyperplane is a plane and in 4D, a support vectors. hyperplane is a volume. The output from SVM is the label of a particular point as wT x + b. We also get three lines, one where the label is unknown (0) at the line equidistant from the two classes, and two lines where we know the labels are one class or the other (1, -1). The three lines are wT x + b = 0, wT x + b = 1, and wT x + b = −1. If we select one support vector from each class, x+ , x− , the margin is the difference between the two points. Figure 3: Geometric display of SVM’s separat- ing line. m is the margin. superv i sed lea rni n g - n a i v e b ay e s s u p p o r t v e c t o r m a c hi ne s 3 We can state the two support vectors as, T w x+ + b = 1, (5) T w x− + b = −1. Now let’s calculate the margin’s distance. Recall that the distance, ||d||2 between a point xi and a hyperplane wT x + b = 0, is |wT xi + b| ||d||2 =. (6) ||w||2 The margin is therefore the distance from x+ to the hyperplane plus the distance from x− to the hyperplane, |wT x+ + b| |wT x− + b| 2 margin = + =. (7) ||w||2 ||w||2 ||w||2 We want to maximize the margin for the model, but since we prefer to write objective functions as minimization problems, we can minimize its reciprocal. It also turns out that if we square the ℓ2 -norm, we can solve the problem using straightforward quadratic programming methods. The basic SVM objective function is, 1 min ||w||22 w 2 (8) s.t. yi (wT xi + b) ≥ 1 ∀ i, where yi ∈ [1, −1] determines the class point i belongs to. SVMs focus on extreme points that are on the class boundary. This means that they Based on this approach, we ensure that the margin separates all points don’t necessarily need a large amount of and does not allow any room for error. A hard margin does not allow any data. room for error and this leads to overfitting the data. If we want the margin to allow some error then it becomes a soft margin and we use a penalty parameter, C , to determine the softness of the margin. The idea here is that we will allow the model a certain amount of error based on the wrong points it classifies and how wrong these points are. We determine the error for a point using the distance to its class line and we use the C value as a weight to multiply by the sum of all the distances of the wrong points. A higher value for C will result in a “harder” margin and a lower value will result in a “softer” margin. We update the objective function of SVM to, 1 ∑ n Figure 4: Example of soft margin SVM min ||w||22 + C ξi allowing misclassification which results in w 2 (9) i=1 better generalization. s.t. yi (w xi + b) ≥ 1 − ξi , ξi ≥ 0 T ∀ i, where ξi is the error for point i, and C is the penalty tuning parameter. 4 mi n d s @ mine s Multi-Class To apply SVMs to more than 2 classes, one approach is to train k SVMs, where k is the number of classes. We train the ith SVM to classify whether a point is of class i or of any of the other classes. We can then use all the hyperplanes of the trained models to classify any given point. This approach is called one versus all (or one versus rest) and we can use it with any binary classifier other than SVMs as well. In each model, we likely have an imbalanced dataset since we are training this class against all others. One vs all will train fewer classifiers that are k(k−1) imbalanced and one vs one will train more Another approach is to train 2 classifiers, where we train each classi- classifiers on smaller subsets of data that are fier on data from two classes and ignore the rest. This approach is called one likely more balanced. vs one and can also be used with any other binary classifier. Regression To use SVMs for regression we maintain the same objective function but change the constraints to match a continuous value. yi is no longer the label but the exact value of the target and we no longer use 1 as the inequality constraint but instead use a limit on the acceptable error, ϵ. The objective function for support vector machines when used for regres- sion is, 1 ∑n Figure 5: Visualization of SVM when used for min ||w||22 + C ξi regression. w 2 i=1 (10) s.t. |yi − (⟨w, xi ⟩ + b)| ≤ ϵ + ξi , ϵ, ξi ≥ 0∀ i, where ϵ is the acceptable error boundary. The Kernel Method Popular kernel methods include, SVMs are powerful as they are but a way to make them more effective for K(xi , xj ) = ( xT d i xj + 1) , particular datasets is to use the kernel method. Everything we’ve discussed so P olynomial far assumes that the data is linearly separable (classification) or linearly rep- with degree d, resentable (regression). If the data is not so, we can apply the kernel method −||xi − xj ||2 K(xi , xj ) = exp( ), to transform it to a new space that is linearly separable or representable. For RBF 2σ 2 all the above objective functions, we replace x with ϕ(x) where ϕ(x) defines a with width σ , transformation that occurs to any given point into the new space. K(xi , xj ) = tanh (κxT i xj + θ), Sigmoid A kernel function K(xi , xj ) is one that transforms xi and xj into a new space by taking the inner product of their transformations, Examples with of Kernels parameters κ and(III) θ. φ K(xi , xj ) = ⟨ϕ(xi ), ϕ(xj )⟩ = ϕ(xi )T ϕ(xj ). (11) Polynomial A simple example of this is data that appears as 2 concentric circles where kernel (n=2) the inner circle near the origin is of one class and the outer one is of another class. These two classes in the original space are not linearly separable. We RBF kernel can apply a transformation to the data and the result becomes linearly sepa- (n=2) rable. Fall 2012 Heng Huang Machine Learning 11 Figure 6: Examples of kernel methods