Linear Classification - Machine Learning PDF

Linear classification Machine learning Ing. Tomáš Vičar, Ph.D. Department of Biomedical Engineering, FEEC, Brno University of Technology Introduction For linearly separated cases. Linear classifier are simple and usually easy to compute (train). Attractive candidates also for not strictly linearly separated cases… at least as a candidates for initial, trial classifier. Offers simple generalization, which can be acceptable in some cases. No danger of overfitting. Introduction Building a linear classifier - looking for linear hyperplane (or more general as separating/decision hyperplane/hypersurface or discriminant function) that separates two or more classes by a linear function - linear in the components of feature vector x. In 2D case - the linear function is a line. What is the best “position” of this line? We need some criteria (i.e. cost function), in order to optimize the position. Linear hyperplane Consider now a two-class problem. The respective separating hyperplane in the n- dimensional feature space can be mathematically described by equation: g(x) = w’T x’ + b = 0, where w’ = [w’1,w’2,...w’n] is a known weight vector and b is a threshold. For a 2D feature vector we can write an equation for a line: g(x1, x2) = w’1 x’1 + w2 x’2 + b = 0 Compare it with equation for a line in 2D: a.x + b.y + c = 0, or y = -(a/b).x - c/b Linear hyperplane Consider two points x1 and x2 on the line, then the following is valid: 0 = w’T x’1 + b = w’T x’2 + b w’T (x’1 - x’2) = 0 Since the vector (x’1 - x’2) obviously lies on the decision line, it is apparent that w’T is orthogonal to this line and therefore it is its normal vector. It can be shown (see some math book) that the Linear hyperplane distance of the line from origin is: and the distance of some point x to the line is: It also holds that z value on one plane side the g(x) takes positive values and on the other negative, e.g.: Remark: g(x) represents function of a line (or generally some hyperplane), e.g. g(x): w1 x1 + w2 x2 + b = 0 Cost function A cost function J(x,y,w) quantifies how unhappy you would be if you used w’ (including b) to make a classification of vector x when the correct output is y. Therefore, we want to minimize J(.). There are different versions of the cost function. We will closely examine these: Mean squared error Perceptron Fisher linear discriminant (LDA – Dimensionality reduction lecture) Support vector Logistic regression (Probabilistic models lecture) Mean Square Error Cost Function Mean square error (MSE) method The linear separation of classes is usually not very often - in practice the samples from different class may slightly overlap. And also in these cases, it is still worth to use linear classification. But we have to live with error. We will attempt to design a linear classifier so that its desired output is again ±1, depending on the class ownership of the input vector. But due to possible classification error, the predicted output will not always be equal to desired one. Given an input vector x, the output of the classifier will be wTx (threshold can be added to have ±1 output). Then, the weight vector w will be computed so as to minimize the mean square error (MSE) between the desired (yi) and true output: MSE method This might be rather weird to compare “some distance” with labels. But it works. Let’s elaborate on this: Class with label +1, i.e. (+1 - wtx)2 Class with label -1, i.e. (-1 - wtx)2 Correct classification: ( 1 - (+k) )2 ↓ Correct classification: ( -1 - (-k) )2 ↓ Incorrect classification: ( 1 - (-k) )2 ↑ Incorrect classification: ( -1 - (+k) )2 ↑ MSE method The criterion can be written also in an alternative way, with matrix notation and L2 norm (||. ||2): where y is the vector of labels and X is a matrix of (augmented) input vectors - each row is an augmented input vector. Again, for gradient-based minimization we need to compute the gradient: MSE method Here, we can set the gradient to zero: We are looking for w: And finally: This is a direct solution, which needs the inverse matrix of XTX, which is square and often nonsingular. MSE method - example Suppose we have four samples. Class +1: (1,2)T, (2,0)T Class -1: (3,1)T, (2,3)T Let’s create (augmented) matrix X and prepare vector y: MSE method - example Now compute the (XT X)-1 XT: and the weight vector is computed as: MSE method The simple inverse matrix technique is sometime not possible. Therefore the iterative optimization gradient descent scheme is applied, similarly as in perceptron case: Note that in this case the index k for updating the parameter (w) coincides with the index of the current input sample xk. So, this is a time-adaptive scheme, which adapts to the solution as successive samples become available to the system. This algorithm is known as a least mean squares (LMS) or Widrow-Hoff algorithm. This method can be of course seen as training the neuron without (generally) nonlinear activation function which is known as adaline (adaptive linear element). Comparison LMS - perceptron method MSE + L2 regularization MSE can become unstable (during training the solution is oscillating around the optima). Regularization is a general method to avoid these problems by imposing additional constraints on weight vector w. A common approach is to make sure that the weights are, on average, small in magnitude; this is also referred as a shinkage. The regularized version is: where is the squared norm of w or equivalently, the dot product wTw, λ is scalar value determining the level of regularization. This L2 norm regularization is often called as ridge regularization. MSE + L2 regularization This form of regularization still have a closed form solution: where I is identity matrix. If we compare this equation with equation for w without regularization, we see that here we add λ to the diagonal of XTX, which is know trick to improve the stability of matrix inversion. This form of is also known as ridge classification or regression. MSE + L1 regularization Another way of regularization is the absolute value of w coefficients: where we sum the absolute values of the weights. The result is that some weights are shrunk and others are set to zero. In other words - some features are set as less important or totally unimportant in our classification task. This regularization is called as lasso (“least absolute shrinkage and selection operator”). It is rather sensitive to regularization parameter, which is usually set during cross-validation. More information about ridge and lasso regularization: https://www.youtube.com/watch?v=sO4ZirJh9ds MSE with one neuron With this notation we can simply write the separating hyperplane as: g(x) = wT x = 0 So, the w0 is included in vector w and we must also include 1 into vector x. These vectors are sometimes called as augmented vectors. Stochastic gradient descent leads to this update: MSE with one neuron This shows us that the weights will be a linear combination of samples – dual representation. Perceptron Cost Function Perceptron The simple math description of the linear hyperplane is equivalent to perceptron model. Now we will describe the training based perceptron algorithm. Before that we slightly modify the notation: With this notation we can simply write the separating hyperplane as: g(x) = wT x = 0 So, the w0 is included in vector w and we must also include 1 into vector x. These vectors are sometimes called as augmented vectors. Perceptron algorithm The cost function in classical perceptron algorithm considers only misclassified samples. We introduce labels for two-class problem, with classes 𝛚1 and 𝛚2. The labels are lx=±1. Let’s consider weight of correctly trained classifier w such that: wT x > 0 ∀ x ∈ 𝛚1 (we set the labels lx = +1 for this class) wT x < 0 ∀ x ∈ 𝛚2 (we set the labels lx = -1 for this class) Shortly: if the sample belongs to class 𝛚1 then the output of classifier is positive and vice versa. Perceptron method The criterium can be then formulated as: where Ymiss is a set of samples misclassified by w. If no samples are misclassified, the Ymiss is empty and we define JP to be zero. Note that JPerc: is alway positive geometrically, it is proportional to the sum of distances of the misclassified samples from decision boundary. is piecewise constant. Perceptron method Now, if we want to apply gradient descent method for minimizing this function, we must compute the gradient of JPerc with respect to w: Let’s look closer on 2D case and make the derivatives for each components of w. Remind that g(x1, x2) = w1 x1 + w2 x2 + b = 0. Perceptron method Gradient descent take the form: The learning rate ρt may be constant or depending on iteration t. We will omit the proof of convergence of this method. But it will converge only for linearly separable cases. Again – weights are linear combination of samples – dual representation. Perceptron method Practical limitations of this approach: Convergence may be slow. If the data are not separable, the algorithm will not converge. We will only know that the samples are separable once the algorithm converges. The solution is in general not unique, and will depend upon initialization (w) and the learning rate ρt. Modifications of perceptron algorithm There are various modifications of this basic perceptron algorithm, e.g. Pocket Algorithm 1. Run the training and compute wt. 2. Test the classifier - evaluate classification error. 3. If this error is smaller than previous one, then store (in the pocket) current wt. 4. In such a way, we will get the optimal solution also for non-separable set. Fisher Linear Discriminant Classifier Fisher’s Linear Discriminant Linear discriminant analysis (LDA) discussed in Feature reduction part can be used also for classification. Just set some threshold (e.g. compute ROC curve and set threshold according to desired classifier properties). Fisher’s Linear Discriminant, in essence, is a technique for dimensionality reduction, not a discriminant. To find the optimal direction to project the input data, Fisher needs supervised data. Given a dataset with D dimensions, we can project it down to D-1, D-2… up to 1 dimension. Support vector classification (Support vector machine) Support vector linear classification Suppose a linearly separable two class case. The sample in these classes are separated (in a feature space) by “some space”, sometimes called as a buffer zone. But there can be more buffer zones. The figure shows two options. Which one would you select? Implementation: https://www.csie.ntu.edu.tw/~cjlin/libsvm/ Linear SVM Again, the goal is to find a hyperplane g(x) = wT x + b =0 that classify correctly all training vectors. Now our task is to find direction of separating hyperplane, that gives us the maximum possible margin (or buffer zone). From math we know that the distance of a point x’ to line (or hyperplane) is given by: Note: In SVM is better to work with w and b separately - we will now consider the non- augmented vectors, which makes the analysis easier. Linear SVM We can now scale w so that the value of g(x’) at the nearest points in both classes is equal to +1 or -1, respectively. (i.e. g(x’)=±1). This is not a restriction. We just simplify the next analysis. This is then equivalent with: 1. The margin size (buffer zone width): 2. Requiring that all samples will be on or outside the margin: These two equations can be written as one (with the help of the labels yi): In SVM the labels are typically marked as y Linear SVM Suppose that we already trained classifier, so we have w and b. Then this inequality will hold for all samples: But in which cases it become to equality? This equality is valid for the samples lying on the margin of the trained classifier. We call this samples as support vectors. The number of support vectors depends on “spatial distribution” of the samples in the training set. Linear SVM Because the margins are , then if we minimize ||w||, we also maximize margins (i.e. buffer zone). The principle of this method can be summarized as (using square root of ||w||): This is non-linear (quadratic) optimization task subject to a set of linear inequality constraints. It is example of standardized “quadratic programming” (QP) task. Dual form of linear SVM This objective function with inequality constraints form a so-called constrained optimization problem, which is often solved by Lagrange multipliers and Lagrangian: This Lagrangian has to be minimized with respect to primal variables w and b and maximized with respect to dual variables λi ( ) Differentiation of L(.) with respect to primal variables gives these conditions: Leads to: Dual form of linear SVM By substituting these conditions into Lagrangian, we eliminate the primal variables and we get so-called dual optimization problem, which is usually solved in practice: So, we maximize this L(λ) with respect to λ and subject to: and Dual form of linear SVM The optimization can be solved by some appropriate “tool”. Often, a quadratic programming package is used. Because these packages usually search for minima (and not for maxima as in our case), we need to change the sign. The original expression is and we change it to: dot product between samples Dual form of linear SVM Let’s write the last expression in a form of matrices: subject to linear constraint: To remind: x - feature of our training set and range for Lagrange multipliers: y - labels of our training set p - number of datapoints n - number of features Evaluation of the new sample The discriminant function is given as Thus we take new sample x and compute the value of above formula. If it is positive, then it belongs to “+1 class” if it is negative then it belong to “-1 class”. For computation we use all the support vectors (SV), i.e. the samples from training set that defines the margin. Linear SVM Note the dimensions of the matrix on the previous slide - it depends on the number of samples. If we work with hundreds, or few thousands of samples we are moreless OK. But for higher number of samples it becomes more tricky. The result of solution is the vector of Lagrange coefficients, λ. Those xi for which λi>0 are called support vectors. We can find primal parameters w and b. For w, we can use the equation of the constraint: How to solve for b? Linear SVM We know that at the margin are support vectors, i.e. some of the training samples. And we also know what is the equation for the margins: Therefore, we solve for b using this equation and any support vector (for any xi with λi>0 ). Linear SVM, non-separable case, soft margin When the classes are not separable, the above setup is no longer valid. In this non-separable case we can have three cases: Vectors that fall outside the buffer zone and satisfy: yi(wTxi+b) > 1 Vectors falling inside the zone and are correctly classified: 0 ≤ yi(wTxi+b) < 1 Vectors that are misclassified: yi(wTxi+b) < 0 Soft margin, C-SVM All three cases can be treated under a single type of constraints by introducing a new set of variables: yi( wTxi + b ) ≥ 1 - ξi The first category of data corresponds to ξi = 0, the second category to 0 < ξi ≤ 1, and the third to ξi > 1. The variables ξi are known as a slack variables. The optimization task is now more complex - to make the margin as large as possible but at the same time to keep the number of points with ξi>1 as small as possible. So we can formulate the new cost function for minimization: subject to: Soft margin, C-SVM If we now compare original SVM with this C-SVM and solution based on Lagrangian, you get almost the same solution (without proof) : So, we maximize this L(λ) with respect to λ and subject to: and Soft margin, C-SVM C value is a constant, which defines (set by user) the relative importance of the two terms in JC-SVM. How to set C? Look again: If C is very very high - a small error (i.e. slack or violation of the margin) would give high value of the second term, which leads to almost previous solution. If C is very very small we can extend the margin and we get a lot of samples lying inside the margin. Based on C we can compromise; C is set based on intuition, trial-error. Soft margin, C- SVM Example of C values. 𝝂-SVM In the previous method, we’ve seen that there exists close connection between parameter C and width of the margin obtained as a result of optimization. However, the margin will depend on the number of samples and amount of overlapping. Since the margin is important (it is an essence of SVM, after all), we would like to introduce some parameter that would be directly connected to margin width. This margin can be simply defined by the pair of hyperplanes: wTx + b = 土ϱ where ϱ≥0 is a variable to be optimized. Under this new setting, the problem can be rewritten with the new cost function. Soft margin, 𝝂-SVM This cost function is minimized: subject to Here 𝝂 represents an upper bound for number of errors (and hence also on the fraction of training errors) and also lower bound of the fraction of support vectors; i.e. if 𝝂=0.05 there will be at most 5% of training samples miss-classified and at least 5% of training samples will be support vectors. Non-linear (kernel) SVM Beyond the linearity The linear classifiers are simple but not always sufficient for classification task. Here we will introduce the non-linear classifiers based on the kernel approach. Yes, the kernel trick as in nonlinear PCA or regression will be used again here. The main idea is simple: to transform the data non-linearly to a feature space in which the linear classification can be applied. Beyond the linearity Now we will call the transformed space a feature space and the original space as input space. The approach is: - transform the data from input space to feature space (typically with higher number of dimensions) - train and evaluate linear classifier - in order to classify new samples, we must first transform the new sample into feature space and then perform classification But - the big thing is that the feature space does not have to be explicitly constructed, because we can perform all necessary operation in input space. Kernel trick In many machine learning methods, the dot product over the data is applied. The results of the dot product of two vectors is a scalar value. So, consider the dot product of two vectors (in 2 dimensions): Now, consider the dot product after transforming the features into higher dimension (from previous example): Kernel trick The disadvantage of this approach is that we first need to transform all the feature set into high dimension. And because we typically have many features (not only two) it creates very high dimensional space (p>>m) and therefore it is computationally demanding and also the memory storage increases. Fortunately, we can do it in a smarter way: So, we are able to compute the dot product without explicitly going to higher dimension! Kernel trick To conclude - the kernel trick can be written as: k This is sometimes called as generalized dot product. The examples of various kernels are on the next slides. https://www.researchgate.net/figure/Figure-B16-Non-linear-classifier-using-Kernel-trick_fig13_324250451 Beyond the linearity Recall that linear SVM maximizes this term: The important is the inner product (or written as ). As we have seen we can apply “kernel trick” on this inner product. we apply kernel k(.,.) on each matrix element (i.e. each dot product). Evaluation of the new sample The discriminant function is given as k Thus we take new sample x and compute the value of above formula. If it is positive, then it belongs to “+1 class” if it is negative then it belong to “-1 class”. For computation we use all the support vectors (SV), i.e. the samples from training set that defines the margin.

Linear Classification - Machine Learning PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue