Maximum Likelihood Methods. Linear Models PDF

Document Details

AltruisticMystery9422

Uploaded by AltruisticMystery9422

Szymon Jaroszewicz

Tags

maximum likelihood linear models logistic regression statistical modeling

Summary

This document provides a presentation on maximum likelihood methods and linear models, including a detailed review of logistic and linear regression. It covers concepts like regularization and provides examples, particularly using diabetes regression data, along with various illustrations and formulas.

Full Transcript

Maximum likelihood methods. Linear models Szymon Jaroszewicz Szymon Jaroszewicz Maximum likelihood methods.Linear models Overview Maximum Likelihood Regularization: L2 and L1 Regression with general loss functions Support Vector Machines...

Maximum likelihood methods. Linear models Szymon Jaroszewicz Szymon Jaroszewicz Maximum likelihood methods.Linear models Overview Maximum Likelihood Regularization: L2 and L1 Regression with general loss functions Support Vector Machines Szymon Jaroszewicz Maximum likelihood methods.Linear models Linear regression Linear regression. We model a continuous variable Y using a linear combination or predictor variables X1 ,... , Xm X Ŷ = w0 + wi Xi = XT w wi are called weights or coefficients, w0 is the intercept Often we add a constant variable X0 always equal to 1. This allows us to get rid of the intercept w0 Szymon Jaroszewicz Maximum likelihood methods.Linear models Linear regression – example Diabetes regression data 350 300 250 Diabetes score 200 150 100 50 0.10 0.05 0.00 0.05 0.10 0.15 Body mass index Model: y = 949.435x + 152.133 Szymon Jaroszewicz Maximum likelihood methods.Linear models Linear regression Assumption: the data is generated according to yi = xT i w + εi where εi are independent with Eεi = 0, Varεi = σ 2 σ is typically unknown and can be estimated Errors need not be normally distributed Method of least squares X ŵ = arg min (yi − xT i w) 2 w If X is a matrix of predictors, y the vector of target variables ŵ = (X T X )−1 X T y Szymon Jaroszewicz Maximum likelihood methods.Linear models Logistic regression If the target variable Y is binary: Y ∈ {0, 1} we want to model P(Y = 1|X1 ,... , Xm ) using a linear model We cannot do it directly because XT w can take any value between −∞ and +∞ Szymon Jaroszewicz Maximum likelihood methods.Linear models Logistic regression If the target variable Y is binary: Y ∈ {0, 1} we want to model P(Y = 1|X1 ,... , Xm ) using a linear model We cannot do it directly because XT w can take any value between −∞ and +∞ Solution: logistic regression   P(Y = 1|X) logit (P(Y = 1|X)) = log = XT w 1 − P(Y = 1|X)   P(Y = 1|X) = log = XT w P(Y = 0|X) logit = log-odds The logit transform converts the value of the probability from (0, 1) to (−∞, +∞) such that we can model it using a linear combination of X ’s Szymon Jaroszewicz Maximum likelihood methods.Linear models Logistic regression Another way to look at logistic regression P(Y = 1|X) = sigmoid(XT w) where the sigmoid function 1 ez sigmoid(z) = = 1 + e −z 1 + ez transforms the value of XT w from (−∞, +∞) into [0, 1] Szymon Jaroszewicz Maximum likelihood methods.Linear models The sigmoid function Szymon Jaroszewicz Maximum likelihood methods.Linear models Generalized Linear Models (GLMs) GLMs generalize linear and logistic regression µ(Y ) = XT w where µ is a link function Examples: Linear regression Y follows the Gaussian distribution link function: identity Logistic regression Y follows binomial or Bernoulli (two-point) distribution link function: logit Regression for count data (e.g. counting the number of cases) Y follows Poisson distribution link function: log... Szymon Jaroszewicz Maximum likelihood methods.Linear models Logistic regression, GLMs Invented in the 50s Returns to popularity now Szymon Jaroszewicz Maximum likelihood methods.Linear models The method of Maximum Likelihood The coefficients of GLMs are typically found using the Maximum Likelihood method (ML). The method of Maximum Likelihood Pick model coefficients w which are most probable given the training data D P(D|w)P(w) P(w|D) = ∝ P(D|w)P(w) P(D) P(D|w) – Likelihood of w given D P(w) – a-priori distribution for weights (we’ll ignore it for a while) In practice we minimize − log P(D|w) Szymon Jaroszewicz Maximum likelihood methods.Linear models Maximum likelihood – an example: linear regression The density of Y given x and w, known σ 2 ! 1 Y − xT w f (Y |w, x) = √ exp − σ 2π 2σ 2 and the likelihood: 2 ! Y 1 y − xT w P(D|w) = √ exp − σ 2π 2σ 2 (x,y )∈D Szymon Jaroszewicz Maximum likelihood methods.Linear models Maximum likelihood – an example: linear regression The log-likelihood (after taking the logarithm): " 2 !# X 1 y − xT w log P(D|w) = log √ exp − σ 2π 2σ 2 (x,y )∈D 1 X  2 = C− 2 y − xT w 2σ (x,y )∈D Maximum is attained when X  2 y − xT w = min (x,y )∈D (Least-squares principle) Szymon Jaroszewicz Maximum likelihood methods.Linear models The sigmoid function A useful property: 1 − sigmoid(z) = sigmoid(−z) Proof 1 1 + e −z − 1 1 − sigmoid(z) = 1 − = 1 + e −z 1 + e −z e −z 1 = −z = z = sigmoid(−z) 1+e e +1 Szymon Jaroszewicz Maximum likelihood methods.Linear models Maximum likelihood – logistic regression We have a training set D = {(x1 , y1 ),... , (xn , yn )}, yi ∈ {0, 1} What is the probability of observing (y1 ,... , yn ) given (x1 ,... , xn ) and w Y Y P(y|w) = P(yi = 1|w, xi ) · P(yi = 0|w, xi ) (x,y )∈D:y =1 (x,y )∈D:y =0 Y 1 Y 1 = Tx · (1 − ) (x,y )∈D:y =1 1+e −w (x,y )∈D:y =0 1 + e −wT x Szymon Jaroszewicz Maximum likelihood methods.Linear models Maximum likelihood – logistic regression An the log-likelihood logP(y|w) X 1 X 1 = log T + log(1 − ) (x,y )∈D:y =1 1 + e −w x (x,y )∈D:y =0 1 + e −wT x X 1 X 1 = log Tx + log (x,y )∈D:y =1 1+e −w (x,y )∈D:y =0 1 + e wT x T T X X =− log(1 + e −w x ) − log(1 + e w x ) (x,y )∈D:y =1 (x,y )∈D:y =0 −wT x T X =− y log(1 + e ) + (1 − y ) log(1 + e w x ) (x,y )∈D Szymon Jaroszewicz Maximum likelihood methods.Linear models Maximum likelihood – logistic regression and GLMs No explicit formulas for coefficients Optimization methods need to be used methods such as BFGS work very well variants of the Newton’s method can be used helpfully, each Newton step often reduces to weighted least-squares large datasets: stochastic optimization Implementing it correctly is actually quite hard (e.g. problems with zero probabilities) Szymon Jaroszewicz Maximum likelihood methods.Linear models Regularization Regularization Assuming an a-priori distribution P(w) on model coefficients Advantages: 1 Reduces overfitting: prevents model coefficients from taking very large values decreases variance of the coefficients (at the cost of increasing model bias) 2 Improves numerical properties, i.e. makes optimization easier Szymon Jaroszewicz Maximum likelihood methods.Linear models L2 regularization Historically first Proposed by Tikhonov in 1943 Szymon Jaroszewicz Maximum likelihood methods.Linear models L2 regularization P(w) – m-dimensional Gaussian with zero mean and covariance σ 2 I density w′ w   1 P(w) = p exp − 2 (2π)m σ 2m 2σ To find w we maximize: log P(w|D) ∝ log [P(D|w)P(w)] X = log P(D|w) − λ wi 2 + C = log P(D|w) − λ ∥w∥22 + C The log-likelihood is penalized by the ∥ · ∥2 norm of w λ – a parameter determining the strength of regularization Such a w is called the MAP (Maximum a-posteriori) estimate Szymon Jaroszewicz Maximum likelihood methods.Linear models A practical remark Typically the intercept w0 is not regularized (i.e. is not included in the computation of ∥w∥) Szymon Jaroszewicz Maximum likelihood methods.Linear models L2 regularization – preventing overfitting Assume m = 2, X1 , X2 ∈ [−1, 1], Y typically achieves small values between −2 and 2 Imagine a (very) small training dataset D in which X1 , X2 and Y all happen to take values close to 1 A model Y = 10000X1 − 9999X2 might be learned It works well on the training data, but will be a disaster when X1 ≈ 1, X2 ≈ −1 Szymon Jaroszewicz Maximum likelihood methods.Linear models L2 regularization – preventing overfitting Assume m = 2, X1 , X2 ∈ [−1, 1], Y typically achieves small values between −2 and 2 Imagine a (very) small training dataset D in which X1 , X2 and Y all happen to take values close to 1 A model Y = 10000X1 − 9999X2 might be learned It works well on the training data, but will be a disaster when X1 ≈ 1, X2 ≈ −1 The L2 regularization will prevent such huge coefficients The model might still not predict Y that well, but the catastrophe will be avoided Szymon Jaroszewicz Maximum likelihood methods.Linear models L2 regularization – preventing numerical problems Assume m = 2 and Y ∈ {0, 1} Logistic regression is used Suppose the training dataset D is linearly separable i.e. there is a w∗ which correctly separates points of class 0 from points of class 1 Likelihood Y Y P(D|w) = sigmoid(xT w) (1−sigmoid(xT w)) (x,y )∈D,y =1 (x,y )∈D,y =0 What can we tell about P(D|w∗ ), P(D|2w∗ ), P(D|3w∗ ),...? Szymon Jaroszewicz Maximum likelihood methods.Linear models L2 regularization – how to select λ? Use default value, e.g. 1 Trial and error, e.g. until matching coefficients with opposite sign are no longer present Use crossvalidation convenience class: LogisticRegressionCV results in nested crossvalidation Szymon Jaroszewicz Maximum likelihood methods.Linear models L2 regularization – preventing numerical problems Conclusion: the solution is not well defined for separable data Solutions: stop optimization early, typically when the norm of the gradient is smaller than some predefined value use L2 regularization with small value of the regularization parameter λ Weka uses λ = 10−8 by default this does not affect normal computations, but guarantees the existence of a well defined solution Szymon Jaroszewicz Maximum likelihood methods.Linear models Why L2 regularization might not be enough Recently, new types of data emerged with a huge number of attributes and a small number of records microarrays: a few thousands of attributes, a few tens/hundreds of records text data: tens/hundreds thousands of attributes, a few thousands records L2 regularization is no longer enough L2 regularized models are typically rotationally invariant, so they require the number of training samples to grow linearly with the number of attributes Szymon Jaroszewicz Maximum likelihood methods.Linear models L1 regularization wi 2 to P P We change |wi |: Szymon Jaroszewicz Maximum likelihood methods.Linear models L1 regularization wi 2 to P P We change |wi |: Pick the coefficients w which maximize X log P(D|w) − λ |wi | = log P(D|w) − λ ∥w∥1 Szymon Jaroszewicz Maximum likelihood methods.Linear models L1 regularization wi 2 to P P We change |wi |: Pick the coefficients w which maximize X log P(D|w) − λ |wi | = log P(D|w) − λ ∥w∥1 This is equivalent to assuming the Laplace a-priori distribution on the coefficients w 1 The p.d.f. of the Laplace distribution is 2 exp(−|x − µ|) 1 0.5 0 −4 −2 0 2 4 Szymon Jaroszewicz Maximum likelihood methods.Linear models L1 regularization – properties Very useful properties: Automatic variable selection The Laplace distribution has a sharp peak at zero. This ‘latches’ the coefficients to zero. Works (under certain assumptions) even for a large number of attributes and a small number of training records Szymon Jaroszewicz Maximum likelihood methods.Linear models L1 regularization – variable selection example Randomly generated data with X1 , X2 ∼ U(0, 1) X1 ⊥ X2 The class Y is binary and logit (P(Y = 1|X1 , X2 )) = 1.5X1 + 0.5X2 1000 data records generated Szymon Jaroszewicz Maximum likelihood methods.Linear models L1 regularization – variable selection example λ=0 λ Szymon Jaroszewicz Maximum likelihood methods.Linear models L1 regularization – variable selection example λ = 250 λ Szymon Jaroszewicz Maximum likelihood methods.Linear models L1 regularization – variable selection example λ = 500 λ Szymon Jaroszewicz Maximum likelihood methods.Linear models L1 regularization – variable selection example λ = 750 λ Szymon Jaroszewicz Maximum likelihood methods.Linear models L1 regularization – variable selection example λ = 2000 λ Szymon Jaroszewicz Maximum likelihood methods.Linear models L1 regularization – variable selection – genetic data The colon dataset activity of 2000 genes 60 training samples 2 classes: healty/cancerous tissue Regularization path – the value of w for all values of λ Szymon Jaroszewicz Maximum likelihood methods.Linear models L1 regularization – variable selection – genetic data L1 -regularization path 0.002 0.001 0.000 wi 0.001 0.002 5 10 15 20 α Szymon Jaroszewicz Maximum likelihood methods.Linear models L1 regularization – variable selection – genetic data L2 -regularization path 0.25 0.20 0.15 0.10 0.05 wi 0.00 0.05 0.10 0.15 0.20 0 10 101 102 103 104 α Szymon Jaroszewicz Maximum likelihood methods.Linear models Learning with a small number of samples and a large number of attributes Szymon Jaroszewicz Maximum likelihood methods.Linear models Experiment: dependence of the classification error on the number of attributes 1 Artificial data from m-dimensional normal distribution with zero mean and covariance matrix: ···   1 0.2 0.2 0.2 0.2  1 0.2 ··· 0.2 0.2  0.2 1 ··· 0.2 .........  ......  0.2 0.2 0.2 ··· 1 m×m 2 Class Y generated according to logit(P(Y = 1)) = 10 · X1 3 70 training records, 30 test records 4 The regularization parameter chosen based on a 20 record validation set 5 The experiment has been repeated a 100 times and the results averaged Szymon Jaroszewicz Maximum likelihood methods.Linear models Experiment: dependence of the classification error on the number of attributes 40 35 classification error [%] 30 25 L1 regularization L2 regularization 20 15 10 5 0 200 400 600 800 1000 number of attributes m Szymon Jaroszewicz Maximum likelihood methods.Linear models Sample complexity of L1 -regularized logistic regression Theorem (Ng 2004) If: there is a logistic model w∗ with r nonzero coefficients and classification error err (w∗ ), ŵ has been determined using L1 -regularized logistic regression, the parameter λ has been chosen based on a validation set, then, in order for err (ŵ) ≤ err (w∗ ) + ϵ with probability ≥ 1 − δ, it is sufficient to use 1 1 n = O((log m) · poly (r , log , log )) δ ϵ training samples. Szymon Jaroszewicz Maximum likelihood methods.Linear models Sample complexity of regularized logistic regression Conclusions If there is a small subset of variables which predicts the class well, L1 -regularized logistic regression can tolerate an exponential (in n) number of attributes L2 -regularized logistic regression is rotationally invariant, so it requires the number of training samples to be proportional to the number of attributes Szymon Jaroszewicz Maximum likelihood methods.Linear models regularization of GLMs We talked about logistic regression, but all results remain valid for other GLMs For linear regression Ridge regression L2 -regularized linear regression Lasso L1 -regularized linear regression Szymon Jaroszewicz Maximum likelihood methods.Linear models Extensions Elastic net (Zou, Hastie) – a combination of L1 and L2 regularization: γ∥w∥1 + (1 − γ)∥w∥22 Advantage: the L1 part ensures variable selection, the L2 part groups weights of correlated attributes Extensions to time series data: points next to each other should P have similar values. Use the penalty term |xt − xt−1 | to ensure this Similar trick for image filtering Szymon Jaroszewicz Maximum likelihood methods.Linear models GLMs in R glm – standard GLM routine glm2 – GLM package with improved numerical stability glmpath, glmnet compute regularization path for L1 and elastic net penalized package – L1 and L2 regularization Szymon Jaroszewicz Maximum likelihood methods.Linear models Risk and loss functions Szymon Jaroszewicz Maximum likelihood methods.Linear models A general view of regression – loss and risk Most regression methods find model parameters w by minimizing an expression of the form 1 X L̂(w) + R(w) = ℓ(y , fw (x)) + R(w) n (x,y )∈D where L̂ is an empirical risk function the empirical risk function is an average of loss functions ℓ(y , fw (x)) fw is the regression model, e.g. fw (x) = wT x R is the regularization term (e.g. L1 or L2 penalty) By using different loss functions and/or regularizers we get different regression models Szymon Jaroszewicz Maximum likelihood methods.Linear models Example: linear regression Classical linear regression: loss function: squared loss: ℓ = (y − wx)2 no regularizer: R(w) = 0 16 squared loss 14 12 10 loss 8 6 4 2 0 −4 −3 −2 −1 0 1 2 3 4 y − wx Szymon Jaroszewicz Maximum likelihood methods.Linear models Example: linear regression Properties: predicts the conditional mean E(y |x) = wx classical, well understood method sensitive to outliers squared loss penalizes cases with big prediction error very strongly a single outlier can distort the model Szymon Jaroszewicz Maximum likelihood methods.Linear models Regression with outliers 60 50 least squares least squares 40 0 20 −50 0 −100 −20 −150 −40 −200 −60 −250 −4 −3 −2 −1 0 1 2 3 4 −4 −2 0 2 4 6 Szymon Jaroszewicz Maximum likelihood methods.Linear models Why squared loss predicts conditional mean? Assume: true population mean is linear Ey |x = w∗ x Condition on a fixed x Let ŷ be a value predicted for x The true (population) risk is Z 2 E(y − ŷ ) |x = (y − ŷ )2 f (y |x) dy where f (y |x) is the conditional density of Y Szymon Jaroszewicz Maximum likelihood methods.Linear models Why squared loss predicts conditional mean? Differentiating by ŷ (under the integral sign) Z d E(y − ŷ )2 |x = −2 (y − ŷ )f (y |x) dy =0 d ŷ Z Z yf (y |x) dy − ŷ f (y |x) dy =0 ŷ = Ey |x So we make the smallest error is for each x we predict the conditional mean Possible since we assumed Ey |x = w∗ x Szymon Jaroszewicz Maximum likelihood methods.Linear models Example: L1 regression (LAD) loss function: L1 loss: ℓ = |y − wx| no regularizer: R(w) = 0 (for now) also called Least Absolute Deviation (LAD) regression 4.0 L1 loss 3.5 3.0 2.5 loss 2.0 1.5 1.0 0.5 0.0 −4 −3 −2 −1 0 1 2 3 4 y − wx Szymon Jaroszewicz Maximum likelihood methods.Linear models Example: L1 regression Properties: predicts the conditional median median(y |x) = wx less sensitive to outliers harder to find coefficients Szymon Jaroszewicz Maximum likelihood methods.Linear models Example: L1 regression Properties: predicts the conditional median median(y |x) = wx less sensitive to outliers harder to find coefficients Changing the loss function allows us to get linear models with different properties Szymon Jaroszewicz Maximum likelihood methods.Linear models Example: Huber loss loss function: Huber loss with parameter a: ( 1 (y − wx)2 if |y − wx| ≤ a ℓa = 2 a2 a|y − wx| − 2 otherwise. no regularizer: R(w) = 0 (for now) 7 huber loss 6 5 4 loss 3 2 1 0 −4 −3 −2 −1 0 1 2 3 4 y − wx Szymon Jaroszewicz Maximum likelihood methods.Linear models Example: Huber loss Properties: For small prediction errors: least squares For large prediction errors: L1 loss It is a robust version of least squares regression Drawbacks harder to find coefficients (need to optimize convex a function) need to decide on a Szymon Jaroszewicz Maximum likelihood methods.Linear models Regression with outliers 60 100 least squares least squares huber huber 50 40 0 20 −50 0 −100 −20 −150 −40 −200 −60 −250 −4 −3 −2 −1 0 1 2 3 4 −4 −2 0 2 4 6 Szymon Jaroszewicz Maximum likelihood methods.Linear models Loss functions – regression 16 squared loss 14 L1 loss huber loss 12 10 loss 8 6 4 2 0 −4 −3 −2 −1 0 1 2 3 4 y − wx Szymon Jaroszewicz Maximum likelihood methods.Linear models Quantile regression L1 regression predicted conditional median (0.5th qunatile) What if we want q-th quantile? Use the loss function ( q(y − wx) if y − wx ≥ 0 ℓq = (q − 1)(y − wx) otherwise LAD regression is a special case 4.0 q0.1 loss 3.5 q0.3 loss q0.5 loss 3.0 quantile loss 2.5 2.0 1.5 1.0 0.5 0.0 −4 −3 −2 −1 0 1 2 3 4 y − wx Szymon Jaroszewicz Maximum likelihood methods.Linear models Proof Condition on fixed x and assume true population quantile is a linear function of x Conditional population risk (expected loss) Z ∞ Z ŷ q (y − ŷ )f (y |x) dy + (q − 1) (y − ŷ )f (y |x) dy ŷ −∞ Differentiating w.r.t. ŷ Z ∞ d Eℓq (y − ŷ )|x = −q(ŷ − ŷ )f (y ) − q f (y |x) dy d ŷ ŷ Z ŷ + (q − 1)(ŷ − ŷ )f (y ) − (q − 1) f (y |x) dy −∞ = −(q − 1)P(y ≤ ŷ ) − q(1 − P(y ≤ ŷ )) = −q + P(y ≤ ŷ ) = 0 Szymon Jaroszewicz Maximum likelihood methods.Linear models Loss functions – regression Those loss function are convex ⇒ the corresponding risk 1 X ℓ(y , wx) n (x,y )∈D is convex ⇒ unique solution (typically) exists and is (relatively) easy to find We try to avoid nonconvex losses/risks as much as possible Szymon Jaroszewicz Maximum likelihood methods.Linear models Loss function examples – classification y +1 Assume Y ∈ {−1, +1}, let ȳ = 2 ∈ {0, 1} Classify as +1 if fw > 0 The 0-1 loss is ℓ = 1[fw (x)y ≤ 0] Optimizing 0-1 loss optimizes classification accuracy directly 0-1 loss is not convex, optimization is NP-hard We try to approximate it Szymon Jaroszewicz Maximum likelihood methods.Linear models Surrogate loss functions Surrogate loss A loss function ℓ is called a surrogate loss if 1 cℓ(y , wx) ≥ ℓ0−1 (y , wx) for all y , w, x for some constant c >0 2 ℓ is convex Surrogate loss leads to risk which is easy to optimize When we minimize a surrogate loss we minimize an upper bound on 0-1 loss Minimizing a surrogate loss we indirectly push down the 0-1 loss More precise theoretical results also exist Szymon Jaroszewicz Maximum likelihood methods.Linear models Loss function examples – logistic regression In logistic regression we optimize the likelihood function X ȳ log sigmoid(wx)) + (1 − ȳ ) log(1 − sigmoid(wx)) (x,y )∈D Notice that 1 − sigmoid(z) = sigmoid(−z) The sum becomes X X log 1 + e −y wx  log sigmoid(y wx) = − (x,y )∈D (x,y )∈D Logistic regression uses the log loss ℓ(y , wx) = log 1 + e −y wx  Szymon Jaroszewicz Maximum likelihood methods.Linear models Loss function examples – hinge loss The hinge loss ℓ(y , fw (x)) = max(0, 1 − ȳ fw (x)) L2 regularized regression with hinge loss is equivalent to Support Vector Machines Hinge loss is quite similar to log loss Log loss and hinge loss are upper bounds on 0-1 loss, so minimizing hinge or log loss we minimize an upper bound on misclassification rate Szymon Jaroszewicz Maximum likelihood methods.Linear models Log loss vs hinge loss 8 Zero-one loss Hinge loss 7 Log loss 6 5 loss 4 3 2 1 0 −4 −3 −2 −1 0 1 2 3 4 ywx Szymon Jaroszewicz Maximum likelihood methods.Linear models Support Vector Machines (SVMs) Szymon Jaroszewicz Maximum likelihood methods.Linear models Support Vector Machines (SVMs) We will now present the more popular, geometric, interpretation of SVMs Properties of linear functions: take a linear function of m variables, fw (x) = wx (dot product) wx = 0 is an equation of an (m − 1)-dimensional hyperplane in Rm if wx0 > 0 than x0 lies on one side of the plane wx = 0 (pointed to by the vector w) if wx0 < 0 than x0 lies on the opposite side of wx = 0 wx0 ∥w∥ is the distance of a point x0 from the plane wx = 0 Szymon Jaroszewicz Maximum likelihood methods.Linear models Support Vector Machines (SVMs) Suppose D = {(xi , yi ) : i = 1,... , N} is a training set Y = Dom(Y ) = {−1, +1} Suppose further that D is linearly separable: there is a hyperplane separating points of class −1 from points of class +1 We want to find w, such that the plane wx = 0 separates the two classes Formally, this can be written as wxi yi > 0 forall i = 1,... , N A plane satisfying those conditions is called a separating hyperplane Szymon Jaroszewicz Maximum likelihood methods.Linear models Support Vector Machines (SVMs) Usually we want the plane to separate the classes with some margin greater than zero Formally, this can be written as wxi yi ≥ 1 forall i = 1,... , N You may be wondering where has the width of the margin gone... Szymon Jaroszewicz Maximum likelihood methods.Linear models Support Vector Machines (SVMs) Recall that the distance of a point x0 from the plane wx = 0 is wx0 /∥w∥, so the condition implies wxi 1 γi = ≥ , ∥w∥ ∥w∥ where γi is the distance of point xi from the separating hyperplane Conclusion: the norm of w decides how far the points are from the separating hyperplane Szymon Jaroszewicz Maximum likelihood methods.Linear models Support Vector Machines (SVMs) Usually many separating hyperplanes are possible Support Vector Machines select the hyperplane with the largest margin Since the width of the margin is determined by ∥w∥ the maximum margin separating hyperplane can be found by solving the following optimization problem Minimize 1 ∥w∥2 2 subject to wxi yi ≥ 1 forall i = 1,... , N Szymon Jaroszewicz Maximum likelihood methods.Linear models Support Vector Machines (SVMs) – support vectors For some training points we will have wxi yi = 1 i.e. their constraints are active Those points: lie closest to the decision boundary determine the width of the margin are the only points affecting the weights w They are called the support vectors Szymon Jaroszewicz Maximum likelihood methods.Linear models Support Vector Machines (SVMs) m ar gi n support vectors +1 +1 +1 −1 −1 +1 −1 −1 Szymon Jaroszewicz Maximum likelihood methods.Linear models Support Vector Machines (SVMs) m ar gi n support vectors +1 +1 +1 −1 −1 +1 −1 −1 Szymon Jaroszewicz Maximum likelihood methods.Linear models Support Vector Machines (SVMs) – the nonseparable case What if D is not linearly separable? We have to allow for mislassified points We introduce slack variables ξi ≥ 0 Each point is now allowed to be ξi units ‘on the wrong side’ wxi yi ≥ 1 − ξi forall i = 1,... , N We also need to penalize the ξi ’s Szymon Jaroszewicz Maximum likelihood methods.Linear models Support Vector Machines (SVMs) – the nonseparable case The SVM optimization problem Minimize N 1 X ∥w∥2 + C ξi 2 i=1 subject to wxi yi ≥ 1 − ξi forall i = 1,... , N ξi ≥ 0 C is a coefficient deciding how strongly should misclassified points be penalized (analogue of L2 regularization coefficient) This procedure is called soft margin maximization Szymon Jaroszewicz Maximum likelihood methods.Linear models Slack variables −1 m ar gi n ξi1 +1 +1 +1 −1 −1 +1 −1 ξi2 +1 −1 Szymon Jaroszewicz Maximum likelihood methods.Linear models Support Vector Machines (SVMs) – the nonseparable case The following approaches to SVMs are fully equivalent risk minimization with hinge loss the geometric view with soft margin maximization Conclusion: logistic regression is quite similar to SVMs, they use different but similar loss functions Proof is easy: Szymon Jaroszewicz Maximum likelihood methods.Linear models PN SVM Optimization task is to minimize 21 ∥w∥2 + C i=1 ξi subject to wxi yi ≥ 1 − ξi forall i = 1,... , N ξi ≥ 0 1 Rewrite the first constraint as ξi ≥ 1 − wxi yi 2 Fix w PN 3 To minimize i=1 ξi choose ξi as small as possible subject to constraints 4 i.e. ξi = max{1 − wxi yi , 0} PN 5 Finally, minimize 21 ∥w∥2 + C i=1 max{1 − wxi yi , 0} over w Szymon Jaroszewicz Maximum likelihood methods.Linear models Probabilistic predictions Linear SVMs are a non-probabilistic model they return a score wx which is not a probability to get P(Y = 1|x): typically build logistic regression with a single predictor z = wx Logistic regression returns P(Y = 1|x) = sigmoid(wx) for scoring (e.g. ROC) it is better to just use just wx i.e. decision function instead of predict proba Szymon Jaroszewicz Maximum likelihood methods.Linear models Literature 1 Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani, Least Angle Regression www.stanford.edu/ ~hastie/Papers/LARS/LeastAngle_2002.pdf 2 Hui Zou and Trevor Hastie, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society B (2005) 67, Part 2, pp. 301-–320 3 John Shawe-Taylor and Nello Cristianini. Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000 Szymon Jaroszewicz Maximum likelihood methods.Linear models

Use Quizgecko on...
Browser
Browser