Linear Models PDF
Document Details
Uploaded by FirmerBandoneon7262
The University of Manchester
Mauricio A. Álvarez
Tags
Summary
This document is a set of lecture notes on linear models, specifically focusing on vector/matrix notation, linear algebra, and its application within machine learning. The notes were produced at the University of Manchester.
Full Transcript
Linear models Mauricio A. Álvarez Foundations of Machine Learning The University of Manchester 1 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regula...
Linear models Mauricio A. Álvarez Foundations of Machine Learning The University of Manchester 1 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 2 / 78 Scalar and vectors ❑ A scalar is just a numeric value like 0.9 or −18.7. ❑ Scalars are usually denoted as lower case letters like x or a. ❑ A vector is an ordered list of scalar values. Sometimes we refer to these scalar values of the vector as attributes or entries of the vector. ❑ Vectors are usually denoted by bold lowercase letters like x or y. 3 / 78 Vectors ❑ A vector can appear sometimes written as a row vector, e.g. x = [x1 , x2 , x3 , x4 , x5 ] Or as a column vector x1 x2 x3 x= x4 x5 ❑ In this module, ALL vectors will be column vectors by default. So, when you see a vector, e.g. x, y, z always think this vector has a column-wise shape. 4 / 78 Matrices ❑ A matrix is a rectangular array of scalars arranged in rows and columns. ❑ Matrices are usually denoted by bold uppercase letters, e.g. X or Y. ❑ The following matrix has three rows and two columns x11 x12 X = x21 x22 x31 x32 ❑ The entries in the matrix above are of the form xij , where the first subindex i indicates the row of the element and the second subindex j indicates the column. 5 / 78 Matrix transpose ❑ Let X be a matrix with elements xij. ❑ The transpose of a matrix X is a new matrix X⊤ with elements xji. 4.1 −5.6 ⊤ 4.1 −2.6 3.5 X = −2.6 7.9 , X = −5.6 7.9 1.8 3.5 1.8 6 / 78 Matrix multiplication ❑ Let A be a matrix with entries aik of dimensions p × q. ❑ Let B be a matrix with entries bkj of dimensions t × s. ❑ Matrix multiplication of the form AB is only possible if q = t. ❑ If this is the case, the matrix C = AB has dimensions p × s with entries X cij = aik bkj. k 7 / 78 Transpose of a product ❑ Let w be a vector of dimensions d × 1. Let X be a matrix with dimensions n × d. ❑ The transpose of the product Xw, (Xw)⊤ is (Xw)⊤ = w⊤ X⊤. ❑ We can apply this result to a product of several matrices (ABCD)⊤ = ((AB)(CD))⊤ = (CD)⊤ (AB)⊤ = D⊤ C⊤ B⊤ A⊤. 8 / 78 Two common types of products ❑ Inner product. The inner product between two vectors results in a scalar. ❑ Let x and y be vectors of dimension m × 1. The inner product is given as m X x⊤ y = xi yi , i=1 ❑ Outer product. The outer product between two vectors results in a matrix. ❑ Let x be a vector of dimension m × 1 and y a vector of dimension p × 1. The outer product is given as x1 y1 · · · x1 yp x2 y1 · · · x2 yp xy⊤ = . .... .... xm y1 ··· xm yp. 9 / 78 From a scalar operation to a vector operation ❑ It is usually desirable to transform a scalar operation into a vector operation. ❑ When coding scalar operations, we require using loops, which can be expensive. ❑ In contrast, vector operations are handled efficiently by low-level routines already included in modules like numpy. 10 / 78 Example (to be reviewed in the tutorial) Write the following scalar operation into a vector/matrix form 2 n X d X yi − xij wj . i=1 j=1 11 / 78 Answer (I) (to be reviewed in the tutorial) ❑ The sum above can be written as 2 Xn Xd d X d X yi − xij wj = y1 − x1j wj y1 − x1j wj i=1 j=1 j=1 j=1 + ··· d X d X + yn − xnj wj yn − xnj wj . j=1 j=1 ❑ Let us define a vector v of dimensions n × 1 with entries given as Xd yi − xij wj . j=1 12 / 78 Answer (II) (to be reviewed in the tutorial) ❑ The product of vectors v⊤ v gives the same result than the required sum, Pd (y1 − j=1 x1j wj ).. h i v⊤ v = (y1 − dj=1 x1j wj ) · · · (yn − dj=1 xnj wj ) P P . Pd (yn − j=1 xnj wj ) 2 X n Xd = yi − xij wj . i=1 j=1 ❑ How do we express the elements in v with vectors and matrices? 13 / 78 Answer (III) (to be reviewed in the tutorial) ❑ For a fixed i, xi1 ,... , xid can be grouped into a vector x⊤ i. ❑ The internal sums in the entries of v can then be written as w1 Xd w2 xij wj = x⊤ i w = xi1 xi2 · · · xid . .. j=1 wd ❑ We can now write v as y1 − x⊤ ⊤ ⊤ 1 w y1 x1 w y1 x1 v=.. .. .. .. .. − − . . . . w = = . yn − x⊤ n w. yn x⊤ n w. yn x⊤ n ❑ We can group the scalars y1 ,... , yn into a vector y. ❑ We can group the row vectors x⊤ ⊤ 1 ,... , xn into a matrix X. 14 / 78 Answer (IV) (to be reviewed in the tutorial) ❑ It means that v = y − Xw. ❑ Finally 2 n d ⊤ X X yi − xij wj = v⊤ v = (y − Xw) (y − Xw). i=1 j=1 15 / 78 Differentiating a function in a vector/matrix form (I) ❑ We will see cases in which a function f (w) depends on some parameters grouped in a vector w. ❑ We would like to find the vector of parameters w that maximise f (w). ❑ For example, suppose f (w) is defined as d X f (w) = wi x i. i=1 ❑ We can group the scalars x1 ,... , xd into x. Likewise for w. ❑ According to what we saw before, we can write f (w) as f (w) = x⊤ w. 16 / 78 Differentiating a function in a vector/matrix form (II) ❑ For a fixed x, we are interested in computing the gradient of f (w) with respect to w ∂f (w) ∂w1 x1 df (w) . . = .. = .. = x. dw ∂f (w) xd. ∂wd ❑ Some useful identities when differentiating with respect to a vector df (w) f (w) dw w⊤ x x x⊤ w x w⊤ w 2w w⊤ Cw 2Cw. 17 / 78 Identity matrix and the inverse of a matrix ❑ The identity matrix of size N is a square matrix with ones on the main diagonal and zeros elsewhere, e.g., 1 0 0 I3 = 0 1 0 0 0 1 ❑ The inverse matrix of a matrix A of dimensions d × d, denoted as A−1 , satisfies AA−1 = A−1 A = Id 18 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 19 / 78 Olympic 100m Data Image from Wikimedia Commons http://bit.ly/191adDC. 20 / 78 Dataset Male 100 mts 12.0 11.5 Seconds 11.0 10.5 10.0 1900 1920 1940 1960 1980 2000 Year 21 / 78 Model ❑ We will use a linear model f (x, w) to predict y, where y is the time in seconds and x the year of the competition. ❑ The linear model is given as f (x, w) = w0 + w1 x, where w0 is the intercept and w1 is the slope. ❑ We use w to refer both to w0 and w1. 22 / 78 Data and model Male 100 mts 12.0 11.5 Seconds 11.0 10.5 10.0 1900 1920 1940 1960 1980 2000 Year 23 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 24 / 78 Linear model ❑ A simple model for regression consists in using a linear combination of the attributes to predict the output f (x, w) = w0 + w1 x1 +... + wD xD , where w0 , w1 , · · · , wD are the parameters of the regression model. ❑ The term w0 is the bias term or intercept, e.g. f (0, w) = w0. ❑ The expression above can be written in a vectorial form f (x, w) = w⊤ x. where we have defined w = [w0 , w1 , · · · , wD ]⊤ and x = [1, x1 , · · · , xD ]⊤. ❑ Notice that x0 = 1. 25 / 78 Parenthesis: Gaussian pdf ❑ The Gaussian pdf has the form (y − µ)2 1 p(y) = √ exp −. 2πσ 2 2σ 2 ❑ A Gaussian pdf requires two parameters µ and σ 2 , the mean and the variance of the RV Y. ❑ We denote the Gaussian pdf as p(y |µ, σ 2 ) = N (y |µ, σ 2 ) or y ∼ N (µ, σ 2 ). 26 / 78 Parenthesis: Gaussian pdf The mean of the two Gaussians is µ = 2 and the variances are σ 2 = 0.5 (solid), and σ 2 = 2 (dashed). 0.5 0.4 0.3 0.2 0.1 0.0 2 1 0 1 2 3 4 5 6 27 / 78 Gaussian regression model (I) ❑ We use a Gaussian regression model to relate the inputs and outputs y = f (x, w) + ϵ, where ϵ ∼ N (0, σ 2 ). ❑ It assumes that each output yi that we observe can be explained as the prediction of an underlying model, f (xi , w) plus a noise term ϵi. ❑ For a fixed x and a fixed w, f (x, w) is a constant, then y = constant + ϵ, where ϵ is a continuous RV. ❑ What is the pdf for y? (we are adding a constant to a Gaussian RV) – E{y } = E{constant + ϵ} = constant – var{y } = var{constant} + var{ϵ} = σ 2. 28 / 78 Gaussian regression model (II) ❑ This means that y ∼ N (constant, σ 2 ), where we said constant was f (x, w), this is, y ∼ N (f (x, w), σ 2 ). ❑ Because we assumed that x and w are given, we can also write p(y |x, w, σ 2 ) = N (y |f (x, w), σ 2 ). ❑ If we knew the value for w, once we have a new x∗ , we can predict the output as f (x∗ , w) = w⊤ x∗. ❑ σ 2 tells us the noise variance. 29 / 78 Gaussian regression model (III) 𝑓 𝑥, 𝒘 𝑝 𝑦 𝑥 ! ,𝒘, 𝜎 " 2𝜎 𝑓 𝑥 ! ,𝒘 𝑥! 30 / 78 How do we estimate w? (I) ❑ We start with a training dataset (x1 , y1 ), · · · , (xN , yN ). ❑ We assume that the random variables Y1 , · · · , YN are independent, N Y p(y1 , · · · , yN |x1 , · · · , xN ) = p(y1 |x1 ) · · · p(yN |xN ) = p(yn |xn ). n=1 ❑ We also assume that the RVs Y1 , · · · , YN follow an identical distribution, Gaussian in this case p(yn |xn , w, σ 2 ) = N (yn |f (xn , w), σ 2 ) = N (yn |w⊤ xn , σ 2 ). ❑ Both assumptions go by the name of the iid assumption, independent and identically distributed. 31 / 78 How do we estimate w? (II) ❑ Putting both assumptions together, we get N Y N Y p(y|X, w, σ 2 ) = p(yn |xn , w, σ 2 ) = N (yn |w⊤ xn , σ 2 ), n=1 n=1 where y = [y1 , · · · , yN ]⊤ ∈ RN×1 and X = [x1 , · · · , xN ]⊤ ∈ RN×(D+1). ❑ The expression above can then be written as N Y p(y|X, w, σ 2 ) = N (yn |w⊤ xn , σ 2 ), n=1 N (yn − w⊤ xn )2 Y 1 = √ exp −. 2πσ 2 2σ 2 n=1 ( N ) 1 1 X = N exp − 2 (yn − w⊤ xn )2. (2πσ 2 ) 2 2σ n=1 32 / 78 How do we estimate w? (III) ❑ When we look at a Gaussian pdf, like (y − µ)2 1 p(y) = √ exp − , 2πσ 2 2σ 2 we assume that both µ and σ 2 are given. In this case, the pdf follows all the properties we reviewed before. ❑ The same is true for N Y N Y p(y|X, w, σ 2 ) = p(yn |xn , w, σ 2 ) = N (yn |w⊤ xn , σ 2 ). n=1 n=1 ❑ Given w⊤ xn and σ 2 , then each p(yn |xn , w, σ 2 ) is a pdf. ❑ A different approach would be to say: I have some data for {yn }N n=1 and {xn }N n=1 but – “I don’t know what is w⊤ (therefore I don’t know what is w⊤ xn )” – “I don’t know what is σ 2 ”. 33 / 78 How do we estimate w? (IV) ❑ With yn and xn given but with unknown values for w and σ 2 , each p(yn |xn , w, σ 2 ) is not a pdf anymore. ❑ In that case, the function N Y p(y|X, w, σ 2 ) = N (yn |w⊤ xn , σ 2 ), n=1 receives the name of a likelihood function. ❑ We can think of a likelihood function as a function of the parameters w and σ 2 , h(w, σ 2 ) = h(y|X, w, σ 2 ), ❑ And subsequently, we can use multivariate calculus to find the values of w, σ 2 that maximise h(w, σ 2 ). ❑ In statistics, this is known as the maximum-likelihood (ML) criterion to estimate parameters. 34 / 78 How do we estimate w? (V) ❑ Given y, X, we use the ML criterion to find the parameters w and σ 2 that maximise ( N ) 2 1 1 X ⊤ 2 p(y|X, w, σ ) = N exp − 2 (yn − w xn ). (2πσ 2 ) 2 2σ n=1 ❑ In practice, we prefer to maximise the log of the likelihood p(y|X, w, σ 2 ), LL(w, σ 2 ) = log p(y|X, w, σ 2 ) N N N 1 X =− log (2π) − log σ 2 − (yn − w⊤ xn )2. 2 2 2σ 2 n=1 35 / 78 How do we estimate w? (VI) ❑ We can also minimise the negative log of the likelihood p(y|X, w, σ 2 ), NLL(w, σ 2 ) = −LL(w, σ 2 ) = − log p(y|X, w, σ 2 ) N N N 1 X 2 = log (2π) + log σ + (yn − w⊤ xn )2. 2 2 2σ 2 n=1 ❑ Consistency of the ML criterion If data was really generated according to the probability we specified, the correct parameters will be recovered in the limit as N → ∞. 36 / 78 Connection with the sum of squared errors ❑ If we multiply LL(w, σ 2 ) by minus one, we get N X E(w, σ 2 ) = − log p(y|X, w, σ 2 ) ∝ (yn − w⊤ xn )2. n=1 ❑ The ML criterion for this model has a close connection with the sum-of-squared errors used in non-probabilistic formulations of linear regression. ❑ Maximising the log-likelihood function is equivalent to minimising the sum-of-squares errors. ❑ Notice that the log is a monotonic function, meaning that if we find w, σ 2 that maximise h(w, σ 2 ), those will also maximise log(h(w, σ 2 )). 37 / 78 Normal equation (I) (to be reviewed in the tutorial) ❑ Let us find an estimate for w. ❑ From what we saw before, N N N 1 X LL(w, σ 2 ) = − log (2π) − log σ 2 − (yn − w⊤ xn )2. 2 2 2σ 2 n=1 ❑ Using what we reviewed in the section on vector/matrix notation, it can be shown that this expression can be written in a vectorial form as N N 1 LL(w, σ 2 ) = − log (2π) − log σ 2 − (y − Xw)⊤ (y − Xw) 2 2 2σ 2 ❑ Let us focus on the term (y − Xw)⊤ (y − Xw), (y − Xw)⊤ (y − Xw) = y⊤ y − w⊤ X⊤ y − y⊤ Xw + w⊤ X⊤ Xw 38 / 78 Normal equation (II) (to be reviewed in the tutorial) ❑ We can find the w that maximises LL(w, σ 2 ) by taking the gradient dLL(w,σ 2 ) dw , equating to zero and solving for w. ❑ Taking the gradient of each term in LL(w, σ 2 ) wrt w, we get d N d N d 1 − log (2π) = 0, − log σ 2 = 0, − 2 y⊤ y = 0, dw 2 dw 2 dw 2σ d 1 ⊤ ⊤ 1 ⊤ w X y = X y, dw 2σ 2 2σ 2 d 1 ⊤ 1 ⊤ 2 y Xw = X y dw 2σ 2σ 2 d 1 1 − 2 w⊤ X⊤ Xw = − 2 2X⊤ Xw dw 2σ 2σ 39 / 78 Normal equation (III) (to be reviewed in the tutorial) ❑ Putting these terms together, we get d 1 ⊤ 1 ⊤ 1 LL(w, σ 2 ) = X y+ X y− 2X⊤ Xw dw 2σ 2 2σ 2 2σ 2 1 1 = 2 X⊤ y − 2 X⊤ Xw σ σ ❑ Now, equating to zero and solving for w, we get 1 ⊤ 1 2 X y − 2 X⊤ Xw = 0 σ σ X⊤ Xw = X⊤ y −1 ⊤ w∗ = X⊤ X X y. ❑ The expression for w∗ is known as the normal equation. −1 ❑ The solution for w∗ exists if we can compute X⊤ X. ❑ The inverse can be computed as long as X⊤ X is non-singular (e.g. determinant different from zero, or has full-rank). 40 / 78 Solving for σ∗2 ❑ Following a similar procedure, it can be shown that the ML solution for σ∗2 is given as 1 σ∗2 = (y − Xw∗ )⊤ (y − Xw∗ ). N 41 / 78 Basis functions ❑ The model that is linear in x only allows linear relationships between x and y. ❑ We can extend the model to describe non-linear relationships between the inputs and the output by using basis functions, non-linear mappings from inputs to outputs. ❑ However, we keep the linear relationship of y wrt w for tractability. ❑ The predictive model follows as f (x, w) M X f (x, w) = w0 + wi ϕi (x) = w⊤ ϕ(x), i=1 where ϕi (x) are basis functions and we have M + 1 parameters for the vector w and ϕ(x) = [1, ϕ1 (x), · · · , ϕM (x)]⊤. 42 / 78 Examples of basis functions 1 1 1 0.5 0.75 0.75 0 0.5 0.5 −0.5 0.25 0.25 −1 0 0 −1 0 1 −1 0 1 −1 0 1 Polynomial: ϕi (x) = x i. n o 2 Exponential: ϕi (x) = exp − (x−µ 2s 2 i) Sigmoidal: ϕi (x) = σ x−µ s i , σ(a) = 1/(1 + exp(−a)). 43 / 78 Transforming the input using the basis functions ❑ As an example, let us use polynomial basis functions to predict y, the time in seconds in the 100 mt Olympics competition. ❑ For each x (year of the competition), we now compute the vector of polynomial basis functions 1 x 2 x ϕ(x) = x 3 .. . xM ❑ We have converted the unidimensional input feature x into a higher dimensional feature representation ϕ(x) ∈ R M+1. 44 / 78 Normal equations with a design matrix ❑ Given X, we first compute a new design matrix Φ, ϕ(x1 )⊤ ϕ0 (x1 ) ϕ1 (x1 ) · · · ϕM (x1 ) ϕ(x2 )⊤ ϕ0 (x2 ) ϕ1 (x2 ) · · · ϕM (x2 ) Φ= = .. .. . . ϕ(xN )⊤ ϕ0 (xN ) ϕ1 (xN ) · · · ϕM (xN ) ❑ We now can use (y, Φ) and write the Gaussian linear regression problem N Y p(y|X, w, σ 2 ) = N (yn |w⊤ ϕn , σ 2 ), n=1 where ϕn = ϕ(xn ). ❑ Using the ML criterion, we arrive to the following normal equation −1 w∗ = Φ⊤ Φ Φ⊤ y. 45 / 78 Olympic 100-mt data with M = 5 Male 100 mts 12.0 11.5 Seconds 11.0 10.5 10.0 1900 1920 1940 1960 1980 2000 Year 46 / 78 Alternative to find w ❑ For solving the normal equation, we need to invert X⊤ X. ❑ This inversion has a computational complexity between O((D + 1)2.4 ) to O((D + 1)3 ) (depending on the implementation). ❑ The normal equation is linear regarding the number of instances in the training data, O(N). ❑ It can handle a large training set as long as it fits in memory. ❑ Alternatively, we can use iterative optimisation in cases with a large number of features and too many instances to fit in memory. 47 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 48 / 78 General problem ❑ We are given a function h(w), where w ∈ Rp. ❑ Aim: to find a value for w that minimises h(w). ❑ Use an iterative procedure wk+1 = wk + ηdk , where dk is known as the search direction and it is such that h(wk +1 ) < h(wk ). ❑ The parameter η is known as the step size or learning rate. 49 / 78 Gradient descent ❑ Perhaps, the simplest algorithm for unconstrained optimisation. ❑ It assumes that dk = −gk , where gk = g(wk ). ❑ Also known as steepest descent. ❑ It can be written like wk+1 = wk − ηgk. 50 / 78 Step size ❑ The main issue in gradient descent is how to set the step size. ❑ If it is too small, convergence will be very slow. If it is too large, the method can fail to converge at all. 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 (a) (b) Figure: The function to optimise is h(w1 , w2 ) = 0.5(w12 − w2 )2 + 0.5(w1 − 1)2. The minimum is at (1, 1). In (a) η = 0.1. In (b) η = 0.6. 51 / 78 Alternatives to choose the step size η ❑ Line search methods (there are different alternatives). ❑ Line search methods may use search directions other than the steepest descent direction. ❑ Conjugate gradient (method of choice for quadratic objectives g(w) = w⊤ Aw). ❑ Use a Newton search direction. 52 / 78 Gradient descent for linear regression (I) ❑ For simplicity, let us assume that the objective function h(w) corresponds to the mean squared error N 1 X E(w) = (yn − w⊤ xn )2. N n=1 ❑ We could also minimise the negative LL(w) instead, NLL(w). ❑ We write the update equation as d wk +1 = wk − η E(w). dw w=wk 53 / 78 Gradient descent for linear regression (II) ❑ Computing the gradient for E(w), we get N d 2 X ⊤ 2 w xn − yn xn = X⊤ (Xw − y). E(w) = dw N N n=1 ❑ The update equation follows as 2 ⊤ wk +1 = wk − η X (Xwk − y). N ❑ The computation of the gradient involves using the whole dataset (X, y) at every step. ❑ For this reason, this algorithm is known as batch gradient descent. 54 / 78 Gradient descent and feature scaling ❑ Always normalise the features if using gradient descent. ❑ Gradient descent converges faster if all features have a similar scale. ❑ If the attributes are in very different scales, it may take a long time to converge. 55 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 56 / 78 Online learning and large datasets ❑ Traiditionally in machine learning, the gradient gk is computed using the whole dataset D = {xn , yn }Nn=1. ❑ There are settings, though, where only a subset of the data can be used. ❑ Online learning: the instances (xn , yn ) appear one at a time. ❑ Large datasets: computing the exact value for gk would be expensive, if not impossible. 57 / 78 Stochastic gradient descent (I) ❑ In stochastic gradient descent (SGD), the gradient gk is computed using a subset of the instances available. ❑ The word stochastic refers to the fact that the value for gk will depend on the subset of the instances chosen for computation. 58 / 78 Stochastic gradient descent (II) ❑ In the stochastic setting, a better estimate can be found if the gradient is computed using 1 X gk = gk,i , |S| i∈S where S ∈ D, |S| is the cardinality of S, and gk ,i is the gradient at iteration k computed using the instance (xi , yi ). ❑ This setting is called mini-batch gradient descent. 59 / 78 Step size in SGD ❑ Choosing the value of η is particularly important in SGD since there is no easy way to compute it. ❑ Usually the value of η will depend on the iteration k , ηk. ❑ It should follow the Robbins-Monro conditions ∞ X ∞ X ηk = ∞, ηk2 < ∞. k=1 k=1 ❑ Various formulas for ηk can be used 1 1 ηk = , ηk = , k (τ0 + k )κ where τ0 slows down early interations and κ ∈ (0.5, 1]. 60 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 61 / 78 What is regularisation? ❑ It refers to a technique used for preventing overfitting in a predictive model. ❑ It consists in adding a term (a regulariser) to the objective function that encourages simpler solutions. ❑ With regularisation, the objective function for linear regression would be h(w) = E(w) + λR(w), where R(w) is the regularisation term and λ the regularisation parameter. ❑ In the expression for h(w), we can use NLL(w) instead of E(w). ❑ If λ = 0, we get h(w) = E(w). 62 / 78 Different types of regularisation ❑ The objective function for linear regression would be h(w) = E(w) + λR(w), where R(w) follows as 1 R(w) = α∥w∥1 + (1 − α) ∥w∥22 , 2 Pp Pp where ∥w∥1 = m=1 |wm |, and ∥w∥22 = m=1 wm2. ❑ If α = 1, we get ℓ1 regularisation. ❑ If α = 0, we get ℓ2 regularisation. ❑ If 0 < α < 1, we get the elastic net regularisation. 63 / 78 Ridge regression or ℓ2 regularisation ❑ In ridge regression, α = 0, N 1 X λ h(w) = (yn − w⊤ xn )2 + w⊤ w, N 2 n=1 ❑ It can be shown that an optimal solution for w∗ is given as −1 ⊤ λN w∗ = X X+ I X⊤ y. 2 ❑ Notice that we can also use an iterative procedure for optimising h(w) either through batch gradient descent, SGD or mini-batch SGD. 64 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 65 / 78 Probabilistic classifier ❑ A logistic regression model is an example of a probabilistic classifier. ❑ Let x ∈ RD represents a feature vector and y the target value. ❑ For a binary classification problem we can use y ∈ {0, 1} or y ∈ {−1, +1}. ❑ We model the relationship between y , and x using a Bernoulli distribution. 66 / 78 Bernoulli distribution (I) ❑ A Bernoulli random variable Y is a random variable that can only take two possible values. ❑ For example, the random variable Y associated to the experiment of tossing a coin. ❑ Output “heads” is assigned 1 (Y = 1), and output “tails” is assigned 0 (Y = 0). 67 / 78 Bernoulli distribution (II) ❑ A Bernoulli distribution is a probability distribution for Y , expressed as ( µ y = 1, p(Y = y ) = Ber(y|µ) = 1 − µ y = 0, where µ = P(Y = 1). ❑ The expression above can be summarized in one line using p(Y = y) = Ber(y |µ) = µy (1 − µ)1−y , 68 / 78 How are y and x related in logistic regression? ❑ The target feature y follows a Bernoulli distribution p(y |x) = Ber(y |µ(x)). ❑ Notice how the probability µ = P(y = 1) explicity depends on x. ❑ In logistic regression, the probability µ(x) is given as 1 µ(x) = = σ(w⊤ x), 1 + exp(−w⊤ x) where σ(z) is known as the logistic sigmoid function. ❑ We then have p(y|w, x) = Ber(y |σ(w⊤ x)). 69 / 78 The logistic sigmoid function σ(z) σ(z) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −10 −5 0 5 z 10 1 ❑ Recall σ(z) =. 1 + exp(−z) ❑ If z → ∞, σ(z) = 1. If z → −∞, σ(z) = 0. If z = 0, σ(z) = 0.5. 70 / 78 The logistic sigmoid function σ(w⊤ x) σ(wx) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −10 −5 0 5 x 10 ❑ We have z = w⊤ x. For simplicity, assume x = x, then σ(wx). 1 ❑ Recall σ(wx) =. 1 + exp(−wx) dσ(wx) w ❑ So =. dx x=0 4 71 / 78 The logistic sigmoid function in 2d W=(1,4) W=(5,4) 5 w2 1 1 0.5 0.5 W = ( −2 , 3 ) 0 0 −10 10 −10 10 4 1 0 x1 10 −10 0 x2 0 x1 10 −10 0 x2 0.5 0 W=(0,2) W=(2,2) −10 10 3 0 x1 10 −10 0 x2 1 0.5 1 0.5 0 0 W=(5,1) −10 10 −10 10 2 0 x1 10 −10 0 x2 0 x1 10 −10 0 x2 1 0.5 W=(1,0) W=(3,0) 0 −10 10 1 1 0.5 1 0.5 0 x1 10 −10 0 x2 W = ( −2 , −1 ) 0 0 −10 10 −10 10 0 1 0.5 0 x1 10 −10 0 x2 x1 0 10 −10 0 x2 w1 0 W = ( 2 , −2 ) −10 10 −1 0 x1 10 −10 0 x2 1 0.5 0 −10 10 −2 0 x1 10 −10 0 x2 −3 −3 −2 −1 0 1 2 3 4 5 6 Figure: Plot of σ(w1 x1 + w2 x2 ). Here w = [w1 w2 ]⊤. 72 / 78 Decision boundary ❑ After the training phase, we will have an estimator for w. ❑ For a test input vector x∗ , we compute p(y = 1|w, x∗ ) = σ(w⊤ x∗ ). ❑ This will give us a value between 0 and 1. ❑ We define a threshold of 0.5 to decide to which class we assign x∗. ❑ With this threshold we induce a linear decision boundary in the input space. 73 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 74 / 78 Cross-entropy error function ❑ Let D = {xn , yn }N n=1. ❑ We write X = [x1 · · · xN ]⊤ , and y = [y1 · · · yN ]⊤. ❑ Assuming IID observations N Y N Y p(y|w, X) = p(yn |w, xn ) = Ber(yn |σ(w⊤ xn )). n=1 n=1 ❑ The cross-entropy function or negative log-likelihood is given as NLL(w) = − log p(y|w, X) N X =− {yn log[σ(w⊤ xn )] + (1 − yn ) log[1 − σ(w⊤ xn )]}, n=1 which can be minimised with respect to w. 75 / 78 Gradient of NLL(w) ❑ It can be shown that the gradient g(w) of NLL(w) is given as N d X g(w) = NLL(w) = [σ(w⊤ xn ) − yn ]xn = X⊤ (σ − y), dw n=1 where σ = [σ(w⊤ x1 ) · · · σ(w⊤ xN )]⊤. 76 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 77 / 78 Optimisation and regularisation ❑ SGD methods described above can be applied to find the parameter vector w that minimises the negative log-likelihood NLL(w) in logistic regression. ❑ Likewise, all the regularisation techniques we saw before, can also be added to the cross-entropy error function. 78 / 78