3_Linear Models.pdf
Document Details
Uploaded by FirmerBandoneon7262
University of Surrey
Tags
Full Transcript
Linear models Mauricio A. Álvarez Foundations of Machine Learning The University of Manchester 1 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regula...
Linear models Mauricio A. Álvarez Foundations of Machine Learning The University of Manchester 1 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 2 / 78 Scalar and vectors ❑ A scalar is just a numeric value like 0.9 or −18.7. ❑ Scalars are usually denoted as lower case letters like x or a. ❑ A vector is an ordered list of scalar values. Sometimes we refer to these scalar values of the vector as attributes or entries of the vector. ❑ Vectors are usually denoted by bold lowercase letters like x or y. 3 / 78 Vectors ❑ A vector can appear sometimes written as a row vector, e.g. x = [x1 , x2 , x3 , x4 , x5 ] Or as a column vector x1 x2 x3 x= x4 x5 ❑ In this module, ALL vectors will be column vectors by default. So, when you see a vector, e.g. x, y, z always think this vector has a column-wise shape. 4 / 78 Matrices ❑ A matrix is a rectangular array of scalars arranged in rows and columns. ❑ Matrices are usually denoted by bold uppercase letters, e.g. X or Y. ❑ The following matrix has three rows and two columns x11 x12 X = x21 x22 x31 x32 ❑ The entries in the matrix above are of the form xij , where the first subindex i indicates the row of the element and the second subindex j indicates the column. 5 / 78 Matrix transpose ❑ Let X be a matrix with elements xij. ❑ The transpose of a matrix X is a new matrix X⊤ with elements xji. 4.1 −5.6 ⊤ 4.1 −2.6 3.5 X = −2.6 7.9 , X = −5.6 7.9 1.8 3.5 1.8 6 / 78 Matrix multiplication ❑ Let A be a matrix with entries aik of dimensions p × q. ❑ Let B be a matrix with entries bkj of dimensions t × s. ❑ Matrix multiplication of the form AB is only possible if q = t. ❑ If this is the case, the matrix C = AB has dimensions p × s with entries X cij = aik bkj. k 7 / 78 Transpose of a product ❑ Let w be a vector of dimensions d × 1. Let X be a matrix with dimensions n × d. ❑ The transpose of the product Xw, (Xw)⊤ is (Xw)⊤ = w⊤ X⊤. ❑ We can apply this result to a product of several matrices (ABCD)⊤ = ((AB)(CD))⊤ = (CD)⊤ (AB)⊤ = D⊤ C⊤ B⊤ A⊤. 8 / 78 Two common types of products ❑ Inner product. The inner product between two vectors results in a scalar. ❑ Let x and y be vectors of dimension m × 1. The inner product is given as m X x⊤ y = xi yi , i=1 ❑ Outer product. The outer product between two vectors results in a matrix. ❑ Let x be a vector of dimension m × 1 and y a vector of dimension p × 1. The outer product is given as x1 y1 · · · x1 yp x2 y1 · · · x2 yp xy⊤ = . .... .... xm y1 ··· xm yp. 9 / 78 From a scalar operation to a vector operation ❑ It is usually desirable to transform a scalar operation into a vector operation. ❑ When coding scalar operations, we require using loops, which can be expensive. ❑ In contrast, vector operations are handled efficiently by low-level routines already included in modules like numpy. 10 / 78 Example (to be reviewed in the tutorial) Write the following scalar operation into a vector/matrix form 2 n X d X yi − xij wj . i=1 j=1 11 / 78 Answer (I) (to be reviewed in the tutorial) ❑ The sum above can be written as 2 Xn Xd d X d X yi − xij wj = y1 − x1j wj y1 − x1j wj i=1 j=1 j=1 j=1 + ··· d X d X + yn − xnj wj yn − xnj wj . j=1 j=1 ❑ Let us define a vector v of dimensions n × 1 with entries given as Xd yi − xij wj . j=1 12 / 78 Answer (II) (to be reviewed in the tutorial) ❑ The product of vectors v⊤ v gives the same result than the required sum, Pd (y1 − j=1 x1j wj ).. h i v⊤ v = (y1 − dj=1 x1j wj ) · · · (yn − dj=1 xnj wj ) P P . Pd (yn − j=1 xnj wj ) 2 X n Xd = yi − xij wj . i=1 j=1 ❑ How do we express the elements in v with vectors and matrices? 13 / 78 Answer (III) (to be reviewed in the tutorial) ❑ For a fixed i, xi1 ,... , xid can be grouped into a vector x⊤ i. ❑ The internal sums in the entries of v can then be written as w1 Xd w2 xij wj = x⊤ i w = xi1 xi2 · · · xid . .. j=1 wd ❑ We can now write v as y1 − x⊤ ⊤ ⊤ 1 w y1 x1 w y1 x1 v=.. .. .. .. .. − − . . . . w = = . yn − x⊤ n w. yn x⊤ n w. yn x⊤ n ❑ We can group the scalars y1 ,... , yn into a vector y. ❑ We can group the row vectors x⊤ ⊤ 1 ,... , xn into a matrix X. 14 / 78 Answer (IV) (to be reviewed in the tutorial) ❑ It means that v = y − Xw. ❑ Finally 2 n d ⊤ X X yi − xij wj = v⊤ v = (y − Xw) (y − Xw). i=1 j=1 15 / 78 Differentiating a function in a vector/matrix form (I) ❑ We will see cases in which a function f (w) depends on some parameters grouped in a vector w. ❑ We would like to find the vector of parameters w that maximise f (w). ❑ For example, suppose f (w) is defined as d X f (w) = wi x i. i=1 ❑ We can group the scalars x1 ,... , xd into x. Likewise for w. ❑ According to what we saw before, we can write f (w) as f (w) = x⊤ w. 16 / 78 Differentiating a function in a vector/matrix form (II) ❑ For a fixed x, we are interested in computing the gradient of f (w) with respect to w ∂f (w) ∂w1 x1 df (w) . . = .. = .. = x. dw ∂f (w) xd. ∂wd ❑ Some useful identities when differentiating with respect to a vector df (w) f (w) dw w⊤ x x x⊤ w x w⊤ w 2w w⊤ Cw 2Cw. 17 / 78 Identity matrix and the inverse of a matrix ❑ The identity matrix of size N is a square matrix with ones on the main diagonal and zeros elsewhere, e.g., 1 0 0 I3 = 0 1 0 0 0 1 ❑ The inverse matrix of a matrix A of dimensions d × d, denoted as A−1 , satisfies AA−1 = A−1 A = Id 18 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 19 / 78 Olympic 100m Data Image from Wikimedia Commons http://bit.ly/191adDC. 20 / 78 Dataset Male 100 mts 12.0 11.5 Seconds 11.0 10.5 10.0 1900 1920 1940 1960 1980 2000 Year 21 / 78 Model ❑ We will use a linear model f (x, w) to predict y, where y is the time in seconds and x the year of the competition. ❑ The linear model is given as f (x, w) = w0 + w1 x, where w0 is the intercept and w1 is the slope. ❑ We use w to refer both to w0 and w1. 22 / 78 Data and model Male 100 mts 12.0 11.5 Seconds 11.0 10.5 10.0 1900 1920 1940 1960 1980 2000 Year 23 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 24 / 78 Linear model ❑ A simple model for regression consists in using a linear combination of the attributes to predict the output f (x, w) = w0 + w1 x1 +... + wD xD , where w0 , w1 , · · · , wD are the parameters of the regression model. ❑ The term w0 is the bias term or intercept, e.g. f (0, w) = w0. ❑ The expression above can be written in a vectorial form f (x, w) = w⊤ x. where we have defined w = [w0 , w1 , · · · , wD ]⊤ and x = [1, x1 , · · · , xD ]⊤. ❑ Notice that x0 = 1. 25 / 78 Parenthesis: Gaussian pdf ❑ The Gaussian pdf has the form (y − µ)2 1 p(y) = √ exp −. 2πσ 2 2σ 2 ❑ A Gaussian pdf requires two parameters µ and σ 2 , the mean and the variance of the RV Y. ❑ We denote the Gaussian pdf as p(y |µ, σ 2 ) = N (y |µ, σ 2 ) or y ∼ N (µ, σ 2 ). 26 / 78 Parenthesis: Gaussian pdf The mean of the two Gaussians is µ = 2 and the variances are σ 2 = 0.5 (solid), and σ 2 = 2 (dashed). 0.5 0.4 0.3 0.2 0.1 0.0 2 1 0 1 2 3 4 5 6 27 / 78 Gaussian regression model (I) ❑ We use a Gaussian regression model to relate the inputs and outputs y = f (x, w) + ϵ, where ϵ ∼ N (0, σ 2 ). ❑ It assumes that each output yi that we observe can be explained as the prediction of an underlying model, f (xi , w) plus a noise term ϵi. ❑ For a fixed x and a fixed w, f (x, w) is a constant, then y = constant + ϵ, where ϵ is a continuous RV. ❑ What is the pdf for y? (we are adding a constant to a Gaussian RV) – E{y } = E{constant + ϵ} = constant – var{y } = var{constant} + var{ϵ} = σ 2. 28 / 78 Gaussian regression model (II) ❑ This means that y ∼ N (constant, σ 2 ), where we said constant was f (x, w), this is, y ∼ N (f (x, w), σ 2 ). ❑ Because we assumed that x and w are given, we can also write p(y |x, w, σ 2 ) = N (y |f (x, w), σ 2 ). ❑ If we knew the value for w, once we have a new x∗ , we can predict the output as f (x∗ , w) = w⊤ x∗. ❑ σ 2 tells us the noise variance. 29 / 78 Gaussian regression model (III) 𝑓 𝑥, 𝒘 𝑝 𝑦 𝑥 ! ,𝒘, 𝜎 " 2𝜎 𝑓 𝑥 ! ,𝒘 𝑥! 30 / 78 How do we estimate w? (I) ❑ We start with a training dataset (x1 , y1 ), · · · , (xN , yN ). ❑ We assume that the random variables Y1 , · · · , YN are independent, N Y p(y1 , · · · , yN |x1 , · · · , xN ) = p(y1 |x1 ) · · · p(yN |xN ) = p(yn |xn ). n=1 ❑ We also assume that the RVs Y1 , · · · , YN follow an identical distribution, Gaussian in this case p(yn |xn , w, σ 2 ) = N (yn |f (xn , w), σ 2 ) = N (yn |w⊤ xn , σ 2 ). ❑ Both assumptions go by the name of the iid assumption, independent and identically distributed. 31 / 78 How do we estimate w? (II) ❑ Putting both assumptions together, we get N Y N Y p(y|X, w, σ 2 ) = p(yn |xn , w, σ 2 ) = N (yn |w⊤ xn , σ 2 ), n=1 n=1 where y = [y1 , · · · , yN ]⊤ ∈ RN×1 and X = [x1 , · · · , xN ]⊤ ∈ RN×(D+1). ❑ The expression above can then be written as N Y p(y|X, w, σ 2 ) = N (yn |w⊤ xn , σ 2 ), n=1 N (yn − w⊤ xn )2 Y 1 = √ exp −. 2πσ 2 2σ 2 n=1 ( N ) 1 1 X = N exp − 2 (yn − w⊤ xn )2. (2πσ 2 ) 2 2σ n=1 32 / 78 How do we estimate w? (III) ❑ When we look at a Gaussian pdf, like (y − µ)2 1 p(y) = √ exp − , 2πσ 2 2σ 2 we assume that both µ and σ 2 are given. In this case, the pdf follows all the properties we reviewed before. ❑ The same is true for N Y N Y p(y|X, w, σ 2 ) = p(yn |xn , w, σ 2 ) = N (yn |w⊤ xn , σ 2 ). n=1 n=1 ❑ Given w⊤ xn and σ 2 , then each p(yn |xn , w, σ 2 ) is a pdf. ❑ A different approach would be to say: I have some data for {yn }N n=1 and {xn }N n=1 but – “I don’t know what is w⊤ (therefore I don’t know what is w⊤ xn )” – “I don’t know what is σ 2 ”. 33 / 78 How do we estimate w? (IV) ❑ With yn and xn given but with unknown values for w and σ 2 , each p(yn |xn , w, σ 2 ) is not a pdf anymore. ❑ In that case, the function N Y p(y|X, w, σ 2 ) = N (yn |w⊤ xn , σ 2 ), n=1 receives the name of a likelihood function. ❑ We can think of a likelihood function as a function of the parameters w and σ 2 , h(w, σ 2 ) = h(y|X, w, σ 2 ), ❑ And subsequently, we can use multivariate calculus to find the values of w, σ 2 that maximise h(w, σ 2 ). ❑ In statistics, this is known as the maximum-likelihood (ML) criterion to estimate parameters. 34 / 78 How do we estimate w? (V) ❑ Given y, X, we use the ML criterion to find the parameters w and σ 2 that maximise ( N ) 2 1 1 X ⊤ 2 p(y|X, w, σ ) = N exp − 2 (yn − w xn ). (2πσ 2 ) 2 2σ n=1 ❑ In practice, we prefer to maximise the log of the likelihood p(y|X, w, σ 2 ), LL(w, σ 2 ) = log p(y|X, w, σ 2 ) N N N 1 X =− log (2π) − log σ 2 − (yn − w⊤ xn )2. 2 2 2σ 2 n=1 35 / 78 How do we estimate w? (VI) ❑ We can also minimise the negative log of the likelihood p(y|X, w, σ 2 ), NLL(w, σ 2 ) = −LL(w, σ 2 ) = − log p(y|X, w, σ 2 ) N N N 1 X 2 = log (2π) + log σ + (yn − w⊤ xn )2. 2 2 2σ 2 n=1 ❑ Consistency of the ML criterion If data was really generated according to the probability we specified, the correct parameters will be recovered in the limit as N → ∞. 36 / 78 Connection with the sum of squared errors ❑ If we multiply LL(w, σ 2 ) by minus one, we get N X E(w, σ 2 ) = − log p(y|X, w, σ 2 ) ∝ (yn − w⊤ xn )2. n=1 ❑ The ML criterion for this model has a close connection with the sum-of-squared errors used in non-probabilistic formulations of linear regression. ❑ Maximising the log-likelihood function is equivalent to minimising the sum-of-squares errors. ❑ Notice that the log is a monotonic function, meaning that if we find w, σ 2 that maximise h(w, σ 2 ), those will also maximise log(h(w, σ 2 )). 37 / 78 Normal equation (I) (to be reviewed in the tutorial) ❑ Let us find an estimate for w. ❑ From what we saw before, N N N 1 X LL(w, σ 2 ) = − log (2π) − log σ 2 − (yn − w⊤ xn )2. 2 2 2σ 2 n=1 ❑ Using what we reviewed in the section on vector/matrix notation, it can be shown that this expression can be written in a vectorial form as N N 1 LL(w, σ 2 ) = − log (2π) − log σ 2 − (y − Xw)⊤ (y − Xw) 2 2 2σ 2 ❑ Let us focus on the term (y − Xw)⊤ (y − Xw), (y − Xw)⊤ (y − Xw) = y⊤ y − w⊤ X⊤ y − y⊤ Xw + w⊤ X⊤ Xw 38 / 78 Normal equation (II) (to be reviewed in the tutorial) ❑ We can find the w that maximises LL(w, σ 2 ) by taking the gradient dLL(w,σ 2 ) dw , equating to zero and solving for w. ❑ Taking the gradient of each term in LL(w, σ 2 ) wrt w, we get d N d N d 1 − log (2π) = 0, − log σ 2 = 0, − 2 y⊤ y = 0, dw 2 dw 2 dw 2σ d 1 ⊤ ⊤ 1 ⊤ w X y = X y, dw 2σ 2 2σ 2 d 1 ⊤ 1 ⊤ 2 y Xw = X y dw 2σ 2σ 2 d 1 1 − 2 w⊤ X⊤ Xw = − 2 2X⊤ Xw dw 2σ 2σ 39 / 78 Normal equation (III) (to be reviewed in the tutorial) ❑ Putting these terms together, we get d 1 ⊤ 1 ⊤ 1 LL(w, σ 2 ) = X y+ X y− 2X⊤ Xw dw 2σ 2 2σ 2 2σ 2 1 1 = 2 X⊤ y − 2 X⊤ Xw σ σ ❑ Now, equating to zero and solving for w, we get 1 ⊤ 1 2 X y − 2 X⊤ Xw = 0 σ σ X⊤ Xw = X⊤ y −1 ⊤ w∗ = X⊤ X X y. ❑ The expression for w∗ is known as the normal equation. −1 ❑ The solution for w∗ exists if we can compute X⊤ X. ❑ The inverse can be computed as long as X⊤ X is non-singular (e.g. determinant different from zero, or has full-rank). 40 / 78 Solving for σ∗2 ❑ Following a similar procedure, it can be shown that the ML solution for σ∗2 is given as 1 σ∗2 = (y − Xw∗ )⊤ (y − Xw∗ ). N 41 / 78 Basis functions ❑ The model that is linear in x only allows linear relationships between x and y. ❑ We can extend the model to describe non-linear relationships between the inputs and the output by using basis functions, non-linear mappings from inputs to outputs. ❑ However, we keep the linear relationship of y wrt w for tractability. ❑ The predictive model follows as f (x, w) M X f (x, w) = w0 + wi ϕi (x) = w⊤ ϕ(x), i=1 where ϕi (x) are basis functions and we have M + 1 parameters for the vector w and ϕ(x) = [1, ϕ1 (x), · · · , ϕM (x)]⊤. 42 / 78 Examples of basis functions 1 1 1 0.5 0.75 0.75 0 0.5 0.5 −0.5 0.25 0.25 −1 0 0 −1 0 1 −1 0 1 −1 0 1 Polynomial: ϕi (x) = x i. n o 2 Exponential: ϕi (x) = exp − (x−µ 2s 2 i) Sigmoidal: ϕi (x) = σ x−µ s i , σ(a) = 1/(1 + exp(−a)). 43 / 78 Transforming the input using the basis functions ❑ As an example, let us use polynomial basis functions to predict y, the time in seconds in the 100 mt Olympics competition. ❑ For each x (year of the competition), we now compute the vector of polynomial basis functions 1 x 2 x ϕ(x) = x 3 .. . xM ❑ We have converted the unidimensional input feature x into a higher dimensional feature representation ϕ(x) ∈ R M+1. 44 / 78 Normal equations with a design matrix ❑ Given X, we first compute a new design matrix Φ, ϕ(x1 )⊤ ϕ0 (x1 ) ϕ1 (x1 ) · · · ϕM (x1 ) ϕ(x2 )⊤ ϕ0 (x2 ) ϕ1 (x2 ) · · · ϕM (x2 ) Φ= = .. .. . . ϕ(xN )⊤ ϕ0 (xN ) ϕ1 (xN ) · · · ϕM (xN ) ❑ We now can use (y, Φ) and write the Gaussian linear regression problem N Y p(y|X, w, σ 2 ) = N (yn |w⊤ ϕn , σ 2 ), n=1 where ϕn = ϕ(xn ). ❑ Using the ML criterion, we arrive to the following normal equation −1 w∗ = Φ⊤ Φ Φ⊤ y. 45 / 78 Olympic 100-mt data with M = 5 Male 100 mts 12.0 11.5 Seconds 11.0 10.5 10.0 1900 1920 1940 1960 1980 2000 Year 46 / 78 Alternative to find w ❑ For solving the normal equation, we need to invert X⊤ X. ❑ This inversion has a computational complexity between O((D + 1)2.4 ) to O((D + 1)3 ) (depending on the implementation). ❑ The normal equation is linear regarding the number of instances in the training data, O(N). ❑ It can handle a large training set as long as it fits in memory. ❑ Alternatively, we can use iterative optimisation in cases with a large number of features and too many instances to fit in memory. 47 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 48 / 78 General problem ❑ We are given a function h(w), where w ∈ Rp. ❑ Aim: to find a value for w that minimises h(w). ❑ Use an iterative procedure wk+1 = wk + ηdk , where dk is known as the search direction and it is such that h(wk +1 ) < h(wk ). ❑ The parameter η is known as the step size or learning rate. 49 / 78 Gradient descent ❑ Perhaps, the simplest algorithm for unconstrained optimisation. ❑ It assumes that dk = −gk , where gk = g(wk ). ❑ Also known as steepest descent. ❑ It can be written like wk+1 = wk − ηgk. 50 / 78 Step size ❑ The main issue in gradient descent is how to set the step size. ❑ If it is too small, convergence will be very slow. If it is too large, the method can fail to converge at all. 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 (a) (b) Figure: The function to optimise is h(w1 , w2 ) = 0.5(w12 − w2 )2 + 0.5(w1 − 1)2. The minimum is at (1, 1). In (a) η = 0.1. In (b) η = 0.6. 51 / 78 Alternatives to choose the step size η ❑ Line search methods (there are different alternatives). ❑ Line search methods may use search directions other than the steepest descent direction. ❑ Conjugate gradient (method of choice for quadratic objectives g(w) = w⊤ Aw). ❑ Use a Newton search direction. 52 / 78 Gradient descent for linear regression (I) ❑ For simplicity, let us assume that the objective function h(w) corresponds to the mean squared error N 1 X E(w) = (yn − w⊤ xn )2. N n=1 ❑ We could also minimise the negative LL(w) instead, NLL(w). ❑ We write the update equation as d wk +1 = wk − η E(w). dw w=wk 53 / 78 Gradient descent for linear regression (II) ❑ Computing the gradient for E(w), we get N d 2 X ⊤ 2 w xn − yn xn = X⊤ (Xw − y). E(w) = dw N N n=1 ❑ The update equation follows as 2 ⊤ wk +1 = wk − η X (Xwk − y). N ❑ The computation of the gradient involves using the whole dataset (X, y) at every step. ❑ For this reason, this algorithm is known as batch gradient descent. 54 / 78 Gradient descent and feature scaling ❑ Always normalise the features if using gradient descent. ❑ Gradient descent converges faster if all features have a similar scale. ❑ If the attributes are in very different scales, it may take a long time to converge. 55 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 56 / 78 Online learning and large datasets ❑ Traiditionally in machine learning, the gradient gk is computed using the whole dataset D = {xn , yn }Nn=1. ❑ There are settings, though, where only a subset of the data can be used. ❑ Online learning: the instances (xn , yn ) appear one at a time. ❑ Large datasets: computing the exact value for gk would be expensive, if not impossible. 57 / 78 Stochastic gradient descent (I) ❑ In stochastic gradient descent (SGD), the gradient gk is computed using a subset of the instances available. ❑ The word stochastic refers to the fact that the value for gk will depend on the subset of the instances chosen for computation. 58 / 78 Stochastic gradient descent (II) ❑ In the stochastic setting, a better estimate can be found if the gradient is computed using 1 X gk = gk,i , |S| i∈S where S ∈ D, |S| is the cardinality of S, and gk ,i is the gradient at iteration k computed using the instance (xi , yi ). ❑ This setting is called mini-batch gradient descent. 59 / 78 Step size in SGD ❑ Choosing the value of η is particularly important in SGD since there is no easy way to compute it. ❑ Usually the value of η will depend on the iteration k , ηk. ❑ It should follow the Robbins-Monro conditions ∞ X ∞ X ηk = ∞, ηk2 < ∞. k=1 k=1 ❑ Various formulas for ηk can be used 1 1 ηk = , ηk = , k (τ0 + k )κ where τ0 slows down early interations and κ ∈ (0.5, 1]. 60 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 61 / 78 What is regularisation? ❑ It refers to a technique used for preventing overfitting in a predictive model. ❑ It consists in adding a term (a regulariser) to the objective function that encourages simpler solutions. ❑ With regularisation, the objective function for linear regression would be h(w) = E(w) + λR(w), where R(w) is the regularisation term and λ the regularisation parameter. ❑ In the expression for h(w), we can use NLL(w) instead of E(w). ❑ If λ = 0, we get h(w) = E(w). 62 / 78 Different types of regularisation ❑ The objective function for linear regression would be h(w) = E(w) + λR(w), where R(w) follows as 1 R(w) = α∥w∥1 + (1 − α) ∥w∥22 , 2 Pp Pp where ∥w∥1 = m=1 |wm |, and ∥w∥22 = m=1 wm2. ❑ If α = 1, we get ℓ1 regularisation. ❑ If α = 0, we get ℓ2 regularisation. ❑ If 0 < α < 1, we get the elastic net regularisation. 63 / 78 Ridge regression or ℓ2 regularisation ❑ In ridge regression, α = 0, N 1 X λ h(w) = (yn − w⊤ xn )2 + w⊤ w, N 2 n=1 ❑ It can be shown that an optimal solution for w∗ is given as −1 ⊤ λN w∗ = X X+ I X⊤ y. 2 ❑ Notice that we can also use an iterative procedure for optimising h(w) either through batch gradient descent, SGD or mini-batch SGD. 64 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 65 / 78 Probabilistic classifier ❑ A logistic regression model is an example of a probabilistic classifier. ❑ Let x ∈ RD represents a feature vector and y the target value. ❑ For a binary classification problem we can use y ∈ {0, 1} or y ∈ {−1, +1}. ❑ We model the relationship between y , and x using a Bernoulli distribution. 66 / 78 Bernoulli distribution (I) ❑ A Bernoulli random variable Y is a random variable that can only take two possible values. ❑ For example, the random variable Y associated to the experiment of tossing a coin. ❑ Output “heads” is assigned 1 (Y = 1), and output “tails” is assigned 0 (Y = 0). 67 / 78 Bernoulli distribution (II) ❑ A Bernoulli distribution is a probability distribution for Y , expressed as ( µ y = 1, p(Y = y ) = Ber(y|µ) = 1 − µ y = 0, where µ = P(Y = 1). ❑ The expression above can be summarized in one line using p(Y = y) = Ber(y |µ) = µy (1 − µ)1−y , 68 / 78 How are y and x related in logistic regression? ❑ The target feature y follows a Bernoulli distribution p(y |x) = Ber(y |µ(x)). ❑ Notice how the probability µ = P(y = 1) explicity depends on x. ❑ In logistic regression, the probability µ(x) is given as 1 µ(x) = = σ(w⊤ x), 1 + exp(−w⊤ x) where σ(z) is known as the logistic sigmoid function. ❑ We then have p(y|w, x) = Ber(y |σ(w⊤ x)). 69 / 78 The logistic sigmoid function σ(z) σ(z) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −10 −5 0 5 z 10 1 ❑ Recall σ(z) =. 1 + exp(−z) ❑ If z → ∞, σ(z) = 1. If z → −∞, σ(z) = 0. If z = 0, σ(z) = 0.5. 70 / 78 The logistic sigmoid function σ(w⊤ x) σ(wx) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −10 −5 0 5 x 10 ❑ We have z = w⊤ x. For simplicity, assume x = x, then σ(wx). 1 ❑ Recall σ(wx) =. 1 + exp(−wx) dσ(wx) w ❑ So =. dx x=0 4 71 / 78 The logistic sigmoid function in 2d W=(1,4) W=(5,4) 5 w2 1 1 0.5 0.5 W = ( −2 , 3 ) 0 0 −10 10 −10 10 4 1 0 x1 10 −10 0 x2 0 x1 10 −10 0 x2 0.5 0 W=(0,2) W=(2,2) −10 10 3 0 x1 10 −10 0 x2 1 0.5 1 0.5 0 0 W=(5,1) −10 10 −10 10 2 0 x1 10 −10 0 x2 0 x1 10 −10 0 x2 1 0.5 W=(1,0) W=(3,0) 0 −10 10 1 1 0.5 1 0.5 0 x1 10 −10 0 x2 W = ( −2 , −1 ) 0 0 −10 10 −10 10 0 1 0.5 0 x1 10 −10 0 x2 x1 0 10 −10 0 x2 w1 0 W = ( 2 , −2 ) −10 10 −1 0 x1 10 −10 0 x2 1 0.5 0 −10 10 −2 0 x1 10 −10 0 x2 −3 −3 −2 −1 0 1 2 3 4 5 6 Figure: Plot of σ(w1 x1 + w2 x2 ). Here w = [w1 w2 ]⊤. 72 / 78 Decision boundary ❑ After the training phase, we will have an estimator for w. ❑ For a test input vector x∗ , we compute p(y = 1|w, x∗ ) = σ(w⊤ x∗ ). ❑ This will give us a value between 0 and 1. ❑ We define a threshold of 0.5 to decide to which class we assign x∗. ❑ With this threshold we induce a linear decision boundary in the input space. 73 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 74 / 78 Cross-entropy error function ❑ Let D = {xn , yn }N n=1. ❑ We write X = [x1 · · · xN ]⊤ , and y = [y1 · · · yN ]⊤. ❑ Assuming IID observations N Y N Y p(y|w, X) = p(yn |w, xn ) = Ber(yn |σ(w⊤ xn )). n=1 n=1 ❑ The cross-entropy function or negative log-likelihood is given as NLL(w) = − log p(y|w, X) N X =− {yn log[σ(w⊤ xn )] + (1 − yn ) log[1 − σ(w⊤ xn )]}, n=1 which can be minimised with respect to w. 75 / 78 Gradient of NLL(w) ❑ It can be shown that the gradient g(w) of NLL(w) is given as N d X g(w) = NLL(w) = [σ(w⊤ xn ) − yn ]xn = X⊤ (σ − y), dw n=1 where σ = [σ(w⊤ x1 ) · · · σ(w⊤ xN )]⊤. 76 / 78 Contents Review of vector/matrix notation and linear algebra A regression model Linear regression Gradient descent Stochastic Gradient Descent Regularisation Logistic regression Cross-entropy error function Optimisation and regularisation 77 / 78 Optimisation and regularisation ❑ SGD methods described above can be applied to find the parameter vector w that minimises the negative log-likelihood NLL(w) in logistic regression. ❑ Likewise, all the regularisation techniques we saw before, can also be added to the cross-entropy error function. 78 / 78