Lecture 3: Basics on Optimisation Theory (PDF)

Lecture 3 1 Basics on optimisation theory Outline: In this Section we will examine some theory for the optimisation of func- tions. Unless stated otherwise, we will assume that all functions f are continuous and differentiable. The problem we a trying to tackle here in optimisation theory can be stated simply as ”Find w ~ such that f (w) ~ has a minimum or maximum.” Finding minima is a basic task in data science and machine learning and will be a relevant topic throughout this course. In this respect the function f is a so-called loss function that measures the ”distance” between the predictions of a model to the true values the model is supposed to predict. In this case w ~ are the parameters of the model and the minimum corresponds to the best model we can find within a class of models. 1.1 Some basic definitions Gradient: Consider a differentiable f : Rd → R. We specify the components ~ ∈ Rd as w of w ~ = (w1 ,... , wd )T (T denotes the transpose) and denote the unit vectors in direction of the coordinate axes as ~ei , i = 1,... , d. The gradient is defined as T ∂f ∂f ∂f ∂f ∇f = ~e1 +... + ~ed = ,...,. ∂w1 ∂wd ∂w1 ∂wd ~ ∈ Rd and thus constitutes a For differentiable f the gradient is defined for all w function ∇f : Rd → Rd. Directional derivative: For any unit vector ~u = (u1 ,... ud )T the directional deriva- tive of f at w ~ in direction ~u is defined as ~ + h~u) − f (w) f (w ~ ∇~u f (w) ~ = lim. h→0 h For differentiable f we get ∇~u f (w) ~ T · ~u ~ = ∇f (w) where · denotes the usual scalar product. The directional derivative gives the rate of change of f when going from a given point to a nearby point in direction ~u. f is increasing most in direction ~u where ∇~u f (w)~ has the largest value which is in direction of the gradient itself, i.e. the direction ∇f (w)/|∇f ~ (w)|. ~ Hessian matrix: The Hessian matrix H(w) ~ is defined as the square matrix of second partial derivatives   ∂2f ∂2f 2f ∂w12 ∂w1 ∂w2 · · · ∂w∂1 ∂w n H(w) ~ = ........  . (1) ....  ∂2f ∂2f ∂2f ∂wn ∂w1 ∂wn ∂w2 · · · ∂wn ∂wn The Hessian is a symmetric matrix for a twice continuously differentiable f (which we assume). The Hessian matrix gives us information about the curvature of a function. Taylor expansion: The Taylor expansion of a continuous (sufficiently differen- tiable) function f around a point w ~ k provides an approximation for f at any other point w~ and is given as 1 f (w) ~ = f (w ~ k ))T · (w ~ k ) + (∇f (w ~ −w ~ k ) + (w ~ k )T H(w ~ −w ~ −w ~ k )(w ~ k ) +.... (2) 2 The last terms on the right hand side contain higher order derivatives of f and ~ −w higher order terms of w ~ k. Defining ∆w ~k = w~ −w ~ k, w ~ k+1 = w~ k + ∆w~ k , we can write (??) more compactly as 1 f (w ~ k+1 ) = f (w ~ k ))T · ∆w ~ k ) + (∇f (w ~ k + (∆w ~ k )T H(w ~ k )∆w ~k +.... (3) 2 The Taylor expansion allows us to approximate any continuous (sufficiently dif- ferentiable) function as a polynomial in terms of its derivatives at the point w ~ k. We can make a linear approximation by taking the first two terms of the expan- sion. We can make a quadratic approximation by taking the first three terms of the expansion. Positive and negative definiteness: A symmetric matrix B is called positive defi- ~ T Bw nite if w ~ 6= ~0. A symmetric matrix B is called negative ~ > 0 for any vector w definite if w T ~ Bw~ < 0 for any vector w~ 6= ~0. Checking positive definiteness: The above definition is not feasible when it comes to checking a matrix for being positive definite, because it would require that we examine every possible vector w. ~ There are many ways to find out about positive definiteness, the most relevant for us are listed here. 1. A symmetric matrix B is positive definite if all eigenvalues of B are positive. 2. A symmetric matrix is positive definite if and only if the determinant of each of its principal minor matrices is positive. Example: Let   2 3 −2 B= 3 5 −1 . −2 −1 5 We check the determinants of the principal minor matrices, found by taking the determinant of the first 1 × 1 matrix along the diagonal, the determinant of the first 2 × 2 matrix along the diagonal, and finally the determinant of the entire matrix. If any one of these determinants is not positive, the matrix is not positive definite. We get |2| = 2 > 0 2 3 =1>0 3 5 2 3 −2 3 5 −1 = −5 < 0. −2 −1 5 This matrix is not positive definite. Relation to the Hessian matrix: The matrix we will be most interested in checking for definiteness is the Hessian matrix H(w). ~ If H(w)~ is positive definite, it means curvature of the function f is everywhere positive. This will be an important condition for checking if we have a minimum for f. If H(w) ~ is negative definite, curvature of f is everywhere negative. This will be a condition for verifying that we have a maximum for f. Checking negative definiteness: The two most relevant ways to examine negative definiteness are listed here. 1. A symmetric matrix B is negative definite if all eigenvalues of B are negative. 2. A symmetric matrix B is negative definite if −B is positive definite. Remark: A symmetric matrix is not negative definite if the determinant of each of its principal minor matrices is negative. The easiest way to check for negative definiteness using principal minor matrices is to reverse all signs and see if the resulting matrices have positive determinants. Semi-definiteness: A symmetric matrix can also be positive semi-definite, or negative semi-definite. If one or more of the determinants of the principal minors or eigenvalues are equal to zero, and the others are all positive, the matrix is called positive semi-definite. If one or more eigenvalues are equal to zero, and the others are all negative, the matrix is called negative semi-definite. Indefinite matrices: If a matrix is neither positive definite nor negative definite, nor semi-definite then it is called indefinite. Example: We use the matrix from the previous example which was not positive definite. Since none of its principal minors has a zero determinant, we can also exclude the case of being semi-definite. To check negative definiteness, we reverse the signs and see if the resulting matrix   −2 −3 2 −B =  −3 −5 1  2 1 −5 is positive definite. The first principle minor gives |−2| = −2 < 0. Because the determinant of the first principal minor is negative we conclude that B is indefinite. 1.2 Quadratic functions Motivation: In applications, loss functions relevant for data science in general, and relevant for this course in particular, are often quadratic functions. It is therefore helpful to investigate some of the properties of these functions. Representations: We can represent a quadratic function as a scalar equation, as a vector equation, and as a Taylor expansion. They are all equivalent represen- tations. For example, consider the function on R2 ~ = 6 − 2w1 + w2 + 2w12 + 3w1 w2 + w22. f (w) (4) This is a scalar representation of a quadratic function. We can also write it in vector form as 1 T ~ = a + ~bT · w f (w) ~+ w~ C w. ~ (5) 2 Our example gives 1 T 4 3 f (w) ~+ w ~ = 6 + (−2, 1)w ~ w. ~ (6) 2 3 2 We also observe that 4 3 C = H(w) ~ =. 3 2 This is true in general. A third form is a Taylor representation, by choosing any point w ~ k, 1 f (w) ~ = f (w ~ k ))T · ∆w ~ k ) + (∇f (w ~ k )T H(w ~ k + (∆w ~ k )∆w ~ k. (7) 2 Our example gives −2 + 4w1 + 3w2 ∇f (w) ~ =. 1 + 3w1 + 2w2 We may choose any point of expansion, for instance −2 −4 w ~k = , ∇f (w ~ k) =. 2 −1 Our example then gives 1 4 3 ~ = 12 + (−4, −1) · ∆w f (w) ~ kT ~ k + ∆w ∆w ~ k. (8) 2 3 2 where, w1 + 2 ∆w ~k =. w2 − 2 Characteristics of quadratic functions: The following characteristics are useful in many respects. 1. The gradient vector of a quadratic function is linear in its argument. This makes it easy to solve for points where the gradient is equal to zero. 2. The Hessian for a quadratic function is a matrix of constants. So we may write H instead of H(w). ~ Thus the curvature of a quadratic function is everywhere the same. 3. Excluding the cases where we have a semi-definite Hessian, quadratic functions have only one stationary point, i.e. only one point where the gradient is zero. 4. Given the gradient at some point w ~ k , the gradient at some other point w ~ k+1 , is given by ∇f (w~ k+1 ) = ∇f (w ~ k ) + H(w~ k+1 − w ~ k ). (9) 1.3 Necessary and sufficient conditions for an optimum Outline: In this Section conditions which must hold at an optimum of a function are outlined. Euclidean ball: We denote by Br (~x) the so-called Euclidean ball of radius r around a point ~x ∈ Rd , defined as the set of all points {~y ∈ Rd :k ~y − ~x k≤ r}, within an Euclidean distance less than or equal to r from ~x. Br (~x) defines a local neighbourhood around a point ~x. One can use any value of r > 0 according to one needs. Local minima and local maxima: A point ~x is called a local maxima of f if there is some r > 0 such that f (~y ) ≤ f (~x) for all ~y ∈ Br (~x). A point ~x is called a local minima of f if there is some r > 0 such that f (~y ) ≥ f (~x) for all ~y ∈ Br (~x). A point ~x is called a strict local maxima of f if there is some r > 0 such that f (~y ) < f (~x) for all ~y ∈ Br (~x), ~y 6= ~x. A point ~x is called a strict local minima of f if there is some r > 0 such that f (~y ) > f (~x) for all ~y ∈ Br (~x), ~y 6= ~x. Global minima and global maxima: A point ~x is called a global maxima of f if f (~y ) ≤ f (~x) for all ~y ∈ Rd. A point ~x is called a global minima of f if f (~y ) ≥ f (~x) for all ~y ∈ Rd. There may be more than one global minimum and maximum. If there is exactly one global minima or maxima ~x ∈ Rd , it is called a strict global maxima or minima. The term minima or maxima without the pre-term global or local will always refer to a local property of f. A continuous function f : S → R restricted to a closed and bounded subset S ⊂ Rd always has a global maximum and a global minimum on S (possibly on the boundary of S). Saddle point: A point ~x is called a saddle point if for any r > 0 there are lower points ~y ∈ Br (x) such that f (~y ) < f (~x) and there are upper points ~z ∈ Br (~x) such that f (~z) > f (~x). Necessary conditions for an optimum: The necessary conditions for an optimum at w~ ? are, f differentiable at w ~ ? ) = ~0. ~ ? , ∇f (w (10) ~ = ~0 can These conditions are necessary but not sufficient, inasmuch as ∇f (w) apply at a maximum, minimum or a saddle point. However, if at a point ∇f (w) ~ 6= ~0, then that point cannot be an optimum. Contour lines: Consider a quadratic function f on R2 with a single minimum at ~ ? = (0, 0). The contour lines are ellipses defined by two directions along which w the main axes lie. Since H is symmetric and is not a defect matrix (since it has one minimum), it has orthogonal eigenvectors which we denote by ~v1 and ~v2 corresponding to the real eigenvalues λ1 and λ2. We may choose them to have length 1. Furthermore, recall that the gradient at a point on a contour line is orthogonal to that line. Now, we may use the fact that ∇f (w ~ ? ) = ~0 and choose ? w ~k = w~ and w ~ k+1 = v1 in (9) to get ∇f (~v1 ) = λ1~v1. If we choose w ~ k+1 = v2 we get ∇f (~v2 ) = λ2~v2. ~ k+1 = a~v1 + b~v2 with a2 + b2 = 1 we get If we choose any other direction w ∇f (w ~ k+1 ) = aλ1~v1 + bλ2~v2. Let us assume that λ1 ≤ λ2. It is then straightforward to see that k ∇f (~v1 ) k≤k ∇f (w ~ k+1 ) k≤k ∇f (~v2 ) k. We also get that the gradient on points along the lines defined by the eigenvectors does not change direction. Since the gradient defines the direction of maximum increase we get the important result that the eigenvectors are in direction of the main axes of the ellipses in the contour plot. Sufficient conditions for a minimum: The sufficient conditions include the nec- essary conditions but add other conditions such that when satisfied we know we have an optimum. For a minimum, ~ ? , ∇f (w f differentiable at w ~ ? ) = 0, H(w ~ ? ) positive definite. (11) Sufficient conditions for a maximum: For a maximum ~ ? , ∇f (w f differentiable at w ~ ? ) = 0, H(w ~ ? ) negative definite. (12) Example: Let ~ = 4w1 + 2w2 + w12 − 4w1 w2 + w22. f (w) (13) ~ ? if Since this is a quadratic function, we know it only has one stationary point w H is not semi-definite. We note that the Hessian, 2 −4 H= −4 2 ~ ?. is indefinite.This means we have a saddle point at w Example: Let ~ = w12 − 2w1 w2 + 4w22. f (w) Since this is a quadratic function, the partial derivatives will be linear equations. We can solve these equations directly for a point that satisfies the necessary conditions. We get the condition ? 2w1? − 2w2? 0 ∇f (w ~ )= =. −2w1? + 8w2? 0 When we solve these two equations, we have a solution, w1? = 0, w2? = 0. This point represents a maximum, minimum or saddle point. At this point, the Hessian is 2 −2 H=. −2 8 Since this Hessian is positive definite, we have a minimum. Exercises 1. Let f : Rd → R be a differentiable function. Convince yourself that for any i = 1,... , n ∂f (w) ~ = ∇~ei f (w). ~ ∂wi 2. Let f : R3 → R be a differentiable function. Convince yourself that the gradient of f at any w ~ points in direction of maximum increase of f. 3. Approximate ~ = (w1 − 1)(w2 − 2)2. f (w) as a second order polynomial. 4. Rewrite ~ = 1 − 2w12 − w22 f (w) in vector form. 5. Convince yourself that the contour lines of ~ = 1 − 2w12 − w22 f (w) are ellipses. Lecture 4 2 Gradient descent methods Motivation: In this Section we will continue to discuss optimising functions f : Rd → R. In particular, we have in mind loss-functions that measure the goodness- of-fit of models fitted to data. The loss-function f is also called a cost-function and the goal is to find the parameters with minimum cost, i.e. the best fit to the data. The focus here is on functions f where we can not solve the problem analytically or where we want to avoid solving the problem analytically because of numerical reasons. The most prominent solution method to tackle this problem is the gradient descent method. 2.1 Some basic definitions Outline: To proceed, we need some basic concepts and notions. In particular we need the classification of functions with respect to convexity and concavity and the notion of a distance measure. Distance measures: We are all familiar with the usual way distances are expressed mathematically by using the Euclidean distance d !1/2 X k ~y − ~x k= (yi − xi )2. i=1 There are situations where one would like to use different measures of distances that are still aligned with properties we intuitively associate to a distance. Such properties are: 1. If the distance between two points is zero, the points are identical. 2. A distance is a non-negative number. 3. The distance between a~x and a~y is |a| times the distance between ~x and ~y. Here a is any real number. 4. The distance between two points ~x and ~y is not larger than the sum of the distances between ~x, ~z and ~y , ~z we have by including a third point ~z. We may define a distance measure by first defining a norm n : Rd → R+ and then use n(~y − ~x) as a distance between the points ~x and ~y. To ensure the above listed properties a norm has the properties: 1. If n(~x) = 0, then ~x = ~0 2. n(a~x) = |a|n(~x) 3. n(~x + ~y ) ≤ n(~x) + n(~y ). p-norms: Most relevant for us are so-called p-norms defined as d !1/p X np (~x) =k ~x kp = |xi |p i=1 for integer p ≥ 1. For p = 2 we get the usual Euclidean distance. One can show that in the limit p → ∞ we arrive at k ~x k∞ = max |xi |. i For p = 2 we often use the notation k · k=k · k2. Derivative of p-norms: For finite p and at points where xk 6= 0 we have that ∂ k ~x kp xk |xk |p−2 =. ∂xk k ~x kp−1 p Remark: The definition of the Euclidean ball is based on the 2-norm. Using a different norm changes the geometry. For instance, using p = ∞ results in a hypercube of edge-length 2r. Convex functions: Consider two points p~, ~q ∈ Rd. The set of points l = {~x = α~p + (1 − α)~q : α ∈ R} defines a line through the points p~ and ~q. A line segment going from p~ to ~q is obtained as l = {~x = α~p + (1 − α)~q : 0 ≤ α ≤ 1}. A function f is called convex if for all p~, ~q ∈ Rd , f on points on the line segment between them has value less than (or equal) to the values at the weighted average of f (~p) and f (~q), i.e. for all 0 ≤ α ≤ 1 we have that f (α~p + (1 − α)~q) ≤ αf (~p) + (1 − α)f (~q). f is called strictly convex if the stronger condition f (α~p + (1 − α)~q) < αf (~p) + (1 − α)f (~q) holds for all 0 ≤ α ≤ 1. Some important properties of convex functions: For two convex functions f , g we have that the sum h(~x) = f (~x) + g(~x) is also convex. We also have that h(~x) = max{f (~x), g(~x)} is convex. Most importantly, any local minimum of a convex function is also a global minimum. A strictly convex function has at most a single minimum which is then the single global minimum. Concave functions: A function f is called concave if −f is convex. Illustration: The Figure below shows an example of a convex function. Lines connecting any two points of the graph lie always above the graph. Convexity/concavity and the Hessian: A function f is a convex function if the Hessian matrix of f is positive definite or positive semidefinite for all ~x. A function f is a convex function if the Hessian matrix of −f is positive definite or positive semidefinite for all ~x. Figure 1: Illustration of a convex function. 2.2 Gradient descent Direction of steepest descent: Recall the definition of the directional derivative. For any unit vector ~u = (u1 ,... ud ) the directional derivative is defined as ~ + h~u) − f (w) f (w ~ ∇~u f (w) ~ = lim. h→0 h For differentiable f we get ∇~u f (w) ~ = ∇f (w) ~ · ~u. f is increasing most in direction ~u where ∇~u f (w) ~ has the largest value which is in direction of the gradient itself. To get closer to the minimum of f , we may want to move from any staring point w ~ in the direction −∇f (w). ~ This is then called the direction of steepest descent. Gradient descent methods: Gradient descent comprises a collection of methods that are applied to differentiable functions f for which they try to identify min f (w) ~ ? = arg min f (w). ~ or w ~ w∈R ~ d w∈R ~ d Gradient descent is most effective for convex f and used for cases where w ~ ? can not be found in closed form or for numerical reasons. In general, gradient descent methods are iterative in the sense that they try to find w ~ ? by approximations in repeated steps that get closer and closer to the true w~ ?. The iterative procedure starts with some initial point w ~ 0 and defines the iteration step as w ~ k − γk ∇f (w ~ k+1 = w ~ k) until some criterion k ∇f (w ~ k ) k≤ τ is fulfilled. Remark: Note that for a function f : Rd → R, the gradient of f is a vector in Rd , while the graph of f lives in Rd+1. The change in f when going in direction of the gradient can be visualised on the graph as an arrow in Rd+1 which is not the same arrow as the gradient itself. The gradient is the projection of the change visualised on the graph onto Rd. Stopping condition: The criterion τ in the stopping condition above is also called the tolerance of the algorithm. It reflects the fact that at w ~ ? we must have ? ∇f (w ~ ? we may expect ~ ) = (0,... , 0) (if it exists) which implies that close to w the gradient to be small in the sense that its length is small. Learning rate: The so-called learning rate γk is the most important parameter in any gradient descent application. It is often kept constant, i.e. γk = γ. It controls how fast the procedure works. If γ is very small, the number of required ~ ? it might iteration steps potentially is very large. If γ is too large, then close to w happen that the iteration goes too far, i.e. overshooting the location w ~ ?. It is therefore clearly of importance to choose a proper learning rate. Lipschitz bound: A function g : Rd → Rs has the property of being L-Lipschitz bound for some L if k g(~x) − g(~y ) k≤ L k ~x − ~y k for all ~x, ~y ∈ Rd. In particular, if ∇f is L-Lipschitz bound and for γ ≤ L1 it follows that the gradient descent method converges to a stationary point. For convex f , after k = O(1/) iteration steps it follows that ~ ? ) ≤ . ~ k ) − f (w f (w Remark: There are many procedures that try to optimally adjust the learning rate γk for each individual iteration step. The details of these advanced methods are beyond the scope of the present introduction to the topic. Optimal learning rate for quadratic functions: Let f be a quadratic function with a single minimum. Given the gradient at some point w ~ k not being the minimum and Hessian H, the optimal learning rate γk? is given by k ∇f (w ~ k ) k2 γk? =. ~ k )T H∇f (w ∇f (w ~ k) 2.3 Fitting a model to data Outline: For the analysis of data, gradient descent is a basic tool to fit a model to data. Suppose we have a data set P = {~p1 ,... , p~n } of size n and a class of models M = {Mw~ } where each possible model Mw~ is defined by a d-dimensional parameter vector w ~ = (w1 ,... , wd ). To decide which model is best, i.e. find the optimal parameters w ~ ? = (w1? ,... , wd? ) one defines a loss-function L(P, Mw~ ) that measures the difference between the model prediction and the data. The gradient descent method is then applied to f (w) ~ : Rd → R with f (w) ~ = L(P, Mw~ ). A prominent example for a loss-function is the sum of squared errors SSE. In case we have data p~ = (x, y) with two variables we may want to predict y from x. This would then lead to X f (w) ~ = L(P, Mw~ ) = (y − Mw~ (x))2. (1) ~∈P p In this Section the gradient descent method is illustrated for second order poly- nomial regression with two-dimensional data. The presentation can straightfor- wardly be generalised to all other cases of regression analysis. Set-up: The relevant loss-function is given by (1) with data p~ = (x, y) and Mw~ (x) = w0 + w1 x + w2 x2 where we use as explanatory data the triple (x0 , x1 , x2 ) and the parameters w ~= (w0 , w1 , w2 ). Gradient for the case n = 1: For convenience, we use the notation for the single ~ = (w0 +w1 x1 +w2 x21 −y1 )2 data point p~1 = (x1 , y1 ) and get the cost function f1 (w) which is convex. It is straightforward to show that ~ = 2 Mw~ (x1 ) − y1 , (Mw~ (x1 ) − y1 )x1 , (Mw~ (x1 ) − y1 )x21. ∇f1 (w) The resulting update w ~ k+1 = w ~ k − γ∇f (w~ k ) is called the LMS (least mean squares) update rule or the Widrow-Huff learning rule. The update involves the so-called residual Mw~ (x1 ) − y1. Large residuals give rise to large steps and small residuals give small steps. Multiple data points: The generalisation to multiple data points, n > 1, can be done using the full cost function and is called the batch gradient descent. Another widely used method is the so-called stochastic gradient descent that is an approximation and heavily reduces the computational cost. Both of these methods depend on the fact that the cost function f (w) ~ is decomposable, i.e. n X f (w) ~ = fi (w) ~ i=1 where ~ = (Mw~ (xi ) − yi )2 = (w0 + w1 xi + w2 x2i − yi )2 fi (w) is the cost function associated with the ith data point, using the notation p~i = (xi , yi ). f is convex since it is the sum of convex functions. In fact the sum of these usually becomes strongly convex. The decomposability property holds for many loss functions that are relevant for fitting a model to data. Batch gradient descent: Batch gradient descent uses all of the information avail- able and calculates the exact gradient, using linearity of differentiation, as n ∂f (w) ~ X = 2(Mw~ (xi ) − yi )xji , j = 0, 1, 2 ∂wj i=1 and n X 2(Mw~ (xi ) − yi , 2(Mw~ (xi ) − yi )xi , 2(Mw~ (xi ) − yi )x2i. ∇f (w) ~ = i=1 Computing the full gradient takes O(n) time in each of the iteration steps per- formed which may be computationally expensive. Incremental gradient descent: Incremental gradient descent is computationally cheaper. It avoids computing the full gradient each step, and only computes it for one fi for a single data point. To be more precise the iteration step is w ~ k − γ∇fi (w ~ k+1 = w ~ k ). In each iteration k a different data point i may be used. The rule i = k + 1 will exploit all data points for long enough iterations. A more common version of incremental gradient descent is the so-called stochastic gradient descent. It chooses i for each k at random. Algorithms based on incremental gradient descent methods are much faster than the batch version since each iteration now takes O(1) time. However, they are approximations and do not automatically enjoy the nice convergence results of convex functions. In practice, these methods work quite well if the data size is large. Exercises 1. Let ~x = (x1 ,... , xd )T. Convince yourself that for finite p, k = 1,... , d and at points where xk 6= 0, we have that ∂ k ~x kp xk |xk |p−2 =. ∂xk k ~x kp−1 p 2. Show that the√points ~x ∈ R2 with k ~x k1 = c with constant c > 0 form a square of side-length 2c and corners on the coordinate axes. 3. Show that the points ~x ∈ R2 with k ~x k∞ = c with constant c > 0 form a square of side-length 2c and edges on the coordinate axes. 4. Let f be a quadratic function 1 T ~ = a + ~bT · w f (w) ~+ w~ Hw ~ 2 with Hessian H and with a single minimum. Show that the optimal learning rate γk? at some w ~ k not being the location of the minimum is given by k ∇f (w ~ k ) k2 γk? =. ~ k )T H∇f (w ∇f (w ~ k) You may want to use the expression ~ = ~b + H w. ∇f (w) ~ Lecture 5 3 Some properties of loss functions Outline: In principle, any function that reasonably measures the goodness of fit between model and data can be used as a loss function. In this section we discuss implications on the location of minima that we might encounter when going from one loss function to another. 3.1 Using a different norm Outline: The definition of a loss function involves the choice of some measure of distance between the data and the model prediction. In this Section we briefly discuss, by an example, the situation where different distance measures yield different locations of the minimum, i.e. the distance measures used have an influence on the parameters to choose for the model. Set-up: Suppose we have two-dimensional data (xi , yi ), i = 1,... , n for which we want to estimate the best model of the linear type ŷ = w0 + w1 x. We thus get ŷi = w0 + w1 xi for the model predictions. For the loss function f we may want to choose an arbitrary p-norm and define n X fp (w0 , w1 ) =k ~y − ~yˆ kpp = |yi − ŷi |p i=1 where ~yˆ = (ŷ1 ,... , ŷn )T and ~y = (y1 ,... , yn )T. To make sure that the loss function is everywhere differentiable and to keep the calculations as simple as possible we assume that p is an even integer which implies n X fp (w0 , w1 ) = (yi − ŷi )p. i=1 For the gradient of this loss function we get n n !T X X ∇fp (w0 , w1 ) = − p(yi − w0 − w1 xi )p−1 , − pxi (yi − w0 − w1 xi )p−1. i=1 i=1 Example: For simplicity we assume n = 3 and we choose the data (x1 , y1 ) = (0, 0), (x2 , y2 ) = (1, 1) and (x3 , y3 ) = (−1, 1). The optimal values of w0 and w1 are solutions of p(−w0 )p−1 + p(1 − w0 − w1 )p−1 + p(1 − w0 + w1 )p−1 =0 (1) p(1 − w0 − w1 )p−1 − p(1 − w0 + w1 )p−1 =0. (2) The case p = 2: For p = 2 we get from (2) that w1 = 0. Inserting this into (1) gives the value w0 = 2/3. We thus have the best model parameters for p = 2 as ~ ? = (w0? , w1? )T = (2/3, 0)T. w The case p = 4: For p = 4, equations (1) and (2) give the conditions −w03 + (1 − w0 − w1 )3 + (1 − w0 + w1 )3 =0 (3) (1 − w0 − w1 )3 − (1 − w0 + w1 )3 =0 (4) which can be simplified. Expanding (3) we get 0 = − w03 + (1 − w0 )3 − 3(1 − w0 )2 w1 + 3(1 − w0 )w12 − w13 + (1 − w0 )3 + 3(1 − w0 )2 w1 + 3(1 − w0 )w12 + w13 = − w03 + 2(1 − w0 )3 + 6(1 − w0 )w12. We thus have the first condition −w03 + 2(1 − w0 )((1 − w0 )2 + 3w12 ) = 0. (5) Expanding (4) we get 0 =(1 − w0 )3 − 3(1 − w0 )2 w1 + 3(1 − w0 )w12 − w13 − (1 − w0 )3 − 3(1 − w0 )2 w1 − 3(1 − w0 )w12 − w13 = − 2w13 − 6w1 (1 − w0 )2. We thus have the second condition w1 (w12 + 3(1 − w0 )2 ) = 0. (6) We first notice that our previous optimum does not solve (5) and (6), i.e. this loss function does not produce the same best parameters as the loss function before. Remark 1: We saw that the location of the optimum may change when using a different distance measure, i.e. a different loss function. Moreover, different loss functions for the same data may have a different number of local minima. Remark 2: The reason why different loss functions may have different optimal values lies in the fact that the relative contribution of data points to the value of the loss function depends on our choice of measuring distances. Consider again the case of two-dimensional data (x, y) and loss functions based on the p-norm as defined above. Suppose we have a perfect model, i.e. the loss function is zero for a specific choice of parameters and a given data set. Assume that we change one data point (xi , yi ) to (xi , yi + d). This will change the value of the loss function for our perfect model from zero to |d|p. The amount of change depends on p and this will in turn have an effect on the optimal value of the new loss function that depends on p. Remark 3: The value p = 2 is by far the most common choice for p-norm based loss functions for three reasons. Firstly, p = 2 is mathematically convenient when it comes to the calculation of gradients and solving the resulting system of equations to find the minima. Secondly, the unwanted influence of outliers is less than for other choices of even integer p (which are also mathematically convenient, but not as much as p = 2). Thirdly, a probabilistic framework for the class of regression problems is very well suited to p = 2. 3.2 Including a penalty term Motivation: When fitting a model to a given data set, we might face the situation that one of the parameters, say wl , is much larger than the others. Such a result is not desirable for various reasons. In particular, the data feature controlled by wl , say xl , is then of large influence compared to the others and if xl is corrupted by noise this changes the estimation procedure to a large extent, if a new data set is used. In other words, the model is overfitted in the sense that applying the model to new data will usually not work well since the fitted parameters are too much dependent on the data set we used for estimation. Another problem concerns the situation that some data feature xl is not really relevant in the sense that xl is contributing to the prediction only by chance. To give an example, suppose we want to predict height of people by using information about weight, shoe size and income. Obviously, income should not be relevant. However, our data include income and the model will take income into account. To summarise this reasoning, we usually prefer to have a model where the parameters are not varying too much, and where the model is only taking important data features into account, i.e. we sometimes want to reduce the complexity of the model even so it will cause an increase in loss function. Introducing a penalty term in the loss function often helps to achieve this task. Penalty term: The penalty method in general replaces a given loss function L(P, Mw~ ) for data P by a new loss function Lnew (P, Mw~ ) = L(P, Mw~ ) + λ · Penalty(w). ~ The penalty parameter λ controls the influence of the penalty term. The penalty term is only depending on w ~ and is independent of the data. Its purpose is to reduce the magnitude of the model parameters to comparable size or to get rid of some model parameters and thus reduce the complexity of the model. There are many possibilities to choose from for the penalty term. We will illustrate its effect by a widely used example where the penalty term is given by the 1 − norm d X Penalty(w) ~ = L1 (w) ~ = |wi |. i=1 Contour lines of the L1 penalty term: Recall that the contour lines for the 1- norm are hypercubes, which reduce in the two dimensional case to a square with side-length being the square root of two times the values at the contour lines and with the corners along the coordinate axes. Effect of the L1 penalty term: In general, the penalty term forces the new loss function to take on its minimum at a different location than before. Let us assume ? ? that the old minimum was at w ~ old and the new minimum is at w ~ new. We always ? have that w~ new lies on a contour line of L and on a contour line of L1. Suppose the corresponding values are L(P, Mw~ new ? )=c and ? λL1 (w ~ new ) = c1. ? For the sake of simplicity assume that d = 2 and w ~ old has positive components ? and contour lines of L to be of elliptic shape. Then w ~ new is located p where the contour line of L with value c touches the square with side length 2c1 /λ. The ? effect is that w ~ new has components that are more likely to be of comparable size if they meet around the centre of the corresponding side of the square. If they ? meet around the corner of the square, one component of w ~ new is very small and ? neglecting it has a small influence on model performance for w ~ new , i.e. we reduce complexity. In any case the result may be favoured. Choice of λ: The choice of λ is crucial for the penalty method. Increasing λ forces the model to take decreasing values of its parameters as optimal, and decreasing λ reduces the effect of the penalty term. The right choice needs to be judged by the resulting model performance, i.e. its prediction ability. 4 Maximum likelihood method Motivation: So far, we discussed deterministic models used to predict some prop- erty y from other properties ~x. We now take a probabilistic point of view and consider these properties as being observations of random variables where the prediction of y is now taken as the most probable value of the random variable describing y given that the other random variables have the observed values ~x. In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probabilistic model, given some observed data. A loss function is a measurement of model misfit as a function of the model parameters. MLE is a specific type of probability model estimation, where the loss function is the (log) likelihood. The objective of MLE is to choose values for the parameters of the probabilistic model that would maximise the probability of observing the y values in the sample with the given x values. This probability is summarised in what is called the likelihood function. 4.1 Some basic probability Motivation: This Section and the following Sections provide a primer on proba- bilistic tools that are necessary to understand maximum likelihood theory. Random variables: The specific outcome of a well-defined experiment or mea- surement we observe is always random. Taking a sample, i.e. measurements of the values of variables is thus in general accompanied by a certain amount of randomness, i.e. we are in general always dealing with so-called random vari- ables describing observations. To put this idea into a mathematical framework, we may say that a random variable assigns a value to each possible outcome of an experiment or observation of a system, i.e. it is a function acting on the space of all possible outcomes, the so-called sample space. We denote random variables by capital letters X, Y.... and their values by small letters x, y,.... Random variables may be discrete or continuous. A discrete random variable has a finite or countably infinite number of possible values. These values (also called states) are not necessarily the integers, they can also be names (called attributes). The values of a continuous random variable are real numbers on certain intervals of the real line. In the same spirit one defines vector valued random variables. Data sets: When fitting a model to data we should always have in mind that the data we use are to some extent random, i.e. we have chosen a data set out of all possible data set. We may thus say that our specific data set ~xi , i = 1,... , n are observations of a random variable X~ describing the data at hand. We may then ask about how characteristic our data are, i.e. do we have a data set that has a large probability to be observed or a data set that is not characteristic with a small probability to be observed. Probability density: Let X be a continuous random variable. We denote by P (a ≤ X ≤ b) the probability that a ≤ x ≤ b, i.e. the observations of X are within the specified interval. A function p(x) is called probability density of X if Z b P (a ≤ X ≤ b) = p(x)dx a for all a and b. Note that p is a positive function and it gives all probabilities as the area under p between a and b. Figure 1 illustrates this relation. 136 Chapter 5 Probability Densities P (a # X # b) Figure 5.1 Probability as area under f a b f (x) does not give the probability that the corresponding random variable takes on Figure 1: Illustration of thethe valuerelation between x. In the continuous probability case, and probabilities are probability given density by integrals and not by the values f (x). To obtain the probability that a random variable will actually take on a given Joint probability density:x, we value Themight first determine case of two or the probability that it will takerandom more continuous on a value variables on the is pretty much the same as for one random variables. The key element is that joint probabilities are calculated using joint probability densities. Let X1 ,... , Xk be continuous random variables on the sample space S. The joint probability density of X1 ,... , Xk is a function p(x1 ,... , xk ) such that the probabilities can be expressed as Z b1 Z bk P (a1 ≤ X1 ≤ b1 ,... , ak ≤ Xk ≤ bk ) = ··· p(x1 ,... , xk )dx1 · · · dxk. a1 ak Marginal probability density: Consider the case of k continuous random variables X1 ,... , Xk with joint probability density p(x1 ,... , xk ). We get the probability density of Xi , i = 1,... , k, called the marginal probability density of Xi , as Z ∞ Z ∞ p(xi ) = ··· p(x1 ,... , xk )dx1 · · · dxi−1 dxi+1 · · · dxk. −∞ −∞ In the same manner we can calculate the joint marginal probability density for a subset of the random variables X1 ,... , Xk. For instance Z ∞ Z ∞ p(x1 , x2 ) = ··· p(x1 , x2... , xk )dx3 · · · dxk. −∞ −∞ 4.2 Conditional probability densities Motivation: In applications, one is sometimes interested in the probability that a random variable takes on its values under the assumption that another random variable has some specified values. Problems of this type lead to the notion of conditional probability and conditional probability density. Conditional probability density: Consider two continuous random variables X1 and X2 with joint probability density p(x1 , x2 ). The conditional probability den- sity of X1 given that X2 = x2 is defined as p(x1 , x2 ) p1 (x1 |x2 ) = for p2 (x2 ) 6= 0. p2 (x2 ) In the same spirit we get the conditional probability density of X2 given that X1 = x1 as p(x1 , x2 ) p2 (x2 |x1 ) = for p1 (x1 ) 6= 0. p1 (x1 ) We thus have p(x1 , x2 ) = p1 (x1 |x2 )p2 (x2 ) = p2 (x2 |x1 )p1 (x1 ). Generalisation of conditional probability densities: The case of n continuous ran- dom variables gives similar results. In particular we get p(x1 , x2 ,... , xn ) p(xn |x1 ,... , xn−1 ) = , p(x1 ,... , xn−1 ) 6= 0 p(x1 ,... , xn−1 ) and p(x1 , x2 ,... , xn ) = p(xn |x1 ,... , xn−1 )p(x1 ,... , xn−1 ). 4.3 Independence Motivation: Independence is an important notion in probability theory that al- lows to express joint probability densities in terms of products of marginal prob- ability densities. Independence of continuous random variables: The continuous random variables X1 ,... , Xk are independent if and only if p(x1 ,... , xk ) = p1 (x1 ) ·... · pk (xk ). For independent random variables we have p1 (x1 |x2 ) = p1 (x1 ), p2 (x2 |x1 ) = p2 (x2 ). 4.4 Expectation, variance and covariance Motivation: Expectations of random variables and functions of random variables provide a straightforward way to infer characteristic properties of the probabilistic structure of random variables. Expectation, expected value: Consider a continuous random variable X with prob- ability density p(x) and some function f (X). We may conceive f (X) as a new random variable with values f (x). The expectation or expected value of f (X), EX∼p [f (X)] is the average value of f (X) for x within the range of X, i.e. Z EX∼p [f (X)] = f (x)p(x)dx. all x Notation: If the probability distribution or probability density is clear from the context, we may use the notation E[f (X)]. Variance and standard deviation of f (X): The variance Var(f (X)) measures how much the values of f (X) scatter around the expectation of f (X) Var(f (X)) = E (f (X) − E[f (X)])2. The variance of f (X) plays the role of the average squared distance between f (X) and its expectation. The square root of the variance is called the standard deviation of f (X) and serves as a measure of the mean distance between f (X) and its expectation. The variance of a random variable X is often denoted as σ 2 , σ giving its standard deviation. Covariance between f (X) and g(Y ): Consider two random variables X and Y and some functions f, g. An important information about the distributional properties of f (X) and g(Y ) is provided by the covariance of f (X) and g(Y ), defined as Cov(f (X), g(Y )) = E [(f (X) − E[f (X)]) (g(Y ) − E[g(Y ]))] where the expectation is performed using the joint probability density of X and Y. Large absolute values of the covariance imply that values of f (X) and g(Y ) are widely spread and are both with large probability far from their respective mean values at the same time. If the sign of the covariance is positive, then f (X) and g(Y ) have a high probability to take on high values simultaneously. If the sign of the covariance is negative, then one variable tends to take on a relatively high value at the times that the other one takes on a relatively low value and vice versa. Other measures such as correlation normalise the contribution of each variable in order to measure only how much the variables are related, rather than also being affected by the scale of the separate variables. Covariance matrix: Consider n random variables X1 ,... , Xn with real values which may be summarised as a random vector X ~ = (X1 ,... , Xn ) with values n ~x ∈ R. The covariance matrix Cov(X)~ is defined as the n × n matrix with entries Cov(X)~ i,j = Cov(Xi , Xj ). ~ The diagonal elements give the variance of the components of X Var(Xi ) = Cov(Xi , Xi ). Moments of a random variable: Moments refer to the special case f (X) = X n , n ∈ N. The mean of X, the first order moment, is often denoted as µ = E[X]. Higher order moments are denoted as µk = E[X k ]. Moments around the mean are defined as µ0k = E[(X − µ)k ]. The mean and variance σ 2 = µ02 are useful measures to describe the location and variation of a probability density. Higher order moments describe additional properties. For instance, µ03 can be used to quantify deviation from symmetry around the mean, called skewness. Exercises 1. For n = 3, consider the data (x1 , y1 ) = (0, 0), (x2 , y2 ) = (1, 1), (x3 , y3 ) = (−1, 1), and the loss functions fp (w0 , w1 ) =k ~y − ~yˆ kpp where ~yˆ = (ŷ1 ,... , ŷn )T , ~y = (y1 ,... , yn )T with ŷi = w0 + w1 xi. Find the best model parameters ~ ? = (w0? , w1? )T w for p = 1,... , 9 and plot these best parameters as a function of p. 2. Redo Exercise 1 using (x1 , y1 ) = (0, 1), (x2 , y2 ) = (1, 3), (x3 , y3 ) = (1, 3.5). 3. For n = 3, consider the data (x1 , y1 ) = (0, 0), (x2 , y2 ) = (1, 1), (x3 , y3 ) = (2, 2.2), and the loss functions fp (w0 , w1 ) =k ~y − ~yˆ k22 +λ k w ~ k1 where ~yˆ = (ŷ1 ,... , ŷn )T , ~y = (y1 ,... , yn )T , w ~ T = (w1 , w2 ) with ŷi = w1 xi + w2 x2i. Find the best model parameters ~ ? = (w1? , w2? )T w for λ = 0, 0.01, 0.05, 0.1, 0.5 and plot these best parameters as a function of λ. 4. Consider the data set velocity.txt that contains a time series of length n = 107 of measurements v1 ,... , vn of the main component of the turbulent velocity vector V in a high Reynolds number helium jet experiment. Plot a suitable histogram using all of the data and estimate the mean value µ and the variance σ 2 of V. Remark: A histogram divides the range of the data in bins of equal size and plots, for each bin, the number or relative number of data within the bin. Such a plot can be used as an approximation of the shape of the underlying probability density. Remark: The mean value is estimated as n 1X µ= vi. n i=1 Remark: The variance is estimated as n 1 X σ2 = (vi − µ)2. n − 1 i=1 Find the probability that the phase error is If the supplier’s daily supply capacity is 25 metric tons, (a) between 0 and π/4; (b) greater than π/3. what is the probability that this capacity will be inad- equate on any given day? 5.10 The length of satisfactory service (years) provided by a certain model of laptop computer is a random variable 5.12 Prove that the identity σ 2 = µ#2 − µ2 holds for any having the probability density probability density for which these moments exist.  5.13 Find µ and σ 2 for the probability density of Exer-  1 e−x/4.5 for x > 0 cise 5.2. f (x) = 4.5 5.14 Find µ and σ 2 for the probability density of Exer-  0 for x ≤ 0 cise 5.4. Lecture 6 5.15 Find µ and σ for the probability density obtained in Find the probabilities that one of these laptops will pro- Exercise 5.8. vide satisfactory service for 4.5 (a) Some 5.16 densities Find µ and σ for the distribution of the phase error of at most 2.5important years; probability Exercise 5.9. (b) anywhere from 4 to 6 years; 5.17 Find µ for the distribution of the satisfactory service (c)Outline: There is a number of specific at least 6.75 years. probability distributions and probability of Exercise 5.10. 5.11 Atdensities that site, a certain construction aretheofdaily particular requirement ofinterest for that 5.18 Show machine learning µ#2 and, hence, andfordata σ 2 do not exist science in the prob- gneiss (in metricThe general. tons) isfollowing a random variable having the briefly ability Sections density oftwo introduce Exercise of5.6. them. 5.2 The Normal Distribution 4.5.1 Gaussian probability density Among the special probability densities we study in this chapter, the normal prob- ability density, usually referred to simply as the normal distribution, is by far the Definition: The most Gaussian probability important. 1 It was studied density first in the or normal eighteenth probability century when scientistsdensity ob- is the most important served an astonishing probability degree ofover density regularity theinreal errors line. of measurement. For a They found that random variable X the patterns (distributions) they observed were closely approximated by a continu- with values x ∈ R ousthe normal distribution, probability which they referred todensity with as the “normal mean curve value of errors” µ and variance and attributed σ 2 is defined as to the laws of chance. The equation of the normal probability density, whose graph (shaped like the cross section of a bell) is shown in Figure 5.3, is 1 (x−µ)2 p(x; µ, σ 2 ) = √ 1 e− 2σ2 ,2 −∞ 2 < x < ∞. Normal distribution f ( x; µ, σ 2 ) =2πσ √ e−( x−µ ) /2σ −∞ 0 is the learning rate. 14 / 27 Gradient Descent - 3D visualization Figure: In blue, the global minimum, in red, the iteration points. 15 / 27 Gradient Descent: Effect of the learning rate/1 Figure: Learning rate = 0.1. 16 / 27 Gradient Descent: Effect of the learning rate/2 Figure: Learning rate = 0.4. 17 / 27 Gradient Descent: Effect of the learning rate/3 Figure: Learning rate = 0.5. 18 / 27 Batch GD, SGD and Mini-Batch GD - Intuition For a dataset consisting of multiple samples, the loss function can be computed by summing over samples: N 1 X E (w ) = Ei (w ) N i=1 ▶ The classical gradient descent update rule, i.e. update the weights based on the gradient of the entire cost E (w ), is called batch version. However, for large N computing ∇E (w ) is very time consuming. ▶ To speed-up the update rule we approximate ∇E (w ) with ∇Ei (w ). This is the idea behind the so-called Stochastic Gradient Descent (SGD) or online version. ▶ A trade-off between batch GD and SGD is called the mini-batch GD. 19 / 27 Batch GD, SGD and Mini-Batch GD - Algorithms Batch GD: ▶ Start with an initial guess w 0. ▶ For j ≥ 0, update w j+1 = w j − α∇E (w j ). SGD (online): ▶ Start with an initial guess w 0. ▶ For each epoch j ≥ 0: ▶ draw a random sample i from the dataset; ▶ for each 1 ≤ i ≤ N update w j+1 = w j − α∇Ei (w j ). Mini-Batch GD: ▶ Fix an integer 1 ≤ mb ≤ N (mini-batch size). ▶ Start with an initial guess w 0. ▶ For each epoch j ≥ 0: ▶ draw a random batch from the dataset; N ▶ for each 0 ≤ i < mb update (i+1)·mb X j+1 j w = w − α∇ Ek (w j ). k=i·mb+1 20 / 27 How to choose among different variants? ▶ Batch: usually more stable and provide a more accurate estimation of the gradient, but slow. ▶ SGD: fast, stochastic approximation of the gradient implies possible instability (zig-zag effect) ▶ Mini-Batch GD: a trade-off between Batch GD and SGD (parallelism available). Figure: Batch GD vs SGD vs Mini-Batch GD. 21 / 27 Normal Equation for LR and Gradient Descent 1 We have E (w ) = N ||Xw − y ||2 , hence 1 2 ∇E (w ) = ∇(||Xw − y ||2 ) = XT (Xw − y ) N N ▶ Normal equation ( ⇐⇒ holds if XT X is invertible): 2 T ∇E (w ) = 0 ⇐⇒ X (Xw − y ) = 0 N ⇐⇒ XT Xw = XT y ⇐⇒ w = (XT X)−1 XT y ▶ Gradient descent main iteration for LR: 2α T w j+1 = w j − X (Xw j − y ) N 22 / 27 On the invertibility of XT X Invertibility of XT X ⇐⇒ columns of X linearly independent. What if XT X is not invertible? If two columns are linearly dependent, then those features are correlated (redundant). Solution: discard one of those features. 23 / 27 Normal Equation vs Gradient Descent Normal equation: ▶ No hyperparameters (explicit solution). ▶ No iterations. ▶ Computational cost dominated by inversion of a dense matrix (slow when N is large). Gradient Descent: ▶ Need to choose the learning rate α. ▶ Needs many iterations (usually increase with dataset size). ▶ Cost per iteration lower than matrix inversion (constant when using mini-batch GD), hence faster when N is large. 24 / 27 Dataset scaling In general, features must have similar magnitudes, for good convergence of gradient descent-based methods. Figure: Example loss landscapes with and without feature scaling (F1 and F2 are the weights of the model). 25 / 27 Dataset scaling Common approaches: ▶ Min-max scaling. Compute the feature max (i) (i) M := [maxi xj ] and the feature min m := [mini xj ] vectors. Then normalize features as follows (i) x (i) − m xnorm = M −m ▶ Standardization. Compute the feature mean µ (i) (µj := Ei [xj ]) and the feature standard deviation σ q (i) (σj := Vari [xj ]). Then normalize features as follows (i) x (i) − µ xnorm = σ 26 / 27 Polynomial Regression Polynomial Regression → polynomial hypothesis. Example: quadratic hypothesis (single feature): hw (x1 ) = w0 + w1 x1 + w2 x12 (in case of two features, there is also the term x1 x2 ) To fit this model, use Linear Regression with features x1 = x1 and x2 = x12. 27 / 27 Logistic Regression Prof. Alessandro Lucantonio Aarhus University 1 / 14 Binary classification Recall: ▶ Classification → discrete target. ▶ Binary classification: {0, 1} target (Ex.: spam/not spam). Idea: consider a hypothesis (threshold) such that 0 ≤ hw ≤ 1 and ▶ if hw (x) ≥ 0.5, predict 1; ▶ if hw (x) < 0.5, predict 0. 2 / 14 Logistic Regression In particular, take hw (x) = σ(w T x), where 1 σ(t) = 1 + e −t is the sigmoid function. hw gives us the probability that the output is 1. Figure: Sigmoid function. 3 / 14 Linear decision boundary Hypothesis: hw (x1 , x2 ) = σ(w0 + w1 x1 + w2 x2 ) hw (x1 , x2 ) ≥ 0.5 when w0 + w1 x1 + w2 x2 ≥ 0 =⇒ y = 1 hw (x1 , x2 ) < 0.5 when w0 + w1 x1 + w2 x2 < 0 =⇒ y = 0 The decision boundary is not unique in the linearly separable case unless additional constraints, like regularization, are applied. Figure: An example of linear decision boundary for the linearly separable case. 4 / 14 Non-linear decision boundary Hypothesis: hw (x1 , x2 ) = σ(w0 + w1 x1 + w2 x2 + w3 x12 + w4 x22 ) Figure: An example of non-linear decision boundary. 5 / 14 Cost function First attempt: MSE N 1 X E (w ) = (σ(w T x̃ (i) ) − y (i) )2 N i=0 Problem: σ is non-convex, hence MSE is non-convex (possibly many local minima). Main idea: consider a loss term that behaves as ▶ − log hw (x̃ (i) ) if y = 1, ▶ − log(1 − hw (x̃ (i) )) if y = 0 Notice that: ▶ if y = 1 and hw (x̃ (i) ) → 1, − log hw (x̃ (i) ) → 0; ▶ if y = 1 and hw (x̃ (i) ) → 0, − log hw (x̃ (i) ) → ∞; ▶ if y = 0 and hw (x̃ (i) ) → 1, − log(1 − hw (x̃ (i) )) → ∞; ▶ if y = 0 and hw (x̃ (i) ) → 0, − log(1 − hw (x̃ (i) )) → 0; 6 / 14 Cross-entropy Combine the log loss terms: Binary cross-entropy N 1 X (i) E (w ) := − y log(hw (x̃ (i) )) + (1 − y (i) ) log(1 − hw (x̃ (i) )) N i=1 This cost function is convex (sum of convex terms) with respect to the weights. Vectorized version: 1 T E (w ) = − y log(hw (X)) + (1 − y T ) log(1 − hw (X)) N with hw (X) = σ(Xw ). 7 / 14 Conditional Log-Likelihood and Binary Cross-Entropy In a binary classification problem, the model provides a joint probability of the form (i) (i) ) P(y (i) |x (i) ; θ) = (ŷ (i) )y (1 − ŷ (i) )(1−y The log-likelihood is: N X N X log P(y (i) |x (i) ; θ) = y (i) log(ŷ (i) ) + (1 − y (i) ) log(1 − ŷ (i) ) i=1 i=1 which is the binary cross-entropy loss. 8 / 14 Convexity Any local minimum of a convex function is also a global minimum. Instead, a non-convex function has potentially many local minima and saddle points (vanishing gradient but neither minimum nor maximum). Figure: An example of a non-convex function. 9 / 14 Is a local minimum always bad? Answer: NO; remember the main challenge/goal of ML: generalization. ▶ A global minimum of the cost function corresponds to the best fit of the training set: this may lead to overfitting. ▶ Having a convex cost function is convenient, but finding a local minimum is probably even better than a global minimum. ▶ Saddle points may be bottlenecks for gradient-based optimization, but there are techniques to avoid them (momentum, learning rate schedules,... ) in modern optimization algorithms. 10 / 14 Gradient descent - Derivative of the sigmoid For gradient descent-based optimization, we need to compute the gradient of the cross-entropy with respect to the weights. 1 Recall: σ(t) := 1+e −t. d e −t σ(t) = dt (1 + e −t )2 −t 1 e = −t 1+e 1 + e −t 1 = σ(t) 1 − 1 + e −t = σ(t)(1 − σ(t)). 11 / 14 Gradient of the cross-entropy /1 (assuming sum on repeated indices) " (i) ∂ ∂ y ∂wj hw (x̃ (i) ) (1 − y (i) ) ∂w (1 − hw (x̃ (i) )) # ∂ j N E (w ) = − − ∂wj hw (x̃ (i) ) 1 − hw (x̃ (i) ) with ∂ (i) (i) hw (x̃ (i) ) = σ ′ (w T x̃ (i) )x̃j = σ(w T x̃ (i) )(1 − σ(w T x̃ (i) ))x̃j ∂wj (i) = hw (x̃ (i) )(1 − hw (x̃ (i) ))x̃j → 12 / 14 Gradient of the cross-entropy /2 ∂ (i) (i) N E (w ) = −[y (i) (1 − hw (x̃ (i) ))x̃j − (1 − y (i) )hw (x̃ (i) )x̃j ] ∂wj (i) = [hw (x̃ (i) ) − y (i) ]x̃j. Final result: N ∂ 1 X (i) E (w ) = [hw (x̃ (i) ) − y (i) ]x̃j. ∂wj N i=0 Vectorized version: 1 T ∇E (w ) = X (σ(Xw ) − y ). N 13 / 14 Multi-class classification Targets: y ∈ {0,... , k}. Idea: solve k + 1 binary classification problems. Given a data sample x (i) : (j) (i) ▶ For each 0 ≤ j ≤ k, compute the probability hw (x̃ ) that x̃ (i) belongs to the class j. ▶ The prediction will be the class that corresponds to the maximum probability, i.e. (j) arg max hw (x̃ (i) ). j 14 / 14 Regularization and Model Selection Prof. Alessandro Lucantonio Aarhus University 1 / 14 Underfitting and overfitting - intuition Imagine that you have to study for a written exam consisting of exercises. Practicing on only a few exercises leads to poor perfomance both on homework and at the exam. This is called underfitting : the performance at the exam will be bad because of poor training. Moreover, memorizing all the solutions to homework leads to the maximum score on homework (trivially) but probably a bad score on the exam exercises. This is called overfitting. The bad performance on the exam occurs because you did not capture the true essence of the homework, rather you have memorized their peculiarities. 2 / 14 Underfitting and overfitting - analytical example Figure: Underfitting (polynomial degree 1), good fitting (polynomial degree 4), overfitting (polynomial degree 15). Overfitting can be countered through: ▶ regularization; ▶ validation. 3 / 14 Regularization Overfitting is correlated with “complex” models: to avoid them, we penalize models with large weights. One way to do this is to consider the following cost function Er (w ) = E (w ) + Rλ (w ) where Rλ is the regularization term. λ > 0 is a hyperparameter that must be chosen in the model selection phase (see later). Most common regularization techniques (do not include bias term): ▶ Tikhonov (or L2 or Ridge): Rλ (w ) = λ||w ||22 = λ i wi2. P Tends to make all weights small. ▶ LASSO (or L1 ): Rλ (w ) = λ||w ||1 = λ i |wi |. Tends to make P some weights 0 (feature selection). 4 / 14 Linear Regression with Tikhonov regularization ▶ Cost function (without regularization): 1 E (w ) = ||Xw − y ||2 N ▶ Regularized cost function: Er (w ) = E (w ) + λ||w ||2 ▶ Gradient of the regularized cost function: 1 T ∇Er (w ) = ∇E (w ) + 2λw = 2 X (Xw − y ) + λw N ▶ Normal equation: w = (XT X + λI)−1 XT y Note that in this case XT X + λI is always invertible (why?). 5 / 14 Bias-Variance trade-off Bias: expected discrepancy between targets and predictions given by the model. Variance: measure of the deviation from the expected value given by the model that any particular sampling of the data (from the same underlying data-generating distribution) is likely to cause. Some intuitions on the effect of regularization: ▶ High λ ⇝ simple model ⇝ high bias (underfitting). ▶ Low λ ⇝ complex model ⇝ high variance (overfitting). ▶ Intermediate λ: optimal solution (good bias-variance trade-off). 6 / 14 Generalization Central challenge in Machine Learning: generalization! Our algorithm must perform well on new, previously unseen inputs. Recall that we have to find a balance between bias and variance. Even if we limit the complexity of our model using regularization, the training set does not provide a good estimate of the test (generalization) error. In other words, generalization is compromised if we choose hyperparameters (including the regularization factor) only according to the training error. 7 / 14 Generalization - Training vs test error Test We want to be in the middle Test error curve is necessary so that we do not over or under fit Underfitting Overfitting Train Figure: Training (blue) and test (red) errors as the model complexity varies. 8 / 14 Model selection and model assessment In general, we should distinguish between two phases: 1. Model selection: estimate the performance of different models trained with different hyperparameters. like: grade of the polynomial model, regularization factor, learning rate, number of layers of a NN 2. Model assessment: after choosing a final model we evaluate its performance on new, previously unseen (test) data. ▶ For step 1: do not use the test set to tune hyperparameters; use the validation set. The test set must never be touched until model assessment. ▶ For step 2: once we have selected a model using the validation set, we can assess its generalization error on the test set. Validation set is used to train the hyperparameters, note that is different from the test set 9 / 14 Double Hold-out - model selection/assessment workflow 1. Split the entire dataset in three sets: training, validation and test (typically 60%-20%-20% or 70%-20%-10% of the total dataset). 2. Fit the model on the training set using different hyperparameters until a model is found that maximizes the validation performance. Optional: re-train the best model using training+validation. 3. Assess the generalization capability of the best (re-trained) model by evaluating its performance on the test set. 10 / 14 Cross-Validation Separating the database in different ways, the results may change. cross validation allows to have a more fair separation of train and validation. 1. Hold out the test data. 2. Split the remaining data (development set) in k disjoint folds (k-fold CV). 3. Use k − 1 folds as the training set and the remaining fold as the validation set. Compute the performance measure (MSE, accuracy... ) over the validation set. Repeat k times by changing the training set (see figure). Figure: 5-fold CV 4. Compute the validation performance as the mean ± standard deviation of the performance measure across the k runs. Pro: Not sensitive to a particular partition of the data. Con: Computationally expensive (but parallellizable). 11 / 14 Workflow for selection using CV and assessment train+ validation 1. Split dataset into two sets: development (model selection) and test (model assessment). 2. Use the cross validation method on the development set to select the best model, i.e. the one that maximizes the mean validation performance. Optional: retrain the best model using the entire development set. 3. Evaluate the generalization error on the test set. 12 / 14 Grid search How to search for the best set of hyperparameters? Example: two hyperparameters (learning rate α and regularization factor λ). 1. Consider sets of three values for each parameter, e.g. αvals = {0.001, 0.01, 0.1}, λvals = {0.0001, 0.001, 0.01}. Evaluate the model with (α, λ) = (i, j) for i ∈ αvals , j ∈ λvals. 2. Choose the pair corresponding to the best performance on the validation set based, for example, on a k-fold CV or double hold-out. 3. Refinement: Suppose that the best pair is (0.1, 0.001). Then we can “zoom in” and do another grid search with e.g. αvals = {0.075, 0.1, 0.125}, λvals = {0.00075, 0.001, 0.00125}. 13 / 14 During the training u keep track of the validation loss also. So that we can stop before overfitting Early stopping ▶ Store a copy of the model weights every epoch the validation loss improves; ▶ if the validation loss does not improve for a fixed number of epochs (patience), terminate and return the best weights. Figure: Learning curves for a classifier. Notice: the training loss should include the regularization term, but the validation loss should not. 14 / 14 Ensemble learning and eXtreme Gradient Boosting (XGBoost) Prof. Alessandro Lucantonio Aarhus University 1/9 Ensemble learning Ensemble learning combines several learners (models) to improve overall performance, increasing predictiveness and accuracy in machine learning and predictive modeling. ▶ The variance of the general model decreases significantly thanks to bagging. ▶ The bias also decreases due to boosting. 2/9 Bagging (boostrap aggregating) Reduces overall variance by combining multiple models. It trains multiple models each on a different subset of the original dataset. Subsets are created through random sampling with replacement (bootstrapping). When making predictions, bagging combines the outputs of all these models. Figure: How bagging works for an ensemble of classifiers. 3/9 Decision Trees In a decision tree, nodes are split into sub-nodes based on a threshold value of a feature. The decision tree starts at the root node with all the data. At each node, it looks for the feature that best splits the data into two (or more) groups. This continues till reaching a maximum height for the tree or all samples belonging to the same class (leaf node). Figure: Example of decision tree (from Wikipedia). 4/9 Classification And Regression Trees (CART) CART is a variation of the decision tree algorithm that can handle both classification and regression tasks. In CART, a real score is associated with each of the leaves, which gives us richer interpretations that go beyond classification. Figure: Example of CART (from XGBoost tutorial). 5/9 Classification And Regression Trees (CART) Usually, a single tree is not strong enough to be used in practice. What is actually used is the ensemble model, which sums the prediction of multiple trees together. Figure: Ensemble of two CARTs (from XGBoost tutorial). 6/9 CART predictions ▶ Ensemble prediction: K X ŷi = fk (xi ), fk ∈ F k=1 where K is the number of trees and fk is a function in the set F of all possible CARTs. Specifically, fk (x) = wq(x) , w ∈ R T , q : R d → {1, 2, · · · , T }. where w is the vector of the scores of the leaves, T is the number of leaves and q(x) is a function assigning each sample to the corresponding leaf. ▶ Loss function for training: n X K X L(θ) = ℓ(yi , ŷi ) + ω(fk ) i k=1 where ω is a regularization term that penalizes the complexity of

Lecture 3: Basics on Optimisation Theory (PDF)

Document Details

Tags

Related

Summary

Full Transcript