Machine Learning Calculus PDF Lecture Notes - UCL
Document Details

Uploaded by ChivalrousMountRushmore
University College London
Dariush Hosseini
Tags
Related
- Maths 12 Lesson 1 Calculus Activity PDF
- Math Deep: Algebra, Topology, Differential Calculus, and Optimization for Computer Science
- Mathematics for Machine Learning PDF
- Small Animal Nursing Lab Report 5 PDF
- Machine Learning 1 Classification Methods Lectures PDF
- Significance Of Calculus In Engineering PDF
Summary
These lecture notes from University College London cover the mathematical background of machine learning focusing on calculus. The document outlines key concepts like derivatives, Taylor Series, and optimization methods, including Lagrangian multipliers. It is intended for undergraduate students with a focus on calculus and its applications.
Full Transcript
Machine Learning: Mathematical Background Calculus Dariush Hosseini [email protected] Department of Computer Science University College London 1 108 Lecture Overvie...
Machine Learning: Mathematical Background Calculus Dariush Hosseini [email protected] Department of Computer Science University College London 1 108 Lecture Overview Lecture Overview 1 Lecture Overview 2 Derivatives & Taylor Series 3 Vector & Matrix Derivatives 4 Matrix Calculus 5 Extrema 6 Convexity 7 Quadratic Functions 8 Constrained Optimisation & Lagrange Multipliers 9 Integral Calculus 10 Summary 2 108 Lecture Overview Maths & Machine Learning 1 Much of machine learning is concerned with: Solving systems of linear equations −→ Linear Algebra Minimising cost functions (a scalar function of several variables that typically measures how poorly our model fits the data). To this end we are often interested in studying the continuous change of such functions −→ (Differential) Calculus Characterising uncertainty in our learning environments stochastically −→ Probability Drawing conclusions based on the analysis of data −→ Statistics 1 Much of this lecture is drawn from ‘Mathematics for Machine Learning’ by Garrett Thomas 3 108 Lecture Overview Maths & Machine Learning Much of machine learning is concerned with: Solving systems of linear equations −→ Linear Algebra Minimising cost functions (a scalar function of several variables that typically measures how poorly our model fits the data). To this end we are often interested in studying the continuous change of such functions −→ (Differential) Calculus Characterising uncertainty in our learning environments stochastically −→ Probability Drawing conclusions based on the analysis of data −→ Statistics 4 108 Lecture Overview Learning Outcomes for Today’s Lecture By the end of this lecture you should be familiar with some fundamental objects in and results of Calculus For the most part we will concentrate on the statement of results which will be of use in the main body of this module However we will not be so concerned with the proof of these results 5 108 Derivatives & Taylor Series Lecture Overview 1 Lecture Overview 2 Derivatives & Taylor Series 3 Vector & Matrix Derivatives 4 Matrix Calculus 5 Extrema 6 Convexity 7 Quadratic Functions 8 Constrained Optimisation & Lagrange Multipliers 9 Integral Calculus 10 Summary 6 108 Derivatives & Taylor Series Derivatives For a function, f : R → R, the derivative is defined as: df f (x + δ) − f (x ) = lim = f 0 (x ) dx δ→0 δ The second derivative is defined to be the derivative of the derivative: d 2f f 0 (x + δ) − f 0 (x ) = lim = f 00 (x ) dx 2 δ→0 δ 7 108 Derivatives & Taylor Series Taylor Series For small changes, δ, about a point x = xe, any smooth function, f , can be written as: δi d i ∞ X f (xe + δ) = f (xe) + f (x ) i ! dx x =xe i =1 2 2 df δ d f = f (xe) + δ + +... dx x =xe 2 dx 2 x =xe 8 108 Derivatives & Taylor Series Rules for Combining Functions Sum Rule ∀ functions f , g ∀α, β ∈ R: (αf (x ) + βg (x )) 0 = αf 0 (x ) + βg 0 (x ) Product Rule ∀ functions f , g: (f (x )g (x )) 0 = f 0 (x )g (x ) + f (x )g 0 (x ) Chain Rule if f (x ) = h(g (x )) then: f 0 (x ) = h 0 (g (x )).g 0 (x ) 9 108 Derivatives & Taylor Series Common Derivatives f (x ) f 0 (x ) xn nx n−1 ekx kekx 1 ln x x Where n, k are constants. 10 108 Derivatives & Taylor Series Partial Derivatives For a function that depends on n variables, {xi }ni=1 , f : (x1 , x2 ,..., xn ) 7→ f (x1 , x2 ,..., xn ), then the partial derivative wrt xi is defined as: ∂f f (x1 , x2 ,..., xi + δ,..., xn ) − f (x1 , x2 ,..., xi ,..., xn ) = lim ∂xi δ→0 δ So the partial derivative wrt xi keeps the state of the other variables fixed 11 108 Vector & Matrix Derivatives Lecture Overview 1 Lecture Overview 2 Derivatives & Taylor Series 3 Vector & Matrix Derivatives 4 Matrix Calculus 5 Extrema 6 Convexity 7 Quadratic Functions 8 Constrained Optimisation & Lagrange Multipliers 9 Integral Calculus 10 Summary 12 108 Vector & Matrix Derivatives Gradients The gradient of f : Rn → R, denoted by ∇x f is given by the vector of partial derivatives: ∂f ∂x1 . .. ∇x f = ∂f ∂xn 13 108 Vector & Matrix Derivatives Directional Derivative The directional derivative in direction u bk22 = 1), is the b, (where ku slope of f (x) in the direction of u b: ∇x f · u b By the definition of angle, this can be re-written as: ∇x f · u b = k∇x f k2 ku bk2 cos θ = k∇x f k2 cos θ Where θ is the angle between the gradient vector and u b Thus the directional derivative is maximal when ∇x f and u b are aligned. In other words ∇x f points in the direction of steepest ascent on a surface. 14 108 Vector & Matrix Derivatives Total Derivative For a function that depends on n variables, {xi }ni=1 , f : (x1 , x2 ,..., xn ) 7→ f (x1 , x2 ,..., xn ), where xi = xi (t ) ∀t, then the total derivative wrt t is: n X ∂f dxi df = dt ∂xi dt i =1 T dx1 dxn = ,..., ∇x f dt dt This follows from the chain rule. 15 108 Vector & Matrix Derivatives Jacobian The Jacobian matrix of a vector-valued function, f : Rn → Rm , defined such that f(x) = [f1 (x),... , fm (x)]T , denoted by ∇x f or ∂(f1 ,..,fm ) ∂(x1 ,..,xn ) , is the matrix of all its first order partial derivatives: ∂f1 ∂f1 ... ∂x1 ∂xn ∂(f1 ,.., fm ) ..... ∇x f = =.... ∂(x1 ,.., xn ) ∂fm ∂fm ... ∂x1 ∂xn 16 108 Vector & Matrix Derivatives Hessian The Hessian matrix of f : Rn → R, denoted by ∇2x f or H(x) is a matrix of second order partial derivatives: ∂2 f ∂2 f ∂x 2... 1 ∂x1 ∂xn ∇x f = ....... 2.. 2 2 ∂ f ∂ f ... ∂xn ∂x1 ∂xn 2 17 108 Vector & Matrix Derivatives Scalar-by-Matrix Derivative The derivative of a scalar, y, with respect to the elements of a matrix, A ∈ Rn×m , is defined to be: ∂y ∂y ··· ∂A11 ∂A1m ∂y ..... =.... ∂A ∂y ∂y ··· ∂An1 ∂Anm 18 108 Matrix Calculus Lecture Overview 1 Lecture Overview 2 Derivatives & Taylor Series 3 Vector & Matrix Derivatives 4 Matrix Calculus 5 Extrema 6 Convexity 7 Quadratic Functions 8 Constrained Optimisation & Lagrange Multipliers 9 Integral Calculus 10 Summary 19 108 Matrix Calculus Matrix Calculus Let a, x ∈ Rn and A ∈ Rn×n : ∇x (aT x) = a ∇x (xT Ax) = (A + AT )x In particular, if A is symmetric: ∇x (xT Ax) = 2Ax 20 108 Matrix Calculus Matrix Calculus Let a, c ∈ Rn , b ∈ Rm , A ∈ Rn×n , and B ∈ Rn×m : ∂aT Bb = abT ∂B ∂aT A−1 c −1 T T −1 = − AT ac A ∂A ∂ log |A| −1 = AT ∂A 21 108 Matrix Calculus Matrix Calculus Let B, X ∈ Rn×n : ∂ tr(BXXT ) = BX + BT X ∂X ∂ tr(BXT X) = XBT + XB ∂X 22 108 Matrix Calculus Multivariate Taylor Series For small changes, δ, about a point x = ex, for a scalar function of a vector argument which is at least twice differentiable: 1 T f (e x + δ) ≈ f (e x) + δ · ∇x f (e x) + δ H(e x)δ 2 23 108 Extrema Lecture Overview 1 Lecture Overview 2 Derivatives & Taylor Series 3 Vector & Matrix Derivatives 4 Matrix Calculus 5 Extrema 6 Convexity 7 Quadratic Functions 8 Constrained Optimisation & Lagrange Multipliers 9 Integral Calculus 10 Summary 24 108 Extrema Extrema For a function f : Rn → R and a feasible set X ⊆ Rn over which we are interested in optimising, then: A point x is a local minimum (or local maximum) of f if f (x) 6 f (y) (or f (x) > f (y)) for all y in some neighbourhood, N ⊆ X about x If f (x) 6 f (y) (or f (x) > f (y)) for all y ∈ X then x is a global minimum (or global maximum) of f in X 25 108 Extrema Stationary Points & Saddle Points Points where the gradient vanishes, i.e. where ∇x f = 0, are stationary points, also known as critical points. It can be proved that a necessary condition for a point to be a maximum or minimum is that the point is stationary However this is not a sufficient condition, and points for which ∇x f = 0 but where there is no local maximum or minimum are called saddle points For example: f (x1 , x2 ) = x1 2 − x2 2 =⇒ ∇x f = [2x1 , −2x2 ]T x = [x1 , x2 ]T = 0 =⇒ ∇x f = 0 But at this point we have a minimum in the x1 direction and a maximum in the x2 direction 26 108 Extrema Saddle Points: Example 4 2 0 −2 2 −4 −2 0 −1 0 1 2 −2 27 108 Extrema Further Conditions for Local Extrema Recall that by Taylor’s theorem, for sufficiently small δ, and twice differentiable f about x∗ : 1 T f (x∗ + δ) ≈ f (x∗ ) + δ · ∇x f (x∗ ) + δ H(x∗ )δ 2 If we are at a point for which ∇x f (x∗ ) = 0, then: If H(x∗ ) 0 then x∗ is a local minimum If H(x∗ ) ≺ 0 then x∗ is a local maximum If H(x∗ ) is indefinite then x∗ is a saddle point Otherwise we need to investigate things further... 28 108 Extrema Conditions for Global Extrema These are harder to state, however a class of functions for which it is more straightforward to discern global extrema are twice differentiable convex functions These are functions where ∇2x f 0 globally Many of the learning tasks which we will perform during this module will involve optimisation over convex functions 29 108 Convexity Lecture Overview 1 Lecture Overview 2 Derivatives & Taylor Series 3 Vector & Matrix Derivatives 4 Matrix Calculus 5 Extrema 6 Convexity 7 Quadratic Functions 8 Constrained Optimisation & Lagrange Multipliers 9 Integral Calculus 10 Summary 30 108 Convexity Convex Sets2 Definition: A set Ω is convex if, for any x, y ∈ Ω and θ ∈ [0, 1], then θx + (1 − θ)y ∈ Ω. In other words, if we take any two elements in Ω, and draw a line segment between these two elements, then every point on that line segment also belongs to Ω. 2 Boyd & Vandenberghe, ‘Convex Optimisation’ 31 108 Convexity Convex Functions Definition: A function f : Rn → R is convex if its domain is a convex set and if, for all x, y in its domain, and all λ ∈ [0, 1], we have: f (λx + (1 − λ)y) 6 λf (x) + (1 − λ)f (y) Convex Non-Convex f (y ) f (y ) λf (x ) + (1 − λ)f (y ) f (λx + (1 − λ)y ) f (λx + (1 − λ)y ) λf (x ) + (1 − λ)f (y ) f (x ) f (x ) λx + (1 − λ)y λx + (1 − λ)y x x y y 32 108 Convexity Concave Functions Definition: A function f : Rn → R is concave if −f is convex 33 108 Convexity 1st & 2nd Order Characterisations of Convex, Differentiable Functions Theorem A.1: Suppose f is twice differentiable over an open domain. Then, the following are equivalent: f is convex f (y) > f (x) + ∇x f (x) · (y − x) ∀ x, y ∈ dom(f ) ∇2x f (x) 0 ∀ x ∈ dom(f ) 34 108 Convexity Global Optimality Theorem A.2: Consider an unconstrained optimisation problem: min f (x) x subject to: x ∈ Rn If f : Rn → R is convex, then any point that is locally optimal is globally optimal Furthermore, if f is also differentiable then any point x that satisfies ∇x f (x) = 0 is a globally optimal solution 35 108 Convexity Strict Convexity Definition: A function f : Rn → R is strictly convex if its domain is a convex set and if, for all x, y, x 6= y in its domain, and all λ ∈ (0, 1), we have: f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y) First Order Characterisation: A function f is strictly convex on Ω ⊆ Rn , if and only if: f (y) > f (x) + ∇x f (x) · (y − x) ∀x, y ∈ Ω, x 6= y Second Order Sufficient Condition: A function f is strictly convex on Ω ⊆ Rn , if: ∇2x f (x) 0 ∀x ∈ Ω 36 108 Convexity Strict Convexity and Uniqueness of Optimal Solutions Theorem A.3: Consider an optimisation problem: min f (x) x subject to: x∈Ω Where f : Rn → R is strictly convex on Ω and Ω is a convex set. Then the optimal solution must be unique 37 108 Convexity Sums of Convex Functions If a f (·) is a convex function, and g (·) is a convex function, then: αf (·) + βg (·) is also a convex function if α, β > 0. If a f (·) is a convex function, and h(·) is a strictly convex function, then: αf (·) + βh(·) is a strictly convex function if α, β > 0. Proofs follow from an application of the definitions of convexity and strict convexity 38 108 Quadratic Functions Lecture Overview 1 Lecture Overview 2 Derivatives & Taylor Series 3 Vector & Matrix Derivatives 4 Matrix Calculus 5 Extrema 6 Convexity 7 Quadratic Functions 8 Constrained Optimisation & Lagrange Multipliers 9 Integral Calculus 10 Summary 39 108 Quadratic Functions Quadratic Functions Consider the following quadratic function, f : f (x) = xT Ax + b · x + c Where: A ∈ Rn×n , b ∈ Rn , c ∈ R, and the variable x ∈ Rn. From the Linear Algebra lecture note that, w.l.o.g., we can write: 1 T f (x) = x Bx + b · x + c 2 Where: B = A + AT , and is hence symmetric. 40 108 Quadratic Functions Convexity of Quadratic Functions B0 ⇐⇒ Convexity of f B0 ⇐⇒ Strict Convexity of f 41 108 Quadratic Functions Convexity of Quadratic Functions Proof: (for first result. Analogous proof holds for second result) For any 0 6 λ 6 1, and for any variables x, y ∈ Rn : 1 λf (x) = λ xT Bx + λb · x + λc 2 1 (1 − λ)f (y) = (1 − λ) yT By + (1 − λ)b · y + (1 − λ)c 2 1 f (λx + (1 − λ)y) = (λx + (1 − λ)y)T B (λx + (1 − λ)y) + b · (λx + (1 − λ)y) + c 2 1 1 = λ2 xT Bx + (1 − λ)2 yT By + λ(1 − λ)xT By 2 2 + λb · x + (1 − λ)b · y + c Thus: f (λx + (1 − λ)y) − λf (x) − (1 − λ)f (y) 1 1 = (λ2 − λ) xT Bx + (1 − λ)2 − (1 − λ) yT By + λ(1 − λ)xT By 2 2 1 1 = λ(λ − 1) xT Bx + λ(λ − 1) yT By − λ(λ − 1)xT By 2 2 1 = λ(λ − 1)(x − y)T B(x − y) 2 So: Convexity =⇒ LHS 6 0 ⇐⇒ (x − y)T B(x − y) > 0 =⇒ B0 42 108 Quadratic Functions Optimisation of Quadratic Functions Consider the following unconstrained optimisation problem: min f (x) (1) x Where: 1 T f (x) =x Bx + b · x + c 2 And: B is real and symmetric Consider the following possible forms of f : 43 108 Quadratic Functions Strictly Convex f If B 0 then f is strictly convex Thus, by Theorem A.3, there is a unique solution to problem (1), x∗ : x∗ = −B−1 b Note that this solution must exist because: B0 =⇒ det B 6= 0 =⇒ ∃ B−1 44 108 Quadratic Functions Strictly Convex f : Example 2 0 f (x) = x12 + x22 , i.e.: B = , b = 0, c = 0 0 2 5 2 0 −2 0 −1 0 1 2 −2 45 108 Quadratic Functions Convex & Bounded f If B 0, B 0, b ∈ range(B) then f is convex but not strictly convex, and is bounded below Proof: Solutions to the problem involve the following condition: ∇x f (x) = Bx + b = 0 =⇒ Bx = −b From the Linear Algebra lecture, this system of equations has a solution iff it is consistent, i.e.: rank(B) = rank (B| − b) Because b ∈ range(B) then b is some linear combination of the columns of B, so: columnspace(B) = columnspace (B| − b) =⇒ range(B) = range (B| − b) =⇒ rank(B) = rank (B| − b) 46 108 Quadratic Functions Convex & Bounded f (Cont.) There are infinite solutions to problem (1), since: B 0, B 0 =⇒ at least one eigenvalue = 0 =⇒ det B = 0 =⇒ B is not full rank Thus the system of equations, Bx = −b, is underdetermined 47 108 Quadratic Functions Convex & Bounded f : Example 2 −2 f (x) = x12 + x22 − 2x1 x2 , i.e.: B = , b = 0, c = 0 −2 2 10 2 0 −2 0 −1 0 1 2 −2 48 108 Quadratic Functions Convex & Unbounded f If B 0, B 0, b ∈ / range(B) then f is convex but not strictly convex, and is unbounded below Proof: As before solutions to the problem exist iff: rank(B) = rank (B| − b) But because b ∈/ range(B) then b cannot be written as some linear combination of the columns of B, so: columnspace(B) 6= columnspace (B| − b) =⇒ range(B) 6= range (B| − b) =⇒ rank(B) 6= rank (B| − b) 49 108 Quadratic Functions Convex & Unbounded f (Cont.) Thus the system of equations is inconsistent and no solutions to problem (1) exist 50 108 Quadratic Functions Convex & Unbounded f : Example 0 0 −1 f (x) = x22 − x1 , i.e.: B = ,b = ,c = 0 0 2 0 5 0 2 −2 0 −1 0 1 2 −2 51 108 Quadratic Functions Non-Convex f If B 0 then f is non-convex There is no solution to problem (1) since f is unbounded below Proof: Consider an eigenvector of B, e x, with an eigenvalue, e λ < 0: =⇒ Be x=e λex xT Be =⇒ e x=e λexT e x 0 Which is equivalent to: ∇x L = 0 with: λ > 0 (3) 72 108 Constrained Optimisation & Lagrange Multipliers / Inequality Constraints Inequality Constraints: Problem Reformulation Using equations (2) & (3), we can solve our problem by seeking stationary solutions (x∗ , λ∗ ) which satisfy the following: ∇x L = 0 g (x) 6 0 subject to: λ>0 λg (x) = 0 These conditions are known as the Karush Kuhn Tucker (KKT) conditions. 73 108 Constrained Optimisation & Lagrange Multipliers / Inequality Constraints Inequality Constraints: Complementary Slackness λg (x) = 0 is satisfied for both the active and inactive cases, and is known as the complementary slackness condition It is equivalent to: λ>0 =⇒ g ( x) = 0 g (x) < 0 =⇒ λ=0 And, rarely, when the critical point associated with f (x) coincides with the constraint surface: λ=0 and g (x) = 0 74 108 Constrained Optimisation & Lagrange Multipliers / Inequality Constraints Inequality Constraints: Problem reformulation Again, note that these conditions characterise a stationary point associated with the function f......But we have said nothing about whether such a point is a maximum, a minimum or a saddle point It can be proved that if both f and g are convex functions then the stationary point is a minimum Otherwise the stationary point is a maximum, a minimum or a saddle point (But note that, regardless, the KKT conditions hold) 75 108 Constrained Optimisation & Lagrange Multipliers / Multiple Constraints Multiple Constraints: Problem min f (x) x∈Rn {g (i ) (x) 6 0}m i =1 subject to: {h(j ) (x) = 0}pj=1 76 108 Constrained Optimisation & Lagrange Multipliers / Multiple Constraints Multiple Constraints: Lagrangian We express the Lagrangian as: m p X X (i ) (i ) L(x, λ, µ) = f (x) + µ g (x) + λ(j ) h(j ) (x) i =1 j =1 Where: λ = [λ(1) ,..., λ(p) ]T , {λ(j ) ∈ R}pj=1 ; µ = [µ(1) ,..., µ(m) ]T , {µ(i ) ∈ R>0 }m i =1 ; are Lagrange multipliers 77 108 Constrained Optimisation & Lagrange Multipliers / Multiple Constraints Multiple Constraints: Problem Reformulation And we can solve our problem by seeking stationary solutions (x∗ , {µ(i )∗ }, {λ(j )∗ }) which satisfy the following: ∇x L = 0 p (i ) m (j ) {g (x) 6 0}i =1 , {h (x) = 0}j =1 subject to: {µ(i ) > 0}m i =1 {µ(i ) g (i ) (x) = 0}m i =1 78 108 Constrained Optimisation & Lagrange Multipliers / Duality Duality It’s not immediately obvious what the value of this reformulation is. We seem to have replaced one constrained optimisation problem with another... But actually we have trasnformed a rather opaque optimisation problem into a more familiar problem - a constrained set of simultaneous equations: ∇x L = 0 and {µ(i ) g (i ) (x) = 0}m i =1 But there are other advantages of the Lagrangian approach, for which we will consider the concept of duality 79 108 Constrained Optimisation & Lagrange Multipliers / Duality Duality: Primal Problem The original problem is sometimes know as the primal problem, and its variables, x, are known as the primal variables It is equivalent to the following formulation: min max L(x, λ, µ) x λ,µ>0 Here the bracketed term is known as the primal objective function 80 108 Constrained Optimisation & Lagrange Multipliers / Duality Duality: Barrier Function We can re-write the primal objective as follows: m p X X max L(x, λ, µ) = f (x) + max µ ( i ) g ( i ) ( x) + λ(j ) h(j ) (x) λ,µ>0 λ,µ>0 i =1 j =1 Here the second term gives rise to a barrier function which enforces the constraints as follows: m p X X 0 if x is feasible max µ(i ) g (i ) (x) + λ ( j ) h ( j ) ( x) = λ,µ>0 i =1 j =1 ∞ if x is infeasible 81 108 Constrained Optimisation & Lagrange Multipliers / Duality Duality: Minimax Inequality In order to make use of this barrier function formulation, we will need the minimax inequality: max min φ(x, y) 6 min max φ(x, y) y x x y Proof: min φ(x, y) 6 φ(x, y) ∀x, y x This is true for all y, therefore, in particular the following is true: max min φ(x, y) 6 max φ(x, y) ∀x y x y This is true for all x, therefore, in particular the following is true: max min φ(x, y) 6 min max φ(x, y) y x x y 82 108 Constrained Optimisation & Lagrange Multipliers / Duality Duality: Weak Duality We can now introduce the concept of weak duality: h i min max L(x, λ, µ) > max min L(x, λ, µ) x λ,µ>0 λ,µ>0 x Here the bracketed term on the right hand side is known as the dual objective function, D(λ, µ) If we can solve the right hand side of the inequality then we have a lower bound on the solution of our optimisation problem 83 108 Constrained Optimisation & Lagrange Multipliers / Duality Duality: Weak Duality And sometimes the RHS side of the inequality is an easier problem to solve: minx L(x, λ, µ) is an unconstrained optimisation problem for a given value of (λ, µ)......And if solving this problem is not hard then the overall problem is not hard to solve because: maxλ,µ>0 [minx L(x, λ, µ)] is a maximisation problem over a set of affine functions - thus it is a concave maximisation problem or equivalently a convex minimisation problem, and we know that such problems can be efficiently solved Note that this is true regardless of whether f , g (i ) , h(j ) are nonconvex 84 108 Constrained Optimisation & Lagrange Multipliers / Duality Duality: Strong Duality For certain classes of problems which satisfy constraint qualifications we can go further and strong duality holds: h i min max L(x, λ, µ) = max min L(x, λ, µ) x λ,µ>0 λ,µ>0 x There are several different constraint qualifications. One is Slater’s Condition which holds for convex optimisation problems Recall, these are problems for which f is convex and g (i ) , h(j ) are convex sets For problems of this type we may seek to solve the dual optimisation problem: h i max min L(x, λ, µ) λ,µ>0 x 85 108 Constrained Optimisation & Lagrange Multipliers / Duality Duality: Strong Duality Once again, note that the dual optimisation problem is sometimes easier to solve than the primal problem But, for our purposes, another interesting reason for adopting the dual optimisation approach to solving contrained optimisation problems is based on dimensionality: If the dimensionality of the dual variables, (m + p), is less than the dimensionality of the primal variables, n, then dual optimisation often offers a more efficient route to solutions This is of particular importance if we are dealing with infinite dimensional primal variables 86 108 Constrained Optimisation & Lagrange Multipliers / Duality Linear Programming The following canonical optimisation problem is known as a linear programme: min cT x x∈Rn subject to: Ax 6 b Where: A ∈ Rm×n , b ∈ Rm , and c ∈ Rn. 87 108 Constrained Optimisation & Lagrange Multipliers / Duality Linear Programming The Lagrangian is given by: L(x, µ) = cT x + µT (Ax − b) = (c + AT µ)T x − µT b Taking derivatives and seeking stationarity: ∇x L(x, µ) = c + AT µ = 0 88 108 Constrained Optimisation & Lagrange Multipliers / Duality Linear Programming From which we generate the dual Lagrangian: D(µ) = −µT b Thus, the dual optimisation problem is: max − µT b µ∈Rm subject to: c + AT µ = 0 subject to: µ>0 89 108 Constrained Optimisation & Lagrange Multipliers / Duality Linear Programming: Example4 Let: 2 2 33 2 −4 8 5 c=− ; A = −2 1 ; b= 5 3 0 −1 −1 0 1 8 10 x2 5 0 0 2 4 6 8 10 12 14 x1 4 Example based on Deisenroth et al, ‘Mathematics For Machine Learning’ 90 108 Constrained Optimisation & Lagrange Multipliers / Duality Quadratic Programming The following canonical optimisation problem is known as a quadratic programme: 1 T minn x Qx + cT x x∈R 2 subject to: Ax 6 b Where: A ∈ Rm×n , b ∈ Rm , and c ∈ Rn. Q ∈ Rn×n is symmetric and positive definite, therefore the objective function is strictly convex. 91 108 Constrained Optimisation & Lagrange Multipliers / Duality Quadratic Programming The Lagrangian is given by: 1 T L(x, µ) = x Qx + cT x + µT (Ax − b) 2 Taking derivatives and seeking stationarity: ∇x L(x, µ) = Qx + (c + AT µ) = 0 =⇒ x = −Q−1 (c + AT µ) 92 108 Constrained Optimisation & Lagrange Multipliers / Duality Quadratic Programming From which we generate the dual Lagrangian: 1 D(µ) = − (c + AT µ)T Q−1 (c + AT µ) − µT b 2 Thus, the dual optimisation problem is: 1 max − (c + AT µ)T Q−1 (c + AT µ) − µT b µ∈Rm 2 subject to: µ>0 93 108 Constrained Optimisation & Lagrange Multipliers / Duality Quadratic Programming: Example5 Let: 1 0 1 2 1 5 −1 0 1 Q= ; c= ; A= ; b= 1 4 3 0 1 1 0 −1 1 4 2 0 x2 −2 −4 −4 −2 0 2 4 x1 5 Example based on Deisenroth et al, ‘Mathematics For Machine Learning’ 94 108 Integral Calculus Lecture Overview 1 Lecture Overview 2 Derivatives & Taylor Series 3 Vector & Matrix Derivatives 4 Matrix Calculus 5 Extrema 6 Convexity 7 Quadratic Functions 8 Constrained Optimisation & Lagrange Multipliers 9 Integral Calculus 10 Summary 95 108 Integral Calculus Integral Calculus Calculus can be split into two fields: Differential calculus, which, as we have seen, is concerned with the instantaneous rate of change of functions Integral calculus, which, as we will see, is concerned with the accumulation of quantities. We will see that integral calculus, and the operation of integration, allows us to find the area (or volume) lying beneath functions. This will be crucial when we seek to make a probabilistic analysis of machine learning problems. 96 108 Integral Calculus Indefinite Integral The indefinite integral or antiderivative R of a continuous function, f (x ), is a function F (x ), written as f (x )dx, the derivative of which is f. Thus: F 0 (x ) = f (x ) 97 108 Integral Calculus Common Indefinite Integrals R f (x ) F (x ) = f (x )dx x n+1 x n (n 6= 1) n+1 +C 1 x ln x + C ex ex + C Where C is an arbitrary constant of integration. 98 108 Integral Calculus Definite Integral The definite integral of a continuous function, f (x ), between two Rb numbers, a and b, written as a f (x )dx, is defined to be the difference between F (a) and F (b). Thus: Z b b f (x )dx = F (x ) a = F (b ) − F (a ) a 99 108 Integral Calculus The Fundamental Theorem of Calculus If we divide the interval [a, b] into N equal subintervals, each of (b−a) length δ = N , then we may seek to evaluate the Reimann sum: N X f (xi ) δ i =1 This is the sum of a series of N rectangles each of which has a different height given by the values in the set {xi }N i =1 , but with the same width, δ. 100 108 Integral Calculus The Fundamental Theorem of Calculus Now, the Fundamental Theorem of Calculus tells us that as δ tends to zero then the Reimann sum tends to the definite integral: N X Zb lim f ( xi ) δ = f (x )dx δ→0 a i =1 So, this theorem connects the objects of integration which we have discussed with the notion of ‘(signed) area under the curve’. 101 108 Integral Calculus The Fundamental Theorem of Calculus Rb PN a f (x )dx = limδ→0 i =1 f ( xi ) δ f (x ) b x a δ The area of green region adds to the total of the indefinite integral, while the area of the red region subtracts from it. 102 108 Integral Calculus Multiple Integrals: Example This notion generalises to higher dimensions, so that in n dimensions, where we wish to integrate a function f (x) across a more general region of integration, we may denote the multiple integral by: Z Z Z ··· f (x1 , x2 ,... , xn )dx1 dx2... dxn = f (x)dx V V And this is equivalent to the signed hypervolume under the hypersurface f (x), bounded by the region delineated by the region V ⊆ Rn. 103 108 Integral Calculus Multiple Integrals: Example We wish to integrate the function g (x) = x12 + x22 over the region, A, √ bounded by x2 = x1 and x2 = x12 : ZZ (x12 + x22 )dx1 dx2 A 104 108 Integral Calculus Multiple Integrals: Example The region A, over which the integration takes place, is illustrated below: x2 1 A 0 1 x1 105 108 Integral Calculus Multiple Integrals: Example ZZ Z 1 Z √x1 (x12 + x22 )dx1 dx2 = (x12 + x22 )dx1 dx2 A 0 x12 Z1 √ x1 x3 = x12 x2 + 2 dx1 0 3 x12 Z1 1 3 1 5 = x1 + x12 − x14 − x16 dx1 2 0 3 3 1 2 72 2 52 1 1 7 = x1 + x1 − x15 − x 7 15 5 21 1 0 2 2 1 1 = + − − 7 15 5 21 = 0.171 106 108 Summary Lecture Overview 1 Lecture Overview 2 Derivatives & Taylor Series 3 Vector & Matrix Derivatives 4 Matrix Calculus 5 Extrema 6 Convexity 7 Quadratic Functions 8 Constrained Optimisation & Lagrange Multipliers 9 Integral Calculus 10 Summary 107 108 Summary Summary Calculus is an essential tool that helps us to minimise certain (cost) functions We have introduced the basic machinery for calculus in one and many dimensions Building on this we have introduced some techniques for unconstrained and constrained optimisation that will be of direct use in machine learning 108 108