AI 305 Logistic Regression PDF
Document Details
Uploaded by UncomplicatedLime804
Stanford
Andrew Ng
Tags
Summary
These slides cover logistic regression, a machine learning technique for binary classification. They discuss mathematical concepts, including the logistic function and its derivatives, along with optimization methods like gradient ascent and Newton's method.
Full Transcript
Introduction to Machine learning AI 305 Logistic Regression SLIDES ADOPTED FROM ANDREW NG'S LECTURE NOTES, CS229, STANFORD Last Week Linear Regression Linear Hypothesis Cost function Gradient Descent Least Mean Square (LMS) Normal equations This Week...
Introduction to Machine learning AI 305 Logistic Regression SLIDES ADOPTED FROM ANDREW NG'S LECTURE NOTES, CS229, STANFORD Last Week Linear Regression Linear Hypothesis Cost function Gradient Descent Least Mean Square (LMS) Normal equations This Week Binary Classification Logistic Regression Cost function Newton’s method Multiclass classification Classification and Logistic Regression Binary Classification oIn the classification, the target variable 𝑦 that we’re trying to predict: ◦ Is a discrete (class), such as (apartment, studio, or house). oIn the binary classification, the target can take only two values: 0 or 1. ◦ Email: spam/ not spam ◦ Tumor: malignant/ benign o𝑦 ∈ {0,1} ◦ 0: negative (+) class Apartment ◦ 1: positive (-) class or house Linear regression for binary classification? oWe could approach linear regression hypothesis to represent binary classification. ℎ𝜃 𝑥 = 𝜃 𝑇 𝑥 ◦ In genera, it performs very poorly. 0.5 ◦ 𝑦 ∈ 0,1 , but, ℎ𝜃 𝑥 > 1 or ℎ𝜃 𝑥 < 0 ◦ Outliers lead to new hypothesis with bad results. oWe could use a hypothesis which solve this problem: ◦ Logistic function 0.5 Logistic Regression oThe proposed function to represent the hypothesis of a binary classification problem is called logistic function or sigmoid function. 1 oℎ𝜃 𝑥 = 𝑔 −𝜃 𝑇 𝑥 = 𝑇 1+𝑒 −𝜃 𝑥 o𝑧 = −𝜃 𝑇 𝑥 1 o𝑔 𝑧 = 1+𝑒 −𝑧 o𝑔 𝑧 maps any real number to the (0, 1) interval Logistic Regression-2 oA useful property of logistic function: ◦ The derivative of the function can be written using the original function oGiven the logistic regression model, how do we fit 𝜃 for it? Logistic Regression-3 oℎ𝜃 𝑥 : gives the probability that the output is 1. ◦ℎ𝜃 𝑥 = 0.7 means a probability of 70% that the output is 1. ◦The probability that the prediction is 0 is just the complement of our probability that it is 1 (e.g. if probability that it is 1 is 70%, then the probability that it is 0 is 30%). ◦ℎ𝜃 𝑥 = 𝑃 𝑦 = 1 𝑥; 𝜃 = 1 − 𝑃 𝑦 = 0 𝑥; 𝜃 ◦𝑃 𝑦 = 1 𝑥; 𝜃 + 𝑃 𝑦 = 0 𝑥; 𝜃 = 1 Logistic Regression-4 o𝑃 𝑦 = 1 𝑥; 𝜃 = ℎ𝜃 𝑥 o𝑃 𝑦 = 0 𝑥; 𝜃 = 1 − ℎ𝜃 𝑥 oThen: 𝑃 𝑦 𝑥; 𝜃 = (ℎ𝜃 𝑥 )𝑦 (1 − ℎ𝜃 𝑥 )1−𝑦 oIt is a function of y, given x for fixed 𝜃. In order to explicitly view this as a function of 𝜃, we will instead call it the likelihood function: Logistic Regression-5 oAssuming that the n training examples were generated independently, we can then write down the likelihood of the parameters 𝜃 as Objective function 𝑖 ′ oNow, given this probabilistic model relating the 𝑦 𝑠 and 𝑖 ′ the 𝑥 𝑠: ◦What is a reasonable way of choosing our best guess of the parameters 𝜃? oThe principal of maximum likelihood says that we should choose 𝜃 so as to make the data as high probability as possible. I.e., we should choose 𝜃 to maximize 𝐿(𝜃). Objective function-2 oInstead of maximizing 𝐿(𝜃), we can also maximize any strictly increasing function of 𝐿(𝜃). ◦In particular, the derivations will be a bit simpler if we instead maximize the log likelihood ℓ(𝜃) : Gradient ascent oTo maximize the likelihood, we can use gradient ascent like the derivation in the case of linear regression. oWritten in vectorial notation, the updates will therefore be given by: ◦ 𝜃: = 𝜃 + 𝛼𝛻𝜃 ℓ(𝜃) ◦ Note the positive rather than negative sign in the update formula, since we're maximizing a function Gradient ascent-2 oConsidering just one training example (x, y), and take derivatives to derive the stochastic gradient ascent rule: Gradient ascent-3 oThe stochastic gradient ascent rule: oCompare it to LMS update rule: ◦ It is similar; but ◦ this is not the same algorithm, because ℎ𝜃 𝑥 (𝑖) is a nonlinear function of 𝜃 𝑇 𝑥 (𝑖) oGLM: generalized linear model. oA vectorized implementation of the update rule is: ◦𝜃: = 𝜃 + 𝛼𝑿𝑻 (𝓨- 𝑔(𝑋𝜃)) Newton’s Method oA different algorithm for maximizing ℓ 𝜃 oNewton's method was originally developed for finding roots of nonlinear equations 𝑓(𝜃) = 0. oSuppose there is a function 𝑓: ℝ → ℝ. oWe wish to find a value of 𝜃 so that 𝑓(𝜃) = 0, where 𝜃 ∈ ℝ oNewton’s Method perform the following update: 𝑓(𝜃) o𝜃: = 𝜃 − ሖ 𝑓(𝜃) Newton’s Method-2 oThe natural interpretation is: we find an approximation of the function 𝑓 as a linear function that is tangent to 𝑓 at the current guess 𝜃. oSolving for where that linear function equals to zero, oAnd the next guess for 𝜃 be where that linear function is zero. Newton’s Method-3 oWe're trying to find 𝜃 so that 𝑓(𝜃) = 0, which is 𝜃 = 1.3 oInitialize with 𝜃 = 4.5 oNewton's method then fits a straight line tangent to 𝑓 at 𝜃 = 4.5 oand solves for the where that line evaluates to 0 oThis give us the next guess for 𝜃, which is about 𝜃 = 2.8 Newton’s Method-4 oAfter running one more iteration, which updates 𝜃 to about 1.8. oAfter a few more iterations, we rapidly approach 𝜃 = 1.3. Newton’s Method-5 oNewton's method gives a way of getting to 𝑓(𝜃) = 0. oWhat if we want to use it to maximize some function ℓ? ሖ oThe maxima of ℓ correspond to points where its first derivative ℓ(𝜃) is zero. ሖ oSo, by letting 𝑓(𝜃) = ℓ(𝜃), ሖ owe can use Newton’s algorithm to maximize ℓ(𝜃), oThe update rule: ሖ ℓ(𝜃) o𝜃: = 𝜃 − ሖሖ ℓ(𝜃) Newton’s Method-6 oIdea: iterative quadratic approximation. oThis quadratic approximation is based on Taylor expansion around the current 𝜃 oThen maximize the quadratic approximation. oTo maximize the quadratic function, we equate its gradient to zero ሖ ℓ(𝜃) oThen update 𝜃 by 𝜃: = 𝜃 − ሖሖ ℓ(𝜃) Newton’s Method-7 Optimal learning rate oNewton’s method can be viewed as a way for finding the optimal value of the step size (learning rate 𝛼) in gradient descent algorithm. oInstead of making it user variable. 1 ◦By proposing that 𝛼 = ሖ ℓ(𝜃) ሖ ℓ(𝜃) o𝜃: = 𝜃 − ሖሖ ℓ(𝜃) Newton-Raphson method oThe generalization of Newton's method to multidimensional setting called the Newton-Raphson method. o𝜃: = 𝜃 − 𝐻 −1 𝛻𝜃 ℓ(𝜃) o𝛻𝜃 ℓ 𝜃 : the vector of partial derivatives of ℓ(𝜃) with respect to 𝜃𝑖′ 𝑠 oand 𝐻 is an d-by-d matrix (actually, d+1-by-d+1, assuming that we include the intercept term) called the Hessian, whose entries are given by: 𝜕2 ℓ(𝜃) o𝐻𝑖𝑗 = 𝜕𝜃𝑖 𝜕𝜃𝑗 Newton’s method advantages oNewton's method typically gets faster convergence than (batch) gradient descent, oAnd requires many fewer iterations to get very close to the minimum Newton’s method disadvantages oBut: ◦One iteration of Newton's can be more expensive than one iteration of gradient descent, since it requires finding and inverting an d-by-d Hessian; which costs 𝑂(𝑛𝑑 2 ) ◦but so long as d is not too large, it is usually much faster Fisher scoring method oWhen Newton's method is applied to maximize the logistic regression log likelihood function ℓ(𝜃), othe resulting method is also called Fisher scoring. End