Logistic Regression PDF

Document Details

StraightforwardOctagon

Uploaded by StraightforwardOctagon

American University of Beirut

Joseph Bakarji

Tags

logistic regression supervised learning machine learning probability

Summary

This document presents a lecture on logistic regression, a machine learning technique. It explains the concept, mathematical formulation, and applications of logistic regression.

Full Transcript

Logistic Regression Prepared by: Joseph Bakarji Given new input, what’s the output? Input Output x y y (1) (1) x y ? (2) (2)...

Logistic Regression Prepared by: Joseph Bakarji Given new input, what’s the output? Input Output x y y (1) (1) x y ? (2) (2) x y yq (3) (3) x y (4) (4) x y x ….. ….. xq Query xq ?? Prediction Given new input, what’s the output? Assuming y ∈ℝ y h x y Given the data, nd a function h, that predicts y, given x x y = h(x) fi What if y is a label? h x y y Cancer or Not Given the data, nd a function h, that predicts y, given x Cancer 1 y = h(x) 0 y ∈ [0,1] No Cancer x Size of Tumor fi What if y is a label? A step function, or threshold h x y y Cancer or Not Given the data, nd a function h, that predicts y, given x Cancer 1 y = h(x) y ∈ [0,1] No Cancer 0 x Size of Tumor fi What if y is a label? A smooth function that returns h probability of occurrence x y y Given the data, nd a function h, that predicts y, given x Cancer 1 y = h(x) Probability y ∈ [0,1] No Cancer 0 x fi What if y is a label? A smooth function that returns y = hθ(x) & y ∈ [0,1] probability of occurrence y Logistic Function 1 y= 1 + e −x Cancer 1 Probability 1 No Cancer 0 x hθ(x) = xq −(θ0 + θ1x) 1+e What if y is a label? A smooth function that returns probability of occurrence y = hθ(x) & y ∈ [0,1] 1 y hθ(x) = − (θ ⊤ x ) 1+e ⊤ Where θ x = θ0 + θ1x1 + θ2x2 + … Cancer 1 θ = [θ0, θ1, …] Probability x = [x0, x1, …] No Cancer 0 x y What if y is a label? y = hθ(x) & y ∈ [0,1] Cancer 1 1 Probability hθ(x) = 1+e − (θ ⊤ x ) No Cancer 0 x 1. De ne a predictor: the logistic function ✅ 2. De ne a loss: distance between function and data ❓ 3. Optimize loss 4. Test model fi fi How do we pick the best parameters θ ? d ⊤ ∑ hθ(x) = θ x = θi xi i=0 Cost function y (i) 1 d y ( ) 2 ( ) (i) (i) ∑ J(θ) = h x − y distance (hθ (x ), y ) θ 2 i=1 (i) (i) hθ (x (i)) d 1 (θ x −y ) ⊤ (i) (i) 2 hθ (x (i)) − y (i) 2 ∑ = i=1 ( θ( ) ) 2 (i) (i) h x − y Ordinary least squares x (i) x Logistic Regression y y = hθ(x) hθ (x (i)) 1 1 hθ(x) = = g (θ x) ⊤ 1+e − (θ ⊤ x ) distance (hθ (x (i)), y (i)) 0 y (i) x Linear predictor Logistic predictor negative log-likelihood or OLS Binary-cross entropy loss d n y log hθ (x ) + (1 − y ) log (1 − hθ (x )) 1 ( ) 2 ( ) (i) (i) (i) (i) ∑ (i) (i) ℒ(θ) = 2 ∑ J(θ) = hθ x − y i=1 i=1 Compute gradient ∇ℒ(θ) Gradient descent → Done! Logistic Regression y y = hθ(x) hθ (x (i)) 1 1 hθ(x) = = g (θ x) ⊤ 1+e − (θ ⊤ x ) distance (hθ (x (i)), y (i)) 0 y (i) x Linear predictor negative log-likelihood or OLS d 1 ( ) 2 ( ) (i) (i) 2 ∑ J(θ) = hθ x − y Why not use an ordinary least squares loss? i=1 Why not use an ordinary least squares loss? Probabilistic Interpretation of Linear Regression Assume noise is (i) ⊤ (i) (i) normally distributed y =θ x +ε y around model Normally distributed (0, σ ) 2 (ε ) (i) 2 1 p (ε ) = (i) exp − 2σ x 2 2πσ σ σ σ σ σ σ 𝒩 Probabilistic Interpretation Assume noise is (i) ⊤ (i) (i) normally distributed y =θ x +ε y around model ( ) 2 0, σ (ε ) (i) 2 1 p (ε ) = (i) exp − σ σ σ σ σ σ 2πσ 2σ 2 (y − θ x ) (i) ⊤ (i) 2 1 p (y | x ; θ) = (i) (i) 2πσ exp − 2σ 2 x 𝒩 Likelihood of output given input n p (y | x ; θ) (i) (i) ∏ L(θ) = Independent and Identically Distributed (IID) y i=1 (y − θ x ) n (i) ⊤ (i) 2 1 ∏ = exp − i=1 2πσ 2σ 2 Log-likelihood ℒ(θ) = log L(θ) (y − θ x ) n (i) ⊤ (i) 2 = log ∏ 1 2πσ exp − 2σ 2 x i=1 (y − θ x ) n (i) ⊤ (i) 2 n 1 1 1 ( y −θ x ) (i) ⊤ (i) 2 ∑ = log exp − ∑ = n log − 2 i=1 2πσ 2σ 2 2πσ 2σ i=1 Maximize Log-likelihood ℒ(θ) = log L(θ) n 1 (y − θ x ) (i) ⊤ (i) 2 y ∑ = log exp − i=1 2πσ 2σ 2 n 1 1 ( y −θ x ) (i) ⊤ (i) 2 ∑ = n log − 2 2πσ 2σ i=1 Maximize ℒ(θ) 1 2 i=1 n Minimize ∑ (y − θ x ) (i) ⊤ (i) 2 x What if the noise is not Gaussian? Why not Least Squares? y y = hθ(x) hθ (x (i)) 1 1 Probability hθ(x) = = g (θ x) ⊤ −(θ ⊤x) 1+ e 0 y (i) x Probability of output given input P (y = 1 x; θ) = hθ(x) True label p(y | x; θ) = (hθ(x)) (1 − hθ(x)) y 1−y P (y = 0 x; θ) = 1 − hθ(x) Likelihood! For Bernoulli Distributed Noise Bernoulli Distribution Define Log-likelihood Likelihood p(y | x; θ) = (hθ(x)) (1 − hθ(x)) y 1−y for all (x, y) pair n p (y | x ; θ) (i) (i) ∏ L(θ) = i=1 n (i) (i) hθ (x ) (1 − hθ(x ) (i) y (i) 1−y ∏ = i=1 log n (i) log hθ (x ) + (1 − y ) log (1 − hθ (x )) (i) (i) (i) ∏ ℒ(θ) = y i=1 Maximize Log-likelihood n (i) log hθ (x ) + (1 − y ) log (1 − hθ (x )) (i) (i) (i) ∏ ℒ(θ) = y i=1 Update rule Gradient Descent for t = 1…T: while not converged: for all parameters j: ∂ℒ(θ) n θj := θj + α ∑( ( ) ) j (i) (i) (i) ∂θj Derive θj := θj − α hθ x − y x i=1 ℒ(θ) = (hθ (x ) − y ) xj (i) (i) (i) ∂θj Base Code Snippet Scikit-Learn Code Snippet Example

Use Quizgecko on...
Browser
Browser