Logistic Regression PDF
Document Details
Uploaded by PrudentPointOfView
IIT Gandhinagar
2024
Nipun Batra
Tags
Summary
This presentation covers logistic regression, discussing its application to classification tasks. The slides detail the rationale behind the use of logistic regression for classification, comparing it to linear regression, and explaining the concept of the sigmoid function, all from a machine learning perspective.
Full Transcript
Logistic Regression Nipun Batra February 27, 2024 IIT Gandhinagar Problem Setup Classification Technique Oranges Tomatoes 0.0 0.5 Radius 1 Classification Technique 1.0...
Logistic Regression Nipun Batra February 27, 2024 IIT Gandhinagar Problem Setup Classification Technique Oranges Tomatoes 0.0 0.5 Radius 1 Classification Technique 1.0 Oranges Tomatoes 0.5 0.0 0.0 0.5 Radius 2 Classification Technique 1.0 Oranges Tomatoes 0.5 0.0 0.0 0.5 Radius Aim: Probability(Tomatoes | Radius) ? or 2 Classification Technique 1.0 Oranges Tomatoes 0.5 0.0 0.0 0.5 Radius Aim: Probability(Tomatoes | Radius) ? or More generally, P(y = 1|X = x)? 2 Idea: Use Linear Regression Oranges 1 Tomatoes 0 0.0 0.5 1.0 Radius P(X = Orange|Radius) = θ0 + θ1 × Radius 3 Idea: Use Linear Regression Oranges 1 Tomatoes 0 0.0 0.5 1.0 Radius P(X = Orange|Radius) = θ0 + θ1 × Radius Generally, P(y = 1|x) = X θ 3 Idea: Use Linear Regression Prediction: If θ0 + θ1 × Radius > 0.5 → Orange Else → Tomato Problem: Range of X θ is (−∞, ∞) But P(y = 1|...) ∈ [0, 1] 4 Idea: Use Linear Regression Oranges 1 Tomatoes 0 0.0 0.5 1.0 Radius 5 Idea: Use Linear Regression Oranges 1.0 Tomatoes 0.5 0.0 0 1 2 Radius Linear regression for classification gives a poor prediction! 6 Ideal boundary 1.0 Decision Boundary Oranges 0.5 Tomatoes 0.0 0.0 0.5 Radius Have a decision function similar to the above (but not so sharp and discontinuous) 7 Ideal boundary 1.0 Decision Boundary Oranges 0.5 Tomatoes 0.0 0.0 0.5 Radius Have a decision function similar to the above (but not so sharp and discontinuous) Aim: use linear regression still! 7 Idea: Use Linear Regression Logistic Regression 1.0 Decision Boundary Oranges P(y=1) 0.5 Tomatoes Sigmoid 0.0 0.0 0.5 1.0 Radius Question. Can we still use Linear Regression? Answer. Yes! Transform ŷ → [0, 1] 8 Logistic/Sigmoid function Logistic / Sigmoid Function ŷ ∈ (−∞, ∞) ϕ = Sigmoid / Logistic Function (σ) ϕ(ŷ ) ∈ [0, 1] 1 σ(z) = 1 + e −z 1.0 σ(z) 0.5 0.0 −10 0 10 z 9 Logistic / Sigmoid Function z →∞ 10 Logistic / Sigmoid Function z →∞ σ(z) → 1 10 Logistic / Sigmoid Function z →∞ σ(z) → 1 z → −∞ 10 Logistic / Sigmoid Function z →∞ σ(z) → 1 z → −∞ σ(z) → 0 10 Logistic / Sigmoid Function z →∞ σ(z) → 1 z → −∞ σ(z) → 0 z =0 10 Logistic / Sigmoid Function z →∞ σ(z) → 1 z → −∞ σ(z) → 0 z =0 σ(z) = 0.5 10 Logistic / Sigmoid Function Question. Could you use some other transformation (ϕ) of ŷ s.t. ϕ(ŷ ) ∈ [0, 1] Yes! But Logistic Regression works. 11 Logistic / Sigmoid Function 1 P(y = 1|X ) = σ(X θ) = 1 + e −X θ Q. Write X θ in a more convenient form (as P(y = 1|X ), P(y = 0|X )) 12 Logistic / Sigmoid Function 1 P(y = 1|X ) = σ(X θ) = 1 + e −X θ Q. Write X θ in a more convenient form (as P(y = 1|X ), P(y = 0|X )) 13 Logistic / Sigmoid Function 1 P(y = 1|X ) = σ(X θ) = 1 + e −X θ Q. Write X θ in a more convenient form (as P(y = 1|X ), P(y = 0|X )) 1 e −X θ P(y = 0|X ) = 1 − P(y = 1|X ) = 1 − = 1 + e −X θ 1 + e −X θ 13 Logistic / Sigmoid Function 1 P(y = 1|X ) = σ(X θ) = 1 + e −X θ Q. Write X θ in a more convenient form (as P(y = 1|X ), P(y = 0|X )) 1 e −X θ P(y = 0|X ) = 1 − P(y = 1|X ) = 1 − = 1 + e −X θ 1 + e −X θ P(y = 1|X ) P(y = 1|X ) ∴ = e X θ =⇒ X θ = log 1 − P(y = 1|X ) 1 − P(y = 1|X ) 13 Odds (Used in betting) P(win) P(loss) Here, P(y = 1) Odds = P(y = 0) log-odds = log P(y =1) P(y =0) = X θ 14 Logistic Regression Q. What is decision boundary for Logistic Regression? 15 Logistic Regression Q. What is decision boundary for Logistic Regression? Decision Boundary: P(y = 1|X ) = P(y = 0|X ) 1 e −X θ or 1+e −X θ = 1+e −X θ or e X θ = 1 or X θ = 0 16 Learning Parameters Could we use cost function as: X J(θ) = (yi − ŷi )2 ŷi = σ(X θ) Answer: No (Non-Convex) (See Jupyter Notebook) 17 Deriving Cost Function via Maximum Likelihood Estimation Cost function convexity RMSE contour plot RMSE surface plot 10 18.0 16.5 15.0 15 θ1 0 13.5 10 12.0 10.5 10 −10 0 0 −10 9.0 −10 10 −10 θ1 0 10 θ0 θ0 18 Learning Parameters Likelihood = P(D|θ) P(y |X , θ) = ni=1 P(yi |xi , θ) Q where y = 0 or 1 19 Learning Parameters Likelihood = P(D|θ) n Y P(y |X , θ) = P(yi |xi , θ) i=1 n n Y 1 oyi n 1 o1−yi = Tθ 1− Tθ i=1 1 + e −xi 1 + e −xi [Above: Similar to P(D|θ) for Linear Regression; Difference Bernoulli instead of Gaussian] − log P(y |X , θ) = Negative Log Likelihood = Cost function will be minimising = J(θ) 20 Aside on Bernoulli Likelihood Assume you have a coin and flip it ten times and get (H, H, T, T, T, H, H, T, T, T). 21 Aside on Bernoulli Likelihood Assume you have a coin and flip it ten times and get (H, H, T, T, T, H, H, T, T, T). What is p(H)? 21 Aside on Bernoulli Likelihood Assume you have a coin and flip it ten times and get (H, H, T, T, T, H, H, T, T, T). What is p(H)? We might think it to be: 4/10 = 0.4. But why? 21 Aside on Bernoulli Likelihood Assume you have a coin and flip it ten times and get (H, H, T, T, T, H, H, T, T, T). What is p(H)? We might think it to be: 4/10 = 0.4. But why? Answer 1: Probability defined as a measure of long running frequencies 21 Aside on Bernoulli Likelihood Assume you have a coin and flip it ten times and get (H, H, T, T, T, H, H, T, T, T). What is p(H)? We might think it to be: 4/10 = 0.4. But why? Answer 1: Probability defined as a measure of long running frequencies Answer 2: What is likelihood of seeing the above sequence when the p(Head)=θ? 21 Aside on Bernoulli Likelihood Assume you have a coin and flip it ten times and get (H, H, T, T, T, H, H, T, T, T). What is p(H)? We might think it to be: 4/10 = 0.4. But why? Answer 1: Probability defined as a measure of long running frequencies Answer 2: What is likelihood of seeing the above sequence when the p(Head)=θ? Idea find MLE estimate for θ 21 Aside on Bernoulli Likelihood p(H) = θ and p(T ) = 1 − θ 22 Aside on Bernoulli Likelihood p(H) = θ and p(T ) = 1 − θ What is the PMF for first observation P(D1 = x|θ), where x = 0 for Tails and x = 1 for Heads? 22 Aside on Bernoulli Likelihood p(H) = θ and p(T ) = 1 − θ What is the PMF for first observation P(D1 = x|θ), where x = 0 for Tails and x = 1 for Heads? P(D1 = x|θ) = θx (1 − θ)(1−x) 22 Aside on Bernoulli Likelihood p(H) = θ and p(T ) = 1 − θ What is the PMF for first observation P(D1 = x|θ), where x = 0 for Tails and x = 1 for Heads? P(D1 = x|θ) = θx (1 − θ)(1−x) Verify the above: if x = 0 (Tails), P(D1 = x|θ) = 1 − θ and if x = 1 (Heads), P(D1 = x|θ) = θ 22 Aside on Bernoulli Likelihood p(H) = θ and p(T ) = 1 − θ What is the PMF for first observation P(D1 = x|θ), where x = 0 for Tails and x = 1 for Heads? P(D1 = x|θ) = θx (1 − θ)(1−x) Verify the above: if x = 0 (Tails), P(D1 = x|θ) = 1 − θ and if x = 1 (Heads), P(D1 = x|θ) = θ What is P(D1 , D2 ,..., Dn |θ)? 22 Aside on Bernoulli Likelihood p(H) = θ and p(T ) = 1 − θ What is the PMF for first observation P(D1 = x|θ), where x = 0 for Tails and x = 1 for Heads? P(D1 = x|θ) = θx (1 − θ)(1−x) Verify the above: if x = 0 (Tails), P(D1 = x|θ) = 1 − θ and if x = 1 (Heads), P(D1 = x|θ) = θ What is P(D1 , D2 ,..., Dn |θ)? P(D1 , D2 ,..., Dn |θ) = P(D1 θ)P(D2 |θ)...P(Dn |θ) 22 Aside on Bernoulli Likelihood p(H) = θ and p(T ) = 1 − θ What is the PMF for first observation P(D1 = x|θ), where x = 0 for Tails and x = 1 for Heads? P(D1 = x|θ) = θx (1 − θ)(1−x) Verify the above: if x = 0 (Tails), P(D1 = x|θ) = 1 − θ and if x = 1 (Heads), P(D1 = x|θ) = θ What is P(D1 , D2 ,..., Dn |θ)? P(D1 , D2 ,..., Dn |θ) = P(D1 θ)P(D2 |θ)...P(Dn |θ) P(D1 , D2 ,..., Dn |θ) = θnh (1 − θ)nt 22 Aside on Bernoulli Likelihood p(H) = θ and p(T ) = 1 − θ What is the PMF for first observation P(D1 = x|θ), where x = 0 for Tails and x = 1 for Heads? P(D1 = x|θ) = θx (1 − θ)(1−x) Verify the above: if x = 0 (Tails), P(D1 = x|θ) = 1 − θ and if x = 1 (Heads), P(D1 = x|θ) = θ What is P(D1 , D2 ,..., Dn |θ)? P(D1 , D2 ,..., Dn |θ) = P(D1 θ)P(D2 |θ)...P(Dn |θ) P(D1 , D2 ,..., Dn |θ) = θnh (1 − θ)nt Log-likelihood = LL(θ) = nh log(θ) + nt log(1 − θ) 22 Aside on Bernoulli Likelihood p(H) = θ and p(T ) = 1 − θ What is the PMF for first observation P(D1 = x|θ), where x = 0 for Tails and x = 1 for Heads? P(D1 = x|θ) = θx (1 − θ)(1−x) Verify the above: if x = 0 (Tails), P(D1 = x|θ) = 1 − θ and if x = 1 (Heads), P(D1 = x|θ) = θ What is P(D1 , D2 ,..., Dn |θ)? P(D1 , D2 ,..., Dn |θ) = P(D1 θ)P(D2 |θ)...P(Dn |θ) P(D1 , D2 ,..., Dn |θ) = θnh (1 − θ)nt Log-likelihood = LL(θ) = nh log(θ) + nt log(1 − θ) ∂LL(θ) ∂θ = 0 =⇒ nh θ + nt 1−θ = 0 =⇒ θMLE = nh nh +nt 22 Cross Entropy Cost Function Learning Parameters n n Y o1−yi 1 oyi n 1 J(θ) = − log Tθ 1− Tθ i=1 1 + e −xi 1 + e −xi n X J(θ) = − yi log(σθ (xi )) + (1 − yi ) log(1 − σθ (xi )) i=1 23 Learning Parameters n n Y o1−yi 1 oyi n 1 J(θ) = − log Tθ 1− Tθ i=1 1 + e −xi 1 + e −xi n X J(θ) = − yi log(σθ (xi )) + (1 − yi ) log(1 − σθ (xi )) i=1 This cost function is called cross-entropy. 23 Learning Parameters n n Y o1−yi 1 oyi n 1 J(θ) = − log Tθ 1− Tθ i=1 1 + e −xi 1 + e −xi n X J(θ) = − yi log(σθ (xi )) + (1 − yi ) log(1 − σθ (xi )) i=1 This cost function is called cross-entropy. Why? 23 Interpretation of Cross-Entropy Cost Function 24 Interpretation of Cross-Entropy Cost Function What is the interpretation of the cost function? 24 Interpretation of Cross-Entropy Cost Function What is the interpretation of the cost function? Let us try to write the cost function for a single example: 24 Interpretation of Cross-Entropy Cost Function What is the interpretation of the cost function? Let us try to write the cost function for a single example: J(θ) = −yi log ŷi − (1 − yi ) log(1 − ŷi ) 24 Interpretation of Cross-Entropy Cost Function What is the interpretation of the cost function? Let us try to write the cost function for a single example: J(θ) = −yi log ŷi − (1 − yi ) log(1 − ŷi ) First, assume yi is 0, then if ŷi is 0, the loss is 0; but, if ŷi is 1, the loss tends towards infinity! Cost when y = 0 5 0 0.0 0.5 1.0 ŷ 24 Notebook: logits-usage 25 Interpretation of Cross-Entropy Cost Function 26 Interpretation of Cross-Entropy Cost Function What is the interpretation of the cost function? J(θ) = −yi log ŷi − (1 − yi ) log(1 − ŷi ) 26 Interpretation of Cross-Entropy Cost Function What is the interpretation of the cost function? J(θ) = −yi log ŷi − (1 − yi ) log(1 − ŷi ) Now, assume yi is 1, then if ŷi is 0, the loss is huge; but, if ŷi is 1, the loss is zero! Cost when y = 1 5 0 0.0 0.5 1.0 ŷ 26 Cost function convexity Cross-entropy contour plot Cross-entropy surface plot 10 900 800 700 750 600 500 500 250 θ1 0 400 300 200 10 100 −10 0 0 −10 0 −10 10 −10 θ1 0 10 θ0 θ0 27 Learning Parameters n X ∂J(θ) ∂ =− yi log (σθ (xi )) + (1 − yi )log (1 − σθ (xi )) ∂θj ∂θj i=1 n X ∂ ∂ =− yi log(σθ (xi )) + (1 − yi ) log (1 − σθ (xi )) ∂θj ∂θj i=1 28 Learning Parameters n ∂J(θ) X ∂ ∂ =− yi log(σθ (xi )) + (1 − yi ) log (1 − σθ (xi )) ∂θj ∂θj ∂θj i=1 n 1 − yi X yi ∂ ∂ =− σθ (xi ) + (1 − σθ (xi )) (1) σθ (xi ) ∂θj 1 − σθ (xi ) ∂θj i=1 Aside: ∂ ∂ 1 ∂ σ(z) = −z = −(1 + e −z )−2 (1 + e −z ) ∂z ∂z 1 + e ∂z e −z e −z 1 + e −z 1 1 = = = σ(z) − (1 + e −z )2 1 + e −z 1 + e −z 1 + e −z 1 + e −z = σ(z)(1 − σ(z)) 29 Learning Parameters Resuming from (1) n 1 − yi ∂J(θ) X yi ∂ ∂ =− σθ (xi ) + (1 − σθ (xi )) ∂θj σθ (xi ) ∂θj 1 − σθ (xi ) ∂θj i=1 n 1 − yi X yi σθ (xi ) ∂ ∂ =− (1−σθ (xi )) (xi θ)+ (1−σθ (xi )) (1−σθ (xi )) σθ (xi ) ∂θj 1 − σθ (xi ) ∂θj i=1 n yi (1 − σθ (xi ))xij − (1 − yi )σθ (xi )xij X =− i=1 n j X =− (yi − yi σθ (xi ) − σθ (xi ) + yi σθ (xi ))xi i=1 n σθ (xi ) − yi xij X = i=1 30 Learning Parameters ∂J(θ) PN j θj = i=1 σθ (xi ) − yi xi Now, just use Gradient Descent! 31 = lyi-yii e 80 = lyi yi) i - , Obj MATRIX * = lyi-yii e MATRIX * yyy - y 4 -j column of X N = X I - D = lyi-yii e MATRIX * yyy - y 4 -j column of X N = X I - D lyi -yi)xi i 810) = = X xly y) - Obj T (y y) - I X - 8J(0) -> - Of I iT X (i y) - OJCO) -- OOZ I = x(y y) - 1 i : As X DT (y y) - Logistic Regression with feature transformation 2 Oranges Tomatoes x2 0 −2 −2.5 0.0 2.5 x1 What happens if you apply logistic regression on the above data? 32 Logistic Regression with feature transformation Oranges Predict oranges 2 Tomatoes Predict tomatoes x2 0 −2 −2.5 0.0 2.5 x1 Linear boundary will not be accurate here. What is the technical name of the problem? 33 Logistic Regression with feature transformation Oranges Predict oranges 2 Tomatoes Predict tomatoes x2 0 −2 −2.5 0.0 2.5 x1 Linear boundary will not be accurate here. What is the technical name of the problem? Bias! 33 Logistic Regression with feature transformation 1 ϕ0 (x) x ϕ1 (x) x2 ∈ RK ϕ(x) = = .. x3. .. ϕK −1 (x). x K −1 34 Logistic Regression with feature transformation Oranges Predict oranges 2 Tomatoes Predict tomatoes x2 0 −2 −2.5 0.0 2.5 x1 Using x12 , x22 as additional features, we are able to learn a more accurate classifier. 35 Logistic Regression with feature transformation How would you expect the probability contours look like? 36 Logistic Regression with feature transformation How would you expect the probability contours look like? 1.100 2 0.978 P(Tomatoes) 0.856 0.733 0.611 x2 0 0.489 0.367 0.244 −2 0.122 0.000 −2.5 0.0 x1 36 Multi-Class Prediction sepal width (cm) 4 I. setosa I. versicolor 3 I. virginica 2 6 8 sepal length (cm) 37 Multi-Class Prediction sepal width (cm) 4 I. setosa I. versicolor 3 I. virginica 2 6 8 sepal length (cm) How would you learn a classifier? Or, how would you expect the classifier to learn decision boundaries? 37 Multi-Class Prediction sepal width (cm) 4 I. setosa I. versicolor 3 I. virginica 2 4 6 8 sepal length (cm) 38 Multi-Class Prediction sepal width (cm) 4 I. setosa I. versicolor 3 I. virginica 2 4 6 8 sepal length (cm) 1. Use one-vs.-all on Binary Logistic Regression 2. Use one-vs.-one on Binary Logistic Regression 3. Extend Binary Logistic Regression to Multi-Class Logistic Regression 38 Multi-Class Prediction sepal width (cm) 4 I. setosa I. versicolor 3 I. virginica 2 4 6 8 sepal length (cm) 39 Multi-Class Prediction sepal width (cm) 4 I. setosa I. versicolor 3 I. virginica 2 4 6 8 sepal length (cm) 1. Learn P(setosa (class 1)) = F(X θ1 ) 2. P(versicolor (class 2)) = F(X θ2 ) 3. P(virginica (class 3)) = F(X θ3 ) 4. Goal: Learn θi ∀i ∈ {1, 2, 3} 5. Question: What could be an F? 39 Multi-Class Prediction sepal width (cm) 4 I. setosa I. versicolor 3 I. virginica 2 4 6 8 sepal length (cm) 40 Multi-Class Prediction sepal width (cm) 4 I. setosa I. versicolor 3 I. virginica 2 4 6 8 sepal length (cm) 1. Question: What could be an F? 2. Property: 3i=1 F(X θi ) = 1 P 3. Also F(z) ∈ [0, 1] 4. Also, F(z) has squashing proprties: R 7→ [0, 1] 40 Softmax Z ∈ Rd e zi F(zi ) = Pd zi i=1 e X ∴ F(zi ) = 1 F(zi ) refers to probability of class i 41 Softmax for Multi-Class Logistic Regression k = {1,... , k}classes ....... ..... θ = θ 1 θ2 · · · θk ............ X θk P(y = k|X , θ) = PKe k=1 e X θk 42 Softmax for Multi-Class Logistic Regression For K = 2 classes, e X θk P(y = k|X , θ) = PK X θk k=1 e e X θ0 P(y = 0|X , θ) = e X θ0 + e X θ1 e X θ1 e X θ1 P(y = 1|X , θ) = = e X θ0 + e X θ1 e X θ1 {1 + e X (θ0 −θ1 ) } 1 = ′ 1 + e −X θ = Sigmoid! 43 Multi-Class Logistic Regression Cost Assume our prediction and ground truth for the three classes for i th point is: 0.1 ŷi1 2 yˆi = 0.8 = ŷi 0.1 ŷi3 0 yi1 yi = 1 = yi2 0 yi3 meaning the true class is Class #2 44 Multi-Class Logistic Regression Cost Assume our prediction and ground truth for the three classes for i th point is: 0.1 ŷi1 2 yˆi = 0.8 = ŷi 0.1 ŷi3 0 yi1 yi = 1 = yi2 0 yi3 meaning the true class is Class #2 Let us calculate − 3k=1 yik log ŷik P 44 Multi-Class Logistic Regression Cost Assume our prediction and ground truth for the three classes for i th point is: 0.1 ŷi1 2 yˆi = 0.8 = ŷi 0.1 ŷi3 0 yi1 yi = 1 = yi2 0 yi3 meaning the true class is Class #2 Let us calculate − 3k=1 yik log ŷik P = −(0 × log(0.1) + 1 × log(0.8) + 0 × log(0.1)) 44 Multi-Class Logistic Regression Cost Assume our prediction and ground truth for the three classes for i th point is: 0.1 ŷi1 2 yˆi = 0.8 = ŷi 0.1 ŷi3 0 yi1 yi = 1 = yi2 0 yi3 meaning the true class is Class #2 Let us calculate − 3k=1 yik log ŷik P = −(0 × log(0.1) + 1 × log(0.8) + 0 × log(0.1)) Tends to zero 44 Multi-Class Logistic Regression Cost Assume our prediction and ground truth for the three classes for i th point is: 0.3 ŷi1 2 yˆi = 0.4 = ŷi 0.3 ŷi3 0 yi1 yi = 1 = yi2 0 yi3 meaning the true class is Class #2 45 Multi-Class Logistic Regression Cost Assume our prediction and ground truth for the three classes for i th point is: 0.3 ŷi1 2 yˆi = 0.4 = ŷi 0.3 ŷi3 0 yi1 yi = 1 = yi2 0 yi3 meaning the true class is Class #2 Let us calculate − 3k=1 yik log ŷik P 45 Multi-Class Logistic Regression Cost Assume our prediction and ground truth for the three classes for i th point is: 0.3 ŷi1 2 yˆi = 0.4 = ŷi 0.3 ŷi3 0 yi1 yi = 1 = yi2 0 yi3 meaning the true class is Class #2 Let us calculate − 3k=1 yik log ŷik P = −(0 × log(0.1) + 1 × log(0.4) + 0 × log(0.1)) 45 Multi-Class Logistic Regression Cost Assume our prediction and ground truth for the three classes for i th point is: 0.3 ŷi1 2 yˆi = 0.4 = ŷi 0.3 ŷi3 0 yi1 yi = 1 = yi2 0 yi3 meaning the true class is Class #2 Let us calculate − 3k=1 yik log ŷik P = −(0 × log(0.1) + 1 × log(0.4) + 0 × log(0.1)) High number! Huge penalty for misclassification! 45 Multi-Class Logistic Regression Cost For 2 class we had: X n J(θ) = − yi log(σθ (xi )) + (1 − yi ) log(1 − σθ (xi )) i=1 46 Multi-Class Logistic Regression Cost For 2 class we had: X n J(θ) = − yi log(σθ (xi )) + (1 − yi ) log(1 − σθ (xi )) i=1 More generally, 46 Multi-Class Logistic Regression Cost For 2 class we had: X n J(θ) = − yi log(σθ (xi )) + (1 − yi ) log(1 − σθ (xi )) i=1 More generally, n X J(θ) = − yi log(ŷi ) + (1 − yi ) log(1 − ŷi ) i=1 46 Multi-Class Logistic Regression Cost For 2 class we had: X n J(θ) = − yi log(σθ (xi )) + (1 − yi ) log(1 − σθ (xi )) i=1 More generally, n X J(θ) = − yi log(ŷi ) + (1 − yi ) log(1 − ŷi ) i=1 n X J(θ) = − yi log(ŷi ) + (1 − yi ) log(1 − ŷi ) i=1 Extend to K-class: n X X K k k J(θ) = − yi log(ŷi ) i=1 k=1 46 Multi-Class Logistic Regression Cost Now: n ∂J(θ) X = xi I (yi = k) − P(yi = k|xi , θ) ∂θk i=1 47 Hessian Matrix The Hessian matrix of f(.) with respect to θ, written ∇2θ f (θ) or simply as H, is the d × d matrix of partial derivatives, ∂ 2 f (θ) ∂ 2 f (θ) ∂ 2 f (θ) 2 ∂θ ∂θ... ∂θ ∂θ ∂ 2∂θ 1 1 2 1 n f (θ) ∂ 2 f (θ)... ∂ 2 f (θ) ∂θ2 ∂θ1 ∂θ22 ∂θ2 ∂θn ∇2θ f (θ) = ... ......... ............ ∂ 2 f (θ) ∂ 2 f (θ) ∂ 2 f (θ) ∂θn ∂θ1 ∂θn ∂θ2... n ∂θ2 48 Newton’s Algorithm The most basic second-order optimization algorithm is Newton’s algorithm, which consists of updates of the form, θk+1 = θk − H1k gk where gk is the gradient at step k. This algorithm is derived by making a second-order Taylor series approximation of f (θ) around θk : 1 fquad (θ) = f (θk ) + gkT (θ − θk ) + (θ − θk )T Hk (θ − θk ) 2 differentiating and equating to zero to solve for θk+1. 49 Learning Parameters Now assume: n σθ (xi ) − yi xij = XT (σθ (X) − y) X g (θ) = i=1 πi = σθ (xi ) Let H represent the Hessian of J(θ) n ∂ ∂ X H= g (θ) = σθ (xi ) − yi xij ∂θ ∂θ i=1 n ∂ ∂ σθ (xi )xij − yi xij X = ∂θ ∂θ i=1 n X = σθ (xi )(1 − σθ (xi ))xi xiT i=1 T = X diag (σθ (xi )(1 − σθ (xi )))X 50 Iteratively reweighted least squares (IRLS) For binary logistic regression, recall that the gradient and Hessian of the negative log-likelihood are given by: g (θ)k = XT (πk − y) Hk = XT Sk X Sk = diag (π1k (1 − π1k ),... , πnk (1 − πnk )) πik = sigm(xi θk ) The Newton update at iteraion k + 1 for this model is as follows: θk+1 = θk − H−1 gk = θk + (X T Sk X )−1 X T (y − πk ) = (X T Sk X )−1 [(X T Sk X )θk + X T (y − πk )] = (X T Sk X )−1 X T [Sk X θk + y − πk ] 51 Regularized Logistic Regression Unregularised: n X J1 (θ) = − yi log(σθ (xi )) + (1 − yi ) log(1 − σθ (xi )) i=1 L2 Regularization: J(θ) = J1 (θ) + λθT θ L1 Regularization: J(θ) = J1 (θ) + λ|θ| 52