Logistic Regression: Background and Classifiers PDF
Document Details
Uploaded by HonestViolin
Tags
Summary
This document provides an overview of logistic regression as an analytic tool, especially its applications to text classification. It details the differences between generative and discriminative classifiers, showcasing use cases such as image recognition, and touches on the components of a probabilistic classifier and the phases of logistic regression from training weights to testing.
Full Transcript
Background: Generative and Discriminative Classifiers Logistic Regression Logistic Regression Important analytic tool in natural and social sciences Baseline supervised machine learning tool for classification Is also the foundation of neural networks Generative and Disc...
Background: Generative and Discriminative Classifiers Logistic Regression Logistic Regression Important analytic tool in natural and social sciences Baseline supervised machine learning tool for classification Is also the foundation of neural networks Generative and Discriminative Classifiers Naive Bayes is a generative classifier by contrast: Logistic regression is a discriminative classifier Generative and Discriminative Classifiers Suppose we're distinguishing cat from dog images imagenet imagenet Generative Classifier: Build a model of what's in a cat image Knows about whiskers, ears, eyes Assigns a probability to any image: how cat-y is this image? Also build a model for dog images Now given a new image: Run both models and see which one fits better Discriminative Classifier Just try to distinguish dogs from cats Oh look, dogs have collars! Let's ignore everything else ISTIC R EGRESSION Finding the correct class c from a document d in Generative vs Discriminative Classifiers ormally, recall that the naive Bayes assigns a class c to a document d no computing P(c|d) but by computing a likelihood and a prior Naive Bayes ISTIC R EGRESSION likelihood prior z }| { z}|{ ĉ =the rmally, recall that argmax P(d|c)assigns naive Bayes P(c) a class c to a document (5.1d c2C computing P(c|d) but by computing a likelihood and a prior Logistic ive model like Regression naive Bayes makes use of this likelihood term, which documentprior how to generate the features oflikelihood aposterior if we knew it was of class c. z }| { z}|{ trast a discriminative model in this text categorization scenario attempts ĉ = argmax P(d|c) P(c|d) P(c) ( compute P(c|d). Perhapsc2Cit will learn to assign high weight to documen at directly improve its ability to discriminate between possible classes 7 Components of a probabilistic machine learning classifier Given m input/output pairs (x(i),y(i)): 1. A feature representation of the input. For each input observation x(i), a vector of features [x1, x2,... , xn]. Feature j for input x(i) is xj, more completely xj(i), or sometimes fj(x). 2. A classification function that computes 𝑦, ! the estimated class, via p(y|x), like the sigmoid or softmax functions. 3. An objective function for learning, like cross-entropy loss. 4. An algorithm for optimizing the objective function: stochastic gradient descent. The two phases of logistic regression Training: we learn weights w and b using stochastic gradient descent and cross-entropy loss. Test: Given a test example x we compute p(y|x) using learned weights w and b, and return whichever label (y = 1 or y = 0) is higher probability Background: Generative and Discriminative Classifiers Logistic Regression Classification in Logistic Regression Logistic Regression Classification Reminder Positive/negative sentiment Spam/not spam Authorship attribution (Hamilton or Madison?) Alexander Hamilton Text Classification: definition Input: ◦ a document x ◦ a fixed set of classes C = {c1, c2,…, cJ} Output: a predicted class 𝑦! Î C Binary Classification in Logistic Regression Given a series of input/output pairs: ◦ (x(i), y(i)) For each observation x (i) ◦ We represent x(i) by a feature vector [x1, x2,…, xn] (i) ◦ We compute an output: a predicted class 𝑦! Î {0,1} Features in logistic regression For feature xi, weight wi tells is how important is xi xi ="review contains ‘awesome’": wi = +10 xj ="review contains ‘abysmal’": wj = -10 xk =“review contains ‘mediocre’": wk = -2 Logistic Regression for one observation x Input observation: vector x = [x1, x2,…, xn] Weights: one per feature: W = [w1, w2,…, wn] ◦ Sometimes we call the weights θ = [θ1, θ2,…, θn] Output: a predicted class 𝑦! Î {0,1} (multinomial logistic regression: 𝑦! Î {0, 1, 2, 3, 4}) her real number that’s n added ! to the weighted inputs. ecision X Howon to a do z = test instance— classification wi xi + b after we’ve learned the ssifier first multiplies each x i by its weight wi , sums up th For each featurei=1 xi, weight wi tells us importance of xi ds the◦ bias term (Plus we'll b.a bias have Theb) resulting single number z exp kthe we’ll represent evidence for such the sums using the dot product notatio class. We'll sum up all the weighted features and the bias ot product of two vectors a and ! b, written as a · b is the s orresponding elementsX n of each vector. Thus the followin to Eq. 5.2: z = wi xi + b i=1 z = w·x+b ook we’ll represent such sums using the dot If this sum is high, we say y=1; if low, then y=0product no But we want a probabilistic classifier We need to formalize “sum is high”. We’d like a principled classifier that gives us a probability, just like Naive Bayes did We want a model that can tell us: p(y=1|x; θ) p(y=0|x; θ) ear around 0 but outlier values get squashed toward 0 or 1. to Eq. The5.2: problem: z isn't a probability, it's just a number! e a probability, we’ll pass z through the sigmoid func ction (named becausez =it w ·x+ looks b an s) is also called t like es logistic regression its name. The sigmoid has the fol ically in Fig. Solution: use5.1: a function of z that goes from 0 to 1 g in Eq. 5.3 forces z to be a legal probabil fact, since weights are1 real-valued, 1 the outp y = s (z) = = om to. 1+e z 1 + exp ( z) rest of the book, we’ll use the notation exp(x) to mean e r of advantages; it takes a real-valued number and maps rangeswe’ll bility, to frompass. z through the sigmoid function, s (z). T The very useful sigmoid or logistic function med because it looks like an s) is also called the logistic fu regression its name. The sigmoid has the following equati Fig. 5.1: 1 y = s (z) = z (5 1+e 20 Idea of logistic regression We’ll compute w∙x+b And then we’ll pass it through the sigmoid function: σ(w∙x+b) And we'll just treat it as a probability wo cases, wo cases, p(y =probabilities Making 1) and p(y = = 0), 0), sum withsum to to 1. 1. We sigmoids Wecan cando dothis thisasasfoll fol P(y = 1) = s (w · xx + +b) b) 11 = 1 + exp exp(( (w (w··xx+ +b)) b)) P(y = P(y = 0) = 11 s 0) = (w··xx+ s(w +b)b) 11 = 1 = 1 1 + exp ( (w · x + b)) 1 + exp ( (w · x + b)) exp exp ( ( (w (w ··xx++ b)) b)) = = 1 + exp ( (w · x + b)) 1 + exp ( (w · x + b)) P(y = 0) = 1 s (w · x + b) TheP(ysigmoid = s (w · xhas = 1) function + b) the property By the way: 1 = 1 1 = s (x) = s ( x) 1 + exp ( (w · x 1 + exp ( (w · x1+ b)) exp ( (w · x + b)) = so we P(ycould = 0) also = 1have · x + b) P(y = 0) =as s ( (w ·1x+ expressed s (w b)).( (w · x + b +exp Now we have an algorithm 1 that given an instance x computes the P(y = 1|x). How = 1 do The we sigmoid make a function has For decision? the property a test instance x, we sa 1 + exp ( (w · x + b)) Because probability P(y = exp 1|x)( is(w more than.5, and no otherwise. We call.5 · x + b)) 1 s (x) = s ( x) boundary: = (5.5) 1 + exp ( (w · x + b)) so we could⇢ also have expressed P(y = 0) as s ( (w · x tion has the property Now 1 if P(y = 1|x) > 0.5 ŷ =we have an algorithm that given an instance 0 otherwise P(y = 1|x). How do we make a decision? For a test i 1 s (x) = s ( x) P(y = 1|x) is more than.5, (5.6) probability and no otherwi 5.1.1 Example:decision boundary sentiment classification boundary: How do we make a decision? For a test instance x, we sa (y = 1|x) is more Turning than.5, andinto a probability no otherwise. a classifierWe call.5 t ⇢ 1 if P(y = 1|x) > 0.5 ŷ = 0 otherwise ample: sentiment classification 0.5 here is called the decision boundary n example. Suppose we are doing binary sentiment class text, and we would like to know whether to assign the sen ranges from to. The probabilistic classifier P(y = 1) = s (w · x + b) 1 = 1 + e (w·x+b) P(y=1) P(y = 0) = 1 s (w · x + 1 = 1 1 + e (w· e (w·x+b) = 1+e (w·x+b) Now we have an wx algorithm +b that given an instan algorithm that given an instance x computes the probabil we make a decision? For a test instance x, we say yes if t Turning a probability into a classifier is more than.5, and no otherwise. We call.5 the decisi ⇢ 1 if P(y = 1|x) > 0.5 if w∙x+b > 0 ŷ = 0 otherwise if w∙x+b ≤ 0 sentiment classification e. Suppose we are doing binary sentiment classification Classification in Logistic Regression Logistic Regression Logistic Regression: a text example on sentiment classification Logistic Regression Sentiment example: does y=1 or y=0? It's hokey. There are virtually no surprises , and the writing is second-rate. So why was it so enjoyable ? For one thing , the cast is great. Another nice touch is the music. I was overcome with the urge to get off the couch and start dancing. It sucked me in , and it'll do the same to you. 29 1 if P(y = 1|x) > 0.5 ŷ x= =2 2 0 otherwise x3=1 5.1.1 It'sExample: hokey. Theresentiment are virtually classification no surprises , and the writing is second-rate. So why was it so enjoyable ? For one thing , the cast is Let’s have great an example.. Another nice Suppose wemusic touch is the are doing. I wasbinary sentiment overcome with theclassification urge to get offon movie the review couch text, andand startwe would. like dancing to know It sucked mewhether to assign in , and it'll do the the samesentiment to you. class + or to a review document doc. We’ll represent each input observation by the 6 x4=3 x1input features x1...x6 of the =3 shown x5=0in the following x6=4.19 table; Fig. 5.2 shows the features in a sample mini test document. Figure 5.2 A sample mini test document showing the extracted features in the vector x. Var Definition Value in Fig. 5.2 x1 these Given count(positive lexicon) 6 features and 2 doc) the input review x, P(+|x)3 and P( |x) can be com- putedx2usingcount(negative ⇢Eq. 5.5: lexicon) 2 doc) 2 1 if “no” 2 doc x3 = P(Y = 1|x) = s (w · x + b) p(+|x) 1 0 otherwise x4 count(1st ⇢ = s and 2nd pronouns ([2.5, doc) 5.0, 21.2, 3 · [3, 2, 1, 3, 0, 4.19] + 0.1) 0.5, 2.0, 0.7] 1 if “!”=2 doc s (.833) x5 0 0 otherwise = 0.70 (5.6) x6 log(word count of doc) ln(66) = 4.19 30 x2the following tures x1...x6 of the input shown in count(negative ⇢ table; Fig. lexicon) 5.2 shows 2 doc) the features Classifying sentiment a sample mini test document. x for input x 1 if “no” 2 doc 3 Var Definition 0 otherwise Value in Fig. 5.2 x1 x4 2 doc) count(positive lexicon) count(1st ⇢ and3 2nd pronouns 2 doc) x2 count(negative lexicon) 2 doc) 1 if “!” 22doc ⇢ 1 if “no” 2 doc 5 x x3 0 otherwise 1 0 otherwise x6 log(word count of doc) x4 count(1st ⇢ and 2nd pronouns 2 doc) 3 1 if “!” 2 doc x5 0 0 otherwise x6 Let’s assume log(word count of doc) for the moment that ln(66) = 4.19we’ve alrea for each of these features, and that the 6 weights Suppose ware= [2.5, 5.0, 1.2, 0.5, 2.0, 0.7], while b = 0.1. ( et’s assume for the moment that we’ve already learned a real-valued weight for how the b = 0.1 weights are learned.) The weight w 1 , for e ch of these features, and that the 6 weights corresponding to the 6 features are 31 Figure 5.2 1 mini test5document showing A sample 6 the extracted features in the vector x. Classifying Figure 5.2 sentiment for input x A sample mini test document showing the extracted features in the vector x. Given these 6 features and the input review x, P(+|x) and P( |x) can be com- puted usingthese Given Eq. 5.5: 6 features and the input review x, P(+|x) and P( |x) can be com- puted using Eq. 5.5: p(+|x) = P(Y = 1|x) = s (w · x + b) p(+|x) = P(Y = 1|x) = s ([2.5, (w · x + 5.0, b) 1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1) = s (.833) ([2.5, 5.0, 1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1) = s (.833) 0.70 (5.6) p( |x) = P(Y = 0|x) = 1 0.70s (w · x + b) (5.6) p( |x) = P(Y = 0|x) = 0.30 1 s (w · x + b) = 0.30 Logistic regression is commonly applied to all sorts of NLP tasks, and any property of the input Logistic can be aisfeature. regression commonlyConsider appliedthetotask all of period sorts disambiguation: of NLP tasks, and any deciding property if of athe period input is canthe beend of a sentence a feature. Considerorthepart ofofa period task word, by classifying each disambiguation: period deciding 32 erty of the input can be a feature. Consider the task of period disambiguation: We can build features for logistic regression for deciding if a period is the end of a sentence or part of a word, by classifying each any classification task: period disambiguation period into one of two classes EOS (end-of-sentence) and not-EOS. We might use features like x1 below expressing that the current word is lower case and the class is EOS (perhaps with a positive weight), or that the current word is in our abbrevia- End ofwith tions dictionary (“Prof.”) and the class is EOS (perhaps sentence a negative weight). A feature canThis ends ainquite also express a period. complex combination of properties. For example a The house period following at 465 a upper cased word Main St.to is is a likely new. be an EOS, but if the word itself is St. and the previous word is capitalized, then the period is likely part of a shortening of the word street. Not end ⇢ 1 if “Case(wi ) = Lower” x1 = 0 otherwise ⇢ 1 if “wi 2 AcronymDict” x2 = 0 otherwise ⇢ 1 if “wi = St. & Case(wi 1 ) = Cap” x3 = 0 otherwise 33 , which is just whatinwe Classification want for (binary) a probability. logistic Because regression: it is summary but has a sharp slope toward the ends, it tends to squash outlier Given: And it’s differentiable, which as we’ll see in Section 5.8 will ◦ a set of classes: (+ sentiment,- sentiment) ◦ a vector x of features [x1, x2, …, xn] e. If we◦apply the sigmoid to x1= count( "awesome") the sum of the weighted features, ween 0 and 1. To make it a probability, ◦ x2 = log(number of words in review) we just need to make s, p(y◦=A 1) and wp(y vector = 0), sum of weights to 1.w2, [w1, We can …, do wn]this as follows: ◦ wi for each feature fi P(y = 1) = s (w · x + b) 1 = 1 + e (w·x+b) Logistic Regression: a text example on sentiment classification Logistic Regression Learning: Cross-Entropy Loss Logistic Regression Wait, where did the W’s come from? Supervised classification: We know the correct label y (either 0 or 1) for each x. But what the system produces is an estimate, 𝑦! We want to set w and b to minimize the distance between our estimate 𝑦! (i) and the true y(i). We need a distance estimator: a loss function or a cost function We need an optimization algorithm to update w and b to minimize the loss. 37 Learning components A loss function: ◦ cross-entropy loss An optimization algorithm: ◦ stochastic gradient descent The distance between 𝑦! and y We want to know how far is the classifier output: 𝑦! = σ(w·x+b) from the true output: y [= either 0 or 1] We'll call this difference: L(𝑦! ,y) = how much 𝑦! differs from the true y Intuition of negative log likelihood loss = cross-entropy loss A case of conditional maximum likelihood estimation We choose the parameters w,b that maximize the log probability of the true y labels in the training data given the observations x is the negative log likelihood loss, generally called the cross-entrop derive this loss function, applied to a single observation x. We’d Deriving cross-entropy loss for a single observation ghts that maximize the probability of the correct label p(y|x). Sinc x two discrete outcomes (1 or 0), this is a Bernoulli distribution, and Goal: maximize probability of the correct label p(y|x) he probability p(y|x) that our classifier produces for one observa Since there wing (keeping are only in mind that2 if discrete outcomes y=1, Eq. (0 or 1) to 5.9 simplifies weŷ;can if y=0, express s to 1 ŷ): the probability p(y|x) from our classifier (the thing we want to maximize) as y 1 y p(y|x) = ŷ (1 ŷ) noting: take the log of both sides. This will turn if y=1, this simplifies to 𝑦! out to be handy mathema n’t hurt us; whatever values maximize a probability will also maxim if y=0, this simplifies to 1- 𝑦! e probability: he probability p(y|x) that our classifier produces for one observa y 1 y wingDeriving in mind that ifp(y|x) (keepingcross-entropy loss y=1, for ŷsingle = a5.9 Eq. ŷ) observation (1simplifies x to ŷ; if y=0, s to 1 ŷ): Now weGoal: takemaximize the log ofprobability both sides.of This the correct will turnlabeloutp(y|x) to be handy m nd doesn’t hurt us; whatever values y maximize 1 y a probability will also Maximize: p(y|x) = ŷ (1 ŷ) og of the probability: Now take the log of both sides (mathematically handy) take the log of both sides. This will turn ⇥ y out to be handy ⇤ mathema 1 y n’t hurt us; whatever log Maximize: p(y|x) values = log aŷprobability maximize (1 ŷ) will also maxim e probability: = y log ŷ + (1 y) log(1 ŷ) ⇥ y 1 y ⇤ Eq. 5.10 log p(y|x) Whatever describes values log ŷ log a logmaximize likelihood = that (1 ŷ) p(y|x) should will be alsomaximized. maximize In or p(y|x) nto loss function (something thatŷ +we(1need = y log y) to minimize), log(1 ŷ) we’ll just Eq. 5.10. The result is the cross-entropy loss LCE : and doesn’t hurt us; whatever values maximize a probability will also maximiz Now weNow wethe take takelog theof logboth of both sides. sides. This Thiswill willturn turn out out to tobe behandy handy mathe ma logDeriving of the and probability: cross-entropy doesn’t hurt us; loss whatever for values a single maximize a observation probability will x also max and doesn’t hurt us; whatever values⇥ maximize a⇤probability will also m log of the probability: y 1 y log of the probability: log p(y|x) = log ŷ (1 ŷ) Goal: maximize probability of the correct label p(y|x) ⇥ y 1 y ⇤ = y log log p(y|x) = ŷlog +⇥(1ŷ (1y) log(1 ŷ) ŷ) ⇤ ( Maximize: y 1 y log p(y|x) = =logy logŷ ŷ(1 + (1 ŷ)y) log(1 ŷ) Eq. 5.10 describes a log likelihood that should be maximized. In order to turn = y log ŷ + (1 y) log(1 ŷ) into loss Eq.function (something 5.10 describes a logthat we needthat likelihood to should minimize), we’ll just flip be maximized. the sig In order t Eq.Now 5.10.flip intoThesign loss toisturn result this into that the(something function a loss: cross-entropy loss something to minimize LCE :to minimize), we need we’ll just flip th Eq. 5.10 describes a log likelihood that should Eq. 5.10. The result is the cross-entropy loss LCE :be maximized. In orde intoCross-entropy loss function loss (because (something is that formula we need for to cross-entropy(y, minimize), LCE (ŷ, y) = log p(y|x) = [y log ŷ + (1 y) log(1 ŷ)] we’ll 𝑦 ! )) just fli( Eq. 5.10. The result Minimize: LCEis(ŷ,the cross-entropy y) = log p(y|x) =loss [y Llog CE :ŷ + (1 y) log(1 ŷ)] Finally, we can plug in the definition of ŷ = s (w · x + b): Or, plugging Finally, we in candefinition of 𝑦: ! plug in the definition of ŷ = s (w · x + b): LCE (ŷ, y) = log p(y|x) = [y log ŷ + (1 y) log(1 ŷ) LCE (ŷ, y) = [y log s (w · x + b) + (1 y) log (1 s (w · x + b))] ( LCE (ŷ, y) = [y log s (w · x + b) + (1 y) log (1 s (w · x + b)) Let's see if this works for our sentiment example We want loss to be: smaller if the model estimate is close to correct bigger if model is confused Let's first suppose the true label of this is y=1 (positive) It's hokey. There are virtually no surprises , and the writing is second-rate. So why was it so enjoyable ? For one thing , the cast is great. Another nice touch is the music. I was overcome with the urge to get off the couch and start dancing. It sucked me in , and it'll do the same to you. x4=3 x1=3 x5=0 x6=4.19 Let'sFigure see 5.2 if thisminiworks A sample test documentfor our showing sentiment the extracted example features in the vector x. Given these 6 features and the input review x, P(+|x) and P( |x) can be com- True value is y=1. How well is our model doing? puted using Eq. 5.5: p(+|x) = P(Y = 1|x) = s (w · x + b) = s ([2.5, 5.0, 1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1) R 5 L OGISTIC R EGRESSION= s (.833) = 0.70 (5.6) side of the p(equation drops |x) = P(Y out, = = 0|x) leading 1 sto (wthe · x +following b) loss (we’ll use log to mean natural log when the base is not specified): Pretty well! What's the loss? = 0.30 LCE (ŷ, y) = [y log s (w · x + b) + (1 y) log (1 s (w · x + b))] Logistic regression is commonly applied to all sorts of NLP tasks, and any property = can be a[log of the input s (w Consider feature. · x + b)] the task of period disambiguation: deciding if a period = is the end of a sentence log(.70)or part of a word, by classifying each period into one=of two classes EOS (end-of-sentence).36 and not-EOS. We might use features like x1 below expressing that the current word is lower case and the class is EOS CE p(+|x) = P(Y = 1|x) = s (w · x + b) [log s (w · x + b)] Let's see= if this works for our sentiment example = = s log(.70) ([2.5, 5.0, 1.2, 0.5, 2.0, 0.7] · [3, 2, 1 = s (.833) =.36 Suppose true value instead was y=0. = 0.70 By contrast, let’s pretend instead that the example in Fig. 5.2 was actually negative, i.e., y = 0 p( |x) = (perhaps theP(Y = 0|x) reviewer s (w went=on1to say ·x+ “But b) line, the movie is bottom terrible! I beg you not to see it!”). In this case our model is confused and we’d want = 0.30 the loss to be higher. Now if we plug y = 0 and 1 s (w · x + b) =.31 from Eq. 5.7 intoWhat's theleftloss? Eq. 5.12, the side of the equation drops out: Logistic regression is commonly applied to all sorts of NLP tasks, LCEof (ŷ, the y) =input can be sa(w [y log feature. Consider · x + b)+(1 s (w ·of the task y) log (1 x +period b))] disambig if a period = is the end of a sentence [log (1or spart (w · xof a word, by classif + b))] into one= of two classes EOS (end-of-sentence) log (.30) and not-EOS. We m like x=1 below expressing that 1.2 the current word is lower case and (perhaps with a positive weight), or that the current word is in o Sure enough, the loss for the first classifier (.37) is less than the loss for the second 8 C HAPTER 5 LCE (ŷ,Ly) = [y log s (w · x + b) + (1 OGISTIC R EGRESSION y) log (1 s (w · x + b))] Let's see if= this works for [log s (w · x + b)] our sentiment example side of the=equation drops out, log(.70) leading to the following loss (we’ll use log to mean The loss when model was right (if true y=1) natural log=when the base is not specified):.36 LCE (pretend By contrast, let’s ŷ, y) = instead [y that s (wexample logthe · x + b) +in(1Fig.y)5.2 logwas s (w · x +negative, (1 actually b))] i.e., y = 0 (perhaps the = reviewer[log wents (won· x to + b)] say “But bottom line, the movie is terrible! I beg you not to = see it!”). In this case log(.70)our model is confused and we’d want the loss to be higher. Now = if we plug y = 0 and.361 s (w · x + b) =.31 from Eq. 5.7 Is lower than the loss when model into Eq. 5.12, the left side of the equation drops out: was wrong (if true y=0): By contrast, let’s pretend instead that the example in Fig. 5.2 was actually negative, LCE (yŷ,=y)0=(perhaps the i.e., [y log s (w · xwent reviewer + b)+(1 y) log on to say “But s (w ·line, (1 bottom x + b))] the movie is terrible! I beg = you not to see it!”). In this case[log our (1 model is confused s (w · x + b))]and we’d want the loss to be higher. Now if we plug y = 0 and 1 s (w · x + b) =.31 from Eq. 5.7 into Eq. 5.12, logout: = the left side of the equation drops (.30) = 1.2 LCE (ŷ, y) = [y log s (w · x + b)+(1 y) log (1 s (w · x + b))] Sure enough, the loss for = the first classifier (.37) is less than (1 the loss s (w · x for the second + b))] Sure enough, classifier (1.17). loss = was bigger when model [log log (.30) was wrong! Cross-Entropy Loss Logistic Regression Stochastic Gradient Descent Logistic Regression th gradient descent is to find the optimal weights: minim Our goal: minimize the loss ve defined for the model. In Eq. 5.13 below, we’ll explici the loss Let'sfunction L is that make explicit parameterized by the the loss function weights, whic is parameterized by weights e learning 𝛳=(w,b)as q (in the case of logistic regressio in general s to find theAndset of represent we’ll weights which minimizes 𝑦! as f (x; θ ) to makethetheloss functi mples: dependence on θ more obvious We want the weights that minimize the loss, averaged over all examples: m X 1 q̂ = argmin LCE ( f (x ; q ), y ) (i) (i) q m i=1 Intuition of gradient descent How do I get to the bottom of this river canyon? Look around me 360∘ Find the direction of steepest slope down x Go that way Our goal: minimize the loss For logistic regression, loss function is convex A convex function has just one minimum Gradient descent starting from any point is guaranteed to find the minimum (Loss for neural networks is non-convex) Let's first visualize for a single scalar w Q: Given current w, should we make it bigger or smaller? A: Move w in the reverse direction from the slope of the function Loss Should we move right or left from here? w1 wmin w 0 (goal) Let's first visualize for a single scalar w Q: Given current w, should we make it bigger or smaller? A: Move w in the reverse direction from the slope of the function Loss slope of loss at w1 is negative So we'll move positive w1 wmin w 0 (goal) Let's first visualize for a single scalar w Q: Given current w, should we make it bigger or smaller? A: Move w in the reverse direction from the slope of the function Loss one step of gradient slope of loss at w1 descent is negative So we'll move positive w1 wmin w 0 (goal) Gradients The gradient of a function of many variables is a vector pointing in the direction of the greatest increase in a function. Gradient Descent: Find the gradient of the loss function at the current point and move in the opposite direction. How much do we move in that direction ? The value GISTIC R EGRESSION of the gradient (slope in our example) ! !" 𝐿(𝑓 𝑥; 𝑤 , 𝑦) weighted by a learning rate η Higher learning rate means move w faster t+1 t d w =w h L( f (x; w), y) dw ’s extend the intuition from a function of one scalar variable Now let's consider N dimensions We want to know where in the N-dimensional space (of the N parameters that make up θ ) we should move. The gradient is just such a vector; it expresses the directional components of the sharpest slope along each of the N dimensions. of a 2-dimensional gradient vector taken at the red point. Imagine 2 dimensions, w and b Cost(w,b) Visualizing the gradient vector at the red point It has two dimensions shown in the x-y plane b w Real gradients Are much longer; lots and lots of weights For each dimension wi the gradient component i tells us the slope with respect to that variable. ◦ “How much would a small change in wi influence the total loss function L?” ◦ We express the slope as a partial derivative ∂ of the loss ∂wi The gradient is then defined as a vector of these partials. ach dimension wi , we express the slope as a partial derivative ∂ wi of th n. The gradient is then defined as a vector of these partials. We’ll repre The gradient q ) to make the dependence on q more 2 ∂obvious: 3 ∂ w L( f (x; q ), y) We’ll represent 𝑦! as f (x; θ ) to make 6 ∂the 1 dependence on θ more 7 obvious: 6 2 ∂ w L( f (x; q ), y) 37 6 —q L( f (x; q ), y)) = 6 ∂ w L( f (x; ∂ 2 q ), y) 77 1.. 4 6 ∂. 75 6 ∂ w∂ 2 L( f (x; q ), y)7 —q L( f (x; q ), y)) = 6 6 ∂ wn L( f..(x; q ), y) 7 7 4. 5 ∂ nal equation for updating q based on ∂ wnthe (x; q ), y)is thus L( fgradient The final equation for updating θ based on the gradient is thus final equation for updating q based on the gradient is thus qt+1 = qt h—L( f (x; q ), y) Gradient for Logistic Regression 4.1 What The are Gradient for Logistic these partial Regression derivatives for logistic regression? te q , we need a definition for the gradient —L( f (x; q ), y). Re order to update q , we need a definition for the gradient —L( f (x; q ), y). Recal ession, the cross-entropy loss function is: logistic regression, the cross-entropy loss function is: The loss function y) =LCE (ŷ,[yy)log=s (w ·x+ [y log b)· x++(1 s (w y) log b) + (1 (1(1 ss(w y) log (w ·· xx+ b))] +b))] It turns that theoutderivative that the derivative of thisoffunction this function forfor oneobservation one observation vecto vec. 5.18The (theelegant derivative interested reader of canthis seefunction Section(see textbook 5.8 for 5.8 for derivation) the derivation of this equa erested reader can see Section 5.8 for the derivation of this eq ∂ LCE (ŷ, y) ∂ LCE (ŷ, y) ∂ w = [s (w · x + b) y]x j =j [s (w · x + b) y]x j ∂ w j the gradient with respect to a single weight w j represe Note in Eq. 5.18 that y intuitive 5.18 that value: the difference the gradient with between respectthe totrue y and our a single estimated weight ŷ = w j repr function S TOCHASTIC G RADIENT D ESCENT(L(), f (), x, y) returns q # where: L is the loss function # f is a function parameterized by q # x is the set of training inputs x(1) , x(2) ,..., x(m) # y is the set of training outputs (labels) y(1) , y(2) ,..., y(m) q 0 repeat til done # see caption For each training tuple (x(i) , y(i) ) (in random order) 1. Optional (for reporting): # How are we doing on this tuple? Compute ŷ (i) = f (x(i) ; q ) # What is our estimated output ŷ? Compute the loss L(ŷ (i) , y(i) ) # How far off is ŷ(i) ) from the true output y(i) ? 2. g —q L( f (x(i) ; q ), y(i) ) # How should we move q to maximize loss? 3. q q hg # Go the other way instead return q Figure 5.5 The stochastic gradient descent algorithm. Step 1 (computing the loss) is used Hyperparameters The learning rate η is a hyperparameter ◦ too high: the learner will take big steps and overshoot ◦ too low: the learner will take too long Hyperparameters: Briefly, a special kind of parameter for an ML model Instead of being learned by algorithm from supervision (like regular parameters), they are chosen by algorithm designer. Stochastic Gradient Descent Logistic Regression Stochastic Gradient Descent: An example and more details Logistic Regression Working through an example One step of gradient descent A mini-sentiment example, where the true y=1 (positive) Two features: x1 = 3 (count of positive lexicon words) x2 = 2 (count of negative lexicon words) Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero: w1 = w2 = b = 0 η = 0.1 5.4.1 The Gradient for Logistic Regression Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning Example of gradient descent al equation for updating Inrate order updateqq ,based h isto0.1: we needon the gradient a definition is thus for the gradient for logistic regression, the cross-entropy loss function w —L( f (x; q ), y). Re is:1 = w2 = b = 0; Update stepLfor update [yθlog CE (ŷ, y) = is:s (w · x + b) + (1 w1 = w2 = b = 0 x1(1= 3;s (w y) log x2· x=+2 b))] h = 0.1 Itqturns out = q that the h—L( derivative f (x; q of this ),function y) for one observation ve t requires that we compute the gradient, multiplied by the t+1 update step The single Eq. 5.18 (the learning rate interested reader can see Section 5.8 for the derivation of this eq ∂ LCEt(ŷ, y) where q t+1 = q h—q L( ∂wj ; q ·),xy+ [s(i)(w = f (x (i) b) ) y]x j InNote our mini example in Eq. 5.18there that are thethree parameters, gradient so the gradient with respect vectorweight to a single has 3 dimen- w j rep Gradient vector has 3 dimensions: sions, for w1 , w2 , and b. We can compute the first gradient as follows: very intuitive value: the difference between the true y and our estimated ŷ 2 ∂ LCEx(ŷ,y) 3 2 + b) for that observation, (s (w · x + b) y)x1 multiplied 3 2 by the (s (0) 1)x1 corresponding 3 2 0.5x1 3input 2 value 1.5 3 x j. ∂ w1 6 ∂ LCE (ŷ,y) 7 —w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5 5.4.2 ∂ LCE (ŷ,y) ∂b The Stochastic Gradient Descent Algorithm s (w · x + b) y s (0) 1 0.5 0.5 Stochastic gradient descent is an online algorithm that minimizes 1 the loss Now that we have a gradient, we compute the new parameter vector q by moving 5.4.1 The Gradient for Logistic Regression Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning Example of gradient descent al equation for updating Inrate order updateqq ,based h isto0.1: we needon the gradient a definition is thus for the gradient for logistic regression, the cross-entropy loss function w —L( f (x; q ), y). Re is:1 = w2 = b = 0; Update stepLfor update [yθlog CE (ŷ, y) = is:s (w · x + b) + (1 w1 = w2 = b = 0 x1(1= 3;s (w y) log x2· x=+2 b))] h = 0.1 Itqturns out = q that the h—L( derivative f (x; q of this ),function y) for one observation ve t requires that we compute the gradient, multiplied by the t+1 update step The single Eq. 5.18 (the learning rate interested reader can see Section 5.8 for the derivation of this eq ∂ LCEt(ŷ, y) where q t+1 = q h—q L( ∂wj ; q ·),xy+ [s(i)(w = f (x (i) b) ) y]x j InNote our mini example in Eq. 5.18there that are thethree parameters, gradient so the gradient with respect vectorweight to a single has 3 dimen- w j rep Gradient vector has 3 dimensions: sions, for w1 , w2 , and b. We can compute the first gradient as follows: very intuitive value: the difference between the true y and our estimated ŷ 2 ∂ LCEx(ŷ,y) 3 2 + b) for that observation, (s (w · x + b) y)x1 multiplied 3 2 by the (s (0) 1)x1 corresponding 3 2 0.5x1 3input 2 value 1.5 3 x j. ∂ w1 6 ∂ LCE (ŷ,y) 7 —w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5 5.4.2 ∂ LCE (ŷ,y) ∂b The Stochastic Gradient Descent Algorithm s (w · x + b) y s (0) 1 0.5 0.5 Stochastic gradient descent is an online algorithm that minimizes 1 the loss Now that we have a gradient, we compute the new parameter vector q by moving 5.4.1 The Gradient for Logistic Regression Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning Example of gradient descent al equation for updating Inrate order updateqq ,based h isto0.1: we needon the gradient a definition is thus for the gradient for logistic regression, the cross-entropy loss function w —L( f (x; q ), y). Re is:1 = w2 = b = 0; Update stepLfor update [yθlog CE (ŷ, y) = is:s (w · x + b) + (1 w1 = w2 = b = 0 x1(1= 3;s (w y) log x2· x=+2 b))] h = 0.1 Itqturns out = q that the h—L( derivative f (x; q of this ),function y) for one observation ve t requires that we compute the gradient, multiplied by the t+1 update step The single Eq. 5.18 (the learning rate interested reader can see Section 5.8 for the derivation of this eq ∂ LCEt(ŷ, y) where q t+1 = q h—q L( ∂wj ; q ·),xy+ [s(i)(w = f (x (i) b) ) y]x j InNote our mini example in Eq. 5.18there that are thethree parameters, gradient so the gradient with respect vectorweight to a single has 3 dimen- w j rep Gradient vector has 3 dimensions: sions, for w1 , w2 , and b. We can compute the first gradient as follows: very intuitive value: the difference between the true y and our estimated ŷ 2 ∂ LCEx(ŷ,y) 3 2 + b) for that observation, (s (w · x + b) y)x1 multiplied 3 2 by the (s (0) 1)x1 corresponding 3 2 0.5x1 3input 2 value 1.5 3 x j. ∂ w1 6 ∂ LCE (ŷ,y) 7 —w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5 5.4.2 ∂ LCE (ŷ,y) ∂b The Stochastic Gradient Descent Algorithm s (w · x + b) y s (0) 1 0.5 0.5 Stochastic gradient descent is an online algorithm that minimizes 1 the loss Now that we have a gradient, we compute the new parameter vector q by moving 5.4.1 The Gradient for Logistic Regression Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning Example of gradient descent al equation for updating Inrate order updateqq ,based h isto0.1: we needon the gradient a definition is thus for the gradient for logistic regression, the cross-entropy loss function w —L( f (x; q ), y). Re is:1 = w2 = b = 0; Update stepLfor update [yθlog CE (ŷ, y) = is:s (w · x + b) + (1 w1 = w2 = b = 0 x1(1= 3;s (w y) log x2· x=+2 b))] h = 0.1 Itqturns out = q that the h—L( derivative f (x; q of this ),function y) for one observation ve t requires that we compute the gradient, multiplied by the t+1 update step The single Eq. 5.18 (the learning rate interested reader can see Section 5.8 for the derivation of this eq ∂ LCEt(ŷ, y) where q t+1 = q h—q L( ∂wj ; q ·),xy+ [s(i)(w = f (x (i) b) ) y]x j InNote our mini example in Eq. 5.18there that are thethree parameters, gradient so the gradient with respect vectorweight to a single has 3 dimen- w j rep Gradient vector has 3 dimensions: sions, for w1 , w2 , and b. We can compute the first gradient as follows: very intuitive value: the difference between the true y and our estimated ŷ 2 ∂ LCEx(ŷ,y) 3 2 + b) for that observation, (s (w · x + b) y)x1 multiplied 3 2 by the (s (0) 1)x1 corresponding 3 2 0.5x1 3input 2 value 1.5 3 x j. ∂ w1 6 ∂ LCE (ŷ,y) 7 —w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5 5.4.2 ∂ LCE (ŷ,y) ∂b The Stochastic Gradient Descent Algorithm s (w · x + b) y s (0) 1 0.5 0.5 Stochastic gradient descent is an online algorithm that minimizes 1 the loss Now that we have a gradient, we compute the new parameter vector q by moving 5.4.1 The Gradient for Logistic Regression Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning Example of gradient descent al equation for updating Inrate order updateqq ,based h isto0.1: we needon the gradient a definition is thus for the gradient for logistic regression, the cross-entropy loss function w —L( f (x; q ), y). Re is:1 = w2 = b = 0; Update stepLfor update [yθlog CE (ŷ, y) = is:s (w · x + b) + (1 w1 = w2 = b = 0 x1(1= 3;s (w y) log x2· x=+2 b))] h = 0.1 Itqturns out = q that the h—L( derivative f (x; q of this ),function y) for one observation ve t requires that we compute the gradient, multiplied by the t+1 update step The single Eq. 5.18 (the learning rate interested reader can see Section 5.8 for the derivation of this eq ∂ LCEt(ŷ, y) where q t+1 = q h—q L( ∂wj ; q ·),xy+ [s(i)(w = f (x (i) b) ) y]x j InNote our mini example in Eq. 5.18there that are thethree parameters, gradient so the gradient with respect vectorweight to a single has 3 dimen- w j rep Gradient vector has 3 dimensions: sions, for w1 , w2 , and b. We can compute the first gradient as follows: very intuitive value: the difference between the true y and our estimated ŷ 2 ∂ LCEx(ŷ,y) 3 2 + b) for that observation, (s (w · x + b) y)x1 multiplied 3 2 by the (s (0) 1)x1 corresponding 3 2 0.5x1 3input 2 value 1.5 3 x j. ∂ w1 6 ∂ LCE (ŷ,y) 7 —w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5 5.4.2 ∂ LCE (ŷ,y) ∂b The Stochastic Gradient Descent Algorithm s (w · x + b) y s (0) 1 0.5 0.5 Stochastic gradient descent is an online algorithm that minimizes 1 the loss Now that we have a gradient, we compute the new parameter vector q by moving 6 ∂ w L( f (x; q ), y) 7 —q L( f (x; qthere 6 In our mini example y))arefor= ),sions, 7 2there are three parameters, so the gradient vector has 3 dimen- (5.15) r mini Example example for w1 , w22, and b. We of can w1gradient three ,6 4 parameters, compute the.. w2 , and b. We can. descent so first the gradient gradient vector has 3 dimen- 7 first gradient as follows: compute the 5 as follows: ∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3 CE ∂ b) y)x1 ∂w 1 (s (w · x + 2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb) CE L( f (x; q (sy) ), n y)x 5 = 43(s (0) (0) 1)x1 2 1)x 5 =34 0.5x 0.5x1 2 5 = 43 1.0 5 1.5 w,b 4 5 2 2 2 (s (w · x + b) y)x1 s (w · x +(s ∂ w2 ∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1 b)(0) 0.5 1.5 0.5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5 ∂b on for updating q based on the gradient is thus s (wthat Now we Now · x + b) have y thata we havesa (0) gradient, we1compute gradient, we compute the new 0.5 the new parameter parameter 0.5q 1vector vector by moving q 0 in the opposite direction from the gradient: θ1 by moving θ0 in the opposite direction from the gradient: 2 3 2 3 2 3 that we have a gradient, we compute the wnew 1 parameter 1.5.15 q 1 by moving vector q the opposite = q h—L( f q t+1directiont from the gradient: 1 (x; = 4qw),2 5y) h 4 1.0 5 =η = 4 0.1;.1 5 (5.16) b 0.5.05 2 3 2 3 2 3 So afterwone 1.5 descent,.15 1 step of gradient the weights have shifted to be: w1 =.15, 1w2 =4.1, and5b =.05. q = w2 h 4 1.0 5 = 4.1 5 Note that this observation x happened to be a positive example. We would expect b 0.5.05 that after seeing more negative examples with high counts of negative words, that the weight w2 would shift to have a negative value. ter one step of gradient descent, the weights have shifted to be: w =.15, 6 ∂ w L( f (x; q ), y) 7 —q L( f (x; qthere 6 In our mini example y))arefor= ),sions, 7 2there are three parameters, so the gradient vector has 3 dimen- (5.15) r mini Example example for w1 , w22, and b. We of can w1gradient three ,6 4 parameters, compute the.. w2 , and b. We can. descent so first the gradient gradient vector has 3 dimen- 7 first gradient as follows: compute the 5 as follows: ∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3 CE ∂ b) y)x1 ∂w 1 (s (w · x + 2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb) CE L( f (x; q (sy) ), n y)x 5 = 43(s (0) (0) 1)x1 2 1)x 5 =34 0.5x 0.5x1 2 5 = 43 1.0 5 1.5 w,b 4 5 2 2 2 (s (w · x + b) y)x1 s (w · x +(s ∂ w2 ∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1 b)(0) 0.5 1.5 0.5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5 ∂b on for updating q based on the gradient is thus s (wthat Now we Now · x + b) have y thata we havesa (0) gradient, we1compute gradient, we compute the new 0.5 the new parameter parameter 0.5q 1vector vector by moving q 0 in the opposite direction from the gradient: θ1 by moving θ0 in the opposite direction from the gradient: 2 3 2 3 2 3 that we have a gradient, we compute the wnew 1 parameter 1.5.15 q 1 by moving vector q the opposite = q h—L( f q t+1directiont from the gradient: 1 (x; = 4qw),2 5y) h 4 1.0 5 =η = 4 0.1;.1 5 (5.16) b 0.5.05 2 3 2 3 2 3 So afterwone 1.5 descent,.15 1 step of gradient the weights have shifted to be: w1 =.15, 1w2 =4.1, and5b =.05. q = w2 h 4 1.0 5 = 4.1 5 Note that this observation x happened to be a positive example. We would expect b 0.5.05 that after seeing more negative examples with high counts of negative words, that the weight w2 would shift to have a negative value. ter one step of gradient descent, the weights have shifted to be: w =.15, 6 ∂ w L( f (x; q ), y) 7 —q L( f (x; qthere 6 In our mini example y))arefor= ),sions, 7 2there are three parameters, so the gradient vector has 3 dimen- (5.15) r mini Example example for w1 , w22, and b. We of can w1gradient three ,6 4 parameters, compute the.. w2 , and b. We can. descent so first the gradient gradient vector has 3 dimen- 7 first gradient as follows: compute the 5 as follows: ∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3 CE ∂ b) y)x1 ∂w 1 (s (w · x + 2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb) CE L( f (x; q (sy) ), n y)x 5 = 43(s (0) (0) 1)x1 2 1)x 5 =34 0.5x 0.5x1 2 5 = 43 1.0 5 1.5 w,b 4 5 2 2 2 (s (w · x + b) y)x1 s (w · x +(s ∂ w2 ∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1 b)(0) 0.5 1.5 0.5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5 ∂b on for updating q based on the gradient is thus s (wthat Now we Now · x + b) have y thata we havesa (0) gradient, we1compute gradient, we compute the new 0.5 the new parameter parameter 0.5q 1vector vector by moving q 0 in the opposite direction from the gradient: θ1 by moving θ0 in the opposite direction from the gradient: 2 3 2 3 2 3 that we have a gradient, we compute the wnew 1 parameter 1.5.15 q 1 by moving vector q the opposite = q h—L( f q t+1directiont from the gradient: 1 (x; = 4qw),2 5y) h 4 1.0 5 =η = 4 0.1;.1 5 (5.16) b 0.5.05 2 3 2 3 2 3 So afterwone 1.5 descent,.15 1 step of gradient the weights have shifted to be: w1 =.15, 1w2 =4.1, and5b =.05. q = w2 h 4 1.0 5 = 4.1 5 Note that this observation x happened to be a positive example. We would expect b 0.5.05 that after seeing more negative examples with high counts of negative words, that the weight w2 would shift to have a negative value. ter one step of gradient descent, the weights have shifted to be: w =.15, 6 ∂ w L( f (x; q ), y) 7 —q L( f (x; qthere 6 In our mini example y))arefor= ),sions, 7 2there are three parameters, so the gradient vector has 3 dimen- (5.15) r mini Example example for w1 , w22, and b. We of can w1gradient three ,6 4 parameters, compute the.. w2 , and b. We can. descent so first the gradient gradient vector has 3 dimen- 7 first gradient as follows: compute the 5 as follows: ∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3 CE ∂ b) y)x1 ∂w 1 (s (w · x + 2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb) CE L( f (x; q (sy) ), n y)x 5 = 43(s (0) (0) 1)x1 2 1)x 5 =34 0.5x 0.5x1 2 5 = 43 1.0 5 1.5 w,b 4 5 2 2 2 (s (w · x + b) y)x1 s (w · x +(s ∂ w2 ∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1 b)(0) 0.5 1.5 0.5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5 ∂b on for updating q based on the gradient is thus s (wthat Now we Now · x + b) have y thata we havesa (0) gradient, we1compute gradient, we compute the new 0.5 the new parameter parameter 0.5q 1vector vector by moving q 0 in the opposite direction from the gradient: θ1 by moving θ0 in the opposite direction from the gradient: 2 3 2 3 2 3 that we have a gradient, we compute the wnew 1 parameter 1.5.15 q 1 by moving vector q the opposite = q h—L( f q t+1directiont from the gradient: 1