INF-lecture2-binary-classification-after-class PDF

Lab Arrangement (updated TA) Program Teaching assistant Group & Time Location Instructor (AAI Y2 & ID students)...

Lab Arrangement (updated TA) Program Teaching assistant Group & Time Location Instructor (AAI Y2 & ID students) numbers 106 SE Tuesday Rishabh Ranjan Nabil Zafran Bin Zainuddin P5 E2-07-13 Y2 11am-1pm Zha Wei Ashsyahid Bin Hussin Donny Soh Sebastian Nuguid Fernandez 107 SE Tuesday P1 E2-07-13 Zha Wei Tony: Yu Haotong Y3 2pm-4pm Rishabh Ranjan Ashwin Sekhar C S W1-W6 61 BAC + Tony: Yu Haotong Tuesday Junhua Liu W8-W11 P4 67 DSC E2-07-13 Sebastian Nuguid Fernandez 4pm-6pm Zha Wei 128 Chng Song Heng Aloysius Thursday Rishabh Ranjan P3 118 IS E2-06-18 Nabil Zafran Bin Zainuddin 9am-11am Junhua Liu 31 IS + Thursday Xiaoxiao Miao P2 86 AAI E2-07-13 Ridwan Arefeen 11am-1pm Junhua Liu 117 Combine Wednesday LUCAS VINH W12,W13 P1-P5 Online - d lab 9am-11am TRAN 1 *Note that no lab sessions in W4, W6, W11 - Please do the following by Sunday, 19 January - Grouping - Post your questions at the group project discussion forum - Lab 1 assignment - Instructors have provided some example projects, you are welcome to ask them question if you are interested - The lab of this week: practice lab, no submission is required 2 Linear regression and practical tips Binary classi cation Lecture 2 Xiaoxiao Miao Janurary 2025 fi Two Common Supervised Learning Tasks Two common supervised learning tasks predict a number predict categories In nitely many of Regression Classi cation small number of possible outputs w1 possible outputs W2 how much? or how Linear regression Logistic regression Which category ? many? quesitons questions W3 Nerual network Nerual network e.g. W4 Price prediction Decision tree Decision tree Sale prediction Demand Forcasting W4 Random Forest Random Forest W4 AdaBoost AdaBoost W5 SVM SVM W5 KNN Naive Bayes 4 fi fi Linear regression and practical tips Binary classi cation fi Linear regression - model Linear regression: independent variable(s) and a dependent variable are modeled in a linear way Simple linear regression Multiple linear regression Linear regression with one Linear regression with two or more independent variable independent variables Predicted score x1 w1 x w ∑ fw,b(x) = wx + b Hours of Study Predicted score x2 w2 ∑ fw,b(x) Hours of Study Weights b Exam Length Bias w3 b x3 Class attendence Learnable parameters: w (weights) and b (bias) fw,b(x) = w1x1+w2 x2+w3x3+b E.g. fw,b(x) = 4x1−2x2+4x3+40 Can take this as base score in this case study Data Model Loss function Optimization algorithm 6 Linear Regression - model Multiple linear regression: Linear regression with K independent variables x1 w1 K ∑ x2 w2 ∑ f w ,b ⃗ (x)⃗ = w1x1+w2 x2 +... +wK xK +b = wk xk+b k=1 x3 w3 b … … w ⃗ = [w1, w2, w3,... , wK ] - weights xK wK b is a number - bias Learnable parameters of the model: x ⃗ = [x1, x2, x3,... , xK ] K ⃗ (x)⃗ = w ⃗ ⋅ x+b ŷ = f w ,b ⃗ = w1x1+w2 x2 +... +wK xK+b = ∑ wk xk+b k=1 Dot product Data Model Loss function Optimization algorithm 7 Linear regression - loss function Given N training examples (x n, y n), n from 1 to N, Assume Linear model: yn̂ = fw,b(x n) = wx n + b Objective: Adjust w and b to make the linear model predict ŷ as closely as possible to the true label y for all training examples 1 N n (ŷ − y n)2 2N ∑ Mean Square Error (MSE): L(w, b) = n=1 Objective: Adjust w and b to minimize the loss L(w, b) fw,b(x) , function of x L(w, b), function of w, b L(w, b) b w b w Front view Vertical view Data Model Loss function Optimization algorithm 8 Linear regression - loss function Objective: Adjust w and b to minimize the loss L(w, b) fw,b(x) , function of x L(w, b), function of w, b Front view Vertical view In this case, can we nd w, b to make L(w, b)=0? Is it possible to nd di erent set of (w, b) to make L(w, b) same? Data Model Loss function Optimization algorithm 9 fi fi ff Gradient descent algorithm Initialization: randomly choose w, b 1. Compute loss over the training dataset L(w, b) L(w, b) 2. Compute its derivatives with respect to w and b: Compute: ∂L/∂w, ∂L/∂b ∂L ∂L , −η∂L/∂w, − η∂L/∂b ∂w ∂b 3. Simultaneous update your current values of w and b in the direction of the negative gradient , multiplied with a learning rate η ∂L temp_w = w − η ∂w ∂L temp_b = b − η *Gradient of L : the derivative of L ∂b with multiple variables (w, b) w = temp_w ∂L derivative of L b = temp_b ∂w with respect to w Gradient of L ∇L = 4. Iterate steps 1 to 3 until convergence ∂L derivative of L ∂b with respect to b Data Model Loss function Optimization algorithm 10 Simple linear regression - gradient descent algorithm Linear regression model: fw,b(x) = wx + b 1 N ( fw,b(x n) − y n)2 2N ∑ MSE loss function: L(w, b) = n=1 Gradient descent algorithm repeate until convergence { ∂ ∂ 2 N w=w−η L(w, b) (wx n + b − y n) ⋅ x n 2N ∑ ∂w L(w, b) = ∂w n=1 ∂ ∂ 2 N b = b − η L(w, b) (wx n + b − y n)) ⋅ 1 2N ∑ L(w, b) = ∂b ∂b } n=1 Data Model Loss function Optimization algorithm 11 (Optional) Simple linear regression - gradient descent algorithm - Derivatives for linear regression One sample N samples 2 1 N L(w, b) = (wx + b − y) (wx n + b − y n)2 2N ∑ L(w, b) = n=1 h = wx + b − y ∂L ∂L ∂h ∂L 2 N = = 2hx = 2(wx + b − y)x (wx n + b − y n) ⋅ x n ∂w 2N ∑ = ∂w ∂h ∂w n=1 ∂L ∂L ∂h ∂L 2 N (wx n + b − y n)) ⋅ 1 2N ∑ = = 2h1 = 2(wx + b − y) = ∂b ∂h ∂b ∂b n=1 12 General loss function - Nonconvex: can lead to a local minimum, - Gradient descent algorithm stops when a local minimum of the loss surface is reached - GD does not guarantee reaching a global minimum - However, empirical evidence suggests that GD works well - 13 Make sure gradient descent is working correctly Objective: min L( w ⃗ b) , when should we stop learning/training? , - ⃗ w ,b - loss L( w ,⃗ b) should decrease gradually if GD works properly - Stop training if loss L( w ,⃗ b) does not decrease for several iterations L( w ,⃗L(θ) b) Learning curve Convergence Epoch (t) Iterations Number of iterations The number of Iterations varies for different tasks (hyperparameters) 14 Parameters and Hyperparameters - Two types of parameters exist in machine learning models - Model parameters: can be initialized and updated through the data learning process, e.g. w and b - Hyperparameters, can not be directly estimated from data learning and must be set before training a ML model, e.g. number of iterations, learning rate 15 Check your understanding Gradient descent is an algorithm for nding values of parameters w and b that minimize the loss function L, repeat until convergence: ∂L w=w−η ∂w ∂L b=b−η ∂b ∂L When is negative number (less than zero), what happens to w after one update step? ∂w A. w stays the same B. It is not possible to tall if w will increase or decrease C. w decrease D. w increase 16 fi Check your understanding For linear regression, the model is fw,b(x) = wx + b, which are inputs, or features, that are fed into the model and with which the model is expected to make a prediction? which are the parameters of the model that need to be learned during training stage? For linear regression, if you nd parameters w ,⃗ b so that loss function L( w ,⃗ b) is very close to 0, what can you conclude? A. the selected parameters w and b, cause the algorithm to t the training set very well B. the selected parameters w and b, cause the algorithm to t the training set very bad no way, there must be a bug When you increase the learning rate, the training time will always be shortened. Y/N? To make gradient descent converge about twice as fast, a technique that almost always works is to double the learning rate. Y/N? 17 fi fi fi Practical Tips for linear regression - Convert inputs and output to numerical values (Lab 1 assignment) - Hyperparameters - Number of iterations (basic idea): set a large number, stop early if the loss change is a little - Learning rate η (basic idea): try di erent values from small to big, η = 0.001, η = 0.01, η = 0.1, … L( w ,⃗ b) η too small η too large η just right Iterations 18 ff Practical Tips for linear regression - Feature scaling (normalization) Range is large price = w1x1+w2 x2+b x1: Size (feet2), range: 300-2,000 Range is small x2: # bedrooms, range: 0-5 Size #bedrooms Assume w1 = 1,w2 = 1,b = 50 x2 price = 1 * 2000+1 * 5+50 = 2055 Raw feature #bedrooms 2000K 5K 50K 5 price = 1 * 2000+1 * 1+50 = 2051 0 2000K 1K 50K 300 x1 2000 Size (feet2) E ect is very small 19 ff Practical Tips for linear regression - Feature scaling (normalization) - mean normalization x2 300 ≤ x1 ≤ 2000 0 ≤ x2 ≤ 5 #bedrooms Raw feature u2 = 2.3 u1 = 600 x1 − μ1 x2 − μ2 x1,scale = x2,scale = 5 2000 − 300 5−0 0 max-min max-min x1 300 Size (feet2) 2000 −0.18 ≤ x1,scale ≤ 0.82 −0.46 ≤ x2,scale ≤ 0.54 #bedrooms normalized x2 1 -1 1 Size (feet2) - normalized -1 x1 20 Practical Tips for linear regression - Feature scaling (normalization) - z-score normlization x2 300 ≤ x1 ≤ 2000 0 ≤ x2 ≤ 5 #bedrooms Raw feature σ2 = 1.4 σ1 = 450 x1 − μ1 x2 − μ2 u2 = 2.3 u1 = 600 x1,scale = x2,scale = 5 σ1 σ2 0 x1 300 Size (feet2) 2000 −0.67 ≤ x1,scale ≤ 3.1 −0.16 ≤ x2,scale ≤ 1.9 #bedrooms normalized x2 3 After z-score normalization, all features will have a mean of 0 and a standard deviation of 1 -3 3 Size (feet2) - normalized -3 x1 21 Practical Tips for linear regression - Feature scaling (normalization): aim to make each feature in the similar range 0 ≤ x1 ≤ 3 Ok, no normalization −1 ≤ x2 ≤ 0.5 Ok, no normalization −100 ≤ x3 ≤ 100 Too large, rescale −0.001 ≤ x4 ≤ 0.001 Too small, rescale - Normalization is quite important in real applications, just carry out as a preprocessing procedure 22 Check your understanding - What is the primary goal of feature scaling? A. To reduce the dimensionality of the dataset B. To make features independent of each other C. To standardize the range of independent variables D. To increase the size of the dataset 23 Linear regression and practical tips Binary classi cation fi Binary Classi cation - What is binary classi cation? Question Answer (output) “ y” Is this email spam? No Yes Is the patient healthy? No Yes Does the student pass the exam? No Yes False True y can only be one of two values 1 0 “Binary classi cation” Negative class ≠ “bad” Positive class ≠ “good” class = category Data Model Loss function Optimization algorithm 25 fi fi fi What is the next #Output Problem type Model Loss function nodes MSE Linear regression ŷ = f w ,b ⃗ = w ⃗ ⋅ x⃗ + b ⃗ (x) 1 1 N n ( ŷ − y n)2 2N ∑ L= n=1 ⃗ (x) = g( w ⃗ ⋅ x ⃗ + b) ŷ = f w ,b Binary cross entrophy N 1 Sigmoid function L=− ∑ ( y nlog yn̂ + (1 − y n)log(1 − yn̂ )) n=1 ⃗ (x) = g( w ⃗ ⋅ x ⃗ + b) Binary classi cation ŷ = f w ,b Cross entrophy N 2 − y nlog( yn̂ ) ∑ L= Softmax function n=1 26 Data Model Loss function Optimization algorithm fi Learning Objectives Binary Classi cation A single output, apply the sigmoid on top + BCE Two outputs, apply the softmax on top + CE Note that, although called 'regression,' it actually solves a classi cation problem Binary Logistic Regression Softmax Regression Input One ∑ Softmax ∑ Two Input Output Outputs Logistic/sigmoid function ∑ 27 fi fi A Case Study Adapted source Binary Classi cation: a single output, apply the sigmoid on top - Classi cation: Given input vector x ∈ R D, the function correctly predicts which class y it belongs. Which category ? questions x1 ⃗ (x) = g( w ⃗ ⋅ x ⃗ + b) Study Time f w ,b ⃗ (x) ⃗ ŷ = f w ,b Image source Class attendence x2 Pass probability x ⃗ = [x1, x2] y ∈ {0,1} 1 − ŷ Input Label: fail-0 or pass-1 Fail probability Image source fi fi Interpretation of logistic regression output 1 - Logistic regression using sigmoid/logistic function g(z) = 1 + e −z z is unbounded, can be any Label is pass or Fail, either 0 to 1 number (−∞, ∞) x1 w1 z = w ⃗ ⋅ x+b ⃗ 1 x2 w2 ∑ ⃗ (x)⃗ = g( w ⃗ ⋅ x ⃗ + b) = f w ,b 1 + e −( w ⋅⃗ x+b) ⃗ 1 b g(z) = 1 + e −z Map unbound value to probability value [0,1] If z = w ⃗ ⋅ x+b ⃗ is large, prediction is close to 1 If z = w ⃗ ⋅ x+b ⃗ is small, prediction is close to 0 1 g(z) = 1 + e −z If z = w ⃗ ⋅ x+b ⃗ is 0, prediction is close to 0.5 29 Interpretation of logistic regression output Logistic regression for binary classi cation x1 w1 z = w ⃗ ⋅ x+b ⃗ Pass or Fail, between 0 to 1 1 x2 w2 ∑ ⃗ (x)⃗ = g( w ⃗ ⋅ x ⃗ + b) = f w ,b 1 1 + e −( w ⋅⃗ x+b) ⃗ b g(z) = 1 + e −z The output is the “Probability” that class is 1 x2 Example: Exam Length x1 is studying hours, x2 is exam length, -3 y is 0 (fail) or 1(pass) f w ,b ⃗ (x) ⃗ = 0.8 means 80% chance that y is 1 P(y = 0) + P(y = 1) = 1 3 x1 studying hours What the chance that y is 0? Data Model Loss function Optimization algorithm 30 fi (Optional) Decision boundary for logistic regression Pass or Fail, between 0 to 1 x1 w1 1 x2 w2 ∑ ⃗ (x)⃗ = g( w ⃗ ⋅ x ⃗ + b) = f w ,b 1 + e −( w ⋅⃗ x+b) ⃗ 1 b g(z) = 1 + e −z z = w ⃗ ⋅ x+b ⃗ = w1x1+w2 x2+b If z = 0, on the decision boundary, g(z) = 0.5 z = w ⃗ ⋅ x+b ⃗ = x1 + x2−3 If z > 0, one side of the boundary, g(z) > 0.5 Decision boundary: z = x1 + x2−3 = 0 If z < 0, another side of the boundary g(z) < 0.5 z = w ⃗ ⋅ x+b x2 ⃗ >0 3 g(z) g(z) ≥ 0.5 -> ŷ = 1 z = w ⃗ ⋅ x+b ⃗ 0 threshold=0.5 z=0 z ŷ = 0 z Data Model Loss function Optimization algorithm 31 Loss function for logistic regression - Can we use mean square error loss in logistic regression? - The loss function will be non-convex and di cult to nd the optimum 1 N MSE: L( w ,⃗ b) = n n 2 2N ∑ ( f w ,b ⃗ (x ) − y ) n=1 Linear regression Logistic regression 1 L( w ,⃗ b) f w ,b ⃗ = w ⃗ ⋅ x+b ⃗ (x) ⃗ L( w ,⃗ b) f w ,b ⃗ (x)⃗ = 1 + e −( w ⋅⃗ x+b) ⃗ Convex non-convex w ,⃗ b w ,⃗ b Data Model Loss function Optimization algorithm 32 ffi fi Binary classi cation - BCE {−log(1 − yn̂ ) n −log y ̂ if y n = 1 L(f w ,b ⃗ (x ⃗ , y )) = n n if y n = 0 Combined L = −y nlog yn̂ − (1−y n)log(1 − yn̂ ) If the label is Simpli es to yn = 1 L = −log yn̂ − 0 If the label is Simpli es to yn = 0 L = 0 + −log(1 − yn̂ ) convex Over entire training set Binary cross-entropy loss (BCE) N ( ( ̂ )) n n n n ̂ ∑ L=− y log y + (1 − y )log 1 − y n=1 Data Model Loss function Optimization algorithm 33 fi fi fi Binary Classi cation - SGD 1. Compute loss over the training dataset N L=− ∑ ( y nlog yn̂ + (1 − y n)log(1 − yn̂ )) Just change the loss function to BCE n=1 2. Compute its derivatives concerning w and b: ∂ ∂ L(w, b), L(w, b) ∂w ∂b 3. Simultaneous update your current values of w and b in the direction of the negative gradient , multiplied with a learning rate η ∂ Same as before temp_w = w − η L(w, b) ∂w ∂ temp_b = b − η L(w, b) ∂b w = temp_w b = temp_b 4. Iterate steps 1 to 3 until convergence Data Model Loss function Optimization algorithm 34 fi What is the next #Output Problem type Model Loss function nodes MSE Linear regression ŷ = f w ,b ⃗ = w ⃗ ⋅ x⃗ + b ⃗ (x) 1 1 N n ( ŷ − y n)2 2N ∑ L= n=1 ⃗ (x) = g( w ⃗ ⋅ x ⃗ + b) ŷ = f w ,b Binary cross entrophy N 1 Sigmoid function L=− ∑ ( y nlog yn̂ + (1 − y n)log(1 − yn̂ )) n=1 ⃗ (x) = g( w ⃗ ⋅ x ⃗ + b) Binary classi cation ŷ = f w ,b Cross entrophy N 2 − y nlog( yn̂ ) ∑ L= Softmax function n=1 Next time : ) 35 Data Model Loss function Optimization algorithm fi Reference - Stanford University, Machine Learning Specialization Course - National Taiwan University, Prof Hung-yi Lee, Machine Learning Course - INF2008 [2023/24 T2] Course, Prof Donny Soh 36

INF-lecture2-binary-classification-after-class PDF

Document Details

Tags

Related

Summary

Full Transcript