CPEN 355 Logistic Regression Lecture Notes PDF

CPEN 355 – Lecture 4: Logistic Regression Mirza Sarwar, Ph.D. Department of Electrical and Computer Engineering University of British Columbia 1 Likelihood VS probability https://sebastianraschka.com/faq/docs/probability-vs-likelihood.html CPEN 355 3 ECE@UBC Likelihood Joint distribution for i.i.d likelihood https://www.medicine.mcgill.ca/epidemiology/hanley/bios601/Likelihood/Likelihood.pdf https://web.stanford.edu/class/archive/cs/cs109/cs109.1202/lectureNotes/LN21_parameters_mle.pdf CPEN 355 4 ECE@UBC Maximum Likelihood Estimation (MLE) https://www.medicine.mcgill.ca/epidemiology/hanley/bios601/Likelihood/Likelihood.pdf https://web.stanford.edu/class/archive/cs/cs109/cs109.1202/lectureNotes/LN21_parameters_mle.pdf CPEN 355 5 ECE@UBC Probability View of Logit Event A and its probability to happen: 𝑃 𝐴 = 𝑝, 𝑃 𝐴! = 1 − 𝑝. The odds of event A is defined as 𝑝 odds 𝑝 = 1−𝑝 (Note: the odds of an event is the ratio of the probability of an event to the probability of its complement.) The range of odds is (0, +∞). Logit is defined as log-odds with range (−∞, +∞) https://www.stat.cmu.edu/~ryantibs/advmethods/notes/logreg.pdf CPEN 355 6 ECE@UBC Logit(⋅) and Logit-1(⋅) Logistic regression models the probabilities for classification problems of an event occurring. The cut-off rule is a transformation which maps linear functions’ range (−∞, +∞) to the range of probability (0, 1). logit 𝑝 Xw = 𝑧 = logit 𝑝 = ln(1 − 𝑝) 1−𝑝 !# 1 −𝑧 = ln ⇒ 𝑒 = − 1 ⇒ 𝑝 =? 𝑝 𝑝 sigmoid Activation function Tells you how much to pass through 1 logit !" 𝑧 = 𝑝 = 1 + 𝑒 !# CPEN 355 7 https://www.stat.cmu.edu/~ryantibs/advmethods/notes/logreg.pdf ECE@UBC Derivation of logistic regression Recall the linear regression function: 𝑓! 𝑋 = 𝑋 " Θ ∈ (−∞, +∞) Recall our goal is to find 𝑃(𝑦 = 1|𝑋) and logit-1 maps (−∞, +∞) to (0,1), we have #$ " 1 𝑃 𝑦̂ = 1 𝑋 = logit 𝑋 Θ = ! 1 + 𝑒 #(& !) The equation above is the basic formulation of logistic regression CPEN 355 8 ECE@UBC Logistic regression for classification Logistic regression #$ " 1 𝑃 𝑦̂ = 1 𝑋 = logit 𝑋 Θ = ! ∈ (0,1) 1+ 𝑒 #(& !) Classification rule 𝑃 𝑦̂ = 1 𝑋 ≥ 0.5 ⇒ 𝑦= = 1, otherwise 𝑦= = 0 Like linear regression, Θ is the learnable parameter CPEN 355 9 ECE@UBC Advantages of logistic regression Logistic regression (LR) exactly outputs the probability of probability of y = true label conditioned on the input X. Prediction changes significantly around 0, which helps distinguish boundary samples LR is more robust to outliers CPEN 355 10 ECE@UBC Logistic regression is a (generalized) linear classifier Sometime, we denote logit 23(⋅) as sigmoid ⋅ or 𝜎 ⋅. Since 𝜎 𝑋 4 Θ ≥ 0.5 iff 𝑋 4 Θ ≥ 0, 1. 𝑋 4 Θ ≥ 0 ⇒ 𝑦9 = 1 2. 𝑋 4 Θ < 0 ⇒ 𝑦9 = 0 The decision boundary is the hyperplane 𝑧 = 𝑋 4 Θ CPEN 355 11 ECE@UBC Optimizing logistic regression How to search for the optimal Θ through the training data to maximize the classification performance? Recall our logistic regression model: 4 1 𝑃 𝑦̂ = 1 𝑋 = 𝜎 𝑋 Θ = 2(5 ! 6) ∈ (0,1) 1+𝑒 Given that we have training data: { 𝑥33 , … , 𝑥73 , 𝑦 3 , … , 𝑥38 , … , 𝑥78 , 𝑦 8 } where 𝑦 (9) ∈ 0,1 is the given corresponding training label. CPEN 355 12 ECE@UBC Supervision criteria For each sample 𝑋 (9) = 𝑥39 , … , 𝑥79. We have that ! 1 𝑃 𝑦9 (9) = 1 𝑋 (9) 9 =𝜎 𝑋 Θ = " ! ∈ (0,1) 1 + 𝑒 2(5 6) Which is a function w.r.t. Θ. Supervision criteria If 𝑦 (9) = 1, we expect 𝑃 𝑦9 (9) = 1 𝑋 (9) approaches 1 as close as possible. If 𝑦 (9) = 0, we expect 𝑃 𝑦9 (9) = 0 𝑋 (9) = 1 − 𝑃 𝑦9 (9) = 1 𝑋 (9) approaches 0 as close as possible. CPEN 355 13 ECE@UBC Supervision criteria We can write the probability that the label is correctly predicted as " 32: " 9 9 : 9 9 𝑝9 ≔ 𝑃 𝑦9 =1𝑋 1 − 𝑃 𝑦9 =1𝑋 Namely, When 𝑦 (9) = 0 ⇒ 𝑝9 = 𝑃 𝑦9 9 =0𝑋 9 When 𝑦 (9) = 1 ⇒ 𝑝9 = 𝑃 𝑦9 9 =1𝑋 9 So no matter which category the sample belongs to, 𝑝9 is able to represent the probability that the prediction 𝑦9 (9) matches the true label 𝑦 9. CPEN 355 14 ECE@UBC Supervision criteria Next, we consider the joint probability for all the training samples, and maximize it to learn Θ. Since the samples are independent to each other, the joint probability can be written as: 𝑝 Θ = 𝑃 𝑦9 3 = 𝑦 3 , … , 𝑦9 8 = 𝑦 8 𝑋 3 , … , 𝑋 8 = 𝑃 𝑦9 3 = 𝑦 3 𝑋 3 ⋯ 𝑃 𝑦9 8 = 𝑦 8 𝑋 8 : " 32:(") = ∏8 9;3 𝑃 𝑦 9 9 =1𝑋 9 (1 − 𝑃 𝑦9 9 =1𝑋9 ) = ∏8 9;3 𝑝9 𝑝(Θ) is also known as the likelihood function w.r.t. Θ. The method that estimate the parameters of a probabilistic model by maximizing a likelihood function is call maximum likelihood estimation (MLE). CPEN 355 15 ECE@UBC Solve MLE for logistic regression Maximizing 𝑝 Θ is equivalent to minimizing $ −ln 𝑝 Θ $ −ln𝑝 Θ = −ln ( 𝑝! = − ) ln 𝑝! $ !"# !"# ! ! %$ ! ! #&% $ = − ) ln 𝑃 𝑦+ =1𝑋 (1 − 𝑃 𝑦+ =1𝑋 ) $ !"# = − ) 𝑦 ! ln𝑃 𝑦+ ! =1𝑋 ! + 1−𝑦 ! ln 1 − 𝑃 𝑦+ ! =1𝑋 ! !"# Denote 𝐸! Θ = 𝑦 ! ln𝑃 𝑦+ ! = 1 𝑋 ! + 1 − 𝑦 ! ln 1 − 𝑃 𝑦+ ! = 1 𝑋 ! , which is the form of binary cross-entropy. Binary cross-entropy error ≔ −𝑦ln𝑃 𝑦 − 1 − 𝑦 ln 1 − 𝑃 𝑦 , 𝑦 ∈ {0,1} https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf https://www.stat.cmu.edu/~ryantibs/advmethods/notes/logreg.pdf CPEN 355 16 ECE@UBC Cross-entropy error function Binary cross-entropy error ≔ −𝑦ln𝑃 𝑦 − 1 − 𝑦 ln 1 − 𝑃 𝑦 , 𝑦 ∈ {0,1} (formulation by definition, it is a concept from information theory. P(y) is the probability density function of taking value y.) " " # In our case 𝑃 𝑦 = 𝑃 𝑦. =1𝑋 = ) *+ #$% '( * &' ) ( When 𝑦 = 1, error = −ln(1/(1 + 𝑒 )) * &' ) ( When 𝑦 = 0, error = −ln(1 − 1/(1 + 𝑒 )) CPEN 355 17 ECE@UBC Cross-entropy error function We plot the error curve of cross-entropy CPEN 355 18 ECE@UBC Minimize cross-entropy error function (𝔼(!) Analogy to minimizing RSS, we would like to find |! ∗ = 0 (! (𝔼(!) However, finding an analytic solution of = 0 is very difficult! (! We have another workaround: gradient descent Instead of finding the solution directly by mathematical tricks and reduction, gradient descent algorithm iteratively searches for the optimal solution (iteratively = step by step). CPEN 355 19 ECE@UBC Find the optimal for logistic regression Logistic sigmoid function: #$ 1 𝜎 𝑥 = logit 𝑥 = 1 + 𝑒 #* CPEN 355 20 ECE@UBC Find the optimal for logistic regression We write the loss function (cross-entropy): −𝑦ln𝑃 𝑦 − 1 − 𝑦 ln 1 − 𝑃 𝑦 , Back to logistic regression, given n samples. (") " * # Plug in 𝑃 𝑦 =𝜎 𝑋 Θ = ) *+ , the loss function is expressed as: #$% '( # # 𝐸 Θ = ∑, "+# −𝑦 " ln( ) *+ " ) − (1 − 𝑦 )ln(1 − ) *+ ! #$% '( #$% '( CPEN 355 21 ECE@UBC Find the optimal for logistic regression The analytic expression of - 𝜕𝐸(Θ) 1 (+) (+) ∇𝐸 Θ = =C # !! − 𝑦 𝑋 𝜕Θ #& +,$ 1 + 𝑒 Plug into the gradient descent algorithm box: CPEN 355 22 ECE@UBC CPEN 355 Lecture 5: Naïve Bayes Classifier Mirza Sarwar, Ph.D. [email protected] Department of Electrical and Computer Engineering University of British Columbia Is your classification model good? We always need to evaluate the model performance: Compare to previous models, if they exist. If know ground truth for predictions, there are some metrics can be used CPEN 355 TP, FP, TN, FN Machine learning classifiers are one of the top uses of AI technology. Assume that we have a binary classifier that can classify the data into positive samples (class 1) and negative samples (class 0). CPEN 355 TP, FP, TN, FN TP is the number of true positives: predict correctly as positives. FP is the number of false positives: predict incorrectly as positives. TN is the number of true negatives: predict correctly as negatives. FN is the number of false negatives: predict incorrectly as negatives. CPEN 355 Some examples A person who is actually pregnant (positive) and classified as pregnant (positive). This is called TRUE POSITIVE (TP). CPEN 355 Some examples A person who is actually not pregnant (negative) and classified as not pregnant (negative). This is called TRUE NEGATIVE (TN). CPEN 355 Some examples A person who is actually not pregnant (negative) and classified as pregnant (positive). This is called FALSE POSITIVE (FP). CPEN 355 Some examples A person who is actually pregnant (positive) and classified as not pregnant (negative). This is called FALSE NEGATIVE (FN). CPEN 355 Accuracy, Precision and Recall Many performance metrics are defined based on them. 𝑇𝑃+𝑇𝑁 Accuracy is the fraction of correct predictions: 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 𝑇𝑃 Precision is the accuracy of the positive predictions: 𝑇𝑃+𝐹𝑃 𝑇𝑃 Recall (a.k.a. sensitivity) is the accuracy for positive class: 𝑇𝑃+𝐹𝑁 CPEN 355 Confusion matrix Confusion matrix is a specific table layout that allows visualization of the performance of an algorithm. Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa. Here is an example of confusion matrix of 10-class classification task: CPEN 355 Confusion matrix example CPEN 355 https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd F1 score F1 score is calculated from precision and recall. The F1 score is the harmonic mean of them. The highest possible value of an F-score is 1.0. 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 𝑇𝑃 𝐹1 = 2 = 𝑝𝑟𝑒𝑐𝑖𝑠𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 𝑇𝑃 + 0.5(𝐹𝑃 + 𝐹𝑁) CPEN 355 Receiver operating characteristic curve (ROC) A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. CPEN 355 Drawing an ROC curve Step 1 : classification model predictions. (will repay vs won’t repay a loan) CPEN 355 https://towardsdatascience.com/understanding-the-roc-curve-in-three-visual-steps-795b1399481c 15 Drawing an ROC curve Step 2: FPR and TPR results for different threshold cut-offs CPEN 355 https://towardsdatascience.com/understanding-the-roc-curve-in-three-visual-steps-795b1399481c 16 Drawing an ROC curve Step 3: Plot the the TPR and FPR for every cut-off (FRP as x-axis and TPR as y axis) CPEN 355 https://towardsdatascience.com/understanding-the-roc-curve-in-three-visual-steps-795b1399481c 17 How to learn from data? CPEN 355 Data split Performing machine learning involves creating a model, which is trained on a training data set and then can process test data set, which is independent to training data to make predictions. E.g., split labeled data randomly into 80% training and 20% test. Larger test set gives more accurate assessment of performance. We sometimes also use a validation data set, which is a data set of examples used to tune the hyper-parameters (if any) of the model. CPEN 355 Data split Performing machine learning involves creating a model, which is trained on a training data set and then can process test data set, We sometimes also use a validation data set, which is a data set of examples used to tune the hyper-parameters (if any) of the model. CPEN 355 Cross validation Cross-validation is a popular method used to evaluate AI models on a small-scale data set. The general procedure is as follows: Shuffle the dataset randomly and split the dataset into k groups. For each unique group: Take the group as a hold out or test data set Take the remaining groups as a training data set Summarize the skill of the model using all the model evaluation scores. CPEN 355 Cross validation Cross-validation is a popular method used to evaluate AI models on a small-scale data set. The general procedure Shuffle the dataset randomly and split the dataset into k groups. For each unique group: Take a group as a hold out or test data set Take the remaining groups as a training data set CPEN 355 k-fold Cross-Validation A potentially faster approach: Randomly divide the dataset into k folds. For b = 1, …, k: Use b-th fold as validation set. Use everything else as training set. Compute validation error on b-th fold. Estimate test error using: 𝑛𝑏 𝐶𝑉(𝑘) = ෍ 𝑀𝑆𝐸𝑏 𝑛 𝑏 where 𝑛𝑏 is the total # observations in the b-th fold, and n is the total # samples. CPEN 355 k-fold Cross-Validation CPEN 355 Naïve Bayes Classifier CPEN 355 Basic Probability Concepts Bayes’ Original Theorem Marginal Probability Likelihood Prior probability 𝑃 𝑑𝑎𝑡𝑎 ℎ𝑦𝑝𝑜 𝑃(ℎ𝑦𝑝𝑜) 𝑃 ℎ𝑦𝑝𝑜 𝐷𝑎𝑡𝑎 = 𝑃(𝑑𝑎𝑡𝑎) Posterior Given observation Posterior ∝ Likelihood x Prior CPEN 355 26 Bayesian classification A classifier from probabilistic view: Let A1 through Ak be attributes with discrete values. The class is C. Given a test example d with observed attribute values a1 through ak. Classification is basically to compute the following posterior probability. The prediction is the class cj such that Pr(𝐶 = 𝑐𝑗 |𝐴1 = 𝑎1, … , 𝐴𝑘 = 𝑎𝑘 ) is maximal. https://www.cs.toronto.edu/~urtasun/courses/CSC411_Fall16/09_naive_bayes.pdf CPEN 355 27 Apply Bayes’ Rule Pr 𝐶 = 𝑐𝑗 𝐴1 = 𝑎1 , … , 𝐴𝑘 = 𝑎𝑘 Pr 𝐴1 = 𝑎1 , … , 𝐴𝑘 = 𝑎𝑘 𝐶 = 𝑐𝑗 Pr(𝐶 = 𝑐𝑗 ) = Pr(𝐴1 = 𝑎1 , … , 𝐴𝑘 = 𝑎𝑘 ) Pr 𝐴1 = 𝑎1 , … , 𝐴𝑘 = 𝑎𝑘 𝐶 = 𝑐𝑗 Pr(𝐶 = 𝑐𝑗 ) = 𝐶 σ𝑟 Pr 𝐴1 = 𝑎1 , … , 𝐴𝑘 = 𝑎𝑘 𝐶 = 𝑐𝑟 Pr(𝐶 = 𝑐𝑟 ) ◼ Pr(C=cj) is the class prior probability: easy to estimate from the training data. CPEN 355 28 Conditional independence assumption Assumption: All attributes are conditionally independent given the class C = cj. 𝑘 Pr 𝐴1 = 𝑎1 , … , 𝐴𝑘 = 𝑎𝑘 𝐶 = 𝑐𝑗 Pr 𝐶 = 𝑐𝑗 = ෑ Pr(𝐴𝑖 = 𝑎𝑖 |𝐶 = 𝑐𝑗 ) 𝑖=1 CPEN 355 29 Final naïve Bayesian classifier Pr 𝐶 = 𝑐𝑗 𝐴1 = 𝑎1 , … , 𝐴𝑘 = 𝑎𝑘 Pr(𝐶 = 𝑐𝑗 ) ς𝑘𝑖=1 Pr(𝐴𝑖 = 𝑎𝑖 |𝐶 = 𝑐𝑗 ) = σ𝑘𝑗=1 Pr(𝐶 = 𝑐𝑗 ) ς𝑘𝑖=1 Pr(𝐴𝑖 = 𝑎𝑖 |𝐶 = 𝑐𝑗 ) CPEN 355 30 https://www.cs.toronto.edu/~urtasun/courses/CSC411_Fall16/09_naive_bayes.pdf Classify a test instance If we only need a decision on the most probable class for the test instance, we only need the numerator as its denominator is the same for every class. Thus, given a test example, we compute the following to decide the most probable class for the test instance 𝑘 𝑐 = argmax P(𝑐𝑗 ) ෑ P(𝐴𝑖 = 𝑎𝑖 |𝐶 = 𝑐𝑗 ) 𝑐𝑗 𝑖=1 We are done! How do we estimate P(Ai = ai| C=cj) and P(cj)? Easy! Count. CPEN 355 31 Example ◼ Compute all probabilities required for classification CPEN 355 32 Answer For C = t, we have 2 1 2 2 2 Pr(C = t ) Pr( A j = a j | C = t ) =   = j =1 2 5 5 25 For class C = f, we have 2  1 1 2 1 Pr(C = f ) Pr( A j = a j | C = f ) =   = j =1 2 5 5 25 C = t is more probable. t is the final class. CPEN 355 33 On naïve Bayesian classifier Advantages: Easy to implement Very efficient Good results obtained in many applications Disadvantages Assumption: class conditional independence, therefore loss of accuracy when the assumption is seriously violated (those highly correlated data sets) CPEN 355 34 Discussions Most assumptions made by naïve Bayesian learning are violated to some degree in practice. Despite such violations, researchers have shown that naïve Bayesian learning produces very accurate models. The main problem is the mixture model assumption. When this assumption is seriously violated, the classification performance can be poor. Naïve Bayesian learning is extremely efficient. CPEN 355 35 CPEN 355 Lecture 6: Support Vector Machine Mirza Sarwar, Ph.D. [email protected] Department of Electrical and Computer Engineering University of British Columbia 1 History of classifier CPEN 355 2 https://erogol.com/brief-history-machine-learning/ Introduction Support vector machines were invented by V. Vapnik and his co-workers and became known to the West in 1992. SVMs are linear classifiers that find a hyperplane to separate two class of data, positive and negative. SVM not only has a rigorous theoretical foundation, but also performs classification more accurately than most other methods in applications, especially for high dimensional data. CPEN 355 3 Logistic regression for classification Logistic regression −1 𝑇 1 𝑃 𝑦̂ = 1 𝑋 = logit 𝑋 Θ = 𝑇 ∈ (0,1) 1+ 𝑒 −(𝑋 Θ) Classification rule 𝑃 𝑦̂ = 1 𝑋 ≥ 0.5 ⇒ 𝑦̂ො = 1, otherwise 𝑦̂ො = 0 CPEN 355 4 Decision Boundary of LR 𝑍=𝑋 Θ 𝑇 Since 𝜎 𝑋 𝑇 Θ ≥ 0.5 iff 𝑋 𝑇 Θ ≥ 0, 1. 𝑋 𝑇 Θ ≥ 0 ⇒ 𝑦̂ො = 1 2. 𝑋 𝑇 Θ < 0 ⇒ 𝑦̂ො = 0 https://erogol.com/brief-history-machine-learning/ CPEN 355 5 Terminologies Decision Support Boundary vector Margin Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. CPEN 355 6 Basic concepts Let the set of training examples D is represented {(x1, y1), (x2, y2), …, (xr, yr)}, where xi = (x1, x2, …, xn) is an input vector in a real-valued space X  Rn and yi is its class label (output value), yi  {1, -1}. 1: positive class and -1: negative class. SVM finds a linear function of the form (w: weight vector) f(x) = w  x + b  1 if  w  x i  + b  0 yi =  − 1 if  w  x i  + b  0 CPEN 355 7 The hyperplane The hyperplane that separates positive and negative training data is w  x + b = 0 It is also called the decision boundary (surface). CPEN 355 8 The hyperplane There are many possible hyperplanes, which one to choose? CPEN 355 9 The hyperplane The decision boundary has a larger distance to the closest point. The larger distance means more noise tolerance and therefore the classifier becomes more robust. CPEN 355 10 The hyperplane Robust separating hyperplane has “wide” margin (i.e., far from both sides of examples). So, the goal is to find the widest separating hyperplane. CPEN 355 11 Linear SVM: separable case CPEN 355 12 https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote09.html Margin CPEN 355 13 Optimization CPEN 355 14 Soft Margin Subtract something so that it satisfies CPEN 355 15 https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote09.html Soft Margin CPEN 355 16 https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote09.html SVM for non-linear datasets Sometimes, we cannot find a straight line to separate the samples. In other words, the data is far from being linearly separable in the current dimensions. CPEN 355 17 SVM for non-linear datasets If we increase the dimension, it may make the inseparable data become separable. For example, the samples are in 2D and inseparable, but maybe in 3D, there is a gap between two categories (e.g., a level difference). In this case, we can easily draw a separating hyperplane. Many ways to increase the feature dimension. CPEN 355 18 Kernel function in SVM SVM can use mathematical functions that are defined as the kernel. The function of kernel is to take data as input and transform it into the required form. For example, if we have an inseparable two-dimensional data, we can use a transformation on the original data. CPEN 355 19 Kernel function in SVM If we could find a higher dimensional space in which these points were linearly separable, then we could do the following: Map the original features to the higher, transformer space (feature mapping) Perform linear SVM in this higher space Obtain a set of weights corresponding to the decision boundary hyperplane Map this hyperplane back into the original 2D space to obtain a non linear decision boundary CPEN 355 20 Kernel function in SVM CPEN 355 21 https://ml-course.github.io/master/03%20-%20Kernelization.pdf Kernel function in SVM Then we can transform the two-dimensional data to three dimensional data by 𝒙 → 𝜙 𝒙 , and the data can be linearly separatable in a higher-dimensional space: 𝑓 𝒙 = 𝑤 𝑇 𝜙(𝒙) CPEN 355 22 Kernel function in SVM The optimal hyperplane will be: CPEN 355 23 Kernel function in SVM CPEN 355 24 CPEN Lecture 7: Decision Tree Mirza Sarwar, Ph.D. [email protected] Department of Electrical and Computer Engineering University of British Columbia Introduction of decision tree Decision tree learning is one of the most widely used techniques for classification. It yields a set of interpretable decision rules. The classification model is a tree, called decision tree. Decision Tree CPEN 355 https://harvard-iacs.github.io/2017-CS109A/lectures/lecture15/presentation/lecture15_RandomForest.pdf Entropy measure in classification The entropy formula 𝐶 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐷 = − ෍ 𝑃 𝑐𝑗 ln𝑃 𝑐𝑗 𝑗 P(cj) is the probability of class cj in data set D We use entropy as a measure of impurity or disorder of data set D. CPEN 355 Entropy measure in classification The entropy formula 𝐶 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐷 = − ෍ 𝑃 𝑐𝑗 ln𝑃 𝑐𝑗 𝑗 P(cj) is the probability of class cj in data set D We use entropy as a measure of impurity or disorder of data set D. Low Impurity High Impurity Low Entropy High Entropy CPEN 355 5 Low Entropy implies greater predictability! Fruit Classification CPEN 355 6 Fruit Classification CPEN 355 7 Entropy, purity Entropy measures the purity 4+ 8+ 4- 0- The distribution is less uniform Entropy is lower The node is purer CPEN 355 Gini impurity CPEN 355 https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote17.html Gini impurity CPEN 355 Algorithm for decision tree learning Conditions for stopping partitioning All examples for a given node belong to the same class There are no remaining attributes for further partitioning – majority class is the leaf There are no examples left CPEN 355 Regression https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote17.html CPEN 355 Ensemble methods Ensemble methods pool together multiple models to arrive at more reliable predictions. bagging random forests boosting CPEN 355 13 Bagging CPEN 355 14 Bagging CPEN 355 15 https://people.csail.mit.edu/dsontag/courses/ml16/slides/lecture11.pdf Bagging - issues Suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors. Then all bagged trees will select the strong predictor at the top of the tree and therefore all trees will look similar. How do we avoid this? CPEN 355 16 Bagging - issues What if we consider only a subset of the predictors at each split? We will still get correlated trees unless …. we randomly select the subset ! ➔ Random Forest. CPEN 355 17 Random Forest CPEN 355 18 https://people.csail.mit.edu/dsontag/courses/ml16/slides/lecture11.pdf Variable importance While bagging improves upon the predictive ability of trees, it kills off the interpretability of the model. A good tool for interpreting a bagged tree model (and other tree- based ensemble methods like forests) is the variable importance measure. Variable importance can be measured by the amount that the Residual Sum of Squares (RSS) or Gini index is reduced due to splits over a given predictor, averaged over all B trees. CPEN 355 19 Titanic data CPEN 355 Titanic data CPEN 355 Variable importance CPEN 355 22 Bagging and Boosting CPEN 355 23 Boosting (Cont..) -- Algorithm Boosting is a strategy to boost a weaker learner to a stronger learner. It starts with learning with the base leaner M 1. Then based on the performance of M 1, we reweight/resampling the training samples --- e.g., assigning larger weights to difficult samples Based on the new distribution of samples, we train another learner M2 Recursively, we train M1, …,MT. Finally, we aggregate the learners, e.g., 𝑀 = σ𝑇𝑖=1 𝛼𝑖 𝑀𝑖 CPEN 355 24 Boosting (Cont..) – Learning from weighted data Not all data points are equal Consider a weighted dataset D(i) – weight of i-th training example (xi ,yi ) – Interpretations: i th training example counts as D(i) examples If I were to “resample” data, I would get more samples of “heavier” data points Now, in all calculations, whenever used, i-th training example counts as D(i) “examples” CPEN 355 25 Boosting (Cont..) – Learning from weighted data CPEN 355 26 An example: Adaboost M CPEN 355 27 An example: Adaboost CPEN 355 28

CPEN 355 Logistic Regression Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue