Lecture 08 - Supervised Learning PDF

CS3121 - Introduction to Data Science Supervised Learning Dr. Nisansa de Silva, Department of Computer Science & Engineering http://nisansads.staff.uom.lk/ 1 Basic concepts 2 What is Learning? “Learning is any process by which a system improves performance from experience.” – Herbert Simon “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” – Tom Mitchell 3 3 Learning ▪ Learning is essential for unknown environments, – i.e., when designer lacks omniscience ▪ Learning is useful as a system construction method, – i.e., expose the agent to reality rather than trying to write it down ▪ Learning modifies the agent's decision mechanisms to improve performance 4 Machine Learning ▪ Machine learning: how to acquire a model on the basis of data / experience – Learning parameters (e.g. probabilities) – Learning structure (e.g. BN graphs) – Learning hidden concepts (e.g. clustering) ▪ Like human learning from past experiences. ▪ A computer does not have “experiences”. ▪ A computer system learns from data, which represent some “past experiences” of an application domain. ▪ Our focus: learn a target function that can be used to predict the values of a discrete class attribute, e.g., approve or not- approved, and high-risk or low risk. ▪ The method is commonly called: inductive learning. 5 Machine Learning ▪ Given – a data set D, – a task T, and – a performance measure M, a computer system is said to learn from D to perform the task T if after learning the system’s performance on T improves as measured by M. ▪ In other words, the learned model helps the system to perform T better as compared to no learning. ▪ An example – Data: Loan application data – Task: Predict whether a loan should be approved or not. – Performance measure: accuracy. No learning: classify all future applications (test data) to the majority class (i.e., Yes): Accuracy = 9/15 = 60%. – We can do better than 60% with learning. 6 Fundamental Assumption of Machine Learning Assumption: The distribution of training examples is identical to the distribution of test examples (including future unseen examples). ▪ In practice, this assumption is often violated to certain degree. ▪ Strong violations will clearly result in poor classification accuracy. ▪ To achieve good accuracy on the test data, training examples must be sufficiently representative of the test data. 7 An Example Application ▪ An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. ▪ A decision is needed: whether to put a new patient in an intensive-care unit. ▪ Due to the high cost of ICU, – Those patients who may survive less than a month are given higher priority. – On the other hand, the patience who may survive only a few hours are given lower priority. ▪ Problem: to predict high-risk patients and discriminate them from low-risk patients. 8 Another Application ▪ A credit card company receives thousands of applications for new cards. Each application contains information about an applicant, – age – Marital status – annual salary – outstanding debts – credit rating – etc. ▪ Problem: to decide whether an application should approved, or to classify applications into two categories, approved and not approved. 9 Machine Learning Areas ▪ Supervised Learning: Data and corresponding labels are given – Supervision: The data (observations, measurements, etc.) are labeled with pre-defined classes. It is like that a “teacher” gives the classes (supervision). – Test data are classified into these classes too. – The process of inferring a function from labeled training data drawn from set of training examples. – ▪ Unsupervised Learning: Only data is given, no labels provided – Class labels of the data are unknown – Given a set of data, the task is to establish the existence of classes or clusters in the data ▪ Semi-supervised Learning: Some (if not all) labels are present ▪ Reinforcement Learning: An agent interacting with the world makes observations, takes actions, and is rewarded or punished; it should learn to choose actions in such a way as to obtain a lot of reward 10 Supervised Learning : Important Concepts ▪ Data: labeled instances , e.g. emails marked spam/not spam – Training Set – Held-out Set / Validation Set – Test Set ▪ Features: attribute-value pairs which characterize each x ▪ Experimentation cycle – Learn parameters (e.g. model probabilities) on training set – (Tune hyper-parameters on held-out set) – Compute accuracy of test set – Very important: never “peek” at the test set! ▪ Evaluation – Accuracy: fraction of instances predicted correctly ▪ Overfitting and generalization – Want a classifier which does well on test data – Overfitting: fitting the training data very closely, but not generalizing well 11 Example: Spam Filter Slide from Mackassy 12 Example: Spam Filter Slide from Mackassy 12 Example: Digit Recognition Slide from Mackassy 14 Example: Digit Recognition Slide from Mackassy 14 Supervised Learning: Classification & Regression Regression Binary/ Boolean Classification Linear Regression Learning a continuous function Learning a discrete function Linear SVM Regression Trees Logistic-regression Decision trees Random forests Bayesian Belief Networks Neural Networks Bayesian Linear Regression Naive Bayes Support Vector Machines Non-Linear Regression Gradient-boosted trees Polynomial Regression Multiclass Classification Supervised Learning: Classification ▪ In classification, we predict labels 𝑦 (classes) for inputs 𝑥 ▪ Binary Classification – Given: a set of 𝑚 examples 𝑥𝑖 , 𝑦𝑖 𝑖 = 1,2, … , 𝑚 sampled from some distribution 𝐷, where 𝑥𝑖 ∈ 𝑅𝑛 and 𝑦𝑖 ∈ −1, +1. – Find: a function 𝑓, 𝑓: 𝑅𝑛 → −1, +1 which classifies examples 𝑥𝑗 sampled from 𝐷 well. ▪ Comments – The function 𝑓 is usually a statistical model, whose parameters are learnt from the set of examples. – Thy set of examples are called ‘training‘ se’ – 𝑦 is called – ‘target variable’ or ‘target’ – Examples with 𝑦𝑖 = +1 are called ‘positive examples’ – Examples with 𝑦𝑖 = −1 are called ‘negative examples’ 17 Supervised Learning: Classification ▪ Examples: – OCR (input: images, classes: characters) – Medical diagnosis (input: symptoms, classes: diseases) – Automatic essay grader (input: document, classes: grades) – Fraud detection (input: account activity, classes: fraud / no fraud) – Customer service email routing – Recommended articles in a newspaper, recommended books – DNA and protein sequence identification – Categorization and identification of astronomical images – Financial investments – … many more 18 Supervised Learning: Regression ▪ Estimation of relationships between a dependent variable and one or more independent variables. ▪ Linear Model Assumptions – The dependent and independent variables show a linear relationship between the slope and the intercept. – The independent variable is not random. – The value of the residual (error) is zero. – The value of the residual (error) is constant across all observations. – The value of the residual (error) is not correlated across all observations. – The residual (error) values follow the normal distribution. 𝑌 – Dependent variable 𝑋1 , 𝑋2 , 𝑋3 – Independent ▪ Simple Linear Regression: 𝑌 = 𝑎 + 𝑏𝑋1 + 𝜖 (explanatory) variables 𝑎 – Intercept 𝑏, 𝑐, 𝑑 – Slopes ▪ Multiple Linear Regression: 𝑌 = 𝑎 + 𝑏𝑋1 + 𝑐𝑋2 + 𝑑𝑋3 + 𝜖 𝜖 – Residual (error) 19 Inductive learning Method ▪ Simplest form: learn a function from example – 𝑓 is the target function – An example is a pair (𝑥, 𝑓(𝑥)) ▪ Pure induction task: – Given a collection of examples of 𝒇, return a function 𝒉 that approximates 𝒇. – find a hypothesis ℎ, such that ℎ ≈ 𝑓, given a training set of examples – Construct/adjust ℎ to agree with 𝑓 on training set – ℎ is consistent if it agrees with 𝑓 on all examples ▪ This is a highly simplified model of real learning: – Ignores prior knowledge – Assumes examples are given 20 Inductive learning Method: Regression Example ▪ E.g., curve fitting: Ockham’s razor: prefer the simplest hypothesis consistent with data 21 Generalization ▪ Hypotheses must generalize to correctly classify instances not in the training data. ▪ Simply memorizing training examples is a consistent hypothesis that does not generalize. ▪ Occam’s razor: – Finding a simple hypothesis helps ensure generalization. 22 22 Supervised Learning Steps ▪ Determine the type of the training examples. Before doing anything else, the user should decide what kind of data is to be used as a training set. ▪ Gather training set. Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements. ▪ Determine the structure of the learned function and corresponding learning algorithm. ▪ Complete the design. Run the learning algorithm on the gathered training set. ▪ Evaluate the accuracy of the learned function. After the parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set. 21 Inductive learning Method: Classification ▪ Data: A set of data records (also called examples, instances or cases) described by – k attributes: A1, A2, … Ak. – a class: Each example is labelled with a pre-defined class. ▪ Goal: To learn a classification model from the data that can be used to predict the classes of new (future, or test) cases/instances. 24 Classification: A Two-Step Process ▪ Model construction: describing a set of predetermined classes – Learning (training): Learn a model using the training data – Each tuple/sample is assumed to belong to a predefined class, as determined by the class label – The set of tuples used for model construction is training set – The model is represented as classification rules, decision trees, or mathematical formulae 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠 – Accuracy = 𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑒𝑠𝑡 𝐶𝑎𝑠𝑒𝑠 ▪ Model usage: for classifying future or unknown objects – Testing: Test the model using unseen test data to assess the model accuracy – Estimate accuracy of the model ▪ The known label of test sample is compared with the classified result from the model ▪ Test set is independent of training set, otherwise over-fitting will occur – If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known Data Mining: Concepts and Techniques 25 25 An Example: The Learning Task ▪ Learn a classification model from the data ▪ Use the model to classify future loan applications into – Yes (approved) and – No (not approved) ▪ What is the class for following case/instance? 26 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class Learning 1 Yes Large 125K No algorithm 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No Learn 8 No Small 85K Yes Model 9 No Medium 75K No 10 No Small 90K Yes Model 10 Training Set Apply Tid Attrib1 Attrib2 Attrib3 Class Model 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? Deduction 14 No Small 95K ? 15 No Large 67K ? 10 Test Set 27 Evaluation of Classifiers 28 Evaluating classification methods ▪ Predictive accuracy 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 Accuracy = 𝑇𝑜𝑡𝑎𝑙 ▪ Efficiency – time to construct the model – time to use the model ▪ Robustness: handling noise and missing values ▪ Scalability: efficiency in disk-resident databases ▪ Interpretability: – understandable and insight provided by the model ▪ Compactness of the model: size of the tree, or the number of rules. 29 Evaluation methods ▪ Holdout set: The available data set D is divided into two disjoint subsets, – the training set Dtrain (for learning a model) – the test set Dtest (for testing the model) ▪ Important: training set should not be used in testing and the test set should not be used in learning. – Unseen test set provides a unbiased estimate of accuracy. ▪ The test set is also called the holdout set. (the examples in the original data set D are all labeled with classes.) ▪ This method is mainly used when the data set D is large. 30 Evaluation methods (cont…) ▪ n-fold cross-validation: The available data is partitioned into n equal-size disjoint subsets. ▪ Use each subset as the test set and combine the rest n-1 subsets as the training set to learn a classifier. ▪ The procedure is run n times, which give n accuracies. ▪ The final estimated accuracy of learning is the average of the n accuracies. ▪ 10-fold and 5-fold cross-validations are commonly used. ▪ This method is used when the available data is not large. 31 Evaluation methods (cont…) ▪ Leave-one-out cross-validation: This method is used when the data set is very small. ▪ It is a special case of cross-validation ▪ Each fold of the cross validation has only a single test example and all the rest of the data is used in training. ▪ If the original data has m examples, this is m-fold cross-validation 32 Evaluation methods (cont…) ▪ Validation set: the available data is divided into three subsets, – a training set, – a validation set and – a test set. ▪ A validation set is used frequently for estimating parameters in learning algorithms. ▪ In such cases, the values that give the best accuracy on the validation set are used as the final parameter values. ▪ Cross-validation can be used for parameter estimating as well. 33 Training, Validation, and Test 34 Classification measures ▪ Accuracy is only one measure (error = 1-accuracy). ▪ Accuracy is not suitable in some applications. ▪ In text mining, we may only be interested in the documents of a particular topic, which are only a small portion of a big document collection. ▪ In classification involving skewed or highly imbalanced data, e.g., network intrusion and financial fraud detections, we are interested only in the minority class. – High accuracy does not mean any intrusion is detected. – E.g., 1% intrusion. Achieve 99% accuracy by doing nothing. ▪ The class of interest is commonly called the positive class, and the rest negative classes. 35 Binary Classification: Evaluation ▪ Used in information retrieval and text classification. ▪ We use a confusion matrix to introduce them. Predicted Negative Positive Negative True Negative (TN) False Positive (FP) Actual Positive False Negative (FN) True Positive (TP) ▪ 𝑇𝑃 - True Positive – Number of Positive (correct) classifications of Positive examples ▪ 𝐹𝑁 - False Negative – Number of Negative (incorrect) classifications of Positive examples ▪ 𝐹𝑃 - False Positive – Number of Positive (incorrect) classifications of Negative examples ▪ 𝑇𝑁 - True Negative – Number of Negative (correct) classifications of Negative examples 36 Binary Classification: Evaluation Predicted Negative Positive Negative True Negative (TN) False Positive (FP) Actual Positive False Negative (FN) True Positive (TP) ▪ Precision p is the number of correctly classified positive examples divided by the total number of examples that are classified as positive. 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃 ▪ P = Precision = = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 +𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃 +𝐹𝑃 ▪ Recall r is the number of correctly classified positive examples divided by the total number of actual positive examples in the test set. 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃 ▪ R = Recall = = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 +𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑇𝑃 +𝐹𝑁 37 An example ▪ This confusion matrix gives – precision p = 100% and – recall r = 1% because we only classified one positive example correctly and no negative examples wrongly. ▪ Note: precision and recall only measure classification on the positive class. 38 F1-value (also called F1-score) ▪ It is hard to compare two classifiers using two measures. F1 score combines precision and recall into one measure ▪ The harmonic mean of two numbers tends to be closer to the smaller of the two. ▪ For F1-value to be large, both p and r much be large. 39 Binary Classification: Evaluation Predicted Negative Positive Negative True Negative (TN) False Positive (FP) Actual Positive False Negative (FN) True Positive (TP) 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑇𝑟𝑢𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ▪ Accuracy = ▪ Recall = 𝑇𝑜𝑡𝑎𝑙 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 +𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 Precision ∗Recall ▪ Precision = ▪ F1 = 2 × 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 +𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 Precision +Recall 40 Some Supervised Learning Algorithms 41 k-Nearest Neighbor Classification (kNN) ▪ Unlike most learning methods, kNN does not build model from the training data. ▪ To classify a test instance d, define k- neighborhood P as k nearest neighbors of d ▪ Count number n of training instances in P that belong to class cj ▪ Estimate Pr(cj|d) as n/k ▪ No training is needed. ▪ Classification time is linear in training set size for each test case. 40 kNNAlgorithm k is usually chosen empirically via a validation set or cross-validation by trying a range of k values. Distance function is crucial, but depends on applications. 43 Example: k=6 (6NN) Government Science Arts A new point Pr(science| )? 44 Discussions ▪ kNN can deal with complex and arbitrary decision boundaries. ▪ Despite its simplicity, researchers have shown that the classification accuracy of kNN can be quite strong and in many cases as accurate as those elaborated methods. ▪ kNN is slow at the classification time ▪ kNN does not produce an understandable model 45 Decision Trees Introduction ▪ Decision tree learning is one of the most widely used techniques for classification. – Its classification accuracy is competitive with other methods, and – it is very efficient. ▪ The classification model is a tree, called decision tree. ▪ C4.5 by Ross Quinlan is perhaps the best known system. It can be downloaded from the Web. 46 Learning decision trees Example Problem: decide whether to wait for a table at a restaurant, based on the following attributes: 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60) 47 Feature(Attribute)-based representations ▪ Examples described by feature(attribute) values – (Boolean, discrete, continuous) ▪ E.g., situations where I will/won't wait for a table: ▪ Classification of examples is positive (T) or negative (F) 48 Decision trees ▪ One possible representation for hypotheses ▪ E.g., here is the “true” tree for deciding whether to wait: 49 Expressiveness ▪ Decision trees can express any function of the input attributes. ▪ E.g., for Boolean functions, truth table row → path to leaf: ▪ Trivially, there is a consistent decision tree for any training set with one path to leaf for each example (unless f nondeterministic in x) but it probably won't generalize to new examples ▪ Prefer to find more compact decision trees 50 Decision tree learning ▪ Aim: find a small tree consistent with the training examples ▪ Idea: (recursively) choose "most significant" attribute as root of (sub)tree 51 Decision Tree Construction Algorithm ▪ Principle – Basic algorithm (adopted by ID3, C4.5 and CART): a greedy algorithm – Tree is constructed in a top-down recursive divide-and-conquer manner ▪ Iterations – At start, all the training tuples are at the root – Tuples are partitioned recursively based on selected attributes – Test attributes are selected on the basis of a heuristic or statistical measure (e.g, information gain) ▪ Stopping conditions – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left 52 Tree Induction ▪ Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. – Nodes with homogeneous class distribution are preferred ▪ Issues – Determine how to split the records ▪ How to specify the attribute test condition? ▪ How to determine the best split? – Determine when to stop splitting 53 Choosing an attribute ▪ Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" Patrons? is a better choice 52 Choosing an attribute ▪ Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" ▪ Patrons? is a better choice 54 Example 55 Example 56 Example 57 Example 58 Example 59 Example 60 Example 61 Measures of Node Impurity ▪ Information Gain ▪ Gini Index ▪ Misclassification error Choose attributes to split to achieve minimum impurity 62 Attribute Selection Measure: Information Gain (ID3/C4.5) Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D| Expected information (entropy) needed to classify a tuple in D: m I ( D )   pi log2 ( pi ) i 1 Information needed (after using A to split D into v partitions) to classify D: v |D | InfoA ( D)    I (D j ) j j 1 | D | Information gained by branching on attribute A Gain(A)  Info(D)  Info A(D) March 31, 2022 Data Mining: Concepts and Techniques 63 63 Information gain For the training set, p = n = 6, I(6/12, 6/12) = 1 bit Consider the attributes Patrons and Type (and others too): 2 4 6 2 4 IG( Patrons)  1  [ I (0,1)  I (1,0)  I ( , )] .0541 bits 12 12 12 6 6 2 1 1 2 1 1 4 2 2 4 2 2 IG(Type)  1  [ I ( , )  I ( , )  I ( , )  I ( , )]  0 bits 12 2 2 12 2 2 12 4 4 12 4 4 Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root 64 Decision Tree Based Classification ▪ Advantages: – Easy to construct/implement – Extremely fast at classifying unknown records – Models are easy to interpret for small-sized trees – Accuracy is comparable to other classification techniques for many simple data sets – Tree models make no assumptions about the distribution of the underlying data : nonparametric – Have a built-in feature selection method that makes them immune to the presence of useless variables ▪ Disadvantages – Computationally expensive to train – Some decision trees can be overly complex that do not generalise the data well. – Less expressivity: There may be concepts that are hard to learn with limited decision trees 71 References ▪ CS583 - Chapter 3: Supervised Learning by Bing Liu ▪ “Supervised Learning” by Amar Tripathi ▪ “Learning from Observations” Nazli Ikizler-cinbis ▪ “What is Regression Analysis?” Corporate Finance Institute. ▪ “Supervised Machine Learning” by javatpoint ▪ “k-Nearest Neighbor (kNN) Classifier” by WOLFRAM Demonstrations Project 77 Two More Supervised Learning Algorithms (Not Covered in Class) 78 Naïve Bayesian Classification ▪ Probabilistic view: Supervised learning can naturally be studied from a probabilistic point of view. ▪ Let A1 through Ak be attributes with discrete values. The class is C. ▪ Given a test example d with observed attribute values a1 through ak. ▪ Classification is basically to compute the following posteriori probability. The prediction is the class cj such that is maximal 79 Apply Bayes’ Rule Pr(C  c j | A1  a1 ,..., A| A|  a| A| ) Pr( A1  a1 ,..., A| A|  a| A| | C  c j ) Pr(C  c j )  Pr( A1  a1 ,..., A| A|  a| A| ) Pr( A1  a1 ,..., A| A|  a| A| | C  c j ) Pr(C  c j )  |C |  Pr( A  a ,..., A r 1 1 1 | A|  a| A| | C  cr ) Pr(C  cr ) Pr(C=cj) is the class prior probability: easy to estimate from the training data. 80 Computing probabilities ▪ The denominator P(A1=a1,...,Ak=ak) is irrelevant for decision making since it is the same for every class. ▪ We only need P(A1=a1,...,Ak=ak | C=ci), which can be written as ▪ Pr(A1=a1|A2=a2,...,Ak=ak, C=cj)* Pr(A2=a2,...,Ak=ak |C=cj) ▪ Recursively, the second factor above can be written in the same way, and so on. ▪ Now an assumption is needed. 81 Conditional independence assumption ▪ All attributes are conditionally independent given the class C = cj. ▪ Formally, we assume, Pr(A1=a1 | A2=a2,..., A|A|=a|A|, C=cj) = Pr(A1=a1 | C=cj) and so on for A2 through A|A|. I.e., | A| Pr( A1  a1 ,..., A| A|  a| A| | C  ci )   Pr( Ai  ai | C  c j ) i 1 82 Final naïve Bayesian classifier ▪ We are done! ▪ How do we estimate P(Ai = ai| C=cj)? Easy!. Pr(C  c j | A1  a1 ,..., A| A|  a| A| ) | A| Pr(C  c j ) Pr( Ai  ai | C  c j ) i 1  |C | | A|  Pr(C  cr ) Pr( Ai  ai | C  cr ) r 1 i 1 83 Classify a test instance ▪ If we only need a decision on the most probable class for the test instance, we only need the numerator as its denominator is the same for every class. ▪ Thus, given a test example, we compute the following to decide the most probable class for the test instance | A| c  arg max Pr(c j ) Pr( Ai  ai | C  c j ) cj i 1 84 An example Compute all probabilities required for classification 85 An Example (cont …) ▪ For C = t, we have 2 1 2 2 2 Pr(C  t ) Pr( A j  a j | C  t )     j 1 2 5 5 25 ▪ For class C = f, we have 2  1 1 2 1 Pr(C  f ) Pr( A j  a j | C  f )     j 1 2 5 5 25 ▪ C = t is more probable. t is the final class. 86 Additional issues ▪ Numeric attributes: Naïve Bayesian learning assumes that all attributes are categorical. Numeric attributes need to be discretized. ▪ Zero counts: An particular attribute value never occurs together with a class in the training set. We need smoothing. ▪ Missing values: Ignored nij   Pr( Ai  ai | C  c j )  n j  ni 87 On naïve Bayesian classifier ▪ Advantages: – Easy to implement – Very efficient – Good results obtained in many applications ▪ Disadvantages – Assumption: class conditional independence, therefore loss of accuracy when the assumption is seriously violated (those highly correlated data sets) 88 Support Vector Machines Introduction ▪ Support vector machines were invented by V. Vapnik and his co-workers in 1970s in Russia and became known to the West in 1992. ▪ SVMs are linear classifiers that find a hyperplane to separate two class of data, positive and negative. ▪ Kernel functions are used for nonlinear separation. ▪ SVM not only has a rigorous theoretical foundation, but also performs classification more accurately than most other methods in applications, especially for high dimensional data. ▪ It is perhaps the best classifier for text classification. 89 Basic concepts ▪ Let the set of training examples D be {(x1, y1), (x2, y2), …, (xr, yr)}, where xi = (x1, x2, …, xn) is an input vector in a real-valued space X  Rn and yi is its class label (output value), yi  {1, -1}. 1: positive class and -1: negative class. ▪ SVM finds a linear function of the form (w: weight vector) f(x) = w  x + b  1 if  w  x i   b  0 yi    1 if  w  x i   b  0 90 The hyperplane ▪ The hyperplane that separates positive and negative training data is w  x + b = 0 ▪ It is also called the decision boundary (surface). ▪ So many possible hyperplanes, which one to choose? 91 Maximal margin hyperplane ▪ SVM looks for the separating hyperplane with the largest margin. ▪ Machine learning theory says this hyperplane minimizes the error bound 92 Linear SVM: separable case ▪ Assume the data are linearly separable. ▪ Consider a positive data point (x+, 1) and a negative (x-, -1) that are closest to the hyperplane + b = 0. ▪ We define two parallel hyperplanes, H+ and H-, that pass through x+ and x- respectively. H+ and H- are also parallel to + b = 0. 93 Compute the margin ▪ Now let us compute the distance between the two margin hyperplanes H+ and H-. Their distance is the margin (d+ + d in the figure). ▪ Recall from vector space in algebra that the (perpendicular) distance from a point xi to the hyperplane w  x + b = 0 is: | w  xi   b | (36) where ||w|| is the norm of w, || w || || w ||  w  w   w1  w2 ...  wn 2 2 2 (37) 94 Compute the margin (cont …) ▪ Let us compute d+. ▪ Instead of computing the distance from x+ to the separating hyperplane w  x + b = 0, we pick up any point xs on w  x + b = 0 and compute the distance from xs to w  x+ + b = 1 by applying the distance Eq. (36) and noticing w  xs + b = 0, | w  xs   b  1 | 1 d   (38) || w || || w || 2 margin  d   d   (39) || w || 95 A optimization problem! Definition (Linear SVM: separable case): Given a set of linearly separable training examples, D = {(x1, y1), (x2, y2), …, (xr, yr)} Learning is to solve the following constrained minimization problem,  w  w Minimize : 2 (40) Subject to : yi ( w  x i   b)  1, i  1, 2,..., r yi ( w  x i   b  1summarizes , i  1, 2,..., r w  xi + b  1 for yi = 1 w  xi + b  -1 for yi = -1. 96 Solve the constrained minimization ▪ Standard Lagrangian method r  [ y (w  x   b)  1] 1 LP   w  w  i i i (41) 2 i 1 where i  0 are the Lagrange multipliers. ▪ Optimization theory says that an optimal solution to (41) must satisfy certain conditions, called Kuhn-Tucker conditions, which are necessary (but not sufficient) ▪ Kuhn-Tucker conditions play a central role in constrained optimization. 97 Kuhn-Tucker conditions ▪ Eq. (50) is the original set of constraints. ▪ The complementarity condition (52) shows that only those data points on the margin hyperplanes (i.e., H+ and H-) can have i > 0 since for them yi(w  xi + b) – 1 = 0. ▪ These points are called the support vectors, All the other parameters i = 0. 98 Solve the problem ▪ In general, Kuhn-Tucker conditions are necessary for an optimal solution, but not sufficient. ▪ However, for our minimization problem with a convex objective function and linear constraints, the Kuhn-Tucker conditions are both necessary and sufficient for an optimal solution. ▪ Solving the optimization problem is still a difficult task due to the inequality constraints. ▪ However, the Lagrangian treatment of the convex optimization problem leads to an alternative dual formulation of the problem, which is easier to solve than the original problem (called the primal). 99 Dual formulation ▪ From primal to a dual: Setting to zero the partial derivatives of the Lagrangian (41) with respect to the primal variables (i.e., w and b), and substituting the resulting relations back into the Lagrangian. – I.e., substitute (48) and (49), into the original Lagrangian (41) to eliminate the primal variables r 1 r i 1  LD   i  2 i , j 1  y i y j  i j  x i  x j  , (55) 100 Dual optimization prolem This dual formulation is called the Wolfe dual. For the convex objective function and linear constraints of the primal, it has the property that the maximum of LD occurs at the same values of w, b and i, as the minimum of LP (the primal). Solving (56) requires numerical techniques and clever strategies, which are beyond our scope. 101 The final decision boundary ▪ After solving (56), we obtain the values for i, which are used to compute the weight vector w and the bias b using Equations (48) and (52) respectively. ▪ The decision boundary  w  x  b   y  x  x  b  0 isv i i i (57) ▪ Testing: Use (57). Given a test instance z,    isv  sign( w  z  b)  sign  i yi  x i  z  b   (58) ▪ If (58) returns 1, then the test instance z is classified as positive; otherwise, it is classified as negative. 102 Linear SVM: Non-separable case ▪ Linear separable case is the ideal situation. ▪ Real-life data may have noise or errors. – Class label incorrect or randomness in the application domain. ▪ Recall in the separable case, the problem was  w  w Minimize : 2 Subject to : yi ( w  x i   b)  1, i  1, 2,..., r ▪ With noisy data, the constraints may not be satisfied. Then, no solution! 103 Relax the constraints ▪ To allow errors in data, we relax the margin constraints by introducing slack variables, i ( 0) as follows: w  xi + b  1  i for yi = 1 w  xi + b  1 + i for yi = -1. ▪ The new constraints: Subject to: yi(w  xi + b)  1  i, i =1, …, r, i  0, i =1, 2, …, r. 104 Geometric interpretation ▪ Two error data points xa and xb (circled) in wrong regions 105 Penalize errors in objective function ▪ We need to penalize the errors in the objective function. ▪ A natural way of doing it is to assign an extra cost for errors to change the objective function to  w  w r Minimize :  C (  i ) k 2 i 1 ▪ k = 1 is commonly used, which has the advantage that neither i nor (60)its Lagrangian multipliers appear in the dual formulation. 106 New optimization problem ▪ This formulation is called the soft-margin SVM. The primal Lagrangian is  w  w r (61) Minimize :  C i 2 i 1 Subject to : yi ( w  x i   b)  1   i , i  1, 2,..., r  i  0, i  1, 2,..., r where i, i  0 are the Lagrange multipliers (62) r r r 1 LP   w  w  C   i   i [ yi ( w  x i   b)  1   i ]   i i 2 i 1 i 1 i 1 107 Kuhn-Tucker conditions 108 From primal to dual ▪ As the linear separable case, we transform the primal to a dual by setting to zero the partial derivatives of the Lagrangian (62) with respect to the primal variables (i.e., w, b and i), and substituting the resulting relations back into the Lagrangian. ▪ Ie.., we substitute Equations (63), (64) and (65) into the primal Lagrangian (62). ▪ From Equation (65), C  i  i = 0, we can deduce that i  C because i  0. 10 Dual ▪ The dual of (61) is ▪ Interestingly, i and its Lagrange multipliers i are not in the dual. The objective function is identical to that for the separable case. ▪ The only difference is the constraint i  C. 110 Find primal variable values ▪ The dual problem (72) can be solved numerically. ▪ The resulting i values are then used to compute w and b. w is computed using Equation (63) and b is computed using the Kuhn-Tucker complementarity conditions (70) and (71). ▪ Since no values for i, we need to get around it. – From Equations (65), (70) and (71), we observe that if 0 < i < C then both i = 0 and yiw  xi + b – 1 + i = 0. Thus, we can use any training data point for which 0 < i < C and Equation (69) (with i = 0) to compute b. r 1 b    yi i  x i  x j   0. (73) yi i 1 111 (65), (70) and (71) in fact tell us more ▪ (74) shows a very important property of SVM. – The solution is sparse in i. Many training data points are outside the margin area and their i’s in the solution are 0. – Only those data points that are on the margin (i.e., yi(w  xi + b) = 1, which are support vectors in the separable case), inside the margin (i.e., i = C and yi(w  xi + b) < 1), or errors are non-zero. – Without this sparsity property, SVM would not be practical for large data sets. 112 The final decision boundary ▪ The final decision boundary is (we note that many i’s are 0) r  w  x  b   y  x  x  b  0 i 1 i i i (75) ▪ The decision rule for classification (testing) is the same as the separable case, i.e., sign(w  x + b). ▪ Finally, we also need to determine the parameter C in the objective function. It is normally chosen through the use of a validation set or cross-validation. 113 How to deal with nonlinear separation? ▪ The SVM formulations require linear separation. ▪ Real-life data sets may need nonlinear separation. ▪ To deal with nonlinear separation, the same formulation and techniques as for the linear case are still used. ▪ We only transform the input data into another space (usually of a much higher dimension) so that – a linear decision boundary can separate positive and negative examples in the transformed space, ▪ The transformed space is called the feature space. The original data space is called the input space. 114 Space transformation ▪ The basic idea is to map the data in the input space X to a feature space F via a nonlinear mapping , :X F (76) x   ( x) ▪ After the mapping, the original training data set {(x1, y1), (x2, y2), …, (xr, yr)} becomes: {((x1), y1), ((x2), y2), …, ((xr), yr)} (77) 115 Geometric interpretation In this example, the transformed space is also 2-D. But usually, the number of dimensions in the feature space is much higher than that in the input space 116 Optimization problem in (61) becomes 117 An example space transformation ▪ Suppose our input space is 2-dimensional, and we choose the following transformation (mapping) from 2-D to 3-D: 2 2 ( x1 , x2 )  ( x1 , x2 , 2 x1 x2 ) ▪ The training example ((2, 3), -1) in the input space is transformed to the following in the feature space: ((4, 9, 8.5), -1) 118 Problem with explicit transformation ▪ The potential problem with this explicit data transformation and then applying the linear SVM is that it may suffer from the curse of dimensionality. ▪ The number of dimensions in the feature space can be huge with some useful transformations even with reasonable numbers of attributes in the input space. ▪ This makes it computationally infeasible to handle. ▪ Fortunately, explicit transformation is not needed. 11 Kernel functions ▪ We notice that in the dual formulation both – the construction of the optimal hyperplane (79) in F and – the evaluation of the corresponding decision function (80) only require dot products (x)  (z) and never the mapped vector (x) in its explicit form. This is a crucial point. ▪ Thus, if we have a way to compute the dot product (x)  (z) using the input vectors x and z directly, – no need to know the feature vector (x) or even  itself. ▪ In SVM, this is done through the use of kernel functions, denoted by K, K(x, z) = (x)  (z) (82) 120 An example kernel function ▪ Polynomial kernel (83) K(x, z) = x  zd ▪ Let us compute the kernel with degree d = 2 in a 2-dimensional space: x = (x1, x2) and z = (z1, z2).  x  z  2  ( x1 z1  x 2 z 2 ) 2 (84)  x1 z1  2 x1 z1 x 2 z 2  x 2 z 2 2 2 2 2   ( x1 , x 2 , 2 x1 x 2 )  ( z1 , z 2 , 2 z1 z 2 ) 2 2 2 2   (x)   (z ) , ▪ This shows that the kernel x  z2 is a dot product in a transformed feature space 121 Kernel trick ▪ The derivation in (84) is only for illustration purposes. ▪ We do not need to find the mapping function. ▪ We can simply apply the kernel function directly by – replace all the dot products (x)  (z) in (79) and (80) with the kernel function K(x, z) (e.g., the polynomial kernel x  zd in (83)). ▪ This strategy is called the kernel trick. 122 Is it a kernel function? ▪ The question is: how do we know whether a function is a kernel without performing the derivation such as that in (84)? I.e, – How do we know that a kernel function is indeed a dot product in some feature space? ▪ This question is answered by a theorem called the Mercer’s theorem, which we will not discuss here. 123 Commonly used kernels ▪ It is clear that the idea of kernel generalizes the dot product in the input space. This dot product is also a kernel with the feature map being the identity 124 Some other issues in SVM ▪ SVM works only in a real-valued space. For a categorical attribute, we need to convert its categorical values to numeric values. ▪ SVM does only two-class classification. For multi-class problems, some strategies can be applied, e.g., one-against-rest, and error-correcting output coding. ▪ The hyperplane produced by SVM is hard to understand by human users. The matter is made worse by kernels. Thus, SVM is commonly used in applications that do not required human understanding. 12

Lecture 08 - Supervised Learning PDF

Document Details

Tags

Related

Summary

Full Transcript