COS10022_Lecture 08_Naive Bayes & Model Validation.pdf

Naïve Bayes Classifier & Model Evaluation (P1) COS10022 Data Science Principles Teaching Materials Co-developed by: Pei-Wei Tsai ([email protected] WanTze Vong ([email protected]) Learning Outcomes This lecture supports the achievement of the following learning outcomes: 3. Describe the processes within the Data Analytics Lifecycle. 4. Analyse business and organisational problems and formulate them into data science tasks. 5. Evaluate suitable techniques and tools for specific data science tasks. 6. Develop analytics plan for a given business case study. 3 Using Proper Techniques to Solve the Problems The Problem to Solve The Category of Techniques I want to group items by similarity. Clustering I want to find structure (commonalities) in the data I want to discover relationships between actions or items Association Rules I want to determine the relationship between the outcome and the input Regression variables I want to assign (known) labels to objects Classification I want to find the structure in a temporal process Time Series Analysis I want to forecast the behavior of a temporal process I want to analyze my text data Text Analysis 4 Key Questions How does the Naïve Bayes algorithm work for predictive modeling? How do we evaluate the performance of various analytics models? What are some popular evaluation metrics? How do we perform cross-validations? What are some practical considerations when evaluating a model? 5 Phase 4 – Model Building In the 4th phase of the Data Analytics Lifecycle, the data science team develops datasets for testing, training, and production purposes. The team builds and executes models based on the work done in the model planning phase. The team also considers the sufficiency of the existing tools to run the models, or whether a more robust environment for executing the models is needed (e.g. fast hardware, parallel processing, etc.). 6 Phase 4 – Model Building Key activities: Develop analytical model, fit it on the training data, and evaluate its performance on the test data. The data science team can move to the next phase if the model is sufficiently robust to solve the problem, or if the team has failed. 7 Naïve Bayes Model Naïve Bayes is a probabilistic classification model m P(a1 , a2 ,  , am | ci ) = P(a1 | ci ) ⋅ P(a2 | ci )  P(am | ci ) = ∏ P(a j | ci ) developed from the Bayes’ Theorem or Bayes’ Law. It is j =1 ‘naïve’ as it assumes that the influence of a particular attribute value on the class assignment of an object is P(A|C=“apple”) = P(color=“red”|C=“apple”) x P(size=“small”|C=“apple”) P(A|C=“grape”) = P(color=“red”|C=“grape”) x P(size=“small”|C=“grape”) independent of the values of other attributes. This assumption simplifies the computation of the model. IF P(A|C=“apple”) > P(A|C=“grape”) THEN Class Label =“apple” Suppose you have a bag of fruits and want to classify each fruit as either "apple" or “grape" based on its color and size. Advantages: Naïve Bayes assumes that the color and size of a fruit are independent of (a) easy to implement; each other. (b) execute efficiently without prior knowledge about the data. For instance, it treats the probability of a fruit being an apple based on its color (e.g., red) separately from the probability based on its size (e.g., small). Disadvantages: Even though in reality, the color and size of a fruit might be related (e.g., (a) its strong assumption about the independence of attributes small apples are often red), Naïve Bayes simplifies the calculation by treating often give bad results (i.e., bad prediction accuracy); them as independent. (b) discretizing numerical values may result in loss of useful This simplification helps in efficiently calculating the probabilities and making information. classifications. 8 Naïve Bayes Model Input variables: categorical variables; numerical or continuous variables need to be discretized, e.g.: income < $10,000 → low income income ≥ $1,000,000 → working class Output: a class label, plus its corresponding probability score. Note that the class label is a categorical variable (non-numeric). 9 Bayes’ Theorem Bayes’ Theorem gives the relationship between the probabilities of two events and their conditional probabilities. The theorem is derived from conditional probability. The conditional probability of event C occurring, given that event A has already occurred, is denoted as P(C|A), for which the following formula applies: P( A ∩ C ) P(C | A) = P( A) 10 Bayes’ Theorem The Bayes’ Theorem is algebraically derived from the previous conditional probability formula to obtain: P( A | C ) ⋅ P(C ) P 𝐶𝐶 𝐴𝐴 = 𝑃𝑃(𝐴𝐴 ∩ 𝐶𝐶) … … … ….. (1) P(C | A) = 𝑃𝑃(𝐴𝐴) P( A) 𝑃𝑃(𝐴𝐴 ∩ 𝐶𝐶) P 𝐴𝐴 𝐶𝐶 = … … … ….. (2) 𝑃𝑃(𝐶𝐶) where: P A ∩ C = 𝑃𝑃 𝐴𝐴 𝐶𝐶 𝑃𝑃(𝐶𝐶) …. (3) C is the class label C ∈ {c1 , c2 ,  , cn } Equation 1 is the conditional probability of C given A has occurred. A is the observed attributes, A ∈ {a1 , a2 ,  , am } Equation 2 is the conditional probability of A given C has occurred. From Equation 2, we can get the probability of A intersect C, 𝑃𝑃(𝐴𝐴 ∩ 𝐶𝐶). Equation 3 is substituted into Equation 1 to derived the equation of Bayes theorem. 11 Bayes’ Theorem: Example 1 John flies frequently and likes to upgrade his seat to first class. He has determined that if he checks in for his flight at least two hours early, the probability that he will get an upgrade is 0.75; otherwise, the probability that he will get an upgrade is 0.35. With his busy schedule, he checks in at least two hours before his flight only 40% of the time. Suppose John did not receive an upgrade on his most recent attempt, what is the probability that he did not arrive two hours early? P 𝐴𝐴 𝐶𝐶 = 0.75 Let C={John arrived at least two hours early} P 𝐴𝐴 ¬𝐶𝐶 = 0.35 A={John received an upgrade} P C = 0.40 such that, ¬C={John did not arrive two hours early} 𝐏𝐏 ¬𝐂𝐂 ¬𝐀𝐀 = ? ¬A={John did not receive an upgrade} 12 Bayes’ Theorem: Example 1 The question above requires that we compute the probability P(¬C|¬A). By directly applying Bayes’ Theorem, we can mathematically formulate the question as: P( A | C ) ⋅ P(C ) P ( ¬A | ¬C ) ⋅ P ( ¬C ) P(C | A) = P(¬C | ¬A) = P( A) P(¬A) The rest of the problem is simply figuring out the probability scores of the terms on the right-hand side. 13 Bayes’ Theorem: Example 1 Start by figuring out the simplest terms based on the available information: Since John checks in at least two hours early 40% of the time, we know that P(C) = 0.4 This means that the probability of not checking in at least two hours early is P(¬C) = 1 – P(C) = 0.6 The story also tells us that the probability that John received an upgrade given that he checked in early is 0.75, such that P(A|C) = 0.75 Next, we were told that the probability that John received an upgrade given that he did not checked in early is 0.35, i.e. P(A|¬C) = 0.35, which allows us to compute the probability that he did not receive an upgrade given the same circumstance as P(¬ A|¬C) = 1 - P(A|¬C) = 0.65 P ( ¬A | ¬C ) ⋅ P ( ¬C ) P(¬C | ¬A) = P(¬A) 14 Bayes’ Theorem: Example 1 We were not told of the probability of John receiving an upgrade, or P(A). Fortunately, using all the terms figured out earlier, this probability can be calculated as follows: P( A) = P( A ∩ C ) + P( A ∩ ¬C ) = P(C ) ⋅ P( A | C ) + P(¬C ) ⋅ P( A | ¬C ) = 0.4 × 0.75 + 0.6 × 0.35 = 0.51 Since P(A) = 0.51, then P(¬A) = 1 - P(A) = 0.49 15 Bayes’ Theorem: Example 1 Finally, using the Bayes’ Theorem, we can compute the probability P(¬C|¬A) as follows: P(¬A | ¬C ) ⋅ P(¬C ) 0.65 × 0.6 P(¬C | ¬A) = = ≈ 0.796 P(¬A) 0.49 Answer: the probability that John did not arrive two hours early given that he did not receive an upgrade is 0.796 16 Bayes’ Theorem: Example 2 Assume that a patient named Mary took a lab test for a certain disease and the result came back positive. The test returns a positive result in 95% of the cases in which the disease is actually present, and it returns a positive result in 6% of the cases in which the disease is not present. Furthermore, 1% of the entire population has this disease. What is the probability that Mary actually has the disease, given that the test is positive? Let P 𝐴𝐴 𝐶𝐶 = 0.95 C={having the disease} P 𝐴𝐴 ¬𝐶𝐶 = 0.06 A={testing positive} such that, P C = 0.01 ¬C={not having the disease} 𝐏𝐏 𝐂𝐂 𝐀𝐀 = ? ¬A={testing negative} 17 Bayes’ Theorem: Example 2 Slightly different from Example 1, the current problem requires that we compute the probability P(C|A). By directly applying Bayes’ Theorem, we translate the question as: P( A | C ) ⋅ P(C ) P(C | A) = P( A) 18 Bayes’ Theorem: Example 2 What we know: 1% of the population has the disease, hence P(C) = 0.01 Conversely, the probability of not having the disease is P(¬C) = 1 – P(C) = 0.99 The probability that the test is positive given the presence of the disease is 95%, i.e. P(A|C) = 0.95 The probability that the test is positive given the absence of the disease is 6%, i.e. P(A|¬C) = 0.06 To compute P(A): √ √ P( A) = P( A ∩ C ) + P( A ∩ ¬C ) P( A | C ) ⋅ P(C ) P(C | A) = = P(C ) ⋅ P( A | C ) + P(¬C ) ⋅ P( A | ¬C ) P( A) ? = 0.01× 0.95 + 0.99 × 0.06 = 0.0689 19 Bayes’ Theorem: Example 2 Finally, we can compute the probability P(C|A) as follows: P( A | C ) ⋅ P(C ) 0.95 × 0.01 P(C | A) = = ≈ 0.1379 P( A) 0.0689 Answer: the probability that Mary actually has the disease given that the test is positive is only 13.79%. This result indicates that the lab test may not be a good one. The likelihood of having the disease was 1% when the patient walked in the door and only 13.79% when the patient walked out, which would suggest further tests. 20 Generalisation of the Bayes’ Theorem To derive the Naïve Bayes model, the previous Bayes’ output Theorem must first be generalised. / class input variables variable The generalisation of the Bayes’ Theorem Assuming that we have a dataset as shown on the right, the Bayes’ Theorem assigns an appropriate class label ci to each object (record) in the dataset that has multiple attributes A = {a1 , a2,  , am } , such that the assigned class label corresponds to the largest value of P(ci|A). P(C=“Yes” | A) > P(C=“No”|A)  Class label “Yes” 21 Generalisation of the Bayes’ Theorem Mathematically, the generalised Bayes’ Theorem is expressed as follows: P (a1 , a2 , , am | ci ) ⋅ P (ci ) P (ci | A) = , i = 1, 2 ,  , n P (a1 , a2 , , am ) P( A ∩ C ) Bayes’ Theorem: P (C | A) = P( A) 22 P (a1 , a2 , , am | ci ) ⋅ P (ci ) P (ci | A) = , i = 1, 2 ,  , n P (a1 , a2 , , am ) Naïve Bayes Classifier The Naïve Bayes model is finally derived by applying two simplifications of the generalised Bayes’ Theorem. Simplification 1 Apply the conditional independence assumption, whereby each attribute is conditionally independent of every other attribute given a class label ci. This naïve assumption simplifies the computation of P (a1 , a2 ,  , am | ci ) as follows: m P (a1 , a2 ,  , am | ci ) = P(a1 | ci ) ⋅ P(a2 | ci )  P(am | ci ) = ∏ P(a j | ci ) j =1 23 P (a1 , a2 , , am | ci ) ⋅ P (ci ) P (ci | A) = , i = 1, 2 ,  , n P (a1 , a2 , , am ) Naïve Bayes Classifier Simplification 2 Ignore the denominator P(a1 , a2 ,  , am ) in the generalised Bayes’ Theorem formula as its value is unchanged regardless of the class labels. In other words, removing this denominator will have no impact on the relative probability score of the classes while at the same time simplifying the computation of Naïve Bayes model. 24 m ∏ P(a j | ci ) Naïve Bayes Classifier j =1 P (a1 , a2 , , am | ci ) ⋅ P (ci ) P (ci | A) = , i = 1, 2 ,  , n P (a1 , a2 , , am ) Following the two simplifications above, calculating the probability of a class label ci given a set of attributes a1, a2, …, am is proportional to the product of P(aj|ci) multiplied by P(ci), as shown below: m P(ci | A) ∝ P(ci ) ⋅ ∏ P(a j | ci ), i = 1, 2,  , n j =1 LHS: the probability of a class label ci RHS: the product of P(aj|ci) multiplied given a set of attributes a1, a2, …, am by P(ci) This symbol means ‘directly proportional to’ 25 Where does Naïve Bayes fit in Data Science? Machine Learning Supervised Unsupervised Learning Learning Classification Regression Naïve Bayes Where to Use Naïve Bayes Classifier? 27 https://www.youtube.com/watch?v=l3dZ6ZNFjo0 Shopping Example – Problem Statement To predict whether a person will purchase a product on a specific combination of Day, Discount, and Free Delivery using Naïve Bayes Classifier. 28 https://www.youtube.com/watch?v=l3dZ6ZNFjo0 Predictors Class Label Shopping Example – Dataset ◦ The total observation is 30. (30 records) ◦ We have three predictors (Day, Discount, and Free Delivery) and one target (Purchase). ◦ In the big data era, we are not looking at three predictors anymore. It could be thirty or even more columns in the data with million records. 29 Class Label Class Label Class Label Predictor 1 Predictor 3 Predictor 2 Shopping Example ◦ Based on the dataset containing three input types of Day, Discount and Free Delivery, we will – Frequency Table populate frequency tables for each attribute. 30 Shopping Example – Frequency Table B A ◦ For the Bayes’ Theorem, let the event “Buy” be “A” and the independent variables “Discount,” “Free Delivery,” and “Day” be “B.” 31 Likelihood Table 1 Predictor 1 Shopping Example – Likelihood Table ◦ Let’s calculate the Likelihood table for one of the variable, 𝑃𝑃 𝐵𝐵 = 𝑃𝑃 Weekday = 11 30 ≈ 0.37 𝑃𝑃 𝐵𝐵 = 𝑃𝑃 Weekday = 11 30 ≈ 0.37 Day, which includes 𝑃𝑃 𝐴𝐴 = 𝑃𝑃 No Buy = 6 30 = 0.2 𝑃𝑃 𝐶𝐶 = 𝑃𝑃 Buy = 24 30 = 0.8 “Weekday,” “Weekend,” and 𝑃𝑃 𝐵𝐵 𝐴𝐴 = 𝑃𝑃 Weekday No Buy 𝑃𝑃 𝐵𝐵 𝐶𝐶 = 𝑃𝑃 Weekday Buy “Holiday.” = 2 6 ≈ 0.33 = 9 24 ≈ 0.38 ◦ Based on this likelihood table, we 𝑃𝑃 𝐴𝐴 𝐵𝐵 = 𝑷𝑷 𝐍𝐍𝐍𝐍 𝐁𝐁𝐁𝐁𝐁𝐁 𝐖𝐖𝐖𝐖𝐖𝐖𝐖𝐖𝐖𝐖𝐖𝐖𝐖𝐖 𝑃𝑃 𝐶𝐶 𝐵𝐵 = 𝑷𝑷 𝐁𝐁𝐁𝐁𝐁𝐁 𝐖𝐖𝐖𝐖𝐖𝐖𝐖𝐖𝐖𝐖𝐖𝐖𝐖𝐖 will calculate the 𝑃𝑃 Weekday No Buy × 𝑃𝑃 No Buy 𝑃𝑃 Weekday Buy × 𝑃𝑃 Buy = = conditional 𝑃𝑃 Weekday 𝑃𝑃 Weekday probabilities as shown 0.33 × 0.2 0.38 × 0.8 = ≈ 0.18 = ≈ 0.82 on the left. 0.37 0.37 As the Probability(Buy | Weekday) is more than Probability(No Buy | Weekday), we can conclude that a customer will most likely buy the product on a Weekday. 32 Likelihood Table 1 Predictor 1 Shopping Example – Likelihood Table (2) ◦ Now we know how to calculate the likelihood table and thus we can do the same to the remaining tables. Likelihood Table 2 Predictor 2 ◦ Let’s use the three likelihood tables to calculate whether a customer will purchase a product on a specific combination of Day, Discount, and Free Delivery. Predictor 3 Likelihood Table 3 33 Shopping Example – Naïve Bayes Classifier ◦ Let’s take a combination of these factors: ◦ Day = Holiday ◦ Discount = Yes B ◦ Free Delivery = Yes ◦ Let A = No Buy ◦ 𝑃𝑃 𝐴𝐴 𝐵𝐵 = 𝑃𝑃 No Buy Discount = Yes, Free Delivery = Yes, Day = Holiday = [𝑃𝑃 Discount = Yes No Buy × 𝑃𝑃 Free Delivery = Yes No Buy × 𝑃𝑃 Day = Holiday No Buy × 𝑃𝑃 No Buy ] ÷ [𝑃𝑃 Discount = Yes × 𝑃𝑃 Free Delivery = Yes × 𝑃𝑃 Day = Holiday ] 1 × 2 × 3 × 6 6 6 6 30 = ≈ 0.0296 20 × 23 × 11 30 30 30 𝑃𝑃 𝐵𝐵 𝐴𝐴 𝑃𝑃 𝐴𝐴 34 𝑃𝑃 𝐴𝐴 𝐵𝐵 = 𝑃𝑃 𝐵𝐵 Shopping Example – Naïve Bayes Classifier (2) ◦ Let’s take a combination of these factors: ◦ Day = Holiday ◦ Discount = Yes B ◦ Free Delivery = Yes ◦ Let A = Buy ◦ 𝑃𝑃 𝐴𝐴 𝐵𝐵 = 𝑃𝑃 Buy Discount = Yes, Free Delivery = Yes, Day = Holiday = [𝑃𝑃 Discount = Yes Buy × 𝑃𝑃 Free Delivery = Yes Buy × 𝑃𝑃 Day = Holiday Buy × 𝑃𝑃 Buy ] ÷ [𝑃𝑃 Discount = Yes × 𝑃𝑃 Free Delivery = Yes × 𝑃𝑃 Day = Holiday ] 19 21 8 24 24 × 24 × 24 × 30 = ≈ 0.9857 20 × 23 × 11 30 30 30 𝑃𝑃 𝐵𝐵 𝐴𝐴 𝑃𝑃 𝐴𝐴 𝑃𝑃 𝐴𝐴 𝐵𝐵 = 35 𝑃𝑃 𝐵𝐵 Shopping Example – Naïve Bayes Classifier (3) ◦ Based on the calculation ◦ Probability of purchase = 0.986 ◦ Probability of no purchase = 0.03 ◦ Finally, we have the conditional probabilities of purchase on this day! ◦ Let’s now normalise these probabilities to get the likelihood of the events: 0.986 ◦ Likelihood of Purchase = ≈ 97.05% 0.986+0.03 0.03 ◦ Likelihood of No Purchase = ≈ 2.95% 0.986+0.03 As 97.05% is greater than 2.95%, we can conclude that an average customer will buy on a holiday with discount and free delivery. 36 Advantage of Very simple and easy to Naïve Bayes implement Classifier Not sensitive to irrelevant features Needs less training data Advantage of Naïve Bayes Classifier As it is fast, it Handles both can be used in continuous real time and discrete predictions data Highly scalable with number of predictors and data points 37 Disadvantage Its strong assumption about of Naïve Bayes the independence of attributes often give bad results Classifier (i.e. bad prediction accuracy) Disadvantage of Naïve Bayes Classifier Discretising numerical values may result in loss of useful information. (Lower resolution.) 38 Naïve Bayes Classifier & Model Evaluation (P2) COS10022 Data Science Principles Model Evaluation Evaluation is an important activity in Phase 4 because it allows us to determine if a chosen analytic model has performed well in meeting our analytical needs. Model evaluation is often called ‘model validation’. Sometimes, the term ‘model diagnostics’ is also used. Evaluating the performance of an analytics model requires some basic knowledge concerning common evaluation metrics and methods. For instance, a data scientist should know the differences in evaluating supervised and unsupervised models. 40 Supervised Models: Metrics and Methods Popular metrics for evaluating the performance of supervised models: 1. Accuracy 2. TPR 3. FPR 4. FNR 5. Precision 6. Area Under the Curve (AUC) These metrics can be calculated by utilizing a confusion matrix. 41 Supervised Models: Metrics and Methods Confusion Matrix is a specific table layout that allows the visualization of the performance of a supervised model / predictive model / a classifier. For a 2-class (positive, negative) classifier, the basic layout of a confusion matrix is shown below. Predicted Class Positive Negative Positive True Positives (TP) False Negatives (FN) Actual Class Negative False Positives (FP) True Negatives (TN) 42 Supervised Models: Metrics and Methods True Positives (TP): the number of positive instances that a classifier correctly classifies as positive. False Positives (FP): the number of instances that a classifier identified as positive but in reality are negative. True Negatives (TN): the number of negative instances that a classifier correctly identifies as negative. False Negatives (FN): the number of instances classified as negative but in reality are positive. TP and TN are correct predictions. A good classifier should have large TP and TN, and small (ideally, zero) numbers of FP and FN. 43 Supervised Models: Metrics and Methods Example 1. A confusion matrix of Naïve Bayes classifier for 100 customers in predicting whether they would subscribe to the term deposit (refer to the Naïve Bayes example in the previous slides). Predicted Class Subscribed Not Subscribed Total Subscribed 3 (correct prediction) 8 (error) 11 Actual Class Not Subscribed 2 (error) 87 (correct prediction) 89 Total 5 95 100 44 Supervised Models: Metrics and Methods Metric 1: Accuracy (“overall success rate”) Accuracy defines the rate at which a model has classified the records correctly. TP + TN Accuracy = ×100% TP + TN + FP + FN A good model should have a high accuracy % (80% and above). However, accuracy alone does not guarantee that a model is well-established. Other metrics can be introduced to better evaluate the performance of a supervised model. 45 Supervised Models: Metrics and Methods Metric 2: True Positive Rate (TPR) (also known as “Recall”) TPR shows what percentage of positive instances a classifier correctly identified. TP TPR = TP + FN Metric 3: False Positive Rate (FPR) FPR shows what percentage of negatives that a classifier marks as positive. FPR is also known as “false alarm rate” or “Type I error rate”. FP FPR = FP + TN 46 Supervised Models: Metrics and Methods Metric 4: False Negative Rate (FNR) FNR shows what percentage of positive instances a classifier marks as negative. FN FNR = FN + TP FNR is also known as “miss rate” or “Type II error rate”. Note that the sum of TPR and FNR is 1. A well-performed model should have a high TPR (ideally 1) and a low FPR and FNR (ideally 0). In reality, however, it is rare to obtain these ideal scores. 47 Supervised Models: Metrics and Methods Metric 5: Precision Precision is the percentage of instances marked positive that really are positive. TP Precision = TP + FP A good model should have a high precision. 48 Supervised Models: Metrics and Methods Example 2. Based on the confusion matrix in Example 1, we can compute the above metrics as follow: TP + TN 3 + 87 Accuracy = ×100% = ×100% = 90% TP + TN + FP + FN 3 + 87 + 2 + 8 TP 3 TPR (Recall) = = ≈ 0.273 Not Good TP + FN 3 + 8 FP 2 FPR = = ≈ 0.022 FP + TN 2 + 87 FN 8 TP 3 FNR = = ≈ 0.727 Precision = = = 0.6 FN + TP 3 + 8 TP + FP 3 + 2 49 Supervised Models: Metrics and Methods Metric 6: Area Under the Curve (AUC) AUC score measures the area under the ROC (Receiver Operating Characteristic) curve. An ROC curve measures the performance of a classifier based on the TP and FP. Higher AUC scores mean the classifier performs better as TP gets nearer to 1 and FP gets nearer to 0. A rough guide for interpreting AUC scores: 0.90 – 1: excellent 0.80 – 0.90: good 0.70 – 0.80: fair 0.60 – 0.70: poor 0.50 – 0.60: fail Source: http://gim.unmc.edu/dxtests/roc3.htm 50 Supervised Models: Metrics and Methods There are various ways or methods in which the previously discussed metrics can be used to evaluate a model. One of the most well-established methods for evaluating supervised models is the cross-validation method. Cross validation involves dividing a dataset into training and test sets. A model is then trained / built based on the training set, and its performance is then evaluated on the test set where various evaluation metrics are calculated. The goal of cross-validation is to get the true measurement of a supervised model’s performance based on data that the model has never seen before. 51 Supervised Models: Metrics and Methods There are different types of cross validation. Holdout cross validation (“percentage split”) This is the simplest kind of cross validation which separates a dataset into training and test sets according to pre-specified percentages. Problem: a model’s performance may vary depending on how the dataset is split. Train (50%) Train (70%) Train (60%) Train (80%) Test (40%) Test (50%) Test (20%) Test (30%) 52 Supervised Models: Metrics and Methods K-fold cross validation The data set is divided into k subsets with approximately equal size, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. Finally, the average error across all k trials is computed. In contrast to the holdout method, in k-fold cross validation it does not really matter how the data gets divided. This leads to more consistent evaluation results. But this evaluation method is more costly to run. 53 Supervised Models: Metrics and Methods Example: 10-fold cross validation Training set Fold 1 Fold 1 Fold 1 Test set Fold 2 Fold 2 Fold 2 Fold 3 Fold 3 Fold 3 Fold 4 Fold 4 Fold 4 … do this for k times Fold 5 Fold 5 Fold 5 Fold 6 Fold 6 Fold 6 Note that the number Fold 7 Fold 7 Fold 7 of records in each fold Fold 8 Fold 8 Fold 8 will be determined by Fold 9 Fold 9 Fold 9 the size of the dataset. Fold 10 Fold 10 Fold 10 54 Supervised Models: Metrics and Methods Leave-one-out cross validation This method takes 1 record as the test set and uses all remaining records as the training set. This validation is performed for each record in the dataset. Given N number of records, this method is run in ways similar to the k-fold cross validation, except k=N. As such, this method is usually more costly, especially when there are a large number of records in the dataset. 55 Supervised Models: Metrics and Methods Example: leave-one-out cross validation Training set Record #1 Record #1 Record #1 Record #2 Record #2 Record #2 Test set Record #3 Record #3 Record #3 Record #4 Record #4 Record #4 … do this for N times Record #5 Record #5 Record #5 Record #6 Record #6 Record #6 Note that each Record #7 Record #7 Record #7 segment contains Record #8 Record #8 Record #8 exactly 1 record. Record #9 Record #9 Record #9 Record #10 Record #10 Record #10 56 Supervised Models: Metrics and Methods Validation Set A more sophisticated cross validation method introduces a validation set in addition to training and test sets. Train (60%) For example, instead of splitting a dataset into 80% Validation (20%) training and 20% test set, using this approach, the dataset may be split into 60% training set, 20% Test (20%) validation set, and 20% test set. 57 Supervised Models: Metrics and Methods The motivation behind using a validation set Certain types of classifiers have parameters. Depending on what values you set for these parameters, they could greatly improve or degrade the performance of a classifier. Build the model Train (60%) With a validation set, a classifier is initially built on the training set. Since there is no guarantee that the built model uses the best parameter values, the validation set is subsequently used to explore and determine parameter Validation (20%) Improve the model values that will give the classifier its best performance. Test (20%) Evaluate the model Once the classifier is properly fine-tuned with the help of the validation set, its true performance is finally evaluated on the test set. 58 Practical Considerations How do we know if an evaluation result is “good enough”? This depends on the business situation. During the Discovery phase (Phase 1), the data science team should have learned from the business what kind of errors can be tolerated. Some business situations are more tolerant of Type I errors, whereas others may be more tolerant of Type II errors. In some cases, a model with a TPR of 0.95 and an FPR of 0.3 is more acceptable than a model with a TPR of 0.9 and an FPR of 0.1, even if the second model has a higher overall Accuracy score. 59 Practical Considerations How do we know if an evaluation result is “good enough”? Consider the performance of a Naïve Bayes classifier in an automatic e-mail spam filtering (positive means “spam”). A busy executive only want important emails in her Inbox. She does not mind having some less important emails to end up in the Spam folder (even if they are not spam), as long as there is no spam in the Inbox. Which is more important to avoid in this case, the Type I error or the Type II error? Low FNR is Answer: Type II error. Therefore, a good Naïve Bayes classifier should demonstrate a low FNR score. important Predicted Actual Spam (Positive) Not Spam (Negative) Spam Spam emails correctly classified into spam Spam emails incorrectly classified as (Positive) folder (TP) important emails in Inbox. (FN) Not Spam Important emails incorrectly classified into Important emails correctly classified into (Negative) spam folder (FP) Inbox (TN) 60 Practical Considerations Continuing from the previous example, another less busy executive may not want any of his important emails to end up in the Spam folder and is willing to have some spam in his Inbox. Which is more important to avoid in this case, Type I error or Type II error? Answer: Type I error. Therefore, a good Naïve Bayes classifier should demonstrate a low FPR score. Predicted Actual Spam (Positive) Not Spam (Negative) Spam Spam emails correctly classified into spam Spam emails incorrectly classified as (Positive) folder (TP) important emails in Inbox. (FN) Not Spam Important emails incorrectly classified into Important emails correctly classified into (Negative) spam folder (FP) Inbox (TN) 61 Practical Considerations Detecting the problem of Overfitting One of the main motivations for performing a cross-validation is to detect the overfitting problem. Overfitting refers to the situation where a model performs very well on the training set, but fails terribly on the test set. In practice, a model that suffers from overfitting is undesirable as it does not help us much in dealing with previously unknown scenarios. 62 Unsupervised Model Evaluation Evaluating the performance of unsupervised models (e.g. K-Means Clustering) differs from the way the supervised models are evaluated. In addition to asking the domain experts to subjectively inspect the results of an unsupervised model and provide their feedback, there are objective evaluation metrics that can be used. These metrics, however, usually differ from one unsupervised model to another. For example: K-Means Clustering: Silhouette, Weighted Sum of Square (WSS) Association Rules: support, confidence, lift, leverage 63 Texts and Resources Unless stated otherwise, the materials presented in this lecture are taken from: Dietrich, D. ed., 2015. Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. EMC Education Services. 64

COS10022_Lecture 08_Naive Bayes & Model Validation.pdf

Document Details

Tags

Related

Full Transcript