Artificial Intelligence and Machine Learning for Business (AIMLB) PDF

Artificial Intelligence and Machine Learning for Business (AIMLB) Mukul Gupta (Information Systems Area) Machine Learning challenges Prediction error Bias Variance Overfitting and underfitting Selection and tuning of a model 2 Prediction error Total error Reducible error Irreducible error Bias error Bayes’ error rate Variance error (Optimum Error rate) The Bayes Error is the lower limit of the error that you can get with any classifier. A classifier that achieves this error rate is an optimal classifier. 3 Bias and Variance Bias error How far is the predicted value from the true value The systematic error of the model It is about the model and the data itself Variance error The error caused by sensitivity to small variances in the training data set The dispersion of predicted values over target values with different train-sets It is about the model sensitivity 4 Bias and Variance 5 Overfitting and Underfitting Overfitting occurs when The model captures the noise and the outliers in the data along with the underlying pattern. These models usually have high variance and low bias Under fitting occurs when The model is unable to capture the underlying pattern of the data These models usually have a low variance and a high bias. 6 Overfitting and Underfitting 7 The bias variance curve 8 Model Selection Is it suitable for my type of data? Does it produce an accurate model? How does it do bias/variance wise? Is it sophisticated enough to capture patterns in my data without overfitting? Is it able to allow determination of feature importance? Is it easy to use? 9 Model Selection and Tuning ML is all about training (and validation) and testing the model. 10 What makes a loan risky? Assessing Credit Risk Situation: Person applies for a loan Task: Should a bank approve the loan? Note: Usually people who have the best credit don’t need the loans, and people with worst credit are not likely to repay. Bank’s best customers are in the middle Credit history explained Did I pay previous loans on time? Example: excellent, good, or fair Income What’s my income? Example: $80K per year Loan terms How soon do I need to pay the loan? Example: 3 years, 5 years,… Personal information Age, reason for the loan, marital status,… Example: Home loan for a married couple Intelligent application Credit Risk - Results Banks develop credit risk models using variety of machine learning methods. Mortgage and credit card proliferation are the results of being able to successfully predict if a person is likely to default on a loan Widely used in many countries Credit Risk - Results Classification Classifier review Classification – The Problem Given a set of training cases/objects and their attribute values, try to determine the target attribute value of new examples. Tid Attrib1 Attrib2 Attrib3 Class Learning 1 Yes Large 125K No algorithm 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No Learn 8 No Small 85K Yes Model 9 No Medium 75K No 10 No Small 90K Yes Model 10 Training Set Apply Tid Attrib1 Attrib2 Attrib3 Class Model 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? Deduction 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Classification: Definition Given a collection of records (training set ) Each record is characterized by a tuple 𝒙, 𝑦 , where 𝒙 is the attribute set and 𝑦 is the class label 𝒙: attribute, predictor, independent variable, input 𝑦: class, response, dependent variable, output Task: Learn a model that maps each attribute set 𝒙 into one of the predefined class labels 𝑦 Other examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Email spam filtering Sentiment classification and many more… Classification Techniques Base Classifiers Decision Tree based Methods Rule-based Methods Nearest-neighbor Naïve Bayes Support Vector Machines Neural Networks, Deep Neural Nets Ensemble Classifiers Bagging (Random Forests), Boosting (XGBoost) Why decision tree? Decision trees are powerful and popular tools for classification and prediction. Decision trees represent rules, which can be understood by humans and used in knowledge system such as database. Why decision tree? Example Decision Tree – Retail Data Why decision tree? Decision Tree - Ruleset Model Decision Tree: Terminology Key requirements Attribute-value description object or case must be expressible in terms of a fixed collection of properties or attributes (e.g., hot, mild, cold). Predefined classes (target values) the target function has discrete output values (binary or multiclass) Sufficient data enough training cases should be provided to learn the model. Example of a Decision Tree Splitting Attributes Home Marital Annual Defaulted ID Owner Status Income Borrower 1 Yes Single 125K No Home 2 No Married 100K No Owner Yes No 3 No Single 70K No 4 Yes Married 120K No NO MarSt 5 No Divorced 95K Yes Single, Divorced Married 6 No Married 60K No Income NO 7 Yes Divorced 220K No < 80K > 80K 8 No Single 85K Yes 9 No Married 75K No NO YES 10 No Single 90K Yes 10 Training Data Model: Decision Tree Apply Model to Test Data Test Data Start from the root of tree. Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Income NO < 80K > 80K NO YES Apply Model to Test Data Test Data Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Income NO < 80K > 80K NO YES Apply Model to Test Data Test Data Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Income NO < 80K > 80K NO YES Apply Model to Test Data Test Data Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Income NO < 80K > 80K NO YES Apply Model to Test Data Test Data Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Income NO < 80K > 80K NO YES Apply Model to Test Data Test Data Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Assign Defaulted to “No” Income NO < 80K > 80K NO YES Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class Tree 1 Yes Large 125K No Induction 2 No Medium 100K No algorithm 3 No Small 70K No 4 Yes Medium 120K No Induction 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No Learn 8 No Small 85K Yes Model 9 No Medium 75K No 10 No Small 90K Yes Model 10 Training Set Apply Decision Tid Attrib1 Attrib2 Attrib3 Class Model Tree 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? Deduction 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Induction of Decision Trees Data Set (Learning Set) Each example = Attributes + Class Induced description = Decision tree TDIDT Top-Down Induction of Decision Trees Recursive Partitioning Induction of Decision Trees To construct decision tree T from learning set S: If all examples in S belong to some class C, then make leaf labeled C Otherwise select the “most informative” attribute A partition S according to A’s values recursively construct subtrees T1, T2,..., for the subsets of S Induction of Decision Trees Resulting Tree T is A S Attribute A v1 v2 vn A’s values T1 T2 Tn Subtrees Induction of Decision Trees Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left Brief Review of Entropy Entropy measures the degree of randomness in data For a set of samples 𝑋 with 𝑘 classes: 𝑘 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 = − ෍ 𝑝𝑖 log 2 (𝑝𝑖 ) 𝑖=1 where 𝑝𝑖 is the proportion of elements of class 𝑖 Lower entropy implies greater predictability! Attribute Selection Measure: Information Gain (ID3) The information gain of an attribute 𝑎 is the expected reduction in entropy due to splitting on values of 𝑎: 𝑋𝑣 𝑔𝑎𝑖𝑛 𝑋, 𝑎 = 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 − ෍ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑋𝑣 ) 𝑋 𝑣 ∈ 𝑉𝑎𝑙𝑢𝑒𝑠(𝑎) where 𝑋𝑣 is the subset of 𝑋 for which 𝑎 = 𝑣 Best attribute = highest information gain 3 3 4 4 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 = − 𝑝 mammal log2 𝑝 mammal − 𝑝bird log2 𝑝bird = − log2 − log2 ≈ 0.985 7 7 7 7 Best attribute = highest information gain 3 3 4 4 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 = − 𝑝 mammal log2 𝑝 mammal − 𝑝bird log2 𝑝bird = − log2 − log2 ≈ 0.985 7 7 7 7 1 1 2 2 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑏𝑟𝑜𝑤𝑛 ) = − log2 − log2 ≈ 0.918 3 3 3 3 Best attribute = highest information gain 3 3 4 4 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 = − 𝑝 mammal log2 𝑝 mammal − 𝑝bird log2 𝑝bird = − log2 − log2 ≈ 0.985 7 7 7 7 1 1 2 2 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑏𝑟𝑜𝑤𝑛 ) = − log2 − log2 ≈ 0.918 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑤ℎ𝑖𝑡𝑒 ) = 1 3 3 3 3 Best attribute = highest information gain 3 3 4 4 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 = − 𝑝 mammal log2 𝑝 mammal − 𝑝bird log2 𝑝bird = − log2 − log2 ≈ 0.985 7 7 7 7 1 1 2 2 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑏𝑟𝑜𝑤𝑛 ) = −log2 − log2 ≈ 0.918 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑤ℎ𝑖𝑡𝑒 ) = 1 3 3 3 3 𝟑 𝟒 𝒈𝒂𝒊𝒏 𝑿, 𝒄𝒐𝒍𝒐𝒓 = 𝟎. 𝟗𝟖𝟓 − ∙ 𝟎. 𝟗𝟏𝟖 − ∙ 𝟏 ≈ 𝟎. 𝟎𝟐𝟎 𝟕 𝟕 Best attribute = highest information gain 3 3 4 4 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 = − 𝑝 mammal log2 𝑝 mammal − 𝑝bird log2 𝑝bird = − log2 − log2 ≈ 0.985 7 7 7 7 1 1 2 2 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑏𝑟𝑜𝑤𝑛 ) = −log2 − log2 ≈ 0.918 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑤ℎ𝑖𝑡𝑒 ) = 1 3 3 3 3 𝟑 𝟒 𝒈𝒂𝒊𝒏 𝑿, 𝒄𝒐𝒍𝒐𝒓 = 𝟎. 𝟗𝟖𝟓 − ∙ 𝟎. 𝟗𝟏𝟖 − ∙ 𝟏 ≈ 𝟎. 𝟎𝟐𝟎 𝟕 𝟕 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑓𝑙𝑦=𝑦𝑒𝑠 ) = 0 Best attribute = highest information gain 3 3 4 4 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 = − 𝑝 mammal log2 𝑝 mammal − 𝑝bird log2 𝑝bird = − log2 − log2 ≈ 0.985 7 7 7 7 1 1 2 2 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑏𝑟𝑜𝑤𝑛 ) = −log2 − log2 ≈ 0.918 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑤ℎ𝑖𝑡𝑒 ) = 1 3 3 3 3 𝟑 𝟒 𝒈𝒂𝒊𝒏 𝑿, 𝒄𝒐𝒍𝒐𝒓 = 𝟎. 𝟗𝟖𝟓 − ∙ 𝟎. 𝟗𝟏𝟖 − ∙ 𝟏 ≈ 𝟎. 𝟎𝟐𝟎 𝟕 𝟕 3 3 1 1 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑓𝑙𝑦=𝑦𝑒𝑠 ) = 0 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑓𝑙𝑦=𝑛𝑜 ) = − log2 − log2 ≈ 0. 811 4 4 4 4 Best attribute = highest information gain In practice, we compute 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑋) only once! 3 3 4 4 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑋 = − 𝑝 mammal log2 𝑝 mammal − 𝑝bird log2 𝑝bird = − log2 − log2 ≈ 0.985 7 7 7 7 1 1 2 2 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑏𝑟𝑜𝑤𝑛 ) = −log2 − log2 ≈ 0.918 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑤ℎ𝑖𝑡𝑒 ) = 1 3 3 3 3 𝟑 𝟒 𝒈𝒂𝒊𝒏 𝑿, 𝒄𝒐𝒍𝒐𝒓 = 𝟎. 𝟗𝟖𝟓 − ∙ 𝟎. 𝟗𝟏𝟖 − ∙ 𝟏 ≈ 𝟎. 𝟎𝟐𝟎 𝟕 𝟕 3 3 1 1 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑓𝑙𝑦=𝑦𝑒𝑠 ) = 0 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 (𝑋𝑓𝑙𝑦=𝑛𝑜 ) = − log2 − log2 ≈ 0. 811 4 4 4 4 𝟑 𝟒 𝒈𝒂𝒊𝒏 𝑿, 𝒇𝒍𝒚 = 𝟎. 𝟗𝟖𝟓 − ∙ 𝟎 − ∙ 𝟎. 𝟖𝟏𝟏 ≈ 𝟎. 𝟓𝟐𝟏 𝟕 𝟕 Information gain – Limitation: biased towards multivalued attributes Gain Ratio for Attribute Selection (C4.5) Information gain measure is biased towards attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) 𝑋𝑣 𝑋𝑣 𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜 𝑋, 𝑎 = − ෍ log 2 ( ) 𝑋 𝑋 𝑣 ∈ 𝑉𝑎𝑙𝑢𝑒𝑠(𝑎) GainRatio(a) = gain(X,a)/SplitInfo(X,a) Gini Impurity (CART) Gini impurity measures how often a randomly chosen example would be incorrectly labeled if it was randomly labeled according to the label distribution For a set of samples 𝑋 with 𝑘 classes: 𝑘 𝑔𝑖𝑛𝑖 𝑋 = 1 − ෍ 𝑝𝑖2 𝑖=1 where 𝑝𝑖 is the proportion of elements of class 𝑖 Can be used as an alternative to entropy for selecting attributes! Best attribute = highest impurity decrease In practice, we compute 𝑔𝑖𝑛𝑖(𝑋) only once! 2 3 4 2 𝑔𝑖𝑛𝑖 𝑋 = 1 − − ≈ 0.489 7 7 2 2 1 2 𝑔𝑖𝑛𝑖 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑏𝑟𝑜𝑤𝑛 ) = 1 − − ≈ 0.444 𝑔𝑖𝑛𝑖 (𝑋𝑐𝑜𝑙𝑜𝑟=𝑤ℎ𝑖𝑡𝑒 ) 3 3 𝟑 𝟒 = 0.5 △ 𝒈𝒊𝒏𝒊 𝑿, 𝒄𝒐𝒍𝒐𝒓 = 𝟎. 𝟒𝟖𝟗 − ∙ 𝟎. 𝟒𝟒𝟒 − ∙ 𝟎. 𝟓 𝟕 𝟕 2 2 ≈ 𝟎. 𝟎𝟏𝟑 3 1 𝑔𝑖𝑛𝑖 (𝑋𝑓𝑙𝑦=𝑛𝑜 ) = 1 − − ≈ 0. 375 𝑔𝑖𝑛𝑖 (𝑋𝑓𝑙𝑦=𝑦𝑒𝑠 ) = 0 4 4 𝟑 𝟒 △ 𝒈𝒊𝒏𝒊 𝑿, 𝒇𝒍𝒚 = 𝟎. 𝟒𝟖𝟗 − ∙ 𝟎 − ∙ 𝟎. 𝟑𝟕𝟓 𝟕 𝟕 ≈ 𝟎. 𝟐𝟕𝟒 Pruning Pruning is a technique that reduces the size of a decision tree by removing branches of the tree which provide little predictive power It is a regularization method that reduces the complexity of the final model, thus reducing overfitting Decision trees are prone to overfitting! Pruning methods: Pre-pruning: Stop the tree building algorithm before it fully classifies the data Post-pruning: Build the complete tree, then replace some non- leaf nodes with leaf nodes if this improves validation error Computing Information-Gain for Continuous-Valued Attributes Let attribute 𝐴 be a continuous-valued attribute Must determine the best split point for 𝐴 Sort the value 𝐴 in increasing order Typically, the midpoint between each pair of adjacent values is considered as a possible split point 𝑎𝑖 +𝑎𝑖+1 is the midpoint between the values of 𝑎𝑖 and 𝑎𝑖+1 2 The point with the minimum expected information requirement for 𝐴 is selected as the split-point for 𝐴 Split: 𝐷1 is the set of tuples in 𝐷 satisfying 𝐴 ≤ 𝑠𝑝𝑙𝑖𝑡_𝑝𝑜𝑖𝑛𝑡, and 𝐷2 is the set of tuples in 𝐷 satisfying 𝐴 > 𝑠𝑝𝑙𝑖𝑡_𝑝𝑜𝑖𝑛𝑡 Handling missing values at training time Data sets might have samples with missing values for some attributes Simply ignoring them would mean throwing away a lot of information There are better ways of handling missing values: Handling missing values at training time Data sets might have samples with missing values for some attributes Simply ignoring them would mean throwing away a lot of information There are better ways of handling missing values: Set them to the most common value Handling missing values at training time Data sets might have samples with missing values for some attributes Simply ignoring them would mean throwing away a lot of information There are better ways of handling missing 2 values: 𝑃 𝑌𝑒𝑠 𝐵𝑖𝑟𝑑 = = 0.66 3 Set them to the most common value 1 Set them to the most probable value given the 𝑃 𝑁𝑜 𝐵𝑖𝑟𝑑 = label 3 = 0.33 𝑃 𝑊ℎ𝑖𝑡𝑒 𝑀𝑎𝑚𝑚𝑎𝑙 =1 𝑃 𝐵𝑟𝑜𝑤𝑛 𝑀𝑎𝑚𝑚𝑎𝑙 =0 Handling missing values at training time Data sets might have samples with missing values for some attributes Simply ignoring them would mean throwing away a lot of information There are better ways of handling missing values: Set them to the most common value Set them to the most probable value given the label Add a new instance for each possible value Handling missing values at inference time When we encounter a node that checks an attribute with a missing value, we explore all possibilities We explore all branches and take the final prediction based on a (weighted) vote of the corresponding leaf nodes Loan? Not a student 49 years old Unknown income Fair credit record Yes Decision Boundaries Decision trees produce non-linear decision boundaries Decision Tree Decision Trees: Training and Inference Training Inference Model Evaluation and Selection Evaluation metrics: How can we measure accuracy? Other metrics to consider? Use test set of class-labeled tuples instead of training set when assessing accuracy Methods for estimating a classifier’s accuracy: Holdout method, random subsampling Cross-validation Bootstrap Model Evaluation and Selection Holdout method Random subsampling Model Evaluation and Selection Cross validation Bootstrap (sampling with replacement) Also called 0.632 bootstrap – a particular instance has a probability of 1 – (1/n) not being picked Thus, its probability of ending up in the test data is 0.368 This means the training data will contain approximately 63.2% of the instances Classifier Evaluation Metrics: Confusion Matrix Confusion Matrix: Actual class\Predicted class C1 ¬ C1 C1 True Positives (TP) False Negatives (FN) ¬ C1 False Positives (FP) True Negatives (TN) Example of Confusion Matrix: Actual class\Predicted buy_computer buy_computer Total class = yes = no buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 Total 7366 2634 10000 Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j May have extra rows/columns to provide totals Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity A\P C ¬C ◼ Class Imbalance Problem: C TP FN P ◼ One class may be rare, e.g., ¬C FP TN N fraud, or HIV-positive P’ N’ All ◼ Significant majority of the Classifier Accuracy, or recognition negative class and minority of rate: percentage of test set tuples the positive class that are correctly classified ◼ Sensitivity: True Positive Accuracy = (TP + TN)/All recognition rate Error rate: 1 – accuracy, or ◼ Sensitivity = TP/P Error rate = (FP + FN)/All ◼ Specificity: True Negative recognition rate ◼ Specificity = TN/N Classifier Evaluation Metrics: Precision and Recall, and F-measures Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive Recall: completeness – what % of positive tuples did the classifier label as positive? Perfect score is 1.0 Inverse relationship between precision & recall 𝐹 measure ( 𝐹1 or 𝐹 -score): harmonic mean of precision and recall, Classifier Evaluation Metrics: Example Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%) cancer = yes 90 210 300 30.00 (sensitivity cancer = no 140 9560 9700 98.56 (specificity) Total 230 9770 10000 96.40 (accuracy) Precision = 90/230 = 39.13% Recall = 90/300 = 30.00% Thank You

Artificial Intelligence and Machine Learning for Business (AIMLB) PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue