Decision Tree Classifier PDF

Document Details

ConsiderateSpessartine

Uploaded by ConsiderateSpessartine

Dr. Sanchali

Tags

decision trees machine learning statistical learning classification

Summary

These lecture notes detail the concept of decision trees in machine learning. It covers steps in decision tree methods and formulas like entropy and their implementation.

Full Transcript

Statistical Machine Learning TEXTBOOKS/LEARNING RESOURCES: a) Masashi Sugiyama, Introduction to Statistical Machine Learning (1 st ed.), Morgan Kaufmann, 2017. ISBN 978- 0128021217. b) T. M. Mitchell, Machine Learning (1st ed.), McGraw Hill, 2017. ISBN 978-1259096952. REFERENCE BOOKS/LEARNING R...

Statistical Machine Learning TEXTBOOKS/LEARNING RESOURCES: a) Masashi Sugiyama, Introduction to Statistical Machine Learning (1 st ed.), Morgan Kaufmann, 2017. ISBN 978- 0128021217. b) T. M. Mitchell, Machine Learning (1st ed.), McGraw Hill, 2017. ISBN 978-1259096952. REFERENCE BOOKS/LEARNING RESOURCES: a) Richard Golden, Statistical Machine Learning A Unified Framework (1 st ed.), unknown, Dr.2020. Sanchali November 25, 2 1 Decision Tree Classifier Decision Tree 3 Supervi Classifier sor ‒ Supervised classification technique. Agree Yes No Search ‒ It builds tree like structure where Datas Other et Problem Availa Node : test on attributes ble Yes No Branch : Tests outcome Can we Select Collect Leaf Node : Class Labels Probl Dataset? em Yes No ‒ Decision tree are created using Select Search Proble Other “Tree Induction Algorithms” m Problem Fig. Decision tree for selecting research problem Decision Tree 4 Classifier Tree Induction Algorithm Step1 : Discretize the continuous valued attributes. For example age can be discretized as: A - Under 18 B - 18 - 40 C - 41 - 65 D - above 65 How we select the Step2 : Split the node by selecting best attribute Best Attribute ? Step3 : Stop a) If all the instance belongs to same class. b) If there is no remaining attribute to divide further Decision Tree 5 Classifier Finding Best Split Attributes ‒ Assigned some “goodness value” to attributes ‒ Rank attributes based on its goodness value. ‒ Select attributes with highest rank first Finding Goodness value: ‒ Split Algorithm Based on Information Theory ‒ Split Algorithm Based on Gini Index Decision Tree 6 Classifier Split Algorithm Based on Information Theory This method is based on the calculation of entropy.  Entropy is minimum - when all values of the target attribute are the same.  Entropy is maximum - when there is an equal chance of all values for the target attribute 𝑚 where 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =𝑆=∑ −𝑝𝑖 log 2 ⁡(𝑝𝑖 ¿ )¿ pi = Probability of occurrence of attribute 𝑖=0 value i = 1, 2, 3,.... m (no. of classes) 𝑚 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛(𝑆 , 𝐴)=𝐼 − ∑ − 𝑝𝑖log ⁡2 (𝑝𝑖 ¿ )¿ where 𝑖=0 I = Information before split Decision Tree 7 Classifier Split Algorithm Based on Information Theory Table: Training Tid Refund sample Marital Step1: Calculate the information Status Cheat before split Sample space = 10 1 Yes Single No 2 No Married No Classes Yes = 3 3 No Single No No = 7 4 Yes Married No 5 No Divorced Yes I =-Py log(Py) - Pn log(Pn) 6 No Married No where 7 Yes Divorced No Py= Probability of “Yes“ Pn= Probability of “No" 8 No Single Yes 9 No Married No I =-(3/10) log(3/10) – (7/10)log(7/10) 10 No Single Yes 10 Information before split I = 0.88 Decision Tree 8 Classifier Split Algorithm Based on Information Theory Step2: Calculate the information for “Refund” Table: Training Refund Tid Refund sample Marital Status Cheat Yes = No 3 =7 1 Yes Single No Yes = No Yes No 0 =3 2 No Married No =3 =4 3 No Single No I =-Py log(Py) - Pn log(Pn) 4 Yes Married No 5 No Divorced Yes I(yes) = -(0/3) log(0/3) – (3/3)log(3/3), I (yes) = 0 6 No Married No I(no) -(3/7) log(3/7) – (4/7)log(4/7) , I(no) = 7 Yes Divorced No = 0.985 𝑚 8 No Single Yes 𝑆𝑜 , 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛( 𝑆 , 𝐴)=𝐼 − ∑ − 𝑝𝑖 log ⁡2(𝑝𝑖 ¿ )¿ 9 No Married No 𝑖=0 Information (Refund) = (3/10)*(0) + (7/10)*( 0.985) = 10 No Single Yes 10 0.69 Information Gain = 0.88 - 0.69 = 0.19 Decision Tree 9 Classifier Note the formula : I = -Py log(Py) - Pn Split Algorithm Based on log(Pn) Information Theory Table: Training Step3: Calculate the information for “Marital Marital Tid Refund sample Marital Status” Status Status Cheat Single= Married Divorced 1 Yes Single No 4 =4 =2 2 No Married No Yes = No Yes No Yes No 2 =2 =0 =1 =4 =1 3 No Single No 4 Yes Married No I(single) =-(2/4) log(2/4) – (2/4)log(2/4) =1 5 No Divorced Yes I(married) =-(0/4) log(0/4) – (4/4)log(4/4) =0 6 No Married No 7 Yes Divorced No I(divorced) =-(1/2) log(1/2) – (1/2)log(1/2) =1 8 No Single Yes Information (marital status) = (4/10)*(1) + (4/10)*(0) 9 No Married No + (2/10)*(1) 10 No Single Yes = 0.60 10 Information Gain = 0.88 - 0.60 = 0.28 Decision Tree 10 Classifier Split Algorithm Based on Information Theory Table: Training Tid Refund Step4: Rank the attributes sample Marital Status Cheat Attributes Information Information Information 1 Yes Single No (Before (After Split) Gain Split) 2 No Married No Refund 0.8812 0.69 0.1912 3 No Single No Marital Status 0.8812 0.60 0.2812 4 Yes Married No 5 No Divorced Yes Since marital status has the highest information it will be selected for sp 6 No Married No 7 Yes Divorced No 8 No Single Yes 9 No Married No 10 No Single Yes 10 Decision Tree 11 Classifier Split Algorithm Based on Gini Index Gini Index is the measure of resource inequality. It varies from 0 to 1.  0 - no uncertainty  1 - maximum uncertainty 𝑛 𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 𝐺(𝑠 )=1 − ∑ 𝑝 𝑖 2 where 𝑖=0 pi = Relative frequency of class i in data sample 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 ( 𝑆 , 𝐴)=𝐼 − 𝐺(𝑠 ) where I = Information before split Decision Tree 12 Classifier Split Algorithm Based on Gini Index Table: Training Tid Refund sample Marital Step1: Calculate the information Status Cheat before split Sample space = 10 1 Yes Single No 2 No Married No Classes Yes = 3 3 No Single No No = 7 4 Yes Married No 5 No Divorced Yes I = 1- (Py)2 - (Pn) 2 6 No Married No where Py= Probability of “Yes“ 7 Yes Divorced No Pn= Probability of “No" 8 No Single Yes 9 No Married No I =1- (3/10) 2 – (7/10)2 10 No Single Yes I = 0.49 Information before split 10 Decision Tree 13 Classifier Split Algorithm Based on Gini Index Table: Training Step2: Calculate the information for “Refund” Tid Refund sample Marital Refund Status Cheat Yes = No 1 Yes Single No 3 =7 2 No Married No Yes = No Yes No 0 =3 =3 =4 3 No Single No 4 Yes Married No I(yes) =1-(0/3)2 – (3/3)2 5 No Divorced Yes 6 No Married No I (yes) =0 7 Yes Divorced No I(no) =1-(3/7)2 – (4/7)2 8 No Single Yes I (no) =0.489 9 No Married No 10 No Single Yes 10 Information (Refund) = (3/10)*(0) + (7/10)*(0.489) = 0.3423 Decision Tree 14 Classifier Split Algorithm Based on Gini Index Table: Training Step3: Calculate the information for “Marital Marital Tid Refund sample Marital Status” Status Status Cheat Single= Married Divorced 1 Yes Single No 4 =4 =2 2 No Married No Yes = No Yes No Yes No 2 =2 =0 =1 =4 =1 3 No Single No 4 Yes Married No I(single) =1-(2/4) 2 – (2/4) 2 = 0.5 5 No Divorced Yes I(married) =1-(0/4) 2 – (4/4) 2 =0 6 No Married No 7 Yes Divorced No I(divorced) =1-(1/2) 2 – (1/2) 2 =1 8 No Single Yes 9 No Married No Information (marital status) = (4/10)*(0.5) + (4/10)*(0) + (2/10)*(1) = 0.4 10 No Single Yes 10 Decision Tree 15 Classifier Split Algorithm Based on Gini Index Table: Training Tid Refund Step4: Rank the attributes sample Marital Status Cheat Attributes Gini Index Gini Index Information 1 Yes Single No (Before (After Split) Gain Split) 2 No Married No Refund 0.49 0.3423 0.1477 3 No Single No Marital Status 0.49 0.4 0.09 4 Yes Married No 5 No Divorced Yes Since Refund has the highest information it will be selected for splitting 6 No Married No 7 Yes Divorced No 8 No Single Yes 9 No Married No 10 No Single Yes 10 Decision Tree 16 Classifier Advantages  Easy to understand  Easy to generate rules Disadvantages  Problem in handling numeric data  Large data generates Complex tree  Need expensive training Decision Tree 17 Classifier fig We can save the image file by fig.savefig('imagename.png') Decision Tree 18 Classifier Decision Tree 19 Classifier Decision Tree Hyperparameters class sklearn.tree.DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, max_leaf_nodes=None, ccp_alpha=0.0)  Criterion : {“gini”, “entropy”, “log_loss”}, default=”gini”  Splitter : {“best”, “random”}, default=”best”  max_depth : int, default=None  The maximum depth of the tree.  min_samples_split : int or float, default=2  The minimum number of samples required to split an internal node  min_samples_leaf : int or float, default=1  The minimum number of samples required to be at a leaf node.  max_leaf_nodes : int, default=None  Grow a tree with max_leaf_nodes in best-first https://www.datacamp.com/tutorial/decision-tree- fashion. classification-python Dr. Sanchali November 25, 2 20 Decision Tree Hyperparameters  The code ax = ax is used to visualize the decision tree.  The ax object is a matplotlib axis object, which is used to plot the tree.  The ax = ax code simply assigns the ax object to itself, which is necessary in order to plot the tree.  In this example, the ax = ax code is used to assign the ax object to itself, which is necessary in order to plot the tree.  The ax object is then used to plot the decision tree using the tree.plot_tree() function. https://www.datacamp.com/tutorial/decision-tree-classification- python Dr. Sanchali November 25, 2 21 22 Issues Overfitting Instability in Bias towards dominant features Lack of smoothness Decisio Limited expressiveness n Tree Difficulty handling irrelevant features Imbalanced datasets Classifie High variance r Greedy nature Difficulty with continuous variables 23 Reference 1. Decision tree explained : https://www.geeksforgeeks.org/python-decision-tree-regression-using-sklearn/ 2. Decision tree Regressor tutorial: https://www.kaggle.com/code/alirezahasannejad/decision-tree-regressor-tutorial 3. Details of Decision tree: https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html 4. Python libraries of DT: https://scikit-learn.org/stable/modules/tree.html

Use Quizgecko on...
Browser
Browser