Summary

This document provides an overview of decision trees, a machine learning algorithm used for classification tasks. It includes examples, diagrams, and equations related to decision trees and touches upon their applications. It discusses different aspects of decision trees, such as different types of decision trees (e.g. different leaf node types).

Full Transcript

Decision  Trees   Func-on  Approxima-on   Problem  Se*ng   Set  of  possible  instances   X Set  of  possible  labels   Y Unknown  target  func-on   f : X ! Y Set  of  fun...

Decision  Trees   Func-on  Approxima-on   Problem  Se*ng   Set  of  possible  instances   X Set  of  possible  labels   Y Unknown  target  func-on   f : X ! Y Set  of  func-on  hypotheses   H = {h | h : X ! Y} Input:    TnD raining  examples   Eo nD of  unknown   E target   D func-on   Eo f n (i) (i) x ,y = x(1) , y (1) ,... , x(n) , y (n) i=1 Output:    Hypothesis    h        2          H          that  best  approximates  f Based  on  slide  by  Tom  Mitchell   Sample  Dataset   Columns  denote  features  Xi nD Eon nD E Rows  denote  labeled  instances     x , y i=1 = x , y ,... , (i) (i) (1) (1) Class  label  denotes  whether  a  tennis  game  was  played   nD Eon nD E D Eo (i) (i) (1) (1) (n) (n) x ,y = x ,y ,..., x ,y i=1 Decision  Tree   A  possible  decision  tree  for  the  data:   Each  internal  node:  test  one  aFribute  Xi Each  branch  from  a  node:  selects  one  value  for  Xi   Each  leaf  node:  predict  Y        (or    p(Y                  |    x        2        leaf)                      )   Based  on  slide  by  Tom  Mitchell   Decision  Tree   A  possible  decision  tree  for  the  data:   What  predic-on  would  we  make  for    ?   Based  on  slide  by  Tom  Mitchell   Decision  Tree   If  features  are  con-nuous,  internal  nodes  can   test  the  value  of  a  feature  against  a  threshold   6   DecisionDecision   Tree Learning Tree  Learning   Problem Setting: Set of possible instances X – each instance x in X is a feature vector – e.g., Unknown target function f : XY – Y is discrete valued Set of function hypotheses H={ h | h : XY } – each hypothesis h is a decision tree – trees sorts x to leaf, which assigns y Slide  by  Tom  Mitchell   Stages  of  (Batch)  Machine  Learning   nD Eon Given:  labeled  training  data  X, Y = x(i) , y (i)   i=1 Assumes  each    x    (i)            ⇠                  )      with   y (i) = ftarget x(i)        D(X X,  Y   Train  the  model:     learner    model  ß  classifier.train(X,  Y )   x   model   ypredic-on     Apply  the  model  to  new  data:   x ⇠ D(X ) Given:  new  unlabeled  instance    ypredic-on  ß  model.predict(x)     Example  Applica-on:    A  Tree  to    Predict  Caesarean  Sec-on  Risk   Based  on  Example  by  Tom  Mitchell   Decision  Tree  Induced  Par--on   Color green blue red Size + Shape big small square round - + Size + big small - + Decision  Tree  –  Decision  Boundary   Decision  trees  divide  the  feature  space  into  axis-­‐ parallel  (hyper-­‐)rectangles   Each  rectangular  region  is  labeled  with  one  label   – or  a  probability  distribu-on  over  labels   Decision   boundary   11   Expressiveness   Decision  trees  can  represent  any  boolean  func-on  of   the  input  aFributes   Truth  table  row  à  path  to  leaf     In  the  worst  case,  the  tree  will  require  exponen-ally   many  nodes   Expressiveness   Decision  trees  have  a  variable-­‐sized  hypothesis  space   As  the  #nodes  (or  depth)  increases,  the  hypothesis   space  grows   – Depth  1  (“decision  stump”):  can  represent  any  boolean   func-on  of  one  feature   – Depth  2:    any  boolean  fn  of  two  features;  some  involving   three  features  (e.g.,    (x        1      ^        x      2    )      _        (¬x              1      ^        ¬x            3    )    )   – etc.   Based  on  slide  by  Pedro  Domingos   Another  Example:       Restaurant  Domain  (Russell  &  Norvig)   Model a patron’s decision of whether to wait for a table at a restaurant                     ~7,000  possible  cases   A  Decision  Tree   from  Introspec-on   Is  this  the  best  decision  tree?   Preference  bias:  Ockham’s  Razor   Principle  stated  by  William  of  Ockham  (1285-­‐1347)   – “non  sunt  mul0plicanda  en0a  praeter  necessitatem”   – en--es  are  not  to  be    mul-plied  beyond  necessity     – AKA  Occam’s  Razor,  Law  of  Economy,  or  Law  of  Parsimony   Idea:    The  simplest  consistent  explana-on  is  the  best     Therefore,  the  smallest  decision  tree  that  correctly   classifies  all  of  the  training  examples  is  best   Finding  the  provably  smallest  decision  tree  is  NP-­‐hard  ...So  instead  of  construc-ng  the  absolute  smallest  tree   consistent  with  the  training  examples,  construct  one  that   is  preFy  small   Basic  Algorithm  for  Top-­‐Down    Induc-on  of  Decision  Trees     [ID3,  C4.5  by  Quinlan]   node  =  root  of  decision  tree   Main  loop:   1.  A  ß  the  “best”  decision  aFribute  for  the  next  node.   2. Assign  A  as  decision  aFribute  for  node.   3. For  each  value  of  A,  create  a  new  descendant  of  node.   4. Sort  training  examples  to  leaf  nodes.   5. If  training  examples  are  perfectly  classified,  stop.           Else,  recurse  over  new  leaf  nodes.   How  do  we  choose  which  aFribute  is  best?   Choosing  the  Best  AFribute   Key  problem:  choosing  which  aFribute  to  split  a   given  set  of  examples     Some  possibili-es  are:   – Random:  Select  any  aFribute  at  random     – Least-­‐Values:  Choose  the  aFribute  with  the  smallest   number  of  possible  values     – Most-­‐Values:  Choose  the  aFribute  with  the  largest   number  of  possible  values     – Max-­‐Gain:  Choose  the  aFribute  that  has  the  largest   expected  informa0on  gain   i.e.,  aFribute  that  results  in  smallest  expected  size  of  subtrees   rooted  at  its  children   The  ID3  algorithm  uses  the  Max-­‐Gain  method  of   selec-ng  the  best  aFribute   Choosing  an  AFribute   Idea:  a  good  aFribute  splits  the  examples  into  subsets   that  are  (ideally)  “all  posi-ve”  or  “all  nega-ve”           Which  split  is  more  informa-ve:  Patrons?  or  Type?   Based  on  Slide  from  M.  desJardins  &  T.  Finin   ID3-­‐induced     Decision  Tree   Based  on  Slide  from  M.  desJardins  &  T.  Finin   Compare  the  Two  Decision  Trees   Based  on  Slide  from  M.  desJardins  &  T.  Finin   Informa-on  Gain     Which  test  is  more  informa-ve?   Split over whether Split over whether Balance exceeds 50K applicant is employed Less or equal 50K Over 50K Unemployed Employed 22   Based  on  slide  by  Pedro  Domingos   Informa-on  Gain   Impurity/Entropy  (informal)   – Measures  the  level  of  impurity  in  a  group  of   examples   23   Based  on  slide  by  Pedro  Domingos   Impurity   Very impure group Less impure Minimum impurity 24   Based  on  slide  by  Pedro  Domingos   Entropy:  a  common  way  to  measure  impurity   Entropy # of possible values for X Entropy H(X) of a random variable X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Why? Information theory: Most efficient code assigns -log2P(X=i) bits to encode the message X=i So, expected number of bits to code one random X is: Slide  by  Tom  Mitchell   Entropy:  a  common  way  to  measure  impurity   Entropy # of possible values for X Entropy H(X) of a random variable X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Why? Information theory: Most efficient code assigns -log2P(X=i) bits to encode the message X=i So, expected number of bits to code one random X is: Slide  by  Tom  Mitchell   Example:  Huffman  code   In  1952  MIT  student  David  Huffman  devised,  in  the  course   of  doing  a  homework  assignment,  an  elegant  coding   scheme  which  is  op-mal  in  the  case  where  all  symbols’   probabili-es  are  integral  powers  of  1/2.     A  Huffman  code  can  be  built  in  the  following  manner:   – Rank  all  symbols  in  order  of  probability  of  occurrence   – Successively  combine  the  two  symbols  of  the  lowest   probability  to  form  a  new  composite  symbol;  eventually   we  will  build  a  binary  tree  where  each  node  is  the   probability  of  all  nodes  beneath  it   – Trace  a  path  to  each  leaf,  no-cing  direc-on  at  each  node   Based  on  Slide  from  M.  desJardins  &  T.  Finin   Huffman  code  example   M code length prob M      P   A 000 3 0.125 0.375 B 001 3 0.125 0.375 A    .125   1   C 01 2 0.250 0.500 B    .125   0   1   D 1 1 0.500 0.500 C    .25  .5  .5   average message length 1.750 D    .5   0   1   D  .25  .25   If  we  use  this  code  to  many   messages  (A,B,C  or  D)  with  this   0   1   C   probability  distribu-on,  then,  over  .125  .125   -me,  the  average  bits/message   should  approach  1.75   A   B   Based  on  Slide  from  M.  desJardins  &  T.  Finin   2-­‐Class  Cases:   n X Entropy   H(x) = P (x = i) log2 P (x = i) i=1 What  is  the  entropy  of  a  group  in  which  all   Minimum examples  belong  to  the  same  class?   impurity – entropy  =  -­‐  1  log21  =  0       not  a  good  training  set  for  learning   What  is  the  entropy  of  a  group  with  50%   Maximum in  either  class?   impurity – entropy  =  -­‐0.5    log20.5  –  0.5    log20.5  =1   good  training  set  for  learning   30   Based  on  slide  by  Pedro  Domingos   Sample Entropy Sample  Entropy   Slide  by  Tom  Mitchell   Informa-on  Gain   We  want  to  determine  which  aFribute  in  a  given  set   of  training  feature  vectors  is  most  useful  for   discrimina-ng  between  the  classes  to  be  learned.   Informa-on  gain  tells  us  how  important  a  given   aFribute  of  the  feature  vectors  is.   We  will  use  it  to  decide  the  ordering  of  aFributes  in   the  nodes  of  a  decision  tree.   32   Based  on  slide  by  Pedro  Domingos   From  Entropy  to  Informa-on  Gain   Entropy Entropy H(X) of a random variable X Specific conditional entropy H(X|Y=v) of X given Y=v : Conditional entropy H(X|Y) of X given Y : Mututal information (aka Information Gain) of X and Y : Slide  by  Tom  Mitchell   From  Entropy  to  Informa-on  Gain   Entropy Entropy H(X) of a random variable X Specific conditional entropy H(X|Y=v) of X given Y=v : Conditional entropy H(X|Y) of X given Y : Mututal information (aka Information Gain) of X and Y : Slide  by  Tom  Mitchell   From  Entropy  to  Informa-on  Gain   Entropy Entropy H(X) of a random variable X Specific conditional entropy H(X|Y=v) of X given Y=v : Conditional entropy H(X|Y) of X given Y : Mututal information (aka Information Gain) of X and Y : Slide  by  Tom  Mitchell   From  Entropy  to  Informa-on  Gain   Entropy Entropy H(X) of a random variable X Specific conditional entropy H(X|Y=v) of X given Y=v : Conditional entropy H(X|Y) of X given Y : Mututal information (aka Information Gain) of X and Y : Slide  by  Tom  Mitchell   Informa-on  Gain   Information Gain is the mutual information between input attribute A and target variable Y Information Gain is the expected reduction in entropy of target variable Y for data sample S, due to sorting on variable A Slide  by  Tom  Mitchell   Calcula-ng  Informa-on  Gain   InformaOon  Gain  =        entropy(parent)  –  [average  entropy(children)]   child   = −⎛⎜ 13 ⋅ log 13 ⎞⎟ − ⎛⎜ 4 ⋅ log 4 ⎞⎟ = 0.787 impurity 2 2 entropy   ⎝ 17 17 ⎠ ⎝ 17 17 ⎠ Entire population (30 instances) 17 instances child   ⎛ 1 1 ⎞ ⎛ 12 12 ⎞ impurity = −⎜ entropy   13 ⋅ log 2 ⎟ − ⎜ ⋅ log 2 ⎟ = 0.391 ⎝ 13 ⎠ ⎝ 13 13 ⎠ parent  = −⎛⎜ 14 ⋅ log 2 14 ⎞⎟ − ⎛⎜ 16 ⋅ log 2 16 ⎞⎟ = 0.996 impurity entropy   ⎝ 30 30 ⎠ ⎝ 30 30 ⎠ 13 instances ⎛ 17 ⎞ ⎛ 13 ⎞ (Weighted) Average Entropy of Children = ⎜ ⋅ 0.787 ⎟ + ⎜ ⋅ 0.391⎟ = 0.615 ⎝ 30 ⎠ ⎝ 30 ⎠ Information Gain= 0.996 - 0.615 = 0.38 38   Based  on  slide  by  Pedro  Domingos   Entropy-­‐Based  Automa-c  Decision   Tree  Construc-on   Training  Set  X   Node  1    x1=(f11,f12,…f1m)   What  feature      x2=(f21,f22,        f2m)   should  be  used?                                .   What  values?                                .    xn=(fn1,f22,        f2m)   Quinlan  suggested  informa-on  gain  in  his  ID3  system   and  later  the  gain  ra-o,  both  based  on  entropy.   39   Based  on  slide  by  Pedro  Domingos   Using  Informa-on  Gain  to  Construct     a  Decision  Tree   Choose  the  aFribute  A   Full  Training  Set  X   with  highest  informa-on   AFribute  A   gain  for  the  full  training   v1   v2   vk   set  at  the  root  of  the  tree.   Construct  child  nodes   for  each  value  of  A.   Set  X  ʹ′   Xʹ′={x∈X  |  value(A)=v1}   Each  has  an  associated   subset  of  vectors  in   repeat   which  A  has  a  par-cular   recursively   -ll  when?   value.   Disadvantage  of  informa-on  gain:         It  prefers  aFributes  with  large  number  of  values  that  split   the  data  into  small,  pure  subsets     Quinlan’s  gain  ra-o  uses  normaliza-on  to  improve  this   40   Based  on  slide  by  Pedro  Domingos   Slide  by  Tom  Mitchell   Slide  by  Tom  Mitchell   Slide  by  Tom  Mitchell   Restaurant example Random:  Patrons  or  Wait-­‐-me;  Least-­‐values:  Patrons;  Most-­‐values:  Type;  Max-­‐gain:  ???   French Y N Y N Italian Type  variable   Thai N Y NY Burger N Y NY Empty Some Full Patrons variable Compu-ng  Informa-on  Gain   French Y N I(T)  =  ?   I  (Pat,  T)  =    ?   Y N Italian I  (Type,  T)  =  ?   Thai N Y NY Burger N Y N Y Empty Some Full Gain  (Pat,  T)  =  ?   Gain  (Type,  T)  =  ?   50   Based  on  Slide  from  M.  desJardins  &  T.  Finin   Compu-ng  informa-on  gain   French Y N I(T) = - (.5 log.5 +.5 log.5) Y N =.5 +.5 = 1 Italian I (Pat, T) = 2/12 (0) + 4/12 (0) + Thai N Y NY 6/12 (- (4/6 log 4/6 + 2/6 log 2/6)) Burger N Y N Y = 1/2 (2/3*.6 + Empty Some Full 1/3*1.6) =.47 I (Type, T) = Gain (Pat, T) = 1 -.47 =.53 2/12 (1) + 2/12 (1) + Gain (Type, T) = 1 – 1 = 0 4/12 (1) + 4/12 (1) = 1 Based  on  Slide  from  M.  desJardins  &  T.  Finin   Spli{ng   examples     by  tes-ng   aFributes   Decision  Tree  Applet     hFp://webdocs.cs.ualberta.ca/~aixplore/learning/ DecisionTrees/Applet/DecisionTreeApplet.html     Which Tree Should We Output? ID3 performs heuristic search through space of decision trees It stops at smallest acceptable tree. Why? Occam’s razor: prefer the simplest hypothesis that fits the data Slide  by  Tom  Mitchell   The ID3 algorithm builds a decision tree, given a set of non-categorical attributes C1, C2,.., Cn, the class attribute C, and a training set T of records function ID3(R:input attributes, C:class attribute, S:training set) returns decision tree; If S is empty, return single node with value Failure; If every example in S has same value for C, return single node with that value; If R is empty, then return a single node with most frequent of the values of C found in examples S; # causes errors -- improperly classified record Let D be attribute with largest Gain(D,S) among R; Let {dj| j=1,2,.., m} be values of attribute D; Let {Sj| j=1,2,.., m} be subsets of S consisting of records with value dj for attribute D; Return tree with root labeled D and arcs labeled d1..dm going to the trees ID3(R-{D},C,S1)... ID3(R-{D},C,Sm); Based  on  Slide  from  M.  desJardins  &  T.  Finin   How  well  does  it  work?   Many  case  studies  have  shown  that  decision  trees   are  at  least  as  accurate  as  human  experts.     – A  study  for  diagnosing  breast  cancer  had  humans   correctly  classifying  the  examples  65%  of  the   -me;  the  decision  tree  classified  72%  correct   – Bri-sh  Petroleum  designed  a  decision  tree  for   gas-­‐oil  separa-on  for  offshore  oil  pla|orms  that     replaced  an  earlier    rule-­‐based  expert  system   – Cessna  designed  an  airplane  flight  controller  using   90,000  examples  and  20  aFributes  per  example   Based  on  Slide  from  M.  desJardins  &  T.  Finin   Extensions  of  ID3   Using  gain  ra-os   Real-­‐valued  data   Noisy  data  and  overfi{ng   Genera-on  of  rules   Se{ng  parameters   Cross-­‐valida-on  for  experimental  valida-on  of   performance   C4.5  is  an  extension  of  ID3  that  accounts  for     unavailable  values,  con-nuous  aFribute  value   ranges,  pruning  of  decision  trees,  rule  deriva-on,   and  so  on   Based  on  Slide  from  M.  desJardins  &  T.  Finin   Using  gain  ra-os   The  informa-on  gain  criterion  favors  aFributes   that  have  a  large  number  of  values   – If  we  have  an  aFribute  A  that  has  a  dis-nct  value  for  each   record,  then  Info(X,A)  is  0,  thus  Gain(X,A)  is  maximal   To  compensate  for  this  Quinlan  suggests  using  the   following  ra-o  instead  of  Gain:   GainRa-o(X,A)  =  Gain(X,A)  /  SplitInfo(X,A)   SplitInfo(X,A)  is  the  informa-on  due  to  the  split  of   X  on  the  basis  of  value  of  categorical  aFribute  A   SplitInfo(X,A)    =    I(|X1|/|X|,  |X2|/|X|,  ..,  |Xm|/|X|)   where  {X1,  X2,  ..  Xm}  is  the  par--on  of  X  induced  by   value  of  A   Based  on  Slide  from  M.  desJardins  &  T.  Finin   Computing gain ratio French Y N I(X) = 1 Y N I (Pat, X) =.47 Italian I (Type, X) = 1 Thai N Y NY Gain (Pat, X) =.53 Gain (Type, X) = 0 Burger N Y N Y Empty Some Full SplitInfo (Pat, X) = - (1/6 log 1/6 + 1/3 log 1/3 + 1/2 log 1/2) = 1/6*2.6 + 1/3*1.6 + 1/2*1 = 1.47 SplitInfo (Type, X) = 1/6 log 1/6 + 1/6 log 1/6 + 1/3 log 1/3 + 1/3 log 1/3 = 1/6*2.6 + 1/6*2.6 + 1/3*1.6 + 1/3*1.6 = 1.93 GainRatio (Pat, X) = Gain (Pat, X) / SplitInfo(Pat, X) =.53 / 1.47 =.36 GainRatio (Type, X) = Gain (Type, X) / SplitInfo (Type, X) = 0 / 1.93 = 0 Based  on  Slide  from  M.  desJardins  &  T.  Finin   Gain Ratio l Gain Ratio: 𝑘 𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖 𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = − ෍ 𝑙𝑜𝑔2 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛 𝑖=1 Parent Node, 𝑝 is split into 𝑘 partitions (children) 𝑛𝑖 is number of records in child node 𝑖 – Adjusts Information Gain by the entropy of the partitioning (𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜). ◆Higher entropy partitioning (large number of small partitions) is penalized! – Used in C4.5 algorithm – Designed to overcome the disadvantage of Information Gain 2/1/2021 Introduction to Data Mining, 2nd Edition 1 Gain Ratio l Gain Ratio: 𝑘 𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖 𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = ෍ 𝑙𝑜𝑔2 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛 𝑖=1 Parent Node, 𝑝 is split into 𝑘 partitions (children) 𝑛𝑖 is number of records in child node 𝑖 CarType CarType CarType {Sports, {Family, Family Sports Luxury {Family} {Sports} Luxury} Luxury} C1 1 8 1 C1 9 1 C1 8 2 C2 3 0 7 C2 7 3 C2 0 10 Gini 0.163 Gini 0.468 Gini 0.167 SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97 2/1/2021 Introduction to Data Mining, 2nd Edition 2

Use Quizgecko on...
Browser
Browser