10. Linear and Non-Linear Separation (Decision Trees, Neural Networks, and Support Vector Machines).pdf

Document Details

FinerLimeTree

Uploaded by FinerLimeTree

University of Southern Denmark - SDU

Tags

entropy machine learning classification

Full Transcript

SDU Entropy Function DM868 DM870 □S804 Arthur Zimek Introduction The entropy in bits of a discrete random variable X is given by Freq. Pattern Mining Feature Spaces = - L Pr(X = x) log2 Pr(X =...

SDU Entropy Function DM868 DM870 □S804 Arthur Zimek Introduction The entropy in bits of a discrete random variable X is given by Freq. Pattern Mining Feature Spaces = - L Pr(X = x) log2 Pr(X = x), Clustering - Basics Classification - Basics Bayesian Learning Learning with H(X) Distributions Entropy, Purity, and X (Non-)Linear Sep. Decision Tree Learning where the summation is over all values x in the range of X. Neural Networks SVM Regression Summary Ensemble Learning References H(X) = E [log2 Pr X)] 539 SDU Binary Entropy Function DM868 DM870 □S804 Arthur Zimek Introduction The binary entropy function of a random variable that assumes only two possible outcomes occurring with Freq. Pattern Mining Feature Spaces probability p and 1 - p, respectively, is Clustering - Basics Classification - Basics Bayesian Learning Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. H(p) = -plog2 p - (1 - p) log2 (1 - p). Decision Tree 1 Learning Neural Networks SVM Regression Summary Ensemble Learning References jo.5 We define H(O) = H(l) = 0. 0.5 1 540 p SDU n-ary Entropy (Uniform) DM868 DM870 □S804 Arthur Zimek Generalizing, the entropy of a random variable X with n Introduction Freq. Pattern Mining outcomes of equal probability is given by: n Feature Spaces 1 log2 1 Clustering - Basics -L.,; Classification - Basics H(X) - n- Bayesian Learning Learning with Distributions n i: 1 l) ¼ Entropy, Purity, and (Non-)Linear Sep. _ Llog2 ( Decision Tree Learning _;:;, ( m} Neural Networks SVM Regression Summary Ensemble Learning References 1 -log2 - n log2n 541 SDU I Intuition DM868 DM870 □S804 Arthur Zimek Introduction ► Imagine a die with 8 sides that Freq. Pattern Mining show up with equal probability. Feature Spaces Clustering - Basics Classification - Basics ► The entropy is 3 bits. Bayesian Learning Learning with ► If the faces of the die were numbered with O to 7 in binary code, the outcome of a die roll would give a Distributions Entropy, Purity, and (Non-)Linear Sep. Decision Tree sequence of 3 bits uniform over the set {O, 1 } 3. ► This shows the equivalence to generating 3 bits Learning Neural Networks SVM Regression independently and uniformly at random. Summary Ensemble Learning References The entropy of a random variable X does not depend on the values that X can take but only on the probability distribution of X over those values. 542 SDU Interpretations DM868 DM870 □S804 Arthur Zimek ► One interpretation of entropy is that it measures the Introduction Freq. Pattern Mining amount of randomness (disorder, uncertainty) in a Feature Spaces Clustering - Basics system: Classification - Basics Bayesian Learning ► For example consider the second law of Learning with Distributions thermodynamics: the total entropy of an isolated system Entropy, Purity, and (Non-)Linear Sep. cannot decrease over time. Decision Tree ► Another interpretation relates entropy to compression Learning Neural Networks and coding theory (see Slide 306): SVM Regression ► Entropy relates to the number of bits per symbol required Summary Ensemble Learning to encode a message. References ► A very predictable message can be encoded with fewer bits (it conveys less information). ► An unpredictable message conveys much information (high entropy). 543 , , ,!: I Entropy of Class Distributions DM868 DM870 L Pr(cil V) log2 Pr(cil V) □S804 Arthur Zimek k Introduction H(V) = - Freq. Pattern Mining Feature Spaces i=I Clustering - Basics Classification - Basics Bayesian Learning Learning with ► Considering labeled data with k = 2, a very pure set (almost all labels belong to class A, only some belong to Distributions Entropy, Purity, and (Non-)Linear Sep. Decision Tree class B) has very low entropy (i.e., low disorder, low Learning Neural Networks uncertainty): SVM Regression Summary ► If we draw some data object at random, we Ensemble Learning References are very likely to get one that belongs to class A. ► The more equal the proportions of the two 0 classes are, the more uncertainty we have 0 0.5 p about which class we will likely draw (i.e., higher disorder, higher entropy). 544 SDU I Gini Index DM868 DM870 □S804 Arthur Zimek Introduction Freq. Pattern Mining The Gini index as a measure for the (im-)purity of a data set Feature Spaces 1J w.r.t. k classes c1,... , q is given by: L Pr(cil 1J) Clustering - Basics Classification - Basics k Bayesian Learning Learning with G(D) = 1 - 2 Distributions Entropy, Purity, and i=l (Non-)Linear Sep. Decision Tree Learning ► If a dataset contains only one class, the probability of that class is 1, the dataset has minimal impurity, the Gini Neural Networks SVM index is 0. Regression Summary Ensemble Learning References ► When each class is equally represented, we have Pr(cil D) = ½, the dataset is maximally impure, and the Gini index is kkI (i.e., approaching 1 ask-+ oo). ► Probabilistic interpretation of the square: if we randomly draw two objects from D, how likely are they belonging to 545 the same class? SDU Outline DM868 DM870 □S804 1t 'JductI0 Arthur Zimek Frequent P·.tterr Minir q Introduction Freq. Pattern Mining FeatJre Spacec:, Feature Spaces Clustering - Basics lustc- 1ng E'las1c..; and k med% Classification - Basics :,IasE !icat1on Bc.lt:u; md d Bc.ls:c Gia.SE 1Ier as1c Probability rreorv i,ayes Rule, anc Bayes.an L e::ir Bayesian Learning Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy Entropy, Purity, and Separation: Linear vs. Non-Linear Separation Basic Algorithm Decision Tree Learning SpUts/Split-Crit. Geom. lnterp./Bias Overlitting/Error­ Red. Pruning Split; and Splr. t;r tena Discussion Geometric lnte pretation and Bias Overf1tting and ror Heduction Pn,rnng Neural Networks SVM Regression Discussion Summary Ensemble Learning Neural Networks References Support Vector Mach 1es Regression Svnmary 546 r.:;emble L e.1rnmg SDU Recommended Reading DM868 DM870 □S804 Arthur Zimek Introduction textbooks Freq. Pattern Mining Feature Spaces Clustering - Basics Classification - Basics Bayesian Learning ► Zaki and Meira Jr. , Chapter 19. Learning with Distributions ► Mitchell , Chapter 3. Entropy, Purity, and (Non-)Linear Sep. ► Tan et al. , Chapter 4.3. Entropy ► Tan et al. , Chapter 3.3. Basic Algorithm ► Witten et al. , Chapter 6. 1. Splits/Split-Crit. Geom. lnterp./Bias Overlitting/Error­ fundamental paper ► Quinlan Red. Pruning Discussion Neural Networks SVM Regression Summary Ensemble Learning References 547 SDU Outline DM868 DM870 □S804 1t 'JductI0 Arthur Zimek Frequent P·.tterr Minir q Introduction Freq. Pattern Mining FeatJre Spacec:, Feature Spaces Clustering - Basics lustc- 1ng E'las1c..; and k med% Classification - Basics :,IasE !icat1on Bc.lt:u; md d Bc.ls:c Gia.SE 1Ier as1c Probability rreorv i,ayes Rule, anc Bayes.an L e::ir Bayesian Learning Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy Entropy, Purity, and Separation: Linear vs. Non-Linear Separation Decision Tree Learning Decision Tree Learning SpUts/Split-Crit. Geom. lnterp./Bias Basic Algorithm Overlitting/Error­ Red. Pruning ena Discussion Geomemc mte pretation and Bias Overf1tting and ror Heduction Pn,rnng Neural Networks SVM Regression Discussion Summary Ensemble Learning Neural Networks References Support Vector Mach 1es Regression Svnmary 548 r.:;emble L e.1rnmg sou I Decision Tree - Basic Idea DM868 DM870 □S804 Arthur Zimek Data: Decision tree: ty e Introduction Freq. Pattern Mining l Feature Spaces cj \ Clustering - Basics Classification - Basics ID age car type risk Bayesian Learning Learning with Distributions 23 family car Entropy, Purity, and (Non-)Linear Sep. 2 17 sports car Entropy 3 43 sports car Decision Tree Learning 4 68 family car 1....... 1 > 60 5 32 truck - 'risk: low SpUts/Split-Crit. I I Geom. lnterp./Bias Overlitting/Error­ Red. Pruning Discussion Neural Networks SVM Regression ► A decision tree provides explicit knowledge on the data. Summary Ensemble Learning ► The classification model is interpretable (a hierarchy of References rules). 549 SDU Decision Tree - Properties DM868 DM870 □S804 car type Arthur Zimek ► A decision tree is a tree with Introduction Freq. Pattern Mining the following properties: =truck Feature Spaces Clustering - Basics ► an inner node represents a Classification - Basics Bayesian Learning test on an attribute Learning with ► an edge represents a test :S 60 Distributions Entropy, Purity, and (Non-)Linear Sep. result on the parent node Entropy Decision Tree ► a leaf node represents one Learning of the classes SpUts/Split-Crit. Geom. lnterp./Bias Overlitting/Error­ Red. Pruning ► Construction: top-down based on the training set Application (prediction): Discussion Neural Networks ► SVM Regression ► traversal according to the tests from the root to some leaf Summary Ensemble Learning node (deterministic path) References ► class assignment: the class of the leaf node reached in the traversal 550 SDU Decision Tree - Basic Algorithm DM868 DM870 □S804 Arthur Zimek Introduction Freq. Pattern Mining 1. given a dataset, select an attribute and split point Feature Spaces Clustering - Basics Classification - Basics Bayesian Learning (greedy selection, following some split strategy) Learning with Distributions Entropy, Purity, and 2. partition the data according to the test on the split (Non-)Linear Sep. Entropy attribute Decision Tree Learning 3. repeat the procedure recursively for each partition The recursion stops if the partition is "pure" (contains only Splits/Split-Crit. Geom. lnterp./Bias Overlitting/Error­ Red. Pruning examples of a single class). Discussion Neural Networks SVM Regression Summary Ensemble Learning References 551 SDU Example: Should We Play Tennis Today? I DM868 DM870 □S804 Arthur Zimek ID forecast I temperature I humidity I wind I play tennis? I Introduction 1 sunny hot high weak no Freq. Pattern Mining Feature Spaces 2 sunny hot high strong no 3 overcast hot high weak yes Clustering - Basics Classification - Basics Bayesian Learning Learning with 4 rainy mild high weak yes Distributions Entropy, Purity, and 5 rainy cool normal weak yes (Non-)Linear Sep. Entropy 6 rainy cool normal strong no Decision Tree Learning 7 overcast cool normal strong yes SpUts/Split-Crit. 8 sunny mild high weak no Geom. lnterp./Bias Overlitting/Error­ 9 sunny cool normal weak yes Red. Pruning Discussion 10 rainy mild normal weak yes Neural Networks SVM 11 sunny mild normal strong yes Regression 12 overcast mild high strong yes Summary Ensemble Learning 13 overcast hot normal weak yes References 14 rainy mild high strong no 552 SDU Example Decision Tree DM868 DM870 □S804 Arthur Zimek forecast Introduction Freq. Pattern Mining Feature Spaces Clustering - Basics Classification - Basics Bayesian Learning sunny ove cast rainy Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy Decision Tree Learning humidity wind SpUts/Split-Crit. Geom. lnterp./Bias Overlitting/Error­ Red. Pruning Discussion Neural Networks SVM high normal strong weak Regression Summary Ensemble Learning References no no 553 SDU Outline DM868 DM870 □S804 1t 'JductI0 Arthur Zimek Frequent P·.tterr Minir q Introduction Freq. Pattern Mining FeatJre Spacec:, Feature Spaces Clustering - Basics lustc- 1ng E'las1c..; and k med% Classification - Basics :,IasE !icat1on Bc.lt:u; md d Bc.ls:c Gia.SE 1Ier as1c Probability rreorv i,ayes Rule, anc Bayes.an L e::ir Bayesian Learning Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy Entropy, Purity, and Separation: Linear vs. Non-Linear Separation Decision Tree Learning Basic Algorithm Decision Tree Learning Geom. lnterp./Bias Overlitting/Error­ Red. Pruning Splits and Split Criteria Discussion nd Bias Overf1tting and ror Heduction Pn,rnng Neural Networks SVM Regression Discussion Summary Ensemble Learning Neural Networks References Support Vector Mach 1es Regression Svnmary 554 r.:;emble L e.1rnmg SDU Types of Splits DM868 DM870 □S804 Arthur Zimek categorical attributes ► attribute = a Introduction Freq. Pattern Mining Feature Spaces attribute ► attribute E A = a3 Clustering - Basics Classification - Basics Bayesian Learning =a1. Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. attribute Entropy Decision Tree Learning Basic Algorithm Geom. lnterp./Bias attribute Overlitting/Error­ Red. Pruning Discussion Neural Networks SVM >a Regression Summary numerical attributes Ensemble Learning References ► attribute < a, a ► atribute 2:: a, > a 555 SDU Problem: Where to Split Numerical Attributes DM868 DM870 □S804 Arthur Zimek Where should we define a split for some numerical attribute? ► We want to maximize the separation of classes. Introduction Freq. Pattern Mining Feature Spaces Clustering - Basics ► Idea: sort attribute values value I 0.9 I 0.8 o.3 I 0.1s I 0.01 Classification - Basics Bayesian Learning class A IA A IA IA Learning with Distributions potential sj:ilit points Entropy, Purity, and (Non-)Linear Sep. Entropy Decision Tree ► test combinations with split criterion Learning Basic Algorithm Geom. lnterp./Bias Overlitting/Error­ alternative: potential Red. Pruning ► fit to each class a Gaussian (\ split points Discussion Neural Networks distribution n SVM Regression Summary Ensemble Learning ► intersections of the Gaussian References pdfs are potential split points 556 , , ,!: I Quality Measure for Splits DM868 DM870 □S804 Arthur Zimek To select attributes for splitting and to select split points in Introduction Freq. Pattern Mining attributes, we need to assess the quality of partitions induced Feature Spaces Clustering - Basics by a split on some attribute. given Classification - Basics Bayesian Learning Learning with Distributions ► a training set TR ► disjunct and complete partitionings T = T1,... , Tm of Entropy, Purity, and (Non-)Linear Sep. Entropy Decision Tree Learning TR: UiTi = TR, \:/i #- j : Ti n 1'j = 0 ► relative frequency of each class ci in each partition Basic Algorithm I'j: Pi = Pr(cil½) Geom. lnterp./Bias Overlitting/Error­ Red. Pruning Discussion Neural Networks required ► a relative measure for the purity w.r.t. classes of SVM Regression some set of partitions Summary Ensemble Learning ► a split that optimizes this measure References 557 SDU I Measure: Information Gain DM868 DM870 □S804 Arthur Zimek ► Information gain is a measure based on the entropy. Introduction = - L Pr(cilT) log2 Pr(cdT) Freq. Pattern Mining Feature Spaces k Clustering - Basics Classification - Basics H(T) Bayesian Learning Learning with i=l Distributions Entropy, Purity, and (Non-)Linear Sep. ► Information gain measures the reduction of entropy (i.e., Entropy Decision Tree gain of information) by a split of set T into partitions Learning T1,... ,Tm : t 11ii' Basic Algorithm Geom. lnterp./Bias Overlitting/Error­ Red. Pruning Discussion information gain(T, T1,... ,Tm)= H(T) - H(Ti) Neural Networks SVM z=l Regression Summary Ensemble Learning ► Higher information gain means larger reduction of References entropy. ► We choose the attribute and split point that maximize the 558 information gain. SDU Example: Should We Play Tennis Today? I DM868 DM870 □S804 Arthur Zimek ID forecast I temperature I humidity I wind I play tennis? I Introduction 1 sunny hot high weak no Freq. Pattern Mining Feature Spaces 2 sunny hot high strong no 3 overcast hot high weak yes Clustering - Basics Classification - Basics Bayesian Learning Learning with 4 rainy mild high weak yes Distributions Entropy, Purity, and 5 rainy cool normal weak yes (Non-)Linear Sep. Entropy 6 rainy cool normal strong no Decision Tree Learning 7 overcast cool normal strong yes Basic Algorithm 8 sunny mild high weak no Geom. lnterp./Bias Overlitting/Error­ 9 sunny cool normal weak yes Red. Pruning Discussion 10 rainy mild normal weak yes Neural Networks SVM 11 sunny mild normal strong yes Regression 12 overcast mild high strong yes Summary Ensemble Learning 13 overcast hot normal weak yes References 14 rainy mild high strong no 559 SDU Example: Information Gain DM868 DM870 9 "yes", 5 "no": H(T) = 0.940 □S804 Arthur Zimek Introduction Freq. Pattern Mining humidity Feature Spaces Clustering - Basics = 0.985 = 0.592 Classification - Basics Bayesian Learning 3 "yes", 4 "no": H(T1) 6 "yes", 1 "no": H(T2) Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy 7 7 Decision Tree Learning inf. gain(T, Ti (humidity)) = 0.940 - 0.985 - 0.592 = 0.151 Basic Algorithm 14 14 Geom. lnterp./Bias Overlitting/Error­ Red. Pruning Discussion.trong Neural Networks SVM Regression 3 "yes", 3 "no": H(T2) = 1.0 Summary Ensemble Learning References 6 int. gain(T, Ti (wind)) = 0.940 - : 0.811 1.0 = 0.048 1 14 560 SDU I Measure: Gini Index DM868 DM870 □S804 Arthur Zimek L Pr(cil 'D) Introduction k Freq. Pattern Mining Feature Spaces G('D) = 1- 2 i=l Clustering - Basics Classification - Basics Bayesian Learning Learning with Distributions In an analogous way, we can use the weighted Gini index of Entropy, Purity, and (Non-)Linear Sep. induced partitions to compare partitionings: Entropy Decision Tree m ITil Learning Basic Algorithm G(T1,... , Tm)= TrfG(Ti) Geom. lnterp./Bias Overlitting/Error­ z=l Red. Pruning Discussion Neural Networks SVM ► Smaller value of the Gini index means lower impurity. Regression Summary Ensemble Learning ► We choose the attribute and the split that minimizes the References Gini index. 561 SDU Example: Should We Play Tennis Today? I DM868 DM870 □S804 Arthur Zimek ID forecast I temperature I humidity I wind I play tennis? I Introduction 1 sunny hot high weak no Freq. Pattern Mining Feature Spaces 2 sunny hot high strong no 3 overcast hot high weak yes Clustering - Basics Classification - Basics Bayesian Learning Learning with 4 rainy mild high weak yes Distributions Entropy, Purity, and 5 rainy cool normal weak yes (Non-)Linear Sep. Entropy 6 rainy cool normal strong no Decision Tree Learning 7 overcast cool normal strong yes Basic Algorithm 8 sunny mild high weak no Geom. lnterp./Bias Overlitting/Error­ 9 sunny cool normal weak yes Red. Pruning Discussion 10 rainy mild normal weak yes Neural Networks SVM 11 sunny mild normal strong yes Regression 12 overcast mild high strong yes Summary Ensemble Learning 13 overcast hot normal weak yes References 14 rainy mild high strong no 562 SDU Example: Gini Index DM868 DM870 ITI = 14 □S804 Arthur Zimek 9 "yes", 5 "no": Introduction wind Freq. Pattern Mining Feature Spaces hig ormal wea trong Clustering - Basics Classification - Basics 3 "yes", 4 "no" 6 "yes", 1 "no" 6 "yes", 2 "no" 3 "yes", 3 "no" Bayesian Learning Learning with 32 42 62 , 2 Distributions G(T1)=I- ( ',2°+ ) G(T2)=l- ( + ) G(T1)=1-(i+ ) G(T2 )=l-(i+i) Entropy, Purity, and ',2° '72" '72" (Non-)Linear Sep. Entropy =ij =i _24_3 -,;;:-, _18_1 -,;;-,: Decision Tree Learning 7 24 7 12 18 G ( sp/,ton h um,.d.,ty) = - - +- - Basic Algorithm. = - Geom. lnterp./Bias Overlitting/Error­ 14 49 14 49 49 Red. Pruning Discussion Neural Networks.. SVM 8 3 6 1 3 21 Regression Summary G ( sp/,ton wm d) = - - +- - = - = - Ensemble Learning 14 8 14 2 7 49 References 563 SDU Outline DM868 DM870 □S804 1t 'JductI0 Frequent P·.tterr Minir q Arthur Zimek Introduction Freq. Pattern Mining FeatJre Spacec:, Feature Spaces Clustering - Basics lustc- 1ng E'las1c..; and k med% Classification - Basics :,IasE !icat1on Bc.lt:u; md d Bc.ls:c Gia.SE 1Ier as1c Probability rreorv i,ayes Rule, anc Bayes.an L e::ir Bayesian Learning Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy Entropy, Purity, and Separation: Linear vs. Non-Linear Separation Decision Tree Learning Basic Algorithm Decision Tree Learning SpUts/Split-Crit. Overlitting/Error­ Red. Pruning Discussion Geometric Interpretation and Bias ,g Neural Networks SVM Regression Discussion Summary Ensemble Learning Neural Networks References Support Vector Mach 1es Regression Svnmary 564 r.:;emble L e.1rnmg , , ,!: I Axis-parallel Hyperplanes DM868 DM870 □S804 Arthur Zimek ► A hyperplane h(x) is defined as the set of all points Introduction Freq. Pattern Mining x E JR.a that satisfy: Feature Spaces h(x) : w · xT - b = 0, Clustering - Basics Classification - Basics Bayesian Learning where w is a normal vector to the hyperplane, and b Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy defines the offset of the hyperplane from the origin. Decision Tree Learning ► For axis-parallel hyperplanes, the normal vector is parallel to one of the axes, i.e., w E { e1,... , ea}, where Basic Algorithm SpUts/Split-Crit. Overlitting/Error­ ei E JR.a has value 1 in dimension Xi and value O in every other dimension. Red. Pruning Discussion Neural Networks SVM h(x) : ei xT - b = 0 = h(x) : Regression Summary Ensemble Learning References Xi - b =0 ► The choice of b yields different hyperplanes parallel to 565 axis xi. , , ,!: I Hyperplanes and Split Points DM868 DM870 □S804 Arthur Zimek ► For decision trees on real-valued attributes, axis-parallel Introduction Freq. Pattern Mining hyperplanes relate to split points. ► A hyperplane splits the data space JRd into two Feature Spaces Clustering - Basics half-spaces. Classification - Basics Bayesian Learning ► All points with h(x) < o are on one side of the Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy hyperplane, all points with h(x) > Oare on the other side Decision Tree Learning of the hyperplane. ► We can therefore write the split point as: Basic Algorithm SpUts/Split-Crit. h(x) '.S O b '.S O '.S b Overlitting/Error­ Red. Pruning Discussion {::} Xi - {::} Xi Neural Networks SVM Regression ► As b can be choosen to be any value of the dimension Summary Ensemble Learning Xi , the generic form of a split point (or test) is References xi :Sv, 566 where v = b is some value in the domain of xi. C C) 0 C - 0 ct! lo... ctS ctS ct! - ctS Q. "O (I) Q) Cf) ctS.._ lo... ctS ct! a. - (I) Q) en C 0 :.:J Q) (I).._ CJ).EQ) ui -.._ Q) "§ Q) C (I)..c ctS (.) a. (I).!!2.._ - Q) a.. Q) a. Q) >,.._..c (I) C Q) 0 ct! ·u5 ctS ·c::; co.; lo... ct! a.. Q) a. "O en I.....0 ·xctS I CJ) ctS Q) ' en en 00 ctS 00 000 :0 "§: 00000 O 00 Q) 00 CJ) u O..::: ct! Q) 0 0 00..c -[ 0 00 CJ I- SDU Bias Prevents Detection of Certain Simple Decision Rules DM868 DM870.... □S804 Consider a dataset with a relatively simple decision rule:............... Arthur Zimek. Introduction...'...,· Freq. Pattern Mining Feature Spaces..:.... '...... ". Clustering - Basics Classification - Basics Bayesian Learning Learning with Distributions ⇒+... Entropy, Purity, and ► if x+y 1 (Non-)Linear Sep.. Entropy 0.5 ► else ⇒ Decision Tree + Learning 0.4r-+ + ++ + Basic Algorithm ++ + + + + SpUts/Split-Crit. 0.3f- + +f- + :i + + + ++ ++ :\I + + Overlitting/Error­ 0.2 ++ +.,,. + Red. Pruning + 0.1 + + + t + + + +++ Discussion ++ ++ + ++ + Neural Networks t *+ SVM QI ;;.#, + 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Regression Summary Ensemble Learning ► A decision tree is biased to find a more complex model, References as it can only find piecewise axis-parallel hyperplanes as split points. 568 SDU Outline DM868 DM870 □S804 1t 'JductI0 Arthur Zimek Frequent P·.tterr Minir q Introduction Freq. Pattern Mining FeatJre Spacec:, Feature Spaces Clustering - Basics lustc- 1ng E'las1c..; and k med% Classification - Basics :,IasE !icat1on Bc.lt:u; md d Bc.ls:c Gia.SE 1Ier as1c Probability rreorv i,ayes Rule, anc Bayes.an L e::ir Bayesian Learning Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy Entropy, Purity, and Separation: Linear vs. Non-Linear Separation Decision Tree Learning Basic Algorithm Decision Tree Learning SpUts/Split-Crit. Geom. lnterp./Bias Solit; and Solr. t;r tena Discussion Overfitting and Error-Reduction Pruning Neural Networks SVM Regression Summary Ensemble Learning Neural NetworKs References Support Vector Mach 1es Regression Svnmary 569 r.:;emble L e.1rnmg SDU Overfitting Example DM868 DM870 □S804 Arthur Zimek Introduction Freq. Pattern Mining Feature Spaces Clustering - Basics Classification - Basics Bayesian Learning Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy Decision Tree x2 < 14.3129 Learning Basic Algorithm SpUts/Split-Crit. Geom. lnterp./Bias O} Learning (Yi = I) (w,pi) Neural Networks SVM Non-Linearly Sep. Problems Kernels condition 2: maximal margin maximize e, where e = p;EmiTRn I ✓(w,1 w) ((w,pi) + b) Discussion Regression Summary Ensemble Learning References 607 SDU Outline DM868 DM870 □S804 1t 'JductI0 Arthur Zimek Frequent P·.tterr Minir q Introduction Freq. Pattern Mining FeatJre Spacec:, Feature Spaces Clustering - Basics lustc- 1ng E'las1c..; and k med% Classification - Basics :,IasE !icat1on Bc.lt:u; md d Bc.ls:c Gia.SE 1Ier Bayesian Learning Learning with as1c Probability rreorv i,ayes Rule, anc Bayes.an L e::ir Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy Entropy, Purity, and Separation: Linear vs. Non-Linear Separation Decision Tree Learning Neural Networks uecIs,on lree Learn1rQ SVM Optimal Linear Separation Support Vector Machines Kernels Discussion Non-Linearly Separable Problems Regression Summary Ensemble Learning DiSCLSSion References RegresBion Svnmary r r ;E'mble Learning 608 SDU Soft Margin DM868 DM870 □S804 Arthur Zimek If the data are Introduction not linearly separable or the separation is not very stable ,' Freq. Pattern Mining.. i Feature Spaces Clustering - Basics Classification - Basics Bayesian Learning ,,' Learning with ,,,,,,' Distributions 0 ,,,l. Entropy, Purity, and (Non-)Linear Sep. ) 0 0 Entropy 0 , 0 Decision Tree , 0 ·,,, ,,' 0 Learning ,, ' Neural Networks 0 , 0 , 0 SVM 0 0 ,' ,, , ' Optimal Linear Separation 0 0 0 0 0 0 Kernels Discussion Regression Summary we can optimize with a trade-off between training error and Ensemble Learning References size of the margin. 609 SDU Soft Margin.....,, DM868 DM870 □S804 Arthur Zimek Consider training error during optimization:.. Introduction # Freq. Pattern Mining / Feature Spaces , -' / Clustering - Basics ' / Classification - Basics , / ► ti: distance of Pi from the Bayesian Learning /\. P, / Learning with Distributions ,/ 2 ; ,/o border of the margin Entropy, Purity, and (Non-)Linear Sep. ,,6 0 (slack variable) Entropy / / 0 Decision Tree Learning / 1 / ,'p, ,' ,,,' ► regularization of the influence of each / , Neural Networks 0 SVM ; , ' ,, , 0 Optimal Linear Separation ,,' 0' 0 0 individual observation Kernels Discussion Regression Summary Ensemble Learning References 610 SDU Transform Featurespace DM868 DM870 □S804 Arthur Zimek ► If the problem is not linearly separable in the given feature space, the problem can be transformed into Introduction Freq. Pattern Mining another feature space. Feature Spaces Clustering - Basics Classification - Basics Bayesian Learning Learning with ► This means, the hypothesis space is extended. Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy Decision Tree Learning 0 0 Neural Networks ------------------- SVM Optimal Linear Separation 0 0 0 0 Kernels Discussion Regression 0 0 0 0 Summary Ensemble Learning 0 0 0 0 References 611 , , ,!: I Feature Transformation - Principle DM868 DM870 □S804 I feature space I.................... Arthur Zimek... Introduction ··· Freq. Pattern Mining Feature Spaces , ············....... ··· transformed feature s.12.ace ] Clustering - Basics Classification - Basics Bayesian Learning Try to find a linear separation in the transformed feature Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. space......... · · Entropy · ··· ····· · Decision Tree Example transformation: · · ··· I a I b I c I...... Learning Neural Networks SVM · Optimal Linear......................... Separation I a I b I c la ala bla clb bib clc cl Kernels Discussion Regression Summary Ensemble Learning References A hyperplane in this feature space corresponds to a polynomial (2nd degree) separation surface in the original 612 feature space. SDU Feature Transformation - Example DM868 DM870 □S804 Arthur Zimek Feature space x = (x1,x2) E JR2 Introduction Freq. Pattern Mining Transformed feature space e.g., (x) = (xi,x , v'2 · x1, v'2 · x2, v'2 · x1 x2, 1) E JR6 Feature Spaces Clustering - Basics Classification - Basics Bayesian Learning Learning with Distributions X2 X2 ,, Entropy, Purity, and (Non-)Linear Sep. ,, Entropy Decision Tree I 0 Learning \ , I ,, Neural Networks \ I \ I ,, SVM I I , Optimal Linear \ Separation I ,, ,,,, \ 0 \ I 0 c9 I I ,, Kernels I Discussion , \ I 00 Regression 0 0 ' \ 0 ,, Summary \ I Ensemble Learning 0 ,' 0 0 2 References X1 x1 613 SDU Outline DM868 DM870 □S804 1t 'JductI0 Arthur Zimek Frequent P·.tterr Minir q Introduction Freq. Pattern Mining FeatJre Spacec:, Feature Spaces Clustering - Basics lustc- 1ng E'las1c..; and k med% Classification - Basics :,IasE !icat1on Bc.lt:u; md d Bc.ls:c Gia.SE 1Ier Bayesian Learning Learning with as1c Probability rreorv i,ayes Rule, anc Bayes.an L e::ir Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy Entropy, Purity, and Separation: Linear vs. Non-Linear Separation Decision Tree Learning Neural Networks uecIs,on lree Learn1rQ SVM Optimal Linear Separation Non-Linearly Support Vector Machines Sep. Problems Jr Discussion ly Separable Probler,s Regression Summary Kernels Ensemble Learning References RegresBion SL.'Tlmary r r ;E'mble Learning 614 SDU I Kernel Trick DM868 DM870 □S804 Arthur Zimek For the optimization, we need to eventually compute scalar Introduction Freq. Pattern Mining products between points. Feature Spaces A special similarity function, a kernel, is used for an implicite Clustering - Basics Classification - Basics feature transformation: Bayesian Learning Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. ► Instead of computing new features for all objects: 4'(x), and then compute scalar products in the new space, Entropy Decision Tree Learning Neural Networks SVM ► a kernel computes the scalar product in the transformed Optimal Linear Separation space directly, without explicitly computing the Non-Linearly Sep. Problems transformed space first: Discussion K(pi,Pj) = (4'(pi), 4'(pj )) Regression Summary Ensemble Learning References ► This is an interesting property for computationally expensive feature transformations ("kernel trick"). 615 SDU I Kernel DM868 DM870 □S804 Arthur Zimek When is a function a kernel function? Introduction Freq. Pattern Mining Feature Spaces Clustering - Basics Classification - Basics Bayesian Learning Learning with Distributions Entropy, Purity, and A function K : :!Rn -+ IRt is a kernel function, if the (Non-)Linear Sep. Entropy kernel-matrix (Gram matrix) Decision Tree Learning Neural Networks SVM K(x1,x1) K(x1 ' Xn) Optimal Linear _ Separation Non-Linearly KM(K) = ( : = ) K(xn,x1) K(xn,Xn) Sep. Problems Discussion Regression Summary Ensemble Learning is positive semi-definite (i.e., the Gram matrix does not have References any negative Eigenvalues). 616 SDU Necessary Conditions for a Kernel Function DM868 DM870 □S804 Arthur Zimek Necessary (but not sufficient) conditions for K being a kernel Introduction Freq. Pattern Mining function are: symmetry Feature Spaces Clustering - Basics Classification - Basics Kif>(pi,Pj) ((pi)' (pj)) Bayesian Learning Learning with Distributions ((pj), (pi)) Entropy, Purity, and (Non-)Linear Sep. Entropy = Kif>(pJ,Pi) Decision Tree Learning Neural Networks Cauchy-Schwarz Kif>(pi,Pj)2 S Kif>(pi,Pi) Kif>(pj,Pj) SVM Optimal Linear Separation Non-Linearly Sep. Problems Discussion Regression Summary Ensemble Learning References 617 sou I Combinations of Kernels DM868 DM870 □S804 Arthur Zimek For given Kernels K1, K2, a constant a E JR + , and a Introduction Freq. Pattern Mining symmetric, positive semi-definite matrix B, the following Feature Spaces Clustering - Basics functions Kare also kernels: ► K(x,y) = K1(x,y) K2(x,y) Classification - Basics Bayesian Learning = K1(x,y) + K2(x,y) Learning with Distributions Entropy, Purity, and ► K(x,y) (Non-)Linear Sep. Entropy Decision Tree ► K(x,y) =a· K1(x,y) ► K(x,y) = xT B y Learning Neural Networks SVM Optimal Linear Separation Non-Linearly Sep. Problems Discussion Regression Summary Ensemble Learning References 618 SDU Typical Kernels DM868 DM870 □S804 Arthur Zimek Typical examples for kernel functions are: linear K(x,y) = (x,y) Introduction Freq. Pattern Mining Feature Spaces Clustering - Basics Classification - Basics Bayesian Learning Learning with Distributions Entropy, Purity, and polynomial K(x,y) ((x,y) + cl (Non-)Linear Sep. Entropy Decision Tree 2 radial basis function K(x,y) Learning Neural Networks e-'Y· lx-yl SVM Optimal Linear Separation Non-Linearly Sep. Problems 2 _ llx-yll Discussion Gauss kernel K(x,y) e 2u1 Regression Summary Ensemble Learning References Sigmoid K(x,y) tanh ('y (x,y) + c) 619 SDU Decision Boundaries DM868 DM870 □S804 Arthur Zimek Radial basis function Polynomial (2nd degree) Introduction Freq. Pattern Mining Feature Spaces Clustering - Basics Classification - Basics Bayesian Learning Learning with Distributions Entropy, Purity, and @ (Non-)Linear Sep. 0 0 Entropy Decision Tree Learning Neural Networks SVM Optimal Linear Separation Non-Linearly Sep. Problems Discussion Regression Summary Ensemble Learning References 620 SDU Outline DM868 DM870 □S804 1t 'JductI0 Arthur Zimek Frequent P·.tterr Minir q Introduction Freq. Pattern Mining FeatJre Spacec:, Feature Spaces Clustering - Basics lustc- 1ng E'las1c..; and k med% Classification - Basics :,IasE !icat1on Bc.lt:u; md d Bc.ls:c Gia.SE 1Ier Bayesian Learning Learning with as1c Probability rreorv i,ayes Rule, anc Bayes.an L e::ir Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy Entropy, Purity, and Separation: Linear vs. Non-Linear Separation Decision Tree Learning Neural Networks uecIs,on lree Learn1rQ SVM Optimal Linear Separation Non-Linearly Support Vector Machines Sep. Problems Jr Kernels Non inearly Separable Probler,s Regression Summary Ensemble Learning Discussion References R SL.'Tlmary r r ;E'mble Learning 621 sou I Discussion SYDDANSK UNIVERSITET DM868 DM870 □S804 Arthur Zimek pros ► can generate classifiers of high accuracy Introduction Freq. Pattern Mining ► can be formulated as a convex optimization Feature Spaces Clustering - Basics problem, thus efficient optimization algorithms can Classification - Basics Bayesian Learning find the global optimum Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy ► relatively low tendency to overfit (based on Decision Tree Learning generalization theory) - but one should prefer Neural Networks SVM simpler kernels Optimal Linear Separation ► efficient application of the learned model ► compact models Non-Linearly Sep. Problems Kernels Regression cons Summary Ensemble Learning ► often extensive training time References ► non-trivial implementation (note that we did not go into details of the optimization algorithms) 622 I ► derived classification models are hard to interpret SDU Outline DM868 DM870 □S804 1t 'JductI0 Arthur Zimek Frequent P·.tterr Minir q Introduction Freq. Pattern Mining FeatJre Spacec:, Feature Spaces Clustering - Basics lustc- 1ng E'las1c..; and k med% Classification - Basics :,IasE !icat1on Bc.lt:u; md d Bc.ls:c Gia.SE 1Ier Bayesian Learning Learning with as1c Probability rreorv i,ayes Rule, anc Bayes.an L e::ir Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy Entropy, Purity, and Separation: Linear vs. Non-Linear Separation Decision Tree Learning Neural Networks uecIs,on lree Learn1rg SVM Neural Networks Summary Mach 'les Ensemble Learning References Regression E-rsemo1e: L e:3.rning 623 SDU Recommended Reading DM868 DM870 □S804 Recommended Reading: Arthur Zimek 0061. Aooendix o Introduction Freq. Pattern Mining Feature Spaces Clustering - Basics Classification - Basics Bayesian Learning ► Zaki and Meira Jr. , Part V Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy Decision Tree Learning Neural Networks SVM Summary Ensemble Learning References 624 , , ,!: I Regression vs. Classification DM868 DM870 □S804 Arthur Zimek ► In the classification problem, each object belongs to one of a set of (discrete) classes ci EC = { c1,... q}. Introduction Freq. Pattern Mining Feature Spaces Clustering - Basics ► The function we want to learn or approximate is a functionf: D ➔ C, i.e.,f(x) EC. Classification - Basics Bayesian Learning Learning with Distributions Entropy, Purity, and ► In the regression problem, the prediction aims at a real value. (Non-)Linear Sep. Entropy ► To each object x belongs a unique value in IR, i.e., the Decision Tree Learning Neural Networks SVM function we want to predict or approximate is a function Summary Ensemble Learning f:D➔R References ► examples: ► prediction of selling rates of some product ► recommended amount of fertilizer for a given type of soil ► The predicted valuef(x) is also called predicted variable, response variable, or dependent variable and typically 625 denoted by y. SDU Different Regression Tasks DM868 DM870 □S804 Arthur Zimek linear regression We assume that the predicted variable y follows a linear function. Introduction Freq. Pattern Mining Feature Spaces Clustering - Basics Classification - Basics multiple regression The input variable x is multi-dimensional Bayesian Learning Learning with (a vector). non-linear regression General case - the regression function Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy Decision Tree does not need to be linear. Learning Neural Networks SVM Summary Ensemble Learning References 626 SDU Linear Regression DM868 DM870 □S804 Arthur Zimek Given Objects xi, sampled from random variable X; s training objects are additionally annotated with a value Introduction Freq. Pattern Mining following the random variable Y. Feature Spaces Clustering - Basics Classification - Basics y Assumption Y = o: + /3 X Bayesian Learning Learning with Distributions ♦ Entropy, Purity, and (Non-)Linear Sep. The observed values y deviate from a line with Entropy Decision Tree f\V Learning constant variance Y. Neural Networks ♦ (distances cj of Yi at xi from SVM ♦ ♦ ♦ ♦ ♦. ' the line). Summary Ensemble Learning ♦ References Objective Find a line describing the expected values of X (YIX). years of experience Solution Method of least squares: determine o: and /3 such that L = I:f= 1 c 2 = I:f= 1 (Yi - o: - /3 xi)2 is minimal. 627 SDU Solution DM868 DM870 □S804 Arthur Zimek Partial derivations return estimates for the line's Introduction Freq. Pattern Mining Feature Spaces Clustering - Basics Classification - Basics Bayesian Learning Learning with y Distributions Entropy, Purity, and slope ♦ (Non-)Linear Sep. Entropy Decision Tree Lf=1 (xi - E(X))(Yi - E(Y)) ♦ (3 Learning Lf=l (xi - E(X)) 2 Neural Networks ♦ 1 SVM ♦ ♦ ' Y offset ♦/ Summary Ensemble Learning References a = E(Y) - (3 E(X) X years of experience 628 SDU Multiple Regression DM868 DM870 □S804 Arthur Zimek In multiple (linear) regression we are seeking a regression Introduction Freq. Pattern Mining hyperplane. Feature Spaces Clustering - Basics Given Objects x E JRd \, Classification - Basics Assumption Y is dependent on a Bayesian Learning __ Learning with Distributions Entropy, Purity, and linear combination of the d attributes of x: (Non-)Linear Sep. Entropy Decision Tree Learning L Neural Networks d =a+ SVM y fJk xk Summary Ensemble Learning k=l ( References Solution Minimize quadratic error s s d )2 L = c 2 = Yi - 0: - (fJk ·Xi,k) 629 SDU Non-linear Regression DM868 DM870 □S804 Arthur Zimek Given random variables X and Y Introduction Freq. Pattern Mining Feature Spaces Assumption Y depends on X non-linearly Clustering - Basics l.S Example We assume Classification - Basics Bayesian Learning polynomial dependency: Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. Y = o:+,81 ·X+,82 ·X2 +,83 ·X3 Entropy Decision Tree Learning Neural Networks SVM Solution Define new variables Summary Ensemble Learning X1 =X,X2 =X2 ,X3 =X3. References Solve multiple linear regression Y = o: + ,81 X1 + ,82 X2 + ,83 X3 630 SDU Overfitting DM868 DM870 □S804 Arthur Zimek The higher the degree of the polynomial, the more degrees of Introduction Freq. Pattern Mining freedom the approximation has, the better can the Feature Spaces Clustering - Basics observations be approximated. Classification - Basics Bayesian Learning 8 -.=::-==:::==:::::==='=--=-. - - Learning with Distributions 7 j polynom deg re O j Entropy, Purity, and (Non-)Linear Sep. 6 Entropy Decision Tree Learning 5 Neural Networks SVM 4 >, Summary 3 Ensemble Learning References 2 1 12345678 X ► outliers might influence the model strongly 631 ► overfitting problem SDU Overfitting DM868 DM870 □S804 Arthur Zimek The higher the degree of the polynomial, the more degrees of Introduction Freq. Pattern Mining freedom the approximation has, the better can the Feature Spaces Clustering - Basics observations be approximated. Classification - Basics Bayesian Learning 8 -.=::-==:::==:::::==='=--=-. - - Learning with Distributions 7 j polynom deg re 1 j Entropy, Purity, and (Non-)Linear Sep. 6 Entropy Decision Tree Learning 5 Neural Networks SVM >, 4 Summary 3 Ensemble Learning References 2 1 1 2 3 4 5 6 7 8 X ► outliers might influence the model strongly 631 ► overfitting problem SDU Overfitting DM868 DM870 □S804 Arthur Zimek The higher the degree of the polynomial, the more degrees of Introduction Freq. Pattern Mining freedom the approximation has, the better can the Feature Spaces Clustering - Basics observations be approximated. Classification - Basics Bayesian Learning 8 -.=::-==:::==:::::==='=--=-. - - Learning with Distributions 7 j polynom deg re 2 j Entropy, Purity, and (Non-)Linear Sep. 6 Entropy Decision Tree Learning 5 Neural Networks SVM >, 4 Summary 3 Ensemble Learning References 2 1 1 2 3 4 5 6 7 8 X ► outliers might influence the model strongly 631 ► overfitting problem SDU Overfitting DM868 DM870 □S804 Arthur Zimek The higher the degree of the polynomial, the more degrees of Introduction Freq. Pattern Mining freedom the approximation has, the better can the Feature Spaces Clustering - Basics observations be approximated. Classification - Basics Bayesian Learning 8..I:""--- -- -- -- -- - - Learning with 7 j polynom deg re 3 j Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy 6 Decision Tree Learning 5 Neural Networks SVM >, 4 Summary 3 Ensemble Learning References 2 1 o - - - - o 1 2 3 4 5 6 7 8 X ► outliers might influence the model strongly 631 ► overfitting problem SDU Overfitting DM868 DM870 □S804 Arthur Zimek The higher the degree of the polynomial, the more degrees of Introduction Freq. Pattern Mining freedom the approximation has, the better can the Feature Spaces Clustering - Basics observations be approximated. Classification - Basics : l[ : j p lyno- gr e 4 ': : f _ ]j Bayesian Learning Learning with 8 ' ' ' Distributions Entropy, Purity, and - - (Non-)Linear Sep. Entropy Decision Tree Learning 5 4 Neural Networks SVM >, Summary 3 Ensemble Learning References 2 1 o - - - - o 1 2 3 4 5 6 7 8 X ► outliers might influence the model strongly 631 ► overfitting problem SDU Overfitting DM868 DM870 □S804 Arthur Zimek The higher the degree of the polynomial, the more degrees of Introduction Freq. Pattern Mining freedom the approximation has, the better can the Feature Spaces Clustering - Basics observations be approximated. Classification - Basics Bayesian Learning 8..I:""--- -- -- -- -- - - Learning with 7 \ j polynom deg re s j Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy 6 Decision Tree Learning 5 Neural Networks SVM >, 4 Summary 3 Ensemble Learning References 2 1 o - - - 0 1 2 3 4 5 6 7 8 X ► outliers might influence the model strongly 631 ► overfitting problem SDU Overfitting DM868 DM870 □S804 Arthur Zimek The higher the degree of the polynomial, the more degrees of Introduction Freq. Pattern Mining freedom the approximation has, the better can the Feature Spaces Clustering - Basics observations be approximated. Classification - Basics Bayesian Learning 8..I:""--- -- -- -- -- - - Learning with 7 j polynom deg re 6 j Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy 6 Decision Tree Learning 5 Neural Networks SVM >, 4 Summary 3 Ensemble Learning References 2 1 o - - - 0 1 2 3 4 5 6 7 8 X ► outliers might influence the model strongly 631 ► overfitting problem SDU Outline DM868 DM870 □S804 1t 'JductI0 Arthur Zimek Frequent P·.tterr Minir q Introduction Freq. Pattern Mining FeatJre Spacec:, Feature Spaces Clustering - Basics lustc- 1ng E'las1c..; and k med% Classification - Basics :,IasE !icat1on Bc.lt:u; md d Bc.ls:c Gia.SE 1Ier as1c Probability rreorv i,ayes Rule, anc Bayes.an L e::ir Bayesian Learning Learning with Distributions Entropy, Purity, and (Non-)Linear Sep. Entropy Entropy, Purity, and Separation: Linear vs. Non-Linear Separation Decision Tree Learning Neural Networks uecIs,on lree Learn1rg SVM Regression Neural Networks S1..,;::mort Vector Mach 'les Ensemble Learning References Summary ng 632 SDU Summary DM868 DM870 □S804 Arthur Zimek Introduction ► Entropy and Gini-index as measures of randomness and Freq. Pattern Mining Feature Spaces purity Clustering - Basics Classification - Basics Bayesian Learning Learning with ► Decision tree learning Distributions Entropy, Purity, and (Non-)Linear Sep. ►Basic greedy algorithm for decision tree learning Entropy Decision Tree ►strategies for selection of attributes and of split points Learning Neural Networks ►bias and geometrical interpretation (separating SVM piece-wise axis-parallel hyperplanes) ► overfitting and error-reduction pruning (model selection) Regression Ensemble Learning References ► Linear vs. non-linear separation and learning ► Basics on Artificial Neural Nets ► Basics on Support Vector Machines ► Basics on regression 633

Use Quizgecko on...
Browser
Browser