Machine Learning 1: Supervised Learning PDF

MACHINE LEARNING 1 UNIT 1: SUPERVISED LEARNING 2.1 INTRODUCTION  Two phases of learning: ◦ Training  The examples are presented to the system (training set)  The system “learns” from the examples  The system gradually modifies its adjustable parameters until the output matches the desired output (target)  We need a measure of performance/accuracy/error ◦ Utilization  New examples never seen before  We ask the system to generalize  Need of a test set 2.1 INTRODUCTION  Generalization: ◦ We do not want the system to memorize  Give the correct answer only to the training examples  Hard for humans, easy for a computer ◦ We want to learn to generalize  Easier for humans (to conceptualize) than for a computer (to represent concepts)  The animal and human mind looks for patterns even where there are none  Pareidolias  We have to extract the essence, the structure of the data, and not just learn the correct answer for some cases ◦ Important: test in new cases once the system is trained  Check if the good results are beyond the training set 2.1 INTRODUCTION  The dataset is made up of tuples ◦ ◦ Training set  We want a model or system that, when faced with new attribute values, is capable of returning a value consistent with the data set ◦ Train a model ◦ Generalization ability  To be applied to new data 2.1 INTRODUCTION  Training: ◦ The ability of a model to solve a problem (knowledge) is linked to the examples used during learning. ◦ The training set must:  Being meaningful:  Sufficient number of examples  If the training set is small, the network will not be able to solve the problem efficiently.  Be representative:  Various examples: all regions of the state space must be sufficiently covered.  If a learning set contains many more instances of one type than the rest, the model will specialize in that subset and not be of general application. 2.1 INTRODUCTION  If a model has been well chosen, with adequate complexity, and trained, with a well chosen training set, it will have good generalization ability ◦ Will work well with new instances  What happens if there are instances in some region where there were no data in the training? ◦ Poorly chosen training set: it was not representative  That area where we want to use it was not represented in the training set 2.1 INTRODUCTION  What happens if there are instances in some region where there were no data in the training? ◦ Example:  Train a model with 763,231,921 Asian camels with 2 humps  Result: the model will predict that camels have 2 humps  What if we use as “input” an African camel (dromedary)???  It will return that it has 2 humps, but it only has one!!!  The model was used with an instance where there were no training samples  Poorly chosen training set: it was not representative  The area where we want to use it was not fully represented in the training set 2.1 INTRODUCTION  What happens if there are instances in some region where there were no data in the training? ◦ Example:  Do all cats have fur?  It is not so important to observe many as that all breeds are represented 2.1 INTRODUCTION  Parameters vs. hyperparameters ◦ Parameters: Each of the values that define the behavior of a model  They are internal to the model.  Their values are set by the training process ◦ Hyperparameters: Parameters whose value is used to control the learning process  Based on them, the values of the parameters are determined  Their values are set by the user Model Parameters hyperparameters Polynomial coefficients order ANN weights, bias architecture, learning rate, activation functions, etc. SVM α kernel, C, etc 2.1 INTRODUCTION  Problems that can be solved in supervised learning: ◦ Prediction ◦ Curve fitting ◦ Classification ◦ Regression 2.1 INTRODUCTION  Applications: ◦ Prediction  It attempts to determine the function that maps a set of input variables to one (or more) output variables.  It is basically numerical  It is based on statistical assumptions  Examples:  Monitoring the reservation of seats in aviation companies.  Short-term financial predictions 2.1 INTRODUCTION  Applications: ◦ Curve fitting:  Model the curve that lies behind a set of examples (xi , y) 2.1 INTRODUCTION  Applications: ◦ Classification  Classify objects into a finite number of classes, given their properties (inputs)  Look for a mapping function that allows class 1 to be separated from class 2 and class 3 from class 3...  The number of classes is finite and known  The class membership of each sample is known 2.1 INTRODUCTION  Applications: ◦ Regression:  Generic  From a dataset, try to find a model of the relationship between two sets of them  Instances: (inputs, targets)  Unknown relationship  Model must be generalizable, that is, that from new unknown samples it returns coherent outputs  The other problems (prediction, curve fitting) can be seen as particular cases of this  To assess the goodness of a regression system, some type of error (mean error, ECM, etc.) is used 2.1 INTRODUCTION  Supervised classification: ◦ Encoding of the output if the model does not return categorical outputs (eg: ANNs), 2 possibilities:  As real numbers  Only if in the real world there is an order between the equivalent classes  Example: classify the quality of a product as low/medium/high  Outputs 0/0.5/1 or 0/1/2 …  The classification problem becomes a regression problem  As boolean values (much more common)  More than 2 classes (one-hot encoding)  2 classes ( A/-A, -/+, A/B, etc. ) 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  One output per class is used  Target: each target takes the value 1 if that instance belongs to that class, or 0 otherwise  Output: Real values:  ANN: The softmax function is applied to the outputs of the last layer (output yi): 𝑒 𝑦 ∑ 𝑒  From real values, it returns values between 0 and 1  The sum of all is equal to 1  Each value: probability of belonging to that class  Final output: class with higher certainty/probability 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  One output per class is used  Target: each target takes the value 1 if that instance belongs to that class, or 0 otherwise  Output: Real values:  ANN: Example of softmax Chosen class 4 outputs: 4 outputs: values real values in [0,1], sum = 1 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  ANN: Error calculation:  cross-entropy:  Given two distributions p(x), with the true values, and q(x) with the estimates, the entropy is calculated for a set of instances x in as 𝐻 𝑝, 𝑞 𝑝 𝑥 · log 𝑞 𝑥  Given the output of an ANN for an example n, the targets 𝑦 𝑝 𝑥 , and the outputs 𝑦 𝑞 𝑥 : 𝐻 𝑦, 𝑦 𝑦 · log 𝑦 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  ANN: Error calculation:  cross-entropy: 𝐻 𝑦, 𝑦 𝑦 · log 𝑦  𝑦 · log 𝑦 is the dot product of a vector of c components (c classes): 𝑦 · log 𝑦 =∑ 𝑦 log 𝑦  Since there is only one target equal to 1, the rest are 0, only the output 𝑦 corresponding to the positive target is evaluated:  Averaging across all examples: 1 𝐽 log 𝑦 𝑁 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  The target is encoded as 0/1 or -1/1 values  Depends on the classifier  For example:  ANN: 0/1  SVM: -1/1 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  The model returns real numbers as output  In addition to the classification value, it gives information on the safety or certainty of the classification  ANNs  Targets: 0/1  The sigmoid function is usually applied as transfer function1.5  High values: output tends to 1 f (n) = −n 1 1 1+ e 0.5  Greater certainty of classification as positive 0 -0.5 -10 -8 -6 -4 -2 0 2 4 6 8 10  Low values: output tends to 0  Greater certainty of classification as negative  Classification: Threshold in 0.5 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  The model returns real numbers as output  In addition to the classification value, it gives information on the safety or certainty of the classification  SVM  Targets: -1/1  Output tends to ∞: Higher certainty of clas. as positive  Output tends to -∞: Higher certainty of clas. as negative  Classification:  Threshold in 0  Look at the sign 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  ANN: Error calculation: (𝑦 : target, 𝑦 : output)  Difference between output and target  ME, MSE, etc. 𝑦 𝑦  Binary cross-entropy: 1 𝑦 log 𝑦 1 𝑦 log 1 𝑦 𝑁 2.1 INTRODUCTION  Supervised classification: ◦ Many models only allow binary classification  Two classes  For example Support Vector Machines ◦ To be able to apply them to multi-class problems:  Voting based on the combination of binary classifiers  “One-vs-All”  “One-vs-One” 2.1 INTRODUCTION  Supervised classification: ◦ One-vs-All or One-vs-Rest:  Consider the problem as a collection of binary problems (2 classes)  For k classes, k binary classifiers are trained (k>2)  one per class  i-th classifier (i=1,…,k):  Data belonging to the class i → “positive”  Rest of the data → “negative”  Similar to one-hot-encoding  To select the class to which each data belongs  The sample is used as input to the k classifiers  A voting scheme is usually used among the k classifiers 2.1 INTRODUCTION  Supervised classification: ◦ One-vs-All or One-vs-Rest:  Usual voting scheme: winner-takes-all  Let fi (x,w) be the output of classifier i  Numerical output  The label of the classifier that has a greater certainty that the data (x) is in the +1 class is assigned, that is f ( x, w) = arg max( f i ( x, w)) i 2.1 INTRODUCTION  Supervised classification: ◦ One-vs-One:  In this method k(k-1)/2 models are built separating each class from each other  That is, all possible pairs of one class vs. another are created, and these problems are solved, in total they are k(k-1)/2  For each pair (i,j), i < j, of class labels, a binary classifier is constructed, classifying the points of class i with the label +1 and those of class j with the label -1  Subsequently, a voting method is used to select the class of each data MACHINE LEARNING 1 UNIT 1: SUPERVISED LEARNING 2.1 INTRODUCTION  Two phases of learning: ◦ Training  The examples are presented to the system (training set)  The system “learns” from the examples  The system gradually modifies its adjustable parameters until the output matches the desired output (target)  We need a measure of performance/accuracy/error ◦ Utilization  New examples never seen before  We ask the system to generalize  Need of a test set 2.1 INTRODUCTION  Generalization: ◦ We do not want the system to memorize  Give the correct answer only to the training examples  Hard for humans, easy for a computer ◦ We want to learn to generalize  Easier for humans (to conceptualize) than for a computer (to represent concepts)  The animal and human mind looks for patterns even where there are none  Pareidolias  We have to extract the essence, the structure of the data, and not just learn the correct answer for some cases ◦ Important: test in new cases once the system is trained  Check if the good results are beyond the training set 2.1 INTRODUCTION  The dataset is made up of tuples ◦ ◦ Training set  We want a model or system that, when faced with new attribute values, is capable of returning a value consistent with the data set ◦ Train a model ◦ Generalization ability  To be applied to new data 2.1 INTRODUCTION  Training: ◦ The ability of a model to solve a problem (knowledge) is linked to the examples used during learning. ◦ The training set must:  Being meaningful:  Sufficient number of examples  If the training set is small, the network will not be able to solve the problem efficiently.  Be representative:  Various examples: all regions of the state space must be sufficiently covered.  If a learning set contains many more instances of one type than the rest, the model will specialize in that subset and not be of general application. 2.1 INTRODUCTION  If a model has been well chosen, with adequate complexity, and trained, with a well chosen training set, it will have good generalization ability ◦ Will work well with new instances  What happens if there are instances in some region where there were no data in the training? ◦ Poorly chosen training set: it was not representative  That area where we want to use it was not represented in the training set 2.1 INTRODUCTION  What happens if there are instances in some region where there were no data in the training? ◦ Example:  Train a model with 763,231,921 Asian camels with 2 humps  Result: the model will predict that camels have 2 humps  What if we use as “input” an African camel (dromedary)???  It will return that it has 2 humps, but it only has one!!!  The model was used with an instance where there were no training samples  Poorly chosen training set: it was not representative  The area where we want to use it was not fully represented in the training set 2.1 INTRODUCTION  What happens if there are instances in some region where there were no data in the training? ◦ Example:  Do all cats have fur?  It is not so important to observe many as that all breeds are represented 2.1 INTRODUCTION  Parameters vs. hyperparameters ◦ Parameters: Each of the values that define the behavior of a model  They are internal to the model.  Their values are set by the training process ◦ Hyperparameters: Parameters whose value is used to control the learning process  Based on them, the values of the parameters are determined  Their values are set by the user Model Parameters hyperparameters Polynomial coefficients order ANN weights, bias architecture, learning rate, activation functions, etc. SVM α kernel, C, etc 2.1 INTRODUCTION  Problems that can be solved in supervised learning: ◦ Prediction ◦ Curve fitting ◦ Classification ◦ Regression 2.1 INTRODUCTION  Applications: ◦ Prediction  It attempts to determine the function that maps a set of input variables to one (or more) output variables.  It is basically numerical  It is based on statistical assumptions  Examples:  Monitoring the reservation of seats in aviation companies.  Short-term financial predictions 2.1 INTRODUCTION  Applications: ◦ Curve fitting:  Model the curve that lies behind a set of examples (xi , y) 2.1 INTRODUCTION  Applications: ◦ Classification  Classify objects into a finite number of classes, given their properties (inputs)  Look for a mapping function that allows class 1 to be separated from class 2 and class 3 from class 3...  The number of classes is finite and known  The class membership of each sample is known 2.1 INTRODUCTION  Applications: ◦ Regression:  Generic  From a dataset, try to find a model of the relationship between two sets of them  Instances: (inputs, targets)  Unknown relationship  Model must be generalizable, that is, that from new unknown samples it returns coherent outputs  The other problems (prediction, curve fitting) can be seen as particular cases of this  To assess the goodness of a regression system, some type of error (mean error, ECM, etc.) is used 2.1 INTRODUCTION  Supervised classification: ◦ Encoding of the output if the model does not return categorical outputs (eg: ANNs), 2 possibilities:  As real numbers  Only if in the real world there is an order between the equivalent classes  Example: classify the quality of a product as low/medium/high  Outputs 0/0.5/1 or 0/1/2 …  The classification problem becomes a regression problem  As boolean values (much more common)  More than 2 classes (one-hot encoding)  2 classes ( A/-A, -/+, A/B, etc. ) 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  One output per class is used  Target: each target takes the value 1 if that instance belongs to that class, or 0 otherwise  Output: Real values:  ANN: The softmax function is applied to the outputs of the last layer (output yi): 𝑒𝑒 𝑦𝑦𝑖𝑖 𝑦𝑦 𝑖𝑖 = ∑𝐶𝐶𝑗𝑗=1 𝑒𝑒 𝑦𝑦𝑗𝑗  From real values, it returns values between 0 and 1  The sum of all is equal to 1  Each value: probability of belonging to that class  Final output: class with higher certainty/probability 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  One output per class is used  Target: each target takes the value 1 if that instance belongs to that class, or 0 otherwise  Output: Real values:  ANN: Example of softmax Chosen class 4 outputs: 4 outputs: values real values in [0,1], sum = 1 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  ANN: Error calculation:  cross-entropy:  Given two distributions p(x), with the true values, and q(x) with the estimates, the entropy is calculated for a set of instances x in as 𝐻𝐻 𝑝𝑝, 𝑞𝑞 = − 𝑝𝑝(𝑥𝑥) log 𝑞𝑞(𝑥𝑥) 𝑥𝑥  Given the output of an ANN for an example n, the targets 𝑦𝑦 = 𝑝𝑝(𝑥𝑥), and the outputs 𝑦𝑦 = 𝑞𝑞(𝑥𝑥): 𝑁𝑁 𝐻𝐻 𝑦𝑦, 𝑦𝑦 = − 𝑦𝑦𝑛𝑛 log 𝑦𝑦 𝑛𝑛 𝑛𝑛=1 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  ANN: Error calculation:  cross-entropy: 𝑁𝑁 𝐻𝐻 𝑦𝑦, 𝑦𝑦 = − 𝑦𝑦𝑛𝑛 log 𝑦𝑦 𝑛𝑛 𝑛𝑛=1  𝑦𝑦𝑛𝑛 log 𝑦𝑦 𝑛𝑛 is the dot product of a vector of c components (c 𝐶𝐶 classes): 𝑦𝑦𝑛𝑛 log 𝑦𝑦 𝑛𝑛 =∑𝑐𝑐=1 𝑦𝑦𝑛𝑛𝑐𝑐 log 𝑦𝑦 𝑛𝑛𝑐𝑐  Since there is only one target equal to 1, the rest are 0, only the output 𝑦𝑦 𝑛𝑛 corresponding to the positive target is evaluated: 𝑁𝑁  Averaging across all examples: 1 𝐽𝐽 = − log 𝑦𝑦 𝑛𝑛 𝑁𝑁 𝑛𝑛=1 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  The target is encoded as 0/1 or -1/1 values  Depends on the classifier  For example:  ANN: 0/1  SVM: -1/1 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  The model returns real numbers as output  In addition to the classification value, it gives information on the safety or certainty of the classification  ANNs  Targets: 0/1  The sigmoid function is usually applied as transfer function1.5  High values: output tends to 1 f ( n) = −n 1 1 1+ e 0.5  Greater certainty of classification as positive 0 -0.5 -10 -8 -6 -4 -2 0 2 4 6 8 10  Low values: output tends to 0  Greater certainty of classification as negative  Classification: Threshold in 0.5 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  The model returns real numbers as output  In addition to the classification value, it gives information on the safety or certainty of the classification  SVM  Targets: -1/1  Output tends to ∞: Higher certainty of clas. as positive  Output tends to -∞: Higher certainty of clas. as negative  Classification:  Threshold in 0  Look at the sign 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  ANN: Error calculation: 𝑦𝑦𝑛𝑛 : target, 𝑦𝑦𝑛𝑛 : output) (  Difference between output and target  ME, MSE, etc. 𝑁𝑁 𝑦𝑦 𝑛𝑛 − 𝑦𝑦𝑛𝑛 𝑛𝑛=1  Binary cross-entropy: 𝑁𝑁 1 − 𝑦𝑦 𝑛𝑛 log 𝑦𝑦𝑛𝑛 + 1 − 𝑦𝑦 𝑛𝑛 log(1 − 𝑦𝑦𝑛𝑛 ) 𝑁𝑁 𝑛𝑛=1 2.1 INTRODUCTION  Supervised classification: ◦ Many models only allow binary classification  Two classes  For example Support Vector Machines ◦ To be able to apply them to multi-class problems:  Voting based on the combination of binary classifiers  “One-vs-All”  “One-vs-One” 2.1 INTRODUCTION  Supervised classification: ◦ One-vs-All or One-vs-Rest:  Consider the problem as a collection of binary problems (2 classes)  For k classes, k binary classifiers are trained (k>2)  one per class  i-th classifier (i=1,…,k):  Data belonging to the class i → “positive”  Rest of the data → “negative”  Similar to one-hot-encoding  To select the class to which each data belongs  The sample is used as input to the k classifiers  A voting scheme is usually used among the k classifiers 2.1 INTRODUCTION  Supervised classification: ◦ One-vs-All or One-vs-Rest:  Usual voting scheme: winner-takes-all  Let fi (x,w) be the output of classifier i  Numerical output  The label of the classifier that has a greater certainty that the data (x) is in the +1 class is assigned, that is f ( x, w) = arg max( f i ( x, w)) i 2.1 INTRODUCTION  Supervised classification: ◦ One-vs-One:  In this method k(k-1)/2 models are built separating each class from each other  That is, all possible pairs of one class vs. another are created, and these problems are solved, in total they are k(k-1)/2  For each pair (i,j), i < j, of class labels, a binary classifier is constructed, classifying the points of class i with the label +1 and those of class j with the label -1  Subsequently, a voting method is used to select the class of each data SUPPORT VECTOR MACHINES  SVM ◦ Classification technique ◦ They are derived from the statistical learning theory postulated by Vapnik and Chervonenkis ◦ They were introduced in 1992, and have received much interest in recent years for:  Very high classification accuracy  Very firm mathematical basis  In many fields they have shown to outperform the most successful systems (ANNs) ◦ Classifier in two classes  Generalizable to more classes  Extendable for regression problems SUPPORT VECTOR MACHINES  From a series of observations, each consisting of a pair of data: ◦ A vector xi ∈ R n , i = 1,..., l ◦ A label yi ∈{+1,−1}  Suppose we have a hyperplane separating the positive (+1) from the negative (-1) samples. ◦ Linearly separable problems ◦ Example, in R2: SUPPORT VECTOR MACHINES  Linearly separable problems 𝑤 𝑤 𝑤 ◦ Example, in R2 : hyperplane : w1*x1 + w2*x2 + b 𝑥 𝑥 +1 𝑥 -1 w Tx + b = 0 x2 _ All samples in the hyperplane satisfy: w*x+b = 0 All samples meet: w*x+b > 0 All samples meet: w*x+b < 0 (in this image the subscripts of x and x denote x1 _the dimension, not that they are samples 1 2 1 and 2) SUPPORT VECTOR MACHINES  From a series of observations, each consisting of a pair of data: ◦ A vector xi ∈ R n , i = 1,..., l ◦ A label yi ∈{+1,−1}  Suppose we have a hyperplane separating the positive (+1) from the negative (-1) samples. ◦ Linearly separable problems ◦ Hyperplane, general case: wT x + b =0  The points xi that are on the hyperplane satisfy wT x + b = 0  The points xi that are "on one side" of the hyperplane satisfy wT x + b > 0  The points xi that are "on the other side" of the hyperplane satisfy wT x + b < 0 SUPPORT VECTOR MACHINES  In the real world, the vast majority of problems are not linearly separable.  Study of the SVMs in three different cases: ◦ Linearly separable problems  Linear SVMs  Separation region: hyperplane ◦ Nonlinearly separable problems  Linear SVMs  Separation region: hyperplane  Some error in classification is allowed ◦ Nonlinear SVM  Even admitting some error, it is not possible to separate the data  More complex separation regions SUPPORT VECTOR MACHINES  Linearly separable problems ◦ We want to find the hyperplane w⋅ x + b = 0 defined by the pair (w, b), such that the point xi can be separated according to the function f ( xi ) = 1 w ⋅ xi + b > 0 and yi =1 -1 w ⋅ xi + b < 0 and yi =-1 i.e. 1 and yi =1 f ( xi ) = sign( w ⋅ xi + b) = -1 and yi =-1 where x N and b  That is, the function f classifies each point as +1 or -1 SUPPORT VECTOR MACHINES  Linearly separable problems +1 How to classify this data? -1 SUPPORT VECTOR MACHINES  Linearly separable problems +1 How to classify this data? -1 SUPPORT VECTOR MACHINES  Linearly separable problems +1 How to classify this data? -1 SUPPORT VECTOR MACHINES  Linearly separable problems +1 How to classify this data? -1 SUPPORT VECTOR MACHINES  Linearly separable problems Any can be good, but +1 which is the best? -1 SUPPORT VECTOR MACHINES  Linearly separable problems If this hyperplane is chosen +1 -1 SUPPORT VECTOR MACHINES  Linearly separable problems If this hyperplane is chosen, +1 given this new instance… -1 SUPPORT VECTOR MACHINES  Linearly separable problems If this hyperplane is chosen, +1 given this new instance… it is classified as -1 -1 SUPPORT VECTOR MACHINES  Linearly separable problems If this hyperplane is chosen, +1 given this new instance… it is classified as -1 -1 when it should be +1 SUPPORT VECTOR MACHINES  Linearly separable problems Problem of the +1 chosen hyperplane: it is too close to the -1 samples A sample close to the hyperplane could "fall" to either side SUPPORT VECTOR MACHINES  Linearly separable problems So, which hyperplane +1 best classifies new -1 cases? SUPPORT VECTOR MACHINES  Linearly separable problems So, which hyperplane +1 best classifies new -1 cases? The one that is more separated from the samples of both classes SUPPORT VECTOR MACHINES  Linearly separable problems w x +b= 0 +1 We define the -1 hyperplane SUPPORT VECTOR MACHINES  Linearly separable problems +1 We define the margin -1 SUPPORT VECTOR MACHINES  Linearly separable problems +1 The idea is to maximize the margin. -1 SUPPORT VECTOR MACHINES  Linearly separable problems +1 Other hyperplanes with other maximized margins -1 would be possible. SUPPORT VECTOR MACHINES  Linearly separable problems +1 Other hyperplanes with other maximized margins -1 would be possible. SUPPORT VECTOR MACHINES  Linearly separable problems The hyperplane with the +1 largest margin is the best classifier of the -1 data. This is the simplest class of SVM, the LSVM. SUPPORT VECTOR MACHINES  Linearly separable problems +1 Support vectors are the samples that touch the -1 margin boundary. SUPPORT VECTOR MACHINES  Linearly separable problems Possible decision frontiers for linearly separable data SUPPORT VECTOR MACHINES  Linearly separable problems b11 _ B1 b12 _ B2 Decision margin SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Given a set of training examples, ◦ Build a hyperplane “w*x + b” as a decision surface such that the separation of the two classes is maximum (generalization principle)  Decision margin as wide as possible  Training seeks to find w and b SUPPORT VECTOR MACHINES  Linearly separable problems + 1 -1 w*x + b = 0 w*xi + b ≥ 0 for yi = +1 w*xi + b ≤ 0 for yi = -1 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ hyperplanes of the margins are defined : "positive" hyperplane: w*x + b = +1 + 1 “negative” hyperplane: w*x + b = -1 -1 w*x + b = 0 w*xi + b ≥ 0 for yi = +1 w*xi + b ≤ 0 for yi = -1 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ hyperplanes of the margins are defined : "positive" hyperplane: w*x + b = +1 + 1 “negative” hyperplane: w*x + b = -1 -1 w*x + b = 0 w*xi + b ≥ 0 for yi = +1 w*xi + b ≤ 0 for yi = -1 It is desired that within the margin there are no instances: SUPPORT VECTOR MACHINES  Linearly separable problems ◦ hyperplanes of the margins are defined : "positive" hyperplane: w*x + b = +1 + 1 “negative” hyperplane: w*x + b = -1 -1 w*x + b = 0 w*xi + b ≥ 0 for yi = +1 w*xi + b ≤ 0 for yi = -1 It is desired that within the margin there are no instances: All samples meet: wx+b≥+1 All samples meet: wx+b≤-1 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ If we work with a set of vectors (samples) S, we say that it is linearly separable if there exists (w,b) such that the inequalities ( w ⋅ xi + b) ≥ 1 yi =1 (positive cases) i=1,…,L ( w ⋅ xi + b) ≤ −1 yi =-1 (negative cases) ◦ are valid for all elements of the set S  For the linearly separable case of S, a unique optimal hyperplane can be found, for which the margin between the training points of two different classes is maximized SUPPORT VECTOR MACHINES  Linearly separable problems ◦ These equations ( w ⋅ xi + b) ≥ 1 yi =1 i=1,…,L ( w ⋅ xi + b) ≤ −1 yi =-1 ◦ can be reformulated to yi ( w ⋅ xi + b) ≥ 1 i=1,…,L SUPPORT VECTOR MACHINES  Linearly separable problems ◦ The closest examples to the hyperplane are the so-called support vectors  Only these matter, the rest of the training examples can be ignored r SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Decision function "positive" hyperplane: w*x + b = +1 “negative” hyperplane: w*x + b = -1 w*x + b = 0 w*xi + b ≥ 0 for yi = +1 w*xi + b ≤ 0 for yi = -1 All samples meet: wx+b≥+1 All samples meet: wx+b≤-1 The support vectors satisfy: w x + b = +1 w x + b = -1 SUPPORT VECTOR MACHINES  Linearly separable problems “positive” hyperplane : w x + b = +1 “negative” hyperplane : w x + b = -1 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Maximize margin width, d wx + b = 0 x1 – x2 _ d x2 _ x1 _ w wx + b = 1 wx + b = -1 Decision frontier and margin of SVM SUPPORT VECTOR MACHINES  Linearly separable problems ◦ We want to find w and b such that the margin is maximized ◦ Let the “margin” d be the distance between the “positive” and “negative” hyperplanes  It is desired to maximize that value d  For each training example: wT xi + b ≤ − d if yi =-1 2 w xi + b ≥ d T if yi =1 2  Namely: ( ) yi wT xi + b ≥ d 2 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ For each support vector, the above inequality is an equality ( ) yi wT xi + b = d 2 can be transformed to the previous formulation ( ) yi wT xi + b = 1 by rescaling w and b by d/2  The distance from an T example xi to the separating plane is w xi + b r= w  A support vector is at the margin boundary T  Putting both equations together: r = y i (w x i + b) = 1 w w  Therefore, the separation margin to be maximized is: 2 d = 2r = w SUPPORT VECTOR MACHINES  Linearly separable problems 2 ◦ The margin is equal to: w ◦ The idea is to find a hyperplane with the maximum "margin"  This is an optimization problem: T maximize: 2 w subject to: yi ( w xi + b) ≥ 1 which can also be expressed as 1 T minimize: w 2 subject to: yi ( w xi + b) ≥ 1 2 2  where w is the Euclidean norm of w  That is, we need to optimize a quadratic function subject to linear constraints  One constraint for each sample SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Quadratic optimization problems are a well-known class of mathematical problems for which several algorithms exist. ◦ But the problem can be transformed to make it easier to handle:  Lagrange multipliers (αi) are associated with each of the inequalities of the original problem  The dual problem is constructed: l w −  α i [ yi (w·x i + b ) − 1] 2 LP ≡ 1 2 i =1 α i ≥ 0 para todo α i i=1,…,L  A value αi is associated to each sample xi and to each inequality SUPPORT VECTOR MACHINES  Linearly separable problems ◦ To construct the dual problem, a value αi is associated to each inequality (and, therefore, to each sample), to include the constraints within the expression  Original problem: 1 T minimize: w 2 subject to:yi ( w xi + b) ≥ 1 2  Dual problem: l w −  α i [ yi (w·x i + b ) − 1] 2 LP ≡ 1 2 i =1 α i ≥ 0 para todo α i i=1,…,L  This problem must be minimized with respect to w and b, and maximized with respect to α i >= 0 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Dual problem minimized with respect to w and b:  Calculate the derivative Lp with respect to a and b and set it equal to 0 ∂L l ∂b P =0 α y i =1 i i =0 ∂LP l =0 w =  α i yi xi ∂w i =1  Substitute this expression into the above equation: l w −  α i [ yi (w·x i + b ) − 1] l 1 l l 2 LP ≡ 1 LD =  α i −  α iα j yi y j x i x j 2 T i =1 l w =  α i yi xi i =1 2 i =1 j =1 i =1  which must be maximized with respect to αi, with the constraint l  i =1 α i yi = 0 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ The equation to optimize is therefore l 1 l l maximize: LD =  α i −  α iα j yi y j x i x j T i =1 2 i =1 j =1 l subject to: α y i =1 i i =0 α i ≥ 0 i=1,…,L para todo α i ◦ In this expression, w and b no longer appear, only the αi and yi  one for each sample SUPPORT VECTOR MACHINES  Linearly separable problems ◦ The equation to optimize is therefore l 1 l l maximize: LD =  α i −  α iα j yi y j x i x j T i =1 2 i =1 j =1 l subject to: α y i =1 i i =0 α i ≥ 0 i=1,…,L para todo α i ◦ This optimization can be solved by quadratic programming (QP) techniques (Bertsekas 1995) SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Given a solution to that problem, the solution to the "primary" problem would be, for w: l w =  α i yi x i i =1  Previous expression  Sum for each training sample  Each αi not equal to 0 indicates that the corresponding xi is a support vector  If αi is 0, that vector xi has not influence on the calculation of w (product by 0) SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Given a solution to that problem, the solution to the "primary" problem would be, for b:  A support vector satisfies yk ( wT xk + b) = 1  Then the parameter b is calculated: yk ( wT xk + b) = 1 wT xk + b = 1 yk = yk b = yk − wT xk for any support vector αk > 0  For anyone it will give the same value  In practice, the value of b is averaged to avoid rounding problems: NVS 1  i xi )T for each support vector b= ( y − w NVS : number of support NVS i =1 vectors SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Therefore, the solution is: l b = yk − wT xk w =  α i yi x i l for any αk > 0 i =1 b = yk −  α i yi xiT xk i =1 ◦ In this way, the classification function is: l f ( x) =  α i yi xiT x + b i =1  It is not necessary to calculate the w  In this function, a scalar product is performed between the point to classify and each of the support vectors  Instances that are not support vectors have αi = 0 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Furthermore, if the αi ≥ 0, Karush Kuhn Tucker conditions can be applied to convert the inequalities: yi ( wT xi + b) ≥ 1 into several equality equations for each example xi: α i ( yi ( w xi + b) − 1) = 0 T ◦ All examples where αi > 0 are the support vectors  If α i ≠ 0, then T yi ( w xi + b) − 1 = 0 yi ( wT xi + b) = 1 and therefore xi is at the limit of the decision margin SUPPORT VECTOR MACHINES  Linearly Separable Problems: Example: 1 0.9 x1 x2 Y 0.8 0.3858 0.4687 1 0.7 0.4871 0.611 -1 0.6 0.9218 0.4103 -1 x2 0.5 0.7382 0.8936 -1 0.4 0.1763 0.0579 1 0.4057 0.3529 1 0.3 0.9355 0.8132 -1 0.2 0.2146 0.0099 1 0.1 0 0 0.2 0.4 0.6 0.8 1 x1 SUPPORT VECTOR MACHINES  Linearly Separable Problems: Example: 1 0.9 x1 x2 Y αi 0.8 0.7 0.3858 0.4687 1 65.5261 0.4871 0.611 -1 65.5261 0.6 0.9218 0.4103 -1 0 x2 0.5 0.7382 0.8936 -1 0 0.4 0.1763 0.0579 1 0 0.3 0.4057 0.3529 1 0 0.9355 0.8132 -1 0 0.2 0.2146 0.0099 1 0 0.1 0 0 0.2 0.4 0.6 0.8 1 x1 SUPPORT VECTOR MACHINES  Linearly Separable Problems: Example: 1 0.9 x1 x2 Y αi 0.8 0.7 0.3858 0.4687 1 65.5261 0.4871 0.611 -1 65.5261 0.6 0.9218 0.4103 -1 0 x2 0.5 0.7382 0.8936 -1 0 0.4 0.1763 0.0579 1 0 0.3 0.4057 0.3529 1 0 0.9355 0.8132 -1 0 0.2 0.2146 0.0099 1 0 0.1 0 0 0.2 0.4 0.6 0.8 1 x1 SUPPORT VECTOR MACHINES  Linearly Separable Problems: Example: ◦ Solution:  Calculation of w: l 0.3858 0.4871 − 6.64 w =  α i yi x i = 65.5621*1*   + 65.5621* (−1) *   =  i =1  0.4687   0. 611   − 9.32   Calculation of b:  With the first instance: 0.3858 b = yk − w xk = 1 − [− 6.64 − 9.32]*  T  = 1 − (−6.929996) = 7.929996 0.4687   With the second instance: 0.4871 b = yk − wT xk = −1 − [− 6.64 − 9.32]*   = −1 − (−8.928862) = 7.928864  0.611   Both values are averaged: b = 7.929 SUPPORT VECTOR MACHINES  Linearly Separable Problems: Example: ◦ To classify a new sample x: f ( x) = w ⋅ x + b l or f ( x ) =  αi yi xiT x + b i =1  Classification:  If f(x)>0 1  If f(x) 0 1 1 0 2 Normal 3 1 2 0 Normal (entropy = 0.971) 4 0 2 1 Normal 5 1 1 1 carcinogenic 6 2 2 1 carcinogenic DECISION TREES  ID3: Definitions: ◦ When choosing an attribute as a node (“Body”):  Therefore, the entropy as a result of this division with the attribute "Body" is  Weighted sum of the entropies of the branches 1·0 5 · 0.971 0.81 1 5  gives an measure of the uncertainty in the branches when dividing with that attribute DECISION TREES  ID3: Definitions: ◦ When choosing an attribute as a node Example 1 Antennas 1 (“Queues”): Tails 0 Nuclei 2 Body Striped Class Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal Tails 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic Example Antennas Nuclei Body Class 5 1 1 Striped carcinogenic Example Antennas Nuclei Body Class Example Antennas Nuclei Body Class 3 1 0 Striped Normal 1 1 2 Striped Normal 4 0 1 Striped Normal 2 1 1 White carcinogenic 6 2 1 Striped carcinogenic DECISION TREES  ID3: Definitions: ◦ When choosing an attribute as a node (“Tails”):  For each branch:  Branch “0”: 2 instances  Equal number of instances of each class: total uncertainty  Entropy 1  Branch “1”: 1 instances  All instances of the same class: full certainty  Entropy 0  Branch “2”: 3 instances  2 instances of the “Normal” class, 1 “Carcinogenic”  Entropy > 0 (entropy = 0.9183) DECISION TREES  ID3: Definitions: ◦ When choosing an attribute as a node (“Tails”):  Therefore, the entropy as a result of this division with the attribute “Tails” is  Weighted sum of the entropies of the branches 2·1 1 · 0 3 · 0.918 0.793 2 1 3  gives an measure of the uncertainty in the branches when dividing with that attribute DECISION TREES  ID3: Definitions: ◦ When choosing an attribute as a node:  Entropy comparison:  Split with “Body”: Entropy = 0.81  Split with “Tails”: Entropy = 0.79  The “Tails” attribute generates a partition with lower entropy (higher certainty) in the branches  The entropy is be calculated with the rest of the attributes  Choose the one with less entropy  This process is applied recursively on each sub-tree DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n)  Example:  I = -(3/6)log2(3/6) - (3/6)log2 (3/6) Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n)  Examples:  1 single class: nc = n  I = - ( nc/n) log2 (nc/n) = 0  2 classes, with the same number of samples in each: nc = n/2  I = - 2 * (nc/n) log2 (nc/n) = - 2 * (1/2) log2 (1/2) = 1  5 classes, with the same number of samples in each: nc = n/5  I = - 5 * (nc/n) log2 (nc/n) = - 5 * (1/5) log2 (1/5) = 2.32  That is, the more distributed the samples are in different classes, the higher the entropy  More uncertainty, less information about the data DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij  The same operation is performed, but only among the instances with value Vj in the attribute Ai  Sub-table  Result of splitting the data with the attribute Ai  The entropy in that subtable is measured DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij A = {Antennas, Tails, Nuclei, Body}  Example: I42 ? A4 = Body, V4 = {White, Striped}  Entropy of the examples that in "Body" have “Striped" Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij A = {Antennas, Tails, Nuclei, Body}  Example: I42 ? A4 = Body, V4 = {White, Striped}  Subtable with the examples that in "Body" have "Striped": Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij A = {Antennas, Tails, Nuclei, Body}  Example: I42 ? A4 = Body, V4 = {White, Striped}  Subtable with the examples that in "Body" have "Striped": Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic  I42 = - (3/5)log2(3/5) - (2/5)log2(2/5) = 0.971 DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij A = {Antennas, Tails, Nuclei, Body}  Example: I41 ? A4 = Body, V4 = {White, Striped}  Entropy of the examples that in "Body" have "White" Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij A = {Antennas, Tails, Nuclei, Body}  Example: I41 ? A 4 = Body, V4 = {White, Striped}  Sub-table of the examples that in “Body” have “White”: Example Antennas Tails Nuclei Body Class 2 1 0 1 White carcinogenic  I41 = 0  Full certainty DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij  The same operation is performed, but only between the samples with value Vj in the attribute Ai  Subtable  The entropy in that subtable is measured  As before, the more distributed the samples with value Vj in the attribute Ai among the different classes, the higher the entropy  More uncertainty DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij ◦ Information entropy of attribute Ai  I(Ai) = Σ(nij/n)Iij j  The weighted average of the entropies for each of the possible values in the attribute Ai  The weighted average of the entropies of the sub-tables for each value of Ai DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij ◦ Information entropy of attribute Ai  I(Ai) = Σ(nij/n)Iij A = {Antennas, Tails, Nuclei, Body} j A4 = Body, V4 = {White, Striped}  Example: I(A4)  Entropy of the “Body” attribute  I(A4) = (1/6)*0 + (5/6)*1.971 = 0.8091  Weighted average of the entropy of that attribute with each of the possible values DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij ◦ Information entropy of attribute Ai  I(Ai) = Σ(nij/n)Iij j  Gives a measure of the uncertainty associated with the attribute  The uncertainty that would remain when splitting the dataset with all values of that attribute  When creating the sub-tables, one for each value of the attribute DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij ◦ Information entropy of attribute Ai  I(Ai) = Σ(nij/n)Iij j ◦ Information gain of attribute Ai  G(Ai) = I – I(Ai) DECISION TREES  ID3: Example: ◦ We want to generate a decision tree that classifies between normal cells and carcinogenic cells according to the data in the following table: Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of attribute A1 (Antennas):  Split the table based on the values of that attribute Example Antennas Tails Nuclei Body Class 4 0 2 1 Striped Normal Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 5 1 1 1 Striped carcinogenic Example Antennas Tails Nuclei Body Class 6 2 2 1 Striped carcinogenic  Calculate the entropies of those subtables  Sum of those entropies weighted by the number of instances DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of the attribute A1 (Antennas) Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of the attribute A1 (Antennas)  I10 = ‐ (1/1)log2(1/1) – (0/1)log2(0/1) = 0 Example Antennas Tails Nuclei Body Class 4 0 2 1 Striped Normal DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of the attribute A1 (Antennas)  I10 = ‐ (1/1)log2(1/1) – (0/1)log2(0/1) = 0 Number of Antennas value 0, normal Number of Antennas value 0 Example Antennas Tails Nuclei Body Class 4 0 2 1 Striped Normal DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of the attribute A1 (Antennas)  I10 = ‐ (1/1)log2(1/1) – (0/1)log2(0/1) = 0 Number of Antennas value 0, normal Number of Antennas Number of Antennas value 0 value 0 Number of Antennas value 0, carcinogenic Example Antennas Tails Nuclei Body Class 4 0 2 1 Striped Normal DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of the attribute A1 (Antennas)  I10 = ‐ (1/1)log2(1/1) – (0/1)log2(0/1) = 0  Entropy = 0:  Total certainty about the class in which the samples are classified Example Antennas Tails Nuclei Body Class 4 0 2 1 Striped Normal DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 1 of the attribute A1 (Antennas) Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 1 of the attribute A1 (Antennas)  I11 = ‐ (2/4)log2 (2/4) - (2/4)log2 (2/4) = 1 Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 5 1 1 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 1 of the attribute A1 (Antennas)  I11 = ‐ (2/4)log2 (2/4) - (2/4)log2 (2/4) = 1 Number of Antennas value 1, normal Number of Antennas value 1 Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 5 1 1 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 1 of the attribute A1 (Antennas)  I11 = ‐ (2/4)log2 (2/4) - (2/4)log2 (2/4) = 1 Number of Antennas value 1, normal Number of Antennas Number of Antennas value 1 value 1 Number of Antennas value 1, carcinogenic Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 5 1 1 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 1 of the attribute A1 (Antennas)  I11 = ‐ (2/4)log2 (2/4) - (2/4)log2 (2/4) = 1  Entropy = 1:  Highest entropy (uncertainty) that can be had with 2 classes  The samples are distributed into the two classes with 50% each  If there were more samples of one class than another, there would be more certainty, less entropy Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 5 1 1 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 2 of the attribute A1 (Antennas) Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 2 of the attribute A1 (Antennas)  I12 = ‐ (0/1)log2(0/1) - (1/1)log2(1/1) = 0 Number of Antennas value 2, normal Number of Antennas Number of Antennas value 2 value 2 Number of Antennas value 2, carcinogenic Example Antennas Tails Nuclei Body Class 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of the attribute A1 (Antennas)  I10 = ‐ (1/1)log2(1/1) – (0/1)log2 ( 0/1) = 0  Information entropy of the value 1 of the attribute A1 (Antennas)  I11 = ‐ (2/4)log2 (2/4) – (2/4)log2 (2/4) = 1  Information entropy of the value 2 of the attribute A1 (Antennas)  I12 = ‐ (0/1)log2(0/1) – (1/1)log2 (1/1) = 0 DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of attribute A1 (Antennas)  I(A1) = Σ (nij/n)Iij = (1/6) ∙ I10 + (4/6) ∙ I11 + (1/6) ∙ I12 = 0.66 Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of attribute A1 (Antennas)  I(A1) = Σ (nij/n)Iij = (1/6) ∙ I10 + (4/6) ∙ I11 + (1/6) ∙ I12 = 0.66 Number of Antennas value 0 Number of Antennas value 2 Number of Antennas value 1 Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of attribute A1 (Antennas)  I(A1) = Σ (nij/n)Iij = (1/6) ∙ I10 + (4/6) ∙ I11 + (1/6) ∙ I12 = 0.66 Number of Antennas value 0 Number of Antennas value 2 Number of Antennas value 1 Number of Antennas (instances) Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of attribute A2 (Tails)  I20 = - (1/2)log2 (1/2) – (1/2)log2(1/2) = 1  Information entropy of value 1 of attribute A2 (Tails)  I21 = -(0/1)log2(0/1) – (1/1)log2(1/1) = 0  Information entropy of value 2 of attribute A2 (Tails)  I22 = -(2/3)log2(2/3) – (1/3)log2(1/3) = 0.9183  Information entropy of attribute A2 (Tails)  I(A2) = Σ (nij/n) Iij = (2/6) ∙ I20 + (1/6) ∙ I21 + (3/6) ∙ I22 = 0.79 DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of attribute A3 (Nuclei)  I30 = - (1/1)log2(1/1) – (0/1)log2( 0/1) = 0  Information entropy of the value 1 of attribute A3 (Nuclei)  I31 = - (1/4)log2(1/4) – (3/4)log2(3/4) = 0.8113  Information entropy of value 2 of attribute A3 (Nuclei)  I32 = - (1/1)log2(1/1) – (0/1)log2(0/1) = 0  Information entropy of attribute A3 (Nuclei)  I(A3) = Σ (nij/n) Iij = (1/6) ∙ I30 + (4/6) ∙ I31 + (1/6) ∙ I32 = 0.54 DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the White value of attribute A4 (Body)  I40 = - (0/1)log2(0/1) – (1/1)log2(1/1) = 0  Information entropy of the Striped value of attribute A4 (Body)  I41 = -(2/5)log2(2/5 ) – (3/5)log2 ( 3/5) = 0.9710  Information entropy of attribute A4 (Body)  I(A4) = Σ (nij/n) Iij = (1/6) ∙ I40 + (5/6) ∙ I41 = 0.81 DECISION TREES  ID3: Example: ◦ The attribute with the lowest entropy is chosen:  The entropy of each attribute determines the level of uncertainty when separating the dataset with all the values of that attribute  The attribute with lower entropy indicates that by separating the samples with that attribute, each of the resulting sets will have more certainty in the data.  That is, the instances of the different classes will be further unified in each of those sets (each of the values of that attribute)  There will be less disparity of instances of different classes in each set  That is, it better explains the sets  Attribute with lowest Entropy: Nuclei DECISION TREES  ID3: Example: ◦ The attribute with the lowest entropy is chosen: Nuclei ◦ The dataset is split and that attribute is removed:  Nuclei = 0 Example Antennas Tails Body Class 3 1 2 Striped Normal  Nuclei = 1 Example Antennas Tails Body Class 2 1 0 White carcinogenic 4 0 2 Striped Normal 5 1 1 Striped carcinogenic 6 2 2 Striped carcinogenic  Nuclei = 2 Example Antennas Tails Body Class 1 1 0 Striped Normal DECISION TREES  ID3: Example: Nuclei 0 2 Example Antennas Tails Body Class Example Antennas Tails Body Class 3 1 2 Striped Normal 1 1 0 Striped Normal 1 Example Antennas Tails Body Class 2 1 0 White carcinogenic 4 0 2 Striped Normal 5 1 1 Striped carcinogenic 6 2 2 Striped carcinogenic DECISION TREES  ID3: Example: Nuclei 0 2 Normal Normal 1 Example Antennas Tails Body Class 2 1 0 White carcinogenic 4 0 2 Striped Normal 5 1 1 Striped carcinogenic 6 2 2 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of the attribute A1 (Antennas)  I10 = - (1/1)log2(1/1) – (0/1)log2(0/1) = 0  Information entropy of the value 1 of the attribute A1 (Antennas)  I11 = - (0/2)log2(0/2) – (2/2)log2(2/2) = 0  Information entropy of the value 2 of the attribute A1 (Antennas)  I12 = - (0/1)log2(0/1) – (1/1)log2(1/1) = 0  Information entropy of attribute A1 (Antennas)  I(A1) = Σ (nij/n) Iij = (1/4) ∙ I10 + (2/4) ∙ I11 + (1/4) ∙ I12 = 0 DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of attribute A2 (Tails)  I20 = - (0/1)log2(0/1) – (1/1)log2(1/1) = 0  Information entropy of the value 1 of the Tails attribute  I21 = - (0/1)log2(0/1) – (1/1)log2(1/1) = 0  Information entropy of the value 2 of the Tails attribute  I22 = - (1/2)log2 (1/2) – (1/2)log2 (1/2) = 1  Information entropy of attribute Tails  I(A2) = Σ (nij/n) Iij = (1/4) ∙ I20 + (1/4) ∙ I21 + (2/4) ∙ I22 = 0.5 DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the “White” value of attribute A4 (Body)  I40 = - (0/1)log2 ( 0/1 ) – (1/1)log2(1/1) = 0  Information entropy of the “Striped” value of attribute A4 (Body)  I41 = - (1/3)log2(1/3) – (2/3)log2(2/3) = 0.9183  Information entropy of attribute A4 (Body)  I(A4) = Σ (nij/n) Iij = (1/4) ∙ I40 + (3/4) ∙ I41 = 0.6887 DECISION TREES  ID3: Example: ◦ The attribute with the lowest entropy is chosen: Antennas ◦ The dataset is split and that attribute is removed:  Antennas = 0 Example Tails Body Class 4 2 Striped Normal  Antennas = 1 Example Tails Body Class 2 0 White carcinogenic 5 1 Striped carcinogenic  Antennas = 2 Example Tails Body Class 6 2 Striped carcinogenic DECISION TREES  ID3: Example Nuclei 0 2 1 Normal Normal Antennas 0 1 2 Normal Carcinogenic Carcinogenic DECISION TREES  ID3: Some practical questions: ◦ How much to grow the tree? ◦ Is this attribute selection measure appropiate? ◦ How to handle training data with missing (unknown) values in attributes? ◦ How to handle attributes with different costs? ◦ How to handle continuous attributes? DECISION TREES  ID3: How much to grow the tree? ◦ If the tree is grown until it correctly classifies all training examples: overfitting  If there is noise in the examples, the noise is learned  If there are few examples associated with the leaf nodes, they are not representative of the true function. ◦ ID3 produces trees that overfit training examples, and do not work properly with new examples  The model is not able to generalize DECISION TREES  ID3: Solutions to overfitting : ◦ Pre-pruning: ◦ Post-pruning: DECISION TREES  ID3: Solutions to overfitting : ◦ Pre-pruning:  Stop growing the tree before it reaches the point where it perfectly classifies the training examples  When to do it? DECISION TREES  ID3: Solutions to overfitting : ◦ Pre-pruning:  Stop growing the tree before it reaches the point where it perfectly classifies the training examples  When to do it?  Apply a statistical test to estimate whether expanding a particular node is likely to produce an improvement beyond the training set DECISION TREES  ID3: Solutions to overfitting : ◦ Pre-pruning:  Stop growing the tree before it reaches the point where it perfectly classifies the training examples  When to do it? ◦ Post-pruning:  Allow it to overfit the data, and then prune it by replacing subtrees with a leaf  Requires higher computational cost, but offers better results DECISION TREES  ID3: Solutions to overfitting : Post-pruning: ◦ Starting from the bottom, examine the subtrees of the non-terminal nodes ◦ Pruning a node entails:  1.- Delete its corresponding subtree with root in that node  2.- Convert the node to a leaf node  3.- Assign the most common classification of the training examples associated to that node ◦ Prune only if the resulting pruned tree improves or equals the performance of the original tree  Test (validation) ◦ The tree is pruned iteratively, always choosing the node whose pruning implies the greatest improvement, until there is no improvement DECISION TREES  ID3: Is this attribute selection measure adequate? ◦ The ID3 algorithm favors choosing attributes with many values  For example, attribute "Date" in a set with the examples on different days  This attribute perfectly characterizes each training example, and we would have a perfect tree of depth 1, with one leaf node per sample  The gain information is the highest  Having so many branches forces the examples to be separated into very small sets  But it would be very bad with unseen examples (test) DECISION TREES  ID3: Is this attribute selection measure appropriate? ◦ There are other measures instead of information gain  Gain ratio: penalizes attributes with many values and uniformly distributed G ( Ai ) H ( Ai ) where H(Ai) denotes the entropy with respect to the values of Ai nij nij H ( Ai ) = − log 2 j n n  It does not take into account the classes, only the different values it takes  When an attribute has many different values, it has a very large entropy, so it is divided by a larger value.  The gain ratio favors those attributes that, with equal gain, separate the data into fewer subsets DECISION TREES  ID3: How to handle attributes with missing values? ◦ When classifying a new case with a missing value:  Instances with unknown values are classified according to the highest probability provided by the tree  Example:  Nuclei = ?  Antennas = 1 Nuclei 0 2 1 Normal Antennas Normal 0 2 1 Normal Carcinogenic Carcinogenic DECISION TREES  ID3: How to handle attributes with different costs? ◦ Are all attributes equally valuable when classifying? ◦ Enter the cost in the attribute selection measure  For example: G( Aj ) G( A ) 2 j −1 w: constant between 0 Coste Cost ( A j ) Coste( A j ) + 1) w ((Cost and 1 that weights the importance of cost versus profit ◦ They try to place the low-cost ones at the top of the tree, only calling the high-cost ones when necessary  That is, when we want to improve the classification DECISION TREES  ID3: How to handle continuous attributes? ◦ They are discretized:  They are partitioned into a discrete set of intervals  Example: 2 intervals: A=c  Find a value of c with the highest information gain  First, order the examples by the attribute A  This value is always where the value of the class changes DECISION TREES  ID3: How to handle continuous attributes? ◦ They are discretized:  Process:  Order the examples according to the continuous value of A  Take each pair of contiguous values of A with a different value of C and

Machine Learning 1: Supervised Learning PDF

Document Details

Tags

Related

Summary

Full Transcript