Machine Learning 1: Supervised Learning PDF

Summary

These notes introduce supervised learning in machine learning, covering concepts like training, utilization, and generalization. The document discusses different types of supervised learning problems, including prediction, curve fitting, classification, and regression. It explains the importance of a well-chosen training set and model complexity for good generalization.

Full Transcript

MACHINE LEARNING 1 UNIT 1: SUPERVISED LEARNING 2.1 INTRODUCTION  Two phases of learning: ◦ Training  The examples are presented to the system (training set)  The system “learns” from the examples  The system gradually modifies its adjustabl...

MACHINE LEARNING 1 UNIT 1: SUPERVISED LEARNING 2.1 INTRODUCTION  Two phases of learning: ◦ Training  The examples are presented to the system (training set)  The system “learns” from the examples  The system gradually modifies its adjustable parameters until the output matches the desired output (target)  We need a measure of performance/accuracy/error ◦ Utilization  New examples never seen before  We ask the system to generalize  Need of a test set 2.1 INTRODUCTION  Generalization: ◦ We do not want the system to memorize  Give the correct answer only to the training examples  Hard for humans, easy for a computer ◦ We want to learn to generalize  Easier for humans (to conceptualize) than for a computer (to represent concepts)  The animal and human mind looks for patterns even where there are none  Pareidolias  We have to extract the essence, the structure of the data, and not just learn the correct answer for some cases ◦ Important: test in new cases once the system is trained  Check if the good results are beyond the training set 2.1 INTRODUCTION  The dataset is made up of tuples ◦ ◦ Training set  We want a model or system that, when faced with new attribute values, is capable of returning a value consistent with the data set ◦ Train a model ◦ Generalization ability  To be applied to new data 2.1 INTRODUCTION  Training: ◦ The ability of a model to solve a problem (knowledge) is linked to the examples used during learning. ◦ The training set must:  Being meaningful:  Sufficient number of examples  If the training set is small, the network will not be able to solve the problem efficiently.  Be representative:  Various examples: all regions of the state space must be sufficiently covered.  If a learning set contains many more instances of one type than the rest, the model will specialize in that subset and not be of general application. 2.1 INTRODUCTION  If a model has been well chosen, with adequate complexity, and trained, with a well chosen training set, it will have good generalization ability ◦ Will work well with new instances  What happens if there are instances in some region where there were no data in the training? ◦ Poorly chosen training set: it was not representative  That area where we want to use it was not represented in the training set 2.1 INTRODUCTION  What happens if there are instances in some region where there were no data in the training? ◦ Example:  Train a model with 763,231,921 Asian camels with 2 humps  Result: the model will predict that camels have 2 humps  What if we use as “input” an African camel (dromedary)???  It will return that it has 2 humps, but it only has one!!!  The model was used with an instance where there were no training samples  Poorly chosen training set: it was not representative  The area where we want to use it was not fully represented in the training set 2.1 INTRODUCTION  What happens if there are instances in some region where there were no data in the training? ◦ Example:  Do all cats have fur?  It is not so important to observe many as that all breeds are represented 2.1 INTRODUCTION  Parameters vs. hyperparameters ◦ Parameters: Each of the values that define the behavior of a model  They are internal to the model.  Their values are set by the training process ◦ Hyperparameters: Parameters whose value is used to control the learning process  Based on them, the values of the parameters are determined  Their values are set by the user Model Parameters hyperparameters Polynomial coefficients order ANN weights, bias architecture, learning rate, activation functions, etc. SVM α kernel, C, etc 2.1 INTRODUCTION  Problems that can be solved in supervised learning: ◦ Prediction ◦ Curve fitting ◦ Classification ◦ Regression 2.1 INTRODUCTION  Applications: ◦ Prediction  It attempts to determine the function that maps a set of input variables to one (or more) output variables.  It is basically numerical  It is based on statistical assumptions  Examples:  Monitoring the reservation of seats in aviation companies.  Short-term financial predictions 2.1 INTRODUCTION  Applications: ◦ Curve fitting:  Model the curve that lies behind a set of examples (xi , y) 2.1 INTRODUCTION  Applications: ◦ Classification  Classify objects into a finite number of classes, given their properties (inputs)  Look for a mapping function that allows class 1 to be separated from class 2 and class 3 from class 3...  The number of classes is finite and known  The class membership of each sample is known 2.1 INTRODUCTION  Applications: ◦ Regression:  Generic  From a dataset, try to find a model of the relationship between two sets of them  Instances: (inputs, targets)  Unknown relationship  Model must be generalizable, that is, that from new unknown samples it returns coherent outputs  The other problems (prediction, curve fitting) can be seen as particular cases of this  To assess the goodness of a regression system, some type of error (mean error, ECM, etc.) is used 2.1 INTRODUCTION  Supervised classification: ◦ Encoding of the output if the model does not return categorical outputs (eg: ANNs), 2 possibilities:  As real numbers  Only if in the real world there is an order between the equivalent classes  Example: classify the quality of a product as low/medium/high  Outputs 0/0.5/1 or 0/1/2 …  The classification problem becomes a regression problem  As boolean values (much more common)  More than 2 classes (one-hot encoding)  2 classes ( A/-A, -/+, A/B, etc. ) 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  One output per class is used  Target: each target takes the value 1 if that instance belongs to that class, or 0 otherwise  Output: Real values:  ANN: The softmax function is applied to the outputs of the last layer (output yi): 𝑒 𝑦 ∑ 𝑒  From real values, it returns values between 0 and 1  The sum of all is equal to 1  Each value: probability of belonging to that class  Final output: class with higher certainty/probability 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  One output per class is used  Target: each target takes the value 1 if that instance belongs to that class, or 0 otherwise  Output: Real values:  ANN: Example of softmax Chosen class 4 outputs: 4 outputs: values real values in [0,1], sum = 1 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  ANN: Error calculation:  cross-entropy:  Given two distributions p(x), with the true values, and q(x) with the estimates, the entropy is calculated for a set of instances x in as 𝐻 𝑝, 𝑞 𝑝 𝑥 · log 𝑞 𝑥  Given the output of an ANN for an example n, the targets 𝑦 𝑝 𝑥 , and the outputs 𝑦 𝑞 𝑥 : 𝐻 𝑦, 𝑦 𝑦 · log 𝑦 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  ANN: Error calculation:  cross-entropy: 𝐻 𝑦, 𝑦 𝑦 · log 𝑦  𝑦 · log 𝑦 is the dot product of a vector of c components (c classes): 𝑦 · log 𝑦 =∑ 𝑦 log 𝑦  Since there is only one target equal to 1, the rest are 0, only the output 𝑦 corresponding to the positive target is evaluated:  Averaging across all examples: 1 𝐽 log 𝑦 𝑁 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  The target is encoded as 0/1 or -1/1 values  Depends on the classifier  For example:  ANN: 0/1  SVM: -1/1 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  The model returns real numbers as output  In addition to the classification value, it gives information on the safety or certainty of the classification  ANNs  Targets: 0/1  The sigmoid function is usually applied as transfer function1.5  High values: output tends to 1 f (n) = −n 1 1 1+ e 0.5  Greater certainty of classification as positive 0 -0.5 -10 -8 -6 -4 -2 0 2 4 6 8 10  Low values: output tends to 0  Greater certainty of classification as negative  Classification: Threshold in 0.5 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  The model returns real numbers as output  In addition to the classification value, it gives information on the safety or certainty of the classification  SVM  Targets: -1/1  Output tends to ∞: Higher certainty of clas. as positive  Output tends to -∞: Higher certainty of clas. as negative  Classification:  Threshold in 0  Look at the sign 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  ANN: Error calculation: (𝑦 : target, 𝑦 : output)  Difference between output and target  ME, MSE, etc. 𝑦 𝑦  Binary cross-entropy: 1 𝑦 log 𝑦 1 𝑦 log 1 𝑦 𝑁 2.1 INTRODUCTION  Supervised classification: ◦ Many models only allow binary classification  Two classes  For example Support Vector Machines ◦ To be able to apply them to multi-class problems:  Voting based on the combination of binary classifiers  “One-vs-All”  “One-vs-One” 2.1 INTRODUCTION  Supervised classification: ◦ One-vs-All or One-vs-Rest:  Consider the problem as a collection of binary problems (2 classes)  For k classes, k binary classifiers are trained (k>2)  one per class  i-th classifier (i=1,…,k):  Data belonging to the class i → “positive”  Rest of the data → “negative”  Similar to one-hot-encoding  To select the class to which each data belongs  The sample is used as input to the k classifiers  A voting scheme is usually used among the k classifiers 2.1 INTRODUCTION  Supervised classification: ◦ One-vs-All or One-vs-Rest:  Usual voting scheme: winner-takes-all  Let fi (x,w) be the output of classifier i  Numerical output  The label of the classifier that has a greater certainty that the data (x) is in the +1 class is assigned, that is f ( x, w) = arg max( f i ( x, w)) i 2.1 INTRODUCTION  Supervised classification: ◦ One-vs-One:  In this method k(k-1)/2 models are built separating each class from each other  That is, all possible pairs of one class vs. another are created, and these problems are solved, in total they are k(k-1)/2  For each pair (i,j), i < j, of class labels, a binary classifier is constructed, classifying the points of class i with the label +1 and those of class j with the label -1  Subsequently, a voting method is used to select the class of each data MACHINE LEARNING 1 UNIT 1: SUPERVISED LEARNING 2.1 INTRODUCTION  Two phases of learning: ◦ Training  The examples are presented to the system (training set)  The system “learns” from the examples  The system gradually modifies its adjustable parameters until the output matches the desired output (target)  We need a measure of performance/accuracy/error ◦ Utilization  New examples never seen before  We ask the system to generalize  Need of a test set 2.1 INTRODUCTION  Generalization: ◦ We do not want the system to memorize  Give the correct answer only to the training examples  Hard for humans, easy for a computer ◦ We want to learn to generalize  Easier for humans (to conceptualize) than for a computer (to represent concepts)  The animal and human mind looks for patterns even where there are none  Pareidolias  We have to extract the essence, the structure of the data, and not just learn the correct answer for some cases ◦ Important: test in new cases once the system is trained  Check if the good results are beyond the training set 2.1 INTRODUCTION  The dataset is made up of tuples ◦ ◦ Training set  We want a model or system that, when faced with new attribute values, is capable of returning a value consistent with the data set ◦ Train a model ◦ Generalization ability  To be applied to new data 2.1 INTRODUCTION  Training: ◦ The ability of a model to solve a problem (knowledge) is linked to the examples used during learning. ◦ The training set must:  Being meaningful:  Sufficient number of examples  If the training set is small, the network will not be able to solve the problem efficiently.  Be representative:  Various examples: all regions of the state space must be sufficiently covered.  If a learning set contains many more instances of one type than the rest, the model will specialize in that subset and not be of general application. 2.1 INTRODUCTION  If a model has been well chosen, with adequate complexity, and trained, with a well chosen training set, it will have good generalization ability ◦ Will work well with new instances  What happens if there are instances in some region where there were no data in the training? ◦ Poorly chosen training set: it was not representative  That area where we want to use it was not represented in the training set 2.1 INTRODUCTION  What happens if there are instances in some region where there were no data in the training? ◦ Example:  Train a model with 763,231,921 Asian camels with 2 humps  Result: the model will predict that camels have 2 humps  What if we use as “input” an African camel (dromedary)???  It will return that it has 2 humps, but it only has one!!!  The model was used with an instance where there were no training samples  Poorly chosen training set: it was not representative  The area where we want to use it was not fully represented in the training set 2.1 INTRODUCTION  What happens if there are instances in some region where there were no data in the training? ◦ Example:  Do all cats have fur?  It is not so important to observe many as that all breeds are represented 2.1 INTRODUCTION  Parameters vs. hyperparameters ◦ Parameters: Each of the values that define the behavior of a model  They are internal to the model.  Their values are set by the training process ◦ Hyperparameters: Parameters whose value is used to control the learning process  Based on them, the values of the parameters are determined  Their values are set by the user Model Parameters hyperparameters Polynomial coefficients order ANN weights, bias architecture, learning rate, activation functions, etc. SVM α kernel, C, etc 2.1 INTRODUCTION  Problems that can be solved in supervised learning: ◦ Prediction ◦ Curve fitting ◦ Classification ◦ Regression 2.1 INTRODUCTION  Applications: ◦ Prediction  It attempts to determine the function that maps a set of input variables to one (or more) output variables.  It is basically numerical  It is based on statistical assumptions  Examples:  Monitoring the reservation of seats in aviation companies.  Short-term financial predictions 2.1 INTRODUCTION  Applications: ◦ Curve fitting:  Model the curve that lies behind a set of examples (xi , y) 2.1 INTRODUCTION  Applications: ◦ Classification  Classify objects into a finite number of classes, given their properties (inputs)  Look for a mapping function that allows class 1 to be separated from class 2 and class 3 from class 3...  The number of classes is finite and known  The class membership of each sample is known 2.1 INTRODUCTION  Applications: ◦ Regression:  Generic  From a dataset, try to find a model of the relationship between two sets of them  Instances: (inputs, targets)  Unknown relationship  Model must be generalizable, that is, that from new unknown samples it returns coherent outputs  The other problems (prediction, curve fitting) can be seen as particular cases of this  To assess the goodness of a regression system, some type of error (mean error, ECM, etc.) is used 2.1 INTRODUCTION  Supervised classification: ◦ Encoding of the output if the model does not return categorical outputs (eg: ANNs), 2 possibilities:  As real numbers  Only if in the real world there is an order between the equivalent classes  Example: classify the quality of a product as low/medium/high  Outputs 0/0.5/1 or 0/1/2 …  The classification problem becomes a regression problem  As boolean values (much more common)  More than 2 classes (one-hot encoding)  2 classes ( A/-A, -/+, A/B, etc. ) 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  One output per class is used  Target: each target takes the value 1 if that instance belongs to that class, or 0 otherwise  Output: Real values:  ANN: The softmax function is applied to the outputs of the last layer (output yi): 𝑒𝑒 𝑦𝑦𝑖𝑖 𝑦𝑦 𝑖𝑖 = ∑𝐶𝐶𝑗𝑗=1 𝑒𝑒 𝑦𝑦𝑗𝑗  From real values, it returns values between 0 and 1  The sum of all is equal to 1  Each value: probability of belonging to that class  Final output: class with higher certainty/probability 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  One output per class is used  Target: each target takes the value 1 if that instance belongs to that class, or 0 otherwise  Output: Real values:  ANN: Example of softmax Chosen class 4 outputs: 4 outputs: values real values in [0,1], sum = 1 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  ANN: Error calculation:  cross-entropy:  Given two distributions p(x), with the true values, and q(x) with the estimates, the entropy is calculated for a set of instances x in as 𝐻𝐻 𝑝𝑝, 𝑞𝑞 = − 𝑝𝑝(𝑥𝑥) log 𝑞𝑞(𝑥𝑥) 𝑥𝑥  Given the output of an ANN for an example n, the targets 𝑦𝑦 = 𝑝𝑝(𝑥𝑥), and the outputs 𝑦𝑦 = 𝑞𝑞(𝑥𝑥): 𝑁𝑁 𝐻𝐻 𝑦𝑦, 𝑦𝑦 = − 𝑦𝑦𝑛𝑛 log 𝑦𝑦 𝑛𝑛 𝑛𝑛=1 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  More than 2 classes (one-hot encoding):  ANN: Error calculation:  cross-entropy: 𝑁𝑁 𝐻𝐻 𝑦𝑦, 𝑦𝑦 = − 𝑦𝑦𝑛𝑛 log 𝑦𝑦 𝑛𝑛 𝑛𝑛=1  𝑦𝑦𝑛𝑛 log 𝑦𝑦 𝑛𝑛 is the dot product of a vector of c components (c 𝐶𝐶 classes): 𝑦𝑦𝑛𝑛 log 𝑦𝑦 𝑛𝑛 =∑𝑐𝑐=1 𝑦𝑦𝑛𝑛𝑐𝑐 log 𝑦𝑦 𝑛𝑛𝑐𝑐  Since there is only one target equal to 1, the rest are 0, only the output 𝑦𝑦 𝑛𝑛 corresponding to the positive target is evaluated: 𝑁𝑁  Averaging across all examples: 1 𝐽𝐽 = − log 𝑦𝑦 𝑛𝑛 𝑁𝑁 𝑛𝑛=1 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  The target is encoded as 0/1 or -1/1 values  Depends on the classifier  For example:  ANN: 0/1  SVM: -1/1 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  The model returns real numbers as output  In addition to the classification value, it gives information on the safety or certainty of the classification  ANNs  Targets: 0/1  The sigmoid function is usually applied as transfer function1.5  High values: output tends to 1 f ( n) = −n 1 1 1+ e 0.5  Greater certainty of classification as positive 0 -0.5 -10 -8 -6 -4 -2 0 2 4 6 8 10  Low values: output tends to 0  Greater certainty of classification as negative  Classification: Threshold in 0.5 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  The model returns real numbers as output  In addition to the classification value, it gives information on the safety or certainty of the classification  SVM  Targets: -1/1  Output tends to ∞: Higher certainty of clas. as positive  Output tends to -∞: Higher certainty of clas. as negative  Classification:  Threshold in 0  Look at the sign 2.1 INTRODUCTION  Supervised classification: ◦ Output encoding: As boolean values  2 classes (A/-A, -/+, A/B, etc.):  ANN: Error calculation: 𝑦𝑦𝑛𝑛 : target, 𝑦𝑦𝑛𝑛 : output) (  Difference between output and target  ME, MSE, etc. 𝑁𝑁 𝑦𝑦 𝑛𝑛 − 𝑦𝑦𝑛𝑛 𝑛𝑛=1  Binary cross-entropy: 𝑁𝑁 1 − 𝑦𝑦 𝑛𝑛 log 𝑦𝑦𝑛𝑛 + 1 − 𝑦𝑦 𝑛𝑛 log(1 − 𝑦𝑦𝑛𝑛 ) 𝑁𝑁 𝑛𝑛=1 2.1 INTRODUCTION  Supervised classification: ◦ Many models only allow binary classification  Two classes  For example Support Vector Machines ◦ To be able to apply them to multi-class problems:  Voting based on the combination of binary classifiers  “One-vs-All”  “One-vs-One” 2.1 INTRODUCTION  Supervised classification: ◦ One-vs-All or One-vs-Rest:  Consider the problem as a collection of binary problems (2 classes)  For k classes, k binary classifiers are trained (k>2)  one per class  i-th classifier (i=1,…,k):  Data belonging to the class i → “positive”  Rest of the data → “negative”  Similar to one-hot-encoding  To select the class to which each data belongs  The sample is used as input to the k classifiers  A voting scheme is usually used among the k classifiers 2.1 INTRODUCTION  Supervised classification: ◦ One-vs-All or One-vs-Rest:  Usual voting scheme: winner-takes-all  Let fi (x,w) be the output of classifier i  Numerical output  The label of the classifier that has a greater certainty that the data (x) is in the +1 class is assigned, that is f ( x, w) = arg max( f i ( x, w)) i 2.1 INTRODUCTION  Supervised classification: ◦ One-vs-One:  In this method k(k-1)/2 models are built separating each class from each other  That is, all possible pairs of one class vs. another are created, and these problems are solved, in total they are k(k-1)/2  For each pair (i,j), i < j, of class labels, a binary classifier is constructed, classifying the points of class i with the label +1 and those of class j with the label -1  Subsequently, a voting method is used to select the class of each data SUPPORT VECTOR MACHINES  SVM ◦ Classification technique ◦ They are derived from the statistical learning theory postulated by Vapnik and Chervonenkis ◦ They were introduced in 1992, and have received much interest in recent years for:  Very high classification accuracy  Very firm mathematical basis  In many fields they have shown to outperform the most successful systems (ANNs) ◦ Classifier in two classes  Generalizable to more classes  Extendable for regression problems SUPPORT VECTOR MACHINES  From a series of observations, each consisting of a pair of data: ◦ A vector xi ∈ R n , i = 1,..., l ◦ A label yi ∈{+1,−1}  Suppose we have a hyperplane separating the positive (+1) from the negative (-1) samples. ◦ Linearly separable problems ◦ Example, in R2: SUPPORT VECTOR MACHINES  Linearly separable problems 𝑤 𝑤 𝑤 ◦ Example, in R2 : hyperplane : w1*x1 + w2*x2 + b 𝑥 𝑥 +1 𝑥 -1 w Tx + b = 0 x2 _ All samples in the hyperplane satisfy: w*x+b = 0 All samples meet: w*x+b > 0 All samples meet: w*x+b < 0 (in this image the subscripts of x and x denote x1 _the dimension, not that they are samples 1 2 1 and 2) SUPPORT VECTOR MACHINES  From a series of observations, each consisting of a pair of data: ◦ A vector xi ∈ R n , i = 1,..., l ◦ A label yi ∈{+1,−1}  Suppose we have a hyperplane separating the positive (+1) from the negative (-1) samples. ◦ Linearly separable problems ◦ Hyperplane, general case: wT x + b =0  The points xi that are on the hyperplane satisfy wT x + b = 0  The points xi that are "on one side" of the hyperplane satisfy wT x + b > 0  The points xi that are "on the other side" of the hyperplane satisfy wT x + b < 0 SUPPORT VECTOR MACHINES  In the real world, the vast majority of problems are not linearly separable.  Study of the SVMs in three different cases: ◦ Linearly separable problems  Linear SVMs  Separation region: hyperplane ◦ Nonlinearly separable problems  Linear SVMs  Separation region: hyperplane  Some error in classification is allowed ◦ Nonlinear SVM  Even admitting some error, it is not possible to separate the data  More complex separation regions SUPPORT VECTOR MACHINES  Linearly separable problems ◦ We want to find the hyperplane w⋅ x + b = 0 defined by the pair (w, b), such that the point xi can be separated according to the function f ( xi ) = 1 w ⋅ xi + b > 0 and yi =1 -1 w ⋅ xi + b < 0 and yi =-1 i.e. 1 and yi =1 f ( xi ) = sign( w ⋅ xi + b) = -1 and yi =-1 where x N and b  That is, the function f classifies each point as +1 or -1 SUPPORT VECTOR MACHINES  Linearly separable problems +1 How to classify this data? -1 SUPPORT VECTOR MACHINES  Linearly separable problems +1 How to classify this data? -1 SUPPORT VECTOR MACHINES  Linearly separable problems +1 How to classify this data? -1 SUPPORT VECTOR MACHINES  Linearly separable problems +1 How to classify this data? -1 SUPPORT VECTOR MACHINES  Linearly separable problems Any can be good, but +1 which is the best? -1 SUPPORT VECTOR MACHINES  Linearly separable problems If this hyperplane is chosen +1 -1 SUPPORT VECTOR MACHINES  Linearly separable problems If this hyperplane is chosen, +1 given this new instance… -1 SUPPORT VECTOR MACHINES  Linearly separable problems If this hyperplane is chosen, +1 given this new instance… it is classified as -1 -1 SUPPORT VECTOR MACHINES  Linearly separable problems If this hyperplane is chosen, +1 given this new instance… it is classified as -1 -1 when it should be +1 SUPPORT VECTOR MACHINES  Linearly separable problems Problem of the +1 chosen hyperplane: it is too close to the -1 samples A sample close to the hyperplane could "fall" to either side SUPPORT VECTOR MACHINES  Linearly separable problems So, which hyperplane +1 best classifies new -1 cases? SUPPORT VECTOR MACHINES  Linearly separable problems So, which hyperplane +1 best classifies new -1 cases? The one that is more separated from the samples of both classes SUPPORT VECTOR MACHINES  Linearly separable problems w x +b= 0 +1 We define the -1 hyperplane SUPPORT VECTOR MACHINES  Linearly separable problems +1 We define the margin -1 SUPPORT VECTOR MACHINES  Linearly separable problems +1 The idea is to maximize the margin. -1 SUPPORT VECTOR MACHINES  Linearly separable problems +1 Other hyperplanes with other maximized margins -1 would be possible. SUPPORT VECTOR MACHINES  Linearly separable problems +1 Other hyperplanes with other maximized margins -1 would be possible. SUPPORT VECTOR MACHINES  Linearly separable problems The hyperplane with the +1 largest margin is the best classifier of the -1 data. This is the simplest class of SVM, the LSVM. SUPPORT VECTOR MACHINES  Linearly separable problems +1 Support vectors are the samples that touch the -1 margin boundary. SUPPORT VECTOR MACHINES  Linearly separable problems Possible decision frontiers for linearly separable data SUPPORT VECTOR MACHINES  Linearly separable problems b11 _ B1 b12 _ B2 Decision margin SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Given a set of training examples, ◦ Build a hyperplane “w*x + b” as a decision surface such that the separation of the two classes is maximum (generalization principle)  Decision margin as wide as possible  Training seeks to find w and b SUPPORT VECTOR MACHINES  Linearly separable problems + 1 -1 w*x + b = 0 w*xi + b ≥ 0 for yi = +1 w*xi + b ≤ 0 for yi = -1 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ hyperplanes of the margins are defined : "positive" hyperplane: w*x + b = +1 + 1 “negative” hyperplane: w*x + b = -1 -1 w*x + b = 0 w*xi + b ≥ 0 for yi = +1 w*xi + b ≤ 0 for yi = -1 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ hyperplanes of the margins are defined : "positive" hyperplane: w*x + b = +1 + 1 “negative” hyperplane: w*x + b = -1 -1 w*x + b = 0 w*xi + b ≥ 0 for yi = +1 w*xi + b ≤ 0 for yi = -1 It is desired that within the margin there are no instances: SUPPORT VECTOR MACHINES  Linearly separable problems ◦ hyperplanes of the margins are defined : "positive" hyperplane: w*x + b = +1 + 1 “negative” hyperplane: w*x + b = -1 -1 w*x + b = 0 w*xi + b ≥ 0 for yi = +1 w*xi + b ≤ 0 for yi = -1 It is desired that within the margin there are no instances: All samples meet: wx+b≥+1 All samples meet: wx+b≤-1 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ If we work with a set of vectors (samples) S, we say that it is linearly separable if there exists (w,b) such that the inequalities ( w ⋅ xi + b) ≥ 1 yi =1 (positive cases) i=1,…,L ( w ⋅ xi + b) ≤ −1 yi =-1 (negative cases) ◦ are valid for all elements of the set S  For the linearly separable case of S, a unique optimal hyperplane can be found, for which the margin between the training points of two different classes is maximized SUPPORT VECTOR MACHINES  Linearly separable problems ◦ These equations ( w ⋅ xi + b) ≥ 1 yi =1 i=1,…,L ( w ⋅ xi + b) ≤ −1 yi =-1 ◦ can be reformulated to yi ( w ⋅ xi + b) ≥ 1 i=1,…,L SUPPORT VECTOR MACHINES  Linearly separable problems ◦ The closest examples to the hyperplane are the so-called support vectors  Only these matter, the rest of the training examples can be ignored r SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Decision function "positive" hyperplane: w*x + b = +1 “negative” hyperplane: w*x + b = -1 w*x + b = 0 w*xi + b ≥ 0 for yi = +1 w*xi + b ≤ 0 for yi = -1 All samples meet: wx+b≥+1 All samples meet: wx+b≤-1 The support vectors satisfy: w x + b = +1 w x + b = -1 SUPPORT VECTOR MACHINES  Linearly separable problems “positive” hyperplane : w x + b = +1 “negative” hyperplane : w x + b = -1 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Maximize margin width, d wx + b = 0 x1 – x2 _ d x2 _ x1 _ w wx + b = 1 wx + b = -1 Decision frontier and margin of SVM SUPPORT VECTOR MACHINES  Linearly separable problems ◦ We want to find w and b such that the margin is maximized ◦ Let the “margin” d be the distance between the “positive” and “negative” hyperplanes  It is desired to maximize that value d  For each training example: wT xi + b ≤ − d if yi =-1 2 w xi + b ≥ d T if yi =1 2  Namely: ( ) yi wT xi + b ≥ d 2 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ For each support vector, the above inequality is an equality ( ) yi wT xi + b = d 2 can be transformed to the previous formulation ( ) yi wT xi + b = 1 by rescaling w and b by d/2  The distance from an T example xi to the separating plane is w xi + b r= w  A support vector is at the margin boundary T  Putting both equations together: r = y i (w x i + b) = 1 w w  Therefore, the separation margin to be maximized is: 2 d = 2r = w SUPPORT VECTOR MACHINES  Linearly separable problems 2 ◦ The margin is equal to: w ◦ The idea is to find a hyperplane with the maximum "margin"  This is an optimization problem: T maximize: 2 w subject to: yi ( w xi + b) ≥ 1 which can also be expressed as 1 T minimize: w 2 subject to: yi ( w xi + b) ≥ 1 2 2  where w is the Euclidean norm of w  That is, we need to optimize a quadratic function subject to linear constraints  One constraint for each sample SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Quadratic optimization problems are a well-known class of mathematical problems for which several algorithms exist. ◦ But the problem can be transformed to make it easier to handle:  Lagrange multipliers (αi) are associated with each of the inequalities of the original problem  The dual problem is constructed: l w −  α i [ yi (w·x i + b ) − 1] 2 LP ≡ 1 2 i =1 α i ≥ 0 para todo α i i=1,…,L  A value αi is associated to each sample xi and to each inequality SUPPORT VECTOR MACHINES  Linearly separable problems ◦ To construct the dual problem, a value αi is associated to each inequality (and, therefore, to each sample), to include the constraints within the expression  Original problem: 1 T minimize: w 2 subject to:yi ( w xi + b) ≥ 1 2  Dual problem: l w −  α i [ yi (w·x i + b ) − 1] 2 LP ≡ 1 2 i =1 α i ≥ 0 para todo α i i=1,…,L  This problem must be minimized with respect to w and b, and maximized with respect to α i >= 0 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Dual problem minimized with respect to w and b:  Calculate the derivative Lp with respect to a and b and set it equal to 0 ∂L l ∂b P =0 α y i =1 i i =0 ∂LP l =0 w =  α i yi xi ∂w i =1  Substitute this expression into the above equation: l w −  α i [ yi (w·x i + b ) − 1] l 1 l l 2 LP ≡ 1 LD =  α i −  α iα j yi y j x i x j 2 T i =1 l w =  α i yi xi i =1 2 i =1 j =1 i =1  which must be maximized with respect to αi, with the constraint l  i =1 α i yi = 0 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ The equation to optimize is therefore l 1 l l maximize: LD =  α i −  α iα j yi y j x i x j T i =1 2 i =1 j =1 l subject to: α y i =1 i i =0 α i ≥ 0 i=1,…,L para todo α i ◦ In this expression, w and b no longer appear, only the αi and yi  one for each sample SUPPORT VECTOR MACHINES  Linearly separable problems ◦ The equation to optimize is therefore l 1 l l maximize: LD =  α i −  α iα j yi y j x i x j T i =1 2 i =1 j =1 l subject to: α y i =1 i i =0 α i ≥ 0 i=1,…,L para todo α i ◦ This optimization can be solved by quadratic programming (QP) techniques (Bertsekas 1995) SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Given a solution to that problem, the solution to the "primary" problem would be, for w: l w =  α i yi x i i =1  Previous expression  Sum for each training sample  Each αi not equal to 0 indicates that the corresponding xi is a support vector  If αi is 0, that vector xi has not influence on the calculation of w (product by 0) SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Given a solution to that problem, the solution to the "primary" problem would be, for b:  A support vector satisfies yk ( wT xk + b) = 1  Then the parameter b is calculated: yk ( wT xk + b) = 1 wT xk + b = 1 yk = yk b = yk − wT xk for any support vector αk > 0  For anyone it will give the same value  In practice, the value of b is averaged to avoid rounding problems: NVS 1  i xi )T for each support vector b= ( y − w NVS : number of support NVS i =1 vectors SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Therefore, the solution is: l b = yk − wT xk w =  α i yi x i l for any αk > 0 i =1 b = yk −  α i yi xiT xk i =1 ◦ In this way, the classification function is: l f ( x) =  α i yi xiT x + b i =1  It is not necessary to calculate the w  In this function, a scalar product is performed between the point to classify and each of the support vectors  Instances that are not support vectors have αi = 0 SUPPORT VECTOR MACHINES  Linearly separable problems ◦ Furthermore, if the αi ≥ 0, Karush Kuhn Tucker conditions can be applied to convert the inequalities: yi ( wT xi + b) ≥ 1 into several equality equations for each example xi: α i ( yi ( w xi + b) − 1) = 0 T ◦ All examples where αi > 0 are the support vectors  If α i ≠ 0, then T yi ( w xi + b) − 1 = 0 yi ( wT xi + b) = 1 and therefore xi is at the limit of the decision margin SUPPORT VECTOR MACHINES  Linearly Separable Problems: Example: 1 0.9 x1 x2 Y 0.8 0.3858 0.4687 1 0.7 0.4871 0.611 -1 0.6 0.9218 0.4103 -1 x2 0.5 0.7382 0.8936 -1 0.4 0.1763 0.0579 1 0.4057 0.3529 1 0.3 0.9355 0.8132 -1 0.2 0.2146 0.0099 1 0.1 0 0 0.2 0.4 0.6 0.8 1 x1 SUPPORT VECTOR MACHINES  Linearly Separable Problems: Example: 1 0.9 x1 x2 Y αi 0.8 0.7 0.3858 0.4687 1 65.5261 0.4871 0.611 -1 65.5261 0.6 0.9218 0.4103 -1 0 x2 0.5 0.7382 0.8936 -1 0 0.4 0.1763 0.0579 1 0 0.3 0.4057 0.3529 1 0 0.9355 0.8132 -1 0 0.2 0.2146 0.0099 1 0 0.1 0 0 0.2 0.4 0.6 0.8 1 x1 SUPPORT VECTOR MACHINES  Linearly Separable Problems: Example: 1 0.9 x1 x2 Y αi 0.8 0.7 0.3858 0.4687 1 65.5261 0.4871 0.611 -1 65.5261 0.6 0.9218 0.4103 -1 0 x2 0.5 0.7382 0.8936 -1 0 0.4 0.1763 0.0579 1 0 0.3 0.4057 0.3529 1 0 0.9355 0.8132 -1 0 0.2 0.2146 0.0099 1 0 0.1 0 0 0.2 0.4 0.6 0.8 1 x1 SUPPORT VECTOR MACHINES  Linearly Separable Problems: Example: ◦ Solution:  Calculation of w: l 0.3858 0.4871 − 6.64 w =  α i yi x i = 65.5621*1*   + 65.5621* (−1) *   =  i =1  0.4687   0. 611   − 9.32   Calculation of b:  With the first instance: 0.3858 b = yk − w xk = 1 − [− 6.64 − 9.32]*  T  = 1 − (−6.929996) = 7.929996 0.4687   With the second instance: 0.4871 b = yk − wT xk = −1 − [− 6.64 − 9.32]*   = −1 − (−8.928862) = 7.928864  0.611   Both values are averaged: b = 7.929 SUPPORT VECTOR MACHINES  Linearly Separable Problems: Example: ◦ To classify a new sample x: f ( x) = w ⋅ x + b l or f ( x ) =  αi yi xiT x + b i =1  Classification:  If f(x)>0 1  If f(x) 0 1 1 0 2 Normal 3 1 2 0 Normal (entropy = 0.971) 4 0 2 1 Normal 5 1 1 1 carcinogenic 6 2 2 1 carcinogenic DECISION TREES  ID3: Definitions: ◦ When choosing an attribute as a node (“Body”):  Therefore, the entropy as a result of this division with the attribute "Body" is  Weighted sum of the entropies of the branches 1·0 5 · 0.971 0.81 1 5  gives an measure of the uncertainty in the branches when dividing with that attribute DECISION TREES  ID3: Definitions: ◦ When choosing an attribute as a node Example 1 Antennas 1 (“Queues”): Tails 0 Nuclei 2 Body Striped Class Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal Tails 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic Example Antennas Nuclei Body Class 5 1 1 Striped carcinogenic Example Antennas Nuclei Body Class Example Antennas Nuclei Body Class 3 1 0 Striped Normal 1 1 2 Striped Normal 4 0 1 Striped Normal 2 1 1 White carcinogenic 6 2 1 Striped carcinogenic DECISION TREES  ID3: Definitions: ◦ When choosing an attribute as a node (“Tails”):  For each branch:  Branch “0”: 2 instances  Equal number of instances of each class: total uncertainty  Entropy 1  Branch “1”: 1 instances  All instances of the same class: full certainty  Entropy 0  Branch “2”: 3 instances  2 instances of the “Normal” class, 1 “Carcinogenic”  Entropy > 0 (entropy = 0.9183) DECISION TREES  ID3: Definitions: ◦ When choosing an attribute as a node (“Tails”):  Therefore, the entropy as a result of this division with the attribute “Tails” is  Weighted sum of the entropies of the branches 2·1 1 · 0 3 · 0.918 0.793 2 1 3  gives an measure of the uncertainty in the branches when dividing with that attribute DECISION TREES  ID3: Definitions: ◦ When choosing an attribute as a node:  Entropy comparison:  Split with “Body”: Entropy = 0.81  Split with “Tails”: Entropy = 0.79  The “Tails” attribute generates a partition with lower entropy (higher certainty) in the branches  The entropy is be calculated with the rest of the attributes  Choose the one with less entropy  This process is applied recursively on each sub-tree DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n)  Example:  I = -(3/6)log2(3/6) - (3/6)log2 (3/6) Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n)  Examples:  1 single class: nc = n  I = - ( nc/n) log2 (nc/n) = 0  2 classes, with the same number of samples in each: nc = n/2  I = - 2 * (nc/n) log2 (nc/n) = - 2 * (1/2) log2 (1/2) = 1  5 classes, with the same number of samples in each: nc = n/5  I = - 5 * (nc/n) log2 (nc/n) = - 5 * (1/5) log2 (1/5) = 2.32  That is, the more distributed the samples are in different classes, the higher the entropy  More uncertainty, less information about the data DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij  The same operation is performed, but only among the instances with value Vj in the attribute Ai  Sub-table  Result of splitting the data with the attribute Ai  The entropy in that subtable is measured DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij A = {Antennas, Tails, Nuclei, Body}  Example: I42 ? A4 = Body, V4 = {White, Striped}  Entropy of the examples that in "Body" have “Striped" Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij A = {Antennas, Tails, Nuclei, Body}  Example: I42 ? A4 = Body, V4 = {White, Striped}  Subtable with the examples that in "Body" have "Striped": Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij A = {Antennas, Tails, Nuclei, Body}  Example: I42 ? A4 = Body, V4 = {White, Striped}  Subtable with the examples that in "Body" have "Striped": Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic  I42 = - (3/5)log2(3/5) - (2/5)log2(2/5) = 0.971 DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij A = {Antennas, Tails, Nuclei, Body}  Example: I41 ? A4 = Body, V4 = {White, Striped}  Entropy of the examples that in "Body" have "White" Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij A = {Antennas, Tails, Nuclei, Body}  Example: I41 ? A 4 = Body, V4 = {White, Striped}  Sub-table of the examples that in “Body” have “White”: Example Antennas Tails Nuclei Body Class 2 1 0 1 White carcinogenic  I41 = 0  Full certainty DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij  The same operation is performed, but only between the samples with value Vj in the attribute Ai  Subtable  The entropy in that subtable is measured  As before, the more distributed the samples with value Vj in the attribute Ai among the different classes, the higher the entropy  More uncertainty DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij ◦ Information entropy of attribute Ai  I(Ai) = Σ(nij/n)Iij j  The weighted average of the entropies for each of the possible values in the attribute Ai  The weighted average of the entropies of the sub-tables for each value of Ai DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij ◦ Information entropy of attribute Ai  I(Ai) = Σ(nij/n)Iij A = {Antennas, Tails, Nuclei, Body} j A4 = Body, V4 = {White, Striped}  Example: I(A4)  Entropy of the “Body” attribute  I(A4) = (1/6)*0 + (5/6)*1.971 = 0.8091  Weighted average of the entropy of that attribute with each of the possible values DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij ◦ Information entropy of attribute Ai  I(Ai) = Σ(nij/n)Iij j  Gives a measure of the uncertainty associated with the attribute  The uncertainty that would remain when splitting the dataset with all values of that attribute  When creating the sub-tables, one for each value of the attribute DECISION TREES  ID3: Definitions: ◦ Information entropy of a dataset  I = ‐ Σc (nc/n) log2 (nc/n) ◦ Information entropy of value j of attribute Ai  Iij = ‐ Σ(n /n ) log2 (nijc/nij) c ijc ij ◦ Information entropy of attribute Ai  I(Ai) = Σ(nij/n)Iij j ◦ Information gain of attribute Ai  G(Ai) = I – I(Ai) DECISION TREES  ID3: Example: ◦ We want to generate a decision tree that classifies between normal cells and carcinogenic cells according to the data in the following table: Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of attribute A1 (Antennas):  Split the table based on the values of that attribute Example Antennas Tails Nuclei Body Class 4 0 2 1 Striped Normal Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 5 1 1 1 Striped carcinogenic Example Antennas Tails Nuclei Body Class 6 2 2 1 Striped carcinogenic  Calculate the entropies of those subtables  Sum of those entropies weighted by the number of instances DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of the attribute A1 (Antennas) Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of the attribute A1 (Antennas)  I10 = ‐ (1/1)log2(1/1) – (0/1)log2(0/1) = 0 Example Antennas Tails Nuclei Body Class 4 0 2 1 Striped Normal DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of the attribute A1 (Antennas)  I10 = ‐ (1/1)log2(1/1) – (0/1)log2(0/1) = 0 Number of Antennas value 0, normal Number of Antennas value 0 Example Antennas Tails Nuclei Body Class 4 0 2 1 Striped Normal DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of the attribute A1 (Antennas)  I10 = ‐ (1/1)log2(1/1) – (0/1)log2(0/1) = 0 Number of Antennas value 0, normal Number of Antennas Number of Antennas value 0 value 0 Number of Antennas value 0, carcinogenic Example Antennas Tails Nuclei Body Class 4 0 2 1 Striped Normal DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of the attribute A1 (Antennas)  I10 = ‐ (1/1)log2(1/1) – (0/1)log2(0/1) = 0  Entropy = 0:  Total certainty about the class in which the samples are classified Example Antennas Tails Nuclei Body Class 4 0 2 1 Striped Normal DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 1 of the attribute A1 (Antennas) Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 1 of the attribute A1 (Antennas)  I11 = ‐ (2/4)log2 (2/4) - (2/4)log2 (2/4) = 1 Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 5 1 1 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 1 of the attribute A1 (Antennas)  I11 = ‐ (2/4)log2 (2/4) - (2/4)log2 (2/4) = 1 Number of Antennas value 1, normal Number of Antennas value 1 Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 5 1 1 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 1 of the attribute A1 (Antennas)  I11 = ‐ (2/4)log2 (2/4) - (2/4)log2 (2/4) = 1 Number of Antennas value 1, normal Number of Antennas Number of Antennas value 1 value 1 Number of Antennas value 1, carcinogenic Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 5 1 1 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 1 of the attribute A1 (Antennas)  I11 = ‐ (2/4)log2 (2/4) - (2/4)log2 (2/4) = 1  Entropy = 1:  Highest entropy (uncertainty) that can be had with 2 classes  The samples are distributed into the two classes with 50% each  If there were more samples of one class than another, there would be more certainty, less entropy Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 5 1 1 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 2 of the attribute A1 (Antennas) Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 2 of the attribute A1 (Antennas)  I12 = ‐ (0/1)log2(0/1) - (1/1)log2(1/1) = 0 Number of Antennas value 2, normal Number of Antennas Number of Antennas value 2 value 2 Number of Antennas value 2, carcinogenic Example Antennas Tails Nuclei Body Class 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of the attribute A1 (Antennas)  I10 = ‐ (1/1)log2(1/1) – (0/1)log2 ( 0/1) = 0  Information entropy of the value 1 of the attribute A1 (Antennas)  I11 = ‐ (2/4)log2 (2/4) – (2/4)log2 (2/4) = 1  Information entropy of the value 2 of the attribute A1 (Antennas)  I12 = ‐ (0/1)log2(0/1) – (1/1)log2 (1/1) = 0 DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of attribute A1 (Antennas)  I(A1) = Σ (nij/n)Iij = (1/6) ∙ I10 + (4/6) ∙ I11 + (1/6) ∙ I12 = 0.66 Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of attribute A1 (Antennas)  I(A1) = Σ (nij/n)Iij = (1/6) ∙ I10 + (4/6) ∙ I11 + (1/6) ∙ I12 = 0.66 Number of Antennas value 0 Number of Antennas value 2 Number of Antennas value 1 Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of attribute A1 (Antennas)  I(A1) = Σ (nij/n)Iij = (1/6) ∙ I10 + (4/6) ∙ I11 + (1/6) ∙ I12 = 0.66 Number of Antennas value 0 Number of Antennas value 2 Number of Antennas value 1 Number of Antennas (instances) Example Antennas Tails Nuclei Body Class 1 1 0 2 Striped Normal 2 1 0 1 White carcinogenic 3 1 2 0 Striped Normal 4 0 2 1 Striped Normal 5 1 1 1 Striped carcinogenic 6 2 2 1 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of attribute A2 (Tails)  I20 = - (1/2)log2 (1/2) – (1/2)log2(1/2) = 1  Information entropy of value 1 of attribute A2 (Tails)  I21 = -(0/1)log2(0/1) – (1/1)log2(1/1) = 0  Information entropy of value 2 of attribute A2 (Tails)  I22 = -(2/3)log2(2/3) – (1/3)log2(1/3) = 0.9183  Information entropy of attribute A2 (Tails)  I(A2) = Σ (nij/n) Iij = (2/6) ∙ I20 + (1/6) ∙ I21 + (3/6) ∙ I22 = 0.79 DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of attribute A3 (Nuclei)  I30 = - (1/1)log2(1/1) – (0/1)log2( 0/1) = 0  Information entropy of the value 1 of attribute A3 (Nuclei)  I31 = - (1/4)log2(1/4) – (3/4)log2(3/4) = 0.8113  Information entropy of value 2 of attribute A3 (Nuclei)  I32 = - (1/1)log2(1/1) – (0/1)log2(0/1) = 0  Information entropy of attribute A3 (Nuclei)  I(A3) = Σ (nij/n) Iij = (1/6) ∙ I30 + (4/6) ∙ I31 + (1/6) ∙ I32 = 0.54 DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the White value of attribute A4 (Body)  I40 = - (0/1)log2(0/1) – (1/1)log2(1/1) = 0  Information entropy of the Striped value of attribute A4 (Body)  I41 = -(2/5)log2(2/5 ) – (3/5)log2 ( 3/5) = 0.9710  Information entropy of attribute A4 (Body)  I(A4) = Σ (nij/n) Iij = (1/6) ∙ I40 + (5/6) ∙ I41 = 0.81 DECISION TREES  ID3: Example: ◦ The attribute with the lowest entropy is chosen:  The entropy of each attribute determines the level of uncertainty when separating the dataset with all the values of that attribute  The attribute with lower entropy indicates that by separating the samples with that attribute, each of the resulting sets will have more certainty in the data.  That is, the instances of the different classes will be further unified in each of those sets (each of the values of that attribute)  There will be less disparity of instances of different classes in each set  That is, it better explains the sets  Attribute with lowest Entropy: Nuclei DECISION TREES  ID3: Example: ◦ The attribute with the lowest entropy is chosen: Nuclei ◦ The dataset is split and that attribute is removed:  Nuclei = 0 Example Antennas Tails Body Class 3 1 2 Striped Normal  Nuclei = 1 Example Antennas Tails Body Class 2 1 0 White carcinogenic 4 0 2 Striped Normal 5 1 1 Striped carcinogenic 6 2 2 Striped carcinogenic  Nuclei = 2 Example Antennas Tails Body Class 1 1 0 Striped Normal DECISION TREES  ID3: Example: Nuclei 0 2 Example Antennas Tails Body Class Example Antennas Tails Body Class 3 1 2 Striped Normal 1 1 0 Striped Normal 1 Example Antennas Tails Body Class 2 1 0 White carcinogenic 4 0 2 Striped Normal 5 1 1 Striped carcinogenic 6 2 2 Striped carcinogenic DECISION TREES  ID3: Example: Nuclei 0 2 Normal Normal 1 Example Antennas Tails Body Class 2 1 0 White carcinogenic 4 0 2 Striped Normal 5 1 1 Striped carcinogenic 6 2 2 Striped carcinogenic DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of the attribute A1 (Antennas)  I10 = - (1/1)log2(1/1) – (0/1)log2(0/1) = 0  Information entropy of the value 1 of the attribute A1 (Antennas)  I11 = - (0/2)log2(0/2) – (2/2)log2(2/2) = 0  Information entropy of the value 2 of the attribute A1 (Antennas)  I12 = - (0/1)log2(0/1) – (1/1)log2(1/1) = 0  Information entropy of attribute A1 (Antennas)  I(A1) = Σ (nij/n) Iij = (1/4) ∙ I10 + (2/4) ∙ I11 + (1/4) ∙ I12 = 0 DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the value 0 of attribute A2 (Tails)  I20 = - (0/1)log2(0/1) – (1/1)log2(1/1) = 0  Information entropy of the value 1 of the Tails attribute  I21 = - (0/1)log2(0/1) – (1/1)log2(1/1) = 0  Information entropy of the value 2 of the Tails attribute  I22 = - (1/2)log2 (1/2) – (1/2)log2 (1/2) = 1  Information entropy of attribute Tails  I(A2) = Σ (nij/n) Iij = (1/4) ∙ I20 + (1/4) ∙ I21 + (2/4) ∙ I22 = 0.5 DECISION TREES  ID3: Example: ◦ Entropies are calculated:  Information entropy of the “White” value of attribute A4 (Body)  I40 = - (0/1)log2 ( 0/1 ) – (1/1)log2(1/1) = 0  Information entropy of the “Striped” value of attribute A4 (Body)  I41 = - (1/3)log2(1/3) – (2/3)log2(2/3) = 0.9183  Information entropy of attribute A4 (Body)  I(A4) = Σ (nij/n) Iij = (1/4) ∙ I40 + (3/4) ∙ I41 = 0.6887 DECISION TREES  ID3: Example: ◦ The attribute with the lowest entropy is chosen: Antennas ◦ The dataset is split and that attribute is removed:  Antennas = 0 Example Tails Body Class 4 2 Striped Normal  Antennas = 1 Example Tails Body Class 2 0 White carcinogenic 5 1 Striped carcinogenic  Antennas = 2 Example Tails Body Class 6 2 Striped carcinogenic DECISION TREES  ID3: Example Nuclei 0 2 1 Normal Normal Antennas 0 1 2 Normal Carcinogenic Carcinogenic DECISION TREES  ID3: Some practical questions: ◦ How much to grow the tree? ◦ Is this attribute selection measure appropiate? ◦ How to handle training data with missing (unknown) values in attributes? ◦ How to handle attributes with different costs? ◦ How to handle continuous attributes? DECISION TREES  ID3: How much to grow the tree? ◦ If the tree is grown until it correctly classifies all training examples: overfitting  If there is noise in the examples, the noise is learned  If there are few examples associated with the leaf nodes, they are not representative of the true function. ◦ ID3 produces trees that overfit training examples, and do not work properly with new examples  The model is not able to generalize DECISION TREES  ID3: Solutions to overfitting : ◦ Pre-pruning: ◦ Post-pruning: DECISION TREES  ID3: Solutions to overfitting : ◦ Pre-pruning:  Stop growing the tree before it reaches the point where it perfectly classifies the training examples  When to do it? DECISION TREES  ID3: Solutions to overfitting : ◦ Pre-pruning:  Stop growing the tree before it reaches the point where it perfectly classifies the training examples  When to do it?  Apply a statistical test to estimate whether expanding a particular node is likely to produce an improvement beyond the training set DECISION TREES  ID3: Solutions to overfitting : ◦ Pre-pruning:  Stop growing the tree before it reaches the point where it perfectly classifies the training examples  When to do it? ◦ Post-pruning:  Allow it to overfit the data, and then prune it by replacing subtrees with a leaf  Requires higher computational cost, but offers better results DECISION TREES  ID3: Solutions to overfitting : Post-pruning: ◦ Starting from the bottom, examine the subtrees of the non-terminal nodes ◦ Pruning a node entails:  1.- Delete its corresponding subtree with root in that node  2.- Convert the node to a leaf node  3.- Assign the most common classification of the training examples associated to that node ◦ Prune only if the resulting pruned tree improves or equals the performance of the original tree  Test (validation) ◦ The tree is pruned iteratively, always choosing the node whose pruning implies the greatest improvement, until there is no improvement DECISION TREES  ID3: Is this attribute selection measure adequate? ◦ The ID3 algorithm favors choosing attributes with many values  For example, attribute "Date" in a set with the examples on different days  This attribute perfectly characterizes each training example, and we would have a perfect tree of depth 1, with one leaf node per sample  The gain information is the highest  Having so many branches forces the examples to be separated into very small sets  But it would be very bad with unseen examples (test) DECISION TREES  ID3: Is this attribute selection measure appropriate? ◦ There are other measures instead of information gain  Gain ratio: penalizes attributes with many values and uniformly distributed G ( Ai ) H ( Ai ) where H(Ai) denotes the entropy with respect to the values of Ai nij nij H ( Ai ) = − log 2 j n n  It does not take into account the classes, only the different values it takes  When an attribute has many different values, it has a very large entropy, so it is divided by a larger value.  The gain ratio favors those attributes that, with equal gain, separate the data into fewer subsets DECISION TREES  ID3: How to handle attributes with missing values? ◦ When classifying a new case with a missing value:  Instances with unknown values are classified according to the highest probability provided by the tree  Example:  Nuclei = ?  Antennas = 1 Nuclei 0 2 1 Normal Antennas Normal 0 2 1 Normal Carcinogenic Carcinogenic DECISION TREES  ID3: How to handle attributes with different costs? ◦ Are all attributes equally valuable when classifying? ◦ Enter the cost in the attribute selection measure  For example: G( Aj ) G( A ) 2 j −1 w: constant between 0 Coste Cost ( A j ) Coste( A j ) + 1) w ((Cost and 1 that weights the importance of cost versus profit ◦ They try to place the low-cost ones at the top of the tree, only calling the high-cost ones when necessary  That is, when we want to improve the classification DECISION TREES  ID3: How to handle continuous attributes? ◦ They are discretized:  They are partitioned into a discrete set of intervals  Example: 2 intervals: A=c  Find a value of c with the highest information gain  First, order the examples by the attribute A  This value is always where the value of the class changes DECISION TREES  ID3: How to handle continuous attributes? ◦ They are discretized:  Process:  Order the examples according to the continuous value of A  Take each pair of contiguous values of A with a different value of C and

Use Quizgecko on...
Browser
Browser