Classification PDF

CLASSIFICATION Business Intelligence Dr Sara Migliorini CLASSIFICATION Classification is the task that predicts the class label associated to a given object (data instance). Each data instance of the classification task is characterized by the tuple (x, y) where: x is the set of values that describe the instance y is the class label of the instance (categorical value) 2 CLASSIFICATION MODEL A classification model is an abstract representation of the relationship between the attribute set and the class label. f(x) = ŷ The classification is correct if ŷ=y 3 TYPES OF CLASSIFICATION A classification is said to be binary, if each data instance can be categorized into one of two classes. A classification is said to be multiclass, if the number of classes is greater than 2. Each class label must be of nominal type. This distinguishes classification from other predictive models, such as regression, where the predicted value is often quantitative. 4 PURPOSES OF THE CLASSIFICATION A classification model serves two important roles: Predictive model: if it is used to classify previously unlabeled instances. Descriptive model: if it is used to identify the characteristics that distinguish instances from different classes. E.g.: medical diagnosis → how the model reaches such a decision? Not all attributes may be relevant for the classification task and some other might not be able to distinguish the classes by themselves (they must be used in concert with other attributes). 5 GENERAL FRAMEWORK A classifier is a tool used to perform a classification task (e.g., assigning labels to unlabeled data instances) It is typically described in terms of a model. The classification task involves two phases: Induction: construction of a classification model through the application of a learning algorithm on a training set. The training set contains attribute values and class labels for each instance. Deduction: application of a classification model on unseen test instances to predict their class labels. 6 GENERAL FRAMEWORK Classification technique: a general approach to classification. It consists of a family of related models and a number of algorithms for learning these models. Decision tree Classification rules Every model defines a Association rules classifier, but not every classifier is defined by Bayesian networks a single model. Etc. 7 GENERAL FRAMEWORK Learning Generalization Training set algorithm performance: capability to provide good Induction learn model accuracy during the deduction step. Model Prediction/ Deduction classification Test set 8 CLASSIFICATION EVALUATION PP = True positive NP = False negative Confusion matrix Predicted Class Class = P Class = N Actual class Class = P fPP fPN Class = N fNP fNN NP = False positive NN = True negative 9 CLASSIFICATION EVALUATION Accuracy acc = # of correct predictions / total # of predictions acc = (fPP+ fNN)/(fPP+ fPN + fNP + fNN) 10 CLASSIFICATION EVALUATION Error rate err = # of wrong predictions / total # of predictions err = (fPN+ fNP)/(fPP+ fPN + fNP + fNN) 11 TYPES OF CLASSIFIER 12 TYPES OF CLASSIFIERS Binary vs Multiclass: Binary classifiers assign each data instance to one of two possible labels. If there are more than two possible labels available, the technique is known as multiclass classifier. Deterministic vs Probabilistic: A deterministic classifier produces a discrete-valued label to each data instance it classifies. A probabilistic classifier assigns a continuous score between 0 to 1 to indicate how likely it is that an instance belongs to a particular class. Additional information about the confidence in assigning a label to a class. 13 TYPES OF CLASSIFIERS Linear vs nonlinear: A linear classifier uses a linear separating hyperplance. A nonlinear classifier enables the construction of more complex, nonlinear decision surfaces. Global vs local: A global classifier fits a single model to the entire data set. A local classifier partitions the input space into smaller regions and fits a distinct model to training instances in each region. 14 TYPES OF CLASSIFIERS Generative vs Discriminative: Generative classifiers learns a generative model of every class in the process of predicting class labels. The describe the underlying mechanism that generates the instances belonging to every class label. Discriminative classifiers directly predicts the class labels without explicitly describing the distribution of each of them. 15 DECISION TREE CLASSIFIER 16 DECISION TREE A decision tree classifier solves the classification problem by asking a series of carefully crafted questions about the attributes of the test instances. The series of questions and their possible answers can be organized into a hierarchical structure called a decision tree. Root node Internal nodes: contain attribute test conditions Leaf or terminal nodes: each one is associated with a class label 17 DECISION TREE Given a decision tree, classifying a test instance consists in traversing the tree starting from the root node, applying the test conditions and following the appropriate branch based on the outcome of the test. Once a leaf is reached, we assign the class label associated with that node to the test instance. 18 DECISION TREE: EXAMPLE attribute test condition splitting Home ID Home owner Marital status Annual Income Defaulted? owner attribute Yes No 1 Yes Single 125,000 No 2 No Married 100,000 No Defaulted = No Marital 3 No Single 70,000 No status 4 Yes Married 120,000 No Single, divorced Married 5 No Divorced 95,000 Yes Annual 6 No Single 60,000 No income < Defaulted = No 7 Yes Divorced 220,000 No 78,000 8 No Single 85,000 Yes Yes No 9 No Married 75,000 No Defaulted = No Defaulted = Yes 10 No Single 90,000 Yes 19 DECISION TREE: CONSTRUCTION Many possible decision tree can be constructed from a particular data set. Some trees are better than others → finding the optimal tree is a computationally expensive task (exponential size of the search space) Efficient algorithms have been developed to obtain a reasonable accuracy with suboptimal trees in a reasonable amount of time Greedy algorithms to build the tree in a top-down fashion Local optimal choices in deciding which attribute to use. 20 DECISION TREE: ALGORITHMS Hunt’s Algorithm CART ID3, C4.5, C5.0 SLIQ, SPRING 21 HUNT’S ALGORITHM The Hunt’s algorithm builds the decision tree in a recursive way. The tree initially contains a single node that is associated with all the training instances. If the node is associated with instances from more than one class, it is expanded by using an attribute test condition that is determined by a splitting criterion. 22 HUNT’S ALGORITHM [Expansion step] A child leaf node is created for each possible outcome of the attribute test condition and the instances associated with the parent node are distributed to the children based on the test outcomes. The node expansion step is applied recursively to each child node, as long as it has labels from more than one classes. Once all instances associated with the same leaf node have identical labels, the algorithm terminate. 23 HUNT’S ALGORITHM: EXAMPLE ID Home owner Marital status Annual Income Defaulted? 1 Yes Single 125,000 No 2 No Married 100,000 No Step 1 Defaulted = No 3 No Single 70,000 No 4 Yes Married 120,000 No Since, the majority of the instances is not defaulted. 5 No Divorced 95,000 Yes 6 No Single 60,000 No 7 Yes Divorced 220,000 No 8 No Single 85,000 Yes 9 No Married 75,000 No 10 No Single 90,000 Yes 24 HUNT’S ALGORITHM: EXAMPLE ID Home owner Marital status Annual Income Defaulted? 1 Yes Single 125,000 No 2 No Married 100,000 No Step 1 Defaulted = No 3 No Single 70,000 No 4 Yes Married 120,000 No Since, the majority of the instances is not defaulted. 5 No Divorced 95,000 Yes 6 No Single 60,000 No 7 Yes Divorced 220,000 No Error rate = 30% 8 No Single 85,000 Yes 9 No Married 75,000 No 10 No Single 90,000 Yes 25 HUNT’S ALGORITHM: EXAMPLE ID Home owner Marital status Annual Income Defaulted? 1 Yes Single 125,000 No 2 No Married 100,000 No Step 1 Defaulted = No 3 No Single 70,000 No 4 Yes Married 120,000 No Since, the majority of the instances is not defaulted. 5 No Divorced 95,000 Yes 6 No Single 60,000 No 7 Yes Divorced 220,000 No Error rate = 30% 8 No Single 85,000 Yes 9 No Married 75,000 No The leaf node can be further expanded, since it 10 No Single 90,000 Yes contains instances with different labels 26 HUNT’S ALGORITHM: EXAMPLE ID Home owner Marital status Annual Income Defaulted? 1 Yes Single 125,000 No Home Step 2 owner 2 No Married 100,000 No Yes No 3 No Single 70,000 No 4 Yes Married 120,000 No ?? ?? 5 No Divorced 95,000 Yes 6 No Single 60,000 No We choose the “home owner” as the attribute for the splitting 7 Yes Divorced 220,000 No 8 No Single 85,000 Yes 9 No Married 75,000 No 10 No Single 90,000 Yes 27 HUNT’S ALGORITHM: EXAMPLE ID Home owner Marital status Annual Income Defaulted? 1 Yes Single 125,000 No Home Step 2 owner 2 No Married 100,000 No Yes No 3 No Single 70,000 No 4 Yes Married 120,000 No Defaulted = No Defaulted = Yes 5 No Divorced 95,000 Yes 6 No Single 60,000 No 7 Yes Divorced 220,000 No 8 No Single 85,000 Yes All instances associated with It contains instances 9 No Married 75,000 No with both labels. this node have identical labels Defaulted = No It needs to be further 10 No Single 90,000 Yes expanded. 28 HUNT’S ALGORITHM: EXAMPLE Home ID Home owner Marital status Annual Income Defaulted? owner Yes 1 Yes Single 125,000 No 2 No Married 100,000 No Defaulted = No Marital 3 No Single 70,000 No status 4 Yes Married 120,000 No Single, divorced Married 5 No Divorced 95,000 Yes 6 No Single 60,000 No Defaulted = Yes Defaulted = No 7 Yes Divorced 220,000 No 8 No Single 85,000 Yes 9 No Married 75,000 No It contains instances 10 No Single 90,000 Yes with both labels. It needs to be further expanded. 29 HUNT’S ALGORITHM: EXAMPLE Home ID Home owner Marital status Annual Income Defaulted? owner Yes No 1 Yes Single 125,000 No 2 No Married 100,000 No Defaulted = No Marital 3 No Single 70,000 No status 4 Yes Married 120,000 No Single, divorced Married 5 No Divorced 95,000 Yes Annual 6 No Single 60,000 No income < Defaulted = No 7 Yes Divorced 220,000 No 78,000 8 No Single 85,000 Yes Yes No 9 No Married 75,000 No Defaulted = No Defaulted = Yes 10 No Single 90,000 Yes 30 HUNT’S ALGORITHM: LIMITATIONS Some of the child nodes can be empty if none of the training instances have a particular attribute value Declare each of them as a leaf node and assign it the class label that occurs most frequent in the parent node. If all training instances associated with a node have identical attribute values but different class labels, it is not possible to expand this node any further. Declare it as a leaf node and assign it the class label that occurs most frequently in the training instances. 31 HUNT’S ALGORITHM: LIMITATIONS Splitting criterion: the decision tree is built with a greedy strategy The “best” attribute for the split is selected locally and this can lead to a stuck in a local optimum (not a global optimum)! Stopping criterion: the algorithm stops to expand a node only when all the training instances associated with it belongs to the same class → not always the best solution → early termination 32 SPLITTING CRITERION: EXPRESSING ATTRIBUTE TEST CONDITIONS Binary attributes: only two possible outcomes (simplest case) Home owner Yes No Nominal attributes: it can have multiple values a) Multi-way split b) Binary split: aggregate the possible values into two groups: 2k-1-1 options Marital Marital status status Single Divorced Married {Single, divorced} Married 33 SPLITTING CRITERION: EXPRESSING ATTRIBUTE TEST CONDITIONS Ordinal attributes: like the nominal attributes they can produce binary or multi-way splits The possible aggregations are only those that do not violate the order property of the attribute values. Shirt size Shirt size Shirt size {Small, Medium} {Large, Extra large} {Small} {Medium, Large, {Small, Large} {Medium, Extra large} Extra large} 34 SPLITTING CRITERION: EXPRESSING ATTRIBUTE TEST CONDITIONS Continuous attributes: the attribute test condition can be expressed as a comparison test producing a binary split (e.g. A < v) or a multi-way split (range query) → discretization strategy Multi-way split: value ranges should be mutually exclusive and cover the entire range of the attribute values. Annual Annual income > 80K income < 10K 10K ≤ v < 50K ≥ 50K Yes No 35 SELECT THE BEST ATTRIBUTE CONDITION Many measures to determine the goodness of an attribute test condition. IDEA: give preference to attribute test conditions that partition the training instances into purer subsets in the child nodes. Pure = same class label Pure nodes do not need to be expanded any further. Larger trees They are more susceptible to model overfitting They are more difficult to interpret They require more training and test time 36 IMPURITY MEASURES The impurity of a node measures how dissimilar the class labels are for the data instances belonging to a common node. Impurity of a node t (less is better): 𝑝𝑖 𝑡 is the relative frequency Entropy = − σ𝑐−1 𝑖=0 𝑝𝑖 (𝑡)𝑙𝑜𝑔2 𝑝𝑖 (𝑡) of training instances that belong to class i at node t Gini index = 1 − σ𝑐−1 𝑖=0 𝑝𝑖 𝑡 2 c is the total number of classes Classification error = 1 − 𝑚𝑎𝑥𝑖 𝑝𝑖 𝑡 0 log2 0 = 0 All measures gives 0 impurity if the node contains only instances from a single class and maximum impurity when the node has equal proportion of instances from multiple classes 37 IMPURITY MEASURES Node N1 Count Entropy = − [(0/6)log2(0/6) + (6/6)log2(6/6)] = 0 Class = 0 0 Gini index = 1 – [(0/6)2 + (6/6)2] = 0 Error = 1 – max[0/6, 6/6] = 0 Class = 1 6 Node N2 Count Entropy = − [(1/6)log2(1/6) + (5/6)log2(5/6)] = 0.650 Class = 0 1 Gini index = 1 – [(1/6)2 + (5/6)2] = 0.278 Class = 1 5 Error = 1 – max[1/6, 5/6] = 0.167 Node N3 Count Entropy = − [(3/6)log2(3/6) + (3/6)log2(3/6)] = 1 Class = 0 3 Gini index = 1 – [(3/6)2 + (3/6)2] = 0.5 Class = 1 3 Error = 1 – max[3/6, 3/6] = 0.5 38 IMPURITY MEASURES Node N1 Count Entropy = − [(0/6)log2(0/6) + (6/6)log2(6/6)] = 0 Class = 0 0 Gini index = 1 – [(0/6)2 + (6/6)2] = 0 Error = 1 – max[0/6, 6/6] = 0 Class = 1 6 Node N2 Count Entropy = − [(1/6)log2(1/6) + (5/6)log2(5/6)] = 0.650 Class = 0 1 Gini index = 1 – [(1/6)2 + (5/6)2] = 0.278 Class = 1 5 Error = 1 – max[1/6, 5/6] = 0.167 Node N3 Count Entropy = − [(3/6)log2(3/6) + (3/6)log2(3/6)] = 1 Class = 0 3 Gini index = 1 – [(3/6)2 + (3/6)2] = 0.5 Class = 1 3 Error = 1 – max[3/6, 3/6] = 0.5 39 IMPURITY MEASURES: COMPARISON Different values but consistency among the impurity measures! 40 COLLECTIVE IMPURITY Let us consider an attribute test condition that splits a node containing N training instances into k children {v1, v2, …, vn}. Let N(vj) the number of training instances associated with child node vj whose impurity is I(vj) Collective impurity = average 𝑘 𝑁 𝑣𝑗 of the node impurity weighted I(children) = σ𝑗=1 𝐼(𝑣𝑗 ) by the number of instances 𝑁 associated to each node 41 COLLECTIVITY IMPURITY: WEIGHTED ENTROPY Marital Home status owner Single Married Divorced Yes No Yes: 2 Yes: 0 Yes: 1 Yes: 0 Yes: 3 No: 3 No: 3 No: 1 No: 3 No: 4 I(MS = Single) = − [(2/5)log2(2/5) + (3/5)log2(3/5)] = 0.971 I(Home owner = Yes) = − [(0/3)log2(0/3) + (3/3)log2(3/3)] = 0 I(MS = Married) = − [(0/3)log2(0/3) + (3/7)log2(3/7)] = 0 I(Home owner = No) = − [(3/7)log2(3/7) + (4/7)log2(4/7)] = 0.985 I(MS = Divorced) = − [(3/7)log2(3/7) + (4/7)log2(4/7)] = 1 I(Home owner) = 3/10 * 0 + 7/10 * 0.985 = 0.690 I(M) = 5/10 * 0.971 + 3/10 * 0 + 2 * 1 = 0.686 Collective impurity 42 IDENTIFYING THE BEST ATTRIBUTE TEST CONDITION To determine the goodness of an attribute test condition we need to compare the degree of impurity of the parent node (before splitting) with the weighted degree of impurity of the child nodes (after splitting). The larger the difference, the better the test condition. Gain = I(parent) – I(children) 43 IDENTIFYING THE BEST ATTRIBUTE TEST CONDITION Gain is always positive, since I(parent) ≥ I(children) Splitting criterion: select the attribute which maximizes the gain If the impurity measure is the entropy, the gain is also called information gain Δgain 44 IDENTIFYING THE BEST ATTRIBUTE TEST CONDITION: EXAMPLE I(parent) = − [(3/10)log2(3/10) + (7/10)log2(7/10)] = 0.881 Home Marital Before splitting 7 instances = No, 3 instances = Yes owner status Yes No Single Married Divorced Yes: 0 Yes: 3 Yes: 2 Yes: 0 Yes: 1 No: 3 No: 4 No: 3 No: 3 No: 1 I(MS = Single) = − [(2/5)log2(2/5) + (3/5)log2(3/5)] = 0.971 I(Home owner = Yes) = − [(0/3)log2(0/3) + (3/3)log2(3/3)] = 0 I(MS = Married) = − [(0/3)log2(0/3) + (3/7)log2(3/7)] = 0 I(Home owner = No) = − [(3/7)log2(3/7) + (4/7)log2(4/7)] = 0.985 I(MS = Divorced) = − [(3/7)log2(3/7) + (4/7)log2(4/7)] = 1 I(Home owner) = 3/10 * 0 + 7/10 * 0.985 = 0.690 I(M) = 5/10 * 0.971 + 3/10 * 0 + 2 * 1 = 0.686 Δgain = I(parent) – I(Home owner) = 0.881 – 0.690 = 0.191 Δgain = I(parent) – I(MS) = 0.881 – 0.686 = 0.195 45 SPLITTING BASED ON GINI: CATEGORICAL ATTRIBUTES Gini index = 1 − σ𝑐−1 𝑖=0 𝑝𝑖 𝑡 2 Parent No 7 Gini = 1 – [(7/10)2 + (3/10)2] = 0.420 Yes 3 Gini = 0.420 Home Marital Marital Marital owner status status status Yes No Single Married, Single, Divorced Single, Divorced Divorced Married Divorced Yes: 0 Yes: 3 Yes: 2 Yes: 1 Yes: 2 Yes: 1 Yes: 3 Yes: 0 No: 3 No: 4 No: 3 No: 4 No: 6 No: 1 No: 4 No: 3 Gini = 0.343 Gini = 0.400 Gini = 0.400 Gini = 0.343 46 SPLITTING BASED ON GINI: QUANTITATIVE ATTRIBUTES Annual income ≤ τ τ can be chosen equal to any value between the minimum and the maximum It is sufficient to only consider the values observed in the training set as candidate splitting positions We can compute the Gini index at each candidate position and chose the τ that produces the lowest value. 47 SPLITTING BASED ON GINI: QUANTITATIVE ATTRIBUTES O(N) operations for computing the Gini index for each possible position O(N2) operations for computing the Gini index and perform the comparisons → brute force approach For each possible value (N), scan each instance (N) for counting the number instances < and ≥ O(NlogN) is we previously sort the training instances based on the value in the splitting attribute → the candidate split positions is given by the midpoints between every two adjacent sorted values Sort the values in O(N log N) + scan each value for updating a count matrix and computing the Gini index O(N) 48 SPLITTING BASED ON GINI: QUANTITATIVE ATTRIBUTES The best split position is the one that produces the lowest Gini index. 49 SPLITTING BASED ON GINI: QUANTITATIVE ATTRIBUTES Further reduction if we consider as candidate splits only positions where the instances change their label. 50 INFORMATION GAIN One potential limitation of the entropy and Gini index is that they tend to favor qualitative attributes with larger number of distinct values. Having a low impurity value alone is insufficient to find a good attribute test condition for a node. Having a greater number of children can make the decision tree more complex and consequently more susceptible to overfitting The number of children produced by the splitting attribute should also be taken into consideration while deciding the best attribute test condition. 51 INFORMATION GAIN Solutions: Generate only binary decision trees (CART). Gain Ratio (C4.5) Δ𝑖𝑛𝑓𝑜 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝𝑎𝑟𝑒𝑛𝑡 −𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛) Gain Ratio = = 𝑁(𝑣 ) 𝑁(𝑣 ) 𝑠𝑝𝑙𝑖𝑡 𝑖𝑛𝑓𝑜 − σ𝑘 𝑖 𝑖 𝑖=1 𝑁 𝑙𝑜𝑔2 𝑁 The split information measures the entropy of splitting a node into its child nodes and evaluates if the split results in a larger number of equally-sized child nodes or not. If each partition has the same number of instances N(vi)/N = 1/k → split information = log(k) If the number of splits (k) is also large → the gain ratio is reduced 52 ALGORITHM FOR DECISION TREE INDUCTION (CONSTRUCTION) E = set of training instances F = attribute set Recursively select the best attribute to split the data … … and expand the nodes of the tree 53 ALGORITHM FOR DECISION TREE INDUCTION (CONSTRUCTION) Extend the decision tree by creating a new node: each node has a test condition (node.test_cond) or a label (node.label) 54 ALGORITHM FOR DECISION TREE INDUCTION (CONSTRUCTION) It determines the best attribute test condition for partitioning the training instances associated with a node by using an impurity measure (e.g., entropy, Gini index) 55 ALGORITHM FOR DECISION TREE INDUCTION (CONSTRUCTION) It determines the class label to be assigned to a leaf node. leaf.label = argmax 𝑝(𝑖|𝑡) 𝑖 p(i|t) = fraction of training instances from class i assigned to node t label = the one that occurs most frequently in the training instances 56 ALGORITHM FOR DECISION TREE INDUCTION (CONSTRUCTION) stopping_cond() checks if all instances have identical class label or attribute values, or a threshold could be used to prevent the data fragmentation problem (= too few training instances on a leaf node to make statistically significant decisions) 57 CHARACTERISTICS OF DECISION TREES Applicability It is a nonparametric approach which does not require any prior assumption about the probability distribution governing the classes and attributes of the data. It can be applied to a great vary of datasets. It is applicable to both categorical and continuous values. It can deal with both binary and multiclass problems. The built models are relatively easy to interpret. 58 CHARACTERISTICS OF DECISION TREES Expressiveness It can encode any function of discrete-valued attributes. Every leaf node represent a combination of attributes. Computational efficiency Decision tree algorithms apply a heuristic to build the trees. Once a decision tree has been built, classifying a record is extremely fast. 59 CHARACTERISTICS OF DECISION TREES Handling missing values Probabilistic split (C4.5) Distribute the data instances based on the probability that the missing attribute attributes have a particular value Surrogate split metho (CART) Instances with a missing value are assigned to one of the child nodes based on the value of another non-missing surrogate attribute. Separate class method (CHAID) Missing values are considered as a separate categorical value. Other strategies E.g., they are discarded during pre-processing. 60 CHARACTERISTICS OF DECISION TREES Handling interactions among attributes Attributes are considered interacting if they are able to distinguish between classes when used together, but individually they provide little or no information Trees can perform poorly when there are interacting attributes Handling irrelevant attributes An attribute is irrelevant if it is not useful for the classification task. The presence of a small number of irrelevant attributes will not impact the decision tree construction process. 61 CHARACTERISTICS OF DECISION TREES Handling redundant attributes An attribute is redundant if it is strongly correlated with another attribute. Redundant attributes show similar gains in purity → only one of them will be selected. Choice of the impurity measure It has little effect on the performance of the decision tree, since they are all consistent with each other. 62 CHARACTERISTICS OF DECISION TREES Rectlinear splits The test conditions involves a single attribute at a time The tree-growing procedure can be viewed as the process of partitioning the attribute space into disjoint regions until each of them contains only records of the same class. The border of each region is called decision boundary The decision boundaries are rectilinear, i.e., parallel to the coordinate axes. 63 CHARACTERISTICS OF DECISION TREES 64 CHARACTERISTICS OF DECISION TREES Problem The true decision boundary is represented by the diagonal line. An oblique decision tree could be more expressive and can produce a more compact tree, but it is more computational expensive to build! 65 MODEL OVERFITTING/UNDERFITTING Model overfitting happens when the model fits well over the training data, but it show poor generalization performances (low accuracy on the test data). Model underfitting happens when the model is too simplicist and thus incapable of fully representing the true relationship between the attributes and the class labels. 66 MODEL OVERFITTING Ok: the training and the test error rates are Overfitting: the training error decreases while quite close to each other → the performances the test error increases! of the training set are quite representative. 67 MODEL OVERFITTING: CAUSES Model overfitting happens when an overly complex model is selected that captures specific patterns in the training data but fails to learn the true nature of relationships between attributes and class labels in the overall data. Limited training size The dataset can only provide a limited representation of the overall data → the learned pattern does not fully represent the true pattern in the data Remedy: increase the training size 68 MODEL OVERFITTING: CAUSES High model complexity An overly complex model has a tendency to learn specific patterns in the training set that do not generalize well over unseen instances If the number of parameter combinations is large but the number of training instances is small, it is highly likely to obtain overfitting 69 MODEL OVERFITTING: CAUSES Decision boundary is distorted by a noise point 70 MODEL SELECTION We want to select the model that shows lowest generalization error rate. Training error cannot be reliable used as the sole criterion for model selection. Generic approach: use a validation set Incorporating the model complexity in the error computation 71 VALIDATION SET Evaluate the model on a separate validation set that has not been used for training the model. Given a training set D.train, partition it into two smaller parts: (2/3) D.tr : used for the training (1/3) D.val : used for computing the validation error rate The model that shows the lowest value of errorval(m) can be selected as the preferred model. 72 MODEL COMPLEXITY The chance for model overfitting increases as a model becomes more complex. Given two models with the same errors, the simpler model is preferred over the more complex one (principle of parsimony). gen.err(m) = train.error(m, D.train) + α x complexity(m) 73 MODEL COMPLEXITY The complexity of a decision tree can be estimated as the ratio of the number of leaf nodes to the number of training instances. k / Ntrain k is the number of leaf nodes Ntrain is the number of training instances For a larger training size, we can learn a decision tree with larger number of leaf nodes without it becoming overly complex 74 MODEL SELECTION: PRE-PRUNING (EARLY STOPPING RULE) Stop the algorithm before generating a fully-grown tree Typical stopping conditions for a node Stop if all instances belong to the same class Stop if all the attribute values are the same More restrictive conditions Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain) 75 MODEL SELECTION: PRE-PRUNING (EARLY STOPPING RULE) Advantage: It avoids the computation associated with generating overly complex subtrees that overfit the training data. Disadvantage: Optima subtrees could not be reached due to the greedy nature of the decision tree induction. 76 MODEL SELECTION: POST-PRUNING Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion Subtree replacement: if generalization error improves after trimming, replace sub-tree by a leaf node. Subtree raising: theclass label of leaf node is determined from majority class of instances in the sub-tree Advantage: Better results because the pruning is based on a fully grown tree. Disadvantage: Additional computations for growing the full tree. 77 RULE-BASED CLASSIFIER 78 RULE-BASED CLASSIFIER It uses a collection of “if … then …” rules (rule set) to classify data instances. Each rule ri can be expressed as: ri : (conditioni) → yi conditioni is the antecedent or precondition Conjunction of attribute test conditions: (A1 op v1) ˄ (A2 op v2) ˄ … ˄ (A3 op v3) (Ai, vi) are attribute-value pairs and op is a comparison operator {=, ≠, >, ≥, - r2: (P=No,Q=Yes) ==> + - + + Q r3: (P=Yes,R=No) ==> + r4: (P=Yes,R=Yes,Q=No) ==> - No Yes r5: (P=Yes,R=Yes,Q=Yes) ==> + - + 96 RULE-BASED CLASSIFIERS: CHARACTERISTICS Expressiveness Very similar to a decision tree, since a decision tree can be represented by a set of mutually exclusive and exhaustive rules (rectilinear partitions of the attributes). Multiple rules can be triggered for a given instance, thus more complex models can be learnt. 97 RULE-BASED CLASSIFIERS: CHARACTERISTICS Redundant attributes Their presence is handled because once an attribute has been used, the remaining redundant attributes will introduce a little or no information gain and would be ignored. Interacting attribute Poor performances like decision tree. Missing values in the test set They are not treated well, because if a rule involves an attribute that is missing in the test instance, it is difficult to ignore the rule and proceed with the subsequent on in the set. 98 RULE-BASED CLASSIFIERS: CHARACTERISTICS Imbalanced class distributions They can be handled through the rule ordering. Efficiency Fast model building Very fast classification Robustness Robust to outliers 99 NEAREST NEIGHBOR CLASSIFIER 100 K-NEAREST NEIGHBOR A Nearest Neighbor classifier finds all the training examples that are relatively similar to the attributes of the test instances. Each example is represented as a data point in a d-dimensional space, where d is the number of attributes. Given a test instance, its proximity to the training instances is computed according to one of the available proximity measures. The k-nearest neighbors of a given instance z refer to the k training examples that are closer to z. 101 NEAREST NEIGHBOR CLASSIFIERS Basic idea: If it walks like a duck, quacks like a duck, then it’s probably a duck Compute Distance Test Record Training Choose k of the Records “nearest” records NEAREST-NEIGHBOR CLASSIFIERS Unknown record Requires the following: A set of labeled records Proximity metric to compute distance/similarity between a pair of records e.g., Euclidean distance The value of k, the number of nearest neighbors to retrieve A method for using class labels of K nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) K-NEAREST NEIGHBOR The instances is classified based on the class labels of its neighbors. In case where the neighbors have more than one label, the test instance is assigned to the majority class of its nearest neighbors. The decision of the value of k is crucial: If k is too small, then the nearest neighbor classifier is subject to overfitting. If k is too large, the nearest neighbor classifier can misclassify the test instance because the nearest neighbors includes training examples that are far away. 104 K-NEAREST NEIGHBOR 105 K-NEAREST NEIGHBOR Too large k value 106 K-NEAREST NEIGHBOR The test instance is classified based on the majority class of its nearest neighbor → every neighbor has the same impact on the classification. 1 Distance-weighted Voting: use a weight based on the distance 𝑤𝑖 = 𝑑 𝑥 ′ ,𝑥𝑖 2 107 K-NEAREST NEIGHBOR The test instance is classified based on the majority class of its nearest neighbor → every neighbor has the same impact on the classification. 1 Distance-weighted Voting: use a weight based on the distance 𝑤𝑖 = 𝑑 𝑥 ′ ,𝑥𝑖 2 108 NEAREST NEIGHBOR: CHARACTERISTICS Instance-based learning: it does not build a global model, but it uses the training examples to make predictions for a test instance. Similarity based on the definition of a proximity measure. Expressiveness: It can build decision boundaries of arbitrary shape. Efficiency: It does not build a model, but it requires to compute every time the proximity value beach the test and the training examples (slow classification) 109 NEAREST NEIGHBOR: CHARACTERISTICS Interacting attribute It can handle interacting attributes without any problem. Irrelevant attributes They can distort commonly used proximity measures. Missing values in the test set Proximity computation normally require the presence of all attributes. Robustness With smaller values of k, it is susceptible to noise. It can produce wrong predictions unless the appropriate proximity measure and data pre-prossing steps are taken. 110 BAYESIAN CLASSIFIER 111 NAÏVE BAYES CLASSIFIER Many classification problems involve uncertainty. Attributes and class labels may be unreliable. The set of attributes chosen for the classification may not be fully representative of the target class. There is the need to not only make a prediction of class labels but also provide a measure of confidence associated with every prediction. Probabilistic classification model. 112 PROBABILITY The probability of an event e, e.g. P(X=xi) measures how likely it is for the event e to occur (relative frequency). Variables that have probabilities associated with each possible outcome (values) are known as random variables. Joint Probability: two random variables X and Y which can take k distinct values. Let nij the number of times we observe X=xi and Y = yi out of a total number of N occurencies: P(X=xi, Y=yj) = nij/N 113 MARGINAL PROBABILITY Marginal Probability: By summing out the joint probabilities with respect to a random variable Y, we obtain the probability of observing the other variable X independently from Y (marginalization) σ𝑘 𝑗=1 𝑛𝑖𝑗 𝑛𝑖 σ𝑘𝑗=1 𝑃 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑖 = = = 𝑃(𝑋 = 𝑥𝑖 ) 𝑁 𝑁 114 CONDITIONAL PROBABILITY Let P(Y|X) denote the conditional probability of observing the random variable Y whenever the random variable X takes a particular value (= probability of observing Y conditioned on the outcome of X). 𝑃(𝑋, 𝑌) Joint probability 𝑃 𝑌𝑋 = 𝑃(𝑋) 𝑃 𝑋, 𝑌 = 𝑃 𝑌 𝑋 𝑃 𝑋 Joint probability is simmetric = 𝑃(𝑋|𝑌)𝑃 𝑌 P(X) = denote the probability of any generic outcome of X P(xi) = denote the probability of a specific outcome xi 115 BAYES THEOREM 𝑃(𝑋|𝑌)𝑃(𝑌) 𝑃 𝑌𝑋 = Provides a relationship between P(Y|X) and P(X|Y) 𝑃(𝑋) Dimostrazione 𝑃(𝑋,𝑌) 𝑃(𝑋|𝑌)𝑃(𝑌) 𝑃 𝑌𝑋 = = 𝑃(𝑋) 𝑃(𝑋) 𝑃(𝑋, 𝑌) = 𝑃(𝑌|𝑋)𝑃(𝑋) 𝑃(𝑋, 𝑌) = 𝑃(𝑋|𝑌)𝑃(𝑌) 116 BAYES THEOREM: EXAMPLE You known that Alex attends 40% of the parties he is invited to P(A=1) = 0.40. If Alex goes to a party, there is an 80% chance that Martha coming along P(M=1|A=1) = 0.80. Conversely, if Alex is not going to the party, the chance of Martha coming to the party is reduced to 30%: P(M=1|A=0) = 0.30. If Martha has responded she will come to the party, what is the probability that Alex will go also to the party? P(A=1|M=1)? P(A=1|M=1) = P(M=1|A=1)P(A=1)/P(M=1) P(M=1) = P(M=1|A=0)P(A=0) + P(M=1|A=1)P(A=1) P(A=1|M=1) = 0.80  0.40 / (0.30  0.60 + 0.80  0.40) = 0.64 117 BAYES THEOREM FOR CLASSIFICATION We are interested in computing the probability of observing a class label y for a data instance given its set of attribute values x: P(y|x) posterior probability of the target class 𝑃(𝒙|𝑦)𝑃(𝑦) 𝑃 𝑦𝒙 = 𝑃(𝒙) 𝑃(𝒙|𝑦) is the class-conditional probability, it measures the likelihood of observing x from the distribution instances belonging to y. Fraction of the training instances of a given class for every possible combination of attribute values P(y) is the prior probability, it measures the distribution of the class labels, independently from the observed attribute values. Fraction of the training instances that belong to the class y. P(x) is the probability of evidence → 𝑃 𝑥 = σ𝑖 𝑃 𝒙 𝑦𝑖 𝑃(𝑦𝑖 ) 118 NAÏVE BAYES ASSUMPTION A large number of attribute value combinations can also result in poor estimates of the class-conditional probabilities. The naïve bayes assumption helps in obtaining reliable estimates of class- conditional probabilities even in presence of a large number of attributes. 𝑑 𝑃 𝒙 𝑦 = ෑ 𝑃(𝑥𝑖 |𝑦) 𝑖=1 X = {x1, …, xd} and xi are independent of each other Compute each 𝑃 𝑥𝑖 𝑦 independently → the number of parameters need to learn class-conditional probabilities is reduced from dk to dk. 119 NAÏVE BAYES CLASSIFIER: EXAMPLE Outlook Temperature Humidity Windy Class xi = Outlook Sunny Hot High False N P(sunny | P) = 2/9 P(sunny | N) = 3/5 Sunny Hot High True N P(overcast | P) = 4/9 P(overcast | N) = 0/5 Overcast Hot High False P P(rain | P) = 3/9 P(rain | N) = 2/5 Rain Mild High False P Rain Cool Normal False P xi = temperature Rain Cool Normal True N P(hot | P) = 2/9 P(hot | N) = 2/5 Overcast Cool Normal True P P(mild | P) = 4/9 P(mild | N) = 2/5 Sunny Mild High False N P(cool | P) = 3/9 P(cool | N) = 1/5 Sunny Cool Normal False P Rain Mild Normal False P xi = humidity Sunny Mild Normal True P P(high | P) = 3/9 P(high | N) = 4/5 Overcast Mild High True P P(normal | P) = 6/9 P(normal | N) = 2/5 Overcast Hot Normal False P xi = Windy Rain Mild High True N P(true | P)= 3/9 P(true | N) = 3/5 P(y=P) = 9/14 P(false | P) = 6/9 P(false | N) = 2/5 P(y=N) = 5/14 120 NAÏVE BAYES CLASSIFIER: EXAMPLE Data to be labelled: X = For class y=P P(X|y=P)  P(y=P) = P(rain|P)  P(hot|P)  P(high|P)  P(false|P) P(P) = 3/9  2/9  3/9  6/9  9/14 = 0.010582 For class y=N P(X|y=N)  P(y=N) = P(rain|N)  P(hot|N)  P(high|N)  P(false|N) P(N) = 2/5  2/5  4/5  2/5  5/14 = 0.018286 121 NAÏVE BAYES CLASSIFIER: CHARACTERISTICS It can easily work with incomplete information about data instances (missing values). It requires that attributes are conditionally independent of each other given the class labels. Correlated attributes can degrade the performances It is robust to isolated noise points (they not significantly impact the conditional probability estimates). It is robust to irrelevant attributes. Fast model building and fast classification. 122 OTHER CLASSIFIERS 123 BAYESIAN NETWORKS Approach that relax the naïve Bayesian assumption. It allows to model probabilistic relationships between attributes and class labels. Bayesian networks are probabilistic graphical models where nodes represents random variables and edges between the nodes express the probabilistic relationships. 124 LOSTISTIC REGRESSION Naïve bayes classifiers and Bayesian networks are probabilistic generative models, because they describe the behavior of instances in the attribute space that are generated by the class y. Classification models that directly assign class labels without computing class-conditional probabilities are called discriminative models. Logistic regression is similar to regression models, but the computed values are the probabilities used to determine the class label of a data instance. 125 SUPPORT VECTOR MACHINE (SVM) SVM is a discriminative classification model that learns linear or non-linear decision boundaries in the attribute space to separate the classes. Separating hyperplane: 𝒘𝑇 𝒙 + 𝑏 = 0 x represents the attributes (w,b) represent the parameters of the hyperplane A data instance xi can belong to either side of the hyperplane depending on the sign of (𝒘𝑇 𝒙 + 𝑏) 126 SUPPORT VECTOR MACHINE (SVM) Find a linear hyperplane (decision boundary) that will separate the data SUPPORT VECTOR MACHINE (SVM) B1 One Possible Solution SUPPORT VECTOR MACHINE (SVM) B2 Another possible solution SUPPORT VECTOR MACHINE (SVM) B2 Other possible solutions SUPPORT VECTOR MACHINE (SVM) B1 Which one is better? B1 or B2? How do you define better? B2 SUPPORT VECTOR MACHINE (SVM) B1 Find hyperplane maximizes the margin → B1 is better than B2 B2 b21 b22 margin b11 b12 ARTIFICIAL NEURAL NETWORKS 133 ARTIFICIAL NEURAL NETWORKS Different tasks, different architectures: Image understanding → CNN Time series analysis → RNN Denoising → Auto-encoders Text and speech recognition → LLM etc… not part of the course (there are some available AI courses) 134 ENSEMBLE METHODS 135 ENSEMBLE METHODS Sometimes a single classification model is not enough. Classification accuracy can be improved by aggregating the predictions of multiple classifiers. The idea is to construct a set of base classifiers from the same dataset and then perform classification by taking a vote on the predictions made by each base classifier. 136 ENSEMBLE METHODS: RANDOM FOREST A random forest is an ensemble learning technique represented by a set of decision trees. A number of decision trees are built at training time and the class is assigned by majority voting. 137 ENSEMBLE METHODS: RANDOM FOREST original training data random subsets multiple decision trees For each subset, a tree is learned on a random set of features aggregating classifiers 138 CLASS IMBALANCE PROBLEM 139 CLASS IMBALANCE PROBLEM In many data sets there are a disproportionate number of instances that belong to a certain class with respect to the other (skew or class imbalance). It can be difficult to find sufficiently many labeled samples of a rare class. A classifier trained over an imbalanced data set shows a bias toward improving its performance over the majority class. Accuracy is not well suited for evaluating models in the presence of class imbalance in the test data. acc = (fPP+ fNN)/(fPP+ fPN + fNP + fNN) 140 DEALING WITH IMBALANCED DATA: TRAINING SET TRANSFORMATION The first step in learning with imbalanced data is transform the training set into a balanced training set. Undersampling: the frequency of the majority class is reduced to match the frequency of the minority class. Some useful examples may not be chosen for training, obtaining an inferior classification model. Oversampling: artificial examples of the minority class are created to make them equal in proportion to the number of the majority class. Artificially duplicate the instances of the minority class (doubling its weight). Poor generalization Assigning higher weights to instances of the minority class. capabilities Generate synthetic instances of the minority class in the neighborhood of the existing instances. 141 DEALING WITH IMBALANCED DATA: EVALUATING PERFORMANCES Accuracy and Confusion matrix are not well-suited metrics in presence of imbalanced classes. Accuracy acc = (TP+ TN)/(TP + TN + FP + FN) 142 DEALING WITH IMBALANCED DATA: EVALUATING PERFORMANCES Consider a binary problem: Cardinality of class 0 = 9900 Cardinality of class 1 = 100 Model () → class 0 Model predicts everything to be class 0 Accuracy = 9900/10000 = 99% Accuracy is misleading because the model does not detect any class 1 object. 143 DEALING WITH IMBALANCED DATA: EVALUATING PERFORMANCES Recall recall (r) = TP / TP+FN Number of object correctly assigned to the positive class, divided by the number of objects really belonging to that class Precision precision (p) = TP / TP + FP Fraction of correct predictions of the positive class over the total number of positive predictions. 144 DEALING WITH IMBALANCED DATA: EVALUATING PERFORMANCES F1 2 𝑟𝑝 𝐹1 = 𝑟+𝑝 It represents an harmonic mean between recall and precision. 145 DEALING WITH IMBALANCED DATA: ROC CURVE ROC = Receiver Operating Characteristic A graphical approach for displaying trade-off between detection rate and false alarm rate ROC curve plots y-axis = True Positive Rate TPR = TP / (TP+FN) x-axis = False Positive Rate FPR = FP / (FP+TN) 146 DEALING WITH IMBALANCED DATA: ROC CURVE (TPR,FPR): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: It represents a random guessing Below diagonal line: prediction is opposite of the true class A good classification model should be located as close as possible to the upper left corner in the diagram (above the diagonal!) 147 HOW TO CONSTRUCT AN ROC CURVE Use a classifier that produces a continuous-valued Instance Score True Class score for each instance 1 0.95 + The more likely it is for the instance to be in the 2 0.93 + + class, the higher the score 3 0.87 - Sort the instances in decreasing order according to 4 0.85 - the score 5 0.85 - Select the lowest ranked test instance. Assign the selected instance and those above to the positive 6 0.85 + class, while those ranked below it as negative. 7 0.76 - Count the number of TP, FP, TN, FN at each 8 0.53 + threshold 9 0.43 - TPR = TP/(TP+FN) 10 0.25 + FPR = FP/(FP + TN) HOW TO CONSTRUCT AN ROC CURVE Class + - + - - - + - + + P Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 5 4 4 3 3 3 3 2 2 1 0 FP 5 5 4 4 3 2 1 1 0 0 0 TN 0 0 1 1 2 3 4 4 5 5 5 FN 0 1 1 2 2 2 2 3 3 4 5 TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0 FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0 ROC Curve USING ROC FOR MODEL COMPARISON No model consistently outperforms the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve (AUC) Ideal: Area = 1 Random guess: Area = 0.5 MULTICLASS PROBLEM 151 MULTICLASS PROBLEM (1-r approach) Decompose the multiclass problem into K binary problems For each class yi  Y, a binary problem is created where all instances that belong to yi are considered positive examples, while the remaining instances are considered negative examples. (1-1 approach) Constructs K(K-1)/2 binary classifiers. Each classifier is used to distinguish between a pair of classes (yi, yj). Instances that do not belong to yi or yj are ignored. The final prediction is made by combining the predictions made by the binary classifiers. 152

Classification PDF

Document Details

Tags

Related

Summary

Full Transcript