Summary

This document covers data mining concepts, techniques, and applications. It likely includes information on algorithms, machine learning, and data analysis.

Full Transcript

Data Mining Unit 4 Classification Elizabeth Wanner [email protected] Thanks to Dr Felipe Campelo for preparing the slides Outline Problem definition A basic classifier: decision trees Other classification methods Data splitting and perf...

Data Mining Unit 4 Classification Elizabeth Wanner [email protected] Thanks to Dr Felipe Campelo for preparing the slides Outline Problem definition A basic classifier: decision trees Other classification methods Data splitting and performance assessment Imbalanced classification Outline Problem definition A basic classifier: decision trees Other classification methods Data splitting and performance assessment Imbalanced classification Problem definition Classification refers to the problem of predicting the class (a nominal label, also sometimes called the dependent variable) of an observation, based on the values of other attributes (usually called independent variables or predictors). Classification is a type of supervised learning, meaning that the model that maps attribute values to class values is learned based on examples for which the true value of Class is known. We can define the problem slightly more formally as: given a collection of labelled records (training set), each represented by a set of predictive features paired with a value of a target variable (class), find a model that maps the predictive feature values to a class attribute. The goal of this task is to assign a class label to previously unseen records as accurately as possible. Figure source: Tan et al. (2019), Ch. 3 Problem definition Or, to put it more concisely: Classification is the task of learning a target function - a classification model - that maps each attribute vector x to one of the predefined class Iabels y. Table source: Tan et al. (2019), Ch. 3 Problem definition As an example, consider the vertebrate classification problem in the table below. Table source: Tan et al. (2019), Ch. 3 Problem definition As an example, consider the vertebrate classification problem in the table below. predictors Table source: Tan et al. (2019), Ch. 3 Problem definition As an example, consider the vertebrate classification problem in the table below. class Table source: Tan et al. (2019), Ch. 3 Problem definition As an example, consider the vertebrate classification problem in the table below. The objective in this case would be to learn from the labelled examples (i.e., examples for which we know the class value) to predict the class of new organisms, based on the available predictor variables (body temp., skin cover, etc.). Notice that the first column (vertebrate name) would be useless for building a predictive model. Can you guess why? Table source: Tan et al. (2019), Ch. 3 General classification framework Classification generally involves: A modelling approach (a general model structure + a learning algorithm). The process of using a learning algorithm to build a classification model from the training data is known as induction. A set of data used for model induction (training data). A set of data used for assessing the model (test data). As usual, there are exceptions – for instance, classifiers that don’t build an explicit model, different approaches to training/testing, etc.. We’ll talk about those later. Figure source: Tan et al. (2019), Ch. 3 Some important considerations There are some very important points about classification (and statistical learning in general) that are unfortunately often neglected, but that we should keep in mind. Most of those probably stem from ignoring what we really want when we build a classification model. Any guesses? Some important considerations There are some very important points about classification (and statistical learning in general) that are unfortunately often neglected, but that we should keep in mind. Most of those probably stem from ignoring what we really want when we build a classification model. Any guesses? What we really want is to have a model that is good at solving a practical problem! This means that it should be able to generate good predictions for unseen data (i.e., data that was not present at the time of model training), which in turn means that we need to test the model on data that is: 1. representative of the type of data that the model is likely to encounter once it’s deployed (this may be harder than it sounds); 2. as unseen as we can make it (more on this later)! Outline Problem definition A basic classifier: decision trees Other classification methods Data splitting and performance assessment Imbalanced classification Decision Trees One of the simplest types of classifiers (which is commonly used to build more sophisticated ones) is the method of decision trees. A decision tree is basically a directed acyclical graph (or DAG) representing a sequence of queries to single attributes. Each tree has three types of nodes: root nodes (zero incoming, zero or more outgoing links); internal nodes (exactly one incoming, and two or more outgoing links); and leaf nodes (exactly one incoming, zero outgoing links). Most commonly used decision trees are built as binary trees (non-leaf nodes always have exactly 2 outgoing links) In a decision tree, root and internal nodes are each associated with a question (a query on some attribute), and leaf nodes are associated with an answer (an inferred value for the class attribute). Figure source: Tan et al. (2019), Ch. 3 Building decision trees Decision trees are often built iteratively. At any point, the algorithm needs to determine (i) whether to add a non-leaf or a leaf node; (ii) if a non-leaf node is added, which attribute to query; and (iii) how to use that attribute to split the data. To build a decision tree we need training data (i.e., a set of observations with known class labels) and a way to: Decide which attribute to query at each step. Decide how to split the data based on that attribute. Decide when to stop growing the tree. These aspects give rise to several different tree induction algorithms and a large number of decision tree variants. These include Hunt’s Algorithm, Ross Quinlan’s methods (ID3, C4.5 / 4.8 / 5.0); Brieman’s CART, and several methods developed by researchers from IBM (e.g., Mehta’s SLIQ, Shaffer’s SPRINT). Today we’ll only discuss the general principles of decision tree induction to gain a basic understanding of the main aspects of tree induction, but the required reading for this unit provides more details on the specific algorithms. Splitting criteria Figures source: Tan et al. (2019), Ch. 3 Binary attributes only give rise to simple yes/no, true/false questions. Nominal attributes can be split either in a multiway or a binary fashion (by turning n-ary questions into true/false ones related to single attribute value, e.g., “Is value == married?”). Splitting criteria Figures source: Tan et al. (2019), Ch. 3 Interval and ratio attributes can also be split either multiway or binary (more common). Unlike for nominal variables, splitting strategies for continuous attributes need to explicitly determine the splitting threshold(s). Ordinal attributes can be treated as either categorical (ignoring the ordering) or as (semi-) numerical (defining a threshold that takes ordering into consideration). Selecting an attribute test condition At each node of a decision tree any attribute can be selected as the splitting criterion. Selecting which one is the best is usually done following a greedy approach (i.e., an approach that only considers the highest gain at that node, rather than the overall condition of the tree). The best split at a node is defined as one that does the best job of separating the data into groups where a single class predominates in each group. Selecting an attribute test condition At each node of a decision tree any attribute can be selected as the splitting criterion. Selecting which one is the best is usually done following a greedy approach (i.e., an approach that only considers the highest gain at that node, rather than the overall condition of the tree). The best split at a node is defined as one that does the best job of separating the data into groups where a single class predominates in each group. X1 < 2.75? true false Class 1: 11 Class 1: 39 Class 2: 21 Class 2: 29 Class 3: 1 Class 3: 49 Selecting an attribute test condition At each node of a decision tree any attribute can be selected as the splitting criterion. Selecting which one is the best is usually done following a greedy approach (i.e., an approach that only considers the highest gain at that node, rather than the overall condition of the tree). The best split at a node is defined as one that does the best job of separating the data into Class 1: 50 groups where a single class predominates in Class 2: 50 each group. Class 3: 0 Class 1: 0 X2 < 2. 5? Class 2: 0 Class 3: 50 true false Determining impurity Let p(i | t) represent the proportion of records at a node t that belong to a class i. We need a measure based on the proportions across all classes that quantifies how heterogeneous (or “impure”) a certain mixture of classes is. Common measures of impurity: C AAAConicbVHbbtNAEF2bWwm3AI88sCKA0odGdlQBL5UqqgiEBLSCtJVi11pv1smoe7F2x4jIhP/iN3jjb9i4FgptRhrt0Zk5M7MzeSnBYRT9CcJr12/cvLV1u3Pn7r37D7oPHx07U1kuxtxIY09z5oQELcYIKMVpaQVTuRQn+fnBKn7yTVgHRn/FRSlSxWYaCuAMPZV1f400WlMu+rhNX+7RncRVKpGgAF1Ww168PDso+/ADtxNpZtkwkaLAfstYmM39k3TegYa2QEx36IYajW6yrkvPhl75ERyXzLl/E/0cWWvsejHFvmewQZ91e9EgaoxeBXELeqS1w6z7O5kaXimhsek5iaMS05pZBC7FspNUTpSMn7OZmHiomRIurZsVL+kLz0xpYax3jbRh1xU1U84tVO4zFcO5uxxbkZtikwqLN2kNuqxQaH7RqKgkRUNX96JTsIKjXHjAuAU/K+VzZhlHf9WOX0J8+ctXwfFwEL8a7B7t9vbftuvYIk/IM9InMXlN9sl7ckjGhAdPg1HwKfgcPg8/hEfhl4vUMGg1j8l/FiZ/AeOBy8I= X Entropy(t) = p(i|t) log2 (p(i|t)) i=1 C X 2 Gini(t) = 1 [p(i|t)] i=1 M isclassif ication Error(t) = 1 max [p(i|t)] i Figures source: Tan et al. (2019), Ch. 3 Determining impurity Let p(i | t) represent the proportion of records at a node t that belong to a class i. We need a measure based on the proportions across all classes that quantifies how heterogeneous (or “impure”) a certain mixture of classes is. Common measures of impurity: C AAAConicbVHbbtNAEF2bWwm3AI88sCKA0odGdlQBL5UqqgiEBLSCtJVi11pv1smoe7F2x4jIhP/iN3jjb9i4FgptRhrt0Zk5M7MzeSnBYRT9CcJr12/cvLV1u3Pn7r37D7oPHx07U1kuxtxIY09z5oQELcYIKMVpaQVTuRQn+fnBKn7yTVgHRn/FRSlSxWYaCuAMPZV1f400WlMu+rhNX+7RncRVKpGgAF1Ww168PDso+/ADtxNpZtkwkaLAfstYmM39k3TegYa2QEx36IYajW6yrkvPhl75ERyXzLl/E/0cWWvsejHFvmewQZ91e9EgaoxeBXELeqS1w6z7O5kaXimhsek5iaMS05pZBC7FspNUTpSMn7OZmHiomRIurZsVL+kLz0xpYax3jbRh1xU1U84tVO4zFcO5uxxbkZtikwqLN2kNuqxQaH7RqKgkRUNX96JTsIKjXHjAuAU/K+VzZhlHf9WOX0J8+ctXwfFwEL8a7B7t9vbftuvYIk/IM9InMXlN9sl7ckjGhAdPg1HwKfgcPg8/hEfhl4vUMGg1j8l/FiZ/AeOBy8I= X Entropy(t) = p(i|t) log2 (p(i|t)) i=1 C X 2 Gini(t) = 1 [p(i|t)] i=1 M isclassif ication Error(t) = 1 max [p(i|t)] i Figures source: Tan et al. (2019), Ch. 3 Entropy: examples Maximum entropy happens when all classes are Class 1: 50 equally represented (largest variability, maximum Class 2: 50 impurity) Class 3: 50 AAACc3icbVFNbxMxEPUuX+3yFUDi0gNWA1V6aLpuGuglUgVC4lgk0laKw8rreBOr9nplzyKiZfMD+vN667/gwh1vkkNpmdPTe/PsmTdpoaSDOL4Ownv3Hzx8tLEZPX7y9Nnz1ouXp86UloshN8rY85Q5oWQuhiBBifPCCqZTJc7Si0+NfvZDWCdN/g3mhRhrNs1lJjkDTyWty6Ijf8Eu3hngfrxP+jEeYLLfWywWmOrU/KwyYzFTCnPFnBOupjT6nIM1xbyzsu1RV2qqpJbgkkoOSP29RzPLeEXqqldTZabJAVUig85N2srpDHb9azsD0u0f9aMoabXjbrwsfBeQNWijdZ0krSs6MbzUIofldCMSFzCumAXJlagjWjpRMH7BpmLkYc60cONqmVmN33lmgpvtMpMDXrI3HRXTzs116js1g5m7rTXk/7RRCdnRuJJ5UYLI+eqjrFQYDG4OgCfSCg5q7gHjVvpZMZ8xHwz4MzUhkNsr3wWnB13yvnv49bB9/HEdxwbaQtuogwj6gI7RF3SChoij38Hr4E2Agz/hVrgdvl21hsHa8wr9U+HeX1OTuOE= p(i|t) = 50/150 = 1/3 for all classes X3 ✓ ◆ 1 1 Entropy(t) = log2 i=1 3 3 = 1.585 Entropy: examples Minimum entropy happens when there’s only a single class represented in a group (lowest variability, minimum impurity) AAACk3icbVFtaxQxEM5ufanr21rxk1+Ch2WLeE3OtgpSPKqCX4QKXlu4XY5sLnsXmk1CMiscZ/+QP8dv/htztyvYl4EwzzzzDDOZKa2SHgj5E8Ubt27fubt5L7n/4OGjx+mTrRNvGsfFiBtl3FnJvFBSixFIUOLMOsHqUonT8vzjKn/6Qzgvjf4OCyuKms20rCRnEKhJ+stm9Cfs4O1DTHb3CQ4uzxObDW4i33TkfsfSwH7W4IxdZG3mda5EBeNBDrIWfh1kpA1wrsxsMsjITu7kbB70r3AroJcF9J+gdUVosn1IkmSS9kifrA1fB7QDPdTZ8ST9nU8Nb2qhgSvm/ZgSC8WSOZBciYskb7ywjJ+zmRgHqFkYoliud3qBXwZmiivjwtOA1+z/FUtWe7+oy6CsGcz91dyKvCk3bqB6Vyyltg0IzdtGVaMwGLw6EJ5KJzioRQCMOxlmxXzOHOMQzrhaAr365evgZNCnB/29b3u94VG3jk30HL1AGaLoLRqiL+gYjRCP0ugg+hAN42fx+/go/tRK46ireYouWfz1LyOGu6c= p(1|t) = 0/50 = 0 p(2|t) = 0/50 = 0 Class 1: 50 p(3|t) = 50/50 = 1 Class 2: 50 Entropy(t) = [2 ⇥ (0 ⇥ log2 (0)) + (1 ⇥ log2 (1))] Class 3: 0 =0 Class 1: 0 Class 2: 0 Class 3: 50 Note: 0 x log2(0) → 0 (but your calculator may throw an error or a NaN here) Entropy: examples Intermediate values of entropy occur for different mixtures, depending on the heterogeneity of classes AAACn3icdVFdb9MwFHXCx0b46uBxPBgqpkyIzuk22EulaQiBBEJFottQE1WO67TWHNuyb5Cqsr/FD+GNf4OTBgQdXMnyueecq3t9nRspHBDyIwivXb9xc2PzVnT7zt179ztbD06drizjI6altuc5dVwKxUcgQPJzYzktc8nP8otXtX72hVsntPoEC8Ozks6UKASj4KlJ55uJk6+wi3cG+JDsJYTgASa9wzSNTNz/n7DfCr95z75WYLVZxCvleSp5AeN+CqLkrkniurpJcSr1bNKvid3Uitnc1zzDrWnN8suwujLfaGeQRNGk0yU90gS+CpIWdFEbw0nnezrVrCq5Aiapc+OEGMiW1IJgkl9GaeW4oeyCzvjYQ0X9ENmy2e8lfuqZKS609UcBbtg/K5a0dG5R5t5ZUpi7da0m/6WNKyiOsqVQpgKu2KpRUUkMGtefhafCcgZy4QFlVvhZMZtTSxn4L62XkKw/+So47feSF72Djwfd45N2HZtoGz1BMUrQS3SM3qIhGiEWPApOgnfB+/Bx+Cb8EA5X1jBoax6ivyL8/BMWIb5i p(1|t) = 50/100 = 0.5 p(2|t) = 50/100 = 0.5 p(3|t) = 0/100 = 0 Entropy(t) = [2 ⇥ (0.5 ⇥ log2 (0.5)) + (0 ⇥ log2 (0))] Class 1: 50 =1 Class 2: 50 Class 3: 0 Class 1: 0 Class 2: 0 Class 3: 50 Entropy: examples Intermediate values of entropy occur for different mixtures, depending on the heterogeneity of classes AAACS3icbZDNSyMxGMYz9avWr6579BIsSr2Mk5l2dQ+CrAgeFawKbSmZNNVgJhOSd4TS7f/nxcve/Ce87EERD6btgJ8PBB6e3/uS5Im1FBaC4N4rTE3PzM4V50sLi0vLK+Ufq2c2zQzjDZbK1FzE1HIpFG+AAMkvtOE0iSU/j68PRvz8hhsrUnUKfc3bCb1UoicYBRd1yrGukr+whTf3cPR7m5AdvIcDP4parZKuhjkJ30hYH5MoJ7U3UgsdOVRgUt2vTijx6/Wo1ClXAj8YC381JDcVlOu4U/7X6qYsS7gCJqm1TRJoaA+oAcEkH5ZameWasmt6yZvOKppw2x6MuxjiDZd0cS817ijA4/T9xoAm1vaT2E0mFK7sZzYKv2PNDHq77YFQOgOu2OSiXiYxpHhULO4KwxnIvjOUGeHeitkVNZSBq39UAvn85a/mLPTJL792Uqvs/8nrKKI1tI6qiKAdtI+O0DFqIIZu0QN6RE/enfffe/ZeJqMFL9/5iT6oMPMKUGWmIA== p(1|t) = 39/117 = 0.33 p(2|t) = 29/117 = 0.25 p(3|t) = 49/117 = 0.42 Entropy(t) = 1.553 Class 1: 39 Class 2: 29 AAACSHicbZBLSwMxFIUz9T2+qi7dBItSN3WmU6ouCqIILhWsCm0pmTS1wUwmJHeEUvvz3Lh0529w40IRd2bqKL4uBA7nu4ckJ1SCG/C8Byc3Nj4xOTU9487OzS8s5peWz0ycaMrqNBaxvgiJYYJLVgcOgl0ozUgUCnYeXh2k/PyaacNjeQp9xVoRuZS8yykBa7XzbVX0b2ATb9Sw728FAa5hrxQEzaariuUMlL9AtTICwWfi0/fSwKEEHat+MWMlb7fquu18weLR4L/Cz0QBZXPczt83OzFNIiaBCmJMw/cUtAZEA6eCDd1mYpgi9IpcsoaVkkTMtAajIoZ43Tod3I21PRLwyP2eGJDImH4U2s2IQM/8Zqn5H2sk0N1pDbhUCTBJPy7qJgJDjNNWcYdrRkH0rSBUc/tWTHtEEwq2+7QE//eX/4qzcsmvlionlcLeflbHNFpFa6iIfLSN9tAROkZ1RNEtekTP6MW5c56cV+ftYzXnZJkV9GNyuXdR76Un Class 1: 11 Class 3: 49 p(1|t) = 11/33 = 0.33 Class 2: 21 p(2|t) = 21/33 = 0.64 Class 3: 1 p(3|t) = 1/33 = 0.03 Entropy(t) = 1.096 Measuring split quality A natural way to measure split quality is to quantify the relative reduction in total entropy (i.e., the improvement in node purity after a split). We saw that the data impurity at a given node can be measured by the entropy of the data at that node. The total impurity of the child nodes after a given partition is measured by the total weighted entropy after the partition. k AAACJXicbVBNSwMxEM36WetX1aOXYBHaS9mVoh4sFEXQm4JVoVuXbJq1oUl2SWaFsuyf8eJf8eLBIoIn/4pp7UGtD2Z4vDdDMi9MBDfguh/OzOzc/MJiYam4vLK6tl7a2Lw2caopa9FYxPo2JIYJrlgLOAh2m2hGZCjYTdg/Gfk3D0wbHqsrGCSsI8m94hGnBKwUlI7OK7THRVczVcUN7JtU+oJLDibIoOHld1k/9yNNaKYCyG1L8lMFOk4GFagGpbJbc8fA08SbkDKa4CIoDf1uTFPJFFBBjGl7bgKdjGjgVLC86KeGJYT2yT1rW6qIZKaTja/M8a5VujiKtS0FeKz+3MiINGYgQzspCfTMX28k/ue1U4gOOxlXSQpM0e+HolRgiPEoMtzlmlEQA0sI1dz+FdMesZmADbZoQ/D+njxNrvdq3n6tflkvN48ncRTQNtpBFeShA9REZ+gCtRBFj+gZvaKh8+S8OG/O+/fojDPZ2UK/4Hx+Ac+BpiE= X nt I(children) = Entropy(t) n t=1 p Class 1: 50 Class 2: 50 np = 150 AAAB8HicbVDLSgNBEOyNrxhfqx69DAbBU9iV+LgIQS8eI5iHJEuYncwmQ2Zml5lZISz5Ci8eFPHq53jzb5wke9DEgoaiqpvurjDhTBvP+3YKK6tr6xvFzdLW9s7unrt/0NRxqghtkJjHqh1iTTmTtGGY4bSdKIpFyGkrHN1O/dYTVZrF8sGMExoIPJAsYgQbKz3KXoKukX/u9dyyV/FmQMvEz0kZctR77le3H5NUUGkIx1p3fC8xQYaVYYTTSambappgMsID2rFUYkF1kM0OnqATq/RRFCtb0qCZ+nsiw0LrsQhtp8BmqBe9qfif10lNdBVkTCapoZLMF0UpRyZG0+9RnylKDB9bgoli9lZEhlhhYmxGJRuCv/jyMmmeVfyLSvW+Wq7d5HEU4QiO4RR8uIQa3EEdGkBAwDO8wpujnBfn3fmYtxacfOYQ/sD5/AHnT48t Class 3: 50 Entropy = 1.585 AAAB+nicbVDLSsNAFJ3UV62vVJduBoviKiTSajdCUQSXFewD2lAm00k7dDITZiZKiP0UNy4UceuXuPNvTNostPXAhcM593LvPV7IqNK2/W0UVlbX1jeKm6Wt7Z3dPbO831Yikpi0sGBCdj2kCKOctDTVjHRDSVDgMdLxJteZ33kgUlHB73UcEjdAI059ipFOpYFZvuFaijCGJ5fQsWr1WmlgVmzLngEuEycnFZCjOTC/+kOBo4BwjRlSqufYoXYTJDXFjExL/UiREOEJGpFeSjkKiHKT2elTeJwqQ+gLmRbXcKb+nkhQoFQceGlngPRYLXqZ+J/Xi7RfdxPKw0gTjueL/IhBLWCWAxxSSbBmcUoQljS9FeIxkgjrNK0sBGfx5WXSPrOcc6t6V600rvI4iuAQHIFT4IAL0AC3oAlaAINH8AxewZvxZLwY78bHvLVg5DMH4A+Mzx8pSpH9 X2 < 2. 5? true false Class 1: 0 Class 1: 50 n2 = 100 AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBU9ktRb0IRS8eK9gPaZeSTbNtaJJdkqxQlv4KLx4U8erP8ea/MdvuQVsfDDzem2FmXhBzpo3rfjuFtfWNza3idmlnd2//oHx41NZRoghtkYhHqhtgTTmTtGWY4bQbK4pFwGknmNxmfueJKs0i+WCmMfUFHkkWMoKNlR7loIaukee6g3LFrbpzoFXi5aQCOZqD8ld/GJFEUGkIx1r3PDc2foqVYYTTWamfaBpjMsEj2rNUYkG1n84PnqEzqwxRGClb0qC5+nsixULrqQhsp8BmrJe9TPzP6yUmvPJTJuPEUEkWi8KEIxOh7Hs0ZIoSw6eWYKKYvRWRMVaYGJtRyYbgLb+8Stq1qndRrd/XK42bPI4inMApnIMHl9CAO2hCCwgIeIZXeHOU8+K8Ox+L1oKTzxzDHzifP4BKjuo= n1 = 50 Class 2: 0 Class 2: 50 AAAB73icbVDLSgNBEOyNrxhfUY9eBoPgKexKfFyEoBePEcwDkiXMTjrJkNnZdWZWCEt+wosHRbz6O978GyfJHjSxoKGo6qa7K4gF18Z1v53cyura+kZ+s7C1vbO7V9w/aOgoUQzrLBKRagVUo+AS64Ybga1YIQ0Dgc1gdDv1m0+oNI/kgxnH6Id0IHmfM2qs1JJdj1yTc7dbLLlldwayTLyMlCBDrVv86vQiloQoDRNU67bnxsZPqTKcCZwUOonGmLIRHWDbUklD1H46u3dCTqzSI/1I2ZKGzNTfEykNtR6Hge0MqRnqRW8q/ue1E9O/8lMu48SgZPNF/UQQE5Hp86THFTIjxpZQpri9lbAhVZQZG1HBhuAtvrxMGmdl76Jcua+UqjdZHHk4gmM4BQ8uoQp3UIM6MBDwDK/w5jw6L8678zFvzTnZzCH8gfP5AxVGjrM= Class 3: 50 Class 3: 0 Entropy = 1