CS488-Ch-9A Machine Learning PDF

HiLCoE School of Computer Science and Technology Course Title: Introduction to Artificial Intelligence (CS488) Credit Hour: 4 Instructor: Fantahun B. (PhD) [email protected] Office: 3## Ch-9A: Machine Learning Machine Learning Agenda:  Learning Agents; Learning Strategies; Learning element;  Supervised vs. Unsupervised vs. Reinforcement Learning  Regression  Classification  Clustering 9-Machine Learning Fantahun B. (PhD) 2 Machine Learning Objectives After completing this chapter students will be able to:  Define learning agents; Identify learning strategies;  Identify learning elements that can be modified/learnt;  Compare and contrast Supervised vs. Unsupervised Learning  Explain Inductive Learning: Decision trees  Implement a simple, example machine learning algorithm  Explain how the Decision tree classifier works; implement it  Explain how Perceptron works and try to implement it  Explain the working of the K-Means clustering algorithm and try to implement it 9-Machine Learning Fantahun B. (PhD) 3 Machine Learning: Learning  An agent is learning if it improves its performance on future tasks after making observations about the world.  Q#1. Why would we want an agent to learn?  Q#2. If the design of the agent can be improved, why wouldn’t the designers just program in that improvement to begin with?  We can identify three main reasons. 9-Machine Learning Fantahun B. (PhD) 4 Machine Learning: Learning 1. The designers cannot anticipate all possible situations that the agent might find itself in.  For example, a robot designed to navigate mazes must learn the layout of each new maze it encounters. 2. The designers cannot anticipate all changes over time;  a program designed to predict tomorrow’s stock market prices must learn to adapt when conditions change from boom to bust. 3. Sometimes human programmers have no idea how to program a solution themselves.  For example, most people are good at recognizing the faces of family members, but even the best programmers are unable to program a computer to accomplish that task, except by using learning algorithms. 9-Machine Learning Fantahun B. (PhD) 5 Machine Learning: Learning Learning is essential:  for unknown environments, i.e., when designer lacks omniscience  as a system construction method, i.e., expose the agent to reality rather than trying to write it down  b/c modifies the agent's decision mechanisms to improve performance 9-Machine Learning Fantahun B. (PhD) 6 Machine Learning: Learning Agent Any component of an agent can be improved by learning from data. 9-Machine Learning Fantahun B. (PhD) 7 Machine Learning: Learning Components Any component of an agent can be improved by learning from data. The improvements, and the techniques used to make them, depend on four major factors: 1. Which component is to be improved. 2. What prior knowledge the agent already has. 3. What representation is used for the data and the component. 4. What feedback is available to learn from. 9-Machine Learning Fantahun B. (PhD) 8 Machine Learning: Learning Components The components of these agents include: 1. A direct mapping from conditions on the current state to actions. 2. A means to infer relevant properties of the world from the percept sequence. 3. Information about the way the world evolves and about the results of possible actions the agent can take. 4. Utility information indicating the desirability of world states. 5. Action-value information indicating the desirability of actions. 6. Goals that describe classes of states whose achievement maximizes the agent’s utility. Each of these components can be learned. 9-Machine Learning Fantahun B. (PhD) 9 Machine Learning: Learning Components Example: an agent training to become a taxi driver.  Every time the instructor shouts “Brake!” the agent might learn a condition–action rule for when to brake (component-1);  The agent also learns every time the instructor does not shout. By seeing many camera images that it is told contain buses, it can learn to recognize them (component-2).  By trying actions and observing the results—for example, braking hard on a wet road—it can learn the effects of its actions (component-3).  Then, when it receives no tip from passengers who have been thoroughly shaken up during the trip, it can learn a useful component of its overall utility function (component-4). 9-Machine Learning Fantahun B. (PhD) 10 Machine Learning: Representation and Prior Knowledge We can think of several examples of representations for agent components:  propositional logic  first-order logical sentences for the components in a logical agent;  bayesian networks for the inferential components of a decision- theoretic agent,  etc. This chapter covers inputs that form a factored representation — a vector of attribute values—and outputs that can be either a continuous numerical value or a discrete value. 9-Machine Learning Fantahun B. (PhD) 11 Machine Learning: types of learning Other ways to look at different types of learning Inductive learning, we generalize conclusions from specific given facts. For example,  a. Apple is a fruit.  b. Apple tastes sweet.  Conclusion: All fruits taste sweet. Deductive learning uses the already available facts and information in order to give a valid conclusion. It uses a top-down approach.  a. All carnivores eat meat.  b. Lion is a carnivore.  Conclusion: – Lion eats meat. 9-Machine Learning Fantahun B. (PhD) 12 Machine Learning: Feedback to learn from There are three types of feedback that determine the three main types of learning: supervised, unsupervised, and reinforcement learning. Unsupervised learning: the agent learns patterns in the input even though no explicit feedback is supplied. The most common unsupervised learning task is clustering- detecting potentially useful clusters of input examples. For example, a taxi agent might gradually develop a concept of “good traffic days” and “bad traffic days” without ever being given labeled examples of each by a teacher. 9-Machine Learning Fantahun B. (PhD) 13 Machine Learning: Feedback to learn from Reinforcement learning: the agent learns from a series of reinforcements — rewards or punishments. For example:  the lack of a tip at the end of the journey gives the taxi agent an indication that it did something wrong.  The two points for a win at the end of a chess game tells the agent it did something right. It is up to the agent to decide which of the actions prior to the reinforcement were most responsible for it. 9-Machine Learning Fantahun B. (PhD) 14 Machine Learning: Feedback to learn from Supervised learning: the agent observes some example input– output pairs and learns a function that maps from input to output. In component 1 above, the inputs are percepts and the output are provided by a teacher who says “Brake!” or “Turn left.” In component 2, the inputs are camera images and the outputs again come from a teacher who says “that’s a bus.” In component 3, the theory of braking is a function from states and braking actions to stopping distance in feet. In this case the output value is available directly from the agent’s percepts (after the fact); the environment is the teacher. 9-Machine Learning Fantahun B. (PhD) 15 Machine Learning: Feedback to learn from Supervised Learning In practice, these distinction are not always so crisp. In semi-supervised learning we are given a few labeled examples and must make what we can of a large collection of unlabeled examples. Even the labels themselves may not be the oracular truths that we hope for.  Imagine that you are trying to build a system to guess a person’s age from a photo. You gather some labeled examples by snapping pictures of people and asking their age. That’s supervised learning.  But in reality some of the people lied about their age. It’s not just that there is random noise in the data; rather the inaccuracies are systematic, and to uncover them is an unsupervised learning problem involving images, self- reported ages, and true (unknown) ages.  Thus, both noise and lack of labels create a continuum between supervised and unsupervised learning. 9-Machine Learning Fantahun B. (PhD) 16 Machine Learning: Feedback to learn from Supervised Learning The task of supervised learning is this: Given a training set of N example input–output pairs (x1, y1), (x2, y2),... (xN, yN) , where each yj was generated by an unknown function y = f(x), discover a function h that approximates the true function f. Here x and y can be any value; they need not be numbers. The function h is a hypothesis. Learning is a search through the space of possible hypotheses for one that will perform well, even on new examples beyond the training set. To measure the accuracy of a hypothesis we give it a test set of examples that are distinct from the training set. 9-Machine Learning Fantahun B. (PhD) 17 Machine Learning: Feedback to learn from Supervised Learning We say a hypothesis generalizes well if it correctly predicts the value of y for novel examples. Sometimes the function f is stochastic—it is not strictly a function of x, and what we have to learn is a conditional probability distribution, P(Y | x).. When the output y is one of a finite set of values (such as sunny, cloudy or rainy), the learning problem is called classification, and is called Boolean or binary classification if there are only two values. When y is a number (such as tomorrow’s temperature), the learning problem is called regression. (Technically, solving a regression problem is finding a conditional expectation or average value of y, because the probability that we have found exactly the right real-valued number for y is 0.) 9-Machine Learning Fantahun B. (PhD) 18 Machine Learning: Feedback to learn from Supervised Learning We don’t know what f is, but we will approximate it with a function h selected from a hypothesis space, H. In general, there is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses that may generalize better. We say that a learning problem is realizable if the hypothesis space contains the true function. Unfortunately, we cannot always tell whether a given learning problem is realizable, because the true function is not known. Supervised learning can be done by choosing the hypothesis h* that is most probable given the data: 9-Machine Learning Fantahun B. (PhD) 19 Machine Learning: Feedback to learn from Supervised Learning 9-Machine Learning Fantahun B. (PhD) 20 DECISION TREE  Decision Tree is a supervised learning technique that can be used for both classification and regression problems, but mostly it is preferred for solving classification problems.  It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome. 9-MACHINE LEARNING FANTAHUN B. (PHD) 21 Decision Tree: Apply model to test data 9-Machine Learning Fantahun B. (PhD) 22 Decision Tree: Apply model to test data 9-Machine Learning Fantahun B. (PhD) 23 Decision Tree: Apply model to test data 9-Machine Learning Fantahun B. (PhD) 24 Decision Tree: Apply model to test data 9-Machine Learning Fantahun B. (PhD) 25 Decision Tree: Apply model to test data 9-Machine Learning Fantahun B. (PhD) 26 Decision Tree: Apply model to test data 9-Machine Learning Fantahun B. (PhD) 27 Decision Tree: Tree Induction  Issues  Determine how to split the records o How to specify the attribute test condition? o How to determine the best split?  Determine when to stop splitting 9-Machine Learning Fantahun B. (PhD) 28 Decision Tree: ID3 Algorithm  ID3: Iterative Dichotomiser, invented by Ross Quinlan, ID3 uses a top- down greedy approach to build a decision tree.  In simple words, the top-down approach means that we start building the tree from the top and the greedy approach means that at each iteration we select the best feature at the present moment to create a node. ID3 (examples, target_attributes, attributes)  Examples  training examples  Target_attribute  class labels  Attributes  features (other attributes)  Returns a decision tree that correctly classifies given examples 9-Machine Learning Fantahun B. (PhD) 29 Decision Tree: ID3 Algorithm Creat a node for a particular tree If all examples are +, Return the single-node tree Root with label + If all examples are -, Return the single-node tree Root with label – If Attributes is empty, Return the single-node tree Root with label = most common value of Target_aattribute in Examples 9-Machine Learning Fantahun B. (PhD) 30 Decision Tree: Measures for selecting the best split  There are many measures that can be used to determine the best way to split the records.  These measures are defined in terms of the class distribution of the records before and after splitting.  Widely used measures include: Information Gain (used by ID3 and C4.5 algorithm) Gini Index (used by CART algorithm) 9-Machine Learning Fantahun B. (PhD) 31 Decision Tree: Information Gain Entropy  Entropy is an information theory metric that measures the impurity or uncertainty in a group of observations. It determines how a decision tree chooses to split data. 9-Machine Learning Fantahun B. (PhD) 32 Decision Tree: Information Gain Entropy  Consider a dataset with N classes. The entropy may be calculated using the formula below:  pi is the probability of randomly selecting an example in class i.  Example: Let’s have a dataset made up of three colors; red, purple, and yellow. If we have one red, three purple, and four yellow observations in our set, our equation becomes: 9-Machine Learning Fantahun B. (PhD) 33 Decision Tree: Information Gain Entropy  Where pr, pp and py are the probabilities of choosing a red, purple and yellow example respectively.  We have  pr=1/8 because only 1/8 of the dataset represents red.  pp=3/8 as 3/8 of the dataset is purple.  py=4/8 since half the dataset is yellow. As such, we can represent py as py=1/2.  Entropy, E = 1.41 9-Machine Learning Fantahun B. (PhD) 34 Decision Tree: Information Gain Entropy  When all observations belong to the same class, the entropy will always be zero. E = -(1log21 = 0)  Such a dataset has no impurity. This implies that such a dataset would not be useful for learning.  However, if we have a dataset with say, two classes, half made up of yellow and the other half being purple, the entropy will be one. E = -[ (0.5log20.5) + (0.5log20.5) ] = 1  This kind of dataset is good for learning. 9-Machine Learning Fantahun B. (PhD) 35 Decision Tree: Information Gain  Entropy for the training set S shown at the right side table can be calculated as 9-Machine Learning Fantahun B. (PhD) 36 Decision Tree: Information Gain Where, ─ S: training set ─ A: a data attribute ─ v: set of possible values in attribute A ─ Sv: a subset of S. Sv contain records with value v ─ |Sv|: cardinality of the set Sv ─ |S|: cardinality of the set S  We can define information gain as a measure of how much information a feature provides about a class. Information gain helps to determine the order of attributes in the nodes of a decision tree.  The more the information gain the better the split and this also means lower the entropy.  Hence, we should select the attribute with the maximum Information Gain value as a split node 9-Machine Learning Fantahun B. (PhD) 37 Decision Tree: Information Gain  The main node is referred to as the parent node, whereas sub- nodes are known as child nodes.  We can use information gain to determine how good the splitting of nodes in a decision tree.  IG = Eparent - Echildren 9-Machine Learning Fantahun B. (PhD) 38 Decision Tree: Information Gain Example 1:  Draw a decision tree for the given dataset using ID3 algorithm. Instances a1 a2 Target 1 T T + 2 T T + 3 T F - 4 F F + 5 F T - 6 F T - 9-Machine Learning Fantahun B. (PhD) 39 Decision Tree: Information Gain Example 1: Values(a1) = T,F 1.0 0.9183 Insta a1 a2 Target nces 0.9183 1 T T + 2 T T + 3 T F - 4 F F + 5 F T - 6 F T - 0.0817 9-Machine Learning Fantahun B. (PhD) 40 Decision Tree: Information Gain Example 1: Values(a2) = T, F Insta a1 a2 Target nces 1 T T + 2 T T + 3 T F - 4 F F + 5 F T - 6 F T - 9-Machine Learning Fantahun B. (PhD) 41 Decision Tree: Information Gain Example 1: 0.0817 0.0 + - - + 9-Machine Learning Fantahun B. (PhD) 42 Decision Tree: Information Gain Example 2: Draw a decision tree for the given dataset using ID3 algorithm. 9-Machine Learning Fantahun B. (PhD) 43 Decision Tree: Information Gain Example 2: Values(a1) = True, False Gain(S,a1) = 0.9709-(5/10*0.7219)-(5/10*0) = 0.6099 9-Machine Learning Fantahun B. (PhD) 44 Decision Tree: Information Gain Example 2: Values(a2) = Hot, Cool 0.1245 9-Machine Learning Fantahun B. (PhD) 45 Decision Tree: Information Gain Example 2: Values(a3) = High, Normal 0.4199 9-Machine Learning Fantahun B. (PhD) 46 Decision Tree: Information Gain Example 2: Tree construction 0.6099 0.1245 0.4199 9-Machine Learning Fantahun B. (PhD) 47 Decision Tree: Information Gain Example 2: Values(a2) = Hot, Cool 0.3219 9-Machine Learning Fantahun B. (PhD) 48 Decision Tree: Information Gain Example 2: Values(a3) = High, Normal 9-Machine Learning Fantahun B. (PhD) 49 Decision Tree: Information Gain Example 2: 0.3219 0.7219 9-Machine Learning Fantahun B. (PhD) 50 Decision Tree: Information Gain Exercise Using the Information Gain, build a decision tree for the following training set. 9-Machine Learning Fantahun B. (PhD) 51 Decision Tree: Information Gain  Answer ?? Have you got this answer or a different one? 9-Machine Learning Fantahun B. (PhD) 52 Decision Tree: GINI Index  The other way of splitting a decision tree is via the Gini Index.  Gini Index or Gini impurity measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen.  But what is actually meant by ‘impurity’? If all the elements belong to a single class, then it can be called pure.  Gini Index for a given node t is given by: Where, p( j|t) is the relative frequency of class j at node t. 9-Machine Learning Fantahun B. (PhD) 53 Decision Tree: GINI Index  The degree of Gini Index varies between 0 and 1, where,  '0' denotes that all elements belong to a certain class or there exists only one class (pure),  '1' denotes that the elements are randomly distributed across various classes (impure).  A Gini Index of '0.5 'denotes equally distributed elements into some classes. 9-Machine Learning Fantahun B. (PhD) 54 Decision Tree: GINI Index  Example 9-Machine Learning Fantahun B. (PhD) 55 Decision Tree: GINI Index  Used in CART, SLIQ, SPRINT.  When a node p is split into k partitions (children), the quality of split is computed as,  where, ni = number of records at child i, n = number of records at node p.  We select the attribute with the smallest Gini index value 9-Machine Learning Fantahun B. (PhD) 56 Decision Tree: GINI Index  Using the Gini Index, build a decision tree for the following training set. 9-Machine Learning Fantahun B. (PhD) 57 Decision Tree: GINI Index  Solution 9-Machine Learning Fantahun B. (PhD) 58 Decision Tree: GINI Index 9-Machine Learning Fantahun B. (PhD) 59 Decision Tree: GINI Index  Similarly, calculating Gini of humidity and windy, we have: o Gini(outlook ) = 0.34  the minimum o Gini(temperature) = 0.43 o Gini(humidity) = 0.63 o Gini(windy) = 0.42  We select the attribute Outlook as a root node for the tree  Exercise : complete the tree (using the Gini index) 9-Machine Learning Fantahun B. (PhD) 60 Decision Tree: Continuous-valued Attributes  How can we handle continuous-valued attributes? Tempe Play  Change the continuous valued-attribute to rature Tennis discrete-valued attribute 40 No  identify boundaries where there is a sudden change of class, (No to Yes or Yes to No 48 No  Calculate average of the boundary points:  (48+60)/2 = 54 60 Yes  (80+90)/2 = 85  Calculate Information Gain for each average 72 Yes boundary points. (54) and (85) 80 Yes 90 No 9-Machine Learning Fantahun B. (PhD) 61 Decision Tree: Continuous-valued Attributes  How can we handle continuous-valued attributes? Tempe Play rature Tennis 40 No 48 No 60 Yes 72 Yes 80 Yes 90 No 0.4591 9-Machine Learning Fantahun B. (PhD) 62 Decision Tree: Continuous-valued Attributes  How can we handle continuous-valued attributes? Tempe Play rature Tennis 40 No 48 No 60 Yes 72 Yes 80 Yes 90 No 0.1908 9-Machine Learning Fantahun B. (PhD) 63 Decision Tree: Continuous-valued Attributes  How can we handle continuous-valued attributes? Tempe Play rature Tennis 0.4591 Maximum 40 No 0.1908 48 No  The boundary pont 54 will be chosen 60 Yes  Now, we have only two classes < 54 and >54 72 Yes  We have discrete-valued attributes 80 Yes  We can proceed with the remainder of the decision tree processes. 90 No 9-Machine Learning Fantahun B. (PhD) 64 Machine Learning: ANNs-Perceptron Biological vs Artificial Neuron 9-Machine Learning Fantahun B. (PhD) 65 Machine Learning: ANNs-Perceptron Biological vs Artificial Neuron 9-Machine Learning Fantahun B. (PhD) 66 Machine Learning: ANNs-Perceptron Artificial Neural Networks  According to the basic findings of neuroscience, mental activity consists primarily of electrochemical activity in networks of brain cells called neurons.  AANs are inspired by this hypothesis; some of the earliest AI work aimed to create artificial neural networks.  Researchers Warren McCullock and Walter Pitts published their first concept of simplified brain cell in 1943. This was called McCullock- Pitts (MCP) neuron.  They described such a nerve cell as a simple logic gate with binary outputs. 9-Machine Learning Fantahun B. (PhD) 67 Machine Learning: ANNs-Perceptron Artificial Neural Networks  An artificial neuron is a mathematical function based on a model of biological neurons, where each neuron takes inputs, weighs them separately, sums them up and passes this sum through a nonlinear function to produce output.  An artificial neural network whose inputs are directly connected with its outputs is called a single layer network (perceptron)  Perceptron was introduced by Frank Rosenblatt in 1957.  He proposed a Perceptron learning rule based on the original MCP neuron.  A Perceptron is an algorithm for supervised learning of binary classifiers. This algorithm enables neurons to learn and processes elements in the training set one at a time. 9-Machine Learning Fantahun B. (PhD) 68 Machine Learning: ANNs-Perceptron Perceptron learning rule  Perceptron Learning Rule states that the algorithm would automatically learn the optimal weight coefficients. 9-Machine Learning Fantahun B. (PhD) 69 Machine Learning: ANNs-Perceptron Perceptron learning rule  9-Machine Learning Fantahun B. (PhD) 70 Machine Learning: ANNs-Perceptron Example 9-Machine Learning Fantahun B. (PhD) 71 Machine Learning: ANNs-Perceptron Example 9-Machine Learning Fantahun B. (PhD) 72 Machine Learning: ANNs-Perceptron Example 9-Machine Learning Fantahun B. (PhD) 73 Machine Learning: ANNs-Perceptron Example 9-Machine Learning Fantahun B. (PhD) 74 Machine Learning: ANNs-Perceptron Example 9-Machine Learning Fantahun B. (PhD) 75 Machine Learning: ANNs-Perceptron Example 9-Machine Learning Fantahun B. (PhD) 76 Machine Learning: ANNs-Perceptron Example 9-Machine Learning Fantahun B. (PhD) 77 Machine Learning: ANNs-Perceptron Example  Do you think the result of this iteration is correct?  If it is not how do you correct it? 9-Machine Learning Fantahun B. (PhD) 78 Machine Learning: ANNs-Perceptron Example 9-Machine Learning Fantahun B. (PhD) 79 Machine Learning: ANNs-Perceptron Example 9-Machine Learning Fantahun B. (PhD) 80 Machine Learning: ANNs-Perceptron Example 9-Machine Learning Fantahun B. (PhD) 81 Machine Learning: ANNs-Perceptron Example 9-Machine Learning Fantahun B. (PhD) 82 Machine Learning: ANNs-Perceptron Example You will stop when the weights do not change seeing all the inputs @ epoch-4 with the below weights. 9-Machine Learning Fantahun B. (PhD) 83 Machine Learning: ANNs-Perceptron Exercise (10 pts, due on ….) How to submit: based on the instructions in the class. 1. Estimate the weights of the perceptron which realize the AND function (Gate). 2. Try to estimate the weights of the perceptron which realize the XOR function (Gate). Note:  Use 5 epochs for each.  Use a table to record the intermediate values (epochs, inputs, sum, predicted- output, target-output, error, weight-vector)  Report if the perceptron does not converge in 5 epochs. If this happens, why do you think it happened so? 3. Using Information Gain, build a decision tree for the dataset shown in slide #57. 9-Machine Learning Fantahun B. (PhD) 84 K-NEAREST NEIGHBOR CLASSIFIER  The k-nearest neighbors algorithm, also known as KNN or k- NN, is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point.  While it can be used for either regression or classification problems, it is typically used as a classification algorithm, working off the assumption that similar points can be found near one another. ML: CLASSIFIERS FANTAHUN B. (PHD) 85 K-Nearest Neighbor Classifier  Uses k “closest” points (nearest neighbors) to make classification  Basic idea:  “ If it walks like a duck, quacks like a duck, then it’s probably a duck ” ML: Classifiers Fantahun B. (PhD) 86 K-Nearest Neighbor Classifier  Uses k “closest” points (nearest neighbors) to make classification  Basic idea:  “If it walks like a duck, quacks like a duck, then it’s probably a duck” ML: Classifiers Fantahun B. (PhD) 87 K-Nearest Neighbor Classifier  Requires three things  The set of stored records  Distance Metric to compute distance between records  The value of k, the number of nearest neighbors to retrieve  To classify an unknown record: 1. Compute distance to other training records 2. Identify k nearest neighbors 3. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) ML: Classifiers Fantahun B. (PhD) 88 K-Nearest Neighbor Classifier (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor  K-nearest neighbors of a record x are data points that have the k smallest distance to x ML: Classifiers Fantahun B. (PhD) 89 K-Nearest Neighbor Classifier  Compute distance between two points:  Euclidean distance  Determine the class from nearest neighbor list  take the majority vote of class labels among the k-nearest neighbors ML: Classifiers Fantahun B. (PhD) 90 K-Nearest Neighbor Classifier  Example: Determine if “David” is Loyal taking k=3. ML: Classifiers Fantahun B. (PhD) 91 K-Nearest Neighbor Classifier  Example:  Identify the k nearest neighbor of David  k=3  Use the Euclidean distance to identify neighbors Distance from David: 35 − 37 2 + 35 − 50 2 + (3 − 2)2 = 15.16 22 − 37 2 + 50 − 50 2 + (2 − 2)2 = 15 63 − 37 2 + 200 − 50 2 + (1 − 2)2 = 152.23 59 − 37 2 + 170 − 50 2 + (1 − 2)2 = 122 25 − 37 2 + 40 − 50 2 + (4 − 2)2 = 15.74 Yes “Yes” is the class of David; why? ML: Classifiers Fantahun B. (PhD) 92 K-Nearest Neighbor Classifier  Choosing the value of k:  If k is too small, sensitive to noise points  If k is too large, neighborhood may include points from other classes ML: Classifiers Fantahun B. (PhD) 93 K-Nearest Neighbor Classifier: Scaling issues  Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes  Example:  height of a person may vary from 1.5m to 1.8m  weight of a person may vary from 90lb to 300lb  income of a person may vary from $10K to $1M ML: Classifiers Fantahun B. (PhD) 94 K-Nearest Neighbor Classifier: Scaling issues  All such distance based algorithms are affected by the scale of the variables.  Consider your data has an age variable which tells about the age of a person in years and an income variable which tells the monthly income of the person in rupees: ID Age Income  Here Age ranges from 25 to 40 whereas income ranges (Rupees) from 50,000 to 110,000. 1 25 80,000  The Euclidean distance between observation 1 and 2 will be given as: 2 30 100,000  ED = [(100000–80000)2 + (30–25)2](1/2) = 20000.000625. 3 40 90,000  We can see that the distance is biased towards the 4 30 50,000 high magnitude of the income  We can use scaling and/or normalization to fix this 5 40 110,000 problem ML: Classifiers Fantahun B. (PhD) 95 K-Nearest Neighbor Classifier: Scaling issues  MinMax scaling or normalization  Standardization ID Age Income ID Age Income (Rupees) (Rupees) 1 25 80,000 1 0.000 0.500 2 30 100,000 2 0.333 0.833 3 40 90,000 3 1.000 0.667 4 30 50,000 4 0.333 0.000 5 40 110,000 5 1.000 1.000 ML: Classifiers Fantahun B. (PhD) 96 K-Nearest Neighbor Classifier: Scaling issues  Euclidean Distance = [(0.608+0.260)2 + (-0.447+1.192)2](1/2)=1.1438  We can clearly see that the distance is not biased towards the income variable. It is now giving similar weightage to both the variables. ID Age Income (Rupees) 1 25 80,000 2 30 100,000 3 40 90,000 4 30 50,000 5 40 110,000 ML: Classifiers Fantahun B. (PhD) 97 K-Nearest Neighbor Classifier: Ties Neighbor Selection ML: Classifiers Fantahun B. (PhD) 98 K-Nearest Neighbor Classifier: Ties Neighbor Selection  A tie can occur when two or more points are equidistant from an unclassified observation, thereby making it difficult to choose which neighbors are included.  Figure i  it is obvious which two neighbors should be selected.  Figure ii  three equidistant neighbors from the unclassified observation M.  Figure iii  one of the nearest neighbors is obvious, but the other three have an equal chance of being selected.  There are three prevailing theories on how to address this complication: ML: Classifiers Fantahun B. (PhD) 99 K-Nearest Neighbor Classifier: Ties Neighbor Selection 1. Choose a different k.  3-nearest neighbor classification solves the issue of neighbor selection in figures i and ii, it does not solve the problem in figure iii. 2. Randomly choose between the tied values.  Applying this approach to figure ii, each of the three observations would have an equal chance of being selected as one of the two neighbors. In figure iii, the A closest to N would be selected, and one of the remaining three observed points would be randomly selected. 3. Allow observations in until natural stop point.  The idea here is to choose the smallest number such that k is greater than or equal to two, and that no ties exist.  For figure i, the two nearest observations would be selected.  For figure ii, since there is a three-way tie, all three neighbors are considered. The constraint on only two neighbors is lifted for this particular observation.  Similarly, all four observations are selected in figure iii.  Note that this method would choose as many values as necessary to avoid the tie. If figure ii instead had 10 equidistant points, all 10 would be allowed into the model. ML: Classifiers Fantahun B. (PhD) 100 K-Nearest Neighbor Classifier: Ties Classification  Let’s assume we settled on the third method above, and now have our selected neighbors (table i).  Now, we can begin classifying our unclassified points. The default classification method is majority vote, which works well for M. Since there are two As nearby and only one B, M is predicted to be an A. However, things are more complicated for L and N. There are several theories of how to approach this. ML: Classifiers Fantahun B. (PhD) 101 K-Nearest Neighbor Classifier: Ties Classification 1. Choose an odd k. Some suggest to simply choose an odd value for k. This method does not always work, since the classification statuses could be odd as well (A, B, C). 2. Randomly select between tied neighbors. Using this method, L and N both have an equal chance of being classified as A or B. But is this fair? Does it make sense that N, who is very close to an observed A, has an equal chance of being a B? 3. Weighted by distance. Addressing the concern created by the previous method, it is possible to weight the neighbors so that those nearest to the unobserved point have a “greater vote”. This approach would result in both L and N being classified as A, since the A neighbors are closer than the others by comparison. This method seems to address most scenarios, although it is still possible it won’t solve every issue. ML: Classifiers Fantahun B. (PhD) 102

CS488-Ch-9A Machine Learning PDF

Document Details

Tags

Related

Summary

Full Transcript