Week 8 - Lecture 7 Supervised ML.pdf

Lecture 7 Introduction to Supervised Machine Learning A/Prof Guo Huaqun (Linda) Acknowledgement This set of slides is based on The 1st version from A/Prof. Vivek Balachandran The 2nd version from Prof. Yu Chien Siang ClassPoint Tool classpoint.app (https://classpoint.app) Input the class code Learning Outcomes Define the concept of Machine Learning Understand three types of Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Understand concept of supervised learning algorithms and apply them into the lab assignment and coursework K Nearest Neighbors Decision Tree Random Forest Machine Learning What? Like humans: learning from the past The science of getting computers to learn without being explicitly programmed Machine Learning is an application of AI wherein the system gets the ability to automatically learn and improve based on experience Machine Learning Applications Top 5 Machine Learning Application For 2022 https://www.youtube.com/watch?v=aKhz79s-Row Supervised Learning Supervised It is like learning Training dataset is Model is trained learning under the like a teacher on a pre-defined guidance of a which is used to dataset before it teacher train the machine starts making a method in which decisions when the machine learns given new data using labelled data. Unsupervised learning It is like learning Unsupervised a method in which the without a teacher. machine is trained on Learning unlabelled data or without any guidance Model is given a Model learns dataset and is left through to automatically observation & find patterns and finds structures in relationships in that data. dataset by creating clusters. Reinforcement Learning Reinforcement learning involves an agent that interacts with its environment by producing actions & discovers errors or rewards. It is like being stuck in an isolated island, where you must explore the environment and learn how to live and adapt to the living conditions on your own. Model learns through the trial-and-error method It learns on the basis of reward or penalty given for every action it performs Quiz Please draw lines to connect from supervised learning, unsupervised learning and reinforcement learning to the rest of words that are related to three machine learning types respectively. Unlabelled Data Supervised Learning labelled data No pre-defined data Unsupervised Learning Understand pattern and discover output Predict future Reinforcement Learning Follow trail and error method Supervised Unsupervised Reinforcement Types of Learning Learning Learning Machine The machine is Involve an agent that interacts with Learning The machine trained on its environment by Definition learns by using unlabeled data producing actions labeled data without any & discovers errors guidance or rewards Types of Classification or Clustering or Reward Based Problems Regression Association No pre-defined Types of Data Labelled Data Unlabelled Data data External Training No Supervision No Supervision Supervision Map Labeled Understand Follow trail-and- Approach input to known pattern and error method output discover output Which machine learning you think will be smarter than humans? a) Supervised Learning b) Unsupervised Learning c) Reinforcement Learning d) None Supervised learning process 2 Stage process Learning (training): Learn a model using the training data Testing: Test the model using unseen test data to assess the model accuracy Label Machine learning Feature algorithm extractor features Training input Feature Classifier extractor Label model features Test Input Instance-based learning Instance-based Learning Learning=storing all training instances Classification=assigning target function to a new instance Referred to as “Lazy” learning Model is created at the point of classification Disadvantage of instance-based methods is that the cost of classifying new instances can be high Nearly all computation takes place at classification time rather than learning time Slower in classification K Nearest Neighbors Most basic instance-based method Data is represented in a vector space Supervised learning https://www.youtube.com/watch?v=4HKqjENq9OU KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Beginners (27 min) KNN Algorithm Features All instances correspond to points in an n-dimensional Euclidean space Classification is delayed till a new instance arrives Classification done by comparing feature vectors of the different points Target function may be discrete or real-valued For discrete-valued, the KNN returns the most common value among the k nearest training examples. K-Nearest Neighbor (How it works) 1 Nearest Neighbor K-Nearest Neighbor (How it works) 3 Nearest Neighbor KNN Algorithm Training algorithm For each training example add the example to the list Classification algorithm Given a query instance xq to be classified Let x1,..,xk be k instances which are nearest to xq argmax k fˆ (x q ) ← ∑ δ (v, f (x i )) v ∈ V i=1 Where δ(a,b)=1 if a=b, else δ(a,b)= 0 V = finite set of classes/labels What is a good value for k? Determined experimentally Start with k=1 and use a test set to validate the error rate of the classifier Repeat with k=k+1 Choose the value of k for which the error rate is minimum Is K = 10 or K = 11, better? How to test efficacy N-fold cross validation! N-fold Cross Validation Split data into N block Perform the classification N times Each round of testing has N-1 blocks used for Training 1 block used for Testing Total of N results are obtained Average the error across all the N classification Distance Calculation All instances correspond to points in an n-dimensional Euclidean space 𝑋𝑋 = 𝑋𝑋1 , 𝑋𝑋2 , ….. 𝑋𝑋𝑛𝑛 Distance between two instances Measured in Euclidean distance D= 𝑥𝑥1 − 𝑥𝑥2 2 + 𝑦𝑦1 − 𝑦𝑦2 2 Distance between 𝑥𝑥1 , 𝑦𝑦1 and 𝑥𝑥2 , 𝑦𝑦2 Continuous-valued target functions KNN approximates to continuous-valued target functions Calculate the mean value of the k nearest training examples rather than calculate their most common value k d ∑ f (x ) i f :ℜ →ℜ fˆ (x q ) ← i=1 k https://study.com/academy/lesson/discrete-continuous-functions-definition-examples.html Curse of Dimensionality Imagine instances are described by 20 features (attributes) but only 3 are relevant to target function Curse of dimensionality: nearest neighbor is easily misled when instance space is high-dimensional Dominated by large number of irrelevant features https://deepai.org/machine-learning-glossary-and- terms/curse-of- dimensionality#:~:text=The%20curse%20of%20dime nsionality%20refers,and%20%E2%80%9Ccloseness% E2%80%9D%20of%20data. Possible solutions Weight features Use cross-validation to automatically choose weights z1,…,zn The two wind turbines above seem very close to each other in two dimensions but separate when Feature subset selection viewed in a third dimension. This is the same effect the curse of dimensionality has on data. KNN – When? When to use KNN Classification problems Data has definitive manageable features (say less than 20) Lots of training data Advantages: No Training Period Easy Implementation New data can be added seamlessly Disadvantages: Slow at query time (or classification) Easily fooled by irrelevant features (attributes) Decision Tree Age Old Young Sex Healthy Female Male Tree structure Healthy Diseased Inverted tree starting with a root node An internal node is a test on an attribute A branch represents an outcome of the test A leaf node represents a class label or class label distribution At each node, one attribute is chosen to split training examples into distinct classes as much as possible A new case is classified by following a matching path to a leaf node. Training: make a decision tree Create a list of attributes that can be measured Decide on the target attributes that specify different classes Create an experience table with these attributes that we have seen in the past Convert the experience table into a decision tree Eg: Using ID3 algorithm Experience Table : To play or not! Outlook Temperature Humidity Windy Play? sunny hot high false No Previous weather data sunny hot high true No With the days James played tennis overcast hot high false Yes rain mild high false Yes rain cool normal false Yes This data is used to train the model rain cool normal true No overcast cool normal true Yes sunny mild high false No Given a new set of weather data sunny cool normal false Yes Predict if James will play tennis rain mild normal false Yes sunny mild normal true Yes overcast mild high true Yes overcast hot normal false Yes rain mild high true No Decision Tree sample Outlook Sunny Overcast Rain Humidity Yes Wind High Normal True False No Yes No Yes Decision Tree dissected Outlook Sunny Overcast Rain Humidity Each internal node tests an attribute Normal Each branch corresponds to an attribute value node Yes Leaf node -> A Class/decision A fine new day! Outlook Temperature Humidity Windy Play? sunny hot high false No sunny hot high true No Test case overcast hot high false Yes Sunny, hot, normal, true rain mild high false Yes rain cool normal false Yes Different from training data rain cool normal true No overcast cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes overcast mild high true Yes overcast hot normal false Yes rain mild high true No Decision Tree classification Outlook Test case Outlook =Sunny Sunny Overcast Rain Humidity = Normal Temperature = hot Windy = true Humidity Yes Windy Will he play? Yes! High Normal True False No Yes No Yes Building Decision Tree Top-down tree construction At start, all training examples are at the root. Partition the examples recursively by choosing one attribute each time. Bottom-up tree pruning Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases. Deciding on the splitting attribute At each node, available attributes are evaluated on the basis of separating the classes of the training examples. A Goodness function is used for this purpose. Typical goodness functions: information gain (ID3/C4.5) information gain ratio gini index Which attribute to select? Outlook Windy Sunny Overcast Rain Humidity True False Yes Yes Yes Yes Yes Yes No Yes Yes High Normal No Yes Yes Yes No No Yes Yes No Yes Yes Yes Yes Temperature No Yes Yes Yes No Yes Yes Yes No Yes Hot Mild Cool No Yes No No Yes Yes Yes No No Yes Yes Yes Yes No No No Yes Yes No Yes Yes No No No Attribute selection Which is the best attribute? The one which will result in the smallest tree Select the attribute that produces the “purest” nodes (all yes or all no) Measure by the uncertainty Popular impurity measure: information gain Information gain increases with the average purity of the subsets that an attribute produces Strategy: choose attribute that results in greatest information gain Which attribute to split on? 9 Yes 9 Yes Outlook 5 No Windy 5 No Sunny Overcast Rain True False 2 Yes 4 Yes 3 Yes 3 No 3 Yes 6 Yes 0 No 2 No 3 No 2 No Which attribute to split on? 9 Yes 9 Yes 5 No Windy 5 No Outlook Sunny Overcast Rain True False 2 Yes 4 Yes 3 Yes 3 Yes 6 Yes 3 No 0 No 2 No 3 No 2 No Measure the purity of the split More certain about Yes/No after the split Pure set (4 yes/0 no) : completely certain (100%) Impure set (3 yes/3 no) : completely uncertain (50%) Cannot use P(“yes”|set) Must be symmetric: (4 yes/0 no) is as pure as (0 yes/7 no) Entropy Entropy is the measure of randomness or unpredictability in the dataset S is a sample of training examples in a subset p+ is the proportion of positive examples p- is the proportion of negative examples Entropy measures the impurity of subset S Entropy(S) = -p+ log2 p+ - p- log2 p- Interpretation: if item X belongs to S How many bits need to tell if X positive or negative Impure (3 yes / 3 no): Entropy (S) = -3/6log2(3/6) – 3/6log2(3/6) = -log2(1/2) = - (log21 – log22) = - (0 – 1) = 1 bits Pure ( 4 yes/ 0 no) Entropy (S) = -4/4log2(4/4) – 0/4log2(0/4) = - log21 – 0 = 0 bits log2x = logx/log2 Information Gain Entropy tells how pure or impure one subset is How to combine entropy of all subsets? Aggregate information from several different subsets Average them? Not a simple average (Why?) Weight on the entropy value for each subset Proportional size of the subset Information Gain Entropy difference before and after the split 𝑆𝑆𝑣𝑣 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝑆𝑆, 𝐴𝐴 = Entropy(S) - ∑ 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸(𝑆𝑆𝑉𝑉 ) 𝑆𝑆 Find the Gain of Outlook 9 Yes Entropy of the set of outcomes 5 No Before the split happens (at the root node) Outlook Entropy of Sunny Entropy of Overcast Sunny Overcast Rain Entropy of Rain Expected information for the attribute subsets 2 Yes 4 Yes 3 Yes Information gain 3 No 0 No 2 No Entropy(S) = -p+ log2 p+ - p- log2 p- S – {9Yes, 5No} 𝑆𝑆𝑣𝑣 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝑆𝑆, 𝐴𝐴 = Entropy(S) - ∑ 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸(𝑆𝑆𝑉𝑉 ) A – Outlook 𝑆𝑆 V – {Sunny, Overcast, Rainy} Sv – Subset of S for each v in V Find the Gain of Outlook 9 Yes Entropy of the set of outcomes (E) 5 No E(9/14,5/14) Outlook Entropy of Sunny (E1) E(2/5,3/5) Entropy of Overcast (E2) Sunny Overcast Rain E(4/4,0/4) Entropy of Rain (E3) 2 Yes 4 Yes 3 Yes E(3/5,2/5) 3 No 0 No 2 No Expected information for the attribute subsets Information gain: Find the Gain of Outlook 9 Yes Entropy of the set of outcomes (E) 5 No E(9/14,5/14) (X) Outlook Entropy of Sunny (E1) E(2/5,3/5) Entropy of Overcast (E2) Sunny Overcast Rain E(4/4,0/4) Entropy of Rain (E3) 2 Yes 4 Yes 3 Yes E(3/5,2/5) 3 No 0 No 2 No Expected information for the attribute subsets (5/14)*E1 + (4/14)*E2+ (5/14)*E3 (Y) Information gain: X-Y Find the Gain of Outlook Entropy of the set of outcomes E(9/14,5/14) = -9/14log2(9/14) – 5/14log2(5/14) 9 Yes = -9/14log(9/14)/log2 – 5/14log(5/14)/log2 = 0.94 5 No Entropy of Sunny Outlook E(2/5,3/5) = -2/5log2(2/5) – 3/5log2(3/5) = -2/5log(2/5)/log2 – 3/5log(3/5)/log2 = 0.971 Entropy of Overcast Sunny Overcast Rain E(4/4,0/4) = -4/4log2(4/4) – 0/4log2(0/4) = 0 Entropy of Rain 2 Yes 4 Yes 3 Yes E(3/5,2/5) = -3/5log2(3/5) – 2/5log2(2/5) 3 No 0 No 2 No = -3/5log(3/5)/log2 – 2/5log(2/5)/log2 = 0.971 Expected information for the attribute subsets (5/14)*0.971 + (4/14)*0+ (5/14)*0.971 = 0.69 Information gain: Gain(S, Outlook) = 0.94 – 0.69 = 0.25 Find the Gain of Temperature Entropy of the set of outcomes 9 Yes E(9/14,5/14) = -9/14log2(9/14) – 5/14log2(5/14) 5 No = -9/14log(9/14)/log2 – 5/14log(5/14)/log2 = 0.94 Temperature Entropy of Hot E(2/4,2/4) = -2/4log2(2/4) – 2/4log2(2/4) = 1 Entropy of Mild Hot Mild Cold E(4/6,2/6) = -4/6log2(4/6) – 2/6log2(2/6) = -4/6log(4/6)/log2 – 2/6log(2/6)/log2 = 0.92 Entropy of Cold 2 Yes 4 Yes 3 Yes E(3/4,1/4) = -3/4log2(3/4) – 1/4log2(1/4) 2 No 2 No 1 No = -3/4log(3/4)/log2 – 1/4log(1/4)/log2 = 0.81 Expected information for the attribute subsets (4/14)*1 + (6/14)*0.92+ (4/14)*0.81 = 0.29+0.39+0.23 = 0.91 Information gain: Gain(S, Temperature) = 0.94 – 0.91 = 0.03 Find the Gain of Humidity Entropy of the set of outcomes Entropy(9/14, 5/14) Yes – 9 E(9/14,5/14) = -9/14log2(9/14) – 5/14log2(5/14) No - 5 = -9/14log(9/14)/log2 – 5/14log(5/14)/log2 = 0.94 Humidity High Humidity entropy E(3/7,4/7) = -3/7log2(3/7) – 4/7log2(4/7) = -3/7log(3/7)/log2 – 4/7log(4/7)/log2 = 0.985 High Normal High - 7 Normal - 7 Normal Humidity entropy E(6/7,1/7) = -6/7log2(6/7) – 1/7log2(1/7) = -6/7log(6/7)/log2 – 1/7log(1/7)/log2 = 0.592 Yes - 3 Yes - 6 No - 4 No - 1 Expected information for the attribute subsets (7/14)*0.985 + (7/14)*0.592 = 0.79 Entropy(3/7, 4/7) Entropy(6/7, 1/7) Information gain: Gain(S, Humidity) = 0.94 – 0.79 = 0.15 Find the Gain of Windy 9 Yes Entropy of the set of outcomes Windy 5 No Es True entropy E1 True False False entropy E2 6 Yes 3 Yes Expected information for the attribute subsets 3 No 2 No Information gain: Gain(S, Windy) = ? Computing information gain Information gain for each attributes Gain(“Outlook’) = 0.25 Gain(“Temperature”) = 0.03 Gain(“Humidity”) = 0.15 Gain(“Windy”) = 0.05 Find the node with the maximum gain The root node is Outlook! Decision Tree Outlook Overcast node already ended up Sunny Overcast Rain having leaf node ‘Yes’ Two subtrees of Humidity Yes Windy Sunny and Rain to compute information gain: High Normal False True Humidity Temperature No Yes No Yes Windy Overfitting Overfitting: A tree may overfit the training data Symptoms: tree too deep and too many branches, some may reflect anomalies due to noise or outliers Keep splitting until each node contains 1 example Singleton = pure Good accuracy on training data but poor on test data Two approaches to avoid overfitting Pre-pruning: Halt tree construction early Stop splitting when not statistically significant Difficult to decide because we do not know what may happen subsequently if we keep growing the tree. Post-pruning: Remove branches or sub-trees from a “fully grown” tree. This method is commonly used Uses a statistical method to estimates the errors at each node for pruning. A validation set may be used for pruning as well. An example Postpruning Postpruning waits until the full decision tree has built and then prunes the attributes Two techniques: Subtree Replacement Subtree Raising Subtree Replacement Entire subtree is replaced by a single leaf node A B C 4 5 1 2 3 Subtree Replacement Node 2 replaced the subtree Generalizes tree a little more, but may increase accuracy A B 2 4 5 Subtree Raising Entire subtree is raised onto another node A B C 4 5 1 2 3 Subtree Raising Entire subtree is raised onto another node A C B 1 2 3 Random Forest (RF) Ensemble Classifier Consists of many decision trees Created from subsets of data Random sampling of subsets Classification Classify using each of the trees in the random forest Each classifier predicts the outcome Final decision by voting The method combines Breiman's "bagging" idea and the random selection of features https://www.youtube.com/watch?v=eM4uJ6XGnSM Random Forest Algorithm - Random Forest Explained (45 min) Age An example! Age Old Young Old Young Sex diseased Height Healthy Female Male Tall Short Healthy Healthy Healthy Diseased Work status New Sample Retired Working Old, retired, male, short What is the output based on RF? Height Healthy a) Healthy Tall Short b) Diseased Diseased Healthy An example! Age Age Old Young Old Young Height Healthy Sex Diseased Short Female Male Tall Diseased Healthy Healthy Healthy New Sample Work status Old, retired, male, short Retired Working 2 predictions Diseased, Healthy Height Healthy Majority Rule Tall Short Diseased Healthy Diseased Advantages of Random Forests The advantages of random forest are: It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier. It runs efficiently on large databases. It can handle thousands of input variables without variable deletion. It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data is missing. It is faster due to the smaller size of the trees. Summary Machine Learning Supervised Algorithm Supervised learning process – Train and Test KNN Lazy learning N-fold cross validation Decision Trees Tree structure Entropy/Information gain Overfitting Random Forest Subsets of data Collection of Trees

Week 8 - Lecture 7 Supervised ML.pdf

Document Details

Related

Full Transcript