Machine Learning for Smart Industry Lecture 08: Unsupervised Learning PDF

Summary

This document contains lecture notes on unsupervised learning in machine learning, covering topics such as clustering (K-means, DBSCAN, Gaussian Mixture Models), association rules, and dimensionality reduction. The notes are structured for a smart industry focus, introducing different algorithms and methods.

Full Transcript

MS3 / AMDA MACHINE LEARNING FOR SMART INDUSTRY LECTURE 08: UNSUPERVISED LEARNING AMEYA REGE WHAT WAS COVERED IN THE PREVIOUS LECTURE? WHAT WAS COVERED IN THE PREVIOUS LECTURE? Dimensionality reduction Principal component analysis PCA – regression PCA – classification PCA...

MS3 / AMDA MACHINE LEARNING FOR SMART INDUSTRY LECTURE 08: UNSUPERVISED LEARNING AMEYA REGE WHAT WAS COVERED IN THE PREVIOUS LECTURE? WHAT WAS COVERED IN THE PREVIOUS LECTURE? Dimensionality reduction Principal component analysis PCA – regression PCA – classification PCA in feature space Single Value Decomposition WHAT WILL WE LEARN Supervised vs. unsupervised learning TODAY? Unsupervised learning Clustering K-means clustering DBSCAN Gaussian Mixture Model Associative rule THIS COURSE Neural nets and deep learning MLP CNN RNN GAN Unsupervised learning Supervised learning Clustering Regression Pattern search Classification Reinforcement learning SUPERVISED VS. UNSUPERVISED LEARNING Classification Clustering UNSUPERVISED LEARNING UNSUPERVISED LEARNING ML algorithm Model Data UNSUPERVISED LEARNING Unsupervised learning appears where machine learning algorithms are used to analyze and cluster unlabeled data sets The algorithms discover hidden patterns or data groupings without the need for human intervention So, how do they differ from supervised learning algorithms that you have worked out so far? No output labels are provided UNSUPERVISED LEARNING If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be Reinforcement learning the icing on the cake, and reinforcement learning Supervised learning would be the cherry on the cake. - Yann LeCun Unsupervised learning UNSUPERVISED LEARNING What are the different learning algorithms used for unsupervised learning? Clustering Association rule Dimensionality reduction (conceptually done in Lecture #7) CLUSTERING Clustering involves grouping of unlabeled data into clusters based on their similarities Goal is to identify patterns and relationships in data without any prior know-how concerning the meaning of the data Common clustering algorithms: Exclusive clustering Density-based clustering Hierarchical clustering Probabilistic clustering ASSOCIATION RULE A rule-based ML technique that finds out useful relations (associations) between parameters of a large data set Commonly used in market analysis  allowing companies to better understand relationships between different products Common algorithms: Apriori FP-Growth Eclat DIMENSIONALITY REDUCTION Looks for reducing the number of features in a dataset while preserving as much information as possible Commonly used in the preprocessing stage Common algorithms: Principal component analysis Singular value decomposition Autoencoders CLUSTERING EXCLUSIVE CLUSTERING As the name suggests, it is a form of grouping that stipulates that a datapoint can exist only in one cluster Also called hard clustering The most common example is the k-means clustering algorithm Alternative to exclusive clustering is the overlapping clustering, where a data point is allowed to belong to multiple clusters with independent degrees of membership An example of overlapping clustering is the fuzzy k-means clustering algorithm K-MEANS CLUSTERING A CENTROID-BASED CLUSTERING APPROACH A centroid-based clustering algorithm Based on partitioning method Goal is to group data points on the basis of their closeness Wikipedia You can choose Euclidean distance or Manhattan one or the Minkowski one as the similarity measure Datasets are separated into predetermined number of clusters  Recalculate the centroids for observations assigned to each cluster Drawback: The need to select the number of clusters (k)  k-means K-MEANS CLUSTERING Taken from O'Reilly book Hands-on Machine Learning with Scikit-Learn and TensorFlow K-MEANS CLUSTERING How to select k? One common technique is the elbow method Taken from O'Reilly book Hands-on Machine Learning with Scikit-Learn and TensorFlow Based on the inertia metric Mean squared distance between each datapoint and its closest centroid K-MEANS CLUSTERING General overview of k-means If data is well separated, k-means performs best Not suitable for overlapping datapoints Fast in comparison to other clustering algorithms Does not provide clear information regarding the quality of clusters Not suitable for non-spherical clusters Taken from O'Reilly book Hands-on Machine Learning with Scikit-Learn and TensorFlow Means because they are the K-MEANS CLUSTERING mean values of the datapoints categorized in them ALGORITHM Randomly initialize k instances (cluster centroids or means) Categorize each datapoint to its closest mean Label the instances Update the mean’s coordinates  average of datapoints categorized in that cluster so far Repeat for n iterations K-MEANS CLUSTERING ALGORITHM - VISUAL Taken from O'Reilly book Hands-on Machine Learning with Scikit-Learn and TensorFlow K-MEANS CLUSTERING ALGORITHM - VISUAL Taken from O'Reilly book Hands-on Machine Learning with Scikit-Learn and TensorFlow Sensitive to centroid initialization !!! DENSITY-BASED CLUSTERING Find groups or categories based on the density of datapoints Benefit: it determines the number of clusters automatically (contrary to k-means) and is also less sensitive to initial positions Suited for non separable data, irregular shaped or overlapping clusters Manages dense and sparse data regions  very tailorable for diverse morphologies DBSCAN Density-Based Spatial Clustering of Applications with Noise Dense regions are separated from regions with lower density of datapoints Based on the notion of ‘clusters’ and ‘noise’ Key idea relates to density  For each datapoint of a cluster, the neighbourhood of a given radius has to contain at least a minimum number of points Thus, the parameters: Radius Border point Minimum number of points Core point Noise / Anomaly DBSCAN ALGORITHM Find all datapoints within a radius and identify the core points For each core point (if not already assigned to a cluster), create a new cluster Find recursively all its density-connected points and assign them to the same cluster as the selected core point Iterate through the remaining unvisited datapoints Points that do not belong to any cluster are noise DBSCAN ALGORITHM Taken from O'Reilly book Hands-on Machine Learning with Scikit-Learn and TensorFlow GAUSSIAN MIXTURE MODEL A probabilistic model Assumes that instances were generated from a mixture of several (k) Gaussian distributions whose parameters are unknown The dataset 𝑋𝑋 is assumed to have been generated as follows: For each instance, a cluster is picked randomly from 𝑘𝑘 clusters The probability of choosing this cluster (say 𝑗𝑗) is defined by the cluster’s weight (or mixture weight) (𝛷𝛷) The index of the cluster chosen for the 𝑖𝑖 th instance is denoted as 𝑧𝑧 (𝑖𝑖) If 𝑧𝑧 (𝑖𝑖) = 𝑗𝑗, the location 𝑥𝑥 (𝑖𝑖) is sampled randomly from the Gaussian distribution with mean 𝜇𝜇 (𝑗𝑗) and covariance matrix Σ (𝑗𝑗)  𝑥𝑥 (𝑖𝑖) ~ℵ 𝜇𝜇 (𝑗𝑗) , Σ (𝑗𝑗) GAUSSIAN MIXTURE MODEL The dataset 𝑋𝑋 is assumed to have been generated as follows: For each instance, a cluster is picked randomly from 𝑘𝑘 clusters The probability of choosing this cluster (say 𝑗𝑗) is defined by the cluster’s weight (or mixture weight) (𝛷𝛷) The index of the cluster chosen for the 𝑖𝑖 th instance is denoted as 𝑧𝑧 (𝑖𝑖) If 𝑧𝑧 (𝑖𝑖) = 𝑗𝑗, the location 𝑥𝑥 (𝑖𝑖) is sampled randomly from the Gaussian distribution with mean 𝜇𝜇 (𝑗𝑗) and covariance matrix Σ (𝑗𝑗)  𝑥𝑥 (𝑖𝑖) ~ℵ 𝜇𝜇 (𝑗𝑗) , Σ (𝑗𝑗) Goal: Taken from O'Reilly book Hands-on Machine Learning with Scikit-Learn and TensorFlow Given the dataset X, start estimating the weights 𝛷𝛷 and all distribution parameters 𝜇𝜇 (1) to 𝜇𝜇 (𝑘𝑘) and Σ (1) to Σ (𝑘𝑘). But how? GAUSSIAN MIXTURE MODEL What do we do in k-means? Expectation Initialize cluster parameters randomly Assign instances to clusters Maximization Update the clusters Repeat the previous two steps Here we generalize this approach GAUSSIAN MIXTURE MODEL Not only find cluster centers But also the size, shape and orientation And their relative weights Soft clustering During the expectation step The algorithm estimates the probability that it belongs to each cluster During the maximization step Each cluster is updated using all the instances in the dataset, where each instance is weighted by the estimated probability that it belongs to that cluster K-MEANS VS. GAUSSIAN MIXTURE MODEL Taken from O'Reilly book Hands-on Machine Learning with Scikit-Learn and TensorFlow ASSOCIATION RULE ASSOCIATION RULE BASICS Let 𝐼𝐼 = 𝑖𝑖1 , 𝑖𝑖2 , … , 𝑖𝑖𝑛𝑛 be a set of n binary attributes called items Let 𝐷𝐷 = 𝑡𝑡1 , 𝑡𝑡2 , … , 𝑡𝑡𝑛𝑛 be a set of transactions called the database where each transaction in 𝐷𝐷 has a unique transaction ID and contains a subset of the items in 𝐼𝐼 An association rule is then 𝑋𝑋 ⇒ 𝑌𝑌 𝑋𝑋, 𝑌𝑌 ⊆ 𝐼𝐼 In words, whenever 𝑋𝑋 occurs in a dataset, then 𝑌𝑌 will occur too ASSOCIATION RULE BASICS – EXAMPLE – SUPERMARKET INDUSTRY 𝐼𝐼 = milk, bread, butter, beer, diapers, eggs, fruit transaction ID milk bread butter beer diapers eggs fruit 1 1 1 0 0 0 0 1 2 0 0 1 0 0 1 1 3 0 0 0 1 1 0 0 4 1 1 1 0 0 1 1 5 0 1 0 0 0 0 0 Example association rule: butter, bread ⇒ milk ASSOCIATION RULE BASICS – EXAMPLE – SUPERMARKET INDUSTRY Introduce two constraints  minimum threshold on support and confidence Support transaction ID milk bread butter beer diaper s eggs fruit 1 1 1 0 0 0 0 1 How frequently an itemset appears in the dataset 2 0 0 1 0 0 1 1 3 0 0 0 1 1 0 0 support = 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵 4 1 1 1 0 0 1 1 5 0 1 0 0 0 0 0 E.g. itemset = {milk, bread, butter} has a support of 1/5 = 0.2 (20%) Define a minimum threshold for support E.g. support threshold set to ≥ 0.4, then rule {milk}⇒{eggs} is out ASSOCIATION RULE BASICS – EXAMPLE – SUPERMARKET INDUSTRY Introduce two constraints  minimum threshold on support and confidence Confidence transaction ID milk bread butter beer diaper s eggs fruit 1 1 1 0 0 0 0 1 Percentage of all transactions satisfying X that also satisfy Y 2 0 0 1 0 0 1 1 supp(𝑋𝑋∩𝑌𝑌) 3 0 0 0 1 1 0 0 conf 𝑋𝑋 ⇒ 𝑌𝑌 = 𝑃𝑃 𝐴𝐴|𝐵𝐵 = supp(𝑋𝑋) 4 1 1 1 0 0 1 1 1/5 E.g. rule {bread, butter}⇒{milk} has confidence = 0.1 (100%) 5 0 1 0 0 0 0 0 1/5 Everytime a customer buys bread and butter, they buy milk Define a minimum threshold for confidence ASSOCIATION RULE BASICS – EXAMPLE – SUPERMARKET INDUSTRY Alternatively, introduce support x confidence diaper transaction ID milk bread butter beer eggs fruit X⇒Y Support × Confidence s 1 1 1 0 0 0 0 1 if buy milk, then buy bread 0.4×1.0= 0.4 2 0 0 1 0 0 1 1 if buy milk, then buy eggs 0.2×0.5= 0.1 3 0 0 0 1 1 0 0 4 1 1 1 0 0 1 1 if buy bread, then buy fruit 0.4×0.66= 0.264 5 0 1 0 0 0 0 0 if buy fruit, then buy eggs 0.4×0.66= 0.264 if buy milk and bread, then buy fruit 0.4×1.0= 0.4 WHAT DID WE LEARN TODAY? Clustering K-means DBSCAN Gaussian Mixture Model Associative rule WHAT DID WE LEARN IN THIS COURSE? CLASSIFICATION VS. REGRESSION SUPPORT VECTOR MACHINES DECISION TREES DECISION TREES ENSEMBLE LEARNING / RANDOM FOREST NEURAL NETWORKS ANNOUNCEMENT THANK YOU AMEYA REGE MS3 / AMDA [email protected] MS3 / AMDA MACHINE LEARNING FOR SMART INDUSTRY LECTURE 01: INTRODUCTION AMEYA REGE RONALD AARTS WHAT WILL WE LEARN Who are we? TODAY? How is the course organized? Lecture and tutorial planning Assignments and exam Course material Society 5.0 Industry 5.0 What is machine learning? Definition History of AI Focus of this course WHO ARE WE? AMEYA REGE Assistant Professor at MS3 Staffordshire, England Aachen / Cologne, Germany My background: Solid Mechanics Los Angeles, USA Computing Materials Mumbai, India RONALD AARTS Associate Professor at MS3 Eindhoven (NL) My interests: System identification Paris (FR) Multibody Linz (AT) Control dynamics ORGANIZATION OF THE COURSE AI IN ENGINEERING Computer vision Reinforcement learning Machine learning for smart industry COURSE ORGANIZATION 8 Lectures (2x per week) Type Topic Lecturers Lecture 01 Introduction to machine learning Ameya Rege / Ronald Aarts Lecture 02 Linear regression and classification Ronald Aarts Lecture 03 Nonlinear regression with kernels Ronald Aarts Lecture 04 Support vector machines Ameya Rege Lecture 05 Decision Trees Ameya Rege Lecture 06 Neural networks Ameya Rege Lecture 07 Feature selection and PCA Ronald Aarts Lecture 08 Unsupervised learning Ameya Rege COURSE ORGANIZATION 8 Lectures (2x per week) 8 Tutorials (2x per week) These will cover 4 assignments, one per week Python – how to – and linear regression Regression and classification Neural networks Decision trees COURSE ORGANIZATION Assignments will be due the following Monday at 10:30. The solutions will be made available right after the submission deadline. This will follow self evaluation, due on Wednesday 10:30 – why? The prerequisite for appearing at the exam is submitting all assignments on time and passing with 60% of the total number points for all assignments Examination will take place on 16th December 2024 (Chromebook/Remindo) with resit on 31st January 2025 SOCIETY 5.0 INDUSTRY 5.0 WHAT IS MACHINE LEARNING? CAN YOU NAME THESE TWO GENTLEMEN? Prof. John Hopfield, Princeton University Prof. Geoffery Hinton, University of Toronto Prof. John Hopfield, Princeton University Prof. Geoffery Hinton, University of Toronto MACHINE LEARNING Definition according to Oxford Dictionary “the use and development of computer systems that are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyse and draw inferences from patterns in data” Arthur Samuel, 1959 “field of study that gives computers the ability to learn without being explicitly programmed” Tom Mitchell, 1997 “a computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E” CLOSER LOOK AT THE DEFINITION BY TOM MITCHELL Your spam filter is a Machine Learning program that, given examples of spam emails (e.g., flagged by users) and examples of regular (non-spam) emails, can learn to flag spam. The examples that the system uses to learn are called the training set. Each training example is called a training instance (or sample). In this case, the task T is to flag spam for new emails, the experience E is the training data, and the performance measure P needs to be defined; for example, you can use the ratio of correctly classified emails. This particular performance measure is called accuracy, and it is often used in classification tasks. REWIND THE CLOCK BACK Arthur Samuels (1901-1990) first coined the term “machine learning” in 1952 He improved the checkers game by what is now called alpha-beta pruning. The program recorded all positions it had already seen and combined this with the values of the reward function. Strictly speaking, rewinding the clock, AI was considered (and rejected) by many well-known scholars – albeit in a very simplistic sense. E.g., Gottfried Leibniz made a thought experiment where he expanded the brain into the size of a mill. He found it difficult to imagine that a mind capable of perception could be constructed using only mechanical processes – thus mimicking, brain – a complex machine. FIRST AI WORKSHOP FIRST AI WORKSHOP “To proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.” THE TURING TEST CAN MACHINES THINK? Alan Turing (1912-1954) THE TURING TEST: EUGENE GOOTSMAN Eugene Gootsman is a chatbot regarded to have passed the Turing test in 2001, and then in the contest marking Turing’s 60th death anniversary in 2014 Developed by Russian-born Vladimir Veselov, Ukranian-born Eugene Demchenko, and Russian-born Sergey Ulasen You can check an exemplary conversation here Can a computer really have a mind, understanding, and consciousness? THE TURING TEST: CHINESE ROOM John Searle thought experiment aimed at the following: Assuming the AI research had succeeded in programming a computer to behave as if it understands Chinese. Would the computer be actually understanding Chinese (conversation) or just simulating the ability to understand it? He explained that without understanding (better intentionality), we cannot call the computer as “thinking” → thus, the computer does not have a mind of its own → thus, falsifying the strong AI hypothesis What is strong AI? Is there some classification? TYPES OF AI Narrow AI Everything that exists in AI today Can be trained to perform tasks, often faster and better than humans Cannot perform outside its defined task E.g., Siri, Alexa, ChatGPT (yes! Because it is limited to single task of text-based chat) General AI Artificial General Intelligence (AGI) is a theoretical concept, for now ☺ Can use previous learnings to accomplish new tasks in a different context without needs of humans to train underlying models Would make AGI to learn and perform like humans Super AI Strictly theoretical Could make super AI think, reason, learn, make judgements, and possess cognitive abilities that surpass those of humans ARTIFICIAL INTELLIGENCE OR MACHINE LEARNING WHEN DID AI BECOME BIG, AND WHY? Ok, we have seen scholars writing positively or negatively about AI, but when did AI really progress! – Well, there went a lot between the 1950’s and what we are about to look into! One common term used in machine learning: artificial neural network (we will see what that is a bit later!) The first artificial neural network was first developed by Belmont Farley and Wesley Clarke at MIT in 1954 – albeit with just 128 neurons (given lack of compute power)! It could train to recognise simple patterns. So the 1950’s were when machines were shown to be capable of AI Let us jump to the 1980’s now!!! Why? WHEN DID AI BECOME BIG, AND WHY? AI winter (mid-1960’s and 1970’s) Among others, John Hopfield at Princeton revitalized the AI field with his work on associative neural networks in 1982 with the paper: Neural networks and physical systems with emergent collective computational abilities David Rumelhart, Geoffery Hinton, and Ronald Williams, in 1986, published Learning representations by back-propagating errors, thus, popularizing the algorithm (although the algorithm was used in 1970’s for training ANNs) In 1989, Alexander Waibel et al., introduced convolutional neural networks (very commonly used today) In 1990, Yann LeCun at Bell Labs used CNNs to recognize handwritten digits – first real application of ANNs This was followed by second AI winter, due to failure of several companies – technology did not seem viable WHEN DID AI BECOME BIG, AND WHY? 1990’s and beyond saw increased compute power One significant milestone came in 1997, when Deep Blue became the first computer chess-playing system to beat the reigning world champion Gary Kasparov First decade saw the emergence of Big Data – large amount of data along with cheaper and faster computers became accessible In 2009, Fei-Feli Li developed ImageNet, a dataset of three million images – served as a basis for testing new generation of image processing systems WHEN DID AI BECOME BIG, AND WHY? DEEP LEARNING REVOLUTION Key to deep learning revolution was hardware advances, particularly GPU In 2009, Raina, Madhavan and Ng reported a 100M deep network trained on 30 Nvidia GeForce GTX 280 GPUs CNNs were combined with long short-term memory (LSTM) and showed success for image classification and generating captions In 2014, Ian Goodfellow et al. introduced the generative adversarial network (GAN) WHEN DID AI BECOME BIG, AND WHY? LARGE LANGUAGE MODELS Transformer – a deep learning architecture proposed by Google researchers in 2017 Large language models (based on transformers) introduced by OpenAI (ChatGPT: GPT-3 in 2020) another by DeepMind (Gato in 2022) and another by Meta AI (Llama in 2023) Optimus Prime is and many more!! een coole robot Information extraction Large language model Object recognition Sentiment analysis Following instructions Image captioning CRITICISM OF AI CAN WE AFFORD AI? The computational and environmental costs of training large models are becoming increasingly unsustainable CRITICISM OF AI BIAS Problem framing – humans are framing the problem with all bias Data collection – unrepresentative and having reflecting human prejudices Data preparation – one can introduce bias during preparation as to which features to select “AI is trying to decide if you get a job, AI is deciding if you get access to medical treatment – and if we get that wrong, you have real world consequences.” – Joy Buolamwini, MIT Media Lab, now founder Algorithmic Justice League “Debiasing humans is harder than debiasing AI systems.”, Olga Russakovsky, Princeton University CRITICISM OF AI ETHICAL CONCERNS What is real? What is fake? CRITICISM OF AI ETHICAL CONCERNS FOCUS OF THIS COURSE AI Natural language Machine learning Speech recognition Expert systems Computer vision Robotics Planning processing Supervised learning Text generation Speech to text Machine vision Unsupervised learning Question answering Text to speech Image recognition Deep learning Classification Context extraction Machine translation MACHINE LEARNING In 1949, Donald Hebb published Organization of Behavior – where he described how function of neurons contributed to learning Dendrites x1 w1 Nucleus Inputs Outputs x2 Inputs w2 ෍ 𝑥𝑖 𝑤𝑖 + 𝑏 𝜑.. activation Axon wn b function xn Outputs bias Biological neuron Artificial neuron MACHINE LEARNING Neural nets and deep learning MLP CNN RNN GAN Unsupervised learning Supervised learning Clustering Regression Pattern search Classification Reinforcement learning SUPERVISED VS. UNSUPERVISED LEARNING Classification Clustering CLASSIFICATION Classification of spam emails Inbox Classifier Spam folder CLASSIFICATION Classification of failure modes in reinforced concrete frames Sliding / flexural hinging Sliding / shear failure Crushing / flexural hinging Crushing / shear failure CLASSIFICATION VS. REGRESSION CLASSIFICATION VS. REGRESSION REGRESSION NEURAL NETWORKS DECISION TREES DECISION TREES RANDOM FOREST WHAT SKILLS OF ME WILL BE TESTED? Mathematics Statistics / Data engineering Probability theory Programming Computing KNOW OUR TUTORS Merit Fernhout Timm Gödde [email protected] [email protected] THANK YOU AMEYA REGE & RONALD AARTS MS3 / AMDA [email protected] & [email protected] ET / MS3 / AMDA MACHINE LEARNING FOR SMART INDUSTRY 2. LINEAR REGRESSION AND CLASSIFICATION RONALD AARTS PREVIOUS LECTURE Introduction AI Introduction ML Introduction supervised learning Data sets 2 Machine Learning for Smart Industry - 2. Linear regression and classification LINEAR REGRESSION AND Supervised learning – Overview Risk minimisation CLASSIFICATION Classification example Linear models Distance and norm Linear regression – Least Squares fit (LSQ) Basis functions Matrix solution Examples with underfit/overfit Residual Nonlinear? 3 Machine Learning for Smart Industry - 2. Linear regression and classification SUPERVISED LEARNING – OVERVIEW From Bojana Rosić Dataset 𝒟 = 𝑥𝑖 , 𝑦𝑖 𝑁 𝑖=1 of size 𝑁 with inputs 𝑥𝑖 and corresponding outputs or labels 𝑦𝑖. Supervised learning: Both inputs 𝑥𝑖 and corresponding outputs 𝑦𝑖 are known. Assume there is some mapping 𝑦𝑖 = 𝑓 ∗ 𝑥𝑖 , where function 𝑓 ∗ is sometimes denoted as the oracle. 4 Machine Learning for Smart Industry - 2. Linear regression and classification SUPERVISED LEARNING – OVERVIEW Assumed mapping with oracle function 𝑓 ∗ : 𝑦𝑖 = 𝑓 ∗ 𝑥𝑖. Accounting for noise → distribution; Or by adding some random noise term 𝜖𝑖 : 𝑦𝑖 = 𝑓 ∗ 𝑥𝑖 + 𝜖𝑖. Outputs 𝑦𝑖 continuous → regression. Outputs 𝑦𝑖 discrete → classification. MNIST dataset with handwritten digits 5 Machine Learning for Smart Industry - 2. Linear regression and classification SUPERVISED LEARNING – RISK MINIMISATION Mapping with oracle function 𝑓 ∗ , with random noise term 𝜖𝑖 : 𝑦𝑖 = 𝑓 ∗ 𝑥𝑖 + 𝜖𝑖. 𝑁 Goal of supervised learning: Obtain “good” approximation of oracle 𝑓 ∗ from dataset 𝒟, i.e. 𝑥𝑖 , 𝑦𝑖 = 𝑓 ∗ 𝑥𝑖 𝑖=1. This “best approximation” function 𝑓 is picked from a collection of functions, the hypothesis space ℋ. Closeness of 𝑓 to 𝑓 ∗ is defined with suitable loss function: 𝐿 𝑓 𝑥𝑖 , 𝑓 ∗ 𝑥𝑖 = 𝐿 𝑓 𝑥𝑖 , 𝑦𝑖 1 1 Combined for all data in dataset into risk: 𝑅 𝑓 = 𝑁 σ𝑁 ∗ 𝑖=1 𝐿 𝑓 𝑥𝑖 , 𝑓 𝑥𝑖 = 𝑁 σ𝑁 𝑖=1 𝐿 𝑓 𝑥𝑖 , 𝑦𝑖 Finding best predictive model 𝑓መ from optimisation problem: min 𝑅 𝑓 → training 𝑓∈ℋ 6 Machine Learning for Smart Industry - 2. Linear regression and classification RISK MINIMISATION Is this risk minimisation what we really want? The risk is evaluated from the samples in the dataset, which results in the empirical risk minimisation 1 min 𝑅emp 𝑓 = min 𝑁 σ𝑁 𝑖=1 𝐿 𝑓 𝑥𝑖 , 𝑦𝑖 𝑓∈ℋ 𝑓∈ℋ 𝑁 where the samples 𝑥𝑖 𝑖=1 are independent and identically distributed, described with some sample distribution 𝜇. However, 𝑓መ should also perform well on new samples, so mathematically solve population risk minimisation min 𝑅pop 𝑓 = min 𝔼𝑥∼𝜇 𝐿 𝑓 𝑥𝑖 , 𝑓 ∗ 𝑥𝑖 𝑓∈ℋ 𝑓∈ℋ which however in practice is mostly not feasible. Challenge of generalisation. 7 Machine Learning for Smart Industry - 2. Linear regression and classification SUPERVISED LEARNING – SUMMARY Main questions to be addressed: 1. Approximation: How to define the hypothesis space ℋ? Not too large, not too small. 2. Optimisation: How to perform the optimisation to get an approximation 𝑓መ of oracle 𝑓 ∗ ? 3. Generalisation: How well does 𝑓መ perform on unseen data? From Qianxiao Li 8 Machine Learning for Smart Industry - 2. Linear regression and classification SUPERVISED LEARNING – CLASSIFICATION EXAMPLE Labelled data set: Cat Fox Dog 9 Machine Learning for Smart Industry - 2. Linear regression and classification SUPERVISED LEARNING – CLASSIFICATION EXAMPLE Learn model to map image to label: Cat Fox Dog Hypothesis space ℋ is the set of labels. Optimisation/training to obtain the model. 10 Machine Learning for Smart Industry - 2. Linear regression and classification SUPERVISED LEARNING – CLASSIFICATION EXAMPLE Image not in training set: Cat Fox Dog Generalisation: Do we get the correct answer? 11 Machine Learning for Smart Industry - 2. Linear regression and classification LINEAR MODELS Well suited to illustrate the concepts introduced so far. Baseline for more complex problems. How to measure closeness in loss function / risk minimisation? 12 Machine Learning for Smart Industry - 2. Linear regression and classification DISTANCE AND NORM 𝑥 𝑥 2 ∞ 𝑥 1 How to measure closeness / distance? Norms: Euclidian or 2-norm: 𝑥 2 = 𝑥12 + 𝑥22 + ⋯ + 𝑥𝑛2 Manhattan, taxicab or 1-norm: 𝑥 1 = 𝑥1 + 𝑥2 +…+ 𝑥𝑛 Maximum or ∞-norm: 𝑥 ∞ = max 𝑥1 , 𝑥2 , … , 𝑥𝑛 13 Machine Learning for Smart Industry - 2. Linear regression and classification LINEAR REGRESSION – LEAST SQUARES FIT (LSQ) Hypothesis space ℋ is the space of linear functions: ℋ = 𝑓: 𝑓 𝑥 = 𝑤0 + 𝑤1 𝑥, 𝑤0 ∈ ℝ, 𝑤1 ∈ ℝ The Euclidian or 2-norm is used for the loss function: 1 min 𝑅emp 𝑓 = min 𝑁 σ𝑁 𝑖=1 𝑓 𝑥𝑖 − 𝑦𝑖 2 𝑓∈ℋ 𝑓∈ℋ Or the explicit solution for the parameters: 1 min 𝑅emp 𝑤0 , 𝑤1 = min 𝑁 σ𝑁 𝑖=1 𝑤0 + 𝑤1 𝑥𝑖 − 𝑦𝑖 2 𝑤0 ∈ℝ,𝑤1 ∈ℝ 𝑓∈ℋ Solution, see e.g. page 14 in [Qianxiao Li (2020)]. From Qianxiao Li → Explicit solution. 14 Machine Learning for Smart Industry - 2. Linear regression and classification LINEAR REGRESSION – BASIS FUNCTIONS Hypothesis space ℋ is the space of functions where the parameters 𝑤 appear in a linear relation: ℋ = 𝑓: 𝑓 𝑥 = 𝜑𝑇 𝑤 𝑇 E.g. parameter vector 𝑤 = 𝑤0 𝑤1 and row vector with basis functions 𝜑𝑇 = 1 𝑥 𝑦1 1 𝑥1 Collect all data 𝑥𝑖 , 𝑦𝑖 𝑁 𝑖=1 in vector 𝑦 = ⋮ and matrix Φ = ⋮ ⋮ 𝑦𝑁 1 𝑥𝑁 2 Minimise Euclidian or 2-norm: 𝑅emp 𝑤 = 𝑦 − Φ𝑤 2 15 Machine Learning for Smart Industry - 2. Linear regression and classification LINEAR REGRESSION – BASIS FUNCTIONS 2 Minimise Euclidian or 2-norm: 𝑅emp 𝑤 = 𝑦 − Φ𝑤 2 𝜕 Derivatives should be zero: 𝑅 𝑤 =0 𝜕𝑤 emp Solution for parameter estimate: Φ 𝑇 𝑦 − Φ𝑤 ෝ =0 Φ 𝑇 𝑦 − Φ 𝑇 Φ𝑤 ෝ=0 ෝ = Φ𝑇 Φ 𝑤 −1 Φ𝑇 𝑦 With (Moore-Penrose) pseudoinverse: Φ† = Φ𝑇 Φ −1 Φ𝑇 → Note Φ† Φ = 𝐼 → The parameter vector 𝑤 ෝ is the unique and rather straightforward solution of a linear problem. 16 Machine Learning for Smart Industry - 2. Linear regression and classification LINEAR REGRESSION – EXAMPLES ℋ = 𝑓: 𝑓 𝑥 = 𝑤0 + 𝑤1 𝑥, 𝑤0 ∈ ℝ, 𝑤1 ∈ ℝ Possible underfitting? From Qianxiao Li 17 Machine Learning for Smart Industry - 2. Linear regression and classification LINEAR REGRESSION – EXAMPLES ℋ = 99th order polynomial Possible overfitting? From Qianxiao Li 18 Machine Learning for Smart Industry - 2. Linear regression and classification LINEAR REGRESSION – RESIDUAL Check your fit result by evaluating a plot of the residual 𝑓መ 𝑥𝑖 − 𝑓 ∗ 𝑥𝑖 = 𝑓መ 𝑥𝑖 − 𝑦𝑖 E.g. 𝑤 ෝ0 + 𝑤 ෝ1 𝑥𝑖 − 𝑦𝑖 Random and magnitude small and acceptable in view of expected (measurement) noise? → Okay Showing some structure as function of 𝑥𝑖 ? → Consider more complex hypothesis space. → Remove possible outliers. 19 Machine Learning for Smart Industry - 2. Linear regression and classification WHAT IF NOT LINEAR REGRESSION / LSQ? Linearity in the parameters and 2-norm results in straightforward mathematical analysis: - Linear algebra. - A single analytical solution that is also the global minimum. What if not linear in the parameters or using a different norm? - Typically, no explicit analytical solution. - Likely multiple solutions. - Start with initial guess and iterate to better solution, i.e. smaller norm, e.g. with gradient descent → See e.g. page 16-17 in [Qianxiao Li (2020)]. → To be discussed later. 20 Machine Learning for Smart Industry - 2. Linear regression and classification NEXT WEEK Connect regression and Machine Learning. Classification. 21 Machine Learning for Smart Industry - 2. Linear regression and classification ET / MS3 / AMDA MACHINE LEARNING FOR SMART INDUSTRY 3. NONLINEAR REGRESSION USING KERNELS RONALD AARTS PREVIOUS LECTURE Linear regression (“classical” approach) Basis functions Matrix solution for parameter vector 2 Machine Learning for Smart Industry - 2. Linear regression and classification LINEAR REGRESSION AND Linear regression recap Regularisation CLASSIFICATION Nonlinear Connection to Machine Learning Classification Kernel functions 3 Machine Learning for Smart Industry - 2. Linear regression and classification LINEAR REGRESSION – RECAP Hypothesis space ℋ is the space of functions where the parameters 𝑤 appear in a linear relation: ℋ = 𝑓: 𝑓 𝑥 = 𝜑𝑇 𝑤 with basis functions or feature maps 𝜑𝑇 = 𝜑(𝑥)𝑇. 𝑦1 𝜑 𝑥1 𝑇 𝑁 Collect all data 𝑥𝑖 , 𝑦𝑖 𝑖=1 in vector 𝑦 = ⋮ and matrix Φ = ⋮. 𝑦𝑁 𝜑 𝑥𝑁 𝑇 2 Minimise Euclidian or 2-norm: 𝑅emp 𝑤 = 𝑦 − Φ𝑤 2 Solution for parameter estimate: ෝ = Φ𝑇 Φ 𝑤 −1 Φ𝑇 𝑦 = Φ† 𝑦 with (Moore-Penrose) pseudoinverse: Φ† = Φ𝑇 Φ −1 Φ𝑇 4 Machine Learning for Smart Industry - 2. Linear regression and classification LINEAR REGRESSION – MORE GENERAL Linear basis hypothesis space ℋ𝑀 where 𝑀 refers to the number of parameters 𝑤 in a linear relation: ℋ𝑀 = 𝑓: 𝑓 𝑥 = σ𝑀−1 𝑗=0 𝑤𝑗 𝜑𝑗 𝑥 Input 𝑥𝑖 ∈ ℝ𝑑 can be a vector of dimension 𝑑 and 𝜑𝑗 is basis function ℝ𝑑 ⟶ ℝ. 𝑦1 𝜑0 𝑥1 … 𝜑𝑀−1 𝑥1 Collect all data 𝑥𝑖 , 𝑦𝑖 𝑁 𝑖=1 in vector 𝑦 = ⋮ and matrix Φ = ⋮ ⋱ ⋮ 𝑦𝑁 𝜑0 𝑥𝑁 ⋯ 𝜑𝑀−1 𝑥𝑁 2 Minimise Euclidian or 2-norm: 𝑅emp 𝑤 = 𝑦 − Φ𝑤 2 Solution for parameter estimate: ෝ = Φ𝑇 Φ 𝑤 −1 Φ𝑇 𝑦 = Φ† 𝑦 with (Moore-Penrose) pseudoinverse: Φ† = Φ𝑇 Φ −1 Φ𝑇 5 Machine Learning for Smart Industry - 2. Linear regression and classification LINEAR REGRESSION – REGULARISATION Solution for parameter estimate: ෝ = Φ† 𝑦 𝑤 with (Moore-Penrose) pseudoinverse: Φ† = Φ𝑇 Φ −1 Φ𝑇 Involves inverse of 𝑀 × 𝑀 matrix Φ𝑇 Φ which may be (close to) singular → Pseudoinverse Φ† still exists. Possibly infinite number of solutions: ෝ 𝑢 = Φ† 𝑦 + 𝐼 − Φ† Φ 𝑢, for any 𝑢. 𝑤 Pick one solution. Smallest norm 𝑤 ෝ 𝑢 ? ෝ 0 = Φ† 𝑦 𝑤 1 2 More generally: Add regularisation term: min𝑀 𝑅emp 𝑤 = min𝑀 𝑁 𝑦 − Φ𝑤 2 + 𝜆𝐶 𝑤 𝑤∈ℝ 𝑤∈ℝ with regularisation function 𝐶: ℝ𝑀 ⟶ ℝ+ and parameter 𝜆 > 0 controlling the “strength” of the regularisation. 6 Machine Learning for Smart Industry - 2. Linear regression and classification LINEAR REGRESSION – 𝑙 2 REGULARISATION 𝑙 2 regularisation / ridge regression: 𝐶 𝑤 = 𝑤 2 1 2 2 Hence: min𝑀 𝑅emp 𝑤 = min𝑀 𝑁 𝑦 − Φ𝑤 2 +𝜆 𝑤 𝑤∈ℝ 𝑤∈ℝ Solution: ෝ = Φ𝑇 Φ + 𝜆𝐼𝑀 𝑤 −1 Φ𝑇 𝑦 with 𝑀 × 𝑀 identity matrix 𝐼𝑀 7 Machine Learning for Smart Industry - 2. Linear regression and classification NONLINEAR REGRESSION Not linear-in-the-parameters? Non-Euclidian norm? Matrix formalism doesn’t apply. No explicit solution → Iterative solution. E.g. gradient descent: Update uses local gradient of cost function 𝑅emp 𝑤 → May end in local minimum instead of global minimum: From Wikipedia From Wikipedia 8 Machine Learning for Smart Industry - 2. Linear regression and classification CONNECTION TO MACHINE LEARNING Nonlinear optimisation is quite common in machine learning. Numerical implementation available in TensorFlow: Select hypothesis space ℋ. Choose optimisation procedure, e.g. the Adam optimizer, which is “a stochastic gradient descent method”. [Kingma et al., 2014]: The method is "computationally efficient, has little memory requirement, invariant to diagonal rescaling of gradients, and is well suited for problems that are large in terms of data/parameters". Check generalisation: Split data: Training data → To fit the parameters. Validation data → For unbiased evaluation of fit while tuning hyperparameters. Test data → For unbiased evaluation of final model. 9 Machine Learning for Smart Industry - 2. Linear regression and classification TENSORFLOW Some tips & tricks for using TensorFlow: Import and preprocess data, e.g. using the pandas package. Check that the format of the data matches with the requirements of TensorFlow, e.g. column versus row vectors. Consider splitting the dataset into training, validation and test data (or just two of these sets). Consider scaling of the numerical values. Define the model with tf.keras.Sequential() In the first assignments, the models are rather simple with a single layer. Compile the model with model.compile() The optimizer can be specified in this command or defined previously with e.g. tf.keras.optimizers.Adam() Train the model with model.fit() Evaluate the model: model.evaluate() Compute predictions: model.predict() 10 Machine Learning for Smart Industry - 2. Linear regression and classification CLASSIFICATION Outputs 𝑦𝑖 are discrete labels → First consider only two labels. Example: 𝑦𝑖 ∈ −1, +1 or 𝑦𝑖 ∈ 0,1 with input vector 𝑥𝑖 ∈ ℝ2 → Hypothesis space: ℋ = 𝑓: 𝑓 𝑥 = 𝑔 σ𝑀−1 𝑗=0 𝑤𝑗 𝜑𝑗 𝑥 with activation function 𝑔: ℝ → ℝ to produce binary output labels. Theoretically the activation involves a “hard” switch, e.g. if its continuous input value is either below or above some threshold, e.g. 𝑔 𝑧 = sign(𝑧). Alternatively, a “smooth” transition can be implemented with e.g. 𝑔 𝑧 = tanh(𝑧) → Ignore activation 𝑔(𝑧), then still linear. Otherwise, nonlinear! 11 Machine Learning for Smart Industry - 2. Linear regression and classification CLASSIFICATION Outputs 𝑦𝑖 are multiple discrete labels. Categorical data: Labels have no natural order. Example: 𝑦𝑖 ∈ "A", "B","C" with input vector 𝑥𝑖 ∈ ℝ𝑑. Can be represented with one-hot embedding, where each 𝑦𝑖 is a one-hot vector, i.e. all 0 except for one 1: 1 0 0 𝑦𝑖 = 0 , 𝑦𝑖 = 1 or 𝑦𝑖 = 0. 0 0 1 Oracle function 𝑓 ∗ 𝑥𝑖 maps vector 𝑥𝑖 to a vertex of 2-dimensional hypercube (in general 𝐾-dimensional hypercube). Hypothesis space is also 𝐾-dimensional. 12 Machine Learning for Smart Industry - 2. Linear regression and classification CLASSIFICATION 1 0 0 𝑦𝑖 is a one-hot vector, i.e. all 0 except for one 1, e.g. 0 , 1 or 0. 0 0 1 Hypothesis space is 𝐾-dimensional: 𝑀−1 ℋ = 𝑓: 𝑓 𝑥 = 𝑔 ෍ 𝑤𝑗 𝜑𝑗 𝑥 , 𝑤𝑗 ∈ ℝ𝐾 𝑗=0 with 𝐾-dimensional weight vector 𝑤𝑗 and activation function 𝑔: ℝ𝐾 → ℝ𝐾 to produce output comparable to one-hot vector. Theoretically the activation function selects (first) maximum value in input vector. Or smooth approximation like softmax function: exp 𝑧𝑘 𝑔sm 𝑧 𝑘 = σ𝑀−1 𝑙=0 exp 𝑧𝑙 Nonlinear! 13 Machine Learning for Smart Industry - 2. Linear regression and classification KERNEL FUNCTIONS 1 2 Recall LSQ with 𝑙 2 regularisation / ridge regression: min𝑀 𝑅emp 𝑤 = min𝑀 𝑁 𝑦 − Φ𝑤 2 +𝜆 𝑤 2 𝑤∈ℝ 𝑤∈ℝ With solution: ෝ = Φ𝑇 Φ + 𝜆𝐼𝑀 𝑤 −1 Φ𝑇 𝑦 𝑀 × 𝑀 inverse Prediction for new sample 𝑥 ∈ ℝ𝑑 : 𝑓መ 𝑥 = 𝜑 𝑥 𝑇 𝑤 ෝ with 𝜑 𝑥 = 𝜑0 𝑥 , … , 𝜑𝑀−1 𝑥 Alternative solution: ෝ = Φ𝑇 ΦΦ𝑇 + 𝜆𝐼𝑁 𝑤 −1 𝑦 𝑁 × 𝑁 inverse Proof: See page 26 in [Qianxiao Li (2020)]. 14 Machine Learning for Smart Industry - 2. Linear regression and classification KERNEL FUNCTIONS Alternative solution: ෝ = Φ𝑇 ΦΦ𝑇 + 𝜆𝐼𝑁 𝑤 −1 𝑦 Prediction for new sample 𝑥 ∈ ℝ𝑑 : 𝑓መ 𝑥 = 𝜑 𝑥 𝑇 𝑤 ෝ = 𝜑 𝑥 𝑇 Φ𝑇 ΦΦ𝑇 + 𝜆𝐼𝑁 −1 𝑦 Define α = ΦΦ𝑇 + 𝜆𝐼𝑁 −1 𝑦 Then 𝑓መ 𝑥 = σ𝑁 𝑇 𝑁 𝑖=1 𝛼𝑖 𝜑 𝑥 𝜑 𝑥𝑖 = σ𝑖=1 𝛼𝑖 𝑘 𝑥, 𝑥𝑖 Previously 𝑓መ 𝑥 expressed in feature maps 𝜑 𝑥. Now in kernel function 𝑘 𝑥, 𝑥′ = 𝜑 𝑥 𝑇 𝜑 𝑥′ 15 Machine Learning for Smart Industry - 2. Linear regression and classification KERNEL FUNCTIONS Kernel function 𝑘: ℝ𝑑 × ℝ𝑑 → ℝ 𝑘 𝑥, 𝑥′ = 𝜑 𝑥 𝑇 𝜑 𝑥′ Solution depends no longer on explicit feature map, but on this kernel function: 𝑓መ 𝑥 = σ𝑁 𝑁 𝑖=1 𝛼𝑖 𝑘 𝑥, 𝑥𝑖 = σ𝑖=1 𝐺 + 𝜆𝐼𝑁 −1 𝑦 𝑖 𝑘 𝑥, 𝑥𝑖 with Gram matrix 𝐺𝑖𝑗 = 𝑘 𝑥𝑖 , 𝑥𝑗 16 Machine Learning for Smart Industry - 2. Linear regression and classification FEATURE MAP → KERNEL FUNCTION Example: Simple linear regression with 𝑑 = 1, 𝑀 = 2, and 𝜑0 𝑥 = 1, 𝜑1 𝑥 = 𝑥. Then kernel function 𝑘 𝑥, 𝑥′ = 𝜑 𝑥 𝑇 𝜑 𝑥′ = 1 + 𝑥𝑥′ 17 Machine Learning for Smart Industry - 2. Linear regression and classification KERNEL → FEATURE MAP Example: Kernel function with 𝑑 = 2: 𝑘 𝑥, 𝑥′ = 1 + 𝑥 𝑇 𝑥′ 2 = 1 + 𝑥1 𝑥1′ + 𝑥2 𝑥2′ 2 This can be written as: 𝑘 𝑥, 𝑥′ = 1, 2𝑥1 , 2𝑥2 , 2𝑥1 𝑥2 , 𝑥12 , 𝑥22 ∙ 1, 2𝑥′1 , 2𝑥′2 , 2𝑥′1 𝑥′2 , 𝑥′12 , 𝑥′22 Hence this kernel function corresponds to the 6-dimensional feature map φ 𝑥 = 1, 2𝑥1 , 2𝑥2 , 2𝑥1 𝑥2 , 𝑥12 , 𝑥22 18 Machine Learning for Smart Industry - 2. Linear regression and classification KERNEL FUNCTIONS Kernel function 𝑘: ℝ𝑑 × ℝ𝑑 → ℝ can be defined without explicit feature map 𝜑 𝑥. Symmetric Positive Definite (SPD) kernel examples: Linear: 𝑘 𝑥, 𝑥′ = 𝑥 𝑇 𝑥′ Polynomial: 𝑘 𝑥, 𝑥′ = 1 + 𝑥 𝑇 𝑥′ 𝑚 (𝑚 > 0) 𝑥−𝑥′ 2 Gaussian / radial basis function (RBF): 𝑘 𝑥, 𝑥′ = exp − (𝜎 > 0) 2𝜎 2 19 Machine Learning for Smart Industry - 2. Linear regression and classification KERNEL RIDGE REGRESSION Hypothesis space for SPD kernel: ∞ ∞ ℋ = 𝑓: 𝑓 𝑥 = ෍ 𝑤𝑗 𝜑𝑗 𝑥 , 𝑤𝑗 ∈ ℝ with 𝑘 𝑥, 𝑥 ′ = ෍ 𝜑𝑗 𝑥 𝜑𝑗 𝑥′ 𝑗=0 𝑗=0 Solution of regularised empirical risk minimisation: 𝑓መ 𝑥 = σ𝑁 𝑖=1 𝐺 + 𝜆𝐼𝑁 −1 𝑦 𝑖 𝑘 𝑥, 𝑥𝑖 with Gram matrix 𝐺𝑖𝑗 = 𝑘 𝑥𝑖 , 𝑥𝑗 without explicit knowledge of feature maps 𝜑𝑗. 20 Machine Learning for Smart Industry - 2. Linear regression and classification KERNEL RIDGE REGRESSION Solution of regularised empirical risk minimisation: 𝑓መ 𝑥 = σ𝑁 𝑖=1 𝐺 + 𝜆𝐼𝑁 −1 𝑦 𝑖 𝑘 𝑥, 𝑥𝑖 with Gram matrix 𝐺𝑖𝑗 = 𝑘 𝑥𝑖 , 𝑥𝑗. Implementation in TensorFlow: Define the kernel function, e.g. RBF or polynomial kernel. It should depend on the hyperparameter 𝜎 (also called gamma). Another hyperparameter is the number of basis functions to be used, e.g. for the RBF kernel the number of centres. Consider splitting the dataset into training, validation and test data (or just two of these sets). Consider scaling of the numerical values. Transform the data according to the kernel function. Etcetera. 21 Machine Learning for Smart Industry - 2. Linear regression and classification KERNEL RIDGE REGRESSION EXAMPLES From page 31 in [Qianxiao Li (2020)]: Choosing hyperparameters, e.g. the width 𝜎 of the RBF kernel. 22 Machine Learning for Smart Industry - 2. Linear regression and classification NEXT LECTURE Support vector machines (SVM). From Qianxiao Li 23 Machine Learning for Smart Industry - 2. Linear regression and classification MS3 / AMDA MACHINE LEARNING FOR SMART INDUSTRY LECTURE 04: SUPPORT VECTOR MACHINES AMEYA REGE WHAT WAS COVERED IN THE PREVIOUS LECTURE? WHAT WAS COVERED IN THE PREVIOUS LECTURE? Supervised learning Risk minimization Classification Linear regression Nonlinear regression Tensorflow Kernel functions WHAT WILL WE LEARN 1. SVM classification – Linear – Hard Margin TODAY? 2. Hyperplane 3. SVM classification – Soft Margin 4. SVM classification – using kernels 5. SVM regression WHAT IS A SUPPORT VECTOR MACHINE? WHAT IS A SUPPORT VECTOR MACHINE? A Support Vector Machine (SVM) is a powerful machine learning algorithm widely used for both linear and nonlinear classification, as well as regression It is also good for outlier detection tasks Applications of SVMs Text classification, Image classification, Spam detection, Face detection, Anomaly detection SVM CLASSIFICATION The primary objective of the SVM algorithm is to identify the optimal hyperplane in an N-dimensional space that can effectively separate data points into different classes in the feature space. The algorithm ensures that the margin between the closest points of different classes, known as support vectors, is maximized. x2 How to choose the best line? x1 HYPERPLANE Hyperplane is a boundary which classifies the data set (e.g., spam vs. ham) In 2-d  line, 3-d  plane In N-dimensional space  a flat affine subspace of dimension N-1 x2 x1 HYPERPLANE Best hyperplane? x2 One that maximizes the separation margin between the two classes Hard margin It is the maximal-margin hyperplane selected based on maximizing the distance between the hyperplane and the nearest data point on each side What has the hyperplane got to do with SVM? Why support vector machine? x1 HYPERPLANE SOME MATH support vectors x2 Support vectors are datapoints closest to the hyperplane x1 HYPERPLANE SOME MATH Equation of the linear hyperplane 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 = 0 𝒘𝒘  normal vector to the hyperplane 𝑏𝑏  determines the offset of the hyperplane from origin along 𝑤𝑤 | 𝑤𝑤 | 𝑥𝑥𝑤𝑤 = 𝑥𝑥 cos 𝜃𝜃 = 𝑙𝑙 𝒘𝒘 𝑤𝑤 𝑥𝑥 cos 𝜃𝜃 = 𝑤𝑤 𝑙𝑙 𝒘𝒘 𝒙𝒙 = 𝑤𝑤 𝑙𝑙 𝑙𝑙 = −𝑏𝑏 𝒙𝒙 𝜃𝜃 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 = 0 HYPERPLANE DEFINITION Let 𝒘𝒘 be non-zero 𝑑𝑑-dimensional vector and let 𝑏𝑏 be real number. The set 𝐻𝐻 = 𝒙𝒙 ∈ ℝ𝑑𝑑 ∶ 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 = 0 is hyperplane. 𝒘𝒘 𝑤𝑤 is a normal, 𝑏𝑏 𝑙𝑙 = is distance to origin. | 𝒘𝒘 | 𝑙𝑙 SVM FORMULATION x2 𝑤𝑤 𝑏𝑏 | 𝑤𝑤 | 𝑑𝑑1 𝑑𝑑2 𝐻𝐻2 𝐻𝐻 𝐻𝐻1 x1 SVM FORMULATION x2 Consider the dataset given in the figure to the left, with 𝑛𝑛 points 2 𝑤𝑤 𝑥𝑥1 , 𝑦𝑦1 , 𝑥𝑥2 , 𝑦𝑦2 , … , 𝑥𝑥𝑛𝑛 , 𝑦𝑦𝑛𝑛 | 𝑤𝑤 | Here, 𝑦𝑦𝑖𝑖 are either 1 or -1  either or Remind the equation for hyperplane 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 = 0 So, for the margins 𝐻𝐻2 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 = 1 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 = −1 𝐻𝐻 Thus, the distance between the two hyperplanes 𝐻𝐻1 2 x1 | 𝑤𝑤 | SVM FORMULATION x2 2 𝑤𝑤 | 𝑤𝑤 | Thus, for the space 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 ≥ 1 And for the space 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 ≥ −1 𝐻𝐻2 We can thus generalise 𝑦𝑦𝑖𝑖 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 − 1 ≥ 0 𝐻𝐻 𝐻𝐻1 x1 SVM FORMULATION x2 2 𝑤𝑤 | 𝑤𝑤 | Thus, we must maximise the margin 1 | 𝑤𝑤 | Thus, we must minimize | 𝑤𝑤 | 𝐻𝐻2 Condition: 𝐻𝐻 no data points between 𝐻𝐻1 and 𝐻𝐻2 𝐻𝐻1 x1 SVM FORMULATION x2 2 𝑤𝑤 | 𝑤𝑤 | Minimise 1 min | 𝑤𝑤 |  min | 𝑤𝑤 |2 2 Optimisation problem 1 min | 𝑤𝑤 |2 𝐻𝐻2 2 𝑦𝑦𝑖𝑖 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 − 1 ≥ 0 𝐻𝐻 𝐻𝐻1 x1 SUPPORT VECTOR MACHINES IS THE DATA ALWAYS LINEARLY SEPARABLE? x2 x1 SUPPORT VECTOR MACHINES IS THE DATA ALWAYS LINEARLY SEPARABLE? x2 Soft margins x1 SOFT MARGIN SVM SOFT MARGINS What happens when data is not linearly separable? In our previous case, hard margin, we went for minimizing ||𝑤𝑤|| 1 This is essentially, minimising margin 1 In the case of soft margins, we add a penalty and modify the minimisation equation to + λ ∑ penalty margin Today, we will learn a bit about the penalty called “hinge loss” SOFT MARGIN x2 x2 x1 x1 Requires violation of constraints  introduce slack variables SOFT MARGIN SLACK VARIABLE ξ=0 0 1 Some datapoints may violate only the margin Some datapoints violate the hyperplane Observation on the correct side of the hyperplane Observation on the wrong side of the hyperplane What is the cost of misclassification? SOFT MARGIN C VARIABLE Cost of misclassification is greater than or equal to the summation of all the epsilons of each data point Typically denoted by cost or C. Logic dictates: When C is large  slack variable can be large  greater misclassification / violation of margin When C is small  slack variable can be small  not many datapoints fall on the wrong side of the margin / hyperplane  fewer misclassification SOFT MARGIN Classification error C VARIABLE What role does the C variable play in our optimization condition? Previously Misclassified data 1 min | 𝑤𝑤 |2 x2 2 Now 𝑛𝑛 1 min | 𝑤𝑤 |2 + 𝐶𝐶 ξ𝑖𝑖 2 𝑖𝑖=1 𝑦𝑦𝑖𝑖 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 ≥ 1 − ξ𝑖𝑖 x1 SOFT MARGIN C VARIABLE Large C Small C x2 x2 x1 x1 SOFT MARGIN C VARIABLE Small C ignores outliers and hence large margin  focus on first term Large C gives importance to outliers and hence small margin (focus on second term) C ∞ enforces all outliers and hence hard margin  minimizes only second term Goal maximise margin and minimize classification error SOFT MARGIN C VARIABLE Large C Small C x2 x2 x1 x1 SOFT MARGIN Let us bring back our optimization condition for soft margin case 𝑛𝑛 1 min | 𝑤𝑤 |2 + 𝐶𝐶 ξ𝑖𝑖 𝑦𝑦𝑖𝑖 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 ≥ 1 − ξ𝑖𝑖 2 𝑖𝑖=1 The right side equation can be rewritten as 𝑦𝑦𝑖𝑖 𝑓𝑓(𝑥𝑥𝑖𝑖 ) ≥ 1 − ξ𝑖𝑖 where 𝑓𝑓 𝑥𝑥𝑖𝑖 = 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 Hinge loss Thus, ξ𝑖𝑖 = max 0,1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝒙𝒙𝑖𝑖 𝑛𝑛 resulting in 1 min | 𝑤𝑤 |2 + 𝐶𝐶 max 0,1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝒙𝒙𝑖𝑖 2 𝑖𝑖=1 HINGE LOSS Loss ξ𝑖𝑖 = max 0,1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝒙𝒙𝑖𝑖 𝑓𝑓 𝑥𝑥𝑖𝑖 = 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 For 𝑓𝑓 𝑥𝑥𝑖𝑖 = 0 and 𝑦𝑦𝑖𝑖 = 1  max 0,1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝑥𝑥𝑖𝑖 = max 0,1 = 1 For 𝑓𝑓 𝑥𝑥𝑖𝑖 = 1 and 𝑦𝑦𝑖𝑖 = 1  max 0,1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝑥𝑥𝑖𝑖 = max 0,0 = 0 For 𝑓𝑓 𝑥𝑥𝑖𝑖 = 0.5 and 𝑦𝑦𝑖𝑖 = 1  max 0,1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝑥𝑥𝑖𝑖 = max 0,0.5 = 0.5 𝑓𝑓 𝒙𝒙𝑖𝑖 For 𝑓𝑓 𝑥𝑥𝑖𝑖 = −1 and 𝑦𝑦𝑖𝑖 = 1  max 0,1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝑥𝑥𝑖𝑖 = max 0,2 = 2 HINGE LOSS Loss ξ𝑖𝑖 = max 0,1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝒙𝒙𝑖𝑖 𝑓𝑓 𝑥𝑥𝑖𝑖 = 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 For 𝑓𝑓 𝑥𝑥𝑖𝑖 = 0 and 𝑦𝑦𝑖𝑖 = −1  max 0,1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝑥𝑥𝑖𝑖 = max 0,1 = 1 For 𝑓𝑓 𝑥𝑥𝑖𝑖 = 1 and 𝑦𝑦𝑖𝑖 = −1  max 0,1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝑥𝑥𝑖𝑖 = max 0,0 = 0 For 𝑓𝑓 𝑥𝑥𝑖𝑖 = 0.5 and 𝑦𝑦𝑖𝑖 = −1  max 0,1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝑥𝑥𝑖𝑖 = max 0,0.5 = 0.5 𝑓𝑓 𝒙𝒙𝑖𝑖 For 𝑓𝑓 𝑥𝑥𝑖𝑖 = −1 and 𝑦𝑦𝑖𝑖 = −1  max 0,1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝑥𝑥𝑖𝑖 = max 0,2 = 2 HINGE LOSS Correct classification and 𝒇𝒇 𝒙𝒙𝒊𝒊 ≥ 𝟏𝟏: 𝑦𝑦𝑖𝑖 𝑓𝑓 𝒙𝒙𝑖𝑖 will always be positive and greater than 1  1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝒙𝒙𝑖𝑖 will always be negative Loss  Loss function will always be zero Correct classification and 𝒇𝒇 𝒙𝒙𝒊𝒊 < 𝟏𝟏: 𝑦𝑦𝑖𝑖 𝑓𝑓 𝒙𝒙𝑖𝑖 will always be positive but less than 1  1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝒙𝒙𝑖𝑖 will always be positive [0,1]  Loss function will be = 1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝒙𝒙𝑖𝑖 Incorrect classification: 𝑦𝑦𝑖𝑖 𝑓𝑓 𝒙𝒙𝑖𝑖 𝑦𝑦𝑖𝑖 or 𝑓𝑓 𝒙𝒙𝑖𝑖 will be negative  1 − 𝑦𝑦𝑖𝑖 𝑓𝑓 𝒙𝒙𝑖𝑖 will always be positive and greater than 1  Loss function will always be positive and increase linearly with 𝑓𝑓 𝒙𝒙𝑖𝑖 SVM USING KERNELS SUPPORT VECTOR MACHINES USING KERNEL As known, kernel is a mathematical function used in SVM to map input data to a higher-dimensional feature space Thus, generation of a hyperplane becomes possible where one does not have linearly separable data Let us go back to our problem and plot the 1D dataset for classification SUPPORT VECTOR MACHINES USING KERNEL Now, we map the 1D data to 2D and successfully make it linearly separable 𝑦𝑦 𝑦𝑦 is a new variable created as a function of distance from origin. The nonlinear function that creates this variable is the kernel. SUPPORT VECTOR MACHINES USING KERNEL You would like to see 2D  3D for sure, or? SUPPORT VECTOR MACHINES USING KERNEL Another example – credit Wikipedia KERNELS Need for kernels? Handle non-linearly separable data by transforming the feature space  generally achieved without explicitly performing transformation (very expensive!!!) Flexible options  diverse kernels can be chosen depending on the data Feature extraction  kernels implicitly take care of feature extraction by projecting data into a space where it becomes linearly separable So, what is the kernel trick? KERNELS Key idea  no explicit mapping to higher dimensional space Kernel function computes the similarity between data points in the higher dimensional space without having to directly compute the coordinate of each data point in space Computationally efficient Handle large, complex data including nonlinear relationships between features SUPPORT VECTOR MACHINES KERNELS Linear kernel  no transformation necessary Input space Simple and fast 𝒙𝒙 = 𝒙𝒙1 , … , 𝒙𝒙𝑛𝑛 Data is already linearly separable Not suitable for complex, nonlinear data 𝐾𝐾 𝑥𝑥𝑖𝑖 , 𝑥𝑥𝑗𝑗 = 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 SUPPORT VECTOR MACHINES KERNELS Polynomial kernel Input space Allows more complex decision boundaries 𝒙𝒙 = 𝒙𝒙1 , … , 𝒙𝒙𝑛𝑛 Can capture interaction between features 𝑑𝑑 Suitable for nonlinear data 𝐾𝐾 𝑥𝑥𝑖𝑖 , 𝑥𝑥𝑗𝑗 = 𝛾𝛾𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 + c Computationally more expensive Risk of overfitting SUPPORT VECTOR MACHINES KERNELS Radial basis function (RBF) kernel Input space Highly flexible 𝒙𝒙 = 𝒙𝒙1 , … , 𝒙𝒙𝑛𝑛 Can handle very complex nonlinear relations −𝛾𝛾| 𝑥𝑥𝑖𝑖 −𝑥𝑥𝑗𝑗 |2 𝐾𝐾 𝑥𝑥𝑖𝑖 , 𝑥𝑥𝑗𝑗 = 𝑒𝑒 Computationally expensive with large datasets SUPPORT VECTOR MACHINES KERNELS How to choose the right one? Complexity of data Computational resources available at hand Model performance Feature scaling Feature scaling is mapping the feature values of a dataset into the same range. Crucial when distances between two observations differs for non-scaled and scaled cases Two typical approaches  normalization and standardization SUPPORT VECTOR MACHINES FEATURE SCALING Normalization 𝑥𝑥 − min(𝑥𝑥) 𝑧𝑧 = Normalization maps values into [0,1] max 𝑥𝑥 − min(𝑥𝑥) Standardization 𝑥𝑥 − μ Standardization shifts feature values to have a mean of 0 𝑧𝑧 = 𝜎𝜎  Maps with standard deviation of 1 Centers the data and is more flexible to new values SVM CLASSIFICATION It is not always binary!!  Multiclass SVM One against all One against one SVM REGRESSION SVM REGRESSION 𝒘𝒘 SVM classification Given 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 , 𝑖𝑖 = 1, … , 𝑛𝑛, 𝑥𝑥 ∈ ℝ, 𝑦𝑦 ∈ 1, … , 𝑘𝑘 Predict class 𝑦𝑦 ∗ of a new point 𝑥𝑥 ∗ SVM regression Given 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 , 𝑖𝑖 = 1, … , 𝑛𝑛, 𝑥𝑥 ∈ ℝ, 𝑦𝑦 ∈ ℝ Predict function 𝑦𝑦 = 𝑓𝑓 𝑥𝑥 that describes the data set that has the maximum absolute deviation 𝜀𝜀 from all the training data 𝜀𝜀 SVM REGRESSION SVM classification 𝒘𝒘 𝐻𝐻: 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 = 0 2 min 𝑤𝑤 𝑦𝑦𝑖𝑖 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 ≥ 1 SVM regression 𝑦𝑦 𝑥𝑥 = 𝑓𝑓 𝑥𝑥 = 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 2 min 𝑤𝑤 𝜀𝜀 𝑦𝑦𝑖𝑖 − 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 ≤ 𝜀𝜀 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 − 𝑦𝑦𝑖𝑖 ≤ 𝜀𝜀 SVM SOFT REGRESSION SVM soft classification misclassified 𝒘𝒘 𝐻𝐻: 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 = 0 1 min | 𝑤𝑤 |2 + 𝐶𝐶 ∑𝑛𝑛𝑖𝑖=1 ξ𝑖𝑖 2 𝑦𝑦𝑖𝑖 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 ≥ 1 − ξ𝑖𝑖 SVM soft regression outlier 𝑦𝑦 𝑥𝑥 = 𝑓𝑓 𝑥𝑥 = 𝒘𝒘 𝒙𝒙 + 𝑏𝑏 1 min | 𝑤𝑤 |2 + 𝐶𝐶 ∑𝑛𝑛𝑖𝑖=1(ξ𝑖𝑖 +ξ∗𝑖𝑖 ) 2 𝜀𝜀 𝑦𝑦𝑖𝑖 − 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 ≤ 𝜀𝜀 + ξ𝑖𝑖 𝒘𝒘 𝒙𝒙𝑖𝑖 + 𝑏𝑏 − 𝑦𝑦𝑖𝑖 ≤ 𝜀𝜀 + ξ∗𝑖𝑖 NEXT WEEK x1 w1 Outputs x2 Inputs w2 𝑥𝑥𝑖𝑖 𝑤𝑤𝑖𝑖 + 𝑏𝑏 𝜑𝜑.. activation xn wn b function bias Artificial neuron THANK YOU AMEYA REGE MS3 / AMDA [email protected] MS3 / AMDA MACHINE LEARNING FOR SMART INDUSTRY LECTURE 05: DECISION TREES AMEYA REGE WHAT WAS COVERED IN THE PREVIOUS LECTURE? WHAT WAS COVERED IN THE PREVIOUS LECTURE? Support vector machines Hyperplane SVM for classification Hard and soft margin Using kernels SVM for regression WHAT WILL WE LEARN 1. Decision trees TODAY? 2. Entropy 3. Gini impurity 4. Ensemble learning models 5. Bagging and pasting 6. Random forest 7. Boosting WHAT IS A DECISION TREE? TREE Terminology of a tree Roots Branches Leaves DECISION TREE A decision tree is a non-parametric supervised learning approach used for classification and regression applications It uses a flowchart like a tree structure to show the predictions that result from a series of feature-based splits Root node Internal node Internal node Leaf node Leaf node Leaf node Leaf node DECISION TREE Some terminologies Root node: initial node at the beginning of the decision tree Decision node – also called internal node: node splitting from root (or Root node other decision) node Leaf node: no further splitting possible Decision node Decision node Branch / sub-tree Pruning Leaf node Leaf node Leaf node Leaf node DECISION TREES ALGORITHM Start at the root The algorithm always begins at the top  root node which represents the entire dataset Ask the best questions Most important question or feature that splits the data into most distinct groups Branch out Depending on the answer, the data is divided into smaller subsets  creating new branches Repeat Continue asking questions and splitting the data until the final leaf nodes are reached  predicted outcome or classification DECISION TREES ADVANTAGES Simplicity and interpretability Visual representation closely mirrors human decision-making process Versatility Usable for both, classification and regression No need for features scaling Handles nonlinear relationships between features and target variables DECISION TREES DISADVANTAGES Overfitting If the depth of the tree is large, they may tend to overfit Instability Small changes to data can results in generation of a completely new tree Bias towards features with more labels Features with more labels dominate the tree structure DECISION TREES DECISION TREES ASSUMPTIONS Binary splits Typically, each node divides data into two subsets based on a single feature or condition Recursive partitioning Assumes data can be subdivided into smaller more manageable subsets  parent node – child node Feature independence / Equal importance to each feature (unless scaled to emphasize certain features) Features used for splitting nodes are independent Homogeneity Homogeneous subgroups  samples within a node are as similar as possible with regards to target value No missing values / no outliers Top-down approach (greedy) Each split is chosen to maximise information gain / minimize impurity ENTROPY ENTROPY Solid Liquid Gas Measure of disorder ENTROPY Entropy is the uncertainty in our dataset 𝑛𝑛 𝐸𝐸 𝑆𝑆 = − 𝑝𝑝𝑖𝑖,𝑘𝑘 log 2 𝑝𝑝𝑖𝑖,𝑘𝑘 Shannon’s information theory 𝑘𝑘=1 𝑝𝑝𝑖𝑖,𝑘𝑘 ≠0 where 𝑆𝑆 is the subset of the training example 𝑝𝑝𝑖𝑖,𝑘𝑘 probability of class 𝑘𝑘 instances among the training instances in the 𝑖𝑖𝑡𝑡𝑡 node Claude Shannon (1916-2001) ENTROPY IN DECISION TREES If I ask you some question …? Entropy measures the impurity of a node 32 yes A set’s entropy is zero when it contains instances of only one class 8 no Pure split means yes or no Entropy of feature 1 26 yes 6 yes 13 13 2 2 − log 2 − log 2 = 0.5665 4 no 4 no 15 15 15 15 Entropy of feature 2 3 3 2 2 − log 2 − log 2 = 0.9710 5 5 5 5 Left node has lower entropy  high purity ENTROPY IN DECISION TREES Information gain Measures the reduction of uncertainty given some feature  deciding factor for which attribute should be selected as decision node Information gain = 𝐸𝐸 𝑌𝑌 − 𝐸𝐸(𝑌𝑌|𝑋𝑋) Entropy of complete dataset Entropy of dataset given some feature ENTROPY IN DECISION TREES LET US GIVE OUR EXAMPLE A MEANING Shall we go hiking on Sunday? Feature 1: Energy High Low 30 yes Feature 2: Motivation No motivation 10 no Neutral High motivation ENTROPY IN DECISION TREES 30 yes HIKING EXAMPLE 10 no Shall we go hiking on Sunday? Feature 1: Energy 27 yes 3 yes High Low 1 no 9 no 3 3 1 1 𝐸𝐸 parent = − log 2 − log 2 = 0.8113 4 4 4 4 27 27 1 1 𝐸𝐸 parent|Energy = high = − log 2 − log 2 = 0.3712 28 28 28 28 3 3 9 9 𝐸𝐸 parent|Energy = low = − log 2 − log 2 = 0.9183 12 12 12 12 Weighted average of entropy of each node: 28 12 𝐸𝐸 parent Energy = 0.3712 + 0.9183 = 0.5353 40 40 Information gain = 𝐸𝐸 parent − 𝐸𝐸 parent Energy = 0.276 ENTROPY IN DECISION TREES HIKING EXAMPLE 30 yes 10 no Feature 2: Motivation No motivation 17 yes 8 yes 5 yes Neutral High motivation 1 no 2 no 7 no 𝐸𝐸 parent = 0.8113 1

Use Quizgecko on...
Browser
Browser