COMP9517_24T2W4_Pattern_Recognition_Part_1.pdf
Transcript
COMP9517 Computer Vision 2024 Term 2 Week 4 Dr Dong Gong Pattern Recognition Part 1 Computer Vision Low-level vision High-level vision Pixel-level manipul...
COMP9517 Computer Vision 2024 Term 2 Week 4 Dr Dong Gong Pattern Recognition Part 1 Computer Vision Low-level vision High-level vision Pixel-level manipulations/editing Semantic understanding/analysis Feature extraction Predict what’s in the image Mid-level vision Higher level? Image – images Reasoning? Stitching, Panorama image Image – 3D modeling Stereo model estimation/analysis Image – motion analysis Segmentation (non-semantic) Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 2 Introduction Pattern recognition The scientific discipline whose goal is to automatically recognise patterns and regularities in the data (e.g. images) Examples – Object recognition (e.g. image classification) – Text classification (e.g. spam/non-spam emails) – Speech recognition (e.g. automatic subtitling) – Event detection (e.g. in surveillance) – Recommender systems (e.g. in webshops) Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 3 Pattern Recognition Categories Based on different learning paradigms Supervised learning Learning patterns in a set of data with available labels (ground truth) Unsupervised learning Finding patterns in a set of data without any labels available Semi-supervised learning Using a combination of labelled and unlabelled data to learn patterns Weakly supervised learning Using noisy / limited / imprecise supervision signals in learning of patterns Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 4 Applications in Computer Vision Making decisions about image content Classifying objects in an image Recognising activities Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 5 Applications in Computer Vision Character recognition Human activity recognition Face detection/recognition Image-based medical diagnosis Biometric authentication Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 6 3 major international CV conferences: CVPR, ICCV, ECCV; and others Top machine learning conferences with CV research: NeurIPS, ICML, ICLR … https://csrankings.org/#/index?vision&mlmining&australasia Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 7 Pattern Recognition Overview Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 8 Pattern Recognition (First Lecture) Pattern recognition concepts ‒ Definition and description of basic terminology ‒ Recap of feature extraction and representation Supervised learning for classification ‒ Nearest class mean classification ‒ K-nearest neighbours classification ‒ Bayesian decision theory and classification ‒ Decision trees for classification ‒ Ensemble learning and random forests Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 9 Pattern Recognition (Second Lecture) Supervised learning for classification ‒ Simple linear classifiers ‒ Support vector machines ‒ Multiclass classification ‒ Classification performance evaluation Supervised learning for regression ‒ Linear regression ‒ Least-squares regression ‒ Regression performance evaluation Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 10 Pattern Recognition Concepts Objects are (identifiable) physical entities of which images are taken Regions (ideally) correspond to objects after image segmentation Classes are disjoint subsets of objects sharing common features Labels are associated with objects and indicate to which class they belong Classification is the process of assigning labels to objects based on features Classifiers are algorithms/methods performing the classification task Patterns are regularities in object features and are used by classifiers A man Features Classifier Bob Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 11 Pattern Recognition Systems Basic stages involved in the design of a classification system System Image Pre- Feature Feature Learning evaluation acquisition processing extraction selection system (test) Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 12 More Pattern Recognition Concepts Pre-processing aims to enhance images for further processing Feature extraction reduces the data by measuring certain properties Feature descriptors represent scalar properties of objects Feature vectors capture all the properties measured from the data Feature selection aims to keep only the most descriptive features Models are (mathematical or statistical) descriptions of classes Training samples are objects with known labels used to build models Cost is the consequence of making an incorrect decision/assignment Decision boundary is the demarcation between regions in feature space Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 13 Pattern Recognition Example Recall: feature representation salmon or sea bass? 1D feature space 2D feature space Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 14 Feature Vector Representation Vector 𝑥 = [𝑥1, 𝑥2, … , 𝑥𝑑] where each 𝑥𝑗 is a feature ‒ Object measurement ‒ Count of object parts ‒ Colour of the object ‒ … Features represent knowledge about the object and go by other names such as predictors, descriptors, covariates, independent variables… Feature vector examples ‒ For fish recognition: [length, colour, lightness, …] ‒ For letter/digit recognition: [holes, moments, SIFT, …] Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 15 Feature Extraction Characterise objects by measurements that are – Similar for objects in the same class/category – Different for objects in different classes Use distinguishing features – Invariant to object position (translation) – Invariant to object orientation (rotation) – Invariant to … (depends on the application) – Good examples are shape, colour, texture Design of features often based on prior experience or intuition Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 16 Feature Extraction Select features that are robust to – Rigid transformations (translation and rotation) – Occlusions and other 3D-to-2D projective distortions – Non-rigid/articulated object deformations (e.g. fingers around a cup) – Variations in illumination and shadows Feature selection is problem- and domain-dependent and requires domain knowledge Classification techniques can help to make feature values less noise sensitive and to select valuable features out of a larger set Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 17 Supervised Learning Overview Feature space 𝑋 Functions Training set Label space 𝑌 𝑓 ∈ 𝐹: 𝑋 → 𝑌 {(𝑥! , 𝑦! ) ∈ 𝑋, 𝑌} Learning Find 𝑓/ such that 𝑦! ≈ 𝑓(𝑥 / !) Model 𝑓/ Prediction / 𝑦 = 𝑓(𝑥) Test sample 𝑥 𝑦 Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 18 Pattern Recognition Models Generative models – Model the “mechanism” by which the data was generated – Represent each class by a probabilistic “model” 𝑝 𝑥|𝑦 and 𝑝 𝑦 – Obtain the joint probability of the data as 𝑝 𝑥, 𝑦 = 𝑝 𝑥|𝑦 𝑝 𝑦 – Find the decision boundary implicitly via the most likely 𝑝 𝑦|𝑥 – Applicable to unsupervised learning tasks (unlabelled data) Discriminative models – Focus on explicit modelling of the decision boundary – Applicable to supervised learning tasks (labelled data) Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 19 Classification Classifier performs object recognition by assigning a class label 𝑥 → 𝑦 = "salmon" to an object, using the object description in the form of features Perfect classification is often impossible, instead determine the 𝑝!"#$%& = 0.7 𝑥→. 𝑝'"!! = 0.3 probability for each possible class Difficulties are caused by variability in feature values for objects in the same class versus objects in different classes – Variability may arise due to complexity but also due to noise – Noisy features and missing features are major issues – Not always possible to determine values of all features for an object Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 20 Binary Classification Given a training set of 𝑁 observations: 𝑥! , 𝑦! , 𝑥! ∈ ℝ" , 𝑦! ∈ {−1,1} Classification is the problem of estimating: 𝑦! = 𝑓 𝑥! Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 21 Nearest Class Mean Classifier Based on the minimum distance principle, where the class exemplars are just the class centroids (or means) Training: Given training sample pairs { 𝑥! , 𝑦! , 𝑥" , 𝑦" , … , 𝑥# , 𝑦# }, the centroid for each class 𝑘 is obtained as 1 𝜇! = & 𝑥% |𝑐! | "! ∈$" Testing: Each unknown object with feature vector 𝑥 is classified as class 𝑘 if 𝑥 is closer to the centroid of class 𝑘 than to any other class centroid Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 22 Nearest Class Mean Classifier Compute the Euclidean distance (or some other different distance metrics) between feature vector 𝑥 and the centroid of each class Choose the closest class, if close enough (reject otherwise) Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 23 Nearest Class Mean Classifier Class 2 has two modes… where is its centroid, if one centroid is used? If modes are detected, two subclass centroids can be used Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 24 Nearest Class Mean Classifier Pros ü Simple ü Fast ü Works well when classes are compact and far from each other Cons × Poor results for complex classes (multimodal, non-spherical) × Cannot handle outliers and noisy data well Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 25 K-Nearest Neighbours Classifier K-NN is a classifier that decides the class label for a sample based on the K nearest samples in the data set Nearest Class Mean KNN Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 26 K-Nearest Neighbours Classifier For every new test sample, the distances between the test sample and all training samples are computed, and the K nearest training samples are used to decide the class label of test sample The sample will be assigned to the class which has the majority of members in the neighbourhood The neighbours are selected from a set of training samples for which the class is known Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 27 K-Nearest Neighbours Classifier Euclidean distance is commonly used for continuous variables Hamming distance is commonly used for discrete variables https://towardsdatascience.com/knn-using-scikit-learn-c6bed765be75 Hamming distance Image credit: https://medium.com/geekculture/total-hamming-distance- problem-1b74decd71c9 Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 28 K-Nearest Neighbours Classifier Pros ü Very simple and intuitive ü Easy to implement ü No a priori assumptions ü No training step ü Decision surfaces are non-linear Cons × Slow algorithm for big data sets × Needs homogeneous (similar nature) feature types and scales × Does not perform well when the number of variables grows (curse of dimensionality) × Finding the optimal K (number of neighbours) to use can be challenging Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 29 K-Nearest Neighbours: Applications Automated MS-lesion segmentation by KNN Manually labeled image used as training set Features used: intensity and voxel locations (x, y, z coordinates) https://www.midasjournal.org/browse/publication/610 Sun, Yiyou, et al. "Out-of-distribution detection with deep nearest neighbors." International Conference on Machine Learning (ICML). PMLR, 2022. Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 30 Bayesian Decision Theory Classifier decisions may or may not be correct, so they should be probabilistic (soft decisions rather than hard decisions) Probability distributions may be used to make classification decisions with the least expected error rate Introducing prior knowledge Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 31 Bayesian Decision Theory Bayesian classification assigns an object into the class to which it most likely belongs based on observed features Assume the following to be known: – Prior probability 𝑝(𝑐% ) for each class 𝑐% To be learned from data or assigned – Class conditional distribution 𝑝 𝑥 𝑐% Compute the posterior probability 𝑝 𝑐! 𝑥 as follows: If all the classes are disjoint, by Bayes Rule, the posterior probabilities are given by 𝑝 𝑥 𝑐% 𝑝(𝑐% ) 𝑝 𝑐% 𝑥 = 𝑝 𝑥, 𝑐! = 𝑝 𝑥|𝑐! 𝑝 𝑐! = 𝑝 𝑐! |𝑥 𝑝 𝑥 ∑& 𝑝 𝑥 𝑐& 𝑝(𝑐& ) 𝑝 𝑥 = * 𝑝 𝑥, 𝑐" = * 𝑝 𝑥|𝑐" 𝑝 𝑐" " " Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 32 Bayesian Decision Theory Given an observation 𝑥, for which 𝑝(𝑐1|𝑥) is greater than 𝑝(𝑐2|𝑥), we would naturally prefer to decide that it belongs to 𝑐1 So the decision rule is: 𝑐 = arg max( (𝑝(𝑐( |𝑥)) This is equivalent to: 𝑐 = arg max( (𝑝 𝑥 𝑐( 𝑝(𝑐( )) 𝑝 𝑥|𝑐! 𝑝 𝑐! 𝑝 𝑥|𝑐! 𝑝 𝑐! 𝑝 𝑐! 𝑥 = = ∝ 𝑝 𝑥 𝑐! 𝑝(𝑐! ) ∑4 𝑝 𝑥|𝑐4 𝑝 𝑐4 𝑝(𝑥) Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 33 Bayesian Decision Theory For a given 𝑥, the probability of error is 𝑝 𝑐5 𝑥 if we decide 𝑐6 𝑝 error 𝑥 = ; 𝑝 𝑐6 𝑥 if we decide 𝑐5 We can minimise the probability of error by deciding 𝑐1 if 𝑝(𝑐1|𝑥) > 𝑝(𝑐2|𝑥), and 𝑐2 if 𝑝(𝑐2|𝑥) > 𝑝(𝑐1|𝑥) This is called the Bayes decision rule Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 34 Bayesian Decision Rule: Example Suppose we want to classify fish type: salmon, sea bass, other From past experience we already know the probability of each class: 𝒑 𝒄𝒊 Salmon Sea bass Other Prior 0.3 0.6 0.1 If we were to decide based only on the prior, our best bet would be to always classify as sea bass – This is called decision rule based on prior – It never predicts other classes – This can behave very poorly Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 35 Bayesian Decision Rule: Example Let us use some feature, e.g. length, to add more information From experience we know the class conditionals for length 𝒑 𝒙|𝒄𝒊 Salmon Sea bass Other length > 100 cm 0.5 0.3 0.2 50 cm < length < 100 cm 0.4 0.5 0.2 length < 50 cm 0.1 0.2 0.6 Now we can estimate the posterior probability for each class 𝑝 𝑐! 𝑥 ∝ 𝑝 𝑥 𝑐! 𝑝(𝑐! ) Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 36 Bayesian Decision Rule: Example If we have a fish with length 70 cm, what would be our prediction? 𝑝 𝑐- = salmon 𝑥 = 70 cm ∝ 𝑝 70 cm salmon ∗ 𝑝 salmon = 0.4 ∗ 0.3 = 0.12 𝑝 𝑐- = sea bass 𝑥 = 70 cm ∝ 𝑝 70 cm sea bass ∗ 𝑝 sea bass = 0.5 ∗ 0.6 = 0.30 𝑝 𝑐- = other 𝑥 = 70 cm ∝ 𝑝 70 cm other ∗ 𝑝 other = 0.2 ∗ 0.1 = 0.02 Based on these probabilities, we predict the type as sea bass Question: What if 𝑝 70 𝑐𝑚 𝑠𝑎𝑙𝑚𝑜𝑛 = 𝑝 70 𝑐𝑚 𝑠𝑒𝑎 𝑏𝑎𝑠𝑠 =0.5? What is the effect of prior 𝒑 𝒄𝒊 ? Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 37 Bayesian Decision Rule: Example If we have a fish with length 70 cm, what would be our prediction? 𝑝 𝑐- = salmon 𝑥 = 70 cm ∝ 𝑝 70 cm salmon ∗ 𝑝 salmon = 0.4 ∗ 0.3 = 0.12 𝑝 𝑐- = sea bass 𝑥 = 70 cm ∝ 𝑝 70 cm sea bass ∗ 𝑝 sea bass = 0.5 ∗ 0.6 = 0.30 𝑝 𝑐- = other 𝑥 = 70 cm ∝ 𝑝 70 cm other ∗ 𝑝 other = 0.2 ∗ 0.1 = 0.02 Based on these probabilities, we predict the type as sea bass Question: If the price of salmon is twice that of sea bass, and sea bass is also more expensive than the other types of fish, is the cost of a wrong decision the same for any misclassification? Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 38 Bayesian Decision Risk If we only care about the classification accuracy (the prices of all types of fish are the same), then we can make the decision by maximizing the posterior probability If the prices are not the same, we minimize the loss – Loss is the cost of an action 𝛼! based on our decision: 𝜆 𝛼! 𝑐! Not a probability – The expected loss associated with action 𝛼! is: 𝑅 𝛼! 𝑥 = ∑4 𝜆 𝛼! 𝑐4 𝑝(𝑐4 |𝑥) – 𝑅 𝛼! 𝑥 is also called conditional risk – An optimal Bayes decision strategy is to minimize the conditional risk Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 39 Bayesian Decision Risk: Example Suppose we have the following loss function 𝜆 𝛼! 𝑐! 𝜆 𝛼) 𝑐) Salmon Sea bass Other If predicted salmon 0 2 3 If predicted sea bass 10 0 4 If predicted other 20 7 0 𝑅 sell as salmon 50 < 𝑥 < 100) = 𝜆 sell as salmon salmon ∗ 𝑝 salmon 50 < 𝑥 < 100 + 𝜆 sell as salmon sea bass ∗ 𝑝 sea bass 50 < 𝑥 < 100 + 𝜆 sell as salmon other ∗ 𝑝 other 50 < 𝑥 < 100 ∝ 0 ∗ 𝑝 50 < 𝑥 < 100 salmon ∗ 𝑝 salmon + 2 ∗ 𝑝 50 < 𝑥 < 100 sea bass ∗ 𝑝 sea bass + 3 ∗ 𝑝 50 < 𝑥 < 100 other ∗ 𝑝 other = 0 ∗ 0.4 ∗ 0.3 + 2 ∗ 0.5 ∗ 0.6 + 3 ∗ 0.2 ∗ 0.1 = 0.66 Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 40 Bayesian Decision Risk: Example 𝑅 sell as salmon 50 < 𝑥 < 100) ∝ 0.66 𝑅 sell as sea bass 50 < 𝑥 < 100) ∝ 1.28 𝑅 sell as other 50 < 𝑥 < 100) ∝ 4.5 So, if the length of the fish was in the range of [50,100], the loss would be minimized if the type was predicted as salmon Is the loss function in the example good? How to define the loss function? Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 41 Bayesian Decision Theory Estimating the probabilities – Values of 𝑝(𝑥|𝑐𝑖 ) and 𝑝(𝑐𝑖 ) can be computed empirically from the samples – If we know that the distributions follow a parametric model (defining the model), we may estimate the model parameters using the samples (learning the parameters) Example – Suppose class 𝑖 can be described by a normal distribution whose covariance matrix Σ! is known but the mean 𝜇! is unknown – An estimate of the mean may then be the average of the labelled samples available in the training set: 𝜇̂ ! = 𝑥̅ Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 42 Bayesian Decision Rule Classifier Pros ü Simple and intuitive ü Considers uncertainties ü Permits combining new information with current knowledge Cons × Computationally expensive × Choice of priors can be subjective Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 43 Decision Trees: Introduction Most pattern recognition methods address problems where feature vectors are real-valued and there exists some notion of a metric Some classification problems involve nominal data, with discrete descriptors and without a natural notion of similarity or ordering – Example: {high, medium, low}, {red, green, blue} – Nominal data are also called as categorical data Nominal data can be classified using rule-based method Continuous values can also be handled with rule-based method Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 44 Decision Trees: Example Salmon Sea bass 𝑥% = width Suppose we have only two types of fish, salmon and sea bass, and assume we have only two features: – Length (𝑥' ) – Width (𝑥( ) We classify fish based on these two features 𝑥# = length 𝑥* > 80 𝑥% no yes 𝑥+ > 10 𝑥+ > 30 no yes no yes 1 salmon 0 salmon 3 salmon 𝑥* > 110 10 30 0 sea bass 4 sea bass 0 sea bass no yes 1 salmon 7 salmon 80 110 𝑥# Copyright (C) UNSW 5 sea bass 0 sea bass COMP9517 24T2W4 Pattern Recognition Part 1 45 Decision Trees: Example For any new sample, start from the top of the tree, answer the questions, follow the appropriate path to the bottom, and then decide the label 𝑥* > 80 no yes 𝑥+ > 10 𝑥+ > 30 no yes no yes salmon sea bass salmon 𝑥* > 110 no yes sea bass salmon Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 46 Decision Trees: Summary Approach – Classify a sample through a sequence of questions – Next question asked depends on answer to current question – Sequence of questions displayed in a directed decision tree or simply tree Structure – Nodes in the tree represent features – Leaf nodes contain the class labels – One feature (or a few) at a time to split search space – Each branching node has one child for each possible value of the parent feature Classification – Begins at the root node and follows the appropriate link to a leaf node – Assigns the class label of the leaf node to the test sample Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 47 Decision Trees Construction Binary decision tree: a binary tree structure that has a decision function associated with each node Simple case: numeric feature values and a decision function selects the left/right branch if the value of a feature is below/above a threshold Advantages: each node uses only one feature and one threshold value For any given set of training samples, there may be more than one possible decision tree to classify them, depending on feature order We must select features that give the ‘best’ tree based on some criterion Computationally the smallest tree is preferred Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 48 Decision Trees Construction Algorithm 1. Select a feature to place at the node (the first one is the root) 2. Make one branch for each possible value (nominal) or range (numerical) 3. For each branch node, repeat step 1 and 2 using only those samples that actually reach the branch 4. When all samples at a node have the same classification (or label), stop growing that part of the tree 𝑥% 𝑥* > 80 no yes 𝑥+ > 10 𝑥+ > 30 no yes no yes 10 30 salmon sea bass salmon 𝑥* > 110 no yes Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 49 sea bass salmon 80 110 𝑥# Decision Trees Construction How to determine which feature to split on? – One way is to use measures from information theory – Entropy and Information Gain 𝑥% 𝑥* > 80 no yes 𝑥+ > 10 𝑥+ > 30 no yes no yes 10 30 salmon sea bass salmon 𝑥* > 110 no yes sea bass salmon 80 110 𝑥# Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 50 Constructing Optimal Decision Tree To construct optimal decision trees from training data we must define ‘optimal’ One simple criterion based on information theory is entropy The entropy of a set of events 𝑦 = 𝑦5, 𝑦6, … , 𝑦P is: P 𝐻 𝑦 = M −𝑝 𝑦! log 6𝑝(𝑦! ) !Q5 where 𝑝(𝑦! ) is the probability of event 𝑦! https://researchoutreach.org/wp-content/uploads/2019/02/Biro-2nd-Law.jpg Entropy may be viewed as the average uncertainty of the information source – If the source information has no uncertainty: 𝐻 = 0 – If the source information is uncertain: 𝐻 > 0 Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 51 Decision Trees: Entropy Example of entropy computations Let us look at the fish example 𝒑 𝒄𝒊 Salmon Sea bass Other Prior 0.3 0.6 0.1 The entropy (uncertainty) of information (type of fish) is: 𝐻 = − 0.3 ∗ log 0.3 + 0.6 ∗ log 0.6 + 0.1 ∗ log 0.1 = 1.29 Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 52 Decision Trees: Information Gain Information gain is an entropy-based measure to guide which feature to use for producing optimal decision trees. Measuring the information gain of selecting a specific feature while building the decision tree. If 𝑆 is a set of training samples and 𝑓 is one feature of the samples, then the information gain with respect to feature 𝑓 is: 𝐼𝐺 𝑆, 𝑓 = 𝐻 𝑆 − 𝐻(𝑆|𝑓) Z'& 𝐼𝐺 𝑆, 𝑓 = Entropy 𝑆 − ∑R& ∈ TUVWXY R Entropy(𝑆R& ) Z Use the feature with highest information gain to split on. Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 53 Decision Trees: Information Gain Example 𝑥! 𝑥" Type Prior probabilities can be estimated using S S Salmon frequency of events in the training data M S Salmon M S Salmon Let us look again at the fish example with two S M Sea bass features, length and width, but for the sake of S L Sea bass example, assume three categories for each: S M Sea bass M M Sea bass 𝑥! ∈ {Small, Medium, Large} M L Sea bass 𝑥" ∈ {Small, Medium, Large} L M Salmon L M Salmon The table lists 15 observations/samples from two L L Salmon classes of salmon and sea bass S L Sea bass M L Sea bass #Salmon = 6 M M Sea bass #Sea bass = 9 M L Sea bass Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 54 Decision Trees: Information Gain Example 𝑥! 𝑥" Type Before selecting the first feature we need to S S Salmon know the base entropy: M S Salmon M S Salmon 𝑝 salmon = 6/15 = 0.4 S M Sea bass 𝑝 sea bass = 9/15 = 0.6 S L Sea bass S M Sea bass 𝐻0123 = −0.6 log 0.6 − 0.4 log 0.4 = 0.97 M M Sea bass To estimate 𝐼𝐺(𝑆, 𝑥! ) we need to use the frequency M L Sea bass table for 𝑥! to compute 𝐻(𝑆|𝑥! ) L M Salmon L M Salmon Type L L Salmon Salmon Sea bass S L Sea bass S 1 4 5 𝑥! M L Sea bass M 2 5 7 M M Sea bass L 3 0 3 M L Sea bass 15 Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 55 Decision Trees: Information Gain Example 5 7 3 Type 𝐻 𝑆 𝑥* = 𝐻 1,4 + 𝐻 2,5 + 𝐻 3,0 = 0.64 Salmon Sea bass 15 15 15 S 1 4 5 𝑥! M 2 5 7 L 3 0 3 𝐼𝐺 𝑆, 𝑥* = 𝐻'"!, − 𝐻 𝑆 𝑥* = 0.97 − 0.64 = 0.33 15 3 6 6 Type 𝐻 𝑆 𝑥+ = 𝐻 3,0 + 𝐻 2,4 + 𝐻 1,5 = 0.62 Salmon Sea bass 15 15 15 S 3 0 3 𝑥" M 2 4 6 𝐼𝐺 𝑆, 𝑥+ = 𝐻'"!, − 𝐻 𝑆 𝑥+ = 0.97 − 0.62 = 0.35 L 1 5 6 15 Since 𝐼𝐺 𝑆, 𝑥6 > 𝐼𝐺 𝑆, 𝑥5 the decision tree starts with spliting 𝑥6 Divide the dataset by its branches and repeat the same process on every branch A branch with entropy more than 0 needs further splitting Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 56 Decision Trees: Example 𝑥+ S L M 𝑥* 𝑥* 𝑥* S L S L S L M M M #salmon 1 2 0 0 0 2 0 0 1 #sea bass 0 0 0 2 2 0 2 3 0 salmon salmon sea bass sea bass salmon sea bass sea bass salmon Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 57 Decision Trees Classifier Pros ü Easy to interpret ü Can handle both numerical and categorical data ü Robust to outliers and missing values ü Gives information on importance of features (feature selection) Cons × Tends to overfit × Only axis-aligned splits × Greedy algorithm (may not find the best tree) Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 58 Ensemble Learning Ensemble learning combines multiple models to improve the predictive performance compared to those obtained from any of the constituent models Multiple models can be created using – Different classifiers/learning algorithms – Different parameters for the same algorithm – Different training examples – Different feature sets Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 59 Random Forests Random forests is an ensemble learning method Constructs an ensemble of decision trees by training Outputs the majority of all class outputs by individual trees Overcomes the decision trees’ habit of overfitting https://en.wikipedia.org/wiki/Random_forest Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 60 Random Forests: Breiman’s algorithm Training Let 𝑁 be number of training instances and 𝑀 the number of features (-- bagging) Sample 𝑁 instances at random with replacement from the original data At each node select 𝑚 ≪ 𝑀 features at random and split on the best feature (-- “feature bagging”) Grow each tree to the largest extent possible (no pruning) Repeat 𝐵 times. Keep the value of 𝑚 constant during the forest growing Testing Push a new sample down a tree and assign the label of the terminal node it ends up in Iterate over all trees in the ensemble to get 𝐵 predictions for the sample Report the majority vote of all trees as the random forest prediction Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 61 Random Forests The random forest error rate depends on two factors: 1. Correlation between any two trees in the forest Increased correlation increases the forest error rate; uncorrelated trees lead to better generalization. 2. Strength of each individual tree in the forest – Strong tree has low error rate – Increasing the strength of individual trees decreases the forest error rate Selecting parameter 𝑚 Reducing 𝑚 reduces both the correlation and the strength Increasing 𝑚 increases both the correlation and the strength Somewhere in between is an “optimal” range of values for 𝑚 Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 62 Random Forests Pros ü High accuracy among traditional classification algorithms for many problems ü Works efficiently and effectively on large datasets ü Handles thousands of input features without feature selection ü Handles missing values effectively Cons × Less interpretable than an individual decision tree × More complex and more time-consuming to construct than decision trees Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 63 Random Forests: An Application Random forests for predicting Alzheimer’s disease Features: cortical thickness, Jacobian maps, sulcal depth https://doi.org/10.1016/j.nicl.2014.08.023 Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 64 References and Acknowledgements Shapiro & Stockman, Chapter 4 Duda, Hart, Stork, Chapters 1, 2.1 Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, Chapters 2 and 12 More references – Sergios Theodoridis & Konstantinos Koutroumbas, Pattern Recognition, 2009 – Ian H. Witten & Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2005 Some diagrams are extracted from the above resources Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 65 Example exam question Which one of the following statements is correct for random forest classifiers? A. Increasing the correlation between the individual trees decreases the random forest classification error rate. B. Reducing the number of selected features at each node increases the correlation between the individual trees. C. Reducing the number of selected features at each node increases the strength of the individual trees. D. Increasing the strength of the individual trees decreases the random forest classification error rate. Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 1 66