COMP9517 Computer Vision 2024 Term 2 Week 4 Pattern Recognition Part 2 PDF

COMP9517 Computer Vision 2024 Term 2 Week 4 Dr Dong Gong Pattern Recognition Part 2 Pattern Recognition (First Lecture) Pattern recognition concepts ‒ Definition and description of basic terminology ‒ Recap of feature extraction and representation Supervised learning for classification ‒ Nearest class mean classification ‒ K-nearest neighbours classification ‒ Bayesian decision theory and classification ‒ Decision trees for classification ‒ Ensemble learning and random forests Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 2 Pattern Recognition (Second Lecture) Supervised learning for classification ‒ Linear classification ‒ Support vector machines ‒ Multiclass classification ‒ Classification performance evaluation Supervised learning for regression ‒ Linear regression ‒ Least-squares regression ‒ Regression performance evaluation Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 3 Separability Separable classes If a discrimination subspace exists that separates the feature space such that only objects from one class are in each region, then the recognition task is said to have separable classes Linearly separable If the object classes can be separated using a hyperplane as the discrimination subspace, the feature space is said to be linearly separable Linearly separable non-linearly separable Source Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 4 Linear Classifier Given a training set of 𝑁 observations: 𝑥! , 𝑦! , 𝑥! ∈ ℝ" , 𝑦! ∈ −1,1 , 𝑖 = 1, … , 𝑁 A binary classification problem can be modeled by a separation function 𝑓(𝑥) using the data such that: > 0 if 𝑦! = +1 𝑓 𝑥! =' < 0 if 𝑦! = −1 So in this approach 𝑦! 𝑓 𝑥! > 0 Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 5 Linear Classifier A linear classifier has the form: 𝑓 𝑥 = 𝑊 ! 𝑥 + 𝑏 = 𝑤" 𝑥" + 𝑤# 𝑥# + ⋯ + 𝑤$ 𝑥$ + 𝑏 Corresponding to a line in 2D, a plane in 3D, and a hyperplane in nD Source We use the training data to learn the weights 𝑊 and offset 𝑏 𝑥% are features Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 6 Linear Classifier Which hyperplane is the best…? For generalization purposes, a large margin is preferred Good generalization Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 7 Linear Classifier Which hyperplane is the best…? Bad generalization For generalization purposes, a large margin is preferred Good generalization Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 8 Support Vector Machines (SVMs) Maximize margin - the distance to the closest sample – Leads to an optimization problem Examples closest to the hyperplane are support vectors Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 9 Support Vector Machines The primal optimization problem for linear SVM (Hard-margin SVMs) Decision rules in testing Why? Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 10 Support Vector Machines – some preliminaries Hyperplane (in the high-dimensional space) defined by a linear model Distance between a point to a hyperplane Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 11 Support Vector Machines – some preliminaries Hyperplane (in the high-dimensional space) defined by a linear model Distance between a point to a hyperplane Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 12 Support Vector Machines SVM objective – maximize the distance from hyperplane to the closest examples – positive class and negative class samples are on each side of the hyperplane The distance between the hyperplane and the closest examples For correct classification This problem can be equivalently reformulated as: – The “standard” formulation of (hard-margin) linear SVM Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 13 Support Vector Machines (additional proof steps) Not required Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 14 Support Vector Machines (additional proof steps) Not required Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 15 Support Vector Machines Hard-margin linear SVM Quadratic programming optimization problem subject to linear constraints Convex optimization problem With a dual form from Lagrangian method hard margin SVM which does not allow any misclassification of samples Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 16 Soft Margin Support Vector Machines In hard margin SVM, we assume classes are linearly separable, but what if separability assumptions doesn’t hold? 𝜉! is the distance of 𝑥! to the corresponding class margin if on the wrong side of the margin, or 0 otherwise Introduce “slack” variables 𝜉% to allow misclassiﬁcation of instances Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 17 Soft Margin Support Vector Machines When classes were linearly separable, we had: 𝑦! 𝑊 " 𝑥! + 𝑏 ≥ 1 But if we get some data that violate this slack value: 𝑦! 𝑊 " 𝑥! + 𝑏 ≥ 1 − 𝜉! and 𝜉! ≥ 0 So, the total violation for all data is ∑! 𝜉! This is a measure of violation of the margin and now we optimize for: Hard-margin SVM Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 18 Soft Margin Support Vector Machines Soft margin SVMs are better able to handle noisy data Small 𝐶: more tolerance on miss-classified samples for larger margin Large 𝐶: focus on avoiding mistakes at the expense of smaller margin 𝐶 to infinity means going back to the hard margin SVM Still a quadratic programming optimization problem Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 19 Nonlinear Support Vector Machines To generate nonlinear decision boundaries, we can map the features into a new feature space where classes are linearly separable and then apply the SVM there https://medium.com/analytics-vidhya/how-to-classify-non-linear-data-to-linear-data-bb2df1a6b781 Feature mapping into a higher dimensional space can be done using a kernel function which reduces the complexity of the optimization problem Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 20 Support Vector Machines Pros ü Very effective in high dimensional feature spaces ü Effective when the number of features is larger than the training data size ü Among the best algorithms when the classes are (well) separable ü Work very well when the data is sparse ü Can be extended to nonlinear classification via kernel trick Cons × For larger datasets it takes more time to process × Does not perform well for overlapping classes × Hyperparameter tuning needed for sufficient generalization Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 21 Multiclass Classification If there are more than two classes, we must build a multiclass classifier Some methods may be directly used for multiclass classification: – K-nearest neighbours – Decision trees – Bayesian techniques For those that cannot be directly applied to multiclass problems, we can transform them to binary classification by building multiple binary classifiers Two possible techniques for multiclass classification with binary classifiers: – One versus rest: builds one classifier for one class versus the rest and assigns a test sample to the class that has the highest confidence score – One versus one: builds one classifier for every pair of classes and assigns a test sample to the class that has the highest number of predictions Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 22 Multiclass Classification Two possible techniques for multiclass classification with binary classifiers: – One versus rest: builds one classifier for one class versus the rest and assigns a test sample to the class that has the highest confidence score – One versus one: builds one classifier for every pair of classes and assigns a test sample to the class that has the highest number of predictions one vs rest one vs one Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 23 Evaluation of Classification Error Error rate – Measures how well/poor the system solves the problem it was designed for Reject class – Generic class for objects that cannot be placed in any of the known classes Classification error – The classifier makes a classification error whenever it classifies an input object as class 𝐶! when the true class is 𝐶" , 𝑖 ≠ 𝑗, and 𝐶! ≠ 𝐶# (the reject class) Performance – Performance determined by both errors and rejections made – Classifying all inputs into reject class means system makes no errors but is useless! Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 24 Evaluation of Classification Error Empirical error rate – Number of errors on independent test data divided by number of classifications attempted Empirical reject rate – Number of rejects on independent test data divided by number of classifications attempted Independent test data – Sample objects with true class (labels) known, including objects from the reject class, and that were not used in designing the feature extraction and classification algorithms Samples used for training and testing should be representative – Available data is split for example in 80% training and 20% test data Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 25 False Alarms and False Dismissals For two-class classification problems, the two possible types of errors have a special meaning and are not symmetric For example, in medical diagnosis: – If the person does NOT have the disease, but the system incorrectly says they do, then the error is a false alarm or false positive (also called type I error) – If the person DOES have the disease, but the system incorrectly says they do NOT, then the error is a false dismissal or false negative (also called type II error) Consequences and costs of the two errors can be very different – There are bad consequences to both, but false negatives are generally more catastrophic – So, the aim is to minimize false negatives, possibly at the cost of increasing false positives – The optimal/acceptable balance of the two errors depends on the application Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 26 Receiver Operating Curve (ROC) Binary classification; ROC For each sample, probability of classifying as 100% positive class, p1; Conducting classification with threshold on p1; True Positive Given different threshold, we can get different 50% results. – different false positive and true negative rate on the whole dataset By applying different threshold, we can get ROC. 0% 0% 50% 100% The Receiver Operator Curve (ROC) relates the False Positive false positive to the true positive. Plots the correct true positive versus the false Truth Classification Error? positive (false alarm) rate Cancer Cancer Correct detection (no error) No cancer Cancer False alarm (error) Cancer No cancer False dismissal (error) No cancer No cancer Correct dismissal (no error) Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 27 Receiver Operating Curve (ROC) Generally, false alarms go up with attempts to ROC 100% correctly detect higher percentages of known objects Correct detections True Positive Area Under the ROC (AUC or AUROC) 50% summarizes overall performance How to evaluate the quality of a classifier 0% based on ROC? 0% 50% 100% False alarms False Positive Truth Classification Error? Cancer Cancer Correct detection (no error) No cancer Cancer False alarm (error) Cancer No cancer False dismissal (error) No cancer No cancer Correct dismissal (no error) Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 28 Receiver Operating Curve (ROC) Generally, false alarms go up with attempts to ROC 100% correctly detect higher percentages of known objects Correct detections True Positive Area Under the ROC (AUC or AUROC) 50% summarizes overall performance 0% 0% 50% 100% False alarms False Positive Truth Classification Error? Cancer Cancer Correct detection (no error) No cancer Cancer False alarm (error) Cancer No cancer False dismissal (error) No cancer No cancer Correct dismissal (no error) Copyright (C) UNSW https://en.wikipedia.org/wiki/Receiver_operating_characteristic COMP9517 24T2W4 Pattern Recognition Part 2 29 Confusion Matrix Handwritten digits recognition Matrix whose entry (i, j) records the number of times an object of class i was classified as class j Often used to report the results of classification experiments Diagonal entries indicate successes High off-diagonal numbers indicate confusion between classes Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 30 Binary Confusion Matrix Confusion matrix for binary classification Prediction P N P # True Positives (TP) # False Negatives (FN) Actual N # False Positives (FP) # True Negatives (TN) Accuracy TP + TN Correct Accuracy = TP + TN + FP + FN Total Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 31 Precision versus Recall Precision / correctness Fraction of relevant elements among the selected elements TP Precision = (P) TP + FP Recall / sensitivity / completeness Fraction of selected elements among the relevant elements TP Recall = (R) TP + FN F1 score 2PR Harmonic mean of precision and recall: F1 = P+R https://en.wikipedia.org/wiki/Precision_and_recall Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 32 More Terminology and Metrics Table of metrics computed from the confusion matrix and often used in classification https://en.wikipedia.org/wiki/Confusion_matrix Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 33 Regression Suppose we have a training set of 𝑁 observations: 𝑥! , 𝑦! , 𝑥! ∈ ℝ" , 𝑦! ∈ ℝ, 𝑖 = 1, … , 𝑁 Training process is to learn 𝑓(𝑥) from the training data such that: 𝑦! = 𝑓 𝑥! Example for 𝑑 = 1 But here the output variable has 𝑦 a continuous value 𝑥 Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 34 Linear Regression Linear regression assumes there is a linear relationship between the output and the features: 𝑓 𝑥 = 𝑤) + 𝑤* 𝑥* + 𝑤+ 𝑥+ + ⋯ + 𝑤" 𝑥" 𝑥 = [1, 𝑥* , 𝑥+ , … , 𝑥" ] (features) 𝑦 𝑊 = [𝑤) , 𝑤* , … , 𝑤" ], (weights) 𝑥 𝑓 𝑥 = 𝑥𝑊 How to find the best line? The most basic estimation approach is least squares fitting Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 35 Least Squares Regression The idea is to minimize the residual sum of squares (sum of the squared error) RSS 𝑊 = ∑& #$% 𝑦# − 𝑓 𝑥# ' = (𝑌 − 𝑋𝑊)( 𝑌 − 𝑋𝑊 ( 𝑌 = 𝑦% , 𝑦' , … , 𝑦& (all sample values) ( 𝑋 = 𝑥% , 𝑥' , … , 𝑥& (all sample features) How to find the best fit? 𝑦 0 = arg min) RSS 𝑊 𝑊 RSS is a quadratic function that can be differentiated with 𝑥 respect to 𝑊 Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 36 Least Squares Regression Differentiation of RSS with respect to 𝑊 yields: 𝜕RSS = −2𝑋 ( (𝑌 − 𝑋𝑊) 𝜕𝑊 𝜕 ' RSS (𝑋 = 2𝑋 𝜕𝑊𝜕𝑊 ( If we assume that 𝑋 has full rank, then 𝑋 ! 𝑋 is positive and that means we have a convex function which has a minimum, so: 𝜕RSS = 0 ⇒ 𝑋 ( 𝑌 − 𝑋𝑊 = 0 𝜕𝑊 0 = (𝑋 ( 𝑋)*% 𝑋 ( 𝑌 𝑊 Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 37 Linear Regression: Example Assume we have the length and width of Length (𝒙𝟏 ) Width (𝒙𝟐 ) Weight (𝒚) some fish and we want to estimate their 100 40 5 weights from this information (features) 102 35 4.5 92 33 4 Start with one feature (say 𝑥* ) which is 83 29 3.9 easier for visualization 87 36 3.5 95 30 3.6 𝑦 = 𝑤& + 𝑤'𝑥' 87 37 3.4 1 100 5 104 38 4.8 1 102 𝑤& 4.5 𝑋= , 𝑊= 𝑤 , 𝑌= 101 34 4.6 ⋮ ' ⋮ 97 39 4.3 1 97 4.3 Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 38 Linear Regression: Example For one feature we obtain: −1.8 𝑊 = (𝑋 $ 𝑋)%& 𝑋 $ 𝑌 = 0.0635 RSS 𝑊 = ∑( !'& 𝑦! − 𝑓 𝑥! ) = (𝑌 − 𝑋𝑊)$ 𝑌 − 𝑋𝑊 = 0.9438 For two features we repeat the same procedure with updated 𝑋: 1 100 40 1 102 35 𝑋= ⋮ 1 97 39 −2.125 𝑊 = 0.0591 RSS 𝑊 = 0.9077 0.0194 Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 39 Regression Evaluation Metrics Root Mean Square Error (RMSE) Represents the standard deviation of the predicted values from the observed values ( 1 RMSE = I(𝑦! − 𝑦J! )) 𝑁 !'& Mean Absolute Error (MAE) Represents the average of the absolute differences between the predicted and observed values ( 1 MAE = I |𝑦! − 𝑦J! | 𝑁 !'& RMSE penalizes big differences between predicted values and observed values more heavily Smaller values of RMSE and MAE are more desirable Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 40 Regression Evaluation Metrics R-Squared (𝑹𝟐 ) Indicates how well the selected feature(s) explain the output variable ( ) ∑ J! )) !'&(𝑦! − 𝑦 𝑅 =1− ( N ) ∑!'&(𝑦! − 𝑦) R-squared tends to always increase by adding extra features Adjusted R-Squared (Adjusted 𝑹𝟐 ) Indicates how well the selected feature(s) explain the output, adjusted for the number of features: (1 − 𝑅 ) )(𝑁 − 1) ) 𝑅*+, =1− 𝑁−𝑑−1 where 𝑁 is the number of samples and 𝑑 is the number of features Larger values of R-Squared and Adjusted R-Squared are more desirable Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 41 Normalization on features -- preprocessing Goal: to change the scale of numeric values to a common scale Commonly applied techniques: – Z-score: re-scales the data (features) such that it will have a standard normal distribution (𝜇 = 0, 𝜎 = 1), which works well for normally distributed data: 𝑥−𝜇 𝜎 – Min-max normalization: re-scales the range of the data to [0,1] such that the minimum value is mapped to 0 and the maximum value to 1: 𝑥 − 𝑥)*+ 𝑥),- − 𝑥)*+ Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 42 Cross Validation Ideally a trained model should work well also on new (unseen) data This means the model should neither underfit nor overfit the training data Can be used for hyperparameter tuning Cross validation (CV) is a technique to assess model performance across all data – Train-test split: The available data is randomly split into a training set and a test set (usually 80:20 ratio) for, respectively, training and testing the model – K-fold cross validation: The data is split into K subsets (folds) and at each iteration we keep one fold out for testing and use the rest for training This is repeated K times until all folds have been used once as the test set The performance of the model will be the average of the performance on the K test sets Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 43 Cross Validation Cross validation can be used for hyperparameter tuning (or model selection) – Leave a test set – Do cross validation on the rest of data with training set and validation set – Test set cannot be used for selecting the hyperparameter Source: https://erdogant.github.io/hgboost/pages/html/Cross%20validation%20and%20hyperparameter%20tuning.html#:~:text=Cross%20valida tion%20and%20hyperparameter%20tuning%20are%20two%20tasks%20that%20we,crossvalidation%20to%20evalute%20our%20results. Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 44 References and Acknowledgements Shapiro & Stockman, Chapter 4 Duda, Hart, Stork, Chapters 1, 2.1 Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, Chapters 2 and 12 Theodoridis & Koutroumbas, Pattern Recognition, 2009 Ian H. Witten & Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2005 Some diagrams extracted from the above resources Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 45 Example exam question Which one of the following statements is correct for pattern recognition? A. Pattern recognition is defined as the process of model training on a training dataset and then testing on an independent test set. B. The dimension of feature vectors should be smaller than the number of training samples in order to avoid the overfitting problem. C. The simple kNN classifier needs homogeneous feature types and scales so that the classification performance can be better. D. SVM is a powerful classifier that can separate classes even when the feature space exhibits significant overlaps between classes. Copyright (C) UNSW COMP9517 24T2W4 Pattern Recognition Part 2 46

COMP9517 Computer Vision 2024 Term 2 Week 4 Pattern Recognition Part 2 PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue