Nearest Neighbors and Support Vector Machine PDF

Document Details

MesmerizingGyrolite5380

Uploaded by MesmerizingGyrolite5380

Ajou University

So Yeon Kim

Tags

machine learning nearest neighbors support vector machines artificial intelligence

Summary

This presentation covers Nearest Neighbors (k-NN) and Support Vector Machines (SVMs). It details the algorithms, their advantages and disadvantages, and includes examples and diagrams. The presentation is aimed at an undergraduate-level audience.

Full Transcript

Ot~Cfl9l-jjl AJOU UNlVERSITY Nearest Neighbors and Support Vector Machine 의료 인공지능 So Yeon Kim *Slides adapted from Roger Grosse Nearest Neighbors 2 K-Nearest Neighbor (k-NN) Training...

Ot~Cfl9l-jjl AJOU UNlVERSITY Nearest Neighbors and Support Vector Machine 의료 인공지능 So Yeon Kim *Slides adapted from Roger Grosse Nearest Neighbors 2 K-Nearest Neighbor (k-NN) Training data No Strength Smooth Quality x1 5 5 Good x2 1 2 Bad x3 6 7 Good x4 3 5 Bad Test data x5 2 3 ? 3 K-Nearest Neighbor (k-NN) Instance-based learning k-NN classifies new data points by finding the k closest instances in the training set and assigning the most common label among them 1-NN: Predicts the label based on the closest instance in the training dataset. k-NN: Finds the k closest instances and predicts the label by majority voting. 4 k-NN classification For the given test data, 1. calculate the distance between the test data and all the training data. No Strength Smooth Quality x1 5 5 Good x2 1 2 Bad x3 6 7 Good x4 3 5 Bad x5 2 3 ? 5 k-NN classification For the given test data, 1. Select the k closest training data points (k-nearest neighbors). 2. Predict the label (class) through majority voting. Bad Good k=1 → Good k=3 → Bad 6 Distance measure Can formalize “nearest” in terms of Euclidean distance others Mahalanobis distance Correlation Cosine similarity 7 Choosing k Nearest neighbors sensitive to noise or mis-labeled data (“class noise”) Small k Good at capturing fine-grained patterns May overfit, i.e. be sensitive to random idiosyncrasies in the training data Large k Makes stable predictions by averaging over lots of examples May underfit, i.e. fail to capture important regularities 8 k=1 0.... ,., r: - r§;J ) u (' '' _, r,- r..J ) l,.,. C 0 9 Image credit: ”The Elements of Statistical Learning” k = 15 10 Image credit: ”The Elements of Statistical Learning” Choosing k Choosing an appropriate value of k is important. Heuristic: A decision-making method based on trial and error and experience in situations where information is incomplete. 11 Choosing k k is an example of a hyperparameter We can tune hyperparameters using a validation set 12 Summary Advantages k-NN does not build an explicit model during the training phase It simply stores the dataset and performs computations during prediction (lazy learning) Simple and easy to program As more data becomes available, k-NN can better approximate decision boundaries 13 Summary Disadvantages k-NN needs to compute the distance between the query point and all training samples at prediction time, which becomes computationally expensive and memory-intensive for large datasets. k-NN can be biased toward the majority class, if the data is imbalanced Curse of dimensionality: In high-dimensional data, the distance between points becomes less meaningful. The performance of K-NN depends heavily on the choice of k. 14 Support Vector Machine 15 Separating Hyperplanes Suppose we are given these data points from two different classes and want to f ind a linear classifier that separates them. 16 Separating Hyperplanes The decision boundary looks like a line * * * * * * 17 Separating Hyperplanes There are multiple separating hyperplanes, described by different parameters (w, b). * * * ** * b2 +wI x - 0 b1 + W1T x -- 0 18 Separating Hyperplanes Which of these is optimal? 19 Separating Hyperplanes Optimal Separating Hyperplane: A hyperplane that separates two classes and maximizes the distance to the closest point from either class, i.e., maximize the margin of the classifier. 20 Maximizing Margin 2 max-margin objective: max 𝐶 is equivalent to min 𝑤 2 𝑤,𝑏 𝑤,𝑏 21 Maximizing Margin The important training examples are the data points that lie on the edge of this margin, and are called support vectors The position of the decision boundary is determined entirely by the support vectors Hence, this algorithm is called the (hard) Support Vector Machine (SVM) (or Support Vector Classifier) 22 Maximizing Margin for Non-Separable Data Points How can we apply the max-margin principle if the data are not linearly separable? * * * * ** * * 23 Maximizing Margin for Non-Separable Data Points Allow some points to be within the margin or even be misclassified we represent this with slack variables 𝜉𝑖 24 Maximizing Margin for Non-Separable Data Points constrain or penalize the total amount of slack σ𝑁 𝑖=1 𝜉𝑖 1 2 Soft-margin SVM objective: min 𝑤 + 𝛾 σ𝑁 𝑖=1 𝜉𝑖 𝑤,𝑏,𝜉 2 2 𝛾 is a hyperparameter that trades off the margin with the amount of slack 25 Margin Hard margin Soft margin C = 1000 C = 10 C = 0.1 overfitting underfitting 26 Kernel trick What if our data looks like this? 27 Kernel trick Kernel Trick: A method that transforms data into a higher dimension using kernels, enabling the construction of a decision boundary that can be linearly separable, e.g. polynomial kernels and RBF (Radial Basis Function) kernels When using the kernel trick for problems that can be solved with a soft margin, overfitting may occur I 0 ,,0 -------------- 0 ,,,,, 0 0 ,, , ,, 28 Kernel trick SVC with linear kernel SVC with RBF kernel SVC with polynomial (degree 3) kernel E u ,0 C. -.J ~:o 0~ 0 E - u E - u ,0 C. QJ QJ Ill goo ~o Ill 0 0 sepal length (cm) sepal length (cm) sepal length (cm) 29 Thank You! Q&A 30

Use Quizgecko on...
Browser
Browser