KNN Algorithm Explanation PDF
Document Details
Uploaded by GodGivenCouplet
Tags
Summary
This document provides a detailed explanation of the K-Nearest Neighbors (KNN) algorithm, a supervised machine learning technique. It covers various aspects, including mathematical explanations, choosing the optimal K value for the algorithm, classification tasks, regression techniques, and its advantages and disadvantages.
Full Transcript
INTRODUCTION: The abbreviation KNN stands for “K-Nearest Neighbors”. It is a supervised machine learning algorithm. The algorithm can be used to solve both classification and regression problem statements. KNN is a very popular algorithm. Its popularity springs from the fact that it is very easy to...
INTRODUCTION: The abbreviation KNN stands for “K-Nearest Neighbors”. It is a supervised machine learning algorithm. The algorithm can be used to solve both classification and regression problem statements. KNN is a very popular algorithm. Its popularity springs from the fact that it is very easy to understand and interpret yet many times it’s accuracy is comparable or even better than other, more complicated algorithms. KNN is a supervised algorithm (which means that the training data is labeled, it is non-parametric and lazy (instance based). Because it does not explicitly learns the model, but it saves all the training data and uses the whole training set for classification or prediction. This is in contrast to other techniques like SVM, where you can discard all non support vectors without any problem. This means that the training process is very fast, it just saves all the values from the data set. The real problem is the huge memory consumption (because we have to store all the data) and time complexity at testing time (since classifying a given observation requires a run down of the whole data set). But in general it’s a very useful algorithm in case of small data sets (or if you have lots of time and memory) or for educational purposes. MATHEMATICAL EXPLANATION OF THE ALGORITHM: The important assumption is that this algorithm requires that the data is in metric space. This means that we can define metrics for calculation distance between data points. Defining distance metrics can be a real challenge. An interesting idea is to find the distance metrics using machine learning (mainly by converting the data to vector space, represent the differences between objects as distances between vectors and learn those differences, but this is another topic, we will talk about this later). The most used distance metrics are: - Euclidean Distance - Manhatten Distance CHOOSING K VALUE : ASE-1 : Suppose I pick my value of K=1. Meaning, I have to pick 1-Nearest Neighbor of my query point, and whatever is the class label of that 1-NN, I need to assign it to the query point. In that case, my 1-NN is a positive point and hence I need to assign a positive class label to my query point. But wait! My query point lies on the negative side of the decision surface. Yes. That’s true, but that’s from a geometric point of view. When we apply KNN (where K=1) the closest neighbor is my positive point and hence the algorithm classifies it as a positive point. This is also known as overfitting. CASE-2 : Here, I pick my value of K=5. Now, I need to consider 5 nearest neighbors. When I consider 5-NN and take the majority vote, I can classify the query point as negative. CASE-3 : Here, my value of K=n where n = total no. of points in my dataset. For the sake of our understanding, let’s assume I have 1000 data points and out of these 1000 data points 600 points are negative and 400 points are positive. So my value of K=1000. It means we`ll consider all the 1000 data points as our neighbors. Now, when we consider all the data points and consider the majority vote, our query point will be classified as negative since the number of negative points is greater than positive points The K-NN working can be explained on the basis of the below algorithm Step-1: Select the number K of the neighbors Step-2: Calculate the Euclidean distance of K number of neighbors Step-3: Take the K nearest neighbors as per the calculated Euclidean distance. Step-4: Among these k neighbors, count the number of the data points in each category. Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. Step-6: Our model is ready. KNN Classification Classification is a prediction task with a categorical target variable. Classification models learn how to classify any new observation. This assigned class can be either right or wrong, not in between. A classic example of classification is the iris dataset, in which you use physical measurements of plants to predict their species. A famous algorithm that can be used for classification is logistic regression. For classification problems, the algorithm assigns a class label based on a majority vote, meaning that the most frequent label that occurs in neighbors is applied. KNN Regression Regression is a prediction task in which the target variable is numeric. A famous example of regression is the Housing Prices Challenge on Kaggle. In this machine learning contest, participants try to predict the sales prices of houses based on numerous independent variables. For regression problems, the average is used to identify the k nearest neighbors. For regression problems, continuous values are applied, meaning that the output would be a floating-point number. We can evaluate the accuracy of the results by looking at how close the model’s predictions and estimates regression and a classification would look like using the example: The left part of this image is a classification. The target variable is the shape of the observation, which is a categorical variable. The right part is a regression. The target variable is numeric. The decision rules could be exactly the same for the two examples, but their interpretations are different. For a single prediction, classifications are either right or wrong, while regressions have an error on a continuous scale. Having a numeric error measure is more practical, so many classification models predict not only the class but also the probability of being in either of the classes. Some models can only do regression, some can only do classification, and some can do both. The KNN algorithm seamlessly adapts to both classification and regression. You’ll learn exactly how this works in the next part of this tutorial. KNN Advantages and Disadvantages Advantages:- No Training Period- KNN modeling does not include training period as the data itself is a model which will be the reference for future prediction and because of this it is very time efficient in term of improvising for a random modeling on the available data. Easy Implementation- KNN is very easy to implement as the only thing to be calculated is the distance between different points on the basis of data of different features and this distance can easily be calculated using distance formula such as- Euclidian or Manhattan As there is no training period thus new data can be added at any time since it wont affect the model. Disadvantages:- Does not work well with large dataset as calculating distances between each data instance would be very costly. Does not work well with high dimensionality as this will complicate the distance calculating process to calculate distance for each dimension. Sensitive to noisy and missing data Feature Scaling- Data in all the dimension should be scaled (normalized and standardized) properly.