KNN Algorithm Explanation PDF
Document Details
Uploaded by GodGivenCouplet
Tags
Summary
This document provides an introduction to the K-Nearest Neighbors (KNN) algorithm, a popular machine learning technique. It explains the concept, mathematical background (including Euclidean and Manhattan distances), different K values, and applications like classification and regression. It also covers the bias-variance tradeoff and computational cost associated with varying K values.
Full Transcript
INTRODUCTION: The abbreviation KNN stands for “K-Nearest Neighbors”. It is a supervised machine learning algorithm. The algorithm can be used to solve both classification and regression problem statements. KNN is a very popular algorithm. It’s accuracy is comparable or even better than o...
INTRODUCTION: The abbreviation KNN stands for “K-Nearest Neighbors”. It is a supervised machine learning algorithm. The algorithm can be used to solve both classification and regression problem statements. KNN is a very popular algorithm. It’s accuracy is comparable or even better than other, more complicated algorithms. KNN is a supervised algorithm (which means that the training data is labeled, it is non-parametric and lazy (instance based). Because it does not explicitly learns the model, but it saves all the training data and uses the whole training set for classification or prediction. This means that the training process is very fast, it just saves all the values from the data set. The real problem is the huge memory consumption (because we have to store all the data) and time complexity at testing time. MATHEMATICAL EXPLANATION OF THE ALGORITHM: The important assumption is that this algorithm requires that the data is in metric space. This means that we can define metrics for calculation distance between data points. Defining distance metrics can be a real challenge. An interesting idea is to find the distance metrics using machine learning The most used distance metrics are: - Euclidean Distance - Manhatten Distance Equations 1. EUCLIDEAN Formula: For two points A(x1,y1) and B(x2,y2) in a 2D space, the Euclidean distance is: √((x1 — x2)² + (y1 — y2)²). 2. Manhattan Distance (L1 norm) Given two points ( (x1, y1) ) and ( (x2, y2) ), the Manhattan Distance ( d ) between them is: [ d = |x1 — x2| + |y1 — y2| ] CHOOSING K VALUE : CASE-1 : Suppose I pick my value of K=1. Meaning, I have to pick 1-Nearest Neighbor of my query point, and whatever is the class label of that 1-NN, I need to assign it to the query point. In that case, my 1-NN is a positive point and hence I need to assign a positive class label to my query point. But wait! My query point lies on the negative side of the decision surface. Yes. That’s true, but that’s from a geometric point of view. When we apply KNN (where K=1) the closest neighbor is my positive point and hence the algorithm classifies it as a positive point. This is also known as overfitting. CASE-2 : Here, I pick my value of K=5. Now, I need to consider 5 nearest neighbors. When I consider 5-NN and take the majority vote, I can classify the query point as negative. CASE-3 : Here, my value of K=n where n = total no. of points in my dataset. For the sake of our understanding, let’s assume I have 1000 data points and out of these 1000 data points 600 points are negative and 400 points are positive. So my value of K=1000. It means we`ll consider all the 1000 data points as our neighbors. Now, when we consider all the data points and consider the majority vote, our query point will be classified as negative since the number of negative points is greater than positive points The K-NN Algorithm Step-1: Select the number K of the neighbors Step-2: Calculate the Euclidean distance of K number of neighbors Step-3: Take the K nearest neighbors as per the calculated Euclidean distance. Step-4: Among these k neighbors, count the number of the data points in each category. Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. Step-6: Our model is ready. Majority Voting (Classification) or Averaging (Regression): For classification problems, the algorithm counts the occurrences of each class among the ‘k’ neighbors and assigns the class with the highest count to the new data point. For regression problems, the algorithm calculates the average value of the target variable among the neighbors. KNN Classification Classification is a prediction task with a categorical target variable. Classification models learn how to classify any new observation. This assigned class can be either right or wrong, not in between. A classic example of classification is the iris dataset, in which you use physical measurements of plants to predict their species. A famous algorithm that can be used for classification is logistic regression. For classification problems, the algorithm assigns a class label based on a majority vote, meaning that the most frequent label that occurs in neighbors is applied. Why is the Choice of K Important? Bias-Variance Tradeoff: A smaller K value (like K=1) will have low bias but high variance. It will closely follow the training data, capturing noise and making it sensitive to outliers (overfitting). A larger K value provides a smoother and less variable fit. The prediction will be based on several points, making it less sensitive to the noise in the training data but potentially adding bias to the prediction which causes high bias and low variance (underfitting). Computational Cost: Larger values of K mean more computation since more points need to be considered to make a prediction. Class Boundary: The decision boundary becomes smoother with increasing K. With K=1, the decision boundary will be jagged, while a large K will provide a more generalized boundary. KNN Regression Regression is a prediction task in which the target variable is numeric. A famous example of regression is the Housing Prices Challenge on Kaggle. In this machine learning contest, participants try to predict the sales prices of houses based on numerous independent variables. For regression problems, the average is used to identify the k nearest neighbors. For regression problems, continuous values are applied, meaning that the output would be a floating-point number. We can evaluate the accuracy of the results by looking at how close the model’s predictions and estimates match the known classes in the testing set. regression and a classification would look like using the example: The left part of this image is a classification. The target variable is the shape of the observation, which is a categorical variable. The right part is a regression. The target variable is numeric. The decision rules could be exactly the same for the two examples, but their interpretations are different. For a single prediction, classifications are either right or wrong, while regressions have an error on a continuous scale. Having a numeric error measure is more practical, so many classification models predict not only the class but also the probability of being in either of the classes. Some models can only do regression, some can only do classification, and some can do both. The KNN algorithm seamlessly adapts to both classification and regression. You’ll learn exactly how this works in the next part of this tutorial. KNN Advantages and Disadvantages Advantages:- No Training Period- KNN modeling does not include training period as the data itself is a model which will be the reference for future prediction and because of this it is very time efficient in term of improvising for a random modeling on the available data. Easy Implementation- KNN is very easy to implement as the only thing to be calculated is the distance between different points on the basis of data of different features and this distance can easily be calculated using distance formula such as- Euclidian or Manhattan As there is no training period thus new data can be added at any time since it wont affect the model. Disadvantages:- Does not work well with large dataset as calculating distances between each data instance would be very costly. Does not work well with high dimensionality as this will complicate the distance calculating process to calculate distance for each dimension. Sensitive to noisy and missing data Feature Scaling- Data in all the dimension should be scaled (normalized and standardized) properly. Practical Applications of KNN Market Segmentation: Businesses often face the challenge of segmenting their vast customer base to deliver personalized experiences. KNN assists in this endeavor by classifying customers into specific segments based on their purchasing behaviors, preferences, and other factors. Using these segments, businesses can then tailor their marketing strategies and offerings, ensuring each customer feels uniquely catered to. Recommendation Systems: We’ve all encountered online platforms suggesting movies, songs, or products based on our past interactions. At the heart of many of these recommendation systems lies the KNN algorithm. By examining a user’s history and finding ‘k’ other users with similar tastes, KNN can suggest items that those similar users have liked, providing personalized recommendations that resonate with individual preferences. Image Recognition: The digital age has brought an influx of images, and categorizing them manually is no small feat. KNN rises to the occasion by analyzing the features of images and classifying them based on the features of ‘k’ nearest images in the dataset. Whether it’s for facial recognition in security systems or sorting photographs in online galleries, KNN simplifies the task, making image categorization more efficient and accurate.