Applied Data Science Lecture 5 PDF
Document Details
Uploaded by LegendarySerpentine6646
Dr. Nermeen Ghazy
Tags
Summary
This document is a lecture on applied data science. It covers the K Nearest Neighbors (K-NN) algorithm, including distance calculations, weights, and voting procedures for classifications. The lecture also presents an example, demonstrating data preparation steps.
Full Transcript
Applied Data Science DS403 Dr. Nermeen Ghazy 1 Reference Books 2 3 K Nearest Neighbors (k-NN) Similar records congregate in a neighborhood in n-dimensional space, with the same target class labels. This is the central...
Applied Data Science DS403 Dr. Nermeen Ghazy 1 Reference Books 2 3 K Nearest Neighbors (k-NN) Similar records congregate in a neighborhood in n-dimensional space, with the same target class labels. This is the central logic behind the approach used by the k- nearest neighbor algorithm, simply referred k-NN. 4 K Nearest Neighbors (k-NN) How It Works: Any record in a dataset can be visualized as a point in an n-dimensional space, where n is the number of attributes. It is hard for us to visualize in more than three dimensions, mathematical functions are scalable to any dimension and, hence, all the operations that can be done in two-dimensional spaces and be performed in n- dimensional space. 5 K Nearest Neighbors 6 K Nearest Neighbors The colors indicate the species of Iris, the target class variable. For an unseen test record A, with (petal length, petal width) values (2.1, 0.5), one can visually deduce that the predicted species for the values of data point A would be I. setosa. This is based on the fact that test data point A is in the neighborhood of other data points that belong to the species I. setosa. Similarly, unseen test data point B has values (5.7, 1.9) and is in the neighborhood of I. virginica, hence, the test data point can be classified as I. virginica. However, if the data points are in between the boundaries of two species, for data points such as (5.0, 1.8), then the classification can be tricky because the neighborhood has more than one species in the vicinity. 7 K Nearest Neighbors One technique is to find the nearest training data point from an unseen test data point in multi- dimensional space. That is how the k-NN algorithm works. The k in the k-NN algorithm indicates the number of close training record(s) that need to be considered when making the prediction for an unlabeled test record. When k=1, the model tries to find the first nearest record and adopts k is usually assigned an odd number for a two-class problem 8 K Nearest Neighbors The key task in the k-NN algorithm is determination of the nearest training record from the unlabeled test record using a measure of proximity. Once the nearest training record(s) are determined, the subsequent class voting of the nearest training records is straightforward. 9 K Nearest Neighbors Measure of Proximity: The effectiveness of the k-NN algorithm hinges on the determination of how similar or dissimilar a test record is when compared with the memorized training record. A measure of proximity between two records is a measure of the proximity of its attributes. To quantify similarity between two records, there is a range of techniques available such as calculating distance, correlation, and cosine similarity 10 Measure of Proximity Distance: The distance between two points X (x1, x2) and Y (y1, y2) in two-dimensional space can be calculated by For datasets with n attributes, where X is (x1, x2,..., xn) and Y is (y1, y2,..., yn) For example, the first two records of a four- dimensional Iris dataset is X5 (4.9, 3.0, 1.4, 0.2) and Y5(4.6, 3.1, 1.5, 0.2). The distance between X and Y is: 11 Measure of Proximity Weights: The premise of the k-NN algorithm is that data points closer to each other are similar and, hence, have the same target class labels. The weights are included in the final multi-voting step, where the predicted class is calculated. Weights (wi) should satisfy two conditions: 1. They should be proportional to the distance of the test data point from the neighbor and 2. The sum of all weights should be equal to one. One of the calculations for weights shown in the Eq: where wi is the weight of ith neighbor ni, k the is total number of neighbors, and x is the test data point. 12 Measure of Proximity Weights: The weight is used in predicting target class : where yiis the class outcome of neighbor ni. The distance measure works well for numeric attributes. 13 K-NN How to Implement? Since the key functionality is referencing or looking up the training dataset, one could implement the entire algorithm in spreadsheet software like MS Excel, using lookup functions In RapidMiner, k-NN implementation is similar to other classification and regression process, with data preparation, modeling and performance evaluation operators. 14 K-NN Data Preparation The dataset used in this example isthe standard Iris dataset with 150 examples and four numeric attributes. The Normalize operator accepts numeric attributes and generates transformed numeric attributes. The dataset is then split into two equal exclusive datasets using the Split Data operator. Split data is used to partition the test and training datasets. 15 K-NN Modeling Operator and Parameters: The k-NN modeling operator is available in Modeling> Classification>Lazy Modeling. These parameters can be configured in the operator settings: 1. k: The value of k in k-NN can be configured. This defaults to one nearest neighbor. This example uses k= 3. 2. Weighted vote: In the case of k. 1, this setting determines if the algorithm needs to take into consideration the distance value for the weights. 3. Measure types: There are more than two dozen distance measures available in RapidMiner. 4. Measure: The selection of the measure will put restrictions on the type of input the model receives. 16 K-NN 17 K-NN Execution and Interpretation The result output is observed as: 1. k-NN model: The model for k-NN is just the set of training records. Hence, no additional information is provided in this view, apart from the statistics of training records. 2. Performance vector: The output of the Performance operator provides the confusion matrix with correct and incorrect predictions for all of the test dataset. 3. Labeled test dataset: The prediction can be examined at the record level. 18 K-NN Execution and Interpretation 19 K-NN Execution and Interpretation 20 Example: The k-Nearest Neighbors (k-NN) algorithm is illustrated here with an example that includes weighted voting. We’ll set k=3 and use Euclidean distance to measure the proximity between points. Here is the data: Training Data Suppose we have five training points, each with two features (such as X and Y), with class labels: Training Point X Y Class A 1 2 1 B 2 3 1 C 3 3 2 D 5 4 2 E 4 2 1 21 Test Point Let’s assume a new test point T with values (X = 3, Y = 2). We want to use k-NN to classify this point. Step 1: Calculating Distances We start by calculating the Euclidean distance between the test point T and each point in the training data: 22 Step 2: Identifying the Nearest Neighbors We order the points by distance: With k=3, the nearest neighbors are: 1.C (Class 2, Distance 1) 2.E (Class 1, Distance 1) 3.B (Class 1, Distance 1.41) Point Distance Class C 1 2 E 1 1 B 1.41 1 A 2 1 D 2.83 2 23 Step 3: Calculating Weights We calculate weights using the distances (we can use the inverse of distance as a weight): 24 Step 4: Voting Based on Weights Now, we calculate the total weights for each class: Class 1: Total Weight1=WeightE+WeightB=1+0.707≈1.707 Total Weight2=WeightC=1 Final Result Based on the total weights: Class 1: 1.707 Class 2: 1 Thus, point T is classified as Class 1, as it has the highest weight total. This example illustrates how weights are calculated in the k-NN algorithm and how they influence the final classification decision. 25 26