k-Nearest Neighbor Classification

Study Notes

The nearest neighbor may sometimes be an outlier, leading to misleading classification results.
To address this, a k-Nearest Neighbor (k-NN) classifier considers multiple neighbors instead of just one.
The decision set consists of the k nearest neighbors utilized for determining the classification outcome.
The decision rule is the method for assigning a class based on the different classes among the k neighbors.
Common approaches for decision rules include:
- Majority vote: The class with the most votes from neighbors is chosen.
- Weighted votes: Neighbors contribute to the vote based on their distance, allowing closer neighbors to have a greater influence.
The function for classifying an instance (x_q) is defined as:
- (h(x_q) = \text{argmax}c \sum{i=1}^{k} w_i \delta(c, f(x_i)))
- Here, (\delta(a, b)) equals 1 if (a = b) (the classes match), and 0 otherwise.
This formulation provides a flexible way to account for varying distances among neighbors, enhancing the robustness of class predictions.

Euclidean Distance: Calculated as the straight-line distance between points; widely used in KNN applications.
Manhattan Distance: Measures the distance based on horizontal and vertical paths; suited for grid-like data structures.
Minkowski Distance: Versatile distance metric defined by a parameter 'p'; adapts to both Euclidean (p=2) and Manhattan (p=1) distances.
Cosine Similarity: Evaluates the angle between two vectors; particularly useful in high-dimensional spaces to determine directional similarity.

Confusion Matrix: Provides a visual representation of true/false predictions, classifying them into true positives, true negatives, false positives, and false negatives.
Accuracy: Represents the ratio of correct predictions to total predictions; though useful, may be misleading in the presence of class imbalance.
Precision and Recall: Precision assesses the correctness of positive predictions, while recall measures the model's ability to identify all actual positives.
F1 Score: Combines precision and recall into a single metric; serves as an overall effectiveness measure by balancing both aspects.
Cross-Validation: Statistical method for validating model performance by dividing the dataset into subsets, ensuring reliable performance metrics.

Importance: Essential to ensure all features contribute equivalently to distance computations, avoiding biases from different scales.
Min-Max Scaling: Rescales feature values into a specified range, typically from 0 to 1, for uniform contribution to distance metrics.
Z-score Normalization: Centers data around the mean while scaling it according to its standard deviation, effective for normally distributed features.
Robust Scaling: Utilizes median and interquartile ranges to reduce sensitivity to outliers, ensuring more stable results in diverse datasets.

Choosing K: Selecting an optimal K value is vital to avoid overfitting; larger values can create smoother decision boundaries while reducing sensitivity to noise.
Cross-Validation: Aids in identifying the best K by evaluating model performance across various data partitions.
Dimensionality Reduction: Implements methods like Principal Component Analysis (PCA) to simplify the feature space, reducing the risk of overfitting.
Feature Selection: Involves picking only the most relevant features to simplify the model and enhance prediction performance.

Image Recognition: Utilizes KNN for classifying images by comparing them to known categories through their feature similarities.
Recommendation Systems: Employs user behavior patterns to suggest relevant products or content, enhancing user experience.
Medical Diagnosis: Classifies patient symptoms against historical data to assist in diagnosing diseases accurately.
Text Classification: Categorizes documents or messages based on their content and contextual similarity.
Anomaly Detection: Identifies rare patterns that deviate from the common data behavior, crucial for fraud detection and quality assurance.