KNN - Nearest Neighbor Methods - PDF

Nearest neighbor methods. Spatial indexing Szymon Jaroszewicz Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Motivation How do human experts solve problems? One approach is to look at similar cases they have seen before This is exactly the idea behind k-nearest neighbors methods Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing The k nearest neighbors classifier (k-NN) To classify a new instance x: 1 find the set NNk (x) = {(x1 , y1 ),... , (xk , yk )} ⊂ D containing k nearest neighbors of x i.e.: (x′ , y ′ ) ∈ D \ NNk (x) implies d(x, x′ ) ≥ d(x, xi ) for all 1 ≤ i ≤ k, where d is some dissimilarity measure 2 pick the class y which is most frequent among {y1 ,... , yk } Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing The k nearest neighbors classifier (k-NN) There is no model building stage The training sample itself is the model k-NN is a memory based method k-NN is a lazy classifier Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Dissimilarity measures Definition A function d(x, y) is a dissimilarity measure if, for all x, y: 1 d(x, y) ≥ 0 2 d(x, y) = 0 ⇔ x = y 3 d(x, y) = d(y, x) We don’t require the triangle inequality to hold so a dissimilarity need not be a distance (but it does not hurt) Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Dissimilarity measures Most frequently used: Euclidean distance squared Euclidean distance (does not satisfy the triangle inequality) modifications of them: pP weighted Euclidean distance d(x, y) = wi (xi − yi )2 Mahalanobis distance Other metrics such as Lp distance can be used too Problem specific dissimilarities are possible, e.g. taking into account symmetries Common problems: unequal attribute scales categorical attributes Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Unequal attribute scales Suppose we have two attributes age ∈ (0, 102 ) yearly income ∈ (0, 1010 ) Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Unequal attribute scales Suppose we have two attributes age ∈ (0, 102 ) yearly income ∈ (0, 1010 ) Their scales differ hugely In practice, decision would be made based on income only Solution: normalize attribute values Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Unequal attribute scales – normalization Two basic techniques: 1 Normalize attributes’ values to the range [0, 1] or [−1, 1] 2 Standardize each attribute Xi − µi Xi′ = σi Another option is to use weighted Euclidean distance In practice it is a good idea to normalize attributes for other methods also, e.g. logistic regression Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Dissimilarity measures – categorical attributes Categorical attributes cannot be directly used in distance computations Solutions: 1 treat binary attributes as real valued {0, 1}-attributes 2 while computing the distance assume 2 0, if xi = yi (xi − yi ) := 1, if xi ̸= yi 3 generalize the preceding approach by introducing a dissimilarity matrix for each categorical attribute. Makes sense if some values are more similar than others e.g.: if the attribute is language, German might be more similar to English than to Russian Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing k-NN classifier: ties Recall that we pick the class which was more frequent in the k nearest neighbors of the classified point What if there is a tie? For two-class problems: pick an odd k, ties will never happen Make the decision based on the closest point (switch to 1-NN) Add more nearest neighbors until a decision can be made Use dissimilarity based weighting... Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing k-NN classifier: dissimilarity based weighting Each of the k nearest neighbors (xi , yi ) casts a vote proportional to 1 d(x, xi ) where x is the new instance being classified Advantages: natural interpretation: closer points have more weight ties are much less likely if the model is used for scoring, the scores are more diverse (otherwise only k different values of the score are possible) if the model is used to predict class probabilities, predictions are more diverse (otherwise only k different probability values are possible) Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing k-NN classifier: the choice of k Typically, in literature, small values of k are suggested, e.g. 3 or 5 This is often a mistake! Much larger values of k are often advantageous large values of k decrease the variance of the model larger values of k guarantee finite Vapnik-Chervonenkis dimension Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing k-NN classifier: the choice of k An example given in the paper On Bias, Variance, 0/1-loss, and the Curse-of-Dimensionality by Jerome H. Friedman X1 ,... , Xm uniformly distributed on [0, 1]m (unit hypercube) Y = 1[X1 > 0.5]; X1 functionally determines Y 20 attributes, 3200 samples Let’s see how the choice of k affects classification error... Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing k-NN classifier: the choice of k Source: Friedman, 1997 Optimal k = 1651 Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing k-NN classifier: the choice of k For comparison: mean squared error of predicted probabilities: Source: Friedman, 1997 A very different picture! Justification: true conditional probabilities ∈ {0, 1} Remember: getting the probabilities right is not always necessary for good classification Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing k-NN classifier: the choice of k The reason for large optimal k is high model stability (low model variance) See Friedman 1997 for a detailed discussion A word of caution: this only works for balanced classes otherwise a bias correction needs to be applied (e.g. records weighted such that class priors are equal) see the paper for details Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing k-NN classifier: the choice of k For any constant k, the Vapnik-Chervonenkis dimension of the k-NN algorithm is infinite i.e.: the error does not converge to minimum as the sample size grows Klȩsk, Korzeń, 2011: The Vapnik-Chervonenkis dimension of the k-NN algorithm is finite if k grows proportionally to the size of the training set N Conclusion: The number of nearest neighbors should be larger for larger training sets Otherwise serious overfitting may occur Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing k-NN classifier: the choice of k A practical solution Choose the best value of k based on cross-validation. Make sure to include larger values of k in the selection process. Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Asymptotic error of 1-NN Let us pick a small region R of the sample space Suppose R is small enough, such that the class distribution P(Y |x′ ) is constant for all x′ ∈ R Suppose also, that a new case x to be classified belongs to R Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Asymptotic error of 1-NN Let us pick a small region R of the sample space Suppose R is small enough, such that the class distribution P(Y |x′ ) is constant for all x′ ∈ R Suppose also, that a new case x to be classified belongs to R The true class of x follows the distribution P(Y |x) x can be closest to any point in R, so the class of the nearest neighbor of x also follows P(Y |x) This is like picking the class for x at random from P(Y |x) And the error is... Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Asymptotic error of 1-NN It can be shown that X 1− P 2 (Y = y |x) ≤ 2(1 − max P(Y = y |x)) y y When N → ∞, the nearest neighbor will lie very close to x and the assumption on the conditional distribution being constant will hold Thus, the above inequality holds for every x So the error e of the 1-NN classifier is e ≤ 2e ∗ where e ∗ is the Bayes error Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Asymptotic error of k-NN What if k > 1? If k, N → ∞, k/N → 0, then e → e ∗ i.e.: the error tends to the Bayes error This is another indication that k should grow with N Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Other developments Learn problem specific distance metrics (still active research) Reduce training set size: e.g. remove examples which do not affect classification results Efficiency multidimensional indexing is a very active research area Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing KNN – current applications KNN is old but still used Facebook, Google, others develop libraries for fast nearest neighbor search Recommendations: find most similar users find most similar products Answering queries develop a vector representation of texts (e.g. using neural networks) convert answers to the representation convert a query to the same rep find answer with closest vector Quick adding of exceptions KNN can accompany another model when model output needs to be fixed for a given case: add the case to KNN Yandex story Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing KNN – current applications Few shot learning: humans can learn to recognize an object after seeing it even once neural networks require thousands of images of a new object solution: few-shot learning One approach: prototypical networks 1 A neural network converts an image to 64 element vector 2 Average vectors for each class ⇒ prototype 3 Classify image to nearest prototype (k-NN) 4 To add a new class just add a new protorype Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Literature 1 T. Hastie, R. Tibshirani, J. Friedman, Elements of statistical learning, Chapter 13.3.3 [problem specific distance measure taking into account symmetries] 2 J. Friedman, On Bias, Variance, 0/1-loss, and the Curse-of-Dimensionality, Data Mining and Knowledge Discovery 1(1), 1997, pp. 55-77 Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing k-nearest neighbors – implementation The most obvious: 1 for a new case x to be classified 2 compute the distances to all training cases 3 pick the k closest ones Time complexity for classifying a single observation: O(N) Can be very slow! Solution: spatial indexing Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Spatial indexing Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Binary search trees 5.1 2.7 8.4 1.0 3.1 5.2 9.9 3.0 3.2 Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Binary search trees – finding an element Procedure Find(x, node) Output: Boolean value indicating whether the tree contains x 1 If node.x == x: Return True 2 If node.isLeaf : Return False 3 If x < node.x 4 Return Find(x, node.left) 5 Else 6 Return Find(x, node.right) Initial call: Find(x, root) Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Binary search trees – finding a nearest neighbor Procedure FindNN(x, node) 1 If node.isLeaf 2 Return node.x 3 If x < node.x 4 nn ← FindNN(x, node.left) 5 If node.x − x < d(x, nn) # need to also check the root 6 nn ← node.x # node.x is closer to x than all points # in the right subtree, we can skip right subtree 7 Return nn 8 Else... Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Binary search trees – time complexity 1 If the tree is balanced, the search is O(log n), where n is the number of points in the tree 2 The same time complexity for finding the nearest neighbor 3... but in one dimension this is not that interesting Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Tree based multidimensional indexing We would like to generalize BSTs to m dimensions We want to quickly find nearest neighbors (hopefully in O(log n) time) General idea If current nearest neighbor is closer to x than all nodes in some part of the tree, we can skip that part Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing KD-trees (k-dimensional tress) The simplest spatial index It is quite easy to implement if you need to do it yourself other spatial data structures are actually quite difficult to implement if possible you shouldn’t implement algorithms yourself, use libraries Main idea: a KD-tree is a binary search tree, except that... at every level of the tree we compare on a different dimension Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing KD-tree – example X1 (5, 1) X2 (2, 6) (8, 3) X1 (4, 3) (3, 8) (7, 2) (6, 4) (1, 7) (4, 7) Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing KD-tree – example 8 6 4 2 0 0 2 4 6 8 Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing KD-tree – construction algorithm Procedure BuildKDtree(D, depth) Input: m-dimensional dataset D Output: KD-tree T for D 1 i ← 1 + (depth mod m) 2 M ← median of the i-th coordinate of D 3 T ← new node 4 T.x ← M 5 T.left ←BuildKDtree({x ∈ D : xi ≤ M}, depth + 1) 6 T.right ←BuildKDtree({x ∈ D : xi > M}, depth + 1) 7 Return T Initial call: BuildKDtree(D, 0) Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing KD-tree – construction algorithm Time complexity: Finding median by sorting: O(n log2 n) Finding median using a tricky linear time algorithm: O(n log n) Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing KD-trees – finding a nearest neighbor Procedure FindNN(x, node) Output: Nearest neighbor of x 1 If node.isLeaf : Return node.x 2 i ← 1 + (node.depth mod m) 3 If xi < node.xi 4 nn ← FindNN(x, node.left) 5 If node.xi − xi < d(x, nn) 6 If d(x, node.x) < d(x, nn): nn ← node.x 7 nn2 ← FindNN(x, node.right) 8 If d(x, nn2) < d(x, nn): nn ← nn2 9 Return nn 10 Else... Identical to the BST version, except that we compare on a different coordinate at each level Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing KD-trees – finding a nearest neighbor Time complexity: For random points: O(log n) m−1 In practice often: O(n m ) Works best when n ≫ 2m Other spatial data structures are supposed to work better in higher dimensions But for very high dimensions the performance degrades In practice we typically still gain quite a lot compared to linear search Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing R-trees Every node contains several points All points in a node (and its subnodes) are contained in a hyperrectangle Hyperrectangles are allowed to overlap During nearest neighbor search if current nearest neighbor is closer to x than all points in a subtree (i.e. its hyperrectangle), skip that subtree Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing R-trees Source: Wikipedia Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing R-trees If the tree is constructed well O(log n) search is possible For higher dimensions, nodes’ hyperrectangles overlap more and more which kills performance The implementation is quite complicated Many variants: R ∗ trees, Hilbert R-trees, etc. Very popular in GIS (Geographic Information Systems), spatial databases, etc. mainly used in 2-d or 3-d disk based implementations are possible Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Ball-trees Like R-trees, but the points in each node are contained in an m-dimensional ball, not in a hyperrectangle This means less overlap Ball trees are supposed to work well in higher dimensions giving O(log n) search time Currently seems to be the spatial structure of choice for exact NN search Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Approximate NN search Exact methods don’t work for large problems Need to approximate Locality Sensitive Hashing (random projections) clustering (simple but effective) random walks on local small world graphs New methods appear quite regularly Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing Approximate NN search – libraries ANNOY (Approximate Nearest Neighbors Oh Yeah) from Spotify, uses a random projection based tree FAISS from Facebook. General package, many methods NMSLIB Non-Metric Space Library. Hierarchical Navigable Small Worlds (HNSW) implementation ScaNN from Google. Vector quantization (k-means) but modifies the loss function to be more sensitive in important directions ann-benchmarks.com a benchmarking site Szymon Jaroszewicz Nearest neighbor methods.Spatial indexing

KNN - Nearest Neighbor Methods - PDF

Document Details

Tags

Related

Summary

Full Transcript