Pattern Recognition Nonparametric Techniques PDF

Pattern Recognition (Αναγνώριση Προτύπων) Nonparametric Techniques (Μη-Παραμετρικές Τεχνικές) Panos Trahanias UNIVERSITY OF CRETE DEPARTMENT of COMPUTER SCIENCE Non-Parametric Techniques  Issues in Parametric Techniques:  Often the shape of the distribution is not known.  In practice, many distributions are multimodal (contain more than one modes), whereas most parametric models are unimodal.  Approximation of multimodal distributions as a product of unimodal ones is not appropriate.  Non-parametric techniques: Estimation of distribution function from scratch.  Estimation of pdf’s p(x/ωi) from the data via generalization of the multi-dimensional histogram.  Direct Estimation of the posterior probabilities P(ωi /x) and the discriminant functions. Density (Distribution) Estimation  Based on the fact that the probability of sample x is found within area R results from P P  x  R   p(x)dx xR p(x*)  The above integral can be approximated either by the product of p(x) times the area R, or by the number of samples that are found within R P~k/n ~ P p(x)dx  p(x*) V k / n x* R xR V: Area R. In one-dimensional case, V= length of R k: Number of samples within R n: Total number of samples Density (Distribution) Estimation In order to estimate the density at x, we choose a series of areas R1, R2,…,Rn, that contain x, whereby Ri is used for i samples. Let Vn the volume of Rn, kn the number of samples within the n-th area, and pn(x) the n-th estimation of p(x). Then, kn n pn ( x)   n p ( x) p(x*|w2) Vn  For k/n to be a good estimate of P, and hence pn(x) a good estimate of p(x), the following should hold: P lim Vn 0 n  x* lim k n  R n  lim k n n 0 n  Density (Distribution) Estimation  There are two ways to create series of areas Ri for pn(x) to converge to p(x):  The volume of an initial area is reduced by defining a sequence of volumes Vn as Vfunctions n V1 / n of n, e.g.  Density estimation via the Parzen Windows method kn  n  We define kn as a function of n, hence Vn increases until it contains kn samples. e.g.  Density estimation via the kn-Nearest Neighbor method Two Approaches Parzen Windows  Method based on finding the number of samples within a given area, where the area shrinks as the number of samples grows larger.  The number of samples within the area is computed via the use of a “window function”, the Parzen Window. (x)  1 n 1  x  xi  pn ( x)      1/2 n i 1 Vn  hn  d Volume of function, Vn=(hn) Window function Area/width of function 1/2 1/2 n  x  xi  kn     No of samples within Rn, i 1  hn  Rn with center x and width hn Parzen Windows 1 n 1  x  xi  1 n 1  x pn ( x)         n  x  x i  όπου  n  x      n i 1 Vn  hn  n i 1 Vn  hn   Window  (.) can be a general functional, not necessarily hypercube. For pn(x) to be a valid p.d.f. for every n, it should hold   u  0 και   u  du 1 u  pn(x) is a linear combination of  (.), where every sample xi contributes to the estimation of p(x) according to its distance from x. If  (.) is a valid p.d.f., then pn(x) will converge to p(x) as n increases. A typical choice for  (.) is – Wow, you’ve made a great guess– the Gaussian Function!  p(x) is computed as a superposition of Gaussians, where each Gaussian is centered in the corresponding training sample. Parameter hn is the Gaussian standard deviation!!! Parzen Windows Distribution to be Gaussians with centers the training samples estimated = hn  x1 x2 x3 Training Samples Effect of Window Width hn 1  x  n  x      , Vn hnd Vn  hn  Parameter hn has an effect on both the window width and its amplitude:  When hn is large (small), window becomes wide (narrow), window amplitude is small (large) and x must be far from (close to) xi for function δn(x-xi) to change drastically compared to its value at δn(0). Effect of Window Width hn Hence How the window width affects the estimation of p.d.f. p(x) :  If hn is large, estimate pn(x) results as superposition of n “wide” functions centered at the training samples and constitutes a smooth, “out-of-focus” estimation of p(x), with low resolution.  If hn is small, pn(x) results as superposition of n “narrow” functions, i.e. “erratic or noisy” estimation of p(x). Convergence Properties of pn(x)  Mean value of random variable pn(x): pn  x  E  pn  x    n  x  v  p v  dv is the convolution of p(x) with the window function, hence it is a blurred version of p(x).  It holds,  n  x  v   V   x  v  οπότε n 0 Hence pn  x   n p x    Variance of random variable pn(x): 1 1 2 x  v  1 2 sup    pn  x  var pn  x        p v  dv  pn  x   nVn Vn  hn  n nVn  Accordingly, we have small variance for large Vn! In the limit ninf, we may have Vn approach 0 and variance will also approach 0, given that nVninf.  Valid choices: Vn V1 n ή Vn V1 ln n Parzen Windows Examples 1 n 1  x  xi  pn ( x)     ; n i 1 hn  hn  1  u2 2  u   e 2 As n approaches infinity, estimation becomes accurate, regardless of the window width. Parzen Windows Examples 1 n 1  x  xi  pn ( x)   2   ; n i 1 hn  hn   uT u 1  u  e 2 2 Parzen Windows Examples As n approaches infinity, estimation becomes accurate, regardless of the window width. Parzen Windows for Pattern Classification  Estimation of likelihood p(x|ωi) from training data via the Parzen windows approach, and use of Bayes rule for classification, e.g. computation of posterior probabilities and classification based on largest posterior probability.  Pros: No need for problem specific assumptions, only the existence of training data set!  Cons: Requires (many)d data so that estimate converges to actual distribution.  Moreover, as the dimension increases, the requirement for (many)d data becomes ( (many)many (n) )n !!!!  Curse of dimensionality !  The only way to overcome the above is the prior knowledge about “good” training data!  The training error can become arbitrarily small (even zero), by choosing small windows! Still this is not desirable, since it’ll result in overfitting and it’ll reduce performance in the classification of test data. Parzen Windows for Pattern Classification Very small window  Larger window  Larger training error, but better generalization very small/detailed separation in the performance! feature space, not desirable! Desirable property. In practice we’re looking at small windows in areas with many (dense) training samples, and large windows in areas with little (sparse) training samples! How can this be accomplished…? Probabilistic Neural Networks (PNN)  Input: {xk; k=1,…,d} d nodes, each corresponds to a feature.  wjk: weights that link k–th input node with j–th node of hidden layer (pattern node). aji Sparsely connected  Hidden layer: n nodes, each corresponds to a pattern, i.e. training sample, j=1,2,…,n.  Output layer: c nodes, each represents a class. Fully connected wjk  aji: weights that link j–th hidden node with i–th output node, i=1,2,…,c k=1,2,…,d d-dimensional input vector x PNN-Training Training  j-th training sample (pattern) is normalized to unit length.  Given as input (input nodes).  Weights wjk set as wjk=xjk. aji  A unique link with weight aji=1 is established from the j-th hidden node to the output node that corresponds to the (known) class of xj. wjk j=0, aji=0 for j=1,…,n; i=1,…,c k=1,2,…,d d-dimensional input vector x aji  1 PNN-Classification  Each pattern node gives rise to the inner product of the vector of its weights with the normalized input x to compute netJ=wtx, and output e[(netJ –1) /σ2]. x1 wjk J e  net J  1  2  Activation Function xd Width of Gaussian Parzen window  Each class node sums the results of the pattern nodes that are connected to it. Accordingly, the activation of each class represents the p.d.f. estimate with a Gaussian Parzen window with covariance matrix σ2Id×d, where I is the identity matrix. net1 ajc C aki gi gi netn kn-Nearest Neighbor (ΚΝΝ)  Window width: instead of choosing it as a function of the number of training samples (Vn V1 / n ), why not choosing it as a function of the training data?  Note: a large window is desirable in areas with small number of data, whereas a small window is appropriate in dense (with data) areas!  k-nearest neighbor estimation algorithm:  Choose an initial area centered at x where p(x) is estimated.  Increase window until a predefined number of kn samples falls inside the window.knThose n are the kn nearest neighbors of x. Vn lim k n ; lim k n n 0  Pdf (density) is estimated as n  n   Convergence conditions of pn(x): KNN If k n and nassuming that pn(x) is a good estimate of p(x), then. Accordingly Vn is once more but this time the  Vn 1 n p (x)  V n 1 choice. Moreover, for each n, the size of area initial area V1 is identified from the pdf p(x) of the training data and is not an arbitrary Vn is a function of x, i.e. Vn= Vn(x). How to Choose kn Note that as kn increases, estimation accuracy increases…! In classification problems, we usually tune kn (or hn for Parzen windows), so that the classifier operates with the lowest error for the validation test dataset. KNN Classification  KNN can be employed to estimate the posterior probabilities: Actually, posterior probabilities in an area around x are calculated as the ratio of samples within the area with class label ωi.  n: total number of patterns from all classes  ki: number of patterns from class i in the area around x  k: total number of patterns from all classes in the area around x ki n p n ( x,  i )  V p n ( x,  i ) ki Pn (i | x)  c  k  pn (x,  j ) j 1 KNN Classification  Classification of a pattern x,  From the n training samples, the k nearest neighbors are identified regardless of their class, (where k is odd for a two-class problem). i ki k  Let ki the number of samples from class i,  Classify x to the class with the largest ki ! Nearest Neighbor Classifier  The simplest KNN classifier is for k=1! In this case x assumes the class of its nearest neighbor! Is this a good classifier…?  ΝΝ classifier partitions the feature space as a Voronoi tessellation, where each cell assumes the label of the class within it.  Given infinite number of training samples, the classification error probability is bounded by twice the error of the Bayesian classifier (valid for small error probabilities). Error Rates NN Classifier K-NN Classifier Computational Issues  Partial distances  Use a subset r of d dimensions, to compute the distance of the unknown pattern from the training samples: r 1/ 2  2 Dr  a, b     ak  bk    k 1   Search trees  Establish “search trees” where training samples are selectively linked so that, for the classification of a new pattern, distance computation is required only to some entry/root patterns and the patterns linked to them.  Editing, pruning, or condensing  Prune patterns that are surrounded from patterns of the same class. 1. Compute the Voronoi diagram of the training patterns 2. For each pattern, if any of its neighbors is not of the same class, mark it. 3. Prune non-marked patterns and compute the revised Metrics and ΚΝΝ Classification  Properties:  Nonegativity: D a, b  0  reflexivity: D a, b  0 iff a b  symmetry: D a, b  D b, a  D a, b   D b, c  D a, c   triangle inequality: 1/ 2  Euclidean Distance:  d 2 L2  a, b     ak  bk    k 1  1/ p  Minkowski Metric:  d p L p  a, b    ak  bk   k 1  d  Manhatan Distance: L1  a, b   ak  bk k 1 Distance 1 from the center via the use of each one of the metrics Lp.  Chess-board Distance: L  a, b   max  ak  bk  k 1,..., d Metrics and ΚΝΝ Classification When the pattern space is transformed after multiplying each feature by a constant, distances in the transformed space may vary significantly from the initial ones. Evidently, this effects the performance of the ΚΝΝ classifier. It is of utmost importance to employ metrics that are invariant under basic transformations, such as translation (shift), scaling, rotation, line thickness, shear. Example case: optical character recognition – OCR. Euclidean distance between 5 and a translated 5 by s pixels. For translations larger than 1 pixel, distance between the initial 5 and translated 5 is larger than the distance between the initial 5 and 8, and hence NN classifier that employs Euclidean distance will classify erroneously. Tangent Distance  Let r transformations, αi.  Let x’ a pattern.  Transformed pattern, Fi(x’;αi).  Tangent vector: TVi=Fi(x’;αi)-x’.  Tangent matrix: Τdxr=[TV1,..., TVr].  Tangent space: space defined by the r linearly independent tangent vectors TVi that pass from x’. Forms a linear approximation of the space of transformed x’.  Tangent distance: Dtan  x, x  min   x  Τw   x  w  Actually, it is Euclidean distance of x from the tangent space of x’ Tangent Distance Tangent Distance Reduced Coulomb Energy (RCE) Networks  An RCE network, during training tunes the window width around each pattern according to its distance from the nearest pattern in a different class.  During training, each pattern initiates a new circle and circle radii are adapted so that they do not contain patterns of different classes.  Black circles represent class 1, pink circles class 2, whereas dark red areas represent ambiguous regions where no classification decision can be made. Reduced Coulomb Energy (RCE) Networks

Pattern Recognition Nonparametric Techniques PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue