Pattern Recognition (PDF)
Document Details
Uploaded by FeasibleDialect
Mansoura University
Nagham Elsayed Mekky
Tags
Summary
This document is a lecture on pattern recognition, focusing on Bayesian decision theory. It covers terminology, conditional risk, and overall risk calculations. The content is about pattern classification techniques in computer science.
Full Transcript
Pattern Recognition Nagham Elsayed Mekky Faculty of Computers and Information Sciences, Mansoura University, Mansoura Chapter(2) Bayesian Decision Theory Terminology State of nature ω (class label): e.g., ω1 for sea bass, ω2 for salmon Probabilities P(ω 1) and...
Pattern Recognition Nagham Elsayed Mekky Faculty of Computers and Information Sciences, Mansoura University, Mansoura Chapter(2) Bayesian Decision Theory Terminology State of nature ω (class label): e.g., ω1 for sea bass, ω2 for salmon Probabilities P(ω 1) and P(ω 2) (priors): e.g., prior knowledge of how likely is to get a sea bass or a salmon Probability density function p(x) (evidence): e.g., how frequently we will measure a pattern with feature value x (e.g., x corresponds to lightness) Terminology (cont’d) Conditional probability density p(x/ω j) (likelihood) : e.g., how frequently we will measure a pattern with feature value x given that the pattern belongs to class ωj e.g., lightness distributions between salmon/sea-bass populations Terminology (cont’d) Conditional probability P(ω j /x) (posterior) : e.g., the probability that the fish belongs to class ω j given feature x. Ultimately, we are interested in computing P(ω j /x) for each class ω j. Conditional Risk (or Expected Loss) Suppose we observe x and take action αi The conditional risk (or expected loss) with taking action αi is defined as: c R (ai / x) (ai / j ) P ( j / x) j 1 Overall Risk Suppose α(x) is a general decision rule that determines which action α1, α2, …, αl to take for every x. The overall risk is defined as: R R(a(x) / x) p(x)dx The optimum decision rule is the Bayes rule Overall Risk (cont’d) The Bayes rule minimizes R by: (i) Computing R(αi /x) for every αi given an x (ii) Choosing the action αi with the minimum R(αi /x) The resulting minimum R* is called Bayes risk and is the best (i.e., optimum) performance that can be achieved: R min R * Example: Two-category classification Define α1: decide ω1 α2: decide ω2 be the loss incurred for deciding ω when λij = λ(αi /ω j) i the true state of nature is ω. j The conditional risks are: c R (ai / x) (ai / j ) P ( j / x) j 1 Example: Two-category classification (cont’d) Minimum risk decision rule: or or (i.e., using likelihood ratio) > likelihood ratio threshold Special Case:symmetrical or Zero-One Loss Function Assign the same loss to all errors: The conditional risk corresponding to this loss function: Special Case: Zero-One Loss Function (cont’d) The decision rule becomes: or or The overall risk turns out to be the average probability error! Example Assuming general loss: > Assuming zero-one loss: Decide ω 1 if p(x/ω 1)/p(x/ω 2)>P(ω 2 )/P(ω 1) otherwise decide ω 2 a P(2 ) / P(1 ) P(2 )(12 22 ) b P(1 )(21 11 ) assume: 12 21 (decision regions) Example If we penalize mistakes in classifying ω 1 patterns as ω 2 more than the converse (i.e., λ12 > λ21), then Eq. 17 Leads to the threshold θb marked. Note that the range of x values for which we classify a pattern as ω 1 gets smaller, as it should. a P(2 ) / P(1 ) P(2 )(12 22 ) b P(1 )(21 11 ) assume: 12 21 (decision regions) Example 2.4 Classifiers, Discriminant Functions and Decision Surfaces 2.4.1 The Multi-Category Case There are many different ways to represent pattern classifiers. One of the most useful is in terms of a set of discriminant functions gi(x), i = 1,... , c, The classifier is said to assign a feature vector x to class ωi if Thus, the classifier is viewed as a network or machine that computes c discriminant functions and selects the category corresponding to the largest discriminant. Pattern Classification, Chapter 2 (Part 2) 17 Discriminants for Bayes Classifier Assuming a general loss function: gi(x)= -R(αi / x) Assuming the zero-one loss function: gi(x)=P(ω i / x) Discriminants for Bayes Classifier (cont’d) Is the choice of gi unique? Replacing gi(x) with f(gi(x)), where f(.) is monotonically increasing function, does not change the classification results. p(x / i ) P(i ) g i ( x) p ( x) gi(x)=P(ωi/x) gi (x) p(x / i ) P(i ) gi (x) ln p(x / i ) ln P(i ) we’ll use this discriminant extensively! Case of Two Categories More common to use a single discriminant function instead of two: Examples: g (x) P(1 / x) P(2 / x) p (x / 1 ) P(1 ) g (x) ln ln p ( x / 2 ) P(2 ) Decision Regions and Boundaries Discriminants divide the feature space in decision regions R1, R2, …, Rc, separated by decision boundaries. Decision boundary is defined by: g1(x)=g2(x) Pattern Classification, Chapter 2 (Part 2) 22 2.5 The Normal Density Univariate احادي المتغيرdensity We begin with the continuous univariate normal or Gaussian density, 1 1 x 2 P( x ) exp , 2 2 Where: = mean (or expected value) of x 2 = expected squared standard deviation or variance Pattern Classification, Chapter 2 (Part 2) 23 Pattern Classification, Chapter 2 (Part 2) 24 Multivariate normal density p(x) ~ N( , ) The general multivariate normal density in d dimensions is written as 1 1 P( x ) exp ( x ) ( x ) t 1 ( 2 ) 2 d/2 1/ 2 where x is a d-component column vector, μ is the d-component mean vector, Σ is the d-by-d covariance matrix, |Σ| and Σ− 1 are its determinant and inverse, respectively, and (x − μ)t is the transpose of x − μ. where the expected value of a vector or a matrix is found by taking the expected values of its components. In other words, if xi is the ith component of x, μi the ith component of μ, and σij the ijth component of Σ, then Note: Note: Note: 2.6 Discriminant Functions for the Normal Density N(μ,Σ) In Sect. 2.4.1 we saw that the minimum-error classification can be achieved by use of the discriminant function gi (x) ln p(x / i ) ln P(i ) This expression can be readily evaluated if the densities p(x|ωi) are multivariate normal, i.e., if p(x|ωi) ∼ N(μi,Σi). In this case, then, Multivariate Gaussian Density: Case I Let us examine the discriminant function and resulting classification for a number of special cases. Σi=σ2 I (diagonal matrix) Features are statistically independent Each feature has the same variance Multivariate Gaussian Density: Case I Multivariate Gaussian Density: Case I (cont’d) A classifier that uses linear discriminant functions is called “a linear machine” w i= We call wi0 the threshold or bias in the ith direction. ) ) Multivariate Gaussian Density: Case I (cont’d) Properties of decision boundary: It passes through x0 It is orthogonal to the line linking the means. What happens when P(ω i)= P(ω j) ? P(ω j), then x0 shifts away from the most likely category. If P(ω i)= If σ is very small, the position of the boundary is insensitive to P(ω i) and P(ω j) ) ) The hyperplane separating Ri and Rj 1 2 P( i ) x0 ( i j ) ln ( i j ) 2 i j 2 P( j ) always orthogonal to the line linking the means! 1 if P ( i ) P ( j ) then x0 ( i j ) 2 Pattern Classification, Chapter 2 (Part 3) 36 Pattern Classification, Chapter 2 (Part 3) 37 Disparate : متباينة 38 Multivariate Gaussian Density: Case I (cont’d) P(ω ), then x shifts away If P(ω i)= j 0 from the most likely category. Multivariate Gaussian Density: Case I (cont’d) P(ω ), then x shifts away If P(ω i)= j 0 from the most likely category. Multivariate Gaussian Density: Case I (cont’d) P(ω ), then x shifts away If P(ω i)= j 0 from the most likely category. 42