Bayesian Classification PDF

Bayesian classification Szymon Jaroszewicz Szymon Jaroszewicz Bayesian classification Bayesian classification Decision trees modeled P(Y |X1 ,... , Xm ) directly The final classifier 1...

Bayesian classification Szymon Jaroszewicz Szymon Jaroszewicz Bayesian classification Bayesian classification Decision trees modeled P(Y |X1 ,... , Xm ) directly The final classifier 1  1 if P(Y |X1 ,... , Xm ) > 2 M(X1 ,... , Xm ) = 1 0 if P(Y |X1 ,... , Xm ) ≤  2 Today we will look at another approach: use the Bayes Theorem try to model P(X1 ,... , Xm |Y ) Szymon Jaroszewicz Bayesian classification Bayesian classification Denote by X the vector (X1 ,... , Xm ) of all explanatory variables Apply the Bayes theorem to P(Y |X): P(X|Y )P(Y ) P(Y |X) = P(X) Q: do we really need the denominator? Szymon Jaroszewicz Bayesian classification Bayesian classification Denote by X the vector (X1 ,... , Xm ) of all explanatory variables Apply the Bayes theorem to P(Y |X): P(X|Y )P(Y ) P(Y |X) = P(X) Q: do we really need the denominator? A: we can ignore it, since it does not depend on Y P(Y |X) ∝ P(X|Y )P(Y ) (we can always recover the probabilities for Y by normalization) Szymon Jaroszewicz Bayesian classification Bayesian classification The Bayesian classifier M(x) = arg max P(X = x|Y = y )P(Y = y ) y The problem is reduced to estimating P(X = x|Y = y ) Szymon Jaroszewicz Bayesian classification Bayesian classification The Bayesian classifier M(x) = arg max P(X = x|Y = y )P(Y = y ) y The problem is reduced to estimating P(X = x|Y = y ) Another formulation discriminant functions: dy (x) = log P(X = x|Y = y ) + log P(Y = y ) classifier M(x) = arg max dy (x) y Szymon Jaroszewicz Bayesian classification Bayesian classification – two classes Suppose Dom(Y ) = {0, 1} Yet another formulation  P(X=x|Y =1) P(Y =1) 1 if  P(X=x|Y =0) · P(Y =0) >1 M(x) =  0 otherwise. and  P(X=x|Y =1) P(Y =1) 1 if log P(X=x|Y =0) + log P(Y =0) > 0  M(x) =  0 otherwise. Szymon Jaroszewicz Bayesian classification Odds and log-odds Odds of an event E P(E ) P(E ) O(E ) = P(¬E ) = 1−P(E ) O(E ) ∈ [0, ∞) if P(E ) = 0.2, then O(E ) = 0.2/0.8 = 0.25 = 1 : 4 odds are 4 to 1 against E if P(E ) = 0.5, then O(E ) = 1 = 1 : 1 both outcomes equally likely, Likelihood ratio (known from statistics) P(X = x|Y = 1) P(X = x|Y = 0) Log-odds: logarithm of odds Log-likelihood ratio: logarithm of the likelihood ratio Szymon Jaroszewicz Bayesian classification Bayesian classification – two classes Two-class Bayesian classifier:  P(X = x|Y = 1) P(Y = 1) 1 if · >1       P(X = x|Y = 0) P(Y = 0) | {z } | {z } M(x) = likelihood ratio odds      0 otherwise. and  P(X = x|Y = 1) P(Y = 1)  1 if log + log >0 P(X = x|Y = 0) P(Y = 0)     | {z } | {z } M(x) = log-likelihood ratio log-odds      0 otherwise.  Szymon Jaroszewicz Bayesian classification Optimality of the Bayesian classifier If the probabilities P(X|Y ) and P(Y ) are correct (equal to the true probabilities) then the Bayesian classifier has the lowest possible classification error its classification error is equal to the Bayes error Szymon Jaroszewicz Bayesian classification Bayesian classification A major question was left open: How to estimate P(X|Y )? This seems more difficult than estimating P(Y |X) Szymon Jaroszewicz Bayesian classification Bayesian classification A major question was left open: How to estimate P(X|Y )? This seems more difficult than estimating P(Y |X) Make some simplifying assumptions Two most important approaches: Assume P(X|Y ) follows the normal distribution Linear and quadratic discriminant analysis Assume X1 ,... , Xm are independent within each class Naive Bayesian classifier Other approaches: Kernel / spline density estimation k-nearest neighbors (typically has a non-Bayesian interpretation) Szymon Jaroszewicz Bayesian classification Discriminant analysis Szymon Jaroszewicz Bayesian classification Bayesian classification – continuous attributes We model porobability density function of x within each class Notation: πy = P(Y = y ): priors fy = fy (x): conditional density within class y The Bayesian classifier M(x) = arg max fy (x)πy y or M(x) = arg max [log fy (x) + log πy ] y All other formulations have their analogues Szymon Jaroszewicz Bayesian classification Multidimensional normal distribution Mean vector: µ Covariance matrix: Σ Joint density of the m-dimensional multivariate normal distribution 1 1 T −1 f (X) = exp − (X − µ) Σ (X − µ) (2π)m/2 |Σ|1/2 2 Szymon Jaroszewicz Bayesian classification Multidimensional normal distribution 0 1 0.5 Example: µ = ,Σ= 0 0.5 1 5· 10 − 2 2 1 − 10 0.1 5· 5 0.1 1 0. 2 0 10 − 5· 0.15 5 · 10 −1 0.1 −2 −2 0 5·1 −1 0 1 Contours of the joint PDF are ellipsoids Szymon Jaroszewicz Bayesian classification Mahalanobis distance The Mahalanobis distance of a point x from the mean µ of a multivariate normal distribution with covariance matrix Σ q dΣ (x, µ) = (x − µ)T Σ−1 (x − µ) Szymon Jaroszewicz Bayesian classification Mahalanobis distance The Mahalanobis distance of a point x from the mean µ of a multivariate normal distribution with covariance matrix Σ q dΣ (x, µ) = (x − µ)T Σ−1 (x − µ) Measures the distance from the mean of the multivariate normal distribution Takes into account the variance of the distribution in the direction of the point (i.e. how likely it is to get that far from the mean in a given direction) Points with equal Mahalanobis distance from the mean lie on the same contour of the joint multivariate PDF lie on an ellipsoid centered at the mean the PDF has equal values at those points When Σ = I, reduces to standard Euclidean distance Szymon Jaroszewicz Bayesian classification Discriminant analysis Assume that the density fy (x|Y = yi ) of x within class y is multivariate normal with mean µy and covariance matrix Σy The Bayesian classifier is M(x) = arg max dy (x), y where the discriminant functions dy are (skip the − m2 log 2π term) dy (x) = log fy (x) + log πy h i = log |Σy |−1/2 exp −1/2(x − µy )T Σy −1 (x − µy ) + log πy = −1/2(x − µy )T Σy −1 (x − µy ) − 1/2 log |Σy | + log πy = −1/2 [dΣy (x, µy )]2 −cy | {z } squared Mahalanobis distance Szymon Jaroszewicz Bayesian classification Quadratic Discriminant Analysis (QDA) The discriminant functions dy are dy (x) = log fy (x) + log πy = −1/2(x − µy )T Σy −1 (x − µy ) − 1/2 log |Σy | + log πy = −1/2 [dΣy (x, µy )]2 −cy | {z } squared Mahalanobis distance The discriminant functions are quadratic Therefore the procedure is called Quadratic Discriminant Analysis Szymon Jaroszewicz Bayesian classification QDA – two-class case Suppose Dom(Y ) = {0, 1} Classify x as 1 when d0−1 = d1 (x) − d0 (x) > 0 We have Szymon Jaroszewicz Bayesian classification QDA – two-class case Suppose Dom(Y ) = {0, 1} Classify x as 1 when d0−1 = d1 (x) − d0 (x) > 0 We have 2d1 (x) = −(x − µ1 )T Σ1 −1 (x − µ1 ) − log |Σ1 | + 2 log π1 = −xT Σ1 −1 x + 2µ1 T Σ1 −1 x − µT1 Σ1 −1 µ1 − log |Σ1 | + 2 log π1 = −xT Σ1 −1 x + 2µ1 T Σ1 −1 x + c1 Analogous for 2d0 Szymon Jaroszewicz Bayesian classification QDA – two-class case Suppose Dom(Y ) = {0, 1} Classify x as 1 when d0−1 = d1 (x) − d0 (x) > 0 We have 2d1 (x) = −(x − µ1 )T Σ1 −1 (x − µ1 ) − log |Σ1 | + 2 log π1 = −xT Σ1 −1 x + 2µ1 T Σ1 −1 x − µT1 Σ1 −1 µ1 − log |Σ1 | + 2 log π1 = −xT Σ1 −1 x + 2µ1 T Σ1 −1 x + c1 Analogous for 2d0 and 2d0−1 (x) = −xT Σ1 −1 x + 2µ1 T Σ1 −1 x + c1 +xT Σ0 −1 x − 2µ0 T Σ0 −1 x − c0 = xT (Σ0 −1 − Σ1 −1 )x + 2(µ1 T Σ1 −1 − µ0 T Σ0 −1 )x + c1 − c0 = xT Ax + bT x + c Szymon Jaroszewicz Bayesian classification QDA – two-class case Dom(Y ) = {0, 1}; classify x as 1 when d0−1 = d1 (x) − d0 (x) > 0 We have 2d0−1 (x) = xT Ax + bT x + c The surface d0−1 (x) = 0 is the decision boundary The decision boundary for two-class QDA is a quadratic surface: In 2 dimensions (conic sections): line or pair of lines circle, ellipse parabola hyperbola Or their m-dimensional generalization: plane, hyperplane, hyperellipsoid, etc. Szymon Jaroszewicz Bayesian classification QDA – example 0 1 0.5 −1 1 −0.5 µ0 = , Σ0 = , µ1 = , Σ1 = , π0 = π1 = 0.5 0 0.5 1 0.5 −0.5 2 2 0 −2 −4 −3 −2 −1 0 1 2 Szymon Jaroszewicz Bayesian classification QDA – more than two classes Decision boundary between each pair of classes is quadratic Final decision boundary consists of segments of quadratic surfaces Can be quite complex Szymon Jaroszewicz Bayesian classification QDA – parameter estimation In practice we do not know the πy , µy , Σy we need to estimate them |{(xi , yi ) ∈ D : yi = y }| π̂y = N 1 X µ̂y = xi N (xi ,yi )∈D:yi =y 1 X Σ̂y = (xi − µ̂y )(xi − µ̂y )T N −1 (xi ,yi )∈D:yi =y Szymon Jaroszewicz Bayesian classification QDA – parameter estimation In practice we do not know the πy , µy , Σy we need to estimate them |{(xi , yi ) ∈ D : yi = y }| π̂y = N 1 X µ̂y = xi N (xi ,yi )∈D:yi =y 1 X Σ̂y = (xi − µ̂y )(xi − µ̂y )T N −1 (xi ,yi )∈D:yi =y Problems QDA requires estimation of several covariance matrices each covariance matrix is estimated on a subset of data the subset can be small for rare classes to summarize: many parameters are estimated, possibly based on little data = instability of the model Szymon Jaroszewicz Bayesian classification Covariance estimation Covariance estimation is more difficult than it seems We’ll talk shortly later Szymon Jaroszewicz Bayesian classification Linear Discriminant Analysis (LDA) An idea: assume covariance matrices within all classes are identical Σy0 = Σy1 =... = Σyk−1 = Σ For two classes: 2d0−1 (x) = xT (Σ0 −1 − Σ1 −1 )x + 2(µ1 T Σ1 −1 − µ0 T Σ0 −1 )x + c = 2(µ1 T − µ0 T )Σ−1 x + c = bT x + c The decision boundary is a line / plane / hyperplane For more than 2 classes, the decision boundary consists of line/plane/hyperplane segments Szymon Jaroszewicz Bayesian classification Linear Discriminant Analysis Parameter estimation: 1 X X Σ̂ = (xi − µ̂y )(xi − µ̂y )T N −k k (xi ,yi )∈D:yi =yk This is a weighted average of covariance matrices within each class There are significantly (k-times) fewer parameters in the model Each parameter is estimated on the full dataset Szymon Jaroszewicz Bayesian classification LDA vs. QDA 4 2 0 −2 −4 −2 0 2 Linear method seems very different but: The biggest differences are in the area of low probability of X The less within-class covariances differ the more similar LDA is to QDA (for small differences, practically identical) Very often we have to pick a classifier which has a higher bias but whose parameters are easier to estimate Szymon Jaroszewicz Bayesian classification Fisher’s discriminant analysis Historically first classfication algorithm (1936) No assumption of normality For two classes (almost) analogous to LDA described above Main idea: find the direction, such that, after projecting the data points onto this direction, the classes are as separated as possible Szymon Jaroszewicz Bayesian classification Means / variances of random vectors The scalar product aT x is the projection of x on a If a random vector x has mean x̄ and A is a matrix then Ax has mean Ax̄ If a random vector x has covariance matrix Σ and A is a matrix then Ax has covariance matrix AΣAT Szymon Jaroszewicz Bayesian classification Fisher’s discriminant analysis – two class case Two classes Assume identical covariance matrices ΣW within both classes Measure of separation between classes along direction a: (aT x̄1 − aT x̄2 )2 (the squared distance between projected means) Correct the separation by the variance along a: aT ΣW a So we need to find a maximizing (aT x̄1 − aT x̄2 )2 aT ΣW a This turns out to be the same direction as in the LDA case above Szymon Jaroszewicz Bayesian classification Fisher’s discriminant analysis for > 2 classes Assume identical covariance matrices ΣW in all classes ΣW – within class variance ΣB = ki=1 nk (x̄i − x̄)(x̄i − x̄)T – between class variance P Find a maximizing aT ΣB a aT ΣW a The solution can be obtained by solving a generalized eigenvalue problem ΣB a = λΣW a Szymon Jaroszewicz Bayesian classification Fisher’s discriminant analysis for > 2 classes Find a maximizing aT ΣB a aT ΣW a The solution is the first generalized eigenvector But a single direction may not be enough to separate the classes well More directions can be obtained by taking the second, third, fourth,... generalized eigenvectors As a result we obtain a set of directions, projecting on which separates the classes well This can be seen as a class-aware variant of PCA (but of course is different from PCA) Szymon Jaroszewicz Bayesian classification Fisher’s discriminant analysis: literature A very good description can be found in: Jacek Koronacki, Jan Ćwik, Statystyczne systemy uczace sie, Chapter 1 detailed description illustrative figures Szymon Jaroszewicz Bayesian classification Discriminant analysis – closing remarks Discriminant analysis is not that popular anymore Not many people use it for classification nowadays better to use logistic regression or Naive Bayesian classifier instead Fisher’s discriminant analysis still has important applications as a replacement for PCA We now move to naive Bayesian classifier which is much more important from the practical point of view is based on a different assumption Szymon Jaroszewicz Bayesian classification Covariance estimation Szymon Jaroszewicz Bayesian classification How to estimate the covariance matrix LDA/QDA require covariance matrices In practice: much harder than it seams The basic formula gets inaccurate with small samples and large numbers of features Szymon Jaroszewicz Bayesian classification Covariance estimation – shrinkage Idea: shrink towards diagonal covariance matrix Tr Σ̂ Σ̂s = (1 − α)Σ̂ + α I p where Σ̂ classical estimate α shrinkage factor Tr Σ̂ p estimate of diagonal variance How to choose α? Many algorithms possible, see documentation Szymon Jaroszewicz Bayesian classification Naı̈ve Bayesian classifier Szymon Jaroszewicz Bayesian classification Naive Bayesian classifier (NB) Bayesian classifiers directly model P(X = x|Y = y ) This is a multidimensional distribution, so simplifying assumptions need to be made Discriminant Analysis assumed normality Naive Bayesian classifier makes another assumption: The “naive” assumption The predictor variables X1 ,... , Xn are conditionally independent given Y : P(X1 ,... , Xm |Y ) = P(X1 |Y )P(X2 |Y ) · · · P(Xm |Y ) Szymon Jaroszewicz Bayesian classification Naive Bayesian classifier (NB) The estimate of the conditional probability is P(Y |X) ∝ P(X|Y )P(Y ) = P(X1 |Y )P(X2 |Y ) · · · P(Xm |Y ) · P(Y ) and the final model is M(x) = arg max P(X = x|Y = y )P(Y = y ) y = arg max P(x1 |y )P(x2 |y ) · · · P(xm |y ) · P(Y = y ) y Szymon Jaroszewicz Bayesian classification Naive Bayes (NB) – Implementation Overall the implementation is very simple, just estimate m + 1 small conditional distributions P(Y ), P(Xi |Y ) But correct implementation requires care Implementation problems 1 Zero probabilities 2 Multiplying probabilities leads to very small values Szymon Jaroszewicz Bayesian classification Naive Bayes (NB) – Zero probabilities If, for a given y , there is an Xi such that P(Xi = xi |y ) = 0, then automatically P(y |x) = 0 This is a problem when 1 Some values of Xi are rare; they will very often lead to (unjustified) zero probabilities 2 It is possible that P(y |x) = 0 for all y Solution: make all probabilities nonzero This can be achieved using Laplace correction Szymon Jaroszewicz Bayesian classification Laplace correction Typically the probability of X = x is estimated using |{x ∈ D : x[X ] = x}| nx P̂(X = x) = = |{x ∈ D}| n If nx = 0, then P̂(X = x) = 0 Laplace correction nx + 1 P̂(X = x) = n+k where k = |Dom(X )| This is equivalent to using prior knowledge equivalent to a database of size k each value of X present in exactly one record This is also equivalent to Bayesian estimation of the probability with uniform prior Szymon Jaroszewicz Bayesian classification Laplace correction – m-estimate Laplace correction nx + 1 P̂(X = x) = n+k A generalization called the m-estimate nx + m P̂(X = x) = n + mk for some constant m For m = 1 reduces to Laplace correction This is equivalent to using prior knowledge equivalent to a database of size mk each value of X present in exactly m records Szymon Jaroszewicz Bayesian classification Laplace correction Laplace correction is a very useful tool Using Laplace correction makes algorithms stable even when there is very little data and an attribute (or a set of attributes) has many values Example uses: smoothing probability estimates in leaves of decision trees (Some authors suggest not pruning the trees but using m-estimate in leaves instead) estimating n-gram frequencies in text processing estimating probabilities for the NB classifier Szymon Jaroszewicz Bayesian classification Naive Bayes (NB) – Small probabilities Multiplying probabilities leads to very small values In double precision arithmetic they will become exactly equal to zero A problem arises when the estimate P(y |x) = 0 for all y Solution: use logarithms M(x) = arg max [log P(X = x|Y = y ) + log P(Y = y )] y m " # X = arg max log P(Y = y ) + log P(xi |y ) y i=1 Szymon Jaroszewicz Bayesian classification Naive Bayes (NB) – Missing values Missing values are easy to handle in Naive Bayesian classifiers When building the model, just ignore records with missing values (in the counted attributes only) When classifying a new instance, omit the attributes with missing values from the product / sum Szymon Jaroszewicz Bayesian classification Naive Bayes (NB) – Continuous attributes So far we only talked about categorical attributes What happens when numerical attributes are present? Three main solutions: 1 Normal assumption 2 Discretization 3 Kernel estimation Szymon Jaroszewicz Bayesian classification Naive Bayes (NB) – Normal assumption Assume that each numerical attribute Xi is (conditional on Y ) normally distributed For each class y it is enough to estimate µyXi and σXy i for Xi within this class In the NB formula P(Xi = x|Y = y ) is replaced by the density of the normal distribution 2 y x−µ Xi − 2 1 2 σ y fXi |Y =y (x) = √ e Xi σXy i 2π Szymon Jaroszewicz Bayesian classification Naive Bayes (NB) – Normal assumption Mathematical warning How can we replace a probability P(X = x|Y = y ) with a value of a p.d.f. at a point? Worse, probabilities and densities get mixed together in a product Szymon Jaroszewicz Bayesian classification Naive Bayes (NB) – Normal assumption Mathematical warning How can we replace a probability P(X = x|Y = y ) with a value of a p.d.f. at a point? Worse, probabilities and densities get mixed together in a product For small ε and p.d.f. f , the expression εf (x) is approximately the probability that X ∈ [x − 2ε , x + 2ε ] We may substitute this expression into the NB formula and factor all the ε’s out. Only the densities remain. In other words: we are looking at the probabilities (conditional on Y of course) that X is in a small interval around its value in the classified instance Szymon Jaroszewicz Bayesian classification Naive Bayes (NB) – Normal assumption Naive Bayesian classifier with normal assumption is equivalent to QDA with the following restriction: all X ’s are conditionally independent X ’s are allowed to have different standard deviations or, in other words: the matrices Σk are diagonal In this case axes of the quadratic decision surfaces are aligned with the coordinate axes Szymon Jaroszewicz Bayesian classification Naive Bayes (NB) – discretization Another approach: convert numerical attributes to categorical attributes Example: numerical variable age can be replaced with a categorical variable taking values ‘[0, 10)’, ‘[10, 30)’, ‘[30, 50)’, ‘[50, 70)’, ‘[70, ∞)’ Such a conversion is called discretization We will discuss discretization when we talk about data preprocessing This is equivalent to estimating conditional p.d.f.’s using histograms Szymon Jaroszewicz Bayesian classification Naive Bayes (NB) – kernel estimation A third approach: use kernel density estimation to estimate the probability density function of each numerical attribute within each class y Szymon Jaroszewicz Bayesian classification Kernel density estimation Suppose we have a numerical variable X and a sample x1 , x2 ,... , xn An estimate of the p.d.f. of X is n 1 X x − xi fˆ(x) = K nh h i=1 where: K is a kernel function h is a smoothing parameter or bandwidth Szymon Jaroszewicz Bayesian classification Kernel density estimation Various kernels are possible, typically the Gaussian kernel is used 1 1 2 K (x) = √ e − 2 x 2π Szymon Jaroszewicz Bayesian classification Kernel density estimation The choice of bandwidth h Low h: low smoothing, jagged estimate High h: very smooth estimate, detail is lost Optimal h is proportional to n−1/5 (unfortunately we do not know the coefficient of proportionality) For Gaussian kernel, when the approximated distribution is normal hopt ≈ 1.06σ̂n−1/5 (Silverman’s rule of thumb: use it even for nonnormal distributions) Szymon Jaroszewicz Bayesian classification Kernel density estimation – example True f : Gaussian with mean zero and variance 1 Sample of size 100 Gaussian Kernels for different h Szymon Jaroszewicz Bayesian classification Kernel density estimation – example 0.6 f h =0.100000 0.5 h =0.500000 h =1.000000 h =0.400496(hopt) 0.4 0.3 0.2 0.1 0.0 3 2 1 0 1 2 3 Szymon Jaroszewicz Bayesian classification Naive Bayes (NB) – general remarks Very often more advanced methods can beat NB, but it almost always works quite well NB is really hard to break In my opinion it is usually best to start with NB and use it as a reference model If complicated models are not much better than NB it’s better not to use them NB is also very useful in certain specific applications, e.g. spam filtering We will now try to figure out why it works well in practice and discuss some extensions Szymon Jaroszewicz Bayesian classification Why Naive Bayes works so well? The ‘naive’ assumption is not so naive Logarithmic sample complexity A model giving wrong class probabilities can still be a good classifier... Szymon Jaroszewicz Bayesian classification The ‘naive’ assumption is not so naive The assumption corresponds to the case when attributes are symptoms of the class In other words: the class is the cause behind the values of attributes Example: class: flu attributes: cough, temperature Szymon Jaroszewicz Bayesian classification The ‘naive’ assumption is not so naive Example: class: flu attributes: cough, temperature cough and temperature are not independent: If someone has temperature, s/he likely has the flu and thus likely coughs If we already know that a person has the flu, then finding out they have temperature does not tell us anything about coughing The symptoms cough and temperature are independent conditional on the class flu Don’t confuse: the attributes need not be independent but conditionally independent Szymon Jaroszewicz Bayesian classification NB as sum of log-likelihood ratios Two class case P(Y = 1|X) P(X1 |Y = 1) · · · P(Xm |Y = 1) · P(Y = 1) = P(Y = 0|X) P(X1 |Y = 0) · · · P(Xm |Y = 0) · P(Y = 0) Taking logarithms P(Y = 1|X) log P(Y = 0|X) P(Y = 1) P(X1 |Y = 1) P(Xm |Y = 1) = log + log +... + log P(Y = 0) P(X1 |Y = 0) P(Xm |Y = 0) | {z } | {z } | {z } log-odds log-likelihood log-likelihood ratio for X1 ratio for Xm NB is a sum of log-odds and log-likelihood ratios for all variables Szymon Jaroszewicz Bayesian classification NB is a linear model Two class case, all variables binary ({0, 1}) m P(Y = 1|X) P(Y = 1) X P(Xi |Y = 1) log = log + log P(Y = 0|X) P(Y = 0) i=1 P(Xi |Y = 0) P(Y = 1) = log P(Y = 0) m m X P(Xi = 1|Y = 1) X P(Xi = 0|Y = 1) + Xi log + (1 − Xi ) log i=1 P(X i = 1|Y = 0) i=1 P(X i = 0|Y = 0) m X m X = b+ ai1 Xi + ai0 (1 − Xi ) i=1 i=1 Xm m X = (b + ai0 ) + (ai1 − ai0 )Xi i=1 i=1 Conclusion: NB is a linear model Szymon Jaroszewicz Bayesian classification NB vs. logistic regression Since NB is a linear model, what is the difference from other linear models e.g. logistic regression? Szymon Jaroszewicz Bayesian classification NB vs. logistic regression Since NB is a linear model, what is the difference from other linear models e.g. logistic regression? Main difference: NB chooses the coefficient for each variable independent of all other variables logistic regression chooses all coefficients simultaneously We may expect that: logistic regression will thus give better coefficients this is true asymptotically but how about finite samples? Szymon Jaroszewicz Bayesian classification NB vs. logistic regression Theorem (Ng 2001) If there exists a naive Bayesian model NB ∗ with (population) ˆ classification error ε∗ , then to learn a naive Bayesian model NB which, with probability ≥ 1 − δ, has (population) classification error not greater than ε∗ + ϵ one needs a training sample of size 1 m n = O 2 log , ϵ δ where m is the number of attributes, and ϵ, δ are given constants. Szymon Jaroszewicz Bayesian classification NB vs. logistic regression Definition An algorithm A is rotationally invariant if for every vector x, training data D, and every square matrix R such that RT R = I, |R| = 1, we have A(D)(x) = A(DR)(xR) Informally: the results do not change after rotating the data and the vectors to be classified Examples of rotationally invariant algorithms: logistic regression SVM any method which uses PCA as a preprocessing step Szymon Jaroszewicz Bayesian classification NB vs. logistic regression Theorem (Ng 2004) For every rotationally invariant algorithm there is a distribution P such that Y = 1[X1 > 0] and learning the model (with given accuracy and error probability) requires m Ω ϵ training samples. Szymon Jaroszewicz Bayesian classification NB vs. logistic regression Conclusions: NB requires the number of samples which grows logarithmically with the number of attributes logistic regression requires the number of samples which grows linearly with the number of attributes If there is a lot of data, logistic regression will almost always be better than NB If there is little training data, NB may have smaller error NB has high bias and low variance Szymon Jaroszewicz Bayesian classification NB vs. logistic regression – example Source: Ng, 2001 (m – training sample size) Szymon Jaroszewicz Bayesian classification Discriminative vs. generative models discriminative models model the conditional distribution P(Y |X) example: logistic regression good large sample behavior generative models model full joint distribution P(Y , X) example: naive Bayesian classification good small sample behavior Szymon Jaroszewicz Bayesian classification Spam filter – a typical application of NB Text classification (simplest approach): each document is a data record each possible word is an attribute the value of an attribute is 1 if the word is present in a document, 0 otherwise There is a huge number of attributes (10s/100s of thousands), and a moderate number of data records (need to label messages by hand) NB is well suited to such tasks and most spam filters use NB in some form or other Szymon Jaroszewicz Bayesian classification Good class probabilities are not needed for good class prediction NB does not predict class probabilities well But it turns out that it is not necessary for good classification It is enough for the predicted probability be greater or less than the threshold for the class to be correct Some authors suggest this to be one of the reasons why NB works: the predicted probability is usually on the ‘right side’ the variance is low, so the predictions are stable Szymon Jaroszewicz Bayesian classification Naive Bayesian classifier – extensions Most extensions try to weaken the conditional independence assumption Others (we won’t talk about) modify the conditional distributions (e.g. to work better with text data) We will discuss: variable selection Tree Augmented naive Bayes Szymon Jaroszewicz Bayesian classification Naive Bayesian classifier – attribute selection Conditionally dependent attributes violate the naive assumption and can lead to problems For example: repeat the same attribute several times No new information is added, but the attribute became much more important Idea: select a subset of attributes which are (approximately) conditionally independent We will talk about attribute selection when we talk about data preprocessing Szymon Jaroszewicz Bayesian classification Bayesian networks Directed acyclic graph: vertices = attributes edges represent the dependency structure between attributes Each vertex is labeled with conditional distribution of its variable on its parents The joint distribution of a Bayesian network is defined as the product of conditional distributions of all variables Bayesian networks allow for representing joint distributions of a large number of variables and efficient inference with such distributions Szymon Jaroszewicz Bayesian classification Bayesian network – an example P(A|B,E) =0.96 P(B)=0.01 P(A|B,~E) =0.94 P(A|~B,E) =0.29 P(N|A)=0.9 Burglary P(A|~B,~E)=0.01 P(N|~A)=0.01 Neighbour Alarm calls Erthquake P(E)=0.02 Easy to understand graphical structure Edges correspond to direct causes The joint distribution: P(E , B, A, N) = P(E ) · P(B) · P(A|E , B) · P(N|A) Szymon Jaroszewicz Bayesian classification Tree Augmented naive Bayes Chow and Liu (1968) have shown how to build an optimal Bayesian network whose graph is a tree: 1 Assign to each pair of variables a weight equal to their mutual information 2 Build a maximum weighted spanning tree 3 Pick any variable as the root of the tree and direct all edges away from it Tree Augmented naive Bayes: assume that (conditional on the class Y ) the joint distribution of the predictor variables is represented by a Bayesian network whose graph is a tree Szymon Jaroszewicz Bayesian classification Literature 1 Andrew Ng, Michael Jordan, On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes, NIPS 2001, http://ai.stanford.edu/~ang/ papers/nips01-discriminativegenerative.pdf [sample complexity of NB] 2 Webb, G. I., J. Boughton, and Z. Wang (2005). Not So Naive Bayes: Aggregating One-Dependence Estimators. Machine Learning, 58(1), 5–24 http://www.csse.monash.edu.au/ ~webb/Files/WebbBoughtonWang05.pdf [AODE] 3 N. Friedman, D. Geiger, and M. Goldszmidt, Bayesian network classifiers, Machine Learning 29:131–163, 1997 http://www.cs.huji.ac.il/~nir/Papers/FrGG1.pdf [Tree augmented naive Bayes] Szymon Jaroszewicz Bayesian classification

Bayesian Classification PDF

Document Details

Tags

Related

Summary

Full Transcript