Parameter Estimation PDF
Document Details
Uploaded by AdoredChlorine
University of Crete
Panos Trahanias
Tags
Summary
This document provides an overview of parameter estimation methods, focusing on MLE and Bayesian approaches in the context of pattern recognition, specifically how they are applied when dealing with distributions like normal distributions. The concepts are explained with formulas and examples, highlighting the relationship between the methods when considering large amounts of training data. The document is intended as a learning resource within the context of computer science, potentially for undergraduates.
Full Transcript
Pattern Recognition (Αναγνώριση Προτύπων) Parameter Estimation (Εκτίμηση Παραμέτρων) Panos Trahanias UNIVERSITY OF CRETE DEPARTMENT of COMPUTER SCIENCE Parameter Estimation Bayesian classifier cannot be employed when the probability de...
Pattern Recognition (Αναγνώριση Προτύπων) Parameter Estimation (Εκτίμηση Παραμέτρων) Panos Trahanias UNIVERSITY OF CRETE DEPARTMENT of COMPUTER SCIENCE Parameter Estimation Bayesian classifier cannot be employed when the probability density functions and prior probabilities are not known, i.e. p(x/ωi) & P(ωi). Distributions can be estimated when appropriate data are available ➡ Hard Task! If the distribution shape is known, e.g. normal/gaussian, but not its parameters, e.g. mean and variance, then the problem is formulated as one of parameter estimation. Parameter Estimation Two basic approaches: a. Maximum Likelihood Estimation (Εκτίμηση Μέγιστης Πιθανοφάνειας) b. Bayesian Parameter Estimation (Εκτίμηση παραμέτρων κατά Bayes) Maximum Likelihood Estimation Assume that we take random samples from a given distribution with unknown parameters. Let us use vector θ to denote the set of parameters. If for example we know that the distribution is gaussian but we do not know its mean and variance, then: θ (μ, Σ) ( 1 ,..., d , 12 ,..., d2 , cov( xm , xn ); m, n 1,...d ; m n) parameter s MLE Task For each class, estimate vector θ using a training set Dn={x1,…,xn} that includes n independent samples (i.i.d): Likelihood of θ with respect of the sample set Dn Maximum Likelihood Estimation of θ corresponds to that value that maximizes the above function. Intuitively, it corresponds to the value of θ that «agrees/interprets» in the best possible way with the samples. Likelihood Function Log-Likelihood θ that maximizes the likelihood function, maximizes also its log, that is more convenient to use: θ that maximizes the above function can be obtained by setting its derivative over θ equal to zero, and solving for θ. Log-Likelihood Special Cases – Normal Distribution Normal distribution with unknown mean μ ○ p(xκ | μ) ~ N(μ, Σ) ○ Maximum likelihood estimator of μ satisfies: ○ Left multiplying by Σ we obtain: Accordingly, it is the arithmetic mean of the training samples! Special Cases – Normal Distribution Normal distribution with unknown mean μ and variance σ2: ○ θ = (θ1, θ2) = (μ, σ2) Special Cases – Normal Distribution Taking into consideration all samples: From (1) and (2), we obtain: Special Cases – Normal Distribution The case of multidimensional gaussian distribution: Bayesian Estimator In contrast to MLE, where we assumed that the unknown parameters have constant values, the Bayesian estimator (BE) assumes that the unknown parameters are random variables and follow an priori known p.d.f Therefore, BE estimates a distribution of the values of θ and not the values themselves. BE provides more information, but is often difficult to calculate. The existence of training data allows the conversion of prior information to posterior p.d.f. ➡ phenomenon of learning (Bayesian learning) where each new observation refines the posterior probability. Bayesian Estimation Bayesian Learning for pattern classification problems. ○ The computation of the posterior p.d.f. is the basis of Bayesian classification. ○ Goal: Computation of P(ωi | x, D) given the set of training samples D={D1,…,Dc}, where the samples in the set Dj correspond to the class j, j=1,…,c. ○ For each new, unclassified sample, x, the Bayes rule gives: Bayesian Estimation We assume that ○ The priors P(ωi) are known, thus P(ωi/D) = P(ωi). ○ Only the samples of the set Di hold information about the p.d.f. p(x/ωi,Di) ➡ c independent problems of estimation of p(x/ωi,Di) arise, which can also be written as p(x/Di). We want to estimate this function (also written as p(x/Di)) General Methodology of Bayesian Estimation For each class, we know the form of the p.d.f. p(x|θ), but the value of the parameter vector θ is unknown. We have some initial knowledge of θ in the form of a priori p.d.f. p(θ). For each class, we have a set Dn={x1, …, xn} from n statistically independent samples. Then: Key relationship: Connects the conditional p.d.f. p(x/D) with the posterior p.d.f. p(θ/D) of the parameter vector. It states that the p(x/D) is a linear combination of p(x/θ) with weights p(θ/D). If p(θ/D) has a steep unique maximum at θ* then: General Methodology of Bayesian Estimation For the computation of p(θ/D), we have: Due to the independence of the training samples, If p(D/θ) is centered around θ* with a large peak at this point, and if p(θ*) is not 0, then p(θ/D) also has a large peak at θ*, and therefore will be But the point θ*, as described above, is the MLE estimator of θ!! Bayesian Learning An issue of interest is that of the computation and the convergence of the sequence of p.d.f. p(θ/Dn), where we restored the index n of the number of training samples in the set Dn. Therefore, The above relationship creates a sequence of p.d.f. p(θ/x1), p(θ/x1, x2),…, p(θ/x1,…,xn) ➡ Bayesian recursive (or incremental) learning. Bayesian Learning – Normal Distribution Problem: Computation of p.d.f. p(θ/Dn) and p(x/Dn) when we assume that p(x/θ)=p(x/μ)=Ν(μ,σ2) (i.e. θ=μ) and p(μ)=N(μ0,σ02). μ0 is the best a prior knowledge of μ and σ02 indicates our uncertainty. It follows that p(μ/Dn) = Ν(μn,σn2), where: μn represents our best knowledge of μ after observing n training samples and σn2 measures our uncertainty. Also, p(x/Dn)= Ν(μn, σ2+σn2). ✔ As σ02→inf, we have for every n, that is, we return to the estimation of maximum likelihood. ✔ As n→inf, we have , that is, for a fairly large number of training samples, the accuracy of the estimate of μ does not depend on the uncertainty of our a priori knowledge, σ02. Bayesian Learning – Normal Distribution Bayesian Learning – Normal Distribution Bayesian Learning – Normal Distribution As n→inf, the posterior p.d.f. p(μ|Dn) becomes more and more concentrated/peaked around its middle. The phenomenon is called Bayesian learning. MLE vs Bayesian Estimation In virtually every case, MLE & Bayesian are equivalent in case of infinite training data. Criteria for choosing: ○ Computational complexity ➡ MLE ○ Interpretability ➡ MLE ○ Confidence in prior info ➡ Bayesian / Bayesian with flat / uniform prior == equivalent to MLE