AI 305 - Generative Learning Algorithms PDF
Document Details
Uploaded by UncomplicatedLime804
Stanford
Tags
Summary
These lecture notes cover generative learning algorithms, including Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Naïve Bayes. They explore different classification approaches using Bayes Theorem and various distributions, such as Gaussian distributions. The notes also provide illustrative examples and comparative analyses of these methods.
Full Transcript
Introduction to Machine learning AI 305 Generative learning algorithms Agenda Linear Discriminant Analysis Quadratic Discriminant Analysis Naïve Bayes Introduction We have considered classification learning algorithms that model 𝑝 𝑦ȁ𝑥; 𝜃 the conditional distribution of...
Introduction to Machine learning AI 305 Generative learning algorithms Agenda Linear Discriminant Analysis Quadratic Discriminant Analysis Naïve Bayes Introduction We have considered classification learning algorithms that model 𝑝 𝑦ȁ𝑥; 𝜃 the conditional distribution of y given x. For instance, logistic regression modeled 𝑝 𝑦ȁ𝑥; 𝜃 as ℎ𝜃 𝑥 = 𝑔 𝜃 𝑇 𝑥 where g is the sigmoid function. In these lecture, we'll talk about a different type of learning algorithm. Consider a classification problem in which we want to learn to distinguish between elephants (y = 1) and dogs (y = 0), based on some features of an animal. Given a training set, an algorithm like logistic regression or the perceptron algorithm (basically) tries to find a straight line; that is, a decision boundary; that separates the elephants and dogs. Then, to classify a new animal as either an elephant or a dog, it checks on which side of the decision boundary it falls, and makes its prediction accordingly. Different approach First, looking at elephants, we can build a model of what elephants look like. Then, looking at dogs, we can build a separate model of what dogs look like. Finally, to classify a new animal, we can match the new animal against the elephant model, and match it against the dog model, to see whether the new animal looks more like the elephants or more like the dogs we had seen in the training set. Generative learning algorithms Algorithms that try to learn 𝑝 𝑦ȁ𝑥 directly (such as logistic regression), or algorithms that try to learn mappings directly from the space of inputs X to the labels (such as the perceptron algorithm) are called discriminative learning algorithms. Here, we'll talk about algorithms that instead try to model 𝑝 𝑥 ȁy (and 𝑝 𝑦 ). These algorithms are called generative learning algorithms. For instance, if y indicates whether an example is a dog (0) or an elephant (1), then 𝑝 𝑥 ȁy = 0 models the distribution of dogs’ features, and 𝑝 𝑥 ȁy = 1 models the distribution of elephants' features. Generative learning algorithms-cont. Here the approach is to model the distribution of X in each of the classes separately 𝑝 𝑥 ȁy (and 𝑝 𝑦 −class priors), and then use Bayes theorem to flip things around and obtain P(Y |X) (posterior distribution). When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis. However, this approach is quite general, and other distributions can be used as well. We will focus on normal distributions. Bayes theorem for classification Thomas Bayes was a famous mathematician whose name represents a big subfield of statistical and probabilistic modeling. Here we focus on a simple result, known as Bayes theorem: 𝑝 𝑋 = 𝑥 ȁ𝑌 = 𝑘 𝑝 𝑌 = 𝑘 𝑝 𝑌 = 𝑘 ȁ𝑋 = 𝑥 = 𝑝 𝑋=𝑥 One writes this slightly differently for discriminant analysis: 𝑝 𝑋 = 𝑥 ȁ𝑌 = 𝑘 𝑝 𝑌 = 𝑘 𝑝 𝑌 = 𝑘 ȁ𝑋 = 𝑥 = 𝐾 σ𝑙=1 𝑝(𝑋 = 𝑥ȁ𝑌 = 𝑙)𝑝 𝑌 = 𝑙 𝑓𝑘 𝑥 = 𝑝 𝑋 = 𝑥 ȁ𝑌 = 𝑘 is the density for X in class k. Here we will use normal densities for these, separately in each class. 𝜋𝑘 = 𝑝 𝑌 = 𝑘 is the marginal or prior probability for class k. 𝑝 𝑌 = 𝑘ȁ𝑋 = 𝑥 is the posterior probability that an observation X = x belongs to the kth class That is, it is the probability that the observation belongs to the kth class, given the predictor value for that observation. Bayes theorem for classification-cont. The previous equations suggests that instead of directly computing the posterior probability 𝑝 𝑌 = 𝑘 ȁ𝑋 = 𝑥 , we can simply plug in estimates of 𝜋𝑘 and 𝑓𝑘 𝑥 into this equation. In general, estimating 𝜋𝑘 is easy if we have a random sample from the population: we simply compute the fraction of the training observations that belong to the kth class. However, estimating the density function 𝑓𝑘 𝑥 is much more challenging. As we will see, to estimate 𝑓𝑘 𝑥 , we will typically have to make some simplifying assumptions. Bayes theorem for classification-cont. Bayes classifier classifies an observation x to the class for which 𝑝(𝑌 = 𝑘 ȁ𝑋 = 𝑥 ) is largest, has the lowest possible error rate out of all classifiers. this is only true if all of the terms in previous equation are correctly specified Therefore, if we can find a way to estimate 𝑓𝑘 𝑥 , then we can plug it into the equation in order to approximate the Bayes classifier. Next we are going to discuss three classifiers that use different estimates of 𝑓𝑘 𝑥 to approximate the Bayes classifier: linear discriminant analysis, quadratic discriminant analysis, and naive Bayes. Classify to the highest density 1=.5, 2=.5 1=.3, 2=.7 −4 −2 0 2 4 −4 −2 0 2 4 We classify a new point according to which density is highest. When the priors are different, we take them into account as well, and compare 𝑝 𝑥 ȁy 𝑝 𝑦. On the right, we favor the pink class — the decision boundary has shifted to the left. 21 / 40 Why discriminant analysis? When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem. If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model. Linear discriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the data. Linear Discriminant Analysis when p = 1 Assume that p = 1, only one predictor. To estimate 𝑓𝑘 𝑥 , we will first make some assumptions about its form. we assume that 𝑓𝑘 𝑥 is normal or Gaussian. In the one dimensional setting, the normal density takes the form Here 𝜇 𝑘 is the mean, and 𝜎𝑘2 the variance (in class k). We will assume that all the σ k = σ are the same (same variance for all classes. Plugging this into Bayes formula, we get a rather complex expression for p k (x) = P(Y = k|X = x): Discriminant functions To classify at the value X = x, we need to see which of the p k (x) is largest. Taking logs, and discarding terms that do not depend on k, we see that this is equivalent to assigning x to the class with the largest discriminant score: Note that δ k (x) is a linear function of x. If there are K = 2 classes and π1 = π2 = 0.5, then one can see that the decision boundary is at 𝜋1 + 𝜋2 𝑥= 2 Estimating the parameters LDA –cont. LDA classifier results from assuming that the observations within each class come from a normal distribution with a class-specific mean and a common variance 𝛿 2 , and plugging estimates for these parameters into the Bayes classifier Coming, we will consider a less stringent set of assumptions, by allowing the observations in the kth class to have a class-specific variance, 𝛿𝑘2. Linear Discriminant Analysis when p > 1 We now extend the LDA classifier to the case of multiple predictors. We will assume that X = (X1;X2; : : : ;Xp) is drawn from a multi- variate Gaussian (or multivariate normal) distribution, with a class-specific multivariate Gaussian mean vector and a common covariance matrix. Two multivariate Gaussian density functions, with p = 2. Left: The two predictors are uncorrelated. Var(X1) = Var(X2) and Cor(X1;X2) = 0; Right: The two variables have a correlation of 0,7. (have unequal variances) Linear Discriminant Analysis when p > 1 Formally, the multivariate Gaussian density is defined as Where X~𝑁 μ, 𝛴. Here E(X) = μ is the mean of X (a vector with p components), and Cov(X) = 𝛴 is the p × p covariance matrix of X. Example π1 = π2 = π3 = 1/3 The observations from each class are drawn from a multivariate Gaussian distribution with p = 2, with a class-specific mean vector and a common covariance matrix. Left: Ellipses that contain 95% of the probability for each of the three classes are shown. The dashed lines are the Bayes decision boundaries. Right: 20 observations were generated from each class, and the corresponding LDA decision boundaries are indicated using solid black lines. The Bayes decision boundaries are once again shown as dashed lines. Quadratic Discriminant Analysis When fk(x( are Gaussian densities, with the same covariance matrix Σ in each class, this leads to linear discriminant analysis. Quadratic discriminant analysis (QDA) provides an alternative quadratic discriminant analysis approach. Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes' theorem in order to perform prediction. However, unlike LDA, QDA assumes that each class has its own covariance matrix. an observation from the kth class is of the form X~𝑁 μ𝑘 , 𝛴𝑘 𝛴𝑘 is a covariance matrix for the kth class Under this assumption, the Bayes classifier assigns an observation X = x to the class for which the following is largest Example LDA and QDA in two scenarios. In the left-hand panel, the two Gaussian classes have a common correlation of 0.7 between X1 and X2. the Bayes decision boundary is linear and is accurately approximated by the LDA decision boundary. The QDA decision boundary is inferior, because it suffers from higher variance without a corresponding decrease in bias. In contrast, the right-hand panel displays a situation in which the orange class has a correlation of 0.7 between the variables and the blue class has a correlation of −0.7. The Bayes decision boundary is quadratic, and so QDA more accurately approximates this boundary than does LDA. Naive Bayes In previous sections, we used Bayes' theorem to develop the LDA and QDA classifiers. Here, we use Bayes' theorem to motivate the popular naive Bayes classifier Assumes features are independent in each class. Useful when p is large, and so multivariate methods like QDA and even LDA break down. Gaussian naive Bayes assumes each 𝛴𝑘 is diagonal Naïve Bayes can use for mixed feature vectors (qualitative and quantitative). If 𝑥𝑗 is qualitative, replace 𝑓𝑘𝑗 𝑥𝑗 with probability mass function (histogram) over discrete categories. Despite strong assumptions, naive Bayes often produces good classification results. L DA on Credit Data (23 + 252)/10000 errors — a 2.75% misclassification rate! Some notes: This is training error, and we may be overfitting. Not a big concern here since n = 10000 and p = 2! If we classified to the prior — always to class Noin this case — we would make 333/10000 errors, or only 3.33%. Of the true No’s, we make 23/9667 = 0.2% errors; of the true Yes’s, we make 252/333 = 75.7% errors! Types of errors False positive rate: The fraction of negative examples that are classified as positive — 0.2% in example. False negative rate: The fraction of positive examples that are classified as negative — 75.7% in example. We produced this table by classifying to class Yes if P^r(Default = Yes|Balance, Student) ≥ 0.5 We can change the two error rates by changing the threshold from 0.5 to some other value in [0, 1]: P^r(Default = Yes|Balance, Student) ≥ threshold, and vary threshold. Varying the threshold 0.6 Error Rate Overall Error 0.4 False Positive False Negative 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 Threshold In order to reduce the false negative rate, we may want to reduce the threshold to 0.1 or less. Equal Error Rate (EER), the point where the two errors are identical, which we aim to reduce it These error can be improved according the domain of the problem ROC Curve 1.0 0.8 True positive rate 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate The R O C plot displays both simultaneously. Sometimes we use the A U C or area under the curve to summarize the overall performance. Higher A U C is good. Logistic Regression versus L DA For a two-class problem, L D A c a n b e s h o w n a s So it has the same form as logistic regression. The difference is in how the parameters are estimated. Logistic regression uses the conditional likelihood based on P(Y | X ) (known as discriminative learning). L D A uses the full likelihood based on P ( X , Y ) (known as generative learning). Despite these differences, in practice the results are often very similar. Summary Logistic regression is very popular for classification, especially when K = 2. L D A is useful when n is small, or the classes are well separated, and Gaussian assumptions are reasonable. Also when K > 2. Naive Bayes is useful when p is very large. End