LDA: Latent Dirichlet Allocation PDF

LDA: Latent Dirichlet Allocation [Blei12; BlNgJo01; BlNgJo03] Topic and word distributions use “sparse” Dirichlet distributions parameterized via. Careful, different sources uses different symbols: source priors [BlNgJo01] [Blei12; BlNgJo03] [GrSt04; WaMiMc09] [ZhMa16] Wikipedia - document topic distribution topic word distribution 27 Generative Process of LDA LDA assumes that documents are generated as follows: 1. For each document , draw a topic distribution 2. For each topic , draw a word distribution 3. For each word in each document 4. Draw a word from the multinomial We could also draw the document length from a Dirichlet distribution from a Dirichlet distribution draw a topic from the multinomial of topic , but it is usually left out. 28 Benefits of LDA Why priors? ▶ yield a full generative model (we can now also generate ▶ generalizes better to additional documents ▶ penalty for solutions with many topics ▶ preference for solutions with fewer topics ) ➜ less likely to overfit (fewer solutions are considered to be “good”)! Why the Dirichlet prior? [WaMiMc09] ▶ conjugate to the multinomial distribution (Dirichlet multinomial ⇝ Dirichlet) ▶ simple 30 Likelihood Function of LDA LDA adds the Dirichlet prior to model the likelihood of word and topic distributions. Note: , are our prior probabilities, i.e., Dirichlet. A better model also needs to have more likely word distributions. 31 More Complicated Models many variations and extensions of LDA have been proposed ▶ we can vary the prior assumptions (to draw , ), e.g., Rethinking LDA: Why priors matter [WaMiMc09], but conjugate priors like Dirichlet-Multinomial are easier to handle and compute ▶ learn the number of topics, , and (may require labeled data). ▶ hierarchical topic models: nested Chinese restaurant processes [BGJT03] ▶ Hierarchical Dirichlet Processes [TJBB06] ▶ Pitman-Yor and Poisson Dirichlet processes [PiYo97; SaNa10] ▶ Correlated Topic Models [BlLa05] ▶ application to other domains (instead of text). ▶ temporal embeddings [RuBl18] 32 Computing LDA We do not generate random documents, but we need to compute the likelihood of a document, and optimize (hyper-) parameters to best explain the documents. We cannot solve this exactly, but we need to approximate this. ▶ variational inference [BlNgJo01; BlNgJo03] ▶ Gibbs sampling [Grif02; PrStDo00] ▶ expectation propagation [MiLa02] ▶ collapsed Gibbs sampling [GrSt04] ▶ collapsed variational inference [TeNeWe06] ▶ sparse collapsed Gibbs sampling [YaMiMc09] ▶ tensor decomposition [AGHK14; WLSH14] ▶ Metropolis-Hastings-Walker sampling [LARS14] Each has benefits and disadvantages. 34 Monte-Carlo Methods We need to estimate complex functions that we cannot handle analytically. Estimates of a function where usually look like this: is the likelihood of the input parameters being Monte-Carlo methods estimate from a sample set Important: we require the 🔗 Wikipedia to occur with probability. :. Example: Estimate by choosing points in the unit square uniformly, and testing if they are within the unit circle (here, uniform sampling is okay). This works, because points in the circle occur with probability. 35.1 Markov Chain Monte Carlo (MCMC) Monte Carlo is simple, but how do we get such In a Markov process, the new state We need to design a transition function according to their probabilities only depends on the previous state such that and ? 🔗 Wikipedia : as desired. For , we can use, e.g., Gibbs sampling. We then can estimate our hidden variables! Because of autocorrelation, it is common to use only every th sample. (We require above to be ergodic, but omit details in this lecture.) A really nice introduction to Markov-Chain-Monte-Carlo (MCMC) and Gibbs sampling can be found in [ReHa10]. A more formal introduction is in the textbook [Bish07]. 36 Gibbs Sampling – Updating Variables Incrementally Assume that our state is a vector with we can update one variable at a time for Our function then is to do this for each. Informally: in every iteration ( ), for every variable , we choose a new value but we prefer values more likely given the current state of the other variables. More likely values of 🔗 Wikipedia components, : will be more likely returned (even with the desired likelihood of randomly, ). 37 Gibbs Sampling – Benefits and Details may not depend on all , but only on the “Markov blanket”. (Markov blanket: parents, children, and other parents of the node’s children in the diagram.) 🔗 Wikipedia Sometimes we can also “integrate out” some to further simplify. This is also called a “collapsed” Gibbs sampler (details on next slide). If we have a conjugate prior (e.g., Beta for Bernoulli, Dirichlet for Multinomial), then we get the same family (but different parameters) as posterior, which usually yields much simpler equations. 38 Collapsed Gibbs Sampler [GrSt04] We need to draw the topic of each word according to: After integrating out and , we get the word-topic probability: By exploiting properties of the function, we can simplify this to: number of occurences of each topic-document assignment number of occurrences of each topic-word assignment number of occurrences of each topic , , and are the same without the current word and its topic assignment. A detailed derivation can be found in Appendix D of [Chan11]. 39 Inference with Gibbs Sampling Putting everything together: Initialization: 1. choose prior parameters and 2. for every document and word, choose 3. initialize , , and For every Markov-Chain iteration 1. for every document randomly : and word : 1. remove old , from , , and 2. sample a new random topic 3. update , , and with new , re-add 2. if (burn in) and only every th sample (decorrelation): 1. Monte-Carlo update all , from 40 Motivation of Variational Inference Many interesting distributions are too difficult to compute! Posterior probability of LDA (necessary for EM): We cannot compute the denominator because of these integrals! ➜ can we approximate the solution? Idea: can we find a simpler probability with a different ? 42 Variational Inference [Bish07] [WaJo08] Variational inference turns the inference into an optimization of the approximation. We have to posit a function , then optimize the variables. The Kullback-Leibler divergence of distributions measures how diverged and , is from. Here, we will use and. We can decompose the log-likehood of the model as We need to choose a tractable , but that can also approximate Note: is not symmetric. If is small, well. does not matter much; for large we need. 43 Evidence Lower Bound (ELBO) Kullback-Leibler-Divergence is still not manageable, but we can use the ELBO instead: using Jensen’s inequality to pull the logarithm into the integral/expected value. Maximizing ELBO is equivalent to minimizing Kullback-Leibler divergence: by using the definition of conditional probability. Because of the inequality used, this is a lower bound. If we find the tightest lower bound, we should be close to the true optimum! Idea: solve this, e.g., with gradient descent on. 44 Mean Field Variational Inference for LDA Choosing a for LDA: The variable dependencies onto the priors make the problem difficult, hence: 1. 2. 3. 4. 5. make all variables independent and give each its own prior: each document each word position each topic drop the actual word nodes Note: we want a very broad search space, so we can get close to the true optimum. But: the independence assumptions introduce some inaccuracies. 45 Variational Approximation for LDA Variational approximation: Recall the graphical model of LDA: 46 Mean Field Variational Inference for LDA /2 Variational approximation: The resulting variational posterior now is: Note: because of the independence, this nicely decomposes. 47 Optimizing Variational LDA Using the posterior we get the optimization problem the objective then turns out to be We can now use this in the expectation maximization algorithm. 48 Variational Expectation Maximization for LDA We can now maximize our variational parameters. For each topic and term update : 𝟙 For each document update For each document and word position update : where Note: : is the usual digamma function. is fairly expensive to compute, hence it is common to use a fast approximation. 49 Variational Inference vs. Gibbs Sampling Both approaches have their benefits and drawbacks: ▶ Gibbs sampling has an additional inner loop to estimate frequencies ▶ variational inference uses the expensive digamma function very often ▶ Gibbs sampling is usually more accurate [AWST09] ▶ since ELBO is not convex, we may end in poor local optima ▶ variational inference is usually faster [TJBB06] Collapsed variational inference improves accuracy of VI [TeNeWe06], but it remains worse than collapsed Gibbs sampling and is more expensive. 50

LDA: Latent Dirichlet Allocation PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue