14-Evaluation-of-Topic-Modeling.pdf
Document Details
Uploaded by ThrillingTuba
Tags
Related
Full Transcript
Evaluation of Probabilistic Topic Models Probabilistic models compute probabilities for observing the documents. During training, we try to maximize the likelihood of observing our training data. ➜ can we use this as quality measure? Beware of the usual caveats: ▶ overfitting: because we optimize th...
Evaluation of Probabilistic Topic Models Probabilistic models compute probabilities for observing the documents. During training, we try to maximize the likelihood of observing our training data. ➜ can we use this as quality measure? Beware of the usual caveats: ▶ overfitting: because we optimize this value, we will be too optimistic ▶ model complexity: a more complex model (e.g., more topics) will fit the data better ▶ comparability: different models compute the probabilities differently (e.g., different preprocessing, ignoring word order etc. affect the perceived probabilities) ▶ computational problems when approximating the reference probabilities [WMSM09] But: found to be even negatively correlated with human judgment [CBGW09]. 55 Shannon Entropy The entropy of a sample is: We have seen entropy before in cluster evaluation: mutual information. Intuition: the minimum average number of bits required to encode the values of an infinite sample with an idealized (not necessarily real) compression technique. ➜ number of bits of “information” in each object Examples: Fair coin toss: Fair dice: Two dice: Sum of two dice: Uniform 2…12: 56 Cross-Entropy 🔗 Wikipedia Cross-entropy compares two probability distributions: Note that , and for all , and: i.e., Kullback-Leibler divergence is the excess entropy. Intuition: encode the data distributed as with the encoding scheme obtained by. But: We do not know the true probabilities ! We can use a validation sample and hope that approximates the true well enough. 57 Perplexity Given a validation set , perplexity is defined as. Because we do not know the true distribution , we assume every sample in Note: This is the geometric mean of has weight.. Warning: occasionally, you will see perplexity defined and used based on entropy directly, i.e., ; and this will not give the same results unless. Warning: often perplexity is normalized to a per-word perplexity, not per document. Warning: if for any (considered “impossible” for the model), the value becomes ! Caused by, e.g., a previously unseen word! ⇝ remove rare words, use smoothing, etc. 58.1 Interpreting Perplexity Perplexity relates to the average Consider a fair coin, and Consider a fair die, and. Then. Then – what does this mean?.. It is common to interpret perplexity as “expected this to occur in 1 of cases”. If we expect correct data to occur more often, we fit the data better. Equivalent: – expected these events to occur with probability. But: using the geometric average, not the arithmetic average. 🔗 Wikipedia Expecting a word to occur is on a different scale than expecting a document to occur! Sum of two dice: – but what does this intuitively mean? Average probability of each outcome? No: Weighted average probability of each outcome? No: Weighted geometric average: 🔗 Wikipedia okay to judge the relative quality of a fit, but not intuitively interpretable as probability. 58.2 Coherence [MWTL11; NLGB10; RöBoHi15] Several variants of his measure have been discussed in literature. The basic idea is that in a coherent topic, the top words cooccur in the same documents. Let Let be the fraction of documents containing word. be the fraction of documents containing neighboring words and. Computed within the original corpus or a reference corpus such as Wikipedia. Given the top words Note: these measures are of a topic, use. is a constant to avoid the logarithm of 0. … and at least 237912 other possible recombinations/parameterizations [RöBoHi15] 59 Evaluation of Topic Models – “Reading Tea Leaves” [CBGW09; LaNeBa14] Topic model evaluation is difficult: There is a disconnect between how topic models are evaluated and why we expect topic models to be useful. — David Blei [Blei12] ▶ manual inspection of the most important words in each topic (“eye balling”) ▶ perplexity and coherence ▶ often evaluated with a secondary task (e.g., classification, IR) [WMSM09] ▶ by the ability to explain held out documents with existing clusters [WMSM09] (a document is “well explained” if it has a high probability in the model) ▶ word intrusion task [CBGW09] (can a user identify a word that was artificially injected into the most important words?) ▶ topic intrusion task [CBGW09] (can the user identify a topic that doesn’t apply to a test document?) 60 Conclusions Topic modeling is related to cluster analysis: ▶ topics are often comparable to cluster centers ▶ documents may belong to multiple topics; most clusterings are “hard” ▶ algorithms share ideas, such as EM ▶ algorithms contain special adaptation for text (e.g., sparsity, priors) ▶ computationally quite expensive; scalability ▶ evaluation is a major challenge, as with clustering ▶ subjective quality does often not agree with quality measures 62