21-Maximum-Entropy-Models.pdf
Document Details
Uploaded by ThrillingTuba
Tags
Full Transcript
Conditional Probability vs. Joint Probability We want to find the most likely inner states i.e., maximize the conditional probability: Equivalently, we can maximize the joint probability given the observations , : In the Hidden Markov Model, we used the joint probability. With Maximum Entropy Tagger...
Conditional Probability vs. Joint Probability We want to find the most likely inner states i.e., maximize the conditional probability: Equivalently, we can maximize the joint probability given the observations , : In the Hidden Markov Model, we used the joint probability. With Maximum Entropy Taggers, we optimize the conditional probability directly. 21 From Unsupervised to Supervised Hidden Markov Models are unsupervised – we fit the states to the observed sentence. For most applications this is insufficient: ▶ states could be predefined by the application (e.g., verb, noun, adjective, …) ▶ training data to learn transitions, generalize to new test data ▶ test data may have unknown words (transitions?) ▶ words are not independent Markov Models (supervised ⇝ state not “hidden” in training) ▶ training by counting frequencies ▶ estimates unreliable for low frequencies (incl. zero value problem) ▶ “naïve Bayes” on sequences ➜ can we do better? 22 Handling of Rare and Unknown Words We may need state transitions also on unknown words. It is often better to handle rare words the same way as unknown words, because we have too few examples to reliably estimate transition probabilities. Idea 1: use the total average transitions – as if all words were unknown. But: on natural language text, most unknown words will actually be nouns (names)! Idea 2: use word similarity to infer most likely type. Uppercase: most likely a noun. Lowercase: ? Can we learn the features / weights? 23 Features in Natural Language Processing [Ratn96] Feature functions are binary functions , e.g., Typical features used with natural language: ▶ word at the current position (as in Hidden Markov Model) ▶ previous words, next words (consider “to book” vs. “the book”) ▶ prefix or suffix of the current word (e.g., “-ing”) ▶ position in the sentence: first word, last word ▶ lowercase, Capitalized, ALLCAPS, CamelCase, Numbers, M1xed ▶ contains a hyphen, number ▶ word shape: “Angela” → “Xxxxxx” ▶… E.g., in CoreNLP edu/stanford/nlp/ie/NERFeatureFactory.java 24 Maximum Entropy Markov Models (MEMM) [McFrPe00; Ratn96] Log-linear model with feature weights (recall that. – we have intuitive and efficient presence/absence semantics) and the conditional probability: for optimization, we need to constrain the expectation of each feature (“consistency”): with time steps from a particular previous state. Choose the maximum entropy model that satisfies this constraint, e.g., via Generalized Iterative Scaling (GIS [DaRa72]) with smoothing to reduce overfitting. 25 Label Bias Problem [LaMcPe01] In , is most likely In , is most likely Most likely path: always Average outgoing weight of smaller More out-edges ⇝ label bias problem Solution: global normalization, CRF 26