Chapter5_Part-of-Speech_tagging (1).pdf

The University of Jordan University of Jordan Computer Information Systems Department King Abdullah School of Information Technology Chapter 5: Part-of-Speech Tagging The University of Jordan Outline (Mostly) English Word Classes Tagsets for English Part-of-Speech Tagging Rule-Based Part-of-Speech Tagging HMM Part-of-Speech Tagging Computing the Most Likely Tag Sequence: An Example Formalizing Hidden Markov Model Taggers Using the Viterbi Algorithm for HMM tagging Extending The University of Jordan Part-of-Speech Tagging: Terminology Tagging The process of associating labels with each token in a text, using an algorithm to select a tag for each word, eg: Hand-coded rules Statistical taggers Brill (transformation-based) tagger Hybrid tagger: combination, eg by “vote” WORDS TAGS the boy throws N the V Tags ball P The labels over DET the wall Tag Set The collection of tags used for a particular task, eg Brown or LOB tagset 3 The University of Jordan Motivation Speech synthesis: pronunciation; “record”, “Record” [one is VB, one is NN] Speech recognition: class-based N-grams Information retrieval: stemming [knowing the POS can help tell us which morph affixes it takes], noun phrase recognition and query-document matching based on meaningful units rather than individual terms, high-content words [allows us to ignore functional words like “the”] Information Extraction: tagging and partial parsing help identify useful terms and relationships between them. The University of Jordan Motivation Question Answering analyzing a query to understand what type of entity the user is looking for and how it is related to other noun phrases mentioned in the question Word-sense disambiguation can help build automatic word-sense disambiguating algorithms for text, e.g., “flies” [nn], “flies” [vbz] Corpus analysis of language & lexicography find instances or frequencies of particular constructions in large corpora, can help us building lexicons The University of Jordan What is a word class? Words that somehow ‘behave’ alike: Appear in similar contexts Perform similar functions in sentences Undergo similar transformations For example, if we know the difference between possessive pronouns (my, your, etc.) and personal pronouns (I, you, he, etc.), we will have some idea of what the next word will be. (Possessive pronoun likely to be followed by noun, personal pronoun likely to be followed by verb.) The University of Jordan Word Classes Word classes can be divided into two broad categories: closed, open. - Closed classes: new words are rarely coined in closed classes. These are often called “function words” (grammatical words; short, occur frequently, and play important role in grammar not content). - Open classes: often have newly coined or borrowed terms (e.g., “to fax” or “to email”). The University of Jordan Open Class Words Nouns: people, places, things Proper nouns: Ahmad, Jordan, IBM Common nouns: Count nouns: goat/goats, relationship/relationships Mass nouns: snow, salt, communism Verbs: actions and processes draw, provide, differ, go Adjectives: properties, qualities White, old, good Adverbs Directional adverbs: (home, here, downhill) Degree adverbs: (extremely, very, somewhat) Manner adverbs: (slowly, delicately) Temporal adverbs: (yesterday, Monday) The University of Jordan Closed Class Words Examples: prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I,.. conjunctions: and, but, or, … auxiliary verbs: can, may should, … numerals: one, two, three, third, … The University of Jordan Some of the best-known Tagsets Brown corpus: 87 tags (more when tags are combined, eg isn’t) LOB corpus: 132 tags Penn Treebank: 45 tags Lancaster UCREL C5 (used to tag the BNC): 61 tags Lancaster C7: 145 tags 12 The University of Jordan The Brown Corpus An early digital corpus (1961) Francis and Kucera, Brown University Contents: 500 texts, each 2000 words long From American books, newspapers, magazines Representing genres: Science fiction, romance fiction, press reportage scientific writing, popular lore 13 The University of Jordan PYTHON - help(nltk.corpus.brown) >>> help(nltk.corpus.brown) | paras(self, fileids=None, categories=None) | | raw(self, fileids=None, categories=None) | | sents(self, fileids=None, categories=None) | | tagged_paras(self, fileids=None, categories=None, simplify_tags=False) | | tagged_sents(self, fileids=None, categories=None, simplify_tags=False) | | tagged_words(self, fileids=None, categories=None, simplify_tags=False) | | words(self, fileids=None, categories=None) | The University of Jordan nltk.corpus.brown >>> nltk.corpus.brown.words()[:20] ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that'] >>> nltk.corpus.brown.tagged_words()[:20] [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS')] >>> nltk.corpus.brown.tagged_sents()[:3] [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), …. The University of Jordan Penn Treebank http://www.cis.upenn.edu/~treebank/ First large syntactically annotated corpus 1 million words from Wall Street Journal Part-of-speech tags and syntax trees >>> nltk.corpus.treebank.tagged_sents()[:3] [[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')], [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')], [('Rudolph', 'NNP'), ('Agnew', 'NNP'), (',', ','), ('55', 'CD'), ('years', 'NNS'), ('old', 'JJ'), ('and', 'CC'), ('former', 'JJ'), ('chairman', 'NN'), ('of', 'IN'), ('Consolidated', 'NNP'), ('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC', 'NNP'), (',', ','), ('was', 'VBD'), ('named', 'VBN'), ('*-1', '-NONE-'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('of', 'IN'), ('this', 'DT'), ('British', 'JJ'), ('industrial', 'JJ'), ('conglomerate', 'NN'), ('.', '.')]] 16 The University of Jordan Penn TreeBank Tagset (48-tag simplification of Brown Corpus tagset) The University of Jordan Part-of-Speech Tagging How do we assign POS tags to words in a sentence? Get/V the/Det bass/N Time/[V,N] flies/[V,N] like/[V,Prep] an/Det arrow/N Time/N flies/V like/Prep an/Det arrow/N Fruit/N flies/N like/V a/DET banana/N Fruit/N flies/V like/V a/DET banana/N Problem: Words often have more than one word class The University of Jordan Potential Sources of Disambiguation Many words have only one POS tag (e.g. is, Mary, very, smallest) Others have a single most likely tag (e.g. a, dog) But tags also tend to co-occur regularly with other tags (e.g. Det, N) The University of Jordan How hard is POS tagging? In the Brown corpus, 12% of word types ambiguous 40% of word tokens ambiguous Number of 1 2 3 4 5 6 7 tags Number of 35340 3760 264 61 12 2 1 word types “Still” 20 The University of Jordan Tagging with lexical frequencies Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN Problem: assign a tag to race given its lexical frequency Solution: we choose the tag that has the greater probability P(race|VB) P(race|NN) Actual estimate from the Switchboard corpus: P(race|NN) =.00041 P(race|VB) =.00003 This suggests we should always tag race/NN (correct 41/44=93%) 21 The University of Jordan Part-of-Speech Tagging Rule-Based Tagger: ENGTWOL Stochastic Tagger: HMM-based Transformation-Based Tagger: Brill The University of Jordan Rule-Based Tagging ENGTWOL tagger (now ENGCG-2) http://www.lingsoft.fi/cgi-bin/engcg Basic Idea: Assign all possible tags to words Remove tags according to set of rules such as: (if word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1 is not a verb like “consider” then eliminate non-adv else eliminate adv) Typically more than 1000 hand-written rules, but may be machine-learned. The University of Jordan Stochastic Tagging Statistical methods for reducing ambiguity in sentences have been developed by using large corpora (e.g., the Brown Corpus) about word usage. The words and sentences in these corpora are pre-tagged with the parts of speech, grammatical structure, and frequencies of usage for words and sentences taken from a broad sample of written language. Part of speech tagging is the process of selecting the most likely part of speech from among the alternatives for each word in a sentence. In practice, only about 10% of the distinct words in a million-word corpus (Brown) have two or more parts of speech. The University of Jordan Stochastic Tagging Simple Method: Choose most frequent tag in training text for each word (unigram)! Result: 90% accuracy (baseline) Others will apply tools to do better by an extra 10% HMM is an example The University of Jordan Training and Testing of Machine Learning Algorithms Algorithms that “learn” from data see a set of examples and try to generalize from them. Training set: Examples trained on Test set: Also called held-out data and unseen data Use this for evaluating your algorithm Must be separate from the training set; otherwise, you cheated! “Gold standard” evaluation corpus An evaluation set that a community has agreed on and uses as a common benchmark. Not “seen” until development is finished – ONLY for evaluation The University of Jordan Cross-Validation of Learning Algorithms Cross-validation set Part of the training set. Used for tuning parameters of the algorithm without “polluting” (tuning to) the test data. You can train on x%, and then cross-validate on the remaining 1-x% E.g., train on 90% of the training data, cross-validate (test) on the remaining 10% Repeat several times with different splits This allows you to choose the best settings to then use on the real test set. You should only evaluate on the test set at the very end, after you’ve gotten your algorithm as good as possible on the cross-validation set. The University of Jordan Strong Baselines When designing NLP algorithms, you need to evaluate them by comparing to others. Baseline Algorithm: An algorithm that is relatively simple but can be expected to do “ok” Should get the best score possible by doing the obvious thing. The University of Jordan A Tagging Baseline Find the most likely tag for the most frequent words Frequent words are ambiguous You’re likely to see frequent words in any collection Will always see “to” but might not see “armadillo” How to do this? First find the most likely words and their tags in the training data Train a tagger that looks up these results in a table The University of Jordan Find the most frequent words and the most likely tag of each >>> from nltk.probability import * >>> wordcounts = FreqDist() >>> tagscounts = ConditionalFreqDist() >>> for sent in nltk.corpus.brown.tagged_sents(): for (w,t) in sent: wordcounts.inc(w) tagscounts[w].inc(t) >>> frequentwords = sorted(wordcounts.samples()[:1000]) >>> table = dict(word, tagscounts[word].max() for word in frequentwords) >>> table['to'] 'TO‘ >>> table['the'] 'AT' >>> table['said'] 'VBD' The University of Jordan Unigram Tagger Train on a set of sentences Keep track of how many times each word is seen with each tag. After training, associate with each word its most likely tag. Problem: many words never seen in the training data. Solution: have a default tag to “backoff” to. The University of Jordan Use our own tagger class >>> baseline_tagger = nltk.UnigramTagger(model=table) >>> sent = nltk.corpus.brown.sents(categories='news') >>> baseline_tagger.tag(sent) [('``', '``'), ('Only', None), ('a', 'AT'), ('relative', None), ('handful', None), ('of', 'IN'), ('such', 'JJ'), ('reports', None), ('was', 'BEDZ'), ('received', 'VBN'), ("''", "''"), (',', ','), ('the', 'AT'), ('jury', None), ('said', 'VBD'), (',', ','), ('``', '``'), ('considering', None), ('the', 'AT'), ('widespread', None), ('interest', 'NN'), ('in', 'IN'), ('the', 'AT'), ('election', None), (',', ','), ('the', 'AT'), ('number', 'NN'), ('of', 'IN'), ('voters', None), ('and', 'CC'), ('the', 'AT'), ('size', 'NN'), ('of', 'IN'), ('this', 'DT'), ('city', 'NN'), ("''", "''"), ('.', '.')] >>> from nltk.corpus import brown >>> brown_tagged_sents = brown.tagged_sents(categories='news') >>> baseline_tagger.evaluate(brown_tagged_sents) 0.6113133241840205 The University of Jordan Unigram tagger with Backoff >>> baseline_tagger = nltk.UnigramTagger(model=table, backoff=nltk.DefaultTagger('NN')) >>> baseline_tagger.tag(sent) [('``', '``'), ('Only', 'NN'), ('a', 'AT'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'IN'), ('such', 'JJ'), ('reports', 'NN'), ('was', 'BEDZ'), ('received', 'VBN'), ("''", "''"), (',', ','), ('the', 'AT'), ('jury', 'NN'), ('said', 'VBD'), (',', ','), ('``', '``'), ('considering', 'NN'), ('the', 'AT'), ('widespread', 'NN'), ('interest', 'NN'), ('in', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('the', 'AT'), ('number', 'NN'), ('of', 'IN'), ('voters', 'NN'), ('and', 'CC'), ('the', 'AT'), ('size', 'NN'), ('of', 'IN'), ('this', 'DT'), ('city', 'NN'), ("''", "''"), ('.', '.')] >>> baseline_tagger.evaluate(brown_tagged_sents) 0.6910814089941723 The University of Jordan What’s wrong with unigram? Most frequent tag isn’t always right! Need to take the context into account Which sense of “to” is being used? Which sense of “like” is being used? The University of Jordan Stochastic POS Tagging Problem: given text W (w1, w2, …, wn-1 , wn), what are the most likely POSs T (t1, t2, …, tn-1 , tn)? Goal: choose the best sequence of tags T for a sequence of words in a sentence Assume: we have a large training corpus, labeled with correct POSs that we can use to gather statistics (estimate probabilities) Choose (t1, t2, …, tn) so to maximize P (t1, … tn | w1 … wn) The University of Jordan Stochastic POS Tagging So we need to find T′= argmax P(T | W) T By Bayes Rule P (T | W) = P(T) P(W | T) P(W) Denominator is fixed for the maximization, so we can ignore it, i.e. T′= argmax P(T) P(W | T) T 1 2 The University of Jordan Likelihood and prior n The University of Jordan Likelihood and prior 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it n 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag 3. The most probable tag sequence estimated by the bigram tagger The University of Jordan Stochastic POS Tagging For 1 P (t1, … tn): assume that each ti depends only on ti-1 (Markov assumption) By chain rule P (t1, … tn) = P (t1) P (t2|t1) P (t3|t2t1) P (t4|t3t2t1) … n = P (ti | ti-1) where i=1 P (t1 | t0) = P (t1 | Ø) = P (t1) P (ti | ti-1) are transition probabilities The University of Jordan Likelihood and prior 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram. Bigrams are used as the basis for simple statistical analysis of text The bigram assumption is related to the first-order Markov assumption The University of Jordan Stochastic POS Tagging: Transition Probabilities P (t1) P (t2|t1) P (t3|t2) t1 t2 t3 Possible Possible words words The states of the HMM correspond to the tags and the output alphabet consists of words in dictionary or classes of words “States” ti can’t be directly observed The University of Jordan Hidden Marcov Model (HMM) Probabilistic parameters of a hidden Markov model (example) x — states y — possible observations a — state transition probabilities b — output probabilities The University of Jordan Stochastic POS Tagging: The (Lexical) Likelihood For 2 assume that the choice of Wi depends only on its own tag ti P (W | T) = P (w1, w2, …, wn | t1, t2, …, tn ) By the Chain Rule: n ≈ P (wi | ti) i=1 So combining 1 & 2 n T′≈ argmax P (ti | ti-1) P (wi | ti) T i=1 The University of Jordan Stochastic POS Tagging: The (Lexical) Likelihood n T′≈ argmax P (ti | ti-1) P (wi | ti) T i=1 So this gives us the “best guess” of the hidden states (POSs) t1 … tn of the HMM, given the values of the “observables” (words) w1 … wn Probability estimates from tagged corpus: E.g. for 1, P(N | DET) ≈ # (DET followed by N) / # DET E.g. for 2, P (Like | V) ≈ # (“Like” occurs as V) / # V The University of Jordan Markov model A markov model is a probabilistic model of symbol sequences in which the probability of the current event is conditioned only by the previous event. The University of Jordan Improving accuracy in POS-Tagging We could store long sequences of words and their POS tags – but this is not feasible because takes too much storage difficult (impossible!) to find all sequences in advance If we make some independence assumptions, we can restrict the number of words we consider at any time bigrams – examine two words at a time trigrams – examine three words at a time n-grams – examine n words at a time The University of Jordan Improving accuracy in POS-Tagging Bigrams for the initial word: as the first word is not preceded a hypothetical category, Ø, is used e.g. P (w | Ø) Some words do not occur in the hand tagged corpus (e.g. “man” as verb) They are coded by including a default very small value for bigrams e.g. 0.0001. So add some occurrences of “man” as V, A, P, etc. to corpus when estimating probabilities. The University of Jordan HMM Tagger Intuition: Pick the most likely tag for this word. HMM Taggers choose tag sequence that maximizes this formula: P (word | tag) × P (tag | previous n tags) Find POS tags that generate a sequence of words, i.e., look for most probable sequence of tags T underlying the observed words W. The University of Jordan An Example Secretariat /NNP is /VBZ expected /VBN to /TO race /VB tomorrow /NN People /NNS continue /VBP to /TO inquire /VB the /DT reason /NN for /IN the /DT race /NN for /IN outer /JJ space /NN to /TO race /??? the /DT race /??? t = argmax P (word | tag) × P (tag | previous tag) Max [ P (VB | TO) P (race | VB) , P ( NN | TO) P( race | NN) ] From Brown corpus: winner P(NN | TO) =.021 × P(race | NN) =.00041 =.000007 P(VB | TO) =.34 × P(race | VB) =.00003 =.00001 The University of Jordan Disambiguating “race” The University of Jordan Markov chain = “First-order observable Markov Model” The University of Jordan Bigram-HMM Tagger How do we compute the most probable tag sequence? The Viterbi Algorithm Suppose we need to tag the English sentence: time flies time /(V, N)? flies /(V, N)? The University of Jordan Bigram-HMM Tagger 1 1 P = max { P * P (N|N) * P(flies|N) , 2 1 2 1 P = P(N|ø) * P(time|N) P * P (N|V) * P(flies|N) } 1 1 P(N | N) P(time|N) P(flies|N) ø P(time|V) P(flies|V) P(V | V) 2 2 1 P = P (V|ø) * P(time|V) P = max { P * P (V|N) * P(flies|V) , 1 2 1 2 P * P (V|V) * P(flies|V) } 1

Chapter5_Part-of-Speech_tagging (1).pdf

Document Details

Tags

Related

Full Transcript