Language Models - FIDIT

Language Models Sanda Martinčić-Ipšić Full professor [email protected] Osnove vjerojatnosti teorija vjerojatnosti: koliko je vjerojatno da se nešto dogodi / nastupi vjerojatnost: broj iz intervala [0, 1] funkcija: P(A) vjerojatnost događaja A, 0 P(A) 1 P(A)=1 => siguran događaj P(A)=0 => nemoguć događaj funkcija distribucije vjerojatnosti 2 Uvjetna i bezuvjetna vjerojatnost bezuvjetna vjerojatnost (Prior Probability) P(A) – događaj A nije ovisan od drugih događaja uvjetna vjerojatnost (Posterior Probability) vjerodostojnost (Likelihood) P(A|B) – vjerojatnost da nastupi A ako znamo da je nastupio B Primjer P(stavi) vjerojatnost riječi stavi u tekstu P(na|stavi) vjerojatnost da nakon riječi stavi u tekstu bude riječ na 3 Uvjetna i bezuvjetna vjerojatnost II A B P(A|B) = P(AB) / P(B) AB P(B|A) = P(BA) / P(A) 4 Bayesov teorem teorem Bayesov teorem izračuna se P(A|B) iz dane uvjetne vjerojatnosti P(B|A) Poznato: P(AB) = P(A|B) P(B) P(AB) = P(B|A) P(A) P( B | A) P( A) P( A | B) P( B) P( A | B) = P( B | A) = P( B) P( A) 5 Statistika korpusa okruženje riječi u tekstu je važno u različitom okruženju riječ može imati i drugačije značenje Marko jede bananu. U banani smo. statističke metode automatski pokušavaju naučiti / odrediti leksičke / strukturne odnose u tekstu 6 Frekvencija riječi Word Counts: Koje su najčešće riječi u korpusu? Koliko je riječi u korpusu? (pojavnice) Koliko je različitih riječi u korpusu? (word tokens and word types). Koja je prosječna frekvencija riječi? Koje riječi nastupaju samo jednom? Problem: većina riječi u korpusu nastupa mali broj puta a manjina nastupa velik broj puta 7 Distribucija riječi u korpusu: Zipfov zakon Zipfov zakon: 1 f  f – frekvencija riječi r je obrnuto proporcionalna r –rangu riječi (rank), poziciji u uređenoj listi frekvencija značenje velika većina riječi je izrazito rijetka samo mali broj riječi ima visoku frekvenciju mala gustoća - oskudnost podataka (sparse data) 8 Primjer EN: frekvencija riječi u Tomu Sawyeru 9 Primjer EN: statistika u Tomu Sawyeru # of word tokens: 71,370 # of word types (unique words): 8,018 Average frequency: 71,370/8,018 = 8.9 Some words are very common! 12 words appear more than 700 times each 100 words account for more than 50.9% of the text 49.8% of “word types” appear only once in the corpus! “hapax legomena” Greek for “read only once” Kako statistika funkcionira ako riječ vidimo samo jednom? – Premali broj nastupa? 10 Frekvencija frekvencija Total # of  = 8018 Word Types 11 Zipfov zakon 12 Empirical evaluation of Zipf law for Tom Sawyer Značaj Zipfovog zakona dobar pokazatelj kakav je tekst i što se u njemu nalazi mali broj visoko frekventnih riječi srednji broj srednje frekventnih riječi velik broj nisko frekventnih riječi koji problemi nas očekuju oskudnost podatka (sparseness of data) velik problem u NLP-u i ASR 13 Kolokacije Kolokacije: tipične konstukcije, sljedovi riječi riječi koje imaju zajedničko značenje tipične konstrukcije, fraze, složenice, nazivi.... ham and eggs, make up, disk drive, The New YorkTimes bilo koji par riječi - nema uvijek značenje bigrami upravo analiziraju sve parove kolokacije su više od parova: parovi koji imaju točno određenu strukturu POS koriste se: lingvistika – frazeologija, ASR i IR 14 Bigrami - kolokacije 15 Collocations in New York Times corpus with and without filtering N-grams & Language Modeling A bad language model 16 A bad language model 17 A bad language model 18 Jezični model vjerojatnost sljedova riječi koja riječ će biti sljedeća? P(“And nothing but the truth”)  0.001 P(“And nuts sing on the roof”)  0 19 Uporaba jezičnog modeliranja NLP ASR, language model, Speech recognition Handwriting recognition Spelling correction Optical character recognition Machine translation Information retrieval... 20 Ideja Ona želi pojesti veliku zelenu __________? planinu kočiju košaru jabuku... P(jabuku | veliku zelenu) ? 21 Primjer: spell-checker context-sensitive spelling error correction based on n- grams (Mays 1991, Kukich 1992) The study was conducted be John Black. The design an construction of the......in about fifteen minuets... He is trying to fine out.... 22 Princip jezičnog modeliranja kako izračunati P(“And nothing but the truth”) Korak 1: dekomponirati vjerojatnost P(“And nothing but the truth”) = P(“And”)  P(“nothing|and”)  P(“but|and nothing”)  P(“the|and nothing but”)  P(“truth|and nothing but the”) 23 N-gramski jezični model imamo n riječi i pretpostavimo da svaka riječ ovisi on n-1 prethodnih predviđanje sljedeće riječi je jednako računanju funkcije vjerojatnosti P(wn|w1,..,wn-1) N-gram: vjerojatnost da će riječ biti izgovorena ako je izgovoreno n-1 prethodnih riječi P(w1..w n ) P(w n |w1..w n-1 ) = P(w1..w n -1 ) 24 Trigram (3-grams) P(“the|… whole truth and nothing but”)  P(“the|nothing but”) vjerojatnost da se izgovori the ako je izgovoreno nothing but P(“truth|… whole truth and nothing but the”)  P(“truth|but the”) vjerojatnost da se izgovori truth ako je izgovoreno but the 25 Pojednostavljenje – Markovljeva pretpostvaka Markov Assumption: samo n-1 prijašnjih riječi utječu na sljedeću (n-1) Markovljev Model ili n-gram bez prethodne riječi: unigram jednu prethodnu riječ: bigram dvije prethodne riječi: trigram tri prethodne riječi: 4-gram (four gram) (quadirigram)... svi n-grami koji imaju iste prethodne riječi tvore ekvivalencijski razred: (the same equivalence class - bin) 26 Računanje n-grama u korpusu imamo samo frekvencije N-gramske vjerojatnosti se izračunaju iz frekvencija riječi C(w) u tekstu C(nothing but the) P(the | nothing but)  C(nothing but) C(w1..w n ) P(w n |w1..w n-1 ) = C(w1..w n -1 ) 27 Računanje unigrama i bigrama Unigramske vjerojatnosti (1-gram) http://www.wordcount.org/main.php Most likely to transition to “the”, least likely to transition to “conquistador”. Bigramske vjerojatnosti (2-gram) Given “the” as the last word, more likely to go to “conquistador” than to “the” again. 28 N-grams for Language Generation primjer generiranih rečenica iz unigramskog i bigramskog modela Unigram: …Here words are chosen independently but with their appropriate frequencies. REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE. 29 N-grams for Language Generation II Bigram: Second-order word approximation. The word transition probabilities are correct but no further structure is included. THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED. 30 Konkordancije istraživanje riječi u kontekstu zajedno sa susjednima riječima Mijenja li kontekst značenje? I kako? 31 Konkordancije - Tom Sawyer Key Word In Context (KWIC) 32 Language Modeling Introduction to N-grams Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation: P(high winds tonite) > P(large winds tonite) Spell Correction The office is about fifteen minuets from my house Why? P(about fifteen minutes from) > P(about fifteen minuets from) Speech Recognition P(I saw a van) >> P(eyes awe of an) + Summarization, question-answering, etc., etc.!! Probabilistic Language Modeling Goal: compute the probability of a sentence or sequence of words: P(W) = P(w1,w2,w3,w4,w5…wn) Related task: probability of an upcoming word: P(w5|w1,w2,w3,w4) A model that computes either of these: P(W) or P(wn|w1,w2…wn-1) is called a language model. language model or LM is standard How to compute P(W) How to compute this joint probability: P(its, water, is, so, transparent, that) Intuition: let’s rely on the Chain Rule of Probability Reminder: The Chain Rule Recall the definition of conditional probabilities p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A) More variables: P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C) The Chain Rule in General P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1) How to estimate these probabilities Could we just count and divide? P(the | its water is so transparent that) = Count(its water is so transparent that the) Count(its water is so transparent that) No! Too many possible sentences! We’ll never see enough data for estimating these Markov Assumption Simplifying assumption: Andrei Markov P(the | its water is so transparent that) » P(the | that) Or maybe P(the | its water is so transparent that) » P(the | transparent that) Markov Assumption In other words, we approximate each component in the product Simplest case: Unigram model Some automatically generated sentences from a unigram model fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass thrift, did, eighty, said, hard, 'm, july, bullish that, or, limited, the Bigram model Condition on the previous word: texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen outside, new, car, parking, lot, of, the, agreement, reached this, would, be, a, record, november N-gram models We can extend to trigrams, 4-grams, 5-grams In general this is an insufficient model of language because language has long-distance dependencies: “The computer which I had just put into the machine room on the fifth floor crashed.” But we can often get away with N-gram models Language Modeling Estimating N-gram Probabilities Estimating bigram probabilities The Maximum Likelihood Estimate count(wi-1,wi ) P(wi | w i-1) = count(w i-1 ) c(wi-1,wi ) P(wi | w i-1 ) = c(wi-1) An example c(wi-1,wi ) P(wi | w i-1 ) = I am Sam c(wi-1) Sam I am I do not like green eggs and ham More examples: Berkeley Restaurant Project sentences can you tell me about any good cantonese restaurants close by mid priced thai food is what i’m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i’m looking for a good place to eat breakfast when is caffe venezia open during the day Raw bigram counts Out of 9222 sentences Raw bigram probabilities Normalize by unigrams: Result: Bigram estimates of sentence probabilities P( I want english food ) = P(I|) × P(want|I) × P(english|want) × P(food|english) × P(|food) =.000031 What kinds of knowledge? P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 Practical Issues We do everything in log space Avoid underflow (also adding is faster than multiplying) log(p1 ´ p2 ´ p3 ´ p4 ) = log p1 + log p2 + log p3 + log p4 Language Modeling Toolkits SRILM http://www.speech.sri.com/projects/srilm/ KenLM https://kheafield.com/code/kenlm/ Google N-Gram Release, August 2006 … Google N-Gram Release serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234 http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html Google Book N-grams http://ngrams.googlelabs.com/ https://books.google.com/ngrams/ Language Modeling Evaluation and Perplexity Evaluation: How good is our model? Does our language model prefer good sentences to bad ones? Assign higher probability to “real” or “frequently observed” sentences Than to “ungrammatical” or “rarely observed” sentences? We train parameters of our model on a training set. We test the model’s performance on data we haven’t seen. A test set is an unseen dataset that is different from our training set, totally unused. An evaluation metric tells us how well our model does on the test set. Extrinsic evaluation of N-gram models Best evaluation for comparing models A and B Put each model in a task spelling corrector, speech recognizer, MT system Run the task, get an accuracy for A and for B How many misspelled words corrected properly How many words translated correctly Compare accuracy for A and B Empirijski: utjecaj različitih jezičnih modela na rezultate raspoznavanja radijskih vijesti isti akustični model (trifoni) Točnost - jezični modeli dva različita bigramska 96,24 95,93 jezična modela 95,84 96,1 97 95,33 95,19 95,16 vokabular vremenskih 94,88 96 prognoza (1462 različite 93,62 riječi) 95 vokabular vijesti (10230 94 92,57 različitih riječi) 93 92 91 90 1mix 5mix 10mix 15mix 20mix 60 Difficulty of extrinsic (in-vivo) evaluation of N-gram models Extrinsic evaluation Time-consuming; can take days or weeks So Sometimes use intrinsic evaluation: perplexity Bad approximation unless the test data looks just like the training data So generally only useful in pilot experiments But is helpful to think about. Intuition of Perplexity The Shannon Game: mushrooms 0.1 How well can we predict the next word? pepperoni 0.1 I always order pizza with cheese and ____ anchovies 0.01 The 33rd President of the US was ____ …. I saw a ____ fried rice 0.0001 …. Unigrams are terrible at this game. (Why?) and 1e-100 A better model of a text is one which assigns a higher probability to the word that actually occurs Perpleksnost intuitivno ASR raspoznaje brojeve: “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” jednostavno – perplexity 10 ASR prepoznavanje 30,000 različitih imena teško – perplexity 30,000 “A, B, C, D, E, F, G…Z”: perplexity is 26 “Alpha, bravo, charlie, delta…yankee, zulu”: perplexity is 26 63 Perplexity The best language model is one that best predicts an unseen test set Gives the highest P(sentence) Perplexity is the inverse probability of the test set, normalized by the 1 - number of words: PP(W ) = P(w w...w ) N 1 2 N Chain rule: = N 1 P(w1w2...wN ) For bigrams: Minimizing perplexity is the same as maximizing probability The Shannon Game intuition for perplexity Josh Goodman: imagine a call-routing phone system gets 120K calls and has to recognize "Operator" (let's say this occurs 1 in 4 calls) "Sales" (1in 4) "Technical Support" (1 in 4) 30,000 different names (each name occurring 1 time in the 120K calls) We get the perplexity of this sequence of length 120Kby first multiplying 120K probabilities (90K of which are 1/4 and 30K of which are 1/120K), nd then taking the inverse 120,000th root: Perp = (¼ * ¼ * ¼* ¼ * ¼ * …. * 1/120K * 1/120K * ….)^(-1/120K) But this can be arithmetically simplified to just N = 4: the operator (1/4), the sales (1/4), the tech support (1/4), and the 30,000 names (1/120,000): Perplexity= ((¼ * ¼ * ¼ * 1/120K)^(-1/4) = 52.6 Perplexity as branching factor Let’s suppose a sentence consisting of random digits What is the perplexity of this sentence according to a model that assign P=1/10 to each digit? Definicija perpleksnosti Perplexity is weighted equivalent branching factor. Perpleksnost jezičnoga modela (engl. perplexity) uvjetno predstavlja faktor grananja nakon nastupa pojedine riječi, odnosno predstavlja broj mogućih riječi koje joj slijede uteženi faktor grananja koliko riječi može slijediti svakoj riječi Perpleksnost mjeri kompleksnost jezičnog modela ne mjeri akustični model kod ASR 67 Matematička definicija perpleksnosti mjera kompleksnosti jezičnoga modela ocjena veličine prostora pretraživanja perpleksnost PP = 2 H (W ) gdje je H(W) entropija 68 Entropija vjerojatnost niza n riječi P(W) iz jezičnog modela H(W) je entropija H ( X ) = −  P( x) log 2 P( x) xX 1 H (W ) = − log 2 P(W ) NW 69 Perplexity: Is lower better? Remarkable fact the true model for data has the lowest possible perplexity Lower the perplexity, the closer we are to true model. Typically, perplexity correlates well with speech recognition word error rate Correlates better when both models are trained on same data Doesn’t correlate well when training data changes 70 Lower perplexity = better model Training 38 million words, test 1.5 million words, WSJ N-gram Unigram Bigram Trigram Order Perplexity 962 170 109 Perpleksnost bigramskog jezičnoga modela mjera kompleksnosti jezičnoga modela ocjena veličine prostora pretraživanja PP = 2H ( L) perpleksnost 1 H ( L) = − log 2 P( w1 , w2 ,.., wn ) gdje je H(L) entropija n P(w1,w2,..,wn) vjerojatnost niza riječi VEPRAD perpleksnost entropija radio 11.17 3.48 vijesti 17.16 4.01 telefon 17.97 4.17 72 Jezični modeli sustava za raspoznavanje Radijske prognoze Radijske vijesti bigramski ▪ dva bigramska jezična modela glađen prelaskom na unigramski ▪ glađena prelaskom na unigramski perpleksnost 11.17 1462 različite riječi ▪ vokabular vremenskih prognoza ▪ perpleksnost 11.17 ▪ 1462 različite riječi ▪ vokabular radijskih vijesti ▪ perpleksnost 17.16 ▪ 10230 različitih riječi 73 Učestalost trifona u radijskim prognozama i vijestima 1600 1376 radijske prognoze 1400 1200 Učestalost 1000 4042 trifona preko granica riječi 800 567 13% od ukupno mogućega broja 600 517 525 370 400 272 181 trifona (31585) 200 111 122 0 >= 500 100-500 50-100 40-50 30-40 20-30 10-20 5-10 =500 100-500 50-100 40-50 30-40 20-30 10-20 5-10 =500 100-500 50-100 40-50 30-40 20-30 10-20 5-10

Language Models - FIDIT

Document Details

Tags

Related

Summary

Full Transcript