Basic Text Processing PDF
Document Details
Uploaded by Deleted User
Daniel Jurafsky & James H. Martin
Tags
Summary
This document presents lecture notes on basic text processing, covering topics such as tokenization, word normalization, and sentence segmentation. It provides technical details on algorithms like Byte Pair Encoding (BPE) for subword tokenization. The document also discusses specific examples in languages with complex morphology, like Chinese.
Full Transcript
Basic Text Processing References Daniel Jurafsky & James H. Martin. 2020. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition 3nd edition. (Online preprint: https://web.stanford.edu/~ju rafsky/slp3)....
Basic Text Processing References Daniel Jurafsky & James H. Martin. 2020. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition 3nd edition. (Online preprint: https://web.stanford.edu/~ju rafsky/slp3). NLP Semester Gasal 2022/2023 2 Outline 1. Tokenizing (segmenting) words 2. Normalizing word formats 3. Segmenting sentences NLP Semester Gasal 2022/2023 3 Word tokenization NLP Semester Gasal 2022/2023 4 https://blog.floydhub.com/tokenization-nlp/ NLP Semester Gasal 2022/2023 5 Issues in Tokenization Can't just blindly remove punctuation: m.p.h., Ph.D., AT&T, cap’n prices ($45.55) dates (01/02/06) URLs (http://www.stanford.edu) hashtags (#nlproc) email addresses ([email protected]) Clitic: a word that doesn't stand on its own "are" in we're, French "je" in j'ai, "le" in l'honneur “nya” in ”anaknya” (ID) When should multiword expressions (MWE) be words? New York, rock ’n’ roll NLP Semester Gasal 2022/2023 6 Space-based tokenization A very simple way to tokenize For languages that use space characters between words Arabic, Cyrillic, Greek, Latin, etc., based writing systems Segment off a token between instances of spaces Use regular expression! NLP Semester Gasal 2022/2023 7 Tokenization in languages without spaces Many languages (like Chinese, Japanese, Thai) don't use spaces to separate words! How do we decide where the token boundaries should be? NLP Semester Gasal 2022/2023 8 Word tokenization in Chinese Chinese words are composed of characters called "hanzi" (or sometimes just "zi") Each one represents a meaning unit called a morpheme. Each word has on average 2.4 of them. But deciding what counts as a word is complex and not agreed upon. NLP Semester Gasal 2022/2023 9 How to do word tokenization in Chinese? 姚明进入总决赛 “Yao Ming reaches the finals” 3 words? 姚明 进入 总决赛 YaoMing reaches finals 5 words? 姚 明 进入 总 决赛 Yao Ming reaches overall finals 7 characters? (don't use words at all): 姚 明 进 入 总 决 赛 Yao Ming enter enter overall decision game NLP Semester Gasal 2022/2023 10 Word tokenization / segmentation So in Chinese it's common to just treat each character (zi) as a token. So the segmentation step is very simple In other languages (like Thai and Japanese), more complex word segmentation is required. The standard algorithms are neural sequence models trained by supervised machine learning. NLP Semester Gasal 2022/2023 11 Another option for text tokenization Instead of white-space segmentation single-character segmentation Use the data to tell us how to tokenize. Subword tokenization (because tokens can be parts of words as well as whole words) NLP Semester Gasal 2022/2023 12 Subword tokenization Three common algorithms: Byte-Pair Encoding (BPE) (Sennrich et al., 2016) Unigram language modeling tokenization (Kudo, 2018) WordPiece (Schuster and Nakajima, 2012) All have 2 parts: A token learner that takes a raw training corpus and induces a vocabulary (a set of tokens). A token segmenter that takes a raw test sentence and tokenizes it according to that vocabulary NLP Semester Gasal 2022/2023 13 Byte Pair Encoding NLP Semester Gasal 2022/2023 14 Byte Pair Encoding (BPE) token learner Let vocabulary be the set of all individual characters = {A, B, C, D,…, a, b, c, d….} Repeat: Choose the two symbols that are most frequently adjacent in the training corpus (say 'A', 'B') Add a new merged symbol 'AB' to the vocabulary Replace every adjacent 'A' 'B' in the corpus with 'AB'. Until k merges have been done. NLP Semester Gasal 2022/2023 15 BPE token learner algorithm 2.4 T EXT N ORMALIZATION 19 function B YTE - PAIR ENCODING (strings C, number of merges k) returns vocab V V all unique characters in C # initial set of tokens is characters for i = 1 to k do # merge tokens til k times tL , tR Most frequent pair of adjacent tokens in C tNEW tL + tR # make new token by concatenating V V + tNEW # update the vocabulary Replace each occurrence of tL , tR in C with tNEW # and update the corpus return V Figure 2.13 The token learner part of the BPE algorithm for taking a corpus broken up into individual characters or bytes, andNLPlearning a vocabulary by iteratively merging tokens. Semester Gasal 2022/2023 16 Figure adapted from Bostrom and Durrett (2020). from the training data, greedily, in the order we learned them. (Thus the frequencies Byte Pair Encoding (BPE) Addendum Most subword algorithms are run inside space-separated tokens. So we commonly first add a special end-of-word symbol '__' before space in training corpus Next, separate into letters. NLP Semester Gasal 2022/2023 17 HAPTER 2 R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE BPE token learner Original (very fascinating🙄) corpus: The algorithm is usually run inside words (not merging across word boundaries), low so the low low input low low corpus lowest is first lowest newer newer newer white-space-separated to give a newer set of newer newer strings, eachwider corre- wider wider new new sponding to the characters of a word, plus a special end-of-word symbol , and its counts. Let’s see its operation on the following tiny input corpus of 18 word tokens withAdd counts for eachtokens, end-of-word word resulting (the word lowvocabulary: in this appears 5 times, the word newer 6 times, and so on), which would have a starting vocabulary of 11 letters: corpus representation vocabulary 5 l o w , d, e, i, l, n, o, r, s, t, w 2 l o w e s t 6 n e w e r 3 w i d e r 2 n e w NLP Semester Gasal 2022/2023 18 The BPE algorithm first count all pairs of adjacent symbols: the most frequent is the pair e r because it occurs in newer (frequency of 6) and wider (frequency of 3) for a total of 9 occurrences1. We then merge these symbols, treating er as one so the input corpus is first white-space-separated to give a set of strings, each corre- sponding to the characters of a word, plus a special end-of-word symbol , and its The algorithm is usually run inside words (not merging across word boundaries), socounts. the inputLet’s seeisitsfirst corpus operation on the following white-space-separated to tiny giveinput corpus a set of of each strings, 18 word tokens corre- with counts sponding to thefor each word characters of a(the word word, low plus appearsend-of-word a special 5 times, thesymbol word newer , and6its times, and soLet’s counts. on),see which would have its operation on athe starting vocabulary following tiny inputof corpus 11 letters: of 18 word tokens with countscorpus for each word (the word low vocabulary appears 5 times, the word newer 6 times, BPE token learner and so on),5 which 2 l owould w have a starting vocabulary l o w e s t , d, e, i, of 11 l,letters: n, o, r, s, t, w corpus vocabulary 5 6 l n o we w e r , d, e, i, l, n, o, r, s, t, w 2 3 l w o wi ed se tr 6 2 n n e we ew r 3 BPE The w i algorithm d e r first count all pairs of adjacent symbols: the most frequent is the2 pairne er wbecause it occurs in newer (frequency of 6) and wider (frequency of 3)Merge for BPE a etotal of 9 occurrences 1. We then merge these symbols, treating er as one The algorithm r to er first count all pairs of adjacent symbols: the most frequent is symbol, the pair eand count again: r because it occurs in newer (frequency of 6) and wider (frequency of a total of 9 occurrences1. Wevocabulary 3) for corpus then merge these symbols, treating er as one symbol, 5 andlcounto w again: , d, e, i, l, n, o, r, s, t, w, er corpus 2 l o w e s t vocabulary 5 6 l no e w w er , d, e, i, l, n, o, r, s, t, w, er 2 3 l wo i w e d sert 6 2 n ne e w er w NLP Semester Gasal 2022/2023 19 3 w i d er Now the most frequent pair is er , which we merge; our system has learned 2 n e w that there should be a token for word-final er, represented as er : Now the most frequent pair is er , which we merge; our system has learned corpus vocabulary that there should be a token for word-final er, represented as er : 5 l o w , d, e, i, l, n, o, r, s, t, w, er, er 6 n e w e r 3 3w i wd ie dr e r 2 2n e nw e w The BPE algorithm first count all pairs of adjacent symbols: the most frequent The BPE algorithm first count all pairs of adjacent symbols: the most frequent is the pair e r because it occurs in newer (frequency of 6) and wider (frequency of is the pair e r because it occurs in newer (frequency of 6) and wider (frequency of 3) for a total of 9 occurrences1. We then merge these symbols, treating er as one 3) for a total of 9 occurrences1. We then merge these symbols, treating er as one BPE symbol, and count again: symbol, and count again: corpus vocabulary corpus vocabulary 5 l o w , d, e, i, l, n, o, r, s, t, w, er 5 l o w , d, e, i, l, n, o, r, s, t, w, er 2 l o w e s t 2 l o w e s t 6 n e w er 6 n e w er 3 w i d er 3 w i d er 2 n e w 2 n e w Now the most_ frequent Merge pair is er , which we merge; our system has learned Now er to er_ the most frequent pair is er , which we merge; our system has learned that there should be a token for word-final er, represented as er : that there should be a token for word-final er, represented as er : corpus vocabulary corpus vocabulary 5 l o w , d, e, i, l, n, o, r, s, t, w, er, er 5 l o w , d, e, i, l, n, o, r, s, t, w, er, er 2 l o w e s t 2 l o w e s t 6 n e w er 3 6w i nd eerw er 2 3n e ww i d er 2 n e w NLP Semester Gasal 2022/2023 20 Next n e (total count of 8) get merged to ne: Next n e (total count of 8) get merged to ne: corpus vocabulary 5 corpus l o w , d,vocabulary e, i, l, n, o, r, s, t, w, er, er , ne 2 5 l o l o w e s tw , d, e, i, l, n, o, r, s, t, w, er, er , ne 55 ll oo ww ,, d, d, e, e, i,i, l,l, n,n, o,o, r, r, s,s, t, t, w, w, erer 22 ll oo ww ee ss tt 66 nn ee ww er er 33 ww ii dd er er 22 nn ee ww Now Now thethe most most frequent frequent pairpair isis er er ,, which which we we merge; merge; ourour system system has has learned learned that BPE that there there should should be be aa token token for for word-final word-final er, er, represented represented as er :: as er corpus corpus vocabulary vocabulary 55 ll oo ww ,, d, d, e, e, i, i, l, l, n, n, o, o, r, r, s, s, t, t, w, w, er, er, er er 22 ll oo ww ee ss tt 66 nn ee ww er er 33 ww ii dd er er 22 nn ee ww Next nnMerge Next n count ee (total (total e to ne count of of 8) 8) get get merged merged to to ne: ne: corpus corpus vocabulary vocabulary 55 ll oo ww ,, d, d, e, e, i,i, l, l, n, n, o, o, r, r, s,s, t, t, w, w, er, er ,, ne er, er ne 22 ll oo ww ee ss tt 66 ne ne ww erer 33 ww ii dd er er 22 ne ne ww NLP Semester Gasal 2022/2023 21 IfIf we we continue, continue, the the next next merges merges are: are: Merge Merge Current CurrentVocabulary Vocabulary (ne, (ne, w) w) ,,d, d,e, e,i, i,l, l,n, n,o, o,r, r,s, s,t, t,w, w,er, er,er er ,,ne, ne,new new (l, (l, o) o) ,,d, d,e, e,i, i,l, l,n, n,o, o,r, r,s, s,t, t,w, w,er, er,er er ,,ne, ne,new, new,lo lo 2 n e w Next n e (total count of 8) get merged to ne: corpus vocabulary 5 l o w , d, e, i, l, n, o, r, s, t, w, er, er , ne 2 l o w e s t BPE 6 ne w er 3 w i d er The next 2 merges ne ware: If we continue, the next merges are: Merge Current Vocabulary (ne, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new (l, o) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo (lo, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low (new, er ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer (low, ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer , low Once we’ve learned our vocabulary, the token parser is used to tokenize a test sentence. The token parser just runs on the test data the merges we have learned NLP Semester Gasal 2022/2023 22 1 Note that there can be ties; we could have instead chosen to merge r first, since that also has a frequency of 9. BPE token segmenter algorithm On the test data, run each merge learned from the training data: Greedily In the order we learned them (test frequencies don't play a role) So: merge every e r to er, then merge er _ to er_, etc. Result: Test set "n e w e r _" would be tokenized as a full word Test set "l o w e r _" would be two tokens: "low er_" NLP Semester Gasal 2022/2023 23 Properties of BPE tokens Usually include frequent words And frequent subwords Which are often morphemes like -est or –er A morpheme is the smallest meaning-bearing unit of a language unlikeliest has 3 morphemes un-, likely, and -est NLP Semester Gasal 2022/2023 24 Word Normalization NLP Semester Gasal 2022/2023 25 Word Normalization Putting words/tokens in a standard format U.S.A. or USA uhhuh or uh-huh Fed or fed am, is, be, are NLP Semester Gasal 2022/2023 26 Case folding Applications like IR: reduce all letters to lower case Since users tend to use lower case Possible exception: upper case in mid-sentence? e.g., General Motors Fed vs. fed SAIL vs. sail For sentiment analysis, MT, Information extraction Case is helpful (US versus us is important) NLP Semester Gasal 2022/2023 27 Lemmatization Represent all words as their lemma, their shared root = dictionary headword form: am, are, is ® be car, cars, car's, cars' ® car Spanish quiero (‘I want’), quieres (‘you want’) ® querer ‘want' He is reading detective stories ® He be read detective story NLP Semester Gasal 2022/2023 28 Lemmatization is done by Morphological Parsing Morphemes: The small meaningful units that make up words Stems: The core meaning-bearing units Affixes: Parts that adhere to stems, often with grammatical functions Morphological Parsers: Parse cats into two morphemes cat and s Parse Spanish amaren (‘if in the future they would love’) into morpheme amar ‘to love’, and the morphological features 3PL and future subjunctive. NLP Semester Gasal 2022/2023 29 Stemming Reduce terms to stems, chopping off affixes crudely Thi wa not the map we found in Billi Bone This was not the map we found in Billy s chest but an accur copi complet in all Bones’s chest, but an accurate copy, thing name and height and sound with the complete in all things-names and heights singl except of the red cross and the and soundings-with the single exception written note of the red crosses and the written notes.. NLP Semester Gasal 2022/2023 30 and soundings-with the single exception of the red crosses and the written notes. produces the following stemmed output: Thi wa not the map we found in Billi Bone s chest but an Based on a series of rewrite rules run in series accur copi complet in all thing name and height and sound A cascade, in which output of each pass fed to next Porter with the Stemmer singl except of thepassred cross and the written note Some sample rules: The algorithm is based on series of rewrite rules run in series, as a cascade, in which the output of each pass is fed as input to the next pass; here is a sampling of the rules: ATIONAL ! ATE (e.g., relational ! relate) ING ! ✏ if stem contains vowel (e.g., motoring ! motor) SSES ! SS (e.g., grasses ! grass) Detailed rule lists for the Porter stemmer, as well as code (in Java, Python, etc.) can be found on Martin Porter’s homepage; see also the original paper (Porter, 1980). NLP Semester Gasal 2022/2023 31 Simple stemmers can be useful in cases where we need to collapse across differ- ent variants of the same lemma. Nonetheless, they do tend to commit errors of both over- and under-generalizing, as shown in the table below (Krovetz, 1993): Dealing with complex morphology is necessary for many languages e.g., the Turkish word: Uygarlastiramadiklarimizdanmissinizcasina `(behaving) as if you are among those whom we could not civilize’ Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ + lar ‘plural’ + imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’ NLP Semester Gasal 2022/2023 32 Sentence Segmentation NLP Semester Gasal 2022/2023 33 Sentence Segmentation !, ? mostly unambiguous but period “.” is very ambiguous Sentence boundary Abbreviations like Inc. or Dr. Numbers like.02% or 4.3 Common algorithm: Tokenize first: use rules or ML to classify a period as either (a) part of the word or (b) a sentence-boundary. An abbreviation dictionary can help Sentence segmentation can then often be done by rules based on this tokenization. NLP Semester Gasal 2022/2023 34 Q&A NLP Semester Gasal 2022/2023 35