2_part-basic.pdf

Introduction to NLP Part: Basic Text Processing Christin Seifert April 22, 2024 OVERVIEW 1. Text Pre-Processing 2. Regular Expressions 3. Minimum Edit Distance 1 Text Pre-Processing PREPROCESSING OF TEXT Possible elements of text pre-processing 1. Remove layout elements 2. Detect the language of the text 3. Detect term (word) boundaries 4. Detect sentence boundaries 5. Remove stop words 6. Stemming and normalization 2 PRE - PROCESSING Example phrase: to sleep perchance to dream Token Sequence of characters representing a useful semantic unit, instance of a type 5 tokens: to sleep perchance to dream Type Common concept for tokens, element of the vocabulary (”something with a unique meaning in the real world”) 4 types: sleep perchance to dream Term Representation of a type that is stored in the dictionary of an index, might be a normalized version if stop words are not important for the task at hand we have 3 terms: sleep perchance dream Procedure: 1. Map all possible tokens to the corresponding type 2. Store all terms of the relevant type in the dictionary. That’s our vocabulary. 3 What are tokens, types and terms? Coffee is a drink. Coffee is the solution! 4 EXAMPLE Hell, if I could explain it to the average person, it wouldn’t have been worth the Nobel prize.1 Tokens Hell if I could explain it to the average person it wouldn’t have been worth the Nobel prize Types (e.g.) Hell would it Nobel prize... Count of terms as one or multiple tokens (“Nobel prize”) depends on the goal. Count of types (e.g, is “would” and “wouldn’t” of same type) depends on the goal. 1 Richard Feynman, People Magazine, 1985 5 Would this also work similarly in your mother tongue? 5 LANGUAGE DETECTION Later processing might depend on the language of the text. For instance stop word lists are language dependent. Simplest method is based on Letter Frequencies Percentage of occurrence Letter English German A 8.17 6.51 E 12.70 17.40 I 6.97 7.55 O 7.51 2.51 U 2.76 4.35 Better – more elaborate – methods use statistics over more than one letter, e.g., statistics over two, three or even more consecutive letters (N-Gram Frequencies). 6 TOKENIZATION Goal Split text into tokens to identify the units later processing should work on (e.g., part-of-speech taggers consider one token at a time) Simplest method: whitespace and punctuation as delimiter “San Francisco” → ? “I’m” and “I am”→ ? “camel case” and “camel-case” and “camelcase” and “CamelCase”→ ? “UK” and “U.K.”→ ? “kmh” and “km/h” → ? German compound nouns: e.g., Donaudampfschiffahrtsgesellschaft, meaning Danube Steamboat Shipping Company French articles: “L’atelier”, meaning the studio 7 TOKENIZATION Tokenization using supervised learning Train a model with annotated training data, use the trained model to tokenize unknown text Hidden-Markov-Models and conditional random fields are commonly used Tokenization using dictionaries Build a dictionary (i.e., list) of tokens Go over the sentence and always take the longest fitting token (greedy algorithm!) Remark: Works well for Asian languages without white spaces and short words, problematic for European languages. 8 SENTENCE DETECTION Naive approach Every period and every “?” and “!” mark the end of a sentence Problem Periods also mark abbreviations, decimal points, email-addresses, e.g., Dr. House is calling. The height is 0.14 inch. Mendeley Ldt. now is part of the Elsevier Corporation. Solution Build a binary classifier, deciding for each period whether it is the end of the sentence (EOS) or not (NEOS). 9 SENTENCE DETECTION Are there more than 2 blank lines after me? More Features: NO YES Is the next letter Is the final EOS capitalized? punctuation "?" or "!" or ":"? Is the period NO YES surrounded by digits? Is the final EOS punctuation "." Has the next word a YES high probability of NO occuring at the NEOS Am I a known abbreviation? beginning of a NO YES sentence (e.g, Die, EOS NEOS The)? 10 STOP WORDS Extremely common words that appear in nearly every text As stop words are so common, their occurrence does not characterize a text We might decide to just drop them2 Stop word list Ignore words that are given on a list (black list), e.g., articles (a, an, the), conjunction (and, or, but,...), pre- and postposition (in, for, from) Problem Special names and phrases (The Who, Let It Be,...) Solution Make another list... (white list) 2 e.g., reasonable for text classification, but not for grammar checking 11 NORMALIZATION Some tokens carry the same meaning for most applications, e.g. Search for UK should return documents with U.K. Search for Car should return documents with cars3 Search for beauty should return documents with beautiful Possible steps Ignore cases (UK → uk) Stemming: removal of affixes (e.g., hammers → hammer) Lemmatization: map inflections and variant forms to their base form (e.g., beautiful → beauty) Rule based removal of hyphens, periods, white spaces, accents, diacritics4 3 Documents with the synonym automobile require different processing 4 Be aware of possible different meaning after removal. 12 STEMMING & LEMMATIZATION Goal of both Reduce different grammatical forms of a word to their base form. Stemming Find the word stem by chopping off/replacing last part of words (Example: Porter’s algorithm). Lemmatization Find the correct dictionary form. Examples Map do, doing, done to common infinitive do Map digitalizing, digitalized to digital Map master’s, masters’, master to master 13 PORTER STEMMER Most common English stemmer Set of rules, applied in predefined sequence, for example step rule example 1a sses → ss misses → miss ies → i libraries → librari ss → ss miss → miss s→∅ houses → house 1b (*vowel*)ing → ∅ dancing → danc king → king (*vowel*)ed → ∅ danced → danc 2 for longer stems ational → ate international → internate ator → ate terminator → terminate 3 for longer stems able → ∅ understandable → understand al → ∅ survival → surviv 14 STEMMING – COMPARISON Figure 1: Comparison of different stemming algorithms, examples from 15 Regular Expressions Assume you have a very long text document (more than 100 pages), and you want to know whether your name is mentioned. How would you find out? What if your name is misspelled? Or only your first name is mentioned? 16 REGULAR EXPRESSIONS Regular Expressions are a formal language for specifying text strings. How can we search for any of these? woodchuck woodchucks Woodchuck Woodchucks https://en.wikipedia.org/wiki/File: Groundhog_With_Burrow_Material.jpg 17 REGULAR EXPRESSIONS Basic Character Matching Specific Character: A matches A literally (case-sensitive) Text A nalytics 1/2. Specific Character List: [Aa] matches A or a. Text A n a lytics 1/2. Range: [A-Z] matches all uppercase alphabet. T ext A nalytics 1/2. [A-Za-z0-9] matches all alphanumeric characters. Text Analytics 1 / 2. Dot(.): matches any character Text Analytics 1/2. Backslash(\): Escapes special characters, \. matches dot. Text Analytics 1/2. Note: In Python, re.search() returns the first match, re.findAll() returns all matches. Build, test and debug Regex at https://regex101.com 18 REGULAR EXPRESSIONS Disjunction & Negation Disjunction: red|green matches red or green a|b|c == [abc] matches a or b or c Negation: [ˆA-Z] matches all characters except capital letters. T ext A nalytics 1/2. aˆb matches literally. the expression aˆb 19 REGULAR EXPRESSIONS Anchors Start of String: ˆ[A-Z] matches all uppercase alphabet at the start of the string/line. T ext Analytics 1/2. End of String:.$ matches any character at the end of the string/line. Text Analytics 1/2. Text Analytics 1/2 ? Text Analytics 1/2 ! \.$ matches a dot at the end of the string. Text Analytics 1/2. Text Analytics 1/2? Text Analytics 1/2! In multiline mode, Anchors match at the start/end of each line, otherwise at the start/end of the whole string. 20 REGULAR EXPRESSIONS Quantifiers Question mark (?): zero or one occurrence of the previous character, colou?r matches either colour or color. Asterisk(*): zero or more occurrences of the previous character, wo*w matches ww, wow, woow, woooooooow, etc. Plus(+): one or more occurrences of the previous character, wo+w matches wow, woow, wooow, woooooooow, etc. {m}: exactly m occurrences of the previous character, wo{3}w matches wooow. {m,n}: between m and n occurrences of the previous character, wo{3,5}w matches wooow, woooow, wooooow. 21 REGULAR EXPRESSIONS Quantifiers Quantifiers are greedy, i.e., they match as much as possible. matches the whole string b. Append a question mark for lazy matching. matches and in the string b. 22 REGULAR EXPRESSIONS Groups & Special Parentheses (... ): indicates a group, which can later be referenced by \index Special: \s matches whitespace characters \d matches digits \w matches alphanumeric characters \b matches word boundaries, i.e. \bfoo\b matches foo but not foobar 23 REGULAR EXPRESSIONS – APPLICATION CHATBOT ELIZA: one of the first chatbots You: Men are all alike. Eliza: IN WHAT WAY You: They’re always bugging us about something or other. Eliza: CAN YOU THINK OF A SPECIFIC EXAMPLE You: Well, my boyfriend made me come here. Eliza: YOUR BOYFRIEND MADE YOU COME HERE You: He says i’m depressed much of the time. Eliza: I AM SORRY TO HEAR YOU ARE DEPRESSED Try out https://www.masswerk.at/elizabot/eliza.html Note: This is not the original implementation, results might slightly differ 24 REGULAR EXPRESSIONS – APPLICATION CHATBOT ELIZA: one of the first chatbots ELIZA works with a cascade of regular expression substitutions, e.g., s/.* all.*/IN WHAT WHAY/ s/.* always.*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ s/.* YOU ARE (depressed|sad|unhappy|sick).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad|unhappy|sick).*/WHY DO YOU THINK YOU ARE \1/ 25 REGULAR EXPRESSIONS How can we search for any of these? woodchuck woodchucks Woodchuck Woodchucks [Ww]oodchucks? https://en.wikipedia.org/wiki/File: Groundhog_With_Burrow_Material.jpg 26 REGULAR EXPRESSIONS – EXAMPLE Find all instances of the word the in a text. the misses capitalized examples. [Tt]he incorrectly matches other or Theology. [ˆa-zA-Z][tT]he[ˆa-zA-Z] incorrectly matches the. and does not match the at the start of a string/line. \b[Tt]he\b 27 ERRORS Types of Errors The process we just went through was based on fixing two kinds of errors: Matching strings that we should not have matched (there, then, other) False positives Not matching things that we should have matched (The) False negatives In NLP we are always dealing with these kinds of errors. Reducing the error rate for an application often involves two antagonistic efforts: Increasing accuracy or precision (minimizing false positives). Increasing coverage or recall (minimizing false negatives). 28 SUMMARY Summary Regular expressions play a surprisingly large role in text processing. Sophisticated sequences of regular expressions are often the first model for any text processing task. For many hard tasks, we use machine learning classifiers, but regular expressions are used in preprocessing / feature engineering for the classifiers. can be very useful in capturing generalizations. 29 Minimum Edit Distance STRING SIMILARITY String Similarity Spelling Correction: The user typed graffe. Which is closest? graf / graft / grail / giraffe / graffle Computational Biology: Task is to align two sequences of nucleotides: AGGCTATCACCTGACCTCCAGGCCGATGCCC, and TAGCTATCACGACCGCGGTCGATTTGCCCGAC Resulting alignment: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC String similarity has also applications in Machine Translation, Information Extraction, and Speech Recognition. 30 EDIT DISTANCE Minimum Edit Distance The minimum edit distance between two strings is the minimum number of editing operations: insertion, deletion, substitution, needed to transform one string into the other. 31 MINIMUM EDIT DISTANCE Two strings and their alignment: 32 MINIMUM EDIT DISTANCE Two strings and their alignment: If each operation has cost of 1 ⇒ Distance is 5. If substitutions cost 2 (alternative formulation of Levenshtein distance) ⇒ Distance is 8. 33 APPLICATIONS OF EDIT DISTANCES Evaluating Machine Translation5 and Speech Recognition 1 Spokesman confirms government adviser was shot 2 Spokesman said the adviser was shot dead S I D I6 Evaluating Named Entity Extraction and Entity Coreference IBM Inc. announced today IBM profits Stanford President John Hennessy announced yesterday for Stanford University President John Hennessy 5 Often only the agreement between reference are measured, e.g., by ROUGE (Recall-Oriented Understudy for Gisting Evaluation) or BLEU (Bi-Lingual Evaluation Understudy) 6 Legend: Substitution, Insertion, Deletion 34 How do we (efficiently) compute the minimum edit distance? 34 COMPUTING MINIMUM EDIT DISTANCE Search Searching for a path, i.e., a sequence of edits, from the source string to target string Initial State the source string Operators insert, delete, substitute Goal State the target string Path Cost number of edits from source to target (that’s what we want to minimize) 35 COMPUTING MINIMUM EDIT DISTANCE Search: The space of all possible edit sequences is huge. We cannot afford to search this space naively. Many paths end up at the same state: We do not have to keep track of all of them Just the shortest path to all of these visited states Minimum Edit Distance For two strings X of length n and Y of length m, we define D(i, j) as the edit distance between X [1... i] and Y [1... j]. The edit distance between X and Y is thus D(n, m). X [1... i] are the first i characters of X Y [1... j] are the first j characters of Y 36 COMPUTING MINIMUM EDIT DISTANCE Dynamic Programming A tabular computation of D(n, m). Solving problems by combining solutions to sub-problems. Bottom-up approach. We compute D(i, j) for small i, j And compute larger D(i, j) based on previously computed values This means, we compute D(i, j) for all (0 < i < n) and (0 < j < m). 37 COMPUTING MINIMUM EDIT DISTANCE Algorithm Overview 1. Initialization D(i, 0) = i //distance between source string (prefix) and empty string is the cost of deleting all characters D(0, j) = j //distance between empty string target string (prefix) is the cost of adding all characters 2. Recurrence Relation For each i = 1... m For each j = 1... n    D(i − 1, j) + 1# deleting one more character in i   D(i, j − 1) + 1# adding one more character in i  D(i, j) = min    +2 if X (i) ̸= Y (j)7 D(i − 1, j − 1)     +0 if X (i) = Y (j) 3. Termination: D(n, m) is distance 7 Insertions and deletions cost 1, substitutions cost of 2. 38 THE EDIT DISTANCE TABLE N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N    D(i − 1, j) + 1   D(i, j − 1) + 1 D(i, j) = min (  ̸ Y (j) +2 if X (i) = D(i − 1, j − 1)    +0 if X (i) = Y (j) 39 THE EDIT DISTANCE TABLE Intuition When we are in this cell, we have transformed INT to E When we know how to to INT -> E, E we can get INT -> EX by adding an X When we know how to do IN -> EX, we can do INT -> EX by deleting the T T ? When we know how to do IN -> E, we can do INT -> EX by replacing T with X N I When we are in this cell, we have transformed IN to E When we are in this cell, we have transformed IN to EX # # E X E C 40 THE EDIT DISTANCE TABLE N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 ? # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N    D(i − 1, j) + 1   D(i, j − 1) + 1 D(i, j) = min (  ̸ Y (j) +2 if X (i) = D(i − 1, j − 1)    +0 if X (i) = Y (j) 41 THE EDIT DISTANCE TABLE N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 ? # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N    D(i − 1, j) + 1   D(i, j − 1) + 1  D(i, j) = min (   ̸ Y (j) +2 if X (i) = D(i − 1, j − 1)   +0 if X (i) = Y (j) 42 THE EDIT DISTANCE TABLE N 9 O 8 I 7 T 6 N 5 D(1, 1) = E 4   D(0, 1) + 1 = 1 + 1 = 2   T 3 min    N 2 I 1 ? # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N    D(i − 1, j) + 1   D(i, j − 1) + 1  D(i, j) = min (   ̸ Y (j) +2 if X (i) = D(i − 1, j − 1)   +0 if X (i) = Y (j) 43 THE EDIT DISTANCE TABLE N 9 O 8 I 7 T 6 N 5 D(1, 1) = E 4   D(0, 1) + 1 = 1 + 1 = 2   T 3 min D(1, 0) + 1 = 1 + 1 = 2    N 2 I 1 ? # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N     D(i − 1, j) + 1   D(i, j − 1) + 1  D(i, j) = min (   ̸ Y (j) +2 if X (i) = D(i − 1, j − 1)   +0 if X (i) = Y (j)  44 THE EDIT DISTANCE TABLE N 9 O 8 I 7 T 6 N 5 D(1, 1) = E 4   D(0, 1) + 1 = 1 + 1 = 2   T 3 min D(1, 0) + 1 = 1 + 1 = 2    N 2 I 1 ? # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N     D(i − 1, j) + 1   D(i, j − 1) + 1  D(i, j) = min (   ̸ Y (j) +2 if X (i) =  D(i − 1, j − 1)   +0 if X (i) = Y (j)  45 THE EDIT DISTANCE TABLE N 9 O 8 I 7 T 6 N 5 D(1, 1) =  E 4     D(0, 1) + 1 = 1 + 1 = 2 T 3 min D(1, 0) + 1 = 1 + 1 = 2    D(0, 0) + 2 = 0 + 2 = 2  N 2 I 1 ? # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N    D(i − 1, j) + 1    D(i, j − 1) + 1  D(i, j) = min    +2 if X (i) ̸= Y (j)  D(i − 1, j − 1)     +0 if X (i) = Y (j) 46 THE EDIT DISTANCE TABLE N 9 O 8 I 7 T 6 N 5 D(1, 1) =  E 4     D(0, 1) + 1 = 1 + 1 = 2 T 3 min D(1, 0) + 1 = 1 + 1 = 2    D(0, 0) + 2 = 0 + 2 = 2  N 2 I 1 2 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N    D(i − 1, j) + 1    D(i, j − 1) + 1  D(i, j) = min    +2 if X (i) ̸= Y (j)  D(i − 1, j − 1)     +0 if X (i) = Y (j) 47 THE EDIT DISTANCE TABLE N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 ? I 1 2 ? # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N    D(i − 1, j) + 1   D(i, j − 1) + 1 D(i, j) = min (  ̸ Y (j) +2 if X (i) = D(i − 1, j − 1)    +0 if X (i) = Y (j) 48 THE EDIT DISTANCE TABLE N 9 O 8 I 7 T 6 N 5 E 4 T 3 ? N 2 3 ? I 1 2 3 ? # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N    D(i − 1, j) + 1   D(i, j − 1) + 1 D(i, j) = min (  ̸ Y (j) +2 if X (i) = D(i − 1, j − 1)    +0 if X (i) = Y (j) 49 THE EDIT DISTANCE TABLE N 9 O 8 I 7 T 6 N 5 E 4 ? T 3 4 N 2 3 4 I 1 2 3 4 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N    D(i − 1, j) + 1   D(i, j − 1) + 1 D(i, j) = min (  ̸ Y (j) +2 if X (i) = D(i − 1, j − 1)    +0 if X (i) = Y (j) 50 THE EDIT DISTANCE TABLE N 9 O 8 I 7 T 6 N 5 D(4, 1) = E 4 ?   D(3, 1) + 1 = 4 + 1 = 5   T 3 4 min    N 2 3 4 I 1 2 3 4 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N    D(i − 1, j) + 1   D(i, j − 1) + 1  D(i, j) = min (   ̸ Y (j) +2 if X (i) = D(i − 1, j − 1)   +0 if X (i) = Y (j) 51 THE EDIT DISTANCE TABLE N 9 O 8 I 7 T 6 N 5 D(4, 1) = E 4 ?   D(3, 1) + 1 = 4 + 1 = 5   T 3 4 min D(4, 0) + 1 = 4 + 1 = 5    N 2 3 4 I 1 2 3 4 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N     D(i − 1, j) + 1   D(i, j − 1) + 1  D(i, j) = min (   ̸ Y (j) +2 if X (i) = D(i − 1, j − 1)   +0 if X (i) = Y (j)  52 THE EDIT DISTANCE TABLE N 9 O 8 I 7 T 6 N 5 D(4, 1) =  E 4 ?     D(3, 1) + 1 = 4 + 1 = 5 T 3 4 min D(4, 0) + 1 = 4 + 1 = 5    D(3, 0) + 0 = 3 + 0 = 3  N 2 3 4 I 1 2 3 4 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N    D(i − 1, j) + 1    D(i, j − 1) + 1  D(i, j) = min    +2 if X (i) ̸= Y (j)  D(i − 1, j − 1) +0 if X (i) = Y (j)    53 THE EDIT DISTANCE TABLE N 9 O 8 I 7 T 6 N 5 D(4, 1) =  E 4 3     D(3, 1) + 1 = 4 + 1 = 5 T 3 4 min D(4, 0) + 1 = 4 + 1 = 5    D(3, 0) + 0 = 3 + 0 = 3  N 2 3 4 I 1 2 3 4 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N    D(i − 1, j) + 1    D(i, j − 1) + 1  D(i, j) = min    +2 if X (i) ̸= Y (j)  D(i − 1, j − 1) +0 if X (i) = Y (j)    54 THE EDIT DISTANCE TABLE N 9 8 9 10 11 12 11 10 9 8 O 8 7 8 9 10 11 10 9 8 9 I 7 6 7 8 9 10 9 8 9 10 T 6 5 6 7 8 9 8 9 10 11 N 5 4 5 6 7 8 9 10 11 10 E 4 3 4 5 6 7 8 9 10 9 T 3 4 5 6 7 8 7 8 9 8 N 2 3 4 5 6 7 8 7 8 7 I 1 2 3 4 5 6 7 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N    D(i − 1, j) + 1   D(i, j − 1) + 1 D(i, j) = min (  ̸ Y (j) +2 if X (i) = D(i − 1, j − 1)    +0 if X (i) = Y (j) 55 COMPUTING ALIGNMENTS Alignment Edit distance is often not sufficient, but we need to align each character of the two strings to each other. We do this by keeping a backtrace: 1. Every time we enter a cell, we remember where we came from. 2. When we reach the end, we trace back the path to read off the alignment.   ↓ deletion   D(i − 1, j) + 1 deletion     D(i, j − 1) + 1 insertion  D(i, j) = min   ( ̸ Y (j) +2 if X (i) = ptr (i, j) = ← insertion D(i − 1, j − 1)    substitution  +0 if X (i) = Y (j)  ↙ substitution  56 BACKTRACE N 9 O 8 I 7 All options (↙↓←) T 6 equally valid. N 5 D(1, 1) = E 4   D(0, 1) + 1 = 1 + 1 = 2  T 3   min D(1, 0) + 1 = 1 + 1 = 2 N 2    D(0, 0) + 2 = 0 + 2 = 2  I 1 2 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N    D(i − 1, j) + 1    D(i, j − 1) + 1  D(i, j) = min    +2 if X (i) ̸= Y (j)  D(i − 1, j − 1)     +0 if X (i) = Y (j) 57 BACKTRACE N 9 O 8 I 7 T 6 Only ↙ valid. N 5 D(4, 1) = E 4 3    D(3, 1) + 1 = 4 + 1 = 5  T 3 4  min D(4, 0) + 1 = 4 + 1 = 5  N 2 3 4   D(3, 0) + 0 = 3 + 0 = 3  I 1 2 3 4 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N    D(i − 1, j) + 1    D(i, j − 1) + 1  D(i, j) = min    +2 if X (i) ̸= Y (j)  D(i − 1, j − 1) +0 if X (i) = Y (j)    58 MIN EDIT DISTANCE WITH BACKTRACE 8 The highlighted path is one possible optimal alignment. 8 slide by d. jurafsky, cs124 59 ALGORITHM COMPLEXITY Time O(nm) we need to calculate all values in the table Space O(nm) we need to store all values in the table Backtrace O(n + m) in the worst case n deletions and m insertions 60 WEIGHTED EDIT DISTANCE While we worked with Levenshtein Distance in our example, i.e., fixed costs (insertion:1, deletion:1, substitution: 0 or 2), the algorithm allows for arbitrary weights. Why would we add weights to the computation? Spelling Correction: some letters are more likely to be mistyped than others Biology: certain kinds of deletions or insertions are more likely than others 61 MINIMUM EDIT DISTANCE Confusion Matrix for spelling errors 62 MINIMUM EDIT DISTANCE Substitutions are more likely to happen between letters next to each other on the keyboard 63 WEIGHTED MIN EDIT DISTANCE 1. Initialization D(0, 0) = 0 D(i, 0) = D(i − 1, 0) + del[x(i)]; 1

Document Details

Tags

Related

Full Transcript