5. Regular expressions and Regular languages (Part_1).pdf
Document Details
Uploaded by Deleted User
Full Transcript
Regular Expressions and Regular Languages 1 Outline Regular Expressions Regular Languages Practical Activities (Pumping Lemma) 2 Regular Expressions Definitions Equivalence to Finite Automata...
Regular Expressions and Regular Languages 1 Outline Regular Expressions Regular Languages Practical Activities (Pumping Lemma) 2 Regular Expressions Definitions Equivalence to Finite Automata 3 Regular Expressions and Text Searching Everybody does it Emacs, vi, perl, grep, etc.. Regular expressions are a compact textual representation of a set of strings representing a language. 4 Example Find all the instances of the word “the” in a text. /the/ /[tT]he/ /\b[tT]he\b/ 5 Errors The process we just went through was based on two fixing kinds of errors Matching strings that we should not have matched (there, then, other) False positives (Type I) Not matching things that we should have matched (The) False negatives (Type II) 6 Errors Reducing the error rate for an application often involves two antagonistic efforts: Increasing accuracy, or precision, (minimizing false positives) Increasing coverage, or recall, (minimizing false negatives). 7 REs: What are they? Regular expressions describe languages by an algebra. 8 Link: https://www.youtube.com/watch?v=eOfMcdeyrMU 11 DFA 10 Converting the regular expression (a|b)* to a DFA 11 Converting the regular expression (a*|b*)* to a DFA 12 Converting the regular expression ab(a|b)* to a DFA 13 Operations on Languages REs use three operations: union concatenation Kleene star (*) [cleany star] 14 Union (aka: disjunction, OR, |, +) The union of languages is the usual thing, since languages are sets. Example: {01,111,10}{00, 01} = {01,111,10,00}. 01 happens to be in both sets, so it will be once in the union 15 Concatenation: represented by juxtaposition (no punctuation) or middle dot ( · ) The concatenation of languages L and M is denoted LM. It contains every string wx such In the example, we take 01 from the first language, that w is in L and x is in M. and we concatenate it with 00 in the second language. Example: {01,111,10}{00, 01} That gives us 0100. We then take 01 from the = {0100, 0101, 11100, 11101, first language again, and we concatenate it with 01 1000, 1001}. in the second language, and that gives us 0101. Then we take 111 from the first language and we concatenated it with 00 in the second language and this gives us 11100 …. and so on. 19 Kleene Star: represented by an asterisk aka star (*) If L is a language, then L*, the Kleene star or just “star,” is the set of strings formed by concatenating zero or more strings from L, in any order. L* = {ε} L LL LLL … Example: {0,10}* = {ε, 0, 10, 00, 010, 100, 1010,…} If you take no strings from L, that would give you the empty string. 17 IMPORTANT! FROM NOW ON, LET’S STICK TO THE FOLLOWING CONVENTIONS (OTHERWISE WE WILL BE CONFUSED): Union (aka: disjunction, OR) represented by: | or + Concatenation: represented by juxtaposition (= no punctuation) or middle dot ( · ) Kleene Star: represented by * 18 Precedence of Operators Parentheses may be used wherever needed to influence the grouping of operators. Order of precedence is * (highest), then concatenation, then + (lowest). Remember: + = union/disjunction 19 Examples: REs 1) The regular expression 01 represents the concatenation of the language consisting of one string, 0 and the language consisting of one string, 1. The result is the language containing the one string 01. 1. L(01) = {01}. 2)The language of 01+0 is the union of the language containing only string 01 and the language containing 2. L(01+0) = {01, 0}. only string 0. 3. L(0(1+0)) = {01, 00}. 3)The language of 0 concatenated with 1+0 is the two strings 01 and 00. Notice that we need parentheses to force the + to group first. Without Note order of precedence them, since concatenation takes precedence over +, we get the interpretation in the second example. of operators. 4)The language of 0* is the star of the language 4. L(0*) = {ε, 0, 00, 000,… }. containing only the string 0. This is all strings of 0’s, including the empty string. 5. L((0+10)*(ε+1)) = all 5)This example denotes the language with all strings strings of 0s and 1s without two of 0s and 1s without two consecutive 0s. To see why this works, in every such string, each 1 is either consecutive 1s. followed immediately by a 0, or it comes at the end of the string. (0+10)* denotes all strings in which every 1 is followed by a 0. These strings are surely in the language we want. But we also want these strings followed by a final 1. Thus, we concatenate the language of (0+10)* with epsilon+1. This concatenation gives us all the strings where 1s are followed by 0s, plus all those strings with an additional 1 at the end. 20 Equivalence of REs and Finite Automata For every RE, there is a finite automaton that accepts the same language. And we need to show that for every finite automaton, there is a RE defining its language. 21 Summary Automata and regular expressions define exactly the same set of languages: the regular languages. 22 REGULAR LANGUAGES 23 The Chomsky Hierachy Hierarchy of classes of formal languages One language is of greater generative power or complexity than another if it can define a language that other cannot define. Context-free grammars are more powerful that regular grammars Regular Context- (DFA) Context- Recursively- free sensitive enumerable (PDA) (LBA (TM ) ) 24 Regular Languages A language L is regular if it is the language accepted by some DFA. Note: the DFA must accept only the strings in L, no others. Some languages are not regular. 25 Only languages that meet the following criteria are regular languages: 26 Regular language derive their name from the fact that the strings they recognize are (in a formal computer science sense) “regular.” This implies that there are certain kinds of strings that it will be very hard, if not impossible, to recognize with regular expressions, especially nested syntactic structures in natural language. 27 Formal languages vs regular languages A formal language is a set of strings, each string composed of symbols from a finite set called an alphabet. Ex: {a,b!} Formal languages are not the same as regular languages…. 28 But Many Languages are Regular They appear in many contexts and have many useful properties. 29 How to tell if a language is not regular The most common way to prove that a language is regular is to build a regular expression for the language. 30 Pumping Lemma 31 Practical Activity 1 The language L contains all strings over the alphabet {a,b} that begin with a and end with b, ie: Write a regular expression that defines the language L. 32 Practical Activity 1: Possible Solution 33 Your Solutions In between the concatenation of a and b there must be 0 or more unions (disjuctions) of a and b. Reference: slides 17-22 37 Practical Activity 2 Draw a deterministic finite-state automaton that accepts the following regular expression: Test the automaton with these legal strings in the language : 0 abc a ab Alternative notation style: cccabc cbacccabababccc ( (ab) | c)* …. ie: 0 or more occurences of the disjunction ab | c 38 Practical Activity 2: Possible Correct Solution Having the initial state as a final state gives us the empty string as an element in the language. 36 Your solutions (1): when we interpret ”+” as disjunction, these solutions are wrong because ”c” happens only after ”a” and ”b”… Test these automata with the string on page 35 37 Your solutions (2): same as previous slide. In addition, here no final states are shown… Test these automata with the string on page 35 38 Practical Activity 3 Construct a grep regular expression that matches patterns containing at least one “ab” followed by any number of bs. Construct a grep regular expression that matches any number between 1000 and 9999. 39 Practical Activity 3: Possible Solutions grep ‘\(ab\)+b*’ [1-9][0- 9]{3} 40 Exercises: E&G (2013) Övning 9.40 Optional: as many as you can AGer having completed the exercises, check out the solutions at the end of the book. 41 Regular Expressions & Regular Languages slideshare: http://www.slideshare.net/marinasantini1/regular-expressions-and-regular-languages Mathematics for Language Technology http://stp.lingfil.uu.se/~matsd/uv/uv15/mfst/ Marina Santini [email protected] Department of Linguistics and Philology Uppsala University, Uppsala, Sweden 1 Reading Required Reading: E&G (2013): Ch. 9 (pp. 252-256) Compendium (3): 7.2, 7.3, 8.2.3 Mats Dahllöf: Reguljära uttryck http://stp.lingfil.uu.se/~matsd/uv/uv14/mfst/dok/oh6.pdf Further Reading: Chapters 2 in Jurafsky D. & Martin J. (2009) Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition. Online draG version: http://stp.lingfil.uu.se/~santinim/ml/2014/ JurafskyMartinSpeechAndLanguageProcessing2ed_draG%202007.pdf 43 The End 44