Module 1 - Introduction to Compilation and Lexical Analysis PDF
Document Details
Uploaded by RightCedar
false
null
Tags
Summary
This document provides an introduction to compilation and lexical analysis, including objectives, outcomes, and a detailed syllabus of topics covered. It details the fundamental concepts of language translators, intermediate code generation, code optimization, and the role of compilers in software systems. It also presents a summary of related materials and references.
Full Transcript
Objective To provide fundamental knowledge of various language translators To make students familiar with lexical analysis and parsing techniques To understand the various actions carried out in semantic analysis To make the students get familiar with how the intermediate code is gen...
Objective To provide fundamental knowledge of various language translators To make students familiar with lexical analysis and parsing techniques To understand the various actions carried out in semantic analysis To make the students get familiar with how the intermediate code is generated To understand the principles of code optimization techniques and code generation To provide foundation for study of high-performance compiler design Outcomes 1. Apply the skills on devising, selecting, and using tools and techniques towards compiler design 2. Develop language specifications using context free grammars (CFG 3. Apply the ideas, the techniques, and the knowledge acquired for the purpose of developing software systems 4. Constructing symbol tables and generating intermediate code 5. Obtain insights on compiler optimization and code generation Syllabus Module: 1 Introduction to Compilation and Lexical Analysis 7 hours Introduction to LLVM - Structure and Phases of a Compiler-Design Issues-Patterns Lexemes-Tokens-Attributes-Specification of Tokens-Extended Regular Expression- Regular expression to Deterministic Finite Automata (Direct method) - Lex - A Lexical Analyzer Generator Module: 2 Syntax Analysis 8 hours Role of Parser- Parse Tree - Elimination of Ambiguity – Top Down Parsing – Recursive Descent Parsing - LL (1) Grammars – Shift Reduce Parsers- Operator Precedence Parsing - LR Parsers, Construction of SLR Parser Tables and Parsing- CLR Parsing- LALR Parsing Module: 3 Semantic Analysis 5 hours Syntax Directed Definition – Evaluation Order - Applications of Syntax Directed Translation - Syntax Directed Translation Schemes - Implementation of L-attributed Syntax Directed Definition Module: 4 Intermediate Code Generation 5 hours Variants of Syntax trees - Three Address Code- Types – Declarations - Procedures - Assignment Statements - Translation of Expressions - Control Flow - Back Patching- Switch Case Statements. Cont.. Module: 5 Code Optimization 6 hours Loop optimizations- Principal Sources of Optimization -Introduction to Data Flow Analysis - Basic Blocks - Optimization of Basic Blocks - Peephole Optimization- The DAG Representation of Basic Blocks -Loops in Flow Graphs - Machine Independent Optimization Implementation of a naïve code generator for a virtual Machine- Security checking of virtual machine code Module: 6 Code Generation 5 hours Issues in the design of a code generator- Target Machine- Next-Use Information – Register Allocation and Assignment- Runtime Organization- Activation Records. Module: 7 Parallelism 7 hours Parallelization- Automatic Parallelization- Optimizations for Cache Locality and Vectorization- Domain Specific Languages-Compilation- Instruction Scheduling and Software Pipelining- Impact of Language Design and Architecture Evolution on Compilers Static Single Assignment Module: 8 Contemporary Issues 2 hours Text Books & References Text Book A. V. Aho, Monica S. Lam, Ravi Sethi and Jeffrey D. Ullman, Compilers: Principles, techniques, & tools, 2007, Second Edition, Pearson Education, Boston Reference Books Watson, Des. A Practical Approach to Compiler Construction. Germany, Springer International Publishing, 2017 Method of Evaluation QUIZ-1 10 Marks (Moodle (LMS)) QUIZ-2 10 Marks (Moodle (LMS)) DA 10 Marks (Presentation) CAT - 1 15 Marks CAT - 2 15 Marks FAT 40 Marks Total 100 Marks Content - Module -1 Introduction to Compilation And Lexical Analysis Introduction to LLVM Structure and Phases of a Compiler Design Issues Patterns & Lexemes Tokens & Attributes Specification of Tokens Extended Regular Expression Regular Expression to Deterministic Finite Automata (Direct method) Lex - A Lexical Analyzer Generator Translator A translator is a program that takes one form of program as input and converts it into another form. Types of translators are: 1. Interpreter Source 2. Assembler Translator Target Program Program 3. Compiler Error Messages Interpreter Interpreter is a translator that reads a program written in source language and translates it into an equivalent program in target language line by line void main() 0000 1100 0010 { 0000 int a=1,b=2,c; Interpreter 1111 1100 0010 c=a+b; 1010 1100 0010 printf(“%d”,c); 0011 1100 0010 } 1111 Source Error Target Program Messages Program Assembler Assembler is a translator which takes the assembly code as an input and generates the machine code as an output. MOV id3, R1 0000 1100 0010 MUL #2.0, R1 0100 MOV id2, R2 0111 1000 0001 MUL R2, R1 Assembler 1111 0101 1110 MOV id1, R2 1100 0000 1000 ADD R2, R1 1011 MOV R1, id1 1100 0000 1000 Assembly Error Messages Machine Code Code Compiler A compiler is a translator that reads a program written in source language and translates it into an equivalent program in target language. (Object/Machine Language/0’s, 1’s). void main() 0000 1100 0010 { 0100 int a=1,b=2,c; Compiler 0111 1000 0001 c=a+b; 1111 0101 1110 printf(“%d”,c); 1100 0000 1000 } 1011 Source Error Target Program Messages Program Features of a Compiler Speed of machine code The correctness of machine code The meaning of code should not change Good error detection Checking the code correctly according to grammar Uses of Compilers Helps to make the code independent of the platform Makes the code free of syntax and semantic errors Generate executable files of code Translates the code from one language to another Applications of Compilers Optimization of computer architectures Design of new computer architectures Artificial Intelligence Security Embedded Systems High-Performance Computing … Analysis Synthesis model of compilation There are two parts of compilation. 1. Analysis Phase 2. Synthesis Phase void main() Analysis Synthesis { Phase Phase 0000 1100 int a=1,b=2,c; 0111 1000 c=a+b; 0001 printf(“%d”,c); Intermediate 1111 0101 } Representation 1000 1011 Source Code Target Code Analysis phase & Synthesis phase Analysis Phase Synthesis Phase Analysis part breaks up the source The synthesis part constructs program into constituent pieces the desired target program from and creates an intermediate the intermediate representation. representation of the source Synthesis phase consist of the program. following sub phases: Analysis phase consists of three 1. Code optimization sub phases: 2. Code generation 1. Lexical analysis 2. Syntax analysis 3. Semantic analysis Phases of compiler Compiler Synthesis Analysis phase phase Lexical analysis Intermediate Code code optimization Syntax analysis generation Code Semantic generation analysis Lexical analysis Lexical Analysis is also called linear analysis or scanning. Position = initial + rate * 60 Lexical Analyzer divides the given source statement into the tokens. Ex: Position = initial + rate * 60 would be Lexical analysis grouped into the following tokens: id1 = id2 + id3 * 60 Position (identifier) = (Assignment operator) initial (identifier) + (Plus Operator) rate (identifier) * (Multiplication Operator) 60 (digit/Number) Lexeme Reads the stream of char making up the source program & group the char into meaningful sequences called lexeme. Lexical analyzer represents the lexeme in the form of tokens. It is a sequence of characters in the source code that are matched by given predefined language rules for every lexeme to be specified as a valid token. Example: main is lexeme of type keyword(token) (,),{,} are lexemes of type punctuation(token) Patterns Specifies a set of rules that a scanner follows to create a token. Example: For a keyword to be identified as a valid token, the pattern is the sequence of characters that make the keyword. For identifier to be identified as a valid token, the pattern is the predefined rules that it must start with alphabet, followed by alphabet or a digit. Token vs Lexeme Vs Pattern Criteria Token Lexeme Pattern Interpretation of All the reserved keywords of int, goto The sequence of characters type Keyword that language(main, printf, etc.) that make the keyword. Interpretation of Name of a variable, function, etc addsum, a, id1 It must start with the type Identifier alphabet, followed by the alphabet or a digit. Interpretation of All the operators are considered +, = +, = type Operator tokens. Interpretation of Each kind of punctuation is (, ), {, } (, ), {, } type Punctuation considered a token. E.G. Semicolon, bracket, comma, etc. Interpretation of A grammar rule or boolean “Welcome to Any string of characters type Literal literal. GeeksforGeeks!” (except ‘ ‘) between ” and “ Phases of Compiler Compiler Synthesis Analysis phase phase Lexical analysis Intermediate Code code optimization Syntax analysis generation Code Semantic generation analysis Syntax analysis Position = initial + rate*60 Syntax Analysis is also called Parsing or Lexical analysis Hierarchical Analysis. id1 = id2 + id3 * 60 It takes token produced by lexical analyzer as Input & generates the parse tree. Syntax analysis The syntax analyzer checks each line of the = code and spots every error. If code is error free then syntax analyzer id1 + generates the tree. * id2 id3 60 Phases of compiler Compiler Synthesis Analysis phase phase Lexical analysis Intermediate Code code optimization Syntax analysis generation Code Semantic generation analysis Semantic analysis Semantic analyzer determines the meaning = of a source string. id1 + It performs following operations: id2 * int to 1. Matching of parenthesis in the expression. real id3 60 2. Matching of if..else statement. 3. Performing arithmetic operation that are type Semantic analysis compatible. 4. Checking the scope of operation. = *Note: Consider id1, id2 and id3 are real id1 + id2 * id3 inttoreal 60 Phases of compiler Compiler Synthesis Analysis phase phase Lexical analysis Intermediate Code code optimization Syntax analysis generation Code Semantic generation analysis Intermediate code generator Two important properties of intermediate = code : + id1 1. It should be easy to produce. id2 * 2. Easy to translate into target program. t3 id3 inttoreal Intermediate form can be represented using t2 t1 60 “three address code”.[Postfix notation & Syntax Analysis tree] Intermediate code Three address code consist of a sequence of t1= int to real(60) instruction, each of which has at most three t2= id3 * t1 operands. t3= id2 + t2 id1= t3 Phases of compiler Compiler Synthesis Analysis phase phase Lexical analysis Intermediate Code code optimization Syntax analysis generation Code Semantic generation analysis Code optimization It improves the intermediate code. This is necessary to have a faster Intermediate code execution of code or less consumption t1= int to real(60) of memory. t2= id3 * t1 t3= t2 + id2 id1= t3 Code optimization t1= id3 * 60.0 id1 = id2 + t1 Phases of compiler Compiler Synthesis Analysis phase phase Lexical analysis Intermediate Code code optimization Syntax analysis generation Code Semantic generation analysis Code generation The intermediate code instructions are translated into sequence of machine Code optimization instruction. t1= id3 * 60.0 id1 = id2 + t1 Code generation MOV id3, R2 MUL #60.0, R2 MOV id2, R1 ADD R2,R1 MOV R1, id1 Id3→R2 Id2→R1 Symbol table Symbol table are data structures that are used by compilers to hold information about source-program constructs. It is used to store information about the occurrences of various entities such as, objects, classes, variable names, functions, etc., It is used by both analysis phase and synthesis phase. Symbol table is used for the following purposes It is used to store the name of all the entities in a structured form at one place It is used to verify if a variable has been declared It is used to determine the scope of a name It is used to implement type checking by verifying assignments and expression in the source code are semantically correct. Cont., Symbol table can be a linear (Linked list) or hash table It maintain a entry for each name as, Example: If a symbol table has to store information about the following variable declaration: static int interest; then it should store the entry such as: Operations of Symbol table Scope Management void pro_one() void pro_two() { { int one_1; int two_1; int one_2; int two_2; { \ { \ int one_3; |_ inner scope 1 int two_3; |_ inner scope 3 int one_4; | int two_4; | } / } / int one_5; int two_5; } { \... int one_6; |_ inner scope 2 int one_7; | } / } Phases of compiler Source program Analysis Lexical analysis Phase Syntax analysis Semantic analysis Symbol Error detection table and recovery Intermediate code Variable Type Address Code optimization Name Position Float 0001 Code generation Synthesis Initial Float 0005 Phase Rate Float 0009 Target Program Exercise 1 Write output of all the phases of compiler for following statements: 1. X = b-c*2 2. I = p*n*r/100 3. F = C * (9/5) + 32 4. α = [-b + √(b^2 – 4*a*c)]/2*a 5. Volume of frustum: V = (Π*h)/12 (d^2 + d*b + b^2) Grouping of Phases Front end & back end (Grouping of phases) Front end Depends primarily on source language and largely independent of the target machine. It includes following phases: 1. Lexical analysis 2. Syntax analysis 3. Semantic analysis 4. Intermediate code generation 5. Creation of symbol table Back end Depends on target machine and do not depends on source program. It includes following phases: 1. Code optimization 2. Code generation phase 3. Error handling and symbol table operation Context of Compiler (Cousins of compiler) Context of compiler (Cousins of compiler) Skeletal Source Program In addition to compiler, many other system programs are required to Preprocessor generate absolute machine code. Source Program These system programs are: Compiler Target Assembly Preprocessor Program Assembler Assembler Linker Relocatable Loader Object Code Libraries & Linker / Loader Object Files Absolute Machine Code Context of compiler (Cousins of compiler) Skeletal Source Program Preprocessor Some of the task performed by preprocessor: Preprocessor 1. Macro processing: Allows user to define macros. Source Ex: #define PI 3.14159265358979323846 Program 2. File inclusion: A preprocessor may include the Compiler header file into the program. Ex: Target Assembly #include Program 3. Rational preprocessor: It provides built in macro for construct like while statement or if statement. Assembler 4. Language extensions: Add capabilities to the Relocatable language by using built-in macros. Object Code Libraries & Ex: the language SQL is a database query Linker / Loader Object Files language embedded in C. Statement beginning with ## are taken by preprocessor to be database access statement unrelated to C and Absolute Machine translated into procedure call on routines that Code perform the database access. Context of compiler (Cousins of compiler) Skeletal Source Program Compiler Preprocessor A compiler is a program that reads a program written in source language and translates it Source Program into an equivalent program in target language. Compiler Target Assembly Program Assembler Relocatable Object Code Libraries & Linker / Loader Object Files Absolute Machine Code Context of compiler (Cousins of compiler) Skeletal Source Program Assembler Preprocessor Assembler is a translator which takes the assembly program (mnemonic) as an input Source Program and generates the machine code as an output. Compiler Target Assembly Program Assembler Relocatable Object Code Libraries & Linker / Loader Object Files Absolute Machine Code Context of compiler (Cousins of compiler) Skeletal Source Program Linker Linker makes a single program from a several files of Preprocessor relocatable machine code. Source Program These files may have been the result of several Compiler different compilation, and one or more library files. Target Assembly Loader Program Assembler The process of loading consists of: Taking relocatable machine code Relocatable Machine Code Altering the relocatable address Libraries & Linker / Loader Placing the altered instructions and data in Object Files main memory at the proper location. Absolute Machine Code Pass structure Pass structure One complete scan of a source program is called pass. Pass includes reading an input file and writing to the output file. In a single pass compiler analysis of source statement is immediately followed by synthesis of equivalent target statement. In a two pass compiler intermediate code is generated between analysis and synthesis phase. It is difficult to compile the source program into single pass due to: forward reference Pass structure Forward reference: A forward reference of a program entity is a reference to the entity which precedes its definition in the program. Can be solved by postponing the generation of target code until more information concerning the entity becomes available. It leads to multi pass model of compilation. Pass I: Perform analysis of the source program and note relevant information. Pass II: In Pass II: Generate target code using information noted in pass I. Types of compiler Types of compiler 1. One pass compiler It is a type of compiler that compiles whole process in one-pass. 2. Two pass compiler It is a type of compiler that compiles whole process in two-pass. It generates intermediate code. 3. Incremental compiler The compiler which compiles only the changed line from the source code and update the object code. 4. Native code compiler The compiler used to compile a source code for a same type of platform only. 5. Cross compiler The compiler used to compile a source code for a different kinds platform. Types of compiler 1. One pass compiler 2. Two pass compiler Token, Pattern & Lexemes Interaction of scanner & parser Token Source Lexical Parser Program Analyzer Get next token Symbol Table Upon receiving a “Get next token” command from parser, the lexical analyzer reads the input character until it can identify the next token. Lexical Analysis Secondary tasks: Stripping out from the source program comments and white spaces in the form of blank, tab, and new line characters. Correlating error messages from the compiler with the source program. Inserting lexemes for user-defined names into the symbol table. Issues in Lexical and Syntax Analysis Why to separate lexical analysis & parsing? 1. Simplicity in design Separation allows the simplification of one or the other. Example: A parser with comments or white spaces is more complex 2. Improves compiler efficiency Optimization of lexical analysis because a large amount of time is spent reading the source program and partitioning it into tokens. 3. Enhance compiler portability Input alphabet peculiarities and other device-specific anomalies can be restricted to the lexical analyzer. Token, Pattern & Lexemes Token Pattern The set of rules called pattern Sequence of character having a associated with a token. collective meaning is known as Example: “non-empty sequence of digits”, token. “letter followed by letters and digits” Categories of Tokens: 1.Identifier Lexemes 2.Keyword The sequence of character in a source program matched with a pattern for a 3.Operator token is called lexeme. 4.Special symbol Example: Rate, DIET, count, Flag 5.Constant Example: Token, Pattern & Lexemes C code: printf("Total = %d\n", score); ▪ printf and score are lexemes matching the pattern for token id, ▪ "Total = %d\n" is a lexeme matching literal In many programming languages, the following classes cover most or all of the tokens: Example: Token, Pattern & Lexemes Example: total = sum + 45 Tokens: total Identifier1 = Operator1 Tokens sum Identifier2 + Operator2 45 Constant1 Lexemes Lexemes of identifier: total, sum Lexemes of operator: =, + Lexemes of constant: 45 Attributes of Tokens The When token more names than and one associated lexeme canattribute match a values pattern,for thethe lexical Fortran analyzer statement must E =provide M * C **the 2 subsequent compiler phases additional information about the areparticular written below lexeme asthat a sequence matched. of pairs. number matches both 0 and 1 analyzer returns to the parser not only a token name, but an represented by the token; Normally, information about an identifier e.g., its lexeme, its type, and the is kept in the symbol table. Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table Divide the following program into appropriate lexemes Input Buffering Finding lexemes: We often have to look one or more characters beyond the next lexeme before we can be sure we have the right lexeme. we need to look at least one additional character ahead. For instance, We cannot be sure we have seen the end of an identifier until we see a character that is not a letter or digit, and therefore is not part of the lexeme for id. In C, single-character operators like -, =, or < could also be the beginning of a two- character operator like ->, ==, or 3 return (relop,NE) = other 5 4 return (relop,LT) return (relop,EQ) > = 6 7 return (relop,GE) other 8 return (relop,GT) Transition diagram : Unsigned number digit digit digit start digit. digit E +or - digit other 1 2 3 4 5 6 7 8 E digit 3 other other 5280 9 10 39.37 1.894 E - 4 A transition diagram for whitespace 2.56 E + 7 45 E + 6 96 E 2 Hard coding and automatic generation lexical analyzers Hard coding lexical analyzer Lexical analysis is about identifying the pattern from the input. To recognize the pattern, transition diagram is constructed. Example: to represent identifier in ‘C’, the first character must be letter and other characters are either letter or digits. To recognize this pattern, hard coding lexical analyzer will work with a transition diagram. The automatic generation lexical analyzer takes special notation as input. Example: lex compiler tool will take regular expression as input and finds out the pattern matching to that regular expression. Letter or digit Start Letter other 1 2 3 Finite Automata Finite Automata are recognizers. FA simply say “Yes” or “No” about each possible input string. Finite Automata is a mathematical model consist of: 1. Set of states 𝑺 2. Set of input symbol 𝜮 3. A transition function move 4. Initial state 𝑺𝟎 5. Final states or accepting states 𝐅 Types of finite automata Types of finite automata are: DFA b Deterministic finite automata (DFA): have for each state exactly one edge a b b 1 2 3 4 leaving out for each symbol. a a b a NFA DFA Nondeterministic finite automata a (NFA): There are no restrictions on the edges leaving a state. There can be a b b 1 2 3 4 several with the same symbol as label and some edges can be labeled with 𝜖. b NFA Regular expression to NFA using Thompson's rule 1. For ∈ , construct the NFA 3. For regular expression 𝑠𝑡 start start 𝜖 𝑖 N(s) N(t) 𝑓 𝑖 𝑓 Ex: ab 2. For 𝑎 in 𝛴, construct the NFA start a a b 𝑖 𝑓 1 2 3 Regular expression to NFA using Thompson's rule 4. For regular expression 𝑠|𝑡 5. For regular expression 𝑠* 𝜖 N(s) 𝜖 𝜖 start 𝜖 𝜖 start 𝑖 N(s) 𝑓 𝑖 𝑓 𝜖 N(t) 𝜖 𝜖 Ex: a* 𝜖 Ex: (a|b) a 2 3 𝜖 𝜖 𝜖 𝑎 𝜖 1 2 3 4 1 6 𝜖 𝜖 𝜖 4 5 b Regular expression to NFA using Thompson's rule a*b 𝜖 𝑎 𝜖 𝑏 1 2 3 4 5 𝜖 𝜖 b*ab 𝜖 𝑏 𝜖 𝑎 𝑏 1 2 3 4 5 6 𝜖 Exercise Convert following regular expression to NFA: 1. abba 2. bb(a)* 3. (a|b)* 4. a* | b* 5. a(a)*ab 6. aa*+ bb* 7. (a+b)*abb 8. 10(0+1)*1 9. (a+b)*a(a+b) 10. (0+1)*010(0+1)* 11. (010+00)*(10)* 12. 100(1)*00(0+1)* Conversion from NFA to DFA using subset construction method Subset construction algorithm Input: An NFA 𝑁. Output: A DFA D accepting the same language. Method: Algorithm construct a transition table 𝐷𝑡𝑟𝑎𝑛 for D. We use the following operation: OPERATION DESCRIPTION − 𝑐𝑙𝑜𝑠𝑢𝑟𝑒(𝑠) Set of NFA states reachable from NFA state 𝑠 on – transition alone. 𝑴𝒐𝒗𝒆 (𝑇, 𝑎) Set of NFA states to which there is a transition on input symbol 𝑎 from some NFA state 𝑠 in 𝑇. Conversion from NFA to DFA (a|b)*abb 𝜖 a 2 3 𝜖 𝜖 𝜖 𝜖 a b b 0 1 6 7 8 9 10 𝜖 𝜖 4 5 b 𝜖 Conversion from NFA to DFA 𝜖 a 2 3 𝜖 𝜖 𝜖 𝜖 a b b 0 1 6 7 8 9 10 𝜖 𝜖 4 5 b 𝜖 𝜖- Closure(0)= {0, 1, 7, 2, 4} = {0,1,2,4,7} ---- A Conversion from NFA to DFA 𝜖 a 2 3 States a b 𝜖 𝜖 A = {0,1,2,4,7} B 𝜖 𝜖 a b b 0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} 𝜖 𝜖 4 5 b 𝜖 A= {0, 1, 2, 4, 7} Move(A,a) = {3,8} 𝜖- Closure(Move(A,a)) = {3, 6, 7, 1, 2, 4, 8} = {1,2,3,4,6,7,8} ---- B Conversion from NFA to DFA 𝜖 a 2 3 States a b 𝜖 𝜖 A = {0,1,2,4,7} B C 𝜖 𝜖 a b b 0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} C = {1,2,4,5,6,7} 𝜖 𝜖 4 5 b 𝜖 A= {0, 1, 2, 4, 7} Move(A,b) = {5} 𝜖- Closure(Move(A,b)) = {5, 6, 7, 1, 2, 4} = {1,2,4,5,6,7} ---- C Conversion from NFA to DFA 𝜖 a 2 3 States a b 𝜖 𝜖 A = {0,1,2,4,7} B C 𝜖 𝜖 a b b 0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B C = {1,2,4,5,6,7} 𝜖 𝜖 4 5 b 𝜖 B = {1, 2, 3, 4, 6, 7, 8} Move(B,a) = {3,8} 𝜖- Closure(Move(B,a)) = {3, 6, 7, 1, 2, 4, 8} = {1,2,3,4,6,7,8} ---- B Conversion from NFA to DFA 𝜖 a 2 3 States a b 𝜖 𝜖 A = {0,1,2,4,7} B C 𝜖 𝜖 a b b 0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D C = {1,2,4,5,6,7} 𝜖 𝜖 4 5 D = {1,2,4,5,6,7,9} b 𝜖 B= {1, 2, 3, 4, 6, 7, 8} Move(B,b) = {5,9} 𝜖- Closure(Move(B,b)) = {5, 6, 7, 1, 2, 4, 9} = {1,2,4,5,6,7,9} ---- D Conversion from NFA to DFA 𝜖 a 2 3 States a b 𝜖 𝜖 A = {0,1,2,4,7} B C 𝜖 𝜖 a b b 0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D C = {1,2,4,5,6,7} B 𝜖 𝜖 4 5 D = {1,2,4,5,6,7,9} b 𝜖 C= {1, 2, 4, 5, 6 ,7} Move(C,a) = {3,8} 𝜖- Closure(Move(C,a)) = {3, 6, 7, 1, 2, 4, 8} = {1,2,3,4,6,7,8} ---- B Conversion from NFA to DFA 𝜖 a 2 3 States a b 𝜖 𝜖 A = {0,1,2,4,7} B C 𝜖 𝜖 a b b 0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D C = {1,2,4,5,6,7} B C 𝜖 𝜖 4 5 D = {1,2,4,5,6,7,9} b 𝜖 C= {1, 2, 4, 5, 6, 7} Move(C,b) = {5} 𝜖- Closure(Move(C,b))= {5, 6, 7, 1, 2, 4} = {1,2,4,5,6,7} ---- C Conversion from NFA to DFA 𝜖 a 2 3 States a b 𝜖 𝜖 A = {0,1,2,4,7} B C 𝜖 𝜖 a b b 0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D C = {1,2,4,5,6,7} B C 𝜖 𝜖 4 5 D = {1,2,4,5,6,7,9} B b 𝜖 D= {1, 2, 4, 5, 6, 7, 9} Move(D,a) = {3,8} 𝜖- Closure(Move(D,a)) = {3, 6, 7, 1, 2, 4, 8} = {1,2,3,4,6,7,8} ---- B Conversion from NFA to DFA 𝜖 a 2 3 States a b 𝜖 𝜖 A = {0,1,2,4,7} B C 𝜖 𝜖 a b b 0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D C = {1,2,4,5,6,7} B C 𝜖 𝜖 4 5 D = {1,2,4,5,6,7,9} B E b E = {1,2,4,5,6,7,10} 𝜖 D= {1, 2, 4, 5, 6, 7, 9} Move(D,b)= {5,10} 𝜖- Closure(Move(D,b)) = {5, 6, 7, 1, 2, 4, 10} = {1,2,4,5,6,7,10} ---- E Conversion from NFA to DFA 𝜖 a 2 3 States a b 𝜖 𝜖 A = {0,1,2,4,7} B C 𝜖 𝜖 a b b 0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D C = {1,2,4,5,6,7} B C 𝜖 𝜖 4 5 D = {1,2,4,5,6,7,9} B E b E = {1,2,4,5,6,7,10} B 𝜖 E= {1, 2, 4, 5, 6, 7, 10} Move(E,a) = {3,8} 𝜖- Closure(Move(E,a)) = {3, 6, 7, 1, 2, 4, 8} = {1,2,3,4,6,7,8} ---- B Conversion from NFA to DFA 𝜖 a 2 3 States a b 𝜖 𝜖 A = {0,1,2,4,7} B C 𝜖 𝜖 a b b 0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D C = {1,2,4,5,6,7} B C 𝜖 𝜖 4 5 D = {1,2,4,5,6,7,9} B E b E = {1,2,4,5,6,7,10} B C 𝜖 E= {1, 2, 4, 5, 6, 7, 10} Move(E,b)= {5} 𝜖- Closure(Move(E,b))= {5,6,7,1,2,4} = {1,2,4,5,6,7} ---- C Conversion from NFA to DFA a b States a b B D a A = {0,1,2,4,7} B C a B = {1,2,3,4,6,7,8} B D A a a b C = {1,2,4,5,6,7} B C D = {1,2,4,5,6,7,9} B E b C E E = {1,2,4,5,6,7,10} B C b Transition Table b Note: Accepting state in NFA is 10 DFA 10 is element of E So, E is acceptance state in DFA Exercise Convert following regular expression to DFA using subset construction method: 1. (a+b)*a(a+b) 2. (a+b)*ab*a DFA optimization 1. Construct an initial partition Π of the set of states with two groups: the accepting states 𝐹 and the non-accepting states 𝑆 − 𝐹. 2. Apply the repartition procedure to Π to construct a new partition Π𝑛𝑒𝑤. 3. If Π 𝑛𝑒𝑤 = Π, let Π𝑓𝑖𝑛𝑎𝑙 = Π and continue with step (4). Otherwise, repeat step (2) with Π = Π𝑛𝑒𝑤. for each group 𝐺 of Π do begin partition 𝐺 into subgroups such that two states 𝑠 and 𝑡 of 𝐺 are in the same subgroup if and only if for all input symbols 𝑎, states 𝑠 and 𝑡 have transitions on 𝑎 to states in the same group of Π. replace 𝐺 in Π𝑛𝑒𝑤 by the set of all subgroups formed. end DFA optimization 4. Choose one state in each group of the partition Π𝑓𝑖𝑛𝑎𝑙 as the representative for that group. The representatives will be the states of 𝑀′. Let s be a representative state, and suppose on input a there is a transition of 𝑀 from 𝑠 to 𝑡. Let 𝑟 be the representative of 𝑡′s group. Then 𝑀′ has a transition from 𝑠 to 𝑟 on 𝑎. Let the start state of 𝑀′ be the representative of the group containing start state 𝑠0 of 𝑀, and let the accepting states of 𝑀′ be the representatives that are in 𝐹. 5. If 𝑀′ has a dead state 𝑑, then remove 𝑑 from 𝑀′. Also remove any state not reachable from the start state. DFA optimization States a b {𝐴, 𝐵, 𝐶, 𝐷, 𝐸} A B C B B D Nonaccepting States Accepting States C B C {𝐴, 𝐵, 𝐶, 𝐷} {𝐸} D B E E B C {𝐴, 𝐵, 𝐶} {𝐷} States a b {𝐴, 𝐶} {𝐵} A B A B B D Now no more splitting is possible. D B E E B A If we chose A as the representative Optimized for group (AC), then we obtain Transition Table reduced transition table Conversion from regular expression to DFA Rules to compute nullable, firstpos, lastpos nullable(n) The subtree at node 𝑛 generates languages including the empty string. firstpos(n) The set of positions that can match the first symbol of a string generated by the subtree at node 𝑛. lastpos(n) The set of positions that can match the last symbol of a string generated be the subtree at node 𝑛. followpos(i) The set of positions that can follow position 𝑖 in the tree. Rules to compute nullable, firstpos, lastpos Node n nullable(n) firstpos(n) lastpos(n) A leaf labeled by true ∅ ∅ A leaf with false {i} {i} position 𝐢 n | nullable(c1) firstpos(c1) lastpos(c1) or c1 c2 nullable(c2) firstpos(c2) lastpos(c2) if (nullable(c1)) if (nullable(c2)) n. nullable(c1) then firstpos(c1) then lastpos(c1) and c1 firstpos(c2) lastpos(c2) c2 nullable(c2) else firstpos(c1) else lastpos(c2) n ∗ true firstpos(c1) lastpos(c1) c1 Rules to compute followpos 1. If n is concatenation node with left child c1 and right child c2 and i is a position in lastpos(c1), then all position in firstpos(c2) are in followpos(i) 2. If n is * node and i is position in lastpos(n), then all position in firstpos(n) are in followpos(i) Conversion from regular expression to DFA Example 1: (a|b)* abb # Step 1: Construct Syntax Tree. Step 2: Nullable node. # 𝟔 Here, * is only nullable node. 𝑏. 𝟓 𝑏 𝟒 ∗ 𝑎 𝟑 | 𝑎 𝑏 𝟏 𝟐 Conversion from regular expression to DFA Step 3: Calculate firstpos Firstpos {1,2,3}. {1,2,3}. A leaf with position 𝒊 = {𝒊} {6} # {1,2,3}. 𝟔 {5} 𝑏 n |. 𝟓 firstpos(c1) firstpos(c2) {1,2,3} {4} 𝑏 𝟒 c1 c2 {1,2} ∗ {3} 𝑎 n ∗ 𝟑 firstpos(c1) c1 {1,2} | n if (nullable(c1)). 𝑎 𝑏 thenfirstpos(c1) {1} 𝟏 {2}𝟐 firstpos(c2) c1 c2 else firstpos(c1) Conversion from regular expression to DFA Step 3: Calculate lastpos Lastpos {1,2,3}. {6} {1,2,3}. {5} A leaf with position 𝒊 = {𝒊} {6} # {6} {1,2,3}. {4} 𝟔 n {5} 𝑏 {5} | {1,2,3}. {3} {4} 𝑏 {4} 𝟓 lastpos(c1) lastpos(c2) c1 c2 𝟒 {1,2} ∗ {1,2} {3} 𝑎 {3} n ∗ 𝟑 lastpos(c1) c1 {1,2} | {1,2} n if (nullable(c2)) then. 𝑎 𝑏 lastpos(c1) lastpos(c2) {1} {1} {2} {2} else lastpos(c2) 𝟏 𝟐 c1 c2 Conversion from regular expression to DFA Step 4: Calculate followpos Position followpos 5 6 Firstpos {1,2,3}. {6} Lastpos {1,2,3}. {5} {6} # {6} {1,2,3}. {4} 𝟔 {5} 𝑏 {5} {1,2,3}. {3} {4} 𝑏 {4} 𝟓. 𝟒 {1,2} ∗ {1,2} {3} 𝑎 {3} {1,2,3} 𝒄 {5} 𝟏 {6} 𝒄𝟐 {6} 𝟑 {1,2} | {1,2} 𝑖 = 𝑙𝑎𝑠𝑡𝑝𝑜𝑠(𝑐1) = {5} 𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 𝑐2 = 6 𝑎 𝑏 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 5 = 6 {1} {1} {2} {2} 𝟏 𝟐 Conversion from regular expression to DFA Step 4: Calculate followpos Position followpos 5 6 {1,2,3}. {6} 4 5 {1,2,3}. {5} {6} # {6} {1,2,3}. {4} 𝟔 {5} 𝑏 {5} {1,2,3}. {3} {4} 𝑏 {4} 𝟓. 𝟒 {1,2} ∗ {1,2} {3} 𝑎 {3} {1,2,3} 𝒄 {4} 𝟏 {5} 𝒄𝟐 {5} 𝟑 {1,2} | {1,2} 𝑖 = 𝑙𝑎𝑠𝑡𝑝𝑜𝑠(𝑐1) = {4} 𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 𝑐2 = 5 𝑎 𝑏 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 4 = 5 {1} {1} {2} {2} 𝟏 𝟐 Conversion from regular expression to DFA Step 4: Calculate followpos Position followpos 5 6 Firstpos {1,2,3}. {6} 4 5 Lastpos {1,2,3}. {5} 3 4 {6} # {6} {1,2,3}. {4} 𝟔 {5} 𝑏 {5} {1,2,3}. {3} {4} 𝑏 {4} 𝟓. 𝟒 {1,2} ∗ {1,2} {3} 𝑎 {3} {1,2,3} 𝒄 {3} 𝟏 {4} 𝒄𝟐 {4} 𝟑 {1,2} | {1,2} 𝑖 = 𝑙𝑎𝑠𝑡𝑝𝑜𝑠(𝑐1) = {3} 𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 𝑐2 = 4 𝑎 𝑏 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 3 = 4 {1} {1} {2} {2} 𝟏 𝟐 Conversion from regular expression to DFA Step 4: Calculate followpos Position followpos 5 6 Firstpos {1,2,3}. {6} 4 5 Lastpos {1,2,3}. {5} 3 4 {6} # {6} 2 3 {1,2,3}. {4} 𝟔 {5} 𝑏 {5} 1 3 {1,2,3}. {3} {4} 𝑏 {4} 𝟓. 𝟒 {1,2} ∗ {1,2} {3} 𝑎 {3} {1,2} 𝒄 {1,2} {3} 𝒄 {3} 𝟏 𝟐 𝟑 {1,2} | {1,2} 𝑖 = 𝑙𝑎𝑠𝑡𝑝𝑜𝑠(𝑐1) = {1,2} 𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 𝑐2 = 3 𝑎 𝑏 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 1 = 3 {1} {1} {2} {2} 𝟏 𝟐 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 2 = 3 Conversion from regular expression to DFA Step 4: Calculate followpos Position followpos 5 6 Firstpos {1,2,3}. {6} 4 5 Lastpos {1,2,3}. {5} 3 4 {6} # {6} 2 1,2, 3 {1,2,3}. {4} 𝟔 {5} 𝑏 {5} 1 1,2, 3 {1,2,3}. {3} {4} 𝑏 {4} 𝟓 𝟒 {1,2} * {1,2} {1,2} ∗ {1,2} {3} 𝑎 {3} 𝒏 𝟑 {1,2} | {1,2} 𝑖 = 𝑙𝑎𝑠𝑡𝑝𝑜𝑠(𝑛) = {1,2} 𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 𝑛 = 1,2 𝑎 𝑏 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 1 = 1,2 {1} {1} {2} {2} 𝟏 𝟐 𝑓𝑜𝑙𝑙𝑜𝑤𝑝𝑜𝑠 2 = 1,2 Conversion from regular expression to DFA Initial state = 𝑓𝑖𝑟𝑠𝑡𝑝𝑜𝑠 of root = {1,2,3} ----- A Position followpos State A 5 6 δ( (1,2,3),a) = followpos(1) U followpos(3) 4 5 3 4 =(1,2,3) U (4) = {1,2,3,4} ----- B 2 1,2,3 1 1,2,3 δ( (1,2,3),b) = followpos(2) =(1,2,3) ----- A States a b A={1,2,3} B A B={1,2,3,4} Conversion from regular expression to DFA State B Position followpos δ( (1,2,3,4),a) = followpos(1) U followpos(3) 5 6 =(1,2,3) U (4) = {1,2,3,4} ----- B 4 5 3 4 δ( (1,2,3,4),b) = followpos(2) U followpos(4) 2 1,2,3 =(1,2,3) U (5) = {1,2,3,5} ----- C 1 1,2,3 State C δ( (1,2,3,5),a) = followpos(1) U followpos(3) States a b A={1,2,3} B A =(1,2,3) U (4) = {1,2,3,4} ----- B B={1,2,3,4} B C C={1,2,3,5} B D δ( (1,2,3,5),b) = followpos(2) U followpos(5) D={1,2,3,6} =(1,2,3) U (6) = {1,2,3,6} ----- D Conversion from regular expression to DFA State D Position followpos δ( (1,2,3,6),a) = followpos(1) U followpos(3) 5 6 =(1,2,3) U (4) = {1,2,3,4} ----- B 4 5 3 4 δ( (1,2,3,6),b) = followpos(2) 2 1,2,3 1 1,2,3 =(1,2,3) ----- A b a States a b A={1,2,3} B A a b b B={1,2,3,4} B C A B C D C={1,2,3,5} B D a a D={1,2,3,6} B A b DFA Example 2 Convert the following RE to DFA (Direct Method): ba(a+b)*ab Practice Problems Convert the following regular expressions to deterministic finite automata: 1. (a|b)* 2. (a*|b*)* 3. ((ε|a)|b*)* 4. (a|b)*abb(a|b)* Solutions to practice problems (a*|b*)* (a|b)*