Compiler Design Course Hondout PDF - Ambo University
Document Details
Uploaded by CarefreeRhenium
Ambo University
2020
Yoobsan Bechera (BSc)
Tags
Summary
This document is a compiler design hondout for third-year computer science students at Ambo University. It covers topics like lexical analysis, syntax analysis, and code generation. The document was prepared in May 2020.
Full Transcript
Ambo University @Woliso Campus School of Technology and Informatics Computer Science Department Compiler Design Course Full hondout Program: Regular Batch: 3rd Semister: II Course Code:CoSc 3112 Prepared and Compiled by: Yoobsan Bechera (BSc)...
Ambo University @Woliso Campus School of Technology and Informatics Computer Science Department Compiler Design Course Full hondout Program: Regular Batch: 3rd Semister: II Course Code:CoSc 3112 Prepared and Compiled by: Yoobsan Bechera (BSc) May/25/2020, AU, Oromia, Ethiopia Table of Contents Chapter 1 Introduction to Compiler Design................................................................................................. 1 What is a compiler?.................................................................................................................................. 1 Programs related to compilers................................................................................................................... 1 The translation process............................................................................................................................. 5 Analysis (front end).............................................................................................................................. 5 Synthesis (back end)............................................................................................................................. 5 Grouping of phases................................................................................................................................. 10 Major Data and Structures in a Compiler............................................................................................... 11 Compiler construction tools.................................................................................................................... 12 Review Exercise.......................................................................................................................................... 13 Chapter 2..................................................................................................................................................... 14 Lexical analysis........................................................................................................................................... 14 Introduction............................................................................................................................................. 14 Token, pattern, lexeme............................................................................................................................ 14 Specification of patterns using regular expressions................................................................................ 15 Regular expression: Language Operations.......................................................................................... 16 Regular expressions for tokens........................................................................................................... 18 Automata................................................................................................................................................. 20 Deterministic Finite Automata (DFA).................................................................................................... 24 From regular expression to an NFA........................................................................................................ 28 From an NFA to a DFA (subset construction algorithm)....................................................................... 29 The Lexical- Analyzer Generator: Lex................................................................................................... 31 Summary of Chapter 2............................................................................................................................ 32 Review Exercise...................................................................................................................................... 34 Chapter – 3.................................................................................................................................................. 35 Syntax analysis............................................................................................................................................ 35 Introduction............................................................................................................................................. 35 Context free grammar (CFG).................................................................................................................. 36 Derivation............................................................................................................................................... 37 Page i of 115 Parse tree................................................................................................................................................. 37 Elimination of ambiguity Precedence/Association................................................................................. 38 Left Recursion......................................................................................................................................... 39 Left factoring.......................................................................................................................................... 40 Top-down parsing................................................................................................................................... 41 Non-recursive predictive parsing............................................................................................................ 43 FIRST and FOLLOW......................................................................................................................... 45 LL (1) Grammars…................................................................................................................................ 48 Bottom-Up and Top-Down Parsers......................................................................................................... 49 The Parser Generator: Yacc.................................................................................................................... 61 Review Exercise...................................................................................................................................... 63 CHAPTER 4 Syntax-Directed Translation................................................................................................ 65 Introduction............................................................................................................................................. 65 Syntax-Directed Definitions and Translation Schemes.......................................................................... 65 Annotated Parse Tree.............................................................................................................................. 67 Annotating a Parse Tree with Depth-First Traversals......................................................................... 67 Dependency Graphs for Attributed Parse Trees..................................................................................... 69 Evaluation Order..................................................................................................................................... 72 S-Attributed Definitions.......................................................................................................................... 72 Bottom-Up Evaluation of S-Attributed Definitions............................................................................ 74 Canonical LR(0) Collection for The Grammar....................................................................................... 75 CHAPTER 5 Type checking....................................................................................................................... 76 Position of type checker.......................................................................................................................... 76 Static versus Dynamic Checking............................................................................................................ 76 Type systems........................................................................................................................................... 80 Specification of a simple type checker................................................................................................... 83 A Simple Language example.................................................................................................................. 83 CHAPTER 6 Intermediate Code Generation............................................................................................. 88 Intermediate Representations.................................................................................................................. 88 Types of Intermediate Representations................................................................................................... 89 Intermediate languages........................................................................................................................... 89 Syntax-Directed Translation of Abstract Syntax Trees.......................................................................... 90 Page ii of 115 Abstract Syntax Trees versus DAGs....................................................................................................... 91 Stack Machine Code............................................................................................................................... 91 Three-Address Statements...................................................................................................................... 93 Syntax-Directed Translation into Three-Address Code.......................................................................... 95 Implementation of Three-Address Statements........................................................................................ 97 Implementation of Three-Address Statements: Quads........................................................................... 97 Implementation of Three-Address Statements: Triples.......................................................................... 98 More triplet representations.................................................................................................................... 98 Implementation of Three-Address Statements: Indirect Triples............................................................. 98 CHAPTER 7 Code Generation............................................................................................................... 103 Introduction........................................................................................................................................... 103 Issue in the Design of a Code Generator............................................................................................... 103 The Target Machine: Addressing Modes.............................................................................................. 108 Program and Instruction Costs.............................................................................................................. 110 Page iii of 115 Chapter 1 Introduction to Compiler Design What is a compiler? A program that reads a program written in one language (the source language) and translates it into an equivalent program in another language (the target language). Why we design compiler? Why we study compiler construction techniques? Compilers provide an essential interface between applications and architectures Compilers embody a wide range of theoretical techniques Since different platforms, or hardware architectures along with the operating systems (Windows, Macs, Unix), require different machine code, you must compile most programs separately for each platform. Programs related to compilers Interpreter: Is a program that reads a source program and executes it Works by analyzing and executing the source program commands one at a time Page 1 of 115 Does not translate the whole source program into object code Interpretation is important when: Programmer is working in interactive mode and needs to view and update variables Running speed is not important Commands have simple formats, and thus can be quickly analyzed and executed Modification or addition to user programs is required as execution proceeds Interpreter and compiler differences Interpreter: o Interpreter takes one statement then translates it and executes it and then takes another statement. o Interpreter will stop the translation after it gets the first error. o Interpreter takes less time to analyze the source code. o Over all execution speed is less. Compiler: While compiler translates the entire program in one go and then executes it. Generates the error report after the translation of the entire program. Takes a large amount of time in analyzing and processing the high level language code. Overall execution time is faster. Page 2 of 115 Interpreter: Well-known examples of interpreters: o Basic interpreter, Lisp interpreter, UNIX shell command interpreter, SQL interpreter, java interpreter… In principle, any programming language can be either interpreted or compiled: o Some languages are designed to be interpreted, others are designed to be compiled Interpreters involve large overheads: o Execution speed degradation can vary from 10:1 to 100:1 o Substantial space overhead may be involved E.g., Compiling Java Programs The Java compiler produces bytecode not machine code Bytecode is converted into machine code using a Java Interpreter You can run bytecode on any computer that has a Java Interpreter installed Android and Java Assemblers: Page 3 of 115 Translator for the assembly language. Assembly code is translated into machine code Output is relocatable machine code. Linker Links object files separately compiled or assembled Links object files to standard library functions Generates a file that can be loaded and executed Loader Loading of the executable codes, which are the outputs of linker, into main memory. Pre-processors A pre-processor is a separate program that is called by the compiler before actual translation begins. Such a pre-processor: o Produce input to a compiler o can delete comments, o Macro processing (substitutions) o Include other files... Page 4 of 115 The translation process A compiler consists of internally of a number of steps, or phases, that perform distinct logical operations. The phases of a compiler are shown in the next slide, together with three auxiliary components that interact with some or all of the phases: The symbol table, the literal table, and error handler. There are two important parts in compilation process: Analysis and Synthesis Analysis and Synthesis Analysis (front end) Breaks up the source program into constituent pieces and Creates an intermediate representation of the source program. During analysis, the operations implied by the source program are determined and recorded in hierarchical structure called a tree. Synthesis (back end) The synthesis part constructs the desired program from the intermediate representation. Analysis of the source program Page 5 of 115 Analysis consists of three phases: 1) Linear/Lexical analysis 2) Hierarchical/Syntax analysis 3) Semantic analysis Lexical analysis or Scanning The stream of characters making up the source program is read from left to right and is grouped into tokens. A token is a sequence of characters having a collective meaning. A lexical analyzer, also called a lexer or a scanner, receives a stream of characters from the source program and groups them into tokens. Examples: Identifiers Keywords Symbols (+, -, …) Numbers … Blanks, new lines, tabulation marks will be removed during lexical analysis. Example: a[index] = 4 + 2; a identifier [ left bracket all are tokens index identifier ] right bracket = assignment operator 4 number + plus operator 2 number ; semicolon A scanner may perform other operations along with the recognition of tokens. Page 6 of 115 It may inter identifiers into the symbol table, and It may inter literals into literal table. Lexical Analysis Tools There are tools available to assist in the writing of lexical analyzers. o lex - produces C source code (UNIX/linux). o flex - produces C source code (gnu). o JLex - produces Java source code. We will use Lex. Syntax analysis or Parsing The parser receives the source code in the form of tokens from the scanner and performs syntax analysis. The results of syntax analysis are usually represented by a parse tree or a syntax tree. Syntax tree each interior node represents an operation and the children of the node represent the arguments of the operation. The syntactic structure of a programming language is determined by context free grammar (CFG). Ex. Consider again the line of C code: a[index] = 4 + 2 Sometimes syntax trees are called abstract syntax trees, since they represent a further abstraction from parse trees. Example is shown in the following figure. Page 7 of 115 Syntax Analysis Tools There are tools available to assist in the writing of parsers. o yacc - produces C source code (UNIX/Linux). o bison - produces C source code (gnu). o CUP - produces Java source code. We will use yacc. Semantic analysis The semantics of a program are its meaning as opposed to syntax or structure The semantics consist of: o Runtime semantics – behavior of program at runtime o Static semantics – checked by the compiler Static semantics include: o Declarations of variables and constants before use o Calling functions that exist (predefined in a library or defined by the user) o Passing parameters properly o Type checking. The semantic analyzer does the following: o Checks the static semantics of the language o Annotates the syntax tree with type information Ex. Consider again the line of C code: a[index] = 4 + 2 Page 8 of 115 Synthesis of the target program The target code generator Intermediate code generator Intermediate code generator Comes after syntax and semantic analysis Separates the compiler front end from its backend Intermediate representation should have 2 important properties: o Should be easy to produce o Should be easy to translate into the target program Intermediate representation can have a variety of forms: o Three-address code, P-code for an abstract machine, Tree or DAG representation Code generator The machine code generator receives the (optimized) intermediate code, and then it produces either: o Machine code for a specific machine, or Page 9 of 115 o Assembly code for a specific machine and assembler. Code generator o Selects appropriate machine instructions o Allocates memory locations for variables o Allocates registers for intermediate computations The code generator takes the IR code and generates code for the target machine. Here we will write target code in assembly language: a[index]=6 MOV R0, index ;; value of index -> R0 MUL R0, 2 ;; double value in R0 MOV R1, &a ;; address of a ->R1 ADD R1, R0 ;; add R0 to R1 MOV *R1, 6 ;; constant 6 -> address in R1 &a –the address of a (the base address of the array) *R1-indirect registers addressing (the last instruction stores the value 6 to the address contained in R1) Grouping of phases The discussion of phases deals with the logical organization of a compiler. In practice most compilers are divided into: o Front end - language-specific and machine-independent. o Back end - machine-specific and language-independent. Compiler passes: Page 10 of 115 A pass consists of reading an input file and writing an output file. Several phases may be grouped in one pass. For example, the front-end phases of lexical analysis, syntax analysis, semantic analysis, and intermediate code generation might be grouped together into one pass. Single pass: o Is a compiler that passes through the source code of each compilation unit only once o A one-pass compiler does not "look back" at code it previously processed. o A one-pass compilers is faster than multi-pass compilers o They are unable to generate efficient programs, due to the limited scope available. Multi pass: o Is a type of compiler that processes the source code or abstract syntax tree of a program several times o A collection of phases is done multiple times Major Data and Structures in a Compiler Token o Represented by an integer value or an enumeration literal o Sometimes, it is necessary to preserve the string of characters that was scanned o For example, name of an identifiers or value of a literal Syntax Tree o Constructed as a pointer-based structure o Dynamically allocated as parsing proceeds o Nodes have fields containing information collected by the parser and semantic analyzer Symbol Table o Keeps information associated with all kinds of tokens: Identifiers, numbers, variables, functions, parameters, types, fields, etc. Page 11 of 115 o Tokens are entered by the scanner and parser o Semantic analyzer adds type information and other attributes o Code generation and optimization phases use the information in the symbol table Performance Issues o Insertion, deletion, and search operations need to be efficient because they are frequent o More than one symbol table may be used Literal Table o Stores constant values and string literals in a program. o One literal table applies globally to the entire program. o Used by the code generator to: Assign addresses for literals. o Avoids the replication of constants and strings. o Quick insertion and lookup are essential Compiler construction tools Various tools are used in the construction of the various parts of a compiler. Scanner generators o Ex. Lex, flex, JLex o These tools generate a scanner /lexical analyzer/ if given a regular expression. Parser Generators o Ex. Yacc, Bison, CUP o These tools produce a parser /syntax analyzer/ if given a Context Free Grammar (CFG) that describes the syntax of the source language. Syntax directed translation engines o Ex. Cornell Synthesizer Generator o It produces a collection of routines that walk the parse tree and execute some tasks. Page 12 of 115 Automatic code generators o Take a collection of rules that define the translation of the IC to target code and produce a code generator. This completes our brief description of the phases of compiler (chapter 1). For any unclear, comment, question, doubt things and etc please don’t hesitate to announce me as you can. Review Exercise 1) What is compiler? 2) Why designing it is necessary? 3) What are the main phases of compiler construction? Explain each. 4) Consider the line of C++ code: float [index] = a-c. write its: A. Lexical analysis D. Syntax analysis B. Semantic analysis E. Intermediate code generator C. Code generator 5) Consider the line of C code: int [index] = (4 + 2) / 2. write its: A. Lexical analysis C. Syntax analysis B. Semantic analysis D. Intermediate code generator E. Code generator 6) Explain diagrammatically how high level languages can be changed to machine understandable language (machine code). a) What is the role of compilers in this action? b) What is another programs used in this process and what makes them different from compilers. Page 13 of 115 Chapter 2 Lexical analysis Introduction The role of lexical analyzer is: o to read a sequence of characters from the source program o group them into lexemes and o Produce as output a sequence of tokens for each lexeme in the source program. The scanner can also perform the following secondary tasks: o stripping out blanks, tabs, new lines o stripping out comments o keep track of line numbers (for error reporting) Interaction of the Lexical Analyzer with the Parser token: smallest meaningful sequence of characters of interest in source program Token, pattern, lexeme A token is a sequence of characters from the source program having a collective meaning. A token is a classification of lexical units. For example: id and num Lexemes are the specific character strings that make up a token. For example: abc and 123A Page 14 of 115 Patterns are rules describing the set of lexemes belonging to a token. For example: “letter followed by letters and digits” Patterns are usually specified using regular expressions. [a-zA-Z]* Example: printf("Total = %d\n", score); Example: The following table shows some tokens and their lexemes in Pascal (a high level, case insensitive programming language) In general, in programming languages, the following are tokens: o Keywords, operators, identifiers, constants, literals, punctuation symbols… Specification of patterns using regular expressions o Regular expressions o Regular expressions for tokens Regular expression: Definitions Represents patterns of strings of characters. An alphabet Σ is a finite set of symbols (characters) A string s is a finite sequence of symbols from Σ o |s| denotes the length of string s o ε denotes the empty string, thus |ε| = 0 A language L is a specific set of strings over some fixed alphabet Σ A regular expression is one of the following: Symbol: a basic regular expression consisting of a single character a, where a is from: Page 15 of 115 o an alphabet Σ of legal characters; o the metacharacter ε: or o the metacharacter ø. In the first case, L(a)={a}; in the second case, L(ε)= {ε}; and in the third case, L(ø)= { }. {} – contains no string at all and {ε} – contains the single string consists of no character Alternation: an expression of the form r|s, where r and s are regular expressions. o In this case , L(r|s) = L(r) U L(s) ={r,s} Concatenation: An expression of the form rs, where r and s are regular expressions. o In this case, L(rs) = L(r)L(s)={rs} Repetition: An expression of the form r*, where r is a regular expression. o In this case, L(r*) = L(r)* ={ε, r,…} Regular expression: Language Operations Union of L and M o L ∪ M = {s |s ∈ L or s ∈ M} Concatenation of L and M o LM = {xy | x ∈ L and y ∈ M} Exponentiation of L o L0 = {ε}; Li = Li-1L Kleene closure of L o L* = ∪i=0,…,∞ Li Positive closure of L o L+ = ∪i=1,…,∞ Li Note: The following short hands are often used: r+ =rr* r* = r+| ε r? =r|ε Page 16 of 115 RE‟s: Examples a) L(01) = ? b) L(01|0) = ? c) L(0(1|0)) = ? o Note order of precedence of operators. L(0*) = ? L((0|10)*(ε|1)) = ? RE‟s: Examples Solution L(01) = {01}. L(01|0) = {01, 0}. L(0(1|0)) = {01, 00}. o Note order of precedence of operators. L(0*) = {ε, 0, 00, 000,… }. L((0|10)*(ε|1)) = all strings of 0‟s and 1‟s without two consecutive 1‟s. RE‟s: Examples (more) 1- a | b = ? 2- (a|b)a = ? 3- (ab) | ε = ? 4- ((a|b)a)* = ? Reverse 1 – Even binary numbers =? 2 – An alphabet consisting of just three alphabetic characters: Σ = {a, b, c}. Consider the set of all strings over this alphabet that contains exactly one b. RE‟s: Examples (more) Solutions Page 17 of 115 1- a | b = {a,b} 2- (a|b)a = {aa,ba} 3- (ab) | ε ={ab, ε} 4- ((a|b)a)* = {ε, aa,ba,aaaa,baba,....} Reverse 1 – Even binary numbers (0|1)*0 2 – An alphabet consisting of just three alphabetic characters: Σ = {a, b, c}. Consider the set of all strings over this alphabet that contains exactly one b. (a | c)*b(a|c)* {b, abc, abaca, baaaac, ccbaca, cccccb} Exercises Describe the languages denoted by the following regular expressions: 1- a(a|b)*a 2- ((ε|a)b*)* 3- (a|b)*a(a|b)(a|b) 4- a*ba*ba*ba* Regular Expressions (Summary) Definition: A regular expression is a string over ∑ if the following conditions hold: o ε, Ø, and a Є ∑ are regular expressions o If α and β are regular expressions, so is αβ o If α and β are regular expressions, so is α+β o If α is a regular expression, so is α* o Nothing else is a regular expression if it doesn‟t follow from (1) to (4) Let α be a regular expression, the language represented by α is denoted by L(α). Regular expressions for tokens Regular expressions are used to specify the patterns of tokens. Each pattern matches a set of strings. It falls into different categories: Page 18 of 115 Reserved (Key) words: They are represented by their fixed sequence of characters, o Ex. if, while and do.... If we want to collect all the reserved words into one definition, we could write it as follows: Reserved = if | while | do |... Special symbols: including arithmetic operators, assignment and equality such as =, :=, +, -, * Identifiers: which are defined to be a sequence of letters and digits beginning with letter, o we can express this in terms of regular definitions as follows: letter = A|B|…|Z|a|b|…|z digit = 0|1|…|9 or letter= [a-zA-Z] digit = [0-9] identifiers = letter(letter|digit)* Numbers: Numbers can be: o sequence of digits (natural numbers), or o decimal numbers, or o numbers with exponent (indicated by an e or E). Example: 2.71E-2 represents the number 0.0271. We can write regular definitions for these numbers as follows: nat = [0-9]+ signedNat = (+|-)? Nat number = signedNat(“.” nat)?(E signedNat)? Literals or constants: This can include: o numeric constants such as 42, and o String literals such as “ hello, world”. Page 19 of 115 relop < | | >= Comments: Ex. Delimiter newline | blank | tab | comment White space = (delimiter )+ Example: Divide the following Java program into appropriate tokens. public class Dog { private String name; private String color; public Dog(String n, String c) { name = n; color = c; } public String getName() { return name; } public String getColor() { return color; } public void speak() { System.out.println("Woof"); } } Automata Abstract machines Characteristics Input: input values (from an input alphabet ∑) are applied to the machine Output: outputs of the machine States: at any instant, the automation can be in one of the several states Page 20 of 115 State relation: the next state of the automation at any instant is determined by the present state and the present input Types of automata o Finite State Automata (FSA) Deterministic FSA (DFSA) Nondeterministic FSA (NFSA) o Push Down Automata (PDA) Deterministic PDA (DPDA) Nondeterministic PDA (NPDA) Finite State Automaton o Finite Automaton, Finite State Machine, FSA or FSM o An abstract machine which can be used to implement regular expressions (etc.). o Have a finite number of states, and a finite amount of memory (i.e., the current state). o Can be represented by directed graphs or transition tables Design of a Lexical Analyzer/Scanner Finite Automata Page 21 of 115 Lex – turns its input program into lexical analyzer. Finite automata are recognizers; they simply say "yes" or "no" about each possible input string. Finite automata come in two flavors: a) Nondeterministic finite automata (NFA) have no restrictions on the labels of their edges. ε, the empty string, is a possible label. b) Deterministic finite automata (DFA) have, for each state, and for each symbol of its input alphabet exactly one edge with that symbol leaving that state. The Whole Scanner Generator Process Overview Direct construction of Nondeterministic finite Automation (NFA) to recognize a given regular expression. o Easy to build in an algorithmic way o Requires ε-transitions to combine regular sub expressions Construct a deterministic finite automation (DFA) to simulate the NFA o Use a set-of-state construction Minimize the number of states in the DFA (optional) Generate the scanner code. Design of a Lexical Analyzer … o Token Pattern o Pattern Regular Expression o Regular Expression NFA o NFA DFA o DFA‟s or NFA‟s for all tokens Lexical Analyzer Page 22 of 115 Non-Deterministic Finite Automata (NFA) Definition: An NFA M consists of five tuples: ( Σ,S, T, S0, F) o A set of input symbols Σ, the input alphabet o a finite set of states‟ S, o a transition function T: S × (Σ U { ε}) -> S (next state), o a start state S0 from S, and o a set of accepting/final states F from S. The language accepted by M, written L(M), is defined as: The set of strings of characters c1c2...cn with each ci from Σ U { ε} such that there exist states s1 in T(s0,c1), s2 in T(s1,c2),... , sn in T(sn-1,cn) with sn an element of F. It is a finite automata which has choice of edges o The same symbol can label edges from one state to several different states. An edge may be labeled by ε, the empty string o We can have transitions without any input character consumption. Transition Graph The transition graph for an NFA recognizing the language of regular expression (a|b)*abb Page 23 of 115 Transition Table The mapping T of an NFA can be represented in a transition table The language defined by an NFA is the set of input strings it accepts, such as (a|b)*abb for the example NFA Acceptance of input strings by NFA An NFA accepts input string x if and only if there is some path in the transition graph from the start state to one of the accepting states The string aabb is accepted by the NFA: Another NFA An -transition is taken without consuming any character from the input. What does the NFA above accepts? aa*|bb* Deterministic Finite Automata (DFA) A deterministic finite automaton is a special case of an NFA o No state has an ε-transition o For each state S and input symbol a there is at most one edge labeled a leaving S Page 24 of 115 Each entry in the transition table is a single state o At most one path exists to accept a string o Simulation algorithm is simple DFSA: Example DFA example A DFA that accepts (a|b)*abb Simulating a DFA: Algorithm How to apply a DFA to a string? INPUT: o An input string x terminated by an end-of-file character eof. o A DFA D with start state So, accepting states F, and transition function move. OUTPUT: Answer ''yes" if D accepts x; "no" otherwise METHOD o Apply the algorithm in (next slide) to the input string x. o The function move(s, c) gives the state to which there is an edge from state s on input c. o The function nextChar() returns the next character of the input string x. Page 25 of 115 Example: DFA: Exercise Construct DFAs for the string matched by the following definition: digit =[0-9] nat=digit+ signednat=(+|-)?nat number=signednat(“.”nat)?(E signedNat)? Why do we study RE,NFA,DFA? Goal: To scan the given source program Process: o Start with Regular Expression (RE) o Build a DFA How? We can build a non-deterministic finite automaton, NFA (Thompson's construction) Convert that to a deterministic one, DFA (Subset construction) Minimize the DFA (optional) (different algorithms) Implement it Existing scanner generator: Lex/Flex RENFADFA Minimize DFA states Page 26 of 115 Step 1: Come up with a Regular Expression (a|b)*ab Step 2: Use Thompson's construction to create an NFA for that expression r RENFADFA Minimize DFA states Step 1: Come up with a Regular Expression (a|b)*ab Step 2: Use Thompson's construction to create an NFA for that expression RENFADFA Minimize DFA states Step 3: Use subset construction to convert the NFA to a DFA States 0 and 2 behave the same way, so they can be merged. Step 4: Minimize the DFA states Design of a Lexical Analyzer Generator Two algorithms: 1- Translate a regular expression into an NFA (Thompson‟s construction) Page 27 of 115 2- Translate NFA into DFA (Subset construction) From regular expression to an NFA It is known as Thompson‟s construction. Rules: 1- For a ε, a regular expressions, construct: 2- For a composition of regular expression: Case 1: Alternation: regular expression (s|r), assume that NFAs equivalent to r and s have been constructed. Case 2: Concatenation: regular expression sr. From RE to NFA: Exercises Construct NFA for token identifier. letter(letter|digit)* Construct NFA for the following regular expression: (a|b)*abb Page 28 of 115 From an NFA to a DFA (subset construction algorithm) Rules: Start state of D is assumed to be unmarked. Start state of D is = ε-closer (S0), Where S0 -start state of N. ε- closure ε-closure (S‟) – is a set of states with the following characteristics: 1- S‟ € ε-closure(S‟) itself 2- if t € ε-closure (S‟) and if there is an edge labeled ε from t to v, then v € ε-closure (S‟) 3- Repeat step 2 until no more states can be added to ε-closure (S‟). E.g: for NFA of (a|b)*abb ε-closure (0)= {0, 1, 2, 4, 7} and ε-closure (1)= {1, 2, 4} Algorithm While there is unmarked state X = { s0, s1, s2,..., sn} of D do Begin Mark X For each input symbol „a‟ do Begin Let T be the set of states to which there is a transition „a‟ from state si in X. Y= ε-Closer (T) If Y has not been added to the set of states of D then { Page 29 of 115 Mark Y an “Unmarked” state of D add a transition from X to Y labeled „a‟ if not already presented } End End NFA for identifier: letter(letter|digit)* Example: Convert the following NFA into the corresponding DFA. letter (letter|digit)* Exercise: convert NFA of (a|b)*abb in to DFA. Other Algorithms How to minimize a DFA? (see Dragon Book 3.9, pp.173) How to convert RE to DFA directly? (see Dragon Book 3.9.5 pp.179) Page 30 of 115 The Lexical- Analyzer Generator: Lex The first phase in a compiler is, it reads the input source and converts strings in the source to tokens. Lex: generates a scanner (lexical analyzer or lexer) given a specification of the tokens using REs. o The input notation for the Lex tool is referred to as the Lex language and o The tool itself is the Lex compiler. The Lex compiler transforms the input patterns into a transition diagram and generates code, in a file called lex.yy.c, that simulates this transition diagram. By using regular expressions, we can specify patterns to lex that allow it to scan and match strings in the input. Each pattern in lex has an associated action. Typically an action returns a token, representing the matched string, for subsequent use by the parser. It uses patterns that match strings in the input and converts the strings to tokens. General Compiler Infra-structure Page 31 of 115 Scanner, Parser, Lex and Yacc Generating a Lexical Analyzer using Lex We will see more about lex and its construction in lab case. Summary of Chapter 2 Tokens: The lexical analyzer scans the source program and produces as output a sequence of tokens, which are normally passed, one at a time to the parser. Some tokens may consist only of a token name while others may also have an associated lexical value that gives information about the particular instance of the token that has been found on the input. Lexemes: Each time the lexical analyzer returns a token to the parser, it has an associated lexeme - the sequence of input characters that the token represents. Buffering: Because it is often necessary to scan ahead on the input in order to see where the next lexeme ends, it is usually necessary for the lexical analyzer to buffer its input. Using a pair of buffers cyclically and ending each buffer's contents with a sentinel that warns of its end are two techniques that accelerate the process of scanning the input. + Patterns. Each token has a pattern that describes which sequences of characters can form the lexemes corresponding to that token. The set of words or strings of characters that match a given pattern is called a language. Page 32 of 115 Regular Expressions: These expressions are commonly used to describe patterns. Regular expressions are built from single characters, using union, concatenation, and the Kleene closure, or any-number-of, operator. Regular Definitions: Complex collections of languages, such as the patterns that describe the tokens of a programming language, are often defined by a regular definition, which is a sequence of statements that each define one variable to stand for some regular expression. The regular expression for one variable can use previously defined variables in its regular expression. Transition Diagrams: The behavior of a lexical analyzer can often be described by a transition diagram. These diagrams have states, each of which represents something about the history of the characters seen during the current search for a lexeme that matches one of the possible patterns. There are arrows, or transitions, from one state to another, each of which indicates the possible next input characters that cause the lexical analyzer to make that change of state. + Finite Automata. These are a formalization of transition diagrams that include a designation of a start state and one or more accepting states, as well as the set of states, input characters, and transitions among states. Accepting states indicate that the lexeme for some token has been found. Unlike transition diagrams, finite automata can make transitions on empty input as well as on input characters. Deterministic Finite Automata: A DFA is a special kind of finite automaton that has exactly one transition out of each state for each input symbol. Also, transitions on empty input are disallowed. The DFA is easily simulated and makes a good implementation of a lexical analyzer, similar to a transition diagram. Nondeterministic Finite Automata: Automata that are not DFA7s are called nondeterministic. NFA's often are easier to design than are DFA's. Another possible architecture for a lexical analyzer is to tabulate all the states that NFA7s for each of the possible patterns can be in, as we scan the input characters. Conversion among Pattern Representations: It is possible to convert any regular expression into an NFA of about the same size, recognizing the same language as the regular expression defines. Further, any NFA can be converted to a DFA for the same pattern, although in the worst case (never encountered in common programming languages) the size of the automaton can grow exponentially. It is also possible to convert any nondeterministic or deterministic finite automaton into a regular expression that defines the same language recognized by the finite automaton. Lex: There is a family of software systems, including Lex and Flex, that are lexical- analyzer generators. The user specifies the patterns for tokens using an extended regular- expression notation. Lex converts these expressions into a lexical analyzer that is Page 33 of 115 essentially a deterministic finite automaton that recognizes any of the patterns. Minimization of Finite Automata: For every DFA there is a minimum state DM accepting the same language. Moreover, the minimum-state DFA for a given language is unique except for the names given to the various states. Review Exercise 1) Divide the following C + + program: float linitedSquare(x) float x { return (x=l0.0)?100:x*x; into appropriate lexemes. Which lexemes should get associated lexical values? What should those values be? 2) Write regular definitions for the following languages: a) All strings of lowercase letters that contain the five vowels in order. b) All strings of lowercase letters in which the letters are in ascending lexicographic order. c) Comments, consisting of a string surrounded by , without an intervening */, unless it is inside double-quotes ("). d) All strings of digits with no repeated digits. Hint: Try this problem first with a few digits, such as {0, 1, 2). !! e) All strings of digits with at most one repeated digit. !! f) All strings of a's and b's with an even number of a's and an odd number of b's. g) The set of Chess moves, in the informal notation, such as p-k4 or kbp x qn.!! h) All strings of a's and b's that do not contain the substring abb. i) All strings of a's and b's that do not contain the subsequence abb. 3) Construct the minimum-state DFA7s for the following regular expressions: a) (a|b)*a(a|b) b) (a|b)*a(a|b) (a|b) c) (a|b)*a(a|b) (a|b)(a|b) Page 34 of 115 Chapter – 3 Syntax analysis Introduction Syntax: the way in which tokens are put together to form expressions, statements, or blocks of statements. o The rules governing the formation of statements in a programming language. Syntax analysis: the task concerned with fitting a sequence of tokens into a specified syntax. Parsing: To break a sentence down into its component parts with an explanation of the form, function, and syntactical relationship of each part. The syntax of a programming language is usually given by the grammar rules of a context free grammar (CFG). Parser The syntax analyzer (parser) checks whether a given source program satisfies the rules implied by a CFG or not. o If it satisfies, the parser creates the parse tree of that program. o Otherwise, the parser gives the error messages. A CFG: o Gives a precise syntactic specification of a programming language. o A grammar can be directly converted in to a parser by some tools (yacc). The parser can be categorized into two groups: Top-down parser Page 35 of 115 o The parse tree is created top to bottom, starting from the root to leaves. Bottom-up parser o The parse tree is created bottom to top, starting from the leaves to root. Both top-down and bottom-up parser scan the input from left to right (one symbol at a time). Efficient top-down and bottom-up parsers can be implemented by making use of context- free- grammar. o LL for top-down parsing o LR for bottom-up parsing Context free grammar (CFG) A context-free grammar is a specification for the syntactic structure of a programming language. Context-free grammar has 4-tuples: G = (T, N, P, S) where o T is a finite set of terminals (a set of tokens) o N is a finite set of non-terminals (syntactic variables) o P is a finite set of productions of the form A → α where A is non-terminal and α is a strings of terminals and non-terminals (including the empty string) S ∈ N is a designated start symbol (one of the non- terminal symbols) Example: grammar for simple arithmetic expressions Page 36 of 115 Derivation A derivation is a sequence of replacements of structure names by choices on the right hand sides of grammar rules. Example: E → E + E | E – E | E * E | E / E | -E E→(E) E → id E => E + E means that E + E is derived from E o we can replace E by E + E o we have to have a production rule E → E+E in our grammar. E=>E+E =>id+E=>id+id means that a sequence of replacements of non-terminal symbols is called a derivation of id+id from E. If we always choose the left-most non-terminal in each derivation step, this derivation is called left-most derivation. Example: E=>-E=>-(E)=>-(E+E)=>-(id+E)=>-(id+id) If we always choose the right-most non-terminal in each derivation step, this derivation is called right-most derivation. Example: E=>-E=>-(E)=>-(E+E)=>-(E+id)=>-(id+id) We will see that the top-down parser try to find the left-most derivation of the given source program. We will see that the bottom-up parser try to find right-most derivation of the given source program in the reverse order. Parse tree A parse tree is a graphical representation of a derivation. It filters out the order in which productions are applied to replace non-terminals. A parse tree corresponding to a derivation is a labeled tree in which: o the interior nodes are labeled by non-terminals, o the leaf nodes are labeled by terminals, and o the children of each internal node represent the replacement of the associated non- terminal in one step of the derivation. Page 37 of 115 Parse tree and Derivation Ambiguity: example Ambiguity: example… Elimination of ambiguity Precedence/Association These two derivations point out a problem with the grammar: The grammar do not have notion of precedence, or implied order of evaluation To add precedence o Create a non-terminal for each level of precedence o Isolate the corresponding part of the grammar Page 38 of 115 o Force the parser to recognize high precedence sub expressions first For algebraic expressions o Multiplication and division, first (level one) o Subtraction and addition, next (level two) To add association o Left-associative : The next-level (higher) non-terminal places at the last of a production o Elimination of ambiguity o To disambiguate the grammar : o we can use precedence of operators as follows: * Higher precedence (left associative) + Lower precedence (left associative) o We get the following unambiguous grammar: Left Recursion Elimination of Left recursion A grammar is left recursive, if it has a non-terminal A such that there is a derivation A=>+Aα for some string α. Page 39 of 115 Top-down parsing methods cannot handle left-recursive grammar. So a transformation that eliminates left-recursion is needed. To eliminate left recursion for single production A Aα |β could be replaced by the non- left- recursive productions A β A‟ A‟ α A‟| ε Generally, we can eliminate immediate left recursion from them by the following technique. First we group the A-productions as: A Aα1 |Aα2 |…. |Aαm |β1 | β2|….| βn Where no βi begins with A. then we replace the A productions by: A β1A‟ | β2A‟ | … | βnA‟ A‟ α1Α‟ | α2A‟ | … | αmA‟ |ε Left factoring When a non-terminal has two or more productions whose right-hand sides start with the same grammar symbols, the grammar is not LL(1) and cannot be used for predictive parsing A predictive parser (a top-down parser without backtracking) insists that the grammar must be left-factored. In general: A αβ1 | αβ2 , where α-is a non-empty and the first symbol of β1 and β2. When processing α we do not know whether to expand A to αβ1 or to αβ2, but if we re-write the grammar as follows: A αA‟ A‟ β1 | β2 so, we can immediately expand A to αA‟. Example: given the following grammar: Page 40 of 115 S iEtS | iEtSeS | a Eb Left factored, this grammar becomes: S iEtSS‟ | a S‟ eS | ε Eb Syntax analysis Every language has rules that prescribe the syntactic structure of well-formed programs. The syntax can be described using Context Free Grammars (CFG) notation. The use of CFGs has several advantages: o helps in identifying ambiguities o it is possible to have a tool which produces automatically a parser using the grammar o a properly designed grammar helps in modifying the parser easily when the language changes Top-down parsing Recursive Descent Parsing (RDP) This method of top-down parsing can be considered as an attempt to find the left most derivation for an input string. It may involve backtracking. To construct the parse tree using RDP: o We create one node tree consisting of S. o Two pointers, one for the tree and one for the input, will be used to indicate where the parsing process is. Page 41 of 115 o Initially, they will be on S and the first input symbol, respectively. o Then we use the first S-production to expand the tree. The tree pointer will be positioned on the left most symbol of the newly created sub-tree. As the symbol pointed by the tree pointer matches that of the symbol pointed by the input pointer, both pointers are moved to the right. Whenever the tree pointer points on a non-terminal, we expand it using the first production of the non-terminal. Whenever the pointers point on different terminals, the production that was used is not correct, thus another production should be used. We have to go back to the step just before we replaced the non-terminal and use another production. If we reach the end of the input and the tree pointer passes the last symbol of the tree, we have finished parsing. Example: G: S cAd A ab|a Draw the parse tree for the input string cad using the above method. Exercise: 1) Consider the following grammar: SA A A + A | B++ By Draw the parse tree for the input “ y+++y++” 2) Using the grammar below, construct a parse tree for the following string using RDP algorithm: ( ( id. id ) id ( id ) ( ( ) ) ) S→E E → id |(E.E) |(L) |() Page 42 of 115 L→LE |E Non-recursive predictive parsing It is possible to build a non-recursive parser by explicitly maintaining a stack. This method uses a parsing table that determines the next production to be applied. The input buffer contains the string to be parsed followed by $ (the right end marker) The stack contains a sequence of grammar symbols with $ at the bottom. Initially, the stack contains the start symbol of the grammar followed by $. The parsing table is a two dimensional array M[A, a] where A is a non-terminal of the grammar and a is a terminal or $. The parser program behaves as follows. The program always considers o X, the symbol on top of the stack and o a, the current input symbol. Predictive Parsing… There are three possibilities: 1. x = a = $ : the parser halts and announces a successful completion of parsing 2. x = a ≠ $ : the parser pops x off the stack and advances the input pointer to the next symbol 3. X is a non-terminal: the program consults entry M[X, a] which can be an X-production or an error entry. Page 43 of 115 If M[X, a] = {X uvw}, X on top of the stack will be replaced by uvw (u at the top of the stack). As an output, any code associated with the X-production can be executed. If M[X, a] = error, the parser calls the error recovery method. A Predictive Parser table A Predictive Parser: Example Non-recursive predictive parsing Example: G: E TR R +TR Input: 1+2 R -TR Rε T 0|1|…|9 Page 44 of 115 FIRST and FOLLOW The construction of both top-down and bottom-up parsers are aided by two functions, FIRST and FOLLOW, associated with a grammar G. During top-down parsing, FIRST and FOLLOW allow us to choose which production to apply, based on the next input symbol. During panic-mode error recovery, sets of tokens produced by FOLLOW can be used as synchronizing tokens. We need to build a FIRST set and a FOLLOW set for each symbol in the grammar. The elements of FIRST and FOLLOW are terminal symbols. o FIRST() is the set of terminal symbols that can begin any string derived from . o FOLLOW() is the set of terminal symbols that can follow : t FOLLOW() derivation containing t Construction of a predictive parsing table Makes use of two functions: FIRST and FOLLOW. FIRST o FIRST(α) = set of terminals that begin the strings derived from α. o If α => ε in zero or more steps, ε is in FIRST(α). o FIRST(X) where X is a grammar symbol can be found using the following rules: 1- If X is a terminal, then FIRST(x) = {x} 2- If X is a non-terminal: two cases a) If X ε is a production, then add ε to FIRST(X) b) For each production X y1y2…yk, place a in FIRST(X) if for some i, a Є FIRST (yi) and ε Є FIRST (yj), for 1