Compiler Design: Syntax Analysis

Chapter 3 Syntax Analysis 3. Syntax Analysis Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall learn the basic concepts used in the construction of a parser. We have seen that a lexical analyzer can identify tokens with the help of regular expressions and pattern rules. But a lexical analyzer cannot check the syntax of a given sentence due to the limitations of the regular expressions. Regular expressions cannot check balancing tokens, such as parenthesis. Therefore, this phase uses context-free grammar (CFG), which is recognized by push-down automata. CFG, on the other hand, is a superset of Regular Grammar, as depicted below: It implies that every Regular Grammar is also context-free, but there exists some problems, which are beyond the scope of Regular Grammar. CFG is a helpful tool in describing the syntax of programming languages. Syntax Analyzers A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The parser analyzes the source code (token stream) against the production rules to detect any errors in the code. The output of this phase is a parse tree. comp Compiler Design 1|Page This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors, and generating a parse tree as the output of the phase. Parsers are expected to parse the whole code even if some errors exist in the program. Parsers use error recovering strategies, which we will learn later in this chapter. 3.1. Context-Free Grammar In this section, we will first see the definition of context-free grammar and introduce terminologies used in parsing technology. A context-free grammar has four components:  A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings. The non-terminals define sets of strings that help define the language generated by the grammar.  A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which strings are formed.  A set of productions (P). The productions of a grammar specify the manner in which the terminals and non-terminals can be combined to form strings. Each production consists of a non-terminal called the left side of the production, an arrow, and a sequence of tokens and/or on- terminals, called the right side of the production.  One of the non-terminals is designated as the start symbol (S); from where the production begins. The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the start symbol) by the right side of a production, for that non-terminal. Example We take the problem of palindrome language, which cannot be described by means of Regular Expression. That is, L = {w | w = wR} is not a regular language. But it can be described by means of CFG, as illustrated below: G = (V, Σ, P, S) Where: V = {Q, Z, N} Σ = {0, 1} P = {Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1} S = {Q} comp Compiler Design 2|Page This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111, etc. Derivation A derivation is basically a sequence of production rules, in order to get the input string. During parsing, we take two decisions for some sentential form of input:  Deciding the non-terminal which is to be replaced.  Deciding the production rule, by which, the non-terminal will be replaced. At each point in this derivation process, the string is a collection of terminal or nonterminal symbols. Such a string is called a sentential form if it occurs in some step of a valid derivation. Any sentential form can be derived from the start symbol in zero or more steps. Similarly, from any sentential form we can derive a valid sentence in zero or more steps. To decide which non-terminal to be replaced with production rule, we can have two options. Left-most Derivation If the sentential form of an input is scanned and replaced from left to right, it is called left-most derivation. The sentential form derived by the left-most derivation is called the left-sentential form. Right-most Derivation If we scan and replace the input with production rules, from right to left, it is known as right-most derivation. The sentential form derived from the right-most derivation is called the right-sentential form. Example Production rules: E → E + E E → E * E E → id Input string: id + id * id The left-most derivation is: E → E * E E → E + E * E E → id + E * E E → id + id * E E → id + id * id Notice that the left-most side non-terminal is always processed first. The right-most derivation is: E → E + E E → E + E * E E → E + E * id E → E + id * id E → id + id * id Parse Tree A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived from the start symbol. The start symbol of the derivation becomes the root of the parse tree. Let us see this by an example from the last topic. We take the left-most derivation of a + b * c The left-most derivation is: comp Compiler Design 3|Page E → E * E E → E + E * E E → id + E * E E → id + id * E E → id + id * id Step 1: Step 2: Step 3: Step 4: comp Compiler Design 4|Page Step 5: In a parse tree:  All leaf nodes are terminals.  All interior nodes are non-terminals.  In-order traversal gives original input string. A parse tree depicts associativity and precedence of operators. The deepest sub-tree is traversed first, therefore the operator in that sub-tree gets precedence over the operator which is in the parent nodes. Ambiguity A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at least one string. Example E → E + E comp Compiler Design 5|Page E → E – E E → id For the string id + id – id, the above grammar generates two parse trees: The language generated by an ambiguous grammar is said to be inherently ambiguous. Ambiguity in grammar is not good for a compiler construction. No method can detect and remove ambiguity automatically, but it can be removed by either re-writing the whole grammar without ambiguity, or by setting and following associativity and precedence constraints. Associativity If an operand has operators on both sides, the side on which the operator takes this operand is decided by the associativity of those operators. If the operation is left-associative, then the operand will be taken by the left operator; or if the operation is right-associative, the right operator will take the operand. Example Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If the expression contains: id op id op id it will be evaluated as: (id op id) op id For example, (id + id) + id Operations like Exponentiation are right associative, i.e., the order of evaluation in the same expression will be: id op (id op id) For example, id ^ (id ^ id) Precedence comp Compiler Design 6|Page If two different operators share a common operand, the precedence of operators decides which will take the operand. That is, 2+3*4 can have two different parse trees, one corresponding to (2+3)*4 and another corresponding to 2+(3*4). By setting precedence among operators, this problem can be easily removed. As in the previous example, mathematically * (multiplication) has precedence over + (addition), so the expression 2+3*4 will always be interpreted as: 2 + (3 * 4) These methods decrease the chances of ambiguity in a language or its grammar. Left Recursion A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’ itself as the left-most symbol. Left-recursive grammar is considered to be a problematic situation for top-down parsers. Top-down parsers start parsing from the Start symbol, which in itself is non-terminal. So, when the parser encounters the same non-terminal in its derivation, it becomes hard for it to judge when to stop parsing the left non-terminal and it goes into an infinite loop. Example: (1) A => Aα | β (2) S => Aα | β A => Sd (1) is an example of immediate left recursion, where A is any non-terminal symbol and α represents a string of non-terminals. (2) is an example of indirect-left recursion. A top-down parser will first parse A, which in-turn will yield a string consisting of A itself and the parser may go into a loop forever. Removal of Left Recursion One way to remove left recursion is to use the following technique: The production A => Aα | β is converted into following productions A => βA' A' => αA' | ε comp Compiler Design 7|Page This does not impact the strings derived from the grammar, but it removes immediate left recursion. Second method is to use the following algorithm, which should eliminate all direct and indirect left recursions. Algorithm START Arrange non-terminals in some order like A1, A2, A3,…, An for each i from 1 to n { for each j from 1 to i-1 { replace each production of form Ai→Ajγ with Ai → δ1γ | δ2γ | δ3γ |…| γ where Aj → δ1 | δ2|…| δn are current Aj productions } } eliminate immediate left-recursion END Example The production set S => Aα | β A => Sd after applying the above algorithm, should become S => Aα | β A => Aαd | βd and then, remove immediate left recursion using the first technique. A => βdA' A' => αdA' | ε Now none of the production has either direct or indirect left recursion. Left Factoring If more than one grammar production rules have a common prefix string, then the top-down parser cannot make a choice as to which of the production it should take to parse the string in hand. Example If a top-down parser encounters a production like A → αβ | αγ | … Then it cannot determine which production to follow to parse the string, as both productions are starting from the same terminal (or non-terminal). To remove this confusion, we use a technique called left factoring. Left factoring transforms the grammar to make it useful for top-down parsers. In this technique, we make one production for each common prefixes and the rest of the derivation is added by new productions. comp Compiler Design 8|Page Example The above productions can be written as A => αA’ A’=> β | γ | … Now the parser has only one production per prefix which makes it easier to take decisions. Limitations of Syntax Analyzers Syntax analyzers receive their inputs, in the form of tokens, from lexical analyzers. Lexical analyzers are responsible for the validity of a token supplied by the syntax analyzer. Syntax analyzers have the following drawbacks:  It cannot determine if a token is valid,  It cannot determine if a token is declared before it is being used,  It cannot determine if a token is initialized before it is being used,  It cannot determine if an operation performed on a token type is valid or not. These tasks are accomplished by the semantic analyzer, which we shall study in Semantic Analysis. Types of Parsing Syntax analyzers follow production rules defined by means of context-free grammar. The way the production rules are implemented (derivation) divides parsing into two types: top-down parsing and bottom-up parsing. Top-down Parsing When the parser starts constructing the parse tree from the start symbol and then tries to transform the start symbol to the input, it is called top-down parsing.  Recursive descent parsing: It is a common form of top-down parsing. It is called recursive, as it uses recursive procedures to process the input. Recursive descent parsing suffers from backtracking.  Backtracking: It means, if one derivation of a production fails, the syntax analyzer restarts the process using different rules of same production. This technique may process the input string more than once to determine the right production. Bottom-up Parsing As the name suggests, bottom-up parsing starts with the input symbols and tries to construct the parse tree up to the start symbol. Example: Input string: a + b * c Production rules: comp Compiler Design 9|Page S → E E → E + T E → E * T E → T T → id Let us start bottom-up parsing. a + b * c Read the input and check if any production matches with the input: a + b * c T + b * c E + b * c E + T * c E * c E * T E S Top-Down parsing We have learnt that the top-down parsing technique parses the input, and starts constructing a parse tree from the root node gradually moving down to the leaf nodes. The types of top-down parsing are depicted below: Recursive Descent Parsing Recursive descent is a top-down parsing technique that constructs the parse tree from the top and the input is read from left to right. It uses procedures for every terminal and non-terminal entity. This parsing technique recursively parses the input to make a parse tree, which may or may not require back-tracking. But the grammar associated with it (if not left factored) cannot avoid back-tracking. A form of recursive- descent parsing that does not require any back-tracking is known as predictive parsing. This parsing technique is regarded recursive, as it uses context-free grammar which is recursive in nature. Back-tracking Top- down parsers start from the root node (start symbol) and match the input string against the production rules to replace them (if matched). To understand this, take the following example of CFG: comp Compiler Design 10 | P a g e S → rXd | rZd X → oa | ea Z → ai For an input string: read, a top-down parser, will behave like this: It will start with S from the production rules and will match its yield to the left-most letter of the input, i.e. ‘r’. The very production of S (S → rXd) matches with it. So the top-down parser advances to the next input letter (i.e. ‘e’). The parser tries to expand non-terminal ‘X’ and checks its production from the left (X → oa). It does not match with the next input symbol. So the top-down parser backtracks to obtain the next production rule of X, (X → ea). Now the parser matches all the input letters in an ordered manner. The string is accepted. Predictive Parser Predictive parser is a recursive descent parser, which has the capability to predict which production is to be used to replace the input string. The predictive parser does not suffer from backtracking. To accomplish its tasks, the predictive parser uses a look-ahead pointer, which points to the next input symbols. To make the parser back-tracking free, the predictive parser puts some constraints on the grammar and accepts only a class of grammar known as LL(k) grammar. comp Compiler Design 11 | P a g e Predictive parsing uses a stack and a parsing table to parse the input and generate a parse tree. Both the stack and the input contains an end symbol $ to denote that the stack is empty and the input is consumed. The parser refers to the parsing table to take any decision on the input and stack element combination.  Input buffer: our string to be parsed. We will assume that its end is marked with a special symbol $.  Output: a production rule representing a step of derivation sequence (left-most derivation) of the string in the input buffer.  Stack: contains the grammar symbols o At the bottom of the stack, there is a special end mark symbol $. o Initially the stack contains only the symbol $ and the starting symbol S ($S ← initial stack) o When the stack is emptied (i.e. only $ left in the stack), the parsing is completed.  Parsing Table: o A two-dimensional array M[A, a] o Each row is a non-terminal symbol. o Each column is a terminal symbol or the special symbol $. o Each entry holds a production rule. In recursive descent parsing, the parser may have more than one production to choose from for a single instance of input; whereas in predictive parser, each step has at most one production to choose. There might be instances where there is no production matching the input string, making the parsing procedure to fail. LL Parser An LL Parser accepts LL grammar. LL grammar is a subset of context-free grammar but with some restrictions to get the simplified version, in order to achieve easy implementation. LL grammar can be implemented by means of both algorithms, namely, recursive-descent or table-driven. LL parser is denoted as LL(k). The first L in LL(k) is parsing the input from left to right, the second L in LL(k) stands for left-most derivation and k itself represents the number of look aheads. Generally k = 1, so LL(k) may also be written as LL(1). comp Compiler Design 12 | P a g e LL Parsing Algorithm We may stick to deterministic LL(1) for parser explanation, as the size of table grows exponentially with the value of k. Secondly, if a given grammar is not LL(1), then usually, it is not LL(k), for any given k. Implementation of predictive parser: 1. Elimination of left recursion, left factoring and ambiguous grammar. 2. Construct FIRST() and FOLLOW() for all non-terminals. 3. Construct predictive parsing table. 4. Parse the given input string using stack and parsing table. FIRST and FOLLOW The construction of a predictive parser is aided by two functions associated with grammar G. These functions, FIRST and FOLLOW, allow us to fill in the entries of a predictive parsing table for G, whenever possible. Sets of token yielded by the FOLLOW function can also be used as synchronizing token during panic-mode error recovery. FIRST (α) = is the set of terminals that can begin strings derived from α. FOLLOW (α) = is the set of terminals that can immediately follow X. That is, t ∈ FOLLOW(X) if there is any derivation containing Xt. This can occur if the derivation contains X YZt where Y and Z both derive ε. Rules to computing the Function FIRST To compute FIRST (X) for all grammar symbols X apply the following rules until no more terminals or ε can be added to any FIRST set: 1. If X is terminal, then FIRST (X) is {X} 2. If X → ε is a production, then add ε to the FIRST(X) 3. IF X is a non-terminal and X → aA is a production then add a to FIRST(X) 4. If X is non-terminal and X → Y1Y2…Yk is a production, then place a in FIRST (X) if for some I, a is in FIRST(Yi), and ε is in all of FIRST(Y1), …., FIRST(Yi-1); that is, Y1Y2…Yi-1 => ε. If ε is in FIRST(Yj) for all j= 1,2,…,k then add ε to FIRST(X). Rules to computing the Function FOLLOW To compute FOLLOW (X) for all non-terminals X apply the following rules until nothing can be added to any FOLLOW set: 1. If S is the start symbol, then FOLLOW(S) contains $. 2. If there is a production A → αXβ, the everything in FIRST(β) except ε is placed in FOLLOW(X). 3. If there is a production A→ αX, or a production A → αXβ where FIRST(β) contains ε, then everything is FOLLOW(A) is in FOLLOW(X). Algorithm for construction of predictive parsing table: Input : Grammar G comp Compiler Design 13 | P a g e Output : Parsing table M Method : 1. For each production A → α of the grammar, do steps 2 and 3. 2. For each terminal a in FIRST(α), add A → α to M[A, a]. 3. If ε is in FIRST(α), add A → α to M[A, b] for each terminal b in FOLLOW(A). If ε is in FIRST(α) and $ is in FOLLOW(A) , add A → α to M[A, $]. 4. Make each undefined entry of M be error. Algorithm for parsing a given input string using LL(1) Parser Actions: The symbol at the top of the stack (say X) and the current symbol in the input string (say a) determine the parser action. There are four possible parser actions: 1. If X and a are $ → parser halts (Successful completion) 2. If X and a are the same terminal symbol (different from $) → parser pops X from the stack, and moves the next symbol in the input buffer. 3. If X is a non-terminal → parser looks at the parsing table entry M[X,a] holds a production rule X → Y1Y2…Yk, it pops X from the stack and pushes Yk, Yk-1,…, Y1 into the stack. The parser also outputs the production rule X → Y1Y2…Yk to represent a step of the derivation. 4. None of the above → error All empty entries in the parsing table are errors. If X is a terminal symbol different from a, this is also an error case. Example: Consider the following grammar: E → E+T | T T → T*F | F F → (E) | id After eliminating left-recursion the grammar is E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id FIRST and FOLLOW FIRST FOLLOW E { ( , id} { $, ) } E’ {+ , ε } { $, ) } T { ( , id} { +, $, ) } T’ {*, ε } { +, $, ) } F { ( , id } {+, * , $ , ) } Parsing Table comp Compiler Design 14 | P a g e Parsing a given input id+id*id$ LL(1) grammar: The parsing table entries are single entries. So each location has not more than one entry. This type of grammar is called LL(1) grammar. Consider this following grammar: S → iEtS | iEtSeS | a E→b comp Compiler Design 15 | P a g e After eliminating left factoring, we have S → iEtSS’ | a S’→ eS | ε E→b To construct a parsing table, we need FIRST() and FOLLOW() for all the non-terminals. FIRST(S) = { i, a } FIRST(S’) = {e, ε } FIRST(E) = { b} FOLLOW(S) = { $ ,e } FOLLOW(S’) = { $ ,e } FOLLOW(E) = {t} Parsing Table: Since there are more than one production, the grammar is not LL(1) grammar. comp Compiler Design 16 | P a g e Bottom-Up Parsing Bottom-up parsing starts from the leaf nodes of a tree and works in upward direction till it reaches the root node. Here, we start from a sentence and then apply production rules in reverse manner in order to reach the start symbol. The image given below depicts the bottom-up parsers available. Shift-Reduce Parsing Shift-reduce parsing use two unique steps for bottom-up parsing. These steps are known as shift-step and reduce-step.  Shift step: The shift step refers to the advancement of the input pointer to the next input symbol, which is called the shifted symbol. This symbol is pushed onto the stack. The shifted symbol is treated as a single node of the parse tree.  Reduce step: When the parser finds a complete grammar rule (RHS) and replaces it to (LHS), it is known as reduce-step. This occurs when the top of the stack contains a handle. To reduce, a POP function is performed on the stack which pops off the handle and replaces it with LHS non-terminal symbol. Shift-reduce parsing is a type of bottom-up parsing that attempts to construct a parse tree for an input string beginning at the leaves (the bottom) and working up towards the root (the top). Example: Consider the grammar: S → aABe A → Abc | b B→d The sentence to be recognized is abbcde. Reduction (leftmost) Rightmost derivation abbcde (A→b) S →aABe aAbcde (A→Abc) →aAde aAde (B→d) →aAbcde aABe (S→aABe) →abbcde S The reductions trace out the right-most derivation in reverse. comp Compiler Design 17 | P a g e Handles: A handle of a string is a substring that matches the right side of production, and whose reduction to the non-terminal on the left side of the production represents one step along the reverse of a rightmost derivation. Example: Consider the grammar: E→E+E E→E*E E → (E) E → id And the input string id1 + id2 * id3 E→E+E →E+E*E → E + E * id3 → E + id2 * id3 → id1 + id2 * id3 In the above derivation the underlined substrings are called handles. Handle Pruning: Stack Implementation of Shift Reduce parsing There are four possible actions of a shift-reduce parse actions:  Shift: The next input symbol is shifted onto the top of the stack.  Reduce: Replace the handle on the top of the stack by the non-terminal.  Accept: Successful completion of parsing.  Error: Parser discovers a syntax error, and calls an error recovery routine. Initial stack just contains only the end-marker $. The end of the input string is marked by the end-marker $. Example: consider the following grammar: E→E+E E→E*E E → (E) E → id And input string: id1 + id2 * id3$ comp Compiler Design 18 | P a g e Conflicts in Shift-Reduce parsing: There are two conflicts that occur in shift-reduce parsing: 1. Shift-reduce conflict: The parser cannot decide whether to shift or to reduce. 2. Reduce-reduce conflict: The parser cannot decide which of several reductions to make. Types of Shift Reduce Parsing: There are two main categories of shift-reduce parsers. 1. Operator-Precedence Parser: Simple, but only a small class of grammar. 2. LR-Parsers: Covers wide range of grammars. a. SLR – Simple LR parser b. CLR – Most general LR parse (Canonical LR) c. LALR – Intermediate LR parser (Look Ahead LR) SLR, CLR and LALR work same, only their parsing tables are different. comp Compiler Design 19 | P a g e Operator precedence Parser Operator grammar have the property that no production right side is empty or has two adjacent nonterminals. This property enables the implementation of efficient Operator-precedence parsers. Examples: E →AB E →EOE E →E+E A →a E →id E →E*E B →b O →+|*|/ E →E/E | id Not operator grammar Not operator grammar Operator grammar Precedence Relations These parsers rely on the following three precedence relations: The determination of correct precedence relations between terminals are based on the traditional notions of associativity and precedence of operators. (unary minus causes a problem). The intention of the precedence relations is to find the handle of right-sentential form, marking the right hand. Precedence Relations: From Associativity and Precedence 1. If operator θ1 has higher precedence than operator θ2, make θ1 > θ2 and θ2 + and + θ2 and θ2 > θ1 if the operators are left-associative, or make. θ1 -, - > -, - > +. 3. For all operator θ: θ θ, θ ), θ > $, and $  Scan backwards the strings from right to left until seeing > + * > > $ $ Reduce E → E * E $+ $ + > $ Reduce E → E + E $ $ Accept Advantages of Operator PrecedenceParsing: 1. It is easy to implement 2. Once an operator precedence relation is made beteen all pairs of terminals a grammar, the grammar can be ignored. The grammar is not referred anymore during implementation. Disadvantage of Operator PrecedenceParsing: 1. It is hard to handle tokens like the minus sign (-) which has two different precedence. 2. Only a small class of grammar can be parsed using operator precedence parser. LR Parser The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide class of context-free grammar which makes it the most efficient syntax analysis technique. LR parsers are also known as LR(k) parsers, where L stands for left-to-right scanning of the input stream; R stands for the construction of right- most derivation in reverse, and k denotes the number of lookahead symbols to make decisions. Advantages of LR parsing:  It recognizes virtually all programming language constructs for which CFG can be written.  It is efficient non-backtracking shift-reduce parsing method.  A grammar that can be parsed using LR method is a proper superset of a grammar that can be parsed with predictive parser. Drawbacks of LR parsing:  It is too much of work to construct a LR parser by hand for programming language grammar. A specialized tool, called LR parser generator, is needed. Example: YACC/BISON. There are three widely used algorithms available for constructing an LR parser:  SLR(1) – Simple LR Parser: o Works on smallest class of grammar o Few number of states, hence very small table o Simple and fast construction  LR(1) – LR Parser: o Works on complete set of LR(1) Grammar o Generates large table and large number of states o Slow construction  LALR(1) – Look-Ahead LR Parser: o Works on intermediate size of grammar o Number of states are same as in SLR(1) comp Compiler Design 22 | P a g e LR parsing configuration It consists of: an Input, an output, a stack, a driver program, and a parsing table that has two part (action and goto)  The driver program is the same for all LR parser.  The parsing program reads characters from an input buffer one at a time.  The program uses a stack to store a string of the form s0X1s1X2s2…Xmsm, where sm is on top. Each Xi is a grammar symbol and each si is a state.  The parsing table consists of two parts: action and goto functions. Items and the LR(0) Automaton How does a shift-reduce parser know when to shift and when to reduce? For example, with stack contents $ T and next input symbol * , how does the parser know that T on the top of the stack is not a handle, so the appropriate action is to shift and not to reduce T → EP. An LR parser makes shift-reduce decisions by maintaining states to keep track of where we are in a parse. States represent sets of "items." An LR(0) item (item for short) of a grammar G is a production of G with a dot at some position of the body. Thus, production A → XYZ yields the four items: A → XYZ means parser is ready to scan. A → X YZ means parser has scanned X and ready to scan YZ. A → XY Z means parse has scanned X and Y and ready to scan Z. A → XYZ means parser is ready to detect handle. One collection of sets of LR(0) items, called the canonical LR(0) collection, provides the basis for constructing a deterministic finite automaton that is used to make parsing decisions. Such an automaton is called an LR(0) automaton. In particular, each state of the LR(0) automaton represents a set of items in the canonical LR(0) collection. comp Compiler Design 23 | P a g e To construct the canonical LR(0) collection for a grammar, we define an augmented grammar and two functions, CLOSURE and GOTO. If G is a grammar with start symbol S, then G', the augmented grammar for G, is G with a new start symbol S’ and production S' —> S. The purpose of this new starting production is to indicate to the parser when it should stop parsing and announce acceptance of the input. That is, acceptance occurs when and only when the parser is about to reduce by S' → S. Closure of Item Sets If I is a set of items for a grammar G, then CLOSURE(I) is the set of items constructed from / by the two rules: 1. Initially, add every item in / to CLOSURE(). 2. For any item in CLOSURE(), A →α Bβ where the next symbol after the dot is a non-terminal, add the production rules of that sysmbol where the dot is before the first item. 3. Repeat (2) and (3) for any new item added under (2) Example: G: S→E+S|E E →num Closure(S' → S) = { S' → S, S→ E + S, S→ E, E→ num} The Function GOTO The second useful function is GOTO (I, X) where I is a set of items and X is a grammar symbol. Intuitively, the GOTO function is used to define the transitions in the LR(0) automaton for a grammar. The states of the automaton correspond to sets of items, and GOTO(7,X) specifies the transition from the state for I under input X. LR(0) parsing: Building the DFA a. Augment the grammar with a new state (I0) S’ → S $ b. Build the node for state I0 : i. Create an item from the rule introduced in (a), with the dot before the first item. ii. Build the closure() for S’ → S $ c. For each unique symbol which follows a dot in the closure (I0): i. Create a new state representing the recognition of the symbol. ii. Link the original state to the new state an arc labeled with the recognized symbol. iii. Add a DFA node for this states. iv. The closure() of the new states is calculated as follows: a) It contains only those items from the previous state where the recognized symbol was the symbol after dot. b) For each of these items, move the dot AFTER the recognized symbol. c) Where the new next symbol is a non terminal, add the closure of these items to the closure. v. Where the closure includes an item with dot at end, place a double circle around the state (some actions from this state cause a REDUCE). d. Repeat (c) for each of the new states, Except: where a new state which has the same closure as an existing one, link to that state instead. LR(0) Parser: Parsing Input String Initially we have: 1. The input vector: a vector of the input tokens, terminated by tokens $ 2. Next token pointer: pointer to the next input token to be processed, initially pointing at the first token. 3. The Stack: initially we place I0 Recursively: comp Compiler Design 24 | P a g e 1. Apply action: Apply the action given in the action column for the current state (the topmost state in the stack). a. If shift, move the next token and state S onto the stack, then move the pointer to the next token. b. If reduce, look up the rule given in the action, and remove n*2 items from the stack (where n is the number of symbols on the RHS of the rule). Then place the LHS of the production on the stack. c. If accept, then finish parsing, we have a finished analysis. 2. Goto State: The top of the stack now contains a state and a symbol (terminal or non-terminals). In the Goto table, look up the row for the state, and the column symbol. a. If there is a number there, put this number on the top of the stack (it is the new state number) b. If there is no Goto, then return ERROR ( the input cannot be parsed by the grammar) Structure of the parse Table: 1. Each table row is indexed by state  Each state corresponds to a top of stack configuration. 2. Each column is index by a symbol  Terminals and non-terminals are both valid 3. Entries are exactly one of:  Shift  Reduce  Accept Action GOTO State id * + - $ E T A I0 I1 I2 I3 Example: Consider the following grammar G: 0. S’ → S 1. S→(L) 2. S→x 3. L→S 4. L→L,S a. Draw LR(0) DFA States. b. Draw the parse table c. Parse the input (x , x) comp Compiler Design 25 | P a g e Solution: a) LR(0) DFA states b) LR(0) Parse Table c) Parsing input (x , x) Stack Input Action Output 1 (x , x)$ S3 1(3 x , x)$ S2 1(3X2 , x)$ R2 S→x 1(3S7 , x)$ R3 L→S 1(3L5 , x)$ S8 1(3L5,8 x)$ S2 1(3L5,8x2 )$ R2 S→x 1(3L5,8S9 )$ R4 L→L,S comp Compiler Design 26 | P a g e 1(3L5 )$ S6 1(3L5)6 $ R1 S→(L) 1S4 $ Accept (a) LL Vs LR LL LR Does a leftmost derivation. Does a rightmost derivation in reverse. Starts with the root nonterminal on the stack. Ends with the root nonterminal on the stack. Ends when the stack is empty. Starts with an empty stack. Uses the stack for designating what is still to be Uses the stack for designating what is already seen. expected. Builds the parse tree top-down. Builds the parse tree bottom-up. Continuously pops a nonterminal off the stack, and Tries to recognize a right hand side on the stack, pushes the corresponding right hand side. pops it, and pushes the corresponding nonterminal. Expands the non-terminals. Reduces the non-terminals. Reads the terminals when it pops one off the stack. Reads the terminals while it pushes them on the stack. Pre-order traversal of the parse tree. Post-order traversal of the parse tree. comp Compiler Design 27 | P a g e

Compiler Design: Syntax Analysis

Document Details

Tags

Related

Summary

Full Transcript