Compiler Design (57 Pages) PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document is a textbook on compiler design. It covers the fundamental concepts of language processing, compiler structure, and various types of grammars. The text describes the phases of a compiler, including lexical analysis, syntax analysis, and code generation.
Full Transcript
**Introduction Language Processing,** Structure of a compiler the evaluation of Programming language, The Science of building a Compiler application of Compiler Technology. Programming Language Basics. **Types of Grammer** **Phase of Compiler** **Lexical Analysis-:** The role of lexical analysis...
**Introduction Language Processing,** Structure of a compiler the evaluation of Programming language, The Science of building a Compiler application of Compiler Technology. Programming Language Basics. **Types of Grammer** **Phase of Compiler** **Lexical Analysis-:** The role of lexical analysis buffing, specification of tokens. Recognitions of tokens the lexical analyzer generator lexical - **Syntax Analysis -:** The Role of a **parser**, Context free Grammars Writing A grammar, top down passing bottom up parsing Introduction to Lr Parser. - More Powerful LR parser (LR1, LALR) Using Armigers Grammars Equal Recovery in Lr parser - Syntax Directed Transactions Definition, Evolution order of SDTS Application of SDTS. Syntax Directed Translation Schemes. - **Code Generation-:** Variants of Syntax trees 3 Address code, Types and Deceleration, Translation of Expressions, Type Checking. Canted Flow Back patching? - **Runtime Environments,** Stack allocation of space, access to Non Local date on the stack Heap Management code generation -- Issues in design of code generation the target Language Address in the target code Basic blocks and Flow graphs. A Simple Code generation. - **Code Optimization.** The principle sources of Optimization peep hole Optimization, Introduction to Data flow Analysis. Preprocessor ============ 1. *Macro processing:* A preprocessor may allow a user to define macros that are short hands for longer constructs. 2. *File inclusion:* A preprocessor may include header files into the program text. COMPILER -------- **Types of Grammer** n the literary sense of the term, grammars denote syntactical rules for conversation in natural languages. Linguistics have attempted to define grammars since the inception of natural languages like English, Sanskrit, Mandarin, etc. The theory of formal languages finds its applicability extensively in the fields of Computer Science. **Noam Chomsky** gave a mathematical model of grammar in 1956 which is effective for writing computer languages. **Grammar** A grammar **G** can be formally written as a **4-tuple** (N, T, S, P) where − - **N** or **V*~N~*** is a set of variables or non-terminal symbols. - **T** or **∑** is a set of Terminal symbols. - **S** is a special variable called the Start symbol, S ∈ N - **P** is Production rules for Terminals and Non-terminals. A production rule has the form α → β, where α and β are strings on V*~N~* ∪ ∑ and least one symbol of α belongs to V~N~. Example Grammar G1 − ({S, A, B}, {a, b}, S, {S → AB, A → a, B → b}) Here, - **S, A,** and **B** are Non-terminal symbols; - **a** and **b** are Terminal symbols - **S** is the Start symbol, S ∈ N - Productions, **P : S → AB, A → a, B → b** Example Grammar G2 − (({S, A}, {a, b}, S,{S → aAb, aA → aaAb, A → ε } ) Here, - **S** and **A** are Non-terminal symbols. - **a** and **b** are Terminal symbols. - **ε** is an empty string. - **S** is the Start symbol, S ∈ N - Production **P : S → aAb, aA → aaAb, A → ε** **Derivations from a Grammar** Strings may be derived from other strings using the productions in a grammar. If a grammar **G** has a production **α → β**, we can say that **x α y** derives **x β y** in **G**. This derivation is written as − ***x α y ⇒G x β y*** Example Let us consider the grammar − G2 = ({S, A}, {a, b}, S, {S → aAb, aA → aaAb, A → ε } ) Some of the strings that can be derived are − S ⇒ [aA]b using production S → aAb ⇒ a[aA]bb using production aA → aAb ⇒ aaa[A]bbb using production aA → aaAb ⇒ aaabbb using production A → ε According to **Noam Chomosky,** there are four types of grammars − Type 0, Type 1, Type 2, and Type 3. The following table shows how they differ from each other − ![](media/image2.png) **Type - 3 Grammar** **Type-3 grammars** generate **regular languages**. Type-3 grammars must have a single non-terminal on the left-hand side and a right-hand side consisting of a single terminal or single terminal followed by a single non-terminal. The productions must be in the form **X → a or X → aY** where **X, Y ∈ N** (Non terminal) and **a ∈ T** (Terminal) The rule **S → ε** is allowed if **S** does not appear on the right side of any rule. **Example** X → ε X → a \| aY Y → b **Type - 2 Grammar** **Type-2 grammars** generate **context-free languages**. The productions must be in the form **A → γ** where **A ∈ N** (Non terminal) and **γ ∈ (T ∪ N)\*** (String of terminals and non-terminals). These languages generated by these grammars are be recognized by a non-deterministic pushdown automaton. **Example** S → X a X → a X → aX X → abc X → ε **Type - 1 Grammar** **Type-1 grammars** generate **context-sensitive languages.** The productions must be in the form **α A β → α γ β** where **A ∈ N** (Non-terminal) and **α, β, γ ∈ (T ∪ N)\*** (Strings of terminals and non-terminals) The strings **α** and **β** may be empty, but **γ** must be non-empty. The rule **S → ε** is allowed if S does not appear on the right side of any rule. The languages generated by these grammars are recognized by a linear bounded automaton. **Example** AB → AbBc A → bcA B → b **Type - 0 Grammar** **Type-0 grammars** generate **recursively enumerable languages**. The productions have no restrictions. They are any phase structure grammar including all formal grammars. They generate the languages that are recognized by a Turing machine. The productions can be in the form of **α → β** where **α** is a string of terminals and nonterminals with at least one non-terminal and **α** cannot be null. **β** is a string of terminals and non-terminals. Example S → ACaB Bc → acB CB → DB aD → Db STRUCTURE OF THE COMPILER DESIGN -------------------------------- a. Analysis (Machine Independent/Language Dependent) b. Synthesis(Machine Dependent/Language independent) Compilation process is partitioned into no-of-sub processes called **'phases'**. Lexical Analysis:- ------------------ Syntax Analysis:- ----------------- Intermediate Code Generations:- ------------------------------- Code Optimization :- -------------------- Code Generation:- ----------------- Symbol Table Management ----------------------- Error Handing :- ---------------- LEXICAL ANALYSIS ================ OVER VIEW OF LEXICAL ANALYSIS ----------------------------- - To identify the tokens we need some method of describing the possible tokens that can appear in the input stream. For this purpose we introduce regular expression, a notation that can be used to describe essentially all the tokens of programming language. - Secondly , having decided what the tokens are, we need some mechanism to recognize these in the input stream. This is done by the token recognizers, which are designed using transition diagrams and finite automata. ROLE OF LEXICAL ANALYZER ------------------------ TOKEN, LEXEME, PATTERN: ----------------------- -- -- -- -- -- -- Lex specifications: =================== 1. The *declarations* section includes declarations of variables,manifest constants(A manifest constant is an identifier that is declared to represent a constant e.g. *\# define PIE 3.14*), and regular definitions. 2. The *translation rules* of a Lex program are statements of the form : 3. The third section holds whatever *auxiliary procedures* are needed by the *actions.*Alternatively these procedures can be compiled separately and loaded with the lexical analyzer. SYNTAX ANALYSIS =============== PARSING TECHNIQUES ------------------ ROLE OF THE PARSER ------------------ Parser obtains a string of tokens from the lexical analyzer and verifies that it can be generated by the language for the source program. The parser should report any syntax errors in an intelligible fashion. The two types of parsers employed are: ![](media/image6.jpeg) TOP-DOWN PARSING ---------------- - Starts with the starting symbol **S** - Goes down to tree leaves using productions BOTTOM-UP PARSING ----------------- - Starts from tree leaves - Proceeds upward to the root which is the starting symbol **S** Context-Free Grammar ==================== - **N** is a set of non-terminal symbols. - **T** is a set of terminals where **N ∩ T = NULL.** - **P** is a set of rules, **P: N → (N** 𝖴 **T)\***, i.e., the left-hand side of the production rule **P** does have any right context or left context. - **S** is the start symbol. Example ------- - The grammar ({A}, {a, b, c}, P, A), P : A → aA, A → abc. - The grammar ({S, a, b}, {a, b}, P, S), P: S → aSa, S → bSb, S → ε - The grammar ({S, F}, {0, 1}, P, S), P: S → 00S \| 11F, F → 00F \| ε Top-down Parsing ---------------- - **Recursive descent parsing** : It is a common form of top-down parsing. It is called recursive as it uses recursive procedures to process the input. Recursive descent parsing suffers from backtracking. - **Backtracking** : It means, if one derivation of a production fails, the syntax analyzer restarts the process using different rules of same production. This technique may process the input string more than once to determine the right production. Bottom-up Parsing ----------------- Example: -------- LR PARSER ========= LR PARSING INTRODUCTION ----------------------- WHY LR PARSING: --------------- - LR parsers can be constructed to recognize virtually all programming-language constructs for which context-free grammars can be written. - The LR parsing method is the most general non-backtracking shift-reduce parsing method known, yet it can be implemented as efficiently as other shift-reduce methods. - The class of grammars that can be parsed using LR methods is a proper subset of the class of grammars that can be parsed with predictive parsers. - An LR parser can detect a syntactic error as soon as it is possible to do so on a left-to- right scan of the input. MODELS OF LR PARSERS -------------------- - shift s, where s is a state - reduce by a grammar production A -\> b - accept - error ALGORITHM FOR EASY CONSTRUCTION OF AN LALR TABLE ------------------------------------------------ 1. Construct C = {I0, I1 , \..., In} the collection of sets of LR(1) items for G\'. 2. For each core present among the set of LR(1) items, find all sets having that core and replace these sets by the union. 3. Let C\' = {J0, J1 , \..., Jm} be the resulting sets of LR(1) items. The parsing actions for state i are constructed from Ji in the same manner as in the construction of the canonical LR parsing table. 4. If there is a conflict, the grammar is not LALR(1) and the algorithm fails. 5. The goto table is constructed as follows: If J is the union of one or more sets of LR(1) items, that is, J = I0U I1 U \... U Ik, then the cores of goto(I0, X), goto(I1, X), 6. Then goto(J, X) = K. Consider the above example, -- -- -- -- -- -- -- -- -- -- -- -- HANDLING ERRORS --------------- DANGLING ELSE ------------- -- -- -- -- -- -- -- -- PANIC-MODE ERROR RECOVERY ------------------------- PHRASE-LEVEL RECOVERY --------------------- SEMANTIC ANALYSIS ----------------- - Semantic Analysis computes additional information related to the meaning of the program once the syntactic structure is known. - In typed languages as C, semantic analysis involves adding information to the symbol table and performing type checking. - The information to be computed is beyond the capabilities of standard parsing techniques, therefore it is not regarded as syntax. - As for Lexical and Syntax analysis, also for Semantic Analysis we need both a Representation Formalism and an Implementation Mechanism. - As representation formalism this lecture illustrates what are called Syntax Directed Translations. SYNTAX DIRECTED TRANSLATION --------------------------- - The Principle of Syntax Directed Translation states that the meaning of an input sentence is related to its syntactic structure, i.e., to its Parse-Tree. - By Syntax Directed Translations we indicate those formalisms for specifying translations for programming language constructs guided by context-free grammars. - We associate Attributes to the grammar symbols representing the language constructs. - Values for attributes are computed by Semantic Rules associated with grammar productions. - Evaluation of Semantic Rules may: - Generate Code; - Insert information into the Symbol Table; - Perform Semantic Check; - Issue error messages; - etc. 1. **Syntax Directed Definitions.** High-level specification hiding many implementation details (also called **Attribute Grammars**). 2. **Translation Schemes.** More implementation oriented: Indicate the order in which semantic rules are to be evaluated. Syntax Directed Definitions --------------------------- - **Syntax Directed Definitions** are a generalization of context-free grammars in which: 1. Grammar symbols have an associated set of **Attributes**; 2. Productions are associated with **Semantic Rules** for computing the values of attributes. - Such formalism generates **Annotated Parse-Trees** where each node of the tree is a record with a field for each attribute (e.g.,X.a indicates the attribute a of the grammar symbol X). - The value of an attribute of a grammar symbol at a given parse-tree node is defined by a semantic rule associated with the production used at that node. 1. **Synthesized Attributes.** They are computed from the values of the attributes of the children nodes. 2. **Inherited Attributes.** They are computed from the values of the attributes of both the siblings and the parent nodes Syntax Directed Definitions: An Example --------------------------------------- - **Example.** Let us consider the Grammar for arithmetic expressions. The Syntax Directed Definition associates to each non terminal a synthesized attribute called *val*. S-ATTRIBUTED DEFINITIONS ------------------------ - **Evaluation Order.** Semantic rules in a S-Attributed Definition can be evaluated by a bottom-up, or PostOrder, traversal of the parse-tree. - **Example.** The above arithmetic grammar is an example of an S-Attributed Definition. The annotated ![](media/image11.jpeg) INTERMEDIATE CODE GENERATION ---------------------------- Logical Structure of a Compiler Front End ----------------------------------------- Static Checking --------------- - ![](media/image13.png)Ex: Break statement within a loop construct Uniqueness checks - Labels in case statements Name-related checks Intermediate Representations ---------------------------- - A clear distinction between the machine-independent and machine-dependent parts of the compiler - Retargeting is facilitated the implementation of language processors for new - We could apply machine independent code optimization techniques Intermediate representations span the gap between the source and target languages. - ***High Level Representations*** - closer to the source language - easy to generate from an input program - code optimizations may not be straightforward - ***Low Level Representations*** - closer to the target machine - Suitable for register allocation and instruction selection - easier for optimizations, final code generation - Specific to the language being implemented P-code for Pascal Byte code for Java LANGUAGE INDEPENDENT 3-ADDRESS CODE ----------------------------------- Addresses and Instructions -------------------------- - TAC consists of a sequence of instructions, each instruction may have up to three addresses, prototypically t1 = t2 op t3 - Addresses may be one of: - A name. Each name is a symbol table index. For convenience, we writethe names as the identifier. - A constant. - A compiler-generated temporary. Each time a temporary address is needed, the compiler generates another name from the stream t1, t2, t3, etc. - Temporary names allow for code optimization to easily move Instructions - At target-code generation time, these names will be allocated to registers or to memory. - TAC Instructions - Includes binary arithmetic and logical operations - Includes unary arithmetic op (-) and logical op (!) and type conversion - Copy instructions: x = y - Unconditional jump: goto L - L is a symbolic label of an instruction - Function calls : y= p(x1,..., xn) y = call p,n , return y - Indexed copy instructions: x = y\[i\] and x\[i\] = y - Left: sets x to the value in the location i memory units beyond y - Right: sets the contents of the location i memory units beyond x to y - Address and pointer instructions: - x = &y sets the value of x to be the location (address) of y. - x = \*y, presumably y is a pointer or temporary whose value is a location. The value of x is set to the contents of that location. - \*x = y sets the value of the object pointed to by x to the value of y. Types of three address code --------------------------- Assignment statement -------------------- Unary operation --------------- Copy Statement -------------- Unconditional jump ------------------ i. Creates label L, generate code for expression exp, If the exp returns value true then go to the statement labelled L. exp returns a value false go to the statement immediately following the if statement. Function call ------------- QUADRUPLES- ----------- TRIPLES ======= ![](media/image16.png) Indirect Triples ---------------- Canted Flow Back patching: -------------------------- - The problem in generating three address codes in a single pass is that we may not know the labels that control must go to at the time jump statements are generated. - So to get around this problem a series of branching statements with the targets of the jumps temporarily left unspecified is generated. - Back Patching is putting the address instead of labels when the proper label is determined. 1. makelist (i) -- creates a new list containing only i, an index into the array of quadruples and returns pointer to the list it has made. 2. Merge (i, j) -- concatenates the lists pointed to by i and j, and returns a pointer to the concatenated list. 3. Backpatch (p, i) -- inserts i as the target label for each of the statements on the list pointed to by p. RUNTIME ENVIRONMENT ------------------- - Runtime organization of different storage locations - Representation of scopes and extents during program execution. - Components of executing program reside in blocks of memory (supplied by OS). - Three kinds of entities that need to be managed at runtime: - Generated code for various procedures and programs. - Data objects: STORAGE ORGANIZATION -------------------- 2. The stack is used to store - Procedure activations. - The status of the machine just before calling a procedure, so that the status can be restored when the called procedure returns. - The HEAP stores data allocated under program control (e.g. by malloc() in C). CODE GENERATION --------------- 7.2 BASIC BLOCKS AND FLOW GRAPHS -------------------------------- BASIC BLOCKS ------------ 1. We first determine the set of leaders, the first statements of basic blocks. The rules we use are the following: I. The first statement is a leader. II. Any statement that is the target of a conditional or unconditional goto is a leader. III. Any statement that immediately follows a goto or conditional goto statement is a leader. 2. For each leader, its basic block consists of the leader and all statements up to but not including the next leader or the end of the program. 8. t6 := prod +t5 9. prod := t6 8.1 PRINCIPLE SOURCES OF OPTIMIZATION ------------------------------------- DEAD-CODE ELIMINATIONS ---------------------- PEEPHOLE OPTIMIZATION --------------------- - Redundant-instructions elimination - Flow-of-control optimizations - Algebraic simplifications - Use of machine idioms REDUNTANT LOADS AND STORES -------------------------- 1. \(1) MOV R0,a 2. \(2) MOV a,R0 UNREACHABLE CODE ---------------- ![](media/image24.png) Compiler Structure ------------------ - Source code parsed to produce AST - AST transformed to CFG - Data flow analysis operates on control flow graph (and other intermediate representations) - A directed graph where - Each node represents a statement - Edges represent control flow - Statements may be - Assignments x := y op z or x := op z - Copy statements x := y - Branches goto L or if x relop y goto L etc. - x := a + b; - y := a \* b; - while (y \> a) { - a := a + 1; - x := a + b - } - An expression e is available at program point p if - e is computed on every path to p, and - the value of e has not changed since the last time e is computed on p - Optimization - If an expression is available, need not be recomputed - Is expression e available? - Facts: - a + b is available - a \* b is available - a + 1 is available