Compiler Design - Lecture Notes
Document Details
![SpotlessBasil7947](https://quizgecko.com/images/avatars/avatar-14.webp)
Uploaded by SpotlessBasil7947
IIT Delhi
Tags
Summary
This document presents lecture notes covering the design of compilers. It explores the key phases involved, like lexical and syntax analysis, discussing concepts such as tokenization and parsing, and related aspects of computer science. The content should prove useful for undergraduate students seeking an overview of compiler design principles.
Full Transcript
21CSC304J-Compiler Design WHY WE NEED COMPILER? HOW IT IS WORKING? Pre- requisite Basic knowledge of programming languages. Basic knowledge of FSA and CFG. Knowledge of a high programming language for the programming assignments. Textbook: Alfred V. Aho, Ravi Sethi, and Jeffrey D...
21CSC304J-Compiler Design WHY WE NEED COMPILER? HOW IT IS WORKING? Pre- requisite Basic knowledge of programming languages. Basic knowledge of FSA and CFG. Knowledge of a high programming language for the programming assignments. Textbook: Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, “Compilers: Principles, Techniques, and Tools” Addison- Wesley, 1986. Course Introduction to Compiling Outline Lexical Analysis Syntax Analysis – Context Free Grammars – Top-Down Parsing, LL Parsing – Bottom-Up Parsing, LR Parsing Syntax-Directed Translation – Attribute Definitions – Evaluation of Attribute Definitions Semantic Analysis, Type Checking Run-Time Organization Intermediate Code TRANSLA TOR Translate high level language into equivalent machine language Types Compiler Interpreter Assembler COMPIL AERS compiler is a program takes a program written in a source language and translates it into an equivalent program in a target language. Source COMPILER Targe t Program Progr am (A program written in ( The equivalent program in a high-level programming machine code – relocatable language) Error object file) message (Take correct actions to remove the errors for further compilation process) 11 INTERPRE TER Translates line by line Better error diagnostics Source Program Interpreter Output Input Error messages 12 ASSEMB LER Assembler is a translator which is used to translate the assembly language code into machine language code Assem Assembler Machine bly language langua code ge code Language Processor Language processor Source Program Pre processor Modified source Compiler program Assembler Target assembly Linker/Loader program Cousins of the Compiler STRUCTURE OF THE COMPILER Two Parts: Compiler Analysis Synthesis Part Part Phases of the Compiler Standard Compiler Structure Source code (character stream) Lexical analysis Token stream Parsing Front end (machine-independent) Abstract syntax tree Intermediate Code Generation Intermediate code Optimization Intermediate code Back end (machine-dependent) Code generation Assembly code 2 4 Translation of an Assignment Statement LEXICAL ANALYSIS □ Lexical analysis Lexical analysis is the first phase of compiler which is also termed as scanning. Source program is scanned to read the stream of characters and those characters are grouped to form a sequence called lexemes which produces token as output. Token: Token is a sequence of characters that represent lexical unit, which matches with the pattern, such as keywords, operators, identifiers etc. Lexeme: Lexeme is instance of a token i.e., group of characters forming a token. , Pattern: Pattern describes the rulethat the Token that must be matched by strings. Example : c=a+b*5; The primary functions of Lexical Phase: Identify the lexical units in a source code Classify lexical units into classes like constants, reserved words, and enter them in different tables. It will Ignore comments in the source program Identify token which is not a part of the language Example : x = y + 10X identifier Tokens= Assignment operator Y identifier + Addition operator 10 Number SYNTAX ANALYSIS □ Syntax Analysis Syntax analysis is the second phase of compiler which is also called as parsing. Constructing the parse tree with the help of tokens. It also determines the structure of source language and grammar or syntax of the language. Parser converts the tokens produced by lexical analyser into a tree like representation called parse tree. Syntax tree is a compressed representation of the parse tree in which the operators appear as interior nodes and the operands of the operator are the children of the node for that operator. Input: Tokens Output: Syntax tree Here, is a list of tasks performed in this phase: Obtain tokens from the lexical analyser Checks if the expression is syntactically correct or not Report all syntax errors Construct a hierarchical structure which is known as a parse tree SEMANTIC ANALYSIS □ Semantic Analysis ❖ Type conversion and Type Coercion EXAMPLE INTERMEDIATE CODE GENERATOR □ Intermediate Code Generator CODE OPTIMIZATION □ Code Optimization CODE GENERATOR □ Code Generator Symbol Table Translation of Statement EXERCISE NO 1 Q 1 Phases of a compiler Translation of an assignment statement EXERCISE NO 1 Position = initial +rate *60 Grouping of Phases Front Back End End Front End Front end comprises of phases which are dependent on the input (source language)and independent on the target machine (Target lang) It includes lexical and syntactic analysis, symbol table management, semantic analysis and the generation of intermediate code. Code optimization can also be done by the front end. It also includes error handling at the phases concerned. Back End Dependent on the target independent on the source machine and language. This includes code generatio optimization, code n. PASS ESbe implemented in a single pass by The phases of compiler can marking the primary actions viz. reading of input file and writing to the output file. Several phases of compiler are grouped into one pass in such a way that the operations in each and every phase are incorporated during the pass. Lexical analysis, syntax analysis, semantic analysis and intermediate code generation might be grouped into Types of Compiler Single Pass Compilers Two Pass Compilers Multi pass Compilers Compiler Construction Tool 1.Scanner Generator 2.Parser Generator 3.Syntax –Directed Translation engine 4.Automatic code generators 5.Data-flow analysis engines 1 2 of a program to another Role Of Lexical Analyzer Interaction of Lexical analyzer with parser token Source Lexical parser To semantic program analyzer getNexttoken() analysis symbol table Attributes for Token Lexical Analysis Lexical analyzer: reads input characters and produces a sequence of tokens as output (nexttoken()). Trying to understand each element in a program. Token: a group of characters having a collective meaning. const pi = 3.14159; Token 1: (const, -) Token 2: (identifier, ‘pi’) Token 3: (=, -) Token 4: (realnumber, 3.14159) Token 5: (;, -) Buffering Techniques Input Buffering How to increase the speed of reading the source program, character by character ? A specialized buffering techniques have been developed to reduce the amount of overhead required to process a single input character. There are two type of buffering 79 scheme – One buffer scheme and two Cont.., Cont.., Cont.. , Main Concept of Input Buffering Two pointers to the input are maintained Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to determine. Pointer forward scans ahead until a pattern match is found; Sentinels To reload the buffer and to advance the forward pointer effectively we have to make two tests one for the end of the buffer and other to determine what character is read This can be done in a combine way if we hold a sentinel character at the end A sentinel is a special character that cannot be part of the source program (Eg: eof) 84 ▪ Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes. ▪ Using one system read command we can read N characters in a buffer, rather than using one system call per character. ▪ If fewer than N characters remain in the input Conclus ion Lex is a tool in lexical analysisphase to recognize tokens using regular expression. Lex tool itself is a lex compiler. lex.l is an a input file written in a language which describes the generation of lexical analyzer. The lex compiler transforms lex.l to a C program known as lex.yy.c. lex.yy.c is compiled by the C compiler to a file called a.out. The output of C compiler is the working lexical analyzer which takes stream of input characters and produces a stream of tokens. yylval is a global variable which is shared by lexical analyzer and parser to return the name and an attribute value of token. The attribute value can be numeric code, pointer to symbol table or nothing. Another tool for lexical analyzer generation is Flex Structure of Lex Programs Lex program will be in following form Declarations This section includes declaration of variables, constants and regular definitions. Translation rules It contains regular expressions and code segments. Form : Pattern {Action} Pattern is a regular expression or regular definition. Action refers to segments of code. ▪Auxiliary functions This section holds additional functions which are used in actions. These functions are compiled separately and loaded with lexical analyzer. ▪Lexical analyzer produced by lex starts its process by reading one character at a time until a valid match for a pattern is found. ▪Once a match is found, the associated action takes place to produce token. Conflict Resolution in Lex ❖ Conflict arises when several prefixes of inputmatches one or ❖ more Alwayspatterns. prefer aThis canprefix longer be resolved than a by the shorter prefix. following: If two or more patterns are matched for the longest ❖ prefix, then the first pattern listed in lex program is preferred. Lookahead Operator Lookahead operator is the additional operator that is read by lex in order to distinguish additional pattern for a token. Lexical analyzer is used to read one character ahead of valid lexeme and then retracts to produce token. At times, it is needed to have certain characters at the end of input to match with a pattern. In such cases, slash (/) is used to indicate end of part of pattern that matches the lexeme. ❖ (eg.) In some languages keywords are not reserved. So the statements ❖ IF (I, J) = 5 and IF(condition) THEN results in conflict whether to produce IF as an array name or a keyword. To resolve this the lex rule for keyword IF can be written as, ❖ Issues in Lexical Analysis 1. Lookahead 2.Ambiguit y Regular expressions Ɛ is a regular expression, L(Ɛ) = {Ɛ} Ifa is a symbol in ∑then a is a regular expression, L(a) = {a} (r)| (s) is a regular expression denoting the language L(r) ∪ L(s) (r)(s) is a regular expression denoting the language L(r)L(s) (r)* is a regular expression denoting (L(r))* (r) is a regular expression denting L(r) 121 Regular Expressions Regular expressions are a notation to represent lexeme pattern for a token. We use regular expressions to describe tokens of a programming language. A regular expression is built up of simpler regular expressions (using defining rules) Each regular expression denotes a language. A language denoted by a regular expression is called as a regular set. 122 Regular Expressions (Rules) Regular expressions over alphabet Σ ε Reg. Expr Language it {ε} denotes a∈ Σ L(r1) ∪ {a} (r1) (r1) | L(rL(r 1 ) ) (r (r) (r2) 2 ) * (L(r)) L(r ) 2 2 (r) * L(r) (r)+ = (r) (r)* (r)? = (r) | ε 123 Regular Expressions (cont.) We may remove parentheses by using precedence rules. * highest concatenation next | lowest ab*|c means (a(b)*)|(c) Ex: Σ = {0,1} 0|1 => {0,1} (0|1)(0|1) => {00,01,10,11} 0* => {ε ,0,00,000,0000,....} (0|1)* => all strings with 0 and 1, including the empty string 124 Regular Definitions To write regular expression for some languages can be difficult, because their regular expressions can be quite complex. In those cases, we may use regular definitions. We can give names to regular expressions, and we can use these names as symbols to define other regular expressions. A → r1 d1 regular where di isisaadistinct definition sequencename anddefinitions of the form: of the d2 → r2 ri is a regular expression over Σ∪{d1,d2,...,di symbols in -1 } dn → rn Regular definitions d1 -> If we try write the regular r1 d2 to representi expression identifiers without -> r2 ng using regular that regular … definitions expression will be , complex. dn -> rn (A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) * Ex am ple : Ex: Unsigned numbers in Pascal digit → 0 | 1 |... | 9 digits → digit + opt-fraction → (. digits ) ? opt-exponent → ( E (+|-)? digits ) ? unsigned-num → digits opt-fraction opt- exponent Extensio ns One or more instances: (r)+ Zero or one instances: r? Character classes: [abc] digi -> Example: [0-9] t letter_-> letter_(letter| -> [A-Za-z_] id digit)* Recognition of tokens Starting point is the language grammar to understand the tokens: stmt -> if expr then stmt | if expr then stmt else stmt | Ɛ expr -> term relop term | term term -> id Recognition of tokens (cont.) The next step is to formalize the patterns: digit -> [0-9] Digits -> digit+ number -> digit(.digits)? (E[+-]? Digit)? letter -> [A-Za-z_] id -> letter (letter|digit)* If -> if Then -> then Else -> else Relop -> < | > | = | = | We also need to handle 131 Operations on Languages Concatenati L1L2 = { s1s2 | s1 on: and s2 ∈ L2 ∈ L1 } Union or s ∈ L1 ∪ L2 = { s | s L2 } Exponentiatio ∈ L1 n: L2 = L0 = {ε} L1 = L LL Kleene Closure Positive L* = Closure L+ = Transition diagrams Transition diagram for relop The pattern are converted into transition diagram while constructing the lexical analyzer Transition diagrams (cont.) Transition diagram for reserved words and identifiers KEYWO RD t h e n Nonlet| dig Transition diagrams (cont.) Transition diagram for unsigned numbers Transition diagrams (cont.) Transition diagram for whitespace Finite Automata Regular expressions = specification Finite automata = implementation Recognizer ---A recognizer for a language is a program that takes as input a string x answers ‘yes’ if x is a sentence of the language and ‘no’ otherwise. A better way to convert a regular expression to a recognizer is to construct a generalized transition diagram from the expression. This diagram is called a finite automaton. Finite Automaton can be Deterministic Non-deterministic Finite Automata A finite automaton consists of An input alphabet Σ A set of states S A start state q0 A set of accepting states F ⊆ S A set of transitions state →input state Finite Automata Transiti on s1 → a s2 Is read In state s1 on input “a” go to state s2 If end of input If in accepting state => accept, otherwise => reject If no transition possible => reject Finite Automata State Graphs A state The start state An accepting state a A transition CS416 Compiler 140 Design Finite Automata A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and “no” otherwise. We call the recognizer of the tokens as a finite automaton. A finite automaton can be: deterministic(DFA) or non-deterministic (NFA) This means that we may use a deterministic or non-deterministic automaton as a lexical analyzer. Both deterministic and non-deterministic finite automaton recognize regular sets. Which one? deterministic – faster recognizer, but it may take more space non-deterministic – slower, but it may take less space Deterministic automatons are widely used lexical analyzers. First, we define regular expressions for tokens; Then we convert them into a DFA to get a lexical analyzer for our tokens. Algorithm1: Regular Expression NFA DFA (two steps: first to NFA, then to DFA) Algorithm2: Regular Expression DFA (directly convert a regular expression into a DFA) Non-Deterministic Finite (NFA Automaton ) A non-deterministic finite automaton (NFA) is a mathematical model that consists of: Q - a set of states Σ - a set of input symbols (alphabet) 𝞭 -move – a transition function move to map state-symbol pairs to sets of states. q0 - a start (initial) state F – a set of accepting states (final states) ε- transitions are allowed in NFAs. In other words, we can move from one state to another one without consuming any symbol. A NFA accepts a string x, if and only if there is a path from the starting state to one of accepting states such that edge labels along this path spell out x. Deterministic and Nondeterministic Automata Deterministic Finite Automata (DFA) One transition per input per state No ε-moves Nondeterministic Finite Automata (NFA) Can have multiple transitions for one input in a given state Can have ε-moves Finite automata have finite memory Need only to encode the current state 143 Jeya R 144 DF A For every string x, there is a unique path from initial state and associated with x. x x is accepted if and only if this path ends at a final state. NF A For any string x, there may exist none or more than one path from initial state and associated with x. A Simple Example A finite automaton that accepts only “1” 1 A finite automaton accepts a string if we can follow transitions labeled with the characters in the string from the start to some accepting state Another Simple Example A finite automaton accepting any number of 1’s followed by a single 0 Alphabet: {0,1} 1 0 Check that “1110” is accepted. NF A NF A Transition Table CS416 Compiler 152 Design Converting A Regular Expression into A NFA (Thomson’s Construction) This is one way to convert a regular expression into a NFA. There can be other ways (much efficient) for the I conversion. t guarantees that the resulting NFA will have exactly one final state, Thomson’s Construction is simple and and systematic one start method. state. Construction starts from simplest parts (alphabet symbols). To create a NFA for a complex regular expression, NFAs of its sub-expressions are combined to create its NFA, CS416 Compiler 153 Design Thomson’s Construction ε (cont.) To recognize an empty i f string ε To recognize a symbol a in the a alphabet Σ i f If N(r1) and N(r2) are NFAs for regular expressions r1 and r2 For regular expression r1 | r2 ε N(r1) ε NFA r1 | i ε f ε for r2 N(r2) CS416 Compiler 154 Design Thomson’s Construction For regular expression (cont.)r r 1 2 i N(r2 ) Final state of N(r2) 1 N(r ) f final becomestate of N(r 1 r ) 2 NFA r 1 r2 for For regular expression r* ε ε ε i N(r f ) ε NFA for r* CS416 Compiler 155 Design Thomson’s Construction (Example - (a|b) * a a (a|b) a ) : * ε a ε ε ε ε ε b ε b b : ε (a | b) (a|b) * ε a ε a ε a ε ε ε ε ε a b ε ε ε b ε 156 RE to NFA (a|b) * ε abb a ε 2 3 ε 0 1 b b 1 ε ε 6 ε 7 a 8 9 ε 0 4 b 5 ε Jeya R 158 (a|b) * abb RE to DFA –USING THOMPSONS SUBSET Then. NFA is converted into DFA. C Oen rNeguSlaTr eRxpUressCioTn isIOconNverted STEPS Giv 1.Convert into NFA.. into NFA using above rules for operators (union, concatenation and closure)and precedence. 2.Find Ɛ-closure of all states. 3.Start with epsilon closure of start state of NFA. 4.Apply the input symbols and find its epsilon closure. Dtran[state, input symbol] = Ɛ-closure(move(state, input symbol)) where Dtran transition function of DFA 5.Analyze the output state to find whether it is a new state. 6.If new state is found, repeat step 4 and step 5 until no more new states are found. 7.Construct the transition table for Dtran function. 8.Draw the transition diagram with start state as the s-closure (start state of NFA) STEP1: RE to NFA (a|b) abb * Jeya 16 R 0 Step 2: Start with finding Ɛ-closure of state 0 Ɛ-closure(0) = {0,1,2,4,7} = A Step 3: Apply input symbols a, b to A Dtran(A, a) = Ɛ- closure(move(A, a)) = Ɛ- closure(move({0,1,2,4,7), a))= Ɛ-closure(3,8)= {3,6,7,1,2,4,8}= {1,2,3,4,6,7,8} = B Dtran A, a=B Dtran A, b] = Ɛ- closure( move(A,b)) = Ɛ- closure(move({0,1,2,4,7), b)) = Ɛ-closure(5) Step 4: Apply input symbols to new state B Dtran B, a] = Ɛ-closure (move(B, a)) = Ɛ- closure(move({1,2,3,4,6,7,8}, a)) = Ɛ-closure(3,8) = {1,2,3,4,6,7,8} = B Dtran [B, a] =B Dtran(B, b) = Ɛ-closure(move(B, b)) = Ɛ- closure(move({1,2,3,4,6,7,8}, b)) Step 5: Apply input symbols to new state C Dtran(C, a) = Ɛ- closure(move(C, a)) = Ɛ-closure(move({1,2,4,5,6,7), a)) = Ɛ-closure(3,8) ={1,2,3,4,6,7,8}= B Dtran(C, a] =B Dtran [C, b) = Ɛ-closure(move(C, b)) = Ɛ- closure(move({1,2,4,5,6,7),b)) Step 6: Apply input symbols to new state D Dtran[ D, a] = Ɛ- closure(move(D, a)) = Ɛ- closure(move({1,2,4,5,6,7,9}, a)) = Ɛ-closure(3,8) {1,2,3,4,6,7,8} = B Dtran[ D, a] = B Dtran(D, b] = Ɛ-closure(move(D,b)) = Ɛ- closure(move({1,2,4,5,6,7,9},b) Step) 7: Apply input symbols to new Dtran [E, b] = E-closure(move(E, state=EƐ-closure(5, Dtran[ E, a) =Ɛ - 10)= 6)) a)) = E {1,2,4,5,6,7,10} closure(move(E, Dtran[ D, b] =E = E- = Ɛ-closure(move({1,2,4,5,6,7,10), closure(move({1,2,4,5,6,7,10), a)) b)) = Ɛ-closure(3,8) = {1,2,3,4,6,7,8} = B = E-closure(5)= {1,2,4,5,6,7} Dtran [E, a] =B =C Minimization of DFA Minimization of DFA Minimization of DFA Minimization of DFA Minimization of DFA Example-Minimization of DFA Example-Minimization of DFA Example-Minimization of DFA Example-Minimization of DFA Regular Expression to DFA (Direct Method) Regular Expression to DFA (Direct Method)- Example Regular Expression: (a/b)*abb Augmented Grammar : (a/b)*abb# =? (a/b)*.a.b.b.# Regular Expression to DFA (Direct Method)- Example Computation of Nullable, Firstpos, LastPos: Exampl e: Direct Method Direct Method Direct Method Direct Method Direct Method Direct Method Direct method RE to DFA (a|b)*abb(a|b)* Find firstpos,lastpos,followpo s Draw DFA {1,2, { 3} 8 } Finding followpos by rules NFA TO DFA ( SUBSET CONSTRUCTION)