Lex, JFlex, and Yacc (PDF)
Document Details
Uploaded by ComelyBluebell4015
Tags
Summary
These lecture notes cover LEX, JFLEX, and YACC, tools used in compiler construction for lexical analysis and parsing. The document details the functionalities of each tool, their input formats, and their uses in building programs.
Full Transcript
LEX, JFLEX and YACC WEEK 10 Lex and Yacc Lex and yacc are a matched pair of tools. Lex breaks down files into sets of "tokens," roughly analogous to words. Yacc takes sets of tokens and assembles them into higher-level constructs, analogous to sentences. Lex's output is mostly d...
LEX, JFLEX and YACC WEEK 10 Lex and Yacc Lex and yacc are a matched pair of tools. Lex breaks down files into sets of "tokens," roughly analogous to words. Yacc takes sets of tokens and assembles them into higher-level constructs, analogous to sentences. Lex's output is mostly designed to be fed into some kind of parser. Yacc is designed to work with the output of Lex. 2 Lex and Yacc Lex and yacc are tools for building programs. Their output is itself code – Which needs to be fed into a compiler – May be additional user code is added to use the code generated by lex and yacc 3 Lex : A lexical analyzer generator Lex is a program designed to generate scanners, also known as tokenizers, which recognize lexical patterns in text Lex is an acronym that stands for "lexical analyzer generator.“ The main purpose is to facilitate lexical analysis The processing of character sequences in source code to produce tokens for use as input to other programs such as parsers Another tool for lexical analyzer generation is Flex, JFLEX 4 Lex : A lexical analyzer generator lex.lex is an a input file written in a language which describes the generation of lexical analyzer. The lex compiler transforms lex.l to a C program known as lex.yy.c. lex.yy.c is compiled by the C compiler to a file called a.out. The output of C compiler is the working lexical analyzer which takes stream of input characters and produces a stream of tokens. 5 Lex: A lexical analyzer generator 6 Lex: A lexical analyzer generator Lex file has three sections. The first is sort of "control" information, The second is the actual token or grammar rule definitions, 7 The last is C code to be copied verbatim to the output. Lex: A lexical analyzer generator Lines 1 through 3 are more C code to be copied. In the control section, you can indicate C code to be copied to the output by enclosing it with "%{" and "%}“ – This section includes include files, declaration of variables, and constants We wouldn't need one at all if we didn't use cout in the middle section Line 4 is "%%", which means we're done with the control section and moving on to the token section. 8 Lex: A lexical analyzer generator Lines 5-8 are all the same (simple) format: they define a regular expression and an action (code segment). Form : Pattern {Action} – Pattern is regular expression and action is code segment When lex is reading through an input file and can match one of the regular expressions, it executes the action. The action is just C++ code that is copied into the eventual lex output – You can have a single statement or you can have curly braces with a whole bunch of statements. 9 Lex: A lexical analyzer generator Line 9 is another "%%" delimiter, meaning we're done with the second section and we can go onto the third. Lines 10-13 are the third section, which is exclusively for copied C code. main() function – containing important call to yylex() function Additional functions which are used in actions These functions are compiled separately and loaded with lexical analyzer in the Lex output file 10 Lex: A lexical analyzer generator Lexical analyzer produced by lex starts its process by reading one character at a time until a valid match for a pattern is found Once a match is found, the associated action takes place to produce token The token is then given to parser for further processing 11 Lex: A lexical analyzer generator Operators : " \ [ ] ˆ - ?. * | ( ) $ / { } % < > Letters and digits match themselves Period ‘.’ matches any character (except newline) Brackets [ ] enclose a sequence of characters, termed a character class. This matches: Any character in the sequence A ‘-’ in a character class denotes an inclusive range, – e.g.: [0-9] matches any digit. A ˆ at the beginning denotes negation: [ˆ0-9] matches any character that is not a digit. 12 Lex: A lexical analyzer generator A quoted character " " matches that character. \n, \t match newline, tab. parentheses () grouping Bar | alternatives Star * zero or more occurrences + one or more occurrence ? zero or one occurrence 13 Operators 14 Lex: A lexical analyzer generator 15 Lex: A lexical analyzer generator 16 Lex: A lexical analyzer generator A quick tutorial on yacc 17 A quick tutorial on yacc 18 Lex: A lexical analyzer generator This example can be compiled by running this: % lex snazzle.lex This will produce the file "lex.yy.c", which we can then compile with g++: % g++ lex.yy.c -lfl -o snazzle Notice the "-lfl", which links in the dynamic lex libraries 19 Yacc: Overview Parser generator: Takes a specification for a context-free grammar. Produces code for a parser. Output: C code Input: a set of yacc implementing a parser: grammar rules function: yyparse() and actions (or bison) file [default]: y.tab.c 21 Using Yacc lexical rules grammar rules y.output describes “yacc -v” states, flex yacc transitions of parser (useful for debugging) “yacc -d” y.tab.h lex.yy.c y.tab.c tokens parsed input yylex() yyparse() input 22 yacc: input format A yacc input file has the following structure: definitions required %% rules optional %% user code Shortest possible legal yacc input: %% 23 int yyparse() Called once from main() [user-supplied] Repeatedly calls yylex() until done: On syntax error, calls yyerror() [user-supplied] Returns 0 if all of the input was processed; Returns 1 if aborting due to syntax error. Example: int main() { return yyparse(); } 24 Yacc: Grammar Rules Information about tokens: Token names: – Declared using ‘%token’ %token name1 name2... – Any name not declared as a token is assumed to be a nonterminal. Start symbol of grammar, using ‘%start’ [optional] %start name – If not declared explicitly, defaults to the nonterminal on the LHS of the first grammar rule listed Stuff to be copied verbatim into the output (e.g., declarations, #includes): enclosed in %{ … }% 25 Yacc: Grammar Rules Grammar production yacc rule A B1 B2 … Bm A: B1 B2 … Bm ; B C1 C2 … Cn B: C1 C2 … Cn ; C D1 D2 … Dk C: D1 D2 … Dk ; ; Rule RHS can have arbitrary C code embedded, within { … }. E.g.: A : B1 { printf(“after B1\n”); x = 0; } B2 { x++; } B3; Right recursion more efficient than left-recursion: – A : xA | … rather than A : Ax | … 26 Yacc: Error Handling The “token” ‘error’ is reserved for error handling: Can be used in rules; Suggests places where errors might be detected and recovery can occur. Example: stmt : IF '(' expr ')' stmt | IF '(' error ')' stmt Intended to recover from errors in ‘expr’ | FOR … |… 30 Adding Semantic Actions Semantic actions for a rule are placed in its body: an action consists of C code enclosed in { … } may be placed anywhere in rule RHS Example: expr : ID { symTbl_lookup(idname); } decl : type_name { tval = … } id_list; 33 JFLEX Fastest lexical analyzer generator for Java