CSC 448: Compiler Design Lecture 3 PDF
Document Details
Uploaded by SpiritualMaxwell8882
DePaul University
2018
Joseph Phillips
Tags
Summary
This lecture provides an introduction to compiler design, focusing on lexical analysis techniques. It covers the practice of scanning and using the Flex tool for tokenization.
Full Transcript
CSC 448: Compiler Design Lecture 3 Joseph Phillips De Paul University 2018 April 9 Copyright © 2015-8 Joseph Phillips All rights reserved Reading Charles Fischer, Ron Cytron, Richard LeBlanc Jr. “Crafting a Compiler” A...
CSC 448: Compiler Design Lecture 3 Joseph Phillips De Paul University 2018 April 9 Copyright © 2015-8 Joseph Phillips All rights reserved Reading Charles Fischer, Ron Cytron, Richard LeBlanc Jr. “Crafting a Compiler” Addison-Wesley. 2010. – Chapter 3: Scanning – Theory and Practice Topics: Scanning (Practice) Flex Overview: Remember our hand-coded tokenizer? Symbol* scanner () throw(const char*) case '-' : { symbolPtr->symbol_ = SUBTRACT_SYMBOL; while ( isspace(inputCharStream_.peek()) ) break; inputCharStream_.advance(); case 'p' : if ( inputCharStream_.isAtEnd() ) symbolPtr->symbol_ = PRINT_SYMBOL; return( &endSymbol ); break; if ( isdigit(inputCharStream_.peek()) ) case 'i' : return( scanDigits() ); symbolPtr->symbol_ = INT_DECLARE_SYMBOL; break; char ch = inputCharStream_.peek(); Symbol* symbolPtr = (Symbol*) case 'f' : malloc(sizeof(Symbol)); symbolPtr->symbol_ = FLOAT_DECLARE_SYMBOL; break; inputCharStream_.advance(); default : switch (ch) if ( islower(ch) ) { { case '=' : symbolPtr->symbol_ = ID_SYMBOL; symbolPtr->value_.varName_= ch; symbolPtr->symbol_ = ASSIGN_SYMBOL; break; break; } case '+' : throw "Unexpected character in input"; symbolPtr->symbol_ = ADD_SYMBOL; } break; return(symbolPtr); } Remember our hand-coded tokenizer? Symbol* scanDigits () throw() { else std::string lex(""); { lex += inputCharStream_.peek(); while ( isdigit(inputCharStream_.peek()) ) inputCharStream_.advance(); { while ( isdigit(inputCharStream_.peek()) ) lex += inputCharStream_.peek(); { inputCharStream_.advance(); lex += inputCharStream_.peek(); } inputCharStream_.advance(); } Symbol* symbolPtr; symbolPtr= (Symbol*) if (inputCharStream_.peek() != '.') malloc(sizeof(Symbol)); { symbolPtr->symbol_ = FLOAT_SYMBOL; symbolPtr= (Symbol*) symbolPtr->value_.floatPt_ malloc(sizeof(Symbol)); = strtod(lex.c_str(),NULL); symbolPtr->symbol_ = INT_SYMBOL; } symbolPtr->value_.integer_ = strtol(lex.c_str(),NULL,10); return(symbolPtr); } } Q: Eeww! What a pain-in-the-ass! Can’t we just specify regular expressions and someone else make a tokenizer in C? A: Yes you may! With flex History of flex lex: – “Lexical Analyzer” – Written by Mike Lesk and Eric Schmidt for ATT Unix in 1970s – Uses regular expression to define lexemes – C code actions do-able for each flex: – “fast lexical analyzer” – Written by Vern Paxson circa 1987 Our first lex/flex program (1) %{ // Compile with: // $ flex o ex1_echoer.c ex1_echoer.lex // $ gcc ex1_echoer.c o ex1_echoer %} %% \n printf("\n");. printf("%c",yytext); %% int yywrap () { return(1); } int main () { yylex(); return(0); } Our first lex/flex program (2) I know! I know! It looks funny... just compile and run it! $ flex o ex1_echoer.c ex1_echoer.lex $ gcc ex1_echoer.c o ex1_echoer $./ex1_echoer I type something I type something I type something else I type something else Hey! Are *YOU* copying me?!?! Hey! Are *YOU* copying me?!?! ^C # I typed CtrlC Our first lex/flex program (3) So, it just copies what I type: $./ex1_echoer < ex1_echoer.lex %{ // Compile with: // $ flex o ex1_echoer.c ex1_echoer.lex // $ gcc ex1_echoer.c o ex1_echoer %} %% \n printf("\n");. printf("%c",yytext); %% int yywrap () { return(1); } int main () { yylex(); return(0); } Our first lex/flex program (4) Now let’s try to understand the bad-boy: In general: %{ // #includes and Cexterns go here %} // regular expression defns go here %% // regular expressions rules go here %% C functions go here Our first lex/flex program (5) %{ // Compile with: // $ flex o ex1_echoer.c ex1_echoer.lex // $ gcc ex1_echoer.c o ex1_echoer %} Not much going on here... just storing some comments Our first lex/flex program (6) %% \n printf("\n");. printf("%c",yytext); %% Regular expression rules to match – Period (.) matches any char, – \n means newline printf(..) is the C-code to do when regular expression matches – yytext[ ] holds the input read so far Our first lex/flex program (7) %% int yywrap () { return(1); } int main () { yylex(); return(0); } yylex() is the C function that lex/flex produces yywrap() is a function/macro that yylex() uses: – Return value 0: keep on reading when reach EOF (yywrap() would open the next file and then return 0) – Return value 1: return the zero-token to report EOF Pressing Ctrl-C was annoying! Another solution? (1) %{ // Compile with: // $ flex o ex2_echoer.c ex2_echoer.lex // $ gcc ex2_echoer.c o ex2_echoer %} %% quit return(0); // This is our new rule! \n printf("\n");. printf("%c",yytext); %% int yywrap () { return(0); } int main () { printf("Type \"quit\" to quit:\n"); yylex(); return(0); } Pressing Ctrl-C was annoying! Another solution? (2) $ flex o ex2_echoer.c ex2_echoer.lex $ gcc ex2_echoer.c o ex2_echoer $./ex2_echoer Type "quit" to quit: Hello Hello Goodbye Goodbye quit $ Your Turn! Any problems with that rule for reading files? Your Turn Again! Write a Lex program that counts the number of newline characters, and the total number of chars. (Similar to Unix’s wc) And Again! Write a program to count the number of vowels in a file Lex regular expressions (1): Rules – Period (.) Matches any char – | The thing before or after – [] Defines character class (all included) and [0-9] both define digits – [^ ] Defines character class Everything but all included [^0-9] defines all chars other than digits Lex regular expressions (2): Rules – ( ) Just used for grouping – * Matches 0 or more of previous thing – + Matches 1 or more of previous thing – ? Matches 0 or 1 of previous thing (e.g. “optional”) – { } How many times the previous thing should match A{1,3} matches ‘A’ between 1 to 3 times – / Match preceding thing only if followed by following thing 0/1 matches 0 only when followed by 1 (1 still in input) Lex regular expressions (3): Rules: – ^ Match at beginning of line ^begin only matches “begin” at beginning of line – $ Match at end of line end$ only matches “end” at end of line – \ C-escape sequence – "(whatever)" Interpret all enclosed characters literally (other than C escape chars) Regular expression rule example (1) %{ // Compile with: // $ flex o ex4_echoer.c ex4_echoer.lex // $ gcc ex4_echoer.c o ex4_echoer %} %% \n // Ignore newlines ("+"|)?[09]+ { int i=atoi(yytext); printf("Integer: %d\n",i); } [AZaz_][AZaz_09]* { printf("Identifier: %s\n",yytext); }. // Ignore other chars %% int yywrap () { return(1); } int main () { yylex(); return(0); } // yytext[] array into which flex puts currently read lexeme Regular expression rule example (2) $./ex4_echoer < ex4_echoer.lex Identifier: Compile Identifier: with Identifier: flex Identifier: o Identifier: ex4_echoer Identifier: c Identifier: ex4_echoer Identifier: lex Identifier: gcc Identifier: ex4_echoer Identifier: c Identifier: o Identifier: ex4_echoer Identifier: n Identifier: ignore Identifier: newline Integer: 0 Integer: 9... You can define regular expressions at a higher level %{ // Compile with: // $ flex o ex5_echoer.c ex5_echoer.lex // $ gcc ex5_echoer.c o ex5_echoer %} DIGIT [09] ID [AZaz_][AZaz_09]* %% \n // Ignore newlines ("+"|)?{DIGIT}+ { int i=atoi(yytext); printf("Integer: %d\n",i); } {ID} { printf("Identifier: %s\n",yytext); }. // Ignore other chars %% int yywrap () { return(1); } int main () { yylex(); return(0); } Your Turn! Revise the previous program to handle floating point numbers (including scientific notation) Your Turn Again! Write a program that recognizes the tokens of 4 th example lex program: Ignore \n \t \r, spaces, , // comments Print these out as “Percent Begin curly”, etc: %{, %}, %%, \n, (, ), |, (char)-(char), (comma), {, }, (period), *, +, [, ], (semi-colon), Print "quoted strings" (don’t worry about \") Print identifiers (include reserved words) Print integers You control the input! input() (or yyinput() in C++) – Returns next read char. You may call it yourself, e.g. to fast-forward to the end of a comment. " Other helper vars and fncs (1) yyin – FILE* object from which yylex() reads input. – Could set to stdin, or to value from fopen() the file to read. yywrap() – Called when reach EOF. – If returns 0 then means “I set up yyin to read from the beginning of the next file.” – If return 1 then means “That’s it. There is no more input” Other helper vars and fncs (2) YY_INPUT: macro to efficiently read buffer-fulls of chars: YY_INPUT(destinationBuffer,intVarSetToNumCharsRead,buf ferSize) // Example definition: #define YY_INPUT(buf,result,max_size) \ { \ int c = getchar(); \ if (c == EOF) return(YY_NULL); \ buf = c; \ result = 1; \ } Other helper vars and fncs (3) Another example of YY_INPUT %{ #undef YY_INPUT #define YY_INPUT(buf,res,size) \ { res = ourInput(buf,size); } %} %%.. %% Other helper vars and fncs (4) int ourInput (char* buffer, int size) { int n; errno = 0; while ( (n=fread(buffer,1,size,yyin)) == 0 && ferror(yyin)) { if (errno != EINTR) { fprintf(stderr,"Reading file failed: %s\n",strerror(errno)); exit(EXIT_FAILURE); } errno = 0; clearerr(yyin); } return(n); }