Lexical Analyzer Lecture PDF

LECTURE 02 LEXICAL ANALYZER Dr Manal Mostafa LEXICAL ANALYSIS  lexical analysis attempts to isolate the words in an input string.  A word, a lexeme, a lexical item, or a lexical token  A string of input characters which is taken as a unit and passed to the next phase of compilation  The output of the lexical phase is a stream of tokens THE ROLE OF THE LEXICAL ANALYZER  Read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program.  The stream of tokens is sent to the parser for syntax analysis.  lexical analyzer interacts with the symbol table.  When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table. INTERACTIONS BETWEEN THE LEXICAL ANALYZER AND THE PARSER OTHER TASKS BESIDES IDENTIFICATION OF LEXEMES.  stripping out comments and whitespace (blank, newline, tab, and perhaps other characters that are used to separate tokens in the input).  correlating error messages generated by the compiler with the source program. TOW IMPORTANT PROCESSES OF LEXICAL ANALYZERS  lexical analyzers are divided into a cascade of two processes: a) Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one. b) Lexical analysis is the more complex portion, where the scanner produces the sequence of tokens as output. TOKENS, PATTERNS, AND LEXEMES  A token is a pair consisting of a token name and an optional attribute value.  The token name is an abstract symbol representing a kind of lexical unit.  E.g., a particular keyword, or a sequence of input characters denoting an identifier.  The token names are the input symbols that the parser processes. EXAMPLE  ThisC statement gives some typical tokens, their described patterns, and some sample lexemes. printf ("Total = %d\n", score);  Bothprintf and score are lexemes matching the pattern for token id.  "Total = °/,d\n“ is a lexeme matching literal. HOW TO HANDLE KEYWORDS?  A lexeme is a sequence of characters that forms a single unit of meaning in a language. int age = 25; Lexemes: int, age, =, 25, and ;  Role: Lexemes are the actual substrings that are matched against patterns defined by the lexical rules of a language.  A token is a category that represents the type of lexemes. When the lexical analyzer identifies a lexeme, it assigns it a token that symbolizes its role in the language.  Tokens: KEYWORD(int), IDENTIFIER(age), OPERATOR(=), NUMBER(25), and SEMICOLON(;). 1  Role: Tokens are used by parsers to understand the LEXICAL ANALYSIS  Recognize tokens and ignore white spaces, comments If (x1 * x2 = and >  Typically implemented through a buffer  Keep input in a buffer  Move pointers over the input INPUT BUFFERING  We cannot be sure we've seen the end of an identifier until we see a character that is not a letter or digit, and therefore is not part of the lexeme for id.  In C, single-character operators like -, =, or < could also be the beginning of a two-character operator like ->, ==, or id  letter(letter|digit)* num  digit+ (‘.’ digit+)? (E(‘+’|’-’)? digit+)? delim  blank | tab | newline ws  delim+  Construct an analyzer that will return 3 pairs TRANSITION DIAGRAM FOR RELOPS token is relop, > lexeme is >= = * token is relop, othe lexeme is > r * token is relop, < othe lexeme is < r token is relop, > lexeme is token is relop, = lexeme is lexeme is >= othe * token is relop, r 4 is > lexeme TRANSITION DIAGRAM FOR IDENTIFIER letter letter other * digit Transition diagram for white spaces delim deli * othe m r 4 Transition diagram for unsigned numbers digit digit digit digi. digi E + digi othe * t t - t rs E digi digi digi t t t Real digi. digi othe * numbers t t rs digi t digi othe * Integer t rs number 4  The lexeme for a given token must be the longest possible Assume input to be 12.34E56 Starting in the third diagram the accept state will be reached after 12 Therefore, the matching should always start with the first transition diagram If failure occurs in one transition diagram then retract the forward pointer to the start state and activate the next diagram If failure occurs in all diagrams then 4a lexical error has occurred ANOTHER TRANSITION DIAGRAM FOR UNSIGNED NUMBERS digi digi digi t t t digi. digi E + digi othe * t t - t rs E digi others t others A more complex transition diagram is difficult to implement and may give rise to errors during coding, however, there are ways to better implementation 4 LEXICAL ANALYZER GENERATOR  Input to the generator  List of regular expressions in priority order  Associated actions for each of regular expression (generates kind of token and other book keeping information)  Output of the generator – Program that reads input character stream and breaks that into tokens – Reports lexical errors (unexpected characters), if any 4 LEX: A LEXICAL ANALYZER GENERATOR lex.yy. Token cC C specificati LEX code for Compile ons Lexical analyze rObject r code Input Lexical toke progra analyze ns m r Refer to LEX User’s Manual 4 HOW DOES LEX WORK?  Regular expressions describe the languages that can be recognized by finite automata Translate each token regular expression into a non deterministic finite automaton (NFA) Convert the NFA into an equivalent DFA Minimize the DFA to reduce number of states Emit code driven by the DFA tables 4

Lexical Analyzer Lecture PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue