Podcast
Questions and Answers
What is the primary output of the lexical analyzer?
What is the primary output of the lexical analyzer?
The lexical analyzer is responsible for syntax analysis.
The lexical analyzer is responsible for syntax analysis.
False
What is a lexeme?
What is a lexeme?
A sequence of characters that forms a single unit of meaning in a language.
The process of __________ removes comments and consecutive whitespace characters from the input.
The process of __________ removes comments and consecutive whitespace characters from the input.
Signup and view all the answers
Match the following terms with their correct descriptions:
Match the following terms with their correct descriptions:
Signup and view all the answers
Which of the following is NOT a responsibility of the lexical analyzer?
Which of the following is NOT a responsibility of the lexical analyzer?
Signup and view all the answers
Lexemes can include keywords, identifiers, and literals.
Lexemes can include keywords, identifiers, and literals.
Signup and view all the answers
A token consists of a token name and an optional __________ value.
A token consists of a token name and an optional __________ value.
Signup and view all the answers
What is the function of a token in lexical analysis?
What is the function of a token in lexical analysis?
Signup and view all the answers
Tokens are used by parsers to understand the structure of the entire program.
Tokens are used by parsers to understand the structure of the entire program.
Signup and view all the answers
Name two types of tokens mentioned in the content.
Name two types of tokens mentioned in the content.
Signup and view all the answers
In C, single-character operators like '-' could also signify the beginning of a __________ operator.
In C, single-character operators like '-' could also signify the beginning of a __________ operator.
Signup and view all the answers
Match the following token types with their examples:
Match the following token types with their examples:
Signup and view all the answers
Which of the following is NOT a type of token?
Which of the following is NOT a type of token?
Signup and view all the answers
Whitespace and comments are ignored during the recognition of tokens.
Whitespace and comments are ignored during the recognition of tokens.
Signup and view all the answers
What is the purpose of an input buffer in lexical analysis?
What is the purpose of an input buffer in lexical analysis?
Signup and view all the answers
What should happen if a failure occurs in all transition diagrams?
What should happen if a failure occurs in all transition diagrams?
Signup and view all the answers
The longest possible lexeme for a given token should be identified.
The longest possible lexeme for a given token should be identified.
Signup and view all the answers
What is the output of a lexical analyzer generator?
What is the output of a lexical analyzer generator?
Signup and view all the answers
If the input is '12.34E56', the accept state will be reached after the first ___ digits.
If the input is '12.34E56', the accept state will be reached after the first ___ digits.
Signup and view all the answers
Match the terms with their definitions:
Match the terms with their definitions:
Signup and view all the answers
What is required for the input to the lexical analyzer generator?
What is required for the input to the lexical analyzer generator?
Signup and view all the answers
It is acceptable to use a simpler transition diagram to implement a lexical analyzer.
It is acceptable to use a simpler transition diagram to implement a lexical analyzer.
Signup and view all the answers
What happens when a forward pointer fails in a transition diagram?
What happens when a forward pointer fails in a transition diagram?
Signup and view all the answers
Study Notes
Lexical Analyzer
- Attempts to isolate words in an input string
- A word, lexeme, lexical item, or lexical token
- A string of input characters used as a unit, passed to the next compilation phase
- Output from the lexical phase is a stream of tokens
Role of the Lexical Analyzer
- Reads source program characters, grouping them into lexemes and producing a sequence of tokens for each lexeme
- The stream of tokens is sent to the parser for syntax analysis
- Interacts with the symbol table, entering lexemes (identifiers) into it
Interactions Between Lexical Analyzer and Parser
- Source program → Lexical Analyzer → Token → Parser → Semantic analysis
Other Tasks Besides Identifying Lexemes
- Removing comments and whitespace
- Correlating error messages with the source program
Important Processes of Lexical Analyzers
- Divided into a cascade of two processes
- Scanning: Simple processes not requiring tokenization, e.g. deleting comments, compacting whitespace
- Lexical analysis: The more complex portion, where the scanner produces the sequence of tokens.
Tokens, Patterns, and Lexemes
- A token is a pair: token name and optional attribute value
- Token name is an abstract symbol representing a lexical unit (keyword, sequence of input characters denoting an identifier)
- Token names are input symbols processed by the parser
Example
- C statement:
printf ("Total = %d\n", score);
-
printf
andscore
are lexemes that match the pattern for token id. -
"Total = %d\n"
is a lexeme matching literal.
Handling Keywords
- A lexeme is a sequence of characters forming a single unit of meaning in a language
- Example:
int age = 25;
- Lexemes =
int
,age
,=
,25
,;
- Tokens categorize lexemes (
KEYWORD(int)
,IDENTIFIER(age)
,OPERATOR(=)
,NUMBER(25)
,SEMICOLON
)
Lexical Analysis
- Recognize tokens, ignore whitespace and comments
- Example:
if (x1 * x2 < 1.0){}
- Generates token stream:
if (x 1 * x2 < 1.0)
Error Reporting
- Model using regular expressions
- Recognize using Finite State
Tokens, Patterns, and Lexemes (continued)
- Pattern describes the form of a token's lexemes
- Keyword pattern is the keyword itself
- Identifiers and other tokens have more complex patterns matching multiple strings
- A lexeme matches a token's pattern, identified by the lexical analyzer
Examples of Tokens
Token | Informal Description | Sample Lexemes |
---|---|---|
if |
characters i , f |
if |
else |
characters e , 1 , s , e |
else |
Comparison | < , > , <= , >= , == , != |
<= , != |
id |
letter followed by letters and digits | pi , score , D2 |
number |
any numeric constant | 3.14159 , 0 , 6.02e23 |
literal |
anything but " , surrounded by "s |
"core dumped" |
Tokens
- In programming languages, these classes cover most tokens:
- One token for each keyword (e.g.,
if
,else
) - Tokens for operators (e.g., comparison operators)
- One token for identifiers
- One or more tokens for constant values (numbers, literals)
- Tokens for punctuation (e.g., parentheses, commas, semicolons)
- One token for each keyword (e.g.,
Attributes for Tokens
- Lexical analyzers often return attribute values describing the lexeme
- Token name influences parsing decisions
- Attribute value impacts translation after parsing
Example (Fortran statement attributes)
-
E = M * C ** 2
-
<id, pointer to symbol-table entry for E >
,<assign_op>
,<id, pointer to symbol-table entry for M >
,<mult_op>
,<id, pointer to symbol-table entry for C >
,<exp_op>
,<number, integer value 2>
Lexical Errors
- Difficult to detect errors without aid of other components
- Example:
fi ( a == f(x))
- Lexical analyzer identifies
fi
as a lexeme assigned toid
token; the compiler's parser handles errors like misspelledif
or undeclared functionfi
.
Interface to Other Phases
- Lexical analyzer reads characters and pushes back characters to look ahead for recognition
- Implemented through a buffer to keep input in and move pointers
Input Buffering
- Uncertain of identifier end until a non-letter/digit char is encountered
- Single-character operators can be the beginning of two-character operators (e.g.,
<-
,==
,<=
) - A two-buffer scheme manages large lookaheads safely
- "Sentinels" improve efficiency by avoiding buffer end checks
Buffer Pairs
- Two buffers for input
- One buffer is read/filled, and then reloaded and used
- Size = 4096 bytes; system read command fills a buffer
- "eof" marks buffer end, useful when fewer than
N
characters remain
Buffer Pairs (continued)
-
lexemeBegin
marks the start of the current lexeme -
forward
scans ahead looking for pattern matches - Moving
forward
may require reloading either buffer
Sentinels
- Ensures the necessary checks for the end of the buffer while allowing look-ahead
- Combines end-of-buffer testing with current character testing
- Character "eof" serves as a sentinel (cannot be part of a program)
Lookahead Code with Sentinels
- Switch statement checks for
eof
and buffer ends.
How to Describe Tokens
- Programming language tokens can be described using regular languages
- Regular languages are well-understood, easy to implement
Regular Expressions Notations (Basic Concepts)
- Alphabet: Any finite set of symbols (e.g.,letters,digits, punctuation)
- String: Finite sequence of symbols in an alphabet; "empty string" (ε) has length zero
- Language: Any countable set of strings over a fixed alphabet
Regular Expressions Notations (Continued)
- Epsilon (ε): A regular expression ε denoting the language containing only the empty string
- Repetition (L):* Set of strings created by concatenating elements of L zero or more times, thus ((a | b). a)*
Terms for Parts of Strings
- Prefix: String obtained by removing zero or more symbols from the end
- Suffix: String obtained by removing zero or more symbols from the beginning
- Substring: String obtained by deleting any prefix and suffix
- Proper prefix/suffix/substring: Not equal to the entire string
- Subsequence: Any string formed by deleting zero or more (not necessarily consecutive) positions
Operations on Languages
- Union (L U M): All strings in L or in M
- Concatenation (L M): All strings formed by concatenating a string in L and a string in M
- Kleene closure (L):* All strings formed by concatenating zero or more strings in L
- Positive closure (L+): All strings formed by concatenating one or more strings in L
Operations on Languages (Continued)
- (p) parentheses: Define the order of operations.
- pq: Concatenation of two token definitions into a valid token definition
- p | q: The union of two token definitions
- p: (Kleene star):* Matches zero or more instances of token definition p
- p+: Matches one or more instances of token definition p
- p?: Matches zero or one instance of token definition p
- [a…z]: Character set alternation (representing a range of characters)
Example (Languages)
- L=letters, D=digits
- L∪D = letters and digits
- LD = letter followed by a digit
Example of C Identifier Tokens
- Valid C identifiers are sequences of letters, digits, and underscores.
- Example regular expression:
letter- (letter |- | digit )*
Example (Unsigned Numbers)
- Unsigned numbers are integers and floating points, examples: 5280, 0.01234, 6.336E4, 1.89E-4
- Regular expression to define them:
digits optionalFraction optionalExponent
Transition Diagrams
- Implement regular expressions; consists of:
- Set of states S
- Alphabet Σ
- Start state n
- Final states F
- Set of transitions (state -> input -> state)
Pictorial Notations
- A state's representation
- A final state's representation
- A transition from state i to j on input symbol a (written i -> a -> j)
How to Recognize Tokens
-
relop
(relop operators) and "how to construct an analyzer that returns (<token, attribute>
) pairs
Transition Diagram for Relops
- Diagram showing the transitions for relational operators (<, <=, =, >, >=)
Transition Diagram for Identifier
- Diagram showing the transition for identifiers recognition
Transition Diagram for Unsigned Numbers
- Transitions showing how to identify unsigned numbers (integers and real numbers)
LEX (Lexical Analyzer Generator)
- Generates lexical analyzer programs from regular expressions
- Takes a list of regular expressions (in priority order) as input
- Creates code generating the token and book-keeping information
- Program reads the input, breaks it into tokens, reports errors (unexpected characters)
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
This quiz explores the role and processes of the lexical analyzer in programming. It covers how the lexical analyzer isolates words, generates tokens, and interacts with the parser. Additionally, it highlights the tasks involved like removing comments and errors during analysis.