Lexical Analyzer Overview
24 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the primary output of the lexical analyzer?

  • A list of lexemes
  • A compilation error report
  • A stream of source code
  • A sequence of tokens (correct)
  • The lexical analyzer is responsible for syntax analysis.

    False

    What is a lexeme?

    A sequence of characters that forms a single unit of meaning in a language.

    The process of __________ removes comments and consecutive whitespace characters from the input.

    <p>scanning</p> Signup and view all the answers

    Match the following terms with their correct descriptions:

    <p>Lexeme = A single unit of meaning in a language Token = An abstract symbol representing a lexical unit Scanner = Responsible for simple processes like whitespace deletion Parser = Analyzes the token stream for syntax correctness</p> Signup and view all the answers

    Which of the following is NOT a responsibility of the lexical analyzer?

    <p>Generate syntax trees from tokens</p> Signup and view all the answers

    Lexemes can include keywords, identifiers, and literals.

    <p>True</p> Signup and view all the answers

    A token consists of a token name and an optional __________ value.

    <p>attribute</p> Signup and view all the answers

    What is the function of a token in lexical analysis?

    <p>To represent the type of lexeme</p> Signup and view all the answers

    Tokens are used by parsers to understand the structure of the entire program.

    <p>False</p> Signup and view all the answers

    Name two types of tokens mentioned in the content.

    <p>KEYWORD, IDENTIFIER</p> Signup and view all the answers

    In C, single-character operators like '-' could also signify the beginning of a __________ operator.

    <p>two-character</p> Signup and view all the answers

    Match the following token types with their examples:

    <p>KEYWORD = int IDENTIFIER = age OPERATOR = = NUMBER = 25</p> Signup and view all the answers

    Which of the following is NOT a type of token?

    <p>VARIABLE</p> Signup and view all the answers

    Whitespace and comments are ignored during the recognition of tokens.

    <p>True</p> Signup and view all the answers

    What is the purpose of an input buffer in lexical analysis?

    <p>To keep input and move pointers over it.</p> Signup and view all the answers

    What should happen if a failure occurs in all transition diagrams?

    <p>Raise a lexical error</p> Signup and view all the answers

    The longest possible lexeme for a given token should be identified.

    <p>True</p> Signup and view all the answers

    What is the output of a lexical analyzer generator?

    <p>Program that reads input character stream and breaks it into tokens.</p> Signup and view all the answers

    If the input is '12.34E56', the accept state will be reached after the first ___ digits.

    <p>two</p> Signup and view all the answers

    Match the terms with their definitions:

    <p>Lexeme = The longest sequence of characters in a token Token = A category of input recognized by the lexer Lexical error = Unexpected characters found during analysis Lexical analyzer = Program that breaks input into tokens</p> Signup and view all the answers

    What is required for the input to the lexical analyzer generator?

    <p>List of regular expressions in priority order</p> Signup and view all the answers

    It is acceptable to use a simpler transition diagram to implement a lexical analyzer.

    <p>True</p> Signup and view all the answers

    What happens when a forward pointer fails in a transition diagram?

    <p>Retract to the start state and activate the next diagram.</p> Signup and view all the answers

    Study Notes

    Lexical Analyzer

    • Attempts to isolate words in an input string
    • A word, lexeme, lexical item, or lexical token
    • A string of input characters used as a unit, passed to the next compilation phase
    • Output from the lexical phase is a stream of tokens

    Role of the Lexical Analyzer

    • Reads source program characters, grouping them into lexemes and producing a sequence of tokens for each lexeme
    • The stream of tokens is sent to the parser for syntax analysis
    • Interacts with the symbol table, entering lexemes (identifiers) into it

    Interactions Between Lexical Analyzer and Parser

    • Source program → Lexical Analyzer → Token → Parser → Semantic analysis

    Other Tasks Besides Identifying Lexemes

    • Removing comments and whitespace
    • Correlating error messages with the source program

    Important Processes of Lexical Analyzers

    • Divided into a cascade of two processes
      • Scanning: Simple processes not requiring tokenization, e.g. deleting comments, compacting whitespace
      • Lexical analysis: The more complex portion, where the scanner produces the sequence of tokens.

    Tokens, Patterns, and Lexemes

    • A token is a pair: token name and optional attribute value
    • Token name is an abstract symbol representing a lexical unit (keyword, sequence of input characters denoting an identifier)
    • Token names are input symbols processed by the parser

    Example

    • C statement: printf ("Total = %d\n", score);
    • printf and score are lexemes that match the pattern for token id.
    • "Total = %d\n" is a lexeme matching literal.

    Handling Keywords

    • A lexeme is a sequence of characters forming a single unit of meaning in a language
    • Example: int age = 25;
    • Lexemes = int, age, =, 25, ;
    • Tokens categorize lexemes (KEYWORD(int), IDENTIFIER(age), OPERATOR(=) , NUMBER(25), SEMICOLON)

    Lexical Analysis

    • Recognize tokens, ignore whitespace and comments
    • Example: if (x1 * x2 < 1.0){}
    • Generates token stream: if (x 1 * x2 < 1.0)

    Error Reporting

    • Model using regular expressions
    • Recognize using Finite State

    Tokens, Patterns, and Lexemes (continued)

    • Pattern describes the form of a token's lexemes
    • Keyword pattern is the keyword itself
    • Identifiers and other tokens have more complex patterns matching multiple strings
    • A lexeme matches a token's pattern, identified by the lexical analyzer

    Examples of Tokens

    Token Informal Description Sample Lexemes
    if characters i, f if
    else characters e, 1, s, e else
    Comparison <, >, <=, >=, ==, != <=, !=
    id letter followed by letters and digits pi, score, D2
    number any numeric constant 3.14159, 0, 6.02e23
    literal anything but ", surrounded by "s "core dumped"

    Tokens

    • In programming languages, these classes cover most tokens:
      • One token for each keyword (e.g., if, else)
      • Tokens for operators (e.g., comparison operators)
      • One token for identifiers
      • One or more tokens for constant values (numbers, literals)
      • Tokens for punctuation (e.g., parentheses, commas, semicolons)

    Attributes for Tokens

    • Lexical analyzers often return attribute values describing the lexeme
    • Token name influences parsing decisions
    • Attribute value impacts translation after parsing

    Example (Fortran statement attributes)

    • E = M * C ** 2
    • <id, pointer to symbol-table entry for E >, <assign_op>, <id, pointer to symbol-table entry for M >,<mult_op>,<id, pointer to symbol-table entry for C >, <exp_op>, <number, integer value 2>

    Lexical Errors

    • Difficult to detect errors without aid of other components
    • Example: fi ( a == f(x))
    • Lexical analyzer identifies fi as a lexeme assigned to id token; the compiler's parser handles errors like misspelled if or undeclared function fi.

    Interface to Other Phases

    • Lexical analyzer reads characters and pushes back characters to look ahead for recognition
    • Implemented through a buffer to keep input in and move pointers

    Input Buffering

    • Uncertain of identifier end until a non-letter/digit char is encountered
    • Single-character operators can be the beginning of two-character operators (e.g., <-, ==, <=)
    • A two-buffer scheme manages large lookaheads safely
    • "Sentinels" improve efficiency by avoiding buffer end checks

    Buffer Pairs

    • Two buffers for input
    • One buffer is read/filled, and then reloaded and used
    • Size = 4096 bytes; system read command fills a buffer
    • "eof" marks buffer end, useful when fewer than N characters remain

    Buffer Pairs (continued)

    • lexemeBegin marks the start of the current lexeme
    • forward scans ahead looking for pattern matches
    • Moving forward may require reloading either buffer

    Sentinels

    • Ensures the necessary checks for the end of the buffer while allowing look-ahead
    • Combines end-of-buffer testing with current character testing
    • Character "eof" serves as a sentinel (cannot be part of a program)

    Lookahead Code with Sentinels

    • Switch statement checks for eof and buffer ends.

    How to Describe Tokens

    • Programming language tokens can be described using regular languages
    • Regular languages are well-understood, easy to implement

    Regular Expressions Notations (Basic Concepts)

    • Alphabet: Any finite set of symbols (e.g.,letters,digits, punctuation)
    • String: Finite sequence of symbols in an alphabet; "empty string" (ε) has length zero
    • Language: Any countable set of strings over a fixed alphabet

    Regular Expressions Notations (Continued)

    • Epsilon (ε): A regular expression ε denoting the language containing only the empty string
    • Repetition (L):* Set of strings created by concatenating elements of L zero or more times, thus ((a | b). a)*

    Terms for Parts of Strings

    • Prefix: String obtained by removing zero or more symbols from the end
    • Suffix: String obtained by removing zero or more symbols from the beginning
    • Substring: String obtained by deleting any prefix and suffix
    • Proper prefix/suffix/substring: Not equal to the entire string
    • Subsequence: Any string formed by deleting zero or more (not necessarily consecutive) positions

    Operations on Languages

    • Union (L U M): All strings in L or in M
    • Concatenation (L M): All strings formed by concatenating a string in L and a string in M
    • Kleene closure (L):* All strings formed by concatenating zero or more strings in L
    • Positive closure (L+): All strings formed by concatenating one or more strings in L

    Operations on Languages (Continued)

    • (p) parentheses: Define the order of operations.
    • pq: Concatenation of two token definitions into a valid token definition
    • p | q: The union of two token definitions
    • p: (Kleene star):* Matches zero or more instances of token definition p
    • p+: Matches one or more instances of token definition p
    • p?: Matches zero or one instance of token definition p
    • [a…z]: Character set alternation (representing a range of characters)

    Example (Languages)

    •  L=letters, D=digits
    • L∪D = letters and digits
    • LD  = letter followed by a digit

    Example of C Identifier Tokens

    • Valid C identifiers are sequences of letters, digits, and underscores.
    • Example regular expression: letter- (letter |- | digit )*

    Example (Unsigned Numbers)

    • Unsigned numbers are integers and floating points, examples: 5280, 0.01234, 6.336E4, 1.89E-4
    • Regular expression to define them: digits optionalFraction optionalExponent

    Transition Diagrams

    • Implement regular expressions; consists of:
      • Set of states S
      • Alphabet Σ
      • Start state n
      • Final states F
      • Set of transitions (state -> input -> state)

    Pictorial Notations

    • A state's representation
    • A final state's representation
    • A transition from state i to j on input symbol a (written i -> a -> j)

    How to Recognize Tokens

    • relop (relop operators) and "how to construct an analyzer that returns (<token, attribute>) pairs

    Transition Diagram for Relops

    • Diagram showing the transitions for relational operators (<, <=, =, >, >=)

    Transition Diagram for Identifier

    • Diagram showing the transition for identifiers recognition

    Transition Diagram for Unsigned Numbers

    • Transitions showing how to identify unsigned numbers (integers and real numbers)

    LEX (Lexical Analyzer Generator)

    • Generates lexical analyzer programs from regular expressions
    • Takes a list of regular expressions (in priority order) as input
    • Creates code generating the token and book-keeping information
    • Program reads the input, breaks it into tokens, reports errors (unexpected characters)

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Lexical Analyzer Lecture PDF

    Description

    This quiz explores the role and processes of the lexical analyzer in programming. It covers how the lexical analyzer isolates words, generates tokens, and interacts with the parser. Additionally, it highlights the tasks involved like removing comments and errors during analysis.

    More Like This

    Use Quizgecko on...
    Browser
    Browser