Lab 02: Implementing a Simple Tokenizer Using Flex PDF

LAB-02: Implementing a Simple Tokenizer Using Flex CSE-3104 COMPILER LAB What is Tokenization? oTokenization is the process of breaking down a stream of characters into meaningful chunks called tokens. oTokens include keywords, operators, literals, identifiers, etc. oA tokenizer is the first step in building a compiler, such as lexical analysis. Why use Flex? oFlex is a tool to generate scanners (lexical analyzers) that match regular expressions to input text. oIt converts source code or text into tokens for further parsing or processing. Flex File Structure A Flex file consists of three parts: Definitions: Declarations, macros, header includes. Rules: Regular expressions and the actions to be performed. User Subroutine (Code): C/C++ code that handles actions or other program logic. Simple Tokenizer This simple tokenizer can identify keywords (if, else), identifiers (x), numbers (10), and operators (+, -). yywrap() oFlex generates a function called yywrap() to handle situations where the lexer (scanner) reaches the end of an input file. oWhen yylex() (the lexical analyzer function) finishes processing the current file or stream, it will call yywrap() to determine what to do next. oThe typical yywrap() function is expected to return: 0 if the lexer should continue reading the next input file (if you passed multiple files to the program). 1 (or non-zero) if the lexer has finished reading all input and should stop processing. %option noyywrap oWhen you include %option noyywrap, Flex will not generate the yywrap() function. oThis means the lexer will not try to wrap to the next file and will not call yywrap() at the end of the current input. oIf you are only processing a single file or standard input and do not need to handle multiple input sources, you can use this option to avoid the overhead of yywrap(). Simple Tokenizer by reading input from a file argc (Argument Count): Holds the number of command-line arguments passed to the program, including the program's name itself. argv (Argument Vector): A pointer to an array of strings (character pointers) representing the actual command-line arguments. argv is the program name, argv is the first argument, and so on. This condition checks whether the user has provided an argument on the command line. If argc is greater than 1, it means that an input file has been specified as the first argument (i.e., argv). fopen is a standard C library function used to open a file. argv refers to the first command-line argument (which, in this case, should be a file path). The "r" mode indicates that we want to open the file for reading. fopen returns a FILE * (a pointer to a FILE object), which is used to interact with the file. If the file couldn't be opened (e.g., if the file doesn't exist or if there are permission issues), fopen will return NULL. !file checks if the file pointer is NULL. perror("fopen") prints a descriptive error message to the standard error stream (stderr) based on the reason why fopen failed (e.g., "No such file or directory"). return EXIT_FAILURE stops the program and returns a non-zero exit code (indicating an error). yyin is a special global variable used by Flex to specify the input stream. By default, yyin is set to stdin (standard input), but we can change it to read from a file. In this case, we assign the FILE *file (the yylex reads from the input opened file) to yyin. This tells Flex to read stream (yyin) and matches from the specified file instead of the standard patterns specified in the input. Flex rules. Challenges in Tokenization oAmbiguity in Tokens How do you handle situations like distinguishing == from =? oHandling Complex Numbers Flex currently only handles integers, but how would you handle floating-point numbers? oWhitespace Flex automatically skips over spaces and newlines, but what if we need to track line numbers for error reporting? Improving Tokenizer Features 1. How would you modify the tokenizer to support floating-point numbers? ◦ Current tokenizer handles integers like [0-9]+. ◦ How would you modify this regex to match 1.23, 3.1415, etc.? 2. How can you handle multi-character operators like ==, =, etc.? ◦ Would you need to handle these operators as separate tokens or use lookahead techniques? 3. How would you handle comments and whitespace in your tokenizer? ◦ Ignore spaces and tabs? ◦ Skip single-line (//) and multi-line () comments? Improving Tokenizer Features 4. Can you implement error handling for invalid tokens (e.g., unrecognized characters)? ◦ Consider how to handle unexpected characters (like # or @) and report meaningful error messages. Thank You

Lab 02: Implementing a Simple Tokenizer Using Flex PDF

Document Details

Tags

Related

Summary

Full Transcript