MOD 04 - Language Description & Lexical Analysis.pdf

Language Description & Lexical Analysis Pr. Moulay Youssef Hadi Learning Objectives  After completing this module, you should be able to:  Specify when lexical analyzers act in the process of translating PLs into machine language.  Explain the inputs and outputs of a lexical analyzer and its relationship to other compilation phases.  Apply regular expressions (REs) to describe the syntax of legal tokens.  Map between REs, Finite State Automata (FSAs) and their implementation.  Describe how a lexical analyzer can be implemented.  Distinguish the errors caught by a lexical analyzer from those caught in later phases of the translation process. 2 Language Description & Lexical Analysis Compiler phases Phases of compilation process : Lexical Analysis 1. Lexical Analysis: Converts source code into tokens. 2. Syntax Analysis: Constructs a parse tree from the Syntax Analysis Analysis Phase tokens. 3. Semantic Analysis: Checks for semantic errors and Semantic Analysis annotates the parse tree with type information. 4. Intermediate Code Generation: Produces an Intermediate Code intermediate representation of the code. Generation 5. Optimization: Improves the intermediate code for Code Optimization Synthesis Phase performance. 6. Code Generation: Translates the intermediate code Target Code Generation into target machine code. 3 Language Description & Lexical Analysis Introduction to lexical analysis  Lexical analysis is a fundamental step in the compilation process, responsible for breaking down the source code into a sequence of meaningful units called tokens.  This process lays the groundwork for the subsequent phases of parsing, semantic analysis, and code generation.  By understanding the task of lexical analysis and the tools used to implement it, such as regular expressions and finite automata, we can gain valuable insights into the inner workings of compilers and their ability to transform high-level programming languages into executable machine code. Tokens --------- --------- Lexemes --------- --------- Lexical analyzer Syntax analyzer ---- Source code 4 Request for tokens Language Description & Lexical Analysis Introduction to lexical analysis  Purpose:  Tokenization: Divide the source code into tokens or lexemes, such as keywords, identifiers, literals, and operators.  Ignored Characters: Discard whitespace characters, comments, and other non-essential elements from the source code.  Process: 1. Input: Receive the source code as input, typically in the form of a character stream. 2. Scanning: Read characters from the input stream and group them into lexemes according to predefined rules.  Patterns: Define regular expressions or finite automata to recognize the lexical patterns of tokens.  Tokenization: Identify the type of each lexeme (e.g., keyword, identifier, operator) and its position in the source code. 3. Output: Generate a stream of tokens, each representing a lexeme along with its associated information (e.g., token type, value, position). 5 Language Description & Lexical Analysis Introduction to lexical analysis  Common Lexical Elements (Tokens):  Keywords: Reserved words with special meanings in the language (e.g., if, else, while, int).  Identifiers: Names given to variables, functions, classes, etc. (e.g., variable, functionName, ClassName).  Literals: Constants representing fixed values (e.g., 123, "hello", 3.14).  Operators: Symbols representing operations or calculations (e.g., +, -, *, /).  Delimiters: Symbols used to separate or group elements in the code (e.g., (), {}, ;, ,).  Comments: Annotations or explanations ignored by the compiler (e.g., // single-line comment, ). 6 Language Description & Lexical Analysis Introduction to lexical analysis  Techniques and Tools:  Regular Expressions: Define patterns for recognizing tokens using regular expressions.  Finite Automata: Construct finite automata to recognize lexical patterns efficiently.  Lexical Analyzers (Lexers): Implement lexical analyzers using tools like Lex, Flex, or hand-written code.  Tokenization: Categorize lexemes into different token types and generate a token stream.  Error Handling:  Lexical Errors: Detect and report lexical errors such as invalid characters or unrecognized tokens.  Error Recovery: Implement strategies for recovering from errors and continuing the lexical analysis process. 7 Language Description & Lexical Analysis Introduction to lexical analysis  Three major terms are used in lexical analysis:  Lexical unit: a pair consisting of a lexical unit name and an optional attribute value (optional).  Pattern: is a description of the form that the lexemes in a lexical unit can take. It can be simple or complex.  Lexeme: is a sequence of characters in the source program that is recognized by the lexical unit pattern. Example Symbols (lexical units): Identifier, String, Numeric constant. Keyword (IF, WHILE, DO,...) Operator (special symbol): =,3*B DO A SUP_TOKEN … … 9 Language Description & Lexical Analysis Introduction to lexical analysis After lexical analysis, the code might be represented as a sequence of tokens:  Example: 1. Keyword: int Consider the following C code snippet: 2. Identifier: main 3. Delimiter: ( int main() { 4. Delimiter: ) int x = 10; 5. Delimiter: { return x * 2; 6. Keyword: int } 7. Identifier: x 8. Operator: = 9. Literal: 10 10. Delimiter: ; 11. Keyword: return 12. Identifier: x 13. Operator: * 14. Literal: 2 15. Delimiter: ; 10 16. Delimiter: } Language Description & Lexical Analysis Terminology  Alphabet  The term "alphabet" refers to the set of symbols from which strings (sequences of symbols) are formed.  An alphabet is defined as a finite, non-empty set of symbols.  Formally, let Σ be an alphabet. Then: ∑= {𝑎1 , 𝑎2 , … , 𝑎𝑛 } Where:  𝑎1 , 𝑎2 , … , 𝑎𝑛 are individual symbols or characters.  𝑛 is the cardinality (number of elements) of the alphabet Σ, which must be finite. Example ∑ = 𝑎, 𝑏, … , 𝑧 11 ∑ = 𝛼, 𝛽, … , 𝜑, +,∗, −,/ Language Description & Lexical Analysis Terminology  Alphabet  Properties: 1. Finite Set: The alphabet Σ contains a finite number of symbols. 2. Non-Empty: The alphabet Σ cannot be empty; it must contain at least one symbol.  Examples: 1. Binary Alphabet: {0,1} Contains two symbols: 0 and 1. 2. English Alphabet: {𝑎,𝑏,𝑐,...,𝑧} Contains 26 lowercase letters from a to z. 3. Numeric Alphabet: {0,1,2,...,9} Contains 10 digits from 0 to 9. 12 Language Description & Lexical Analysis Terminology  A "word" refers to a basic unit of language, typically representing a single entity within the source code.  Words are identified during lexical analysis (scanning) and are often referred to as tokens.  Each word/token represents a specific syntactic construct in the programming language. Example 1 On the alphabet ∑ = {0, 1}, we can construct words like: 101001, 11, 100. 𝛆 is the empty word. The concatenation of two words is a word. Example 2 Let ∑ = {0, 1}, if α = 100 and β = 1010 then αβ = 1001010. α² = αα = 100100 α3 = ααα = 100100100 13 Language Description & Lexical Analysis Terminology Question How to formally define the symbols of a language? Answer The best models defining lexical units which lexemes belong are regular languages. Two ways to describe regular languages: Regular expressions Finite automata 14 Language Description & Lexical Analysis Formal language A language based on an alphabet Σ is a set of words built on Σ. Example Σ* is the set of all words constructed on Σ. Note A language constructed on an alphabet Σ is a part of Σ*. We distinguish two particular languages : ø, Σ 15 Language Description & Lexical Analysis Formal language Example of languages Let ∑={a, …, z} L0 = {a} L1 = {aa, ab} L2 = {αaα / α ∈Σ *} = {a, bab, cdacd,...} L3 = {α ∈Σ* / |α|a ≤ 10} : the set of all words whose number of occurrences of a ≤ 10. |. |x∈Σ : Σ∗ →IN , |. |x calculate the number of occurrences of x in a word. 16 Language Description & Lexical Analysis Formal language Operations on formal languages  Union  𝐿1 ‫𝐿 ڂ‬2 = {𝑥 / 𝑥 ∈ 𝐿1 𝑜𝑟 𝑥 ∈ 𝐿2}  Intersection  𝐿1 ‫𝐿 ځ‬2 = {𝑥 / 𝑥 ∈ 𝐿1 𝑎𝑛𝑑 𝑥 ∈ 𝐿2}  Difference  𝐿1 − 𝐿2 = {𝑥 / 𝑥 ∈ 𝐿1 𝑎𝑛𝑑 𝑥 ∉ 𝐿2}  𝐿1 = 𝛴 ∗ − 𝐿1  Concatenation  𝐿1𝐿2 = {𝑥𝑦 /𝑥 ∈ 𝐿1 𝑎𝑛𝑑 𝑦 ∈ 𝐿2} L1={a,b,c} L2={1,2}  𝐿𝑛 = 𝐿𝐿 … 𝐿 L1L2={a1,a2,b1,b2,c1,c2} 17 L2L1={1a,1b,1c,2a,2b,2c} 𝑛 𝑡𝑖𝑚𝑒𝑠 Language Description & Lexical Analysis Formal language Kleene closure  Let L be a language  𝐿∗ = ‫≥𝐾ڂ‬0 𝐿𝑘 Example 𝐿1 = 𝑎, 𝑏 𝐿∗1 = 𝐿01 ‫𝐿 ڂ‬11 ‫𝐿 ڂ‬21 ‫𝐿 ڂ‬31 ‫𝜀 = … ڂ‬, 𝑎, 𝑏, 𝑎𝑎, 𝑎𝑏, 𝑏𝑏, …  𝐿+ is called positive Kleene closure of 𝐿 : 𝐿+ = 𝐿∗ − {𝜀}  Proposition  The following identities are true for all languages (X, Y, Z) :  X(YZ) = (XY)Z  X𝜀 = 𝜀X = X  X(Y ‫ ڂ‬Z) = XY ‫ ڂ‬XZ  (X ‫ ڂ‬Y)Z = XZ ‫ ڂ‬YZ 18  (X*)* = X* Language Description & Lexical Analysis Regular languages Definition  A regular language L on an alphabet Σ is defined recursively as follows:  {𝜀} is a regular language on Σ.  Let 𝑎 ∈ Σ then {a} is a regular language on Σ.  If 𝑅 is a regular language, then 𝑅𝑘 and 𝑅∗ are regular languages on Σ.  If 𝑅1 and 𝑅2 are two regular languages, then 𝑅1 ‫𝑅 ڂ‬2 and 𝑅1 𝑅2 are regular languages. 19 Language Description & Lexical Analysis Regular expression (RE)  Given an alphabet Σ, the regular expressions and the languages they describe are defined as follows: 1. Basic Elements:  Let a ∈ Σ, then a is a regular expression (RE) which describes the language {a}.  ε (epsilon) is a regular expression that describes the language {ε}, where ε denotes the empty string. 20 Language Description & Lexical Analysis Regular expression (RE) 2. Operations on Regular Expressions:  Union (Alternation): If α and β are two regular expressions describing languages A and B respectively, then α+β (or α|β) is a regular expression which describes the language A∪B. This denotes the union of languages A and B.  Concatenation: If α and β are regular expressions describing languages A and B respectively, then αβ is a regular expression which describes the concatenated language AB. This means strings formed by concatenating any string from A with any string from B.  Kleene Star: If α is a regular expression describing a language A, then α* is a regular expression which describes the language A*. This denotes the Kleene star operation, representing the set of strings that can be made by concatenating zero or more strings from A. 21 Language Description & Lexical Analysis Regular expression (RE)  Example 1: Basic Elements  Alphabet Σ: {a,b}  Regular Expressions:  a is a RE that describes the language {a}.  b is a RE that describes the language {b}.  ε is a RE that describes the language {ε}, which is the language containing only the empty string. 22 Language Description & Lexical Analysis Regular expression (RE)  Example 2: Union (Alternation)  Regular Expressions:  a + b or a | b is a RE that describes the language {a, b}.  Language Description:  The language described by a + b is the set of strings that contains either a or b. 23 Language Description & Lexical Analysis Regular expression (RE)  Example 3: Concatenation  Regular Expressions:  ab is a RE that describes the language {ab}.  a(b+c) or a(b|c) is a RE that describes the language {ab,ac}.  Language Description:  The language described by ab is the set containing the single string "ab".  The language described by a(b+c) is the set of strings that starts with a and is followed by either b or c. 24 Language Description & Lexical Analysis Regular expression (RE)  Example 4: Kleene Star  Regular Expressions:  a* is a RE that describes the language {ε,a,aa,aaa,…}.  (ab)* is a RE that describes the language {ε,ab,abab,ababab,…}.  Language Description:  The language described by a* is the set of all strings consisting of zero or more occurrences of the character a.  The language described by (ab)* is the set of all strings consisting of zero or more occurrences of the string "ab". 25 Language Description & Lexical Analysis Regular expression (RE)  Example 5: Combined Operations  Regular Expressions:  (a+b)c or (a|b)c is a RE that describes the language {ac,bc}.  (a+b)*c or (a|b)*c is a RE that describes the language {c,ac,bc,aac,abc,bac,bbc,…}.  Language Description:  The language described by (a+b)c is the set of strings where either a or b is followed by c.  The language described by (a+b)*c is the set of strings consisting of zero or more occurrences of a or b followed by c. 26 Language Description & Lexical Analysis Regular expression (RE)  Example 6: Complex Regular Expression  Regular Expressions:  (a*b*)c is a RE that describes the language {c, ac, bc, aac, abc, abbc, aabbc, …}.  Language Description:  The language described by (a*b*)c is the set of strings where zero or more occurrences of a followed by zero or more occurrences of b are followed by c. 27 Language Description & Lexical Analysis Regular expression (RE) Exercice 1 Give the regular expressions which describe: 1. The language of words on Σ={a, b} that start with a and end by b. 2. The language of all words on {a, b} concatenated with words on {c, d}: ab, aab, abb, aaab, abab, abababbab, , a, b, c,d,ac,abc, abcd, ababcdcdc. Answer a(a|b)*b (a|b)*(c|d)* 28 Language Description & Lexical Analysis Regular expression (RE) Exercice 2 1. Does the word w belong to the language described by the RE r in the following cases: w = 10100010 r =(0+10)* w= 01110110 r =(0+(11)*)* w= 000111100 r = ((011+11)*(00)*)* 2. Simplify the following REs: 𝜀 + ab + ab + abab(ab)* aa(b* + a) + a(ab* + aa) a(a+b)* + aa(a+b)*+ aaa(a+b)* 29 Language Description & Lexical Analysis Regular expression (RE) Answer 𝜀 + ab + abab(ab)* = 𝜀 + ab + ab(ab)* = 𝜀 + ab(𝜀 + (ab)*) = 𝜀 + ab(ab)* = 𝜀 + (ab)* = (ab)* aa(b* + a) + a(ab* + aa) = aa(b* + a)+ aa(b* + a) = aa(b* + a) a(a + b)*+ aa(a + b)*+ aaa(a + b)* = (a + aa + aaa)(a + b)* 30 Language Description & Lexical Analysis Regular expression (RE) Exercice 3 1. Prove the following equalities: b + ab* + aa*b + aa*ab* = a*(b + ab*) a*(b+ab*) = b + aa*b* 2. Give an expression r of the language formed on {a,b} having at most 3a. 31 Language Description & Lexical Analysis Regular expression (RE) Answer 1. Verification b + ab* + aa*b + aa*ab* = a*(b+ab*)  (b + ab*) + a+(b + ab*) = (𝜀 + a+)(b + ab*) = a*(b+ab*) a*(b + ab*) = a*b + a*ab* = (𝜀 + a+)b + a+b* = b + a+b + a+b* = b + a+ (b + b*) = b + a+b* = b+aa*b* 2. b*(a+𝜀)b*(a+𝜀)b*(a+𝜀)b* 32 Language Description & Lexical Analysis Introduction to automata  Example :  A coffee machine delivers a cup of coffee for 3 DH. This distributor accepts 1DH and 2DH coins but does not give change. A C button allows you to get coffee. At first the C button is blocked. If you insert enough coins, the button C becomes free and you cannot insert coins.  Model the operation of the distributor. Automata: Initial state e0: button blocked State e1: when we insert 1DH State e2: when we insert 2DH State e3: button C is released 33 Language Description & Lexical Analysis Introduction to automata → : Initial state → : final state Actual State Entry ribbon a e D 2 i N S Transition table Reading head 34 Finit Automata Language Description & Lexical Analysis Introduction to automata  An automaton reads a word written on its input ribbon. It starts from an initial state and with each letter read, it changes state. If at the end of the word, it is in a final state, we say that it accepts the word.  Word belongs to language ➔ last state is a final state.  Word does not belong to language ➔ Otherwise. 35 Language Description & Lexical Analysis Deterministic finite automaton (DFA)  Deterministic finite automaton (DFA) Definition A DFA A is defined as a 5-tuple 𝑨 =< 𝑸, Σ, 𝜹, 𝒒𝟎 , 𝑭 > : Q is the set of states, Σ is an alphabet, 𝜹: Q × Σ → 𝑸 is the transition function, 𝒒𝟎 is the initial state (𝒒𝟎 ∈ 𝑸 ), 𝑭 is the set of final states (𝑭 ⊆ Q ). 36 Language Description & Lexical Analysis Deterministic finite automaton (DFA)  Automata can be represented by graphs where the states are the nodes and the edges represent the transition function. Initial state final state a  If 𝜹 𝒒𝟎 , 𝒂 = 𝒒𝟏 ➔ 𝒒𝟎 𝒒𝟏 Exercice Represent the following automaton graphically: 𝑨 =< 𝑸, Σ, 𝜹, 𝒒𝟎 , 𝑭 > with Σ={0,1} , 𝑸 = {𝒒𝟎 , 𝒒𝟏 } , 𝒒𝟎 is the initial state and 𝑭 = 𝒒𝟏. 𝜹 0 1 𝒒𝟎 𝒒𝟎 𝒒𝟏 37 𝒒𝟏 𝒒𝟎 𝒒𝟏 Language Description & Lexical Analysis Language accepted by DFA  We say that an automaton 𝑨 =< 𝑸, Σ, 𝜹, 𝒒𝟎 , 𝑭 > accepts a word 𝑥 = 𝑥1 𝑥2 … 𝑥𝑛 , if and only if 𝜹(… 𝜹 𝜹 𝜹 𝒒𝟎 , 𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 … ), 𝒙𝒏 ) ∈ F. Example of verification 11111 (Yes) 𝛿(𝛿 𝛿 𝛿 𝛿 𝑞0 , 1 , 1 , 1 , 1 , 1) ∈ F 10010 (No) 𝛿(𝛿 𝛿 𝛿 𝛿 𝑞0 , 1 , 0 , 0 , 1 , 0) ∉ F 38 Language Description & Lexical Analysis Nondeterministic finite automaton (NFA)  An NFA is defined by a 5-tuple M = < Q, Σ , δ , q0, F >, where:  Q is a finite set of states.  Σ is a finite set of input symbols (alphabet).  δ is the transition function, defined as δ : Q × ( Σ ∪ { ε } ) → 𝟐𝑸.  This function takes a state and an input symbol (or epsilon, which represents an empty string) and returns a set of states.  q0 ∈ Q is the initial state.  F ⊆ Q is the set of accepting (or final) states. 39 Language Description & Lexical Analysis Nondeterministic finite automaton (NFA)  Characteristics of an NFA 1. Multiple Transitions: From a given state and input symbol, the NFA can transition to any number of possible next states. 2. Epsilon Transitions: The NFA can transition from one state to another without consuming any input symbols (using epsilon transitions). 3. Acceptance of Strings: An NFA accepts an input string if there exists at least one sequence of transitions (including epsilon transitions) starting from the initial state and ending in an accepting state, such that the entire input string is consumed. 40 Language Description & Lexical Analysis Nondeterministic finite automaton (NFA) a a  Example of an NFA  Consider an NFA M = < Q, Σ , δ , q0, F > with the following components: a b q0 q1 q2  States: Q={q0,q1,q2}  Alphabet: Σ={a,b}  Transition function δ: b  δ(q0,a)={q0,q1} (on input a, from state q0, the NFA can stay in q0 or move to q1)  δ(q0,b)={q0} (on input b, from state q0, the NFA can stay in q0)  δ(q1,b)={q2} (on input b, from state q1, the NFA can move to q2)  δ(q2,a)={q2} (on input a, from state q2, the NFA can stay in q2)  Initial state: q0 41  Accepting states: F={q2} Language Description & Lexical Analysis Nondeterministic finite automaton (NFA)  Example Execution:  To check if the NFA accepts the string "ab":  Start at q0.  On input a, move to q0 or q1.  From q0 on input b, stay in q0 (this path does not lead to acceptance).  From q1 on input b, move to q2 (this path leads to acceptance).  Since there exists a path where the NFA ends in an accepting state (q2), the NFA accepts the string "ab". a a a b q0 q1 q2 42 b Language Description & Lexical Analysis Nondeterministic finite automaton (NFA)  It is difficult to move directly from a RE to a DFA.  It’s easy to move from a RE to a NFA. Definition A NFA 𝑨 =< 𝑸, Σ, 𝜹, 𝒒𝟎 , 𝑭 > is characterized by : 𝜹: 𝑸 × Σ ‫𝜀 ڂ‬ → 𝑷(𝑸)  𝑃({𝑎, 𝑏, 𝑐}){∅, {𝑎} , {𝑏} , {𝑐} , {𝑎, 𝑏} , {𝑎, 𝑐} , {𝑏, 𝑐} , {𝑎, 𝑏, 𝑐}}  Several arcs, recognizing the same symbol, can exit the same state.  There may be 𝜀 −transitions 𝜀 is empty word, not 43 belong on the alphabet Language Description & Lexical Analysis Nondeterministic finite automaton (NFA) Example Represent with a Graph the following NFA: A= with Σ={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, -,.} and Q={q0 , q1 , q2 , q3 , q4 , q5 } q0 is the initial state and F={q5} 44 Language Description & Lexical Analysis Transformation of NFA into DFA  The transformation of a Nondeterministic Finite Automaton (NFA) into a Deterministic Finite Automaton (DFA) is known as the subset construction or powerset construction method.  This process involves creating states in the DFA that correspond to sets of NFA states. 45 Language Description & Lexical Analysis Transformation of NFA into DFA 1. Definitions  NFA: A 5-tuple (Q,Σ,δ,q0,F)  Q: Finite set of states  Σ: Alphabet  δ: Transition function δ:Q×Σ→2𝑄  q0: Initial state  F: Set of accepting states  DFA: A 5-tuple (Q′,Σ,δ′,q0′,F′)  Q′: Finite set of states  Σ: Alphabet  δ′: Transition function δ′:Q′×Σ→Q′  q0′: Initial state 46  F′: Set of accepting states Language Description & Lexical Analysis Transformation of NFA into DFA 2. Subset Construction Algorithm 1. Initialization:  The initial state of the DFA, q0′, is the epsilon-closure of the NFA's initial state q0. The epsilon-closure of a state is the set of states reachable from that state by epsilon transitions alone. 2. Processing States:  For each state in the DFA, which is a set of NFA states, calculate its transitions for each symbol in Σ.  For a state S in the DFA and an input symbol a:  Compute the set of NFA states reachable from any state in S on input a.  Take the epsilon-closure of this set to form the new DFA state. 3. Mark Accepting States: 47  A state in the DFA is an accepting state if it includes at least one accepting state of the NFA. Language Description & Lexical Analysis Transformation of NFA into DFA 3. Example  Consider the following NFA:  States: Q={q0,q1,q2} a a  Alphabet: Σ={a,b}  Transition function δ: a b q0 q1 q2  δ(q0,a)={q0,q1}  δ(q0,b)={q0}  δ(q1,b)={q2} b  δ(q2,a)={q2}  Initial state: q0 48  Accepting state: F={q2} Language Description & Lexical Analysis Transformation of NFA into DFA a a a b q0 q1 q2 3. Example b b Step-by-Step Construction of the DFA: 1. Initial State:  q0′=ϵ-closure({q0})={q0} a q0 q0,q1 2. Transitions from {q0} :  On a:  δ({q0},a)={q0,q1}  New state in DFA: {q0,q1}  On b:  δ({q0},b)={q0}  New state in DFA: {q0} 49 Language Description & Lexical Analysis Transformation of NFA into DFA a a a b q0 q1 q2 3. Example a b b Step-by-Step Construction of the DFA: 3. Transitions from {q0,q1} :  On a: a q0 q0,q1  δ({q0,q1},a)={q0,q1} (reachable from both b q0 and q1 on a)  New state in DFA: {q0,q1} (already exists)  On b: q0,q2  δ({q0,q1},b)={q0,q2} (reachable from q0 on b is q0, and from q1 on b is q2)  New state in DFA: {q0,q2} 50 Language Description & Lexical Analysis Transformation of NFA into DFA a a a b q0 q1 q2 3. Example a b b Step-by-Step Construction of the DFA: 4. Transitions from {q0,q2} :  On a: a q0 q0,q1  δ({q0,q2},a)={q0,q1,q2} b  New state in DFA: {q0,q1,q2} b  On b: a  δ({q0,q2},b)={q0} q0,q1,q2 q0,q2  New state in DFA: {q0} (already exists) 51 Language Description & Lexical Analysis Transformation of NFA into DFA a a a b q0 q1 q2 3. Example a b b Step-by-Step Construction of the DFA: 5. Transitions from {q0,q1,q2} :  On a: a q0 q0,q1  δ({q0,q1,q2},a)={q0,q1,q2} b  New state in DFA: {q0,q1,q2} (already exists) b  On b: a  δ({q0,q1,q2},b)={q0,q2} q0,q1,q2 q0,q2  New state in DFA: {q0,q2} (already exists) a b 52 Language Description & Lexical Analysis Transformation of NFA into DFA a a a b q0 q1 q2 3. Example a b DFA States: b  Q′={{q0},{q0,q1},{q0,q2},{q0,q1,q2}} DFA Transitions: a  δ′({q0},a)={q0,q1} q0 q0,q1  δ′({q0},b)={q0} b  δ′({q0,q1},a)={q0,q1} b  δ′({q0,q1},b)={q0,q2}  δ′({q0,q2},a)={q0,q1,q2} a  δ′({q0,q2},b)={q0} q0,q1,q2 q0,q2  δ′({q0,q1,q2},a)={q0,q1,q2}  δ′({q0,q1,q2},b)={q0,q2} DFA Initial State:  q0′={q0} a b DFA Accepting States: 53  F′={{q0,q2},{q0,q1,q2}} (any DFA state containing q2) Language Description & Lexical Analysis Transformation of NFA into DFA  Exercise:  Given the following NFA, convert it into an equivalent DFA using the subset construction method.  NFA Definition  States: Q={q0,q1,q2}  Alphabet: Σ={a,b}  Transition function δ:  δ(q0,a)={q0,q1}  δ(q0,b)={q0}  δ(q1,a)={q2}  δ(q1,b)={q2}  δ(q2,a)={}  δ(q2,b)={q2}  Initial state: q0 54  Accepting state: F={q2} Language Description & Lexical Analysis Minimization of a DFA: Hopcroft's Algorithm  The process of minimizing a DFA involves reducing the number of states while preserving the language it accepts.  Hopcroft's algorithm is a well-known method for minimizing a DFA efficiently. 55 Language Description & Lexical Analysis Minimization of a DFA: Hopcroft's Algorithm  Steps of Hopcroft's Algorithm 1. Initialization:  Divide the states of the DFA into two groups: accepting states and non-accepting states. 2. Refinement:  Iteratively refine these groups (partitions) until no further refinement is possible.  For each group and each input symbol, check if the transitions under that symbol lead to states in different groups. If they do, split the group accordingly. 3. Construct the Minimized DFA:  Create new states for each of the final groups.  Define the transitions based on the transitions in the original DFA.  The initial state of the minimized DFA corresponds to the group containing the initial state of the original DFA.  The accepting states of the minimized DFA correspond to the groups containing accepting 56 states of the original DFA. Language Description & Lexical Analysis Minimization of a DFA: Hopcroft's Algorithm  Example a  Consider a DFA with the following components: b  States: Q={q0,q1,q2,q3,q4}  Alphabet: Σ={a,b} a q0 q1 q3  Transition function δ:  δ(q0,a)=q1  δ(q2,b)=q0 b b a  δ(q0,b)=q2  δ(q3,a)=q4 a a b  δ(q1,a)=q0  δ(q3,b)=q1  δ(q1,b)=q3  δ(q4,a)=q3 q2 q4  δ(q2,a)=q4  δ(q4,b)=q2  Initial state: q0 b 57  Accepting states: F={q0,q3} a Language Description & Lexical Analysis a b Minimization of a DFA: Hopcroft's Algorithm q0 q1 q3 b b a a a  Step-by-Step Minimization b P0 1. Initialization: q2 q4  Partition the states into two groups: q3 q0 b  Accepting states: P0={q0,q3}  Non-accepting states: P1={q1,q2,q4} q1 P1 q4 q2 58 a Language Description & Lexical Analysis a b Minimization of a DFA: Hopcroft's Algorithm q0 q1 q3 b b a a a  Step-by-Step Minimization b P0 2. Refinement: q2 q4  Iteratively refine the partitions based on the transitions. q0 q3 b  Iteration 1:  For each group and each input symbol, check if transitions lead to different groups.  Checking P0={q0,q3} :  On a: q1 P1 q4  δ(q0,a)=q1 (in P1)  δ(q3,a)=q4 (in P1) q2  No split needed.  On b:  δ(q0,b)=q2 (in P1)  δ(q3,b)=q1 (in P1) 59  No split needed. a Language Description & Lexical Analysis a b Minimization of a DFA: Hopcroft's Algorithm q0 q1 q3 b b a a a  Step-by-Step Minimization b P0 2. Refinement: q2 q4  Iteration 1:... q0 q3 b  Checking P1={q1,q2,q4} :  On a:  δ(q1,a)=q0 (in P0)  δ(q2,a)=q4 (in P1) P1  δ(q4,a)=q3 (in P0) P1 q1 q4  Split P1 into {q2} and {q1,q4}.  Updated partitions:  P0={q0,q3}  P1={q1,q4}  P2={q2} P2 q2 60 a Language Description & Lexical Analysis a b Minimization of a DFA: Hopcroft's Algorithm q0 q1 q3  Step-by-Step Minimization P0 b a b a a 2. Refinement: b  Iteration 2: q2 q4... q0 q3  Checking P1={q1,q4} : b  On a:  δ(q1,a)=q0 (in P0)  δ(q4,a)=q3 (in P0)  No split needed. P1  On b: q1 q4  δ(q1,b)=q3 (in P0)  δ(q4,b)=q2 (in P2)  Split P1 No split needed.  Updated partitions:  P0={q0,q3} P2 q2  P1={q1, q4}  P2={q2} a Language Description & Lexical Analysis a b Minimization of a DFA: Hopcroft's Algorithm q0 q1 q3 Step-by-Step Minimization b b a  a a 3. Construct the Minimized DFA: b  New states: Each partition becomes a new state. q2 q4  P0={q0,q3} becomes q0′  P1={q1, q4} becomes q1′ b  P2={q2} becomes q2′  Transition function:  δ′(q0′,a)=q1′ (since δ(q0,a)=q1 and δ(q3,a)=q4)  δ′(q0′,b)=q2′ (since δ(q0,b)=q2 and δ(q3,b)=q1)  δ′(q1′,a)=q0′ (since δ(q1,a)=q0 and δ(q4,a)=q3)  δ′(q1′,b)=q0′ (since δ(q1,b)=q3 and δ(q4,b)=q2)  δ′(q2′,a)=q3′ (since δ(q2,a)=q4)  δ′(q2′,b)=q0′ (since δ(q2,b)=q0)  Initial state: q0′  Accepting states: {q0′} (since q0′ contains q0 and q3) 62 Language Description & Lexical Analysis Minimization of a DFA: Hopcroft's Algorithm b  Minimized DFA  States:  Q′={q0′,q1′,q2′} a  Alphabet:  Σ={a,b}  Transition Function: q0’ q1’  δ′(q0′,a)=q1′  δ′(q0′,b)=q2′  δ′(q1′,a)=q0′ a  δ′(q1′,b)=q0′ b  δ′(q2′,a)=q3′ b q2’  δ′(q2′,b)=q0′  Initial State: q0′   Accepting States:  F′={q0′} 63 Language Description & Lexical Analysis Minimization of a DFA: Hopcroft's Algorithm  Exercise: Minimize a DFA Using Hopcroft's Algorithm  Given the following DFA, use Hopcroft's algorithm to minimize it.  DFA Definition  States: Q={q0,q1,q2,q3,q4,q5}  Alphabet: Σ={a,b}  Transition function δ:  δ(q0,a)=q1  δ(q3,a)=q0  δ(q0,b)=q2  δ(q3,b)=q3  δ(q1,a)=q0  δ(q4,a)=q4  δ(q1,b)=q3  δ(q4,b)=q5  δ(q2,a)=q4  δ(q5,a)=q4  δ(q2,b)=q5  δ(q5,b)=q5  Initial state: q0 64  Accepting states: F={q0,q3} Language Description & Lexical Analysis Equivalence of finite automata to regular expressions Theorem A language is regular if and only if it is generated by a finite automaton. 65 Language Description & Lexical Analysis Construction of an NFA from a Regular Expression  The construction of a Nondeterministic Finite Automaton (NFA) from a regular expression (RE) can be systematically achieved using Thompson's Construction algorithm.  This method allows us to build an NFA that recognizes the same language as the given regular expression. 66 Language Description & Lexical Analysis Construction of an NFA from a Regular Expression  Thompson's Construction Algorithm 1. Basic Operations:  Single Symbol: For a single symbol a, the NFA is constructed as:  Epsilon (ε): For the epsilon symbol, the NFA is:  Empty String: The NFA accepts only the empty string. 67 Language Description & Lexical Analysis Construction of an NFA from a Regular Expression  Thompson's Construction Algorithm 2. Concatenation (AB):  If A and B are two NFAs constructed for regular expressions R and S respectively, the NFA for RS is constructed by connecting the final state of A to the initial state of B using epsilon transitions. 68 Language Description & Lexical Analysis Construction of an NFA from a Regular Expression  Thompson's Construction Algorithm 3. Union (A | B):  If A and B are two NFAs for regular expressions R and S respectively, the NFA for R ∣ S is constructed by introducing a new initial state with epsilon transitions to the initial states of both A and B, and new final state with epsilon transitions from the final states of both A and B. 69 Language Description & Lexical Analysis Construction of an NFA from a Regular Expression  Thompson's Construction Algorithm 3. Kleene Star (A)* :  If A is an NFA for regular expression R, the NFA for R* is constructed by adding epsilon transitions from the final state of A back to its initial state, and from a new initial state to the initial state of A and from the final state of A to a new final state. 70 Language Description & Lexical Analysis Construction of an NFA from a Regular Expression  Example  Let's construct an NFA from the regular expression : (a|b)*abb 1. Construct NFA for a and b:  For a: a q0 q1  For b: b q2 q3 71 Language Description & Lexical Analysis Construction of an NFA from a Regular Expression  Example  Let's construct an NFA from the regular expression : (a|b)*abb 2. Construct NFA for a|b:  Using union:  Combined NFA for a|b: a ε q0 q1 ε q4 q5 ε b ε q2 q3 72 73 Language Description & Lexical Analysis Construction of an NFA from a Regular Expression  Example  Let's construct an NFA from the regular expression : (a|b)*abb 3. Construct NFA for (a|b)*: ε  Using Kleene star:  Combined NFA for (a|b)*: a ε q0 q1 ε ε q6 q4 q5 q7 ε ε b ε q2 q3 ε Language Description & Lexical Analysis Construction of an NFA from a Regular Expression  Example  Let's construct an NFA from the regular expression : (a|b)*abb 4. Construct NFA for abb:  Concatenation: ε a ε q0 q1 ε ε a b b q6 q4 q5 q7 q8 q9 q10 q11 ε ε b ε q2 q3 ε Language Description & Lexical Analysis Construction of an NFA from a Regular Expression  Example  Let's construct an NFA from the regular expression : (a|b)*abb 4. Combine (a|b)* with abb using concatenation:  Concatenation: ε a ε q0 q1 ε ε ε a b b q6 q4 q5 q7 q8 q9 q10 q11 ε ε b ε q2 q3 ε Language Description & Lexical Analysis Construction of an NFA from a Regular Expression  Example  Let's construct an NFA from the regular expression : (a|b)*abb  Full NFA for (a|b)*abb: 1. States:  Q={q0,q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11} 2. Alphabet:  Σ={a,b} 3. Transitions:  δ(q6, ε) = {q4, q7}  δ(q2, b) = {q3}  δ(q8, a) = {q9}  δ(q4, ε) = {q0, q2}  δ(q3, ε) = {q5}  δ(q9, b) = {q10}  δ(q0, a) = {q1}  δ(q5, ε) = {q4, q7}  δ(q10, b) = {q11}  δ(q1, ε) = {q5}  δ(q7, ε) = {q8} 4. Initial State:  q6 5. Accepting State: 76  F={q11} Language Description & Lexical Analysis Construction of an NFA from a Regular Expression  Exercice:  Construct an NFA for the regular expression ((ab)*|a*) over the alphabet Σ = {a, b} 77 Language Description & Lexical Analysis Flex tool  Flex (Fast Lexical Analyzer) is a tool for generating scanners, also known as lexical analyzers.  A scanner reads input text and breaks it into tokens, which are the basic building blocks for syntax analysis in the compilation process.  Flex is widely used in conjunction with Bison, a parser generator, to create a complete compiler or interpreter.  Overview of Flex  Input: A specification file containing patterns and corresponding actions.  Output: A C source file (typically lex.yy.c) containing the code for the lexical analyzer.  Usage: The generated lexical analyzer reads input text, matches it against the specified patterns, and performs the associated actions. 78 Language Description & Lexical Analysis Flex tool 79 Language Description & Lexical Analysis Flex tool  Structure of a Flex Specification File  A Flex specification file consists of three sections separated by %%: < DEFINITIONS > %% < RULES> %% < USER SUB-PROGRAMS >  Only the first "%%" is required 80 Language Description & Lexical Analysis Flex tool  Structure of a Flex Specification File  Definitions Section: Contains declarations and definitions, such as C code, #include directives, and macro definitions.  Rules Section: Contains patterns and associated actions. Each pattern is a regular expression, and each action is a block of C code to execute when the pattern is matched.  User Code Section: Contains additional C code, such as helper functions, that are used by the actions. 81 Language Description & Lexical Analysis Flex tool Defining Regular Expressions in Flex  Regular expressions in Flex are used to describe patterns in the input text. These patterns are matched against the input to identify tokens.  Flex uses a syntax similar to traditional regular expressions with some additional features tailored for lexical analysis. Basic Elements of Regular Expressions in Flex 1. Literals:  Matches the exact character(s).  Example: a matches the character 'a’. 2. Character Classes:  Matches any one of the characters in the set.  Example: [abc] matches 'a', 'b', or 'c'. 82 Language Description & Lexical Analysis Flex tool Basic Elements of Regular Expressions in Flex 3. Ranges:  Matches any character within the specified range.  Example: [a-z] matches any lowercase letter. 4. Negated Character Classes:  Matches any character not in the set.  Example: [^a-z] matches any character except lowercase letters. 5. Predefined Character Classes: . matches any character except newline.  \d matches any digit (equivalent to [0-9]).  \w matches any word character (equivalent to [a-zA-Z0-9_]).  \s matches any whitespace character (space, tab, newline, etc.). 83 Language Description & Lexical Analysis Flex tool Basic Elements of Regular Expressions in Flex 6. Quantifiers:  * matches zero or more occurrences.  + matches one or more occurrences.  ? matches zero or one occurrence.  {n} matches exactly n occurrences.  {n,} matches n or more occurrences.  {n,m} matches between n and m occurrences. 7. Anchors:  ^ matches the beginning of a line.  $ matches the end of a line. 8. Grouping and Alternation: 84  (abc) groups the characters 'abc' together.  a|b matches either 'a' or 'b'. Language Description & Lexical Analysis Flex tool Basic Elements of Regular Expressions in Flex Flex-Specific Extensions 1. Named Patterns:  Define reusable patterns using the name syntax.  Example: DIGIT [0-9] defines a named pattern DIGIT. 2. Action Blocks:  {... } is used to enclose C code that is executed when the pattern is matched.  Example: DIGIT { printf("Digit: %s\n", yytext); } 3. State Transitions:  matches a pattern only in the specified start condition state.  Example: hello matches "hello" only in the INITIAL state. 85 Language Description & Lexical Analysis Flex tool Predefined variables and functions Flex, the fast lexical analyzer generator, provides several predefined variables and functions that help in writing lexical analyzers.  Predefined Variables 1. yytext:  Type: char *  Description: A pointer to the matched text. It points to the start of the current token in the input stream.  Usage: Access the matched text in actions, e.g., printf("Matched text: %s\n", yytext); 2. yyleng:  Type: int  Description: The length of the matched text, excluding the null terminator.  Usage: Get the length of the current token, e.g., printf("Length of matched text: %d\n", yyleng); 86 Language Description & Lexical Analysis Flex tool Predefined variables and functions  Predefined Variables 3. yylval:  Type: Depends on the user definition  Description: Used to return values from lexical analyzer to the parser. Typically used in conjunction with Yacc/Bison.  Usage: Set semantic values for tokens, e.g., yylval = atoi(yytext); 4. yyin:  Type: FILE *  Description: The file pointer from which Flex reads input. By default, it is stdin.  Usage: Change the input source, e.g., yyin = fopen("input.txt", "r"); 5. yyout:  Type: FILE *  Description: The file pointer to which Flex writes its output. By default, it is stdout.  Usage: Change the output destination, e.g., yyout = fopen("output.txt", "w"); Language Description & Lexical Analysis Flex tool Predefined variables and functions  Predefined Functions 1. yylex():  Type: int  Description: The main lexical analyzer function. It matches the next token in the input and executes the corresponding action.  Usage: Called in the main function or by a parser to get the next token, e.g., int token = yylex(); 2. yywrap():  Type: int  Description: Called when the end of the input file is reached. By default, it returns 1, indicating no more input. Returning 0 allows further input.  Usage: Override to handle end-of-file conditions, e.g., int yywrap() { return 1; // Indicate end of input } Language Description & Lexical Analysis Flex tool Predefined variables and functions  Predefined Functions 3. yyrestart(FILE *new_file):  Type: void  Description: Switches the input stream to new_file.  Usage: Change input source at runtime, e.g., yyrestart(fopen("new_input.txt", "r")); 4. yy_scan_string(const char *str):  Type: YY_BUFFER_STATE  Description: Scans a string as if it were input from a file.  Usage: Scan an in-memory string, e.g., yy_scan_string("some text"); 5. yy_delete_buffer(YY_BUFFER_STATE buffer):  Type: void  Description: Deletes a buffer created by yy_scan_string or similar functions.  Usage: Clean up buffer, e.g., YY_BUFFER_STATE buffer = yy_scan_string("some text"); yy_delete_buffer(buffer); Language Description & Lexical Analysis Flex tool Example: %{ #include %} %% [0-9]+ { printf("Integer: %s\n", yytext); } [a-zA-Z_][a-zA-Z0-9_]* { printf("Identifier: %s\n", yytext); } " " { } \n { }. { printf("Unknown character: %s\n", yytext); } %% int main(void) { printf("Enter input:\n"); yylex(); // Start the lexical analysis return 0; } int yywrap() { return 1; // Indicate end of input } Language Description & Lexical Analysis Flex tool %{ #include %} NUMBER_TOKEN [0-9]* ID_TOKEN [a-zA-Z_][a-zA-Z0-9_]* PLUS_TOKEN "+" Example of a Flex Specification File MINUS_TOKEN "-" TIMES_TOKEN "*"  Here is an example of a Flex DIV_TOKEN "/" specification file for a simple lexical MOD_TOKEN "%" SPACE_TOKEN [ \t\n]+ analyzer that recognizes arithmetic %% expressions with numbers, identifiers, {NUMBER_TOKEN} {printf("NUMBER_TOKEN\n");} and operators. {ID_TOKEN} {printf("ID_TOKEN\n");} {PLUS_TOKEN} {printf("PLUS_TOKEN\n");} {MINUS_TOKEN} {printf("MINUS_TOKEN\n");} {TIMES_TOKEN} {printf("TIMES_TOKEN\n");} {DIV_TOKEN} {printf("DIV_TOKEN\n");} {MOD_TOKEN} {printf("MOD_TOKEN\n");} {MOD_TOKEN} {printf("MOD_TOKEN\n");}. {printf("ERROR\n");} %% int main(int argc, char ** argv){ if(argc >= 1){ yyin = fopen(argv, "r"); yylex(); 91 }else{ printf("Insufficient number of arguments\n"); } } Language Description & Lexical Analysis Flex tool Example of a Flex Specification File  Explanation of the Example  Definitions Section: %{ #include %}  This section, enclosed in %{ and %}, contains C code that will be copied verbatim to the top of the generated C file. Here, it includes the standard I/O library for printing. 92 Language Description & Lexical Analysis Flex tool Example of a Flex Specification File  Explanation of the Example  Token Definitions NUMBER_TOKEN [0-9]* ID_TOKEN [a-zA-Z_][a-zA-Z0-9_]*...  These lines define named patterns (macros) for the various tokens using regular expressions:  NUMBER_TOKEN: Matches zero or more digits ([0-9]*).  ID_TOKEN: Matches an identifier, which starts with a letter or underscore and is followed by letters, digits, or underscores ([a-zA-Z_][a-zA-Z0-9_]*).  PLUS_TOKEN: Matches the plus sign ("+").  MINUS_TOKEN: Matches the minus sign ("-").  TIMES_TOKEN: Matches the asterisk ("*") used for multiplication.  DIV_TOKEN: Matches the forward slash ("/") used for division.  MOD_TOKEN: Matches the percent sign ("%") used for modulo operation.  SPACE_TOKEN: Matches one or more whitespace characters ([ \t\n]+). Language Description & Lexical Analysis Flex tool Example of a Flex Specification File  Explanation of the Example  Rules Section %% {NUMBER_TOKEN} {printf("NUMBER_TOKEN\n");}.... {printf("ERROR\n");} %%  This section, enclosed between %% markers, contains the rules for matching the patterns defined above and the actions to perform when a match is found:  {NUMBER_TOKEN}: If the input matches NUMBER_TOKEN, it prints "NUMBER_TOKEN".  {ID_TOKEN}: If the input matches ID_TOKEN, it prints "ID_TOKEN".  {PLUS_TOKEN}: If the input matches PLUS_TOKEN, it prints "PLUS_TOKEN".  {MINUS_TOKEN}: If the input matches MINUS_TOKEN, it prints "MINUS_TOKEN".  {TIMES_TOKEN}: If the input matches TIMES_TOKEN, it prints "TIMES_TOKEN".  {DIV_TOKEN}: If the input matches DIV_TOKEN, it prints "DIV_TOKEN".  {MOD_TOKEN}: If the input matches MOD_TOKEN, it prints "MOD_TOKEN".  {SPACE_TOKEN}: If the input matches SPACE_TOKEN, it prints “SPACE_TOKEN". .: If the input matches any other character not specified above, it prints "ERREUR". Language Description & Lexical Analysis Flex tool Example of a Flex Specification File  Explanation of the Example  Main Function int main(int argc, char ** argv){ if(argc >= 1){ yyin = fopen(argv, "r"); yylex(); }else{ printf("Insufficient number of arguments\n"); } }  The main function:  Checks if there is at least one command-line argument (which should be the name of the input file).  If so, it opens the file and sets yyin to read from it.  Calls yylex(), which starts the lexical analysis and tokenization process.  If no input file is provided, it prints "Insufficient number of arguments". Language Description & Lexical Analysis Flex tool Example of a Flex Specification File  Generating and Compiling the Lexical Analyzer 1. Save the code in a file (e.g., lexer.l). 2. Generate the Lexical Analyzer:  Run Flex on the specification file to generate the C source file for the lexical analyzer. flex lexer.l 3. Compile the Lexical Analyzer:  Use a C compiler to compile the generated C source file (lex.yy.c) along with any other necessary files. gcc -o lexer lex.yy.c –lfl 4. Run the lexer: 96./lexer input_file Language Description & Lexical Analysis Flex tool Example of a Flex Specification File  Example Input and Output  Suppose you have an input file input_file with the following content: 123 + abc - 456 * def / ghi % jkl  Running the lexer on this file will produce the following output: NUMBER_TOKEN PLUS_TOKEN ID_TOKEN MINUS_TOKEN NUMBER_TOKEN TIMES_TOKEN ID_TOKEN DIV_TOKEN ID_TOKEN MOD_TOKEN 97 ID_TOKEN Language Description & Lexical Analysis Lexical analyzer in C Code Example of Lexical Analyzer in C Code  A Comprehensive example ‘lexer.c’ of a lexical analyzer in C that recognizes arithmetic expressions with numbers, identifiers, and operators, including some basic error handling. #include #include #include typedef enum { TOKEN_IDENTIFIER, TOKEN_NUMBER, TOKEN_OPERATOR, TOKEN_PAREN_OPEN, TOKEN_PAREN_CLOSE, TOKEN_UNKNOWN, TOKEN_END } TokenType; typedef struct { TokenType type; 98 char text; } Token; Language Description & Lexical Analysis Lexical analyzer in C Code Example of Lexical Analyzer in C Code... const char *tokenTypeToString(TokenType type) { switch (type) { case TOKEN_IDENTIFIER: return "Identifier"; case TOKEN_NUMBER: return "Number"; case TOKEN_OPERATOR: return "Operator"; case TOKEN_PAREN_OPEN: return "Paren Open"; case TOKEN_PAREN_CLOSE: return "Paren Close"; case TOKEN_UNKNOWN: return "Unknown"; case TOKEN_END: return "End"; default: return "Invalid"; } } void printToken(Token token) { printf("Token: %-12s Text: %s\n", tokenTypeToString(token.type), token.text); }... 99 Language Description & Lexical Analysis Lexical analyzer in C Code Example of Lexical Analyzer in C Code... Token getNextToken(const char **input) { while (isspace(**input)) (*input)++; // Skip whitespace if (**input == '\0') { return (Token){TOKEN_END, ""}; } Token token; if (isalpha(**input)) { // Identifier token.type = TOKEN_IDENTIFIER; int length = 0; while (isalnum(**input)) { token.text[length++] = *(*input)++; } token.text[length] = '\0'; } else if (isdigit(**input)) {... 100 Language Description & Lexical Analysis Lexical analyzer in C Code Example of Lexical Analyzer in C Code... // Number token.type = TOKEN_NUMBER; int length = 0; while (isdigit(**input)) { token.text[length++] = *(*input)++; } token.text[length] = '\0'; } else if (strchr("+-*/=", **input)) { // Operator token.type = TOKEN_OPERATOR; token.text = *(*input)++; token.text = '\0'; } else if (**input == '(') { // Opening Parenthesis token.type = TOKEN_PAREN_OPEN; token.text = *(*input)++; token.text = '\0'; 101 } else if (**input == ')’) {... Language Description & Lexical Analysis Lexical analyzer in C Code Example of Lexical Analyzer in C Code... // Closing Parenthesis token.type = TOKEN_PAREN_CLOSE; token.text = *(*input)++; token.text = '\0'; } else { // Unknown token token.type = TOKEN_UNKNOWN; token.text = *(*input)++; token.text = '\0'; } return token; }... 102 Language Description & Lexical Analysis Lexical analyzer in C Code Example of Lexical Analyzer in C Code... int main() { const char *input = "var1 = 42 + (var2 * 3) / num3"; printf("Input: %s\n", input); Token token; while ((token = getNextToken(&input)).type != TOKEN_END) { printToken(token); } return 0; } 103 Language Description & Lexical Analysis Lexical analyzer in C Code Example of Lexical Analyzer in C Code  Explanation of the Code  TokenType Enumeration: Defines the different types of tokens: TOKEN_IDENTIFIER, TOKEN_NUMBER, TOKEN_OPERATOR, TOKEN_PAREN_OPEN, TOKEN_PAREN_CLOSE, TOKEN_UNKNOWN, and TOKEN_END.  Token Structure: Represents a token with a type and a text value.  tokenTypeToString Function: Converts a TokenType to a string for printing purposes.  printToken Function: Prints a token's type and text value.  getNextToken Function:  Takes a pointer to the input string and returns the next token.  Skips whitespace.  Recognizes identifiers (alphanumeric strings starting with a letter).  Recognizes numbers (strings of digits).  Recognizes operators (+, -, *, /, =).  Recognizes parentheses (( and )).  Recognizes unknown tokens (any other single character).  main Function:  Defines the input string to be tokenized. 104  Calls getNextToken in a loop to tokenize the entire input string and print each token. Language Description & Lexical Analysis Lexical analyzer in C Code Example of Lexical Analyzer in C Code  Compilation and Execution  To compile and run the lexical analyzer, follow these steps: 1. Save the code to a file named lexer.c. 2. Compile the code using a C compiler, e.g., gcc: gcc -o lexer lexer.c 3. Run the executable:./lexer 105 Language Description & Lexical Analysis Lexical analyzer in C Code Example of Lexical Analyzer in C Code  Example Output  When you run the program with the input string "var1 = 42 + (var2 * 3) / num3", the output will be: Input: var1 = 42 + (var2 * 3) / num3 Token: Identifier Text: var1 Token: Operator Text: = Token: Number Text: 42 Token: Operator Text: + Token: Paren Open Text: ( Token: Identifier Text: var2 Token: Operator Text: * Token: Number Text: 3 Token: Paren Close Text: ) Token: Operator Text: / Token: Identifier Text: num3 106 Language Description & Lexical Analysis The End

MOD 04 - Language Description & Lexical Analysis.pdf

Document Details

Related

Full Transcript