Undergraduate Topics in Computer Science - Lexical Analysis PDF

Chapter 1 Lexical Analysis I am not yet so lost in lexicography as to forget that words are the daughters of earth, and that things are the sons of heaven. Language is only the instrument of science, and words are but the signs of ideas. Samuel Johnson (1709–1784) The word “lexical” in the traditional sense means “pertaining to words”. In terms of programming languages, words are entities like variable names, numbers, keywords etc. Such word-like entities are traditionally called tokens. A lexical analyser, also called a lexer or scanner, will as input take a string of individual letters and divide this string into a sequence of classified tokens. Addi- tionally, it will filter out whatever separates the tokens (the so-called white-space), i.e., lay-out characters (spaces, newlines etc.) and comments. The main purpose of lexical analysis is to make life easier for the subsequent syntax analysis phase. In theory, the work that is done during lexical analysis can be made an integral part of syntax analysis, and in simple systems this is indeed often done. However, there are reasons for keeping the phases separate: Efficiency: A specialised lexer may do the simple parts of the work faster than the parser, which uses more general methods, can. Furthermore, the size of a system that is split in two phases may be smaller than a combined system. This may seem paradoxical but, as we shall see, there is a non-linear factor involved which may make a separated system smaller than a combined system. Modularity: The syntactical description of the language need not be cluttered with small lexical details such as white-space and comments. Tradition: Languages are often designed with separate lexical and syntactical phases in mind, and the standard documents of such languages typically sepa- rate lexical and syntactical elements of the languages. It is usually not terribly difficult to write a lexer by hand: You first read past initial white-space, then you, in sequence, test to see if the next token is a keyword, a © Springer International Publishing AG 2024 1 T. Æ. Mogensen, Introduction to Compiler Design, Undergraduate Topics in Computer Science, https://doi.org/10.1007/978-3-031-46460-7_1 2 1 Lexical Analysis number, a variable or whatnot. However, this is not a very good way of handling the problem: You may read the same part of the input repeatedly while testing each possible token, and in some cases it may not be clear where one token ends and the next begins. Furthermore, a handwritten lexer may be complex and difficult to maintain. Hence, lexers are normally constructed by lexer generators, that transform (somewhat) human-readable specifications of tokens and white-space into efficient programs. We will see the same general strategy in the chapter about syntax analysis: Spec- ifications in a well-defined human-readable notation are transformed into efficient programs. For lexical analysis, specifications are traditionally written using regular expres- sions: An algebraic notation for describing sets of strings. The generated lexers are in a class of extremely simple programs called finite automata. This chapter will describe regular expressions and finite automata, their properties and how regular expressions can be converted to finite automata. Finally, we discuss some practical aspects of lexer generators. 1.1 Regular Expressions The set of all integer constants or the set of all variable names are examples of sets of strings, where the individual digits or letters used to form these constants or names are taken from a particular alphabet, i.e., a set of characters. A set of strings is called a language. For integers, the alphabet consists of the digits 0–9 and for variable names, the alphabet contains both letters and digits (and perhaps a few other characters, such as hyphens and underscores). Given an alphabet, we will describe sets of strings over this alphabet by regular expressions, an algebraic notation that is compact and relatively easy for humans to use and understand. The idea is that regular expressions that describe simple sets of strings can be combined to form bigger regular expressions that describe more complex sets of strings. Regular expressions are often called “regexps” for short. When talking about regular expressions, we will use the letters r, s and t in italics to denote unspecified regular expressions. When letters stand for themselves (i.e., in regular expressions that describe strings that use these letters) we will use typewriter font, e.g., a or b. The letters u, v and w in italics will be used to denote unspecified single strings, i.e., members of some language. As an example, abw denotes any string starting with ab. When we say, e.g., “The regular expression s” (note the typewriter font) we mean the regular expression that describes a single one-letter string “s”, but when we say “The regular expression s” (note the italics), we mean a regular expression of any form which we just happen to call s. We use the notation L(s) to denote the language (i.e., set of strings) described by the regular expression s. For example, L(a) is the set.{“a”}. To find. L(s) for a given regular expression s, we use derivation: Rules that rewrite a regular expression into a string of letters. These rules allow a single regular expression 1.1 Regular Expressions 3 to be rewritten into several different strings, so. L(s) is the set of strings that s can be rewritten to using these rules.. L(s) is often an infinite set, but each string in the set is finite and can be obtained by a finite number of derivation steps. Figure 1.1 shows the different forms of regular expression, the derivation rules for these, and an informal description of what the regular expression forms mean. Note that we use a double arrow (.⇒) to denote derivation. In addition to the specific derivation rules in Fig. 1.1, we also use some general rules to make derivation reflexive and transitive: s⇒s Derivation is reflexive. r ⇒ t if r ⇒ s and s ⇒ t Derivation is transitive Note that, while we use the same notation for concrete strings and regular expressions denoting one-string languages, the context will make it clear which is meant. We will often show strings and sets of strings without using quotation marks, e.g., write {a, bb} instead of {“a”, “bb”}. When doing so, we sometimes use.ε to denote the empty string, so the derivation.s∗ ⇒ shown in Fig. 1.1 can also be written as.s∗ ⇒ ε. We can use the derivation rules to find the language for a regular expres- sion. As an example,. L(a(b|c)) = {ab, ac} because.a(b|c) ⇒ a(b) = ab and ∗.a(b|c) ⇒ a(c) = ac.. L((a|b) ) is infinite and contains any sequence of as and bs, including the empty sequence. For example, the string ab is in. L((a|b)∗ ) because ∗ ∗ ∗ ∗ ∗.(a|b) ⇒ (a|b)(a|b) ⇒ a(a|b) ⇒ a(a|b)(a|b) ⇒ ab(a|b) ⇒ ab. Parentheses and Precedence Rules When we use the symbols above to construct composite regular expressions such as a.|ab.∗ , it is not a priori clear how the different subexpressions are grouped. We will sometimes (like we did above) use parentheses to make the grouping of symbols explicit such as in (a.|(ab)).∗. Additionally, we use precedence rules, similar to the algebraic convention that multiplication binds stronger than additions, so.3 + 4 × 5 Fig. 1.1 Regular expressions and their derivation 4 1 Lexical Analysis is equivalent to.3 + (4 × 5) and not.(3 + 4) × 5. For regular expressions, we use the following conventions:.∗ binds tighter than concatenation, which binds tighter than alternative (.|). The example a.|ab.∗ from above is, hence, equivalent to a.|(a(b.∗ )). The.| operator is associative and commutative. Concatenation is associative (but obviously not commutative) and distributes over.|. Figure 1.2 shows these and other algebraic properties of regular expressions, including properties of some of the short- hands introduced below. Suggested exercise: 1.1. 1.1.1 Shorthands While the constructions in Fig. 1.1 suffice to describe e.g., number strings and variable names, we will often use extra shorthands for convenience. For example, if we want to describe non-negative integer constants, we can do so by saying that a number constant is a sequence of one or more digits, which is expressed by the regular expression.(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)∗ The large number of different digits makes this expression rather verbose. It gets even worse when we get to variable names, where we must enumerate all alphabetic letters (in both upper and lower case). Hence, we introduce a shorthand for sets of letters. A sequence of letters enclosed in square brackets represents the set of these letters. For example, we use [ab01] as a shorthand for.a|b|0|1. Additionally, we can use interval notation to abbreviate to [0–9]. We can combine several intervals within one bracket and for example write [a–zA–Z] to denote all alphabetic letters in both lower and upper case. When using intervals, we must be aware of the ordering for the symbols involved. For the digits and letters used above, there is usually no confusion. However, if we Fig. 1.2 Some algebraic properties of regular expressions 1.1 Regular Expressions 5 write, e.g., [0–z] it is not immediately clear what is meant. When using such notation in lexer generators, a character set encoding such as ASCII, ISO 8859-1, or UTF-8 is usually implied, so the symbols are ordered as defined by these encodings. To avoid confusion, we will in this book use the interval notation only for intervals of digits or alphabetic letters. Getting back to the example of integer constants above, we can now write this much shorter as [0–9][0–9].∗. Since.s∗ denotes zero or more occurrences of.s, we needed to write the set of digits twice to describe that one or more digits are required. Such non-zero repetition is quite common, so we introduce another shorthand,.s+ , to denote one or more occurrences of.s. With this notation, we can abbreviate our description of integers to [0–9].+. On a similar note, it is common that we can have zero or one occurrence of something (e.g., an optional sign to a number). Hence we introduce the shorthand.s? for.s|ε. The shorthand symbols.+ and.? bind with the same precedence as.∗. We must stress that these shorthands are just that. They do not allow more languages to be described, they just make it possible to describe some languages more compactly. In the case of.s+ , it can even make an exponential difference: If.+ is nested.n deep, recursive expansion of.s+ to.ss∗ yields.2n −1 occurrences of.∗ in the expanded regular expression. For example,.((a+ b)+ c)+ expands to ∗ ∗ ∗ ∗ ∗ ∗ ∗.aa b(aa b) c(aa b(aa b) c). 1.1.2 Examples We have already seen how we can describe non-negative integer constants using regular expressions. Here are a few examples of other typical programming language elements: Keywords. A keyword like if is described by a regular expression that looks exactly like that keyword, e.g., the regular expression if (which is the concate- nation of the two regular expressions i and f). Variable names. In the programming language C, a variable name consists of let- ters, digits and the underscore symbol and it must begin with a letter or underscore. This can be described by the regular expression [a–zA–Z_][a–zA–Z_0–9].∗. Integers. An integer constant is an optional sign followed by a non-empty sequence of digits:.[+-]?[0–9]+. In some languages, a signed constant is not a single token, but a concatenation of two tokens: the sign and an unsigned number constant. This will usually allow whitespace between the sign and the number, which is not possible with the above. Floats. In C, a floating-point constant can have an optional sign. After this, the mantissa part is described as a sequence of digits followed by a decimal point and then another sequence of digits. Either one (but not both) of the digit sequences can be empty. Finally, there is an optional exponent part, which is the letter e (in upper or lower case) followed by an (optionally signed) integer constant. If there 6 1 Lexical Analysis is an exponent part to the constant, the mantissa part can be written as an integer constant (i.e., without the decimal point). Some examples: 3.14. -3..23 3e+4 11.22e-3. This rather involved format can be described by the following regular expression:. [+-]?((([0–9]+. [0–9]∗ |. [0–9]+ )([eE][+-]?[0–9]+ )?) | [0–9]+ [eE][+-]?[0–9]+ ) This regular expression is complicated by the fact that the exponent is optional if the mantissa contains a decimal point, but not if it does not (as that would make the number an integer constant). We can make the description simpler if we make the regular expression for floats also include integers, and instead use other means of distinguishing integers from floats (see Sect. 1.8 for details). If we do this, the regular expression can be simplified to. [+-]?(([0–9]+ (. [0–9]∗ )?|. [0–9]+ )([eE][+-]?[0–9]+ )?) Some languages require digits on both sides of the decimal point (if there is a decimal point). This simplifies the description considerably, as there are fewer special cases:. [+-]?(([0–9]+ (. [0–9]+ )?([eE][+-]?[0–9]+ )?) String constants. A string constant starts with a quotation mark followed by a sequence of symbols and finally another quotation mark. There are usually some restrictions on the symbols allowed between the quotation marks. For example, line-feed characters or quotes are typically not allowed, though these may be represented by special “escape” sequences of other characters, such as “.\n.\n” for a string containing two line-feeds. As a (much simplified) example, we can by the following regular expression describe string constants where the allowed symbols are alphanumeric characters and sequences consisting of the backslash symbol followed by a letter (where each such pair is intended to represent a non-alphanumeric symbol):.“([a–zA–Z0–9]|\[a–zA–Z]) ∗ ” Suggested exercises: 1.2, 1.11(a). 1.2 Nondeterministic Finite Automata 7 1.2 Nondeterministic Finite Automata In our quest to transform regular expressions into efficient programs, we use a step- ping stone: Nondeterministic finite automata. By their nondeterministic nature, these are not quite as close to “real machines” as we would like, so we will later see how these can be transformed into deterministic finite automata, which are easily and efficiently executable on normal hardware. A finite automaton is, in the abstract sense, a machine that has a finite number of states and a finite number of transitions between pairs of states. A transition between two states is usually labelled by a character from the input alphabet, but we will also use transitions marked with.ε, the so-called epsilon transitions. A finite automaton can be used to decide if an input string is a member in some particular set of strings. To do this, we select one of the states of the automaton as the starting state. We start in this state, and in each step we can do one of the following: Follow an epsilon transition to another state, or Read a character from the input and follow a transition labelled by that character. When all characters from the input are read, we see if the current state is marked as being accepting. If this is the case, the string we have read from the input is in the language defined by the automaton. Otherwise, it is not. At each step, we may have a choice of several actions: We can choose between either an epsilon transition or a transition on an alphabet character, and if there are several transitions with the same symbol, we can choose between these. This makes the automaton nondeterministic, as the choice of action is not determined solely by looking at the current state and the next input character. It may be that some choices lead to an accepting state while others do not. This does, however, not mean that the string is sometimes in the language and sometimes not: We will include a string in the language if it is possible to make a sequence of choices that makes the string lead to an accepting state. You can think of it as solving a maze with symbols written in the corridors. If you can find the exit while walking over the letters of the string in the correct order, the string is recognised by the maze. We can formally define a nondeterministic finite automaton by: Definition 1.1 A nondeterministic finite automaton consists of a set. S of states. One of these states,.s0 ∈ S, is called the starting state of the automaton, and a subset. F ⊆ S of the states are accepting states. Additionally, we have a set. T of transitions. Each transition.t connects a pair of states.s1 and.s2 and is labelled with a symbol, which is either a character.c from an alphabet.Σ, or the symbol.ε, which indicates an epsilon-transition. A transition from state.s to state.t on the symbol.c is written as.s c t. Starting states are sometimes called initial states and accepting states can also be called final states (which is why we use the letter. F to denote the set of accepting 8 1 Lexical Analysis Fig. 1.3 Example of an NFA states). We use the abbreviations FA for finite automaton, NFA for nondeterministic finite automaton and (later in this chapter) DFA for deterministic finite automaton. We will mostly use a graphical notation to describe finite automata. States are denoted by circles, optionally containing a number or name that identifies the state. This name or number has, however, no operational significance, it is solely used for identification purposes. Accepting states are denoted by using a double circle instead of a single circle. The initial state is marked by an unlabelled arrow pointing to it from outside the automaton. A transition is denoted by an arrow connecting two states. Near its midpoint, the arrow is labelled by the symbol (possibly.ε) that triggers the transition. Note that the arrow that marks the initial state is not a transition and is, hence, not labelled by a symbol. Repeating the maze analogue, the circles (states) are rooms and the arrows (tran- sitions) are one-way corridors. The double circles (accepting states) are exits, while the unlabelled arrow pointing to the starting state is the entrance to the maze. Figure 1.3 shows an example of a nondeterministic finite automaton having three states. State 1 is the starting state, and state 3 is accepting. There is an epsilon- transition from state 1 to state 2, transitions on the symbol a from state 2 to states 1 and 3, and a transition on the symbol b from state 1 to state 3. This NFA recognises the language described by the regular expression.a∗ (a|b). As an example, the string aab is recognised by the following sequence of transitions: from to by 1 2.ε 2 1 a 1 2.ε 2 1 a 1 3 b At the end of the input we are in state 3, which is accepting. Hence, the string is accepted by the NFA. You can check this by placing a coin at the starting state and follow the transitions by moving the coin. Note that we sometimes have a choice of several transitions. If we are in state 2 and the next symbol is an a, we can, when reading this, either go to state 1 or to state 3. Likewise, if we are in state 1 and the next symbol is a b, we can either read this and go to state 3, or we can use the epsilon transition to go directly to state 1.3 Converting a Regular Expression to an NFA 9 2 without reading anything. If we, in the example above, had chosen to follow the a-transition to state 3 instead of state 1, we would have been stuck: We would have no legal transition, and yet we would not be at the end of the input (and even if we were, we are not in an accepting state). But, as previously stated, it is enough that there exists a path leading to acceptance, so the string aab is accepted by the NFA. A program that decides if a string is accepted by a given NFA will have to check all possible paths to see if any of these accepts the string. This requires either back- tracking until a successful path found, or simultaneously following all possible paths. Both of these methods are too time-consuming to make NFAs suitable for efficient recognisers. We will, hence, use NFAs only as a stepping stone between regular expressions and the more efficient DFAs. We use this stepping stone because it makes the construction simpler than direct construction of a DFA from a regular expression. 1.3 Converting a Regular Expression to an NFA We will construct an NFA compositionally from a regular expression, i.e., we will construct the NFA for a composite regular expression from the NFAs constructed from its subexpressions. To be precise, we will from each subexpression construct an NFA fragment and then combine these fragments into bigger fragments. A fragment is not a complete NFA, so we complete the construction by adding the necessary components to make a complete NFA. An NFA fragment consists of a number of states with transitions between these and additionally two incomplete transitions: One pointing into the fragment and one pointing out of the fragment. The incoming half-transition is not labelled by a symbol, but the outgoing half-transition is labelled by either.ε or an alphabet symbol. These half-transitions are the entry and exit to the fragment and are used to connect it to other fragments or additional “glue” states. Construction of NFA fragments for regular expressions is shown in Fig. 1.4. The construction follows the structure of the regular expression by first making NFA fragments for the subexpressions, and then joining these to form an NFA fragment for the whole regular expression. The NFA fragments for the subexpressions are shown as dotted ovals with the incoming half-transition on the left and the outgoing half-transition on the right. The symbol on the outgoing half-transition is not shown when an NFA fragment is shown as a dotted oval (it is “hidden” inside the oval). When an NFA fragment has been constructed for the whole regular expression, the construction is completed by connecting the outgoing half-transition to an accepting state. The incoming half-transition serves to identify the starting state of the com- pleted NFA. Note that, even though we allow an NFA to have several accepting states, an NFA constructed using this method will have only one: the one added at the end of the construction. An NFA constructed this way for the regular expression (a.|b).∗ ac is shown in Fig. 1.5. We have numbered the states for future reference. 10 1 Lexical Analysis Fig. 1.4 Constructing NFA fragments from regular expressions Fig. 1.5 NFA for the regular expression (a.|b).∗ ac 1.3 Converting a Regular Expression to an NFA 11 Fig. 1.6 Optimised NFA construction for regular expression shorthands 1.3.1 Optimisations We can use the construction in Fig. 1.4 for any regular expression by expanding out all shorthand, e.g. converting.s+ to.ss∗ ,.[0–9] to.0|1|2| · · · |9,.s? to.s|ε, and so on. However, this will result in very large NFAs for some expressions, so we use a few optimised constructions for the shorthands, as shown in Fig. 1.6. Additionally, we show an alternative construction for the regular expression.ε. This construction does not quite follow the formula used in Fig. 1.4, as it does not have two half-transitions. Rather, the line-segment notation is intended to indicate that the NFA fragment for.ε just connects the half-transitions of the NFA fragments that it is combined with. In the construction for.[0–9], the vertical ellipsis is meant to indicate that there is a transition for each of the digits in.[0–9]. This construction generalises in the obvious way to other sets of characters, e.g.,.[a–zA–Z0–9]. We have not shown a special construction for.s? as.s|ε will do fine when we use the optimised construction for.ε. As an example, an NFA for.[0–9]+ is shown in Fig. 1.7. Note that while this is optimised, it is not optimal. You can (in several different ways) make an NFA for this language using only two states. Suggested exercises: 1.3(a), 1.11(b). 12 1 Lexical Analysis Fig. 1.7 Optimised NFA for +.[0–9] Fig. 1.8 Example of a DFA 1.4 Deterministic Finite Automata Nondeterministic automata are, as mentioned earlier, not quite as close to “the machine” as we would like. Hence, we now introduce a more restricted form of finite automaton: The deterministic finite automaton, or DFA for short. DFAs are special cases of NFAs that obey a number of additional restrictions: There are no epsilon-transitions. There may not be two identically labelled transitions out of the same state. This means that we never have a choice of several next-states: The state and the next input symbol uniquely determine the transition (or lack of same). This is why these automata are called deterministic. Figure 1.8 shows a DFA equivalent to the NFA in Fig. 1.3. Using the maze analogy, finding an exit is easy, as you are never in doubt about which corridor to follow. The transition relation of a DFA is a partial function, and we often write it as a function:.move(s, c) is the state (if any) that is reached from state.s by a transition on the symbol.c. If there is no such transition,.move(s, c) is undefined. It is very easy to implement a DFA on a computer: A two-dimensional table can be cross-indexed by state and symbol to yield the next state (or an indication that there is no transition), essentially implementing the.move function by table lookup. Another (one-dimensional) table can indicate which states are accepting. DFAs have the same expressive power as NFAs: A DFA is a special case of NFA, and any NFA can (as we shall shortly see) be converted to an equivalent DFA. However, the benefit of deterministic transitions comes at a cost: The resulting 1.5 Converting an NFA to a DFA 13 DFA can be exponentially larger than the NFA (see Sect. 1.9). In practice (i.e., when describing tokens for a programming language) the increase in size is usually modest, which is why most lexical analysers are based on DFAs. Suggested exercises: 1.8(a, b), 1.9. 1.5 Converting an NFA to a DFA As promised, we will show how NFAs can be converted to DFAs such that we, by combining this with the conversion of regular expressions to NFAs shown in Sect. 1.3, can convert any regular expression to a DFA. The conversion is done by simulating all possible transitions in an NFA at the same time. This means that we operate with sets of NFA states: When we have several choices of a next state, we take all of the choices simultaneously and form a set of the possible next-states. Given a set of NFA states and a symbol, we follow all transitions on that symbol from all states in the set, which gives us a new set of NFA states. So we get transitions from sets of NFA states to sets of NFA states. The transitions are deterministic because we from one set of NFA states and one symbol have exactly one (possibly empty) set of NFA states that the transition moves to. The idea is that different sets of NFA states become different single states in the DFA that we construct. Epsilon-transitions complicate the construction a bit: Whenever we are in an NFA state with an outgoing epsilon-transition, we can always choose to follow the epsilon- transition without reading any symbol. Hence, given a symbol, a next-state can be found by either following a transition with that symbol, or by first doing any number of epsilon-transitions and then a transition with the symbol. We handle this in the construction by extending sets of NFA states by adding all NFA states that can be reached from states in the set using only epsilon-transitions. We define the epsilon- closure of a set of NFA states as the set extended with all NFA states that can be reached from these using any number of epsilon-transitions. More formally: Definition 1.2 Given a set. M of NFA states, we define.ε-closure(. M) to be the least (in terms of the subset relation) set. X that is a solution to the set equation. X = M ∪ {t | s ∈ X and s ε t ∈ T } Where.T is the set of transitions in the NFA. We will later on see several examples of set equations like the one above, so we use some time to discuss how such equations can be solved. 1.5.1 Solving Set Equations The following is a very brief description of how to solve set equations like the above. If you find it confusing, you can read the Appendix and in particular Sect. A.4 first. 14 1 Lexical Analysis In general, a set equation over a single set-valued variable. X has the form. X = F(X ) where. F is a function from sets to sets. Not all such equations are solvable, so we will restrict ourselves to special cases, which we will describe below. We will use calculation of epsilon-closure as the driving example. In Definition 1.2, we must find a set. X that solves the equation. X = M ∪ {t | s ∈ X and s ε t ∈ T } To cast this equation into the form. X = F(X ) for a function. f , we define. FM to be. FM (X ) = M ∪ {t | s ∈ X and s ε t ∈ T } There may be several solutions to the equation. X = FM (X ). For example, if the NFA has a pair of states that connect to each other by epsilon transitions, adding this pair to a solution that does not already include the pair will create a new solution. The epsilon-closure of. M is the least solution to the equation (i.e., the smallest set. X that satisfies the equation).. FM has a property that is essential to our solution method: If. X ⊆ Y then. FM (X ) ⊆ FM (Y ). We say that. FM is monotonic. When we have an equation of the form. X = F(X ) and. F is monotonic, we can find the least solution to the equation in the following way: We first guess that the solution is the empty set (written as.∅) and check to see if we are right: We compare.∅ with. F(∅). If these are equal, we are done and.∅ is the solution. If not, we use the following properties: The least solution. S to the equation satisfies. S = F(S).∅ ⊆ S implies that. F(∅) ⊆ F(S) to conclude that. F(∅) ⊆ S. Hence,. F(∅) is either. S or a subset of. S, so we can use it as a new guess. We now form the chain ∅ ⊆ F(∅) ⊆ F(F(∅)) ⊆.... If at any point an element in the sequence is identical to the previous, we have a fixed-point, i.e., a set. S such that. S = F(S). This fixed-point of the sequence will be the least (in terms of set inclusion) solution to the equation. This is not difficult to verify, but we will omit the details. Since we are iterating a function until we reach a fixed-point, we call this process fixed-point iteration. If we are working with sets over a finite domain (e.g., sets of NFA states from a specific NFA), we will eventually reach a fixed-point, as there can be no infinite chain of strictly increasing sets. We can use this method for calculating the epsilon-closure of the set.{1} with respect to the NFA shown in Fig. 1.5. Since we want to find.ε-closure({1}), M = {1}, 1.5 Converting an NFA to a DFA 15 so. FM = F{1}. We start by guessing that. X is the empty set: F{1} (∅) = {1} ∪ {t | s ∈ ∅ and s ε t ∈ T }. = {1} As.∅ ̸= {1}, we continue. F{1} (F{1} (∅)) = F{1} ({1}) = {1} ∪ {t | s ∈ {1} and s ε t ∈ T } = {1} ∪ {2, 5} = {1, 2, 5} F{1} (F{1} (F{1} (∅))) = F{1} ({1, 2, 5}) = {1} ∪ {t | s ∈ {1, 2, 5} and s ε t ∈ T }. = {1} ∪ {2, 5, 6, 7} = {1, 2, 5, 6, 7} F{1} (F{1} (F{1} (F{1} (∅)))) = F{1} ({1, 2, 5, 6, 7}) = {1} ∪ {t | s ∈ {1, 2, 5, 6, 7} and s ε t ∈ T } = {1} ∪ {2, 5, 6, 7} = {1, 2, 5, 6, 7} We have now reached a fixed-point and found our solution. Hence, we conclude that ε-closure({1}) = {1, 2, 5, 6, 7}.. We have done a good deal of repeated calculation in the iteration above: We have calculated the epsilon-transitions from state 1 three times and those from state 2 and 5 twice each. We can make an optimised fixed-point iteration by exploiting that the function is not only monotonic, but also distributive:. F(X ∪ Y ) = F(X ) ∪ F(Y ). This means that, when we during the iteration add elements to our set, we in the next iteration need only calculate. F for the new elements and add the result to the set. In the example above, we get F{1} (∅) = {1} ∪ {t | s ∈ ∅ and s ε t ∈ T } = {1} F{1} ({1}) = {1} ∪ {t | s ∈ {1} and s ε t ∈ T } = {1} ∪ {2, 5} = {1, 2, 5} F{1} ({1, 2, 5}) = F{1} ({1}) ∪ F{1} ({2, 5}). = {1, 2, 5} ∪ ({1} ∪ {t | s ∈ {2, 5} and s ε t ∈ T }) = {1, 2, 5} ∪ ({1} ∪ {6, 7}) = {1, 2, 5, 6, 7} F{1} ({1, 2, 5, 6, 7}) = F{1} ({1, 2, 5}) ∪ F{1} ({6, 7}) = {1, 2, 5, 6, 7} ∪ ({1} ∪ {t | s ∈ {6, 7} and s ε t ∈ T }) = {1, 2, 5, 6, 7} ∪ ({1} ∪ ∅) = {1, 2, 5, 6, 7} 16 1 Lexical Analysis We can use this principle to formulate a work-list algorithm for finding the least fixed-point for an equation over a distributive function. F. The idea is that we step- by-step build a set that eventually becomes our solution. In the first step, we calculate. F(∅). The elements in this initial set are unmarked. In each subsequent step, we take an unmarked element.x from the set, mark it and add. F({x}) (unmarked) to the set. Note that if an element already occurs in the set (marked or not), it is not added again. When, eventually, all elements in the set are marked, we are done. This is perhaps best illustrated by an example (the same as before). We start by calculating. F{1} (∅) = {1}. The element 1 is unmarked, so we pick this, mark it and calculate. F{1} ({1}) and add the new elements 2 and 5 to the set. As we continue, we get this sequence of sets: {1} √ { 1√ , 2, √ 5} { 1 , 2 , 5} √ √ √. { 1 , 2 , 5√ , 6, 7} √ √ √ { 1 , 2 , 5√ , 6 , 7} √ √ √ √ {1, 2, 5, 6, 7} Since all elements in the last set are marked, this is a solution to the equation. We will later also need to solve simultaneous equations over sets, i.e., several equations over several sets. These can also be solved by fixed-point iteration in the same way as single equations, though the work-list version of the algorithm becomes a bit more complicated. 1.5.2 The Subset Construction After this brief detour into the realm of set equations, we are now ready to continue with our construction of DFAs from NFAs. The construction is called the subset construction, as each state in the DFA is a subset of the states from the NFA. Algorithm 1.3 (The subset construction) Given an NFA. N with states. S, starting state.s0 ∈ S, accepting states. F ⊆ S, transitions.T , and alphabet.Σ, we construct an equivalent DFA. D with states. S ′ , starting state.s0′ , accepting states. F ′ , and a transition function (called “.move”) by: s0′ = ε-closure({s0 }) ′ move(s , c) = ε-closure({t | s ∈ s ′ and s c t ∈ T }). S′ = {s0′ } ∪ {move(s ′ , c) | s ′ ∈ S ′ , c ∈ Σ} F′ = {s ′ ∈ S ′ | s ′ ∩ F ̸= ∅} 1.5 Converting an NFA to a DFA 17 The DFA uses the same alphabet.Σ as the NFA. A little explanation: The starting state.s0′ of the DFA is the epsilon-closure of the set containing just the starting state.s0 of the NFA, i.e., the states that are reachable from.s0 solely by epsilon-transitions. A transition in the DFA on a symbol.c is done by finding the set.s ′ of NFA states that comprise the DFA state, following all transitions on.c in the NFA from all NFA states.s in.s ′ , combining the resulting sets of NFA states, and finally closing this under epsilon transitions. The set. S ′ of states in the DFA is the set of DFA states that can be reached from ′ ′.s0 using the.move function.. S is defined as a set equation which can be solved as described in Sect. 1.5.1. A state.s ′ in the DFA is an accepting state if at least one of the NFA states in.s ′ is accepting. As an example, we will convert the NFA in Fig. 1.5 to a DFA. The initial state in the DFA is.ε-closure({1}), which we have already calculated to be.s0′ = {1, 2, 5, 6, 7}. This is now entered into the set. S ′ of DFA states as unmarked (following the work-list algorithm from Sect. 1.5.1). We now pick an unmarked element from the uncompleted. S ′. We have only one choice:.s0′. We now mark this and calculate the transitions for it. We get move(s0′ , a) = ε-closure({t | s ∈ {1, 2, 5, 6, 7} and s a t ∈ T }) = ε-closure({3, 8}) = {3, 8, 1, 2, 5, 6, 7} = s1′ move(s0′ , b) = ε-closure({t | s ∈ {1, 2, 5, 6, 7} and s b t ∈ T }). = ε-closure({8}) = {8, 1, 2, 5, 6, 7} = s2′ move(s0′ , c) = ε-closure({t | s ∈ {1, 2, 5, 6, 7} and s c t ∈ T }) = ε-closure({}) = {} Note that the empty set of NFA states is not an DFA state, so there will be no transition from.s0′ on c. √ We now add.s1′ and.s2′ to our incomplete. S , which now is.{s0′ , s1′ , s2′ }. We now ′ ′ pick.s1 , mark it and calculate its transitions: 18 1 Lexical Analysis move(s1′ , a) = ε-closure({t | s ∈ {3, 8, 1, 2, 5, 6, 7} and s a t ∈ T }) = ε-closure({3, 8}) = {3, 8, 1, 2, 5, 6, 7} = s1′ move(s1′ , b) = ε-closure({t | s ∈ {3, 8, 1, 2, 5, 6, 7} and s b t ∈ T }) = ε-closure({8}). = {8, 1, 2, 5, 6, 7} = s2′ move(s1′ , c) = ε-closure({t | s ∈ {3, 8, 1, 2, 5, 6, 7} and s c t ∈ T }) = ε-closure({4}) = {4} = s3′ √ √ We have seen.s1′ and.s2′ before, so only.s3′ is added:.{s0′ , s1′ , s2′ , s3′ }. We next pick.s2′ : move(s2′ , a) = ε-closure({t | s ∈ {8, 1, 2, 5, 6, 7} and s a t ∈ T }) = ε-closure({3, 8}) = {3, 8, 1, 2, 5, 6, 7} = s1′ move(s2′ , b) = ε-closure({t | s ∈ {8, 1, 2, 5, 6, 7} and s b t ∈ T }). = ε-closure({8}) = {8, 1, 2, 5, 6, 7} = s2′ move(s2′ , c) = ε-closure({t | s ∈ {8, 1, 2, 5, 6, 7} and s c t ∈ T }) = ε-closure({}) = {} No new elements are added, so we pick the remaining unmarked element.s3′ : move(s3′ , a) = ε-closure({t | s ∈ {4} and s a t ∈ T }) = ε-closure({}) = {} move(s3′ , b) = ε-closure({t | s ∈ {4} and s b t ∈ T }). = ε-closure({}) = {} move(s3′ , c) = ε-closure({t | s ∈ {4} and s c t ∈ T }) = ε-closure({}) = {} 1.6 Size Versus Speed 19 Fig. 1.9 DFA constructed from the NFA in Fig. 1.5 Since all states are now marked, this completes the construction of. S ′ = {s0′ , s1′ , s2′ , s3′ }. Only.s3′ contains the accepting NFA state 4, so this is the only accepting state of our DFA. Figure 1.9 shows the completed DFA. Suggested exercises: 1.3(b), 1.5. 1.6 Size Versus Speed In the above example, we get a DFA with 4 states from an NFA with 8 states. How- ever, as the states in the constructed DFA are (nonempty) sets of states from the NFA there may potentially be.2n −1 states in a DFA constructed from an.n-state NFA. It is not too difficult to construct classes of NFAs that expand exponentially in this way when converted to DFAs, as we shall see in Sect. 1.9.1. Since we are mainly interested in NFAs that are constructed from regular expressions as in Sect. 1.3, we might ask ourselves if these NFAs might not be in a suitably simple class that do not risk exponential-sized DFAs. Alas, this is not the case. Just as we can construct a class of NFAs that expand exponentially, we can construct a class of regular expres- sions where the smallest equivalent DFAs are exponentially larger. This happens rarely when we use regular expressions or NFAs to describe tokens in programming languages, though. It is possible to avoid the blow-up in size by operating directly on regular expres- sions or NFAs when testing strings for inclusion in the languages these define. How- ever, there is a speed penalty for doing so. A DFA can be run in time.k ∗ |v|, where.|v| is the length of the input string.v and.k is a small constant that is independent of the size of the DFA.1 Regular expressions and NFAs can be run in time close to.c ∗ |N | ∗ |v|, where.|N | is the size of the NFA (or regular expression) and the 1 If memory access is assumed to be constant time, regardless of memory size. 20 1 Lexical Analysis constant.c typically is larger than.k. All in all, DFAs are a lot faster to use than NFAs or regular expressions, so it is only when the size of the DFA is a real problem that one should consider using NFAs or regular expressions directly. 1.7 Minimisation of DFAs Even though the DFA in Fig. 1.9 has only four states, it is not minimal. It is easy to see that states.s0′ and.s2′ are equivalent: Neither are accepting and they have identical transitions. We can hence collapse these states into a single state and get a three-state DFA. DFAs constructed from regular expressions through NFAs are often non-minimal, though they are rarely very far from being minimal. Nevertheless, minimising a DFA is not terribly difficult and can be done fairly fast, so many lexer generators perform minimisation. An interesting property of DFAs is that any regular language (a language that can be expressed by a regular expression, NFA or DFA) has a unique equivalent minimal DFA (if states are not labelled). Hence, we can decide equivalence of two regular expressions (or NFAs or DFAs) by converting both to minimal DFAs and compare the results. As hinted above, minimisation of DFAs is done by collapsing equivalent states. However, deciding whether two states are equivalent is not just done by testing if their immediate transitions are identical, since transitions to different states may be equivalent if the target states turn out to be equivalent. For example, in Fig. 1.9, a transition to.s0′ (if any such existed) should be considered equivalent to a transition to ′ ′ ′.s2 (since.s0 and.s2 are equivalent). Hence, we use a strategy where we first assume all states to be equivalent and then distinguish them only if we can prove them different. We use the following rules for this: An accepting state is not equivalent to a non-accepting state. If two states.s1 and.s2 have transitions on the same symbol.c to states.t1 and.t2 that we have already proven to be different, then.s1 and.s2 are different. This also applies if only one of.s1 or.s2 have a defined transition on.c. This leads to the following algorithm. Algorithm 1.4 (DFA minimisation) Given a DFA. D over the alphabet.Σ with states. S, where. F ⊆ S is the set of the accepting states, we construct a minimal DFA. Dmin , where each state is a group of equivalent states from. D. The groups in the minimal DFA are consistent: For any pair of states.s1 , s2 in a group and any symbol.c,.move(s1 , c) and.move(s2 , c) are both in the same group, or both are undefined. In other words, we can not tell.s1 and.s2 apart by looking at their transitions. We minimise the DFA. D in the following way: (1) We start with two groups: the set of accepting states. F and the set of non- accepting states. S \ F. Both these groups are initially unmarked. 1.7 Minimisation of DFAs 21 (2) We pick any unmarked group.G and check if it is consistent. If it is, we mark it. If.G is not consistent, we split it into maximal consistent subgroups and replace. G by these. A consistent subgroup is maximal if adding any other state from. G to it will make it inconsistent. All groups (not just the newly added) are then unmarked. (3) If there are no unmarked groups left, we are done, and the remaining groups are all consistent, and each group will be a state of the minimal DFA. Otherwise, we go back to step 2. The starting state of the minimal DFA is the group that contains the original starting state, and any group of accepting states is an accepting state in the minimal DFA. The time needed for minimisation using Algorithm 1.4 depends on the strategy used for picking groups in step 2. With random choices, the worst case is quadratic in the size of the DFA, but there exist strategies for choosing groups and data structures for representing these that guarantee a worst-case time that is O(.n × log(n)), where.n is the number of states in the (non-minimal) DFA. In other words, the method can be implemented so it uses little more than linear time to do minimisation. We will not here go into further detail but just refer to for the optimal algorithm. We will, however, note that we can make a slight optimisation to Algorithm 1.4: A group that consists of a single state needs never be split, so we need never select such in step 2, and we can stop when all unmarked groups are singletons. 1.7.1 Example As an example of minimisation, take the DFA in Fig. 1.10. We now make the initial division into two groups: The accepting and the non- accepting states. Fig. 1.10 Non-minimal DFA 22 1 Lexical Analysis G 1 = {0, 6}. G 2 = {1, 2, 3, 4, 5, 7} These are both unmarked. We next pick any unmarked group, say.G 1. To check if this is consistent, we make a table of its transitions: G1 a b. 0 G2 − 6 G2 − This is consistent, so we just mark it and select the remaining unmarked group.G 2 and make a table for this G2 a b 1 G2 G2 2 G2 G2. 3 − G2 4 G1 G2 5 G2 G2 7 G1 G2. G 2 is evidently not consistent, so we split it into maximal consistent subgroups and erase all marks (including the one on.G 1 ): G1 = {0, 6} G3 = {1, 2, 5}. G4 = {3} G5 = {4, 7} We now pick.G 3 for consideration: G3 a b 1 G5 G3. 2 G4 G3 5 G5 G3 This is not consistent either, so we split again and get G1 = {0, 6} G4 = {3}. G5 = {4, 7} G6 = {1, 5} G7 = {2} We now pick.G 5 and check this: 1.7 Minimisation of DFAs 23 Fig. 1.11 Minimal DFA G5 a b. 4 G1 G6 7 G1 G6 This is consistent, so we mark it and pick another group, say,.G 6 : G6 a b. 1 G5 G7 5 G5 G7 This, also, is consistent, so we have only one unmarked non-singleton group left:.G 1. G1 a b. 0 G6 − 6 G6 − As we mark this, we see that there are no unmarked groups left other than singletons. Hence, the groups now form a minimal DFA equivalent to the one in Fig. 1.10. The minimised DFA is shown in Fig. 1.11. 1.7.2 Dead States Algorithm 1.4 works under some, as yet, unstated assumptions: The.move function is total, i.e., there are transitions on all symbols from all states, or There are no dead states in the DFA. A dead state is a state from which no accepting state can be reached. Dead states do not occur in DFAs constructed from NFAs without dead states, and NFAs with dead states can not be constructed from regular expressions by the method shown in Sect. 1.3. Hence, as long as we use minimisation only on DFAs constructed by this 24 1 Lexical Analysis process, we are safe. However, if we get a DFA of unknown origin, we risk that it may contain both dead states and undefined transitions. A transition to a dead state should rightly be equivalent to an undefined transition, as neither can yield future acceptance. The only difference is that we discover this earlier on an undefined transition than when we make a transition to a dead state. However, Algorithm 1.4 will treat these differently and may hence decree a group to be inconsistent even though it is not. This will make the algorithm split a group that does not need to be split, hence producing a non-minimal DFA. Consider, for example, the following DFA: States 1 and 2 are, in fact, equivalent, as starting from either one, any sequence of as (and no other sequences) will lead to an accepting state. A minimal equivalent DFA consists of a single accepting state with a transition to itself on a. But Algorithm 1.4 will see a transition on b out of state 2 but no transition on b out of state 1, so it will not keep states 1 and 2 in the same group. As a result, no reduction in the DFA is made. There are two solutions to this problem: (1) Make sure there are no dead states. This can be ensured by invariant, as is the case for DFAs constructed from regular expressions by the methods shown in this chapter, or by explicitly removing dead states before minimisation. Dead states can be found by a simple reachability analysis for directed graphs (if you can’t reach an accepting state from state.s,.s is a dead state). In the above example, state 3 is dead and can be removed (including the transition to it). This makes states 1 and 2 stay in the same group during minimisation. (2) Make sure there are no undefined transitions. This can be achieved by adding a new dead state (which has transitions to itself on all symbols) and replacing all undefined transitions by transitions to this dead state. After minimisation, the group that contains the added dead state will contain all dead states from the original DFA. This group can now be removed from the minimal DFA (which will once more have undefined transitions). In the above example, a new (non- accepting) state 4 has to be added. State 1 has a transition to state 4 on b, state 3 has a transition to state 4 on both a and b, and state 4 has transitions to itself on both a and b. After minimisation, state 1 and 2 will be joined, as will state 3 and 4. Since state 4 is dead, all states joined with it are also dead, so we can remove the combined state 3 and 4 from the resulting minimised automaton. Suggested exercises: 1.6, 1.11(c). 1.8 Lexers and Lexer Generators 25 1.8 Lexers and Lexer Generators We have, in the previous sections, seen how we can convert a language description written as a regular expression into an efficiently executable representation (a DFA). What we want is something more: A program that does lexical analysis, i.e., a lexer: A lexer has to distinguish between several different types of tokens, e.g., numbers, variables and keywords. Each of these are described by its own regular expression. A lexer does not check if its entire input is included in the languages defined by the regular expressions. Instead, it has to cut the input into pieces (tokens), each of which is included in one (or more) of the languages. If there are several ways to split the input into legal tokens, the lexer has to decide which of these it should use. A program that takes a set of token definitions (each consisting of a regular expression and a token name) and generates a lexer is called a lexer generator. The simplest approach would be to generate a DFA for each token definition and apply the DFAs one at a time to the input. This can, however, be quite slow, so we will instead from the set of token definitions generate a single DFA that tests for all the tokens simultaneously. This is not difficult to do: If the tokens are defined by regular expressions.r1 , r2 ,... , rn , then the regular expression.r1 | r2 |... | rn describes the union of the languages.r1 , r2 ,... , rn and the DFA constructed from this combined regular expression will scan for all token types at the same time. However, we also wish to distinguish between different token types, so we must be able to know which of the many tokens was recognised by the combined DFA. We can accomplish this with the following construction of a combined DFA: (1) Construct NFAs. N1 , N2 ,... , Nn for each of.r1 , r2 ,... , rn. (2) Mark the accepting states of the NFAs by the name of the tokens they accept. (3) Combine the NFAs to a single NFA by adding a new starting state which has epsilon-transitions to each of the starting states of the NFAs. (4) Convert the combined NFA to a DFA. (5) Each accepting state of the DFA consists of a set of NFA states, at least one of which is an accepting state which we marked by token type in step 2. These marks are used to mark the accepting states of the DFA, so each of these will indicate all the token types it accepts. If the same accepting state in the DFA can accept several different token types, it is because these overlap. This is not unusual, as keywords usually overlap with variable names and a description of floating point constants may include integer constants as well. In such cases, we can do one of two things: Let the lexer generator generate an error and require the user to make sure the tokens are disjoint. Let the user of the lexer generator choose which of the tokens is preferred. 26 1 Lexical Analysis It can be quite difficult (though always possible) with regular expressions to define, e.g., the set of names that are not keywords. Hence, it is common to let the lexer choose according to a prioritised list. Normally, the order in which tokens are defined in the input to the lexer generator indicates priority (earlier defined tokens take precedence over later defined tokens). Hence, keywords are usually defined before variable names, which means that, for example, the string “if” is recognised as a keyword and not a variable name. When an accepting state in a DFA contains accepting NFA states with different marks, the mark corresponding to the highest priority (earliest defined) token is used. Hence, we can simply erase all but one mark from each accepting state. This is a very simple and effective solution to the problem. When we described minimisation of DFAs, we used two initial groups: One for the accepting states and one for the non-accepting states. As there are now several kinds of accepting states (one for each token), we must use one group for each token, so we will have a total of.n + 1 initial groups when we have.n different tokens. To illustrate the precedence rule, Fig. 1.12 shows an NFA made by combining NFAs for variable names, the keyword if, integers and floats, as described by the regular expressions in Sect. 1.1.2. The individual NFAs are (simplified versions of) what you get from the method described in Sect. 1.4. When a transition is labelled by a set of characters, it is a shorthand for a set of transitions each labelled by a single character. The accepting states are labelled with token names as described above. The corresponding minimised DFA is shown in Fig. 1.13. Note that state G is a com- bination of states 9 and 12 from the NFA, so it can accept both NUM and FLOAT, but since integers take priority over floats, we have marked G with NUM only. Similarly, state C is a combination of states 4 and 6 in the NFA, so it can accept both IF and ID, but since keyords take precedence over identifiers, we choose to let it accept only IF. Splitting the Input Stream As mentioned, the lexer must cut the input into tokens. This may be done in several ways. For example, the string if17 can be split in many different ways: As one token, which is the variable name if17. As the variable name if1 followed by the number 7. As the keyword if followed by the number 17. As the keyword if followed by the numbers 1 and 7. As the variable name i followed by the variable name f17. And several more. A common convention is that it is the longest prefix of the input that matches any token which will be chosen. Hence, the first of the above possible splittings of if17 will be chosen. Note that the principle of the longest match takes precedence over the order of definition of tokens, so even though the string starts with the keyword if, which has higher priority than variable names, the variable name is chosen because it is longer. Modern languages like C, Java or Haskell follow this convention, and so do most lexer generators, but some (mostly older) languages like FORTRAN do not. There 1.8 Lexers and Lexer Generators 27 Fig. 1.12 Combined NFA for several tokens are also fairly new languages that have exceptions to the longest-prefix rule: In F#, the expression f-1 is split into three tokens: f, -, and 1 even though -1 is a valid token for a signed number. For example, f -1 is split into just two tokens (as the space character is not considered a token), and is read as “the function f applied to the number -1”, whereas f-1 is read as “f minus one”. When other conventions are used, lexers must either be written by hand to handle these conventions, or the conventions used by the lexer generator must be overridden. Some lexer generators allow the user to have some control over the conventions used. The principle of the longest matching prefix is handled by letting the DFA read as far as it can, until it either reaches the end of the input, or no transition is defined on the next input symbol. If the current state at this point is accepting, we are in luck, and can simply output the corresponding token. If not, we must go back to the last time we were in an accepting state and output the token indicated by this. The characters read since then are put back in the input stream. The lexer must, hence, retain the symbols it has read since the last accepting state, so it in such situations can re-insert these in the input. If we are not at the end of the input stream, we restart the DFA (in its initial state) on the remaining input to find the next tokens. As an example, consider lexing of the string 3e-y with the DFA in Fig. 1.13. We get to the accepting state G after reading the digit 3. However, we can continue making legal transitions to state I on e and then to state J on - (as these could be the 28 1 Lexical Analysis Fig. 1.13 Combined DFA for several tokens start of the exponent part of a real number). It is only when we, in state J, find that there is no transition on y that we realise that this is not the case. We must now go back to the last accepting state (G) and output the number 3 as the first token and re-insert - and e in the input stream, so we can continue with e-y when we look for the subsequent tokens, which will be the identifier e followed by a lexical error, since no prefix of -y match any tokens. Lexical Errors If no prefix of the input string forms a valid token, a lexical error has occurred. When this happens, the lexer will usually report an error. At this point, it may stop reading the input or it may attempt continued lexical analysis by skipping characters until a valid prefix is found. The purpose of the latter approach is to try finding further lexical errors in the same input, so several of these can be corrected by the user before re-running the lexer. Some of these subsequent errors may, however, not be real errors, but may be caused by the lexer not skipping enough characters (or skipping too many) after the first error is found. If, for example, the start of a 1.8 Lexers and Lexer Generators 29 comment is ill-formed, the lexer may try to interpret the contents of the comment as individual tokens, and if the end of a comment is ill-formed, the lexer will read until the end of the next comment (if any) before continuing, hence skipping too much text. When the lexer finds an error, the consumer of the tokens that the lexer produces (e.g., the rest of the compiler) can not usually itself produce a valid result. However, the compiler may try to find other errors in the remaining input, again allowing the user to find several errors in one edit-compile cycle. Again, some of the subsequent errors may really be spurious errors caused by lexical error(s), so the user will have to guess at the validity of every error message except the first, as only the first error message is guaranteed to be a real error. Nevertheless, such error recovery has, when the input is so large that restarting the lexer from the start of input incurs a considerable time overhead, proven to be an aid in productivity by locating more errors in less time. In an integrated development environment, the lexer may work interactively with a text editor, point to a lexical error in the text, allow the user to edit the file, and restart from the first modified position in the file when the user recompiles the program. 1.8.1 Lexer Generators A lexer generator will typically use a notation for regular expressions similar to the one described in Fig. 1.1, but may require alphabet-characters to be quoted to distinguish them from the symbols used to build regular expressions. For example, an * intended to match a multiplication symbol in the input is distinguished from an * used to denote repetition by quoting the * symbol, e.g. as.′ *.′ , "*", or.′ *.′. Additionally, some lexer generators extend regular expressions in various ways, e.g., allowing a set of characters to be specified by listing the characters that are not in the set. This is useful, for example, to specify that a comment continues until the next newline character. The input to the lexer generator will normally contain a list of regular expressions that each denote a token. Each of these regular expressions has an associated action. The action describes what is passed on to the consumer (e.g., the parser), typically an element from a token data type, which describes the type of token (NUM, ID, etc.) and sometimes additional information such as the value of a number token, the name of an identifier token, and the position of the token in the input file. The information needed to construct such values is typically provided by the lexer generator through library functions or variables that can be used in the actions. Normally, the lexer generator requires white-space and comments to be defined by regular expressions. The actions for these regular expressions are typically empty, meaning that white-space and comments are just ignored. Note, however, that in languages (such as Python, Haskell, and F#) where indentation is used to indicate grouping of statements or expressions, whitespace will sometimes have to generate tokens to indicate start and end of such groupings. 30 1 Lexical Analysis An action can be more than just returning a token. If, for example, escape sequences (for defining, e.g., control characters) are allowed in string constants, the actions for string tokens will, typically, translate the string containing these sequences into a string where they have been substituted by the characters they represent. If a language has a large number of keywords, then a DFA that recognises all of these as individual tokens can be fairly large. In such cases, the keywords are not described as separate regular expressions in the lexer definition, but instead treated as special cases of the identifier token. The action for identifiers will then look the name up in a table of keywords and return the appropriate token type (or an identifier token if the name is not found in the table of keywords). A similar strategy can be used if the language allows identifiers to be equal to keywords, so they are distinguished by context. Another use of non-trivial lexer actions is for nested comments. In principle, a regular expression (or finite automaton) cannot recognise arbitrarily deeply nested comments (see Sect. 1.9), but by using a global counter, the actions for comment tokens can keep track of the nesting level. Sometimes lexer generators allow several different starting points. In the example in Figs. 1.12 and 1.13, all regular expressions share the same starting state. However, a single lexer may be used, e.g., for both tokens in the programming language and for tokens in the input data to that language. Often, there will be a good deal of sharing between these token sets (the tokens allowed in the input may, for example, be a subset of the tokens allowed in programs). Hence, it is useful to allow these to share a NFA, as this will save space. The resulting DFA will have several starting states. An accepting state may now have more than one token name attached, as long as these come from different token sets (corresponding to different starting points). In addition to using this feature for several sources of text (program and input), it can be used locally within a single text to read very complex tokens. For example, nested comments and complex-format strings (with nontrivial escape sequences) can be easier to handle if this feature is used. 1.9 Properties of Regular Languages We have talked about regular languages as the class of languages that can be described by regular expressions or finite automata, but this in itself may not give a clear understanding of what is possible and what is not possible to describe by a regular language. We will now state a few properties of regular languages, show some non-obvious examples of regular and non-regular languages, and give informal rules of thumb that can (sometimes) be used to decide if a language is regular. 1.9 Properties of Regular Languages 31 1.9.1 Relative Expressive Power First, we repeat that regular expressions, NFAs and DFAs have exactly the same expressive power: They all can describe all regular languages and only these. Some languages may, however, have much shorter descriptions in one of these forms than in others. We have already argued that we from a regular expression can construct an NFA whose size is linear in the size of the regular expression, and that converting an NFA to a DFA can potentially give an exponential increase in size (see below for a concrete example of this). Since DFAs are also NFAs, NFAs are clearly at least as compact as (and sometimes much more compact than) DFAs. Similarly, we can see that NFAs are at least as compact (up to a small constant factor) as regular expressions. But we have not yet considered if the converse is true: Can an NFA be converted to a regular expression of proportional size. The answer is, unfortunately, no: There exist classes of NFAs (and even DFAs) that need regular expressions that are exponentially larger to describe them. This is, however, mainly of academic interest as we rarely have to make conversions in this direction. If we are only interested in if a language is regular rather than the size or efficiency of its description, however, it does not matter which of the formalisms we choose, so we can in each case choose the formalism that suits us best. Sometimes it is easier to describe a regular language using a DFA or NFA instead of a regular expression. For example, the set of binary number strings that represent numbers that divide evenly by 5 can be described by a 6-state DFA (see Exercise 1.10), but it requires a very complex regular expression to do so. For programming language tokens, regular expressions are typically quite suitable. The subset construction (Algorithm 1.3) maps sets of NFA states to DFA states. Since there are.2n −1 non-empty sets of.n NFA states, the resulting DFA can poten- tially have exponentially more states than the NFA. But can this potential ever be realised? To answer this, it is not enough to find one.n-state NFA that yields a DFA with.2n −1 states (any one-state NFA does that). We need to find a family of ever bigger NFAs, all of which yield exponentially-sized DFAs. We also need to argue that the resulting DFAs are minimal. One construction that has these properties is the following: For each integer.n > 1, construct an.n-state NFA in the following way: 1. State.0 is the starting state and state.n−1 is accepting. 2. If.0 ≤ i < n−1, state.i has a transition to state.i + 1 on the symbol a. 3. All states have transitions to themselves and to state.0 on the symbol b. Figure 1.14 shows such an NFA for.n = 4. We can represent a set of these states by an.n-bit number: Bit.i in the number is.1 if and only if state.i is in the set. The set that contains only the initial NFA state is, hence, represented by the binary number.1 zero-extended to.n bits. We shall see that the way a transition maps a set of states to a new set of states can be expressed as an operation on the number: 32 1 Lexical Analysis Fig. 1.14 A 4-state NFA that gives 15 DFA states A transition on a maps the number.x to.(2x mod (2n )). A transition on b maps the number.x to itself and.(x OR 1), using bitwise OR. This is not hard to verify, so we leave this to the interested reader. It is also easy to see that, starting from the number.1, these two operations can generate any.n-bit number. Hence, any subset can be reached by a sequence of transitions, which means that the subset-construction will generate a DFA state for every possible non-empty subset of the NFA states. But is the DFA minimal? If we look at the NFA, we can see that, if.i < n−1, an a leads from state.i to.i+1, so for each NFA state.i there is exactly one sequence of as that leads to the accepting state, and that sequence has.n−1−i as. Hence, a DFA state whose subset contains the NFA state.i will lead to acceptance on a string of.n−1−i as, while a DFA state whose subset does not contain.i will not. Hence, for any two different DFA states, we can find an NFA state.i that is in one of the sets but not the other, and use that to construct a string that will distinguish the DFA states. Hence, all the DFA states are distinct, so the DFA is minimal. 1.9.2 Limits to Expressive Power The most basic property of a DFA is that it is finite: It has a finite number of states and nowhere else to store information. This means, for example, that any language that requires unbounded counting cannot be regular. An example of this is the language n n.{a b | n ≥ 0}, that is, any sequence of as followed by a sequence of the same number of bs. If we must decide membership in this language by a DFA that reads the input from left to right, we must, at the time we have read all the as, know how many there were, so we can compare this number to the number of bs. But since a finite automaton cannot count arbitrarily high, the language is not regular. A similar non-regular language is the language of matching parentheses. However, if we limit the nesting depth of parentheses to a constant.n, we can recognise this language by a DFA that has.n+1 states (0 to.n), where state.i corresponds to.i unmatched opening parentheses. State 0 is both the starting state and the only accepting state. Some surprisingly complex languages are regular. As all finite sets of strings are regular languages, the set of all legal Java programs of less than a billion characters is a regular language, though it is by no means a simple one. While it can be argued 1.9 Properties of Regular Languages 33 that it would be an acceptable limitation for a language to allow only programs of less than a billion characters, it is not practical to describe such a programming language as a regular language: The description would be far too large. Even if we ignore such absurdities, we can sometimes be surprised by the expressive power of regular languages. As an example, given any integer constant.n, the set of numbers (written in binary or decimal notation) that divide evenly by.n is a regular language (see Exercise 1.10). 1.9.3 Closure Properties We can also look at closure properties of regular languages. It is clear that regular languages are closed under set union: If we have regular expressions s and t for two languages, the regular expression s.|t describes the union of these languages. Sim- ilarly, regular languages are closed under concatenation and unbounded repetition, as these correspond to basic operators of regular expressions. Less obviously, regular languages are also closed under set difference and set intersection. To see this, we first look at set complement: Given a fixed alphabet.Σ, the complement of the language. L is the set of all strings built from the alphabet.Σ, except the strings found in. L. We write the complement of. L as. L. To get the complement of a regular language. L, we first construct a DFA for the language. L and make sure that all states have transitions on all characters from the alphabet (as described in Sect. 1.7.2). Now, we simply change every accepting state to non- accepting and vice versa, and thus get a DFA for. L. We can now (by using the set-theoretic equivalent of De Morgan’s law) construct. L 1 ∩ L 2 as. L 1 ∪ L 2. Given intersection, we can now get set difference by. L 1 \L 2 = L 1 ∩ L 2. Regular sets are also closed under a number of common string operations, such as prefix, suffix, subsequence and reversal. The precise meaning of these words in the present context is defined below. Prefix. A prefix of a string w is any initial part of w, including the empty string and all of w. The prefixes of abc are hence.ε, a, ab and abc. Suffix. A suffix of a string is what remains of the string after a prefix has been taken off. The suffixes of abc are hence abc, bc, c and.ε. Subsequence. A subsequence of a string is obtained by deleting any number of symbols from anywhere in the string. The subsequences of abc are hence abc, bc, ac, ab, c, b, a and.ε. Reversal. The reversal of a string is the string read backwards. The reversal of abc is hence cba. As with complement, these can be obtained by simple transformations of the DFAs for the language. Suggested exercises: 1.12. 34 1 Lexical Analysis 1.10 Further Reading There are many variants of the method shown in Sect. 1.3. The version presented here has been devised for use in this book in an attempt to make the method easy to understand and manageable to do by hand. Other variants can be found in [2, 3]. It is possible to convert a regular expression to a DFA directly without going through an NFA. One such method [2, 8] actually at one stage during the calculation computes information equivalent to an NFA (without epsilon-transitions), but more direct methods based on algebraic properties of regular expressions also exist [4, 10]. These, unlike NFA-based methods, generalise fairly easily to handle regular expres- sions extended with explicit set-intersection and set-difference operators. A good deal of theoretic information about regular expressions and finite automata can be found in. An efficient DFA minimisation algorithm can be found in. Lexer generators can be found for most programming languages. For C, the most common are Lex and Flex. Some lexer generators, e.g., Quex , generate the states of the DFA as program code instead of using table-lookup. This makes the generated lexers fast, but can use much more space than a table-driven program. Quex is also able to handle Unicode characters. Finite automata and notation reminiscent of regular expressions are also used to describe behaviour of concurrent systems. In this setting, a state represents the current state of a process and a transition corresponds to an event to which the process reacts by changing state. 1.11 Exercises Exercise 1.1 Given the regular expression.s = (a|b)(c|d|ε), (a) Using the derivation rules in Fig. 1.1, show that. L(s) contains the string ac. (b) Find the complete set. L(s). Exercise 1.2 In the following, a number-string is a non-empty sequence of decimal digits, i.e., something in the language defined by the regular expression [0-9].+. The value of a number-string is the usual interpretation of a number-string as an integer number. Note that leading zeroes are allowed. Make for each of the following languages a regular expression that describes that language. (a) All number-strings that have the value 42. (b) All number-strings that do not have the value 42. (c) All number-strings that have a value that is strictly greater than 42. Exercise 1.3 Given the regular expression.a∗ (a|b)aa: (a) Construct an equivalent NFA using the method in Sect. 1.3. 1.11 Exercises 35 (b) Convert this NFA to a DFA using Algorithm 1.3. Exercise 1.4 Given the regular expression.((a|b)(a|bb))∗ : (a) Construct an equivalent NFA using the method in Sect. 1.3. (b) Convert this NFA to a DFA using Algorithm 1.3. Exercise 1.5 Make a DFA equivalent to the following NFA: Exercise 1.6 Minimise the following DFA: Exercise 1.7 Minimise the following DFA: Exercise 1.8 Construct DFAs for each of the following regular languages. In all cases the alphabet is.{a, b.}. (a) The set of strings that has exactly 3 bs (and any number of as). (b) The set of strings where the number of bs is a multiple of 3 (and there can be any number of as). (c) The set of strings where the difference between the number of as and the number of bs is a multiple of 3. Exercise 1.9 Construct a DFA that recognises balanced sequences of parenthesis with a maximal nesting depth of 3, e.g.,.ε, ()(), (()(())) or (()())()() but not (((()))) or (()(()(()))). 36 1 Lexical Analysis Exercise 1.10 Given that binary number strings are read with the most significant bit first and may have leading zeroes, construct DFAs for each of the following languages: (a) Binary number strings that represent numbers that are multiples of 4, e.g., 0, 100 and 10100. (b) Binary number strings that represent numbers that are multiples of 5, e.g., 0, 101, 10100 and 11001. Hint: Make a state for each possible remainder after division by 5 and then add a state to avoid accepting the empty string. (c) Given a number.n, what is the minimal number of states needed in a DFA that recognises binary numbers that are multiples of.n? Hint: write.n as.a ∗ 2b , where.a is odd. Exercise 1.11 The empty language, i.e., the language that contains no strings can be recognised by a DFA (any DFA with no accepting states will accept this language), but it can not be defined by any regular expression using the constructions in Sect. 1.1. Hence, the equivalence between DFAs and regular expressions is not complete. To remedy this, a new regular expression.φ is introduced such that. L(φ) = ∅. We will now look at some of the implications of this extension. (a) Argue why each of the following algebraic rules, where s is an arbitrary regular expression, is true: φ|s = s φs = φ. sφ = φ φ∗ = ε (b) Extend the construction of NFAs from regular expressions to include a case for.φ. (c) What consequence will this extension have for converting the NFA to a minimal DFA? Hint: dead states. Exercise 1.12 Show that regular languages are closed under prefix, suffix, subse- quence and reversal, as postulated in Sect. 1.9. Hint: show how an NFA. N for a regular language. L can be transformed to an NFA. N p for the set of prefixes of strings from. L, and similarly for the other operations. Exercise 1.13 Which of the following statements are true? Argue each answer infor- mally. (a) Any subset of a regular language is itself a regular language. (b) Any superset of a regular language is itself a regular language. (c) The set of anagrams of strings from a regular language forms a regular language. (An anagram of a string is obtained by rearranging the order of characters in the string, but without adding or deleting any. The anagrams of the string abc are hence abc, acb, bac, bca, cab and cba.) 1.11 Exercises 37 Exercise 1.14 In Figs. 1.12 and 1.13 we used character sets on transitions as short- hands for sets of transitions, each with one character. We can, instead, extend the definition of NFAs and DFAs such that such character sets are allowed on a single transition. For a DFA (to be deterministic), we must require that transitions out of the same state have disjoint character sets. (a) Sketch how Algorithm 1.3 must be modified to handle transitions with sets in such a way that the disjointedness requirement for DFAs are ensured. (b) Sketch how Algorithm 1.4 must be modified to handle character sets. A new requirement for DFA minimality is that the number of transitions as well as the number of states is minimal. How can this be ensured? Exercise 1.15 As mentioned in Sect. 1.4, DFAs are often implemented by tables where the current state is cross-indexed by the next symbol to find the next state. If the alphabet is large, such a table can take up quite a lot of room. If, for example, 16-bit Unicode is used as the alphabet, there are.216 = 65536 entries in each row of the table. Even if each entry in the table is only one byte, each row will take up 64 KB of memory, which may be a problem. A possible solution is to split each 16-bit Unicode character c into two 8-bit characters.c1 and.c2. In the regular expressions, each occurrence of a character c is hence replaced by the regular expression.c1 c2. This regular expression is then converted to an NFA and then to a DFA in the usual way. The DFA may (and probably will) have more states than the DFA using 16-bit characters, but each state in the new DFA use only.1/256th of the space used by the original DFA. (a) How much larger is the new NFA compared to the old? (b) Estimate what the expected size (measured as number of states) of the new DFA is compared to the old. Hint: Some states in the NFA can be reached only after an even number of 8-bit characters are read and the rest only after an odd number of 8-bit characters are read. What does this imply for the sets constructed during the subset construction? (c) Roughly, how much time does the new DFA require to analyse a string compared to the old? (d) If space is a problem for a DFA over an 8-bit alphabet, do you expect that a similar trick (splitting each 8-bit character into two 4-bit characters) will help reduce the space requirements? Justify your answer. Exercise 1.16 If. L is a regular language, so is. L\{ε}, i.e., the set of all nonempty strings in. L. So we should be able to transform a regular expression for. L into a regular expression for. L\{ε}. We want to do this with a function.nonempty that is recursive over the structure of the regular expression for. L, i.e., of the form: 38 1 Lexical Analysis nonempty(ε) = φ nonempty(a) =... where a is an alphabet symbol nonempty(s|t) = nonempty(s) | nonempty(t). nonempty(s t) =... nonempty(s?) =... nonempty(s ∗ ) =... nonempty(s + ) =... where.φ is the regular expression for the empty language (see Exercise 1.11). (a) Complete the definition of.nonempty by replacing the occurrences of “....” in the rules above by expressions similar to those shown in the rules for.ε and.s|t. (b) Use this definition to find.nonempty(a∗ b∗ ). Exercise 1.17 If. L is a regular language, so is the set of all prefixes of strings in. L (see Sect. 1.9.3). So we should be able to transform a regular expression for. L into a regular expression for the set of all prefixes of strings in. L. We want to do this with a function prefixes that is recursive over the structure of the regular expression for. L, i.e., of the form: prefixes(ε) = ε prefixes(a) = a? where a is an alphabet symbol prefixes(s|t) = prefixes(s) | prefixes(t). prefixes(s t) =... prefixes(s ∗ ) =... prefixes(s + ) =... (a) Complete the definition of prefixes by replacing the occurrences of “....” in the rules above by expressions similar to those shown in the rules for.ε, a and.s|t. (b) Use this definition to find prefixes.(ab∗ c). References 1. Aho, A.V., Hopcroft, J.E., Ullman, J.D.: The Design and Analysis of Computer Algorithms. Addison-Wesley (1974) 2. Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D.: Compilers; Principles, Techniques and Tools. Addison-Wesley (2007) 3. Appel, A.W.: Modern Compiler Implementation in ML. Cambridge University Press (1998) 4. Brzozowski, J.A.: Derivatives of regular expressions. J. ACM 1(4), 481–494 (1964) 5. Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages and Computation, 2nd edn. Addison-Wesley (2001) 6. Keller, J.P., Paige, R.: Program derivation with verified transformations – a case study. Commun. Pure Appl. Math. 48(9–10) (1996) 7. Lesk, M.E.: Lex: a Lexical Analyzer Generator. Tech. Rep. 39, AT&T Bell Laboratories, Murray Hill, N. J. (1975) References 39 8. McNaughton, R., Yamada, H.: Regular expressions and state graphs for automata. IEEE Trans. Electron. Comput. 9(1), 39–47 (1960) 9. Milner, R.: Communication and Concurrency. Prentice-Hall (1989) 10. Owens, S., Reppy, J., Turon, A.: Regular-expression derivatives re-examined. J. Funct. Program. 19(2), 173–190 (2009). https://doi.org/10.1017/S0956796808007090 11. Paxson, V.: Flex, version 2.5, a fast scanner generator (1995). http://www.gnu.org/software/ flex/manual/html_mono/flex.html 12. Schäfer, F.R.: Quex - fast universal lexical analyzer generator (2004–2011). http://quex. sourceforge.net. Accessed Sept 2014 Chapter 2 Syntax Analysis Syntax and vocabulary are overwhelming constraints—the rules that run us. Language is using us to talk—we think we’re using the language, but language is doing the thinking, we’re its slavish agents. Harry Mathews (1930–2017) Where lexical analysis splits a text into tokens, the purpose of syntax analysis (also known as parsing) is to recombine these tokens. Not back into a list of characters, but into something that reflects the structure of the text. This “something” is typically a data structure called the syntax tree of the text. As the name indicates, this is a tree structure. The leaves of this tree are the tokens found by the lexical analysis, and if the leaves are read from left to right, the sequence is the same as in the input text. Hence, what is important in the syntax tree is how these leaves are combined to form the structure of the tree, and how the interior nodes of the tree are labelled. In addition to finding the structure of the input text, the syntax analysis must also reject invalid texts by reporting syntax errors. As syntax analysis is less local in nature than lexical analysis, more advanced methods are required. We, however, use the same basic strategy: A notation suitable for human understanding and algebraic manipulation is transformed into a machine- like low-level notation suitable for efficient execution. This process is called parser generation. The notation we use for human manipulation is context-free grammars,1 which is a recursive notation for describing sets of strings and imposing a structure on each such string. This notation can in some cases be translated almost directly into recursive programs, but it is often more convenient to generate stack automata. These are similar to the finite automata used for lexical analysis, but they can additionally use a stack, which allows counting and non-local matching of symbols. We shall see 1 Because derivation is independent of context, as we will see. © Springer International Publishing AG 2024 41 T. Æ. Mogensen, Introduction to Compiler Design, Undergraduate Topics in Computer Science, https://doi.org/10.1007/978-3-031-46460-7_2 42 2 Syntax Analysis two ways of generating such automata. The first of these, LL(1), is relatively simple, but works only for a somewhat restricted class of grammars. The SLR construction, which we present later, is more complex but handles a wider class of grammars. Sadly, neither of these work for all context-free grammars. Tools that handle all context-free grammars do exist, but they can incur a severe speed penalty, which is why most parser generators restrict the class of input grammars to what can be parsed efficiently using a specific method. This may require rewriting an otherwise correct grammar for a language into a form that can be handled by the method or by providing additional information to resolve ambiguities. We will show some techniques for doing so. 2.1 Context-Free Grammars Like regular expressions, context-free grammars describe sets of strings, i.e., lan- guages. Additionally, a context-free grammar also defines structure on the strings in the language it defines. A language is defined over some alphabet, for example the set of tokens produced by a lexer or the set of alphanumeric characters. The symbols in the alphabet are called terminals. A context-free grammar recursively defines several sets of strings. Each set is denoted by a name, which is called a nonterminal. The set of nonterminals is disjoint from the set of terminals. One of the nonterminals are chosen to denote the main language described by the grammar. This nonterminal is called the start symbol of the grammar, and plays a role similar to the start state of a finite automaton. The sets are described by a number of productions. Each production describes some of the possible strings that are contained in the set denoted by a nonterminal. A production has the form.N → X1... Xn where. N is a nonterminal and. X 1... X n are zero or more symbols, each of which is either a terminal or a nonterminal. The intended meaning of this notation is to say that the set denoted by. N contains strings that are obtained by concatenating strings from the sets denoted by. X 1... X n. In this setting, a terminal denotes a set consisting of a single string consisting of a single symbol, just like an alphabet character in a regular expression denotes a set consisting of a single string consisting of a single character. We will, when no confusion is likely, equate a nonterminal with the set of strings it denotes, like we did for alphabet characters in regular expressions. Some examples:.A → a says that the set denoted by the nonterminal. A contains the one-character string a.. A → aA 2.1 Context-Free Grammars 43 says that the set denoted by. A contains all strings formed by putting an a in front of a string taken from the set denoted by. A. Together, these two productions indicate that. A contains all non-empty sequences of as and is hence (in the absence of other productions) equivalent to the regular expression a.+. We can define a grammar equivalent to the regular expression a.∗ by the two productions B→. B → aB where the first production indicates that the empty string is part of the set. B. Compare this grammar with the definition of.s ∗ in Fig. 1.1. Productions with empty right-hand sides are called empty productions. These are in some variants of grammar notation written with an.ε on the right hand side instead of leaving it empty. So far, we have not described any set that could not just as well have been described using regular expressions. Context-free grammars are, however, capable of expressing much more complex languages. In Sect. 1.9, we noted that the language n n.{a b | n ≥ 0} is not regular. It is, however, easily described by the grammar S→. S → aSb The second production ensures that the as and bs are paired symmetrically around the middle of the string, so they occur in equal number. The examples above have used only one nonterminal per grammar. When several nonterminals are used, we must make it clear which of these is the start symbol. By convention (if nothing else is stated), the nonterminal on the left-hand side of the first production is the start symbol. As an example, the grammar T →R T → aT a. R →b R → bR has.T as start symbol and denotes the set of strings that start with any number of as followed by a non-zero number of bs and then the same number of as with which it started. In some variants of grammar notation, a shorthand notation is used where all the producti

Undergraduate Topics in Computer Science - Lexical Analysis PDF

Document Details

Tags

Related

Summary

Full Transcript