Syntax Definition - Grammars PDF
Document Details
Tags
Summary
This document provides a basic introduction to syntax definition and grammars. It explains concepts such as variables, terminals, and productions. It also discusses different types of grammars according to the Chomsky hierarchy.
Full Transcript
80 Syntax Definition — Grammars Grammar is basically defined as a set of 4-tuple (V, T, P, S), where V is set of nonterminals (variables), T is set of terminals (primitive symbols), P is set of productions (rules), which govern the relationship between nonter-...
80 Syntax Definition — Grammars Grammar is basically defined as a set of 4-tuple (V, T, P, S), where V is set of nonterminals (variables), T is set of terminals (primitive symbols), P is set of productions (rules), which govern the relationship between nonter- minals and terminals, And S is start symbol with which strings in grammar are derived. These productions or rules define the strings belonging to the language. The motivation for these grammars was from the description of natural language. Let us examine the rules used to define a sentence in the English language. → → | → ate | sat | ran → Rama | Bhama → She | He Using these set of rules many sentences can be derived by substituting for variables. 1. Rama ate. 2. Bhama sat. 3. She ran. 4. He sat. Language defined by a grammar G is denoted by L(G). Here L(G) represents a set of strings w derived from G. Start symbol S in one or more steps derives the strings or sen- tences of grammar that is represented by S + w. The sentential form of grammar is a combi- nation of terminals and nonterminals. It is derived from S using one or more derivations. It is represented by S + α. Two grammars G1 and G2 are equivalent if L(G1) = L(G2). Some of the notations used to represent the grammar are: 1. Terminals symbols: these are represented by N Lower case letters of alphabet like a, c, z, etc. N Operator symbols like +, –, etc. N Punctuation symbols like ( , } , , etc. N Digits like 0…9, etc. N Bold face strings like int, main, if, else etc. 2. Nonterminal symbols: these are represented by N Upper case letters of alphabet like A, C, Z, etc. N Letter S is the start symbol. N Lower case strings like expr, stmt, etc. 3. Grammar symbols can be terminals or nonterminals. They can be represented by Greek letters α, β. 4. Lower case letters u, v, w, x, y, z are generally used to represent a string of terminals. Types of Grammars—Chomsky Hierarchy 81 Language acceptance: The language starts with the start symbol; at every step, replace the nonterminal by the right hand side of the rule. Continue this until a string of terminals is derived. The string of termi- nals gives the language accepted by grammar. Consider the language represented by a+, represented by a set {a, aa, aaa, ….}. To gener- ate strings of this language we define grammar as S → a and S → aS. Now we get strings as follows: Strings starting with S: Sa {a} S aS aa {aa} S aS aaS aaa {aaa} 3.2 Types of Grammars—Chomsky Hierarchy Linguist Noam Chomsky defined a hierarchy of languages, in terms of complexity. This four-level hierarchy, called the Chomsky hierarchy, corresponds to four classes of machines. Each higher level in the hierarchy incorporates the lower levels, that is, anything that can be computed by a machine at the lowest level can also be computed by a machine at the next highest level. The Chomsky hierarchy classifies grammars according to the form of their productions into the following levels: a. Type 0 Grammars—Unrestricted Grammars (URG): These grammars include all formal grammars. In URG, all the productions are of the form →, where and may have any number of terminals and nonterminals, that is, no restrictions on either side of productions. Every grammar is included in it if it has at least one nonterminal on the left hand side. Example: The language specified by L(G) = {a2i | I > = 1} is an unrestricted grammar. This grammar is also called phrase structured grammar. The grammar is represented by the following productions: S→ACaB Ca→aaC CB→DB CB→E aD→Da AD→AC aE→Ea AE→ε The string “aa” can be derived from the above grammar as follows: S ⇒ ACaB ⇒ AaaCB ⇒ AaaE ⇒ AaEa ⇒ AEaa ⇒ aa They generate exactly all languages that can be recognized by a turing machine. The lan- guage that is recognized by a turing machine is defined as all the strings on which it halts. These languages are also known as the recursively enumerable languages. 82 Syntax Definition — Grammars b. Type 1 Grammars—Context Sensitive Grammars (CSG): These grammars define the context sensitive languages. In CSG, all the productions of the form α → β where |α| | β|, α and β may have any number of terminals and nonterminals. These grammars can have rules of the form αAβ → αγβ with A as nonterminal and α, β and γ are strings of terminals and nonterminals. We can replace A by γ, where A lies between α and β. Hence, the name context sensitive grammar. The strings α and β may be empty, but γ must be nonempty. It cannot include the rule S → ε. These languages are exactly all languages that can be recognized by a Linear Bound Automata. Example: The language specified by L(G) = {anbn | n > = 1} is a context sensitive gram- mar. The grammar is represented by the following productions. S→SBC|aC B→a CB→BC Ba → aa C→b The string “aaabbb” can be derived from the above grammar as follows: S SBC SBCBC aCBCBCaBCCBC aaCCBC aaCBCC aaBCC aaaCCC aaabbb c. Type 2 Grammars—Context Free Grammars (CFG): These grammars define the context free languages. These are defined by rules of the form α → β with |α| | β where |α| = 1 and is a nonterminal and β is a string of terminals and nonterminals. We can replace α by β regardless of where it appears. Hence, the name context free grammars. These lan- guages are exactly those languages that can be recognized by a nondeterministic pushdown automaton. Context free languages define the syntax of all programming languages. Example: The language specified by L(G) = {anbn | n >= 1} is a context free grammar. The grammar is represented by the following productions. S → aSb | ε The string “aabb” can be derived from the above grammar as follows: S aSb aaSbb aaεbbaabb d. Type 3 Grammars—Regular Grammars (RG): These grammars generate the regular lan- guages. Such a grammar restricts its rules to a single nonterminal on the left hand side. The right hand side consists of either a single terminal or a string of terminals with a sin- gle nonterminal on the left or the right end. Here rules can be of the form A → a B | a or A → Ba | a. The rule S → ε is also allowed here. These languages are exactly those languages that can be decided by a finite state automaton. This family of formal languages can be obtained even by regular expressions (RE). Regular languages are used to define search patterns and the lexical structure of programming languages. Example: Right linear grammar: A → a A | a Left linear grammar: A → A a | a An example of a regular grammar G with V = {S, A}, Σ = {a, b, c}, P consists of the following rules: 84 Syntax Definition — Grammars Example 1: Give a CSG but not CFG for (an | n 1) S → aS | X aS → aa X→a Give a CFG but not regular for (an | n 1) S→AS|a A→a Give a regular grammar for (an | n 1) S → aS | a Every regular grammar is context free, every CFG is context sensitive, and every CSG is unrestricted. 3.3 Grammar Representations We need a way to represent grammars so that we can talk about them in general. For this, we have some standard representations. 1. Backus-Naur Form (BNF): A metalanguage, BNF is widely used as a notation for the gram- mars of computer programming languages, command sets and communication protocols, as well as a notation for representing parts of natural language grammars. BNF is the most general representation scheme. Example: ::= We use symbols like :=, |, , (, ) etc. The main operators here are: 1. Alteration “|” 2. Concatenation.” 3. Grouping “( )” 4. Production Sign “ → ” or “::=” For example A:= | and B:= a | b is in BNF. Some more examples are: ::= | | ::= + | - | or ::= | where: N ‘::=’ means “is defined as” N ‘|’ means “or” N Terms enclosed in angle brackets, such as , are nonterminal symbols. N All other symbols (shown in boldface above [although it’s not obvious for some charac- ters, such as ‘-’, in some browsers]) are terminal symbols. N The concatenation of strings is represented by the juxtaposition of their descriptions. (In the example above, note that there is no space between and.) Grammar Representations 85 Note the recursive definition in the last example above. This is the standard way of specifying sequences of elements in BNF, and for describing nested syntax, as in: ::= | if then < uncond_stmt > | if then < uncond_stmt > else < stmt > BNF was first used in the Algol-60 Report. 2. Extended Backus-Naur Form (EBNF): This improves the readability and conciseness of BNF through extensions: N Kleene cross—a sequence of one or more elements of the class marked. ::= + N Kleene star—a sequence of zero or more elements of the class marked. : = * N Braces can be used for grouping elements, allowing us to avoid having to define interme- diate classes such as. : = {|}* Note: We generally use braces to mean zero or more repetitions of the enclosed elements (i.e., a Kleene star is assumed). N Square brackets may be used to indicate optional elements (i.e., zero or one occurrence of the enclosed elements). : = [ + | – ] Do not use a Kleene cross or star with square brackets. Stacking alternatives result in an even more pictorial representation of the syntax. : = { }* : = [+ –] EBNF is a simple extension of BNF, which tries to extend BNF and makes it easier to represent complex forms. We add two new symbols to our original BNF symbol set viz. 1. {}—Repetition or Closure: Used when one pattern is repeated more than once. 2. []—Optional: Used to represent optional patterns whose inclusion is not necessary. Example, to represent A → A | The corresponding RE is A : * We can represent this in EBNF format as A → { } Another example is: Consider the BNF form as follows: stmt → if (expr) stmt | if (expr) stmt else stmt The equivalent EBNF form is: stmt → if (expr) stmt [else stmt] Note: An advantage of EBNF is that it is very easy to write the corresponding parser for it. 86 Syntax Definition — Grammars If Stmt if cond then stmts ; Elseif else stmts Else Elseif cond then stmts if Figure 3.1 Syntax Diagrams 3. Syntax Diagrams: They are also called railroad diagrams. Syntax diagrams are a way to represent a context free grammar. They represent a graphical alternative to Backus- Naur Form BNF or EBNF. Figure 3.1 shows syntax diagrams of the conditional state- ment “if.” Syntax diagrams were first used in the Pascal User Manual and Report by Jensen and Wirth. The grammar can be represented as a set of syntax diagrams. Each diagram defines a variable or nonterminal. There is a main diagram that defines the language in the following way: to belong to the language, a word must describe a path in the main diagram. Each diagram has an entry point and an end point. The diagram describes possible paths between these two points by going through other nonterminals and terminals. Termi- nals are represented by round boxes, while non terminals are represented by square boxes. Example 2: We use grammar that defines arithmetic expressions as an example. Following is the simplified BNF grammar for arithmetic expressions: ::= | “+” ::= | “*” ::= | | “(“ “)” ::= “x” | “y” | “z” ::= | ::= “0”|“1”|“2”|“3”|“4”|”5”|“6”|“7”|“8”|“9” This grammar can also be expressed in EBNF: Expr = Term, {“+”, Term}; Term = Fact, {“*”, Fact}; Fact = num | id | “(“, Expr, “)”; id = “x” | “y” | “z”; num = digit, {digit}; digit = “0”|“1”|“2”|“3”|“4”|“5”|“6”|“7”|“8”|“9”; The set of rail road diagrams or syntax diagrams for this grammar are shown in Figure 3.2. Nonterminals are represented by rectangular boxes, whereas terminals are represented by circles or ellipses. 88 Syntax Definition — Grammars V, T, P, S are the four important components in the grammatical description of a language. V—the set of variables, also called nonterminals. Each variable represents a set of strings, simply a language. T—the set of terminals, which are a set of symbols that forms the strings of the language, also called terminal symbols. P—the finite set of productions or rules that represent the recursive definition of language. Each production consists of a variable, production symbol → , and a string of terminals and nonterminals. The string is called body of production. S—the start symbol. It is one of the variables that represent the language being defined. The language generated (defined, derived, produced) by a CFG is the set of all strings of terminals that can be produced from the start symbol S using the productions as substitu- tions. A language generated by a CFG is called a context free language (CFL). Example 3: terminal: a nonterminal: S productions: S → aS S→ε is a simple CFG that defines L(G) = a*. where V = {S} T = {a} Example 4: The CFG for defining palindrome over {a or b}. The productions P are: S→ε|a|b S → aSa S → bSb And the grammar is G = ({S}, {a,b}, P, S) Example 5: The CFG for set of strings with equal no of a’s and b’s. The productions P are: S → SaSbS | SbSaS | ε And the grammar is G = ({S}, {a,b}, P, S) Example 6: The context free grammar for syntactically correct infix algebraic expressions in the variables x, y, and z: And the grammar is G = ({S,T}, {+,*,(,),–,x,y,z}, P, S) S→T+S|T–S|T T→T*T|T|T T → (S) T→x|y|z This grammar can generate the string (x + y) * x – z * y|(x + x). Example 7: A context free grammar for the language consisting of all strings over {a, b} which contain a different number of a’s than b’s is