SSCD Notes PDF

SSCD - UNIT 1 1. Introduction to System Software System Software is a set of computer programs designed to operate, control, and extend the processing capabilities of a computer. These programs manage system resources and provide essential functions such as file editing, storage management, and I/O management. Categories of System Software: 1. System Control Programs: Control the execution of programs. Manage storage and processing resources. Example: Operating System (OS). 2. System Support Programs: Provide routine service functions to other programs and users. Example: Utility Programs (e.g., file backup, disk cleanup). 3. System Development Programs: Assist in creating application programs. Example: Language translators like interpreters and compilers. Role of System Software- 2. System Control Programs: Operating System (OS)- An Operating System (OS) is an integrated set of specialized programs that manage the overall resources and operations of a computer. 3. System Development Programs: Language Translators- Language translators, also known as language processors, convert high-level language (HLL) code into machine- readable language. They also check for syntax errors. Types of Language Translators: 1. Assembler: Translates assembly language into machine code. 2. Interpreter: Translates and executes one statement at a time. 3. Compiler: Translates an entire program into machine code before execution. Assembler Task 4. Macro Processor- A macro processor allows a programmer to define abbreviations for repetitive parts of a program (macros). These are replaced with corresponding source code statements during program execution. Example: RDBUFF and WRBUFF. 5. Linkers and Loaders a) Linker: Combines multiple object code files into a single executable program. Steps in program execution:- Translation: The source code is converted into machine code by a compiler or assembler, resulting in the translated address. Linking: The linker combines object files and resolves external references, producing the linked address. Relocation: Adjusts the program's memory addresses to fit into a specific memory location. Loading: The loader places the program into memory for execution, assigning it a load- time address. b) Loader: Places the program into memory for execution. Functions include: Allocation Loading Relocation Linking 6. Debugger- A debugger is used to test and identify bugs in programs. Two types: CorDBG: Command-line debugger (requires debug compilation). DbgCLR: Graphic debugger used in Visual Studio. Popular Debuggers: Firefox JavaScript Debugger GDB (GNU Debugger) Microsoft Visual Studio Debugger 7. Text Editor- A text editor is a software used to create and modify text files. Common features include: Cursor movement Deleting and replacing text Searching and replacing Saving and loading files Examples: Windows: Notepad, WordPad; UNIX: vi, emacs 8. Microservices- Microservices is an architectural style where an application is divided into small, loosely coupled services. Each service is independently deployable and organized around specific business capabilities. Benefits: High maintainability, rapid deployment, and scalability. Examples: Netflix and Amazon use microservices to handle vast amounts of API requests. 9. Containers- Containers are lightweight, portable units that encapsulate an application and its dependencies, ensuring it runs consistently across environments. Multiple containers can run on a single virtual machine (VM), making them efficient and scalable. 10. Assembly Language- Assembly Language is a low-level programming language that uses symbolic codes (mnemonics) to represent machine instructions. It is closely related to the architecture of the computer’s hardware. Elements of Assembly Language Program (ALP) 1. Labels: Symbolic names used to represent memory addresses, making code easier to read and manage. 2. Instructions: Commands to the processor, consisting of an opcode (operation code) and operands (data or memory locations). 3. Directives: Non-executable statements used to define data, set up memory, and control the assembly process (e.g., DB, DW, END). 4. Comments: Explanatory notes in the code for better understanding, ignored by the assembler. 5. Registers: Small, fast storage locations within the CPU used to store data temporarily during program execution. 11. Statement Class for an ALP 1. Declaration Statements: a. Used to define data storage elements, such as variables or constants. b. Example: X DB 10 (defines a byte variable X initialized to 10). 2. Imperative Statements: a. Directly instruct the processor to perform operations like arithmetic, data movement, or branching. b. Example: MOV AX, BX (move the value from register BX to AX). 3. Constants: a. Fixed values used in the program, can be numeric or character data. b. Example: NUM EQU 5 (defines NUM as a constant with a value of 5). 4. Assembly Directives: a. Instructions for the assembler to manage the assembly process (e.g., memory allocation or setting start points). b. Example: ORG 1000H (sets the starting address to 1000H). 5. Difference Between Literal and Immediate Operand Literal Operand: A constant value that is placed in memory before execution, and its address is used during runtime. Example: MOV AX, ='10' (loads literal value 10 into AX). Immediate Operand: A constant value directly embedded in the instruction during execution. Example: MOV AX, 10 (loads immediate value 10 into AX directly). 12. Advantages of Assembly Language 1. Precise Hardware Control: Allows detailed control over hardware components, enabling optimized code. 2. Direct Hardware Access: Provides direct access to registers and hardware, ideal for hardware-specific tasks. 3. Efficient Resource Use: Optimized for low-level control, making resource usage more efficient. 4. Ideal for Embedded Systems: Perfect for programming microcontrollers and sensors. 5. Security Research: Useful in reverse engineering and finding security vulnerabilities. 13. Disadvantages of Assembly Language 1. Complexity: Difficult to learn, especially for beginners. 2. Machine Dependency: Highly machine-specific, limiting portability. 3. Hard to Maintain: Challenging to manage, especially for large-scale projects. 4. Time-Consuming: Lengthy and hard to understand, making development slower. 5. Difficult Debugging: Debugging is more challenging compared to higher-level languages. 14. Types of Assembler 1. Single-Pass Assembler: If an assembler does all this work in one scan then it is called a single-pass assembler. 2. Multiple-Pass Assembler: If it does it in multiple scans then called a multiple-pass assembler. 15. Working of Assembler Pass-1 1. Define symbols and literals and remember them in the symbol table and literal table respectively. 2. Keep track of the location counter. 3. Process pseudo-operations. 4. Defines a program that assigns the memory addresses to the variables and translates the source code into machine code. Pass-2 1. Generate object code by converting symbolic op-code into respective numeric op-code. 2. Generate data for literals and look for values of symbols. 3. Defines a program that reads the source code two times. 4. It reads the source code and translates the code into object code. 15. Two Phases of Assembler 1. Analysis Phase: Breaks down the source code into components (like opcodes, operands, labels) and checks for errors. o Processes: ▪ Lexical Analysis: Scans the program and identifies tokens (symbols, instructions). ▪ Syntax Analysis: Ensures the instructions follow proper format. ▪ Symbol Table Creation: Stores addresses of labels and variables. 2. Synthesis Phase: Converts the analyzed components into machine code and generates the object code. o Processes: ▪ Opcode Generation: Converts assembly instructions into binary (machine) code. ▪ Address Calculation: Resolves memory addresses for variables and instructions. ▪ Object Code Generation: Produces the final machine code ready for execution. 16. ORIGIN Directive- Specifies where to load instructions and data into memory. Syntax: ORIGIN Functionality: o Sets the location counter (LC) to the given address. o Useful for non-consecutive memory allocation. o If no ORIGIN directive is specified, the LC starts at zero. Example: ORIGIN Loop + 2 Note: The assembler uses LC to store the current offset address of the statement being processed. 17. EQU Directive- Enhances program readability by defining symbols for values or addresses. Syntax: EQU Functionality: o No memory is allocated; instead, an entry is made in the symbol table. o Replaces occurrences of the defined name with its corresponding value during assembly. Example: o FACTOR EQU 03H → ADD AL, FACTOR becomes ADD AL, 03H. o A EQU 100 → Assigns the symbol A to address 100. 18. LTORG Directive (Literal Origin)- Specifies where to place literals in memory during assembly. Functionality: o Assembler typically places literals after the END statement by default. o Allows the programmer to define a specific location for literals. Advantages: o Organizes literal data efficiently in memory. o Reduces wasted space by aligning and arranging literals. o Assembles duplicate data into a single memory location. Use Case: The assembler collects literals into a "literal pool" based on LTORG instructi 19. Label Processing 1. Label Presence: o If a label is present: ▪ Set this_label = symbol in label field ▪ Enter the label and its location counter into the symbol table (SYMTAB): Enter(this_label, loc_cntr) in SYMTAB 20. Literal Processing 1. LTORG Statement: o If an LTORG statement is encountered: ▪ Process literals using the literal table (LITTAB) based on the current pool table (POOLTAB). ▪ Allocate memory for the literals and update the location counter (loc_cntr). ▪ Update the pool table pointer: pooltab_ptr := pooltab_ptr + 1. ▪ Update the pool table with the current literal table pointer: POOLTAB[pooltab_ptr] := littab_ptr. 21. START ORIGIN and EQU Processing 1. START or ORIGIN Statement: ▪ If a START or ORIGIN statement is encountered: o Set loc_cntr = value in operand field. 2. EQU Statement: ▪ If an EQU statement is encountered: o Set this_addr = value of. o Update the SYMTAB entry. 22. Declaration (DS/DC) Statement Processing 1. Declaration Statement: If a declaration statement (DS/DC) is encountered: o Get the code for the declaration statement. o Determine the size required by the declaration (DC/DS). o Update the SYMTAB entry. o Increment the location counter: loc_cntr = loc_cntr + size. o Generate intermediate code. 23. Imperative Statement Processing 1. Imperative Statement: o If an imperative statement is encountered: ▪ Get the machine code from the operation code table (MOT). ▪ Update the location counter based on the length of the instruction: loc_cntr = loc_cntr + length of instr. ▪ If the operand is a literal: 1. Set this_lit = literal in operand field. 2. Update the literal table: LITTAB[littab_ptr] = this_lit. 3. Increment the literal table pointer: littab_ptr = littab_ptr + 1. ▪ If the operand is a symbol: 1. Get the symbol table entry number for the operand: this_entry = SYMTAB entry no of operand. 2. Increment the symbol table pointer: symtab_ptr = symtab_ptr + 1. 24. Intermediate Code Variants Variant 1 1. Instruction Format: Each instruction may include: ▪ IS (Instruction Set), AD (Address Directive), DL (Data Directive): Denoted by (class, its number). ▪ Operand Registers: Written with a single digit. ▪ Memory Operand: Represented by a pair (operand class, code) where operand class can be: 1. C: Constant 2. S: Symbol 3. L: Literal ▪ First Operand: May be a register value (1 to 6) for instructions like BC. Variant 2 1. Instruction Format: Similar to Variant 1 for AD and DL. For IS instructions, the first operand is written as is. The second operand follows the VARIANT format only if it's a literal; otherwise, it is written as is from the input. 25. Concepts 1. Forward Reference: Refers to a situation where a label or symbol is referenced before it is defined. 2. Back Patching: The process of updating addresses in an intermediate representation once the final address is known, particularly useful for forward references. UNIT- 2 1. Linker- A system program that combines target program code with other programs and library routines for execution. Steps for Execution: 1. Translation: Converts source code into machine language. 2. Linking: Combines object modules into a binary program. 3. Relocation: Modifies addresses for correct execution. 4. Loading: Loads the program into memory for execution. Object Module Contains: o Target code (machine language). o Information about required programs and library routines. Binary Program- A combination of target code and linked program routines, ready for execution. Address Types 1. Translation Time Address: Address assigned by the translator (e.g., ORIGIN). 2. Linked Address: Address assigned by the linker. 3. Load Time Address: Address assigned by the loader. 4. Relocation: Necessary when translated and linked addresses differ. Relocation Concept- The process of modifying addresses in a program to execute correctly from a designated memory area. When Relocation Occurs: o If linked origin ≠ translated origin (done by linker). o If load origin ≠ linked origin (done by loader). 2. Linking Purpose: Binds external references to the correct link time address. Key Terms: o ENTRY: A public definition that can be referenced in other program units. o EXTRN: An external reference to a symbol defined in another program. Linker Components 1. Header: Contains translated origin, size, and execution start address. 2. Program Section: Contains machine code. 3. Relocation Table (RELOCTAB): Holds translated addresses of address-sensitive instructions. 4. Linking Table (LINKTAB): Contains public definitions (PD) and external symbols (EXT). Types: 1. Non-relocatable: Cannot be executed in memory areas other than its translated origin. 2. Relocatable: Contains information for address-sensitive instructions. 3. Self-relocating: Has logic to perform its own relocation. Uses: Useful in time-sharing operating systems. 3. Link Libraries 1. Static Link Libraries- All library modules are copied into the final executable image during the linking process. Advantages: o Faster execution due to pre-integration. o No compatibility issues. Disadvantages: o Larger executable size. o Requires recompilation for updates. 2. Dynamic Link Libraries- Linking happens at runtime, using shared libraries. Advantages: o Saves memory and disk space by sharing libraries. o Easier updates and compatibility with shared modules. Disadvantages: o Potentially slower execution. o Dependency on compatible libraries; changes may require reworking applications. 4. Loader- A program that prepares object programs for execution by the computer and initiates their execution. Functions of the Loader: 1. Allocation: Allocates memory space for the programs. 2. Linking: Resolves symbolic references between object modules. 3. Relocation: Adjusts address-dependent locations to match the allocated memory. 4. Loading: Physically places machine instructions and data into memory. Concept of Relocation- Modifying address-dependent instructions in a program so it can execute correctly from any memory area. Language Processing System 5. Types of Loader Schemes 1. Compile and Go Loaders: o Assembler places the code directly into memory. o Loader consists of a single instruction that transfers control to the start of the assembled program. o Advantages: Easy to implement. o Disadvantages: Wastes memory, requires retranslation each time the program is run, and struggles with multiple segments. 2. General Loader Scheme: o Smaller than the assembler, freeing up memory. o Does not require reassembling to run programs. 3. Absolute Loaders: o Similar to Compile and Go, but data is punched on cards instead of being in memory. o Advantages: More memory available, simpler implementation. o Disadvantages: Programmer must specify addresses, which can be cumbersome with multiple subroutines. 4. Relocating Loaders: o Avoids reassembling all subroutines when one changes. o Manages allocation and linking automatically. o Allows multiple procedure segments but only one data segment. o Uses relocation bits to indicate which addresses need adjustments. 5. Direct Linking Loaders: o Offers flexible intersegment referencing. o Allows independent translation of programs. o Requires information on segment length, symbol lists, and address constants. 6. Information Provided by the Assembler to the Loader Transfer Vector: o Contains addresses and names related to referenced subroutines. o Includes total program length and transfer vector length. Relocation Bits: o Indicates if an address field requires relocation (1 for yes, 0 for no). 7. Object Deck Components Produced by the Assembler 1. ESD (External Symbol Dictionary): o Contains information about all defined and referenced symbols. o Values include: ▪ SD: Segment Definition (name from START card). ▪ LD: Local Definition (specified on ENTRY card). ▪ ER: External Reference (specified on EXTRN card). 2. TXT (Text): Contains the actual object code of the program. 3. RLD (Relocation and Linkage Directory): Lists constants that need to be changed due to relocation, specifying how to change them. 4. END: Marks the end of the object deck and specifies the starting address for execution if the assembled routine is the main program. UNIT- 3- Scanner 1. Compiler- A compiler is a program that translates a source program from one language (source language) into an equivalent program in another language (target language). 2. Compiler Passes- A pass refers to a full traversal of the source program or its internal representation. Compilers may perform one or multiple passes: 1. One-pass Compiler: Analysis and synthesis are done in one continuous flow from start to finish. 2. Two-pass Compiler: First pass: Performs analysis of the source program. Second pass: Performs synthesis based on the analysis. 3. In one pass structure: both analysis and synthesis of source program is done in the flow from beginning to end of program. 4. In two pass structure: analysis of source program is done in first pass synthesis of source program is done in second pass 3. Compiler Phases 1. Lexical Analysis: Tokenizes the source code. 2. Syntax Analysis: Checks for correct structure based on grammar. 3. Semantic Analysis: Ensures semantic correctness (e.g., type checking). 4. Intermediate Code Generation: Produces an abstract machine-independent representation. 5. Code Optimization: Improves the intermediate code for performance. 6. Target Code Generation: Converts the optimized intermediate code into machine- specific code. 4. Symbol Table- A symbol table is a crucial data structure used by the compiler to track information about names (e.g., variables, functions, classes) within the source code. Its roles include: Storing names and attributes of entities. Verifying declarations and ensuring type correctness. Resolving the scope of variables. Common Data Structures for symbol tables: Lists Linked lists Binary trees Hash tables Frontend – backend of compiler- 5. Assignment Statement Translation Involves translating high-level assignment statements (e.g., x = y + 2;) into target code or intermediate representation, handling operations like variable assignment and expression evaluation. 6. Regular Expressions (RE) Rules 1. Є is a RE that denotes the language {Є}. 2. If ‘a’ is a symbol in the alphabet ∑, then {a} is a RE. 3. If r and s are REs denoting languages L(r) and L(s): r | s: Union of L(r) and L(s). rs: Concatenation of L(r) and L(s). r*: Zero or more repetitions of L(r). r.: Denotes the language L(r). 7. Axioms of Regular Expressions 1. r | s = s | r (Union is commutative) 2. r | (s | t) = (r | s) | t (Union is associative) 3. (rs)t = r(st) (Concatenation is associative) 4. r* = (r | Є)* (Relation between * and Є) 5. r** = r* (* is idempotent) 6. Єr = r and rЄ = r (Є is the identity element for concatenation) 8. Lexical Analyzer- The Lexical Analyzer is the first phase of a compiler. It converts the input program into tokens, which are passed to the syntax analyser. It works like a subroutine, returning tokens to the parser when requested. Key Tasks of the Lexical Analyzer: Recognizes and extracts lexemes (smallest logical units) like keywords, operators, and identifiers. Removes white spaces, tabs, new lines, and comments. Expands macros (e.g., #define MAX 5). Reports unrecognized symbols (foreign words). Generates tokens and passes them to the syntax analysis phase. Implemented using Finite Automata. Terms in Lexical Analysis: Lexeme: The smallest logical unit, e.g., sum, buffer, for. Token: A set of similar lexemes, e.g., Identifier, Keyword, Number. Pattern: Describes the structure of a token, often a regular expression, e.g., DIGIT [0-9]. Types of Tokens: Operators and Punctuation: {}, [], +, -, =, etc. Keywords: if, while, return, etc. Identifiers: id and the actual name. Constants: Kind and value of constants (int, float, char). Design of Lexical Analyzer Actions are implemented using Transition Diagrams. Finite Automata are used to recognize tokens. Lexical analyzers can be created by: 1. Hand Coding (historical approach). 2. Using Generators: Generates lexical analyzers from a specification of tokens (using RE) and actions, forming a DFA (Deterministic Finite Automaton). Basic Tasks of a Lexical Analyzer: 1. Recognizing keywords, identifiers, constants, and literals. 2. Removing extra white spaces, comments, etc. 3. Token generation and passing tokens to the syntax analyzer. 9. Input Buffering in Lexical Analysis The Lexical Analyzer reads characters one at a time to discover tokens. Often, many characters need to be examined before identifying the next token. Input buffering helps manage this by using a buffer divided into two halves (e.g., 100 characters each). Look-ahead pointer may need to move beyond one half of the buffer, in which case the other half is loaded with the next characters from the source file. Example: In DECLARE(arg1,arg2,...), we can't determine if DECLARE is a keyword or identifier until we see the character after the parentheses. 10. Lexical Errors Matched but ambiguous: The parser resolves ambiguous tokens (e.g., fi could be a misspelling of if). Unmatched errors: Recovered by: o Panic Mode: Delete characters until a valid token is found. o Repairing input: Fix by: 1. Deleting extraneous characters. 2. Inserting missing characters. 3. Replacing incorrect characters. 4. Transposing adjacent characters. 11. LEX Specification 1. Declaration Section: Defines variables, constants, and regular definitions. 2. Translation Rules Section: Pattern-action rules are specified (P1 → action1, P2 → action2). 3. Subroutine Section: Auxiliary procedures. 12. Key LEX Functions 1. getchar(): This function reads the next character from the input stream, allowing the lexical analyzer to process the source code one character at a time. It helps in recognizing tokens like keywords and identifiers. 2. ungetc(ch, stdin): This function allows the analyzer to push a character back into the input stream. It is useful when the analyzer realizes that the last character read does not belong to the current token, enabling accurate token recognition. 3. yylex(): The main function of the lexical analyzer, yylex() starts or resumes the scanning process. It identifies the next token in the source code and returns it to the calling program, executing any associated actions during the match. 4. yywrap(): This function is called when the end of the input file is reached. It determines whether to continue scanning another file or to return an end-of-file (EOF) token, managing the analyzer’s behavior at the end of input. 5. yytext: This variable stores the text of the matched token as a null-terminated string. It points to the first character of the matched token, allowing the lexical analyzer to easily access and manipulate token text throughout the scanning process. 13. Regular Expressions in LEX. : Matches any single character except newline. : Matches 0 or more occurrences. [] : Matches any character within the brackets. ^ : Matches the start of a line. $ : Matches the end of a line. {} : Specifies how many times the previous pattern can match. + : Matches one or more occurrences. ? : Matches zero or one occurrence. | : Alternation (matches either pattern). () : Groups expressions into a new RE. 14. Lexical Analyzer Functions 1. installID(): The installID() function is responsible for adding an identifier (like variable names or function names) to the symbol table. It takes the identifier's name and other relevant attributes, such as its type and scope, and stores this information in a structured format. This allows the compiler to keep track of all identifiers used in the program, which is essential for semantic analysis and later stages of compilation. 2. installNum(): The installNum() function handles the installation of numerical constants into a separate table dedicated to numeric values. When the lexical analyzer encounters a number in the source code, it calls this function to store the constant along with its type (e.g., integer or floating-point). This separation from identifiers helps manage different types of symbols effectively, ensuring accurate type checking and optimization during compilation. 15. Error Handling in Lexical Analysis The Lexical Analyzer detects and reports errors, such as: o Illegal characters. o Misspelled keywords. o Error Recovery: 1) Skip illegal characters. 2) Pass the error to the parser for context-based handling. 3) Issue an error message at the end of the file if unresolved. Unit-3 Yacc 1. Overview of Yacc Yacc (Yet Another Compiler Compiler): o A tool for generating parsers. o It processes the syntactic structure of input using grammar rules. o Type: LALR(1) parser generator. Outputs of Yacc: 1. Tables: Generated based on the grammar rules. 2. Driver Routines: Written in C to handle parsing. 3. y.output: A detailed report file about the grammar. 2. Yacc and Lex Workflow 1. Lex: Lexical analyzer reads the input and generates tokens (keywords, identifiers, etc.). Outputs lex.yy.c (C code for the lexical analyzer). 2. Yacc: Parses the tokens produced by Lex based on grammar rules. Outputs y.tab.c (C code for the parser). 3. Process Flow: Input Program → Lex → Tokens → Yacc → Abstract Syntax Tree (AST) → Output. 3. Structure of a Yacc Program A Yacc program has three main sections: 1. Declaration Section: o Contains two parts: 1. Ordinary C declarations enclosed in %{ and %}. 2. Declarations of grammar tokens using %token. 2. Translation Rules Section: o Defines grammar rules and their semantic actions. o Each rule has: ▪ Left-hand side: Non-terminal. ▪ Right-hand side: Productions (terminals/non-terminals). ▪ Semantic action: Specifies how the value is computed for the left-hand side. 3. Supporting C Routines Section: Auxiliary procedures o Contains additional helper functions like: ▪ main(): Entry point for the program. ▪ yyparse(): Starts parsing the input. ▪ yyerror(): Reports syntax errors. 4. Key Functions in Yacc 1. yyparse(): o The entry point of a Yacc parser. o It attempts to parse the input and returns 0 if parsing is successful. 2. yyerror(): o Reports syntax errors detected during parsing. o Called whenever the parser encounters a syntax issue. 3. yylex(): o Called by Yacc to obtain the next token from Lex. o Passes tokens and their attribute values to the parser. 5. Lex and Yacc Schema Summary Component Description Tokens Basic units of syntax generated by Lex. Grammar Rules Define the syntactic structure in Yacc. Semantic Actions C code associated with grammar rules for processing. AST (Abstract Syntax Tree) Represents the structure of parsed input. Lexical Analyzer Scans the input and produces tokens. Parser Processes tokens into structured data or output. 6. Semantic Actions in Yacc Semantic actions are snippets of C code that are executed whenever a rule is reduced during parsing. They help in associating computations or processing tasks with grammar rules. Important Symbols in Semantic Actions 1. $$: o Refers to the attributed value of the non-terminal (NT) on the left-hand side of the grammar rule. o Used to store the result of computations or actions for the rule. 2. $i: o Refers to the value associated with the i-th symbol (terminal (T) or non- terminal (NT)) on the right-hand side of the rule. o $1 corresponds to the first symbol, $2 to the second, and so on. How Semantic Actions Work Semantic actions are executed when a grammar rule is reduced. They calculate the value of $$ in terms of the values of $1, $2,..., $n (symbols on the right-hand side). 7. Example Grammar Rule and Semantic Action For the rule: E → E + T | T 1. First Production: E → E + T o The result ($$) is computed as: $$ = $1 + $3 Where: ▪ $1 is the value of E (first symbol on the right). ▪ $3 is the value of T (third symbol on the right). 2. Second Production: E → T o The result ($$) is assigned directly as: $$ = $1 (value of T). Explanation: $1, $2, $3 refer to the 1st, 2nd, and 3rd symbols on the right-hand side. + is a terminal (no computation is performed on it). Default Semantic Action If no action is specified, Yacc applies the default: $$ = $1 (assigns the value of the first symbol on the right-hand side to the left-hand side). Symbol Meaning $$ Value of the left-hand side non-terminal. $1, $2,...$i Values of the right-hand side symbols (terminals or non-terminals). Refers to the value of the i-th symbol on the right-hand side (e.g., $1, $2, $3). Default $$ = $1; (if no semantic action is specified). Action yylex() Returns the next token from Lex. yylval Communicates the token’s attribute value to the parser. yyerror(msg) Reports syntax errors. 8. Symbol Values and Actions in Yacc Every symbol in a Yacc parser has an associated value, used to carry information from the lexer to the parser and between grammar rules. Symbol Types and Their Values Symbol Value Number The actual numeric value (e.g., int, double). Literal String Pointer to the string in memory (char *). Variable Pointer to a symbol table entry describing the variable. Data Types in Yacc Common data types: int, double, char * for numbers, strings, and pointers. Advanced symbols: Use pointers to structures for complex symbols. Yacc provides a C union typedef called YYSTYPE, which combines all these data types into one. Example Grammar Rules Grammar tokens: Define how inputs like variables or numbers are processed. Grammar rules: Describe relationships between tokens and what operations to perform. Example 1. Statements: o Name = E: Assigns a value to a variable. o E: Prints the computed value of E. 2. Expression (E): o E + Num: Adds two numbers. o E - Num: Subtracts two numbers. o Num: Takes the value of a single number. Semantic Actions: $$: Represents the computed value of the current symbol (left-hand side). $i: Refers to the value of the i-th symbol in the rule's right-hand side. Example: o In E → E + Num, $$ = $1 + $3 means the new value of E is the sum of the first E's value ($1) and the Num's value ($3). The Lexer (Lex) Purpose: Scans the input to identify tokens (e.g., numbers, variable names). Token values: o Numbers: Converted to integers and passed to the parser. o Variable names: Passed as strings. yylval: A special variable used by the lexer to store the value of a token before sending it to the parser. 9. Steps to Compile and Run a Yacc Parser 1. Generate Parser and Lexer Files: o Yacc: Processes the grammar and creates: ▪ y.tab.c: The C code for the parser. ▪ y.tab.h: The header file with token definitions. o Lex: Processes the lexical rules and creates: ▪ lexyy.c: The C code for the lexer. 2. Compile and Link: o Combine the files (y.tab.c and lexyy.c) and link them with the Yacc and Lex libraries to create the parser. 3. Run the Parser: o Execute the compiled program to process input and see the results. 10. Precedence and Associativity in YACC Precedence Determines the order in which operators are executed in an expression. Follows mathematical conventions: o Multiplication and division take precedence over addition and subtraction. o Example: ▪ a + b * c → a + (b * c) ▪ d / e - f → (d / e) - f Operators are grouped into precedence levels, ranging from lowest to highest. o For example, C has 15 levels of precedence. Associativity Defines how operators of the same precedence are grouped. Can be left-associative: o Example: a - b - c → (a - b) - c Or right-associative: o Example: a = b = c → a = (b = c) 11. Ways to Specify Precedence and Associativity 1. Implicit Method: o Rewrite the grammar with separate rules for each precedence level. o Example: ▪ Addition and subtraction (+ and -) at one level. ▪ Multiplication and division (* and /) at a higher level. ▪ Parentheses and numbers at the highest level. 2. Explicit Method: o Use precedence and associativity declarations in the grammar. o Example: %left for left-associative operators. %right for right-associative operators. %nonassoc for operators that cannot be used together (e.g., unary minus). o Operators are listed in increasing precedence levels. Conflict Resolution If there is a shift/reduce conflict during parsing: o Precedence and associativity declarations are used to resolve it. o Example: For a + b * c, the parser uses precedence rules to prioritize multiplication over addition. 12. Token Numbers Tokens (symbols like +, -, numbers, variables) are identified by small integers: o ASCII values are used for literal tokens. o Symbolic tokens (like variables) are assigned unique numbers by YACC. o Users can manually assign token numbers. Example: o + might be token number 43 (ASCII value). o A user-defined token like UP might have the number 50. y.tab.h A header file generated by YACC that includes token definitions and other necessary information for the lexer. Used to ensure the lexer and parser work together seamlessly. Token Values Every token has an associated value, stored in the variable yylval in YACC. Examples of token values: o Numbers → their numeric values (e.g., 123). o Variables → pointers to symbol table entries. o Strings → pointers to the actual text. %union Declaration Used to define a union data type for token values. Allows storing different types of data (e.g., numbers, strings, pointers). The union definition is included in y.tab.h, making it accessible to both the lexer and parser. Unit -4 1. Intermediate Code Generator An Intermediate Code Generator is responsible for translating a source program into an intermediate representation that can be further transformed into target machine code by the backend of the compiler. Benefits of Machine-Independent Intermediate Code Retargeting: The front end of the compiler can remain the same while only the backend (specific to the target machine) needs to be changed, making it easier to create compilers for different machines. Optimization: A machine-independent optimizer can be applied to improve the intermediate code, leading to more efficient target code generation. Three Types of Intermediate Representations 1. Syntax Tree: o Represents the hierarchical structure of the source program. o It shows how the components of the program relate to one another. 2. Postfix Notation: o A linearized form of the syntax tree where each operator appears after its operands. o Easier for machines to process since there are no parentheses. 3. Three-Address Code (TAC): o A more compact and simplified form of intermediate code that consists of instructions with at most three operands (e.g., x = y + z). o Used for easier manipulation and optimization of the code. 2. Syntax Tree vs. DAG Syntax Tree: o Depicts the structure of expressions with a tree-like hierarchy. o Example: For a := b * -c + b * -c, the tree shows the operations in their original order. Directed Acyclic Graph (DAG): o Similar to the syntax tree but more compact. o Identifies and eliminates redundant subexpressions (e.g., the two b * -c operations in the example). 3. Postfix Notation A linear form of the syntax tree. In postfix notation, operators are written after their operands. o Example for a := b * -c + b * -c: ▪ Postfix notation: a b c uminus * b c uminus * + assign This simplified form is efficient for machine processing and easy to convert to target code. 4. Three-Address Code (TAC) Three-Address Code is a type of intermediate code used in compilers. It represents program statements in a simplified form, where each instruction typically involves at most three addresses or operands. General Form x := y op z o x, y, z are variables, constants, or temporary variables created by the compiler. o op is an operator like +, -, *, /, logical operations, etc. Example: For the expression x + y * z, it could be translated to: o t1 := y * z o t2 := x + t1 In this example, t1 and t2 are temporary variables. Advantages of Three-Address Code 1. Simplifies Complex Expressions: It breaks down complex arithmetic expressions and flow-of-control statements into simpler steps for easier target code generation and optimization. 2. Flexibility: Intermediate values are explicitly named, which makes the code easier to rearrange and optimize compared to postfix notation. 3. Clear Structure: Each statement contains clear operands and results, making it easy to represent and manipulate the program logic. Types of Three-Address Code Statements 1. Assignment with Binary Operations: o Form: x := y op z o Example: t1 := y * z 2. Unary Assignment: o Form: x := op y o Example: t1 := -c (Unary minus operation) 3. Copy Statement: o Form: x := y o Example: t2 := x 4. Unconditional Jump (Goto): o Form: goto L o Example: Jump to label L. 5. Conditional Jump: o Form: if x relop y goto L o Example: if x < y goto L 6. Procedure Call and Return: o Form: param x1, param x2, call p, n, return y o Example: Calling a procedure p(x1, x2). 7. Indexed Assignment: o Form: x := y[i] or x[i] := y o Example: a := b[i] 8. Address and Pointer Assignment: o Form: x := &y, x := *y, *x := y o Example: p := &x or *p := x Examples Example 1 (Based on Syntax Tree): Expression: a := b * (-c) + b * (-c) Three-Address Code: o t1 := -c o t2 := b * t1 o t3 := -c o t4 := b * t3 o t5 := t2 + t4 o a := t5 Example 2 (Based on Directed Acyclic Graph - DAG): Expression: a := b * (-c) + b * (-c) Three-Address Code: o t1 := -c o t2 := b * t1 o t3 := b * t1 (reuse t1 since it’s the same operation) o t4 := t2 + t3 o a := t4 5. Implementation of Three-Address Code Statements- Three-address code can be implemented using different data structures, such as Quadruples, Triples, and Indirect Triples. Each representation has its own advantages depending on the use case, especially in optimizing compilers. 1. Quadruples A quadruple is a record or object with four fields: op: The operator (e.g., +, -, *, etc.) arg1: The first operand (could be a variable or constant) arg2: The second operand (could be a variable or constant) result: The result of the operation (could be a variable or temporary) Example: Statement: x := y + z Quadruple Representation: o op = +, arg1 = y, arg2 = z, result = x The advantage of quadruples: In an optimizing compiler, if a temporary value is computed in a quadruple, the instructions that use that temporary value do not need to change when instructions are moved around. Temporary names must be entered into the symbol table as they are created. 2. Triples A triple is a similar representation, but it has only three fields: op: The operator (e.g., +, -, *, etc.) arg1: The first operand (could be a variable, constant, or pointer to another triple) arg2: The second operand (could be a variable, constant, or pointer to another triple) Instead of directly storing the result as a variable, triples refer to the result by its position in the sequence of triples. Example: Statement: x := y + z Triple Representation: o op = +, arg1 = y, arg2 = z Temporary values are not stored directly as variables, but by their position in the sequence. 3. Indirect Triples Indirect triples are similar to triples, but instead of storing the actual triple, we store pointers to the triples. Example: Instead of directly having triples like: t1 := y + z t2 := t1 * 3 t3 := t2 - x We store pointers to these triples in a list (array or another structure), so the triples themselves can be moved or optimized without directly altering references. 6. Need for Semantic Analysis in Compilers Semantic analysis is an essential phase in a compiler that checks the meaning of the program and ensures that it follows the correct rules of the language (semantics), in addition to syntax (grammar). This phase occurs after parsing and before code generation. Purpose of Semantic Analysis: Adds Semantic Information: It enriches the parse tree (created during the parsing phase) with additional meaning (semantics), such as type information and variable bindings. Checks Program Correctness: It performs checks like type checking and ensures variables and functions are defined and used properly. Early Code Optimization: Sometimes, optimizations like constant folding may occur in this phase. Symbol Table Management: The compiler maintains symbol tables to store details about variables, functions, etc., and their relationships. Types of Semantic Errors Identified: 1. Type Mismatch: Using variables in operations where their types are incompatible. 2. Undeclared Variable: Trying to use a variable that hasn't been declared. 3. Reserved Identifier Misuse: Using reserved keywords (like int, for) as identifiers. 4. Multiple Declarations in a Scope: Declaring the same variable more than once in the same scope. 5. Accessing an Out-of-Scope Variable: Trying to access a variable that is not in the current scope. 6. Parameter Mismatch: Mismatching actual parameters with the formal parameters of a function. 7. Syntax-Directed Definitions (SDD): Generalization of Context-Free Grammars (CFG): o Attributes: Each grammar symbol has an associated set of attributes (properties). o Semantic Rules: Productions are linked with rules that compute the attribute values. o These form Annotated Parse-Trees, where each node contains attributes (e.g., X.a for the attribute a of non-terminal X). Purpose: SDD specifies how the syntax (grammar) should be interpreted by associating attributes with symbols and defining how these attributes are computed during parsing. 8. Attribute Grammar: A special form of CFG where additional attributes are associated with non-terminals to provide context-sensitive information. Purpose: Helps define the semantics (meaning) of a programming language, in addition to its syntax. Types of Attributes: o Synthesized Attributes: Values are computed from child nodes in the parse tree. o Inherited Attributes: Values are inherited from parent or sibling nodes. 9. Synthesized Attributes: Attributes whose values are computed from the child nodes of a parse tree. Example: o Production: S -> ABC o Semantic Rule: S.a = A.a, B.a, C.a (S's value is a combination of the values from its children) Computation: The annotated parse tree is generated and attribute values are computed in a bottom-up manner, meaning values are passed up the tree starting from the leaves to the root. 10. Inherited Attributes: Attributes whose values are passed down from parent or sibling nodes in the parse tree. Example: o Production: A -> BCD { C.in = A.in, C.type = B.type } o This means that C inherits attributes like in from A and type from B. Computation: The annotated parse tree is generated, and attribute values are computed in a top-down manner (pre-order traversal of the tree). 11. Implementing Syntax Directed Definitions (SDDs) with Dependency Graphs Dependency Graphs: Purpose: A dependency graph helps determine the order in which attributes should be evaluated in a syntax-directed definition (SDD). Nodes: Each node represents an attribute in the parse tree. Edges: If attribute b depends on attribute c, there is a directed edge from c to b (i.e., b ← c). o Dependency Rule: If b depends on c, evaluate c first, then b. Evaluation Order: Topological Sort: The dependency graph determines the evaluation order. If there's an edge from node M to node N, M must be evaluated before N. o The allowable order must respect this constraint, forming a topological sort. o If a cycle exists in the graph, there's no valid evaluation order (i.e., no way to evaluate the SDD). 12. Types of Syntax Directed Translations (SDTs) 1) S-attributed Definition: Uses only synthesized attributes (attributes computed from child nodes). Example: o A → BC { A.a = B.a, C.a } Evaluation: Performed bottom-up, as the parent node values depend on child node values. Semantic Actions: Placed on the right-hand side of productions. o Example: A → BCD {}. 2) L-attributed Definition: Uses both synthesized and inherited attributes, but inherited attributes can only inherit values from the parent or left siblings. Example: o A → BCD { B.a = A.a, C.a = B.a } Evaluation: Performed left-to-right and depth-first. Semantic Actions: Can be placed anywhere on the right-hand side. o Example: ▪ A → { } BC ▪ B→{}C ▪ BC → { }. Note: If an attribute is S-attributed, it is automatically L-attributed. 13. Type Checking- Type checking ensures that a program follows the type rules of the programming language. It verifies whether operations on variables, functions, and expressions are compatible with their data types. Types of Type Checking: 1. Static Type Checking: Done at compile time. The compiler uses type information from declarations (stored in a symbol table) to check for errors. It checks if operations between variables follow type rules (e.g., adding two integers). Example: Ensures that the operator mod works only with integers. Challenges: Static checking can't handle runtime errors like integer overflow (e.g., large values exceeding the data type range). Four Types of Static Type Checks: 1. Type Checks: Verifies that operands are of compatible types (e.g., adding two integers). 2. Flow of Control Checking: Ensures that the control flow (e.g., loops, conditionals) is valid according to type rules. 3. Uniqueness Checking: Ensures that identifiers (e.g., variables) are used consistently without conflicting types. 4. Name-Related Checks: Verifies that names (variables, functions) are used correctly and are properly declared before use. 2. Dynamic Type Checking: Done at runtime. Each data location (variable) includes its type info, and operations are checked during execution. Example: Checking operand types before performing an addition operation, or ensuring array indexing is within bounds. Languages: Used in languages like JavaScript and LISP, where types are not declared at compile time. Performance: It comes with a runtime performance penalty but can catch errors that static checking can't. 3. Implicit Type Conversion: The compiler may automatically convert types in certain cases to avoid type errors. Example: In C, when adding an integer and a floating-point number, the integer is automatically promoted to a float. Strict Languages: Languages like Ada and Pascal require explicit type conversion, offering more control but less convenience. Unit -5 1. Code Generation The Code Generator is the final phase of a compiler. It takes an intermediate representation of the source program as input and produces an equivalent target program. Main Tasks: 1. Instruction Selection: o Choose appropriate machine instructions for the target program. o Ensure efficiency in terms of speed and size. 2. Register Allocation and Assignment: o Allocation: Decide which variables will use registers. o Assignment: Assign specific registers to the variables. 3. Instruction Ordering: o Arrange the instructions to optimize performance. 2. Issues in Code Generator Design Input to Code Generator: Assumes error-free intermediate representation after parsing and type checking. Target Program: Can be absolute machine language, relocatable machine language, or assembly language. Memory Management: Maps variable names to memory addresses using the symbol table. Converts labels to instruction addresses. Instruction Selection: Target machine instructions should be: o Complete (cover all operations). o Efficient (use machine idioms and optimize speed/size). Register Usage: Register instructions are faster and shorter than memory-based ones. Two steps: 1. Register Allocation: Choose variables for registers. 2. Register Assignment: Assign specific registers. Evaluation Order: Efficient computation order reduces register usage. Basic Blocks and Flow Graphs Basic Block: A sequence of consecutive statements with: Single entry: Control enters only at the beginning. Single exit: Control leaves only at the end. 3. DAG Representation of Basic Blocks- A DAG (Directed Acyclic Graph) represents computations in a basic block, showing relationships between variables and expressions. Key Features: 1. Nodes: o Leaves: Represent variable names or constants. o Interior Nodes: Represent operators and computed expressions. o Output Nodes: Represent variables that remain live after the block execution. 2. Labels: o Leaves are labeled with variable names or constants. o Nodes are labeled with operators and associated variables. 3. Uses: o Tracks how values are computed and used. o Helps detect common subexpressions. Applications of DAGs 1. Common Subexpression Detection: o Identifies and eliminates redundant computations. 2. Variable Usage Analysis: o Determines which variables are live or needed. 3. Optimization: o Improves efficiency by restructuring computations. 4. Transformations on Basic Blocks Structure-Preserving Transformations: 1. Common Subexpression Elimination: o Avoids recomputation by reusing previously computed expressions. 2. Dead-Code Elimination: o Removes statements that do not affect the program’s outcome. 3. Renaming Temporary Variables: o Assigns new temporary names to simplify and normalize the block. 4. Interchange of Statements: o Permits reordering of independent statements to enhance efficiency. 5. Code Generation Algorithm Components: 1. Register Descriptor: Tracks contents of each register (initially empty). 2. Address Descriptor: Tracks the location of a variable’s current value (register/memory). Algorithm Steps: 1. For a statement x = y op z: o Use getreg to find a register or location L to store the result. o Check y’s address descriptor to locate y’. If not in L, generate MOV y’, L. o Generate OP z’, L (use register if z is in both memory and a register). o Update the address descriptor for x to indicate its new location. o Free registers for y and z if no longer needed. 6. Code Optimization Goals: 1. Preserve Meaning: Output must match the program’s intent. 2. Improve Efficiency: Faster execution and reduced resource usage. 3. Fast Optimization: Minimal impact on compilation time. Levels of Optimization: 1. High-Level: Near source code; language-dependent. o Function inlining, tail recursion elimination, loop reordering. 2. Intermediate-Level: Language-independent; operates on intermediate code. o Common subexpression elimination. o Constant propagation. o Dead code elimination. o Strength reduction. 3. Low-Level: Architecture-specific; focuses on machine code. o Register allocation. o Instruction scheduling. o Peephole optimization. Types of Optimization Machine-Independent Optimization: Transforms intermediate code without considering specific hardware. Includes: o Function Preserving: Dead code elimination, strength reduction. o Loop Optimization: Invariant code motion, unrolling, and hoisting. Machine-Dependent Optimization: Tailored to target machine architecture. Utilizes CPU registers, memory hierarchy, and specific hardware features. Includes: o Branch prediction. o Floating-point unit utilization. o Profile-based optimizations. Key Techniques 1. Common Subexpression Elimination: Reuses existing computations. 2. Dead Code Elimination: Removes unnecessary statements. 3. Strength Reduction: Replaces expensive operations with simpler ones. 4. Register Allocation: Efficiently maps variables to CPU registers. 5. Peephole Optimization: Improves small code sections for better performance.

Document Details

Tags

Related

Summary

Full Transcript