Full Transcript

9th August 2024 Computer Architecture(23ELC204) Pattern → L-T-P-C: 2-0-0-2 Dr. Vidya H.A Professor & Chairperson Senior Member of IEEE, Fellow of the Institution of Engineers (FIE), Life member of ISTE, Fellow Life M...

9th August 2024 Computer Architecture(23ELC204) Pattern → L-T-P-C: 2-0-0-2 Dr. Vidya H.A Professor & Chairperson Senior Member of IEEE, Fellow of the Institution of Engineers (FIE), Life member of ISTE, Fellow Life Member of Indian Society of Lighting Engineers (ISLE). Department of Electrical & Electronics Engg. Amrita School of Engineering Amrita Vishwa Vidyapeetham, Bengaluru Campus Kasavanahalli, Carmelaram P.O. Bengaluru - 560 035,Karnataka, India Processor Architecture with example as MIPS MIPS (Microprocessor without Interlocked Pipelined Stages) is a reduced instruction set computer (RISC) instruction set architecture (ISA) developed by MIPS Computer Systems, now MIPS Technologies, based in the United States. There are multiple versions of MIPS: including MIPS I, II, III, IV, and V; as well as five releases of MIPS32/64 (for 32- and 64-bit implementations, respectively). The early MIPS architectures were 32-bit; 64-bit versions were developed later. As of April 2017, the current version of MIPS is MIPS32/64 Release 6. MIPS32/64 primarily differs from MIPS I–V by defining the privileged kernel mode System Control Coprocessor in addition to the user mode architecture. 9th August 2024 Classification of Microprocessors Besides the classification based on the word length, the classification is also based on the architecture i.e. Instruction Set of the microprocessor. These are categorized into RISC (Reduced Instruction Set Computer), CISC (Complex Instruction Set Computer) and EPIC (Explicitly Parallel Instruction Computing). RISC: It stands for Reduced Instruction Set Computer. It is a type of microprocessor architecture that uses a small set of instructions of uniform length. These are simple instructions which are generally executed in one clock cycle. RISC chips are relatively simple to design and inexpensive.The setback of this design is that the computer has to repeatedly perform simple operations to execute a larger program having a large number of processing operations. Examples: SPARC, POWER PC etc. Note: The full form of SPARC is Scalable Processor Architecture. SPARC is an open architecture that is highly scalable and designed for faster execution rates. CISC: It stands for Complex Instruction Set Computer. These processors offer the users, hundreds of instructions of variable sizes. CISC architecture includes a complete set of special purpose circuits that carry out these instructions at a very high speed. These instructions interact with memory by using complex addressing modes. CISC processors reduce the program size and hence lesser number of memory cycles are required to execute the programs. This increases the overall speed of execution. Examples: Intel architecture, AMD. EPIC: It stands for Explicitly Parallel Instruction Computing. The best features of RISC and CISC processors are combined in the architecture. It implements parallel processing of instructions rather than using fixed length instructions. The working of EPIC processors are supported by using a set of complex instructions that contain both basic instructions as well as the information of execution of parallel instructions. It substantially increases the efficiency of these processors. CISC RISC A large number of instructions are Very fewer instructions are present. The present in the architecture. number of instructions are generally less than 100. Some instructions with long execution No instruction with a long execution time times. These include instructions that due to very simple instruction set. Some copy an entire block from one part of early RISC machines did not even have an memory to another and others that copy integer multiply instruction, requiring multiple registers to and from memory. compilers to implement multiplication as a sequence of additions. Variable-length encodings of the Fixed-length encodings of the instructions instructions. are used. Example: IA32 instruction size can range Example: In IA32, generally all instructions from 1 to 15 bytes. are encoded as 4 bytes. Multiple formats are supported for Simple addressing formats are supported. specifying operands. A memory operand Only base and displacement addressing is specifier can have many different allowed. combinations of displacement, base and index registers. CISC supports array. RISC does not supports array. Arithmetic and logical operations can be Arithmetic and logical operations only applied to both memory and register use register operands. Memory operands. referencing is only allowed by load and store instructions, i.e. reading from memory into a register and writing from a register to memory respectively. Implementation programs are hidden Implementation programs exposed to from machine level programs. The ISA machine level programs. Few RISC provides a clean abstraction between machines do not allow specific instruction programs and how they get executed. sequences. Condition codes are used. No condition codes are used. The stack is being used for procedure Registers are being used for procedure arguments and return addresses. arguments and return addresses. Memory references can be avoided by some procedures. Benefits of RISC Design Some benefits that result from RISC design techniques are not directly attributable to the drive to increase performance, but are a result of the basic reduction in complexity—a simpler design allows both chip-area resources and human resources to be applied to features that enhance performance. Some of these benefits are described below. Shorter Design Cycle: The architectures of RISC processors can be implemented more quickly than their CISC counterparts: it is easier to fabricate and debug a streamlined, simplified architecture with no microcode than a complex architecture that uses microcode. CISC processors have such a long design cycle that they may not be completely debugged by the time they are technologically obsolete. The shorter time required to design and implement RISC processors allows them to make use of the best available technologies. Effective Utilization of Chip Area: The simplicity of RISC processors also frees scarce chip geography for performance-critical resources such as larger register files, translation lookaside buffers (TLBs), coprocessors, and fast multiply and divide units. Such resources help RISC processors obtain an even greater performance edge. User (Programmer) Benefits: Simplicity in architecture also helps the user by providing a uniform instruction set that is easier to use. This allows a closer correlation between the instruction count and the cycle count, making it easier to measure code optimization activities. Advanced Semiconductor Technologies: Each new VLSI technology is introduced with tight limits on the number of transistors that fit on each chip. Since the simplicity of a RISC processor allows it to be implemented in fewer transistors than its CISC counterpart, the first computers capable of exploiting these new VLSI technologies have been using and will continue to use RISC architecture. Optimizing Compilers: RISC architecture is designed so that the compilers, not assembly languages, have the optimal working environment. RISC philosophy assumes that high-level language programming is used, which contradicts the older CISC philosophy that assumes assembly language programming is of primary importance. The trend toward high-level language instructions has led to the development of more efficient compilers to convert high-level language instructions to machine code. Primary measures of compiler efficiency are the compactness of its generated code and the shortness of its execution time. During the development of more efficient compilers, analysis of instruction streams revealed that the greatest amount of time was spent executing simple instructions and performing load and store operations, while the more complex instructions were used less frequently. It was also learned that compilers produce code that is often a narrow subset of the processor instruction set architecture (ISA). A compiler works more efficiently with instructions that perform simple, well-defined operations and generate minimal side-effects. Compilers do not use complex instructions and features; the more complex, powerful instructions are either too difficult for the compiler to employ or those instructions do not precisely fit high-level language requirements. Thus, a natural match exists between RISC architectures and efficient, optimizing compilers. This match makes it easier for compilers to generate the most effective sequences of machine instructions to accomplish tasks defined by the high-level language. MIPS RISCompiler Language Suite: Some compiler products are derived from disparate sources and consequently do not fit together very well. Instead of treating each language’s compiler as a separate entity, the MIPS RISCompiler language suite shares common elements across the entire family of compilers. In this way the language suite offers both tight integration and broad language coverage. The MIPS language suite supports: ❖ industry-standard front ends for the following languages (C, FORTRAN, Pascal), ❖ a common intermediate language, offering an efficient way to add language front ends over time, ❖ all of the back end optimization and code generation, ❖ the same object format and calling conventions, ❖ mixed-language programs, ❖ debugging of programs written in all languages, including mixtures. This language suite approach yields high-quality compilers for all languages, since common elements make up the majority of each of the language products. In addition, this approach provides the ability to develop and execute multi- language programs, promoting flexibility in development, avoiding the necessity of recoding proven program segments, and protecting the user’s software investment. The common back-end also exports optimizing and code-generating improvements immediately throughout the language suite, thereby reducing maintenance. Compatibility The R4000 processor provides complete application software compatibility with the MIPS R2000, R3000, and R6000 processors. Although the MIPS processor architecture has evolved in response to a compromise between software and hardware resources in the computer system, the R4000 processor implements the MIPS ISA for user-mode programs. This guarantees that user programs conforming to the ISA execute on any MIPS hardware implementation. Processor General Features Full 32-bit and 64-bit Operations: The R4000 processor contains 32 general purpose 64-bit registers. (When operating as a 32-bit processor, the general purpose registers are 32-bits wide.) All instructions are 32 bits wide. Efficient Pipeline: The superpipeline design of the processor results in an execution rate approaching one instruction per cycle. Pipeline stalls and exceptional events are handled precisely and efficiently. Memory Management Unit (MMU): The R4000 processor uses an on-chip TLB that provides rapid virtual-to-physical address translation. Cache Control: The R4000 primary instruction and data caches reside on- chip, and can each hold 8 Kbytes. In the R4400 processor, the primary caches can each hold 16 Kbytes. Architecturally, each primary cache can be increased to hold up to 32 Kbytes. An off-chip secondary cache (R4000SC and R4000MC processors only) can hold from 128 Kbytes to 4 Mbytes. All processor cache control logic, including the secondary cache control logic, is on-chip. Floating-Point Unit: The FPU is located on-chip and implements the ANSI/IEEE standard 754-1985. R4000 Processor Configurations The R4000 processor is packaged in three different configurations. All processors are implemented in sub-1-micron CMOS technology. ❖ R4000PC is designed for cost-sensitive systems such as inexpensive desktop systems and high-end embedded controllers. It is packaged in a 179-pin PGA, and does not support a secondary cache. ❖ R4000SC is designed for high-performance uniprocessor systems. It is packaged in a 447-pin LGA/PGA and includes integrated control for large secondary caches built from standard SRAMs. ❖ R4000MC is designed for large cache-coherent multiprocessor systems. It is packaged in a 447-pin LGA/PGA and, in addition to the features of R4000SC, includes support for a wide variety of bus designs and cache- coherency mechanisms. Table 1 lists the features in each of the three configurations (X indicates the feature is present). Table 1. R4000 Features MIPS R4000 Processor 64-bit Architecture: The natural mode of operation for the R4000 processor is as a 64-bit microprocessor; however, 32-bit applications maintain compatibility even when the processor operates as a 64-bit processor. The R4000 processor provides the following: ❖ 64-bit on-chip floating-point unit (FPU), ❖ 64-bit integer arithmetic logic unit (ALU), ❖ 64-bit integer registers, ❖ 64-bit virtual address space, ❖ 64-bit system bus. Figure 1 is a block diagram of the R4000 processor internals. 10th August 2024 Figure 1. R4000 Processor Internal Block Diagram Note: A coprocessor is a computer processor used to supplement the functions of the primary processor (the CPU). Operations performed by the coprocessor may be floating-point arithmetic, graphics, signal processing, string processing, cryptography or I/O interfacing with peripheral devices. Superpipeline Architecture: The R4000 processor exploits instruction parallelism by using an eight stage superpipeline which places no restrictions on the instruction issued. Under normal circumstances, two instructions are issued each cycle. The internal pipeline of the R4000 processor operates at twice the frequency of the master clock. The processor achieves high throughput by pipelining cache accesses, shortening register access times, implementing virtual-indexed primary caches, and allowing the latency of functional units to span more than one pipeline clock cycles. System Interface: The R4000 processor supports a 64-bit System interface that can construct uniprocessor systems with a direct DRAM interface—with or without a secondary cache—or cache-coherent multiprocessor systems. The System interface includes: ❖ a 64-bit multiplexed address and data bus, ❖ 8 check bits, ❖ a 9-bit parity-protected command bus, ❖ 8 handshake signals. The interface is capable of transferring data between the processor and memory at a peak rate of 400 Mbytes/second, when running at 50 MHz. CPU Register Overview: The central processing unit (CPU) provides the following registers: ❖ 32 general purpose registers, ❖ a Program Counter (PC) register, ❖ 2 registers that hold the results of integer multiply and divide operations (HI and LO). ❖ Floating-point unit (FPU) registers. CPU registers can be either 32 bits or 64 bits wide, depending on the R4000 processor mode of operation. Figure 2 shows the CPU registers Figure 2. CPU Registers Two of the CPU general purpose registers have assigned functions: ❖ r0 is hardwired to a value of zero, and can be used as the target register for any instruction whose result is to be discarded. r0 can also be used as a source when a zero value is needed. ❖ r31 is the link register used by Jump and Link instructions. It should not be used by other instructions. The CPU has three special purpose registers: ❖ PC — Program Counter register, ❖ HI — Multiply and Divide register higher result, ❖ LO — Multiply and Divide register lower result. The two Multiply and Divide registers (HI, LO) store: ❖ the product of integer multiply operations, or ❖ the quotient (in LO) and remainder (in HI) of integer divide operations. The R4000 processor has no Program Status Word (PSW) register as such; this is covered by the Status and Cause registers incorporated within the System Control Coprocessor (CP0). CPU Instruction Set Overview: Each CPU instruction is 32 bits long. As shown in Figure 3, there are three instruction formats: ❖ immediate (I-type), ❖ jump (J-type), ❖ register (R-type) Figure 3. CPU Instruction Formats Each format contains a number of different instructions, which are described further in this chapter. Instruction decoding is greatly simplified by limiting the number of formats to these three. This limitation means that the more complicated (and less frequently used) operations and addressing modes can be synthesized by the compiler, using sequences of these same simple instructions. The instruction set can be further divided into the following groupings: ❖ Load and Store instructions move data between memory and general registers. They are all immediate (I-type) instructions, since the only addressing mode supported is base register plus 16-bit, signed immediate offset. ❖ Computational instructions perform arithmetic, logical, shift, multiply, and divide operations on values in registers. They include register (R- type, in which both the operands and the result are stored in registers) and immediate (I-type, in which one operand is a 16-bit immediate value) formats. ❖ Jump and Branch instructions change the control flow of a program. Jumps are always made to a paged, absolute address formed by combining a 26-bit target address with the high-order bits of the Program Counter (J-type format) or register address (R-type format). Branches have 16-bit offsets relative to the program counter (I-type). Jump And Link instructions save their return address in register 31. ❖ Coprocessor instructions perform operations in the coprocessors. Coprocessor load and store instructions are I-type. ❖ Coprocessor 0 (system coprocessor) instructions perform operations on CP0 registers to control the memory management and exception handling facilities of the processor. These are listed in Table 18. ❖ Special instructions perform system calls and breakpoint operations. These instructions are always R-type. ❖ Exception instructions cause a branch to the general exception-handling vector based upon the result of a comparison. These instructions occur in both R-type (both the operands and the result are registers) and I-type (one operand is a 16-bit immediate value) formats. Tables 2 through 17 list CPU instructions common to MIPS R-Series processors, along with those instructions that are extensions to the instruction set architecture. The extensions result in code space reductions, multiprocessor support, and improved performance in operating system kernel code sequences—for instance, in situations where run-time bounds-checking is frequently performed. Table 18 lists CP0 instructions. Table 2. CPU Instruction Set: Load and Store Instructions 29th August 2024 Table 3. CPU Instruction Set: Arithmetic Instructions (ALU Immediate) Table 4. CPU Instruction Set: Arithmetic (3-Operand, R-Type) Table 5. CPU Instruction Set: Multiply and Divide Instructions Table 6. CPU Instruction Set: Jump and Branch Instructions Table 7. CPU Instruction Set: Shift Instructions Table 8. CPU Instruction Set: Coprocessor Instructions Table 9. CPU Instruction Set: Special Instructions Table 10. Extensions to the ISA: Load and Store Instructions Table 11. Extensions to the ISA: Arithmetic Instructions (ALU Immediate) Table 12. Extensions to the ISA: Multiply and Divide Instructions Table 13. Extensions to the ISA: Branch Instructions Table 14. Extensions to the ISA: Arithmetic Instructions (3-operand, R-type) Table 15. Extensions to the ISA: Shift Instructions Table 16. Extensions to the ISA: Exception Instructions Table 17. Extensions to the ISA: Coprocessor Instructions Table 18. CP0 Instructions Data Formats and Addressing The R4000 processor uses four data formats: a 64-bit doubleword, a 32-bit word, a 16-bit halfword, and an 8-bit (byte). Byte ordering within each of the larger data formats—halfword, word, doubleword—can be configured in either big-endian or little-endian order. Endianness refers to the location of byte 0 within the multi-byte data structure. Figures 4 and 5 show the ordering of bytes within words and the ordering of words within multiple-word structures for the big-endian and little-endian conventions. When the R4000 processor is configured as a big-endian system, byte 0 is the most-significant (leftmost) byte, thereby providing compatibility with MC 68000 and IBM 370 conventions. Figure 4 shows this configuration. 30th August 2024 Figure 4. Big-Endian Byte Ordering Figure 5. Little-Endian Byte Ordering When configured as a little-endian system, byte 0 is always the least significant (rightmost) byte, which is compatible with iAPX x86 and DEC VAX conventions. Figure 5 shows this configuration. Here, bit 0 is always the least-significant (rightmost) bit; thus, bit designations are always little-endian (although no instructions explicitly designate bit positions within words). Figures 6 and 7 show little-endian and big-endian byte ordering in doublewords. Figure 6. Little-Endian Data in a Doubleword Figure 7. Big-Endian Data in a Doubleword The CPU uses byte addressing for halfword, word, and doubleword accesses with the following alignment constraints: ❖ Halfword accesses must be aligned on an even byte boundary (0, 2, 4...). ❖ Word accesses must be aligned on a byte boundary divisible by four (0, 4, 8...). ❖ Doubleword accesses must be aligned on a byte boundary divisible by eight (0, 8, 16...). The following special instructions load and store words that are not aligned on 4-byte (word) or 8-word (doubleword) boundaries: LWL LWR SWL SWR LDL LDR SDL SDR These instructions are used in pairs to provide addressing of misaligned words. Addressing misaligned data incurs one additional instruction cycle over that required for addressing aligned data. Figures 8 and 9 show the access of a misaligned word that has byte address 3. Figure 8. Big-Endian Misaligned Word Addressing Figure 9. Little-Endian Misaligned Word Addressing Coprocessors (CP0-CP2) The MIPS ISA defines three coprocessors (designated CP0 through CP2): ❖ Coprocessor 0 (CP0) is incorporated on the CPU chip and supports the virtual memory system and exception handling. CP0 is also referred to as the System Control Coprocessor. ❖ Coprocessor 1 (CP1) is reserved for the on-chip, floating-point coprocessor, the FPU. ❖ Coprocessor 2 (CP2) is reserved for future definition by MIPS. System Control Coprocessor, CP0: CP0 translates virtual addresses into physical addresses and manages exceptions and transitions between kernel, supervisor, and user states. CP0 also controls the cache subsystem, as well as providing diagnostic control and error recovery facilities. The CP0 registers shown in Figure 10 and described in Table 19 manipulate the memory management and exception handling capabilities of the CPU. Figure 10. R4000 CP0 Registers Table 19. System Control Coprocessor (CP0) Register Definitions Table 19. System Control Coprocessor (CP0) Register Definitions ……. Floating-Point Unit (FPU) CP1: The MIPS floating-point unit (FPU) is designated CP1; the FPU extends the CPU instruction set to perform arithmetic operations on floating- point values. The FPU, with associated system software, fully conforms to the requirements of ANSI/IEEE Standard 754–1985, IEEE Standard for Binary Floating-Point Arithmetic. The FPU features include: ❖ Full 64-bit Operation: The FPU can contain either 16 or 32 64-bit registers to hold single-precision or double-precision values. The FPU also includes a 32-bit Status/Control register that provides access to all IEEE-Standard exception handling capabilities. ❖ Load and Store Instruction Set: Like the CPU, the FPU uses a load- and store-based instruction set. Floating-point operations are started in a single cycle and their execution overlaps other fixed-point or floating- point operations. ❖ Tightly-coupled Coprocessor Interface: The FPU is on the CPU chip, and appears to the programmer as a simple extension of the CPU (accessed as CP1). Together, the CPU and FPU form a tightly-coupled unit with a seamless integration of floating-point and fixed-point instruction sets. Since each unit receives and executes instructions in parallel, some floating-point instructions can execute at the same rate (two instructions per cycle) as fixed-point instructions. Addressing Modes 31st August 2024 Memory Management System (MMU) The R4000 uses a 36-bit physical address, thus is able to address 64 GB of physical memory. However, since it is rare for systems to implement a physical memory space this large, the CPU provides a logical expansion of memory space by translating addresses composed in the large virtual address space into available physical memory addresses. The R4000 processor supports the following two addressing modes: ❖ 32-bit mode, in which the virtual address space is divided into 2 Gbytes per user process and 2 Gbytes for the kernel. ❖ 64-bit mode, in which the virtual address is expanded to 1 Tbyte (240 bytes) of user virtual address space. The Translation Lookaside Buffer (TLB) Virtual memory mapping is assisted by a translation lookaside buffer, which caches virtual-to-physical address translations. This fully associative, on- chip TLB contains 48 entries, each of which maps a pair of variable-sized pages ranging from 4 Kbytes to 16 Mbytes, in multiples of four. ❖ Instruction TLB: The R4000 processor has a two-entry instruction TLB (ITLB) which assists in instruction address translation. The ITLB is completely invisible to software and exists only to increase performance. ❖ Joint TLB: An address translation value is tagged with the most- significant bits of its virtual address (the number of these bits depends upon the size of the page) and a per-process identifier. If there is no matching entry in the TLB, an exception is taken and software refills the on-chip TLB from a page table resident in memory; this TLB is referred to as the joint TLB (JTLB) because it contains both data and instructions jointly. The JTLB entry to be rewritten is selected at random. Operating Modes The R4000 processor has three operating modes: ❖ User mode, ❖ Supervisor mode, ❖ Kernel mode. The manner in which memory addresses are translated or mapped depends on the operating mode of the CPU. Cache Memory Hierarchy To achieve a high performance in uniprocessor and multiprocessor systems, the R4000 processor supports a two-level cache memory hierarchy that increases memory access bandwidth and reduces the latency of load and store instructions. This hierarchy consists of on-chip instruction and data caches, together with an optional external secondary cache that varies in size from 128 Kbytes to 4 Mbytes. The secondary cache is assumed to consist of one bank of industry-standard static RAM (SRAM) with output enables, arranged as a quadword (128-bit) data array, with a 25-bit-wide tag array. Check fields are added to both data and tag arrays to improve data integrity. The secondary cache can be configured as a joint cache, or split into separate instruction and data caches. The maximum secondary cache size is 4 Mbytes; the minimum secondary cache size is 128 Kbytes for a joint cache, or 256 Kbytes total for split instruction/data caches. The secondary cache is direct mapped, and is addressed with the lower part of the physical address. Primary Caches: The R4000 processor incorporates separate on-chip primary instruction and data caches to fill the high-performance pipeline. Each cache has its own 64-bit data path, and each can be accessed in parallel. The R4000 processor primary caches hold from 8 Kbytes to 32 Kbytes; the R4400 processor primary caches are fixed at 16 Kbytes. Cache accesses can occur up to twice each cycle. This provides the integer and floating-point units with an aggregate bandwidth of 1.6 Gbytes per second at a MasterClock frequency of 50 MHz. Secondary Cache Interface: The R4000SC (secondary cache) and R4000MC (multiprocessor) versions of the processor allow connection to an optional secondary cache. These processors provide all of the secondary cache control circuitry, including error checking and correcting (ECC) protection, on chip. The Secondary Cache interface includes: ❖ a 128-bit data bus, ❖ a 25-bit tag bus, ❖ an 18-bit address bus, ❖ SRAM control signals. The 128-bit-wide data bus is designed to minimize cache miss penalties, and allow the use of standard low-cost SRAM in secondary cache. Summary One of the first commercially available RISC chip sets was developed by MIPS Technology Inc. The system was inspired by an experimental system, also using the name MIPS, developed at Stanford [HENN84]. It has substantially the same architecture and instruction set of the earlier MIPS designs: the R2000 and R3000. The most significant difference is that the R4000 uses 64 rather than 32 bits for all internal and external data paths and for addresses, registers, and the ALU. The use of 64 bits has a number of advantages over a 32-bit architecture. It allows a bigger address space—large enough for an operating system to map more than a terabyte of files directly into virtual memory for easy access. With 1-terabyte and larger disk drives now common, the 4-gigabyte address space of a 32-bit machine becomes limiting. Also, the 64-bit capacity allows the R4000 to process data such as IEEE double- precision floating-point numbers and character strings, up to eight characters in a single action. The R4000 processor chip is partitioned into two sections, one containing the CPU and the other containing a coprocessor for memory management. The processor has a very simple architecture. The intent was to design a system in which the instruction execution logic was as simple as possible, leaving space available for logic to enhance performance (e.g., the entire memory-management unit). The processor supports thirty- two 64-bit registers. It also provides for up to 128 Kbytes of high-speed cache, half each for instructions and data. The relatively large cache (the IBM 3090 provides 128 to 256 Kbytes of cache) enables the system to keep large sets of program code and data local to the processor, off-loading the main memory bus and avoiding the need for a large register file with the accompanying windowing logic. Instruction Set All MIPS R series instructions are encoded in a single 32-bit word format. All data operations are register to register; the only memory references are pure load/store operations. The R4000 makes no use of condition codes. If an instruction generates a condition, the corresponding flags are stored in a general-purpose register. This avoids the need for special logic to deal with condition codes, as they affect the pipelining mechanism and the reordering of instructions by the compiler. Instead, the mechanisms already implemented to deal with register-value dependencies are employed. Further, conditions mapped onto the register files are subject to the same compile time optimizations in allocation and reuse as other values stored in registers. As with most RISC-based machines, the MIPS uses a single 32-bit instruction length. This single instruction length simplifies instruction fetch and decode, and it also simplifies the interaction of instruction fetch with the virtual memory management unit (i.e., instructions do not cross word or page boundaries). The three instruction formats (Figure 11) share common formatting of opcodes and register references, simplifying instruction decode. The effect of more complex instructions can be synthesized at compile time. Only the simplest and most frequently used memory- addressing mode is implemented in hardware. All memory references consist of a 16-bit offset from a 32-bit register. For example, the “load word” instruction is of the form: lw r2, 128(r3) /* load word at address 128 offset from register 3 into register 2 Figure 11. MIPS Instruction Formats Each of the 32 general-purpose registers can be used as the base register. One register, r0, always contains 0. The compiler makes use of multiple machine instructions to synthesize typical addressing modes in conventional machines. Here is an example from [CHOW87], which uses the instruction lui (load upper immediate). This instruction loads the upper half of a register with a 16-bit immediate value, setting the lower half to zero. Consider an assembly-language instruction that uses a 32-bit immediate argument. Quiz 1 12th September 2024 Instruction Pipeline To improve the performance of a CPU we have two options: 1) Improve the hardware by introducing faster circuits. 2) Arrange the hardware such that more than one operation can be performed at the same time. Since there is a limit on the speed of hardware and the cost of faster circuits is quite high, we have to adopt the 2nd option. Pipelining is a process of arrangement of hardware elements of the CPU such that its overall performance is increased. Simultaneous execution of more than one instruction takes place in a pipelined processor. Thus, pipelined operation increases the efficiency of a system. 12th September 2024 Design of a basic pipeline: In a pipelined processor, a pipeline has two ends, the input end and the output end. Between these ends, there are multiple stages/segments such that the output of one stage is connected to the input of the next stage and each stage performs a specific operation. Interface registers are used to hold the intermediate output between two stages. These interface registers are also called latch or buffer. All the stages in the pipeline along with the interface registers are controlled by a common clock. Execution in a pipelined processor: Execution sequence of instructions in a pipelined processor can be visualized using a space-time diagram. For example, consider a processor having 4 stages and let there be 2 instructions to be executed. We can visualize the execution sequence through the following space-time diagrams: Non-overlapped execution: Stage / 1 2 3 4 5 6 7 8 Cycle S1 I1 I2 S2 I1 I2 S3 I1 I2 S4 I1 I2 Total time = 8 Cycle Overlapped execution: Stage / 1 2 3 4 5 Cycle S1 I1 I2 S2 I1 I2 S3 I1 I2 S4 I1 I2 Total time = 5 Cycle Pipeline Stages RISC processor has 5 stage instruction pipeline to execute all the instructions in the RISC instruction set. Following are the 5 stages of the RISC pipeline with their respective operations: Stage 1 (Instruction Fetch) In this stage the CPU reads instructions from the address in the memory whose value is present in the program counter. Stage 2 (Instruction Decode) In this stage, instruction is decoded, and the register file is accessed to get the values from the registers used in the instruction. Stage 3 (Instruction Execute) In this stage, ALU operations are performed. Stage 4 (Memory Access) In this stage, memory operands are read and written from/to the memory that is present in the instruction. Stage 5 (Write Back) In this stage, computed/fetched value is written back to the register present in the instructions. Instruction Pipeline - MIPS With its simplified instruction architecture, the MIPS can achieve very efficient pipelining. It is instructive to look at the evolution of the MIPS pipeline, as it illustrates the evolution of RISC pipelining in general. The initial experimental RISC systems and the first generation of commercial RISC processors achieve execution speeds that approach one instruction per system clock cycle. To improve on this performance, two classes of processors have evolved to offer execution of multiple instructions per clock cycle: superscalar and super-pipelined architectures. In essence, a superscalar architecture replicates each of the pipeline stages so that two or more instructions at the same stage of the pipeline can be processed simultaneously. 13th September 2024 A super-pipelined architecture is one that makes use of more, and more fine- grained, pipeline stages. With more stages, more instructions can be in the pipeline at the same time, increasing parallelism. Both approaches have limitations. With superscalar pipelining, dependencies between instructions in different pipelines can slow down the system. Also, overhead logic is required to coordinate these dependencies. With super-pipelining, there is overhead associated with transferring instructions from one stage to the next. The MIPS R4000 is a good example of a RISC-based super-pipeline architecture. MIPS R3000 Five-Stage Pipeline Simulator Figure 12a shows the instruction pipeline of the R3000. Figure 12. Enhancing the R3000 Pipeline In the R3000, the pipeline advances once per clock cycle. The MIPS compiler is able to reorder instructions to fill delay slots with code 70 to 90% of the time. All instructions follow the same sequence of five pipeline stages: ❖ Instruction fetch; ❖ Source operand fetch from register file; ❖ ALU operation or data operand address generation; ❖ Data memory reference; ❖ Write back into register file. As illustrated in Figure 12a, there is not only parallelism due to pipelining but also parallelism within the execution of a single instruction. The 60-ns clock cycle is divided into two 30-ns stages. The external instruction and data access operations to the cache each require 60 ns, as do the major internal operations (OP, DA, IA). Instruction decode is a simpler operation, requiring only a single 30-ns stage, overlapped with register fetch in the same instruction. Calculation of an address for a branch instruction also overlaps instruction decode and register fetch, so that a branch at instruction i can address the ICACHE access of instruction i + 2. Similarly, a load at instruction i fetches data that are immediately used by the OP of instruction i + 1, while an ALU/shift result gets passed directly into instruction i + 1 with no delay. This tight coupling between instructions makes for a highly efficient pipeline. In detail, then, each clock cycle is divided into separate stages, denoted as ɸ1 and ɸ2. The functions performed in each stage are summarized in Table 20. Table 20 R3000 Pipeline Stages The R4000 incorporates a number of technical advances over the R3000. The use of more advanced technology allows the clock cycle time to be cut in half, to 30 ns, and for the access time to the register file to be cut in half. In addition, there is greater density on the chip, which enables the instruction and data caches to be incorporated on the chip. Before looking at the final R4000 pipeline, let us consider how the R3000 pipeline can be modified to improve performance using R4000 technology. Figure 12b shows a first step. Remember that the cycles in this figure are half as long as those in Figure 12a. Because they are on the same chip, the instruction and data cache stages take only half as long; so they still occupy only one clock cycle. Again, because of the speedup of the register file access, register read and write still occupy only half of a clock cycle. Because the R4000 caches are on- chip, the virtual- to- physical address translation can delay the cache access. This delay is reduced by implementing virtually indexed caches and going to a parallel cache access and address translation. Figure 12c shows the optimized R3000 pipeline with this improvement. Because of the compression of events, the data cache tag check is performed separately on the next cycle after cache access. This check determines whether the data item is in the cache. In a super-pipelined system, existing hardware is used several times per cycle by inserting pipeline registers to split up each pipe stage. Essentially, each super-pipeline stage operates at a multiple of the base clock frequency, the multiple depending on the degree of super-pipelining. The R4000 technology has the speed and density to permit super-pipelining of degree 2. Figure 13a shows the optimized R3000 pipeline using this super-pipelining. Note that this is essentially the same dynamic structure as Figure 12c. Figure 13. Theoretical R3000 and Actual R4000 Super-pipelines Further improvements can be made. For the R4000, a much larger and specialized adder was designed. This makes it possible to execute ALU operations at twice the rate. Other improvements allow the execution of loads and stores at twice the rate. The resulting pipeline is shown in Figure 13b. The R4000 has eight pipeline stages, meaning that as many as eight instructions can be in the pipeline at the same time. The pipeline advances at the rate of two stages per clock cycle. The eight pipeline stages are as follows: ❖ Instruction fetch first half: Virtual address is presented to the instruction cache and the translation lookaside buffer. ❖ Instruction fetch second half: Instruction cache outputs the instruction and the TLB generates the physical address. ❖ Register file: Three activities occur in parallel: ✓ Instruction is decoded and check made for interlock conditions (i.e., this instruction depends on the result of a preceding instruction). ✓ Instruction cache tag check is made. ✓ Operands are fetched from the register file. ❖ Instruction execute: One of three activities can occur: ✓ If the instruction is a register-to-register operation, the ALU performs the arithmetic or logical operation. ✓ If the instruction is a load or store, the data virtual address is calculated. ✓ If the instruction is a branch, the branch target virtual address is calculated and branch conditions are checked. ❖ Data cache first: Virtual address is presented to the data cache and TLB. ❖ Data cache second: The TLB generates the physical address, and the data cache outputs the data. ❖ Tag check: Cache tag checks are performed for loads and stores. ❖ Write back: Instruction result is written back to register file. THANK YOU

Use Quizgecko on...
Browser
Browser