Introduction to the Systems Approach PDF

1 Introduction to the Systems Approach 1.1 SYSTEM ARCHITECTURE: AN OVERVIEW The past 40 years have seen amazing advances in silicon technology and result- ing increases in transistor density and performance. In 1966, Fairchild Semiconductor introduced a quad two input NAND gate with about 10 transistors on a die. In 2008, the Intel quad-core Itanium processor has 2 billion transistors. Figures 1.1 and 1.2 show the unrelenting advance in improv- ing transistor density and the corresponding decrease in device cost. The aim of this book is to present an approach for computer system design that exploits this enormous transistor density. In part, this is a direct extension of studies in computer architecture and design. However, it is also a study of system architecture and design. About 50 years ago, a seminal text, Systems Engineering—An Introduction to the Design of Large-Scale Systems , appeared. As the authors, H.H. Goode and R.E. Machol, pointed out, the system’s view of engineering was created by a need to deal with complexity. As then, our ability to deal with complex design problems is greatly enhanced by computer-based tools. A system-on-chip (SOC) architecture is an ensemble of processors, memo- ries, and interconnects tailored to an application domain. A simple example of such an architecture is the Emotion Engine [147, 187, 237] for the Sony PlayStation 2 (Figure 1.3), which has two main functions: behavior simulation and geometry translation. This system contains three essential components: a main processor of the reduced instruction set computer (RISC) style and two vector processing units, VPU0 and VPU1, each of which contains four parallel processors of the single instruction, multiple data (SIMD) stream style. We provide a brief overview of these components and our overall approach in the next few sections. While the focus of the book is on the system, in order to understand the system, one must first understand the components. So, before returning to the issue of system architecture later in this chapter, we review the components that make up the system. Computer System Design: System-on-Chip, First Edition. Michael J. Flynn and Wayne Luk. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc. 1 www.TechnicalBooksPDF.com c01.indd 1 5/4/2011 9:53:47 AM 2 INTRODUCTION TO THE SYSTEMS APPROACH Transistors per die 1e + 10 1e + 08 Transistors 1e + 06 10,000 100 1 1960 1970 1980 1990 2000 2010 Year Figure 1.1 The increasing transistor density on a silicon die. 1.0 Cost per transistor 0.1 0.01 0.001 Cost 1e – 04 1e – 05 1e – 06 1e – 07 1970 1980 1990 2000 2010 Year Figure 1.2 The decrease of transistor cost over the years. 1.2 COMPONENTS OF THE SYSTEM: PROCESSORS, MEMORIES, AND INTERCONNECTS The term architecture denotes the operational structure and the user’s view of the system. Over time, it has evolved to include both the functional speci- fication and the hardware implementation. The system architecture defines the system-level building blocks, such as processors and memories, and the www.TechnicalBooksPDF.com c01.indd 2 5/4/2011 9:53:47 AM COMPONENTS OF THE SYSTEM 3 Tasks synchronized with Tasks synchronized with the main processor the rendering engine (behavior simulation) (geometry translation) Rendering Main 4 FP SIMD 4 FP SIMD engine processor processor processor (RISC core) (VPU0) (VPU1) + Arbiter Buffer Buffer Buffer DMA (direct memory access) path External memory Figure 1.3 High-level functional view of a system-on-chip: the Emotion Engine of the Sony PlayStation 2 [147, 187]. Architecture Data Paths Control Instruction Set Registers ALU Memory Implementation Hidden Registers Microinstructions Branch Prediction Figure 1.4 The processor architecture and its implementation. interconnection between them. The processor architecture determines the processor’s instruction set, the associated programming model, its detailed implementation, which may include hidden registers, branch prediction cir- cuits and specific details concerning the ALU (arithmetic logic unit). The implementation of a processor is also known as microarchitecture (Figure 1.4). The system designer has a programmer’s or user’s view of the system com- ponents, the system view of memory, the variety of specialized processors, and www.TechnicalBooksPDF.com c01.indd 3 5/4/2011 9:53:48 AM 4 INTRODUCTION TO THE SYSTEMS APPROACH Media Core Vector Processor Processor Coprocessor Interconnects Analog and System Memory Custom Components Circuitry Figure 1.5 A basic SOC system model. their interconnection. The next sections cover basic components: the processor architecture, the memory, and the bus or interconnect architecture. Figure 1.5 illustrates some of the basic elements of an SOC system. These include a number of heterogeneous processors interconnected to one or more memory elements with possibly an array of reconfigurable logic. Frequently, the SOC also has analog circuitry for managing sensor data and analog-to- digital conversion, or to support wireless data transmission. As an example, an SOC for a smart phone would need to support, in addi- tion to audio input and output capabilities for a traditional phone, Internet access functions and multimedia facilities for video communication, document processing, and entertainment such as games and movies. A possible configura- tion for the elements in Figure 1.5 would have the core processor being imple- mented by several ARM Cortex-A9 processors for application processing, and the media processor being implemented by a Mali-400MP graphics processor and a Mali-VE video engine. The system components and custom circuitry would interface with peripherals such as the camera, the screen, and the wire- less communication unit. The elements would be connected together by AXI (Advanced eXtensible Interface) interconnects. If all the elements cannot be contained on a single chip, the implementation is probably best referred to as a system on a board, but often is still called a SOC. What distinguishes a system on a board (or chip) from the conventional general-purpose computer plus memory on a board is the specific nature of the design target. The application is assumed to be known and specified so that the elements of the system can be selected, sized, and evaluated during the design process. The emphasis on selecting, parameterizing, and configuring system components tailored to a target application distinguishes a system architect from a computer architect. www.TechnicalBooksPDF.com c01.indd 4 5/4/2011 9:53:48 AM HARDWARE AND SOFTWARE 5 In this chapter, we primarily look at the higher-level definition of the processor—the programmer’s view or the instruction set architecture (ISA), the basics of the processor microarchitecture, memory hierarchies, and the interconnection structure. In later chapters, we shall study in more detail the implementation issues for these elements. 1.3 HARDWARE AND SOFTWARE: PROGRAMMABILITY VERSUS PERFORMANCE A fundamental decision in SOC design is to choose which components in the system are to be implemented in hardware and in software. The major benefits and drawbacks of hardware and software implementations are summarized in Table 1.1. A software implementation is usually executed on a general-purpose pro- cessor (GPP), which interprets instructions at run time. This architecture offers flexibility and adaptability, and provides a way of sharing resources among different applications; however, the hardware implementation of the ISA is generally slower and more power hungry than implementing the correspond- ing function directly in hardware without the overhead of fetching and decod- ing instructions. Most software developers use high-level languages and tools that enhance productivity, such as program development environments, optimizing com- pilers, and performance profilers. In contrast, the direct implementation of applications in hardware results in custom application-specific integrated circuits (ASICs), which often provides high performance at the expense of programmability—and hence flexibility, productivity, and cost. Given that hardware and software have complementary features, many SOC designs aim to combine the individual benefits of the two. The obvious method is to implement the performance-critical parts of the application in hardware, and the rest in software. For instance, if 90% of the software execu- tion time of an application is spent on 10% of the source code, up to a 10-fold speedup is achievable if that 10% of the code is efficiently implemented in hardware. We shall make use of this observation to customize designs in Chapter 6. Custom ASIC hardware and software on GPPs can be seen as two extremes in the technology spectrum with different trade-offs in programmability and TABLE 1.1 Benefits and Drawbacks of Software and Hardware Implementations Benefits Drawbacks Hardware Fast, low power consumption Inflexible, unadaptable, complex to build and test Software Flexible, adaptable, simple to Slow, high power consumption build and test www.TechnicalBooksPDF.com c01.indd 5 5/4/2011 9:53:48 AM 6 INTRODUCTION TO THE SYSTEMS APPROACH Peak performance: number of operations per watt Custom ASIC Structured ASIC CGRA FPGA ASIP DSP GPP Low High Programmability Figure 1.6 A simplified technology comparison: programmability versus performance. GPP, general-purpose processor; CGRA, coarse-grained reconfigurable architecture. performance; there are various technologies that lie between these two extremes (Figure 1.6). The two more well-known ones are application-specific instruction processors (ASIPs) and field-programmable gate arrays (FPGAs). An ASIP is a processor with an instruction set customized for a specific application or domain. Custom instructions efficiently implemented in hard- ware are often integrated into a base processor with a basic instruction set. This capability often improves upon the conventional approach of using standard instruction sets to fulfill the same task while preserving its flexibil- ity. Chapters 6 and 7 explore further some of the issues involving custom instructions. An FPGA typically contains an array of computation units, memories, and their interconnections, and all three are usually programmable in the field by application builders. FPGA technology often offers a good compromise: It is faster than software while being more flexible and having shorter development times than custom ASIC hardware implementations; like GPPs, they are offered as off-the-shelf devices that can be programmed without going through chip fabrication. Because of the growing demand for reducing the time to market and the increasing cost of chip fabrication, FPGAs are becoming more popular for implementing digital designs. Most commercial FPGAs contain an array of fine-grained logic blocks, each only a few bits wide. It is also possible to have the following: www.TechnicalBooksPDF.com c01.indd 6 5/4/2011 9:53:48 AM PROCESSOR ARCHITECTURES 7 Coarse-Grained Reconfigurable Architecture (CGRA). It contains logic blocks that process byte-wide or multiple byte-wide data, which can form building blocks of datapaths. Structured ASIC. It allows application builders to customize the resources before fabrication. While it offers performance close to that of ASIC, the need for chip fabrication can be an issue. Digital Signal Processors (DSPs). The organization and instruction set for these devices are optimized for digital signal processing applications. Like microprocessors, they have a fixed hardware architecture that cannot be reconfigured. Figure 1.6 compares these technologies in terms of programmability and per- formance. Chapters 6–8 provide further information about some of these technologies. 1.4 PROCESSOR ARCHITECTURES Typically, processors are characterized either by their application or by their architecture (or structure), as shown in Tables 1.2 and 1.3. The requirements space of an application is often large, and there is a range of implementation options. Thus, it is usually difficult to associate a particular architecture with a particular application. In addition, some architectures combine different implementation approaches as seen in the PlayStation example of Section 1.1. There, the graphics processor consists of a four-element SIMD array of vector processing functional units (FUs). Other SOC implementations consist of multiprocessors using very long instruction word (VLIW) and/or supersca- lar processors. TABLE 1.2 Processor Examples as Identified by Function Processor Type Application Graphics processing unit (GPU) 3-D graphics; rendering, shading, texture Digital signal processor (DSP) Generic, sometimes used with wireless Media processor Video and audio signal processing Network processor Routing, buffering TABLE 1.3 Processor Examples as Identified by Architecture Processor Type Architecture/Implementation Approach SIMD Single instruction applied to multiple functional units (processors) Vector (VP) Single instruction applied to multiple pipelined registers VLIW Multiple instructions issued each cycle under compiler control Superscalar Multiple instructions issued each cycle under hardware control www.TechnicalBooksPDF.com c01.indd 7 5/4/2011 9:53:48 AM 8 INTRODUCTION TO THE SYSTEMS APPROACH From the programmer’s point of view, sequential processors execute one instruction at a time. However, many processors have the capability to execute several instructions concurrently in a manner that is transparent to the programmer, through techniques such as pipelining, multiple execution units, and multiple cores. Pipelining is a powerful technique that is used in almost all current processor implementations. Techniques to extract and exploit the inherent parallelism in the code at compile time or run time are also widely used. Exploiting program parallelism is one of the most important goals in com- puter architecture. Instruction-level parallelism (ILP) means that multiple operations can be executed in parallel within a program. ILP may be achieved with hardware, compiler, or operating system techniques. At the loop level, consecutive loop iterations are ideal candidates for parallel execution, provided that there is no data dependency between subsequent loop iterations. Next, there is parallel- ism available at the procedure level, which depends largely on the algorithms used in the program. Finally, multiple independent programs can execute in parallel. Different computer architectures have been built to exploit this inherent parallelism. In general, a computer architecture consists of one or more inter- connected processor elements (PEs) that operate concurrently, solving a single overall problem. 1.4.1 Processor: A Functional View Table 1.4 shows different SOC designs and the processor used in each design. For these examples, we can characterize them as general purpose, or special purpose with support for gaming or signal processing applications. This func- tional view tells little about the underlying hardware implementation. Indeed, several quite different architectural approaches could implement the same generic function. The graphics function, for example, requires shading, render- ing, and texturing functions as well as perhaps a video function. Depending TABLE 1.4 Processor Models for Different SOC Examples SOC Application Base ISA Processor Description Freescale e600 DSP PowerPC Superscalar with vector extension ClearSpeed General Proprietary ISA Array processor of 96 CSX600 processing elements PlayStation 2 Gaming MIPS Pipelined with two [147, 187, 237] vector coprocessors ARM VFP11 General ARM Configurable vector coprocessor www.TechnicalBooksPDF.com c01.indd 8 5/4/2011 9:53:48 AM PROCESSOR ARCHITECTURES 9 Instruction IF ID AG DF EX WB Figure 1.7 Instruction execution sequence. on the relative importance of these functions and the resolution of the created images, we could have radically different architectural implementations. 1.4.2 Processor: An Architectural View The architectural view of the system describes the actual implementation at least in a broad-brush way. For sophisticated architectural approaches, more detail is required to understand the complete implementation. Simple Sequential Processor Sequential processors directly implement the sequential execution model. These processors process instructions sequentially from the instruction stream. The next instruction is not processed until all execution for the current instruction is complete and its results have been committed. The semantics of the instruction determines that a sequence of actions must be performed to produce the specified result (Figure 1.7). These actions can be overlapped, but the result must appear in the specified serial order. These actions include 1. fetching the instruction into the instruction register (IF), 2. decoding the opcode of the instruction (ID), 3. generating the address in memory of any data item residing there (AG), 4. fetching data operands into executable registers (DF), 5. executing the specified operation (EX), and 6. writing back the result to the register file (WB). A simple sequential processor model is shown in Figure 1.8. During execution, a sequential processor executes one or more operations per clock cycle from the instruction stream. An instruction is a container that represents the small- est execution packet managed explicitly by the processor. One or more opera- tions are contained within an instruction. The distinction between instructions and operations is crucial to distinguish between processor behaviors. Scalar and superscalar processors consume one or more instructions per cycle, where each instruction contains a single operation. Although conceptually simple, executing each instruction sequentially has significant performance drawbacks: A considerable amount of time is spent on overhead and not on actual execution. Thus, the simplicity of directly imple- menting the sequential execution model has significant performance costs. www.TechnicalBooksPDF.com c01.indd 9 5/4/2011 9:53:48 AM 10 INTRODUCTION TO THE SYSTEMS APPROACH Memory/L2 Instruction Data Cache Cache Functional Data Decode Unit Control Unit Unit Registers Figure 1.8 Sequential processor model. Instruction #1 IF ID AG DF EX WB Instruction #2 IF ID AG DF EX WB Instruction #3 IF ID AG DF EX WB Instruction #4 IF ID AG DF EX WB Time Figure 1.9 Instruction timing in a pipelined processor. Pipelined Processor Pipelining is a straightforward approach to exploiting parallelism that is based on concurrently performing different phases (instruc- tion fetch, decode, execution, etc.) of processing an instruction. Pipelining assumes that these phases are independent between different operations and can be overlapped—when this condition does not hold, the processor stalls the downstream phases to enforce the dependency. Thus, multiple operations can be processed simultaneously with each operation at a different phase of its processing. Figure 1.9 illustrates the instruction timing in a pipelined proces- sor, assuming that the instructions are independent. For a simple pipelined machine, there is only one operation in each phase at any given time; thus, one operation is being fetched (IF); one operation is being decoded (ID); one operation is generating an address (AG); one operation is accessing operands (DF); one operation is in execution (EX); and one opera- tion is storing results (WB). Figure 1.10 illustrates the general form of a pipe- lined processor. The most rigid form of a pipeline, sometimes called the static pipeline, requires the processor to go through all stages or phases of the pipe- line whether required by a particular instruction or not. A dynamic pipeline allows the bypassing of one or more pipeline stages, depending on the require- ments of the instruction. The more complex dynamic pipelines allow instruc- tions to complete out of (sequential) order, or even to initiate out of order. The out-of-order processors must ensure that the sequential consistency of the program is preserved. Table 1.5 shows some SOC pipelined “soft” processors. www.TechnicalBooksPDF.com c01.indd 10 5/4/2011 9:53:48 AM PROCESSOR ARCHITECTURES 11 TABLE 1.5 SOC Examples Using Pipelined Soft Processors [177, 178]. A Soft Processor Is Implemented with FPGAs or Similar Reconfigurable Technology Floating- Word Pipeline I/D-Cache* Point Unit Usual Processor Length (bit) Stages Total (KB) (FPU) Target Xilinx MicroBlaze 32 3 0–64 Optional FPGA Altera Nios II fast 32 6 0–64 — FPGA ARC 600 16/32 5 0–32 Optional ASIC Tensilica Xtensa LX 16/24 5–7 0–32 Optional ASIC Cambridge XAP3a 16/32 2 — — ASIC *Means configurable I-cache and/or D-cache. Memory/L2 Instruction Data Cache Cache Integer FU Data Decode Unit Control Unit Registers Floating-Point FU Figure 1.10 Pipelined processor model. ILP While pipelining does not necessarily lead to executing multiple instruc- tions at exactly the same time, there are other techniques that do. These tech- niques may use some combination of static scheduling and dynamic analysis to perform concurrently the actual evaluation phase of several different opera- tions, potentially yielding an execution rate of greater than one operation every cycle. Since historically most instructions consist of only a single operation, this kind of parallelism has been named ILP (instruction level parallelism). Two architectures that exploit ILP are superscalar and VLIW processors. They use different techniques to achieve execution rates greater than one operation per cycle. A superscalar processor dynamically examines the instruc- tion stream to determine which operations are independent and can be exe- cuted. A VLIW processor relies on the compiler to analyze the available operations (OP) and to schedule independent operations into wide instruc- tion words, which then execute these operations in parallel with no further analysis. Figure 1.11 shows the instruction timing of a pipelined superscalar or VLIW processor executing two instructions per cycle. In this case, all the instructions are independent so that they can be executed in parallel. The next two sections describe these two architectures in more detail. www.TechnicalBooksPDF.com c01.indd 11 5/4/2011 9:53:48 AM 12 INTRODUCTION TO THE SYSTEMS APPROACH Instruction #1 IF ID AG DF EX WB Instruction #2 IF ID AG DF EX WB Instruction #3 IF ID AG DF EX WB Instruction #4 IF ID AG DF EX WB Instruction #5 IF ID AG DF EX WB Instruction #6 IF ID AG DF EX WB Time Figure 1.11 Instruction timing in a pipelined ILP processor. Memory/L2 Instruction Data Cache Cache FU0 Data Predecode Control Unit FU1 Registers FU2 Rename Dispatch Buffer Decode Unit Stack.... Reorder Buffer Figure 1.12 Superscalar processor model. Superscalar Processors Dynamic pipelined processors remain limited to executing a single operation per cycle by virtue of their scalar nature. This limitation can be avoided with the addition of multiple functional units and a dynamic scheduler to process more than one instruction per cycle (Figure 1.12). These superscalar processors can achieve execution rates of several instructions per cycle (usually limited to two, but more is possible depending on the application). The most significant advantage of a superscalar processor is that processing multiple instructions per cycle is done transparently to the www.TechnicalBooksPDF.com c01.indd 12 5/4/2011 9:53:48 AM PROCESSOR ARCHITECTURES 13 user, and that it can provide binary code compatibility while achieving better performance. Compared to a dynamic pipelined processor, a superscalar processor adds a scheduling instruction window that analyzes multiple instructions from the instruction stream in each cycle. Although processed in parallel, these instruc- tions are treated in the same manner as in a pipelined processor. Before an instruction is issued for execution, dependencies between the instruction and its prior instructions must be checked by hardware. Because of the complexity of the dynamic scheduling logic, high-performance superscalar processors are limited to processing four to six instructions per cycle. Although superscalar processors can exploit ILP from the dynamic instruction stream, exploiting higher degrees of parallelism requires other approaches. VLIW Processors In contrast to dynamic analyses in hardware to determine which operations can be executed in parallel, VLIW processors (Figure 1.13) rely on static analyses in the compiler. VLIW processors are thus less complex than superscalar processors and have the potential for higher performance. A VLIW processor executes opera- tions from statically scheduled instructions that contain multiple independent operations. Because the control complexity of a VLIW processor is not signifi- cantly greater than that of a scalar processor, the improved performance comes without the complexity penalties. VLIW processors rely on the static analyses performed by the compiler and are unable to take advantage of any dynamic execution characteristics. For applications that can be scheduled statically to use the processor resources effectively, a simple VLIW implementation results in high performance. Unfortunately, not all applications can be effectively scheduled statically. In many applications, execution does not proceed exactly along the path defined Memory/L2 Instruction Data Cache Cache FU0 VLIW Data Decode Unit Control Unit FU1 Registers FU2.... Figure 1.13 VLIW processor model. www.TechnicalBooksPDF.com c01.indd 13 5/4/2011 9:53:48 AM 14 INTRODUCTION TO THE SYSTEMS APPROACH by the code scheduler in the compiler. Two classes of execution variations can arise and affect the scheduled execution behavior: 1. delayed results from operations whose latency differs from the assumed latency scheduled by the compiler and 2. interruptions from exceptions or interrupts, which change the execution path to a completely different and unanticipated code schedule. Although stalling the processor can control a delayed result, this solution can result in significant performance penalties. The most common execution delay is a data cache miss. Many VLIW processors avoid all situations that can result in a delay by avoiding data caches and by assuming worst-case latencies for operations. However, when there is insufficient parallelism to hide the exposed worst-case operation latency, the instruction schedule has many incompletely filled or empty instructions, resulting in poor performance. Tables 1.6 and 1.7 describe some representative superscalar and VLIW processors. SIMD Architectures: Array and Vector Processors The SIMD class of pro- cessor architecture includes both array and vector processors. The SIMD pro- cessor is a natural response to the use of certain regular data structures, such as vectors and matrices. From the view of an assembly-level programmer, pro- gramming SIMD architecture appears to be very similar to programming a simple processor except that some operations perform computations on aggre- gate data. Since these regular structures are widely used in scientific program- ming, the SIMD processor has been very successful in these environments. The two popular types of SIMD processor are the array processor and the vector processor. They differ both in their implementations and in their data TABLE 1.6 SOC Examples Using Superscalar Processors Number of Device Functional Units Issue Width Base Instruction Set MIPS 74K Core 4 2 MIPS32 Infineon TriCore2 4 3 RISC Freescale e600 6 3 PowerPC TABLE 1.7 SOC Examples Using VLIW Processors Device Number of Functional Units Issue Width Fujitsu MB93555A 8 8 TI TMS320C6713B 8 8 CEVA-X1620 30 8 Philips Nexperia PNX1700 30 5 www.TechnicalBooksPDF.com c01.indd 14 5/4/2011 9:53:48 AM PROCESSOR ARCHITECTURES 15 Memory/L2 Instruction Cache MEM/REG MEM/REG MEM/REG PE0 PE1 PE2 Decode Unit Control Unit Communication method (e.g., bus) Figure 1.14 Array processor model. organizations. An array processor consists of many interconnected processor elements, each having their own local memory space. A vector processor con- sists of a single processor that references a global memory space and has special function units that operate on vectors. An array processor or a vector processor can be obtained by extending the instruction set to an otherwise conventional machine. The extended instruc- tions enable control over special resources in the processor, or in some sort of coprocessor. The purpose of such extensions is to enable increased perfor- mance on special applications. Array Processors The array processor (Figure 1.14) is a set of parallel proces- sor elements connected via one or more networks, possibly including local and global interelement communications and control communications. Processor elements operate in lockstep in response to a single broadcast instruction from a control processor (SIMD). Each processor element (PE) has its own private memory, and data are distributed across the elements in a regular fashion that is dependent on both the actual structure of the data and also the computa- tions to be performed on the data. Direct access to global memory or another processor element’s local memory is expensive, so intermediate values are propagated through the array through local interprocessor connections. This requires that the data be distributed carefully so that the routing required to propagate these values is simple and regular. It is sometimes easier to dupli- cate data values and computations than it is to support a complex or irregular routing of data between processor elements. Since instructions are broadcast, there is no means local to a processor element of altering the flow of the instruction stream; however, individual processor elements can conditionally disable instructions based on local status information—these processor elements are idle when this condition occurs. The actual instruction stream consists of more than a fixed stream of opera- tions. An array processor is typically coupled to a general-purpose control processor that provides both scalar operations as well as array operations that are broadcast to all processor elements in the array. The control processor performs the scalar sections of the application, interfaces with the outside www.TechnicalBooksPDF.com c01.indd 15 5/4/2011 9:53:48 AM 16 INTRODUCTION TO THE SYSTEMS APPROACH TABLE 1.8 SOC Examples Based on Array Processors Device Processors per Control Unit Data Size (bit) ClearSpeed CSX600 96 32 Atsana J2211 Configurable 16/32 Xelerator X10q 200 4 world, and controls the flow of execution; the array processor performs the array sections of the application as directed by the control processor. A suitable application for use on an array processor has several key char- acteristics: a significant amount of data that have a regular structure, computa- tions on the data that are uniformly applied to many or all elements of the data set, and simple and regular patterns relating the computations and the data. An example of an application that has these characteristics is the solution of the Navier–Stokes equations, although any application that has significant matrix computations is likely to benefit from the concurrent capabilities of an array processor. Table 1.8 contains several array processor examples. The ClearSpeed pro- cessor is an example of an array processor chip that is directed at signal pro- cessing applications. Vector Processors A vector processor is a single processor that resembles a traditional single stream processor, except that some of the function units (and registers) operate on vectors—sequences of data values that are seemingly operated on as a single entity. These function units are deeply pipelined and have high clock rates. While the vector pipelines often have higher latencies compared with scalar function units, the rapid delivery of the input vector data elements, together with the high clock rates, results in a significant throughput. Modern vector processors require that vectors be explicitly loaded into special vector registers and stored back into memory—the same course that modern scalar processors use for similar reasons. Vector processors have several features that enable them to achieve high performance. One feature is the ability to concurrently load and store values between the vector register file and the main memory while performing computations on values in the vector register file. This is an important feature since the limited length of vector registers requires that vectors longer than the register length would be processed in segments—a technique called strip mining. Not being able to overlap memory accesses and computations would pose a significant perfor- mance bottleneck. Most vector processors support a form of result bypassing—in this case called chaining—that allows a follow-on computation to commence as soon as the first value is available from the preceding computation. Thus, instead of waiting for the entire vector to be processed, the follow-on computation can be significantly overlapped with the preceding computation that it is depen- dent on. Sequential computations can be efficiently compounded to behave as www.TechnicalBooksPDF.com c01.indd 16 5/4/2011 9:53:48 AM PROCESSOR ARCHITECTURES 17 if they were a single operation, with a total latency equal to the latency of the first operation with the pipeline and chaining latencies of the remaining opera- tions, but none of the start-up overhead that would be incurred without chain- ing. For example, division could be synthesized by chaining a reciprocal with a multiply operation. Chaining typically works for the results of load opera- tions as well as normal computations. A typical vector processor configuration (Figure 1.15) consists of a vector register file, one vector addition unit, one vector multiplication unit, and one vector reciprocal unit (used in conjunction with the vector multiplication unit to perform division); the vector register file contains multiple vector registers (elements). Table 1.9 shows examples of vector processors. The IBM mainframes have vector instructions (and support hardware) as an option for scientific users. Multiprocessors Multiple processors can cooperatively execute to solve a single problem by using some form of interconnection for sharing results. In Memory/L2 Instruction Data Cache Cache Integer Decode Unit Control Unit FU0 Registers FU1 Vector 64 FU2 Registers.... Figure 1.15 Vector processor model. TABLE 1.9 SOC Examples Using Vector Processor Device Vector Function Units Vector Registers Freescale e600 4 32 Configurable Motorola RSVP 4 (64 bit partitionable at 16 bits) 2 streams (each 2 from, 1 to) memory ARM VFP11 3 (64 bit partitionable to 32 bits) 4 × 8, 32 bit Configurable implies a pool of N registers that can be configured as p register sets of N/p elements. www.TechnicalBooksPDF.com c01.indd 17 5/4/2011 9:53:48 AM 18 INTRODUCTION TO THE SYSTEMS APPROACH TABLE 1.10 SOC Multiprocessors and Multithreaded Processors Machanick IBM Cell Philips Lehtoranta SOC PNX8500 Number of CPUs 4 1 2 4 Threads 1 Many 1 1 Vector units 0 8 0 0 Application Various Various HDTV MPEG decode Comment Proposal only Also called Soft processors Viper 2 this configuration, each processor executes completely independently, although most applications require some form of synchronization during execution to pass information and data between processors. Since the multiple processors share memory and execute separate program tasks (MIMD [multiple instruc- tion stream, multiple data stream]), their proper implementation is signifi- cantly more complex then the array processor. Most configurations are homogeneous with all processor elements being identical, although this is not a requirement. Table 1.10 shows examples of SOC multiprocessors. The interconnection network in the multiprocessor passes data between processor elements and synchronizes the independent execution streams between processor elements. When the memory of the processor is distributed across all processors and only the local processor element has access to it, all data sharing is performed explicitly using messages, and all synchronization is handled within the message system. When the memory of the processor is shared across all processor elements, synchronization is more of a problem— certainly, messages can be used through the memory system to pass data and information between processor elements, but this is not necessarily the most effective use of the system. When communications between processor elements are performed through a shared memory address space—either global or distributed between proces- sor elements (called distributed shared memory to distinguish it from distrib- uted memory)—there are two significant problems that arise. The first is maintaining memory consistency: the programmer-visible ordering effects on memory references, both within a processor element and between different processor elements. This problem is usually solved through a combination of hardware and software techniques. The second is cache coherency—the programmer-invisible mechanism to ensure that all processor elements see the same value for a given memory location. This problem is usually solved exclu- sively through hardware techniques. The primary characteristic of a multiprocessor system is the nature of the memory address space. If each processor element has its own address space (distributed memory), the only means of communication between processor elements is through message passing. If the address space is shared (shared memory), communication is through the memory system. www.TechnicalBooksPDF.com c01.indd 18 5/4/2011 9:53:48 AM 26 INTRODUCTION TO THE SYSTEMS APPROACH TABLE 1.15 Interconnect Models for Different SOC Examples SOC Application Interconnect Type ClearSpeed CSX600 High Performance ClearConnect bus Computing NetSilicon NET +40 Networking Custom bus NXP LH7A404 Networking AMBA bus Intel PXA27x Mobile/wireless PXBus Matsushita i-Platform Media Internal connect bus Emulex InSpeed SOC320 Switching Crossbar switch MultiNOC Multiprocessing system Network-on-chip 1.7 AN APPROACH FOR SOC DESIGN Two important ideas in a design process are figuring out the requirements and specifications, and iterating through different stages of design toward an effi- cient and effective completion. 1.7.1 Requirements and Specifications Requirements and specifications are fundamental concepts in any system design situation. There must be a thorough understanding of both before a design can begin. They are useful at the beginning and at the end of the design process: at the beginning, to clarify what needs to be achieved; and at the end, as a reference against which the completed design can be evaluated. The system requirements are the largely externally generated criteria for the system. They may come from competition, from sales insights, from cus- tomer requests, from product profitability analysis, or from a combination. Requirements are rarely succinct or definitive of anything about the system. Indeed, requirements can frequently be unrealistic: “I want it fast, I want it cheap, and I want it now!” It is important for the designer to analyze carefully the requirements expressions, and to spend sufficient time in understanding the market situation to determine all the factors expressed in the requirements and the priorities those factors imply. Some of the factors the designer considers in determining requirements include compatibility with previous designs or published standards, reuse of previous designs, customer requests/complaints, sales reports, cost analysis, competitive equipment analysis, and trouble reports (reliability) of previous products and competitive products. www.TechnicalBooksPDF.com c01.indd 26 5/4/2011 9:53:49 AM AN APPROACH FOR SOC DESIGN 27 The designer can also introduce new requirements based on new technology, new ideas, or new materials that have not been used in a similar systems environment. The system specifications are the quantified and prioritized criteria for the target system design. The designer takes the requirements and must produce a succinct and definitive set of statements about the eventual system. The designer may have no idea of what the eventual system will look like, but usually, there is some “straw man” design in mind that seems to provide a feasibility framework to the specification. In any effective design process, it would be surprising if the final design significantly resembles the straw man design. The specification does not complete any part of the design process; it initial- izes the process. Now the design can begin with the selection of components and approaches, the study of alternatives, and the optimization of the parts of the system. 1.7.2 Design Iteration Design is always an iterative process. So, the obvious question is how to get the very first, initial design. This is the design that we can then iterate through and optimize according to the design criteria. For our purposes, we define several types of designs based on the stage of design effort. Initial Design This is the first design that shows promise in meeting the key requirements, while other performance and cost criteria are not considered. For instance, processor or memory or input/output (I/O) should be sized to meet high-priority real-time constraints. Promising components and their parameters are selected and analyzed to provide an understanding of their expected idealized performance and cost. Idealized does not mean ideal; it means a simplified model of the expected area occupied and computational or data bandwidth capability. It is usually a simple linear model of perfor- mance, such as the expected million instructions per second (MIPS) rate of a processor. Optimized Design Once the base performance (or area) requirements are met and the base functionality is ensured, then the goal is to minimize the cost (area) and/or the power consumption or the design effort required to complete the design. This is the iterative step of the process. The first steps of this process use higher-fidelity tools (simulations, trial layouts, etc.) to ensure that the initial design actually does satisfy the design specifications and requirements. The later steps refine, complete, and improve the design according to the design criteria. Figure 1.21 shows the steps in creating an initial design. This design is detailed enough to create a component view of the design and a corresponding projection of the component’s expected performance. This projection is, at this www.TechnicalBooksPDF.com c01.indd 27 5/4/2011 9:53:49 AM 28 INTRODUCTION TO THE SYSTEMS APPROACH z Figure 1.21 The SOC initial design process. Idealized memory (fixed access time) Idealized interconnect Idealized (fixed access time and I/O ample bandwidth) P1... Pn n idealized processors (P) selected by function Figure 1.22 Idealized SOC components. step, necessarily simplified and referenced to here as the idealized view of the component (Figure 1.22). System performance is limited by the component with the least capability. The other components can usually be modeled as simply presenting a delay to the critical component. In a good design, the most expensive component is the one that limits the performance of the system. The system’s ability to process transactions should closely follow that of the limiting component. Typically, this is the processor or memory complex. Usually, designs are driven by either (1) a specific real-time requirement, after which functionality and cost become important, or (2) functionality and/ or throughput under cost–performance constraints. In case (1), the real-time constraint is provided by I/O consideration, which the processor–memory– interconnect system must meet. The I/O system then determines the perfor- mance, and any excess capability of the remainder of the system is usually used to add functionality to the system. In case (2), the object is to improve task www.TechnicalBooksPDF.com c01.indd 28 5/4/2011 9:53:49 AM SYSTEM ARCHITECTURE AND COMPLEXITY 29 throughput while minimizing the cost. Throughput is limited by the most con- strained component, so the designer must fully understand the trade-offs at that point. There is more flexibility in these designs, and correspondingly more options in determining the final design. The purpose of this book is to provide an approach for determining the initial design by (a) describing the range of components—processors, memories, and interconnects—that are available in building an SOC; (b) providing examples of requirements for various domains of applica- tions, such as data compression and encryption; and (c) illustrating how an initial design, or a reported implementation, can show promise in meeting specific requirements. We explain this approach in Chapters 3–5 on a component by component basis to cover (a), with Chapter 6 covering techniques for system configuration and customization. Chapter 7 contains application studies to cover (b) and (c). As mentioned earlier, the designer must optimize each component for pro- cessing and storage. This optimization process requires extensive simulation. We provide access to basic simulation tools through our associated web site. 1.8 SYSTEM ARCHITECTURE AND COMPLEXITY The basic difference between processor architecture and system architecture is that the system adds another layer of complexity, and the complexity of these systems limits the cost savings. Historically, the notion of a computer is a single processor plus a memory. As long as this notion is fixed (within broad tolerances), implementing that processor on one or more silicon die does not change the design complexity. Once die densities enable a scalar processor to fit on a chip, the complexity issue changes. Suppose it takes about 100,000 transistors to implement a 32-bit pipelined processor with a small first-level cache. Let this be a processor unit of design complexity. As long as we need to implement the 100,000 transistor processors, addi- tional transistor density on the die does not much affect design complexity. More transistors per die, while increasing die complexity, simplify the problem of interconnecting multiple chips that make up the processor. Once the unit processor is implemented on a single die, the design complexity issue changes. As transistor densities significantly improve after this point, there are obvious processor extension strategies to improve performance: 1. Additional Cache. Here we add cache storage and, as large caches have slower access times, a second-level cache. www.TechnicalBooksPDF.com c01.indd 29 5/4/2011 9:53:49 AM 30 INTRODUCTION TO THE SYSTEMS APPROACH 10,000.0 Processor equivalents/die Processors (1 = 10 M transistors) 1000.0 100.0 Multiple, SOC 10.0 Small, limited 1.0 Robust 0.1 1980 1985 1990 1995 2000 2005 2010 Year Figure 1.23 Complexity of design. 2. A More Advanced Processor. We implement a superscalar or a VLIW processor that executes more than one instruction each cycle.Additionally, we speed up the execution units that affect the critical path delay, espe- cially the floating-point execution times. 3. Multiple Processors. Now we implement multiple (superscalar) proces- sors and their associated multilevel caches. This leaves us limited only by the memory access times and bandwidth. The result of the above is a significantly greater design complexity (see Figure 1.23). Instead of the 100,000 transistor processors, our advanced processor has millions of transistors; the multilevel caches are also complex, as is the need to coordinate (synchronize) the multiple processors, since they require a con- sistent image of the contents of memory. The obvious way to manage this complexity is to reuse designs. So, reusing several simpler processor designs implemented on a die is preferable to a new, more advanced, single processor. This is especially true if we can select specific processor designs suited to particular parts of an application. For this to work, we also need a robust interconnection mechanism to access the various proces- sors and memory. So, when an application is well specified, the system-on-a-chip approach includes 1. multiple (usually) heterogeneous processors, each specialized for specific parts of the application; 2. the main memory with (often) ROM for partial program storage; 3. a relatively simple, small (single-level) cache structure or buffering schemes associated with each processor; and 4. a bus or switching mechanism for communications. www.TechnicalBooksPDF.com c01.indd 30 5/4/2011 9:53:49 AM PRODUCT ECONOMICS AND IMPLICATIONS FOR SOC 31 Even when the SOC approach is technically attractive, it has economic limita- tions and implications. Given the processor and interconnect complexity, if we limit the usefulness of an implementation to a particular application, we have to either (1) ensure that there is a large market for the product or (2) find methods for reducing the design cost through design reuse or similar techniques. 1.9 PRODUCT ECONOMICS AND IMPLICATIONS FOR SOC 1.9.1 Factors Affecting Product Costs The basic cost and profitability of a product depend on many factors: its tech- nical appeal, its cost, the market size, and the effect the product has on future products. The issue of cost goes well beyond the product’s manufacturing cost. There are fixed and variable costs, as shown in Figure 1.24. Indeed, the engineering costs, frequently the largest of the fixed costs, are expended before any revenue can be realized from sales (Figure 1.25). Depending on the complexity, designing a new chip requires a development effort of anywhere between 12 and 30 months before the first manufactured unit can be shipped. Even a moderately sized project may require up to 30 or more hardware and software engineers, CAD design, and support personnel. For instance, the paper describing the Sony Emotion Engine has 22 authors [147, 187]. However, their salary and indirect costs might represent only a fraction of the total development cost. Nonengineering fixed costs include manufacturing start-up costs, inven- tory costs, initial marketing and sales costs, and administrative overhead. The Fixed Variable costs costs Marketing, sales, administration Manufacturing costs Engineering Product cost Figure 1.24 Project cost components. www.TechnicalBooksPDF.com c01.indd 31 5/4/2011 9:53:49 AM 2 Chip Basics: Time, Area, Power, Reliability, and Configurability 2.1 INTRODUCTION The trade-off between cost and performance is fundamental to any system design. Different designs result either from the selection of different points on the cost–performance continuum or from differing assumptions about the nature of cost or performance. The driving force in design innovation is the rapid advance in technology. The Semiconductor Industry Association (SIA) regularly makes projections, called the SIA road map, of technology advances, which become the basis and assumptions for new chip designs. While the projections change, the advance has been and is expected to continue to be formidable. Table 2.1 is a summary of the roadmap projections for the microprocessors with the highest perfor- mance introduced in a particular year. With the advances in lithography, the transistors are getting smaller. The minimum width of the transistor gates is defined by the process technology. Table 2.1 refers to process technology generations in terms of nanometers; older generations are referred to in terms of microns (μm). So the previous generations are 65 and 90 nm, and 0.13 and 0.18 μm. 2.1.1 Design Trade-Offs With increases in chip frequency and especially in transistor density, the designer must be able to find the best set of trade-offs in an environment of rapidly changing technology. Already the chip frequency projections have been called into question because of the resulting power requirements. In making basic design trade-offs, we have five different considerations. The first is time, which includes partitioning instructions into events or cycles, basic pipelining mechanisms used in speeding up the instruction execution, and cycle time as a parameter for optimizing program execution. Second, we discuss area. The cost or area occupied by a particular feature is another important aspect of the architectural trade-off. Third, power consumption Computer System Design: System-on-Chip, First Edition. Michael J. Flynn and Wayne Luk. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc. 39 www.TechnicalBooksPDF.com c02.indd 39 5/4/2011 10:35:10 AM 40 CHIP BASICS: TIME, AREA, POWER, RELIABILITY, AND CONFIGURABILITY TABLE 2.1 Technology Roadmap Projections Year 2010 2013 2016 Technology generation (nm) 45 32 22 Wafer size, diameter (cm) 30 45 45 Defect density (per cm2) 0.14 0.14 0.14 μP die size (cm2) 1.9 2.6 2.6 Chip frequency (GHz) 5.9 7.3 9.2 Million transistors per square centimeter 1203 3403 6806 Max power (W) high performance 146 149 130 affects both performance and implementation. Instruction sets that require more implementation area are less valuable than instruction sets that use less—unless, of course, they can provide commensurately better performance. Long-term cost–performance ratio is the basis for most design decisions. Fourth, reliability comes into play to cope with deep submicron effects. Fifth, configurability provides an additional opportunity for designers to trade off recurring and nonrecurring design costs. FIVE BIG ISSUES IN SYSTEM-ON-CHIP (SOC) DESIGN Four of the issues are obvious. Die area (manufacturing cost) and per- formance (heavily influenced by cycle time) are important basic SOC design considerations. Power consumption has also come to the fore as a design limitation. As technology shrinks feature sizes, reliability will dominate as a design consideration. The fifth issue, configurability, is less obvious as an immediate design consideration. However, as we saw in Chapter 1, in SOC design, the non- recurring design costs can dominate the total project cost. Making a design flexible through reconfigurability is an important issue to broaden the market—and reduce the per part cost—for SOC design. Configurability enables programmability in the field and can be seen to provide features that are “standardized in manufacturing while cus- tomized in application.” The cyclical nature of the integrated circuit industry between standardization and customization has been observed by Makimoto and is known as Makimoto’s wave, as shown in Figure 2.1. In terms of complexity, various trade-offs are possible. For instance, at a fixed feature size, area can be traded off for performance (expressed in term of execution time, T). Very large scale integration (VLSI) complexity theorists have shown that an A × Tn bound exists for processor designs, where n usually falls between 1 and 2. It is also possible to trade off time T for power P www.TechnicalBooksPDF.com c02.indd 40 5/4/2011 10:35:10 AM INTRODUCTION 41 Standardization Standard Memories Field Discrete Microprocessor Programmability 1967 1977 1987 1997 2007 2017 1957 Standardized in Manufacturing but Customized Custom LSIs ASICs in Application System on for TVs, Chip Calculators Customization Figure 2.1 Makimoto’s wave. Power (P) High-performance server processor design P × T 3 = constant Area (A) A × T n = constant Cost and power-sensitive client processor design Time (T) Figure 2.2 Processor design trade-offs. with a P × T3 bound. Figure 2.2 shows the possible trade-off involving area, time, and power in a processor design. Embedded and high-end processors operate in different design regions of this three-dimensional space. The power and area axes are typically optimized for embedded processors, whereas the time axis is typically for high-end processors. This chapter deals with design issues in making these trade-offs. It begins with the issue of time. The ultimate measure of performance is the time required to complete required system tasks and functions. This depends on two factors: first, the organization and size of the processors and memories, www.TechnicalBooksPDF.com c02.indd 41 5/4/2011 10:35:10 AM 42 CHIP BASICS: TIME, AREA, POWER, RELIABILITY, AND CONFIGURABILITY and the second, the basic frequency or clock rate at which these operate. We deal with the first factor in the next two chapters. In this chapter, we only look at the basic processor cycle—specifically, how much delay is incurred in a cycle and how instruction execution is partitioned into cycles. As almost all modern processors are pipelined, we look at the cycle time of pipelined processors and the partitioning of instruction execution into cycles. We next introduce a cost (area) model to assist in making manufacturing cost trade-offs. This model is restricted to on-chip or processor-type trade-offs, but it illustrates a type of system design model. As mentioned in Chapter 1, die cost is often but a small part of the total cost, but an understanding of it remains essential. Power is primarily determined by cycle time and the overall size of the design and its components. It has become a major constraint in most SOC designs. Finally, we look at reliability and reconfiguration and their impact on cost and performance. 2.1.2 Requirements and Specifications The five basic SOC trade-offs provide a framework for analyzing SOC require- ments so that these can be translated into specifications. Cost requirements coupled with market size can be translated into die cost and process technol- ogy. Requirements for wearables and weight limits translate into bounds on power or energy consumption, and limitations on clock frequency, which can affect heat dissipation. Any one of the trade-off criteria can, for a particular design, have the highest priority. Consider some examples: High-performance systems will optimize time at the expense of cost and power (and probably configurability, too). Low-cost systems will optimize die cost, reconfigurability, and design reuse (and perhaps low power). Wearable systems stress low power, as the power supply determines the system weight. Since such systems, such as cell phones, frequently have real-time constraints, its performance cannot be ignored. Embedded systems in planes and other safety-critical applications would stress reliability, with performance and design lifetime (configurability) being important secondary considerations. Gaming systems would stress cost—especially production cost—and, sec- ondarily, performance, with reliability being a lesser consideration. In considering requirements, the SOC designer should carefully consider each trade-off item to derive corresponding specifications. This chapter, when coupled with the essential understanding of the system components, which we will see in later chapters, provides the elements for SOC requirements transla- tion into specifications and the beginning of the study of optimization of design alternatives. www.TechnicalBooksPDF.com c02.indd 42 5/4/2011 10:35:10 AM CYCLE TIME 43 2.2 CYCLE TIME The notion of time receives considerable attention from processor designers. It is the basic measure of performance; however, breaking actions into cycles and reducing both cycle count and cycle times are important but inexact sciences. The way actions are partitioned into cycles is important. A common problem is having unanticipated “extra” cycles required by a basic action such as a cache miss. Overall, there is only a limited theoretical basis for cycle selection and the resultant partitioning of instruction execution into cycles. Much design is done on a pragmatic basis. In this section, we look at some techniques for instruction partitioning, that is, techniques for breaking up the instruction execution time into manageable and fixed time cycles. In a pipelined processor, data flow through stages much as items flow on an assembly line. At the end of each stage, a result is passed on to a subsequent stage and new data enter. Within limits, the shorter the cycle time, the more productive the pipeline. The partitioning process has its own overhead, however, and very short cycle times become dominated by this overhead. Simple cycle time models can optimize the number of pipeline stages. THE PIPELINED PROCESSOR At one time, the concept of pipelining in a processor was treated as an advanced processor design technique. For the past several decades, pipe- lining has been an integral part of any processor or, indeed, controller design. It is a technique that has become a basic consideration in defining cycle time and execution time in a processor or system. The trade-off between cycle time and number of pipeline stages is treated in the section on optimum pipeline. 2.2.1 Defining a Cycle A cycle (of the clock) is the basic time unit for processing information. In a synchronous system, the clock rate is a fixed value and the cycle time is determined by finding the maximum time to accomplish a frequent operation in the machine, such as an add or register data transfer. This time must be sufficient for data to be stored into a specified destination register (Figure 2.3). Less frequent operations that require more time to complete require multiple cycles. A cycle begins when the instruction decoder (based on the current instruc- tion opcode) specifies the values for the registers in the system. These control www.TechnicalBooksPDF.com c02.indd 43 5/4/2011 10:35:10 AM 44 CHIP BASICS: TIME, AREA, POWER, RELIABILITY, AND CONFIGURABILITY Sample signal Skew Control lines active Result to destination Data to ALU Data stored in register Figure 2.3 Possible sequence of actions within a cycle. values connect the output of a specified register to another register or an adder or similar object. This allows data from source registers to propagate through designated combinatorial logic into the destination register. Finally, after a suitable setup time, all registers are sampled by an edge or pulse produced by the clocking system. In a synchronous system, the cycle time is determined by the sum of the worst-case time for each step or action within the cycle. However, the clock itself may not arrive at the anticipated time (due to propagation or loading effects). We call the maximum deviation from the expected time of clock arrival the (uncontrolled) clock skew. In an asynchronous system, the cycle time is simply determined by the completion of an event or operation. A completion signal is generated, which then allows the next operation to begin. Asynchronous design is not generally used within pipelined processors because of the completion signal overhead and pipeline timing constraints. 2.2.2 Optimum Pipeline A basic optimization for the pipeline processor designer is the partitioning of the pipeline into concurrently operating segments. A greater number of seg- ments allow a higher maximum speedup. However, each new segment carries clocking overhead with it, which can adversely affect performance. If we ignore the problem of fitting actions into an integer number of cycles, we can derive an optimal cycle time, Δt, and hence the level of segmentation for a simple pipelined processor. Assume that the total time to execute an instruction without pipeline seg- ments is T nanoseconds (Figure 2.4a). The problem is to find the optimum number of segments S to allow clocking and pipelining. The ideal delay through a segment is T/S = Tseg. Associated with each segment is partitioning overhead. This clock overhead time C (in nanoseconds), includes clock skew and any register requirements for data setup and hold. www.TechnicalBooksPDF.com c02.indd 44 5/4/2011 10:35:10 AM CYCLE TIME 45 T (a) T/S (b) C T/S C (c) Clock overhead Clock overhead (c) plus skew (d) Result available Disruption S–1 Cycles delay Restart Figure 2.4 Optimal pipelining. (a) Unclocked instruction execution time, T. (b) T is partitioned into S segments. Each segment requires C clocking overhead. (c) Clocking overhead and its effect on cycle time, T/S. (d) Effect of a pipeline disruption (or a stall in the pipeline). Now, the actual cycle time (Figure 2.4c) of the pipelined processor is the ideal cycle time T/S plus the overhead: T Δt = + C.

Introduction to the Systems Approach PDF

Document Details

Tags

Related

Summary

Full Transcript