UEC509_CA_F (1).pdf

References: 1. Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier (2009) 4th ed. 2. Hamacher, V., Carl, Vranesic, Z.G. and Zaky, S.G., Computer Organization, McGraw-Hill (2002) 2nd ed. UEC509: COMPUTER ARCHITECURE L T P Cr 3 1 0 3.5 Course Objectives: To introduce the concept of parallelism followed in the modern RISC based computers by introducing the basic RISC based DLX architecture. To make the students understand and implement various performance enhancement methods like memory optimization, Multiprocessor configurations, Pipelining and interfacing of I/O structures using interrupts and to enhance the student’s ability to evaluate performance of these machines by using evaluation methods like CPU time Equation, MIPS rating and Amdahl’s law. Fundamentals of Computer Design: Historical Perspective, Computer Types, Von- Neuman Architecture, Harvard Architecture Functional Units, Basic Operational Concepts, Bus Structures, Performance metrics, CISC and RISC architectures, Control Unit, Hardwired and microprogrammed Control unit. Instruction Set Principles: Classification of Instruction set architectures, Memory Addressing, Operations in the instruction set, Type and Size of operands, Encoding an Instruction set, Program Execution, Role of registers, Evaluation stacks and data buffers, The role of compilers, The DLX Architecture, Addressing modes of DLX architecture, Instruction format, DLX operations, Effectiveness of DLX. Pipelining and Parallelism: Idea of pipelining, The basic pipeline for DLX, Pipeline Hazards, Data hazards, Control Hazards, Design issues of Pipeline Implementation, Multicycle operations, The MIPS pipeline, Instruction level parallelism, Pipeline Scheduling and Loop Unrolling, Data, Branch Prediction, Name and Control Dependences, Overcoming data hazards with dynamic scheduling, Superscalar DLX Architecture, The VLIW Approach. Memory Hierarchy Design: Introduction, Cache memory, Cache Organization, Write Policies, Reducing Cache Misses, Cache Associatively Techniques, Reducing Cache Miss Penalty, Reducing Hit Time, Main Memory Technology, Fast Address Translation, Translation Lookaside buffer Virtual memory, Crosscutting issues in the design of Memory Hierarchies. Multiprocessors: Characteristics of Multiprocessor Architectures, Centralized Shared Memory Architectures, Distributed Shared Memory Architectures, Synchronization, Models of Memory Consistency. Input/ Output Organization and Buses: Accessing I/O Devices, Interrupts, Handling Multiple Devices, Controlling device Requests, Exceptions, Direct Memory Access, Bus arbitration policies, Synchronous and Asynchronous buses, Parallel port, Serial port, Standard I/O interfaces, Peripheral Component Interconnect (PCI) bus and its architecture, SCSI Bus, Universal Synchronous Bus (USB) Interface. Course Learning Outcomes (CLO S): The students will be able to: 1. Understand and analyze a RISC based processor. 2. Understand the concept of parallelism and pipelining. 3. Evaluate the performance of a RISC based machine with an enhancement applied and make a decision about applicability of that respective enhancement as a design engineer. 4. Understand the memory hierarchy design and optimise the same for best results. Understand how input/output devices can be interfaced to a processor in serial or parallel with their priority of access defined. Text Books: 1. Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier (2009) 4th ed. 2. Hamacher, V., Carl, Vranesic, Z.G. and Zaky, S.G., Computer Organization, McGraw-Hill (2002) 2nd ed. Reference Books: 1. Murdocca, M. J. and Heuring, V.P., Principles of Computer Architecture, Prentice Hall (1999) 3rd ed. 2. Stephen, A.S., Halstead, R. H., Computation Structure, MIT Press (1999) 2nd ed. Evaluation Scheme: Will be announced latter on  Computer involved in personal, professional, Institutional or business-related activities.  human being depend on the computer systems.  Computers save time, labor and resources.  Carry on 24/7 long-term automation of machines, robots, and other equipment, and collect data from sensors and other sources helping to optimize operations.  Used in computing services: including servers, storage, databases, networking, software, analytics, and intelligence over the internet.  Personal  Scientific research  Business application  Education  Entertainment  Banks  Cloud  IoT  Communication  Engineering  Medicine  Games  Accounting  And many more……..  What is Computer Architecture?  What is need of Computer Architecture course?  The role of computer designer: To incorporate the important attributes for new machine design to achieve maximum performance and energy efficiency with in the specified range of cost, power, size and other constraints.  To do so the following tasks have to be performed by a computer designers: 1. Instruction set architecture (ISA) 2. Implementation Functional organization Hardware Implementation  Instruction set architecture: ISA includes the specification of an instruction set and the functional behavior of the hardware units that implement the instructions.  Implementation:  Functional Organization: It deals with high level aspects of computer designer such as memory system, memory interfacing and design of CPU  Computer hardware: Computer hardware consists of electronic circuits, RAM, ROM, displays, electromechanical devices, and communication facilities. (Structural Design of a building)  Computer architecture: Architecture of building  Computer hardware: Structural Design of a building  Computer architecture includes the specification of an instruction set and the functional behavior of the hardware units that implement the instructions.  Computer hardware consists of electronic circuits, magnetic and optical storage devices, displays, electromechanical devices, and communication facilities.  To understand the concept of computing machine designed, construct and operate.  To understand the set of rules and methods describing functionality, organization, and implementation of a computer system  How to design and implement a computing machine to get best performance in terms of speed, efficiency, power consumption and throughput  To develop machine with optimize cost, size and easy to use. Computer can be classified into six general types on the basis of size, cost, computational capacity, and applications: Personal computers/Desktop Computer:  Used in homes, educational institutions, and business and engineering office settings, primarily for dedicated individual use.  Operational units (processing unit, storage unit, video o/p, audio o/p, keyboard) are clearly different  Examples: Desktop, Laptop, and Notebook computers Embedded computers:  Operational units are packed in a compact size for a specific application or similar type of applications.  Embedded systems applications include industrial and home automation, appliances, telecommunication products, and vehicles.  Examples: Mobiles, Automobile industry, home appliances  Workstation: Single user computer system, with more powerful microprocessor. having high resolution graphics i/o capability. The size is comparable to desktop computer with more computational power. Engineering applications: interactive design  Enterprise System/ Mainframe: Used in business data analysis. Computational power and storage capacity higher than workstation  Server: Stores database. Capable of handling large no of clients/users. Requests and responses transported over internet communication. Host large databases and provide information processing for a government agency or a commercial organization.  Supercomputer: Most expensive and physically the largest category of computers and employed for specialized applications that require large amount of mathematical calculations Used for the highly demanding computations needed in weather forecasting, engineering design, aircraft design and simulation. Five generations basis on technology & performance :  First Generation of Computer (1940 – 1956)  Second Generation of Computer (1956 – 1963)  Third Generation of Computer (1964 – 1971)  Fourth Generation of Computer (1972 – 1988)  Fifth Generation of Computer (1988 onwards)  1940-1956: First Generation:– Vacuum Tubes.  These early computers used vacuum tubes as circuitry, magnetic drums for memory and punch cards for input s  ENIAC (Electronic Numerical Integrator and Calculator), was funded by the U.S. Army and became operational during World War II in 1946. This machine was 100 feet long, 8.5 feet high. 18,000 vacuum tubes were used to form this first machine. Advantages:  Processing time was in msec.  Fastest calculating device at that time.  Was able to do 5000 additions/sec  Machine language was used for programming Disadvantages:  Thousands of Vacuum tubes were used  Vacuum tubes generated heat  Consumed large power  Expensive and Large size Examples: ENIAC, EDVAC, IBM-701, IBM-650  1956 – 1963: Second Generation was based on Transistors.  Transistor is a based on semiconductor material invented in 1947 Bell Labs.  Transistors were cheaper, low power, more compact in size, more reliable and faster than the first generation.  magnetic cores for primary memory, magnetic tape and magnetic disks as secondary storage devices. Advantages:  Smaller in size  Faster than 1st gen.  Consumed less power  Instructions saved in memory  Machine and assembly language (COBOL, FORTRAN). Disadvantages:  Air-conditioning required  Not used as personal system  Used as commercial purpose Examples: IBM7049 Series, CDC 3600 IBM1600 series  1964 – 1971: Third Generation was based on Integrated Circuits.  In IC, transistors with small in size were placed on a silicon chip.  A single IC has many transistors, resistors, and capacitors along with the associated circuitry. Main features of 3rd generation:  Smaller and faster  Cheaper than 2nd Generation  More reliable in comparison to previous two generations  Generated less heat  Lesser maintenance  Used with I/O devices like keyboard, monitor  Used operating systems  Large production and popular Example: IBM system/360 & 370 IBM, PDP-8, DEC UNIVAC-8, UNIVAC 9000  1972 – 1988: Fourth Generation: Microprocessors based computer was introduced around 1971.  More than 100 thousands transistors fabricated on single chip  The Intel corporation introduced 4004, a four bit microprocessor followed by 8008, 8085,8086 and many more Features:  Microprocessor based on VLSI technology  Millions of transistors on a single chip  Small in size and portable  High speed  Accurate and Reliable  Large memory can be interfaced Examples: A series of Intel processor, AMD processors, Motorola’s processors  Fifth generation computers are used for Artificial Intelligence based applications such as voice, face and other biometric recognition.  1988-onwards.  ULSI microprocessor based  Nano-meter technology nodes used for fabrication  high-level languages like C and C++, Java,.Net etc.  Faster and cheaper  Intelligent computers  Parallel Processing  Have thinking and self analysis  Examples: Desktop, Laptop, NoteBook, UltraBook, ChromeBook Primary memory, cache, secondary Keyboard, touchpad, Memory, optical memory mouse, scanner, mic, camera, joystick, and trackball perform arithmetic or logic operation, timing & control signals monitor, printer All units are interfaced through Address, Data and Control wires called busses Processing unit (CPU):  Use to perform computing operations i.e arithmetic, logical and controlling operations.  Execute information stored in memory. I/O (Input/output) devices:  Provide a means of communicating with CPU Memory:  Store the data temporary or permanent. 1. RAM (Random Access Memory) : temporary storage of programs that computer is running. The data is lost when computer is off. 2. ROM (Read Only Memory): Contains programs and information essential to operation of the computer. ƒ The information cannot be changed by use, and is not lost when power is off. It is called nonvolatile memory Primary Memory:  Main memory  Fast memory use to store programs  Memory consists of one bit storage cells can be read or written individually.  Data is stored in fixes size called words.  The word length of a computer may be 16, 32, or 64 bits.  Each memory locations must have its own address Cache Memory:  Smaller, Fastest and nearest to the processor  Store programs recently executed or to be repeated many times by processor Secondary Storage  In addition with primary memory a less expensive, permanent secondary storage is used to store large data and programs.  Slower than primary memory.  Examples: magnetic disks, hard disc, optical disks (DVD and CD), and flash memory devices. The CPU is connected to memory and I/O through strips of wire called a bus to carries the information from one place to other place Address bus Data bus Control bus Address Bus:  To recognized any I/O device or memory location by processor.  Assign address must be unique.  Processor send the address on address lines and decoding circuit will find respective device.  More number of address bus means more number of devices can be interfaced with CPU. 2m=n m: number of address lines n: number  The address bus is unidirectional Data Bus:  Required data transfer or received through data bus. It indicates the data handling capacity of CPU.  More data buses mean a more expensive CPU and computer  The data bus is bidirectional Control bus:  Use to decide the direction of data, either transfer or received.  The architectures of computer are refers to the arrangement of CPU with program and data memory. Two types of architectures are used two interface the memory units with the processors:  Von Neumann  Harvard Architectures  It was proposed by John von Neumann to decide the structure, layout, interaction, cooperation, implementation and activity for the whole computer as a system. The Von Neumann Architecture is having following features:  A memory, ALU, control unit, input and output devices  CPU and memory is interfaced to each other through one set of address and data bus  Slower in execution speed.  No parallel processing of program.  Only one information can be accessed at one time.  Harvard University introduced a slightly different architecture in which memories for data and instruction was separately interfaced with different set of data and address busses in 1947. This concept is known as the Harvard  Parallel access to data and instructions.  Data and instructions are accessed the same way.  Both memories can use different cell sizes. architecture.  Speed of execution is high.  Complex and costly.  Development of the Control Unit is expensive and needs more time Control unit generates control signal to perform the different operations for CPU. Two techniques are used for control unit  Hardwired Control Unit  Microprogrammed Control Unit. Hardwired Control Unit: It is implemented using logical circuits (using gates, flip-flops, decoders etc.) as the hardware unit. This organization is very complicated for large control units. Microprogrammed Control Unit: A microprogrammed control unit is implemented using programming approach. A sequence of micro control operations are carried out by executing programs consisting of micro-instructions. Parameters Hardwired Microprogrammed Speed Speed is fast Speed is slow Cost More costlier. Cheaper. Not flexible to accommodate More flexible to accommodate Flexibility new system specification new system specification Ability to Handle Difficult to handle complex Easier to handle complex Complex Instructions instruction sets. intruction sets. Complex decoding and Easier decoding and Decoding sequencing logic. sequencing logic. Applications RISC Microprocessor CISC Microprocessor Instruction set of Size Small Large Control Memory Not required Required Chip Area Required Less More Computer Architecture UEC509 Instruction: A command use to perform an operation Program: list of instructions used in sequence to perform a given task Program: stored instructions in program memory Operands as data: stored in data memory The following steps are required to execute a program: 1) Fetch: Opcode of instructions loaded from memory into processor, one-by-one 2) Decode the instruction: Type of operation, type and location of data as operands 3) Read the operands from memory to processor, if required. 4) Execute: Perform the required operation by execution unit of CPU 5) Write back the result from processor to memory, if required Example: LOAD R1, LOC ADD R0, R1, R2 STORE R0, LOC1 CPU interfaced with memory using bus interface unit of CPU ALU processor has different general purpose registers CU and BIU has special purpose registers General purpose vs specialized registers IR: holds the instruction currently being executed PC: holds the address (memory location) of the next instruction to be fetched and executed MAR: holds the address to be accessed MDR: holds the content at MAR address The architecture of the Central Processing Unit (CPU) operates the capacity to function from “Instruction Set Architecture” to where it was designed. From the instruction set architecture point of view, the microprocessor chips can be classified into two categories:  Complex Instruction Set Computers (CISC)  Reduce Instruction Set Computers (RISC) CISC RISC 1) CISC instructions operate directly 1) RISC instructions operate on on memory bank registers 2) Less RAM required to store CISC 2) More RAM required to store RISC instructions instructions 3) Pipeline harder 3) Pipeline easier 4) More emphasis on hardware 4) More emphasis on software. 5) Microprogrammed control 5) Hardwired control 6) Complex addressing modes 6) Simple addressing modes 7) Large number of instructions 7) Small number of instructions 8) Less work for complier 8) More work for compiler Example: Multiply two numbers in memory: source code a = a*b (where a, b are memory locations, a= 2:3, b=5:2) MULT 2:3, 5:2 LOAD A, 2:3 LOAD B, 5:2 PROD A, B STORE 2:3, A Computer Architecture UEC509 How can we say one computer is faster than another? User’ s view point: Time to complete one task Manager’ s view point: Number of tasks per unit time  Response Time and Throughput  Response time – How long it takes to do a task? e.g. 10s, 15s per task  Throughput – Total work done per unit time e.g., tasks/transactions/… per unit time  How are response time and throughput affected by  Replacing the processor with a faster version?  Adding more processors? Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier Define Performance = 1/Execution Time “If X is n time faster than Y” 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑖𝑛 𝑌 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑋 = =n 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑖𝑛 𝑋 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑌 Example: time taken to run a program 10s on X, 15s on Y Execution TimeY / Execution TimeX = 15s / 10s = 1.5 PerformanceX / PerformanceY = 0.1/0.0667 = 1.5 So A is 1.5 times faster than B in terms Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier A is n times faster than B 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑖𝑛 𝐵 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝐴 = n, i.e. 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝐵 = n 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑖𝑛 𝐴 Execution time defined as either elapsed time or CPU time Elapsed time: latency to complete a task, including disk access, memory access, i/o activities, OS overhead etc. CPU time: time CPU is computing, not waiting for i/o Execution time seen by user is elapsed time, not CPU time CPU time spent in program: user CPU time CPU time spent in OS performing tasks requested by program: system CPU time 90.7u 12.9s 2:39 65% (90.7+12.9)/159= 0.65 Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier Computer A Computer B Computer C Program P1 (secs) 1 10 20 Program P2 (secs) 1000 100 20 Total time (secs) 1001 110 40 AM 500.5 55 20 A is 10 times faster than B for program P1. B is 10 times faster than A for program P2. A is 20 times faster than C for program P1. C is 50 times faster than A for program P2. B is 2 times faster than C for program P1. C is 5 times faster than B for program P2. An average of the execution times that tracks total execution time is the arithmetic mean (AM) 𝑛 1 𝐴𝑀 = ෍ 𝑇𝑖𝑚𝑒𝑖 𝑛 𝑖=1 Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier What is the proper mixture of programs for the workload? Are programs P1 and P2 run equally in the workload? For the unequal mix of programs in the workload is to assign a weighting factor wi to each program So, by summing the products of weighting factors and execution times, the performance of the workload is obtained. This is called the weighted arithmetic mean: 𝑛 𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝐴𝑀 = ෍ 𝑊𝑒𝑖𝑔ℎ𝑡 𝑋 𝑇𝑖𝑚𝑒𝑖 𝑖=1 A B C W1 W2 W3 P1 (sec.) 1 10 20 0.5 0.909 0.999 P2 (sec.) 1000 100 20 0.5 0.091 0.001 AM (1) 500.5 55 20 AM (2) 91.91 18.19 20 AM (3) 2 10.09 20 Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier A B C P1 (sec.) 1 10 20 P2 (sec.) 1000 100 20 A second approach to calculate the performance using normalize execution times to a reference machine and then take the average of the normalized execution times. SPEC (Standard Performance Evaluation Corporation) benchmarks, are using this approach. Performance of programs can be predicted by simply multiplying normalized execution times. Average normalized execution time can be expressed as either an arithmetic or geometric mean. The formula for the geometric mean is 𝑛 1 𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑀𝑒𝑎𝑛 = (ෑ 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒 𝑟𝑎𝑡𝑖𝑜𝑖 )𝑛 𝑖=1 Normalized to A Normalized to B Normalized to C A B C A B C A B C P1 (sec.) 1 10 20 0.1 1 2 0.05 0.5 1 P2 (sec.) 1 0.1 0.02 10 1 0.2 50 5 1 AM 1 5.05 10.01 5.05 1 1.1 25.03 2.75 1 GM 1 1 0.63 1 1 0.63 1.58 1.58 1 Total Time 1 0.11 0.04 9.1 1 0.36 25.03 2.75 1 Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier Computer Architecture UEC509 The evaluation of parameters will help to improve the performance of computer while designing and analysis Make the common case fast, i.e. favor the frequent case over the infrequent case Improve performance: Improve the frequent case(usually simpler), rather than the rare case Add two numbers in ALU Frequent case: no overflow Rare case: overflow Improve ALU performance: Optimize the common case ALU slows down when overflow occurs, but that’s rare Amdahl’s law used to quantifies improvement Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier Amdahl law is use to quantify the performance of a machine with enhancement feature as compared to the original machine. Amdahl’s Law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. The enhancement to a machine that will improve performance when it is used is called Speedup and given as 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑏𝑦 𝑢𝑠𝑖𝑛𝑔 𝑡ℎ𝑒 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑚𝑒𝑛𝑡 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑒𝑛𝑡𝑖𝑟𝑒 𝑡𝑖𝑚𝑒 Speedup = 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑢𝑠𝑖𝑛𝑔 𝑡ℎ𝑒 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑚𝑒𝑛𝑡 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 or 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑢𝑠𝑖𝑛𝑔 𝑡ℎ𝑒 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑚𝑒𝑛𝑡 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 Speedup = 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤ℎ𝑒𝑛 𝑢𝑠𝑖𝑛𝑔 𝑡ℎ𝑒 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑚𝑒𝑛𝑡 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑒𝑛𝑡𝑖𝑟𝑒 𝑡𝑖𝑚𝑒 Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 = fraction of total computation time of original machine which can be enhanced to improve the performance. Always less than 1 𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 =The improvement due to enhanced execution mode; that is, how much faster the task would run if the enhanced mode were used for the entire program The execution time of new computer after the enhancement can be calculate as 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 Execution timenew= 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒𝑜𝑙𝑑 x[ 1 − 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 + ] 𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 Performance improvement gained by using some enhancement feature is limited by the fraction of time the enhancement feature can be used 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒𝑜𝑙𝑑 1 Speedupoverall = = 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒𝑛𝑒𝑤 1−𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 + 𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier Suppose that we are considering an enhancement that runs 10 times faster than the original machine but is only usable 40% of the time. What is the overall speedup gained by incorporating the enhancement? 1 𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑜𝑣𝑒𝑟𝑎𝑙𝑙 = 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 1 − 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 + 𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier A machine required to implement a floating-point (FP) square root for its graphics processor. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical benchmark on a machine. One proposal is to add FPSQR hardware that will speed up this operation by a factor of 10. The second alternative is all FP instructions run faster; FP instructions are responsible for a total of 50% of the execution time. The design team believes that they can make all FP instructions run two times faster with the same effort as required for the fast square root. Compare these two design alternatives. Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier A machine required to implement a floating-point (FP) square root for its graphics processor. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical benchmark on a machine. One proposal is to add FPSQR hardware that will speed up this operation by a factor of 10. The second alternative is all FP instructions run faster; FP instructions are responsible for a total of 50% of the execution time. The design team believes that they can make all FP instructions run two times faster with the same effort as required for the fast square root. Compare these two design alternatives. Solution: Case 1: Improvement in Hardware 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 = 0.2 𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 = 10 1 𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑜𝑣𝑒𝑟𝑎𝑙𝑙 = 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 1 − 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 + 𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 1 1 𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝐹𝑃𝑆𝑄𝑅 𝐻/𝑤 = 0.2 = = 1.22 1−0.2 + 0.82 10 Case 2: Improvement in software 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 = 0.5 𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 = 2 1 1 𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝐹𝑃 𝐼𝑛𝑠𝑡 = 0.5 = = 1.33 1−0.5 + 0.75 2 Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier Q. When making changes to optimize part of a processor, it is often the case that speeding up one type of instruction comes at the cost of slowing down something else. For example, if we put in a complicated fast floating point unit, that takes space, and something might have to be moved farther away from the middle to accommodate it, adding an extra cycle in delay to reach that unit. The basic Amdahl’s law equation does not take into account this trade-off. a. If the new fast floating-point unit speeds up floating-point operations by, on average, 2 x, and floating-point operations take 20% of the original program’s execution time, what is the overall speedup (ignoring the penalty to any other instructions)? b. Now assume that speeding up the floating-point unit slowed down data cache accesses, resulting in a 1.5 x slowdown (or 2/3 speedup). Data cache accesses consume 10% of the execution time. What is the overall speedup now? Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier Q. When making changes to optimize part of a processor, it is often the case that speeding up one type of instruction comes at the cost of slowing down something else. For example, if we put in a complicated fast floating point unit, that takes space, and something might have to be moved farther away from the middle to accommodate it, adding an extra cycle in delay to reach that unit. The basic Amdahl’s law equation does not take into account this trade-off. a. If the new fast floating-point unit speeds up floating-point operations by, on average, 2 x, and floating-point operations take 20% of the original program’s execution time, what is the overall speedup (ignoring the penalty to any other instructions)? b. Now assume that speeding up the floating-point unit slowed down data cache accesses, resulting in a 1.5 x slowdown (or 2/3 speedup). Data cache accesses consume 10% of the execution time. What is the overall speedup now? a. 1/(0.8 + 0.20/2) = 1.11 b. 1/(0.7 + 0.20/2 + 0.10 x 1.5) = 1.05 Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier Q. Your company has just bought a new Intel Core i5 dual core processor, and you have been tasked with optimizing your software for this processor. You will run two applications on this dual core, but the resource requirements are not equal. The first application requires 80% of the resources, and the other only 20% of the resources. Assume that when you parallelize a portion of the program, the speedup for that portion is 2. a. Given that 40% of the first application is parallelizable, how much speedup would you achieve with that application if run in isolation? b. Given that 99% of the second application is parallelizable, how much speedup would this application observe if run in isolation? c. Given that 40% of the first application is parallelizable, how much overall system speedup would you observe if you parallelized it? d. Given that 99% of the second application is parallelizable, how much overall system speedup would you observe if you parallelized it? Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier Q. Your company has just bought a new Intel Core i5 dual core processor, and you have been tasked with optimizing your software for this processor. You will run two applications on this dual core, but the resource requirements are not equal. The first application requires 80% of the resources, and the other only 20% of the resources. Assume that when you parallelize a portion of the program, the speedup for that portion is 2. a. Given that 40% of the first application is parallelizable, how much speedup would you achieve with that application if run in isolation? b. Given that 99% of the second application is parallelizable, how much speedup would this application observe if run in isolation? c. Given that 40% of the first application is parallelizable, how much overall system speedup would you observe if you parallelized it? d. Given that 99% of the second application is parallelizable, how much overall system speedup would you observe if you parallelized it? a. 1/(0.6 + 0.4/2) = 1.25 b. 1/(0.01 + 0.99/2) = 1.98 c. 1/(0.2 + 0.8 x 0.6 + 0.8 x 0.4/2) = 1/(.2 +.48 +.16) = 1.19 d. 1/(0.8 + 0.2 x.01 + 0.2 x 0.99/2) = 1/(0.8 + 0.002 + 0.099) = 1.11 Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier Computer Architecture UEC509 Mostly the computers are using a clock signal which is running at a constant frequency. These discrete time events are called ticks, clock ticks, clock periods, clocks, cycles, or clock cycles. CPU time for a program = CPU clock cycles for the program x Time of one clock cycle 𝐶𝑃𝑈 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 = 𝑐𝑙𝑜𝑐𝑘 𝑟𝑎𝑡𝑒 Similarly, if we want to know clock cycles and the instruction count in a program CPU clock cycles for the program= IC x CPI Or 𝐶𝑃𝑈 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝐶𝑃𝐼 = 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡 IC (instruction count): no of instructions to execute a program CPI: clock cycles per instruction Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier Now , the CPU time for a program will be given as CPU time for a program = IC x CPI x Clock cycle time 𝐼𝐶 x 𝐶𝑃𝐼 CPU time for a program= 𝑐𝑙𝑜𝑐𝑘 𝑟𝑎𝑡𝑒 CPU clock cycles = σ𝑛𝑖=1 𝐶𝑃𝐼𝑖 x 𝐼𝐶𝑖 𝐼𝐶𝑖 : no of times instruction i is executed in the program 𝐶𝑃𝐼𝑖 : avg no of clock cycles for instruction I CPU time = (σ𝑛𝑖=1 𝐶𝑃𝐼𝑖 x 𝐼𝐶𝑖 ) x clock cycle time ( σ𝑛𝑖=1 𝐶𝑃𝐼𝑖 x 𝐼𝐶𝑖 ) Overall CPI = 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡 = 𝑛 𝐼𝐶𝑖 = σ𝑖=1 𝐶𝑃𝐼𝑖 ( ) 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡 Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier ( σ𝑛𝑖=1 𝐶𝑃𝐼𝑖 x 𝐼𝐶𝑖 ) Overall CPI = 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡 = (200x2+300x3+500x1) = 1.8 1000 𝐼𝐶 x 𝐶𝑃𝐼 Execution time = = 𝑐𝑙𝑜𝑐𝑘 𝑟𝑎𝑡𝑒 1000𝑋1.8 = =1.8 μs 1𝑥109 Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier MIPS (Million Instructions Per Second): Rate of execution per unit time 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡 (𝐼𝐶) MIPS = 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 x 106 𝐼𝐶 x 𝐶𝑃𝐼 CPU time = 𝑐𝑙𝑜𝑐𝑘 𝑟𝑎𝑡𝑒 Therefore, 𝑐𝑙𝑜𝑐𝑘 𝑟𝑎𝑡𝑒 MIPS = 𝐶𝑃𝐼 x 106 It is inversely proportional to execution time, proportional to performance: faster the machine, larger the MIPS rating Depends on instruction set: cannot compare machines with different instruction sets Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier Boundary between software and hardware Visible to programmer or compiler writer ISA based on complexity of instructions: CISC and RISC ISA based on where operands are kept other than memory: Stack architecture: Operands implicitly on top of stack Accumulator architecture: One operand implicitly in accumulator General purpose register architecture: Only explicit operands- either in registers or in memory locations Register-memory: Memory can be accessed as part of any instruction Register-register: Memory can only be accessed with load and store instructions. Load- store architecture Memory-memory: Operands are in memory only Characteristics : RISC based processor Simple Load Store instruction set Supports Pipelining Registers : 32 GPRs R0 –R31 , R0 fixed to zero and R31 used with JALR instruction used to store the return address. 32 FPRS F0-F31, can be used as 32bit single precision registers and 64 bit double precision registers in pairs. Data types for DLX: The registers will be loaded with data of;  8 bit data – Bytes  16 bit data – half word  32 bit data – word  Floating Point data :  32 bit –Single precision  64 bit double precision Memory :  Byte Addressable  Addresses are 32 bits long  Data is stored in Big Endian mode Determine the decimal value of the following 32 bits in IEEE single precision format. 0 10000011 11000……….0 Memory Addressing Big Endian and Little Endian Formats of storing data in memory: Big Endian: The Most significant byte of data is stored at the lowest address. Little Endian : The Least Significant byte is stored at the lowest address. Example : for a data unit in hexadecimal as 01234567 Little Endian Big Endian Instruction Length is 32 bit which can be encoded into following 3 formats I type : (Immediate type) Used with all immediate operations Loads , stores , Conditional branch instructions , JR & JALR(Rd =0 (unused)) I-type (Immediate) Sign ADDI R1, R2, #5 bit RS1 = R2, RD = R1, immediate = signext (5) = 0000000000000101 Regs [R1] ← Regs [R2] + signext (5) Regs [R1] ← Regs [R2] + 5 LW R3, 6(R2) RS1 = R2, RD = R3, immediate = signext (6) Regs [R3] ← Mem [signext (6) + Regs [R2]] Regs [R3] ← Mem [6 + Regs [R2]] SW 7(R4), R3 RS1 = R4, RD = R3, immediate = signext (7) Mem [signext (7) + Regs [R4]] ← Regs [R3] Mem [7 + Regs [R4]] ← Regs [R3] I-type (Immediate) BEQZ R4, Addr RS1 = R4, immediate = signext (Addr) If Regs [R4] == 0 PC ← PC + 4 + signext (Addr) PC ← PC + 4 + Addr JR R3 RS1 = R3 PC← Regs [R3] JALR R3 RS1 = R3 Regs = PC + 4 PC← Regs [R3] LW R5, 4100(R0) I-type instruction encoded as 0x8C051004 1000 1100 0000 0101 0001 0000 0000 0100 Break according to individual field in I format: 100011 00000 00101 0001000000000100 6 bit opcode: 100011, indicating LW operation RS1 = 00000, i.e. R0 RD = 00101, i.e. R5 immediate = 0001000000000100, i.e. 4100 R type : (Register type) All Register type ALU operations Read , Write special registers and MOVE operations. R-type (Register) ADD R1, R2, R3 (SA unused) RS1 = R2, RS2 = R3, RD = R1 Regs [R1] ← Regs [R2] +Regs [R3] MOVFP2I R1, F1 (RS2, SA unused) RS1 = F1, RD = R1 Regs [R1] = Regs [F1] SLL R1, R2, R3 RD = R1, RS1 = R2, RS2 = R3 Regs [R1] ← Regs [R2]

UEC509_CA_F (1).pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue