Computer Organization PDF
Document Details
Uploaded by EnchantedGamelan8530
University of Moratuwa
Tags
Related
- Computer Organization and Architecture PDF
- (The Morgan Kaufmann Series in Computer Architecture and Design) David A. Patterson, John L. Hennessy - Computer Organization and Design RISC-V Edition_ The Hardware Software Interface-Morgan Kaufmann-24-101-9-11.pdf
- Computer Organization and Design RISC-V Edition PDF
- (The Morgan Kaufmann Series in Computer Architecture and Design) David A. Patterson, John L. Hennessy - Computer Organization and Design RISC-V Edition_ The Hardware Software Interface-Morgan Kaufmann-102-258.pdf
- (The Morgan Kaufmann Series in Computer Architecture and Design) David A. Patterson, John L. Hennessy - Computer Organization and Design RISC-V Edition_ The Hardware Software Interface-Morgan Kaufmann-102-258-pages-4.pdf
- (The Morgan Kaufmann Series in Computer Architecture and Design) David A. Patterson, John L. Hennessy - Computer Organization and Design RISC-V Edition_ The Hardware Software Interface-Morgan Kaufmann-102-258-pages-5.pdf
Summary
This document provides an overview of computer organization and architecture. It covers fundamental concepts such as computer operations, basic computer organization, and various hardware components including the CPU, RAM, and buses.
Full Transcript
STRUCTURE AND ORGANIZATION OF COMPUTER SYSTEM Computer Operations Computer is an electronic device capable of doing arithmetic calculations faster It serves different purposes to different people Computer is a machine capable of solving problems and manipulating data ✓ Accepts data, process...
STRUCTURE AND ORGANIZATION OF COMPUTER SYSTEM Computer Operations Computer is an electronic device capable of doing arithmetic calculations faster It serves different purposes to different people Computer is a machine capable of solving problems and manipulating data ✓ Accepts data, processes the data by doing some mathematical and logical operations and gives us the desired output. Some time it is a monitoring and controlling device Can see a computer as a device that transforms data ✓. Accept data ✓. Store data ✓. Process data as desired ✓. Retrieve the stored data ✓. Print the result in desired format Basic Computer Organization Every computer contains five essential elements or units: 1.. Arithmetic and Logic Unit (ALU) 2.. Memory 3.. Control Unit 4.. Input Unit 5.. Output Unit 1 Outline Components Input: This is the process of entering data and programs in to the computer system. Computer takes as inputs raw data and performs some processing giving out processed data. Therefore, the input unit takes data from us to the computer in an organized manner for processing Storage (Memory): The process of saving data and instructions permanently is known as storage (Memory). Data has to be fed into the system before the actual processing starts. It provides space for storing data and instructions. ✓ All data and instructions are stored here before and after processing. ✓ Intermediate results of processing are also stored here. Processing: The task of performing operations like arithmetic and logical operations is called processing. The Central Processing Unit (CPU) takes data and instructions from the storage unit and makes all sorts of calculations based on the instructions given and the type of data provided. It is then sent back to the storage unit. Output: This is the process of producing results from the data for getting useful information. Similarly the output produced by the computer after processing must also be kept somewhere inside the computer before being given to you in human readable form. Again the output is also stored inside the computer for further processing. 2 Control: The manner how instructions are executed and the above operations are performed. Controlling of all operations like input, processing and output are performed by control unit. It takes care of step by step processing of all operations in side the computer. Computing Systems Computers have two kinds of sub systems: Hardware, consisting of its physical devices (CPU, memory, bus, storage devices,...) Software, consisting of the programs it has (Operating system, applications, utilities,...) Hardware Hardware: Central Processing Unit (CPU): the “brain” of the machine location of circuitry that performs arithmetic and logical ML statements measurement: speed (roughly) in megahertz (millions of clock-ticks per second) or Gigahertz examples: Intel Pentium, AMD K6, Motorola PowerPC, Sun SPARC, Hardware: RAM Random Access Memory (RAM). “main” memory, which is fast, but volatile.... analogous to a person’s short-term memory.. many tiny “on-off” switches: for convenience. “on” is represented by 1, “off ” by 0.. each switch is called a binary digit, or bit.. 8 bits is called a byte.. 210 bytes =1024 bytes is called a kilobyte (1K). 220 bytes is called a megabyte (1M). Hardware (Disk) Secondary Memory (Disk):. Stable storage using magnetic or optical media.. Analogous to a person’s long-term memory.. Larger capacities. Slower to access than RAM.. Examples:. floppy disk (measured in kilobytes). hard disk (measured in gigabytes (230 bytes)). CD-ROM (measured in megabytes),... 3 Hardware: The Bus The Bus: Connects CPU to other hardware devices. Analogous to a person’s spinal cord. Speed measured in megahertz (like the CPU), but typically much slower than the CPU... The bottleneck in most of today’s PCs. Interconnected Components of Computer The CPU (ALU, Control Unit, Registers) The Memory Subsystem (Stored Data) The I/O subsystem (I/O devices) Each of these Components are connected through Buses. BUS - Physically a set of wires. The components of the Computer are connected to these buses. Address Bus Data Bus Control Bus Address Bus Used to specify the address of the memory location to access. Each I/O devices has a unique address. (monitor, mouse, cd-rom) CPU reads data or instructions from other locations by specifying the address of its location. CPU always outputs to the address bus and never reads from it. Data Bus and Control Bus Data :- ✓ Actual data is transferred via the data bus. ✓ When the cpu sends an address to memory, the memory will send data via the data bus in return to the cpu. Control :- ✓ Collection of individual control signals. ✓ Whether the cpu will read or write data. 4 ✓ CPU is accessing memory or an I/O device ✓ Memory or I/O is ready to transfer data I/O Bus or Local Bus In today’s computers the the I/O controller will have an extra bus called the I/O bus. The I/O bus will be used to access all other I/O devices connected to the system. Example: PCI bus Hardware: Cache While accessing RAM is faster than accessing secondary memory, it is still quite slow, relative to the rate at which the CPU runs. To circumvent this problem, most systems add a fast cache memory to the CPU, to store recently used instructions and data. (Assumption: Since such instructions/data were needed recently, they will be needed again in the near future.) Hardware: Summary Putting the pieces together: Programs are stored (long-term) in secondary memory, and loaded into main memory to run, from which the CPU retrieves and executes their statements. Software Software: OS The operating system (OS) is loaded from secondary memory into main memory when the computer is turned on, and remains in memory until the computer is turned off. 5 Software: OS The OS acts as the “manager” of the system, making sure that each hardware device interacts smoothly with the others. It also provides the interface by which the user interacts with the computer, and awaits user input if no application is running. Examples: MacOS, Windows UNIX, Linux, Solaris,... Software: Applications Applications are non-OS programs that perform some useful task, including word processors, spreadsheets, databases, web browsers, C++ compilers,... Example C++ compilers/environments: – CodeWarrior (MacOS, Windows, Solaris) – GNU C++ (UNIX, Linux) – Turbo/Borland C++ (Windows) – Visual C++ (Windows) Putting it all together Programs and applications that are not running are stored on disk. When you launch a program, the OS controls the CPU and loads the program from disk to RAM. 6 The OS then relinquishes the CPU to the program, which begins to run. CPU RAM Structure & Function The hierarchical nature is necessary for design and description of complex system. The designer needs to deal with one level of abstraction at a time. At each level the designer is concerned with structure and function. Structure is the way in which components relate to each other Function is the operation of individual components as part of the structure Functions Computer functions are: ✓. Data processing ✓. Data storage ✓. Data movement ✓. Control Functional view. Functional view of a computer Operations 7 Data movement e.g. keyboard to screen Storage e.g. Internet download to disk Processing from/to storage e.g. updating bank statement 8 Processing from storage to I/O e.g. printing a bank statement Structure - Top Level Structure - The CPU 9 CPU organization. CPU controls the Computer. The CPU will fetch, decode and execute instructions.. The CPU has three internal sections: register section, ALU and Control Unit Register Section Includes collection of registers and a bus. Processor’s instruction set architecture are found in this section. Non accessible registers by the programmer. These are to be used for registers to latch the address being accessed and a temp storage register. Arithmetic/Logic Unit (ALU). Performs most Arithmetic and logical operations.. Retrieves and stores its information with the register section of the CPU Memory Subsystem 2 Types of Memory: ROM: Read Only Memory ✓ Program that is loaded into memory and cannot be changed also retains its data even without power. ✓ The ICs inside the PC that form the ROM. The storage of program and data in the ROM is permanent. The ROM stores some standard processing programs supplied by the manufacturers to operate the personal computer. The ROM can only be read by the CPU but it cannot be changed. The basic input/output program is stored in the ROM that examines and initializes various equipment attached to the PC when the switch is made ON. ROM is non-volatile memory. RAM: Random Access Memory ✓ Also called read/write memory. This type of memory can have a program loaded and then reloaded. It also loses its data with no power Different ROM Chips Masked ROM : ✓ ROM that is programmed with data when fabricated. Data will not change once installed. ✓ Hardwired. Programmable ROM (PROM) : ✓ Capable of being programmed by the user with a ROM programmer. ✓ Not hardwired. ✓ Not possible to modify or erase programs stored in ROM, but it is possible for you to store your program in PROM chip. Once the programs are written it cannot be changed 10 and remain intact even if power is switched off. Therefore programs or instructions written in PROM or ROM cannot be erased or changed Erasable PROM (EPROM) : ✓. Much like the PROM this EPROM can be programmed and then erased by light. ✓. Which over come the problem of PROM & ROM. EPROM chip can be programmed time and again by erasing the information stored earlier in it. Information stored in EPROM exposing the chip for some time ultraviolet light and it erases chip is reprogrammed using a special programming facility. When the EPROM is in use information can only be read. EEPROM : ✓. Another form of EPROM but is reprogrammable electrically. ✓. EEPROM is user-modifiable read-only memory (ROM) that can be erased and reprogrammed (written to) repeatedly through the application of higher than normal electrical voltage generated externally or internally in the case of modern EEPROMs. EPROM usually must be removed from the device for erasing and programming, whereas EEPROMs can be programmed and erased in circuit. Flash memory is a non-volatile computer storage chip that can be electrically erased and reprogrammed. Dynamic RAM (DRAM) : ✓. Leaky capacitors. Caps are charged and slowly leak until they are refreshed to there original data locations. Ex. Computer RAM ✓. Periodical refresh Static RAM (SRAM) : ✓ Much like a register. The contents stay valid and does not have to be refreshed. SRAM is faster than DRAM but cost more Ex. Cache ✓ About 6 MOSFET transistors per bit Synchronous dynamic random access memory (SDRAM) ✓ A Dynamic random access memory (DRAM) that is synchronized with the system bus. Classic DRAM has an asynchronous interface, which means that it responds as quickly as possible to changes in control inputs. ✓ It waits for a clock signal before responding to control inputs and is therefore synchronized with the computer's system bus. ✓ This allows the chip to have a more complex pattern of operation than an asynchronous DRAM, enabling higher speeds. 11 Double data rate synchronous dynamic random access memory (DDR SDRAM) ✓ Compared to single data rate (SDR) SDRAM, the DDR SDRAM interface makes higher transfer rates possible by more strict control of the timing of the electrical data and clock signals. ✓ Transferring data on both the rising and falling edges of the clock signal ✓ The name "double data rate" refers to the fact that a DDR SDRAM with a certain clock frequency achieves nearly twice the bandwidth of a single data rate running at the same clock frequency, due to this double pumping CPU DESIGN Structure - The Control Unit Sequential Logic Combinational circuits have only inputs and outputs. The output is a function of the input only. Sequential circuits have a notion of state. The output is a function of the input and the previous state. Commonly referred to as Finite state machine (FSM) An FSM distinguished from a combinational logic in that the past history inputs to the FSM influences its state and output. This is important for implementing memory circuits as well as control units in a computer. Flip-Flop The basic circuit for storing information in a digital machine is called a flip-flop. There are several fundamental types of Flip-Flops and many circuit designs. 12 Two common characteristics ✓ Bistable device ✓ Has two signals, one of which is the complement of other. Ex: RSFF Registers A single bit of information is stored in a flip-flop A group of N bits, making up an N-bit word Registers are formed from a group of flip-flops arranged to hold and manipulate a data word using some common circuitry. 13 Computer word structure The data and instruction words in computers tend to be organized systematically. Instruction word = data word = memory word (in most early cases) A special term used for an 8-bit word is called Byte. 210 Bytes = 1 Kb, 210 Kb = 1 Mb, 210Mb = 1Gb.. A number that identifies the location of a word in a memory is called address Always used binary number, although hexadecimal used Refers to a word location in memory Von Neumann/Turing. Stored Program concept. Main memory storing programs and data. ALU operating on binary data. Control unit interpreting instructions from memory and executing. Input and output equipment operated by control unit. Princeton Institute for Advanced Studies (IAS). Completed 1952 The Stored Program Concept Von Neumann’s proposal was to store the program instructions right along with the data This may sound trivial, but it represented a profound paradigm shift The stored program concept was proposed about fifty years ago; to this day, it is the fundamental architecture that fuels computers. Think about how amazing that is, given the short shelf life of computer products and technologies… The Stored Program Concept and its Implications The Stored Program concept had several technical formations: Four key sub-components operate together to make the stored program concept work The process that moves information through the sub-components is called the “fetch execute” cycle Unless otherwise indicated, program instructions are executed in sequential order Easy to access and modify data and programs Four Sub-Components. There are four sub-components in von Neumann architecture:. Memory. Input/Output (called “IO”). Arithmetic-Logic Unit. Control Unit 14. While only 4 sub-components are called out, there is a 5th, key player in this operation: a bus, or wire, that connects the components together and over which data flows from one sub-component to another Structure of Von Neumann machine Von Neumann bottleneck Shared bus between the program memory and data memory leads to the Von Neumann bottleneck Limited throughput (data transfer rate) between the CPU and memory compared to the amount of memory Limits the effective processing speed when the CPU is required to perform minimal processing on large amounts of data Harvard Architecture The name Harvard Architecture comes from the Harvard Mark I relay-based computer Harvard Architecture has physically separated signals and storage for program and data memory Possible to access program memory and data memory simultaneously Program memory is read-only and data memory is read write Impossible for program contents to be modified by the program itself ALU. Perform arithmetic and logic operations on input data.. Set of registers with accumulator register. Registers store the results of all the arithmetic and logic operations. 15 Functional Parts of ALU ALU Operation The control unit receive an instruction (from memory unit) specifying that a number stored in a particular memory location (address) is to be added to the number presently stored in the accumulator register The number to be added is transferred from memory to the B Register The number in the B register and the number in the accumulator register are added together in the logic circuits (upon command from the control unit) The result sum is then send to the accumulator to be stored Fetch/Execute cycle All computers can be summarized with just two basic components: ✓ (a) primary storage or memory ✓ (b) central processing unit or CPU CPU is the "brains" of the computer. Its function is to execute programs which are stored in memory This procedure is accomplished by the CPU fetching an instruction stored in memory, then executing the retrieved (fetched) instruction within the CPU before proceeding to fetch the next instruction from memory This process continues until they are told to stop Fetch/Execute cycle The following illustration summarizes this continuous process of fetching and execution of instructions. 16 IAS - details 1000 x 40 bit words ✓ Binary number ✓ 2 x 20 bit instructions Set of registers (storage in CPU) ✓ Memory Buffer Register (MBR) ✓ Memory Address Register (MAR) ✓ Instruction Register (IR) ✓ Instruction Buffer Register (IBR) ✓ Program Counter (PC) ✓ Accumulator (ACC) ✓ Multiplier Quotient (MQ) 17 Structure of IAS - detail Computer Organization – Motherboard/Chipset/Buses The motherboard is the main circuit board inside a computer, containing the central processing unit (CPU), the bus, memory sockets, expansion slots, and more. The associated chips which help in overall system integration And which are fabricated onto the motherboard are collectively called the Chipset. Buses are the medium through which all the devices communicate. E.g. PCI, Microchannel, ISA, EISA etc. Computer Organization – CPU. The part of a computer which controls all the other parts.. Generally called the “processor”.. Composed of – ✓ Control Unit (CU) ✓ Arithmetic and Logic Unit (ALU) ✓ Registers and associated memory. Control Unit (CU) ✓ The control unit fetches instructions from memory and decodes them to produce signals which control the other parts of the computer. ✓ This may cause it to transfer data between memory and ALU or to activate peripherals to perform input or output. 18 Arithmetic and Logic Unit (ALU) ✓ The combinatorial part of the central processing unit which performs operations such as addition, subtraction and multiplication of integers and bitwise AND, OR, NOT and other Boolean operations. ✓ The width in bits of the data which the ALU handles is usually the same as that quoted for the processor as a whole whereas its external buses may be Registers and Associated Memory ✓ Registers are like internal buffers for the CPU. The CPU uses the registers to decode instructions and for using them for computing the final output (ALU). ✓ Additionally the CPU may make use of internal memory (mostly SRAM as they are the fastest) for caching code and data for frequently accessed operations. CPU - Instructions These are basically control commands for the control unit of the CPU which dictate the task to be performed (E.g. add, subtract, move, shift, interrupt etc.). The instructions can be considered as an interface to the features offered by the CPU. Instructions are composed primarily of “opcode” which specifies what the instruction is and the “operands” which specify the data on which that instruction needs to operate on. CPU – Machine Programs These are full length programs composed of several instructions which perform the real world task at hand. Every CPU has a Program Counter (PC) register which is used to keep track of the current instruction being executed. CPU (Contd..) Computers are classified into two broad categories depending upon the complexity of the instructions supported by the CPUs. RISC (Reduced Instruction Set Computer) They have a very small subset of instructions to offer. The format of their instructions are uniform. Control Unit design is very simple. They do not have flexibility. CISC (Complex Instruction Set Computers) They have a huge set of instructions at offer. The format of their instructions are not uniform. The control unit design of CISC CPUs are overly complex. Since they can support any arbitrary instruction, they are very flexible. CPU – Control Unit Control Unit (CU) is the heart of the CPU. Every instruction that the CPU supports has to be decoded by the CU and executed appropriately. 19 Every instruction has a sequence of microinstructions behind it that carries out the operation of that instruction. Control Unit - Designs Hardwired (Conventional) used in high-performance and RISC computers as the name implies, the microinstructions are hardwired into the computer. This is much faster but much less flexible. Typically, a grid/array is used to establish a sequence of connections to construct the appropriate circuitry. Microprogrammed The instructions are programmable. i.e. the computer architect is able to program a particular instruction using micro- instructions and download the entire sequence of micro-instructions into the CU. Slow, Additional time is spent decoding and executing microinstructions. Very flexible. Control Unit – Hardwired vs Microprogrammed Hardwired ✓ Composed of combinatorial and sequential circuits that generates complete timing that corresponds with execution of each instruction. ✓ Time consuming and expensive to design. ✓ Difficult to modify. ✓ but fast ! Microprogrammed ✓ Design is simpler. ✓ Problem of timing each microinstruction is broke down. Microinstruction cycle handles timing in a simple and systematic way ✓ Easier to modify, meaning really flexible. ✓ Slower than hardwired control since you have additional time spent in decoding microinstructions and searching them. CPU – Instruction Cycle Operation of a computer consists of a sequence of instruction cycles. An instruction cycle is the time needed to execute one complete machine instruction. Instruction cycle can be subdivided into - ✓ Fetch retrieving the instruction ✓ Indirect retrieving indirect operands (if any) ✓ Execute executing the instruction. 20 ✓ Interrupt Handle any exceptions (if any) Most important is the Fetch-Execute Cycle. It consists of ✓. Extract the instruction from memory ✓. Calculate the address of the next instruction by advancing the PC (Program Counter) ✓. Decode the opcode ✓. Calculate the address of the operands (if any) ✓. Extract the operand from memory ✓. Execute ✓. Calculate address of the result ✓. Store result in memory Control Unit - Microprogramming Each step of an instruction can be further broken down into micro operations as discussed. These micro operations are accomplished using what are called micro instructions. Therefore, every instruction can be said to have a corresponding micro program which accomplishes the task of that instruction.. Basic components of Microprogramming are ✓. Instruction Register (IR). Stores the current instruction. ✓. Control Store. Repository for microprograms. A form of memory. ✓. Sequencer. A microprogram control unit. A control unit within another. ✓. Microprogram Program Counter. Similar to a Program Counter Instruction Register stores current machine language instruction from program in main memory. Each machine instruction is represented as a series of microinstructions, in the Control Store. Micro Program Counter points to the appropriate location of the current microinstruction in the Control Store. Executing a machine instruction requires finding the microinstructions and executing them until the microinstruction has completed. Microprogramming – (Ordinary Operation) Sequencer causes address of current machine instruction to be placed in micro ProgramCounter It may also clear the microinstruction buffer. 21 Sequencer initiates control-store read and transfers microinstruction to microinstruction buffer Microinstruction decoder decodes microinstruction into microorders and issues those orders over control lines If a non-branching microinstruction, then micro Program Counter is incremented, otherwise new address in control store is calculated and placed in micro ProgramCounter Microprogramming – (Organization of Microprograms) Each Machine Language operation is actually a sequence of microinstructions Each of these microprograms is placed in the control store. Included are the microprograms for instruction fetch, interrupt initiation and other routines separate frommachine language instructions. Microprogramming – (Branching /Non- Branching) Branching microinstructions hold the branch address inside of the microinstruction itself. Branching occurs only within the Control Store (as opposed to a machine language branch to another location in main memory) Special bit designates whether a microinstruction is a branch or nonbranch. Internal Address Bus is used to transfer new control store location Nonbranching instructions require that the micro Program Counter be incremented Microprogramming – (Microinstruction Formats) Microinstructions are simply a list of control orders for the various buses and gates in the computer Horizontal Control ✓ each microinstruction is a series of bits, each of which represents a single control line Vertical Control ✓ groups of bits in the microinstruction represent commands. This requires decoding or demultiplexing before the control commands can be issued Horizontal is faster but requires greater lengthed microinstructions. Computer Organization - Memory. memory are internal storage areas in a system.. Memory operations rely on these – ✓ MAR (Memory Address Register) – Stores theaddress of the memory location that is to be accessed ✓ MDR (Memory Data Register) – Stores the data that is either written to or read from memory. ✓ MBR (Memory Buffer Register) – Buffer register for temporary storage ✓ MCRs (Memory Control Registers) – To setup access to the memory. 22 Computer Organization – (Input/Output System (I/O)) There is a need to communicate with components and with the external world. Each of the communication endpoints is called a “port”. I/O System comprises of the set of all I/O devices in the computer system including physical devices and I/O interface devices In the early days of computers, I/O devices were limited to line printer, punch card reader. Today there are numerous types of I/O devices (terminal, mouse, scanner, etc…) CPU-controlled I/O the CPU would interrupt its current process to directly handle all I/O. MEMORY ORGANIZATION Computer Memory System A memory unit is a collection of storage cells together with necessary circuits for information transfer in and out of storage. The memory stores binary information in group of bits called words. A word in a memory is a fundamental unit of information in a memory Hold series of 1’s and 0’s Represent numbers, instruction code, characters, etc. A group of 8 bits called as a byte which is fundamental unit of metric Usually a word is a multiples of byte Memory Metric 23 Memory access Memory is connected to the memory address lines and the memory data lines (and to a set of control lines; e.g. memory read and memory write) Whenever an address is presented to the memory the data corresponding to that address appears on the data lines. Memory addressing has some limitations based on address lines Larger systems with 32-64 address lines these machines can address 232(4000 M) Bytes to bytes of memory it is not practical to provide all the memory that the machine can address. Each memory word is given an specific address Unit of Transfer Internal ✓ Usually governed by data bus width External ✓ Usually a block which is much larger than a word Addressable unit ✓. Smallest location which can be uniquely addressed ✓. Word internally ✓. Cluster on disks Access Methods Sequential ✓. Start at the beginning and read through in order ✓. Access time depends on location of data and previous location ✓. e.g. tape Direct ✓. Individual blocks have unique address ✓. Access is by jumping to vicinity plus sequential search ✓. Access time depends on location and previous location ✓. e.g. disk Random ✓. Individual addresses identify locations exactly ✓. Access time is independent of location or previous access ✓. e.g. RAM Associative ✓. Data is located by a comparison with contents of a portion of the store ✓. Access time is independent of location or previous access ✓. e.g. cache 24 Memory Hierarchy Registers ✓ In CPU Internal or Main memory ✓ May include one or more levels of cache ✓ “RAM” External memory ✓ Backing store Performance Access time (latency) ✓ Time between presenting the address and getting the valid data Memory Cycle time ✓ Time may be required for the memory to “recover” before next access ✓ Cycle time is access + recovery. Primarily applied to random access memory Transfer Rate ✓ Rate at which data can be moved ✓ 1/(cycle time), for random access memory. TN=TA+N/R, For non- random access memory. TN- Average time to read or write N bits. TA- Average access time. N – Number of bits. R – Transfer rate, in bits per seconds 25 Memory access time Ave access time, Ts = H x T1 + (1-H) x (T1 + T2) ✓. H- fraction of all memory accesses that are found in the faster memory (Hit ratio) ✓. T1 – Access time to level 1 ✓. T2 – Access time to level 2 ✓. Access efficiency = T1/Ts Suppose a processor has two levels of memory. Level 1 contains 1000 words and has access time of 0.01μs. Level 2 contains 100000 words and has access time of 0.1 μs. Level 1 can directly access. If it in level 2, then word first transferred into level 1 and then access by the processor. For simplicity we ignore the time taken to processor to determine whether the word is in level 1 or 2. For high percentage of level 1 access, the average access time is much closer to that of level 1 than level 2. Suppose we have 95% of the memory access found in L1 Memory Types Semiconductor ✓ RAM Magnetic ✓ Disk & Tape Optical ✓ CD & DVD Others ✓ Bubble ✓ Hologram Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape When go down with the hierarchy, Decreases cost per bit Increasing capacity Increasing access time Decreasing frequency of access of the memory by the processor 26 Speed and Cost. It is possible to build a computer which uses only static RAM (see later). This would be very fast. This would need no cache ✓. How can you cache cache?. This would cost a very large amount Semiconductor Memory. Basic element of a semiconductor memory is the memory cell ✓. Cell is able be in one of two states ✓. Read/write RAM ✓. Misnamed as all semiconductor memory is random access ✓. Read/Write ✓. Volatile ✓. Temporary storage ✓. Static or dynamic Dynamic RAM. Bits stored as charge in capacitors. Charges leak. Need refreshing even when powered. Simpler construction. Smaller per bit. Less expensive. Need refresh circuits. Slower. Main memory Static RAM. Bits stored as on/off switches. No charges to leak. No refreshing needed when powered. More complex construction 27. Larger per bit. More expensive. Does not need refresh circuits. Faster. Cache DRAM Organisation in detail. A 16Mbit chip can be organised as 1M of 16 bit words. A bit per chip system has 16 lots of 1Mbit chip with bit 1 of each word in chip 1 and so on. A 16Mbit chip can be organised as a 2048 x 2048 x 4bit array ✓. Reduces number of address pins. Multiplex row address and column address. 11 pins to address (211=2048). Adding one more pin doubles range of values so x4 capacity Typical 16 Mb DRAM (4M x 4) Typical 16 Mb DRAM (4M x 4) Each horizontal line connects to the “select” terminal of each cell in its row; Each vertical line connects to the data in/sense terminal of each cell in its column. Address line supply the address of the 4-bit word to be selected (A0-A10) 28 These selection lines can be decoded in to 2048 lines. The same set of lines used to specify one column out of the 2048 columns 4 data lines are used for the input and output of 4 data bits to and from a data buffer. Refreshing. Refresh circuit included on chip. Disable chip. Count through rows. Read & Write back. Takes time. Slows down apparent performance Packaging 29 Module (256KB) Organisation 30 Module Organisation (1MB) AUXILIARY MEMORY 31 Cache. Small amount of fast memory. Sits between normal main memory and CPU. Cache is usually filled from main memory when instructions or data are fetched into the CPU. May be located on CPU chip or module Cache opera;on - overview. CPU requests contents of memory location. Check cache for this data. If present, get from cache (fast). If not present, read required block from main memory to cache. Then deliver from cache to CPU. Cache includes tags to identify which block of main memory is in each cache slot Cache Design Cache has following properties ✓. Size ✓. Mapping Function (discussed in Com. Arch) ✓. Replacement Algorithm ✓. Write Policy ✓. Block Size ✓. Number of Caches (Levels) the effectiveness of the cache relies on the locality of reference of most programs Locality of Reference During the course of the execution of a program, memory references tend to cluster e.g. Loops Spatial locality - most programs are highly sequential; the next instruction usually comes from the next memory location. Data is usually structured, and data in these structures normally are stored in contiguous memory locations. 32 Short loops are a common program structure, especially for the innermost sets of nested loops. This means that the same small set of instructions is used over and over. Generally, several operations are performed on the same data values, or variables. Typical Cache Organization Associative Memory When a cache is used, there must be some way in which the memory controller determines whether the value currently being addressed in memory is available from the cache Store both the address and the value from main memory in the cache, with the address stored in a type of memory called associative memory or, more descriptively, content addressable memory. when a value is presented to the memory, the address of the value is returned if the value is stored in the memory, otherwise an indication that the value is not in the associative memory is returned Comparisons are done simultaneously→ search is performed very quickly Expensive Write Policy. Must not overwrite a cache block unless main memory is up to date. Multiple CPUs may have individual caches. I/O may address main memory directly. Write Through ✓. All writes go to main memory as well as cache ✓. Multiple CPUs can monitor main memory traffic to 33 ✓ keep local (to CPU) cache up to date ✓. Lots of traffic ✓. Slows down writes Write back. Updates initially made in cache only. Update bit for cache slot is set when update occurs. If block is to be replaced, write to main memory only if update bit is set. Other caches get out of sync. I/O must access main memory through cache. N.B. 15% of memory references are writes Error Correction. Semiconductor memory can be subject to errors. Hard Failure ✓. Permanent defect of a cell. Soft Error ✓. Caused by power supply problem or external ✓ sources which generate radiation ✓. Random, non-destructive ✓. No permanent damage to memory. Detected using Hamming error correcting code Error Correcting Code Function Error Correcting Code Function. When data to be written into memory a calculation is performed on data to produce a code. 34. Both the code and data are stored. When previously stored word is read out the code is used to detect and possibly correct the errors. Hamming error correction code COMPUTER DATA ORGANIZATION AND REPRESENTATION Representation of Data There are basic kinds of information concepts: – Annual earnings of a IT faculty member. (numeric) – Script of Mahawansa. (characters) – Set of fingerprints on file in a police department. (visual) – Micheal Jackson’s song. (audio) We want a single, common way to represent. – Can be encoded as numbers, which in turn can appear as signals at the hardware level. Data Representation How do computers represent data? Most computers are digital Recognize only two discrete states: on or off Computers are electronic devices powered by electricity, which has only two states, on or off Bits, Bytes, and Words Information in the computer are stored as a groups of binary digits. An individual digit is called a bit. Bits are grouped into 8-bit collections called bytes. Memory is normally measured in terms of bytes. Bytes are further grouped into 4 or more byte groupings to make up a modern computer word. The most common word size in most modern computers is 32 bits (4 8-bit bytes) Many programs do scientific calculations using double (64-bit) words. 35 What is the largest integer value that can be expressed in 32 bits? 32-bit Unsigned Integers Maximum Value 11111111111111111111111111111111 = 232 −1 = 4,294,967,295 Three Representations of Signed(+/-) Binary Integers Signed Magnitude: Leading bit represents sign and remaining bits represent the corresponding unsigned number. One’s Complement: Negative representation is obtained by flipping all the bits of the unsigned number. Two’s Complement: Negative representation is obtained by flipping all the bits and adding one. Range of 32-bit 2’s Complement Integers Maximum: 01111111111111111111111111111111 = 2,147,483,647 = 231 −1 Minimum: 10000000000000000000000000000000 = −2,147,483,648 = 231 36 Floating Point Numbers First look at 0.1 in binary: 0.000110011001100110011001100110110110… This is equal to: 1.100110011001100110011001100110011… x 2−4 Must store both exponent with its sign and the mantissa (with sign). Use 8 bits for exponent and its sign and 24 bits for the mantissa and its sign. Internal representation of 0.1 is then: −0000100 +10011001100110011001100 Converting this stored number back to decimal gives: 0.0999999940395355 Thus, there is round-off error incurred when storing floating point numbers 32-bit representation. Computer Representation of Real (floating point) Numbers 37 Number Representation (16-bit example) Floating Point Representations 32-bit Computers 1-bit for the sign 8-bits for the exponent (including its sign) 23-bits for the mantissa 38 Real Representation Problem 1 The use of only a small number of bits to store exponent gives a finite number of negative and positive exponents – Single precision (32-bit) »Range 10-38 to 10+39 Can use two words (uses more storage) and increase exponent (and mantissa) size. – Double precision »Range 10-308 to 10+308 »11- bit exponent with 1023 bias »52- bit mantissa Real Number Representation Problem 2 Not every fraction can be represented exactly in a given number of bits. Some numbers can not be represented in any number of bits. 1. There is a ‘hole’ near zero because the number nearest to zero that can be represented is 2- n where n is the number of bits in the mantissa. 2. Round-off error is created when bits are stripped from the end of a mantissa. Holes in Representation Assume an 8 bit word with a 3 bit mantissa the number line below represents all the possible positive numbers. Note that numbers near zero can not be represented and that as the magnitude of numbers get larger the interval between numbers (round-off error) gets larger. 39 Round-Off Error If we represent the perfectly good decimal fraction 0.1 as a binary fraction we get: 0.00011001100110011001100110011011 (0110 repeats) Suppose we have a computer that stores 11 bits in the mantissa. The binary fraction (after rounding) becomes: 1.10011001101 2−4 = 0.100006110 Machine Epsilon A calculable quantity which quantifies maximum round-off error in a computer which uses certain length word is called machine epsilon (εmach ). εmach is the smallest representable real number that can be added to 1.0 and the result be different than 1.0 For most 32-bit applications it’s value is approximately 2x10-16. Round-Off Error Even though 2 numbers look the same, they might be different for a computer Don’t compare floating point numbers for equality/inequality Compare using = Character Representation ASCII encoding (American Standard Code for Information Interchange) – 128 characters defined. This requires 7-bits Unicode extends ASCII to include other characters – 8-bit adds and additional 128 characters to normal ASCII – 16-bit allows representation of most any recognizable language + most technical symbols. Data Representation Describes the methods by which data can be represented and transmitted in a computer. Alphanumeric data Big Endian vs Little Endian Images ✓ Bit map ✓ Vector Audio 40 Example Data Representations Data Representation What are two popular coding systems to represent data? American Standard Code for Information Interchange (ASCII) Extended Binary Coded Decimal Interchange Code (EBCDIC) ✓ Sufficient for English and Western European languages ✓ Unicode often used for others 41 Alphanumeric Data. Many applications process text (e.g. compilers and word processors). coding schemes include ASCII, EBCDIC and Unicode. ASCII table (in hex) Characters. Example what does the following ASCII string represent Sorting ASCII Characters ASCII and EBCDIC codes are designed so the computer can do alphabetic comparisons. ✓. In Windows, comparisons are case insensitive (in most instances)… ✓. In Unix comparisons are generally case sensitive 42 EBCDIC codes UNICODE. 16 bit code (encode 65536 characters). Modelled on ASCII character set. Encodes most characters currently in use. Uses scripts to define characters in a particular language Big Endian vs LiIle Endian On most computers the storage unit is a byte 43 Multiple bytes are required to store most data types (e.g. integers => 4 bytes) How do we pack words into a byte addressable memory? Use of Little Endian and Big Endian? Differences in performance are minor Intel processors use Little Endian Motorola processors use Big Endian. Some programs (e.g.Windows) likewise insist on a particular format… Windows.bmp format is Little Endian, for instance Internet protocols are Big Endian. Conversion is required on Little Endian processors. Some common file formats Big Endian ✓. Adobe Photoshop ✓. JPEG ✓. MacPaint ✓. SGI (silicon graphics) ✓. Sun Raster ✓. WPG (word perfect graphics metafile) Little Endian ✓. BMP ✓. GIF ✓. PCX (paintbrush) ✓. QTM (quicktime) ✓. Microsoft RTF And some can be either, selected by codes in file… 44 Pictures. Many different formats used to store Images in a computer. Two Main categories. Bit map images. E.g. photographs paintings. Characterised by continuous variations in shading, colour, shape and texture. Necessary to store info about each point. Vector Images. Made up of geometrical shapes (e.g. lines circles etc). Sufficient to store geometrical detail plus its position Bit map Images Many different formats: bit map e.g. GIF, TIFF,... Bit map storage Consider an image with 600 rows of 800 pixels – one byte used to store each of the three colours of each pixel ✓ Total memory = 600 * 800 * 3 ≈ 1.5MB Alternative representation is to use a palette – a lookup table which defines the colours in the image ✓ An index into this table is then stored for each pixel Can also reduce the size by reducing the resolution (I.e. increase the size of each pixel) or by employing various compression algorithms to lower storage requirements. 45 Gif Screen GIF File Format Vector Graphics series of objects such as lines and circles e.g. PICT, TIFF,... ✓. line 0,50,100,50 ✓. line 50,0,50,100 ✓. char A, 75, 25 46 Example Postscript. A page description language. An image consists of a program written in the postscript language. Encoded in ASCII or Unicode. Contains functions to ✓. draw lines ✓. Draw bezier curves ✓. Join simple object into more complex ones ✓. Translate or scale an object ✓. Fill an object Audio Data. Sound is normally digitised from an audio source. Analog waveform sampled at regular times intervals. The amplitude at each interval is recorded using an A-to-D converter. Most positive peak set max binary number. Most negative peak set to zero Digitizing an audio waveform Wave (.WAV) Sound format. Designed by Microsoft. Supports 8 or 16 bit sound samples. Sample rates 11.025KHz, 22.05KHz or 44.1KHz 47. Supports stereo or mono. Very simple format Wave Format Wave Format Some representations If we encode sound at 44kHz, each sample at 16 bits, stereo (2 channels), this amounts to 1.4 MBits/sec and three minutes will take about 25 Mbytes of space It we only encode the most important features, it is termed data compression, and can reduce file size by about 10:1 48 Popular methods. Real Audio is one method used for data compression.. MP3 is another.. Comparative file sizes: ✓. WAV file at 44KHz, 16 bit: 5 MB ✓. Real Audio will take 304KB ✓. MP3 will take 409Kb Data Representation How is a character sent from the keyboard to the computer? Step 1: The user presses the letter T key on the keyboard Step 2: An electronic signal for the letter T is sent to the system unit Step 3: The signal for the letter T is converted to its ASCII binary code (01010100) and is stored in memory for processing Step 4: After processing, the binary code for the letter T is converted to an image on the output device INPUT OUTPUT ORGANIZATION Input Outputs Computer communicate with outside world using Basic I/O hardware ✓ ports, buses, devices and controllers There are I/O Software to facilitate I/O operations ✓ Interrupt Handlers, Device Driver, Device- Independent Software, User-Space I/O Software Important concepts ✓ Three ways to perform I/O operations. Polling, interrupt and DMAs Input-Output Organization Peripheral Devices ✓ I/O Subsystem. Provides an efficient mode of communication between the central system and the outside environment Peripheral (or I/O Device) ✓ Input or Output devices attached to the computer. Monitor (Visual Output Device) : CRT, LCD. KBD (Input Device) : light pen, mouse, touch screen, joy stick, digitizer. Printer (Hard Copy Device) : Dot matrix (impact), thermal, ink jet, laser (non-impact). Storage Device : Magnetic tape, magnetic disk, optical disk 49 ASCII (American Standard Code for Information Interchange) Alphanumeric Characters ✓. I/O communications are usually involved in the transfer of ASCII information ✓. ASCII Code : ✓. 7 bit : 00 - 7F ( 0 - 127 ) ✓. 80 - FF ( 128 - 255 ) : Greek, Italic, Graphics, Input-Output Interface ✓. Interface 1) A conversion of signal values may be required 2) A synchronization mechanism may be needed. The data transfer rate of peripherals is usually slower than the transfer rate of the CPU 3) Data codes and formats in peripherals differ from the word format in the CPU and Memory 4) The operating modes of peripherals are different from each other. Each peripherals must be controlled so as not to disturb the operation of other peripherals connected to the CPU Interface Special hardware components between the CPU and peripherals Supervise and Synchronize all input and output transfers I/O Buses and Interface I/O Bus and Interface Modules : ✓ I/O Bus. Data lines. Address lines. Control lines ✓ Interface Modules :. SCSI (Small Computer System Interface). IDE (Integrated Device Electronics). RS-232 ✓ I/O command :. Control Command. Status Command. Input Command. Output Command Synchronous and asynchronous data transfer Synchronous Data Transfer ✓. All data transfers occur simultaneously during the occurrence of a clock pulse ✓. Registers in the interface share a common clock with CPU registers Asynchronous Data Transfer ✓. Internal timing in each unit (CPU and Interface) is independent ✓ Each unit uses its own private clock for internal registers 50 Data transfer Strobe : Control signal to indicate the time at which data is being transmitted 1) Source-initiated strobe : 2) Destination-initiated strobe :. Disadvantage of strobe method ✓. Destination Data ✓. Handshake method 51 Data transfer: Handshake Handshake : Agreement between two independent units 1) Source-initiated handshake : 2) Destination-initiated handshake : 52 Timeout : If the return handshake signal does not respond withina given time period, the unit assumes that an error has occurred. Data transfer : Serial Asynchronous Serial Transfer ✓ Synchronous transmission :. The two unit share a common clock frequency. Bits are transmitted continuously at the rate dictated by the clock pulses ✓ Asynchronous transmission :. Special bits are inserted at both ends of the character code. Each character consists of three parts : 1) start bit : always “0”, indicate the beginning of a character 2) character bits : data 3) stop bit : always “1” ✓ Asynchronous transmission rules :. When a character is not being sent, the line is kept in the 1-state. The initiation of a character transmission is detected from the start bit, which is always “0”. The character bits always follow the start bit. After the last bit of the character is transmitted, a stop bit is detected when the line returns to the 1-state for at least one bit time Data transfer : Controller I/O units typically consist of ✓ A mechanical component: the device itself ✓ An electronic component: the device controller or adapter. Interface between controller and device is a very low level interface. Example:. Disk controller converts the serial bit stream, coming off the drive into a block of bytes, and performs error correction. 53 I/O Controller Disk controller implements the disk side of the protocol that does: bad error mapping, prefetching, buffering, caching Controller has registers for data and control CPU and controllers communicate via I/O instructions and registers Memory-mapped I/O I/O Ports 4 registers - status, control, data-in, data-out ✓ Status - states whether the current command is completed, byte is available, device has an error, etc ✓ Control - host determines to start a command or change the mode of a device ✓ Data-in - host reads to get input ✓ Data-out - host writes to send output Size of registers - 1 to 4 bytes Memory-Mapped I/O (1) (a) Separate I/O and memory space (b) Memory-mapped I/O (c) Hybrid Memory-Mapped I/O (2) 54 (a) A single-bus architecture (b) A dual-bus memory architecture Polling Polling - use CPU to ✓ Busy wait and watch status bits ✓ Feed data into a controller register 1 byte at a time. EXPENSIVE for large transfers Not acceptable except small dedicated systems not running multiple processes Interrupts Connections between devices and interrupt controller actually use interrupt lines on the bus rather than dedicated wires Host-controller interface: Interrupts CPU hardware has the interrupt report line that the CPU senses after executing every instruction ✓. device raises an interrupt ✓. CPU catches the interrupt and saves the state (e.g., Instruction pointer) ✓. CPU dispatches the interrupt handler ✓. interrupt handler determines cause, services the device and clears the interrupt Why interrupts? IO devices need attension Real life analogy for interrupt ✓. An alarm sets off when the food is ready ✓. So you can do other things in between Support for Interrupts. Need the ability to defer interrupt handling during critical processing. Need efficient way to dispatch the proper device ✓ Interrupt comes with an address (offset in interrupt vector) that selects a specific interrupt handling. Need multilevel interrupts - interrupt priority level 55 Interrupt Handler At boot time, OS probes the hardware buses to ✓. determine what devices are present ✓. install corresponding interrupt handlers into the interrupt vector During I/O interrupt, controller signals that device is ready Other Types of Interrupts Interrupt mechanisms are used to handle wide variety of exceptions: ✓. Division by zero, wrong address ✓. Virtual memory paging ✓. System calls (software interrupts/signals, trap) Direct Memory Access (DMA) Direct memory access (DMA) ✓ Assists in exchange of data between CPU and I/O controller. CPU can request data from I/O controller byte by byte – but this might be inefficient (e.g. for disk data transfer) Uses a special purpose processor, called a DMA controller DMA-CPU Protocol. Use disk DMA as an example. CPU programs DMA controller, sets registers to specify source/destination addresses, byte count and control information (e.g., read/write) and goes on with other work. DMA controller proceeds to operate the memory bus directly without help of main CPU – request from I/O controller to move data to memory. Disk controller transfers data to main memory. Disk controller acks transfer to DMA controller Direct Memory Access (DMA) 56 DMA Issues Handshaking between DMA controller and the device controller Cycle stealing. DMA controller takes away CPU cycles when it uses CPU memory bus, hence blocks the CPU from accessing the memory In general DMA controller improves the total system performance Discussion Tradeoffs between ✓. Programmed I/O ✓. Interrupt-driven I/O ✓. I/O using DMA. Which one is the fastest for a single I/O request?. Which one gives the highest throughput? I/O Software Layers Device Drivers. Logical position of device drivers is shown here. Communications between drivers and device controllers goes over the bus. Device-specific code to control an IO device, is usually written by device's manufacturer ✓. Each controller has some device registers used to give it commands. The number of device registers and the nature of commands vary from device to device (e.g., mouse driver accepts information from the mouse how far it has moved, disk driver has to know about sectors, tracks, heads, etc).. A device driver is usually part of the OS kernel ✓. Compiled with the OS ✓. Dynamically loaded into the OS during execution. Each device driver handles ✓. one device type (e.g., mouse) ✓. one class of closely related devices (e.g., SCSI disk driver to handle multiple disks of different sizes and different speeds.). 57 Categories: ✓. Block devices ✓. Character devices Functions in Device Drivers Accept abstract read and write requests from the device-independent layer above; Initialize the device; Manage power requirements and log events Check input parameters if they are valid Translate valid input from abstract to concrete terms. e.g., convert linear block number into the head, track, sector and cylinder number for disk access Check the device if it is in use (i.e., check the status bit) Control the device by issuing a sequence of commands. The driver determines what commands will be issued. Buffering Buffer is a memory area that stores data while they are transferred between two devices or between a device and an application. Reasons of buffering: ✓ cope with speed mismatch between the producer and consumer of a data stream - use double buffering ✓ adapt between devices that have different data-transfer sizes ✓ support of copy semantics for application I/O - application writes to an application buffer and the OS copies it to the kernel buffer and then writes it to the disk. ✓ Unbuffered input strategy is ineffective, as the user process must be started up with every incoming character. ✓ Buffering is also important on output. Caching ✓ cache is region of fast memory that holds copies of data and it allows for more efficient access. Buffering strategies 58 (a) Unbuffered input (b) Buffering in user space (c) Buffering in the kernel followed by copying to user space (d) Double buffering in the kernel Error Reporting Programming IO Errors ✓ these errors occur when a process asks for something impossible (e.g., write to an input device such as keyboard, or read from output device such as printer). Actual IO Errors ✓ errors occur at the device level (e.g., read disk block that has been damaged, or try to read from video camera which is switched off) The device-independent IO software detects the errors and responds to them by reporting to the user-space IO software. 59 COMPUTER PERFORMANCE Improvements in Chip Organization and Architecture. Increase hardware speed of processor ✓. Fundamentally due to shrinking logic gate size. More gates, packed more tightly, increasing clock rate. Propagation time for signals reduced. Increase size and speed of caches ✓. Dedicating part of processor chip. Cache access times drop significantly. Change processor organization and architecture ✓. Increase effective speed of execution ✓. Parallelism Problems with Clock Speed and Logic Density. Power ✓. Power density increases with density of logic and clock speed ✓. Dissipating heat. RC delay ✓. Speed at which electrons flow limited by resistance and capacitance of metal wires connecting them ✓. Delay increases as RC product increases ✓. Wire interconnects thinner, increasing resistance ✓. Wires closer together, increasing capacitance. Memory latency ✓. Memory speeds lag processor speeds. Solution: ✓. More emphasis on organizational and architectural approaches More Complex Execution Logic. Enable parallel execution of instructions. Pipeline works like assembly line ✓. Different stages of execution of different instructions at same time along pipeline. Superscalar allows multiple pipelines within single processor ✓. Instructions that do not depend on one another can be executed in parallel End of Roads. Internal organization of processors complex ✓. Can get a great deal of parallelism ✓. Further significant increases likely to be relatively modest. Benefits from cache are reaching limit. Increasing clock rate runs into power dissipation problem ✓. Some fundamental physical limits are being reached 60 Multiple Cores. Multiple processors on single chip ✓. Large shared cache. Within a processor, increase in performance proportional to square root of increase in complexity. If software can use multiple processors, doubling number of processors almost doubles performance. So, use two simpler processors on the chip rather than one more complex processor. With two processors, larger caches are justified ✓. Power consumption of memory logic less than processing logic. Example: IBM POWER4 ✓. Two cores based on PowerPC POWER4 Chip Organization 61 ARM Evolution. Designed by ARM Inc., Cambridge, England. Licensed to manufacturers. High speed, small die, low power consumption. PDAs, hand held games, phones ✓. E.g. iPod, iPhone. Acorn produced ARM1 & ARM2 in 1985 and ARM3 in 1989. Acorn, VLSI and Apple Computer founded ARM Ltd. ARM Systems Categories. Embedded real time. Application platform ✓. Linux, Palm OS, Symbian OS, Windows mobile. Secure applications Performance Assessment Clock Speed. Key parameters ✓. Performance, cost, size, security, reliability, power consumption. System clock speed ✓. In Hz or multiples of ✓. Clock rate, clock cycle, clock tick, cycle time. Signals in CPU take time to settle down to 1 or 0. Signals may change at different speeds 62. Operations need to be synchronised. Instruction execution in discrete steps ✓. Fetch, decode, load and store, arithmetic or logical ✓. Usually require multiple clock cycles per instruction. Pipelining gives simultaneous execution of instructions. So, clock speed is not the whole story Instruction Execution Rate. Millions of instructions per second (MIPS). Millions of floating point instructions per second (MFLOPS). GFLOPS. Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy Computer Performance:. Response Time, or Latency — How long does it take for my job to run? — How long does it take to execute a job? — How long must I wait for the database query?. Throughput — How many jobs can the machine run at once? — What is the average execution rate? — How much work is getting done?. Q1: If we upgrade a machine with a new processor what do we increase?. Q2: If we add a new machine to the lab what do we increase? Execu2on Time. Elapsed Time ✓. counts everything (disk and memory accesses, I/O , etc.) ✓. a useful number, but often not good for comparison purposes. CPU time ✓. doesn't count I/O or time spent running other programs ✓. can be broken up into system time, and user time. Our focus: user CPU time ✓. time spent executing the lines of code that are "in" our program Clock Cycles. Instead of reporting execution time in seconds, we often use cycles 63. Clock “ticks” indicate when to start activities (one abstraction):. cycle time = time between ticks = seconds per cycle. clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec) A 200 Mhz. clock has a cycle time How to Improve Performance So, to improve performance (everything else being equal) you can either ________ the # of required cycles for a program, or ________ the clock cycle time or, said another way, ________ the clock rate. How to Improve Performance How many cycles are required for a program?. Could assume that # of cycles = # of instructions This assumption is incorrect, 64 different instructions take different amounts of time on different machines. Why? Different numbers of cycles for different instructions Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more time than accessing registers Important point: changing the cycle time often changes the number of cycles required for various instructions (more later) Example Our favorite program runs in 10 seconds on computer A, which has a 400 Mhz. clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target?” Now that we understand cycles. A given program will require ✓. some number of instructions (machine instructions) ✓. some number of cycles ✓. some number of seconds. We have a vocabulary that relates these quantities: ✓. cycle time (seconds per cycle) ✓. clock rate (cycles per second) ✓. CPI (cycles per instruction) a floating point intensive application might have a higher CPI ✓. MIPS (millions of instructions per second) this would be higher for a program using simple instructions 65 Performance. Performance is determined by execution time. Do any of the other variables equal performance? ✓. # of cycles to execute program? ✓. # of instructions in program? ✓. # of cycles per second? ✓. average # of cycles per instruction? ✓. average # of instructions per second? Performance. Famous equation: CPI Example Suppose we have two implementations of the same instruction set architecture (ISA). For some program, Machine A has a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much? If two machines have the same ISA which of our quantities (e.g., clock rate, CPI, execution time, # of instructions, MIPS) will always be identical? Number of Instruc2ons Example A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. Which sequence will be faster? How much? What is the CPI for each sequence? MIPS example 66 Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce code for a large piece of software. The first compiler's code uses 5 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. The second compiler's code uses 10 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions.. Which sequence will be faster according to MIPS?. Which sequence will be faster according to execution time? Benchmarks. Performance best determined by running a real application ✓. Use programs typical of expected workload ✓. Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc.. Small benchmarks ✓. nice for architects and designers ✓. easy to standardize ✓. can be abused. SPEC (System Performance Evaluation Cooperative) ✓. companies have agreed on a set of real program and inputs ✓. can still be abused (Intel’s “other” bug) ✓. valuable indicator of performance (and compiler technology) Aspects of CPU Performance 67 68