Parallel Processing Systems PDF

Parallel Processing Systems Parallel Computer Classifications  Parallel computers are those that emphasize the parallel processing between the operations in some way.  Parallel computers can be characterized based on the data and instruction streams forming various types of computer organizations.  They can also be classified based on the computer structure, e.g. multiple processors having separate memory or one shared global memory.  Parallel processing levels can also be defined based on the size of instructions in a program called grain size. FLYNN’S CLASSIFICATION  Flynn’s classification is based on multiplicity of instruction streams and data streams observed by the CPU during program execution. Let Is and Ds are minimum number of streams flowing at any point in the execution, then the computer organization can be categorized as follows:  Single Instruction and Single Data stream (SISD)  Single Instruction and Multiple Data stream (SIMD)  Multiple Instruction and Single Data stream (MISD)  Multiple Instruction and Multiple Data stream (MIMD) Single Instruction and Single Data stream (SISD):  In this organization, sequential execution of instructions is performed by one CPU containing a single processing element (PE), i.e., ALU under one control unit.  Therefore, SISD machines are conventional serial computers that process only one stream of instructions and one stream of data.  Examples :CDC 6600 which is unpipelined but has multiple functional units.  Amdhal 470/6 which has pipelined instruction processing. Single Instruction and Multiple Data stream (SIMD)  Multiple processing elements work under the control of a single control unit. It has one instruction and multiple data stream  All the processing elements of this organization receive the same instruction broadcast from the cu.  Main memory can also be divided into modules for generating multiple data streams acting as a distributed memory as shown in figure  All the processing elements simultaneously execute the same instruction and are said to be 'lock-stepped' together.  ILLIAC-IV, PEPE, BSP, STARAN, MPP, DAP AND THE CONNECTION MACHINE (CM-1). Multiple Instruction and Single Data stream (MISD)  Multiple processing elements are organized under the control of multiple control units.  Each control unit is handling one instruction stream and processed through its corresponding processing element.  But each processing element is processing only a single data stream at a time.  All processing elements are interacting with the common shared memory for the for the organization of single data stream.  The only known example of a computer capable of MISD operation is the C.mmp built by Carnegie- Mellon University Multiple Instruction and Multiple Data stream (MIMD)  Multiple processing elements and multiple control units are organized as in MIMD.  multiple instruction streams operate on multiple data streams.  Therefore, for handling multiple instruction streams, multiple control units and multiple processing elements are organized such that multiple processing elements are handling multiple data streams from the Main memory as shown in Figure 7.  The processors work on their own data with their own instructions. Tasks executed by different processors can start or finish at different times.  Examples IBM 370/168 MP, Univac 1100/80, Tandem/16, IBM 3081/3084, References :. Eric Aubanel, “Elements of Parallel Computing” CRC Press , Taylor and Francis group Cache Coherence in a centralized shared memory architecture for Multi processors Multiprocessor Architectures with multiple Caches  The Multiprocessor architecture with multiple caches supports the caching of both shared and private data.  Private data is used by a single processor, while shared data is used by multiple processors, essentially providing communication among the processors through reads and writes of the shared data.  When a private item is cached, its location is migrated to the cache, reducing the average access time as well as the memory bandwidth required.  When shared data are cached, the shared value may be replicated in multiple caches. In addition to the reduction in access latency and required memory bandwidth. Multiprocessor Cache Coherence  Figure below illustrates the problem and shows how two different processors can have two different values for the same location. This is generally referred to as the cache-coherence problem  When any processor writes to a shared variable in its own cache, all other caches that contain an old copy of that variable. A memory system is coherent if  1. A read by a processor, P, to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P.  2. A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated and no other writes to X occur between the two accesses.  3. Writes to the same location are serialized: that is, two writes to the same location by any two processors are seen in the same order by all processors. For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1. Cache Coherence Protocols  Directory based—The sharing status of a block of physical memory is kept in just one location, called the directory.  Snooping—Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block, and no centralized state is kept. The caches are usually on a shared- memory bus, and all cache controllers monitor or snoop on the bus to determine whether or not they have a copy of a block that is requested on the bus Write Invalidate Protocol  A processor has exclusive access to a data item before it writes that item.  It is called a write invalidate protocol because it invalidates other copies on a write.  It is by far the most common protocol, both for snooping and for directory schemes. Write Update or Write Broadcast Protocol.  This protocol updates all the cached copies of a data item when that item is written.  We assume that neither cache initially holds X and that the value of X in memory is 0.  A blank indicates no activity or no copy cached. When CPU A broadcasts the write, both the cache in CPU B and the memory location of X are updated. A write-invalidate, cache-coherence protocol for a write-back cache showing the states and state transitions for each block in the cache. Explanation :  The left side of the diagram shows state transitions based on actions of the CPU associated with this cache; the right side shows transitions based on operations on the bus.  A read miss in the exclusive or shared state and a write miss in the exclusive state occur when the address requested by the CPU does not match the address in the cache block.  Such a miss is a standard cache replacement miss. An attempt to write a block in the shared state always generates a miss, even if the block is present in the cache, since the block must be made exclusive.  Whenever a bus transaction occurs, all caches that contain the cache block specified in the bus transaction take the action dictated by the right half of the diagram.  The protocol assumes that memory provides data on a read miss for a block that is clean in all caches. In actual implementations, these two sets of state diagrams are combined. References :  Computer Architecture: A Quantitative Approach, 2nd edition , John L. Hennessy, David A. Patterson Memory to Memory Interconnection Networks Butterfly Network For interconnecting n processors to n memory modules We need 2 x 2 switches n/2 rows and log2(n) columns of interconnections One switch required at each interconnection. Interconnection rules: Downward connection : r+(2^c) Upwards connection : r- (2^c) Benes network Benes network : 2 back to back connected butterfly networks For interconnecting n processors to n memory modules We need 2 x 2 switches n/2 rows and log2(n) columns of interconnections One switch required at each interconnection. Interconnection rules: Downward connection : r+(2^c) Upwards connection : r- (2^c) References:  Computer Architecture: A Quantitative Approach, 2nd edition , John L. Hennessy, David A. Patterson Input/output Organization Computer Architecture (UEC509) References: 1. Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier (2009) 4th ed. 2. Hamacher, V., Carl, Vranesic, Z.G. and Zaky, S.G., Computer Organization, McGraw-Hill (2002) 2nd ed. Accessing of I/O Devices More than one I/O devices may be connected through set of three bus. Need to assign an unique address Two mapping techniques – Memory mapped I/O – I/O mapped I/O I/O Mapping Techniques Two techniques are used to assign addressing to I/O – Memory mapped I/O – I/O mapped I/O Accessing of I/O through polling Normally, the data transfer of rate of I/O devices is slower than the speed of the processor. This creates the need for mechanisms to synchronize data transfers between them. Program-controlled I/O: Processor continuously check the status flag to achieve the necessary synchronization. It is called polling Two other mechanisms used for synchronizing data transfers between the processor and memory: Interrupts driven Direct Memory Access (DMA). Interrupt Driven I/O Computer Architecture (UEC509) References: 1. Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier (2009) 4th ed. 2. Hamacher, V., Carl, Vranesic, Z.G. and Zaky, S.G., Computer Organization, McGraw-Hill (2002) 2nd ed. Interrupt driven I/O I/O devices send the request about the readiness of the data Processor complete the current instruction and send the acknowledgement to respective I/O Example: Let processor is executing a program and at instruction located at address i when an interrupt occurs. Routine executed in response to an interrupt request is called the interrupt-service routine (ISR). When an interrupt occurs, control must be transferred to the interrupt service routine. After completion of ISR, the control back to main program Interrupt service routine (ISR) CPU suspends execution of the current program – Saves the address of the next instruction to be executed (current contents of PC) and any other data CPU sets the PC to the starting address of an ISR CPU proceeds to the fetch cycle and fetches the first instruction in ISR which is generally a part of the OS ISR typically determines the nature of the interrupt and performs whatever actions are needed. For example, ISR determines which I/O module generated the interrupt and may branch to a program that will write more data out to that I/O module. Once ISR is completed, CPU will resume the execution of the user program at the point of interruption. Daisy Chain in Interrupt Connections of all interrupt is in serial First device has highest priority Same IR is used INTA is used to respond to the devices Start scanning from device 1 and so on Multiple Interrupts Two methods can be used to handle multiple interrupt: Sequential execution Execute as per priority of interrupt Direct Memory Access Computer Architecture (UEC509) References: 1. Hennessy, J. L., Patterson, D. A., Computer Architecture: A Quantitative Approach, Elsevier (2009) 4th ed. 2. Hamacher, V., Carl, Vranesic, Z.G. and Zaky, S.G., Computer Organization, McGraw-Hill (2002) 2nd ed. Direct Memory Access A special control unit to provide transfer a block of data directly between IO and memory by bypassing the processor It uses the busses of processor It is not a processor, so not having any instruction set Direct Memory access DMA can transfer block of data from IO to processor, memory to IO, memory to memory without any intervention from processor To initiate the DMA transfer, the processor load the information about the DMA controller: – Starting address – Number of words to be transfer – Direction of transfer – Modes of transfer After the completion of DMA transfer, it inform the processor by raising interrupt signal Operation of DMA with CPU

Parallel Processing Systems PDF

Document Details

Tags

Related

Summary

Full Transcript