2- Parallel_Platform.pdf
Document Details
Uploaded by Deleted User
Tags
Full Transcript
Parallel Platform Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 1 Agenda 2.1 Implicit Parallelism: Trends in Microprocessor Architecture – Instruction Level Parallelism (ILP) – Pipelining & Superscalar Exec...
Parallel Platform Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 1 Agenda 2.1 Implicit Parallelism: Trends in Microprocessor Architecture – Instruction Level Parallelism (ILP) – Pipelining & Superscalar Execution – Very Large Instruction Window (VLIW) & Vector Processors Memory subsystem – Cache Multiprocessing & Multithreading – Threads Parallel Architecture – 2.3.2 Communication model for parallel platform – 2.4.6 Cache coherence Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 2 Scope of Parallelism Conventional architectures coarsely comprise of a processor, memory system, and the datapath. Each of these components present significant performance bottlenecks. Parallelism addresses each of these components in significant ways. Different applications utilize different aspects of parallelism. e.g., – data intensive applications utilize high aggregate throughput, – server applications utilize high aggregate network bandwidth, and – scientific applications typically utilize high processing and memory system performance. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 3 Implicit Parallelism: Trends in Microprocessor Architectures Microprocessor clock speeds have posted impressive gains over the past two decades (two to three orders of magnitude). Higher levels of device integration have made available a large number of transistors. The question of how best to utilize these resources is an important one. Current processors use these resources in multiple functional units and execute multiple instructions in the same cycle. The precise manner in which these instructions are selected and executed provides impressive diversity in architectures. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 4 Revision: Von Neumann architecture The von Neumann architecture, is a computer architecture based on that described in 1945 by the mathematician and physicist John von Neumann. That describes a design architecture for an electronic digital computer with parts consisting of a processing unit containing an arithmetic logic unit and processor registers, a control unit containing an instruction register and program counter, a memory to store both data and instructions, external mass storage, and input and output mechanisms CPU Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 5 Revision: CPU vs. Microprocessor Can you recognize the differences? CPU and Microprocessor are two words used interchangeably in many cases. CPU is what you mean when we talk about the unit that is responsible of Processing of data. Microprocessor is the Chip that in most cases contains many other blocks to enhance the performance of the CPU. Microprocessor is more mature term since it may contain multiple CPUs in its internal design as in Multi-threaded or Multi-core Microprocessors. For our learning in this class we view them as similar Microprocessor = CPU Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 6 Revision: How do the units interact Registers EN EN EN EN AX BX CX DX Control Unit S0 S0 4 X 1 MUX 4 X 1 MUX S1 S1 ALU Adder The control unit is in complete control of the system at all times. It controls the behavior using select and enable inputs to the different elements of the system. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 7 Revision: The three cycle instruction execution model To execute a program, the user must first enter its instructions in binary format into the memory. The microprocessor then “reads” each instruction from memory, “interprets” it, and “executes” it. To use the right names for the cycles: – The microprocessor fetches each instruction, – decodes it, – Then executes it. This sequence is continued until all instructions are performed. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 8 What Is Instruction-Level Parallelism? Instruction-level parallelism (ILP) is parallelism among instructions of a program that enables concurrent execution of the overlapped instructions. All processors since about 1985 use pipelining to concurrently execute the overlapped instructions and improve performance. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 9 CPU overview Register File Instruction Instruction Fetch & ALU Data Cache Cache Decode PC CPU Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 10 A Simple Implementation of MIPS Instruction set In a simple implementation every instruction takes 5 clock cycles to be finished as follows: – 1.Instruction fetch cycle (IF): Send the program counter (PC) to memory and fetch the current instruction from memory. Update the PC to the next sequential PC by adding 4 (since each instruction is 4 bytes) to the PC. – 2.Instruction decode/register fetch cycle (ID): Decode the instruction and read the registers corresponding to register source specifiers from the register file. Do the equality test on the registers as they are read, for a possible branch. Compute the possible branch target address by adding the sign-extended offset to the incremented PC. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 11 A Simple Implementation of MIPS Instruction set – 3.Execution/effective address cycle (EX): The ALU operates on the operands prepared in the prior cycle and do one of the following: Memory reference: adds the base register and the offset to form the effective address. Register-Register: performs the operation specified by opcode on the values read from the register file. Register-Immediate: performs the operation specified by the opcode on the first value read from the register file and the sign-extended immediate. – 4.Memory access (MEM): Load: memory does a read using the effective address computed in the previous cycle Store: memory writes the data from the second register read from the register file using the effective Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 12 A Simple Implementation of MIPS Instruction set – 5.Write-back cycle (WB): Register-Register ALU instruction or Load instruction: Write the result into the register file, whether it comes from – the memory system (for a load) or – the ALU (for an ALU instruction) Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 13 Conventional Processor Performance Processor with five stages. Each stage takes one clock cycle. Performance of that processor in cycles is:- Cycles Per Instruction (CPI) = 5 cycles Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 14 Pipelining Execution Pipelining overlaps various stages of instruction execution to achieve performance. At a high level of abstraction, an instruction can be executed while the next one is being decoded and the next one is being fetched. This is akin to an assembly line for manufacture of cars. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 15 What is pipelining? Pipelining : is an implementation technique whereby multiple instructions are overlapped in execution; it takes advantage of parallelism that exists among the actions needed to execute an instruction. Today, pipelining is the key implementation technique used to make fast CPUs. In an automobile assembly line, there are many steps, each contributing something to the construction of the car Each step operates in parallel with the other steps, although on a different car. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 16 What is pipelining? Each of these steps is called a pipe stage or a pipe segment. The stages are connected one to the next to form a pipe Instructions enter at one end, progress through the stages, and exit at the other end, just as cars would in an assembly line. In an automobile assembly line, throughput is defined as the number of cars per hour and is determined by how often a completed car exits the assembly line. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 17 The Classic Five-Stage Pipeline for a RISC Processor We can pipeline the execution described for the MIPS with almost no changes by simply starting a new instruction on each clock cycle. Although each instruction takes 5 clock cycles to complete, during each clock cycle the hardware will initiate a new instruction and will be executing some part of the five different instructions. Figure A.1 Simple RISC pipeline. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 18 Processor pipeline The time required between moving an instruction one step down the pipeline is a processor cycle The pipeline designer’s goal is to balance the length of each pipeline stage In steady-state, where pipeline is always full of instructions, Time per instruction on the pipelined processor Time per instruction on unpipeline d machine Number of pipe stages The speedup from pipelining equals the number of pipe stages Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 19 Pipelined Processor Performance Processor with five stages. Each stage takes one clock cycle. Performance of that processor in cycles is:- Cycles Per Instruction (CPI) = 1 cycles Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 20 The Classic Five-Stage Pipeline for a RISC Processor We should make sure we don’t try to perform two different operations with the same data path resource on the same clock cycle. Figure A.2 The pipeline can be thought of as a series of data paths shifted in time. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 21 The Classic Five-Stage Pipeline for a RISC Processor Three observations: – Separate instruction and data memories is used – The register file is used in the two stages: Two reads in ID: second half of clock cycle One write in WB: first half of clock cycle – Increment and store the PC every clock, during the IF stage in preparation for the next instruction. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 22 Pipelining Execution Pipelining, however, has several limitations. The speed of a pipeline is eventually limited by the slowest stage. For this reason, conventional processors rely on very deep pipelines (20 stage pipelines in state-of-the-art Pentium processors). Why? However, in typical program traces, every 5-6th instruction is a conditional jump! This requires very accurate branch prediction. Different types of dependency among the instructions is limitation to the level of concurrency among them. With pipelining We CAN NOT reduce the CPI below ONE Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 23 Superscalar Execution One simple way of alleviating these bottlenecks is to use multiple pipelines. The question then becomes one of selecting these instructions. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 24 Exploiting ILP Using Multiple Issue Pipelining can achieve an ideal CPI of one. Pipeline discussed so far consider one instruction ONLY to issue per cycle even with the availability of multiple FUs of different types To improve performance further we would like to decrease the CPI to less than one. The CPI cannot be reduced below one if we issue only one instruction every clock cycle. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 25 Superscalar Processors The goal of the multi-issue(superscalar) processors is to allow multiple instructions to issue in a clock cycle. Two conditions need to be satisfied to achieve multi- issue capability: – Multiple FUs of different or similar types are available in the processor hardware – Instructions scheduling should be converted from serial to parallel Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 26 Instruction Scheduling in Superscalar Processors Condition for successful scheduling is that all kind of hazard detection and prevention is available on either hardware or software levels. Translate to assembly Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 27 ILP: Data Dependences Determining how one instruction depends on another is critical to determining how much parallelism exists in a program and how that parallelism can be exploited. In particular, to exploit ILP we must determine which instructions can be executed in parallel. If two instructions are parallel, they can execute simultaneously in a pipeline of arbitrary depth without causing any stalls If two instructions are dependent, they are not parallel and must be executed in order, although they may often be partially overlapped. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 28 Data Dependences & Data Hazards If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped. The dependence implies that there would be a chain of one or more data hazards between the two instructions. Executing the instructions simultaneously will cause a processor with pipeline interlocks to detect a hazard and stall, thereby reducing or eliminating the overlap. The presence of a data dependence in an instruction sequence reflects a data dependence in the source code from which the instruction sequence was generated. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 29 Data Dependences & Data Hazards A hazard is created whenever there is a dependence between instructions, and they are close enough that the overlap during execution would change the order of access to the operand involved in the dependence. Because of the dependence, we must preserve what is called program order. Program Order: the order that the instructions would execute in if executed sequentially one at a time as determined by the original source program. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 30 Data Dependences Consider the following MIPS code sequence that increments a vector of values in memory (starting at 0(R1) , and with the last element at 8(R2)), by a scalar in register F2. Two types of dependences exist in this code sequence:- 1- 2- The arrows show the order that must be preserved for correct execution. The arrow points from an instruction that must precede the instruction that the arrowhead points to. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 31 Data Hazards and instruction order H/W,S/W or both could be used for ILP exploitation The goal of both our software and hardware ILP exploitation techniques is to exploit parallelism by preserving program order only where it affects the outcome of the program. Detecting and avoiding hazards ensures that necessary program order is preserved so parallelism will not affect the original goal of correctly executing the code. Data hazards are classified into different types according to the original order of execution of the dependent instructions Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 32 Instruction Scheduling in Superscalar Processors Instruction Scheduling Methods: – Statically scheduled superscalar processors (i.e. Very Long Instruction Window (VLIW)) – Dynamically scheduled superscalar processors Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 33 Dynamic (H/W) scheduling for Superscalar Execution Scheduling of instructions is determined by a number of factors: – True Data Dependency: The result of one operation is an input to the next. – Resource Dependency: Two operations require the same resource. – Branch Dependency: Scheduling instructions across conditional branch statements cannot be done deterministically a-priori. – The scheduler, a piece of hardware looks at a large number of instructions in an instruction queue and selects appropriate number of instructions to execute concurrently based on these factors. – The complexity of this hardware is an important constraint on superscalar processors. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 34 Superscalar Execution: Issue Mechanisms In the simpler model, instructions can be issued only in the order in which they are encountered. That is, if the second instruction cannot be issued because it has a data dependency with the first, only one instruction is issued in the cycle. This is called in- order execution. In a more aggressive model, instructions can be issued out of order. In this case, if the second instruction has data dependencies with the first, but the third instruction does not, the first and third instructions can be co-scheduled. This is also called dynamic issue or Out-Of-Order (o-o-o) execution. Performance of in-order issue is generally limited. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 35 Superscalar Execution: An Example Issue Slots No. of Clock Cycles 1 2 1 1 3 4 2 2 5 3 3 6 4 4 5 5 6 4 cycles 5 cycles 6 cycles Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 36 Superscalar Execution: An Example In the above example, there is some wastage of resources due to data dependencies. The example also illustrates that different instruction mixes with identical semantics can take significantly different execution time. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 37 Superscalar Execution: Efficiency Considerations Not all functional units can be kept busy at all times. If during a cycle, no functional units are utilized, this is referred to as vertical waste. If during a cycle, only some of the functional units are utilized, this is referred to as horizontal waste. Due to limited parallelism in typical instruction traces, dependencies, or the inability of the scheduler to extract parallelism, the performance of superscalar processors is eventually limited. Conventional microprocessors typically support four-way superscalar execution. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 38 Static (S/W) Scheduling The hardware cost and complexity of the superscalar scheduler is a major consideration in processor design. To address this issues, Very Large Instruction Window (VLIW) processors rely on compile time analysis to identify and bundle together instructions that can be executed concurrently. These instructions are packed and dispatched together, and thus the name very long instruction word. This concept was used with some commercial success in the Multiflow Trace machine (circa 1984). Variants of this concept are employed in the Intel IA64 processors. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 39 Statically Scheduled Superscalar processors (VLIW) VLIW processors issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction. VLIW processors are inherently statically scheduled by the compiler. VLIWs use multiple, independent functional units. a VLIW packages the multiple operations into one very long instruction Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 40 Very Long Instruction Word (VLIW) Processors: Considerations Issue hardware is simpler. Compiler has a bigger context from which to select co-scheduled instructions. Compilers, however, do not have runtime information such as cache misses. Scheduling is, therefore, inherently conservative. Branch and memory prediction is more difficult. VLIW performance is highly dependent on the compiler. A number of techniques such as loop unrolling, speculative execution, branch prediction are critical. Typical VLIW processors are limited to 4-way to 8-way parallelism. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 41 Memory Subsystem Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 42 Performance: Processor vs. Memory Rate of increase in the processor speed is much higher than memory. This increases the gap between processor and memory speeds which mandates intermediate stages to bridge this gap Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 43 Introduction & Motivation Principle of locality: Programs tend to reuse data and instructions they have used recently. – Rule of thumb: A program spends 90% of its execution time in only 10% of the code. – We can predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the recent past. – Two type of locality: Temporal locality : recently accessed items are likely to be accessed in the near future. Spatial Locality : items whose addresses are near one another tend to be referenced close together in time. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 44 Introduction & Motivation The principle of locality says that most programs do not access all code or data uniformly. Locality occurs in time (temporal locality) and in space (spatial locality). Hardware design guideline that says smaller hardware is faster, led to hierarchy-based memories of different speeds and sizes. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 45 Computer Memory Hierarchy Since fast memory is expensive, a memory hierarchy is organized into several levels—each smaller, faster, and more expensive per byte than the next lower level. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 46 Computer Memory Hierarchy Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 47 Cache: Concept Cache: is the name given to the highest or first level of the memory hierarchy encountered once the address leaves the processor. The principle of locality applies at many levels, and taking advantage of locality to improve performance is popular Cache: is now applied whenever buffering is employed to reuse commonly occurring items. Examples include file caches, name caches, and so on. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 48 Cache Hierarchy Cache is a small fast storage (compared to the Main Memory) that is closer to the processor so its access is faster than Main Memory. Originally it is being build as one level and later with the advance in its design it is now built as multi-level. Starting closer to the processor the cache level is said to be the higher-level of cache. It is also the smallest, fastest and most expensive cache design. Moving toward Main Memory each cache level is said to be lower, slower and larger in size. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 49 Cache Design The smallest unit of data in the computer system is the byte. Bytes are grouped into bigger unit called word. Each instruction is basically one word. Also each memory reference instruction is presumably access one word of data. Unit of data access between Processor and Cache is performed on a word bases. Block: continuous data words are even grouped into bigger sizes called blocks. Data transfer between cache and Main Memory is done on a block-level also called cache-line Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 50 Cache blocks & locality Cache supports both types of localities. Temporal locality tells us that we are likely to need this word again in the near future, – Cache includes a sub-set of data in the memory since it is much smaller than memory size. – Data in the cache are the most recently used data. – It is useful to place those data in the cache where it can be accessed quickly. Spatial locality there is a high probability that the other data in the block will be needed soon. – Block includes data close to each other in the memory Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 51 Cache Operations & definitions Cache access: when the processor is to access the computer system memory (usually called memory reference) to get information (either get instruction or data and usually called data), the access first send the memory effective address to the cache to get the word of requested data. Cache hit: If data exist in the cache then cache will supply the processor with data so it will fulfill the request. We call this a cache hit. Cache miss: If data doesn’t exist in cache then data should be supplied by lower-level memory (lower-level cache or Main Memory) Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 52 Cache miss At cache miss the lower-level memory is accessed to get the full block that contains the requested data word and place it in the cache for later usage. The fetched block from the lower-level memory will replace an existing cache block (if no block is free) Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 53 Cache access time Cache hit access time matches processor speed Cache miss access time matches Main Memory speed The time required for the cache miss depends on both the latency and bandwidth of the memory. – Latency determines the time to retrieve the first word of the block, – Bandwidth determines the time to retrieve the rest of this block Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 54 Multiprocessing &Multithreading Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 55 Multiprocessing Execution on a Uniprocessor Micro-architecture designs explored in previous sections are all for a uni-processor that could execute one process (running program) at any given time. Early 90’s Operating System (OS) designers introduced the concept of multi-task. In multi-tasking multiple processes can run concurrently on a single uni-processor using the concept of time-sharing During each time slot one of the processes will run on the processor After time slot finish OS will swap out the process and swap in another one to execute on the processor Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 56 Multiprocessing Execution on a Uniprocessor Swapping in and out processes is also called context switching In order to facilitate the correct continuation of the process execution at the correct instruction and not to lose the processing performed so far for the process data a group of architecture variables are saved during swap out then reloaded to the processor during swap in. This set of variables are called process state Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 57 Process State Process State: Defines the execution of a program on a processor and reflect the progress in program execution at any given instant in time. Process State represents the values of the machine structures including:- – Program Counter (PC) – Architecture (logical) registers – Program Memory Space Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 58 What is a thread? Software Thread : is part of the main process. It is a separate process with its own instructions and data that could be executed in separate of the main process. Process has two types:- – Serial program which has only ONE thread. – Parallel program which consists multiple threads. Each thread in a parallel program has separate state per thread including (instructions, PC, register state, and so on) necessary to allow it to execute. All threads in a parallel program share the memory space Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 59 Introduction: What is a thread? Process Thread Creating a thread by the main process to run independently is called thread forking Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 60 Multithreading for Latency Hiding A thread is a single stream of control in the flow of a program. We illustrate threads with a simple example: for (i = 0; i < n; i++) c[i] = dot_product(get_row(a, i), b); Each dot-product is independent of the other, and therefore represents a concurrent unit of execution. We can safely rewrite the above code segment as: for (i = 0; i < n; i++) c[i] = create_thread(dot_product,get_row(a, i), b); Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 61 Multithreading: Exploit Thread-Level Parallelism Although increasing performance by using ILP has the great advantage that it is reasonably transparent to the programmer, as we have seen, ILP can be quite limited or hard to exploit in some applications. There may be significant parallelism occurring naturally at a higher level in the application that cannot be exploited with the approaches discussed in ILP. An online transaction-processing system has natural parallelism among the multiple queries and updates that are presented by requests. So queries and updates can be processed in parallel, since they are largely independent of one another Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 62 Multithreading: Parallelize Application using Threads By building application as parallel program using threads concept we exploiting higher-level parallelism which is called thread-level parallelism (TLP) There are many important applications where TLP occurs naturally, as it does in many server applications. In some embedded applications, the software is being written from scratch, and expressing the inherent parallelism is easy. Large, established applications written without parallelism in mind, pose a significant challenge and can be extremely costly to rewrite to exploit TLP Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 63 Multithreading: How to utilize TLP by the Hardware? One natural question to ask is whether it is possible for a processor oriented at ILP to exploit TLP? Motivation: the observation is that a data path designed to exploit higher amounts of ILP will find that functional units are often idle because of either stalls or dependences in the code. Could the parallelism among threads be used as a source of independent instructions that might keep the processor busy during stalls? Could TLP be used to employ the functional units that would otherwise lie idle when insufficient ILP exists? Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 64 Multithreading: Hardware support to exploit TLP Multithreading : refers to the support in hardware of the processor to allow multiple threads to share the functional units of a single processor in an overlapping fashion so accordingly exploit or utilize TLP. How can processor support TLP for parallel programs? – Program memory space is shared among all thread of a parallel application but each thread has a separate state – The processor must duplicate the independent state of each thread including logical register, PC Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 65 Multithreading: Granularity of Functional Unit (FU) Sharing Multithreading : refers to the support in hardware of How four different approaches use the issue slots of a superscalar processor. The horizontal dimension represents the instruction issue capability in each clock cycle. The vertical dimension represents a sequence of clock cycles. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 66 Multi-thread, Multi-core & Multi-processor I Cache I Cache I Cache I Cache I Cache I Cache FQ FQ FQ FQ FQ FQ Decode Decode Decode Decode Decode CPU Decode Rename Rename Rename Rename Rename Rename FIQ IIQ FIQ IIQ FIQ IIQ FIQ IIQ FIQ IIQ FIQ IIQ IRF FRF IRF FRF IRF FRF IRF FRF IRF FRF IRF FRF IFU FFU IFU FFU IFU FFU IFU FFU IFU FFU IFU FFU LSQ LSQ LSQ LSQ Memory subsystem LSQ LSQ L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 Main Memory Main Memory Main Memory Main Memory Multithreaded Multi-core Uni-processor Multiprocessor processor processor Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 67 Multi-thread, Multi-core & Multi-processor Most characteristic feature of Multithreading is the sharing of the CPU pipeline among two or more running processes or threads. In multi-core the sharing of resource is moved from the CPU to the memory subsystem mainly the cache. Since the cache is usually built as multi-level (L1, L2 and even L3) different flavors of Multi-core processors has recently designed In some of them the sharing of the cache starts as high as the L1 while in others only L3 is shared. In Multi-processor systems the sharing is moved one level further to the system memory. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 68 Parallel Architecture Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 69 A Taxonomy of Parallel Processing All computer processing are categorized to: – Single instruction stream, single data stream (SISD)—This category is the serial process. – Single instruction stream, multiple data streams (SIMD)—The same instruction is executed by multiple processors using different data streams. SIMD computers exploit data-level parallelism by applying the same operations to multiple items of data in parallel. Each processor has its own data memory (hence multiple data), but there is a single instruction memory and control processor, which fetches and dispatches instructions. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 70 A Taxonomy of Parallel Architecture All computer processing are categorized to: – Multiple instruction streams, single data stream (MISD)—No commercial multiprocessor of this type has been built to date. – Multiple instruction streams, multiple data streams (MIMD)—Each processor fetches its own instructions and operates on its own data. MIMD computers exploit thread-level parallelism , since multiple threads operate in parallel. In general, thread-level parallelism is more flexible than data-level parallelism and thus more generally applicable. It is the architecture of choice for general-purpose multiprocessors Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 71 Communication Model of Parallel Platforms Existing MIMD multiprocessors fall into two classes, depending on the memory organization and interconnect strategy. There are two primary forms of data exchange between parallel tasks – Accessing a shared data space and – Exchanging messages. Platforms that provide a shared data space are called shared-address-space machines or multiprocessors. Platforms that support messaging are also called message passing platforms or multicomputers. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 72 Centralized (Symmetric) Shared-Memory Architecture This type has at most a few dozen processor chips (and less than 100 cores). – For multiprocessors with small processor counts, it is possible for the processors to share a single centralized memory. – By using multiple point-to-point connections, or a switch, and adding additional memory banks, a centralized shared-memory design can be scaled to a few dozen processors. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 73 Multiprocessor Architecture Basic structure of a centralized shared-memory multiprocessor. Multiple processor-cache subsystems share the same physical memory, typically connected by one or more buses or a switch. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 74 Centralized (Symmetric) Shared-Memory Architecture The key architectural property is the uniform access time to all of memory from all the processors. These multiprocessors are most often called symmetric (shared-memory) multiprocessors (SMPs),and This style of architecture is sometimes called uniform memory access (UMA) Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 75 Multiprocessor Architecture By sharing the memory sub-system a new set of issues, problems and concerns are to be addressed. Those issues are similar in the large variation of multiprocessors and highly dependent on the scale of the multiprocessor Scale represents the number of processors sharing the memory sub-system. We have chosen to focus on the mainstream of multiprocessor design: multiprocessors with small to medium numbers of processors ( 4 to 32 ). Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 76 Symmetric Shared-Memory & Cache Symmetric shared-memory machines usually support the caching of both shared and private data. When shared data are cached, the shared value may be replicated in multiple caches. Advantage: – This replication provides a reduction in contention that may exist for shared data items that are being read by multiple processors simultaneously. Disadvantage: – Caching of shared data, however, introduces a new problem: cache coherence Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 77 Cache Coherence in Multiprocessor Systems Interconnects provide basic mechanisms for data transfer. In the case of shared address space machines, additional hardware is required to coordinate access to data that might have multiple copies in the network. The underlying technique must provide some guarantees on the semantics. This guarantee is generally one of serializability, i.e., there exists some serial order of instruction execution that corresponds to the parallel schedule. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 78 What is Multiprocessor Cache Coherency? Caching shared data introduces a new problem because the view of memory held by two different processors is through their individual caches, which, without any additional precautions, could end up seeing two different values. This difficulty is generally referred to as the cache coherence problem. We also assume a write-through cache Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 79 Cache Coherence in Multiprocessor Systems Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 80 Distributed-Memory Architecture The second group consists of multiprocessors with physically distributed memory. To support larger processor counts, memory must be distributed among the processors rather than centralized Otherwise the memory system would not be able to support the bandwidth demands of a larger number of processors without incurring excessively long access latency. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 81 Distributed-Memory Architecture A distributed-memory multiprocessor consists of individual nodes containing a processor, some memory, typically some I/O, and an interface to an interconnection network that connects all the nodes Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 82 Distributed-Memory Architecture Advantage: – It is a cost-effective way to scale the memory bandwidth if most of the accesses are to the local memory in the node. – It reduces the latency for accesses to the local memory. Disadvantage: – Communicating data between processors becomes more complex, and that it requires more effort in the software to take advantage of the increased memory bandwidth afforded by distributed memories. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 83 Models for Communication and Distributed-Memory Architecture Message-Passing Multiprocessors: :- – Also called Separate Address Space. The address space consists of multiple private address spaces that are logically disjoint and cannot be addressed by a remote processor. – In such multicomputer system, the same physical address on two different machines refers to two different locations in two different memories. – Each processor-memory module is essentially a separate computer. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 84 Message-Passing Platforms These platforms comprise of a set of processors and their own (exclusive) memory. Instances of such a view come naturally from clustered workstations and non-shared-address- space multicomputers. These platforms are programmed using (variants of) send and receive primitives. Libraries such as MPI and PVM provide such primitives. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 85 Models for Communication and Distributed-Memory Architecture Separate Address Space: – Message Passing is its main programming paradigm for this class of machines. Distributed shared-memory (DSM) : – Also called Shared Address Space. As it does in a symmetric shared-memory architecture, the physically separate memories can be addressed as one logically shared address space meaning that a memory reference can be made by any processor to any memory location. – The DSM multiprocessors are also called NUMAs (nonuniform memory access), since the access time depends on the location of a data word in memory. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 86 Shared-Address-Space Platforms Part (or all) of the memory is accessible to all processors. Processors interact by modifying data objects stored in this shared-address-space. If the time taken by a processor to access any memory word in the system global or local is identical, the platform is classified as a uniform memory access (UMA), else, a non-uniform memory access (NUMA) machine. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 87 NUMA and UMA Shared-Address-Space Platforms Memory Memory Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 88 NUMA and UMA Shared-Address-Space Platforms The distinction between NUMA and UMA platforms is important from the point of view of algorithm design. NUMA machines require locality from underlying algorithms for performance. Programming these platforms is easier since reads and writes are implicitly visible to other processors. However, read-write data to shared data must be coordinated (this will be discussed in greater detail when we talk about threads programming). Caches in such machines require coordinated access to multiple copies. This leads to the cache coherence problem. A weaker model of these machines provides an address map, but not coordinated access. These models are called non cache coherent shared address space machines. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 89 Shared-Address-Space vs. Shared Memory Machines It is important to note the difference between the terms shared address space and shared memory. We refer to the former as a programming abstraction and to the latter as a physical machine attribute. It is possible to provide a shared address space using a physically distributed memory. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 90 Message Passing vs. Shared Address Space Platforms Message passing requires little hardware support, other than a network. Shared address space platforms can easily emulate message passing. The reverse is more difficult to do (in an efficient manner). Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 91 Parallel Systems Hardware Parallel systems Centralized Distributed Hardware Memory Memory Architecture Architecture Shared-Address Shared-Address Programmer Separate-Address Space Space vision Space UMA NUMA Parallel Programming Multi-Thread Multi-Thread MPI Paradigm Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 92 Backup Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 93 Cache Coherence Protocols For each cache block we identify new field called state of the cache block Key to implementing a cache coherence protocol is tracking the state of any sharing of a data block. There are two classes of protocols, which use different techniques to track the sharing status: – Directory based —The sharing status of a block of physical memory is kept in just one location, called the directory; Directory-based coherence has slightly higher implementation overhead than snooping, but it can scale to larger processor counts. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 94 Cache Coherence Protocols There are two classes of protocols, which use different techniques to track the sharing status: – Snooping —Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block, but no centralized state is kept. The caches are all accessible via some broadcast medium (a bus or switch), and all cache controllers monitor or snoop on the medium to determine whether or not they have a copy of a block that is requested on a bus or switch access. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 95 Snooping Protocols There are two classes of Snooping protocols: Write Invalidate : a processor has exclusive access to a data item before it writes that item. – This style of protocol is called a write invalidate protocol because it invalidates other copies on a write. – It is the most common protocol, both for snooping and for directory schemes. – Exclusive access ensures that no other readable or writable copies of an item exist when the write occurs: – All other cached copies of the item are invalidated. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 96 Snooping Protocols There are two classes of Snooping protocols: Write Invalidate : a processor has exclusive access to a data item before it writes that item. – If two processors do attempt to write the same data simultaneously, one of them wins the race, causing the other processor’s copy to be invalidated. – For the other processor to complete its write, it must obtain a new copy of the data, which must now contain the updated value. – Therefore, this protocol enforces write serialization. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 97 Snooping Protocols There are two classes of Snooping protocols: Write Update : The alternative to an invalidate protocol is to update all the cached copies of a data item when that item is written. – Because a write update protocol must broadcast all writes to shared cache lines, it consumes considerably more bandwidth. – For this reason, all recent multiprocessors have opted to implement a write invalidate protocol Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 98 Write invalidate Protocol (MSI) MSI is a very famous write invalidate protocol in which each data block has four different states:- – Modify (M) state: block is modified by the processor so the block ownership is given to the local data block copy All other copies are in the invalid state (I) Any remote read/write request should: – first update the local block from the cache that has the modified block then Change status according to the new access Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 99 Cache Coherence in Multiprocessor Systems Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 100 Cache Coherence: Update and Invalidate Protocols If a processor just reads a value once and does not need it again, an update protocol may generate significant overhead. If two processors make interleaved test and updates to a variable, an update protocol is better. Both protocols suffer from false sharing overheads (two words that are not shared, however, they lie on the same cache line). Most current machines use invalidate protocols. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 101 Maintaining Coherence Using Invalidate Protocols Each copy of a data item is associated with a state. One example of such a set of states is, shared, invalid, or dirty. In shared state, there are multiple valid copies of the data item (and therefore, an invalidate would have to be generated on an update). In dirty state, only one copy exists and therefore, no invalidates need to be generated. In invalid state, the data copy is invalid, therefore, a read generates a data request (and associated state changes). Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 102 Maintaining Coherence Using Invalidate Protocols Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 103 Maintaining Coherence Using Invalidate Protocols Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 104 Snoopy Cache Systems How are invalidates sent to the right processors? In snoopy caches, there is a broadcast media that listens to all invalidates and read requests and performs appropriate coherence operations locally. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 105 Performance of Snoopy Caches Once copies of data are tagged dirty, all subsequent operations can be performed locally on the cache without generating external traffic. If a data item is read by a number of processors, it transitions to the shared state in the cache and all subsequent read operations become local. If processors read and update data at the same time, they generate coherence requests on the bus - which is ultimately bandwidth limited. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 106 Directory Based Systems In snoopy caches, each coherence operation is sent to all processors. This is an inherent limitation. Why not send coherence requests to only those processors that need to be notified? This is done using a directory, which maintains a presence vector for each data item (cache line) along with its global state. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 107 Directory Based Systems Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 108 Performance of Directory Based Schemes The need for a broadcast media is replaced by the directory. The additional bits to store the directory may add significant overhead. The underlying network must be able to carry all the coherence requests. The directory is a point of contention, therefore, distributed directory schemes must be used. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 109 Dichotomy of Parallel Computing Platforms An explicitly parallel program must specify concurrency and interaction between concurrent subtasks. The former is sometimes also referred to as the control structure and the latter as the communication model. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 110 Control Structure of Parallel Programs Parallelism can be expressed at various levels of granularity - from instruction level to processes. Between these extremes exist a range of models, along with corresponding architectural support. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 111 Control Structure of Parallel Programs Processing units in parallel computers either operate under the centralized control of a single control unit or work independently. If there is a single control unit that dispatches the same instruction to various processors (that work on different data), the model is referred to as single instruction stream, multiple data stream (SIMD). If each processor has its own control control unit, each processor can execute different instructions on different data items. This model is called multiple instruction stream, multiple data stream (MIMD). Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 112 SIMD and MIMD Processors Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 113 SIMD Processors Some of the earliest parallel computers such as the Illiac IV, MPP, DAP, CM-2, and MasPar MP-1 belonged to this class of machines. Variants of this concept have found use in co-processing units such as the MMX units in Intel processors and DSP chips such as the Sharc. SIMD relies on the regular structure of computations (such as those in image processing). It is often necessary to selectively turn off operations on certain data items. For this reason, most SIMD programming paradigms allow for an ``activity mask'', which determines if a processor should participate in a computation or not. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 114 Conditional Execution in SIMD Processors Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 115 MIMD Processors In contrast to SIMD processors, MIMD processors can execute different programs on different processors. A variant of this, called single program multiple data streams (SPMD) executes the same program on different processors. It is easy to see that SPMD and MIMD are closely related in terms of programming flexibility and underlying architectural support. Examples of such platforms include current generation Sun Ultra Servers, SGI Origin Servers, multiprocessor PCs, workstation clusters, and the IBM SP. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 116 SIMD-MIMD Comparison SIMD computers require less hardware than MIMD computers (single control unit). However, since SIMD processors ae specially designed, they tend to be expensive and have long design cycles. Not all applications are naturally suited to SIMD processors. In contrast, platforms supporting the SPMD paradigm can be built from inexpensive off-the- shelf components with relatively little effort in a short amount of time. Parallel & Distributed Processing- 1502412 Dr. Ali El-Moursy 117