Lecture 2 on Uniprocessors and Multiprocessors PDF

Summary

This lecture provides an overview of uniprocessors and multiprocessors, including topics like compiler optimizations, levels of parallelism, superscalar architectures, instruction pipelines, and hazards. The content is suitable for an undergraduate-level computer science course.

Full Transcript

Uniprocessors Overview PART I: Uniprocessors PART II: Multiprocessors and and Compiler Optimizations Parallel Programming Models n Uniproce...

Uniprocessors Overview PART I: Uniprocessors PART II: Multiprocessors and and Compiler Optimizations Parallel Programming Models n Uniprocessors n Multiprocessors ¨ Processor architectures ¨ Pipeline and vector machines ¨ Instruction set architectures ¨ Shared memory ¨ Instruction scheduling and ¨ Distributed memory execution ¨ Message passing n Data storage n Parallel programming models ¨ Memory hierarchy ¨ Shared vs distributed memory ¨ Caches, TLB ¨ Hybrid n Compiler optimizations ¨ BSP 1/19/17 HPC Levels of Parallelism Superscalar and multicore processors (fine grain) (stacked, fine grain to coarse grain) 1/19/17 HPC 3 Superscalar Architectures Superscalar Architectures 1/19/17 HPC Instruction Pipeline n An instruction pipeline increases the instruction bandwidth n Classic 5-stage pipeline: ¨ IF instruction fetch ¨ ID instruction decode ¨ EX execute (functional units) ¨ MEM load/store ¨ WB write back to registers and forward results into the pipeline Five instructions are in the when needed by pipeline at different stages another instruction 1/19/17 HPC Instruction Pipeline Example n Example 4-stage pipeline n Four instructions (green, purple, blue, red) are processed in this order ¨ Cycle 0: instructions are waiting ¨ Cycle 1: Green is fetched ¨ Cycle 2: Purple is fetched, green decoded ¨ Cycle 3: Blue is fetched, purple decoded, green executed ¨ Cycle 4: Red is fetched, blue decoded, purple executed, green in write-back ¨ Etc. 1/19/17 HPC Instruction Pipeline Hazards n Pipeline hazards ¨ From data dependences ¨ Forwarding in the WB stage can eliminate some data dependence hazards ¨ From instruction fetch latencies (e.g. I-cache miss) ¨ From memory load latencies (e.g. D-cache miss) n A hazard is resolved by stalling the pipeline, which causes a bubble of one ore more cycles n Example ¨ Suppose a stall of one cycle occurs in the IF stage of the purple instruction ¨ Cycle 3: Purple cannot be decoded and a no-operation (NOP) is inserted 1/19/17 HPC N-way Superscalar n When instructions have the same word size ¨ Can fetch multiple instructions without having to know the instruction content of each n N-way superscalar processors fetch N instructions each cycle ¨ Increases the instruction-level 2-way superscalar RISC pipeline parallelism 1/19/17 HPC The Instruction Set in a Superscalar Architecture n Pentium processors translate CISC instructions to RISC-like µOps n Advantages: ¨ Higher instruction bandwidth ¨ Maintains instruction set architecture (ISA) compatibility n Pentium 4 has a 31 stage pipeline divided into three main stages: ¨ Fetch and decode ¨ Execution ¨ Retirement Simplified block diagram of the Intel Pentium 4 1/19/17 HPC Instruction Fetch and Decode n Pentium 4 decodes instructions into µOps and deposits the µOps in a trace cache ¨ Allows the processor to fetch the µOps trace of an instruction that is executed again (e.g. in a loop) n Instructions are fetched: ¨ Normally in the same order as stored in memory ¨ Or fetched from branch targets predicted by the branch prediction unit n Pentium 4 only decodes one instruction per cycle, but can deliver up to three µOps per cycle to the execution stage n RISC architectures typically fetch multiple instructions per cycle Simplified block diagram of the Intel Pentium 4 1/19/17 HPC Instruction Execution Stage n Executes multiple µOps in parallel ¨ Instruction-level parallelism (ILP) n The scheduler marks a µOp for execution when all operands of the a µOp are ready ¨ The µOps on which a µOp depends must be executed first ¨ A µOp can be executed out-of- order in which it appeared n Pentium 4: a µOp is re-executed when its operands were not ready n On Pentium 4 there are 4 ports to send a µOp into n Each port has one or more fixed execution units ¨ Port 0: ALU0, FPMOV ¨ Port 1: ALU1, INT, FPEXE Simplified block diagram of the Intel Pentium 4 ¨ Port 2: LOAD 1/19/17 HPC ¨ Port 3: STORE Retirement n Looks for instructions to mark completed ¨ Are all µOps of the instruction executed? ¨ Are all µOps of the preceding instruction retired? (putting instructions back in order) n Notifies the branch prediction unit when a branch was incorrectly predicted ¨ Processor stops executing the wrongly predicted instructions and discards them (takes »10 cycles) n Pentium 4 retires up to 3 instructions per clock cycle Simplified block diagram of the Intel Pentium 4 1/19/17 HPC Software Optimization to Increase CPU Throughput n Processors run at maximum speed (high instruction per cycle rate (IPC)) when 1. There is a good mix of instructions (with low latencies) to keep the functional units busy 2. Operands are available quickly from registers or D-cache 3. The FP to memory operation ratio is high (FP : MEM > 1) 4. Number of data dependences is low 5. Branches are easy to predict n The processor can only improve #1 to a certain level with out-of-order scheduling and partly #2 with hardware prefetching n Compiler optimizations effectively target #1-3 n The programmer can help improve #1-5 1/19/17 HPC Instruction Latency and Throughput n Latency: the number of clocks to complete an instruction when all of its inputs are ready a=u*v; b=w*x; c=y*z n Throughput: the number of clocks to wait before starting an identical instruction ¨ Identical instructions are those that use the same execution unit n The example shows three multiply operations, assuming there is only one multiply execution unit n Actual typical latencies (in cycles) ¨ Integer add: 1 ¨ FP add: 3 ¨ FP multiplication: 3 ¨ FP division: 31 1/19/17 HPC Instruction Latency Case Study n Consider two versions of Euclid’s algorithm 1. Modulo version 2. Repetitive subtraction version n Which is faster? int find_gcf1(int a, int b) int find_gcf2(int a, int b) { while (1) { while (1) { a = a % b; { if (a > b) if (a == 0) return b; a = a - b; if (a == 1) return 1; else if (a < b) b = b % a; b = b - a; if (b == 0) return a; else if (b == 1) return 1; return a; } } } } Modulo version Repetitive subtraction 1/19/17 HPC Instruction Latency Case Study n Consider the cycles estimated for the case a=48 and b=40 Instruction # Latency Cycles Instruction # Latency Cycles Modulo 2 68 136 Subtract 5 1 5 Compare 3 1 3 Compare 5 1 5 Branch 3 1 3 Branch 14 1 14 Other 6 1 6 Other 0 Total 14 148 Total 24 24 Modulo version Repetitive subtraction n Execution time for all values of a and b in [1..9999] Modulo version Repetitive subtraction Blended version 18.55 sec 14.56 sec 12.14 sec 1/19/17 HPC Data Dependences n Instruction level parallelism is limited by data dependences n Types of dependences: ¨ RAW: read-after-write also called (w*x)*(y*z) flow dependence ¨ WAR: write-after-read also called anti dependence ¨ WAW: write-after-write also called output dependence n The example shows a RAW dependence n WAR and WAW dependences exist because of storage location reuse (overwrite with new value) ¨ WAR and WAW are sometimes called false dependences ¨ RAW is a true dependence 1/19/17 HPC Data Dependence Case Study n Removing redundant operations by (re)using (temporary) space may increase the number of dependences n Example: two versions to initialize a finite difference matrix 1. Recurrent version with lower FP operation count 2. Non-recurrent version with fewer dependences n Which is fastest depends on effectiveness of loop optimization and instruction scheduling by compiler (and processor) to hide latencies and the number of distinct memory loads dxi=1.0/h(1) do i=1,n do i=1,n dxo=1.0/h(i) dxo=dxi dxi=1.0/h(i+1) dxi=1.0/h(i+1) diag(i)=dxo+dxi diag(i)=dxo+dxi offdiag(i)=-dxi offdiag(i)=-dxi enddo enddo With recurrence Without recurrence WAR RAW (cross iter dep not shown) (cross iter dep not shown) 1/19/17 HPC Case Study (1) Intel Core 2 Duo 2.33 GHz dxi = 1.0/h; for (i=1; i0 less frequently n Rewrite conjunctions to logical expressions if (t1==0 && t2==0 && t3==0) Þ if ( (t1|t2|t3) == 0 ) n Use max/min or arithmetic to avoid branches if (a >= 255) a = 255; Þ a = min(a, 255); n Note that in C/C++ the cond?then:else operator and the && and || operators result in branches! 1/19/17 HPC Data Storage n Memory hierarchy n Performance of storage n CPU and memory n Virtual memory, TLB, and paging n Cache 1/19/17 HPC Memory Hierarchy faster n Storage systems are organized in a hierarchy: ¨ Speed ¨ Cost volatile ¨ Volatility cheaper per byte 1/19/17 HPC Performance of Storage n Registers are fast, typically one clock cycle to access data n Cache access takes tens of cycles n Memory access takes hundreds of cycles n Movement between levels of storage hierarchy can be explicit or implicit 1/19/17 HPC Memory Access n A logical address is translated into a physical address in virtual memory using a page table ¨ The translation lookaside buffer (TLB) is an efficient on-chip address translation cache ¨ Memory is divided into pages ¨ Virtual memory systems store pages in memory (RAM) and on disk ¨ Page fault: page is fetched from disk n L1 caches (on chip): ¨ I-cache stores instructions ¨ D-cache stores data n L2 cache (E-cache, on/off chip) ¨ Is typically unified: stores instructions and data 1/19/17 HPC Translation Lookaside Buffer Logical address to physical address translation Logical pages are mapped to physical with TLB lookup to limit page table reads (page pages in memory (typical page size is table is stored in physical memory) 4KB) 1/19/17 HPC Caches n Direct mapped cache: each location in main memory can be cached by just one cache location n N-way associative cache: each location in main memory can be cached by one of N cache locations n Fully associative: each location in main memory can be cached by any cache location n Experiment shows SPEC CPU2000 benchmark cache miss rates 1/19/17 HPC Cache Details n An N-way associative cache has a set of N cache lines per row n A cache line can be 8 to 512 bytes ¨ Longer lines increase memory bandwidth performance, but space can be wasted when applications access data in a random order n A typical 8-way L1 cache (on-chip) has 64 rows with 64 byte cache lines ¨ Cache size = 64 rows x 8 ways x 64 bytes = 32768 bytes n A typical 8-way L2 cache (on/off chip) has 1024 rows with 128 byte cache lines 1/19/17 HPC Cache Misses Compulsory miss: n Compulsory misses reading some_static_data ¨ Caused by the first reference to a datum ¨ Misses are effectively reduced for (i = 0; i < 100000; i++) by prefetching X[i] = some_static_data[i]; for (i = 0; i < 100000; i++) n Capacity misses X[i] = X[i] + Y[i]; ¨ Cache size is finite ¨ Misses are reduced by limiting the working set size of the application Capacity miss: n Conflict misses X[i] no longer in cache ¨ Replacement misses are caused by the choice of victim Conflict miss: cache line to evict by the X[i] and Y[i] are mapped replacement policy to the same cache line (e.g. ¨ Mapping misses are caused when cache is direct mapped) by level of associativity 1/19/17 HPC False Sharing n On a multi-core processor each core has an L1 cache and shares the L2 cache with other cores n False sharing occurs when two caches of processors (or cores) cache two different non-shared data items that reside on the same cache line n Cache coherency protocol marks the cache line (on all cores) dirty to force reload n To avoid false sharing: ¨ Allocate non-shared data on different cache lines (using malloc) ¨ Limit the use of global variables 1/19/17 HPC

Use Quizgecko on...
Browser
Browser