🎧 New: AI-Generated Podcasts Turn your study notes into engaging audio conversations. Learn more

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Full Transcript

058165 - PARALLEL COMPUTING Fabrizio Ferrandi a.a. 2023-2024 PRAM ARCHITECTURES, ALGORITHMS, PERFORMANCE EVALUATION 2 ACKNOWLEDGE MATERIAL FROM: PARALLEL COMPUTING LECTURES FROM PROF. RAN GINOSAR - TECHNION - IS...

058165 - PARALLEL COMPUTING Fabrizio Ferrandi a.a. 2023-2024 PRAM ARCHITECTURES, ALGORITHMS, PERFORMANCE EVALUATION 2 ACKNOWLEDGE MATERIAL FROM: PARALLEL COMPUTING LECTURES FROM PROF. RAN GINOSAR - TECHNION - ISRAEL INSTITUTE OF TECHNOLOGY DAVID RODRIGUEZ-VELAZQUEZ - SPRING -09 CS-6260 - DR. ELISE DE DONCKER JOSEPH F. JAJA, INTRODUCTION TO PARALLEL ALGORITHMS, 1992 WWW.UMIACS.UMD.EDU/~JOSEPH/ UZI VISHKIN, PRAM CONCEPTS (1981-TODAY) WW.UMIACS.UMD.EDU/~VISHKIN 3 OVERVIEW WHAT IS A MACHINE MODEL? WHY DO WE NEED A MODEL? RAM PRAM STEPS IN COMPUTATION WRITE CONFLICT EXAMPLES 4 A PARALLEL MACHINE MODEL What is a machine model? Why do we need a model? Describes a “machine” Makes it easy to reason algorithms Puts a value to the operations on the Achieve complexity bounds machine Analyzes maximum parallelism 5  UNBOUNDED NUMBER OF LOCAL MEMORY CELLS  EACH MEMORY CELL CAN HOLD AN INTEGER OF UNBOUNDED SIZE RAM  INSTRUCTION SET INCLUDES SIMPLE (RANDOM OPERATIONS, DATA OPERATIONS, COMPARATOR, BRANCHES ACCESS  ALL OPERATIONS TAKE UNIT TIME MACHINE)  TIME COMPLEXITY = NUMBER OF INSTRUCTIONS EXECUTED  SPACE COMPLEXITY = NUMBER OF MEMORY CELLS USED 6 PRAM (PARALLEL RANDOM ACCESS MACHINE)  DEFINITION: ◦ IS AN ABSTRACT MACHINE FOR DESIGNING THE ALGORITHMS APPLICABLE TO PARALLEL COMPUTERS ◦ M’ IS A SYSTEM OF INFINITELY MANY  RAM’S M1, M2, …, EACH MI IS CALLED A PROCESSOR OF M’. ALL THE PROCESSORS ARE ASSUMED TO BE IDENTICAL. EACH HAS ABILITY TO RECOGNIZE ITS OWN INDEX I  INPUT CELLS X(1), X(2),…,  OUTPUT CELLS Y(1), Y(2),…,  SHARED MEMORY CELLS A(1), A(2),…, 7  UNBOUNDED COLLECTION OF RAM PROCESSORS P0, P1, …,  PROCESSORS DON’T HAVE TAPE  EACH PROCESSOR HAS UNBOUNDED PRAM REGISTERS (PARALLEL  UNBOUNDED COLLECTION OF SHARE RAM) MEMORY CELLS  ALL PROCESSORS CAN ACCESS ALL MEMORY CELLS IN UNIT TIME  ALL COMMUNICATION VIA SHARED MEMORY 8 PRAM (STEPS IN A COMPUTATION)  CONSIST OF 5 PHASES (CARRIED IN PARALLEL BY ALL THE PROCESSORS) EACH PROCESSOR: ◦ READS A VALUE FROM ONE OF THE CELLS X(1),…, X(N) ◦ READS ONE OF THE SHARED MEMORY CELLS A(1), A(2),… ◦ PERFORMS SOME INTERNAL COMPUTATION ◦ MAY WRITE INTO ONE OF THE OUTPUT CELLS Y(1), Y(2),… ◦ MAY WRITE INTO ONE OF THE SHARED MEMORY CELLS A(1), A(2),… E.G. FOR ALL I, DO A[I] = A[I-1] + 1; READ A[I-1] , COMPUTE ADD 1, WRITE A[I] HAPPENED SYNCHRONOUSLY 9 PRAM (PARALLEL RAM)  SOME SUBSET OF THE PROCESSORS CAN REMAIN IDLE P0 P1 P2 PN Shared Memory Cells  Two or more processors may read simultaneously from the same cell  A write conflict occurs when two or more processors try to write 10 simultaneously into the same cell  PRAM ARE CLASSIFIED BASED ON THEIR READ/WRITE ABILITIES (REALISTIC AND USEFUL) ◦ EXCLUSIVE READ(ER) : ALL PROCESSORS CAN SIMULTANEOUSLY READ FROM DISTINCT MEMORY LOCATIONS ◦ EXCLUSIVE WRITE(EW) : ALL PROCESSORS CAN SIMULTANEOUSLY WRITE TO DISTINCT MEMORY LOCATIONS ◦ CONCURRENT READ(CR) : ALL PROCESSORS CAN SIMULTANEOUSLY READ FROM ANY MEMORY LOCATION SHARE ◦ CONCURRENT WRITE(CW) : ALL PROCESSORS CAN WRITE MEMORY TO ANY MEMORY LOCATION ◦ EREW, CREW, CRCW ACCESS CONFLICTS 11 CONCURRENT WRITE (CW) WHAT VALUE GETS WRITTEN FINALLY? PRIORITY CW: PROCESSORS HAVE PRIORITY BASED ON WHICH VALUE IS DECIDED, THE HIGHEST PRIORITY IS ALLOWED TO COMPLETE WRITE COMMON CW: ALL PROCESSORS ARE ALLOWED TO COMPLETE WRITE IFF ALL THE VALUES TO BE WRITTEN ARE EQUAL. ANY ALGORITHM FOR THIS MODEL HAS TO MAKE SURE THAT THIS CONDITION IS SATISFIED. IF NOT, THE ALGORITHM IS ILLEGAL AND THE MACHINE STATE WILL BE UNDEFINED. ARBITRARY/RANDOM CW: ONE RANDOMLY CHOSEN PROCESSOR IS ALLOWED TO COMPLETE WRITE 12 STRENGTHS OF PRAM  PRAM IS ATTRACTIVE AND IMPORTANT MODEL FOR DESIGNERS OF PARALLEL ALGORITHMS WHY? ◦ IT IS NATURAL: THE NUMBER OF OPERATIONS EXECUTED PER ONE CYCLE ON P PROCESSORS IS AT MOST P ◦ IT IS STRONG: ANY PROCESSOR CAN READ/WRITE ANY SHARED MEMORY CELL IN UNIT TIME ◦ IT IS SIMPLE: IT ABSTRACTS FROM ANY COMMUNICATION OR SYNCHRONIZATION OVERHEAD, WHICH MAKES THE COMPLEXITY AND CORRECTNESS OF PRAM ALGORITHM EASIER ◦ IT CAN BE USED AS A BENCHMARK: IF A PROBLEM HAS NO FEASIBLE/EFFICIENT SOLUTION ON PRAM, IT HAS NO FEASIBLE/EFFICIENT SOLUTION FOR ANY PARALLEL MACHINE 13 COMPUTATIONAL POWER  MODEL A IS COMPUTATIONALLY STRONGER THAN MODEL B (A>=B) IFF ANY ALGORITHM WRITTEN FOR B WILL RUN UNCHANGED ON A IN THE SAME PARALLEL TIME AND SAME BASIC PROPERTIES. PRIORITY >= ARBITRARY >= COMMON >=CREW >= EREW Most powerful Least powerful Least realistic Most realistic 14 DEFINITIONS 𝑇 ∗ (𝑛) Time to solve problem of 𝑇 ∗ ≠ 𝑇1 input size n on one processor, SUP ≤ P using best sequential SUP ≤ 𝑇1 𝑇∞ algorithm 𝐸𝑝 ≤ 1 𝑇𝑝 (𝑛) Time to solve on p processors 𝑇1 ≥ 𝑇 ∗ ≥ 𝑇𝑝 ≥ 𝑇∞ 𝑇∗ 𝑆𝑈𝑝 IF 𝑇 ∗ ≈ 𝑇1 , 𝐸𝑝 ≈ 𝑝𝑇 = 𝑇 ∗ (𝑛) Speedup on p processors 𝑝 𝑝 SUp(n)=𝑇 𝑝 (𝑛) 𝐸𝑝 = 𝑇1 ≤ 𝑇1 𝑝𝑇𝑝 𝑝𝑇∞ 𝑇 (𝑛) Efficiency (work on 1 / work 𝐸𝑝 𝑛 =𝑝𝑇1 (𝑛) NO USE MAKING P LARGER THAN MAX SU: 𝑝 that could be done on p) E→0, EXECUTION NOT FASTER 𝑇1 ∈ 𝑂 𝐶 , 𝑇𝑝 ∈ 𝑂 𝐶/𝑝 𝑇∞ (𝑛) Shortest run time on any p 𝑊≤𝐶 𝑝 ≈ AREA, 𝑊 ≈ ENERGY, C(n)=P(n) ∙ T(n) Cost (processors and time) 𝑊 ≈ POWER 𝑇𝑝 W(n) Work = total number of 15 operations SPEEDUP AND EFFICIENCY Warning: This is only a (bad) example: An 80% parallel Amdahl’s law chart. 16 We’ll see why it’s bad when we analyze (and refute) Amdahl’s law. Meanwhile, consider only the trend. EXAMPLE 1: MATRIX-VECTOR MULTIPLY 𝐴1 𝐴 Y := AX (𝑛 × 𝑛, 𝑛) 𝐴 = ⋮2 , 𝐴𝑖 (𝑟 × 𝑛) 𝐴𝑝 𝑝 ≤ 𝑛, 𝑟 = 𝑛/𝑝 𝐴1 𝐴2 EXAMPLE: (256 × 256, 256) 𝐴 = , 𝐴𝑖 (8 × 256) ⋮ 𝐴32 32 PROCESSORS, EACH 𝐴𝑖 BLOCK IS 8 ROWS PROCESSOR 𝑃𝑖 READS 𝐴𝑖 AND X, COMPUTES AND WRITES YI. “EMBARRASSINGLY PARALLEL” – NO CROSS-DEPENDENCE 17 MVM ALGORITHM 𝑖 IS THE PROCESSOR INDEX STEP 1: CONCURRENT READ OF X(1:N) BEST SUPPORTED BY B-CAST? TRANSFER N ELEMENTS BEGIN 1. GLOBAL READ (ZX) STEP 2: SIMULTANEOUS READS OF 2. GLOBAL READ(B𝐴𝑖 ) DIFFERENT SECTIONS OF A 3. COMPUTE W:=BZ TRANSFER 𝑛2 /𝑝 ELEMENTS TO EACH 4. GLOBAL WRITE(W→ 𝑦𝑖 ) PROCESSOR END STEP 3: COMPUTE COMPUTE 𝑛2 /𝑝 OPS PER PROCESSOR STEP 4: SIMULTANEOUS WRITES TRANSFER 𝑛/𝑝 ELEMENTS FROM EACH PROCESSOR NO WRITE CONFLICTS 18 MVM ALGORITHM 𝑖 IS THE PROCESSOR INDEX BEGIN 1. GLOBAL READ (ZX) 2. GLOBAL READ(B𝐴𝑖 ) 3. COMPUTE W:=BZ 4. GLOBAL WRITE(W→ 𝑦𝑖 ) END 19 MATRIX-VECTOR MULTIPLY Core 1 The PRAM algorithm Ax=y × = 𝑖 is core index AND slice index Begin × = × = yi=Aix End × = Ai x yi A,x,y in shared memory Core 4 × = (Concurrent Read of x) Temp are in private 20 memories × = PERFORMANCE OF MVM T1(N2)=O(N2) TP(N2)=O(N2/P) --- LINEAR SPEEDUP, SU=P COST=O(P∙ N2/P)= O(N2), W=C, W/TP=P --- LINEAR POWER 𝑇1 𝑛2 𝐸𝑝 = = 2 =1 ---PERFECT EFFICIENCY 𝑝𝑇𝑝 𝑝𝑛 /𝑝 lin log n2=1024 p p 21 log EXAMPLE 2: SPMD SUM A(1:N) ON PRAM (GIVEN 𝑛 = 2𝑘 ) h i adding BEGIN 1. GLOBAL READ (AA(I)) 1 1 1,2 2. GLOBAL WRITE(A→B(I)) 2 3,4 3. FOR H=1:K IF 𝑖 ≤ 𝑛/2ℎ THEN BEGIN 3 5,6 GLOBAL READ(XB(2I-1)) GLOBAL READ(YB(2I)) 4 7,8 Z := X + Y 2 1 1,2 GLOBAL WRITE(Z→B(I)) END 2 3,4 4. IF I=1 THEN GLOBAL WRITE(Z→S) 3 1 1,2 END 22 LOGARITHMIC SUM THE PRAM ALGORITHM // SUM VECTOR A(*) BEGIN B(I) := A(I) h=3 FOR H=1:LOG(N) IF 𝑖 ≤ 𝑛/2ℎ THEN h=2 B(I) = B(2I-1) + B(2I) END h=1 // B(1) HOLDS THE SUM a1 a2 a3 a4 a5 a6 a7 a8 23 24 PERFORMANCE OF SUM (P=N) T*(N)=T1(N)=N TP=N(N)=2+LOG N p=n 𝑛 SUP= 2+𝑙𝑜𝑔 𝑛 COST=P∙ (2+LOG N)≈N LOG N 𝑇1 𝑛 1 𝐸𝑝 = = = 𝑝𝑇𝑝 𝑛 𝑙𝑜𝑔 𝑛 𝑙𝑜𝑔 𝑛 Speedup and efficiency decrease p=n 25 PERFORMANCE OF SUM (N>>P) T*(N)=T1(N)=N n=1,000,000 𝑛 𝑇𝑝 𝑛 = + LOG 𝑝 𝑝 𝑛 SUP=𝑛 ≈P +𝑙𝑜𝑔 𝑝 𝑝 𝑛 COST=𝑝 + LOG 𝑝 ≈N 𝑝 p WORK = N+P ≈N 𝑇1 𝑛 𝐸𝑝 = = 𝑛 ≈1 𝑝𝑇𝑝 𝑝 +LOG 𝑝 𝑝 Speedup & power are linear Cost is fixed Efficiency is 1 (max) 26 WORK DOING SUM T8 = 5 1 C = 85 = 40 -- could do 40 steps 1 W = 2n = 16 -- 16/40, wasted 24 2 2 2 𝐸𝑝 = = = 0.67 log 𝑛 3 4 𝑊 16 = = 0.4 8 𝐶 40 Work = 16 27 SIMPLIFYING PSEUDO-CODE REPLACE GLOBAL READ(XB) GLOBAL READ(YC) Z := X + Y GLOBAL WRITE(Z→A) BY A := B + C ---A,B,C SHARED VARIABLES 28 EXAMPLE 3: MATRIX MULTIPLY ON PRAM = × C := AB (𝑛 × 𝑛), 𝑛 = 2𝑘 RECALL MM: 𝐶𝑖,𝑗 = σ𝑛𝑙=1 𝐴𝑖,𝑙 𝐵𝑙,𝑗 𝑝 = 𝑛3 STEPS PROCESSOR 𝑃𝑖,𝑗,𝑙 COMPUTES 𝐴𝑖,𝑙 𝐵𝑙,𝑗 THE 𝑛 PROCESSORS 𝑃𝑖,𝑗,1:𝑛 COMPUTE SUM σ𝑛𝑙=1 𝐴𝑖,𝑙 𝐵𝑙,𝑗 29 MM ALGORITHM (EACH PROCESSOR KNOWS ITS I,J,L INDICES) BEGIN 1. 𝑇𝑖,𝑗,𝑙 = 𝐴𝑖,𝑙 𝐵𝑙,𝑗 STEP 1: COMPUTE 𝐴𝑖,𝑙 𝐵𝑙,𝑗 2. FOR H=1:K CONCURRENT READ IF 𝑙 ≤ 𝑛/2ℎ THEN STEP 2: SUM 𝑇𝑖,𝑗,𝑙 = 𝑇𝑖,𝑗,2𝑙−1 +𝑇𝑖,𝑗,2𝑙 3. IF 𝑙 = 1 THEN 𝐶𝑖,𝑗 = 𝑇𝑖,𝑗,1 STEP 3: STORE EXCLUSIVE WRITE END RUNS ON CREW PRAM WHAT IS THE PURPOSE OF “IF 𝑙 = 1” IN STEP 3? WHAT HAPPENS IF ELIMINATED? 30 PERFORMANCE OF MM 𝑇1 = 𝑛3 𝑇𝑝=𝑛3 = LOG 𝑛 𝑛3 𝑆𝑈 = LOG 𝑛 𝐶𝑜𝑠𝑡 = 𝑛3 LOG 𝑛 𝑇1 1 𝐸𝑝 = = 𝑝𝑇𝑝 LOG 𝑛 31 SOME VARIANTS OF PRAM  BOUNDED NUMBER OF SHARED MEMORY CELLS. SMALL MEMORY PRAM (INPUT DATA SET EXCEEDS CAPACITY OF THE SHARE MEMORY I/O VALUES CAN BE DISTRIBUTED EVENLY AMONG THE PROCESSORS)  BOUNDED NUMBER OF PROCESSOR SMALL PRAM. IF # OF THREADS OF EXECUTION IS HIGHER, PROCESSORS MAY INTERLEAVE SEVERAL THREADS.  BOUNDED SIZE OF A MACHINE WORD. WORD SIZE OF PRAM  HANDLING ACCESS CONFLICTS. CONSTRAINTS ON SIMULTANEOUS ACCESS TO SHARE MEMORY CELLS LEMMA  ASSUME P’

Use Quizgecko on...
Browser
Browser