Parallel Architectures PDF

## ARCHITECTURES DES SYSTEMES MASSIVEMENT PARALLELES ### Structure générale d'un calculateur (massivement) parallèle The document depicts a diagram of the general structure of a massively parallel calculator: - The main components are a **memory**, a **network of interconnection**, and **processing elements (PE)**. - The memory stores **data** and **instructions**. - The processing elements are connected through the **network of interconnection**. - Three components are crucial: **processor**, **memory**, and **interconnection network**. ### Plan - Introduction - Principle of SIMD/MIMD architectures - Elementary processor - Interconnection network organization - Memory organization - Some examples ### References Bibliographiques 1. Architecture of RISC Processors, D. Etiemble, Collection 2Ai, Armand Colin 2. Massively Parallel Computers, C. Germain-Renaud, J.P. Sansonnet, Collection 2Ai, Armand Colin 3. The Architecture of Pipelined Computers, P.M. Kogge, Advanced Computer Science Series, McGraw Hill 4. High Performance Computer Architecture, H.S Stone, Series in Electrical and Computer Engineering, Addison-Wesley 5. Computer Architecture: a Quantitative Approach, J. Hennessy, D. Patterson, Morgan Kaufman Publishers 6. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, F.T. Leighton, Morgan Kaufman Publishers ### Classification (1) #### Modes of Control The first classification (Flynn) : Data streams / Instruction streams - **SISD**: Single Instruction stream Single Data stream - **SIMD**: Single Instruction stream Multiple Data stream - **MIMD**: Multiple Instruction stream Multiple Data stream #### Kuck's classification: - **Data stream** is replaced with **Execution stream** - **Distinction** between the ability to operate on scalars or arrays #### Memory Organization - **Shared Memory**: a unique address space is shared by all processors. Access time is (almost) independent of the processor and data position in memory. - **Distributed Memory**: each processor has its own address space. Access time is dependent of the processor and the data position in memory. ### Pipeline #### Idea: - Processing unit's operation is subdivided into elementary operations. - Operation execution follows the diagram of an assembly chain. #### Example: - Floating point adder (4 stages) 1. Exponent subtraction 2. Mantissa alignment 3. Mantissa addition 4. Normalization #### General Principle: - Pipeline technique applies both to floating point operations, memory read / write operations, and control unit. ### Pipeline (Suite) - *Ti*: time to cross stage *i* of the pipeline - *T*: time to execute a sequence of *n* operations given by: *T(n) = α + ητ*, *α*: startup time, *τ*: cycle time - Execution speed given by: *V(n) = n/T(n) = n/(α + ητ)* #### Definitions: - *α*: startup time - *r∞*: asymptotic speed - *n1/2*: length of the sequence allowing to achieve half of the asymptotically performance. #### Properties: - *r∞* = 1/τ - *n1/2* = α/τ - *T(n)* = 1/r∞(n + n1/2) ### Parallelism between Functional Units - Multiple independent functional units capable of executing different instructions simultaneously. - Example: Adder, Multiplier, Memory Read/Write unit. - Parallelism applicable to pipelined units. - Low level parallelism mostly used for arithmetic expression calculation. ### SIMD Machines - The document presents a diagram where several processing elements (PE) are connected to a network of interconnection. Each PE has access to a set of memory banks. - Instructions are broadcasted to all PEs. Each PE can choose to execute the broadcasted instruction or remain inactive. - Memory is structured into independent banks capable of providing *M* words in parallel. - The network realigns data: it gathers in the same processor, the data that needs to be combined. - The network of interconnection acts as a synchronous data switch. ### SIMD Machines (Suite) - The document presents a code example to calculate *c(i) = a(i) + b(i)* - Vectors are partitioned in blocks of at most *P* elements. - **SIMD machine operates on blocks of *P* elements**. - The document describes the cases where there are conflicts or no conflicts when accessing data in memory banks of a SIMD machine. - The interconnection network is necessary: it realigns data for correct processing. ### Interest of the Inhibition Bit - The document presents a code example using a conditional statement. - Inhibition bit is used to avoid unnecessary operations in a loop. - Operations are performed on blocks. - It describes a method to calculate the total execution time for a given loop, taking into account the use of the inhibition bit. - Another application is the treatment of incomplete blocks. ### SIMD Performance Curves The document describes a typical SIMD performance curve which presents a plateau pattern: The number of clock cycles increases linearly with the number of iterations, then plateaus and stops increasing. ### Local Memory Variant - The document presents a diagram where each PE has its own local memory. - The network of interconnection connects PEs and the local memory. - Basic differences compared to shared memory: - The number of processors is equal to the number of memory banks. - Local memory is easier to integrate. - In some cases, the local memory is used as a buffer before reaching the shared memory. ### First SIMD Machine Overview - **Advantages:** - Simple, very repetitive, modular, adaptable to a high number of processors. - Simple elementary processor, fine grain parallelism. - Synchronous predictable operation, simplifies resource conflict management. - Simple programming model: vector manipulation. - **Disadvantages:** - Limited and rigid parallelism. - Specialized architecture, limited scope. ### MIMD Architectures with Shared Memory - The document presents a diagram, showing a network of interconnection connecting processing elements (PE), with each PE having access to a unique set of memory banks (BM). - **Advantages:** - Fully autonomous processors, ensuring their own synchronization. - Uniform shared address space. - Network operates as a telephone exchange. - No explicit data movement between processors. ### MIMD Architectures with Distributed Memory - The document presents a diagram where each processor has its own set of dedicated memory banks. - **Advantages:** - Fully autonomous processors, ensuring their own synchronization. - Distributed address space bound to the associated memory bank. - Network operates with a set of messages between processors. - Explicit data movement between processors. ### Topology of Distributed Memory Machines - **2D Mesh Grid**: The document presents a diagram of a 2D grid network of interconnected processor/memory unit couples. - It can be extended to 3D grid or torus form. ### Topology of Distributed Memory Machines (1) - The document presents a diagram of a 3D hypercube network, composed of processor/memory couples placed on the nodes of a 3D hypercube. The connections representing communication links are drawn as edges between these nodes. - It also describes the 4D hypercube structure. - **Hypercube**: in ((Z/2Z)n) space with dimension *n*, processor/memory couples are placed at the vertices of a *n*-dimensional hypercube. Communication links are drawn as edges between these vertices. ### Another view of the Hypercube - The document presents a diagram showcasing the 3D hypercube structure, with processing elements labelled from 0 to 7 and associated memory banks. ### MIMD with Shared Memory: First Overview - **Advantages:** - Simplified programming for user, system and compiler. - No unnecessary data migration. - **Disadvantages:** - Difficult to implement: no modularity. - Network and memory performance: all access to the memory goes through the network. ### MIMD with Distributed Memory: First Overview - **Advantages:** - Simple. - Scalable. - High memory performance (local accesses). - **Disadvantages:** - Programming. - Message handling. - Memory performance (non-local accesses). - The adequacy between the communication topology and the algorithm must be considered. ### Trend: Virtual Shared Memory: provide the programmer with a unique address space. The system then handles data distribution and message management. ### Elementary Processors - SIMD Machines - A complete processor is not required. - Specialized processors exist: - 1-bit processors (Connection Machine, DAP) - 4-bit processors (Maspar) - Coupling with floating point operators (Connection Machine) - By simplifying control logic, it is possible to integrate a large number of processors on a chip. - These processors can be reconfigured to operate on floating point (32/64 bits) or integers. ### Elementary Processor – MIMD Machines - Less constraints, the possibility of reusing commercial microprocessors reduces development costs and benefits from economies of scale. - Two difficulties: - Synchronization instructions are required. Recent generations of microprocessors include such mechanisms. - Communication management is required. ### Communication Management Solutions - Network takes care of the complete routing logic. - Each processor has a communication component. -Transputer (partial solution). ### Side Note: Each processor can be equipped with a vector coprocessor. ### Transputer - **Idea**: A basic building block for multiprocessing, where a processor, memory, a floating point unit, and communication links are all integrated on a single chip. - Connections are point-to-point with other transputers. - Inmos T800 example: - 4 KB memory - 32-bit micro-coded CPU - Floating point unit (1.5 Mflops) - Interrupt handler. - External memory interface allowing for up to 8 MB of external memory. - 4 communication channels. ### Transputer: Interesting features - Fast process switching (less than 1 μs), concurrency is possible. - Communications can be done in parallel with the CPU activities. - **Limitation**: - Point-to-point connection, routing is done by the Transputer. ### Parallel Algorithm? - Parallelism leads to varying performance improvements. - The gain depends on the application and its compatibility with the hardware. - **New complexity metrics:** - Vector length. - Parallelism width. - Communication volume. ### Parallel Algorithm? Definitions - **C**: Calculation to be performed. - **T1**: time required by the best sequential algorithm to compute **C**. - **Tp(A)**: time required by algorithm **A** to compute **C** using *p* processors. - **Speedup**: *Sp(A)* = *T1* / *Tp(A)* - **Efficiency**: *Ep(A)* = *Sp(A)* / *p* ### Parallel Algorithm? Properties - *Sp(A)* ≤ *p* - *Ep(A)* ≤ 1 ### General Result for a Perfect Parallel Machine **Proposition**: If a computation **C** can be performed with *Np* operations in *Tp* time on *p* processors, then it can be performed on *q* processors ( *q* ≤ *p*) in time *Tq* such that: - *Np*/ *q* ≤ *Tq* ≤ *Np*/ *q* + *Tp* ### Proof **For q=1**: - *Np* ≤ *T1* ≤ *Np* **Lower bound**: - At best *q* operations can be executed in parallel at each time step. - To execute *Np* operations, we need at least *⌈Np/q⌉* time steps. **Upper bound**: - Let *ai* be the number of parallel operations in time step *i* (where 1 ≤ *i* ≤ *Tp*). - These *ai* operations can be executed on *q* processors in *⌈ai/q⌉* times, leading to : - *Tq* ≤ Σi = 1Tp *⌈ai/q⌉*. - Since *⌈ai/q⌉* ≤ *ai*/ *q* + 1 - 1/ *q*, we obtain: - *Tq* ≤ (1/ *q*) (Σi=1Tp ai) + (1 - 1/*q*) *Tp*. - Knowing that Σi = 1Tp *ai* = *Np*, it follows: - *Tq* ≤ *Tp* + (*Np* - *Tp*)/ *q*. - This completes the proof. ### Vectors - **Definition**: A vector is defined by a sequence of addresses of the form: {*Ai* + (*k* - 1) *R*}*k*=1,N - *Ai*: Starting address of the vector. - *R*: Step, Increment, or Stride of the vector. - *N*: Length of the vector. - **Notation**: *A*[*i*; *N*; *R*] - **Extension**: Bidimensional vectors: sequences of addresses in the form: {*A1* + (*i* - 1) *R* + (*j* - 1) *L* }*i*=1,N1*j*=1,N2 - **Note**: Fortran 8X includes vector types similar to those defined above. ### Vector Machines - **Definition**: A "vector" machine is characterized by its instruction set, capable of operating on both scalars and vectors. - Typically, the hardware consists of: - Scalar units: these execute scalar instructions only. - Vector units: these execute instructions operating on vectors. - **Note**: SIMD machines can be considered as vector machines. - **Note**: Some architectures might have a single unit capable of executing both scalar and vector instructions (example: Cray series floating point operators). ### Vector Operations - Extension of classic arithmetic expressions but with at least one vector operand. #### Examples: 1. *VECM1* *op* *VECM2* = *VECM3* 2. *VECM1* *op* *scal* = *VECM2* 3. *VECM1* *op* (*scal* *op* *VECM2*) = *VECM3* 4. *op* *VECM1* = *scal* 5. *VECM1* *op* *VECM2* = *scal* - **VECM**: Vector operands are sequences of addresses in memory. - **Scal**: Scalar operand. - **op**: Arithmetic or logical operation. #### Component-by-component operations: - Equivalent to: (*VECM1*(*i*) op *VECM2*(*i*) = *VECM3*(*i*)). - **Last two operations are called reduction operations**. ### Vector Examples - **Attention**: You can only combine vectors with the same length. - *A*[*1*; *N*; *2*] = *B*[*3*; *N*; *1*] + *d* *C*[*5*; *N*; *3*] is equivalent to: - *DO i = 1, N* - *A*(2 * *i* - 1) = *B*(2 + *i*) + *d* *C*(3 * *i* + 2) - *ENDDO* - *c* = *SUM*(*A*[*1*; *N*; *2*]) is equivalent to: - *c* = 0 - *DO i = 1, N* - *c* = *A*(2 * *i* - 1) + *c* - *ENDDO* - **Attention**: Right operands are assumed to be read entirely before affecting operations. ### Software View of Vector Operations 1. Load all the right operands of the assignment operation. 2. Execute operations synchronously, step by step. 3. Store all right operands. 4. Sequential execution of vector assignment operations. ### Example - *DO i= 1,40* - *X*(*i*) = *Y*(*i*) + *Z*(*i*) - *A*(*i*) = *X*(*i* + 1) + 1 - *ENDDO* - This is **not equivalent** to a sequence of vector operations: - *X*[*1*: *40*: *1*] = *Y*[*1*: *40*: *1*] + *Z*[*1*: *40*: *1*] - *A*[*1*: *40*: *1*] = *X*[*2*: *41*: *1*] + *1* - **Attention**: Right operands are assumed to be read entirely before affecting operations. ### Mask - **Definition**: A *VM* mask vector consists of bits that control the execution of a vector operation. - **Behavior**: - If *VM*(*i*) is 1, the *i*th component of the result vector contains the value resulting from the *i*th operation defined by the vector instruction. - If *VM*(*i*) is 0, the *i*th component of the result vector is not affected by the vector instruction. - **Example**: The execution of the vector instruction *C*[*1*: *N*: *1*] = *A*[*1*: *N*: *1*] + *B*[*1*: *N*: *1*] controlled by the mask vector *VM* is equivalent to: - *DO i= 1, N* - *IF *VM*(*i*) =1 THEN *C*(*i*) = *A*(*i*) + *B*(*i*) - *ENDDO* - **Attention**: In general, all *N* additions are performed even if the mask is used to filter the result vector output. The cost is proportional to the *N* dimension, not the number of non-zero elements of *VM*. ### Interest of the Mask Notion - **Treatment of conditional branches within a loop**: - *DO i= 1,N* - *IF* *A*(*i*) > 0 THEN *C*(*i*) = *D*(*i*) + *E*(*i*) - *ENDDO* - **Vectorial instructions equivalent**: - *VM*[*1*: *N*: *1*] = *A*[*1*: *N*: *1*] > 0 - *C*[*1*: *N*: *1*] = *D*[*1*: *N*: *1*] + *E*[*1*: *N*: *1*]

Parallel Architectures PDF

Document Details

Tags

Related

Summary

Full Transcript