Basic Communication Operations PDF
Document Details
Uploaded by Ameera
University of Sharjah
Dr. Ali El-Moursy
Tags
Summary
These lecture notes cover basic communication operations in parallel and distributed processing. The topics include network topologies like linear arrays, meshes, and hypercubes, and message passing costs; concepts are illustrated with examples and diagrams.
Full Transcript
4. Basic Communication Operations Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 1 Topic Overview 2.4.3 Network Topologies 2.5 Message Passing Costs in Parallel Computers One-to-All Broadcast and All-to-One Reduction...
4. Basic Communication Operations Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 1 Topic Overview 2.4.3 Network Topologies 2.5 Message Passing Costs in Parallel Computers One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce Scatter and Gather All-to-All Personalized Communication Improving the Speed of Some Communication Operations Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 2 Network Topologies: Linear Arrays, Meshes, and k-d Meshes In a linear array, each node has two neighbors, one to its left and one to its right. If the nodes at either end are connected, we refer to it as a 1-D torus or a ring. A generalization to 2 dimensions has nodes with 4 neighbors, to the north, south, east, and west. A further generalization to d dimensions has nodes with 2d neighbors. A special case of a d-dimensional mesh is a hypercube. Here, d = log p, where p is the total number of nodes. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 3 Network Topologies: Ring, Mesh, and Hypercube Ring Mesh Hypercube Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 4 Message Passing Costs in Parallel Computers The total time to transfer a message over a network comprises of the following: – Startup time (ts): Time spent at sending and receiving nodes (executing the routing algorithm, programming routers, etc.). – Per-hop time (th): This time is a function of number of hops and includes factors such as switch latencies, network delays, etc. – Per-word transfer time (tw): This time includes all overheads that are determined by the length of the message. This includes bandwidth of links, error checking and correction, etc. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 5 Simplified Cost Model for Communicating Messages The cost of communicating a message between two nodes l hops away is given by In this expression, th is typically smaller than ts and tw. For this reason, the second term in the RHS does not show, particularly, when m is large. Furthermore, it is often not possible to control routing and placement of tasks. For these reasons, we can approximate the cost of message transfer by Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 6 Simplified Cost Model for Communicating Messages It is important to note that the original expression for communication time is valid for only uncongested networks. If a link takes multiple messages, the corresponding tw term must be scaled up by the number of messages. Different communication patterns congest different networks to varying extents. It is important to understand and account for this in the communication time accordingly. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 7 Basic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-defined patterns involving groups of processors. Efficient implementations of these operations can improve performance, reduce development effort and cost, and improve software quality. Efficient implementations must leverage underlying architecture. For this reason, we refer to specific architectures here. We select a descriptive set of architectures to illustrate the process of algorithm design. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 8 Basic Communication Operations: Introduction Group communication operations are built using point-to-point messaging primitives. Recall from our discussion of architectures that communicating a message of size m over an uncongested network takes time ts +twm. We use this as the basis for our analyses. Where necessary, we take congestion into account explicitly by scaling the tw term. We assume that the network is bidirectional and that communication is single-ported. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 9 One-to-All Broadcast and All-to-One Reduction One processor has a piece of data (of size m) it needs to send to everyone. The dual of one-to-all broadcast is all-to-one reduction. In all-to-one reduction, each processor has m units of data. These data items must be combined piece-wise (using some associative operator, such as addition or min), and the result made available at a target processor. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 10 One-to-All Broadcast and All-to-One Reduction data processes A0 A0 One-to-All Broadcast A0 A0 A0 A0 A0 H0 A0 All-to-One Reduction B0 C0 D0 E0 F0 H0 = A0 * B0 * C0 * D0 * E0 * F0 Where * is an operator for Sum, Min, Max, or Logic operation (AND , OR, XOR……..) Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 11 One-to-All Broadcast and All-to-One Reduction on Rings Simplest way is to send p-1 messages from the source to the other p-1 processors - this is not very efficient. Use recursive doubling: source sends a message to a selected processor. We now have two independent problems derined over halves of machines. Reduction can be performed in an identical fashion by inverting the process. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 12 One-to-All Broadcast Step 2 1 3 One-to-all broadcast on an eight-node ring. Node 0 is the source of the broadcast. Each message transfer step is shown by a numbered, dotted arrow from the source of the message to its destination. The number on an arrow indicates the time step during which the message is transferred. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 13 All-to-One Reduction Reduction on an eight-node ring with node 0 as the destination of the reduction. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 14 Broadcast and Reduction: Example Consider the problem of multiplying a matrix with a vector. The n x n matrix is assigned to an n x n (virtual) processor grid. The vector is assumed to be on the first row of processors. The first step of the product requires a one-to-all broadcast of the vector element along the corresponding column of processors. This can be done concurrently for all n columns. The processors compute local product of the vector element and the local matrix entry. In the final step, the results of these products are accumulated to the first row using n concurrent all-to-one reduction operations along the columns (using the sum operation). Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 15 Broadcast and Reduction: Matrix-Vector Multiplication Example One-to-all broadcast and all-to-one reduction in the multiplication of a 4 x 4 matrix with a 4 x 1 vector. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 16 Broadcast and Reduction on a Mesh We can view each row and column of a square mesh of p nodes as a linear array of √p nodes. Broadcast and reduction operations can be performed in two steps - the first step does the operation along a row and the second step along each column concurrently. This process generalizes to higher dimensions as well. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 17 Broadcast and Reduction on a Mesh: Example Step 1 Step 2 Step Step 3 4 One-to-all broadcast on a 16-node mesh. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 18 Broadcast and Reduction on a Hypercube A hypercube with 2d nodes can be regarded as a d-dimensional mesh with two nodes in each dimension. The mesh algorithm can be generalized to a hypercube and the operation is carried out in d (= log p) steps. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 19 Broadcast and Reduction on a Hypercube: Example Step 1 Step 2 Step 3 One-to-all broadcast on a three-dimensional hypercube. The binary representations of node labels are shown in parentheses. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 20 Broadcast and Reduction Algorithms All of the algorithms described above are adaptations of the same algorithmic template. We illustrate the algorithm for a hypercube, but the algorithm, as has been seen, can be adapted to other architectures. The hypercube has 2d nodes and my_id is the label for a node. X is the message to be broadcast, which initially resides at the source node 0. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 21 Broadcast and Reduction Algorithms One-to-all broadcast of a message X from source on a hypercube. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 22 Broadcast and Reduction Algorithms Single-node accumulation on a d-dimensional hypercube. Each node contributes a message X containing m words, and node 0 is the destination. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 23 Cost Analysis The broadcast or reduction procedure involves log p point-to-point simple message transfers, each at a time cost of ts + twm. The total time is therefore given by: Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 24 All-Reduce In all-reduce, each node starts with a buffer of size m and the final results of the operation are identical buffers of size m on each node that are formed by combining the original p buffers using an associative operator. Identical to all-to-one reduction followed by a one-to-all broadcast. This formulation is not the most efficient. Uses the pattern of all-to-all broadcast, instead. The only difference is that message size does not increase here. Time for this operation is (ts + twm) log p. Different from all-to-all reduction, in which p simultaneous all-to-one reductions take place, each with a different destination for the result. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 25 All-Reduce data processes H0 A0 All-to-One B0 Reduction C0 D0 E0 F0 H0 H0 One-to-All Broadcast H0 H0 H0 H0 H0 H0 = A0 * B0 * C0 * D0 * E0 * F0 Where * is an operator for Sum, Min, Max, or Logic operation (AND , OR, XOR……..) Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 26 All-Reduce data processes H0 A0 All-Reduce H0 B0 H0 C0 H0 D0 H0 E0 H0 F0 H0 = A0 * B0 * C0 * D0 * E0 * F0 Where * is an operator for Sum, Min, Max, or Logic operation (AND , OR, XOR……..) Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 27 Scatter and Gather In the scatter operation, a single node sends a unique message of size m to every other node (also called a one-to-all personalized communication). In the gather operation, a single node collects a unique message from each node. While the scatter operation is fundamentally different from broadcast, the algorithmic structure is similar, except for differences in message sizes (messages get smaller in scatter and stay constant in broadcast). The gather operation is exactly the inverse of the scatter operation and can be executed as such. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 28 Scatter and Gather data processes A0 A1 A2 A3 A4 A5 A0 Scatter A1 A2 A3 A4 A5 A0 B0 C0 D0 E0 F0 A0 Gather B0 C0 D0 E0 F0 Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 29 Example of the Scatter Operation The scatter operation on an eight-node hypercube. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 30 Cost of Scatter and Gather There are log p steps, in each step, the machine size halves and the data size halves. We have the time for this operation to be: This time holds for a linear array as well as a 2-D mesh. These times are asymptotically optimal in message size. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 31 All-to-All Broadcast and Reduction Generalization of broadcast in which each processor is the source as well as destination. A process sends the same m-word message to every other process, but different processes may broadcast different messages. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 32 All-to-All Broadcast and Reduction data processes A0 A0 B0 C0 D0 E0 F0 All-to-All B0 Broadcast A0 B0 C0 D0 E0 F0 C0 A0 B0 C0 D0 E0 F0 D0 A0 B0 C0 D0 E0 F0 E0 A0 B0 C0 D0 E0 F0 F0 A0 B0 C0 D0 E0 F0 H0 A0 A1 A2 A3 A4 A5 All-to-All H1 Reduction B0 B1 B2 B3 B4 B5 H2 C0 C1 C2 C3 C4 C5 H3 D0 D1 D2 D3 D4 D5 H4 E0 E1 E2 E3 E4 E5 H5 F0 F1 F2 F3 F4 F5 H x = A x * B x * C x * D x * E x * Fx - * is an operator for Sum, Min, Max, or Logic operation - x is an index (0,1,2….5) Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 33 All-to-All Broadcast and Reduction on a Ring Simplest approach: perform p one-to-all broadcasts. This is not the most efficient way, though. Each node first sends to one of its neighbors the data it needs to broadcast. In subsequent steps, it forwards the data received from one of its neighbors to its other neighbor. The algorithm terminates in p-1 steps. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 34 All-to-All Broadcast and Reduction on a Ring All-to-All Broadcast on an eight-node ring. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 35 All-to-All Broadcast and Reduction on a Ring All-to-all broadcast on a p-node ring. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 36 All-to-All Broadcast on a Mesh Performed in two phases - in the first phase, each row of the mesh performs an all-to-all broadcast using the procedure for the linear array. In this phase, all nodes collect √p messages corresponding to the √p nodes of their respective rows. Each node consolidates this information into a single message of size m√p. The second communication phase is a columnwise all-to-all broadcast of the consolidated messages. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 37 All-to-All Broadcast on a Mesh All-to-all broadcast on a 3 x 3 mesh. The groups of nodes communicating with each other in each phase are enclosed by dotted boundaries. By the end of the second phase, all nodes get (0,1,2,3,4,5,6,7) (that is, a message from each node). Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 38 All-to-All Broadcast on a Mesh All-to-All Broadcast on a square mesh of p nodes. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 39 All-to-All Broadcast on a Hypercube Generalization of the mesh algorithm to log p dimensions. Message size doubles at each of the log p steps. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 40 All-to-All Broadcast on a Hypercube All-to-All Broadcast on an eight-node hypercube. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 41 All-to-All Broadcast on a Hypercube All-to-All Broadcast on a d-dimensional hypercube. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 42 All-to-All Reduction Similar communication pattern to all-to-all broadcast, except in the reverse order. On receiving a message, a node must combine it with the local copy of the message that has the same destination as the received message before forwarding the combined message to the next neighbor. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 43 Cost Analysis On a ring, the time is given by: (ts + twm)(p-1). On a mesh, the time is given by: 2ts(√p – 1) + twm(p- 1). On a hypercube, we have: Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 44 All-to-All Broadcast: Notes All of the algorithms presented above are asymptotically optimal in message size. It is not possible to port algorithms for higher dimensional networks (such as a hypercube) into a ring because this would cause contention. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 45 All-to-All Broadcast: Notes Contention for a channel when the hypercube is mapped onto a ring. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 46 All-to-All Broadcast vs. Gather data processes A0 B0 C0 D0 E0 F0 A0 All-to-All A0 B0 C0 D0 E0 F0 Broadcast B0 A0 B0 C0 D0 E0 F0 C0 A0 B0 C0 D0 E0 F0 D0 A0 B0 C0 D0 E0 F0 E0 A0 B0 C0 D0 E0 F0 F0 A0 B0 C0 D0 E0 F0 A0 Gather B0 C0 D0 E0 F0 Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 47 All-to-All Broadcast: Gather + Vector Broadcast data processes A0 A0 B0 C0 D0 E0 F0 Gather B0 C0 D0 E0 F0 A0 B0 C0 D0 E0 F0 A0 B0 C0 D0 E0 F0 Vector- A0 B0 C0 D0 E0 F0 Broadcast A0 B0 C0 D0 E0 F0 A0 B0 C0 D0 E0 F0 A0 B0 C0 D0 E0 F0 A0 B0 C0 D0 E0 F0 Note: In MPI standard All-to-All Broadcast is called MPI_Allgather Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 48 All-to-All Reduce vs. Scatter data processes H0 All-to-All A0 A1 A2 A3 A4 A5 H1 Reduction B0 B1 B2 B3 B4 B5 C0 C1 C2 C3 C4 C5 H2 D0 D1 D2 D3 D4 D5 H3 E0 E1 E2 E3 E4 E5 H4 F0 F1 F2 F3 F4 F5 H5 A0 Scatter A0 A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 49 All-to-All Reduce: Vector Reduce + Scatter data processes A0 A1 A2 A3 A4 A5 Vector- H0 H1 H2 H3 H4 H5 B0 B1 B2 B3 B4 B5 Reduction C0 C1 C2 C3 C4 C5 D0 D1 D2 D3 D4 D5 E0 E1 E2 E3 E4 E5 F0 F1 F2 F3 F4 F5 H0 H0 H1 H2 H3 H4 H5 Scatter H1 H2 H3 H4 H5 Note: In MPI standard All-to-All Reduce is called MPI_Reduce_Scatter Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 50 All-to-All Personalized Communication Each node has a distinct message of size m for every other node. This is unlike all-to-all broadcast, in which each node sends the same message to all other nodes. All-to-all personalized communication is also known as total exchange. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 51 All-to-All Personalized Communication data processes A0 A1 A2 A3 A4 A5 A0 B0 C0 D0 E0 F0 All-to-All B0 B1 B2 B3 B4 B5 personalized A1 B1 C1 D1 E1 F1 C0 C1 C2 C3 C4 C5 Communication A2 B2 C2 D2 E2 F2 D0 D1 D2 D3 D4 D5 A3 B3 C3 D3 E3 F3 E0 E1 E2 E3 E4 E5 A4 B4 C4 D4 E4 F4 F0 F1 F2 F3 F4 F5 A5 B5 C5 D5 E5 F5 Effectively all-to-all personalized is equivalent to all-to-all-scatter or all-to-all-gather which is actually perform a matrix transpose between the data vector and the processors Note: In MPI standard All-to-All Personalized is called MPI_AlltoAll Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 52 All-to-All Personalized Communication: Example Consider the problem of transposing a matrix. Each processor contains one full row of the matrix. The transpose operation in this case is identical to an all-to-all personalized communication operation. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 53 All-to-All Personalized Communication: Example All-to-all personalized communication in transposing a 4 x 4 matrix using four processes. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 54 All-to-All Personalized Communication on a Ring Each node sends all pieces of data as one consolidated message of size m(p – 1) to one of its neighbors. Each node extracts the information meant for it from the data received, and forwards the remaining (p – 2) pieces of size m each to the next node. The algorithm terminates in p – 1 steps. The size of the message reduces by m at each step. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 55 All-to-All Personalized Communication on a Ring All-to-all personalized communication on a six-node ring. The label of each message is of the form {x,y}, where x is the label of the node that originally owned the message, and y is the label of the node that is the final destination of the message. The label ({x1,y1}, {x2,y2},…, {xn,yn}, indicates a message that is formed by concatenating n individual messages. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 56 All-to-All Personalized Communication on a Ring: Cost We have p – 1 steps in all. In step i, the message size is m(p – i). The total time is given by: The tw term in this equation can be reduced by a factor of 2 by communicating messages in both directions. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 57 All-to-All Personalized Communication on a Mesh Each node first groups its p messages according to the columns of their destination nodes. All-to-all personalized communication is performed independently in each row with clustered messages of size m√p. Messages in each node are sorted again, this time according to the rows of their destination nodes. All-to-all personalized communication is performed independently in each column with clustered messages of size m√p. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 58 All-to-All Personalized Communication on a Mesh The distribution of messages at the beginning of each phase of all-to-all personalized communication on a 3 x 3 mesh. At the end of the second phase, node i has messages ({0,i},…,{8,i}), where 0 ≤ i ≤ 8. The groups of nodes communicating together in each phase are enclosed in dotted boundaries. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 59 All-to-All Personalized Communication on a Mesh ( (2,0) , (2,3) , (2,6) , (2,1) , (2,4) , (2,7) ) 2 0 1 ( (0,1) , (0,4) , (0,7) , ( (1,0) , (1,3) , (1,6) , (0,2) , (0,5) , (0,8) ) (1,2) , (1,5) , (1,8) ) Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 60 All-to-All Personalized Communication on a Mesh: Cost Time for the first phase is identical to that in a ring with √p processors. 𝑝−1 𝑝−1+1 𝑇 = 𝑡𝑠 𝑝 − 1 + 𝑡𝑤 𝑚 𝑝 2 𝑝 𝑇 = 𝑡𝑠 𝑝 − 1 + 𝑡𝑤 𝑚 𝑝−1 2 Time in the second phase is identical to the first phase. Therefore, total time is twice of this time, i.e., Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 61 All-to-All Personalized Communication on a Mesh: Cost Time for the first phase is identical to that in a ring with √p processors, i.e., (ts + twmp/2)(√p – 1). Time in the second phase is identical to the first phase. Therefore, total time is twice of this time, i.e., It can be shown that the time for rearrangement is less much less than this communication time. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 62 All-to-All Personalized Communication on a Hypercube Generalize the mesh algorithm to log p steps. At any stage in all-to-all personalized communication, every node holds p packets of size m each. While communicating in a particular dimension, every node sends p/2 of these packets (consolidated as one message). A node must rearrange its messages locally before each of the log p communication steps. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 63 All-to-All Personalized Communication on a Hypercube An all-to-all personalized communication algorithm on a three-dimensional hypercube. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 64 All-to-All Personalized Communication on a Hypercube: Cost We have log p iterations and mp/2 words are communicated in each iteration. Therefore, the cost is: This is not optimal! Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 65 All-to-All Personalized Communication on a Hypercube: Optimal Algorithm Each node simply performs p – 1 communication steps, exchanging m words of data with a different node in every step. A node must choose its communication partner in each step so that the hypercube links do not suffer congestion. In the jth communication step, node i exchanges data with node (i XOR j). In this schedule, all paths in every communication step are congestion-free, and none of the bidirectional links carry more than one message in the same direction. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 66 All-to-All Personalized Communication on a Hypercube: Optimal Algorithm Seven steps in all-to-all personalized communication on an eight-node hypercube. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 67 All-to-All Personalized Communication on a Hypercube: Optimal Algorithm A procedure to perform all-to-all personalized communication on a d- dimensional hypercube. The message Mi,j initially resides on node i and is destined for node j. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 68 All-to-All Personalized Communication on a Hypercube: Cost Analysis of Optimal Algorithm There are p – 1 steps and each step involves non-congesting message transfer of m words. We have: This is asymptotically optimal in message size compared to first algorithm: Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 69 One-to-All Broadcast data processes A A One-to-All Broadcast A A A A A A0 A1 A2 A3 A4 A5 A0 Scatter By splitting A1 Data to P A2 parts……… A3 A4 A5 A0 B0 C0 D0 E0 F0 All-to-All A0 Broadcast A0 B0 C0 D0 E0 F0 B0 A0 B0 C0 D0 E0 F0 C0 A0 B0 C0 D0 E0 F0 D0 A0 B0 C0 D0 E0 F0 E0 A0 B0 C0 D0 E0 F0 F0 Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 70 Revision: Cost Analysis On a hypercube, we have: – For Scatter: – For All-to-all-Broadcast: – After Splitting the data into P parts, data to be transferred will be m/p Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 71 Improving Performance of Operations Splitting and routing messages into parts: If the message can be split into p parts, a one-to-all broadcast can be implemented as a scatter operation followed by an all-to-all broadcast operation. The time for this is: All-to-one reduction can be performed by performing all-to-all reduction (dual of all-to-all broadcast) followed by a gather operation (dual of scatter). Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 72 Improving Performance of Operations Since an all-reduce operation is semantically equivalent to an all-to-one reduction followed by a one-to-all broadcast, the asymptotically optimal algorithms for these two operations can be used to construct a similar algorithm for the all-reduce operation. The intervening gather and scatter operations cancel each other. Therefore, an all-reduce operation requires an all-to-all reduction and an all-to-all broadcast. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 73 backup Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 74 Circular Shift A special permutation in which node i sends a data packet to node (i + q) mod p in a p-node ensemble (0 ≤ q ≤ p). Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 75 Circular Shift on a Mesh The implementation on a ring is rather intuitive. It can be performed in min{q,p – q} neighbor communications. Mesh algorithms follow from this as well. We shift in one direction (all processors) followed by the next direction. The associated time has an upper bound of: Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 76 Circular Shift on a Mesh Parallel The& Distributed Processing- communication steps0403412 in a circular 5-shiftDr.onAliaEl-Moursy 4 x 4 mesh. 77 Circular Shift on a Hypercube Map a linear array with 2d nodes onto a d- dimensional hypercube. To perform a q-shift, we expand q as a sum of distinct powers of 2. If q is the sum of s distinct powers of 2, then the circular q-shift on a hypercube is performed in s phases. The time for this is upper bounded by: Parallel If E-cube routing & Distributed is used, Processing- 0403412this time can Dr. Alibe reduced El-Moursy 78 to Circular Shift on a Hypercube The mapping of an eight-node linear array onto a three-dimensional hypercube Parallelto & perform Distributed Processing- a circular 5-shift0403412 as a combination of Dr. Ali El-Moursy a 4-shift 79 and a 1-shift. Circular Shift on a Hypercube Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 80 Circular q-shifts on an 8-node hypercube for 1 ≤ q < 8. The Prefix-Sum Operation Given p numbers n0,n1,…,np-1 (one on each node), the problem is to compute the sums sk = ∑ik= 0 ni for all k between 0 and p-1. Initially, nk resides on the node labeled k, and at the end of the procedure, the same node holds Sk. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 81 The Prefix-Sum Operation Computing prefix sums on an eight-node hypercube. At each node, square brackets show the local prefix sum accumulated in the result buffer and parentheses enclose the contents of the outgoing message buffer for the next step. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 82 The Prefix-Sum Operation The operation can be implemented using the all- to-all broadcast kernel. We must account for the fact that in prefix sums the node with label k uses information from only the k-node subset whose labels are less than or equal to k. This is implemented using an additional result buffer. The content of an incoming message is added to the result buffer only if the message comes from a node with a smaller label than the recipient node. The contents of the outgoing message (denoted by parentheses in the figure) are updated with every incoming message. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 83 The Prefix-Sum Operation Prefix sums on a d-dimensional hypercube. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 84