Programming Using Message Passing Paradigm PDF

6. Programming Using Message Passing Paradigm Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 1 Topic Overview Principles of Message-Passing Programming The Building Blocks: Send and Receive Operations MPI: the Message Passing Interface Topologies and Embedding Overlapping Communication with Computation Collective Communication and Computation Operations Groups and Communicators Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 2 Principles of Message-Passing Programming The logical view of a machine supporting the message-passing paradigm consists of p processes, each with its own exclusive address space. Each data element must belong to one of the partitions of the space; hence, data must be explicitly partitioned and placed. All interactions (read-only or read/write) require cooperation of two processes - the process that has the data and the process that wants to access the data. These two constraints, make underlying costs very explicit to the programmer. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 3 Principles of Message-Passing Programming Message-passing programs are often written using the asynchronous or loosely synchronous paradigms. In the asynchronous paradigm, all concurrent tasks execute asynchronously. In the loosely synchronous model, tasks or subsets of tasks synchronize to perform interactions. Between these interactions, tasks execute completely asynchronously. Most message-passing programs are written using the single program multiple data (SPMD) model. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 4 The Building Blocks: Send and Receive Operations The prototypes of these operations are as follows: send(void *sendbuf, int nelems, int dest) receive(void *recvbuf, int nelems, int source) Consider the following code segments: P0 P1 a = 100; receive(&a, 1, 0) send(&a, 1, 1); printf("%d\n", a); a = 0; The semantics of the send operation require that the value received by process P1 must be 100 as opposed to 0. This motivates the design of the send and receive protocols. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 5 Non-Buffered Blocking Message Passing Operations A simple method for forcing send/receive semantics is for the send operation to return only when it is safe to do so. In the non-buffered blocking send, the operation does not return until the matching receive has been encountered at the receiving process. Idling and deadlocks are major issues with non-buffered blocking sends. Consider the following simple exchange of messages that can lead to a deadlock: Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 6 Non-Buffered Blocking Message Passing Operations Handshake for a blocking non-buffered send/receive operation. It is easy to see that in cases where sender and receiver do not reach communication point at similar times, there can be considerable idling overheads. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 7 Buffered Blocking Message Passing Operations A simple solution to the idling and deadlocking problem outlined above is to rely on buffers at the sending and receiving ends. The sender simply copies the data into the designated buffer and returns after the copy operation has been completed. The data must be buffered at the receiving end as well. Buffering trades off idling overhead for buffer copying overhead. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 8 Buffered Blocking Message Passing Operations Blocking buffered transfer protocols: (a) in the presence of communication hardware with buffers at send and receive ends; and (b) in the absence of communication hardware, sender interrupts receiver and deposits data in buffer at receiver end. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 9 Buffered Blocking Message Passing Operations Bounded buffer sizes can have significant impact on performance. P0 P1 for (i = 0; i < 1000; i++){ for (i = 0; i < 1000; i++){ produce_data(&a); receive(&a, 1, 0); send(&a, 1, 1); consume_data(&a); } } What if consumer was much slower than producer? Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 10 Buffered Blocking Message Passing Operations Deadlocks are still possible with buffering since receive operations block. P0 P1 receive(&a, 1, 1); receive(&a, 1, 0); send(&b, 1, 1); send(&b, 1, 0); Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 11 Non-Blocking Message Passing Operations The programmer must ensure semantics of the send and receive. This class of non-blocking protocols returns from the send or receive operation before it is semantically safe to do so. Non-blocking operations are generally accompanied by a check-status operation. When used correctly, these primitives are capable of overlapping communication overheads with useful computations. Message passing libraries typically provide both blocking and non-blocking primitives. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 12 Non-Blocking Message Passing Operations Non-blocking non-buffered send and receive operations (a) in absence of communication hardware; (b) in presence of communication hardware. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 13 Send and Receive Protocols Space of possible protocols for send and receive operations. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 14 MPI: the Message Passing Interface MPI defines a standard library for message- passing that can be used to develop portable message-passing programs using either C or Fortran. The MPI standard defines both the syntax as well as the semantics of a core set of library routines. Vendor implementations of MPI are available on almost all commercial parallel computers. It is possible to write fully-functional message- passing programs by using only the six routines. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 15 MPI: the Message Passing Interface The minimal set of MPI routines. MPI_Init Initializes MPI. MPI_Finalize Terminates MPI. MPI_Comm_size Determines the number of processes. MPI_Comm_rank Determines the label of calling process. MPI_Send Sends a message. MPI_Recv Receives a message. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 16 Starting and Terminating the MPI Library MPI_Init is called prior to any calls to other MPI routines. Its purpose is to initialize the MPI environment. MPI_Finalize is called at the end of the computation, and it performs various clean-up tasks to terminate the MPI environment. The prototypes of these two functions are: int MPI_Init(int *argc, char ***argv) int MPI_Finalize() MPI_Init also strips off any MPI related command-line arguments. All MPI routines, data-types, and constants are prefixed by “MPI_”. The return code for successful completion is MPI_SUCCESS. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 17 Communicators A communicator defines a communication domain - a set of processes that are allowed to communicate with each other. Information about communication domains is stored in variables of type MPI_Comm. Communicators are used as arguments to all message transfer MPI routines. A process can belong to many different (possibly overlapping) communication domains. MPI defines a default communicator called MPI_COMM_WORLD which includes all the processes. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 18 Querying Information The MPI_Comm_size and MPI_Comm_rank functions are used to determine the number of processes and the label of the calling process, respectively. The calling sequences of these routines are as follows: int MPI_Comm_size(MPI_Comm comm, int *size) int MPI_Comm_rank(MPI_Comm comm, int *rank) The rank of a process is an integer that ranges from zero up to the size of the communicator minus one. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 19 Our First MPI Program #include main(int argc, char *argv[]) { int npes, myrank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); printf("From process %d out of %d, Hello World!\n", myrank, npes); MPI_Finalize(); } Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 20 Sending and Receiving Messages The basic functions for sending and receiving messages in MPI are the MPI_Send and MPI_Recv, respectively. The calling sequences of these routines are as follows: int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) MPI provides equivalent datatypes for all C datatypes. This is done for portability reasons. The datatype MPI_BYTE corresponds to a byte (8 bits) and MPI_PACKED corresponds to a collection of data items that has been created by packing non-contiguous data. The message-tag can take values ranging from zero up to the MPI defined constant MPI_TAG_UB. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 21 MPI Datatypes MPI Datatype C Datatype MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double MPI_BYTE MPI_PACKED Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 22 Sending and Receiving Messages MPI allows specification of wildcard arguments for both source and tag. If source is set to MPI_ANY_SOURCE, then any process of the communication domain can be the source of the message. If tag is set to MPI_ANY_TAG, then messages with any tag are accepted. On the receive side, the message must be of length equal to or less than the length field specified. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 23 Sending and Receiving Messages On the receiving end, the status variable can be used to get information about the MPI_Recv operation. The corresponding data structure contains: typedef struct MPI_Status { int MPI_SOURCE; int MPI_TAG; int MPI_ERROR; }; The MPI_Get_count function returns the precise count of data items received. int MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int *count) Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 24 Avoiding Deadlocks int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) Consider: int a, b, myrank; MPI_Status status;... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); } else if (myrank == 1) { MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD); }... Since MPI_Send is blocking, there is a deadlock. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 25 Avoiding Deadlocks Consider the following piece of code, in which process i sends a message to process i + 1 (modulo the number of processes) and receives a message from process i - 1 (module the number of processes). int a, b, npes, myrank; MPI_Status status;... MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD);... Once again, we have a deadlock since MPI_Send is blocking. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 26 Avoiding Deadlocks We can break the circular wait to avoid deadlocks as follows: int a, b, npes, myrank; MPI_Status status;... MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank%2 == 1) { MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD); } else { MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD); MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); }... Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 27 Sending and Receiving Messages Simultaneously To exchange messages, MPI provides the following function: int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype senddatatype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, int recvtag, MPI_Comm comm, MPI_Status *status) The arguments include arguments to the send and receive functions. If we wish to use the same buffer for both send and receive, we can use: int MPI_Sendrecv_replace(void *buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status) Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 28 Topologies and Embeddings MPI allows a programmer to organize processors into logical k-d meshes. The processor ids in MPI_COMM_WORLD can be mapped to other communicators (corresponding to higher-dimensional meshes) in many ways. The goodness of any such mapping is determined by the interaction pattern of the underlying program and the topology of the machine. MPI does not provide the programmer any control over these mappings. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 29 Creating and Using Cartesian Topologies We can create cartesian topologies using the function: int MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart) This function takes the processes in the old communicator and creates a new communicator with dims dimensions. Each processor can now be identified in this new cartesian topology by a vector of dimension dims. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 30 Creating and Using Cartesian Topologies Since sending and receiving messages still require (one-dimensional) ranks, MPI provides routines to convert ranks to cartesian coordinates and vice- versa. int MPI_Cart_coord(MPI_Comm comm_cart, int rank, int maxdims, int *coords) int MPI_Cart_rank(MPI_Comm comm_cart, int *coords, int *rank) The most common operation on cartesian topologies is a shift. To determine the rank of source and destination of such shifts, MPI provides the following function: int MPI_Cart_shift(MPI_Comm comm_cart, int dir, int s_step, int *rank_source, int *rank_dest) Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 31 Matrix-Matrix Multiplication Consider the problem of multiplying two n x n dense, square matrices A and B to yield the product matrix C =A x B. The serial complexity is O(n3). We do not consider better serial algorithms (Strassen's method), although, these can be used as serial kernels in the parallel algorithms. A useful concept in this case is called block operations. In this view, an n x n matrix A can be regarded as a q x q array of blocks Ai,j (0 ≤ i, j < q) such that each block is an (n/q) x (n/q) submatrix. In this view, we perform q3 matrix multiplications, each involving (n/q) x (n/q) matrices. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 32 Matrix-Matrix Multiplication Consider two n x n matrices A and B partitioned into p blocks Ai,j and Bi,j (0 ≤ i, j < ) of size each. Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix. Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤ k <. All-to-all broadcast blocks of A along rows and B along columns. Perform local submatrix multiplication. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 33 Revision: Matrix-Multiplication: Two- dimension Partitioning However, with a two-dimensional distribution, each process needs to access rows of matrix A and columns of matrix B, for process P5. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 34 Matrix-Matrix Multiplication Revision: On a mesh, All-to-all broadcast cost is where: p is number of processors performing the operation, m is the size of data to be transferred In case of Matrix-matrix Multiplication we have to do row-wise all-to-all broadcast among √p processors The message size to be transferred is n2/p Hence, T= All-to-all broadcast will be performed twice once row-wise and the other is column-wise. Hence, T= Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 35 Matrix-Matrix Multiplication The two broadcasts take time The computation requires multiplications of sized submatrices. The parallel run time is approximately Major drawback of the algorithm is that it is not memory optimal. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 36 Matrix-Matrix Multiplication: Cannon's Algorithm In this algorithm, we schedule the computations of the processes of the ith row such that, at any given time, each process is using a different block Ai,k. These blocks can be systematically rotated among the processes after every submatrix multiplication so that every process gets a fresh Ai,k after each rotation. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 37 Matrix-Matrix Multiplication: Cannon's Algorithm Communication steps in Cannon's algorithm on 16 processes. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 38 Matrix-Matrix Multiplication: Cannon's Algorithm Align the blocks of A and B in such a way that each process multiplies its local submatrices. This is done by shifting all submatrices Ai,j to the left (with wraparound) by i steps and all submatrices Bi,j up (with wraparound) by j steps. Perform local block multiplication. Each block of A moves one step left and each block of B moves one step up (again with wraparound). Perform next block multiplication, add to partial result, repeat until all blocks have been multiplied. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 39 Matrix-Matrix Multiplication: Cannon's Algorithm In the alignment step, since the maximum distance over which a block shifts is , the two shift operations require a total of time. Each of the single-step shifts in the compute- and-shift phase of the algorithm takes time. The computation time for multiplying matrices of size is. The parallel time is approximately: This algorithm is memory optimal. Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 40 Cannon's Matrix-Matrix Multiplication with MPI's Topologies 1 MatrixMatrixMultiply(int n, double *a, double *b, double *c, MPI_Comm comm) 3{ 4 int i; 5 int nlocal; 6 int npes, dims, periods; 7 int myrank, my2drank, mycoords; 8 int uprank, downrank, leftrank, rightrank, coords; 9 int shiftsource, shiftdest; 10 MPI_Status status; 11 MPI_Comm comm_2d; 12 13 14 MPI_Comm_size(comm, &npes); 15 MPI_Comm_rank(comm, &myrank); 16 17 18 dims = dims = sqrt(npes); 19 20 21 periods = periods = 1; 22 Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 41 Cannon's Matrix-Matrix Multiplication with MPI's Topologies 23 24 MPI_Cart_create(comm, 2, dims, periods, 1, &comm_2d); 25 26 27 MPI_Comm_rank(comm_2d, &my2drank); 28 MPI_Cart_coords(comm_2d, my2drank, 2, mycoords); 29 30 31 MPI_Cart_shift(comm_2d, 0, -1, &rightrank, &leftrank); 32 MPI_Cart_shift(comm_2d, 1, -1, &downrank, &uprank); 33 34 35 nlocal = n/dims; 36 37 38 MPI_Cart_shift(comm_2d, 0, -mycoords, &shiftsource, &shiftdest); 39 MPI_Sendrecv_replace(a, nlocal*nlocal, MPI_DOUBLE, shiftdest, 1, shiftsource, 1, comm_2d, &status); 41 42 MPI_Cart_shift(comm_2d, 1, -mycoords, &shiftsource, &shiftdest); 43 MPI_Sendrecv_replace(b, nlocal*nlocal, MPI_DOUBLE, shiftdest, 1, shiftsource, 1, comm_2d, &status); 45 Parallel & Distributed Processing- 0403412 Dr. Ali El-Moursy 42 Cannon's Matrix-Matrix Multiplication with MPI's Topologies 45 46 47 for (i=0; i

Programming Using Message Passing Paradigm PDF

Document Details

Tags

Related

Summary

Full Transcript