Message Passing Computing Lecture PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document includes lecture notes on message passing computing. It covers the topics of message passing, process creation, and parallel programming models such as MPMD and SPMD using MPI.
Full Transcript
Chapter-2 MESSAGE PASSING COMPUTING Introduction In this chapter we study the following topics: (a) The basic concepts of message passing computing (b) The structure of message passing programs (c) The techniques to specify message passing between proc...
Chapter-2 MESSAGE PASSING COMPUTING Introduction In this chapter we study the following topics: (a) The basic concepts of message passing computing (b) The structure of message passing programs (c) The techniques to specify message passing between processes (d) How to evaluate message-passing parallel programs, both theoretically and practically. Parallel Programming with Message Passing Libraries It is necessary to know that (a) what processes are to be executed, (b) when to pass messages between concurrent processes and (c) what to pass in the messages. Process Creation Process is an instance of a program in execution There are two methods of process creation: (a) Static Process Creation (b) Dynamic Process Creation. Static Method: All processes are specified explicitly by command line actions before execution System will execute a fixed number of processes Dynamic Method: Processes can be created and their execution can be initiated during the execution of other processes Process creation constructs or library/system calls are used to create processes Processes can also be destroyed Process creation and destruction may be done conditionally Number of processes may vary during execution Static VS Dynamic Dynamic process creation is a more powerful technique than static process creation BUT introduce very significant overhead when processes are created. Process Identification (ID) In most applications, the processes are neither all the same nor all different. One controlling process – MASTER PROCESS Remaining processes – SLAVES or WORKERS Slave processes are identical in form but have different process identification (ID). Process ID can be used to modify the actions of the process or to compute different destinations for messages Programming Model There are two models for parallel programming: 1. MPMD – Multiple Program, Multiple Data 2. SPMD – Single program, Multiple data Multiple Program Multiple Data model It is the most general parallel programming model in which a completely separate and different program is written for each processor. But it is sufficient to have just two different programs, a master program and a slave program. One processor executes the master program and multiple processors execute identical slave programs. The slave programs are identical in form and are differentiated by their process IDs. Pictorially MPMD model is as follows: Source Source f ile f ile Compile to suit processor Ex ecutab les Processor 0 Processor p - 1 MPMD model is more suitable for dynamic process creation. In dynamic process creation, two distinct programs (a master program and a slave program) are written and compiled separately before execution. An example of a library call for dynamic process creation is spawn(name_of_process); which starts another process immediately. Spawned process means a previously compiled executable program. Spawning a process Process 1 Sta r te x ecutio n ofprocess 2 Process 2 Time Single Program Multiple Data model In this model different programs are merged into one program called source program. Within the source program control statements select different parts for each process. As soon as this selection is over the source program is compiled into executable code for each processor. Each processor will load a copy of this code into its local memory for execution and all processors can start executing their code together. This approach is used in MPI. The SPMD model is more suitable for static process creation. Pictorially SPMD model is as follows: Source fi le Basic MPI w ay Compile to suit processor Ex ecutab les Processor 0 Processor p −1 Message Passing Routines Message passing library calls have the following basic forms: (i) send(parameter_list) (ii) recv(parameter_list) where send() is placed in the source process originating the message and recv() is placed in the destination process to collect the messages being sent. For C language the calls have the forms: send(&x, destination_id) recv(&y, source_id) Pictorial form of these library calls is as follows: Proces 1 Proces 2 Destination Source Mo v ement of dat a Generic synt ax (act ual f ormat s lat er) Mechanisms for Message-Passing Routines Various mechanisms are provided for Message-Passing routines to make code efficient and routines flexible. Some of them are: (a) Synchronism (b) Blocking and Non-blocking (c) Message Selection (d) Broadcast, Gather and Scatter Synchronous Message Passing Routines only return when the message transfer has been completed. Synchronous send routine waits until the complete message sent has been accepted by the receiving process before returning. Synchronous receive routine waits until the message it is expecting arrives and is stored before returning. Hence synchronous routines intrinsically perform two actions: They transfer data and they synchronize processes. Synchronous send() and recv() Using a 3-way Protocol Pr oces s 1 Pr oces s 2 Time Reques t to s end Suspend Ac knowledgm ent pr oces s Both processes Message contin ue (a) When oc cur s befor e Pr oces s 1 Pr oces s 2 Time Reques t to s end Suspend pr oces s Both processes Message contin ue Ac knowledgm ent (b) When oc cur s befor e Asynchronous Message Passing Routines that do not wait for actions (message transfer) to complete before returning. Usually a message buffer is needed between the source and the destination to hold messages being sent prior to being accepted by recv(). If recv() is reached before send(), the message buffer will be empty and the recv() will wait for messages. If send() is reached before recv(), the process can continue with subsequent work once the local actions have been completed and the message is safely on its way and hence can decrease the overall execution time. In practice, buffers can only be of finite length. The send() is held up if all the available buffer space has been occupied and will wait until storage becomes available (i.e. the routine behaves like a synchronous routine). Asynchronous Message Passing Using Buffer Proces 1 Proces 2 Mes ag e b uf er Time Contin ue proces Read mes ag e b uf er Blocking and Non-blocking Message Passing Former definition: Blocking is to describe routines that do not allow the process to continue until the transfer is complete. Non-blocking is used to describe routines that return whether or not the message had been received. Hence Blocking = Synchronous Non Blocking = Asynchronous MPI Definitions of Blocking and Non-blocking Blocking (locally blocking) - return after their local actions complete, though the message transfer may not have been completed. Non-blocking - return immediately. Non-blocking routines assume that the data storage used for the transfer is not modified by the subsequent statements prior to the data storage being used for transfer and it is left to the programmer to ensure this. Message Selection The send() in a source process will send message to only that destination process whose ID is used as a parameter in send() and the recv() in a destination process will receive message only from that source process whose ID is used as a parameter in recv(). To make this selection more flexible a message tag is attached to the message. Message Tag It is typically a user chosen positive integer (including zero). Used to differentiate between different types of messages being sent. Message tag is carried within the message. If special type matching is not required, a wild card (a special symbol or number) can be used in place of a message tag, so that the recv() will match with any send(). The reason to use message tag is to provide a more powerful message selection mechanism. Example To send a message, x, with message tag 5 from a source Process 1, to a destination Process 2, and assign to y: Pr oc es 1 Pr oc es 2 Mo v em ent of data W ait s f or a mes sa ge fr om pro ces s 1 with a ta g of 5 Broadcast, Gather and Scatter These are “Group” message passing routines which send message (or messages) to a group of processes or receive message (or messages) from a group of processes and is known as collective operation. These routines have higher efficiency than separate point-to-point routines although not absolutely necessary. Broadcast Routines Broadcast routines send same message to all processes concerned with the problem. Multicast routines send same message to a defined group of processes. But here multicast routines will be considered the same as broadcast routines. Proces s 0 Proces s 1 Proces s p − 1 Action Code MPI f or m Scatter Routines Send each element of an array in root process to a separate process. Contents of ith location of array sent to ith process. Pr o c e s 0 Pr o c e s 1 Pr o c e s p − 1 Act ion Co d e MPI f or m Gather Routines Having one process collect individual values from set of processes. Process 0 Process 1 Process p −1 Action Code MPI f or m Reduce Gather operation is combined with specified arithmetic/logical operation. Example: Values could be gathered and then added together by root: Pr oces 0 Pr oces 1 Pr oces p − 1 Act ion Code MPI f or m MPI (Process Creation and Execution) Generally parallel computations are decomposed into concurrent processes. Creating and starting MPI processes is purposely not defined in the MPI standard and will depend upon the implementation. As SPMD model of computation is used, one program is written and is executed by multiple processors. Typically, an executable MPI program will be started on the command line. Communicators These are the programs in MPI which define scope of communication operation. Processes have ranks associated with communicator. Initially, all processes enrolled in a “universe”( a communicator) called MPI_COMM_WORLD, and each process is given a unique rank, a number from 0 to p - 1, with p processes. It is a default communicator for simple programs. Other communicators can be established for groups of processes. Using SPMD Computational Model The SPMD model is ideal when each process will execute the same code. But usually in all applications one or more processors need to execute different code. To facilitate this within a single program, statements need to be inserted to select which portion of the code will be executed by each processor. Hence in SPMD model both master code and slave code must be in the same program. For example: main (int argc, char *argv[]) { MPI_Init(&argc, &argv);.. MPI_Comm_rank(MPI_COMM_WORLD,&myrank) if (myrank == 0) master(); else slave();.. MPI_Finalize(); } where master() and slave() are to be executed by master process and slave process, respectively. *** Python version *** from mpi4py import MPI comm = MPI.COMM_WORLD.. myrank = comm.Get_rank() if (myrank == 0) master(); else slave();.. Unsafe message passing Message-passing communications can be a source of erroneous operation. This situation occurs due to the use of wild cards. The problem can be solved by MPI communicators, which is discussed in next slide Process 0 Process 1 Dest ination Sour ce (a) Intended beha vior Process 0 Process 1 (b) P ossib le beha vior MPI Solution “Communicators” Communicator is a communication domain that defines a set of processes that are allowed to communicate between themselves. In MPI, communicators are used for all point-to-point and collective MPI message-passing communications. Two types of communicators exist – intracommunicator( within one group) and intercommunicator( between groups). MPI_COMM_WORLD Exists as first communicator for all processes existing in the application. In simple applications it can be used for all point-to-point and collective operations. So no need to create new communicators. A set of MPI routines exists for forming communicators from existing communicators. Processes have a “rank” in a communicator. Message Passing Routine Type of message passing routine – Point-to-point – Completion – Blocking – Non-blocking MPI Point-to-Point Communication Uses send and receive routines with message tags and communicator. Wild cards can be used in place of the message tags (MPI_ANY_TAG) and in place of the source ID in receive routines (MPI_ANY_SOURCE). The datatype of the message is defined in the send/receive parameters. Explicit send and receive buffers are not required which helps in reducing the storage requirement of large messages. Completion There are several versions of send and receive. Locally complete means the process (e.g., send) is done with its own part of the operation (e.g., data has sent out). Globally complete means the entire communication between the sender and receiver has fully finished (send and receive operations have completed). MPI Blocking Routines Return when “locally complete” - when location used to hold message can be used again or altered without affecting message being sent. Blocking send will send message and return - does not mean that message has been received, just that process free to move on without adversely affecting message. Essentially the source process is blocked for minimum time that is required to access the data. A blocking receive routine returns when it is locally complete means the message has been received into the destination location and the destination location can be read. The general format of parameters of the blocking send is *** Python version *** send(buf, dest, tag=0) The general format of parameters of the blocking receive is *** Python version *** recv(buf=None, source=ANY_SOURCE, tag=ANY_TAG, status=None) Example To send an integer x from process 0 to process 1, MPI_Comm_rank(MPI_COMM_WORLD,&myrank); if (myrank == 0){ int x; MPI_Send(&x, 1, MPI_INT, 1, msgtag, MPI_COMM_WORLD); } else if (myrank == 1) { int x; MPI_Recv(&x, 1, MPI_INT, 0, msgtag, MPI_COMM_WORLD, status); } MPI Non-blocking Routines Non-blocking send - MPI_Isend() - will return “immediately” even before source location is safe to be altered. Non-blocking receive - MPI_Irecv() - will return even if no message to accept. In these routines “I” refers to the word immediate. Non-blocking Routine Formats MPI_Isend(buf,count,datatype,dest,tag,comm, request) MPI_Irecv(buf,count,datatype,source,tag,comm ,request) Completion detected by MPI_Wait() and MPI_Test(). MPI_Wait() waits until operation completed and returns then. MPI_Test() returns with a flag set indicating whether operation has completed at that time. A request parameter is used to determine if the operation has completed. The non-blocking receive routine provides the ability for a process to continue with other activities while waiting for the message to arrive. Example To send an integer x from process 0 to process 1 and allow process 0 to continue, MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { int x; MPI_Isend(&x,1,MPI_INT, 1, msgtag, MPI_COMM_WORLD, req1); compute(); MPI_Wait(req1, status); } else if (myrank == 1) { int x; MPI_Recv(&x,1,MPI_INT,0,msgtag, MPI_COMM_WORLD,status); } Send Communication Modes Standard Mode: Send may complete before receive starts if buffer is available. Buffered Mode: Send can complete before receive; user must attach buffer with MPI_Buffer_attach(). Synchronous Mode: Send completes only when the matching receive is also complete. Ready Mode: Send starts only if the matching receive has already begun. Each of the four modes can be applied to both blocking and non-blocking send routines. Only the standard mode is available for the blocking and non-blocking receive routines. Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations are: MPI_Bcast() - Broadcast from root to all other processes MPI_Gather()- Gather values from group of processes MPI_Scatter()- Scatters buffer in parts to group of processes MPI_Alltoall()- Sends data from all processes to all processes MPI_Reduce()- Combine values on all processes to single value MPI_Reduce_scatter()- Combine values and scatter results Example To gather items from group of processes into process 0, using dynamically allocated memory in root process: int data;... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { MPI_Comm_size(MPI_COMM_WORLD, &grp_size); buf = (int *)malloc(grp_size*10*sizeof (int)); } MPI_Gather(data,10,MPI_INT,buf,grp_size*10,MPI_INT,0,MPI_COMM_WORLD) ; NOTE : MPI_Gather() gathers from all processes, including root. Barrier As in all message-passing systems, MPI provides a means of synchronizing processes by stopping each one until they all have reached a specific “barrier” call. Barrier will be discussed in Chapter 6: Synchronised Computational Evaluating Parallel Programs As we will discuss various methods of achieving parallelism in subsequent chapters it is necessary to know how to evaluate them. The first and foremost work to do is to calculate the computation time and the communication time of the parallel algorithm. Equations for Parallel Execution Time Sequential execution time, ts: Estimate by counting computational steps of the best sequential algorithm. Parallel execution time, tp: In addition to number of computational steps, tcomp, need to estimate communication overhead, tcomm. Then tp = tcomp + tcomm. Computational Time It is assumed that all processors are same and operating at the same speed (homogeneous system). Count the number of computational steps. When more than one process is executed simultaneously, count the number of computational steps of the most complex process. Generally, it is a function of n and p, i.e. tcomp = f (n, p) Often we break down the computation time into parts separated by message passing and then determine the computation time of each part. Then tcomp = tcomp1 + tcomp2 + tcomp3 + … Communication Time It depends upon (a) the number of messages, (b) the size of each message, (c) the mode of transfer. (d) the underlying interconnection structure (e) network contention For a first approximation we use tcomm1 = tstartup + ntdata for the communication time of message1. tstartup is the startup time (also called the message latency), essentially the time needed to send a message with no data. It includes the time to pack the message at the source and unpack the message at the destination. It is assumed to be constant. tdata is the transmission time to send one data word which is also assumed to be constant and there are n data words in the message1. The transmission rate is measured in bits/second and would be b/tdata bits/second when there are b bits in the data word. b/tdata bits/second = b bits per tdata and is measured in bits/second Final communication time, tcomm is the summation of communication times of all the sequential messages from a process, i.e. tcomm = tcomm1 + tcomm2 + tcomm3 + … Typically, communication patterns of all processes are same and assumed to take place together. Hence it is sufficient to consider only one process. Since the startup and data transmission times, tstartup and tdata, both are measured in units of one computational step, we can add tcomp and tcomm together to obtain parallel execution time, tp. Idealized Communication Time Startup time Number of data items (n) 70 Benchmark Factors With ts, tcomp, and tcomm, we can establish the speedup factor and computation/communication ratio for any given algorithm (or implementation) : Both are functions of the number of processors, p, and the number of data elements, n and both give indication of scalability of parallel solution with increasing number of processors and problem size. Computation/Communication ratio will highlight the effect of communication with increasing problem size and system size. Time Complexity Especially useful in comparing the execution time of algorithms and can be expressed by the following notations (It hides the lower terms as they are insignificant as the problem size increases). O-notation f(x)=O(g(x)) if and only if there exists positive constants c and x0, such that f(x) ≤ cg(x) for all x ≥ x0 This means that after a certain point (x₀), the function f(x) doesn’t grow faster than g(x), except by some constant factor c. It's a way of saying that g(x) is the "upper limit" for how fast f(x) can grow. Debugging and Evaluating Parallel Programs Empirically Low-Level Debugging: We always want our parallel program to run correctly which can be a significant intellectual challenge. Errors in the sequential program can be debugged by: (a) instrumenting the code i.e. to insert code that outputs intermediate calculated values as the program runs (b) using a debugger These methods do not work in parallel programs efficiently. Geist et al. suggest a three step approach for debugging message passing programs Step1: If possible, run the program as a single process and debug as a normal sequential program. Step2: Execute the program using two to four multitasked processes on a single computer. Now examine actions, such as checking that messages are indeed being sent to the correct places. Step3: Execute the program using the same two to four processes but now across several computers. It helps to find problems caused by network delays related to synchronization and timing. Visualization Tools : MPI provides visualization tools as part of the overall parallel programming environment. For example : The Upshot program visualization system. Programs can be watched as they are executed in a space-time diagram (or process-time diagram) by the help of these tools. Also known as Profiler Space-Time Diagram of a Parallel Program Process 1 Process 2 Process 3 Computin g Time W aiting Mess age-pass ing sys tem ro uti ne Mess age Measuring Execution Time : Time-complexity analysis gives an insight into the potential of a parallel algorithm and is useful in comparing different algorithms. How well an algorithm actually performs can be known when the algorithm is coded and executed on a multiprocessor system. To measure the execution time of a program or the elapsed time between two points in the code in seconds, we use regular system calls, such as clock(), time() or gettimeofday(). Elapsed time includes the time waiting for messages and it is assumed that no other program is running in the processor at the same time. To measure the execution time between point L1 and point L2 in the code, we might have a construction such as : L1: time(&t1);.. L2: time(&t2);. elapsed_time = difftime(t2, t1); printf(“Elapsed time = %5.2f seconds”, elapsed_time); MPI provides the routine MPI_Wtime() for returning time (in seconds). Communication Time by the Ping Pong Method Using ping-pong method, we can find the point- to-point communication time of a specific system. The method is as follows: one process, say P0 is made to send a message to another process, say P1. Immediately upon receiving the message, P1 sends message back to P0. The time involved in this communication is recorded at P0. The half of this time is the time of one-way communication. Program for ping-pong Method Processor P0.. L1: time(&t1); send(&x,P1); recv(&x,P1); L2: time(&t2); elapsed_time = 0.5*difftime(t2, t1); printf(“Elapsed time = %5.2f seconds”,elapsed_time); Processor P1.. recv(&x, P0); send(&x, P0);..