csc25-lecture-notes-149-153_part1.pdf
Document Details
Uploaded by SelfDeterminationOmaha
ITA
Tags
Full Transcript
Chapter 8 Multiple Instruction, Multiple Data (MIMD) Systems Parallel Architectures...
Chapter 8 Multiple Instruction, Multiple Data (MIMD) Systems Parallel Architectures Overview The multiple instruction stream, multiple data stream - MIMD architecture, as seen in Chapter 7, is mainly related to the multiprocessors, where different instructions can work on different data in parallel. Basically, multiprocessors are computers1 consisting of tightly coupled processors whose coordination and usage are generally controlled by a single operating system2. The processors share memory through a shared address space. Memory Organization Multiple processors can share either: 1. cache memory, main memory, and I/O system; 2. main memory and I/O system; 3. I/O system; and 4. nothing, usually communicating through networks. Are all these options feasible or interesting? Note that it is important to avoid bottlenecks in the architecture. And the answer to this question also depends on the project requirements that have to be fulfilled. Multiprocessors can be classified in terms of their memory organization: symmetric (shared-memory) multiprocessors - SMP and distributed shared memory - DSM. 1 One can refer to multiprocessors as a single chip, for example. In this case, that single chip contains more than one processor and they are tightly coupled. Notice that multiprocessor is different from multicore. 2 It is also possible to have a hypervisor system, in which each single-core processor can have a different operating system or even a bare-metal software running on it. Then, the hypervisor abstracts the hardware and creates partitions where different operating systems can execute. An example can be found in the white paper: https://www.cs.unc.edu/ ~anderson/teach/comp790/papers/Wind-River-Hypervisor.pdf 143 8 Multiple Instruction, Multiple Data (MIMD) Systems Symmetric (Shared-Memory) Multiprocessors - SMP SMP are also known as centralized shared-memory multiprocessors. Usually, SMP have approximately 32 cores or less and they all share a single centralized memory, where all the processors have equal access to, i.e., symmetric. In this case, the memory and the bus can become a bottleneck. To avoid this possible bottleneck, SMP can use large caches and many buses. Finally, SMP has a uniform access time - UMA to all the memory from all the processors. Fig. 8.1 illustrates the SMP concept based on a multicore chip. There, each processor has its own local3 caches. The access to the shared cache is through a bus. The main memory is also accessed through a bus connected to the shared cache. Figure 8.1: Basic SMP organization overview. Distributed Shared Memory - DSM The DSM organization counts on a larger processor number, e.g., 16 to 64 processor cores. The distributed memory is used to increase bandwidth and to reduce access latency. However, the communication among processors becomes more complex than SMP. Thus, it is required more effort in the software to really take advantage of the increased memory bandwidth. The I/O system is also distributed, and each node can be a small distributed system with centralized memory. Finally, DSM has nonuniform memory access - NUMA, i.e., access time depends on the location of a data word in memory. This memory can be any memory in the interconnection network. Fig. 8.2 shows the DSM concept, where each processing node can be a multicore with its own memory and I/O system in a multiprocessor chip. All processor cores share the entire memory. However, the access time to the memory attached to the core is much faster than to remote memory, i.e., memory attached to another core. 3 Local cache can be referred to as “private cache” as well. 144 Parallel Architectures Figure 8.2: Basic DSM organization overview. Memory Architecture: UMA and NUMA In SMP, following the UMA time, the processors share a single memory and they have uniform access times to the centralized memory. On the other hand, in DSM which follows NUMA, the processors share the same address space, but not necessarily the same physical memory. And in multicomputers, processors with independent memories and address spaces can communicate through interconnection networks. In this case, they can be even complete computers connected in the network, i.e., clusters. When considering a multithread context, the threads communication is performed through a shared address space in both SMP and DSM architecture, as pointed at the beginning of this chapter. It means that any processor can reference any memory location as long as it has proper access rights. This is because the address space is shared. Communication Models Different communication models are used when having SMP and DSM organizations. SMP Communication Model In SMP, with a central memory, it is possible to use the threads and fork-join model4. A possible practical alternative is to use an application programming interface named open multi-processing - OpenMP5. In this case, there is an implicit communication, i.e., memory access. Threads. In this context, there is the process and the thread. A process is an executing program generally viewed as a working unit in the operating system. Processes have their own address space. Thus, a lightweight process is named thread. Threads usually share the process’s address space. Fork-Join Model. Here, a process creates a child process, i.e., thread, by using fork. A process (parent) and its threads (children) share the same address space. Next, the process waits for their threads to finish their computation by calling join. Creating a process is expensive to the operating system, and that is one reason to use different threads to perform a computation. 4 This can also be applied to DSM. 5 https://www.openmp.org/ 145 8 Multiple Instruction, Multiple Data (MIMD) Systems OpenMP. A code example is presented in Listing 8.1. Listing 8.1: Simple OpenMP example. 1 # include < omp.h > 2 3 void hello ( void ){ 4 int myRank = om p_ get _t hr ead _n um (); 5 int threadCount = o mp _ ge t _n u m _t h re a d s (); 6 printf ( " Hi from thread % d of % d \ n " , myRank , threadCount ); 7 } 8 int main ( int argc , char * argv []){ 9 int threadCount = strtol ( argv , NULL , 10); 10 11 # pragma omp parallel num_threads ( threadCount ) 12 hello (); 13 14 return 0; 15 } The code from Listing 8.1 can be compiled and executed as the following. This will generate four threads. $ gcc -g -Wall -fopenmp -o ompHello ompHello.c $./ompHello 4 DSM Communication Model In DSM, with a distributed memory, it is possible to make use of the message passing model6. There is a library project implementing a message passing interface - MPI7. Here, explicit communication is in place, i.e., the message passing. This solution brings also synchronization problems. Message Passing Model. In this model, tasks (or processes) cooperate to perform a given computation. Each task has its own address space, which is not visible by other tasks. Whenever a task wants to communicate with another task, it sends and receives messages. Then, the memory contents from task A are copied to task B memory. This message passing generates synchronism, i.e., dependency among tasks communicating with each other. MPI. The code in Listing 8.2 illustrates the usage of MPI, where the program has to be substantially modified since the parallelization has to be explicitly informed by the programmer. 6 This can also be applied to SMP. 7 https://www.open-mpi.org/ 146 Parallel Architectures Listing 8.2: Simple MPI usage example. 1 # include < mpi.h > 2 3 int main void { 4 char greetings ; 5 int commSize ; // number of processes 6 int myRank ; 7 8 MPI_Init ( NULL , NULL ); // introduce this process to the others 9 MPI_Comm_size ( MPI_COMM_WORLD , & commSize ); // number of processes involved 10 MPI_Comm_rank ( MPI_COMM_WORLD , & myRank ); // number of the current process 11 12 // master / slave verification 13 if ( mRank !=0){ 14 sprintf ( greetings , " Hi from process % d of % d " , myRank , commSize ); 15 MPI_Send ( greetings , \ 16 strl ( greetings )+1 , MPI_CHAR , 0 , 0 , MPI_COMM_WORLD ); 17 } else { 18 printf ( " Hi from process % d of % d \ n " , myRank , commSize ); 19 for ( int q =1; q < commSize ; q ++){ 20 MPI_Recv ( greetings , \ 21 100 , MPI_CHAR , q , 0 , MPI_COMM_WORLD , MPI_STATUS_IGNORE ); 22 printf ( " % s \ n " , greetings ); 23 } 24 } 25 MPI_Finalize (); 26 return 0; 27 } The code from Listing 8.2 can be compiled and executed as the following. This will generate four processes. $ mpicc -g -Wall -o mpiHello mpiHello.c mpiexec -n./mpiHello $ mpiexec -n 4./mpiHello Market Share SMP have a bigger market share, both in dollars and units in the form of multiprocessors in a chip. Multicomputers gained some market share due to the popularization of clusters for systems on the internet, e.g., more than 100 processors forming massively parallel processors - MPP. Considerations on SMP Large and efficient cache systems can greatly reduce the need for memory bandwidth. SMP, as illustrated in Fig. 8.1, provide some cost benefits as they do not need much extra hardware, and are based on general-purpose processors - GPP. Caches not only provide locality but also replication. Is that a problem? This question is analyzed next. 147