csc25-lecture-notes-149-161.pdf

Chapter 8 Multiple Instruction, Multiple Data (MIMD) Systems Parallel Architectures...

Chapter 8 Multiple Instruction, Multiple Data (MIMD) Systems Parallel Architectures Overview The multiple instruction stream, multiple data stream - MIMD architecture, as seen in Chapter 7, is mainly related to the multiprocessors, where different instructions can work on different data in parallel. Basically, multiprocessors are computers1 consisting of tightly coupled processors whose coordination and usage are generally controlled by a single operating system2. The processors share memory through a shared address space. Memory Organization Multiple processors can share either: 1. cache memory, main memory, and I/O system; 2. main memory and I/O system; 3. I/O system; and 4. nothing, usually communicating through networks. Are all these options feasible or interesting? Note that it is important to avoid bottlenecks in the architecture. And the answer to this question also depends on the project requirements that have to be fulfilled. Multiprocessors can be classified in terms of their memory organization: symmetric (shared-memory) multiprocessors - SMP and distributed shared memory - DSM. 1 One can refer to multiprocessors as a single chip, for example. In this case, that single chip contains more than one processor and they are tightly coupled. Notice that multiprocessor is different from multicore. 2 It is also possible to have a hypervisor system, in which each single-core processor can have a different operating system or even a bare-metal software running on it. Then, the hypervisor abstracts the hardware and creates partitions where different operating systems can execute. An example can be found in the white paper: https://www.cs.unc.edu/ ~anderson/teach/comp790/papers/Wind-River-Hypervisor.pdf 143 8 Multiple Instruction, Multiple Data (MIMD) Systems Symmetric (Shared-Memory) Multiprocessors - SMP SMP are also known as centralized shared-memory multiprocessors. Usually, SMP have approximately 32 cores or less and they all share a single centralized memory, where all the processors have equal access to, i.e., symmetric. In this case, the memory and the bus can become a bottleneck. To avoid this possible bottleneck, SMP can use large caches and many buses. Finally, SMP has a uniform access time - UMA to all the memory from all the processors. Fig. 8.1 illustrates the SMP concept based on a multicore chip. There, each processor has its own local3 caches. The access to the shared cache is through a bus. The main memory is also accessed through a bus connected to the shared cache. Figure 8.1: Basic SMP organization overview. Distributed Shared Memory - DSM The DSM organization counts on a larger processor number, e.g., 16 to 64 processor cores. The distributed memory is used to increase bandwidth and to reduce access latency. However, the communication among processors becomes more complex than SMP. Thus, it is required more effort in the software to really take advantage of the increased memory bandwidth. The I/O system is also distributed, and each node can be a small distributed system with centralized memory. Finally, DSM has nonuniform memory access - NUMA, i.e., access time depends on the location of a data word in memory. This memory can be any memory in the interconnection network. Fig. 8.2 shows the DSM concept, where each processing node can be a multicore with its own memory and I/O system in a multiprocessor chip. All processor cores share the entire memory. However, the access time to the memory attached to the core is much faster than to remote memory, i.e., memory attached to another core. 3 Local cache can be referred to as “private cache” as well. 144 Parallel Architectures Figure 8.2: Basic DSM organization overview. Memory Architecture: UMA and NUMA In SMP, following the UMA time, the processors share a single memory and they have uniform access times to the centralized memory. On the other hand, in DSM which follows NUMA, the processors share the same address space, but not necessarily the same physical memory. And in multicomputers, processors with independent memories and address spaces can communicate through interconnection networks. In this case, they can be even complete computers connected in the network, i.e., clusters. When considering a multithread context, the threads communication is performed through a shared address space in both SMP and DSM architecture, as pointed at the beginning of this chapter. It means that any processor can reference any memory location as long as it has proper access rights. This is because the address space is shared. Communication Models Different communication models are used when having SMP and DSM organizations. SMP Communication Model In SMP, with a central memory, it is possible to use the threads and fork-join model4. A possible practical alternative is to use an application programming interface named open multi-processing - OpenMP5. In this case, there is an implicit communication, i.e., memory access. Threads. In this context, there is the process and the thread. A process is an executing program generally viewed as a working unit in the operating system. Processes have their own address space. Thus, a lightweight process is named thread. Threads usually share the process’s address space. Fork-Join Model. Here, a process creates a child process, i.e., thread, by using fork. A process (parent) and its threads (children) share the same address space. Next, the process waits for their threads to finish their computation by calling join. Creating a process is expensive to the operating system, and that is one reason to use different threads to perform a computation. 4 This can also be applied to DSM. 5 https://www.openmp.org/ 145 8 Multiple Instruction, Multiple Data (MIMD) Systems OpenMP. A code example is presented in Listing 8.1. Listing 8.1: Simple OpenMP example. 1 # include < omp.h > 2 3 void hello ( void ){ 4 int myRank = om p_ get _t hr ead _n um (); 5 int threadCount = o mp _ ge t _n u m _t h re a d s (); 6 printf ( " Hi from thread % d of % d \ n " , myRank , threadCount ); 7 } 8 int main ( int argc , char * argv []){ 9 int threadCount = strtol ( argv , NULL , 10); 10 11 # pragma omp parallel num_threads ( threadCount ) 12 hello (); 13 14 return 0; 15 } The code from Listing 8.1 can be compiled and executed as the following. This will generate four threads. $ gcc -g -Wall -fopenmp -o ompHello ompHello.c $./ompHello 4 DSM Communication Model In DSM, with a distributed memory, it is possible to make use of the message passing model6. There is a library project implementing a message passing interface - MPI7. Here, explicit communication is in place, i.e., the message passing. This solution brings also synchronization problems. Message Passing Model. In this model, tasks (or processes) cooperate to perform a given computation. Each task has its own address space, which is not visible by other tasks. Whenever a task wants to communicate with another task, it sends and receives messages. Then, the memory contents from task A are copied to task B memory. This message passing generates synchronism, i.e., dependency among tasks communicating with each other. MPI. The code in Listing 8.2 illustrates the usage of MPI, where the program has to be substantially modified since the parallelization has to be explicitly informed by the programmer. 6 This can also be applied to SMP. 7 https://www.open-mpi.org/ 146 Parallel Architectures Listing 8.2: Simple MPI usage example. 1 # include < mpi.h > 2 3 int main void { 4 char greetings ; 5 int commSize ; // number of processes 6 int myRank ; 7 8 MPI_Init ( NULL , NULL ); // introduce this process to the others 9 MPI_Comm_size ( MPI_COMM_WORLD , & commSize ); // number of processes involved 10 MPI_Comm_rank ( MPI_COMM_WORLD , & myRank ); // number of the current process 11 12 // master / slave verification 13 if ( mRank !=0){ 14 sprintf ( greetings , " Hi from process % d of % d " , myRank , commSize ); 15 MPI_Send ( greetings , \ 16 strl ( greetings )+1 , MPI_CHAR , 0 , 0 , MPI_COMM_WORLD ); 17 } else { 18 printf ( " Hi from process % d of % d \ n " , myRank , commSize ); 19 for ( int q =1; q < commSize ; q ++){ 20 MPI_Recv ( greetings , \ 21 100 , MPI_CHAR , q , 0 , MPI_COMM_WORLD , MPI_STATUS_IGNORE ); 22 printf ( " % s \ n " , greetings ); 23 } 24 } 25 MPI_Finalize (); 26 return 0; 27 } The code from Listing 8.2 can be compiled and executed as the following. This will generate four processes. $ mpicc -g -Wall -o mpiHello mpiHello.c mpiexec -n./mpiHello $ mpiexec -n 4./mpiHello Market Share SMP have a bigger market share, both in dollars and units in the form of multiprocessors in a chip. Multicomputers gained some market share due to the popularization of clusters for systems on the internet, e.g., more than 100 processors forming massively parallel processors - MPP. Considerations on SMP Large and efficient cache systems can greatly reduce the need for memory bandwidth. SMP, as illustrated in Fig. 8.1, provide some cost benefits as they do not need much extra hardware, and are based on general-purpose processors - GPP. Caches not only provide locality but also replication. Is that a problem? This question is analyzed next. 147 8 Multiple Instruction, Multiple Data (MIMD) Systems Cache Coherence SMP Problems Consider the memory organization of SMP. Let’s assume that a processor P1 writes the memory position X on its local cache, and processor P2 reads the Mem[X]. What happens? This relates to the consistency and coherence problems In the cache coherence problem context, informally, a system is coherent if it returns the last value written to a data item. The memory system’s coherence and consistency are complementary. Coherence ensures the use of the most current data, and consistency synchronizes the reads and writes between processors. Consistency Consistency example. Let’s take the following example related to consistency, involving the codes in Listings 8.3 and 8.4. Listing 8.3: Code snippet example to ver- Listing 8.4: Code snippet example to ver- ify consistency, running on ify consistency, running on processor P1. processor P2. 1 a = 0; 1 b = 0; 2... 2... 3 a = 1; 3 b = 1; 4 if ( b ==0) 4 if ( a ==0) 5... 5... Initially, both variables a and b are in the cache with value equal to zero. Which one of the if statements will be taken? Or both of them, from P1 and P2? In most of the times, this case needs to be handled by the programmer, i.e., through synchronization mechanisms. Coherence A memory system is coherent if the following three aspects are fulfilled. 1. program order preservation – a read by the processor P to the location X, following a write by processor P to location X, with no writes of X by another processor between the write/read by P, always returns the value written by P; 1 // code running on P 2 write A to Mem [ X ] 3 read Mem [ x ] 2. coherent view of memory – a read by the processor P1 to location X, following a write by processor P2 to X, returns the written value, if the write/read are sufficiently separated in time and no other writes to X occur between the two accesses; and 1 P2 : write A to Mem [ X ] // code running on P2 2. 148 Cache Coherence 3. 4. 5 P1 : read Mem [ X ] // code running on P1 3. serialization notion – writes to the same location are serialized, i.e., two writes to the same location by any two processors are seen in the same order by all processors. Consider the following. If the value 1 and then value 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1. 1 P2 : write 1 to Mem [ Y ] // code running on P2 2 P1 : write 2 to Mem [ Y ] // code running on P1 Cache Coherence Protocols There are some basic schemes to enforce coherence. One is to keep track of the status of any sharing of a data block. The cache block status is kept by using a status bit associated with that block. Another one is a hardware-based solution for multiprocessors to maintain cache coherence, named cache coherence protocols. Mainly, there exist the snooping and the directory-based. In the snooping, every cache containing a copy of the data from a physical memory block could track the sharing status of the block. In the directory-based, the sharing status of a particular block of physical memory is kept in one location, i.e., the directory. Generally, SMP uses a single directory and DSM uses distributed directories. Snooping Coherence Protocols Each cache has a copy of both memory block data and “share status” of a block, e.g., shared/non-shared. As caches also share the memory bus, they snoop on the memory traffic to check if they have copies of the “in-transit” block. Snooping can be based on the write invalidate protocol, and the write update, also known as write broadcast protocol. Write Invalidate Protocol. In this case, writing in a shared block invalidates the other block copies in the other caches. Thus, when trying to access an invalid block, there is a cache miss, and the correct/most updated data comes from the “dirty” cache block. Finally, the protocol handles the memory update, following the write back approach. Writing in non-shared blocks do not cause any problem, since there is no possibility of coherence loss within non-shared blocks. What about the write through approach? The memory would always be coherent if following the write through approach. In that case, when trying to access/read an invalid block, it would not require a write in memory, i.e., read misses do not cause writes in the memory. It is always a fetch from the memory. However, a read miss in the write back would cause a memory write if the block to be replaced is with the dirty bit status active. 149 8 Multiple Instruction, Multiple Data (MIMD) Systems Write invalidate protocol example. Let’s assume that neither cache initially holds the data block from memory location X and that the value of X in memory is equal to 0. This is illustrated in Fig. 8.3. The illustration shows the value of processor and memory after both processor activity and bus activity have been completed. Blank stands for “no activity” or no “copy cached”. As processor A reads X, there is a cache miss and this is communicated in the bus. The data block from memory is then cached. In the same way as processor A, processor B reads memory location X, and that cache miss is also communicated through the bus. Data is cached. Next, processor A writes the value “1” in the memory location X. The new data goes only to its cache at this moment. Remember this is following the write back approach. And, considering that memory location X was shared, there is a “invalidation for X” communicated in the bus. Finally, processor B reads X again, but this time there was an “invalid X” noticed due to the last processor’s A write. Then, it is a cache miss for X. But the data block will come from processor A in lieu of memory, as processor A cancels the response from memory since it has the correct data value. At last, both processor B cache and the memory location X are updated. Figure 8.3: Write invalidate protocol example assuming write back caches, to have a better perfor- mance. Write Update Protocol. In this case, the difference is only in the write fashion. The write update protocol updates all the cached copies of a data item when that item is written. Then, it must broadcast all writes to shared cache lines, which consumes considerably more bandwidth than the write invalidate protocol. Therefore, virtually all recent multiprocessors have opted to implement a write invalidate protocol. Brief Protocols Comparison. Points in favor of the write invalidate protocol: 1. There are multiple writes of the same word without intervening reads requiring multiple broadcast, but just one initial block invalidation, as seen in the previous example; 2. Each word written in a cache block requires a write broadcast in the write update protocol, although only the first write of any word within the block needs to generate an invalidation. The write invalidate protocol acts on cache blocks, while the write update protocol must act on individual words; and 3. The write invalidate protocol tends to generate less traffic in the memory bus. Remember that the memory bus can be seen as a bottleneck in SMP. 150 Cache Coherence Point in favor of write update protocol: 1. The delay between writing a word on a processor and reading the value written on another processor is smaller in the write update protocol, as the written data is updated immediately in the reader’s cache. Write Invalidate Implementation. Regarding the block invalidation, the key to implementation is to get access to the memory bus, and to use this access to invalidate a block, i.e., the processor sends the block address through the bus. The other processors are snooping on the bus, and watching whether they have that block in their caches, i.e., by checking the address and invalidating the block. The mechanism relies on serialized writes. The need to get access to the bus, as an exclusive resource, forces the serialization of the writes. When locating a data item on a cache miss, write through or write back based caches can be used. In the write through cache, all written data are always sent to the memory. Then, the most recent value of a data item can always be fetched from memory. In the write back cache, if a block is clean (shared), the protocol simply acts like the write through approach. Otherwise, if the block is dirty (modified/exclusive), the protocol simply sends the dirty block in response to the read request and aborts the memory read/fetch operation. Considering this context, a simple protocol with three states is discussed next. The protocol states are: 1. invalid – this is considered as if the block is not present in the cache; 2. shared – this is considered when the block is present in one or more local/private caches (potentially shared). But, not necessarily in the processor’s cache which requested the data; 3. modified or exclusive – this is considered when the block was updated in the local/private cache. Here, the most up-to-date information is present only in one cache. But, not necessarily in the processor’s cache which requested the data. The state changes from invalid to shared already in the first reading of the block, even if there is only one copy. And, in the first write, it becomes exclusive. Fig. 8.4 illustrates the states along with the triggers activating the states transitions. These stimulus actions are applicable to a private cache block, rather than to a specific address in the cache. For the sake of clarity, Fig. 8.4 shows the same finite-state machine, but from two different perspectives. One based on the “requests from CPU” (left-hand side), and the other one considering the “requests from the bus” perspective (right-hand side). The finite-state machine controller is usually implemented in each core. Table 8.1 details and explains the state transitions from the CPU perspective. Table 8.2 details and explains the state transitions from the bus perspective. 151 8 Multiple Instruction, Multiple Data (MIMD) Systems Figure 8.4: Write invalidate, cache coherence protocol FSM for a private write back cache. Hits and misses in the local cache. Table 8.1: Finite-state machine cache transition details based on requests from the CPU. Addressed Trigger Cache Action Details cache block’s Type state Invalid CPU read miss Regular miss Place read miss on bus Invalid CPU write miss Regular miss Place write miss on bus Shared CPU read hit Regular hit Read data in local cache Shared CPU read miss Replacement Place read miss on bus. There was an address conflict, needing a block replacement Shared CPU write hit Coherence Place invalidate on bus. There is no data fetch here, as this is mainly intended to change state Shared CPU write miss Replacement Place write miss on bus. This was an address conflict, needing a block replacement Exclusive CPU write/read hit Regular hit Write/read data in local cache Exclusive CPU write miss Replacement Write back the cache block, and place a write miss on bus. There was an address conflict, needing a block replacement Exclusive CPU read miss Replacement Write back the cache block, and place a read miss on bus. There was an address conflict, needing a block replacement Directory-based Protocol Is there any cache coherence problem with respect to DSM, i.e., with shared address space? In DSM, the memory bandwidth is increased as the memory becomes distributed. This separates the local memory traffic from the remote and reduces the bandwidth demands in the overall memory system and interconnection network. When using the snooping-based protocols in DSM, there will be a considerable amount of messages 152 Cache Coherence Table 8.2: Finite-state machine cache transition details based on requests from the bus. Addressed Trigger Cache Action Details cache block’s Type state Shared Write miss for this block Coherence Invalidate this block as there was an attempt to write a shared block Shared Invalidate for this block Coherence Invalidate this block as there was an attempt to write a shared block Shared CPU read miss No action The shared cache or memory will service this read miss Exclusive Write miss for this block Coherence Write back the cache block, and abort memory access. There was an attempt to write a block that is exclusive elsewhere. Exclusive Read miss for this block Coherence Place cache block on bus, write back the cache block, and abort memory access. There was an attempt to read a shared data. broadcast on the bus. This will cause a negative impact on the benefits of DSM memory organization. The need for broadcasting messages needs to be eliminated. Given this, the directory protocol is an alternative to the snooping-based coherence protocol. A directory keeps the state of every block that may be cached. The information contained in the directory includes: which caches have copies of the block, whether it is dirty, and the block status, i.e., shared, uncached, modified. The solution is based on distributing the directory along with the memory, as shown in Fig. 8.5. A single directory is not a scalable solution mainly considering a multicore environment. Therefore, different coherence requests can go to different directories, just as different memory requests go to different memories. Each directory is responsible for tracking caches that share the memory addresses of the memory portion in the node. Figure 8.5: Directory added to each node to implement cache coherence in DSM. The directory-based protocol’s state diagrams are the same as those used in snooping, as depicted in Fig. 8.4. However, the implementation is based on message passing between nodes, rather than snooping on the bus. The protocol needs to know which node has the block to make the invalidation. 153 8 Multiple Instruction, Multiple Data (MIMD) Systems In this case, each processor keeps information on which processors have each memory block. This is a field with an associated bit for each system processor for each memory block. Multicomputers Is there any cache coherence problem with respect to the multicomputers based on message passing? In this case, as long as the definition of multicomputers is: “processors with independent memories and address spaces”, there is no other computer writing on its memory, only itself. Thus, there is no problem. Although, inside each computer, the problems with cache coherence are still present. Thread-Level Parallelism Overview Multithreading is more and more exploited and used, as ILP is already very well explored and limited in the present days. Given this, threads are a very common way of explicit parallelism, “controlled” by the programmer. They are also a very different approach regarding the transparency found of ILP and vector instructions. However, multithreading comes together with these other advancements in performance. Multithreading allows multiple threads to share the functional units of a single processor in an overlapping way. A more general method to exploit thread-level parallelism - TLP considers a multiprocessor with multiple independent threads operating at once and in parallel. Multithreading does not duplicate the entire processor as a multiprocessor does. Despite, multithreading shares most of the processor core among a set of threads, and duplicates just private state, i.e., registers and program counter. Hardware Approaches for Multithreading There are some approaches when dealing with multithread, including fine-grained, coarse-grained, and simultaneous multithreading. The fine-grained approach switches between threads on each clock cycle, causing the execution of instructions from multiple threads to be interleaved, uses round-robing, and skips stalled threads. The coarse-grained switches threads only on costly stalls, e.g., L2 or L3 cache misses. Finally, the simultaneous multithreading - SMT approach is the most common implementation. It is a variation of fine-grained implemented on top of a multiple-issue, dynamically scheduled processor. SMT are the superscalar processors and multithreading relationship. In SMT, the register renaming and dynamic scheduling allow multiple instructions from independent threads to be executed without noticing dependencies among them. Fig. 8.6 illustrates a view of a superscalar processor with no multithread - MT support, along with a superscalar processor with coarse and fine MT, and a superscalar processor with SMT. The Intel Core i7 and IBM Power7 processors use SMT. 154 Thread-Level Parallelism Figure 8.6: Conceptual illustration greatly simplified. Shades of gray correspond to 4 different threads in the multithreading processor; the white box indicates that the corresponding execution slot is unused in that clock cycle. In the superscalar, only one thread can be executed. The course MT is able to run two different threads in the considered timeframe. Fine MT, can execute all four threads by switching them on each clock cycle. Finally, SMT can execute up to two threads in parallel. Notice also the number of “white boxes” in each case. 155

csc25-lecture-notes-149-161.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue