cap6-lecture-notes-117-132-1-3.pdf

Chapter 6 Branch Prediction & Speculative Superscalar Proces- sors Speculative Execution Overview As seen in Chapter 5,...

Chapter 6 Branch Prediction & Speculative Superscalar Proces- sors Speculative Execution Overview As seen in Chapter 5, it is possible to unroll certain types of loops without speculative execution. Is it still possible to unroll loops, without speculative execution, in case the conditional branch instruction has a RAW dependence on the loop iteration? In that case, no. There, speculative execution is required. Given this context, new questions with respect to speculative superscalar execution arise. More hardware is needed, e.g., reservation stations, functional units. More complex control and dependencies detection are also needed. Besides, additional memory is required to guarantee an eventual correction in case of a prediction error, i.e., branch outcome misprediction. Finally, the architecture needs more efficient branch prediction to pay off the new hardware cost with performance. Control Dependencies As the number of executed instructions per clock cycle increases, the potential instructions flow also increases. Considering this, the CPI decreases and delays caused by branches can seriously impact performance. A “simple” strategy to minimize this problem is: do not stop or reduce speed in branches. This strategy is detailed in the next sections. Dynamic Branch Prediction Overview As seen in Chapter 3, some strategies on static branch prediction include the predicted-not-taken and predicted-taken. 111 6 Branch Prediction & Speculative Superscalar Processors On the other hand, the dynamic branch prediction considers the fact that branch instructions can be executed many times during program execution, e.g., inside loops. In this case, more parallel instructions mean greater impact due to possible branch delays. The predicted-not-taken approach may be inefficient. Let’s take a look at the code snippet from Listing 6.1. The loop goes from register R3 to 5,000. If the approach is the predicted-not-taken, the branch prediction will miss 5,000 times. Listing 6.1: Predicted-not-taken code example. 1 addi R1 , R1 ,0 2 addi R4 , R3 ,5000 3 loop : 4 load R2 ,0( R3 ) 5 addi R3 , R3 ,1 6 add R1 , R1 , R2 7 bne R3 , R4 , loop Branch-Prediction Buffers 1-bit Prediction A possible solution to the problem illustrated in Listing 6.1 is to use branch-prediction buffers. This is done by relating one bit to each branch decision. Thus, if the last loop iteration was “taken”, the prediction is taken. Otherwise, the prediction is “untaken”. However, it is still needed to compute the target branch address. Branch-prediction buffers are the simplest dynamic branch-prediction scheme. This is also known as the branch history table. This approach considers the aid of a small memory indexed by the lower portion of the branch instruction address. The memory contains a bit associated with. That bit indicates whether a branch was recently taken or untaken. The approach does not guarantee whether the prediction is correct. The branch outcome may have been put there by another branch instruction with the same low-order address bits. Thus, the prediction is a hint assumed to be correct. The next instruction fetch begins in the predicted direction. If the hint is wrong, the prediction bit is inverted and stored back. This 1-bit branch-prediction buffer is like a 2-state finite-state machine - FSM, with states: TAKEN UNTAKEN. The reason why using just the low-order address bits is to have variations, i.e., different addresses considering the use of just the low-order bits, as branch instructions can be close to each other in the code. For example, refer to Listing 6.2. Listing 6.2: Example of lower bits addresses where a small memory is indexed by the lower portion of the branch instruction addresses. 1 address 2 1111 0001 --> lower bits = 0001 3 1111 0010 --> lower bits = 0010 There might be an inconvenient or even problematic case regarding these low-order address bits. Consider the Listing 6.3. There is two distinct branch instruction ending up with the very same index. 112 Dynamic Branch Prediction Listing 6.3: Two different branch instructions with the same addresses. 1 1111 0001 --> small memory address = 0001 2 1111 0010 --> small memory address = 0010 3. 4. 5. 6 ffff 0010 --> small memory address = 0010 Besides, the 1-bit prediction scheme has the following shortcoming. Even if a branch is almost always taken in a loop, the scheme is likely to mispredict twice, rather than once when it is not taken. That is because the misprediction causes the prediction bit to be flipped. The code in Listing 6.4 illustrates this case. Listing 6.4: Example of a twice misprediction in the 1-bit prediction scheme. 1 do { 2 // possible duplicated error ( misprediction ) 3 // with respect to the " predicted - taken " in the inner - loop 4 do { 5 sum += v [ i ][ j ++]; 6 } 7 while (j

cap6-lecture-notes-117-132-1-3.pdf

Document Details

Tags

Related

Full Transcript