(The Morgan Kaufmann Series in Computer Architecture and Design) David A. Patterson, John L. Hennessy - Computer Organization and Design RISC-V Edition_ The H 13.PDF
Document Details
Uploaded by MagnanimousCloisonnism
Vilnius University
Full Transcript
140 Chapter 2 Instructions: Language of the Computer Java was announced. Today, Java virtual machines are found in billions of devices, in everything from cell phones to Internet browsers....
140 Chapter 2 Instructions: Language of the Computer Java was announced. Today, Java virtual machines are found in billions of devices, in everything from cell phones to Internet browsers. The downside of interpretation is lower performance. The incredible advances in performance of the 1980s and 1990s made interpretation viable for many important applications, but the factor of 10 slowdown when compared to traditionally compiled C programs made Java unattractive for some applications. To preserve portability and improve execution speed, the next phase of Java’s Just In Time compiler development was compilers that translated while the program was running. Such (JIT) The name Just In Time compilers (JIT) typically profile the running program to find where commonly given to a compiler that operates at the “hot” methods are and then compile them into the native instruction set on runtime, translating the which the virtual machine is running. The compiled portion is saved for the next interpreted code segments time the program is run, so that it can run faster each time it is run. This balance into the native code of the of interpretation and compilation evolves over time, so that frequently run Java computer. programs suffer little of the overhead of interpretation. As computers get faster so that compilers can do more, and as researchers invent betters ways to compile Java on the fly, the performance gap between Java and C or C++ is closing. Section 2.15 goes into much greater depth on the implementation of Java, Java bytecodes, JVM, and JIT compilers. Check Which of the advantages of an interpreter over a translator was the most important Yourself for the designers of Java? 1. Ease of writing an interpreter 2. Better error messages 3. Smaller object code 4. Machine independence 2.13 A C Sort Example to Put it All Together One danger of showing assembly language code in snippets is that you will have no idea what a full assembly language program looks like. In this section, we derive the RISC-V code from two procedures written in C: one to swap array elements and one to sort them. 2.13 A C Sort Example to Put it All Together 141 FIGURE 2.23 A C procedure that swaps two locations in memory. This subsection uses this procedure in a sorting example. The Procedure swap Let’s start with the code for the procedure swap in Figure 2.23. This procedure simply swaps two locations in memory. When translating from C to assembly language by hand, we follow these general steps: 1. Allocate registers to program variables. 2. Produce code for the body of the procedure. 3. Preserve registers across the procedure invocation. This section describes the swap procedure in these three pieces, concluding by putting all the pieces together. Register Allocation for swap As mentioned on page 104, the RISC-V convention on parameter passing is to use registers x10 to x17. Since swap has just two parameters, v and k, they will be found in registers x10 and x11. The only other variable is temp, which we associate with register x5 since swap is a leaf procedure (see page 108). This register allocation corresponds to the variable declarations in the first part of the swap procedure in Figure 2.23. Code for the Body of the Procedure swap The remaining lines of C code in swap are temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; Recall that the memory address for RISC-V refers to the byte address, and so words are really 4 bytes apart. Hence, we need to multiply the index k by 4 before adding it to the address. Forgetting that sequential word addresses differ by 4 instead of by 1 is a common mistake in assembly language programming. 142 Chapter 2 Instructions: Language of the Computer Hence, the first step is to get the address of v[k] by multiplying k by 4 via a shift left by 2: slli x6, x11, 2 // reg x6 = k * 4 add x6, x10, x6 // reg x6 = v + (k * 4) Now we load v[k] using x6, and then v[k+1] by adding 4 to x6: lw x5, 0(x6) // reg x5 (temp) = v[k] lw x7, 4(x6) // reg x7 = v[k + 1] // refers to next element of v Next we store x9 and x11 to the swapped addresses: sw x7, 0(x6) // v[k] = reg x7 sw x5, 4(x6) // v[k+1] = reg x5 (temp) Now we have allocated registers and written the code to perform the operations of the procedure. What is missing is the code for preserving the saved registers used within swap. Since we are not using saved registers in this leaf procedure, there is nothing to preserve. The Full swap Procedure We are now ready for the whole routine. All that remains is to add the procedure label and the return branch. swap: slli x6, x11, 2 // reg x6 = k * 4 add x6, x10, x6 // reg x6 = v + (k * 4) lw x5, 0(x6) // reg x5 (temp) = v[k] lw x7, 4(x6) // reg x7 = v[k + 1] sw x7, 0(x6) // v[k] = reg x7 sw x5, 4(x6) // v[k+1] = reg x5 (temp) jalr x0, 0(x1) // return to calling routine The Procedure sort To ensure that you appreciate the rigor of programming in assembly language, we’ll try a second, longer example. In this case, we’ll build a routine that calls the swap procedure. This program sorts an array of integers, using bubble or exchange sort, which is one of the simplest if not the fastest sorts. Figure 2.24 shows the C version of the program. Once again, we present this procedure in several steps, concluding with the full procedure. 2.13 A C Sort Example to Put it All Together 143 FIGURE 2.24 A C procedure that performs a sort on the array v. Register Allocation for sort The two parameters of the procedure sort, v and n, are in the parameter registers x10 and x11, and we assign register x19 to i and register x20 to j. Code for the Body of the Procedure sort The procedure body consists of two nested for loops and a call to swap that includes parameters. Let’s unwrap the code from the outside to the middle. The first translation step is the first for loop: for (i = 0; i < n; i += 1) { Recall that the C for statement has three parts: initialization, loop test, and iteration increment. It takes just one instruction to initialize i to 0, the first part of the for statement: addi x19, x0, 0 It also takes just one instruction to increment i, the last part of the for statement: addi x19, x19, 1 // i += 1 The loop should be exited if i < n is not true or, said another way, should be exited if i ≥ n. This test takes just one instruction: for1tst: bge x19, x11, exit1 // go to exit1 if x19 ≥ x1 (i≥n) The bottom of the loop just branches back to the loop test: j for1tst // branch to test of outer loop exit1: The skeleton code of the first for loop is then addi x19, x0, 0 // i = 0 for1tst: bge x19, x11, exit1 // g o to exit1 if x19 ≥ x1 (i≥n) … (body of first for loop) … 144 Chapter 2 Instructions: Language of the Computer addi x19, x19, 1 // i += 1 j for1tst // branch to test of outer loop exit1: Voila! (The exercises explore writing faster code for similar loops.) The second for loop looks like this in C: for (j = i – 1; j >= 0 && v[j] > v[j + 1]; j -= 1) { The initialization portion of this loop is again one instruction: addi x20, x19, -1 // j = i – 1 The decrement of j at the end of the loop is also one instruction: addi x20, x20, -1 j = 1 The loop test has two parts. We exit the loop if either condition fails, so the first test must exit the loop if it fails (j < 0): for2tst: blt x20, x0, exit2 // go to exit2 if x20 < 0 (j < 0) This branch will skip over the second condition test. If it doesn’t skip, then j ≥ 0. The second test exits if v[j] > v[j + 1] is not true, or exits if v[j] ≤ v[j + 1]. First we create the address by multiplying j by 4 (since we need a byte address) and add it to the base address of v: slli x5, x20, 2 // reg x5 = j * 4 add x5, x10, x5 // reg x5 = v + (j * 4) Now we load v[j]: lw x6, 0(x5) // reg x6 = v[j] Since we know that the second element is just the following word, we add 4 to the address in register x5 to get v[j + 1]: lw x7, 4(x5) // reg x7 = v[j + 1] We test v[j] ≤ v[j + 1] to exit the loop: ble x6, x7, exit2 // go to exit2 if x6 ≤ x7 The bottom of the loop branches back to the inner loop test: jal, x0 for2tst // branch to test of inner loop Combining the pieces, the skeleton of the second for loop looks like this: addi x20, x19, -1 // j = i - 1 for2tst: blt x20, x0, exit2 // go to exit2 if x20 < 0 (j < 0) slli x5, x20, 2 // reg x5 = j * 4 add x5, x10, x5 // reg x5 = v + (j * 4) lw x6, 0(x5) // reg x6 = v[j] 2.13 A C Sort Example to Put it All Together 145 lw x7, 4(x5) // reg x7 = v[j + 1] ble x6, x7, exit2 // go to exit2 if x6 ≤ x7... (body of second for loop)... addi x20, x20, -1 // j -= 1 jal, x0 for2tst // branch to test of inner loop exit2: The Procedure Call in sort The next step is the body of the second for loop: swap(v,j); Calling swap is easy enough: jal x1, swap Passing Parameters in sort The problem comes when we want to pass parameters because the sort procedure needs the values in registers x10 and x11, yet the swap procedure needs to have its parameters placed in those same registers. One solution is to copy the parameters for sort into other registers earlier in the procedure, making registers x10 and x11 available for the call of swap. (This copy is faster than saving and restoring on the stack.) We first copy x10 and x11 into x21 and x22 during the procedure: addi x21, x10, 0 // copy parameter x10 into x21 addi x22, x11, 0 // copy parameter x11 into x22 Then we pass the parameters to swap with these two instructions: addi x10, x21, 0 // first swap parameter is v addi x11, x20, 0 // second swap parameter is j Preserving Registers in sort The only remaining code is the saving and restoring of registers. Clearly, we must save the return address in register x1, since sort is a procedure and is itself called. The sort procedure also uses the callee-saved registers x19, x20, x21, and x22, so they must be saved. The prologue of the sort procedure is then addi sp, sp, -20 // make room on stack for 5 regs sw x1, 16(sp) // save x1 on stack sw x22, 12(sp) // save x22 on stack sw x21, 8(sp) // save x21 on stack sw x20, 4(sp) // save x20 on stack sw x19, 0(sp) // save x19 on stack 146 Chapter 2 Instructions: Language of the Computer The tail of the procedure simply reverses all these instructions, and then adds a jalr to return. The Full Procedure sort Now we put all the pieces together in Figure 2.25, being careful to replace references to registers x10 and x11 in the for loops with references to registers x21 and x22. Once again, to make the code easier to follow, we identify each block of code with its purpose in the procedure. In this example, nine lines of the sort procedure in C became 34 lines in the RISC-V assembly language. FIGURE 2.25 RISC-V assembly version of procedure sort in Figure 2.27. 2.13 A C Sort Example to Put it All Together 147 Elaboration: One optimization that works with this example is procedure inlining. Instead of passing arguments in parameters and invoking the code with a jal instruction, the compiler would copy the code from the body of the swap procedure where the call to swap appears in the code. Inlining would avoid four instructions in this example. The downside of the inlining optimization is that the compiled code would be bigger if the inlined procedure is called from several locations. Such a code expansion might turn into lower performance if it increased the cache miss rate; see Chapter 5. Figure 2.26 shows the impact of compiler optimization on sort program Understanding performance, compile time, clock cycles, instruction count, and CPI. Note that unoptimized code has the best CPI, and O1 optimization has the lowest instruction Program count, but O3 is the fastest, reminding us that time is the only accurate measure of Performance program performance. Figure 2.27 compares the impact of programming languages, compilation versus interpretation, and algorithms on performance of sorts. The fourth column shows that the unoptimized C program is 8.3 times faster than the interpreted Java code for Bubble Sort. Using the JIT compiler makes Java 2.1 times faster than the unoptimized C and within a factor of 1.13 of the highest optimized C code. ( Section 2.15 gives more details on interpretation versus compilation of Java and the Java and jalr code for Bubble Sort.) The ratios aren’t as close for Quicksort in Column 5, presumably because it is harder to amortize the cost of runtime compilation over the shorter execution time. The last column demonstrates the impact of a better algorithm, offering three orders of magnitude a performance increase by when sorting 100,000 items. Even comparing interpreted Java in Column 5 to the C compiler at highest optimization in Column 4, Quicksort beats Bubble Sort by a factor of 50 (0.05 × 2468, or 123 times faster than the unoptimized C code versus 2.41 times faster). FIGURE 2.26 Comparing performance, instruction count, and CPI using compiler optimization for Bubble Sort. The programs sorted 100,000 32-bit words with the array initialized to random values. These programs were run on a Pentium 4 with a clock rate of 3.06 GHz and a 533 MHz system bus with 2 GB of PC2100 DDR SDRAM. It used Linux version 2.4.20. 148 Chapter 2 Instructions: Language of the Computer FIGURE 2.27 Performance of two sort algorithms in C and Java using interpretation and optimizing compilers relative to unoptimized C version. The last column shows the advantage in performance of Quicksort over Bubble Sort for each language and execution option. These programs were run on the same system as in Figure 2.29. The JVM is Sun version 1.3.1, and the JIT is Sun Hotspot version 1.3.1. 2.14 Arrays versus Pointers A challenge for any new C programmer is understanding pointers. Comparing assembly code that uses arrays and array indices to the assembly code that uses pointers offers insights about pointers. This section shows C and RISC-V assembly versions of two procedures to clear a sequence of words in memory: one using array indices and one with pointers. Figure 2.28 shows the two C procedures. The purpose of this section is to show how pointers map into RISC-V instructions, and not to endorse a dated programming style. We’ll see the impact of modern compiler optimization on these two procedures at the end of the section. FIGURE 2.28 Two C procedures for setting an array to all zeros. clear1 uses indices, while clear2 uses pointers. The second procedure needs some explanation for those unfamiliar with C. The address of a variable is indicated by &, and the object pointed to by a pointer is indicated by *. The declarations declare that array and p are pointers to integers. The first part of the for loop in clear2 assigns the address of the first element of array to the pointer p. The second part of the for loop tests to see if the pointer is pointing beyond the last element of array. Incrementing a pointer by one, in the bottom part of the for loop, means moving the pointer to the next sequential object of its declared size. Since p is a pointer to integers, the compiler will generate RISC-V instructions to increment p by four, the number of bytes in an RISC-V integer. The assignment in the loop places 0 in the object pointed to by p.