(THEMO~2.PDF

128 Chapter 2 Instructions: Language of the Computer Check I. What is the range of byte addresses for conditional branches in RISC-V Yourself (K = 1024)? 1. Addresses between 0 a...

128 Chapter 2 Instructions: Language of the Computer Check I. What is the range of byte addresses for conditional branches in RISC-V Yourself (K = 1024)? 1. Addresses between 0 and 4K − 1 2. Addresses between 0 and 8K − 1 3. Addresses up to about 2K before the branch to about 2K after 4. Addresses up to about 4K before the branch to about 4K after II. What is the range of byte addresses for the jump-and-link instruction in RISC-V (M = 1024K)? 1. Addresses between 0 and 512K − 1 2. Addresses between 0 and 1M − 1 3. Addresses up to about 512K before the branch to about 512K after 4. Addresses up to about 1M before the branch to about 1M after Parallelism and Instructions: 2.11 Synchronization Parallel execution is easier when tasks are independent, but often they need to cooperate. Cooperation usually means some tasks are writing new values that others must read. To know when a task is finished writing so that it is safe for another to read, the tasks need to synchronize. If they don’t synchronize, there is a data race Two memory danger of a data race, where the results of the program can change depending on accesses form a data race how events happen to occur. if they are from different For example, recall the analogy of the eight reporters writing a story on pages 44–45 threads to the same of Chapter 1. Suppose one reporter needs to read all the prior sections before writing location, at least one is a write, and they occur one a conclusion. Hence, he or she must know when the other reporters have finished after another. their sections, so that there is no danger of sections being changed afterwards. That is, they had better synchronize the writing and reading of each section so that the conclusion will be consistent with what is printed in the prior sections. In computing, synchronization mechanisms are typically built with user-level software routines that rely on hardware-supplied synchronization instructions. In this section, we focus on the implementation of lock and unlock synchronization operations. Lock and unlock can be used straightforwardly to create regions where only a single processor can operate, called a mutual exclusion, as well as to implement more complex synchronization mechanisms. The critical ability we require to implement synchronization in a multiprocessor is a set of hardware primitives with the ability to atomically read and modify a memory location. That is, nothing else can interpose itself between the read and the write of the memory location. Without such a capability, the cost of building basic synchronization primitives will be high and will increase unreasonably as the processor count increases. 2.11 Parallelism and Instructions: Synchronization 129 There are a number of alternative formulations of the basic hardware primitives, all of which provide the ability to atomically read and modify a location, together with some way to tell if the read and write were performed atomically. In general, architects do not expect users to employ the basic hardware primitives, but instead expect system programmers will use the primitives to build a synchronization library, a process that is often complex and tricky. Let’s start with one such hardware primitive and show how it can be used to build a basic synchronization primitive. One typical operation for building synchronization operations is the atomic exchange or atomic swap, which inter- changes a value in a register for a value in memory. To see how to use this to build a basic synchronization primitive, assume that we want to build a simple lock where the value 0 is used to indicate that the lock is free and 1 is used to indicate that the lock is unavailable. A processor tries to set the lock by doing an exchange of 1, which is in a register, with the memory address corresponding to the lock. The value returned from the exchange instruction is 1 (unavailable) if some other processor had already claimed access, and 0 (free) otherwise. In the latter case, the value is also changed to 1 (unavailable), preventing any competing exchange in another processor from also retrieving a 0 (free). For example, consider two processors that each try to do the exchange simultaneously: this race is prevented, since exactly one of the processors will perform the exchange first, returning 0 (free), and the second processor will return 1 (unavailable) when it does the exchange. The key to using the exchange primitive to implement synchronization is that the operation is atomic: the exchange is indivisible, and two simultaneous exchanges will be ordered by the hardware. It is impossible for two processors trying to set the synchronization variable in this manner to both think they have simultaneously set the variable. Implementing a single atomic memory operation introduces some challenges in the design of the processor, since it requires both a memory read and a write in a single, uninterruptible instruction. An alternative is to have a pair of instructions in which the second instruction returns a value showing whether the pair of instructions was executed as if the pair was atomic. The pair of instructions is effectively atomic if it appears as if all other operations executed by any processor occurred before or after the pair. Thus, when an instruction pair is effectively atomic, no other processor can change the value between the pair of instructions. In RISC-V this pair of instructions includes a special load called a load-reserved word (lr.w) and a special store called a store-conditional word (sc.w). These instructions are used in sequence: if the contents of the memory location specified by the load- reserved are changed before the store-conditional to the same address occurs, then the store-conditional fails and does not write the value to memory. The store-conditional is defined to both store the value of a (presumably different) register in memory and to change the value of another register to a 0 if it succeeds and to a nonzero value if it fails. Thus, sc.w specifies three registers: one to hold the address, one to indicate whether the atomic operation failed or succeeded, and one to hold the value to be stored in memory if it succeeded. Since the load-reserved returns the initial value, and the 130 Chapter 2 Instructions: Language of the Computer store-conditional returns 0 only if it succeeds, the following sequence implements an atomic exchange on the memory location specified by the contents of x20: again:lr.w x10, (x20) // load-reserved sc.w x11, x23, (x20) // store-conditional bne x11, x0, again // branch if store fails (0) addi x23, x10, 0 // put loaded value in x23 Any time a processor intervenes and modifies the value in memory between the lr.w and sc.w instructions, the sc.w writes a nonzero value into x11, causing the code sequence to try again. At the end of this sequence, the contents of x23 and the memory location specified by x20 have been atomically exchanged. Elaboration: Although it was presented for multiprocessor synchronization, atomic exchange is also useful for the operating system in dealing with multiple processes in a single processor. To make sure nothing interferes in a single processor, the store- conditional also fails if the processor does a context switch between the two instructions (see Chapter 5). Elaboration: An advantage of the load-reserved/store-conditional mechanism is that it can be used to build other synchronization primitives, such as atomic compare and swap or atomic fetch-and-increment, which are used in some parallel programming models. These involve more instructions between the lr.w and the sc.w, but not too many. Since the store-conditional will fail after either another attempted store to the load reservation address or any exception, care must be taken in choosing which instructions are inserted between the two instructions. In particular, only integer arithmetic, forward branches, and backward branches out of the load-reserved/store-conditional block can safely be permitted; otherwise, it is possible to create deadlock situations where the processor can never complete the sc.w because of repeated page faults. In addition, the number of instructions between the load-reserved and the store-conditional should be small to minimize the probability that either an unrelated event or a competing processor causes the store-conditional to fail frequently. Elaboration: While the code above implemented an atomic exchange, the following code would more efficiently acquire a lock at the location in register x20, where the value of 0 means the lock was free and 1 to mean lock was acquired: addi x12, x0, 1 // copy locked value again: lr.w x10, (x20) // oad-reserved to read lock l bne x10, x0, again // check if it is 0 yet sc.w x11, x12, (x20) // attempt to store new value bne x11, x0, again // branch if store fails We release the lock just using a regular store to write 0 into the location: sw 0(x20) // free lock by writing 0 2.12 Translating and Starting a Program 131 When do you use primitives like load-reserved and store-conditional? Check 1. When cooperating threads of a parallel program need to synchronize to get Yourself proper behavior for reading and writing shared data. 2. When cooperating processes on a uniprocessor need to synchronize for reading and writing shared data. 2.12 Translating and Starting a Program This section describes the four steps in transforming a C program into a file in nonvolatile storage (disk or flash memory) into a program running on a computer. Figure 2.20 shows the translation hierarchy. Some systems combine these steps to reduce translation time, but programs go through these four logical phases. This section follows this translation hierarchy. C program Compiler Assembly language program Assembler Object: Machine language module Object: Library routine (machine language) Linker Executable: Machine language program Loader Memory FIGURE 2.20 A translation hierarchy for C. A high-level language program is first compiled into an assembly language program and then assembled into an object module in machine language. The linker combines multiple modules with library routines to resolve all references. The loader then places the machine code into the proper memory locations for execution by the processor. To speed up the translation process, some steps are skipped or combined. Some compilers produce object modules directly, and some systems use linking loaders that perform the last two steps. To identify the type of file, UNIX follows a suffix convention for files: C source files are named x.c, assembly files are x.s, object files are named x.o, statically linked library routines are x.a, dynamically linked library routes are x.so, and executable files by default are called a.out. MS-DOS uses the suffixes.C,.ASM,.OBJ,.LIB,.DLL, and.EXE to the same effect.

Document Details

Related

Full Transcript