Computer Organisation and Architecture 8e - Cache Memory PDF
Document Details
Uploaded by CheapestAlpenhorn
William Stallings
Tags
Summary
This chapter from "Computer Organization and Architecture" by William Stallings explores cache memory. It details the memory hierarchy, cache principles, design elements, and different organizations, such as Pentium 4 and ARM.
Full Transcript
CHAPTER CACHE MEMORY 4.1 Computer Memory System Overview Characteristics of Memory Systems The Memory Hierarchy 4.2 Cache Memory Principles 4.3 Elements of Cache Design Cache Addresses Cache Size...
CHAPTER CACHE MEMORY 4.1 Computer Memory System Overview Characteristics of Memory Systems The Memory Hierarchy 4.2 Cache Memory Principles 4.3 Elements of Cache Design Cache Addresses Cache Size Mapping Function Replacement Algorithms Write Policy Line Size Number of Caches 4.4 Pentium 4 Cache Organization 4.5 ARM Cache Organization 4.6 Recommended Reading 4.7 Key Terms, Review Questions, and Problems Appendix 4A Performance Characteristics of Two-Level Memories Locality Operation of Two-Level Memory Performance 110 4.1 / COMPUTER MEMORY SYSTEM OVERVIEW 111 KEY POINTS ◆ Computer memory is organized into a hierarchy. At the highest level (clos- est to the processor) are the processor registers. Next comes one or more levels of cache, When multiple levels are used, they are denoted L1, L2, and so on. Next comes main memory, which is usually made out of dynamic random-access memory (DRAM). All of these are considered internal to the computer system. The hierarchy continues with external memory, with the next level typically being a fixed hard disk, and one or more levels below that consisting of removable media such as optical disks and tape. ◆ As one goes down the memory hierarchy, one finds decreasing cost/bit, in- creasing capacity, and slower access time. It would be nice to use only the fastest memory, but because that is the most expensive memory, we trade off access time for cost by using more of the slower memory. The design challenge is to organize the data and programs in memory so that the ac- cessed memory words are usually in the faster memory. ◆ In general, it is likely that most future accesses to main memory by the processor will be to locations recently accessed. So the cache automatically retains a copy of some of the recently used words from the DRAM. If the cache is designed properly, then most of the time the processor will request memory words that are already in the cache. Although seemingly simple in concept, computer memory exhibits perhaps the widest range of type, technology, organization, performance, and cost of any feature of a com- puter system. No one technology is optimal in satisfying the memory requirements for a computer system. As a consequence, the typical computer system is equipped with a hierarchy of memory subsystems, some internal to the system (directly accessible by the processor) and some external (accessible by the processor via an I/O module). This chapter and the next focus on internal memory elements, while Chapter 6 is devoted to external memory. To begin, the first section examines key characteristics of computer memories.The remainder of the chapter examines an essential element of all modern computer systems: cache memory. 4.1 COMPUTER MEMORY SYSTEM OVERVIEW Characteristics of Memory Systems The complex subject of computer memory is made more manageable if we classify memory systems according to their key characteristics. The most important of these are listed in Table 4.1. The term location in Table 4.1 refers to whether memory is internal and exter- nal to the computer. Internal memory is often equated with main memory. But there are other forms of internal memory. The processor requires its own local memory, in 112 CHAPTER 4 / CACHE MEMORY Table 4.1 Key Characteristics of Computer Memory Systems Location Performance Internal (e.g. processor registers, main Access time memory, cache) Cycle time External (e.g. optical disks, magnetic Transfer rate disks, tapes) Physical Type Capacity Semiconductor Number of words Magnetic Number of bytes Optical Unit of Transfer Magneto-optical Word Physical Characteristics Block Volatile/nonvolatile Access Method Erasable/nonerasable Sequential Organization Direct Memory modules Random Associative the form of registers (e.g., see Figure 2.3). Further, as we shall see, the control unit portion of the processor may also require its own internal memory. We will defer dis- cussion of these latter two types of internal memory to later chapters. Cache is another form of internal memory. External memory consists of peripheral storage devices, such as disk and tape, that are accessible to the processor via I/O controllers. An obvious characteristic of memory is its capacity. For internal memory, this is typically expressed in terms of bytes (1 byte = 8 bits) or words. Common word lengths are 8, 16, and 32 bits. External memory capacity is typically expressed in terms of bytes. A related concept is the unit of transfer. For internal memory, the unit of transfer is equal to the number of electrical lines into and out of the memory module. This may be equal to the word length, but is often larger, such as 64, 128, or 256 bits. To clarify this point, consider three related concepts for internal memory: Word: The “natural” unit of organization of memory. The size of the word is typically equal to the number of bits used to represent an integer and to the in- struction length. Unfortunately, there are many exceptions. For example, the CRAY C90 (an older model CRAY supercomputer) has a 64-bit word length but uses a 46-bit integer representation. The Intel x86 architecture has a wide variety of instruction lengths, expressed as multiples of bytes, and a word size of 32 bits. Addressable units: In some systems, the addressable unit is the word. How- ever, many systems allow addressing at the byte level. In any case, the rela- tionship between the length in bits A of an address and the number N of addressable units is 2A = N. Unit of transfer: For main memory, this is the number of bits read out of or written into memory at a time. The unit of transfer need not equal a word or an 4.1 / COMPUTER MEMORY SYSTEM OVERVIEW 113 addressable unit. For external memory, data are often transferred in much larger units than a word, and these are referred to as blocks. Another distinction among memory types is the method of accessing units of data. These include the following: Sequential access: Memory is organized into units of data, called records. Ac- cess must be made in a specific linear sequence. Stored addressing information is used to separate records and assist in the retrieval process. A shared read– write mechanism is used, and this must be moved from its current location to the desired location, passing and rejecting each intermediate record. Thus, the time to access an arbitrary record is highly variable. Tape units, discussed in Chapter 6, are sequential access. Direct access: As with sequential access, direct access involves a shared read–write mechanism. However, individual blocks or records have a unique address based on physical location. Access is accomplished by direct access to reach a general vicinity plus sequential searching, counting, or waiting to reach the final location. Again, access time is variable. Disk units, discussed in Chapter 6, are direct access. Random access: Each addressable location in memory has a unique, physically wired-in addressing mechanism. The time to access a given location is inde- pendent of the sequence of prior accesses and is constant. Thus, any location can be selected at random and directly addressed and accessed. Main memory and some cache systems are random access. Associative: This is a random access type of memory that enables one to make a comparison of desired bit locations within a word for a specified match, and to do this for all words simultaneously. Thus, a word is retrieved based on a portion of its contents rather than its address. As with ordinary random-access memory, each location has its own addressing mechanism, and retrieval time is constant independent of location or prior access patterns. Cache memories may employ associative access. From a user’s point of view, the two most important characteristics of memory are capacity and performance. Three performance parameters are used: Access time (latency): For random-access memory, this is the time it takes to perform a read or write operation, that is, the time from the instant that an ad- dress is presented to the memory to the instant that data have been stored or made available for use. For non-random-access memory, access time is the time it takes to position the read–write mechanism at the desired location. Memory cycle time: This concept is primarily applied to random-access mem- ory and consists of the access time plus any additional time required before a second access can commence. This additional time may be required for tran- sients to die out on signal lines or to regenerate data if they are read destruc- tively. Note that memory cycle time is concerned with the system bus, not the processor. Transfer rate: This is the rate at which data can be transferred into or out of a memory unit. For random-access memory, it is equal to 1/(cycle time). 114 CHAPTER 4 / CACHE MEMORY For non-random-access memory, the following relationship holds: n TN = TA + (4.1) R where TN = Average time to read or write N bits TA = Average access time n = Number of bits R = Transfer rate, in bits per second (bps) A variety of physical types of memory have been employed. The most com- mon today are semiconductor memory, magnetic surface memory, used for disk and tape, and optical and magneto-optical. Several physical characteristics of data storage are important. In a volatile memory, information decays naturally or is lost when electrical power is switched off. In a nonvolatile memory, information once recorded remains without deterioration until deliberately changed; no electrical power is needed to retain information. Magnetic-surface memories are nonvolatile. Semiconductor memory may be either volatile or nonvolatile. Nonerasable memory cannot be altered, except by destroying the storage unit. Semiconductor memory of this type is known as read-only memory (ROM). Of necessity, a practical nonerasable memory must also be nonvolatile. For random-access memory, the organization is a key design issue. By organi- zation is meant the physical arrangement of bits to form words. The obvious arrangement is not always used, as is explained in Chapter 5. The Memory Hierarchy The design constraints on a computer’s memory can be summed up by three ques- tions: How much? How fast? How expensive? The question of how much is somewhat open ended. If the capacity is there, applications will likely be developed to use it. The question of how fast is, in a sense, easier to answer. To achieve greatest performance, the memory must be able to keep up with the processor. That is, as the processor is executing instructions, we would not want it to have to pause waiting for instructions or operands. The final question must also be considered. For a practical system, the cost of memory must be reason- able in relationship to other components. As might be expected, there is a trade-off among the three key characteristics of memory: namely, capacity, access time, and cost. A variety of technologies are used to implement memory systems, and across this spectrum of technologies, the following relationships hold: Faster access time, greater cost per bit Greater capacity, smaller cost per bit Greater capacity, slower access time The dilemma facing the designer is clear. The designer would like to use mem- ory technologies that provide for large-capacity memory, both because the capacity is needed and because the cost per bit is low. However, to meet performance 4.1 / COMPUTER MEMORY SYSTEM OVERVIEW 115 requirements, the designer needs to use expensive, relatively lower-capacity memo- ries with short access times. The way out of this dilemma is not to rely on a single memory component or technology, but to employ a memory hierarchy. A typical hierarchy is illustrated in Figure 4.1. As one goes down the hierarchy, the following occur: a. Decreasing cost per bit b. Increasing capacity c. Increasing access time d. Decreasing frequency of access of the memory by the processor Thus, smaller, more expensive, faster memories are supplemented by larger, cheaper, slower memories. The key to the success of this organization is item (d): decreasing frequency of access. We examine this concept in greater detail when we discuss the cache, later in this chapter, and virtual memory in Chapter 8. A brief explanation is provided at this point. g- Re rs e ist Inb e me oard ch mo Ca ry in Ma ory m me k Ou i c dis t et sto boar gn OM rag d Ma D-R W e C D-R W C R M D- DV -RA D DV Of e f sto -line c tap rag eti e gn Ma Figure 4.1 The Memory Hierarchy 116 CHAPTER 4 / CACHE MEMORY Example 4.1 Suppose that the processor has access to two levels of memory. Level 1 contains 1000 words and has an access time of 0.01 ms; level 2 contains 100,000 words and has an access time of 0.1 ms. Assume that if a word to be accessed is in level 1, then the processor accesses it directly. If it is in level 2, then the word is first transferred to level 1 and then accessed by the processor. For simplicity, we ignore the time required for the processor to determine whether the word is in level 1 or level 2. Figure 4.2 shows the gen- eral shape of the curve that covers this situation. The figure shows the average access time to a two-level memory as a function of the hit ratio H, where H is defined as the fraction of all memory accesses that are found in the faster memory (e.g., the cache), T1 is the access time to level 1, and T2 is the access time to level 2.1 As can be seen, for high percentages of level 1 access, the average total access time is much closer to that of level 1 than that of level 2. In our example, suppose 95% of the memory accesses are found in the cache. Then the average time to access a word can be expressed as (0.95)(0.01 ms) + (0.05)(0.01 ms + 0.1 ms) = 0.0095 + 0.0055 = 0.015 ms The average access time is much closer to 0.01 ms than to 0.1 ms, as desired. T1 T2 T2 Average access time T1 0 1 Fraction of accesses involving only level 1 (hit ratio) Figure 4.2 Performance of accesses involving only level 1 (hit ratio) 1 If the accessed word is found in the faster memory, that is defined as a hit. A miss occurs if the accessed word is not found in the faster memory. 4.1 / COMPUTER MEMORY SYSTEM OVERVIEW 117 The use of two levels of memory to reduce average access time works in prin- ciple, but only if conditions (a) through (d) apply. By employing a variety of tech- nologies, a spectrum of memory systems exists that satisfies conditions (a) through (c). Fortunately, condition (d) is also generally valid. The basis for the validity of condition (d) is a principle known as locality of reference [DENN68]. During the course of execution of a program, memory refer- ences by the processor, for both instructions and data, tend to cluster. Programs typ- ically contain a number of iterative loops and subroutines. Once a loop or subroutine is entered, there are repeated references to a small set of instructions. Similarly, operations on tables and arrays involve access to a clustered set of data words. Over a long period of time, the clusters in use change, but over a short period of time, the processor is primarily working with fixed clusters of memory references. Accordingly, it is possible to organize data across the hierarchy such that the percentage of accesses to each successively lower level is substantially less than that of the level above. Consider the two-level example already presented. Let level 2 mem- ory contain all program instructions and data. The current clusters can be temporarily placed in level 1. From time to time, one of the clusters in level 1 will have to be swapped back to level 2 to make room for a new cluster coming in to level 1. On aver- age, however, most references will be to instructions and data contained in level 1. This principle can be applied across more than two levels of memory, as sug- gested by the hierarchy shown in Figure 4.1. The fastest, smallest, and most expen- sive type of memory consists of the registers internal to the processor. Typically, a processor will contain a few dozen such registers, although some machines contain hundreds of registers. Skipping down two levels, main memory is the principal inter- nal memory system of the computer. Each location in main memory has a unique address. Main memory is usually extended with a higher-speed, smaller cache. The cache is not usually visible to the programmer or, indeed, to the processor. It is a de- vice for staging the movement of data between main memory and processor regis- ters to improve performance. The three forms of memory just described are, typically, volatile and employ semiconductor technology. The use of three levels exploits the fact that semiconduc- tor memory comes in a variety of types, which differ in speed and cost. Data are stored more permanently on external mass storage devices, of which the most com- mon are hard disk and removable media, such as removable magnetic disk, tape, and optical storage. External, nonvolatile memory is also referred to as secondary mem- ory or auxiliary memory. These are used to store program and data files and are usu- ally visible to the programmer only in terms of files and records, as opposed to individual bytes or words. Disk is also used to provide an extension to main memory known as virtual memory, which is discussed in Chapter 8. Other forms of memory may be included in the hierarchy. For example, large IBM mainframes include a form of internal memory known as expanded storage. This uses a semiconductor technology that is slower and less expensive than that of main memory. Strictly speaking, this memory does not fit into the hierarchy but is a side branch: Data can be moved between main memory and expanded storage but not between expanded storage and external memory. Other forms of secondary memory include optical and magneto-optical disks. Finally, additional levels can be effectively added to the hierarchy in software. A portion of main memory can be 118 CHAPTER 4 / CACHE MEMORY used as a buffer to hold data temporarily that is to be read out to disk. Such a tech- nique, sometimes referred to as a disk cache,2 improves performance in two ways: Disk writes are clustered. Instead of many small transfers of data, we have a few large transfers of data. This improves disk performance and minimizes processor involvement. Some data destined for write-out may be referenced by a program before the next dump to disk. In that case, the data are retrieved rapidly from the soft- ware cache rather than slowly from the disk. Appendix 4A examines the performance implications of multilevel memory structures. 4.2 CACHE MEMORY PRINCIPLES Cache memory is intended to give memory speed approaching that of the fastest memories available, and at the same time provide a large memory size at the price of less expensive types of semiconductor memories. The concept is illustrated in Figure 4.3a. There is a relatively large and slow main memory together with a smaller, faster cache memory. The cache contains a copy of portions of main mem- ory. When the processor attempts to read a word of memory, a check is made to Block Transfer Word Transfer CPU Cache Main memory Fast Slow (a) Single cache Level 1 Level 2 Level 3 Main CPU (L1) cache (L2) cache (L3) cache memory Fastest Fast Less Slow fast (b) Three-level cache organization Figure 4.3 Cache and Main Memory 2 Disk cache is generally a purely software technique and is not examined in this book. See [STAL09] for a discussion. 4.2 / CACHE MEMORY PRINCIPLES 119 determine if the word is in the cache. If so, the word is delivered to the processor. If not, a block of main memory, consisting of some fixed number of words, is read into the cache and then the word is delivered to the processor. Because of the phenome- non of locality of reference, when a block of data is fetched into the cache to satisfy a single memory reference, it is likely that there will be future references to that same memory location or to other words in the block. Figure 4.3b depicts the use of multiple levels of cache. The L2 cache is slower and typically larger than the L1 cache, and the L3 cache is slower and typically larger than the L2 cache. Figure 4.4 depicts the structure of a cache/main-memory system. Main memory consists of up to 2n addressable words, with each word having a unique n-bit address. For mapping purposes, this memory is considered to consist of a number of fixed- length blocks of K words each. That is, there are M = 2n/K blocks in main memory. The cache consists of m blocks, called lines.3 Each line contains K words, plus a tag of a few bits. Each line also includes control bits (not shown), such as a bit to indicate Line Memory number Tag Block address 0 0 1 1 2 2 Block 3 (K words) C1 Block length (K Words) (a) Cache Block 2n 1 Word length (b) Main memory Figure 4.4 Cache/Main Memory Structure 3 In referring to the basic unit of the cache, the term line is used, rather than the term block, for two rea- sons: (1) to avoid confusion with a main memory block, which contains the same number of data words as a cache line; and (2) because a cache line includes not only K words of data, just as a main memory block, but also include tag and control bits. 120 CHAPTER 4 / CACHE MEMORY whether the line has been modified since being loaded into the cache. The length of a line, not including tag and control bits, is the line size. The line size may be as small as 32 bits, with each “word” being a single byte; in this case the line size is 4 bytes. The number of lines is considerably less than the number of main memory blocks (m V M). At any time, some subset of the blocks of memory resides in lines in the cache. If a word in a block of memory is read, that block is transferred to one of the lines of the cache. Because there are more blocks than lines, an individual line can- not be uniquely and permanently dedicated to a particular block. Thus, each line in- cludes a tag that identifies which particular block is currently being stored. The tag is usually a portion of the main memory address, as described later in this section. Figure 4.5 illustrates the read operation. The processor generates the read ad- dress (RA) of a word to be read. If the word is contained in the cache, it is delivered START Receive address RA from CPU Is block Access main No containing RA memory for block in cache? containing RA Yes Fetch RA word Allocate cache and deliver line for main to CPU memory block Load main memory block Deliver RA word into cache line to CPU DONE Figure 4.5 Cache Read Operation 4.3 / ELEMENTS OF CACHE DESIGN 121 Address Address buffer System bus Control Control Processor Cache Data buffer Data Figure 4.6 Typical Cache Organization to the processor. Otherwise, the block containing that word is loaded into the cache, and the word is delivered to the processor. Figure 4.5 shows these last two opera- tions occurring in parallel and reflects the organization shown in Figure 4.6, which is typical of contemporary cache organizations. In this organization, the cache con- nects to the processor via data, control, and address lines. The data and address lines also attach to data and address buffers, which attach to a system bus from which main memory is reached. When a cache hit occurs, the data and address buffers are disabled and communication is only between processor and cache, with no system bus traffic. When a cache miss occurs, the desired address is loaded onto the system bus and the data are returned through the data buffer to both the cache and the processor. In other organizations, the cache is physically interposed between the processor and the main memory for all data, address, and control lines. In this latter case, for a cache miss, the desired word is first read into the cache and then trans- ferred from cache to processor. A discussion of the performance parameters related to cache use is contained in Appendix 4A. 4.3 ELEMENTS OF CACHE DESIGN This section provides an overview of cache design parameters and reports some typ- ical results. We occasionally refer to the use of caches in high-performance comput- ing (HPC). HPC deals with supercomputers and supercomputer software, especially for scientific applications that involve large amounts of data, vector and matrix 122 CHAPTER 4 / CACHE MEMORY Table 4.2 Elements of Cache Design Cache Addresses Write Policy Logical Write through Physical Write back Cache Size Write once Mapping Function Line Size Direct Number of caches Associative Single or two level Set Associative Unified or split Replacement Algorithm Least recently used (LRU) First in first out (FIFO) Least frequently used (LFU) Random computation, and the use of parallel algorithms. Cache design for HPC is quite dif- ferent than for other hardware platforms and applications. Indeed, many researchers have found that HPC applications perform poorly on computer architectures that employ caches [BAIL93]. Other researchers have since shown that a cache hierar- chy can be useful in improving performance if the application software is tuned to exploit the cache [WANG99, PRES01].4 Although there are a large number of cache implementations, there are a few basic design elements that serve to classify and differentiate cache architectures. Table 4.2 lists key elements. Cache Addresses Almost all nonembedded processors, and many embedded processors, support vir- tual memory, a concept discussed in Chapter 8. In essence, virtual memory is a facil- ity that allows programs to address memory from a logical point of view, without regard to the amount of main memory physically available. When virtual memory is used, the address fields of machine instructions contain virtual addresses. For reads to and writes from main memory, a hardware memory management unit (MMU) translates each virtual address into a physical address in main memory. When virtual addresses are used, the system designer may choose to place the cache between the processor and the MMU or between the MMU and main memory (Figure 4.7). A logical cache, also known as a virtual cache, stores data using virtual addresses. The processor accesses the cache directly, without going through the MMU. A physical cache stores data using main memory physical addresses. One obvious advantage of the logical cache is that cache access speed is faster than for a physical cache, because the cache can respond before the MMU performs 4 For a general discussion of HPC, see [DOWD98]. 4.3 / ELEMENTS OF CACHE DESIGN 123 Logical address Physical address MMU Main Processor memory Cache Data (a) Logical cache Logical address Physical address MMU Main Processor memory Cache Data (b) Physical cache Figure 4.7 Logical and Physical Caches an address translation. The disadvantage has to do with the fact that most virtual memory systems supply each application with the same virtual memory address space. That is, each application sees a virtual memory that starts at address 0. Thus, the same virtual address in two different applications refers to two different physical addresses. The cache memory must therefore be completely flushed with each appli- cation context switch, or extra bits must be added to each line of the cache to iden- tify which virtual address space this address refers to. The subject of logical versus physical cache is a complex one, and beyond the scope of this book. For a more in-depth discussion, see [CEKL97] and [JACO08]. Cache Size The first item in Table 4.2, cache size, has already been discussed. We would like the size of the cache to be small enough so that the overall average cost per bit is close to that of main memory alone and large enough so that the overall average access time is close to that of the cache alone. There are several other motivations for min- imizing cache size. The larger the cache, the larger the number of gates involved in ad- dressing the cache. The result is that large caches tend to be slightly slower than small ones—even when built with the same integrated circuit technology and put in the 124 CHAPTER 4 / CACHE MEMORY Table 4.3 Cache Sizes of Some Processors Year of Processor Type Introduction L1 Cachea L2 Cache L3 Cache IBM 360/85 Mainframe 1968 16 to 32 kB — — PDP-11/70 Minicomputer 1975 1 kB — — VAX 11/780 Minicomputer 1978 16 kB — — IBM 3033 Mainframe 1978 64 kB — — IBM 3090 Mainframe 1985 128 to 256 kB — — Intel 80486 PC 1989 8 kB — — Pentium PC 1993 8 kB/8 kB 256 to 512 KB — PowerPC 601 PC 1993 32 kB — — PowerPC 620 PC 1996 32 kB/32 kB — — PowerPC G4 PC/server 1999 32 kB/32 kB 256 KB to 1 MB 2 MB IBM S/390 G4 Mainframe 1997 32 kB 256 KB 2 MB IBM S/390 G6 Mainframe 1999 256 kB 8 MB — Pentium 4 PC/server 2000 8 kB/8 kB 256 KB — High-end server/ IBM SP 2000 64 kB/32 kB 8 MB — supercomputer CRAY MTAb Supercomputer 2000 8 kB 2 MB — Itanium PC/server 2001 16 kB/16 kB 96 KB 4 MB SGI Origin 2001 High-end server 2001 32 kB/32 kB 4 MB — Itanium 2 PC/server 2002 32 kB 256 KB 6 MB IBM POWER5 High-end server 2003 64 kB 1.9 MB 36 MB CRAY XD-1 Supercomputer 2004 64 kB/64 kB 1 MB — IBM POWER6 PC/server 2007 64 kB/64 kB 4 MB 32 MB IBM z10 Mainframe 2008 64 kB/128 kB 3 MB 24–48 MB a Two values separated by a slash refer to instruction and data caches. b Both caches are instruction only; no data caches. same place on chip and circuit board. The available chip and board area also limits cache size. Because the performance of the cache is very sensitive to the nature of the workload, it is impossible to arrive at a single “optimum” cache size. Table 4.3 lists the cache sizes of some current and past processors. Mapping Function Because there are fewer cache lines than main memory blocks, an algorithm is needed for mapping main memory blocks into cache lines. Further, a means is needed for determining which main memory block currently occupies a cache line. The choice of the mapping function dictates how the cache is organized. Three tech- niques can be used: direct, associative, and set associative. We examine each of these in turn. In each case, we look at the general structure and then a specific example. 4.3 / ELEMENTS OF CACHE DESIGN 125 Example 4.2 For all three cases, the example includes the following elements: The cache can hold 64 KBytes. Data are transferred between main memory and the cache in blocks of 4 bytes each. This means that the cache is organized as 16K = 214 lines of 4 bytes each. The main memory consists of 16 Mbytes, with each byte directly addressable by a 24-bit address (224 = 16M). Thus, for mapping purposes, we can consider main mem- ory to consist of 4M blocks of 4 bytes each. DIRECT MAPPING The simplest technique, known as direct mapping, maps each block of main memory into only one possible cache line. The mapping is ex- pressed as i = j modulo m where i = cache line number j = main memory block number m = number of lines in the cache Figure 4.8a shows the mapping for the first m blocks of main memory. Each block of main memory maps into one unique line of the cache. The next m blocks of main memory map into the cache in the same fashion; that is, block Bm of main memory maps into line L0 of cache, block Bm1 maps into line L1, and so on. The mapping function is easily implemented using the main memory address. Figure 4.9 illustrates the general mechanism. For purposes of cache access, each main memory address can be viewed as consisting of three fields. The least signifi- cant w bits identify a unique word or byte within a block of main memory; in most contemporary machines, the address is at the byte level. The remaining s bits specify one of the 2s blocks of main memory. The cache logic interprets these s bits as a tag of s - r bits (most significant portion) and a line field of r bits. This latter field iden- tifies one of the m = 2r lines of the cache. To summarize, Address length = (s + w) bits Number of addressable units 2sw words or bytes Block size = line size = 2w words or bytes 2s + w Number of blocks in main memory = = 2s 2w Number of lines in cache = m = 2r Size of cache = 2rw words or bytes Size of tag = (s - r) bits 126 CHAPTER 4 / CACHE MEMORY b t b B0 L0 m lines Bm–1 Lm–1 First m blocks of Cache memory main memory (equal to size of cache) b = length of block in bits t = length of tag in bits (a) Direct mapping t b L0 b One block of main memory Lm–1 Cache memory (b) Associative mapping Figure 4.8 Mapping from Main Memory to Cache: Direct and Associative The effect of this mapping is that blocks of main memory are assigned to lines of the cache as follows: Cache line Main memory blocks assigned 0 0, m, 2m, Á , 2s - m 1 1, m + 1, 2m + 1, Á , 2s - m + 1 o o m - 1 m - 1, 2m - 1, 3m - 1, Á , 2s - 1 Thus, the use of a portion of the address as a line number provides a unique mapping of each block of main memory into the cache. When a block is actually read into its assigned line, it is necessary to tag the data to distinguish it from other blocks that can fit into that line. The most significant s - r bits serve this purpose. 4.3 / ELEMENTS OF CACHE DESIGN 127 s+w Cache Main memory Memory address Tag Data WO Tag Line Word W1 B0 L0 W2 s–r r w W3 s–r s w W4j Li W(4j+1) Compare w W(4j+2) Bj W(4j+3) (Hit in cache) 1 if match 0 if no match Lm–1 0 if match 1 if no match (Miss in cache) Figure 4.9 Direct-Mapping Cache Organization Example 4.2a Figure 4.10 shows our example system using direct mapping.5 In the ex- ample, m = 16K = 214 and i = j modulo 214. The mapping becomes Cache Line Starting Memory Address of Block 0 000000, 010000, Á , FF0000 1 000004, 010004, Á , FF0004 o o 2 14 - 1 00FFFC, 01FFFC, Á , FFFFFC Note that no two blocks that map into the same line number have the same tag num- ber. Thus, blocks with starting addresses 000000, 010000, Á , FF0000 have tag numbers 00, 01, Á , FF, respectively. Referring back to Figure 4.5, a read operation works as follows. The cache system is presented with a 24-bit address. The 14-bit line number is used as an index into the cache to access a particular line. If the 8-bit tag number matches the tag number currently stored in that line, then the 2-bit word number is used to select one of the 4 bytes in that line. Oth- erwise, the 22-bit tag-plus-line field is used to fetch a block from main memory. The actual address that is used for the fetch is the 22-bit tag-plus-line concatenated with two 0 bits, so that 4 bytes are fetched starting on a block boundary. 5 In this and subsequent figures, memory values are represented in hexadecimal notation. See Chapter 19 for a basic refresher on number systems (decimal, binary, hexadecimal). 128 CHAPTER 4 / CACHE MEMORY Main memory address (binary) Tag Tag Line + Word (hex) Data 00 000000000000000000000000 13579246 00 000000000000000000000100 00 000000001111111111111000 00 000000001111111111111100 Line Tag Data number 16 000101100000000000000000 77777777 00 13579246 0000 16 000101100000000000000004 11235813 16 11235813 0001 16 000101100011001110011100 FEDCBA98 16 FEDCBA98 0CE7 FF 11223344 3FFE 16 000101101111111111111100 12345678 16 12345678 3FFF 8 bits 32 bits FF 111111110000000000000000 16K line cache FF 111111110000000000000100 FF 111111111111111111111000 11223344 FF 111111111111111111111100 24682468 Note: Memory address values are 32 bits in binary representation; other values are in hexadecimal 16-MByte main memory Tag Line Word Main memory address = 8 bits 14 bits 2 bits Figure 4.10 Direct Mapping Example The direct mapping technique is simple and inexpensive to implement. Its main disadvantage is that there is a fixed cache location for any given block. Thus, if a program happens to reference words repeatedly from two different blocks that map into the same line, then the blocks will be continually swapped in the cache, and the hit ratio will be low (a phenomenon known as thrashing). Selective Victim Cache Simulator 4.3 / ELEMENTS OF CACHE DESIGN 129 s+w Cache Main memory Memory address Tag Data W0 Tag Word W1 W2 B0 L0 W3 s w w Lj s W4j w W(4j+1) Compare Bj W(4j+2) W(4j+3) (Hit in cache) 1 if match 0 if no match s Lm–1 0 if match 1 if no match (Miss in cache) Figure 4.11 Fully Associative Cache Organization One approach to lower the miss penalty is to remember what was discarded in case it is needed again. Since the discarded data has already been fetched, it can be used again at a small cost. Such recycling is possible using a victim cache. Victim cache was originally proposed as an approach to reduce the conflict misses of direct mapped caches without affecting its fast access time. Victim cache is a fully associative cache, whose size is typically 4 to 16 cache lines, residing between a direct mapped L1 cache and the next level of memory. This concept is explored in Appendix D. ASSOCIATIVE MAPPING Associative mapping overcomes the disadvantage of di- rect mapping by permitting each main memory block to be loaded into any line of the cache (Figure 4.8b). In this case, the cache control logic interprets a memory ad- dress simply as a Tag and a Word field. The Tag field uniquely identifies a block of main memory. To determine whether a block is in the cache, the cache control logic must simultaneously examine every line’s tag for a match. Figure 4.11 illustrates the logic. Note that no field in the address corresponds to the line number, so that the number of lines in the cache is not determined by the address format. To summarize, Address length = (s + w) bits Number of addressable units = 2sw words or bytes Block size = line size = 2w words or bytes 2s + w Number of blocks in main memory = w = 2s 2 Number of lines in cache = undetermined Size of tag = s bits 130 CHAPTER 4 / CACHE MEMORY Example 4.2b Figure 4.12 shows our example using associative mapping. A main mem- ory address consists of a 22-bit tag and a 2-bit byte number. The 22-bit tag must be stored with the 32-bit block of data for each line in the cache. Note that it is the leftmost (most significant) 22 bits of the address that form the tag. Thus, the 24-bit hexadecimal address 16339C has the 22-bit tag 058CE7. This is easily seen in binary notation: memory address 0001 0110 0011 0011 1001 1100 (binary) 1 6 3 3 9 C (hex) tag (leftmost 22 bits) 00 0101 1000 1100 1110 0111 (binary) 0 5 8 C E 7 (hex) Main memory address (binary) Tag Word Tag (hex) Data 000000 000000000000000000000000 13579246 000001 000000000000000000000100 Line Tag Data Number 3FFFFE 11223344 0000 058CE7 FEDCBA98 0001 058CE6 000101100011001110011000 058CE7 000101100011001110011100 FEDCBA98 058CE8 000101100011001110100000 3FFFFD 33333333 3FFD 000000 13579246 3FFE 3FFFFF 24682468 3FFF 22 bits 32 bits 16K line cache 3FFFFD 111111111111111111110100 33333333 3FFFFE 111111111111111111111000 11223344 3FFFFF 111111111111111111111100 24682468 Note: Memory address values are in binary representation; 32 bits other values are in hexadecimal 16-MByte main memory Tag Word Main memory address = 22 bits 2 bits Figure 4.12 Associative Mapping Example 4.3 / ELEMENTS OF CACHE DESIGN 131 With associative mapping, there is flexibility as to which block to replace when a new block is read into the cache. Replacement algorithms, discussed later in this section, are designed to maximize the hit ratio. The principal disadvantage of asso- ciative mapping is the complex circuitry required to examine the tags of all cache lines in parallel. Cache Time Analysis Simulator SET-ASSOCIATIVE MAPPING Set-associative mapping is a compromise that ex- hibits the strengths of both the direct and associative approaches while reducing their disadvantages. In this case, the cache consists of a number sets, each of which consists of a number of lines. The relationships are m = n * k i = j modulo n where i = cache set number j = main memory block number m = number of lines in the cache n = number of sets k = number of lines in each set This is referred to as k-way set-associative mapping. With set-associative mapping, block Bj can be mapped into any of the lines of set j. Figure 4.13a illus- trates this mapping for the first n blocks of main memory. As with associative map- ping, each word maps into multiple cache lines. For set-associative mapping, each word maps into all the cache lines in a specific set, so that main memory block B0 maps into set 0, and so on. Thus, the set-associative cache can be physically imple- mented as n associative caches. It is also possible to implement the set-associative cache as k direct mapping caches, as shown in Figure 4.13b. Each direct-mapped cache is referred to as a way, consisting of n lines. The first n lines of main memory are direct mapped into the n lines of each way; the next group of n lines of main memory are similarly mapped, and so on. The direct-mapped implementation is typically used for small degrees of associativity (small values of k) while the asso- ciative-mapped implementation is typically used for higher degrees of associativ- ity [JACO08]. For set-associative mapping, the cache control logic interprets a memory address as three fields: Tag, Set, and Word. The d set bits specify one of n 2d sets. The s bits of the Tag and Set fields specify one of the 2s blocks of main mem- ory. Figure 4.14 illustrates the cache control logic. With fully associative mapping, the tag in a memory address is quite large and must be compared to the tag of every line in the cache. With k-way set-associative mapping, the tag in a memory 132 CHAPTER 4 / CACHE MEMORY B0 L0 k lines L k–1 Cache memory– set 0 Bv–1 First v blocks of main memory (equal to number of sets) Cache memory–set v–1 (a) v Associative–mapped caches B0 L0 One set v lines Bv–1 L v–1 First v blocks of cache memory—way 1 Cache memory—way k main memory (equal to number of sets) (b) k Direct–mapped caches Figure 4.13 Mapping from Main Memory to Cache: k-way Set Associative address is much smaller and is only compared to the k tags within a single set. To summarize, Address length = (s + w) bits Number of addressable units = 2sw words or bytes Block size = line size = 2w words or bytes 2s + w Number of blocks in main memory = w = 2s 2 Number of lines in set = k Number of sets = n = 2d 4.3 / ELEMENTS OF CACHE DESIGN 133 s+w Cache Main memory Memory address Tag Data B0 Tag Set Word F0 B1 s–d d w F1 Set 0 s–d Fk1 Fk s+w Bj Compare Fki Set 1 (Hit in cache) F2k1 1 if match 0 if no match 0 if match 1 if no match (Miss in cache) Figure 4.14 K-Way Set Associative Cache Organization Number of lines in cache = m = kn = k * 2d Size of cache = k * 2dw words or bytes Size of tag = (s - d) bits Example 4.2c Figure 4.15 shows our example using set-associative mapping with two lines in each set, referred to as two-way set-associative. The 13-bit set number identifies a unique set of two lines within the cache. It also gives the number of the block in main memory, modulo 213. This determines the mapping of blocks into lines. Thus, blocks 000000, 008000, Á , FF8000 of main memory map into cache set 0. Any of those blocks can be loaded into either of the two lines in the set. Note that no two blocks that map into the same cache set have the same tag number. For a read operation, the 13-bit set number is used to determine which set of two lines is to be examined. Both lines in the set are ex- amined for a match with the tag number of the address to be accessed. In the extreme case of n = m, k = 1, the set-associative technique reduces to direct mapping, and for n = 1, k = m, it reduces to associative mapping. The use of two lines per set (n = m/2, k = 2) is the most common set-associative organization. 134 Main memory address (binary) Tag (hex) Tag Set + Word Main memory address = Data Tag Set Word 000 000000000000000000000000 13579246 000 000000000000000000000100 9 bits 13 bits 2 bits 000 000000001111111111111000 000 000000001111111111111100 Set Tag Data number Tag Data 02C 000101100000000000000000 77777777 000 13579246 0000 02C 77777777 02C 000101100000000000000100 11235813 02C 11235813 0001 02C 000101100011001110011100 FEDCBA98 02C FEDCBA98 0CE7 1FF 11223344 1FFE 02C 000101100111111111111100 12345678 02C 12345678 1FFF 1FF 24682468 9 bits 32 bits 9 bits 32 bits 1FF 111111111000000000000000 16K line cache 1FF 111111111000000000000100 1FF 111111111111111111111000 11223344 1FF 111111111111111111111100 24682468 32 bits 16–MByte main memory Note: Memory address values are in binary representation; other values are in hexadecimal Figure 4.15 Two-Way Set Associative Mapping Example 4.3 / ELEMENTS OF CACHE DESIGN 135 1.0 0.9 0.8 0.7 0.6 Hit ratio 0.5 0.4 0.3 0.2 0.1 0.0 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M Cache size (bytes) Direct 2-way 4-way 8-way 16-way Figure 4.16 Varying Associativity over Cache Size It significantly improves the hit ratio over direct mapping. Four-way set associative (n = m/4, k = 4) makes a modest additional improvement for a relatively small ad- ditional cost [MAYB84, HILL89]. Further increases in the number of lines per set have little effect. Figure 4.16 shows the results of one simulation study of set-associative cache performance as a function of cache size [GENU04]. The difference in performance between direct and two-way set associative is significant up to at least a cache size of 64 kB. Note also that the difference between two-way and four-way at 4 kB is much less than the difference in going from for 4 kB to 8 kB in cache size. The complexity of the cache increases in proportion to the associativity, and in this case would not be justifiable against increasing cache size to 8 or even 16 Kbytes. A final point to note is that beyond about 32 kB, increase in cache size brings no significant increase in performance. The results of Figure 4.16 are based on simulating the execution of a GCC compiler. Different applications may yield different results. For example, [CANT01] reports on the results for cache performance using many of the CPU2000 SPEC benchmarks. The results of [CANT01] in comparing hit ratio to cache size follow the same pattern as Figure 4.16, but the specific values are somewhat different. Cache Simulator Multitask Cache Simulator 136 CHAPTER 4 / CACHE MEMORY Replacement Algorithms Once the cache has been filled, when a new block is brought into the cache, one of the existing blocks must be replaced. For direct mapping, there is only one possible line for any particular block, and no choice is possible. For the associative and set- associative techniques, a replacement algorithm is needed. To achieve high speed, such an algorithm must be implemented in hardware. A number of algorithms have been tried. We mention four of the most common. Probably the most effective is least recently used (LRU): Replace that block in the set that has been in the cache longest with no reference to it. For two-way set associative, this is easily imple- mented. Each line includes a USE bit. When a line is referenced, its USE bit is set to 1 and the USE bit of the other line in that set is set to 0. When a block is to be read into the set, the line whose USE bit is 0 is used. Because we are assuming that more recently used memory locations are more likely to be referenced, LRU should give the best hit ratio. LRU is also relatively easy to implement for a fully associative cache. The cache mechanism maintains a separate list of indexes to all the lines in the cache. When a line is referenced, it moves to the front of the list. For replace- ment, the line at the back of the list is used. Because of its simplicity of implementa- tion, LRU is the most popular replacement algorithm. Another possibility is first-in-first-out (FIFO): Replace that block in the set that has been in the cache longest. FIFO is easily implemented as a round-robin or circular buffer technique. Still another possibility is least frequently used (LFU): Replace that block in the set that has experienced the fewest references. LFU could be imple- mented by associating a counter with each line. A technique not based on usage (i.e., not LRU, LFU, FIFO, or some variant) is to pick a line at random from among the candidate lines. Simulation studies have shown that random replacement provides only slightly inferior performance to an algorithm based on usage [SMIT82]. Write Policy When a block that is resident in the cache is to be replaced, there are two cases to consider. If the old block in the cache has not been altered, then it may be overwrit- ten with a new block without first writing out the old block. If at least one write op- eration has been performed on a word in that line of the cache, then main memory must be updated by writing the line of cache out to the block of memory before bringing in the new block. A variety of write policies, with performance and eco- nomic trade-offs, is possible. There are two problems to contend with. First, more than one device may have access to main memory. For example, an I/O module may be able to read-write directly to memory. If a word has been altered only in the cache, then the corresponding memory word is invalid. Further, if the I/O device has altered main memory, then the cache word is invalid. A more complex problem oc- curs when multiple processors are attached to the same bus and each processor has its own local cache. Then, if a word is altered in one cache, it could conceivably in- validate a word in other caches. The simplest technique is called write through. Using this technique, all write operations are made to main memory as well as to the cache, ensuring that main memory is always valid. Any other processor–cache module can monitor traffic to main memory to maintain consistency within its own cache. The main disadvantage 4.3 / ELEMENTS OF CACHE DESIGN 137 of this technique is that it generates substantial memory traffic and may create a bottleneck. An alternative technique, known as write back, minimizes memory writes. With write back, updates are made only in the cache. When an update occurs, a dirty bit, or use bit, associated with the line is set. Then, when a block is replaced, it is written back to main memory if and only if the dirty bit is set. The problem with write back is that portions of main memory are invalid, and hence accesses by I/O modules can be allowed only through the cache. This makes for complex circuitry and a potential bottleneck. Experience has shown that the percentage of memory references that are writes is on the order of 15% [SMIT82]. However, for HPC ap- plications, this number may approach 33% (vector-vector multiplication) and can go as high as 50% (matrix transposition). Example 4.3 Consider a cache with a line size of 32 bytes and a main memory that re- quires 30 ns to transfer a 4-byte word. For any line that is written at least once before being swapped out of the cache, what is the average number of times that the line must be written before being swapped out for a write-back cache to be more efficient that a write- through cache? For the write-back case, each dirty line is written back once, at swap-out time, taking 8 * 30 = 240 ns. For the write-through case, each update of the line requires that one word be written out to main memory, taking 30 ns. Therefore, if the average line that gets written at least once gets written more than 8 times before swap out, then write back is more efficient. In a bus organization in which more than one device (typically a processor) has a cache and main memory is shared, a new problem is introduced. If data in one cache are altered, this invalidates not only the corresponding word in main memory, but also that same word in other caches (if any other cache happens to have that same word). Even if a write-through policy is used, the other caches may contain in- valid data. A system that prevents this problem is said to maintain cache coherency. Possible approaches to cache coherency include the following: Bus watching with write through: Each cache controller monitors the address lines to detect write operations to memory by other bus masters. If another master writes to a location in shared memory that also resides in the cache memory, the cache controller invalidates that cache entry. This strategy de- pends on the use of a write-through policy by all cache controllers. Hardware transparency: Additional hardware is used to ensure that all up- dates to main memory via cache are reflected in all caches. Thus, if one proces- sor modifies a word in its cache, this update is written to main memory. In addition, any matching words in other caches are similarly updated. Noncacheable memory: Only a portion of main memory is shared by more than one processor, and this is designated as noncacheable. In such a system, all accesses to shared memory are cache misses, because the shared memory is never copied into the cache. The noncacheable memory can be identified using chip-select logic or high-address bits. 138 CHAPTER 4 / CACHE MEMORY Cache coherency is an active field of research. This topic is explored further in Part Five. Line Size Another design element is the line size. When a block of data is retrieved and placed in the cache, not only the desired word but also some number of adjacent words are retrieved. As the block size increases from very small to larger sizes, the hit ratio will at first increase because of the principle of locality, which states that data in the vicinity of a referenced word are likely to be referenced in the near future. As the block size increases, more useful data are brought into the cache. The hit ratio will begin to decrease, however, as the block becomes even bigger and the probability of using the newly fetched information becomes less than the probability of reusing the information that has to be replaced. Two specific effects come into play: Larger blocks reduce the number of blocks that fit into a cache. Because each block fetch overwrites older cache contents, a small number of blocks results in data being overwritten shortly after they are fetched. As a block becomes larger, each additional word is farther from the requested word and therefore less likely to be needed in the near future. The relationship between block size and hit ratio is complex, depending on the locality characteristics of a particular program, and no definitive optimum value has been found. A size of from 8 to 64 bytes seems reasonably close to optimum [SMIT87, PRZY88, PRZY90, HAND98]. For HPC systems, 64- and 128-byte cache line sizes are most frequently used. Number of Caches When caches were originally introduced, the typical system had a single cache. More recently, the use of multiple caches has become the norm. Two aspects of this design issue concern the number of levels of caches and the use of unified versus split caches. MULTILEVEL CACHES As logic density has increased, it has become possible to have a cache on the same chip as the processor: the on-chip cache. Compared with a cache reachable via an external bus, the on-chip cache reduces the processor’s ex- ternal bus activity and therefore speeds up execution times and increases overall system performance. When the requested instruction or data is found in the on-chip cache, the bus access is eliminated. Because of the short data paths internal to the processor, compared with bus lengths, on-chip cache accesses will complete appre- ciably faster than would even zero-wait state bus cycles. Furthermore, during this period the bus is free to support other transfers. The inclusion of an on-chip cache leaves open the question of whether an off- chip, or external, cache is still desirable. Typically, the answer is yes, and most contem- porary designs include both on-chip and external caches. The simplest such organization is known as a two-level cache, with the internal cache designated as level 1 (L1) and the external cache designated as level 2 (L2). The reason for including an L2 cache is the following: If there is no L2 cache and the processor makes an access request for a memory location not in the L1 cache, then the processor must access 4.3 / ELEMENTS OF CACHE DESIGN 139 DRAM or ROM memory across the bus. Due to the typically slow bus speed and slow memory access time, this results in poor performance. On the other hand, if an L2 SRAM (static RAM) cache is used, then frequently the missing information can be quickly retrieved. If the SRAM is fast enough to match the bus speed, then the data can be accessed using a zero-wait state transaction, the fastest type of bus transfer. Two features of contemporary cache design for multilevel caches are noteworthy. First, for an off-chip L2 cache, many designs do not use the system bus as the path for transfer between the L2 cache and the processor, but use a separate data path, so as to reduce the burden on the system bus. Second, with the continued shrinkage of processor components, a number of processors now incorporate the L2 cache on the processor chip, improving performance. The potential savings due to the use of an L2 cache depends on the hit rates in both the L1 and L2 caches. Several studies have shown that, in general, the use of a second-level cache does improve performance (e.g., see [AZIM92], [NOVI93], [HAND98]). However, the use of multilevel caches does complicate all of the design issues related to caches, including size, replacement algorithm, and write policy; see [HAND98] and [PEIR99] for discussions. Figure 4.17 shows the results of one simulation study of two-level cache per- formance as a function of cache size [GENU04]. The figure assumes that both caches have the same line size and shows the total hit ratio. That is, a hit is counted if the desired data appears in either the L1 or the L2 cache. The figure shows the im- pact of L2 on total hits with respect to L1 size. L2 has little effect on the total num- ber of cache hits until it is at least double the L1 cache size. Note that the steepest part of the slope for an L1 cache of 8 Kbytes is for an L2 cache of 16 Kbytes. Again for an L1 cache of 16 Kbytes, the steepest part of the curve is for an L2 cache size of 32 Kbytes. Prior to that point, the L2 cache has little, if any, impact on total cache 0.98 0.96 0.94 0.92 0.90 L1 16k Hit ratio 0.88 L1 8k 0.86 0.84 0.82 0.80 0.78 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M 2M L2 cache size (bytes) Figure 4.17 Total Hit Ratio (L1 and L2) for 8-Kbyte and 16-Kbyte L1 140 CHAPTER 4 / CACHE MEMORY performance. The need for the L2 cache to be larger than the L1 cache to affect per- formance makes sense. If the L2 cache has the same line size and capacity as the L1 cache, its contents will more or less mirror those of the L1 cache. With the increasing availability of on-chip area available for cache, most con- temporary microprocessors have moved the L2 cache onto the processor chip and added an L3 cache. Originally, the L3 cache was accessible over the external bus. More recently, most microprocessors have incorporated an on-chip L3 cache. In ei- ther case, there appears to be a performance advantage to adding the third level (e.g., see [GHAI98]). UNIFIED VERSUS SPLIT CACHES When the on-chip cache first made an appear- ance, many of the designs consisted of a single cache used to store references to both data and instructions. More recently, it has become common to split the cache into two: one dedicated to instructions and one dedicated to data. These two caches both exist at the same level, typically as two L1 caches. When the processor attempts to fetch an instruction from main memory, it first consults the instruction L1 cache, and when the processor attempts to fetch data from main memory, it first consults the data L1 cache. There are two potential advantages of a unified cache: For a given cache size, a unified cache has a higher hit rate than split caches be- cause it balances the load between instruction and data fetches automatically. That is, if an execution pattern involves many more instruction fetches than data fetches, then the cache will tend to fill up with instructions, and if an exe- cution pattern involves relatively more data fetches, the opposite will occur. Only one cache needs to be designed and implemented. Despite these advantages, the trend is toward split caches, particularly for super- scalar machines such as the Pentium and PowerPC, which emphasize parallel instruc- tion execution and the prefetching of predicted future instructions.The key advantage of the split cache design is that it eliminates contention for the cache between the in- struction fetch/decode unit and the execution unit.This is important in any design that relies on the pipelining of instructions. Typically, the processor will fetch instructions ahead of time and fill a buffer, or pipeline, with instructions to be executed. Suppose now that we have a unified instruction/data cache. When the execution unit performs a memory access to load and store data, the request is submitted to the unified cache. If, at the same time, the instruction prefetcher issues a read request to the cache for an instruction, that request will be temporarily blocked so that the cache can service the execution unit first, enabling it to complete the currently executing instruction. This cache contention can degrade performance by interfering with efficient use of the instruction pipeline. The split cache structure overcomes this difficulty. 4.4 PENTIUM 4 CACHE ORGANIZATION The evolution of cache organization is seen clearly in the evolution of Intel micro- processors (Table 4.4). The 80386 does not include an on-chip cache. The 80486 includes a single on-chip cache of 8 KBytes, using a line size of 16 bytes and a four-way 4.4 / PENTIUM 4 CACHE ORGANIZATION 141 Table 4.4 Intel Cache Evolution Processor on which Problem Solution Feature First Appears External memory slower than the system Add external cache using faster 386 bus. memory technology. Increased processor speed results in Move external cache on-chip, op- 486 external bus becoming a bottleneck for erating at the same speed as the cache access. processor. Internal cache is rather small, due to Add external L2 cache using faster 486 limited space on chip technology than main memory Contention occurs when both the Instruc- Create separate data and instruc- Pentium tion Prefetcher and the Execution Unit tion caches. simultaneously require access to the cache. In that case, the Prefetcher is stalled while the Execution Unit’s data access takes place. Create separate back-side bus that Pentium Pro runs at higher speed than the main Increased processor speed results in (front-side) external bus. The BSB external bus becoming a bottleneck for L2 is dedicated to the L2 cache. cache access. Move L2 cache on to the proces- Pentium II sor chip. Some applications deal with massive data- bases and must have rapid access to large Add external L3 cache. Pentium III amounts of data. The on-chip caches are Move L3 cache on-chip. Pentium 4 too small. set-associative organization. All of the Pentium processors include two on-chip L1 caches, one for data and one for instructions. For the Pentium 4, the L1 data cache is 16 KBytes, using a line size of 64 bytes and a four-way set-associative organiza- tion. The Pentium 4 instruction cache is described subsequently. The Pentium II also includes an L2 cache that feeds both of the L1 caches. The L2 cache is eight- way set associative with a size of 512 KB and a line size of 128 bytes. An L3 cache was added for the Pentium III and became on-chip with high-end versions of the Pentium 4. Figure 4.18 provides a simplified view of the Pentium 4 organization, high- lighting the placement of the three caches. The processor core consists of four major components: Fetch/decode unit: Fetches program instructions in order from the L2 cache, decodes these into a series of micro-operations, and stores the results in the L1 instruction cache. Out-of-order execution logic: Schedules execution of the micro-operations subject to data dependencies and resource availability; thus, micro-operations may be scheduled for execution in a different order than they were fetched from the instruction stream. As time permits, this unit schedules speculative execution of micro-operations that may be required in the future. 142 System bus Out-of-order L1 instruction Instruction execution cache (12K ops) fetch/decode logic unit 64 bits L3 cache (1 MB) Integer register file FP register file Load Store Simple Simple Complex FP/ FP address address integer integer integer MMX move unit unit ALU ALU ALU unit unit L2 cache (512 KB) L1 data cache (16 KB) 256 bits Figure 4.18 Pentium 4 Block Diagram 4.5 / ARM CACHE ORGANIZATION 143 Table 4.5 Pentium 4 Cache Operating Modes Control Bits Operating Mode CD NW Cache Fills Write Throughs Invalidates 0 0 Enabled Enabled Enabled 1 0 Disabled Enabled Enabled 1 1 Disabled Disabled Disabled Note: CD = 0; NW = 1 is an invalid combination. Execution units: These units executes micro-operations, fetching the required data from the L1 data cache and temporarily storing results in registers. Memory subsystem: This unit includes the L2 and L3 caches and the system bus, which is used to access main memory when the L1 and L2 caches have a cache miss and to access the system I/O resources. Unlike the organization used in all previous Pentium models, and in most other processors, the Pentium 4 instruction cache sits between the instruction de- code logic and the execution core. The reasoning behind this design decision is as follows: As discussed more fully in Chapter 14, the Pentium process decodes, or translates, Pentium machine instructions into simple RISC-like instructions called micro-operations. The use of simple, fixed-length micro-operations enables the use of superscalar pipelining and scheduling techniques that enhance performance. However, the Pentium machine instructions are cumbersome to decode; they have a variable number of bytes and many different options. It turns out that performance is enhanced if this decoding is done independently of the scheduling and pipelining logic. We return to this topic in Chapter 14. The data cache employs a write-back policy: Data are written to main memory only when they are removed from the cache and there has been an update. The Pen- tium 4 processor can be dynamically configured to support write-through caching. The L1 data cache is controlled by two bits in one of the control registers, la- beled the CD (cache disable) and NW (not write-through) bits (Table 4.5). There are also two Pentium 4 instructions that can be used to control the data cache: INVD invalidates (flushes) the internal cache memory and signals the external cache (if any) to invalidate. WBINVD writes back and invalidates internal cache and then writes back and invalidates external cache. Both the L2 and L3 caches are eight-way setassociative with a line size of 128 bytes. 4.5 ARM CACHE ORGANIZATION The ARM cache organization has evolved with the overall architecture of the ARM family, reflecting the relentless pursuit of performance that is the driving force for all microprocessor designers. 144 CHAPTER 4 / CACHE MEMORY Table 4.6 ARM Cache Features Cache Line Write Cache Cache Size Size Buffer Size Core Type (kB) (words) Associativity Location (words) ARM720T Unified 8 4 4-way Logical 8 ARM920T Split 16/16 D/I 8 64-way Logical 16 ARM926EJ-S Split 4-128/4-128 D/I 8 4-way Logical 16 ARM1022E Split 16/16 D/I 8 64-way Logical 16 ARM1026EJ-S Split 4-128/4-128 D/I 8 4-way Logical 8 Intel Split 16/16 D/I 4 32-way Logical 32 StrongARM Intel Xscale Split 32/32 D/I 8 32-way Logical 32 ARM1136-JF-S Split 4-64/4-64 D/I 8 4-way Physical 32 Table 4.6 shows this evolution. The ARM7 models used a unified L1 cache, while all subsequent models use a split instruction/data cache. All of the ARM de- signs use a set-associative cache, with the degree of associativity and the line size varying. ARM cached cores with an MMU use a logical cache for processor families ARM7 through ARM10, including the Intel StongARM and Intel Xscale proces- sors. The ARM11 family uses a physical cache. The distinction between logical and physical cache is discussed earlier in this chapter (Figure 4.7). An interesting feature of the ARM architecture is the use of a small first-in- first out (FIFO) write buffer to enhance memory write performance. The write buffer is interposed between the cache and main memory and consists of a set of ad- dresses and a set of data words. The write buffer is small compared to the cache, and may hold up to four independent addresses. Typically, the write buffer is enabled for all of main memory, although it may be selectively disabled at the page level. Figure 4.19, taken from [SLOS04], shows the relationship among the write buffer, cache, and main memory. Word, byte access Block transfer Cache Fast Slow Main Processor