Computer Systems Organization Lecture Notes PDF

Summary

These lecture notes provide an introduction to computer systems organization. The document details the fundamental components of a digital computer, including processors, memory, and input/output devices. It also covers topics such as instruction execution, different CPU design philosophies (RISC and CISC), and memory hierarchies.

Full Transcript

Computer Systems Organization Introduction This chapter provides an introduction to the three primary components of a digital computer: processors, memories, and input/output (I/O) devices. These components form an interconnected system that is fundamental to computer architecture. Th...

Computer Systems Organization Introduction This chapter provides an introduction to the three primary components of a digital computer: processors, memories, and input/output (I/O) devices. These components form an interconnected system that is fundamental to computer architecture. The chapter serves as a foundational overview before delving into more detailed discussions in the following chapters. Processors Figure 1: The organization of a simple computer with one CPU and two I/O devices. 1.Processors A simple bus-oriented computer consists of a CPU, main memory, and I/O devices, all interconnected by a bus that transmits address, data, and control signals. The CPU, acting as the "brain" of the computer, fetches, decodes, and executes instructions stored in memory. It contains distinct parts, including the control unit, which fetches instructions, and the arithmetic logic unit, which performs operations. Additionally, the CPU has high-speed internal memory called registers, including the Program Counter (PC) and Instruction Register (IR), essential for managing instruction execution. 1.1 CPU Organization Figure 2: The data path of a typical von Neumann machine. 1.1 CPU Organization The internal organization of a typical von Neumann CPU includes a data path consisting of registers, the Arithmetic Logic Unit (ALU), and buses connecting them. The registers provide inputs to the ALU, which performs operations like addition or subtraction, and the results are stored in output registers. There are two main types of instructions: register-memory and register-register. Register-memory instructions move data between memory and registers, while register-register instructions perform operations between register-stored values. The data path cycle, which processes two operands through the ALU and stores the result, is central to CPU performance 1.2 Instruction Execution Figure 3. An interpreter for a simple computer (written in Java). 1.2 Instruction Execution Instruction execution in a CPU follows a fetch-decode- execute cycle, where instructions are fetched from memory, decoded, and executed. This process can be performed by hardware or software interpreters, with the latter allowing for simpler and cheaper processors. Interpreters were particularly useful for handling complex instruction sets, reducing hardware costs by using software to manage instruction execution. 1.2 Instruction Execution This approach gained popularity in the 1970s, leading to the development of simpler, low-cost computers with more complex instructions. Interpreter-based designs allowed for flexibility, such as adding new instructions or fixing bugs. However, overly complex instruction sets, like in the VAX system, could become inefficient, leading to failure in performance and market adoption. 1.3 RISC vs CISC In the late 1970s, experimentation with complex instructions led to the development of two competing CPU design philosophies: 1. RISC (Reduced Instruction Set Computer) and 2. CISC (Complex Instruction Set Computer). RISC, pioneered by John Cocke at IBM and later by groups at Berkeley (RISC I) and Stanford (MIPS), focused on simple instructions that could execute quickly, emphasizing instruction throughput rather than complexity. RISC designs featured smaller instruction sets (around 50 instructions), enabling faster execution by avoiding interpretation 1.3 RISC vs CISC In contrast, CISC architectures, like the DEC VAX and IBM mainframes, had much larger instruction sets (200- 300 instructions) and more complex operations. The CISC approach aimed to reduce the semantic gap between machine instructions and high-level programming languages, but at the cost of slower execution for complex instructions. 1.3 RISC vs CISC A major debate ensued, with RISC advocates arguing that even if more instructions were needed to accomplish a task, the faster execution of simple instructions would result in better overall performance. However, despite the performance potential of RISC, CISC machines, particularly Intel's processors, maintained dominance due to backward compatibility and large software investments. Intel adopted a hybrid approach, incorporating RISC-like features within its CISC architecture, using a RISC core to handle common instructions quickly while interpreting more complex ones, balancing speed and software compatibility. 1.4 Design Principle of Modern Computers The design principles for modern computers, often called RISC design principles, focus on efficiency and performance: 1.Direct Hardware Execution: Common instructions are directly executed by hardware rather than being interpreted by microinstructions, enhancing speed. Complex instructions may be broken into smaller parts for occasional use. 2.Maximizing Instruction Issuance: The goal is to issue as many instructions per second as possible. This can involve executing multiple instructions simultaneously (parallelism), even if instructions are not 1.4 Design Principle of Modern Computers 3. Easy Instruction Decoding: Instructions should have a regular, fixed length with few fields to simplify decoding and increase the rate of instruction issuance. 4. Memory Access via Loads and Stores: Only LOAD and STORE instructions should access memory, while other instructions operate on CPU registers, minimizing unpredictable memory delays. 5. Abundant Registers: Providing many registers (at least 32) allows efficient use of fetched data without constantly reloading from memory, improving overall performance. 1.5 Instruction-Level Parallelism ILP focuses on improving performance by executing multiple instructions in parallel within a single CPU. Two key techniques are discussed: 1. Pipelining: Instructions are divided into stages, each handled by a separate hardware unit, allowing multiple instructions to be processed simultaneously at different stages. 1.5 Instruction-Level Parallelism In Fig. 4(b) we see how the pipeline operates as a function of time. During clock cycle 1, stage S1 is working on instruction 1, fetching it from memory. During cycle 2, stage S2 decodes instruction 1, while stage S1 fetches instruction 2. During cycle 3, stage S3 fetches the operands for instruction 1, stage S2 decodes instruction 2, and stage S1 fetches the third instruction. During cycle 4, stage S4 executes instruction 1, S3 fetches the operands for instruction 2, S2 decodes instruction 3, and S1 fetches instruction 4. Finally, during cycle 5, S5 writes the result of instruction 1 back, while the other stages work on the following instructions. 1.5 Instruction-Level Parallelism Figure 4. (a) A five-stage pipeline. (b) The state of each stage as a function of time. Nine clock cycles are illustrated. 1.5 Instruction-Level Parallelism 2. Superscalar Architecture: Superscalar processors use multiple pipelines or functional units to issue and execute multiple instructions per cycle. In dual pipelines (as seen in the Intel Pentium), two instructions can be fetched and executed in parallel if they are compatible and do not conflict over resources. Superscalar CPUs with more functional units can issue several instructions per clock cycle (e.g., 4 to 6), allowing even higher instruction throughput. 1.5 Instruction-Level Parallelism Figure 5: A superscalar processor with five functional units. Implicit in the idea of a superscalar processor is that the S3 stage can issue instructions considerably faster than the S4 stage is able to execute them 1.6 Processor-level Parallelism To achieve significantly higher performance, simply increasing instruction-level parallelism (as in pipelining and superscalar processing) only provides marginal gains, often up to a factor of five or ten. To go beyond this, computer designers have turned to systems with multiple CPUs or processing elements, leading to various forms of parallel computing architectures. 1.6 Processor-level Parallelism Array of computers Array computers are particularly effective for problems which often involve highly structured, regular computations, like performing the same calculations on multiple data sets simultaneously. These systems are designed for parallel execution, and there are two approaches to achieving this speed-up: 1.6 Processor-level Parallelism 1. Array Processors: An array processor contains numerous identical processors that execute the same instructions on different data sets in parallel. Each processor works on its own data from its memory. These machines are examples of SIMD (Single Instruction Multiple Data) architecture, where a single control unit broadcasts instructions to all processors, which execute them in parallel. 1.6 Processor-level Parallelism Figure 7. An array processor of the ILLIAC IV type. 1.6 Processor-level Parallelism 2. Vector Processors: Vector processors operate similarly but use a single, pipelined adder to perform operations on pairs of data from vector registers. These registers hold multiple data elements loaded from memory, and the processor performs operations like addition on all pairs in a vector at once. 1.6 Processor-level Parallelism Multiprocessors Figure 8. (a) A single-bus multiprocessor. (b) A multicomputer with local memories. 1.6 Processor-level Parallelism Multiprocessors Unlike array processors, which share a single control unit across many processing elements, multiprocessors consist of multiple independent CPUs that share a common memory. These CPUs are tightly coupled, meaning they work closely together and must coordinate their access to shared memory to avoid conflicts. Multiple-processor architecture: 1. Bus-Based Multiprocessors: The simplest multiprocessor system uses a single shared bus to connect multiple CPUs to a shared memory. However, with many processors accessing the memory simultaneously, bus contention can become a bottleneck. 1.6 Processor-level Parallelism 2. Local Memory: To alleviate this, some multiprocessors give each CPU a small amount of private memory for storing program code and non-shared data, reducing bus traffic. Other techniques like caching are also used to improve efficiency. Multiprocessors offer a simpler programming model with a single shared memory, which makes them easier to work with. For example, in a program designed to search for cancer cells in a photograph, different processors can be assigned to search different regions, accessing the entire image in shared memory. 1.6 Processor-level Parallelism Multicomputer: For larger systems with hundreds or thousands of processors, the shared memory architecture of multiprocessors becomes impractical due to the complexity of connecting all CPUs to the memory. Multicomputers solve this by abandoning shared memory altogether. Instead, each CPU in a multicomputer has its own private memory, and CPUs communicate with each other by sending messages. 1.6 Processor-level Parallelism Hybrid Systems Due to the advantages and limitations of both multiprocessors and multicomputers, many research efforts focus on designing hybrid systems that combine the benefits of both. These systems attempt to provide the ease of programming seen in shared- memory systems without the complexity and cost of constructing large shared-memory architectures. 2. Primary Memory The memory is that part of the computer where programs and data are stored. Some computer scientists (especially British ones) use the term store or storage rather than memory, although more and more, the term ‘‘storage’’ is used to refer to disk storage. Without a memory from which the processors can read and write information, there would be no stored- program digital computers. 2.1 Bits Bit: The basic unit of memory, a bit holds a value of either 0 or 1. It’s essential because at least two values are needed for a functional memory system. Binary Efficiency: Computers use binary (0s and 1s) because it’s reliable; distinguishing just two values minimizes errors. Digital information is stored through variations in physical quantities, like voltage. Fewer values mean greater reliability. Binary is thus the most stable method for encoding digital data. 2.1 Bits Binary Coded Decimal (BCD): Large computers, such as IBM mainframes, use BCD to store decimal numbers with 4 bits per digit. BCD provides 16 combinations, 10 used for digits 0–9. Example: Number 1944 as Decimal, 0001 1001 0100 0100 as BCD format, 0000011110011000 as binary. Efficiency Comparison: 16-bit BCD can represent up to 10,000 combinations (0–9999). 16-bit Binary can represent 65,536 combinations, making it more efficient in most cases. Hypothetically, if a device could reliably store 10 values (0–9), it would be more efficient for decimal but less so for binary storage. 2.2 Memory Addresses Memory in a computer is organized into cells (or locations), each with a unique number called an address. Programs use these addresses to access specific memory locations. If there are n cells, addresses range from 0 to n−1. Each cell has the same number of bits, denoted as k, so it can store one of 2^k possible combinations. Memory addressing is represented in binary, and for m-bit addresses, up to 2^m cells can be accessed directly. 2.2 Memory Addresses Three ways of organizing a 96-bit memory 2.2 Memory Addresses If a memory address has 𝑚 bits, it can represent 2𝑚 unique addresses. For example, a 4-bit address can represent up to 24=16 unique cells, with addresses ranging from 0 to 15. The number of bits in the address determines how many memory cells you can access directly, but it’s unrelated to the amount of data each cell holds. 2.2 Memory Addresses Examples: Suppose a memory has 12 cells (like in Fig. 2-9(a)): you need at least a 4-bit address to represent numbers from 0 to 11 (2⁴ = 16, which covers 0 to 11). For a smaller memory with 8 cells (like in Fig. 2-9(b) and (c)), only 3-bit addresses are needed because 2³ = 8, enough to cover addresses 0 to 7. 2.3 Byte Ordering Byte Numbering: Bytes in a word can be numbered from left-to-right (big endian) or right-to-left (little endian). Big Endian (Figure 2-11(a)): Bytes are numbered starting from the high-order end. Common in systems like SPARC and IBM mainframes. Little Endian (Figure 2-11(b)): Bytes are numbered starting from the low-order end, typical in Intel architectures. 2.3 Byte Ordering Big-Endian In big-endian format, the most significant byte (MSB) is stored at the lowest memory address. This format is often more intuitive, as the bytes appear in memory in the same order as we write them. Example: Suppose we have a 4-byte (32-bit) hexadecimal number 0x12345678.In big-endian format, it would be stored in memory like this: Address Value 0x00 12 0x01 34 0x02 56 0x03 78 2.3 Byte Ordering Little-Endian In little-endian format, the least significant byte (LSB) is stored at the lowest memory address. This format reverses the order, so the bytes appear "backwards" in memory. Example: For the same 4-byte hexadecimal number 0x12345678: In little-endian format, it would be stored like this Address Value 0x00 78 0x01 56 0x02 34 0x03 12 2.3 Byte Ordering Figure 2-11. (a) Big endian memory. (b) Little endian memory. Error-Correcting Codes Computer memories can experience errors due to various reasons, such as voltage spikes. Error-correcting codes (ECC) add redundancy to memory words to detect and correct these errors. A memory word consists of m data bits, and r check bits, creating a total length of n=m+r. An n-bit unit, which includes m data bits and r check bits, is known as a codeword. Error-Correcting Codes Hamming distance between two codewords is defined as the number of positions at which the corresponding bits differ. It is calculated using the bitwise EXCLUSIVE OR (XOR) operation. For example, comparing the codewords 10001001 and 10110001, we find a Hamming distance of 3, indicating that 3 single-bit errors are needed to convert one codeword into another. Error-Correcting Codes To detect 𝑑 single-bit errors, a code must have a Hamming distance of 𝑑+1. To correct 𝑑 single-bit errors, the code needs a Hamming distance of 2𝑑+1. For example, a simple parity bit that ensures even parity has a Hamming distance of 2, meaning it can only detect single errors, not correct them. Error-Correcting Codes To detect a single-bit error (d=1): The code needs a Hamming distance of d+1=2 A code with a Hamming distance of 2 can detect if any single bit is incorrect, but it can't correct the error. To correct a single-bit error (d=1): The code must have a Hamming distance of 2d+1=3 With a Hamming distance of 3, the code can locate and correct a single-bit error. Error-Correcting Codes A parity bit is a simple error-detecting code that adds one extra bit to a data set. It ensures that the total number of 1s in the data is either even or odd. 1.Even Parity: The parity bit is set so that the total number of 1s in the data, including the parity bit, is even. 2.Odd Parity: The parity bit is set so that the total number of 1s in the data, including the parity bit, is odd. Error-Correcting Codes Error-Correcting Codes Error-Correcting Codes Error-Correcting Codes A single parity bit is added to ensure the total number of 1 bits is even or odd. This allows detection of single- bit errors but does not provide correction. Consider a code with four valid codewords: 0000000000 0000011111 1111100000 1111111111 The Hamming distance is 5, meaning that it can correct up to two errors. Error-Correcting Codes To determine the minimum Hamming distance for this set of codewords, we compare each codeword to every other one and count the bit differences: 1.0000000000 vs. 0000011111: 1. Differences: 5 bits (the last 5 bits). 2.0000000000 vs. 1111100000: 1. Differences: 5 bits (the first 5 bits). 3.0000000000 vs. 1111111111: 1. Differences: 10 bits (all bits). 4.0000011111 vs. 1111100000: 1. Differences: 10 bits. 5.0000011111 vs. 1111111111: 1. Differences: 5 bits (the first 5 bits). 6.1111100000 vs. 1111111111: 1. Differences: 5 bits (the last 5 bits). Error-Correcting Codes With a Hamming distance of 5, this code can: Detect up to 4 single-bit errors (since d+1=5, where d is the number of detectable errors). Correct up to 2 single-bit errors (since 2d+1=5 where d is the number of correctable errors). m+r+1≤2r This formular helps us determine the minimum r check bits needed for different memory word sizes, balancing data and check bits for error detection and correction capabilities. Error-Correcting Codes Error-Correcting Codes Figure 2-13. Number of check bits for a code that can correct a single error. Error-Correcting Codes The Venn diagram of Fig. 2-14(a) contains three circles, A, B, and C, which together form seven regions. As an example, let us encode the 4-bit memory word 1100 in the regions AB, ABC, AC, and BC.This encoding is shown in Fig. 2-14(a). Next, we add a parity bit to each of the three empty regions to produce even parity, as illustrated in Fig. 2- 14(b). Error-Correcting Codes Figure 2-14. (a) Encoding of 1100. (b) Even parity added. (c) Error in AC Error-Correcting Codes Now suppose that the bit in the AC region goes bad, changing from a 0 to a 1, as shown in Fig. 2-14(c). The computer can now see that circles A and C have the wrong (odd) parity. The only single-bit change that corrects them is to restore AC back to 0, thus correcting the error. In this way, the computer can repair single-bit memory errors automatically. Cache Memory Figure 2-16. The cache is logically between the CPU and main memory. Physically, there are several possible places it could be located. Cache Memory In a typical architecture, the arrangement of the CPU, cache, and main memory is as follows: CPU: The central processing unit performs computations and issues memory requests. Cache: Positioned between the CPU and main memory, the cache is much faster than main memory. It stores frequently accessed data and instructions, allowing the CPU to retrieve them quickly without accessing slower memory. Main Memory: This is the larger, slower memory where the bulk of data and program instructions are stored. It feeds data into the cache when a cache miss occurs (i.e., when the CPU cannot find the required data in the cache). Cache Memory Access Flow: 1.Memory Request: When the CPU needs a word, it first checks the cache. 2.Cache Hit: If the word is found in the cache (hit), it is retrieved quickly. 3.Cache Miss: If the word is not in the cache (miss), the CPU must access the slower main memory to fetch the required data. 4.Data Transfer: Upon a cache miss, not only the requested word but also nearby words are loaded into the cache for future access. 3. SECONDARY MEMORY Secondary memory addresses the limitations of main memory, which is always too small to store the vast amounts of data that users want. As technology advances, the demand for storage grows, leading to scenarios where substantial information, like a comprehensive film archive, must be digitized and stored. This data can reach into the hundreds of terabytes, far exceeding the capacity of main memory. SECONDARY MEMORY Figure 2-18. A five-level memory hierarchy. 3.1 Memory Hierarchies The hierarchy is structured as follows: 1.CPU Registers: These are the fastest and smallest storage, accessed at full CPU speed and typically hold around 128 bytes. 2.Cache Memory: Larger than registers, cache memory ranges from 32 KB to a few megabytes and is accessed slightly slower than registers. 3.Main Memory: This memory can hold tens of megabytes to several gigabytes, with access times in the range of tens of nanoseconds. 3.2 Memory Hierarchies 4. Magnetic Disks: These serve as the primary means for permanent storage, providing several gigabytes to tens of gigabytes of capacity, with access times measured in milliseconds. 5. Magnetic Tape and Optical Disks: Used primarily for archival storage, their access times can take seconds, and their capacity is only limited by budget. 3.3 Memory Hierarchies Access Time: Increases as you go from CPU registers to tape or optical disks. Storage Capacity: Increases downward, from small CPU registers to larger magnetic disks and tapes. Cost per Bit: Decreases down the hierarchy, with main memory costing dollars per megabyte, magnetic disk costing pennies per megabyte, and magnetic tape costing even less. 3.3 Magnetic Disks Figure 2-19. A portion of a disk track. Two sectors are illustrated. 3.3 Magnetic Disks What is a Magnetic Disk? Storage device with platters coated in a magnetizable material. Disk head writes/reads data by magnetizing areas on the surface. Originally large (up to 50 cm); now much smaller (3–12 cm). Track: Circular sequence of bits on the disk. Sector: Each track is divided into sectors (512 bytes of data each). Components: Preamble (for sync), data, ECC (error correction). Figure 2-19: Shows a section of a track with two sectors. 3.3 Magnetic Disks Disk Arm: Moves radially to access different tracks (concentric circles). Zone Bit Recording: Tracks are divided into zones; outer zones have more sectors per track, increasing capacity. Figure 2-21: Shows zones with increasing sectors outward. Manages read/write commands, arm movement, error correction, and remapping of bad sectors. Perpendicular Recording: Newer technique to increase data density by recording vertically. Winchester Disks: Sealed drives for dust protection, ensuring high surface quality 3.3 Magnetic Disks Figure 2-21. A disk with five zones. Each zone has many tracks. 3.4 Floppy disk Originally invented by IBM for mainframe maintenance. Quickly became popular for software distribution on personal computers. Early versions were physically flexible, hence the name "floppy." Organized into tracks and sectors like hard disks. Unlike hard disks, where heads float on a cushion of air, floppy disk heads touch the surface. This direct contact leads to more wear on the media and heads. 3.4 Floppy disk Floppy disks were widely used for about 20 years. Modern computers are now usually shipped without floppy drives, as newer storage media have replaced them. 3.5 RAID RAID Stands for Redundant Array of Independent Disks. RAID addresses slow disk performance relative to CPU by using multiple disks in parallel. It enhances disk performance and reliability. SLED vs. RAID: Unlike a Single Large Expensive Disk (SLED), RAID uses multiple disks and appears as one large disk to the OS. RAID levels are visually illustrated in Figure 2-23 3.5 RAID Figure 2-23. RAID levels 0 through 5. Backup and parity drives are shown shaded. 3.5 RAID RAID Level 0 (Striping) [Fig. 2-23(a)]: Data is split into strips and distributed across multiple drives. Performance: High for large requests; supports parallel I/O. Reliability: Low; lacks redundancy, so data is lost if any drive fails. RAID Level 1 (Mirroring) [Fig. 2-23(b)]: Data is duplicated across primary and backup disks. Performance: Improved read speed; write speed similar to a single disk. Reliability: High fault tolerance; data is recoverable if a drive fails. 3.5 RAID RAID Level 2 (Hamming Code Parity) [Fig. 2-23(c)]: Data split into bits across multiple drives with Hamming code for error correction. Performance: High throughput; requires synchronized drives. Usage: Suitable for systems with many drives but high overhead. RAID Level 3 (Single Parity Disk) [Fig. 2-23(d)]: Uses one parity disk for error correction; all drives are synchronized. Reliability: Can handle single drive failure. Limitation: Limited to handling one I/O request per time, similar to a single drive. 3.5 RAID RAID Level 4 (Striping with Dedicated Parity) [Fig. 2- 23(e)]: Strips data with a dedicated parity disk. Advantage: Can recover data if a drive fails. Drawback: Small updates slow down performance due to parity disk bottleneck. RAID Level 5 (Distributed Parity) [Fig. 2-23(f)]: Distributes parity information across all disks, eliminating the bottleneck. Reliability: High fault tolerance; complex recovery if a drive fails. Best For: Systems needing high reliability and performance without a parity disk bottleneck. 3.6 CD-ROMS Origins of Optical Disks: Initially created for television recording, optical disks soon became popular for computer storage due to high capacity and low cost. LaserVision and Audio CDs: Early large-diameter optical disks were unsuccessful except in Japan. Philips and Sony later launched the Compact Disc (CD) in 1980, becoming the first mass-market digital storage medium. Red Book Standard: Published technical CD details allowed cross-compatibility among music publishers and electronics, ensuring CDs from various sources were compatible with different players. 3.6 CD-ROMS Pit-Land Structure: A low-power laser reads the pits (depressions) and lands (flat areas), as shown in Figure 2-24. The laser identifies transitions between pits and lands to read binary data, ensuring reliable encoding. CD-ROM Development: Recognizing CDs' data storage potential, Philips and Sony published the Yellow Book in 1984, setting standards for CD-ROMs and enhancing error correction for data reliability. 3.6 CD-ROMS Figure 2-24. Recording structure of a Compact Disc or CD-ROM. 3.6 CD-ROMS Data Organization: The CD-ROM format organizes data into 98-frame sectors (Fig. 2-25), with each sector containing a 16-byte preamble, 2048 data bytes, and a 288-byte error-correcting code. Mode 1: Has enhanced error correction. Mode 2: Merges data and ECC for applications where error correction is less crucial (e.g., audio). In 1986, Philips introduced the Green Book, enabling graphics and multimedia by interleaving audio, video, and data in the same sector. 3.6 CD-ROMS Figure 2-25. Logical data layout on a CD-ROM.

Use Quizgecko on...
Browser
Browser