Topic 02 Central Processing Unit (CPU)-FQ PDF
Document Details
Uploaded by AmenableYttrium6587
Dr. Faizan Qamar
Tags
Summary
These lecture notes cover the Central Processing Unit (CPU) in computer architecture. The topics include CPU organization, von Neumann architecture, components, and instruction set architecture. The notes also differentiate between RISC and CISC approaches.
Full Transcript
TTTK1153 Computer Organization & Architecture Chapter 2: Central Processing Unit (CPU) Lecturer: Dr. Faizan Qamar Designed by Ts. Dr. Mohd Nor Akmal Khalid 1 COURSE...
TTTK1153 Computer Organization & Architecture Chapter 2: Central Processing Unit (CPU) Lecturer: Dr. Faizan Qamar Designed by Ts. Dr. Mohd Nor Akmal Khalid 1 COURSE JOURNEY ▰ Topic 1: Introduction to Computer Systems ▰ Topic 2: Central Processing Unit ▰ Topic 3: Memory & Storage Systems ▰ Topic 4: Input/output Systems & Interconnection ▰ Topic 5: Computer Arithmetic & Instruction Set Architecture ▰ Topic 6: Data Path, Control Design, and Pipelining ▰ Topic 7: Parallel Architectures and Multicore Processors ▰ Topic 8: Advanced Memory Systems ▰ Topic 9: Assembly Language 2 Central Processing Unit What is CPU? Central Processing Unit (CPU) controls the operation of the computer and performs its data processing functions. The CPU is the primary component of a computer that performs most of the processing. Executes instructions from programs by performing basic arithmetic, logic, control, and input/output operations. Acts as the brain of the computer, coordinating all other hardware components. 3 CPU Organization 4 Supports all digital systems Determines performance and speed Facilitates task execution Drives multitasking Dictates system compatibility Important Applications: Importance of a Multimedia: encoding/decoding, Handles video real-time CPU rendering. Computation: Executes complex algorithms, manages neural network computations. Software Engineering: Runs development environments, compiles code. Information Systems: Processes data queries, manages databases. 5 Von Neumann Architecture There are two types of computers: Fixed program computers that have limited number of specific functions; can’t be reprogrammed (e.g., calculator). Stored program computers that can be programmed to perform multiple task and have a memory unit attached to it. The modern concept of stored program computers was given by John Von Neumann 6 Von Neumann Architecture 7 Components of a CPU Control Unit (CU): Manages the execution of instructions and coordinates the activities of other components. Arithmetic Logic Unit (ALU): Performs mathematical calculations and logical operations. Memory Unit: Stores both data and instructions in the same memory space. Input/Output Devices (I/O): Allow the system to interact with the external environment, enabling data input and output. 9 Control Unit Control unit directs the operation of the other units by providing timing and control signal. Directs the flow of data between the CPU and the other devices. It tells the computer’s memory, arithmetic logic unit (ALU) and input and output (I/O) devices how to respond to the instructions that have been sent to the processor. 10 Arithmetic Logic Unit (ALU) ALU is a combinational digital electronic circuit that performs arithmetic and bitwise operations on integer binary numbers. Sets of basic operations are hardwired onto the CPU known as instruction set. These basic operations are represented by combinations of bits, called opcode (Machine level language) It executing instructions in a machine language program, the CPU decides which operation to perform by “decoding” the opcode. 11 ALU ‘Decoding’ A, B: Inputs (Operands) X: Output (Result) O: Input Code or Instructions from Control Unit (OpCode) S: Output status: Carry-in Carry-out Overflow Division by zero 12 Logic Gates: One-Bit ALU Inputs: Two one-bit inputs: A and B (the operands). A Carry-In bit (Cin) for handling carry input from a previous bit in multi-bit operations like addition or subtraction. Operation Signals: Operation signals determine which operation the ALU will perform (e.g., AND, OR, addition). Common control signals include:S0, S1, S2: These control lines select the operation based on binary codes. Outputs: Result (R): The one-bit result of the operation. Carry-Out (Cout): The carry bit that is output when performing operations like addition or subtraction. 13 Building of ALU Start by designing 1-bit ALU – Building larger bit ALU simple for some operations, but complex for others – Bottom-up approach: build small unit of functionality and put together to build larger ones Example: – Design an instruction that compute either AND or OR (If Op = 0, do AND; and vice versa) 14 Building of ALU Logic Operation Truth Table (A, B → Function in ALU Gate Output) AND Gate A AND B (0, 0 → 0), (0, 1 → 0), (1, 0 → 0), (1, Performs logical AND; used for AND 1 → 1) operations in the ALU. OR Gate A OR B (0, 0 → 0), (0, 1 → 1), (1, 0 → 1), (1, Performs logical OR; used for OR 1 → 1) operations in the ALU. XOR Gate A XOR B (0, 0 → 0), (0, 1 → 1), (1, 0 → 1), (1, Used in binary addition and 1 → 0) subtraction operations. NOT Gate NOT A (0 → 1), (1 → 0) Used to invert inputs; helpful in subtraction. Multiplexer Selects Depends on control signals Selects the result from multiple logic operation operations (AND, OR, XOR, etc.). Full Adder Adds A, B, and Sum = A XOR B XOR Cin; Cout = (A Performs addition with carry in ALU. Cin AND B) OR (Cin AND (A XOR B)) 15 Example Truth Table 1-bit AND/OR 16 The ‘Actual’ ALU 17 CPU Architectures 18 Registers A type of computer memory used to quickly accept, store, and transfer data and instructions that are being used immediately by the CPU. It can be easily accessed almost at the speed of the processor (1 to 3.8 GHz) But the disadvantage is that there are limited numbers of registers (costlier) and low memory size. 19 Registers in CPU 20 Main Registers Accumulator: the accumulator is a register in which intermediate arithmetic and logic results are stored. Program Counter (PC): PC provide the temporary housing for the next instruction executed in a string of instructions. As one instruction is retrieved and implemented, the program counter queues up the next instruction in the string Effectively minimizing delays in the execution of steps necessary to complete a task. 21 Main Registers Memory Address Register (MAR): Memory Data Register (MDR): The memory address from which data will be The register MDR is used to store the data which has been fetched from the CPU, or sent from the memory. The address to which data will be sent and Data stored in MDR will be sent to CIR where it will be stored along with control signal to memory. decoded. 22 Main Registers Current Instruction Register (CIR): Holds the instruction currently being executed or decoded. Instruction Buffer Register (IBR): IBR is a temporary register where the opcode of the currently fetched instruction is stored. Memory Buffer Register (MBR): MBR is a temporary register where the contents of the last memory fetch is stored. 23 Several Other Registers General Purpose Registers (GPRs): Registers (R0, …, Rn) can store both data and addresses, passing parameters to functions, storing return values, and intermediate values during computations. Stack Pointer(SP): keep track of a call stack stores the address of the last program request in a stack. Index Register(xR): Holds an index number that is relative to the address part of the instruction. Base Register(Br): Holds a base address, and the direct address field of instruction gives a displacement relative to this base address. 24 25 Working Whenever the next instruction is to be executed, the program counter (PC) has the memory address of that instruction. First, the content of program counter is assigned to Memory Address Register (MAR) MAR acts as buffer, gets to address bus and will use these address as reference in the memory and find the content of the address The content of the where the address is, gets transferred to Memory Buffer Register(MBR) by a data bus At the same time, the content of PC is incremented by 1 The content of MBR is then transferred to Current Instruction Register(CIR) CIR breaks down the instruction into op-code and operand Accumulator (AC) is a register in which intermediate arithmetic logic unit results are stored. 26 Instruction Set Architecture (ISA) ISA defines the instructions, registers, and data memory formats. Defines the part of the processor visible to the programmer: – instruction formats, opcodes, registers Examples: x86, ARM, MIPS 27 RISC vs. CISC RISC and CISC are two different approaches to CPU design that determine how processors execute instructions. These architectures have distinct appraoch, affecting the performance, complexity, and efficiency of the processor. RISC (Reduced Instruction Set Computer): CISC (Complex Instruction Set Computer): The RISC architecture is based on the principle The CISC architecture aims to minimize the that a smaller set of simple instructions that number of instructions per program by can be executed quickly will lead to better providing a rich set of complex instructions, performance. each capable of performing multiple low-level Simplified instructions, faster execution. operations in one instruction. Used in AI and performance-critical systems. More complex instructions, easier coding. Widely used in general-purpose applications. 28 RISC vs. CISC Features RISC (Reduced Instruction Set CISC (Complex Instruction Set Computer) Computer) Design Software-Centric Hardware-Centric RAM Usage Heavy use of RAM Efficient use of RAM Cycles Simple, single-cycle instructions Complex, multi-cycle instructions Instruction Length Fixed length Variable length Pipelining High efficiency Lower due to complexity Code Density Lower Higher Complexity In Compiler In Microprograms ISA ARM, MIPS x86 Power Uses Consume high/more power Consume low power Application Used in AI and performance-critical Widely used in general-purpose systems applications 29 RISC vs. CISC CISC (Complex Instruction Set Computer): RISC (Reduced Instruction Set Computer): Complex Instructions: CISC uses complicated Simple Instructions: RISC uses simple instructions that instructions that can perform multiple tasks in one do one task and are executed in a single cycle. instruction. No Microprogramming: Instructions are directly Microprogramming: Each complex instruction is broken executed without being broken into smaller steps. down into smaller steps (micro-instructions) that are Fast and Efficient: RISC focuses on speed, executing executed in multiple cycles. each instruction quickly but with simpler tasks. Flexible but Slower: It offers flexibility with powerful instructions but may take longer to execute each one. 30 Understanding CPU Execution Time Execution time refers to the total time a CPU takes to complete a given task. Influenced by how many instructions the program has, how efficiently the CPU processes these instructions, and the speed of the CPU clock. Execution time can be viewed as the total number of clock cycles the CPU needs to execute all instructions multiplied by the time per clock cycle. 𝐶𝐶𝐶𝐶𝐶𝐶 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 = 𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 × 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 1 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 = 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 31 Relating CPU Clock Cycles to Instruction The total number of CPU clock cycles is determined by how many instructions are executed and how many cycles each instruction takes on average. This average is known as CPI (Cycles Per Instruction). 𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶 = 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 𝑜𝑜𝑜𝑜 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 32 CPU Performance (Primer) If a processor has a frequency of 3 GHz, the clock ticks 3 billion times in a second – as we’ll soon see, with each clock tick, one or more/less instructions may complete If a program runs for 10 seconds on a 3 GHz processor, how many clock cycles did it run for? If a program runs for 2 billion clock cycles on a 1.5 GHz processor, what is the execution time in seconds? 33 CPU Performance 𝐶𝐶𝐶𝐶𝐶𝐶 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 = 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 𝑜𝑜𝑜𝑜 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 × 𝐶𝐶𝐶𝐶𝐶𝐶 × 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 Suppose you have a program with: Number of Instructions = 1,000,000 CPI = 2.5 Clock Cycle Time = 0.5 ns Calculation: CPU Execution Time = 1,000,000 × 2.5 × 0.5 ns = 1,250,000 ns = 1.25 milliseconds 34 Pipelining: Increases CPU performance by executing multiple instructions simultaneously (multiple instruction steps overlapped). Pipelining Stages: Fetch, Decode, Execute, Memory, Write-back and Benefits: Increased instruction Superscalar throughput and CPU performance Architectures Superscalar: Multiple instruction execution units for even faster processing (execute more than one instruction per clock cycle). Features: Multiple execution units, Instruction-level parallelism Benefits: Enhanced performance by parallel execution 35 Example of tage Pipeline 36 Example of Superscalar Architectures ▪ In superscalar design, multiple execution units are present, for example, one for integer arithmetic, one for floating-point arithmetic, and one for Boolean operations. ▪ Two or more instructions are fetched at once, decoded, and dumped into a holding buffer until they can be executed. ▪ A s soon as an execution unit becomes available, it looks in the holding buffer to see if there is an instruction it can handle, and if so, it removes the instruction from the buffer and executes it. 37 Parallel Processing A method where tasks are divided into smaller subtasks that can run concurrently. The CPU splits a large task into multiple smaller, independent tasks. These subtasks are processed simultaneously, allowing for faster task completion. Often used in applications such as scientific simulations, graphics rendering, and data analysis. Data Parallelism: Distributes data across multiple processors to perform the same task on different data sets. Task Parallelism: Distributes different tasks across multiple processors, allowing each core to work on a unique task. 38 1. Multi-tasking To handle two or more programs at the same time from a single user’s perception. CPU can only perform one task at a time; however, it runs so fast that 2 or more jobs seem to execute at the same time. 39 2. Parallel Processing Use 2 or more CPUs to handle one or more jobs Computer networking (e.g., torrent) 40 Example of Single Instruction Multiple Data (SIMD) of Fermi GPU 41 Multiprocessing The use of multiple CPU cores to execute multiple processes or threads simultaneously. Each core in a multi-core processor can handle a separate task, allowing for true concurrent execution. Common in modern CPUs, enabling complex multitasking and heavy workload management. Symmetric Multiprocessing (SMP): All cores share the same memory and operate as peers, coordinating tasks dynamically. Asymmetric Multiprocessing (AMP): A master core controls task allocation to subordinate cores, often used in embedded systems. 42 Example of Multiprocessors Single Bus Multiprocessors Multiprocessors with Local Memories 43 Benefits of Parallel Processing & Multiprocessing Improved Performance: Faster task completion due to concurrent execution. Ideal for intensive computing tasks like 3D rendering, simulations, and large-scale data processing. Enhanced Efficiency: Efficient utilization of CPU resources by distributing workloads across multiple cores, reducing idle time. Greater Reliability: Redundancy in multi-core systems can improve fault tolerance, as tasks can be rerouted if a core fails. Scalability: Systems can scale performance by adding more cores, enhancing the capability to handle more complex tasks or higher workloads. 44 Cache Memory Cache reduces memory access time, increasing CPU efficiency. Small, fast memory located close to the CPU to store frequently accessed data. Multiple levels (L1, L2, L3) of cache allow faster data access for the CPU. L1 Cache (Fastest, smallest), L2 Cache (Larger, slower), L3 Cache (Largest, shared across cores). Purpose: Reduce average time to access data from main memory 45 Cache Memory 46 Cache Memory The cache is logically between the CPU and main memory. Physically, there are several possible places it could be located. 47 Cache Memory Also called static random-access memory (SRAM) Hit rate There is the fundamental tradeoff between cache latency and hit rate. Larger caches have better hit rates but longer latency. Use multiple levels of cache, with small fast caches backed up by larger slower caches. 48 Cache Memory 49 Cache Memory Mapping Direct Mapped Cache: Each memory block is mapped to one specific cache location. It uses three elements: a data block, a tag for part of the address, and a valid bit flag. Fully Associative Cache: Allows any memory block to be stored in any cache location, offering flexibility but increasing complexity. Set Associative Cache: A compromise between direct and fully associative mapping; each memory block mapped to “N” specific cache locations (N-way set associative mapping). 50 Data Writing Policies Write-through: Data is written simultaneously to both cache and main memory, ensuring consistency but increasing latency due to more frequent writes. Write-back: Data is initially written only to the cache and may later be written to main memory, allowing for more efficient operations but risking data inconsistency. Data Consistency Managed by checking the dirty bit, an indicator that shows if the data has been modified. An active dirty bit suggests the data in cache is more recent than in main memory, a scenario more common with write-back. 51 Modern CPUs Application Handling complex computation and AI workloads like optimizing neural networks and parallelizing matrix operations. Multimedia processing like real-time video processing and graphics rendering. Software engineering that runs development tools and agent simulations. Information system that involves managing structured/unstructured databases and data analytics. 52 THANK YOU! Next Lecture: Memory and Storage Systems 53