Operating System Virtual Memory Combinepdf PDF

Document Details

AvailableHeisenberg

Uploaded by AvailableHeisenberg

Boston University

Tags

operating systems virtual memory computer science memory management

Summary

This document discusses virtual memory concepts, including virtualization of the CPU and memory, address translation, paging, and page swapping. It explains the role of the Memory Management Unit (MMU) in translating virtual addresses to physical addresses and examines free space management. The document also touches upon segmentation (a different approach to memory management).

Full Transcript

The Operating System PART TWO Virtual Memory What you have all been waiting for! What we covered last time Virtualization ○ Taking a physical shared device and giving each process the illusion of its own separate, non-shared device. How the Operat...

The Operating System PART TWO Virtual Memory What you have all been waiting for! What we covered last time Virtualization ○ Taking a physical shared device and giving each process the illusion of its own separate, non-shared device. How the Operating System Virtualizes the CPU ○ Interrupts (ie timer) ○ Saving/Restoring process context ○ Scheduler “Refresher” on Virtualization: Another Analogy Two virtual cokes The physical coke Virtualizing the CPU gives the illusion that each process runs alone on it’s own CPU. OS virtualizes CPU via time sharing and scheduling! Virtualizing Memory gives the illusion that each process has its own address space. Physical Memory “Byte Addressable” memory hardware Process’s View of Memory (aka the illusion) “Address Space” Ranging from 0x00 to 0xff (some max address which depends on the size of the address space) mov [0x54261], 3 ○ All assembly instructions we have seen are VIRTUAL addresses The Problem p1 p2 p3 p4 We want to support many process address spaces with our single physical memory. Attempt 1: Place them in non-overlapping regions! But what happens when the CPU executes a memory lookup? proc 3 mov [0x54261], 3 ?? Also, what happens when, (via timesharing), we switch and start running another process? Attempt 1: Address Translation The MMU is a piece of hardware physical_address = in CPU. virtual_address + base CPU (running Assert that virtual address is proc3) WITHIN the bound. virtual Memory Management Marks the addr. Unit start of proc3 (MMU) VAS. base physical bound addr. Marks the end of proc3 VAS. Physical Memory 0x0 When the OS Now we have: 0x1a switches running Proc 3 addr space processes, it When OS runs proc 3, it sets must save and the BASE register to: 0x2b restore 0x1a base/bound … register in It sets the BOUND register 0x30 MMU. to: 0x2b Proc 1 addr space Just like OS save/restores If proc3 accesses address: 0x41 … other context! 0x03, it goes to physical 0xee address: 0x1a + 0x03 = 0x1d Proc 2 addr space 0xff Free Space Management When the OS sees a request to start new process, it needs to find FREE space for that process’s memory. To do so the OS: Maintains free list: an internal data structure to track what physical memory is FREE. Dynamic Relocation: Move around other processes’ address spaces to get bigger chunks of free space. ○ hanging the base/bound register allows us to dynamically move around in physical memory, where our program’s memory resides! ○ The OS can freely optimize the free space in memory (unlike malloc) Good Properties of Address Translation FAST (occurs in hardware via MMU) Isolated Address Spaces ○ Processes CAN’T write/read in other processes’ address spaces! Process CAN’T tell it’s addresses are being translated! Are we done? Are there any issues? How big is the Virtual Address Space of a Process on our computer? How much RAM do we have? (experiment time!) Issue: Wasted Space… Unused space! Bottom of stack = 0x7fff38b0d134 Code = 0x5dd5e9531189 Difference = 37 terabytes? INTERNAL FRAGMENTATION! Idea: Segmentation Can we segment up our address space, and allocate each dynamic section separately? - I.e. three sets of base/bounds registers, one for static (data/text), one for stack, one for heap. - On mmap adjust base/bound of heap section - When stack grows to be large, adjust base/bound of stack section. Problems: - Too many assumptions about address space, not generalizable. - Difficult to manage variable sized pieces of memory floating around Attempt 2: Paging Let’s not allocate the entire virtual address space in physical memory when the process starts. Instead, let’s allocate physical memory in chunks, as needed! Paging Paging Break up virtual address space and physical memory into fixed-sized chunks called “pages”. Map virtual pages to a corresponding physical page. Create mapping Demand Paging Only allocate the page when it is actually used by the process. Mini Example 64 byte Virtual Address Space 128 byte Physical Address Space 16 byte pages Mapping VPN to PPN: The Page Table What will map VPN to PPN? ○ A page table! The OS manages/sets up the page table, which the MMU uses to perform translation. VPN PPN 0 3 1 7 2 5 3 2 How to do Address Translation now? MMU replaces virtual page number with physical page 0x15 number (from page table), to get physical address. 64 bit address space. Need to represent 0 - 63 in binary. ○ 6 bit virtual addresses! Have 128 bytes of physical 0x75 memory… ○ 7 bit physical addresses! A Note on Offset PPN 7 The offset bits tell us WHERE in the 0x0 A 0x1 B page to grab our byte. 0x2 C 0x3 D 4 bits of offset, indicates 24 = 16 byte 0x4 E 0x5 F pages. 0x6 G 0x7 H What is the value of the byte at 0x8 I 0x9 J the below physical address 0xa K access? 0xb L 0xc M 0xd N 0xe O 0xf P What’s in a Page Table? Let’s imagine that the Page Table lives in physical memory. The OS manages the page table. The MMU now contains a register to point to the page table. Linear Page Table uses VPN as an array index to find PPN. Page Table Entry Includes also: Resident (or Present/Valid) bit. Dirty bit (if page has been modified/written). Could also have permission bits, etc! Let’s Zoom Out – Multiple Running Processes MMU Example 256 bytes/page VPN R D PPN 16 virtual pages 8 physical pages 0 1 0 2 12-bit VA (4 bits of vpn, 8 1 0 – – bits offset) 11-bit PA (3 bits of ppn, 8 2 1 0 4 offset). 3 1 0 1 Virtual Address: 0x3c8 4 1 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 0 0 Problem: Our memory is SMALL! Fix 1: Only allocate page on demand. (pretty good!) But… Let’s say we have an array of 100.bmp images = 1 GB of memory. 33 processes and our computer crashes? Additional Modification: Page Swapping Disk has a LOT of storage (1 TB). Plan: Swap pages from and then onto disk! Back into play: the Resident bit and dirty bit! Resident Bit = 0 when the page is in storage, =1 when it is in memory Dirty bit = 1 when we have modified a page in memory (might need to change it in storage) vpn vpn vpn 1 3 1 What happens when we pp4 (5) pp6 (0) access a page not in pp8 (13) physical memory? Page Fault! 1. Let’s say we access page VPN 5. We see it is not resident, but instead on Disk. a. PAGE FAULT →MMU invokes OS handler 2. If memory is full, replace a page (say VPN 1), PPN 4. a. Update page table, write page to disk if dirty. Replacement Policy How to decide which page to kick back into storage? For this class: LRU. (Mark the least recently used page, and swap that page back into storage) Bonus: Locality! Cons: Expensive. (Time to scan for LRU, dirty pages take longer to swap) Neat Bonuses of Page Tables Finally, a side effect that is not negative! mmap ○ Explicitly tell OS to map file on disk (in storage) to a region in process VAS. FAST!! Sharing ○ Can map the SAME physical memory into multiple processes virtual memory to SHARE memory. ○ More efficient– 100 copies of the same running process can have ONE physical memory page corresponding to the instructions for that process Next Time How many memory lookups do we need to perform everytime we access memory? Hint: the page table itself resides where? 2 Memory lookups (one to get the page table entry and one with physical address) ○ Memory access is SLOW So… How to make page tables faster? Examining Virtual Memory through Examples Address Sizes are Tied to the Size of the Addressable Space they describe. If you have 210 bytes of space. You need 10 bit addresses. If you have 12 bit addresses. You can have 212 bytes of memory! This holds true in physical memory OR virtual memory. i.e. 10 bit virtual addresses, means you have 210 bytes in address space. 10 bit physical addresses, means you have 210 bytes of physical memory. The math of page tables. Physical Address: PPN + offset physical_address_size = log2(number_of_physical_pages) + log2(page_size) physical_memory_size = number of physical pages * page size Virtual Address: VPN + offset virtual_address_size = log2(number of virtual pages) + log2(page size) virtual_address_space_size = number of virtual pages * page size C Programming PART THREE Arrays, Strings, and Pointer Arithmetic Strangely, these all go hand-in-hand! Administrative Details: Midterm Exam on Thursday! ○ Covering all of Assembly, and L15 (Basic C and functions) What C is on the exam? You will not write C code, but you may have to read a translation of C assembled and understand the calling convention for functions. Read Piazza for exam logistics! What we covered last time C Pointers! ○ A variable that contains an address! How do we declare a pointer to a character? char* pointer_to_c; ○ Operations: Address of (&) - gets address of variable Indirection (*) - dereferences a pointer Arrays in C This is an example of two statically declared arrays ○ We will get to dynamic allocation, next week! Must declare the size of the array ○ Can be implicit via assignment Must also declare the type Where does this array exist in memory? ○ The stack! sizeof(op) prints the size (in ○ If I declare an array as a global bytes) of either the TYPE of variable, where does that go? VARIABLE op. I.e. sizeof(int) is 4 Array Access Syntax: Subscripting Index an array via the arr[index] syntax. ○ Gets the ith value of the array Array Syntax No bounds checking! What will this print? Why? What will happen when I go out of bounds? That’s undefined behaviour … >:) Why? A picture of our arrays In memory However, this is not arr 0x50 99 guaranteed! 0x54 99 0x58 99 Sometimes out of bounds 0x5c 99 accesses will cause errors, overwrite other arr2 0x60 101 0x64 102 data! 0x68 103 Undefined behavior 0x6c 104 Diving Deeper: C Arrays are similar to pointers! The value of the array variable is equal to the address of the starting element! An expression that has type ‘‘array of type’’ is converted to an expression with type ‘‘pointer to type’’ implicitly ○ What will *arr do? Subtle difference between array and pointer: ○ Except when using sizeof operator, & operator ○ Subtle detail: arr == &arr == &arr Picture in Memory int arr[] = {1,2,3,4}; … int* p = arr; arr 0x50 1 0x54 2 // equivalent to: 0x58 3 // int *p = &arr; 0x5c 4 … p 0x75 0x50 Arrays as Parameters These two methods sumit2 and sumit, are no different int * A is the same as int A[] The callee can do operations as if A is an array or a pointer! Under the hood the same thing happens: C is pass by value, and the value is the address of the array. NOTE: array loses size information Pointer Arithmetic When you add to a pointer, it increments by the size of the type it is pointing to. Arrays in Memory Let’s use gdb to inspect an array of characters! Let’s use array indexing! p &arr p &arr p &arr p p_arr p p_arr + 1 x /1gd p+1 Pointer Arithmetic, Array Indexing! Notice: *(a + i) == a[i] When i == 1, what is *(a+1)? a = 0x50 a + 1 = 0x54 a 0x50 1 *(0x54) = 2 0x54 2 0x58 3 0x5c 4 When i == 2, what is *(a+2)? a = 0x50 a + 2 = 0x58 *(0x58) = 3 Connection: In assembly, C Strings == Arrays we have been inspecting strings in memory… Subtle Detail: [‘H’, ‘e’, ‘l’, ‘l’, ‘o’, ‘ ‘, ‘C’, ‘S’, ‘2’, ‘1’, ‘0’, ‘\0’] Printing arr_msg will result in an address on the stack Printing p_msg will result in an address in the static data section of memory Standard Library for C Strings Next Time… Malloc! (Review Structs) Linked Lists! Structs! Structs as Parameters { int: 40 int: 1999 double: 3.2 C is pass by } value… arguments: { int: 40 41 int: 1999 double: 3.2 } return addr age_up stackframe Pointers to Structs We can declare a pointer to a struct! We can dereference that pointer in order to modify the original struct. Improved Access to Struct: “right arrow selection” Why (*sp).age++? Why not *sp.age++? Instead: sp->age++ sp->age is equivalent to (*sp).age Fixing age_up(); What should our parameters to age_up be? What should our body code do? How should we call age_up? Problems with Pointers What will happen? Pointers As Return Values. What will happen? Summary: Pointers as Return Values What is a Segmentation Fault? A fault raised by the hardware when we attempt to access memory that we should not. Then, switch control to the OS and we follow our fault handling plan. ○ (in this case, kill the program). Do you remember the fault handling plan in the beta processor? C Programming PART Four Dynamically Allocated Data Structures The Grand Finale of our Journey with C! Administrative Details: PS4 Extension! ○ Next week, wednesday! ○ You have all information to complete, especially after today ;) What we covered last time C pointer arithmetic ○ When you increment or decrement an int* the value of the address changes by how many bytes? ○ Answer: By multiples of FOUR bytes! (The size of an int) C arrays & strings A long time ago, very quickly, structs Structs A struct is just a collection of different variable types. Access fields of a struct via the. operator. Nice to have an array of structs for example, grouping together a few different types for convenience. Under the hood: Each field is layed out sequentially in memory. Pointers to Structs We can declare a pointer to a struct! We can dereference that pointer in order to modify the original struct. Improved Access to Struct: “right arrow selection” Why (*sp).age++? Why not *sp.age++? Instead: sp->age++ sp->age is equivalent to (*sp).age Modifying Struct via pointer. What should our parameters to age_up be? What should our body code do? How should we call age_up? Problem 1: What will happen? Problem 2: What will happen? Notice: we have a pointer as a return type! Summary: Pointers as Return Values What is a Segmentation Fault? A fault raised by the hardware when we attempt to access memory that we should not. Then, switch control to the OS and we follow our fault handling plan. ○ (in this case, kill the program). Do you remember the fault handling plan in the beta processor? But what if we really want to create a data structure in a function and return it? How to fix: Dynamically Allocated Memory Problem: We need memory that won’t go out of scope. We don’t know how much memory we need until runtime. The Heap: Grows and shrinks with EXPLICIT instructions Java vs C Java has implicit memory management via reference counting and garbage collection. When memory on the heap is not used, automatically freed. Malloc ps -eo %cpu,pid,%mem,command -n What happens when: Notice: How much space did we malloc? Get a block of memory Equivalent to a 500,000 length double array! We can treat the pointer to the malloc’ed block as an array! Free Example: Problematic Treatment of Memory Memory Leak! pp q Why is this a problem? Malloc a Struct Memory Management Library: Under the Hood Example: bytes, word aligned allocation Heap setup via a system call. Uses mmap() OS syscall, gives us a pool of memory. Malloc is apart of C standard library: Has internal data structure with additional information about freed and allocated blocks. Problems: fragmentation! Linked List! 4 3 144 10 NULL Implement a linked list: Where you can add elements to the front Where you can remove elements head from the front Where you can print the contents of the linked list The Node Type What should our struct for a linked list node look like? What if we want to modify the head of our list (ie ADD)? add_ll(l,5) 4 3 144 10 NULL 5 head remove(l) to_remove 4 3 144 10 NULL head print(l) 4 3 144 10 NULL head Other pitfalls: Using a pointer to a block after calling “free” ○ free(p); // after this p is still the same address ○ Could cause an error, but might instead cause chaos! ○ Could overwrite data of other part of program “Use after Free Bug” ○ Set pointer to NULL after free. And that’s your intro to C! What’s next?? What’s to come… So we understand how one program runs on your computer, but there are a few open questions: How can we make our computer better? What mechanisms do modern computers use to achieve decent performance? More on Pipelining, caching, and other fun tricks! How is it possible we have multiple processes running on our computer without interfering with each other? ○ More on the Operating System! C Programming PART TWO The scariest part of C! Boo! Pointers What we know so far: C Basics Assignment Arithmetic Operations Loops/Conditionals How C works under the hood: Functions Functions! (C syntax, it’s not so bad.) Return type Parameters “Pass by value” Function Body Functions… under the hood! (We saw this before!) compile Turn off optimizations with gcc, to get a complete picture of the stack. Notice: rdi is the parameter. New Question: What is rbp? Zooming Out: Memory of a Running Process We must understand what is going on in memory as.text we run a program. Reminder: Which way does the stack grow? Conventions for Function Calling Why the calling convention is important: Linking with other libraries Understanding common compiler-generated assembly snippets Calling Convention Describes: Which registers to use for parameters Rules for saving/restoring registers in functions: ○ Caller/Callee saved Registers Calling Convention: Parameters 0xff View of the Stack Caller’s frame includes: Any local variables / saved registers Base needed pointer Any arguments for callee that not fit rbp in registers Return Address Callee Frame includes: Any saved register values Local variables Arguments if it is going to call another function 0x00 Which registers does the caller have to save? Which registers does the callee have to save? We divide the registers into 2 groups. Caller saved: (the callee can expect these to be free to use) Callee saved: (the caller can expect that these will not be modified by functions) However, sometimes, we need to use many registers, so we will save restore to maintain this convention! X86’s Caller-Saved Registers X86’s Callee-Saved Registers Exam Reference Sheet Example (caller saved register): mov rax r12 , 1 mov rax, 1 call printf push rax call printf cmp rax r12 , 1 pop rax cmp rax, 1 # can you say for certain that rax is still one after the call returns? # now, can you say for certain that rax is still one after the call returns? How RBP is updated. Function Setup sumit: An example of a callee-saved register. # Preserve current frame pointer push rbp # Create new frame pointer pointing to current stack top mov rbp, rsp rbp # allocate 20 bytes worth of locals on stack. sub rsp, 20 Caller stack frame. Function rsp Teardown rbp rsp Old rbp # clean up space for local variables add rsp, 20 # Restore old stack frame locals mov rsp, rbp rsp # Restore frame pointer pop rbp ret The Final Flow Caller: Save any caller-saved registers on stack Setup arguments in registers (or on stack) call ○ Push return address on stack Callee: Save any callee-saved registers on the stack ○ I.e. rbp (the special callee-saved frame pointer) Grow stack for any local vars EXECUTE Shrink stack/pop restore registers ret ○ Pop address from stack ○ Jump to popped return address The Flexible Call Stack: Stack Frames + Scope void hello() { void imagine() { hello Hello’s stack frame world(); … world World’s stack } } frame void world() { imagine forever Forever’s stack Imagine void forever() { frame stackframe imagine(); forever(); forever forever(); Forever’s stack frame } } forever Forever’s stack frame Now: Pointers! Pointers C allows us to access and modify memory directly. How? A pointer is a variable that contains the address of a variable Pointers are very common in C code Certain operations are only possible via using pointers Powerful but can cause tricky bugs Need to make sure we are not holding nonsense addresses Everything has an address 0xff 0x09 int i = 3 00000000 00000000 00000000 00000011 0x05 char c = ‘a’ 01100011 0x04 0x00 Syntax of a pointer type When we have a pointer type variable, we need to know what we are pointing to. Designates variable is a pointer char *address_of_c; int *address_of_i; Pointer to what type? Details on the C Pointer Type the value of the pointer is often represented in HEX. printf(“print a pointer %p\n”, i); int *i ~8 bytes 0x543f Pointer Operations & Get the A(N)Ddress of something… “Address of” Operator Directly related to what QWORD PTR [ ] does! * And the other one, the Indirection Operator! Okay, Let’s make some pointers… 0xff … 0x40 3 0x38 ‘a’ 0x30 0x40 0x28 0x38 What would the … value of &ai be? 0x0 Another example: Using a pointer int* ai int i 3 5 A puzzle: : A Reason Why Pointers are Useful: What will this print? No change will occur, because C is pass by value. C is pass by value because it is copying arguments into registers, copying onto the stack. Changes do not occur on the original variable. How could we fix this? Pointers as Parameters Pointers as Parameters to Functions Let’s write a function that changes the value of an integer (pointed to by a passed in argument) to the number 2. Completing our understanding of scanf Structs int = 4 bytes, Struct definition: double = 8 bytes Structs Groups together variables of different types. Ordering of definition defines layout in memory (side note: compiler pads structs to a memory-friendly alignment) Gdb with C Make sure to compile with DEBUG flag : ○ gcc struct_ex.c -Wall -g -o struct_ex Print out 16 bytes in memory at location of tom. What is the first 4 byte integer in decimal? 0x00000028 = 40 What is the second value in human readable hex? 0x000007cf = 1999 Structs as Parameters { int: 40 int: 1999 double: 3.2 C is pass by } value, for arguments: { ALL data int: 40 41 types. int: 1999 double: 3.2 } return addr age_up stackframe Pointers to Structs We can declare a pointer to a struct! We can dereference that pointer in order to modify the original struct. Improved Access to Struct: “right arrow selection” Why (*sp).age++? Why not *sp.age++? Instead: sp->age++ sp->age is equivalent to (*sp).age Fixing age_up(); What should our parameters to age_up be? What should our body code do? How should we call age_up? Problems with Pointers What will happen? Pointers As Return Values. What will happen? Summary: Pointers as Return Values What is a Segmentation Fault? Hardware fault raised by the hardware when we attempt to access memory that we should not. Then, switch control to the OS and we follow our fault handling plan. ○ (in this case, kill the program). Do you remember the fault handling plan in the beta processor? Next Time… Pointer Arithmetic! Arrays! Strings! The Operating System PART ONE Timesharing the CPU Why our computers can run more than one program… What we covered last time Ways to improve the performance of various hardware components ○ Caching improves memory latency ○ Pipelining improves processor throughput The Operating System (from a user’s perspective) ○ Processes ○ Syscalls! (read, write) Now: The mechanisms the Operating System Uses Question: What is the job of the Operating System? The Job of the Operating System: Virtualization Virtual physical apples apple Take a physical device that is shared (cpu, memory) and VIRTUALIZE them. aka: Give the illusion that the physical device is NOT shared. Isolate each process from each other. A thought experiment: Where is the virtualization occurring in this picture of our computation stack? The new question: How does the OS virtualize different physical compute resources? Today and to the end: A tiny taste of OS How is it possible we can run multiple programs on our computer at the same time? How can 1 cpu run several programs at the same time? How can our single unit of physical memory support several programs at the same time? Answer: The Operating System! Virtualizing the CPU! Virtualizing Memory! Our Issue: Two Processes, One CPU Question: What is going on under Proc Proc the hood? 1 2 At the hardware level? At the OS level? CPU HARDWARE F r1 Fetch-> Decode ->Execute r2 r3 Experiment: taskset -c 0 time./prime --option VS taskset -c 0 time./prime --option & taskset -c 0 time./prime --option lscpu taskset time Observations “User” time stays the same. “Elapsed” time for both processes is doubled. They both get ~50% of the CPU! What is the mechanism?? First Attempt: Direct Execution? Operating System Process Create entry for process list Allocate memory for program Load program into memory Clear registers Execute call main() Run main() Execute return from main Free memory of process Remove from process list What’s the problem here? Wait… Didn’t we have system calls? Create entry for process list Allocate memory for program Load program into memory Set up stack with argc/argv Clear registers Execute call main() Run main() Execute syscall Return from syscall to Process Perform syscall Execute return from main() Free memory of process Remove from process list Better idea: Taking Turns! OS Proc B Proc A What state must the OS save and restore in order for the return to proc A to work as expected? What does the OS need to know about a process? The contents of it’s registers: context! The PC, of where it is last executed. Other things: - Open files (table of file descriptors) - Process ID (pid) 1. Run Process 0 for some time 2. Switch control to the OS a. (save p0 context) 3. Run OS code, setup Process 1 a. (restore p1 context) 4. Switch execution to process 1 5. Run process 1 for some time. a. (save p1 context) But how can the OS regain control of the CPU? How do we switch control to the OS? Processes yield with a syscall, switching control to the OS and allowing the OS to perform a privileged operation/ Experiment: Let’s time a different program. time./loader What do you expect? Second Attempt: Cooperative Approach When does a process voluntarily give control to the OS? Plan: wait for a syscall. Then, when in OS, switch control back to another process. OS Proc B Proc A syscall syscall What’s the problem here? Third Attempt: Interrupts Use involuntary switches to OS (via interrupts) in order to switch off between processes! Interrupts Occur When: Physical IO devices have input (ie keyboard typing) Hardware storage disk has state change that needs attention Wait… Multiple running processes! So what is the hardware OS, virtualizing the CPU doing? Hardware support for Interrupts!? First… How and when do interrupts occur? A physical device needs to communicate with the OS. In reality: Let’s take a look at this IRQ signal… Jmp to Signal OS code IRQ Lifecycle of a Timing Interrupt Minimal Hardware Implementation of Interrupts Check for Interrupt Request signal (IRQs) ○ On IRQ j Copy PC+4 into Reg[XP] Install j (address of instruction) as new PC (handler code) OS Interrupt Handler ○ Save process state (context) ○ Call C procedure specific to WHICH interrupt arrived (j) ○ Re-install procedure state (from saved context) ○ Return execution to Reg[XP] At BOOT time OS sets up the hardware’s trap table (which contains the addresses in the kernel for the interrupt handlers) One of those Interrupting Devices… Is a Timer! Proc Proc 1 2 Timer Device Operating System Process Create entry for process list Allocate memory for program Load program into memory Clear registers Execute call main() CPU Free memory of process Remove from process list Basic Idea: Scheduling A protocol in the OS (or in any system), which makes a decision about which process to run. Each time we interrupt/switch to OS, decide whether or not to INVOKE (start up) the scheduler. The scheduler policy decides whose turn it is, taking into account: - Priority - Fairness - Blocked for resource? - And many other metrics!! Example: Round Robin (RR) Scheduling Each time slice, take turns. One time slice to Process 1, one time p1 p2 p3 p4 slice to Process 2, and so on and so forth. OS (running RR scheduler) CPU What is running on the time CPU? Nice! We virtualized the cpu… But what about memory? Virtual Memory: The goal Each process to have own illusion of one big memory. WITHOUT having to go through OS for every read/write (SLOW!!) WITHOUT the possibility of reading/writing other processes memory… Review: What we know about OS so far The user’s perspective… https://piazza.com/class_profile/get_resource/lr3vhra0lk74qt/ltegk38clmt2gi https://pages.cs.wisc.edu/~remzi/OSTEP/cpu-mechanisms.pdf The Operating System PART THREE Virtual Memory Making it faster! And better! What we covered last time Virtualization How the Operating System Virtualizes Memory ○ Address Translation: The MMU takes VA and translates them to PA ○ Paging: Split physical memory into chunks (pages) and map virtual pages to physical pages Why does paging help us save physical memory? ○ Page Tables: Data structure containing the mapping, managed by OS What is in a page table entry? ○ Swapping: When memory is full, let’s use storage! When the MMU looks for a page, and sees it is not resident, what happens? Goals for today Review (with examples), what we saw last time! Review Page Faults ○ See an example of a page fault! Learn why the page table is slow… and make it faster! See some demonstrations of cool paging tricks that systems programmers take advantage of! Virtualizing the CPU gives the illusion that each process runs alone on it’s own CPU. OS virtualizes CPU via time sharing and scheduling! Virtual addresses Virtualizing Memory gives the illusion that physical address each process has its mmu own address space. Paging: Review VPN R PPN VPN R PPN 0 0 - 0 0 - 1 1 7 1 1 4 2 0 - 2 0 - 3 0 1 3 1 1 … … … 4 0 - 6 1 0 5 1 6 Demo: Take a look at a page map! ps -a cat /proc//maps Properties of Addresses, why paging works nicely! Address Sizes are Tied to the Size of the Addressable Space they describe. 2b bytes can be addressed by b bits. 2p pages can be indexed by p bits. If you have 210 bytes of space. ○ You need 10 bit addresses! If you have 12 bit addresses. ○ You can have 212 bytes of memory! If you have 3 bits of offset ○ You can address a 8 byte page! If you have 8 bits of VPN ○ You can have 28 pages! Why are offset bits the lower bits of the address? 0x000 Physical pages are layed out Ppn 0 0x0FF continuously in memory! 0x100 ○ All addresses in a physical page Ppn 1 0x1FF have same high order bits! 0x200 ppn2 ○ Locality in a page! 0x2FF Incrementing the PPN just increases ppn3 0x300 address + page_size. 0x3FF If I want to access the 17th byte of ppn1, 0x100 + “17” what address is would I use? = 0x100 + 0x11 0x111 = 01 0001 0001 = 0x111 Offset into the page PPN = which page The math of paging (hint you don’t need to memorize this) VAS RAM VPN OFFSET PPN OFFSET Physical Address: PPN + offset physical_address_size = log2(number_of_physical_pages) + log2(page_size) physical_memory_size = number of physical pages * page size Virtual Address: VPN + offset virtual_address_size = log2(number of virtual pages) + log2(page size) virtual_address_space_size = number of virtual pages * page size Review: Page Swapping Only 32 GB of memory… (Not bad, but could be better!) Disk has a LOT of storage (1 TB). Plan: Swap pages from and then onto disk! When a page is not in memory (in storage) trigger a PAGE FAULT Back into play: the Resident bit and dirty bit! Resident Bit = 0 when the page is in storage, =1 when it is in memory Dirty bit = 1 when we have modified a page in memory. When evicting this page, we NEED to write it back to storage. Page Fault! 1. Let’s say we access page VPN 5. We see it is not resident, but instead on Disk. a. PAGE FAULT →MMU invokes OS handler 2. If memory is full, replace a page (say VPN 1), PPN 4. a. Update page table, write page to disk if dirty. Replacement Policy How to decide which page to kick back into storage? For this class: LRU. (Mark the least recently used page, and swap that page back into storage) Bonus: Locality! Cons: Expensive. (Time to scan for LRU, dirty pages take longer to swap) Example: Page Fault Page size: 128 bytes RAM Virtual pages = 16 VPN R D PPN Vpn 4 Physical Pages = 8 0 1 0 2 1 0 – – Vpn 3 2 0 - - Offset bits = 7 Vpn 0 VPN bits = 4 3 1 0 1 4 1 0 0 Vpn 6 PPN bits = 3 5 0 - 6 1 1 3 Vpn 12 7 0 - - mov rax, [0x56b] Vpn 14 8 1 0 7 9 1 1 6 Vpn 9 10 0 - - 11 0 - - Vpn 8 12 1 0 4 1 0 1 0 1 1 0 1 0 1 1 13 0 - - 14 1 1 5 15 0 - - Page Fault!! LRU = VPN 14 disk Example: Page Fault Page size: 128 bytes RAM Virtual pages = 16 VPN R D PPN Vpn 4 Physical Pages = 8 0 1 0 2 1 0 – – Vpn 3 2 0 - - Offset bits = 7 Write VPN 14 to Vpn 0 VPN bits = 4 3 1 0 1 disk from RAM 4 1 0 0 Vpn 6 PPN bits = 3 5 0 - Read VPN 10 from 6 1 1 3 to PPN 5 disk Vpn 12 7 0 - - mov rax, [0x56b] Vpn 10 8 1 0 7 9 1 1 6 Vpn 9 PA = 0x2eb 10 1 0 5 11 0 - - Vpn 8 12 1 0 4 1 0 1 0 1 1 0 1 0 1 1 13 0 - - 14 0 - - 1 0 1 1 1 0 1 0 1 1 15 0 - - LRU = VPN 14 9 Experiment Time– Let’s Trigger Page Faults /usr/bin/time -v./pf./pf-city.sh Think: why do no page faults occur, if our loop is small? Think: At what # of copies of pf running should we start to see page faults? One Final Problem… How many accesses into memory occur on one mov [0x54123], 3 instruction? Ideally– one However, with our page table design, we ALWAYS have an additional lookup into the page table itself! Page Tables are slow! How could we fix this? A cache for Page Table Entries themselves! (note this IS not managed by software (the OS), but instead by hardware!) Translation Lookaside Buffer (TLB) Let’s add a cache inside the MMU! TLB is a small, fully associative cache! VPN vpn ppn R vpn ppn p PPN PPN TLB Locality! Miss Hit Hit Hit Spatial Hit Run Hit locality Miss again! Hit Hit Hit Hit Temporal Hit hit locality Hit Miss Hit hit Hit hit Hit Details… What about multiple processes?? Is the TLB shared between processes? (Hint: where does it reside?) Option 1: Clear TLB on context switches between processes. Option 2: ○ Add a little bit of information– The ASID (Address Space identifier), which is similar to PID, to know WHICH process the cache entry belongs to. ○ The OS will need to set ASID register in MMU. TLB TLB Replacement Policy: Also can assume something like LRU. No ASID? Assume TLB cleared on process switches. VPN R D PPN Vpn 4 VPN R D PPN 0 1 0 2 Vpn 3 4 1 0 1 1 0 – – 0 1 0 0 2 0 - - Vpn 0 3 1 0 1 LRU = VPN 0 4 1 0 0 Vpn 6 5 0 - Vpn 12 6 1 1 3 7 0 - - Vpn 14 MMU Steps: 1. check TLB for VPN LRU = VPN 6 Vpn 9 a. HIT: use PPN! Vpn 8 b. MISS: go to page table in memory i. Evict LRU TLB line, replace with new PTE Experiment: Unfriendly to the TLB? Think: what type of program might generate a lot of TLB misses? Plan: Random accesses into an array! Question: How big does the array need to be? Much bigger than a single page! Demo: TLB-okay1, TLB-okay2, TLB-miss Using perf Neat Bonuses of Page Tables Finally, a side effect that is not negative! mmap ○ Explicitly tell OS to map file on disk (in storage) to a region in process VAS. FAST!! Sharing ○ Can map the SAME physical memory into multiple processes virtual memory to SHARE memory. Dynamically Linked Libraries Demo: mmap Demo mmap + sharing Linux supplies: The programs ld.so and ld-linux.so* find and load the shared objects (shared libraries) needed by a program, prepare the program to run, and then run it. Dynamic Libraries libc.so is a shared dynamic library! There we go! Next Time: Last lecture! Tiny bit o’ Review Advanced Topics Examples! Poll: What are you more interested in? ○ Applications of what we learned to performance (writing REALLY fast code!) ○ Applications of what we learned to security (using systems knowledge to break things!) Memory Hierarchy and Caches CS210 Fall 24 Let’s talk about memory As of now, memory has been the solution to everything Where to put your program? Load it in memory Not enough space in the registers? Put it in memory It is typically the most costly component in terms of space and time The performance of our processor depends greatly on the bandwidth of the connection between the CPU and main memory! How do we minimize the problems caused by this bottleneck? Memory Technologies Each have different tradeoffs! Note that as the capacity increases, the latency increases. How can we get both large capacity and low latency? Memory Hierarchy Use hierarchical system of these memory technologies to emulate a large, fast, and cheap memory Memory Hierarchy The Idea: Place the most often accessed data in the small fast memory technologies, and leave the rest in the larger memories Question How do we know what data will be most often accessed by a program? By using the Locality Principle! Memory Reference Patterns The Locality Principle Thus, the data that will be most often used is likely to be data that has been accessed before by this program or is nearby those accesses. We will want to populate our cache with these blocks of data Caches A cache is a small interim storage device that retains (aka caches) data from recently accessed location Memory access is fast if data is cached Otherwise, it is slower as it needs to access a larger cache or memory to retrieve the data A Typical Hierarchy Cache Access Main Processor Cache Memory Accesses memory by sending an address to the cache Cache Hit: data for that address is stored in the cache, then returned quickly Cache Miss: the data for that address is not in the cache, then go one level down to check the next memory device Cache Access Main Processor Cache Memory How does data get into the cache initially? As a result of a Cache Miss. Bring the block of data just accessed in main memory and store it in the cache The next time this data is accessed, there won’t be another cache miss! Cache Metrics Hit Ratio: Miss Ratio: Average Memory Access Time: How do we design our cache? Which subsets of a program’s memory should the cache hold? When should the cache copy a subset of a program’s data from main memory to the cache, or vice versa? How can a system determine whether a program’s data is present in the cache? Direct Mapped Cache Direct Mapped Cache Direct Mapped Cache Direct Mapped Cache Address Division Offset: The least significant bits of the memory address. Determined by the block sizes of the cache. Ex: Block size = 8 bytes, then Addr[2:0] indicates the offset Index: The middle bits of the memory address. Indicates which line to access. Determined by the number of lines in the cache. Ex: Lines = 128, then Addr[9:3] indicates the index Tag: The remaining bits of the memory address. Example: Example: Example: Example: Example: Types of Cache Misses: Compulsory/Cold-start misses: The first time a program accesses a memory location. Programs cannot avoid these. Capacity Misses: The program accesses more memory than what the cache can physically store. Conflict Misses: Some cache designed limit where the data can reside, which can lead to misses A direct mapped cache suffers from this quite a bit. Associative Caches Allows for flexibility to choose more than one location to store data in the cache. 2-Way Set Associative Caches Question What happens when a cache is full, and you have a cache miss? The data is then retrieved from main memory, and we place the data in the cache Where do we put the data in an associative cache? Replacement Strategies LRU (Least Recently Used) FIFO/LRR (First In First Out, Least Recently Replaced) Random And any others that you can think of Example: Example: Example: Example: Example: How much associativity? Why stop at 2, why not have a 4 –way associative cache? Or 8? Or just however many possible (aka a fully associative cache)? Our search functionality becomes more and more complex the more ways we have. An overview: Direct-mapped Compare the address with only one tag Memory block can be stored in exactly one line Results in conflict misses N-Way set-associative Compare the address with n tags simultaneously Memory block can be stored in exactly one set, but in any of the lines in the set Must consider the replacement policy Next time So far, we have only been working with loads/reads. What happens when we need to store/write to memory? The End: Systems & Performance Taking all you learned and putting it to good use! Final Exam (the season finale) Tue 12/17/24 9:00AM - 11:00AM in LAW AUD Not Cumulative! Topics: Advanced C (pointers + malloc) Pipelining Caching OS: timesharing + Virtual Memory Bonus Questions : ○ Midterm 1: Architecture Unit Question ○ Midterm 2: Intel Assembly + Basic C Unit Question More on Bonus Questions There will be 2 additional questions at the end of the exam that will not count towards your final exam score, but: Each question will contribute to your midterm 1 and midterm 2 scores respectively These questions will increase the respective midterm score by up to 3% But, your midterm score will cap at 90% (so if you received 90% on midterm 1, and got full points on the corresponding bonus question, your final midterm 1 score will remain at 90%) Course Evaluation! Completely anonymous Very helpful in improving this class as well as the curriculum in the department Make your voice heard! https://go.blueja.io/UZbqyRwov0Cj5sBEG4eueA Goals Review what topics should be fresh in your memory for final exam Hands-On Applications of Programs that: ○ Cause Page Faults ○ Cause TLB misses Why this all matters (and why you should be happy you understand CS210) ○ Performance! ○ The future of computing! What did we do in CS210? Dropped on a deserted island with no food, water, or… computer!! The Final “Make it Better” Unit C programming ○ Malloc (dynamic memory allocation) ○ Calling Convention (how C is compiled into assembly) ○ Pointers (pointers to structs, pointers to pointers, etc!) Hardware Improvements: ○ Pipelining Metrics: Throughput, Latency! ○ Caching Direct-Mapped, Set Associate, fully-associative Locality Cache coherence – when writes occur! Operating System: Running multiple programs on one computer! ○ Timesharing CPU + interrupts ○ Virtual Memory Address translation, Page tables, TLB Demo 1: Page Faults! When a page is not in memory (instead in storage) trigger a PAGE FAULT ○ Chose a victim page via REPLACEMENT policy If victim page is DIRTY, write it back to storage. ○ Read the requested page from disk into memory ○ Update the page table to reflect evicted victim page and newly resident page The Resident bit and Dirty bit: Resident Bit = 0 when the page is in storage, = 1 when it is in memory. Dirty bit = 1 when we have WRITTEN to a page in memory. When evicting a dirty page, we need to write it back to storage. How would you trigger a page fault? Option 1: Have a program malloc a lot of memory. (How much?) Option 2: Have a program malloc a lot of memory, and also use all of it, frequently. Option 3: Have a program malloc a lot of memory, use all of it, frequently. Run many copies of this program. Experiment /usr/bin/time -v./pf./pf-city.sh Think: why do no page faults occur, if our write loop is small? Think: At what # of copies of pf running should we start to see page faults? Demo 2: Translation Lookaside Buffer (TLB) Let’s add a cache inside the MMU! TLB is a small, fully associative cache! VPN vpn ppn R vpn ppn p PPN PPN How would you trigger many TLB misses? Option 1: Allocate a super big array! (How big?) Option 2: Access an array randomly. Option 3: Allocate a super big array and access it randomly. Experiment: TLB-okay1, TLB-okay2, TLB-miss Using perf Applications of Systems Knowledge ❖ Improve the systems of today Notice inefficiencies & study the tradeoffs and concessions made in modern systems ❖ Design the systems of the future Knowing common systems design paradigms: abstraction, caching, etc. allows us to apply these same ideas to future technologies! ❖ Be a super-user of systems Know how computers work (under the covers) so you can: Write PERFORMANT code! Secure (or hack) systems of today! Hint: this section will review: Malloc + dynamic memory Caching Compilers + assembly to C So… Have we “made it better?” Performance Application with CS210 Knowledge You are working at a biotech startup… They have a python program, matmul.py, which they use to pre-process some genetic data. It takes 6 hours!! (Genes data is pretty big, 4096x4096 matrices!) They pay for Amazon cloud computing to run this process. ○ Costs $$$! You, a CS210 student, reimplement it in C… Guess how long matmul takes once you are finished? 3 hours 1 hour 16 minutes 20 seconds < 1 second …my C code just segfaults Review: Matrix Multiplication Experiment 1: Python v C? Setup: 1000 x 1000 matrices Use time! Result: For a 1000x1000 matrix: Python Implementation: 1m58.605s C implementation: 3 seconds! x80 speedup! Experiment 2: Caching Optimization valgrind --tool=cachegrind --cache-sim=yes./matmul1 matrix1.bin 500 500 matrix2.bin 500 500 ~8% miss rate! Loop Ordering? for i in rows: for j in col1: for k in col2: C[i][j] += A[k][j] * B[i][k] Consider loop order i,j,k vs k,j,i. What would be the best? Improvement valgrind --tool=cachegrind --cache-sim=yes./matmul matrix1.bin 500 500 matrix2.bin 500 500 With loop-reordering: Almost 0% D1 miss rate! On 2000x2000 matrix: Without loop-reordering: 37.6 seconds With loop-reordering: 20.63 seconds Okay. git add. git commit -m “switched loops around” git push Why are you doing that!? Makes no sense… Explain yourself… for i in rows: for j in col1: for k in col2: Why? Spatial locality! C[i][j] += A[i][k] * B[k][j] i,j,k loop ordering: j,k,i loop ordering: N i,k,j loop ordering: git commit -m “reordered loops to improve cache spatial locality” Experiment 3: Optimization Flags + Compiler Better assembly = faster code! -O1: With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time -O is the recommended optimization level for large machine-generated code as a sensible balance between time taken to compile and memory use: higher optimization levels perform optimizations with greater algorithmic complexity than at -O. -O2: Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code. -O3: Optimize yet more. Improvement: Summary Setup: 2000x2000 matrix Matmul (no optimization) = 20.63 sec matmul-o1 = 5.3 sec matmul-o2 = 5.2 sec matmul-o3 = 4.04 sec You Right Now: Wow! I should take a Compilers Course! Future Classes: Understand Parallelization CS350 and CS351 anyone? On 2000 x 2000 matrices Single Threaded: 4 seconds Parallelized: 2 seconds Improvement: Summary 4096x4096 matrix multiplication: Python program: 6 hours C program naive: 19 minutes C program cache-optimized: 177 seconds C program w/ compiler optimization: 54 seconds C program with trivial parallelization: 3 seconds! More improvements: Parallel in a more memory-friendly way (tiling, divide and conquer) = 1.9 sec Use intel’s Vectorization HARDWARE = 0.7 seconds!! CS210 Lecture 19 Performance and Pipelining Gotta go fast! Sabrina M. Neuman Assistant Professor, Boston University CS Looking Back: Building the System Stack C Software Higher-Level Languages Lower-Level Languages (e.g., C) Assembly Language Machine Compilers or Interpreters Language Assembly & Machine Language Done Processor Instruction Set Architecture Processor Architecture ALU Functional Units (e.g., ALU) Digital Logic Gates Logic Gates Devices & Circuits Hardware + “0’s and 1’s” 2 Next: Performance and Infrastructure Software Higher-Level Languages Lower-Level Languages (e.g., C) Virtual Memory & Caching More Helpful Operating Systems Infrastructure Compilers or Interpreters Assembly & Machine Language Instruction Set Architecture OS Processor Architecture Functional Units (e.g., ALU) Memory: Digital Logic Gates Virtual Memory & Caching Devices & Circuits Hardware + “0’s and 1’s” 3 Today: Processor Performance & Pipelining Beta Processor x86 Processors General Purpose General Purpose ISA: RISC ISA: CISC Low Throughput High Throughput 4 Defining Terms: Latency Latency Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 Latency: – Time it takes to process a single input – Latency is a time 6 Latency Example 0 Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 7 Latency Example 5 Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 8 Latency Example 7 Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 9 Latency Example 8 Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 10 Latency Example 10 Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 10 Latency: – Time it takes to process a single input – E.g., LatencyCarFactory=10 time units 11 Defining Terms: Throughput Throughput Car Factory Input: Output: Car Order New Car … tPD=5 tPD=2 tPD=1 tPD=2 Throughput: – How often a new output is produced – Throughput is a rate 13 Throughput Example Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 14 Throughput Example 0 Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 15 Throughput Example 5 Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 16 Throughput Example 7 Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 17 Throughput Example 8 Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 18 Throughput Example 10 Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 10 19 Throughput Example 15 Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 10 20 Throughput Example 17 Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 10 21 Throughput Example 18 Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 10 22 Throughput Example 20 Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 20 10 23 Throughput Example 30 Car Factory Input: Output: Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 30 20 10 Throughput: – How often a new output is produced – E.g., ThroughputCarFactory=1 output per 10 time units=1/10 24 Performance of a Processor Recap: Parts of the Beta ILL XAdr OP JT Fetch PCSEL 4 3 2 1 0 Fetch an Instruction PC 00 A Instruction Memory Decode +4 ID[31:0] D the Instruction and Rc: ID[25:21] Ra: ID[20:16] Rb: ID[15:11] Read Registers 0 1 RA2SEL + WASEL RA1 RA2 XP 1 Register File Execute Rc: ID[25:21] 0 WA RD1 RD2 WD WE WERF Z the Instruction C: SXT(ID[15:0]) JT (PC+4)+4*SXT(C) IRQ Z ID[31:26] Memory ASEL 1 0 1 0 BSEL Read/Write Memory Control Logic as Needed Execute PCSEL RA2SEL A B Write Back Decode ASEL BSEL ALUFN ALU Data WD WE OE MWR MOE to Registers as WDSEL Memory Needed ALUFN Adr RD MWR, MOE WERF Memory WASEL PC+4 0 1 2 WDSEL Write Back 26 Latency: tPD of the Beta ILL XAdr OP JT PCSEL 4 3 2 1 0 Fetch PC 00 A Instruction Memory D +4 ID[31:0] Ra: ID[20:16] Rb: ID[15:11] Rc: ID[25:21] 0 1 RA2SEL + WASEL RA1 RA2 XP 1 Rc: ID[25:21] 0 WA Register File WD Longest path is for a Load: Z RD1 RD2 WE WERF JT tPD Beta = C: SXT(ID[15:0]) (PC+4)+4*SXT(C) IRQ Z tPD Fetch Instruction ID[31:26] ASEL 1 0 1 0 BSEL + tPD Decode Instruction & Read Control Logic Registers + tPD Execute Operation on ALU Execute PCSEL + tPD Memory Read RA2SEL A B + tPD Write Back to Register File Decode ASEL BSEL ALUFN ALU Data WD WE OE MWR MOE WDSEL Memory ALUFN Adr RD MWR, MOE WERF Memory WASEL PC+4 0 1 2 WDSEL Write Back Latency: Time it takes to produce a single output 27 Max Processor Clock Speed Set By tPD ILL XAdr OP JT PCSEL 4 3 2 1 0 Fetch PC 00 A Instruction Memory D +4 ID[31:0] Ra: ID[20:16] Rb: ID[15:11] Rc: ID[25:21] 0 1 RA2SEL + WASEL RA1 RA2 XP 1 Rc: ID[25:21] 0 WA Register File WD Longest path is for a Load: Z RD1 RD2 WE WERF JT tPD Beta = C: SXT(ID[15:0]) (PC+4)+4*SXT(C) IRQ Z tPD Fetch Instruction ID[31:26] ASEL 1 0 1 0 BSEL + tPD Decode Instruction & Read Control Logic Registers + tPD Execute Operation on ALU Execute PCSEL + tPD Memory Read RA2SEL A B + tPD Write Back to Register File Decode ASEL BSEL ALUFN ALU Data WD WE OE MWR MOE WDSEL Memory ALUFN Adr RD MWR, MOE  tCLK is set by Beta’s tPD WERF Memory WASEL PC+4 0 1 2 WDSEL Write Back Latency: Time it takes to produce a single output 28 Processor Performance Want this to be small! 29 Processor Performance Throughput: Rate at which outputs are produced 30 Processor Performance Option 1: Faster Clock Option 2: More Instructions Per Cycle Throughput: Rate at which outputs are produced 31 A Need for Speed! Option 1: Pipelining  Faster Clock  Higher Throughput Option 2: Parallelism  More Instructions/Cycle  Higher Throughput 32 A Need for Speed! Option 1: Pipelining  Faster Clock  Higher Throughput Option 2: Parallelism  More Instructions/Cycle  Higher Throughput 33 Intro to Pipelining: Doing Laundry First cars, now laundry? The Laundry Example Input: Output: Dirty Laundry Clean Laundry 35 One Load at a Time Doing laundry one load at a time is slow Like waiting for 1 instruction to go all the way through the Beta processor… tPDs 30 60 36 Doing N Loads of Laundry Don’t want to wait for the whole tPD of both units every time… tPDs 30 60 37 Doing N Loads… The CS210 Way Drier is empty at first, but we keep Instead, let’s “pipeline” the laundry it busy later in “steady state”. process! tPDs 30 60 38 Performance: Latency vs. Throughput One Load at a Time One Load at a Time = Pipelining = Pipelining One Load at a Time = Pipelining = tPDs 30 60 39 Pipelining Combinational Logic Back to Combinational Logic… 41 Back to Combinational Logic… Time  42 Pipelined Logic IDEA: 43 Pipelining Conventions So, this is a 2-Stage Pipeline… 44 A Pipelining Recipe A B C 4ns 3ns 8ns D F 4ns 5ns E 2ns # Pipeline Stages: 0  Combinational Logic! Longest Path: 4+3+8+5 = 20ns → Latency: 20ns = 20ns → Throughput: 1 output / 20ns = 1/20ns 45 Start with Outputs  A B C 4ns 3ns 8ns D F 4ns 5ns E 2ns # Pipeline Stages: 1 Longest Path: 4+3+8+5 = 20ns → Latency: 1*20ns = 20ns → Throughput: 1 output / 20ns = 1/20ns 46 Draw Stages Across All Paths  A B C 4ns 3ns 8ns  D F 4ns 5ns E 2ns # Pipeline Stages: 2 Longest Path: MAX(4+3, 8+5) = 13ns → Latency: 2*13ns = 26ns → Throughput: 1 output / 13ns = 1/13ns 47 Focus on Bottlenecks  A B C 4ns 3ns 8ns  D F 4ns 5ns E 2ns # Pipeline Stages: 3 Longest Path: MAX(4+3, 8, 5) = 8ns → Latency: 3*8ns = 24ns → Throughput: 1 output / 8ns = 1/8ns 48 Use Fewer Registers if You Can  A B C 4ns 3ns 8ns  D F 4ns 5ns E 2ns # Pipeline Stages: 3 Longest Path: MAX(4+3, 8, 5) = 8ns → Latency: 3*8ns = 24ns → Throughput: 1 output / 8ns = 1/8ns 49 Pipelining Example 50 CPUs: Pipelining  High Throughput Do something kind of like this… 51 Deeper Pipelines  Higher Throughput Skipping some details we won’t cover in CS210 this term… 52 A Need for Speed! Option 1: Pipelining  Faster Clock  Higher Throughput Option 2: Parallelism  More Instructions/Cycle  Higher Throughput 53 Bottleneck: Slow Execution Units 54 Interleaving Units  More Laundry Done How do fast and furious CS210 Students do laundry? tPDs 30 60 55 Add Parallelism tPDs 30 60 56 More Execution Units  Parallelism Do all those ALUs have to be the same? Do they have to be doing the same instruction at the same time? 57 Next Time: Improving Memory Performance! 58 Bonus Example: Pipelined Car Factory Pipelining Example Car Factory Input: Output: Engine Room Tires, Wheels, and Doors Room Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 60 Pipelining Example 0 Car Factory Input: Output: Engine Room Tires, Wheels, and Doors Room Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 61 Pipelining Example 5 Car Factory Input: Output: Engine Room Tires, Wheels, and Doors Room Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 62 Pipelining Example 7 Car Factory Input: Output: Engine Room Tires, Wheels, and Doors Room Car Order New Car tPD=5 tPD=2 tPD=1 tPD=2 63 Pipelining Example

Use Quizgecko on...
Browser
Browser