739b5e6a14b5812df89954c0cdc2ee60l5.0_performance__os_and_discussion.pdf

4018 CMD Networking and Computer Architecture Performance, OS and Discussion Recap Assembly language Basic instructions Addressing modes Symbols and labels Branching Subroutines Relation to higher level languages Addr. 0022 0024 0026 002A … 0032 0034 0038 … 0054 … Instr. Pattern.… … … PUSH EAX 0011010011000.. PUSH EBX 0011010010010.. MOV EAX,0 0001010011000.. LoopStart: MUL EBX,EAX 0101010010011.. … … INC EAX 0100010011000.. CMP EAX,10 1000011100101.. JNE LoopStart 1100100000000.. … JE Case1 … Case1: Performance Problem Every time we invent a faster computer, we find a task that needs more work to be done… Not just a computational machine We do not only use computer to do calculations A simulation and modelling tool that can be applied to all sorts of problems So, we need a faster computer Not always possible to reduce computational needs by better algorithms Hardware and the software that supports it matters too Limiting Factors and Other Considerations Design model wise, the Von Neumann bottleneck Memory connected to the CPU by a narrow lane… Clock rate Related to our ability to make transistors smaller and thermal/cooling factors Software Issues Order, priority and time given to tasks, managing resources, etc. Specific CPU Architecture Features Instruction set, number of registers, bus width, etc. A Faster Bus To address the bottleneck, we can try a faster system bus More bandwidth between the processor and the CPU Note that everything inside the computer is regulated through a single clock The speeds of the various busses (system, I/O, etc.) defined in relation to it For instance, 66MHz for the system bus and half the clock rate (33MHz) for other busses (early 486DX machines) Modern computer allow far more control using a very high frequency clock and fine control on how this is divided Buffers Faster and slower bus lanes however create a problem We need to use buffers to store items temporarily as they move between faster and slower buses And add circuitry to control those Such controllers don’t need to be as complicated as a CPU However, might employ similar circuitry – moving bits around and incrementing and decrementing counters And since they operate at high frequencies might need cooling Such controllers can be part of what is often referred to as a chipset Additional Lanes Another thing we can try is more lanes between the CPU and memory Dual and quad memory configuration common in modern PCs Essentially more wires connecting the two For instance, dual channel memory means 2 busses connecting to the CPU This lecture is not sponsored by G-Skill Cache A faster bus needs faster memory – but we also want this to be sufficiently large Those two objectives can be contradictive! We can have fast, but it will be small And we can have big, but it will be slow… What if we did not have to access the main memory all the time? Cache memory was introduced to help us avoid having to do so Typically, a hierarchy of such memories Note: How many caches and how they are used varies greatly based on manufacturer and architecture Memory Hierarchy Registers are fastest! Not all computers have all three types of cache L1 (primary) Cache is “CPU” cache Memory (RAM) L3 Cache L2 Cache L1 Cache Registers Latency L2 on chip but not directly connected to the CPU L3 not always on chip, can be used to feed L2 cache L4 cache? Disc Capacity In multicore computers on per core bases Memory Hierarchy Hit or Miss The principle of locality: Instructions or data currently in use, or believed to be coming into use soon, will be held in the cache Algorithms (replacement policies) can determine what we keep in cache Must be simple because everything in the CPU happens really fast!!! Based on criteria such as hit and miss – Hit when we find what we need there – Miss if we don’t (we want a high hit rate) Replacement Policy Examples Random – is it’s a miss, replace a random block in memory with the one we were looking for – Advantage: simple to implement in hardware – Disadvantage: ignores principles of locality Least Recently Used (LRU) – widely used, does what it says – Advantage: takes locality into account – Disadvantage: need to keep track of hits, thus more complicated to implement Pipeline Another trick we can do is to reduce the amount CPU remain idle Cache helps with that Breaking instructions into multiple stages and performing different stages simultaneously can ensure a constant flow of task to do for out CPU Pipelines allow us to achieve a form of parallelism In its simplest form stages can correspond to those of our fetch execute cycle Problems Note, that it is not always possible to perform all those stages at the same time We will see some examples of this next A solution is flushing the pipeline Essentially clear all stages and start again Not particularly difficult to do But when doing so we lose time (how much depends on number of stages!!!) Problem 1 Some dependency exists between instructions and the data they manipulate NOP not ideal - it would make more sense to process another useful instruction in the meantime Fetch Decode … Add EAX to EBX Copy EDX to EAX Op. Fetch ADD EAX, EBX Execute MOV EAX, EDX Write I haven’t finished copying the contents of EDX into EAX ADD EAX, EBX NOP MOV EAX, EDX A faster solution? Problem 2 Occasionally different phases might require access to the same resource For instance, one set of MAR and MDR registers that allow access to the main store Fetch … Copy contents of memory location X to EBX MOV EBX, [x].... Copy EAX to location Y MOV [Y], EAX Can’t write and read from memory at the same time Decode Op. Fetch Execute Write MOV EBX, [x] NOP.... MOV [Y], EAX One possible solution…. Problem 3 As part of the program logic, we might want to “jump” to a different location in the program This is accomplished by using instructions that overwrite the value of the Program Counter Fetch … Decode NOP … Op. Fetch NOP Execute NOP MOV EAX, EBX If Zero Flag is not set, copy loop address into PC JNZ LOOP Which is the next instruction? Write JNZ LOOP This is the same as flashing the pipeline Branch Prediction Making an “educated” guess is better than waiting (speedy decision making is essential) Static prediction: Taken – assume it will always branch Not taken – assume it will not Opcode – set behaviour for each jump/branch instruction Dynamic prediction: 1-bit predictor – remember last time.. 2-bit predictor – what about the last 4 times! Etc. Longer Pipelines Breaking down phases in smaller, simpler ones can improve speed Duplicating can resolve some problems such as operand fetch And allow parallel execution of instruction on the same core/CPU! * Superscalar, for instance each AMD ZEN 4 core has 4 ALUs!)* Yet another trade off – a longer pipeline means more loss if we make the wrong predictions The “infamous” Pentium 4 Prescott featured an extremely long pipeline with 31 stages!!! *parallel computing well outside the scope of this module More Cores! Adding more execution units, cores and pipelines clearly increase costs But common today due to limits in clock rates and cooling/power consumption Benefits depend on our ability to parallelise tasks But speeding up the processor is not the only way to make a computer faster Subsystems The general idea - get other specialized parts of the machine deal with specific tasks A good example is the graphics system Modern cards do more than just accelerate graphics Can accelerate tasks that benefit from mass parallelism such as physics calculations To signal processing for audio and video encoding/decoding I/O Devices Each I/O device is given a unique I/O port number by the Operating System No two devices can have the same I/O address This I/O port number is basically a memory address where data is temporarily stored as it moves in and out of the device Remember, processors can only process data from the main memory of the computer DMA Smart devices can help efficiency by moving data without the help of the CPU Direct Memory Access (DMA) is often something supported by modern hard drives Data is loaded into/stored from the memory to the device while the CPU is doing something else When the job is done it contact the CPU (IRQ) A new address and length of data to be transferred might be provided DMA Controllers DMA controllers are microprocessors that transfer data to or from specific memory addresses If DMA and I/O are combined on the same card/device they can be tightly coordinated Often found on devices such as sound cards, network adapters, modems, hard drives, etc. A Problem But how do we know how to interact with such devices so we can take advantage of such features? Performance of a computer does not only depend on hardware Software has an important role to play too A good example of that is compilers – their ability to produce efficient assembly code has a direct impact to application performance Break? A Device with no OS Early devices had no operating systems. but we soon realized this was a good way of doing nothing useful very slowly Application needs and complexity has increased a lot since then Latest iPhone without an OS Operating Systems Many attempts for a definition: An operating system is the collection of software and data that manages the system and performs resource sharing, user dialogue provision and timesharing And another one: The software component of a computer system that is responsible for the management and coordination of activities and the sharing of the resources of the computer And so on… Onions! Layers of abstraction hide the complexities of the hardware APIs (Application Programmer Interfaces) provide the programmers with an easy way of interacting with devices The outer layer can be thought of as the user interface Operating System Layers Hardware abstraction: provides all the processor dependent code Kernel: allows running processes simultaneously Executive services: drivers, memory management, etc. Subsystems: provides the API User Interface: Launching and using applications, etc. Applications Subsystems Executive Kernel Hardware A more detailed onion Interface OS Components/Functions A long list of things: Kernel (or supervisor program) Process Management Main Memory Management File Management I/O System Management Networking Security Command-Interpreter System User Interface Etc. Early operating systems only offered a subset of those – mainly file system and some basic I/O operations Kernel The aim here is to hide hardware complexity from a user/programmer Always runs from/is loaded to the main memory of the computer As such needs to be small and efficient Kernel has oversight/control of the computer system Interrupt handling, memory and device management, process communication, etc. Process Management A process is an instance of a program being executed Needs resources such as: CPU time, memory, files, and I/O devices The OS manage processes, it is responsible for Process creation and deletion Process suspension and resumption Provision of mechanisms for: Process Process Process synchronization A B Process communication Blocked! Blocked! Deadlock handling! Waiting for B Waiting for A But How? But how do OS take control of the hardware? Can’t rely on processes to not crash and/or pass control back to OS at regular intervals And how do we control access to our most precious resource, memory And make sure APPs can only access what they should Modern Operating systems need hardware with certain features to allow the amount of control needed Interrupts We already discussed interrupts briefly: Events that stop/suspend execution of whatever the CPU is doing to make it do something else Can be generated by hardware devices That something else is defined by the OS One event that can generate an interrupt is a timer! Can control how long processes get the CPU for, thus allowing scheduling And taking control back from applications that have crashed and need to be terminated Time-slicing Time-slicing: the general idea is timer interrupts limit the amount of time each a process runs for Allowing single core processors to do multitasking And sharing the CPU resource among multiple users Note that suspending and resuming processes has a cost If the slices are too thin there is a lot of overhead But if the slices are too thick system has less responsiveness Memory Management Applications/processes and the data they operate on need to be loaded on memory Another finite resource that needs management In modern systems OS is responsible for: Keep track of which parts of memory are currently being used and by whom Decide which processes to load when memory space becomes available Allocate and un-allocate memory space as needed Real Mode If a computer runs in real mode memory address correspond to physical memory (or real memory if you like) locations Application have to manage memory themselves And limited to what is available No restrictions or safeguards to protect from accessing memory of other applications Programming errors occasionally result in computer crashing necessitating a restart Not the biggest problem for single user single task computers – most DOS application were written in such a way they would only work in this mode Virtual Memory A better idea – create a layer of abstraction; In virtual mode: Applications given access to all the memory they need in a big continuous block This can be more than what is available! This is logical memory, called virtual memory Hardware provides mechanism for efficient translation from physical to logical memory (i.e. specialised registers and circuitry to accelerate translation) Operating Systems keeps track of memory use and allocates physical memory to processes as needed Virtual Addresses Those are logical addresses – a classical definition can split those into three parts: Segment Number Page Number Word Number Segment: A logical division of address space: different segments have different purposes (data blocks, program routines, etc) Page: A physical division of a segment: a more manageable 'lump' of memory that for a given machine will be of a standard size Word: The smallest addressable unit: used directly as the least significant part of the address Paging The memory needed for applications is often bigger than the physical memory of the computer! To solve this problem, we can use secondary storage (i.e. HDD) to temporarily store pages when not needed The operating system maintains page tables that tell us where each page is Can swap out (store to secondary storage) pages to free memory for processes that need to run… If a process requires access to a page currently in secondary storage, then this needs to be loaded back into memory… But to do so, we might need to first swap out something else… Taking Control OS loaded when computer boots in real mode Place the kernel in an actual memory location – this needs to be always in memory Kernel contains interrupt handler definitions Set up parameters and activate virtual mode OS has control Applications can now be loaded Can’t interfere with each other since they can only see non overlapping logical memory Can only run for as long or access what OS allows them to see Not possible for them to swap to real mode Discussion Thanks to paging one could argue that we created a computer with infinite memory!!! In practice there are limits - secondary storage is also a finite resource And because swapping pages in an out can be a timeconsuming task… There is a big performance penalty Virtual memory can also make systems more secure Assuming OS handles this correctly But even then, it does not solve all our security problems Other OS Functions? There is a second year OS module so we will not be discussing file systems, scheduling and other aspects in more detail But we need to have an honourable mention of UNIX for introducing TCP/IP as part of the kernel Future? In the long term: Quantum computing, new semiconductor materials (i.e. graphene), new memory technologies, DNA data storage, analogue computers, etc. In the short term: RISC is back! RISC and CISC CISC - Complex Instruction Set Computing One way to increase efficiency is introducing more complex instructions that for certain problems can do more work per tick Advantage: special instructions can speed-up certain tasks Examples, the x86 Intel and AMD processors RISC – Reduced Instruction Set Computing Another way, limiting instruction number and complexity could allow for other architectural improvements, like more busses, registers, etc. Advantage: power consumption and more/smaller cores Examples, ARM (including Apple Silicon) and RISC-V processors So, who could win? In terms of platforms Mobile devices: RISC has clear advantage here due to lower power consumption Servers: Depends on purpose, but theoretically RISC could pack more cores and be more efficient Desktops: Could go either way, considering our ability to parallelise tasks, use case and needs for backwards compatibility iMac G4: PowerPC RISC processor, 2003 The End End of part one of the module! Next week, networking

739b5e6a14b5812df89954c0cdc2ee60l5.0_performance__os_and_discussion.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue