Franks Concurrent Handbook PDF

EEEN30141 Concurrent Systems. Semester 1 2024. Frank Podd Week 1 Table of contents 1 BACKGROUND.................................................

EEEN30141 Concurrent Systems. Semester 1 2024. Frank Podd Week 1 Table of contents 1 BACKGROUND............................................................................................................................................... 2 1.1 EXAMPLES OF CONCURRENT SYSTEMS........................................................................................................................... 2 2 WK1 (TASK10) AUTONOMOUS PERIPHERALS.................................................................................................. 3 2.1 EFFICIENT CPU USE IN MICROCONTROLLERS.................................................................................................................. 3 2.1.1 Bit banging versus microcontroller peripheral logic...................................................................................... 3 2.1.2 Multiple ways to use data-transfer peripherals............................................................................................. 4 2.1.3 Blocking I/O.................................................................................................................................................... 4 2.2 OVERLAPPING, I/O................................................................................................................................................... 4 2.2.1 Non-blocking I/), aka overlapping I/O............................................................................................................ 4 2.2.2 Polling............................................................................................................................................................. 5 2.2.3 Interrupts........................................................................................................................................................ 5 2.2.4 Multiple threads............................................................................................................................................. 6 2.3 DMA..................................................................................................................................................................... 6 2.3.1 Sleep modes.................................................................................................................................................... 6 2.4 FURTHER READING................................................................................................................................................... 7 2.5 SUMMARY.............................................................................................................................................................. 7 3 WK1 (TASK1) FASTER PROCESSING................................................................................................................. 8 3.1 CLOCK SPEED AND POWER DENSITY.............................................................................................................................. 8 3.2 METHODS TO INCREASE COMPUTATIONAL SPEED............................................................................................................ 9 3.2.1 Concurrent system definition......................................................................................................................... 9 3.2.2 Task Interleaving (time slicing)..................................................................................................................... 10 3.3 ALGORITHMS AND ORDERED TASKS............................................................................................................................ 11 3.3.1 Totally ordered tasks.................................................................................................................................... 11 3.3.2 Unordered and partially ordered tasks........................................................................................................ 11 3.4 CONCURRENCY AND AMDAHL’S LAW - IS IT WORTH IT?.................................................................................................. 12 Background 1 1 Background In this unit on concurrent systems, we will cover topics relating to multiple processor systems and systems using operating systems. The definition of concurrent processing in the next chapter relates to how CPUs can work together efficiently. This narrow definition excludes further essential hardware features that improve CPU performance, such as autonomous peripheral operation and interrupts. You will have encountered these features in your MCE2 unit, so we will use microcontroller examples to discuss them in this unit. This course discusses how software called the operating system kernel controls all other programs and performs CPU time-slicing through a scheduler. We will investigate programming time-sliced concurrent systems using the C++ thread library running on a PC. The issues in this PC-based concurrent system apply to other types of processing hardware, from multi-core microcontrollers to graphic cards and supercomputers. Any sections of text that say, “If you want to read more about…” are just for your interest and understanding. They are not required reading for the unit. 1.1 Examples of concurrent systems We will define a concurrent system once we have discussed the underlying terminology. Concurrent systems are needed when multiple tasks must be performed simultaneously, either for speed or to facilitate a system running many processes. Some computation tasks need to be performed as quickly as possible, e.g. car airbags. For other tasks, increasing computational performance will result in a more accurate outcome. For instance, a weather forecast needs a simulation model that is as large as possible, but the forecast simulation must also be completed within a helpful period. There are two ways to speed up our computation: either increase the speed of our computer or have multiple processors work jointly on the problem. Your PC runs multiple programs concurrently—it checks the keyboard, sends graphics to the monitor, and runs programs such as a word processor and a browser together, all while virus-checking the data and internet traffic. The automobile cruise control system keeps a vehicle moving constantly in the face of disturbances such as driving over a hill. The system needs to monitor several inputs (wheel speed, accelerator/brake/clutch pedals, and steering wheel switches) and process a control algorithm to set the car's throttle as a feedback mechanism. A challenging application is an air traffic control system with tens of aircraft flying within range, both in the air and taxiing on the ground. The system must also communicate with other air traffic control centres, prioritise specific aircraft, and have a responsive graphical user interface. Conversely, we will see that two PCs without a data network are not concurrent systems, as they run independent programs; instead, they are seen as separate systems. Background 2 2 Wk1 (Task10) Autonomous Peripherals 2.1 Efficient CPU use in microcontrollers Processing units are pervasive in electronic systems, from supercomputers to washing machine microcontrollers. This unit is about getting the most out of processing units. We will discuss computational speedup through concurrent systems concepts, covering hardware acceleration, having multiple processors, and using a single processor to perform multiple tasks simultaneously. However, before we look at more powerful CPUs, this chapter will examine how basic microcontrollers improve their performance by adding logic circuits. Microcontrollers have a central processor (CPU), but they also have extra logic circuits, mainly related to “peripherals”, that can perform tasks separately and simultaneously with the CPU. You have covered some of these peripheral logic units in your previous courses, such as the UART and Timer peripherals. We will revisit them now, focusing on how they improve CPU efficiency. 2.1.1 Bit banging versus microcontroller peripheral logic Older CPUs communicated with the outside world by directly controlling their GPIO pins using the GPIO peripheral. The CPU program could set this pin's level high and low itself. This direct method of sending signals is called “bit banging,” and you can still do this on a modern microcontroller if you do not have peripheral logic available to meet the required communication standards. CPUs can send data using just GPIO pins, so they can communicate together using, for example, a UART protocol. UART data, e.g. RS232, runs at a much slower clock speed than the CPU. Using the CPU to send the data directly, or “bit banging”, is a very flexible way of sending data, as the CPU is fast enough to create any digital signal. For example, sending a byte of data (8 data bits + 1 start bit + 1 stop bit) at a data rate of 9600 baud takes about 1 ms. However, in one millisecond, a PIC processor with a 10 MHz clock (2.5 MHz instruction rate) could have performed 2,500 machine code instructions instead of bit banging the data on the GPIO pin. The GPIO peripheral has no processing ability, and bit-banging data is a very inefficient use of the CPU resource. A more efficient implementation is to have specialised logic circuitry for sending the data over a serial port. An example of this circuitry is the UART peripheral. The CPU could pass the information at high speed to this peripheral logic, perhaps using a FIFO buffer, and then the CPU can do other work while the logic transmits the data at a slower rate. This type of extra logic is an example of “autonomous peripheral operation” and is a form of simultaneous processing. Autonomous peripheral operation is extra hardware logic on the IC that increases data processing speed, minimises latencies and saves power. It is performed in parallel to the CPU’s computations. You have seen this with the peripheral logic functionality of the PIC18F and STM32F4 microcontrollers. Wk1 (Task10) Autonomous Peripherals 3 The need for efficient data transfer is particularly pertinent for autonomous peripheral operation. 2.1.2 Multiple ways to use data-transfer peripherals In the MCE2 and Data Networking units, you will have used communication peripherals such as SPI or UART to exchange data. You may have used several methods in your code to do this. There are many possibilities for getting the system to use the peripherals. The primary choice is between blocking and non-blocking methods. An example of data transfer is when the program is trying to read 8 bytes from the UART serial port. It could be minutes before an external device sends this data, so should your program stop until this data has arrived? 2.1.3 Blocking I/O If your answer to that question is “yes,” then you would use a “Blocking” read function call, which will not return until the data is ready. One advantage of blocking function calls is that they make programming easy; the disadvantage is that the program will run slowly. Blocking commands will not return execution to the following line of code until the data transfer has been completed. This means no simultaneous processing is happening, but it does mean that your CPU is doing nothing valuable while it waits. Blocking I/O is the easiest method, as your code runs sequentially and synchronously, so you can relax. However, blocking commands will slow down your program. 2.2 Overlapping, I/O 2.2.1 Non-blocking I/), aka overlapping I/O Conversely, if your answer to whether to stop until the data has arrived was “no, don’t wait," you can use a non-blocking read function call. The read function will return immediately, even if no data is available. The programmer's job is to see how much data was returned by the read call and, if necessary, keep checking (polling) to know when the rest of the data has arrived. The non-blocking method is also called overlapping I/O. Using a non-blocking command to send data means the command will return execution to the next line of code in your program immediately, long before the peripheral completes the data transfer. Your program will continue executing while the peripheral transfers the data to the outside world. It can do this because of autonomous peripheral operation. The data transfer occurs concurrently (overlapping) with the main program execution. Which one of these is the “best” method depends on the program and the system it is running on, your code preference, ease of debugging, and CPU efficiency. Blocking is the easiest to code (as there is no simultaneous processing to worry about), but it is the least efficient use of the processor. There are many ways to implement non-blocking (overlapping I/O) commands. There are: 1. Checking for completion or FIFO state using a. Polling b. Interrupt Wk1 (Task10) Autonomous Peripherals 4 2. Direct Memory Access data transfer with interrupt on completion. With a non-blocking command, the call to the peripheral returns straight away, but the calling program may need to know, for example, when the reading of a given number of bytes of data is complete or the sending FIFO is empty so that it can be refilled. 2.2.2 Polling The simplest way to do this is for the main program to keep asking the peripheral, e.g., “Are you finished yet?” but put this command in a loop, along with a “sleep()” statement. The duration of the sleep statement might be, for example, half the expected duration of the peripheral to complete the operation. This method of repeatedly asking the peripheral a question is called “polling”. Polling is slightly better than using a blocking command, and it can be the best solution for some cases, but in general, it is not a very efficient use of the CPU. Another way to implement non-blocking I/O is to use hardware interrupts, i.e. interrupts generated by peripheral logic circuits. Rather than the CPU polling the peripheral to check its status, the peripheral can create a hardware interrupt informing the main program that its status has changed. For example, you could have a hardware interrupt to say when the write buffer is empty, which may indicate that the data transfer is complete. 2.2.3 Interrupts Autonomous peripherals can offload some CPU tasks to small parallel logic circuits. These peripherals often use event-driven communications (interrupts) to communicate with the CPU. Rather than the CPU continually polling (asking) the GPIO peripheral device to see if a button is pressed, another approach is to have the peripheral tell the CPU when something has changed. The peripheral device will then “interrupt” the CPU to say that something has happened, and this type of signal is appropriately called an “interrupt”. This method has the advantage that the CPU can be working on another task, and the peripheral logic can check the button and inform the CPU when the button is pressed, making the system more efficient. Interrupts are provided for in the CPU hardware. They enable small bits of code to stop the currently executing code, run the interrupt code, then resume the previous code. The interrupts call the associated interrupt service routine (a small subroutine), and this can happen at any time since the peripheral and the main CPU are working concurrently. You, therefore, have no idea where the execution of your main program might be when this interrupt occurs. Another way to say this is that it happens “asynchronously”. It is better to use interrupts with an event-driven implementation where your code mainly checks for changes caused by the interrupts, either through callbacks or flags, rather than having a single main program. The CPU processes hardware and software interrupts similarly to how subroutines work (storing registers, including the program counter, ready for retrieval later). Interrupt callbacks need to return quickly, so they cannot be used for processing. If you want to read more about exceptions, interrupt handling, priorities, and interrupt vectors, review your MCE2 notes. Wk1 (Task10) Autonomous Peripherals 5 The interrupt can be triggered by peripheral logic. Peripheral logic is logic external to the CPU, such as Timer, Memory, or digital and analogue I/O. Interrupts happen at the CPU hardware level, and the CPU is designed to make sure they are as fast and efficient as possible. 2.2.4 Multiple threads There is also a “concurrent” method to perform I/O blocking. You should revisit this once you understand the following terminology… “A separate I/O processing thread of execution can perform the blocking I/O, while the main program can continue.” 2.3 DMA The final method, which is not always available, is to use Direct Memory Access (DMA) data transfers. This is another example of autonomous peripheral operation. To use DMA, your program (which runs on the CPU) must instruct the DMA controller and peripheral of the start memory address to use and how many bytes of data to transfer. Then, the peripheral and the DMA will continue without any further action needed by the CPU. The DMA will use an interrupt to tell the CPU when the data transfer is completed; the peripheral works concurrently with the CPU, which is a very efficient method. Many communication and data capture devices can transfer large amounts of data. The CPU can pass this data between the peripheral and a block of memory locations. A more efficient method is called Direct Memory Access. Direct Memory Access requires extra logic that allows a peripheral to communicate with the memory directly. Thus, the peripheral can capture data and fill an area of main memory with data while the CPU is doing something else or is asleep. Direct Memory Access is often used with ADC/DAC peripherals. For example, an analogue- to-digital converter peripheral (ADC) can save data to memory continuously without needing the CPU. Direct Memory Access (DMA) facilitates some peripheral devices' direct access to memory without the CPU having to copy data around. The peripheral will often access the main memory, but performing a peripheral-to-peripheral data transfer may be possible. 2.3.1 Sleep modes Sleep modes are also available, allowing the CPU (and other unaffected peripheral blocks) to be dormant when unused to save energy. This is particularly useful for battery-powered devices, such as the IoT. This feature is provided by an autonomous peripheral that stays awake to turn the microcontroller on when needed. In some microcontrollers, the whole chip, apart from the sleep peripheral and timer, will power down to be woken up every hour, for example. More advanced microcontrollers can power down the CPU to a “deep sleep” but keep communication peripherals, such as SPI/UART/I2C, working and ready to receive data. Wk1 (Task10) Autonomous Peripherals 6 Then, when their receiving FIFO buffer is full, the CPU will be woken up to deal with the data. 2.3.1.1 Timer peripheral One final peripheral… An important peripheral for running an Operating System is the Timer peripheral, which can be set to interrupt the CPU repeatedly. A programmer may also use the Timer peripheral in interactive programs, for example. To update the GUI every 100 ms. 2.4 Further Reading If you want to read more about autonomous peripherals found on microcontrollers, here are examples you can Google: “Low-Power Background Autonomous Mode (LPBAM)” in STM32U5 microcontrollers. Core Independent Peripherals (CIP) in Microchip PIC18F Peripherals Interconnect Matrix in STM32 microcontrollers Other data processing acceleration, e.g. DSP blocks. If you want to read more about interrupts, an in-depth discussion is given in “Interrupt processing in concurrent systems” https://www.ardent- tool.com/CPU/docs/AMD/anatomy/misc/articles/walker.pdf 2.5 Summary Autonomous peripheral operation is a microcontroller hardware feature that offloads CPU tasks into additional logic circuits. These additional logic circuits can perform the tasks independently and concurrently to the CPU. These autonomous (independent) peripherals often use interrupts to communicate with the CPU, reducing latencies. They also use direct memory access (DMA) to improve throughput and CPU efficiency. Sleep peripherals are also used in IOT devices to minimise power and enable the device to be battery-powered. Autonomous peripheral operation prevents unnecessary waiting on slow peripherals and increases responsive systems by running parts of your program in parallel rather than sequentially. Although autonomous peripheral operation improves CPU efficiency, it is not included in the narrow definition of concurrent systems. Wk1 (Task10) Autonomous Peripherals 7 3 Wk1 (Task1) Faster processing The primary purpose of computer processing systems is to manipulate data. The faster the computer can do this, the better. The four ways to increase the computational speed of the system are: 1. Increasing CPU clock speed 2. CPU features - such as instruction pipelining and autonomous peripheral operation. 3. Use multiple CPUs 4. Make CPU usage more efficient through time-slicing. In the past, the primary way to increase processing speed was to increase the processor's clock speed, and the number of logic gates on an IC, following Moore’s laws. To facilitate this, each new processor design used new manufacturing methods that could shrink the size of the components. However, this has now reached a limit due to power dissipation limits (due to gate density and clock speed) and quantum tunnelling (due to gate density). We will not discuss the topic of quantum tunnelling, but the next section investigates why we cannot keep increasing clock speed. 3.1 Clock speed and power density CPUs are synchronous devices that work are the rate of the input clock. So, if we could just make a CPU that can work at a higher clock speed, it would process data faster. But that would have consequences for the physical effect on the system. The average dynamic power dissipation (Pav) in a continuously active (a=1) switching logic gate is related to the variables of clock frequency (fclk), the switching capacitance (C), the supply voltage Vdd Pav = fclk C Vdd2, If we keep increasing the clock frequency, the logic gate will produce more heat. If we combine that with a higher gate density, then much more power is produced per square millimetre of IC. At some point, we will not be able to get the heat of these tiny areas quickly enough, and the circuit will break. In passing, you can see that the heat created is proportional to the square of the supply voltage. So, making the CPU core work at the lowest possible voltage is a significant benefit. We might not care how much energy our PC’s CPU uses (e.g. 80 W), but there are some applications using microcontrollers where we want the system to run off a coin cell battery for years, so using an average power of only a few milliwatts or less. You may have come across the clock-gating features of the STMF4 microcontroller, where you can just turn on the parts of the microcontroller that you need. In summary, we can reduce the dynamic power consumption of the CPU by: 1. Reducing the CPU voltage. 2. Reducing the CPU clock frequency and using lower clock speeds in parts of the chip. 3. Reducing the gate capacitance. Wk1 (Task1) Faster processing 8 4. Only power the parts of the processor that you need. 3.2 Methods to increase computational speed Increasing clock speed as an approach to make computation faster is limited by power dissipation. That leaves us with the three other approaches mentioned above: 1. CPU features - such as instruction pipelining and autonomous peripheral operation. 2. Use multiple CPUs 3. Make CPU usage more efficient through time-slicing. Computer programs often have parts that need to wait for something external, such as slow memory copies or external data transmission. During this time, the CPU does nothing. In MCE2, you learnt about the CPU’s use of instruction pipelining to speed up the system. Shortly you will learn more about autonomous peripheral operation (e.g. interrupts and DMA). An alternative way to increase data processing speed is to have multiple processors work together on the same problem. This is most obviously seen with modern PC processors; for example, the Intel i7 processor contains 8 CPU “cores”. Entirely separate programs are easily run on multiple cores in parallel. Parallel processing is one type of concurrent system. 3.2.1 Concurrent system definition One way to make the system more efficient is to let the CPU work on different tasks while there is a delay in the running tasks. An Operating System is a software layer written to enable multiple tasks to run on the system in a way that makes it appear to be running simultaneously. Operating systems allow multiple programs to work virtually simultaneously, even on a single core. Operating systems also facilitate the sharing of hardware resources and provide system security. This virtual simultaneous running of multiple tasks is the other type of concurrent system. A system as complicated as a personal computer could not work on a single CPU without task interleaving. This processor time-slicing (fake concurrency) means that the software can handle user inputs, apps and graphics output virtually simultaneously. It would be too hard to write a single program to do all of that while still being responsive. It is possible to speed up a single program by splitting it into parts and running those parts on multiple cores or virtually simultaneously on a single CPU core. This course on concurrent systems covers the use of: 1. Multiple processors to work simultaneously on a program 2. A single processor system to run more efficiently by using an operating system software layer to time-slice its processing. We will discuss time-slicing now—it is how your computer’s operating system runs multiple programs virtually simultaneously on a single CPU. Wk1 (Task1) Faster processing 9 3.2.2 Task Interleaving (time slicing) Imagine a computer must run two programs: a stopwatch app and a word processor app. The user wants to use the apps at the same time. Each of the two programs (independent processes) comprises ordered tasks. The tasks of these two independent processes are a stopwatch app (labelled ‘S’ for the stopwatch app) and a word processor (labelled ‘W’). The programs can be split into subtasks that each take one time slice, e.g., 1 ms. The subtasks are given the postfix label 1,2,3, etc. The subtasks must be computed in the given strict order for each program. Table 1 shows an example of the execution on a CPU with two cores. Table 1 Parallel processing of two programs Time slice CPU Core 1 CPU Core 2 1 W1 S1 2 W2 S2 3 W3 S3 4 W4 S4 … … … It is easy to run the two programs on separate CPU cores. However, it is still possible to run both programs on a CPU with only one core and still make it appear to the user that they are running simultaneously. The CPU core could give each program a time slice in turn, as shown in Table 2. Or, if the Word app is a higher priority program than the stopwatch app, the CPU might order the processing, as shown in Table 3. Table 2 Interleaved processing of two programs Table 3 Interleaved processing of two programs Time slice CPU Core 1 Time slice CPU Core 1 1 W1 1 W1 2 S1 2 W2 3 W2 3 S1 4 S2 4 W3 … … … … In all three cases, the sub-tasks (W1, W2, W3 and S1, S2, S3) have been performed in strict order for each of the programs since the programs comprise totally ordered tasks. Running multiple programs virtually simultaneously is a form of concurrency, as is parallel processing. Wk1 (Task1) Faster processing 10 3.3 Algorithms and ordered tasks To understand how we might make an algorithm run faster, we need to consider how the algorithm is constructed. You will have written programs in various languages, such as Assembly Language in MCE1 and C/C++ in MCE2. You have programmed the Microchip PIC microcontroller and the ST/ARM microcontroller. All these programs are compiled into a list of machine code instructions. Your microcontrollers ran without an operating system. A CPU running without an operating system layer is called a “bare metal” system. The CPU in your “bare metal” microcontroller executes the lists of instructions in a strict sequence. Even though the execution of each line of code may jump about due to machine code branch instructions/subroutine calls (if/for/goto in C++), the order of machine code execution cannot be changed arbitrarily by the CPU. 3.3.1 Totally ordered tasks The statements of a simple sequential program form a “totally ordered set”. This means that, for fixed input data, the program instructions always occur in the same order. A CPU runs two independent programs by executing ten lines from program A, ten lines for program B, and the following ten lines of program A. If the CPU is very fast, it will appear to the user that these two programs are running simultaneously (concurrently), even when each program is a “totally ordered set.” 3.3.2 Unordered and partially ordered tasks However, there are many activities whose order of machine code instruction could be altered without consequence. For an example of an unordered task, consider an algorithm that adds a set of numbers together. The order of summation, i.e., the order in which we add the array's elements, does not matter to the result, as long as we add them all. This is an example of an unordered task. For an example of a partially ordered task, consider a vision-processing task, such as an animal-type classifier. This algorithm has independent sub-tasks of a tail detector and a legs detector. There is also an animal-type estimator that uses the output of the tail and legs detector. The tail and legs detector algorithms work independently, so it does not matter which finishes first. However, the animal-type estimator cannot start until the tail detector and legs detector algorithms have been completed. This partially ordered tasks. Some of the algorithm's sub-tasks can be in any order, while others must be in a strict sequence. If we can solve a problem using unordered or partially ordered sub-tasks, we can use multiple CPU cores and keep these cores fully active to solve the problem quickly. Wk1 (Task1) Faster processing 11 3.4 Concurrency and Amdahl’s law - is it worth it? New computer processors have multiple cores and graphics cards can contain thousands of very simple processors rather than one powerful CPU processor in the computer’s motherboard. However, having multiple processors is not a panacea. A single CPU might use time slicing to be efficient when one thread is waiting for a hardware resource. However, if multiple threads on multiple cores are all waiting for a particular shared hardware resource, the efficiency may drop due to lock contention. In other words, running these tasks in parallel may not significantly speed up the algorithm if the algorithm consists of partially ordered tasks. Having partially ordered tasks means that some parts of the code are ordered, so it is impossible to parallelise them and speed them up with more CPUs. The importance of this depends on the ratio between the bits of code you can parallelise and those bits you cannot. If we run the code on a single, bare metal processor, we can find the ratio (p) of the code in the totally ordered sections and which parts are the unordered sections. If most of the time executing the code is spent in the bits that cannot be parallelised, you are wasting your time trying to use multiple CPUs. Amdahl’s law (my version in the equation below) gives a feel to how your code might speed up using parallel processors, but in reality, it is very algorithm-dependent. 𝑇𝑖𝑚𝑒 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑠𝑝𝑒𝑒𝑑 𝑢𝑝 𝑇𝑖𝑚𝑒 𝑎𝑓𝑡𝑒𝑟 𝑠𝑝𝑒𝑒𝑑 𝑢𝑝 = (1 − 𝑝) + 𝑝/𝑠 For example, if your algorithm needs 10 minutes to run on a single CPU and 90% of the code can be paralleled, then even with an infinite number of parallel processors, your code will never be quicker than 1 minute. However, for big datasets where you really need fast processing, it is pretty likely that the large majority of time is spent in parallelisable code. We will revisit and test Amdahl’s law once we learn how to implement programming parallel algorithms. Wk1 (Task1) Faster processing 12 EEEN30141 Concurrent Systems. Semester 1, 2024. © Frank Podd Week 2 Table of Contents 1 Wk2 (Task2) Processes & Threads........................................................................................ 2 1.1 Background to processes and threads....................................................................................... 2 1.1.1 Program/process/thread................................................................................................................................ 2 1.1.2 Process control block..................................................................................................................................... 3 1.2 Multi-threading....................................................................................................................... 5 1.2.1 Parallel Multithreading................................................................................................................................... 6 1.2.2 Time-slice multi-threading............................................................................................................................. 6 1.2.3 Simultaneously Multi-Threaded / Hyperthreading........................................................................................ 7 1.3 OS Time-slicing the CPU is like interrupts.................................................................................. 8 1.4 Thread context switch.............................................................................................................. 9 1.4.1 Context switch................................................................................................................................................ 9 1.4.2 Reasons a context switch is triggered............................................................................................................ 9 1.4.3 Processes context switch overhead............................................................................................................. 10 1.4.4 A thread’s life – states and state changes.................................................................................................... 10 1.5 Definition of a concurrent system........................................................................................... 12 1.5.1 Creating concurrent systems........................................................................................................................ 13 1.6 Summary............................................................................................................................... 14 1.6.1 Summary of thread operation...................................................................................................................... 14 1.7 Further reading...................................................................................................................... 15 Wk2 (Task2) Processes & Threads 1 1 Wk2 (Task2) Processes & Threads The terms process and thread relate to running a machine code program. Unfortunately, there is no standard terminology for these terms, and their use can depend on the particular Operating System and the programming language. This section will discuss these terms and provide our definitions. 1.1 Background to processes and threads 1.1.1 Program/process/thread A computer program comprises machine code, initial data (e.g., strings and data arrays), metadata from the compiler toolchain and links to other, possibly hidden, libraries. As our system is running an operating system, our computer code will call the underlying operating system functions to enable the program to be run and access, for example, the filesystem or “print” information to the screen. This is why executable code compiled for a PC will not run on that same hardware if a different operating system is installed. On a personal computer system, when this read-only machine code and initialisation data are stored in its non-volatile memory (hard disk drive), it is called a program. Figure 1 shows the steps from writing code to running it on a CPU with an operating system for compiled languages. The program (text) code that you write must be compiled into machine code (binary) instructions, e.g., by a C compiler. If your code uses libraries, then these need to be linked in with the compiled code to create an executable binary that can be run. Figure 1 Compiled language stages: Code, compiled binary, executable binary, process This executable is stored on the hard disk until the user selects it to run. At that point, to run the app quickly, the operating system copies the executable program into faster volatile memory and provides it with some memory space and CPU time. When the program is ready to run, or running, it is called a process. Although there will be only one program on the hard disk, the program could be running multiple times. For example, Figure 9 wanted to run three calculator apps. That same calculator computer program on the hard disk is duplicated three times as separate processes. The process is given a unique virtual memory space to prevent it from affecting other processes. The process contains the machine code to be run (the code segment) and a data segment that includes space for statically allocated variables such as global variables. It is placed into another operating system object called a thread of execution (abbreviated to thread). Wk2 (Task2) Processes & Threads 2 A process has one or more threads (the “main” program is one thread). Time-slice concurrency can occur between processes and within a process. Since threads are another operating system concept, they are created by operating system calls; however, these system calls may be hidden from the programmer if the programming language has built-in constructs for concurrency. The thread object executes the code stored in the process’s code segment. A single- threaded program has just one thread, which runs all the program code. However, it is possible to write code that generates multiple threads, allowing different parts of your program to run concurrently (a multi-threaded program). The process contains information on all its threads and must contain at least one thread. A major purpose of an operating system software layer is to share CPU processing between multiple processes. If there is only one CPU core, the operating system will time- slice CPU time between the threads of the different processes. If there are multiple CPUs, the operating system will time-slice the threads on each core and run the cores in parallel. 1.1.2 Process control block When the operating system starts running a program, the newly created process has a single thread, and its program counter value will contain the entry point of that machine code program (probably the main() statement). The thread stores its own program counter variable and associated information like the call-return stack, so it knows where it is up to with executing the program if it is interrupted by the operating system. It is the job of the process to store the information about the resources (e.g. open files) associated with the program and which all threads can access. It is the thread's job to execute the machine code even when the operating system turns it on and off; in other words, it must remember where it was up to and what the CPU register values, etc. When the operating system time-slices the CPU between programs, the data associated with the process and its thread(s) must be switched in and out of the CPU. By default, a process has one thread, but if you can write the program to be multi-threaded, the process program will have multiple threads that can be run quasi-simultaneously. The complete information about the process and its threads is called its “process control block”. This information enables the operating system to start and stop all the program’s threads of execution threads so that they can share the CPU. The process control block contains two types of information: the process context and the thread context. The process control block contains information about the process that is useful for the operating system, e.g., global variables and files shared between the threads. The thread context stores details on running the thread in the CPU, such as the CPU status information when it was stopped. Each thread has its own thread context. The process control block, along with the process context and thread context for two threads, is shown in Figure 2. The process control block must be kept in an area of memory protected from process access to prevent a malicious process from being able to change its own context information or the context of another process. Wk2 (Task2) Processes & Threads 3 Figure 2 Process control block, process context and thread context Wk2 (Task2) Processes & Threads 4 1.2 Multi-threading A totally ordered program will not have any concurrency, and its machine code commands will be executed strictly in sequence. For a totally ordered program, this sequence can include conditional if statements, but no parts of it, can be run concurrently. This process is said to have one thread of execution. This is how we have run programs in the MCE1 unit with just a single sequence of machine code instructions. For the CPU to run the program it needs a host of parts including ALU, instruction decoder, registers, program counter PC, call-return stack, as well as some data memory associated with the process, and perhaps some file pointers the code has opened. The required information associated with our single process running on a single core bare metal system looks like Figure 3. Figure 3 Single process, single thread The CPU in a modern PC can often have two processing cores within the CPU. This means the CPU can run two separate processes (programs) in parallel; we will call these programs Process 1 and Process 2. We will assign core 1 of the CPU to execute process 1. Let us assume that the programs are both totally ordered, so each process will only have one thread of execution, so core 1 will run thread 1 of process 1. Each core of the CPU has its own set of registers. We will assume that, for security, each core is given a different memory area to use so that one app cannot spy on another. This situation is shown in Figure 4. Figure 4 CPU with two cores running two totally ordered processes Wk2 (Task2) Processes & Threads 5 1.2.1 Parallel Multithreading We can run parts of the same program simultaneously, and we now discuss “parallel multithreading”, the first half of concurrent systems. Say we have a partially ordered program with two parts of code that can be run simultaneously. Each part of code that can be run concurrently is called a thread. Our single multi-threaded process will have two threads of execution. Each thread can run in parallel on separate cores. However, the data associated with this process must be shared between the two threads, as shown in Figure 5. This is where the trouble begins, and dealing with this is an essential part of this unit on concurrent systems. Figure 5 CPU with two cores running two threads of a single process. Although the above figure states that an OS is unnecessary, we will use many features of an operating system to deal with data sharing between the threads of a process. 1.2.2 Time-slice multi-threading We now turn to “time-slice multi-threading”, the other half of concurrent systems. Assume that we have the same partially ordered program that has two threads. However, rather than having two CPUs, we now return to that CPU with a single core that can only execute one machine code instruction at a time. This CPU can only run one thread. However, if it is running an operating system, it can time-slice between different threads. In this case, many threads may be waiting to be executed, and the CPU will work on one of them, then switch to another and another. But this needs an Operating System layer of code to remember the CPU’s state for each thread and switch these values into the CPU registers when the Operating System time- slices between the threads. The data memory and file pointers are associated with the process (program) rather than the individual threads, so they are shared and accessible by all the threads. This multi-threading on a single-core CPU setup is shown in Figure 6. Wk2 (Task2) Processes & Threads 6 Figure 6 Time-slice multithreading - Single core with two threads and using an OS 1.2.3 Simultaneously Multi-Threaded / Hyperthreading Having multiple cores is essential for running multiple operating systems in parallel. Trying to run two operating systems on the same core might be problematic—which one would oversee the time-slicing of the threads? You should be aware of one more concept: “Hyperthreading,” also known as “Simultaneously Multi-Threaded” (SMT). This is a hardware feature of a single-core CPU that gives it two (logical/virtual) cores and can run two processes simultaneously, but these virtual cores share resources and so are not as powerful as separate cores. An example of a hyperthreading CPU is the Intel 8th-gen i7 processor, which runs 12 threads simultaneously (but only on logical/virtual cores). Interestingly, the Intel 9th-gen i7 processor has eight cores but no hyperthreading, so it can only run eight threads simultaneously, but these are run on physical (real) cores. Perhaps hyperthreading has had its day? The ARM Cortex-A65 also has a Simultaneously Multi-Threaded core (hyperthreading). Our final diagram is the most complicated scenario is multicore hyperthreading. This example involves a hyperthreading CPU with two cores running two processes, each with two threads, as shown in Figure 7. The operating system code controls the execution behaviour of both cores. Wk2 (Task2) Processes & Threads 7 Figure 7 Multicore hyperthreading CPU running multiple processes with multiple threads 1.3 OS Time-slicing the CPU is like interrupts An important feature of an operating system is the ability to time-slice the CPU so that many tasks can work virtually at the same time (but on the microsecond time scale, they get the CPU one after another). The operating system kernel software sets up a timer interrupt to run its scheduler code and gives different threads of code access to the CPU. For this to work, the information for running the code must be stored and retrieved so that one thread of code does not affect another, and the code can restart at the point it left off. To perform this code execution (thread) swapping, the operating system works in software as interrupts work in hardware. You will remember that when an interrupt occurs, the current state of the CPU is stored (to registers or the stack), the interrupt code is run, and the CPU state is restored so that the main code can continue as if the interrupt did not happen. The state of every thread’s CPU register must be stored, and this information is called the thread’s “context”. Swapping the CPU states between the threads is called “thread context switching”. A hardware interrupt is given a priority level, determining if one interrupt has precedence over another. Similarly, the operating systems can prioritise which thread gets precedence over another, and the operating system kernel’s scheduler algorithm decides. Some schedulers allow the programmer to give threads priority level numbers. Wk2 (Task2) Processes & Threads 8 Hardware interrupts must be short pieces of code, as the main thread cannot continue until the interrupt has been completed. However, this restriction is removed with operating system CPU time-slicing since the purpose of the scheduler is to give all code the CPU time it warrants. 1.4 Thread context switch The processing power of the CPU is switched from one thread to another. If the new thread belongs to the same process as the previous thread, the change is quicker than if it belongs to a different process. 1.4.1 Context switch When the operating system time slices the CPU between threads of execution, it must update the CPU status, memory, etc. The thread context provides CPU values for the last run of the thread. The process context includes information on the program code, memory space, static variables, and files. Context Switching is the name for this operating system process that saves the context of the current running process and loads the context or state of the next process/thread ready for its continued execution. Context switching enables time slicing on a single CPU. This allows the system to run multiple processes seemingly simultaneously and permits a single process to be multi- threaded. When the operating system moves the CPU usage from one thread to another, it must update the system with the process and thread context of the following thread. If the new thread is from the same process as the previous running thread, then the thread context needs to be switched, and the process context can remain. However, if the new thread is from a different process, then both the thread context and process context need switching. The process context contains large memory management structures that take longer to switch, including processor caches. Therefore, switching between threads from the same process is much quicker than context switching between threads of different processes. A time-slice concurrent system aims to improve CPU use efficiency. However, time slicing the CPU has the overhead of context switching, which uses up CPU time (a few milliseconds for around 20 milliseconds of time slice) without performing any useful work. Some processors provide sets of shadows registered to reduce the time taken to perform a thread context switch. CPU efficiency may not be the most important reason for having a time-slice operating system; the most important reason is its multitasking and multi- threading ability. 1.4.2 Reasons a context switch is triggered There are multiple reasons why the CPU might switch out a running thread: 1. The current thread is waiting for input or output from a device or user. 2. The current thread is waiting for interaction with another process or thread. Wk2 (Task2) Processes & Threads 9 3. A thread with a higher priority has entered the ready state. This can also happen if there is a processor (memory) exception. 4. Another possibility is that the program on operating system functions to provide a kernel service, and then the thread enters kernel mode via an interrupt or exception. This is often called a trap exception instruction although ARM Cortex calls it a supervisor call (scv). 5. It is the turn of another process. 6. The thread has completed its code or exited for another reason. 7. The thread has cooperatively given up its execution. 1.4.3 Processes context switch overhead The CPU registers must be reset when there is a thread switch where the new thread belongs to the same process. There is a much bigger computational overhead when performing a process context switch. You will remember that CPUs speed up execution by implementing a fetch, decode, or other instruction pipeline, and there may be multiple layers of cache memory that have been filled with data and instructions that the following instructions might need. With a process context switch, not only do the CPU registers need restoring, but the instruction pipeline needs to be cleared, and the layers of cache memory need to be refilled. 1.4.4 A thread’s life – states and state changes The operating system will have a long list of threads that require time on the CPU time. When the user starts a new process, more threads will be created, and some threads will “terminate” (finish), such as when the user gets tired of playing Tetris. If a thread needs to wait for a hardware resource, then there is no point in giving it more CPU time until that resource becomes available. While a thread waits for a resource, the operating system gives it the “waiting” status. 1.4.4.1.1 Thread states This brings us to the “life cycle” of a thread. Some references talk about the life cycle of a process, which is the same if they assume the process consists of only a single thread and not a multi-threaded program. This is shown in Figure 8. Once the thread is created by a new process or by operating system calls from within the process’s machine code, it is given the “Ready” state and added to the “Runnable” queue. At some point, the operating system’s scheduler function will give it a slice of time on the CPU, which is when the thread has the running status. Once its time for the CPU is up, it is put back into the Ready status, waiting for its turn with the CPU. At some point, the thread may complete because it has run all its code, or the user may kill it; either way, it is finished, given the Terminating status, and removed from the queue of threads to run. Once a thread has been created, it moves between states until it is terminated. These three states are: Ready: The thread is waiting to be chosen by the scheduler for execution. Running: The thread executes on the CPU. Waiting: The thread is temporarily unable to progress. It may be waiting for the completion of an IO transaction or for another thread to perform some action. When it is in either the Ready or Running state, the thread is classed as Runnable. When it is in the Waiting state, it is moved to the Unrunnable list. Wk2 (Task2) Processes & Threads 10 Having different threads running for short periods one after another on the CPU is not straightforward. Each thread might belong to a different process, and each process has its own data memory area. Each thread will also have its own: CPU register values, call-return stack, etc. The data associated with the process and thread will need to be switched in and out of the CPU, which is called a “Context Switch”. 1.4.4.2 Switching between threads – state changes Part of the job of the Operating System software layer is to create a time-slice concurrent system, i.e. a system that can run multiple parts of an algorithm simultaneously, or at least virtually simultaneously. The kernel is the lowest level of Operating System software. It constructs thread objects with links to the code and all the thread metadata, allowing them to be switched in and out of the CPU and achieving concurrency on single and multiple CPU systems. The kernel software that performs this is called Process Management, an important part of which is thread scheduling. The kernel’s process management thread scheduling code must switch between the threads, allowing them to share the CPU. The thread running on the CPU may hit a line of code that must wait until some hardware resource becomes available. At this point, the kernel will switch the thread out of the CPU, put it into a Waiting state (non-runnable), and then give the CPU to the next runnable thread. There are quite a few reasons why the CPU might switch out a running thread, and we will discuss that in the scheduling section. Once a thread has completed its code, the information about it can be deleted. When all the threads associated with a process have been completed, that process has finished, and the whole process context information can be deleted. However, if you do pause a process before it completes, you need to be able to resume it without it breaking, i.e., restoring all memory locations, CPU registers, and files that it was using. This information is called the “process context,” and the act of changing processes or sub-processes is called “context switching”. We will discuss context switching in detail in the next section. Wk2 (Task2) Processes & Threads 11 Figure 8 Life cycle of a thread in an operating system (from P.N.Green's lecture slides) Figure 8 shows the transitions between the thread states. The thread moves between states until it is terminated. A thread in the Waiting state is temporarily unable to progress. It may be waiting for the completion of an IO transaction or for another thread to perform some action. When the reason for the thread to be waiting has been resolved, that thread is moved back into the ready queue. Once the operating system has created the new thread, it moves it into the Ready queue by the “scheduler dispatch” transition and the thread’s context is restored to the CPU. Then, when it is that thread’s turn for the CPU, its state changes to Running. Revisiting those reasons for ending the running state of a thread: 1. Waiting for input or output from a device or user – the thread’s state moves from Running to Waiting. 2. Waiting for interaction with another process or thread – the thread’s state moves from Running to Waiting. 3. It is the turn of another process – the thread’s state moves from Running to Ready through the “end of turn” transition. 4. The thread has completed its code or exited for another reason - the thread’s state moves from Running to Terminated. 1.5 Definition of a concurrent system An operating system’s kernel is a software layer that provides system concurrency, among other things. In the past, a CPU had just one processing core, but modern CPUs have multiple cores, e.g. 6. The kernel lets multiple programs run at the same time on multi-core hardware, which is called parallel concurrency. The programs can run virtually simultaneously through time slicing for a single-core system. These two types of concurrency are shown in Figure 9. Wk2 (Task2) Processes & Threads 12 Figure 9 Parallel concurrency vs time-sliced Time-slice processing is when a single CPU switches between tasks (threads) without necessarily completing each one. These threads can be from the same programs or different programs. It is also known as multi-tasking, but a better term is multi-threading. Concurrent processing executes order-independent or partially ordered tasks in parallel or time-slice processing. Parallel processing refers to simultaneous processing on multiple processing units (CPUs). When tasks of the same program are run simultaneously, we call this parallel concurrency. When the threads of time-slice processing are from the same program, we call it time-slice concurrency. It is also known as fake concurrency. The term concurrent system covers parallel or time-sliced processing since they have the same problems and solutions. The same “concurrent” algorithm could execute on a single or multiple processors. But beware because, unlike our definition, some computer science texts define concurrency as just relating to time-slicing. Concurrent systems methods can speed up execution in multi-processor/multi-core systems and increase the efficiency of single-cored processors. 1.5.1 Creating concurrent systems To create a time-sliced concurrent system, only the operating system software layer needs to enable multithreading. This works even if there is only one CPU core. To create a parallel processing concurrent system, multiple CPU processor cores and a software layer (e.g., an operating system) that can coordinate multiprocessing are needed. We use the terms, multiprocessor (>1 CPU core), and multi-threading (>=1 CPU core). Wk2 (Task2) Processes & Threads 13 1.6 Summary Our definition of concurrent systems covers multiple processors working simultaneously on tasks and using an operating system to time-slice task processing. Concurrent systems are designed to improve CPU efficiency and reduce data processing time. Whether this is possible will depend on the type of algorithm being executed. Each algorithm can be analysed and classified as containing tasks in one of three classes: totally ordered, partially ordered, or unordered. Partially ordered and unordered tasks can be processed in parallel, which means the algorithm can be completed quickly. You will see in the following sections that totally ordered tasks cannot be split and processed simultaneously, but an operating system can time-slice multiple totally ordered tasks to improve CPU efficiency. Parallel processing concurrent systems (multi-processor) are needed for speed and system behaviour verification. Since CPU clock speeds are limited, having multiple CPUs is another way to process data faster. Having multiple CPUs also means that one CPU can focus on critical tasks, such as a crash detection system, and its code can then be verified to a high standard. Single-processor time-sliced concurrent systems are needed for CPU efficiency and to enable multiple programs to appear to be running simultaneously. A single task running on a processor may spend a lot of time idling and not processing, which is an inefficient use of the CPU’s processing power. For example, in a sequential program, the CPU may need to idle while waiting for some data to arrive. Amdahl’s law shows that doubling the number of processors is unlikely to double the system speed. In addition, there is overhead in getting the processors set up and communicating with each other. 1.6.1 Summary of thread operation A thread is an operating system object representing machine code that can be run parallel to other threads. A process (a program instance in memory) might be a single thread or consist of multiple threads. This information about the thread's running and the process it belongs to is called the process context. Switching between threads associated with different processes is called “context switching.” Once the thread has completed its code, the information about it can be deleted. When all the threads associated with a process have been completed, that process has finished, and all the process context information can be deleted. Although thread context information is private, any information in the process context part is available to all its threads. The process context information is private to that process and cannot be seen by other processes. It is the programmer's responsibility to ensure that the threads of a process work well together, and that is the art of multi-threaded programming Wk2 (Task2) Processes & Threads 14 1.7 Further reading “If you want to read more about…” Best one https://w3.cs.jmu.edu/kirkpams/OpenCSF/Books/csf/html/ProcVThreads.html This covers most of the thread topics we discuss https://levelup.gitconnected.com/multithreading-and-concurrency-concepts-i-wish-i-knew- before-the-interview-11895226179 This covers most of the thread topics we discuss, and it has Java code for the Dining philosopher’s problem! https://www.oreilly.com/library/view/foundations-of- scalable/9781098106058/ch04.html https://www.cs.fsu.edu/~baker/realtime/restricted/notes/concurrency.html Many of this unit’s topics are mentioned in these two links: https://codeahoy.com/learn/computersos/ch2/ https://codeahoy.com/learn/computersos/ch3/ Threads (ch 9) and using them (ch 10,11). https://codeahoy.com/learn/thinkos/ch9/ https://www.codemag.com/Article/100113/Chapter-6-Process-Management https://www.geeksforgeeks.org/context-switch-in-operating-system/ MS Windows - https://learn.microsoft.com/en- gb/windows/win32/procthread/multitasking Write multiple single-threaded processes or one multi-threaded process http://backblaze.com/blog/whats-the-diff-programs-processes-and-threads For Google’s thinking when writing Chrome. Threads on microcontrollers, a thread status diagram is shown for an ARM’s CMSIS real- time operating system (RTOS) here https://arm- software.github.io/CMSIS_6/latest/RTOS2/group__CMSIS__RTOS__ThreadMgmt.html Wk2 (Task2) Processes & Threads 15 EEEN30141 Concurrent Systems. Semester 1 2024. Frank Podd Week 3 Wk3: Multi-tasking, operating systems and schedulers Table of Contents 1 Wk3 Cooperative and pre-emptive multitasking................................................................... 2 1.1 Cooperative multitasking......................................................................................................... 2 1.1.1 Reasons a context switch is triggered............................................................................................................ 3 1.1.2 Cooperative multitasking with the cyclic executive....................................................................................... 3 1.1.3 Is the cyclic executive a concurrent system?................................................................................................. 4 1.2 Pre-emptive multitasking operating systems............................................................................ 4 1.2.1 Pre-emptive OS’s extra reasons to context switch........................................................................................ 5 1.2.2 Hardware interrupts....................................................................................................................................... 5 1.2.3 Schedulers...................................................................................................................................................... 5 2 Wk3 Event-driven programming.......................................................................................... 7 2.1.1 Message passing and the message pump...................................................................................................... 7 2.1.2 GUI programming........................................................................................................................................... 7 3 Wk3 Operating System Kernel.............................................................................................. 9 3.1.1 Security........................................................................................................................................................... 9 3.1.2 Brief mention of real-time operating system............................................................................................... 10 3.1.3 Viewing operating systems processes and statistics................................................................................... 11 4 OS implementations.......................................................................................................... 14 4.1 Containers and the virtualisation of operating systems........................................................... 14 4.1.1 Types of virtual machines............................................................................................................................. 14 4.1.2 Compiled and interpreted languages........................................................................................................... 14 4.2 Microcontroller OS and concurrent systems........................................................................... 15 4.1 Bare metal vs OS.................................................................................................................... 16 4.1.1 Summary OS versus bare metal................................................................................................................... 17 4.1.2 Blocking I/O in a thread................................................................................................................................ 18 4.2 CPU States and Privileged Access............................................................................................ 19 4.2.1 Requirements of an operating system......................................................................................................... 19 4.2.2 Privileged access........................................................................................................................................... 19 5 Further Reading................................................................................................................. 20 Wk3 Cooperative and pre-emptive multitasking 1 1 Wk3 Cooperative and pre-emptive multitasking We have seen how a process (a computer program) has a main thread of execution and can have other code threads running concurrently. Many programs will need to be run at the same time, which means that the operating system must find CPU time to run all threads of code virtually simultaneously. We have discussed how the thread and process context contain all the information needed to swap threads in and out of the CPU. This is similar to how hardware interrupts can temporally interject a running process and determine how and when this context switch happens. In a time-sliced concurrent system, there may only be a single CPU core, but many threads of code need to be run. One approach might be to let one thread run to completion, but this would likely leave the other threads starved of time on the CPU, resulting in a very unresponsive computer. We will discuss a few different approaches to avoiding thread “starvation”. At one end of the spectrum is cooperative multitasking, where the programmer needs to consider how their program will give up time on the CPU to allow other code to run. Conversely, there is pre-emptive multitasking, where the programmer can consider their process to be the only one running on the system, and it is up to the operating system software to schedule CPU time for the various threads of code. 1.1 Cooperative multitasking Cooperative multitasking (aka non-pre-emptive multitasking) is where there is a list of tasks to perform, but the tasks themselves decide when to give up the CPU to another task. The operating system never initiates a context switch from a running process to another process, which simplifies the operating system code. This method is called “cooperative multitasking, as all the running processes must cooperate if the system is to work. The process of voluntarily giving up the CPU is called “yielding”. The task might yield the CPU when it his complete or when it is waiting for some hardware to respond. In C++11, the function call is std::this_thread::yield(). The cooperative multitasking operating system code can be straightforward and run on a memory-constrained processor. Its cooperative scheduler only needs to start the following process and wait for it to yield. This system does not have the performance penalty of the overhead of running a complex operating system. It also gives the programmer more control over when to give up the CPU. A disadvantage of cooperative multitasking is that if you have a task that can wait a long time for a hardware resource to become available, the other functions will be delayed, and the system will slow down. Below is the list of reasons a context switch is triggered in multitasking systems. A pre- emptive operating system extends this list with additional reasons to switch context. Wk3 Cooperative and pre-emptive multitasking 2 1.1.1 Reasons a context switch is triggered There are multiple reasons why the CPU might switch out a running thread: 1. The current thread is waiting for input or output from a device or user. 2. The current thread is waiting for interaction with another process or thread. 3. The thread has completed its code or has exited for another reason. 4. The thread has cooperatively given up (yielded) its execution. 1.1.2 Cooperative multitasking with the cyclic executive When you have written a program on a bare-metal microcontroller, such as for MCE1, you constructed a loop that looped forever. Within that loop, you had code that checked for the pattern on the input switches, processed that data value using conditional branch instruction, and displayed a result on the LEDs. The more impressive name for the endlessly looping main program is a “cyclic executive”. The cyclic executive’s simple structure can form a cooperative multitasking system. The program’s structure is one you will be familiar with; there is a list of tasks to do in a given order, each task is a function call. The tasks are called in turn, and each call returns cooperatively (it is either a short task or the task incorporates a state machine, so it remembers where it is up to). This cyclic executive approach is run as a single program. However, it is challenging to make it efficient, for example, if a line of code in your main program must wait for a data value. Here is a C-Language example of a cyclic executive performing three functions at set times in order to form a real-time system. It reads the temperature every 10 ms and displays the results of two temperature averages every 20 ms. However, in reality, the function calls will take some time, so the update rate will be slightly less than this. // Read temp every cycle (10 ms) and average and display every other cycle (20 ms). This assumes rdTemperature() and display() takes no time… int main(void) { unsigned int firstT=0, avT=0; while(1) // Cyclic executive loops forever { firstT = rdTemperature(); delay(10); avT = (firstT + rdTemperature())/2; delay(10); display(avT); } } Wk3 Cooperative and pre-emptive multitasking 3 1.1.3 Is the cyclic executive a concurrent system? The above example of a cyclic executive appears to be a real-time system, if the function calls always take the same time to complete. The program also performs multiple tasks: reading a sensor, processing the data, and displaying the results. However, since this course concerns concurrent systems, we should ask if this example of a cyclic executive is a concurrent system. The answer is debatable, but we will say “no” for the following reasons. A cyclic executive may enable different tasks to be performed together (e.g. read input, process data, display output), and it is likely to be a good choice for a simple real-time application. You can see that one task, such as reading the input is completed before the next task (data analysis). This cyclic executive is not classed as a time-sliced concurrent system since: It does not use fine-grained time slicing (you might argue that it is a course-grained concurrent system). There is no context switching between tasks – there is just one process, and every task within the program shares the same data and permissions, One task is completed before the next one starts, There is no way to switch tasks if one task gets stuck (no pre-emption). 1.2 Pre-emptive multitasking operating systems A pre-emptive operating is a multitasking operating system with a pre-emptive scheduler. The pre-emptive scheduler switches the CPU between the list of threads automatically. If a pre-emptive operating system is used, the programmer does not need to consider multitasking to run multiple programs concurrently; it happens automatically. If the programmer wants to create a program the uses multiple threads, a pre-emptive OS can simplify that coding. To facilitate the simultaneous running of independent programs, the system runs an operating system, and the programs that need to be run are created as processes with thread(s). Then, the operating system’s kernel performs the process management task of giving all the runnable threads time on the CPU in turn. With the cyclic executive approach, the programmer decides what sub-tasks get called when and for how long, while in a time-sliced concurrent system, the programmer has very little control over which tasks get called when, for how long, or in which order. The benefit of a time-sliced concurrent system is that the operating system decides on all the intricacies needed to run programs and tasks concurrently. It simplifies adding a program to the list of programs running on the system. You can also split your algorithm into partially ordered parts, create separate threads, and run them concurrently. An operating system can make the CPU much more efficient and more accessible than a cyclic executive approach. The time-slice concurrent system is labelled as “fine-grained” since it uses small time steps for access to the CPU. The operating system also enables you to write multithreaded programs. However, multithreaded programming comes with a Wk3 Cooperative and pre-emptive multitasking 4 caveat; you need extra programming knowledge, and there is a risk of introducing hard-to- find, intermittent bugs. In pre-emptive multitasking, an operating system’s scheduling algorithm decides when and for how long a task is given time on a CPU. The scheduler will withdraw the CPU from the task when: The task has completed its predetermined CPU duration. The task has cooperatively yielded it. A software interrupt has occurred, such as a terminate command. 1.2.1 Pre-emptive OS’s extra reasons to context switch 1. A thread with a higher priority has entered the ready state. This can also happen if there is a processor (memory) exception. 2. The program calls an operating system function to provide a kernel service, and then the thread enters kernel mode via an interrupt or exception. This is often called a trap exception instruction, although the ARM Cortex calls it a supervisor call (svc). 3. The OS kernel’s scheduler decides that it is the turn of another thread with the same priority level. 1.2.2 Hardware interrupts Hardware interrupts happen at the lowest level of the CPU, so any interrupt handlers will be called by the hardware below the level of the operating system scheduler. The hardware callback will call parts of the operating system kernel’s device drivers. This will temporarily stop the kernel’s scheduler. 1.2.3 Schedulers There are many different algorithms that the kernel’s scheduling manager could use to schedule threads. For example, the scheduler could be designed for: 1. Fairness – all processes are treated equally (equal time-slice) 2. Priority – the programmers can specify which threads are important. The Posix Linux scheduler with the pthread library has 99 priority levels, with 99 being the highest priority. Windows Win32 has 31 “base” priority levels. There is no way to set thread priority in C++11, instead you must call native_handle(). There are many different scheduler algorithms. In personal computer operating systems, it is common to assign priorities to threads. The scheduler orders the queue of threads in execution order based on their priority level. Threads with a higher priority will continually execute until they enter the waiting state, or they complete. By default, threads will be put into a standard priority level. Threads with the same priority level are treated as equal and executed with time-slices in a round-robin method (cycles through the list of threads with that same priority in equal portions and in circular order). If a thread enters the runnable state with a higher priority than the currently running thread, the running thread is pre-empted, and the CPU core switches to running the new thread. If a high-priority thread executes without yielding execution, the lower-priority threads will suffer “starvation.” Resource starvation is when a thread never gets access to the resources it needs. Wk3 Cooperative and pre-emptive multitasking 5 Have a look at this thread priority scheduling example in Windows C# https://learn.microsoft.com/en-us/dotnet/standard/threading/scheduling-threads to get an idea of the effect of thread priorities. Note that there is a Thread.Sleep() statement to allow the other threads to run (this is the “yield” reason in the above list). Without that sleep statement, the highest priority threads would run to completion before the other threads had any time on the CPU – the lower threads would be starved of CPU time. Wk3 Cooperative and pre-emptive multitasking 6 2 Wk3 Event-driven programming One example of using a forever loop in the main() program is by creating an event-driven program or system. This program structure requires an operating system to run concurrent threads and service hardware interrupts. The forever loop can run a state machine-type loop or do nothing except wait for all thread threads to finish and the system to exit. Before the event loop starts, the program will create many threads and set up callbacks/interrupts for hardware status changes. This type of programming is used on web servers and is the default for programs with graphical user interfaces. In this case, the forever loop is called the event loop, whose job is to process messages from other parts or threads of the program. It is called the main event loop as the central control method of the program. When another part of the program creates a new message, the event loop then calls the relevant event handler callback to action (dispatch) that event. “If you want to read more about…” some efficient methods for event-based programming, try a web search for “I/O completion ports” and “asynchronous procedure calls”. 2.1.1 Message passing and the message pump One way that processes and threads can interact is by sending messages to each other. For this to happen, an underlying software layer must act as a postal service to receive and send out these messages. The operating system usually provides this facility and is called a “message pump” since it pumps messages between the processes. We will discuss inter- process/thread communication by message parsing later in the unit. 2.1.2 GUI programming Event-driven programming with message passing is frequently used when programming graphical user interfaces. There are many options for GUIs. For Windows, you can use Microsoft Visual Studio, but there are cross-platform GUIs such as Qt, wxWidgets, GTK, UNO, and Dear Imgui. Qt also has a Python wrapper. An example of using event-driven programming with the QT GUI implementation in C++ is given below. You will need to install Qt if you want to try it out. // QT C++ Hello World pushbutton example, showing event driven programming. Taken from https://wiki.qt.io/Qt_for_Beginners #include #include int main(int argc, char **argv) { QApplication app(argc, argv); // Set up a GUI object – a push button that will send a message to the event loop when it is pressed. QPushButton button; button.setText("My text"); Wk3 Event-driven programming 7 button.show(); return app.exec(); // app.exec() is the event loop. It does not return until the app closes. } The result of this program is shown in Figure 1. Figure 1 Result of QT simple example, taken from https://wiki.qt.io/Qt_for_Beginners Event driven programming uses a cyclic executive, along with state machines and callback functions rather than using multiple threads. Event-driven programs can improve the performance of multi-threaded and multi-process programs since there is no overhead of context switching. Wk3 Event-driven programming 8 3 Wk3 Operating System Kernel Operating systems provide a software library layer on top of the hardware and enable the processor to work as a time-sliced concurrent system. We have described the thread scheduler, but the operating system kernel has a range of other benefits. These benefits include: 1. Enables simultaneous running of independent programs by sharing the CPU time between the programs without any additional effort on the programmer and permits multithreaded program 2. Shares hardware resources (memory, ethernet) where appropriate, prevents other resources from being shared (e.g., serial communication port), and provides device driver interfaces to encapsulate (hide away) the hardware details. 3. Prevents a poorly written program from crashing the whole system. 4. Prevents a malicious program from stealing system information such as passwords. 5. Enables multithreaded programs. 6. Reduces the analysis time and programming effort needed for a cyclic executive implementation. An operating system is a software layer that oversees the computer's resources, such as memory, CPU, clock, and peripherals. The operating system includes the kernel and other system programs. It is also in charge of system security, making sure that a user’s program cannot compromise the system by monopolising resources or improperly accessing other programs or the kernel. Conversely, it must provide ways for the programs to access the resources and enable inter-program communication. Using device drivers, the operating system can abstract away the details of the hardware and enable the same program to work on systems where the underlying hardware works differently. It can also provide high-level interfaces on top of the hardware, such as the multilayer OSI networking model. 3.1.1 Security. The kernel has complete control of the system, so it must not be susceptible to hacking. On personal computer operating systems, the kernel is separated from other code in memory. To do this, the critical code of the kernel is placed in a protected memory area inaccessible to programs or other code. This area of memory is called kernel space. Non- critical code and the programs the user selects are placed in a separate area of memory called user space. Other forms of memory protection are used on simpler operating systems where the kernel does not have a separate memory address space. These approaches prevent malicious programs from easily taking control of the system and prevent badly written programs from crashing the operating system. One way the operating system can separate access permissions for programs is to give them the same security permissions as the user who started them. For example, the user may only have access to certain parts of the filing system or be only able to use specific Wk3 Operating System Kernel 9 hardware resources. The “root” user is the kernel, and programs with this username will have full system access. It is, therefore, important to restrict having “root” as the program owner as much as possible. You can see the process’s user in the operating system processing information programs, for example, Figure 4 below. 3.1.1.1 Functionality provided by an operating system: The operating system software layer typically has the following features: Time-sliced concurrent multitasking. Device drivers (which also protect the hardware from application software using kernel/user modes). Memory Management – local variables stored on the run-time stack, dynamically created variables stored on the heap, fixed memory addresses in the process context for static and extern variables, and literals and instructions. 3.1.2 Brief mention of real-time operating system A real-time operating system guarantees that events can be processed within a specific time frame. This real-time category can be split into hard real-time and soft real-time systems. A hard real-time system requires exact timing, for example, to trigger an actuator with millisecond accuracy. An example application area is manufacturing. However, this precise timing requirement severely restricts what the operating system can do, for instance, reducing the system's security against malicious applications. This may be acceptable as all the running programs may be written by the system manufacturer rather than using third- party applications. A soft real-time system is permitted to occasionally miss deadlines, where the consequences of missing frames are a low risk, such as watching videos. A real-time system might have a monitoring thread that can pause less important tasks to ensure the critical tasks get the time they need on the CPU. There are many approaches to scheduling. A bare-metal microcontroller often just relies on a watchdog timer to detect if the software has crashed and reboots the systems. In MCE1, we always turned the watchdog off using WDG=OFF in the assembly header. The pros and cons of real-time operating systems over a bare-metal implementation are the same as for operating systems over bare metal. The main disadvantage of an RTOS is its increased footprint and complexity. The RTOS itself uses extra code space and RAM, and generally, each task has its own stack. Their advantage is that they can make it easier to manage complex projects when multiple things are happening simultaneously and need to be handled with specific timing requirements. The topic of real-time operating systems will be covered in more depth later in this unit. Wk3 Operating System Kernel 10 3.1.3 Viewing operating systems processes and statistics One of the primary purposes of computer operating systems is to permit multiple programs to work at the same time (or at least to appear to work simultaneously). You can visualise the processes and threads running on your PC, depending on the operating system 3.1.3.1 Windows OS You can use the Task Manager application with the Microsoft Windows operating system. The result is shown in the left image of Figure 3, which gives 175 running processes comprising 2278 threads. To get more information, you can install Microsoft’s “sysinternals process-explorer”, shown on the right side of that figure. This gives the priority level of the process, the number of threads in that process, and the CPU cycles delta (CPU cycles per second given to that process) – indicating a CPU clock of around 12 GHz (The Apple M1 CPU has a 3.2 GHz clock with 10 cores, so that is possible). Figure 2 Windows Task Manager Figure 3 Processes and threads running on a Windows OS 3.1.3.2 Apple MacOS, you can use the Activity Monitor GUI, see Figure 4. You can see that the top line in the list of processes is the kernel, which has been started by “root” and is above the security level of the other programs owned by the user called mchssfp3. Overall, the operating system is concurrently running 938 programs, consisting of 5,606 threads! By double-clicking on a process, you can get more information. Figure 4 shows that both the multi-threaded Kernel and the single-threaded Bash program imply there are 52 context switches per ms, 19 every microsecond, which seems unlikely! The Kernel has had 14% of the CPU, with a cumulation of 2 hrs of CPU time, in contrast, the Bash only had 0.1 seconds of CPU time. Wk3 Operating System Kernel 11 Figure 4 Processes and threads running on a Apple Mac OS Figure 5 Statistics on the multi-threaded kernel process and a single-threaded bash process. Wk3 Operating System Kernel 12 3.1.3.3 Linux Figure 6 shows the Linux Gnome System Monitor GUI and you can see the Process ID, the thread priority (pipewire app as a “Very High” priority). Another option is to use the Linux command line and the “top” command, Figure 7 gives the upper part of the screen and there are 213 tasks (processes) running. To see the number of threads, press “H” to toggle. Figure 8 shows there are 508 threads and gives the full screen showing the process ID, the user (discussed in the security section), the thread priority and memory usage information. Figure 6 Linux Gnome System Monitor GUI Figure 7 Linux “top” command showing the number of tasks (processes). Press “H” to swap between showing tasks and threads. Figure 8 Linux “top” command showing the number of threads, process ID, user ID priority and memory usage Wk3 Operating System Kernel 13 4 OS implementations 4.1 Containers and the virtualisation of operating systems We previously discussed why a program compiled for the hardware running one operating system would not run on the same hardware but on a different operating system. You could dual boot your computer, but it is more convenient if both operating systems could be running simultaneously. Having one operating system running on another is called a virtual machine. Virtual machines can generally have different operating systems than the host machine, but the program should be compiled for the same CPU (same instruction set architecture). 4.1.1 Types of virtual machines Four types of virtual machines, in increasing order of complexity: process virtual machines, containers, system virtual machines, and a virtual machine emulator. A process virtual machine is a software layer that runs an interpreted platform-independent language, such as the JVM, which runs Java programs, and the PVM, which runs Python programs. Interpreted languages are discussed below. Containers use a lightweight OS-level virtualisation method in which the host kernel creates multiple isolated user-space instances. A container is an executable package that includes the executable program and all the dependencies that the executable needs to run (for example, a particular version of the Python environment). System virtual machines (aka full virtualisation). The guest virtual machines can use the hardware through a hypervisor, a software layer that runs on the host operating system and below the guest virtual machine operating systems. Modern computer CPUs have functionality to aid the hypervisor software. Examples of hypervisors are Parallels, VMWare, and VirtualBox. It is even possible to have a virtual machine emulator that can run a different instruction set architecture than the host machine by translating the machine code between instruction sets. An example emulator is QEMU. However, we will discuss standard virtual machines. 4.1.2 Compiled and interpreted languages Compiled languages, such as C, are converted (compiled and linked) from their language to binary machine code with the instruction set architecture of the CPU, either on a bare-metal system or with an operating system. The code compiled for running with the operating system will have some extra code added so it can interface with the OS. Rather than compiling the whole program into machine code in one go, “interpreted languages” use interpreters to convert and execute the program

Franks Concurrent Handbook PDF

Document Details

Tags

Related

Summary

Full Transcript