dokumenpub_c-high-performance-2nbsped-9781839216541-352-410.pdf
Document Details
Uploaded by GoodlySloth8585
Tags
Related
- bsc-3-sem-computer-science-ad-1833-s-2023.pdf
- Lecture 1: Introduction to C++ Programming and Computer Science PDF
- Intro to Computer Science - For Loop & Control Structures PDF
- Chapter 3 Computer Science 12 - Federal Board PDF
- Computer Science I Lecture Notes PDF
- Introduction to Computer Science PDF Fall 2024
Full Transcript
11 Concurrency After covering lazy evaluation and proxy objects in the last chapter, we will now explore how to write concurrent programs in C++ using threads with shared memory. We will look at ways to make concurrent programs correct by writing programs that are free from data races and deadlocks....
11 Concurrency After covering lazy evaluation and proxy objects in the last chapter, we will now explore how to write concurrent programs in C++ using threads with shared memory. We will look at ways to make concurrent programs correct by writing programs that are free from data races and deadlocks. This chapter will also contain advice on how to make concurrent programs run with low latency and high throughput. Before we go any further, you should know that this chapter is not a complete introduction to concurrent programming, nor will it cover all the details of concurrency in C++. Instead, this chapter is an introduction to the core building blocks of writing concurrent programs in C++, mixed with some performancerelated guidelines. If you haven't written concurrent programs before, it is wise to go through some introductory material to cover the theoretical aspects of concurrent programming. Concepts such as deadlocks, critical sections, condition variables, and mutexes will be very briefly discussed, but this will serve more as a refresher than a thorough introduction to the concepts. The chapter covers the following: The fundamentals of concurrent programming, including parallel execution, shared memory, data races, and deadlocks An introduction to the C++ thread support library, the atomic library, and the C++ memory model A short example of lock-free programming Performance guidelines [ 325 ] Concurrency Understanding the basics of concurrency A concurrent program can execute multiple tasks at the same time. Concurrent programming is, in general, a lot harder than sequential programming, but there are several reasons why a program may benefit from being concurrent: Efficiency: The smartphones and desktop computers of today have multiple CPU cores that can execute multiple tasks in parallel. If you manage to split a big task into subtasks that can be run in parallel, it is theoretically possible to divide the running time of the big task by the number of CPU cores. For programs that run on machines with one single core, there can still be a gain in performance if a task is I/O bound. While one subtask is waiting for I/O, other subtasks can still perform useful work on the CPU. Responsiveness and low latency contexts: For applications with a graphical user interface, it is important to never block the UI so that the application becomes unresponsive. To prevent unresponsiveness, it is common to let long-running tasks (like loading a file from disk or fetching some data from the network) execute in separate background threads so that the thread responsible for the UI is never blocked by long-running tasks. Another example where low latency matters is real-time audio. The function responsible for producing buffers of audio data is executed in a separate high-priority thread, while the rest of the program can run in lower-priority threads to handle the UI and so on. Simulation: Concurrency can make it easier to simulate systems that are concurrent in the real world. After all, most things around us happen concurrently, and sometimes it is very hard to model concurrent flows with a sequential programming model. We will not focus on simulation in this book, but will instead focus on performance-related aspects of concurrency. Concurrency solves many problems for us, but introduces new ones, as we will discuss next. What makes concurrent programming hard? There are a number of reasons why concurrent programming is hard, and, if you have written concurrent programs before, you have most likely already encountered the ones listed here: [ 326 ] Chapter 11 Sharing state between multiple threads in a safe manner is hard. Whenever we have data that can be read and written to at the same time, we need some way of protecting that data from data races. You will see many examples of this later on. Concurrent programs are usually more complicated to reason about because of the multiple parallel execution flows. Concurrency complicates debugging. Bugs that occur because of data races can be very hard to debug since they are dependent on how threads are scheduled. These kinds of bugs can be hard to reproduce and, in the worstcase scenario, they may even cease to exist when running the program using a debugger. Sometimes an innocent debug trace to the console can change the way a multithreaded program behaves and make the bug temporarily disappear. You have been warned! Before we start looking at concurrent programming using C++, a few general concepts related to concurrent and parallel programming will be introduced. Concurrency and parallelism Concurrency and parallelism are two terms that are sometimes used interchangeably. However, they are not the same and it is important to understand the differences between them. A program is said to run concurrently if it has multiple individual control flows running during overlapping time periods. In C++, each individual control flow is represented by a thread. The threads may or may not execute at the exact same time, though. If they do, they are said to execute in parallel. For a concurrent program to run in parallel, it needs to be executed on a machine that has support for parallel execution of instructions; that is, a machine with multiple CPU cores. At first glance, it might seem obvious that we always want concurrent programs to run in parallel if possible, for efficiency reasons. However, that is not necessarily always true. A lot of synchronization primitives (such as mutex locks) covered in this chapter are required only to support the parallel execution of threads. Concurrent tasks that are not run in parallel do not require the same locking mechanisms and can be a lot easier to reason about. [ 327 ] Concurrency Time slicing You might ask, "How are concurrent threads executed on machines with only a single CPU core?" The answer is time slicing. It is the same mechanism that is used by the operating system to support the concurrent execution of processes. In order to understand time slicing, let's assume we have two separate sequences of instructions that should be executed concurrently, as shown in the following figure: Figure 11.1: Two separate sequences of instructions The numbered boxes represent the instructions. Each sequence of instructions is executed in a separate thread, labeled T1 and T2. The operating system will schedule each thread to have some limited time on the CPU and then perform a context switch. The context switch will store the current state of the running thread and load the state of the thread that should be executed. This is done often enough so that it appears as if the threads are running at the same time. A context switch is timeconsuming, though, and most likely will generate a lot of cache misses each time a new thread gets to execute on a CPU core. Therefore, we don't want context switches to happen too often. The following figure shows a possible execution sequence of two threads that are being scheduled on a single CPU: Figure 11.2: A possible execution of two threads. The dots indicate context switches The first instruction of thread T1 starts, and is then followed by a context switch to let thread T2 execute the first two instructions. As programmers, we must make sure that the program can run as expected, regardless of how the operating system scheduler is scheduling the tasks. If a sequence, for some reason, is invalid, there are ways to control the order in which the instructions get executed by using locks, which will be covered later on. [ 328 ] Chapter 11 If a machine has multiple CPU cores, it is possible to execute the two threads in parallel. However, there is no guarantee (it's even unlikely) that the two threads will execute on one core each throughout the lifetime of the program. The entire system shares time on the CPU, so the scheduler will let other processes execute as well. This is one of the reasons why the threads are not scheduled on dedicated cores. Figure 11.3 shows the execution of the same two threads, but now they are running on a machine with two CPU cores. As you can see, the second and third instructions of the first thread (white boxes) are executing at the exact same time as the other thread is executing — the two threads are executing in parallel: Figure 11.3: Two threads executing on a multicore machine. This makes it possible to execute the two threads in parallel. Let's discuss shared memory next. Shared memory Threads created in the same process share the same virtual memory. This means that a thread can access any data that is addressable within the process. The operating system, which protects memory between processes using virtual memory, does nothing to protect us from accidentally accessing memory inside a process that was not intended to be shared among different threads. Virtual memory only protects us from accessing memory allocated in different processes to our own. Sharing memory between multiple threads can be a very efficient way to handle communication between threads. However, sharing memory in a safe way between threads is one of the major challenges when writing concurrent programs in C++. We should always strive to minimize the number of shared resources between threads. Fortunately, not all memory is shared by default. Each thread has its own stack for storing local variables and other data necessary for handling function calls. Unless a thread passes references or pointers to local variables to other threads, no other thread will be able to access the stack from that thread. This is one more reason to use the stack as much as possible (if you are not already convinced that the stack is a good place for your data after reading Chapter 7, Memory Management). [ 329 ] Concurrency There is also thread local storage, sometimes abbreviated to TLS, which can be used to store variables that are global in the context of a thread but which are not shared between threads. A thread local variable can be thought of as a global variable where each thread has its own copy. Everything else is shared by default; that is, dynamic memory allocated on the heap, global variables, and static local variables. Whenever you have shared data that is mutated by some thread, you need to ensure that no other thread is accessing that data at the same time or you will have a data race. Remember the figure from the Process memory section of Chapter 7, Memory Management, which illustrated the virtual address space of a process? Here it is again, but modified to show how it looks when a process contains multiple threads. As you can see in the following figure, each thread has its own stack memory, but there is only one heap for all threads: Figure 11.4: A possible layout of the virtual address space for a process The process contains three threads in this example. The heap memory is by default shared by all threads. [ 330 ] Chapter 11 Data races A data race happens when two threads are accessing the same memory at the same time and at least one of the threads is mutating the data. If your program has a data race, it means that your program has undefined behavior. The compiler and optimizer will assume that there are no data races in your code and optimize it under that assumption. This may result in crashes or other completely surprising behavior. In other words, you can under no circumstances allow data races in your program. The compiler usually doesn't warn you about data races since they are hard to detect at compile time. Debugging data races can be a real challenge and sometimes requires tools such as ThreadSanitizer (from Clang) or Concurrency Visualizer (a Visual Studio extension). These tools typically instrument the code so that a runtime library can detect, warn about, or visualize potential data races while running the program you are debugging. Example: A data race Figure 11.5 shows two threads that are going to update an integer called counter. Imagine that these threads are both incrementing a global counter variable with the instruction ++counter. It turns out that incrementing an int might involve multiple CPU instructions. This can be done in different ways on different CPUs, but let's pretend that ++counter generates the following made-up machine instructions: R: Read counter from memory +1: Increment counter W: Write new counter value to memory [ 331 ] Concurrency Now, if we have two threads that are going to update the counter value that initially is 42, we would expect it to become 44 after both threads have run. However, as you can see in the following figure, there is no guarantee that the instructions will be executed sequentially to guarantee a correct increment of the counter variable. Figure 11.5: The two threads are both incrementing the same shared variable Without a data race, the counter would have reached the value 44, but instead, it only reaches 43. In this example, both threads read the value 42 and increment that value to 43. Then, they both write the new value, 43, which means that we never reach the correct answer of 44. Had the first thread been able to write the value 43 before the next thread started to read, we would have ended up with 44 instead. Note also that this would have been possible even if there was only one CPU core. The scheduler could have scheduled the two threads in a similar way so that both read instructions were executed before any writes. Again, this is one possible scenario, but the important thing is that the behavior is undefined. Anything could happen when your program has a data race. One such example is tearing, which is the common term for torn reads and torn writes. This happens when a thread writes parts of a value to memory while another thread reads the value at the same time and therefore ends up with a corrupt value. [ 332 ] Chapter 11 Avoiding data races How can we avoid data races? There are two main options: Use an atomic data type instead of the int. This will tell the compiler to execute the read, increment, and write atomically. We will spend more time discussing atomic data types later in this chapter. Use a mutually exclusive lock (mutex) that guarantees that multiple threads never execute a critical section at the same time. A critical section is a place in the code that must not be executed simultaneously since it updates or reads shared memory that potentially could generate data races. It is also worth emphasizing that immutable data structures — data structures that are never changed — can be accessed by multiple threads without any risk of data races. Minimizing the use of mutable objects is good for many reasons, but it becomes even more important when writing concurrent programs. A common pattern is to always create new immutable objects instead of mutating existing objects. When the new object is fully constructed and represents the new state, it can be swapped with the old object. In that way, we can minimize the critical sections of our code. Only the swap is a critical section, and hence needs to be protected by an atomic operation or a mutex. Mutex A mutex, short for mutual exclusion lock, is a synchronization primitive for avoiding data races. A thread that needs to enter a critical section first needs to lock the mutex (locking is sometimes also called acquiring a mutex lock). This means that no other thread can lock the same mutex until the first thread that holds the lock has unlocked the mutex. In that way, the mutex guarantees that only one thread at a time is inside a critical section. In Figure 11.6, you can see how the race condition demonstrated in the section A data race example can be avoided by using a mutex. The instruction labeled L is a lock instruction and the instruction labeled U is an unlock instruction. The first thread executing on Core 0 reaches the critical section first and locks the mutex before reading the value of the counter. It then adds 1 to the counter and writes it back to memory. After that, it releases the lock. [ 333 ] Concurrency The second thread, executing on Core 1, reaches the critical section just after the first thread has acquired the mutex lock. Since the mutex is already locked, the thread is blocked until the first thread has updated the counter undisturbed and released the mutex: Figure 11.6: The mutex lock is protecting the critical section and avoids data races on the counter variable The net result is that the two threads can update the mutable shared variable in a safe and correct way. However, it also means that the two threads can no longer be run in parallel. If most of the work a thread does cannot be done without serializing the work, there is, from a performance perspective, no point in using threads. The state where the second thread is blocked by the first thread is called contention. This is something we strive to minimize, because it hurts the scalability of a concurrent program. Adding more CPU cores will not improve performance if the degree of contention is high. Deadlock When using mutex locks to protect shared resources, there is a risk of getting stuck in a state called deadlock. A deadlock can happen when two threads are waiting for each other to release their locks. Neither of the threads can proceed and they are stuck in a deadlock state. One condition that needs to be fulfilled for a deadlock to occur is that one thread that already holds a lock tries to acquire an additional lock. When a system grows and gets larger, it becomes more and more difficult to track all locks that might be used by all threads running in a system. This is one reason for always trying to minimize the use of shared resources, and this demonstrates the need for exclusive locking. [ 334 ] Chapter 11 Figure 11.7 shows two threads in a waiting state, trying to acquire the lock held by the other thread: Figure 11.7: An example of a deadlock state Let's discuss synchronous and asynchronous tasks next. Synchronous and asynchronous tasks I will refer to synchronous tasks and asynchronous tasks in this chapter. Synchronous tasks are like ordinary C++ functions. When a synchronous task is finished doing whatever it is supposed to do, it will return the control to the caller of the task. The caller of the task is waiting or blocked until the synchronous task has finished. An asynchronous task, on the other hand, will return the control back to the caller immediately and instead perform its work concurrently. [ 335 ] Concurrency The sequence in Figure 11.8 shows the difference between calling a synchronous and asynchronous task, respectively: Figure 11.8: Synchronous versus asynchronous calls. The asynchronous task returns immediately but continues to work after the caller has regained control. If you haven't seen asynchronous tasks before, they might look strange at first, since ordinary functions in C++ always stop executing when they encounter a return statement or reach the end of the function body. Asynchronous APIs are getting more and more common, though, and it is likely that you have encountered them before, for example, when working with asynchronous JavaScript. Sometimes, we use the term blocking for operations that block the caller; that is, make the caller wait until the operation has finished. With a general introduction to concurrency behind us, it's time to explore the support for threaded programming in C++. Concurrent programming in C++ The concurrency support in C++ makes it possible for a program to execute multiple tasks concurrently. As mentioned earlier, writing a correct concurrent C++ program is, in general, a lot harder than writing a program that executes all tasks sequentially in one thread. This section will also demonstrate some common pitfalls to make you aware of all the difficulties involved in writing concurrent programs. Concurrency support was first introduced in C++11 and has since been extended into C++14, C++17, and C++20. Before concurrency was part of the language, it was implemented with native concurrency support from the operating system, POSIX Threads (pthreads), or some other library. [ 336 ] Chapter 11 With concurrency support directly in the C++ language, we can write cross-platform concurrent programs, which is great! Sometimes, however, you have to reach for platform-specific functionality when dealing with concurrency on your platform. For example, there is no support in the C++ standard library for setting thread priorities, configuring CPU affinity (CPU pinning), or setting the stack size of new threads. It should also be said that the thread support library has been extended quite a bit with the release of C++20, and more features are likely to be added in future versions of the language. The need for good concurrency support is increasing because of the way hardware is being developed, and there is a lot yet to be discovered when it comes to the efficiency, scalability, and correctness of highly concurrent programs. The thread support library We will now take a tour through the C++ thread support library and cover its most important components. Threads A running program contains at least one thread. When your main function is called, it is executed on a thread usually referred to as the main thread. Each thread has an identifier, which can be useful when debugging a concurrent program. The following program prints the thread identifier of the main thread: int main() { std::cout