Desislav Andreev, Stanimir Lukanov - C++ Programming for Linux Systems_ Create robust enterprise software for Linux and Unix-based operating systems-Packt Publishing (2023)-1-364-375.pdf

concurrent execution that are not directly related to data races. We continue with cache-friendly code. Discussing multiprocessor systems – cache locality and cache friendliness in C++ You probably recall Chapter 2 at this point, where we discussed multithread and multi-core processors. The respective computational units were presented as processors. We also visualized the transport of instructions from the NVM (the disk) to the processors, through which we explained the creation of processes and software threads. We want our code to be as performant as required. The most important aspect of getting the code to perform well is the choice of appropriate algorithms and data structures. With a bit of thought, you can try to squeeze the most out of every last CPU cycle. One of the most common examples of misusing algorithms is sorting a large, unordered array with bubble sort. So, make sure to learn your algorithms and data structures – together with the knowledge from this section and beyond, it will make you a really powerful developer. As you already know, the further we get from the RAM and the closer we get to the processor registers, the faster the operations and the smaller the memory capacity becomes. Each time the processor loads data from the RAM to the cache, it will either just sit and wait for that data to show up, or execute other non-related tasks. Thus, from the perspective of the current task, the CPU cycles are wasted. Of course, reaching 100% CPU utilization might be impossible, but we should at least be aware when it’s doing needless work. All of this might sound meaningless to you at this point, but concurrent systems will suffer if we act carelessly. The C++ language provides access to multiple tools for even better performance improvements, including prefetching mechanisms through hardware instructions and branch prediction optimization. Even without doing anything in particular, modern compilers and CPUs do a great job with these techniques. Still, we could improve this performance further by providing the right hints, options, and instructions. It’s also a good idea to be aware of the data in the cache to help reduce the time taken when accessing it. Remember that the cache is just a type of fast, temporary storage for data and instructions. So, we can use the features of C++ to our advantage when we treat the cache in a good manner, known as cache-friendly code. An important remark to note is the inverse of this statement – misusing C++ features will lead to poor cache performance, or at least not the best performance possible. You’ve probably already guessed that this is related to the system’s scale and the requirement for fast data access. Let’s discuss this further in the next section. Considering cache locality through cachefriendly code We mentioned the concept of cache-friendly code already, but what does it truly mean? First of all, you need to be aware of the cache locality. This means that our first goal is to make frequently used data easily accessible, thus the process will run faster. The second goal is to store in memory only what we need to store. Let’s keep the allocations small. For example, if you need to store a number of dice values (1-6), you don’t need unsigned long longs. Those values will fit in an unsigned int or even an unsigned char. As a result, caching has become a major aspect of almost every system. Earlier in the book we mentioned that slower hardware, such as disks, sometimes has its own cache memory to reduce the time taken to access frequently opened files. OSs can cache frequently used data, for example, files, as chunks of virtual address space, thus improving performance even more. This is also known as temporal locality. Consider the following scenario: a piece of data is not found in the cache on the first try – this is known as a cache miss. Then it is looked up in the RAM, is found, and is loaded into the cache as one or multiple cache blocks or cache lines. Afterwards, if this data is requested a number of subsequent times and is still found in the cache, known as a cache hit, it will remain in the cache and guarantee faster access, or at least faster than the first cache miss. You can observe this in the following diagram: Figure 9.2 – Representation of temporal locality on the hardware level As we mentioned with the prefetching mechanisms earlier, it’s a known fact that having an object with multiple cache hits means that the data around it might also be referenced soon. This causes the processor to request or prefetch that additional nearby data from the RAM and load it a priori, so it will be there in the cache when it is eventually needed. This causes spatial locality, meaning accessing nearby memory and benefiting from the fact that caching is done in chunks, known as cache lines, thus paying for a single transfer and using several bytes of memory. The prefetching technique assumes that the code already has spatial locality in order to improve performance. Both locality principles are based on assumptions. But code branching requires good design. The simpler the branch tree, the simpler to predict. Again, you need to consider carefully the data structures and algorithms to be used. You also need to aim at contiguous memory access and reduce the code to simple loops and small functions; for example, switching from using linked lists to arrays or matrices. For small-sized objects, the std::vector container is still the optimal choice. Additionally, we ideally seek a data structure object that can fit into one cache line – but sometimes this is just not possible because of the application’s requirements. Our process should access the data in contiguous blocks, where each one has the size of a cache line (typically 64 bytes but depends on the system). But if we want to do parallel evaluations, then it would be preferable for each CPU core (processor) to handle data in different cache lines from other cores’ data. If not, the cache hardware will have to move data back and forth between cores and the CPU will waste time on meaningless work again and the performance will worsen, instead of being improved. This term is known as false sharing, which we’ll now have a look at in the following section. A glance at false sharing As a rule, small pieces of data will be put together in a single cache line unless the programmer instructs otherwise, as we will see in the following examples. This is the way processors work in order to keep latency low – they handle one cache line for each core at a time. Even if it’s not full, the cache line’s size will be allocated as the smallest possible block for the CPU to handle. As mentioned earlier, if the data in that cache line is requested by two or more threads independently, then this will slow down the multi-threaded execution. Dealing with the effects of false sharing means getting predictability. Just as code branching can be predicted, so can the system programmer predict if an object is of the size of a cache line, and thus each separate object can reside in its own memory block. In addition, all computations can happen in the local scope and the shared data modifications take place at the end of a given procedure. Of course, such activities will lead to the wasting of resources at some point, but it’s a matter of design and preferences. Nowadays, we can use compiler optimizations to improve this predictability and performance, too, but we shouldn’t always rely on this. Let’s first check the size of our cache line: #include #include using std::hardware_destructive_interference_size; int main() { std::cout

Desislav Andreev, Stanimir Lukanov - C++ Programming for Linux Systems_ Create robust enterprise software for Linux and Unix-based operating systems-Packt Publishing (2023)-1-364-375.pdf

Document Details

Related

Full Transcript

Upgrade to continue