AI - Performance Methodology in the Cloud – Part 4

Study Notes

Performance characterization is a process where the performance monitoring unit (PMU) within the CPU allows you to collect certain counters, and some of these counters can identify patterns.
A pattern here is false sharing. False sharing is where two independent declared variables that are independently accessed by different threads on a processor lie on the same cache line, which is a unit of access for a processor within the cache.
Even though from a software point of view, thread-0 is accessing variable-A and thread-1 is accessing variable-B, because these are on the same hardware cache line, the moment thread-0 makes changes to this cache line, it has to ping-pong the cache line to the other CPU, to make sure the change is consistent.
If thread-1 makes a change, it must go back, make sure it's consistent, and then ping-pong back. This ping-pong movement of the cache line that’s needed to maintain the consistency and coherency of the caches, ends up giving a diminished performance result.
Through characterization at runtime, you can find out that the P-states and the C-states, meaning the power and idle states of the CPU, have a lot to do with this problem. No software can easily detect that because a CPU goes to sleep when it doesn’t have intermittent work to do.
You can see that, based upon C-state levels that are set to default, which is 9 on the cloud, there's a latency excursion and then it goes down again.
Characterization is the process of determining the cause of a performance issue.
Characterization can be done with the help of tools such as perf or PerfSpect.
The top-down hierarchy starts at level 1, where the CPU can be stalled because it's frontend bound, meaning it's not getting any instructions to execute.
Under each of these levels, you have other levels that tell you where you were bound in the frontend.
If it was latency, it tells you where you were latency bound.
To figure out which part of the service is blocked, imagine that after the map and zoom, you're left with the block diagram shown below.
You can then scrutinize the code fragments that are caused by these issues.
perf is a tool that can be used to profile code and figure out which part of the code is consuming the greatest number of cycles
perf records data based on L3 misses and cycle counts
by default, most perf records are based on CPU cycles spent, but by using flame graphs, we can also record data based on L3 misses