AI - Performance Methodology in the Cloud – Part 4 – Harshad

AdvancedIntelligence avatar
AdvancedIntelligence
·
·
Download

Start Quiz

Study Flashcards

15 Questions

What is performance characterization?

A process of determining the cause of a performance issue

True or false: False sharing is where two independently declared variables are accessed by the same thread on a processor.

False

True or false: P-states and C-states have no effect on performance.

False

What is false sharing?

When two independent declared variables that are independently accessed by different threads on a processor lie on the same cache line

True or false: The perf tool can be used to identify which part of the code is consuming the greatest number of cycles.

True

What is the consequence of false sharing?

Diminished performance

True or false: The top-down hierarchy in performance characterization starts with level 2.

False

What tool can be used to profile code and figure out which part of the code is consuming the greatest number of cycles?

perf

True or false: Flame graphs are used to record data based on CPU cycles spent.

False

What is the top-down hierarchy of performance characterization?

Frontend bound, latency bound, and L3 misses

What data does perf record by default?

CPU cycles spent

What is the purpose of characterization?

To determine the cause of a performance issue

What is the purpose of the ping-pong movement of the cache line?

To maintain the consistency and coherency of the caches

What is the purpose of P-states and C-states?

To improve the performance of the CPU when it doesn't have intermittent work to do

What tool can be used to record data based on L3 misses and cycle counts?

perf

Study Notes

  • Performance characterization is a process where the performance monitoring unit (PMU) within the CPU allows you to collect certain counters, and some of these counters can identify patterns.
  • A pattern here is false sharing. False sharing is where two independent declared variables that are independently accessed by different threads on a processor lie on the same cache line, which is a unit of access for a processor within the cache.
  • Even though from a software point of view, thread-0 is accessing variable-A and thread-1 is accessing variable-B, because these are on the same hardware cache line, the moment thread-0 makes changes to this cache line, it has to ping-pong the cache line to the other CPU, to make sure the change is consistent.
  • If thread-1 makes a change, it must go back, make sure it's consistent, and then ping-pong back. This ping-pong movement of the cache line that’s needed to maintain the consistency and coherency of the caches, ends up giving a diminished performance result.
  • Through characterization at runtime, you can find out that the P-states and the C-states, meaning the power and idle states of the CPU, have a lot to do with this problem. No software can easily detect that because a CPU goes to sleep when it doesn’t have intermittent work to do.
  • You can see that, based upon C-state levels that are set to default, which is 9 on the cloud, there's a latency excursion and then it goes down again.
  • Characterization is the process of determining the cause of a performance issue.
  • Characterization can be done with the help of tools such as perf or PerfSpect.
  • The top-down hierarchy starts at level 1, where the CPU can be stalled because it's frontend bound, meaning it's not getting any instructions to execute.
  • Under each of these levels, you have other levels that tell you where you were bound in the frontend.
  • If it was latency, it tells you where you were latency bound.
  • To figure out which part of the service is blocked, imagine that after the map and zoom, you're left with the block diagram shown below.
  • You can then scrutinize the code fragments that are caused by these issues.
  • perf is a tool that can be used to profile code and figure out which part of the code is consuming the greatest number of cycles
  • perf records data based on L3 misses and cycle counts
  • by default, most perf records are based on CPU cycles spent, but by using flame graphs, we can also record data based on L3 misses

Explore the process of performance characterization and profiling within the CPU, including the identification of false sharing patterns, the impact of P-states and C-states, and the use of tools like 'perf' for code profiling. Learn how to pinpoint performance issues and optimize code for improved efficiency.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free

More Quizzes Like This

Use Quizgecko on...
Browser
Browser