Monday, April 4, 2022

Becoming less confused about perf record

I previously wrote about generating bogus flamegraphs and now I can be more clear about what I didn't understand. The default behavior for perf record is frequency-based and on modern Linux you probably get perf record -F 4000 as the default. For now assume the command line also used -p $pid and did not use -e. So in that case perf record does sampling of one process for the cycles event.

Disclaimer -- I am still waiting for an expert to review this.
Update - a fix was pushed

With frequency-based sampling (-F $frequency) perf record tries to generate $frequency samples per second. The perf tutorial wiki does a good job explaining what happens next. A sample (stack trace) is taken when the PMU cycles counter overflows. The challenge is that the number of cycles consumed per second by your process varies (more when it is CPU-bound, less when it is IO or mutex bound). So the sample period (the number of cycles at which overflow occurs) must be changed dynamically to provide $frequency samples/second. A key point here is that perf is sampling based on the number of cycles consumed by the process named in -p. If it were just doing that based on wall-clock time, for example running PMP once/second, then no adjustment is needed (just wake up $frequency times/second and get samples).

The sample periods are displayed in perf script output and here is an example where 1 cycles and 258 cycles are the sample periods. The sample periods also serve as the weight of a sample. For example, if there are 3 stack traces each with a sample period like the following, then perf report will report that g,h,i accounts for 50% of the time.

stack   period/cycles

a,b,c   1

d,e,f   1

g,h,i   2

The equivalence of sample period with what I called weight in my previous post confused me at first. If perf record is run with -c instead of -F then sample periods are not adjusted and all samples have the same sample period (same weight, a value likely to be much larger than 1) which is another workaround for FlameGraph issue 165 as suggested in the issue report.

As I wrote in a LinkedIn discussion: this is interesting and complicated. Given perf record -g -p $pid and a stack trace taken after a sample period of X cycles then you assume that stack trace represents where CPU was consumed by that process for the entire X cycles of that sample period. When the sample period is constant for all stack traces (perf record -c) then this doesn't matter because all stack traces count equally (have equal weight). But it does matter when the sample period is adjusted (perf record -F or perf record) and stack traces have different sample period (the X in X cycles changes) as in this example.

In addition to the links above, these helped reduce my confusion. So did a co-worker from the kernel team:


No comments:

Post a Comment