Saturday, January 16, 2021

Linux /proc Filesystem for MySQL DBAs - Part IV, Creating Off-CPU Flame Graphs

 My upcoming FOSDEM 2021 MySQL Devroom talk based on this series of blog posts is already prepared and recorded. But in the process I noted that some more details should be shared than one can cover in a 20 minutes talk. So I decided to continue the series, and today I am going to show what one can do with kernel stack samples collected from /proc, for example, with the psn tool we discussed in the previous post.

There we've seen that psn allows to summarize kernel stack samples among other per-thread details, for example, we can get the following for Percona Server running under some sysbench load:

openxs@ao756:~$ sudo psn -G kstack -p `pidof mysqld`

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/stat, stack for 5 seconds... finished.


=== Active Threads =========================================================================================================================================================================================================================================================================================

 samples | avg_threads | comm     | state                  | kstack                                                                                                                                                                             
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      99 |        0.99 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit()                                     
      81 |        0.81 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fdatasync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit()                                 
      70 |        0.70 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_poll()->do_sys_poll()->poll_schedule_timeout()                                                                                                    
...
       1 |        0.01 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_sendto()->SYSC_sendto()->sock_sendmsg()->unix_stream_sendmsg()->sock_alloc_send_pskb()->alloc_skb_with_frags()->__alloc_skb()->__kmalloc_reserve.isra.34()->__kmalloc_node_track_caller()


samples: 100 (expected: 100)
total processes: 1, threads: 44
runtime: 5.02, measure time: 1.89

So we have the number of samples in the first column, for threads in the process we see in the third column, per state we see in the fourth column having the given kernel stack fancy formatted and presented in the fifth column. That is, we get number of samples with the given kernel stack trace, for waiting threads (among others).

Now let me recall where I've seen some similarly structured information processed before? Here it is, "Off-CPU Analysis" by Brendan Gregg! There he described different approaches, with /proc sampling among them, but mostly concentrated on his eBPF/bcc based tool, offcputime. that summarizes off-CPU time by stack trace. I used it on Fedora for a coupled of years already, but on this old Ubuntu 16.04 it is not supposed to work (kernel 4.8+ is needed):

openxs@ao756:~$ uname -a
Linux ao756 4.4.0-198-generic #230-Ubuntu SMP Sat Nov 28 01:30:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
openxs@ao756:~$ sudo /usr/share/bcc/tools/offcputime -h
usage: offcputime [-h] [-p PID | -t TID | -u | -k] [-U | -K] [-d] [-f]
                  [--stack-storage-size STACK_STORAGE_SIZE]
                  [-m MIN_BLOCK_TIME] [-M MAX_BLOCK_TIME] [--state STATE]
                  [duration]

Summarize off-CPU time by stack trace

positional arguments:
  duration              duration of trace, in seconds

optional arguments:
  -h, --help            show this help message and exit
  -p PID, --pid PID     trace this PID only
  -t TID, --tid TID     trace this TID only
  -u, --user-threads-only
                        user threads only (no kernel threads)
  -k, --kernel-threads-only
                        kernel threads only (no user threads)
  -U, --user-stacks-only
                        show stacks from user space only (no kernel space
                        stacks)
  -K, --kernel-stacks-only
                        show stacks from kernel space only (no user space
                        stacks)
  -d, --delimited       insert delimiter between kernel/user stacks
  -f, --folded          output folded format
  --stack-storage-size STACK_STORAGE_SIZE
                        the number of unique stack traces that can be stored
                        and displayed (default 1024)
  -m MIN_BLOCK_TIME, --min-block-time MIN_BLOCK_TIME
                        the amount of time in microseconds over which we store
                        traces (default 1)
  -M MAX_BLOCK_TIME, --max-block-time MAX_BLOCK_TIME
                        the amount of time in microseconds under which we
                        store traces (default U64_MAX)
  --state STATE         filter on this thread state bitmask (eg, 2 ==
                        TASK_UNINTERRUPTIBLE) see include/linux/sched.h

examples:
    ./offcputime             # trace off-CPU stack time until Ctrl-C
    ./offcputime 5           # trace for 5 seconds only
    ./offcputime -f 5        # 5 seconds, and output in folded format
    ./offcputime -m 1000     # trace only events that last more than 1000 usec
    ./offcputime -M 10000    # trace only events that last less than 10000 usec
    ./offcputime -p 185      # only trace threads for PID 185
    ./offcputime -t 188      # only trace thread 188
    ./offcputime -u          # only trace user threads (no kernel)
    ./offcputime -k          # only trace kernel threads (no user)
    ./offcputime -U          # only show user space stacks (no kernel)
    ./offcputime -K          # only show kernel space stacks (no user)

openxs@ao756:~$ sudo /usr/share/bcc/tools/offcputime -K -f 5 -p `pidof mysqld`
could not open bpf map: stack_traces, error: Invalid argument
Traceback (most recent call last):
  File "/usr/share/bcc/tools/offcputime", line 235, in <module>
    b = BPF(text=bpf_text)
  File "/usr/lib/python2.7/dist-packages/bcc/__init__.py", line 325, in __init__
    raise Exception("Failed to compile BPF text")
Exception: Failed to compile BPF text

and it does not work as you can see other than wiht -h option to get the usage details. That's unfortunate, as this slow netbook could really benefit from the Off-CPU analysis with its 2 slow cores and encrypted slow HDD :) The offcputime tool output was then used to create nice flame graphs that allow to identify the most often happening kinds of wait:

# /usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks
...
# git clone https://github.com/brendangregg/FlameGraph # cd FlameGraph # ./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg

Flame graph created like that shows all off-CPU stack traces, with stack depth on the y-axis, and the width corresponds to the total time in each stack, along with application level stacks.

I could not get these on my Ubuntu with offcputime. But I tried to find out what kind of outp[ut the tool produces with the -f option we see used above. We can see this without running the tool from this file, https://github.com/iovisor/bcc/blob/master/tools/offcputime_example.txt:

"A -f option will emit output using the "folded stacks" format, which can be
read directly by flamegraph.pl from the FlameGraph open source software
(https://github.com/brendangregg/FlameGraph). Eg:

# ./offcputime -K -f 5
bash;entry_SYSCALL_64_fastpath;sys_read;vfs_read;...;schedule 8
..."

So, the format is simple: first column is the program name and kernel stack trace, separated by ';', and then the number of times we've noted this stack for the program. I can surely get the output of psn converted to such simple format!

I've started a sysbench test like this:

openxs@ao756:~$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=root --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-db=sbtest --threads=16 --time=1200 --report-interval=10 run
sysbench 1.1.0-faaff4f (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 16
Report intermediate results every 10 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 10s ] thds: 16 tps: 69.66 qps: 1423.20 (r/w/o: 997.63/284.64/140.92) lat (ms,95%): 383.33 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 16 tps: 79.81 qps: 1596.57 (r/w/o: 1117.32/319.63/159.62) lat (ms,95%): 320.17 err/s: 0.00 reconn/s: 0.00
...

and in another shell after several attempts filtered out and agrregated exactly what I needed (I'll skip explanations of my lame awk and sed usage, this transformation can be surely done in a more efficient way and with less commands in the pipe):

openxs@ao756:~$ sudo psn -d 60 -G kstack | grep -v Running | awk -F\| '{ print $3, $5, $1 }' |  sed 's/->/;/g' | grep '^.(' | sed 's/(//g' | sed 's/)//g' | awk '{ print $1";"$2, $3 }' > /tmp/psnstacks.txt

Kernel stacks of all the processes were collected for 60 seconds, processed to match the format needed to build a flame graph, and saved into the file. 

The next step is to process it to create an .svg file with the graph:

openxs@ao756:~$ git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Flame Graph based on proc sampling" --countname=hits < /tmp/psnstacks.txt > ~/Documents/psn1.svg
openxs@ao756:~$ ls -l ~/Documents/psn1.svg
-rw-rw-r-- 1 openxs openxs 25999 Jan 16 15:02 /home/openxs/Documents/psn1.svg

The result (with a typo I made in the original command) was the following (this is a .png made from a screenshot of a browser used to work with the psn1.svg):

You can create Off-CPU flame graphs like the above on any Linux 2.6.x by sampling /proc!


I see that basically most of the time when mysqld threads are not running is spent on jbd2_log_wait_commit system call waits. Now we can go to the Linux kernel source code and study what it is about:

/*
 * Wait for a specified commit to complete.
 * The caller may not hold the journal lock.
 */
int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
{
    int err = 0;

    read_lock(&journal->j_state_lock);
#ifdef CONFIG_PROVE_LOCKING
    /*
     * Some callers make sure transaction is already committing and in that
     * case we cannot block on open handles anymore. So don't warn in that
     * case.
     */
    if (tid_gt(tid, journal->j_commit_sequence) &&
        (!journal->j_committing_transaction ||
         journal->j_committing_transaction->t_tid != tid)) {
        read_unlock(&journal->j_state_lock);
        jbd2_might_wait_for_commit(journal);
        read_lock(&journal->j_state_lock);
    }
#endif
...

 So we are waiting for commit of write to the journal of ext4 filesystem here. Fair enough.

* * *

To summarize, the fact that you do not have bcc tools installed or use old enough kernel for them not to work at all does not prevent you from doing off-CPU analysis based on /proc sampling approach. With some simple scripting and command line text processing applied you can get folded stacks for further analysis based on flame graphs. This may help to resolve some obscure MySQL performance problems that go beyong bad execution plans or spinning and wasting CPU cicles...

Moreover, the impact of such sampling is notably lower than with bcc tools. This is a topic I am going to study in my next blog post in this series, based on testing on Fedora 31 box where all the tools just work.

No comments:

Post a Comment