Sunday, May 9, 2021

Off-CPU Analysis Attempt to Find the Reason of Performance Regression in MariaDB 10.4

I did not write new blog posts for more than 2 months already. Busy days... But now I am on vacation and my Percona Live Online 2021 talk on flame graphs is coming soon, so I decided to renew my experience with bcc tools and try to get some insights for one of the bugs I've reported for MariaDB using the off-CPU flame graphs.

The idea was to check why sysbench oltp_read_write test started to work notably slower in a newer version of MariaDB 10.4.x after 10.4.15. On my good, old and slow Acer netbook with recently updated Ubuntu: 

openxs@ao756:~/git/server$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
openxs@ao756:~/git/server$ uname -a
Linux ao756 5.8.0-50-generic #56~20.04.1-Ubuntu SMP Mon Apr 12 21:46:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

I've compiled current MariaDB 10.4 from GitHub source following my usual way:

openxs@ao756:~/git/server$ git checkout 10.4
...
openxs@ao756:~/git/server$ git pull
...
openxs@ao756:~/git/server$ git submodule update --init --recursive
Submodule path 'extra/wolfssl/wolfssl': checked out '9c87f979a7f1d3a6d786b260653d566c1d31a1c4'
Submodule path 'libmariadb': checked out '180c543704d627a50a52aaf60e24ca14e0ec4686'
Submodule path 'wsrep-lib': checked out 'f271ad0c6e3c647df83c1d5ec9cd26d77cef2337'
Submodule path 'wsrep-lib/wsrep-API/v26': checked out '76cf223c690845bbf561cb820a46e06a18ad80d1'
openxs@ao756:~/git/server$ git branch
  10.3
* 10.4
  10.5
  10.6
openxs@ao756:~/git/server$ git log -1
commit 583b72ad0ddbc46a7aaeda1c1373b89d4bded9ea (HEAD -> 10.4, origin/bb-10.4-merge, origin/10.4)
Merge: 473e85e9316 a4139f8d68b
Author: Oleksandr Byelkin <sanja@mariadb.com>
Date:   Fri May 7 11:50:24 2021 +0200

    Merge branch 'bb-10.4-release' into 10.4
openxs@ao756:~/git/server$ rm CMakeCache.txt
openxs@ao756:~/git/server$ cmake . -DCMAKE_INSTALL_PREFIX=/home/openxs/dbs/maria10.4 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_CONFIG=mysql_release -DFEATURE_SET=community -DWITH_EMBEDDED_SERVER=OFF -DPLUGIN_TOKUDB=NO -DWITH_SSL=system
...
-- Generating done
-- Build files have been written to: /home/openxs/git/server
openxs@ao756:~/git/server$ time make -j 3
...
[100%] Building C object extra/mariabackup/CMakeFiles/mariabackup.dir/__/__/libmysqld/libmysql.c.o
[100%] Linking CXX executable mariabackup
[100%] Built target mariabackup

real    74m9,550s
user    134m10,837s
sys     7m0,387s

openxs@ao756:~/git/server$ make install && make clean
...
openxs@ao756:~/git/server$ cd ~/dbs/maria10.4
openxs@ao756:~/dbs/maria10.4$ bin/mysqld_safe --no-defaults --port=3309 --socket=/tmp/mariadb.sock --innodb_buffer_pool_size=1G --innodb_flush_log_at_trx_commit=2 &
[1] 27483
openxs@ao756:~/dbs/maria10.4$ 210507 19:15:37 mysqld_safe Logging to '/home/openxs/dbs/maria10.4/data/ao756.err'.
210507 19:15:37 mysqld_safe Starting mysqld daemon with databases from /home/openxs/dbs/maria10.4/data

openxs@ao756:~/dbs/maria10.4$ bin/mysql --socket=/tmp/mariadb.sock
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 8
Server version: 10.4.19-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> drop database if exists sbtest;
Query OK, 5 rows affected (1.297 sec)

MariaDB [(none)]> create database sbtest;
Query OK, 1 row affected (0.001 sec)

MariaDB [(none)]> exit
Bye

and compared to 10.4.15 with the following test:

openxs@ao756:~/dbs/maria10.4$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=4 prepare
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Initializing worker threads...

Creating table 'sbtest2'...
Creating table 'sbtest3'...
Creating table 'sbtest4'...
Creating table 'sbtest1'...
Inserting 100000 records into 'sbtest1'
Inserting 100000 records into 'sbtest4'
Inserting 100000 records into 'sbtest2'
Inserting 100000 records into 'sbtest3'
Creating a secondary index on 'sbtest1'...
Creating a secondary index on 'sbtest4'...
Creating table 'sbtest5'...
Inserting 100000 records into 'sbtest5'
Creating a secondary index on 'sbtest2'...
Creating a secondary index on 'sbtest3'...
Creating a secondary index on 'sbtest5'...

openxs@ao756:~/dbs/maria10.4$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --time=300 --report-interval=10 --threads=4 run
...
    transactions:                        35630  (118.54 per sec.)
    queries:                             712600 (2370.87 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          300.5612s
    total number of events:              35630

Latency (ms):
         min:                                    3.30
         avg:                                   33.70
         max:                                 2200.17
         95th percentile:                      164.45
...

openxs@ao756:~/dbs/maria10.4.15$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --time=300 --report-interval=10 --threads=4 run
...
    transactions:                        56785  (189.25 per sec.)
    queries:                             1135700 (3784.99 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          300.0501s
    total number of events:              56785

Latency (ms):
         min:                                    3.15
         avg:                                   21.13
         max:                                  704.36
         95th percentile:                      108.68
...

So, basically with the same test with all tables fitting into the buffer pool (1G) and all but few default settings current MariaDB 10.4.19 demonstrate up to 60% drop in throughput and increase of 95th latency on this netbook (even more than 15% or so reported on faster quad core Fedora desktop previously).

If you read MDEV-24272 carefully, the regression was tracked up to a specific commit, but I tried to apply various tools to actually see where more time is spent now, specifically. Profiling with perf and creating on-CPU flame graphs had not given me any clear insight that would explain that increase in latency, so my next idea was to trace off-CPU time spent, that is, try to find out how long MariaDB server waits and where in the code that's mostly happen.

For this I've used the offcputime tool:

openxs@ao756:~$ /usr/share/bcc/tools/offcputime -h
usage: offcputime [-h] [-p PID | -t TID | -u | -k] [-U | -K] [-d] [-f]
                  [--stack-storage-size STACK_STORAGE_SIZE]
                  [-m MIN_BLOCK_TIME] [-M MAX_BLOCK_TIME] [--state STATE]
                  [duration]

Summarize off-CPU time by stack trace

positional arguments:
  duration              duration of trace, in seconds

optional arguments:
  -h, --help            show this help message and exit
  -p PID, --pid PID     trace this PID only
  -t TID, --tid TID     trace this TID only
  -u, --user-threads-only
                        user threads only (no kernel threads)
  -k, --kernel-threads-only
                        kernel threads only (no user threads)
  -U, --user-stacks-only
                        show stacks from user space only (no kernel space
                        stacks)
  -K, --kernel-stacks-only
                        show stacks from kernel space only (no user space
                        stacks)
  -d, --delimited       insert delimiter between kernel/user stacks
  -f, --folded          output folded format
  --stack-storage-size STACK_STORAGE_SIZE
                        the number of unique stack traces that can be stored
                        and displayed (default 1024)
  -m MIN_BLOCK_TIME, --min-block-time MIN_BLOCK_TIME
                        the amount of time in microseconds over which we store
                        traces (default 1)
  -M MAX_BLOCK_TIME, --max-block-time MAX_BLOCK_TIME
                        the amount of time in microseconds under which we
                        store traces (default U64_MAX)
  --state STATE         filter on this thread state bitmask (eg, 2 ==
                        TASK_UNINTERRUPTIBLE) see include/linux/sched.h

examples:
    ./offcputime             # trace off-CPU stack time until Ctrl-C
    ./offcputime 5           # trace for 5 seconds only
    ./offcputime -f 5        # 5 seconds, and output in folded format
    ./offcputime -m 1000     # trace only events that last more than 1000 usec
    ./offcputime -M 10000    # trace only events that last less than 10000 usec
    ./offcputime -p 185      # only trace threads for PID 185
    ./offcputime -t 188      # only trace thread 188
    ./offcputime -u          # only trace user threads (no kernel)
    ./offcputime -k          # only trace kernel threads (no user)
    ./offcputime -U          # only show user space stacks (no kernel)
    ./offcputime -K          # only show kernel space stacks (no user)

I've stored the outputs in /dev/shm to have less impact on disk I/O that I suspected as one of the reasons:

openxs@ao756:~$ ls /dev/shm
openxs@ao756:~$ mkdir /dev/shm/out
openxs@ao756:~$ ls /dev/shm
out

Basically the following commands were used to create folded (ready to use for building flame graphs) user-space only stacks and time spent off-CPU in them over 60 seconds of tracing while sysbench tests were running for clean setup on MariaDB 10.4.15 and then on current 10.4.19, and create flame graphs based on those stacks where start_thread frame is present (to clean up irrelevant details):

openxs@ao756:~$ sudo /usr/share/bcc/tools/offcputime -f -p `pidof mysqld` -U 60 > /dev/shm/out/offcpu_10415.out
WARNING: 27 stack traces lost and could not be displayed.
openxs@ao756:~$ cat /dev/shm/out/offcpu_10415.out | grep start_thread | ~/git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us > /tmp/offcpu_10415.svg

openxs@ao756:~$ sudo /usr/share/bcc/tools/offcputime -f -p `pidof mysqld` -U 60 > /dev/shm/out/offcpu_10419.out
WARNING: 24 stack traces lost and could not be displayed.
openxs@ao756:~$ cat /dev/shm/out/offcpu_10419.out | grep start_thread | ~/git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us > /tmp/offcpu_10419.svg

As a result I've got the following graphs (.png screenshots from real .svg files below). On 10.4.15 we spent around 43 seconds (out of 60 we monitored) off-CPU, mostly in do_command() and waiting for network I/O, and the graph was the following:

Off-CPU time for MariaDB 10.4.15

In case of 10.4.19 the graph is very different and we seem to have spent 79 seconds off-CPU, mostly in background io_handler_thread():

Off-CPU time for MariaDB 10.4.19

I was surprised to see more than 60 seconds spent off-CPU in this case. Maybe this is possible because I have 2 cores and MariaDB threads were waiting on both most of the time.

I've then tried to use differential flame graph to highlight the call stacks that the main difference is related to. I've crated it from existing folded stack traces with the following command:

openxs@ao756:~$ ~/git/FlameGraph/difffolded.pl /dev/shm/out/offcpu_10415.out /dev/shm/out/offcpu_10419.out | grep start_thread | ~/git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Diff Flame Graph" --countname=us > /tmp/offcpu_diff.svg

The resulting graph is presented below:

The source of increase is highlighted in red

Here we clearly see that main increase in time spent waiting in 10.4.19 is related to io_handler_thread(), but increase happened in almost all background threads.

* * *

To summarize:

  • When some performance regression happens you should check not only those code paths in the application where software started to spend more time working, but also where it started to wait more.
  • In case of https://jira.mariadb.org/browse/MDEV-24272 we clearly started to flush more to disk from the very beginning of sysbench oltp_read_write test in newer versions of MariaDB 10.4.x, and on my slow encrypted HDD this matters a lot. The load that was supposed to be CPU-bound (as we have large enough buffer pool) becomes disk-bound.
  • Flame graphs are cool for highlighting the difference and in this post I've shown both a classical smart way to produce them without too much impact, and a way to highlight the difference in them with a differential flam graph produced by the difffolded.pl tool created by Brendan Gregg.
  • Other cases when Flame Graphs may help MySQL or MariaDB DBAs are discussed during my upcoming Percona Live 2021 talk on May 12. See you there!
  • I'll get back to this nice regression bug to study the test case in more details with other tools, maybe more than once. Stay tuned!

No comments:

Post a Comment