Thursday, August 18, 2022

How I do performance tests for RocksDB

My primary tool for doing RocksDB performance tests is db_bench, but as always there are layers of shell scripts to automate the process. First, I will define two terms. A benchmark is a sequence of benchmark steps. A common pattern for big data benchmarks is load and then query where load and query are the steps. I do small data (OLTP) so the common pattern for me is load and then several read+write steps. To save time the load step can be replaced by copying from a backup or snapshot.

I have shared notes on how to do benchmarks for an LSM with a focus on RocksDB and LevelDB. These are based on the many mistakes I have made and occasionally learned from: see here and here.

The layers of shell scripts are:

  • tools/benchmark.sh - runs one benchmark step by invoking db_bench
  • tools/benchmark_compare.sh - runs a benchmark by invoking a sequence of benchmark steps
  • x.sh - selects configuration options based on HW size then calls benchmark_compare.sh
  • x3.sh - selects configuration options based on workload (IO-bound vs cached) then calls x.sh
I also forked the scripts above to work on older versions of RocksDB (versions 4 and 5). The forks are here and were done to reduce the complexity in the scripts above.

The scripts also let me run a benchmark for many versions of RocksDB. When I test many versions it might take more than a week for x3.sh to finish (tmux and screen help here), but it runs unattended which saves me time and in the end I get ascii formatted reports that make it easy to compare results across versions.

RocksDB has many configuration options. The benchmark.sh script has good default values for many options, but the values for some options depend on HW size (amount of RAM, number of CPUs) which is one reason there are layers of scripts.

The benchmark tool, db_bench, also has many configuration options to select different benchmark steps (load in key order, point queries, etc). The primary purpose for benchmark_compare.sh is to run the benchmark steps (and configure db_bench to do that) in an order that is useful to me and minimizes noise. For example, if benchmark steps were to follow overwrite then they might inherit a large and varying amount of compaction debt which would cause variance. The sequence of benchmark steps begins here.

The x.sh script selects values for RocksDB configuration options based on HW capacity. For now I have hardwired this based on known servers that I use (see here) but eventually this will accept CPU and memory sizes as input and go from there.

The x3.sh script is invoked by me. It defines four workloads:
  • byrx
    • byrx is short for cached by RocksDB and the database fits in the RocksDB block cache
  • byos
    • byos is short for cached by OS and the database fits in the OS page cache but is larger than the RocksDB block cache. This simulates fast storage for reads and lets me increase stress on the RocksDB block cache code.
  • iobuf
    • iobuf is short for IO-bound with buffered IO. The database is larger than RAM and RocksDB uses buffered IO.
  • iodir
    • iodir is short for IO-bound with O_DIRECT. The database is larger than RAM and RocksDB uses O_DIRECT for user reads and compaction.
An example command line is:

nohup bash x3.sh 22 no 1800 c30r240 40000000 2000000000 iodir &

Benchmark steps

Note that I have pending changes for benchmark.sh and benchmark_compare.sh that are not yet pushed upstream.

The benchmark steps are:
  • fillseq - loads the database in key order. There is not much compaction debt when this finishes.
  • read-only
    • revrange - reverse range scans, the real use for this is to let compacton catch up
    • fwdrange - forward range scans, this has too much noise that I have yet to explain
    • readrandom - point queries
    • multireadrandom - more point queries, but enhanced by io_uring
  • fragment the LSM tree, prior to this keys of SST files do not overlap
    • overwritesome - does overwrite with –num set to 10% of the keys
    • flush_mt_l0 - flushes the memtable, flushes the L0 then waits for compaction to catch up
  • read+write - perf is reported for reads, the background writer has a 2MB/s rate limit
    • revrangewhilewriting - short, reverse range scans
    • fwdrangewhilewriting - short, forward range scans
    • readwhilewriting - point queries
  • write-only
    • overwrite - the writer does not have a rate limit. If there were more write-only tests that followed then I would use overwriteandwait which waits for compaction to finish when overwrite ends.

No comments:

Post a Comment