Small Datum: RocksDB 8.x benchmarks: large server, IO-bound

This post has results for performance tests in all versions of 8.x from 8.0.0 to 8.9.2 using a large server and IO-bound workload. In a previous post I shared results for the same hardware with a cached database.

tl;dr

There is a small regression that arrives in RocksDB 8.6 for overwriteandwait (write-only, random writes). But only for buffered IO. I think this is caused by changes to compaction readahead. For now I will reuse RocksDB issue 12038 for this.

I focus on the benchmark steps that aren't read-only because they suffer less from noise. These benchmark steps are fillseq, revrangewhilewriting, fwdrangewhilewriting, readwhilewriting and overwriteandwait. I also focus on leveled more so than universal, in part because there is more noise with universal but also because the workloads I care about most use leveled.

Builds

I compiled with gcc RocksDB 8.0.0, 8.1.1, 8.2.1, 8.3.3, 8.4.4, 8.5.4, 8.6.7, 8.7.3 and 8.8.1 and 8.9.2 which are the latest patch releases.

Benchmark

All tests used a server with 40 cores, 80 HW threads, 2 sockets, 256GB of RAM and many TB of fast NVMe SSD with Linux 5.1.2, XFS and SW RAID 0 across 6 devices. For the results here, the database is cached by RocksDB. The benchmark was repeated for leveled and universal compaction using both buffered IO and O_DIRECT.

Everything used the LRU block cache and the default value for compaction_readahead_size. Soon I will switch to using the hyper clock cache once RocksDB 9.0 arrives.

I used my fork of the RocksDB benchmark scripts that are wrappers to run db_bench. These run db_bench tests in a special sequence -- load in key order, read-only, do some overwrites, read-write and then write-only. The benchmark was run using 24 threads. How I do benchmarks for RocksDB is explained here and here. The command line to run them is:

bash x3.sh 24 no 3600 c40r256bc180 40000000 4000000000 iobuf iodir

A spreadsheet with all results is here and performance summaries are here:

buffered IO - for leveled and for universal
O_DIRECT - for leveled and for universal

Results: leveled

There is one fake regression in overwriteandwait for RocksDB 8.6.7. The issue is that the db_bench benchmark client ignored a new default value for compaction_readahead_size. That has been fixed in 8.7.

The is one real regression in overwriteandwait that probably arrived in 8.6 and is definitely in 8.7 through 8.9. The throughput for overwriteandwait drops about 5% from 8.5 to 8.7+. I assume this is from changes to compaction readahead that arrived in 8.6. These changes are for readahead done when buffered IO is used, but not when O_DIRECT is used and in the charts below the regression does not repeat with O_DIRECT.

From the performance summary for overwriteandwait with buffered IO (see here)

compaction wall clock time (c_wsecs) increases by ~3% from ~18200 in 8.5 to ~18700 in 8.7+
compaction CPU seconds (c_csecs) decreases by ~5% from ~18000 in 8.5 to ~17200 in 8.7+
the c_csecs / c_wsecs ratio is ~0.99 for 8.0 thru 8.5 and drops to ~0.92 in 8.7+, so one side effect of the change in 8.6 is that compaction threads see more IO latency
this issue doesn't repeat with O_DIRECT, see here

From iostat metrics during overwriteandwait with buffered IO

rawait (r_await) drops from 0.21 in 8.5 to ~0.08 in 8.7+
rareq-sz (rareqsz) drops from 28.3 in 8.5 to ~9 in 8.7+
the increase in rawait was expected given the decrease in rareq-sz, the real problem is the drop in rareq-sz as the only reads during overwriteandwait are from compaction
this issue doesn't repeat with O_DIRECT

leveled, buffered IO

c rps rmbps rrqmps rawait rareqsz wps wmbps wrqmps wawait wareqsz ver

3762 4762 70.0 0.00 0.21 28.3 5648 576.2 0.00 0.06 104.3 8.5.4

3879 21308 90.9 0.00 0.05 4.2 4393 447.7 0.00 0.06 104.8 8.6.7

3790 9029 79.9 0.00 0.07 9.0 5229 535.8 0.00 0.06 105.2 8.7.3

3790 9678 74.5 0.00 0.08 8.3 5283 539.6 0.00 0.06 104.8 8.8.1

3790 9808 75.5 0.00 0.08 8.5 5298 540.1 0.00 0.06 104.7 8.9.2

leveled, O_DIRECT

c rps rmbps rrqmps rawait rareqsz wps wmbps wrqmps wawait wareqsz ver

3765 5236 619.4 0.00 0.32 120.5 5779 687.7 0.00 0.07 121.1 8.5.4

4187 37528 340.4 0.00 0.09 9.2 1908 218.5 0.00 0.06 118.0 8.6.7

3754 5170 612.6 0.00 0.33 121.1 5708 679.1 0.00 0.07 121.4 8.7.3

3759 5084 602.8 0.00 0.35 121.1 5612 668.0 0.00 0.08 121.5 8.8.1

3759 5048 598.1 0.00 0.37 121.1 5574 663.3 0.00 0.08 121.4 8.9.2

These charts show relative QPS which is (QPS for my version / QPS for RocksDB 8.0).

First is with buffered IO (no O_DIRECT)

Next is with O_DIRECT (no OS page cache)

Results: universal

Summary

Just like above for leveled, there is a bogus regression for overwriteandwait with RocksDB 8.6
Results here have more variance than the results for leveled above. While I have yet to prove this, universal compaction benchmarks are likely prone to more variance. So I don't think there are regressions here.

These charts show relative QPS which is (QPS for my version / QPS for RocksDB 8.0).

First is with buffered IO (no O_DIRECT)

Next is with O_DIRECT (no OS page cache)

Small Datum

Thursday, January 4, 2024

RocksDB 8.x benchmarks: large server, IO-bound

No comments:

Post a Comment