Small Datum: Fixing mmap performance for RocksDB

RocksDB inherited support for mmap from LevelDB. Performance was worse than expected because filesystem readahead fetched more data than needed as I explained in a previous post. I am not a fan of the standard workaround which is to tune kernel settings to reduce readahead because that has an impact for everything running on that server. The DBMS knows more about the IO patterns and can use madvise to provide hints to the OS, just as RocksDB uses fadvise for POSIX IO.

Good news, issue 9931 has been fixed and the results are impressive.

Benchmark

I used db_bench with an IO-bound workload - the same as was used for my previous post. Two binaries were tested:

old - this binary was compiled at git hash ce419c0f and does not have the fix for issue 9931
fix - this binary was compiled at git hash 69a32ee and has the fix for issue 9931.

Note that git hash ce419c0f and 69a32ee are adjacent in the commit log.

The verify_checksums option was false for all tests. The CPU overhead would be much larger were it true because checksum verification would be done on each block access. Tests were repeated with cache_index_and_filter_blocks set to true and false. That didn't have a big impact on results.

Results

The graphs have results for these binary+config pairs:

cache0.old - cache_index_and_filter_blocks=false, does not have fix for issue 9931
cache0.fix - cache_index_and_filter_blocks=false, has fix for issue 9931
cache1.old - cache_index_and_filter_blocks=true, does not have fix for issue 9931
cache1.fix - cache_index_and_filter_blocks=true, has fix for issue 9931

The improvements from the fix are impressive for benchmark steps that do reads for user queries -- see the green and red bars. The average value for the average read request size (rareq-sz in iostat) is:

for readwhilewriting: 115kb without the fix, 4kb with the fix
for fwdrangewhilewriting: 79kb without the fix, 4kb with the fix

Tell me how you really feel about mmap + DBMS

It hasn't been great for me. Long ago I did some perf work with an mmap DBMS and Linux 2.6 kernels suffered from severe mutex contention in VM code. Performance was lousy back then. But I didn't write this to condemn mmap and IO-bound workloads where the read working set is much larger than memory might not be the best choice for mmap.

For the results above if you compare the improved mmap numbers with the POSIX/buffered IO numbers in my previous post -- peak QPS for the IO-bound tests (everything but fillseq and overwrite) is ~100k/second with mmap vs ~250k/second with buffered IO.

From the vmstat results collected during the benchmark I see:

more mutex contention with mmap based on the cs column
more CPU overhead with mmap based on the us and sy columns

Legend:

* qps - throughput from the benchmark

* cs - context switches from the cs column in vmstat

* us - user CPU from the us column in vmstat

* sy - system CPU from the sy column in vmstat

Average values

IO qps cs us sy

mmap 91757 475279 15.0 7.0

bufio 248543 572470 13.8 7.0

Values per query (column divided by QPS, for us and sy result is multiplied by 1000)

IO qps cs us sy

mmap 1 5.2 163 76

bufio 1 2.3 55 28

Small Datum

Friday, June 24, 2022

Fixing mmap performance for RocksDB

No comments:

Post a Comment