Wednesday, July 26, 2023

Tuning MyRocks for the Insert Benchmark on a small server

I used the Insert Benchmark on a small server to see if I could improve the configuration (my.cnf) I have been using.

tl;dr

  • With jemalloc the peak RSS for mysqld is larger with rocksdb_use_hyper_clock_cache=ON so I reduce the value of rocksdb_block_cache_size from 8G to 6G for some configurations. This isn't fully explained but experts are working on it.
  • The base config (a0) is good enough and the other configs don't provide a significant improvement. This isn't a big surprise, while the hyper clock cache and subcompactions are a big deal on larger servers the server in this case is small and the workload has low concurrency.
  • In some cases the a3 config that disables intra-L0 compaction hurts write throughput. This result is similar to what I measured on a larger server.

Builds

I used MyRocks from FB MySQL 8.0.28 using the rel_native_lto build with source from June 2023 at git hash ef5b9b101. 

Benchmark

The insert benchmark was run in three configurations.

  • cached by RocksDB - all tables fit in the RocksDB block cache
  • cached by OS - all tables fit in the OS page cache but not the 1G RocksDB block cache
  • IO-bound - the database is larger than memory

This benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04. 

The benchmark is run with 1 client. The benchmark is a sequence of steps.

  • l.i0
    • insert X million rows across all tables without secondary indexes where X is 20 for cached and 800 for IO-bound
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another 100 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.
  • q100
    • do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.
  • q500
    • do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.
  • q1000
    • do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

Configurations

The configuration (my.cnf) files are here and I use abbreviated names for them in this post. For each variant there are two files -- one with a 1G block cache, one with a larger block cache. The larger block cache size is 8G when LRU is used and 6G when hyper clock cache is used (see tl;dr).

  • a0 (1G, 8G) - base config
  • a1 (1G, 6G) - adds rocksdb_use_hyper_clock_cache=ON
  • a2 (1G, 8G) - adds rocksdb_block_cache_numshardbits=3
  • a3 (1G, 8G) - disables intra-L0 compaction via a hack
  • a4 (1G, 8G) - reduces level0_slowdown_writes_trigger from 20 to 8 and level0_stop_writes_trigger from 36 to 12
  • a5 (1G, 8G) - enables subcompactions via rocksdb_max_subcompactions=2
  • a6 (1G, 6G) - combines a1, a2, a5
  • a7 (1G, 6G) - combines a1, a5

Results

Performance reports are here for Cached by RocksDB, Cached by OS and IO-bound.

The conclusion is that the base config (a0) is good enough and the other configs don't provide a significant improvement. This isn't a big surprise, while the hyper clock cache (a1) and subcompactions (a5) are a big deal on larger servers the server in this case is small and the workload has low concurrency. The a3 config is bad for performance on the IO-bound workload -- intra-L0 compaction is useful.

When evaluating this based on average throughput (see summaries for Cached by RocksDBCached by OS and IO-bound) the base config (a0) is good enough and the other configs don't provide significant improvements although for IO-bound the a3 config is bad for the l.i1 benchmark step because it increases write stalls.

All configs have similar response time distributions for Cached by RocksDB, Cached by OS and IO-bound with one exception. For IO-bound the a3 config does worse on the l.i1 benchmark step.

The charts showing various metrics at 1-second intervals look similar with one exception. Links are in the performance summaries, grep for per 1-second interval in Cached by RocksDBCached by OS and IO-bound. The exception is on IO-bound with the a3 config -- see the IPS charts for the l.i1 benchmark step with the a0 config and a3 config where the a3 config has much more variance.









No comments:

Post a Comment