Small Datum: Tuning MyRocks for the Insert Benchmark on a large server

I used the Insert Benchmark on a small server to see if I could improve the configuration (my.cnf) I have been using.

tl;dr

The good

Performance is much better when subcompactions and the hyper clock cache are enabled.

The not as good

Peak RSS is larger for configs that enable the hyper clock cache. The problem here is not as large as it is on pure open source builds that I test because the version of jemalloc used here is tuned differently. Understanding this is a work in progress.
The select min(transactionid) from TABLE query used at the start of benchmark steps can take up to 20 seconds when it should finish in a few milliseconds. I have yet to try to fix this via tuning.

Updates

Added values from the RocksDB perf context counters to explain the slow select min() query

Builds

I used MyRocks from FB MySQL 8.0.28 with source from June 2023 at git hash ef5b9b101.

Benchmark

The insert benchmark was run in three configurations.

cached by RocksDB - all tables fit in the RocksDB block cache
cached by OS - all tables fit in the OS page cache but not the 4G RocksDB block cache
IO-bound - the database is larger than memory

The test HW has 80 cores with hyperthreads enabled, 256G of RAM and fast local-attached NVMe storage.

The benchmark is run with 24 clients and a client per table. The benchmark is a sequence of steps.

l.i0

insert X million rows across all tables without secondary indexes where X is 20 for cached and 500 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100

do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

q500

do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

q1000

do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

Configurations

The configuration (my.cnf) files are here and I use abbreviated names for them in this post. For each variant there are two files -- one with a 4G block cache, one with 180G cache.

c (4G, 180G) - base config
c1 (4G, 180G) - adds rocksdb_use_hyper_clock_cache=ON
c2 (4G, 180G) - adds rocksdb_block_cache_numshardbits=4
c3 (4G, 180G) - disables intra-L0 compaction via a hack
c4 (4G, 180G) - reduces level0_slowdown_writes_trigger from 20 to 8 and level0_stop_writes_trigger from 36 to 12
c5 (4G, 180G) - enables subcompactions via rocksdb_max_subcompactions=4
c6 (4G, 180G) - combines c1, c2, c5
c7 (4G, 180G) - combines c1, c5

Results

Performance reports are here for Cached by RocksDB, Cached by OS and IO-bound.

Reviewing results based on average throughput:

Cached by RocksDB

The c7 config (combines c1 and c5) is the best followed by c1 and c5 because subcompactions and hyper clock cache are great for perf. The c2 config is bad for l.i0 perhaps because there is more mutex contention with fewer block cache shards. The c3 config is bad for l.i0 and l.i1 because disabling intra-L0 compaction is bad for write throughput. The c4 config is also bad for l.i0 and l.i1 because it makes the L0 less write friendly and more read friendly.

Cached by OS

See the description above for Cached by RocksDB

IO-bound

See the description above for Cached by RocksDB

Reviewing results based on response time histograms:

Cached by RocksDB

What I wrote in the Cached by RocksDB section for average throughput applies here. For l.i1 the histograms are much better for the c5, c6 and c7 configs. For q1000 the query response time histograms are much better for the c6 and c7 configs.

Cached by OS

What I wrote in the Cached by RocksDB section for average throughput applies here. For l.i1 the histograms are much better for the c5, c6 and c7 configs while the base config has a 25-second write stall. For q1000 the query response time histograms are much better for the c6 and c7 configs.

IO-bound

What I wrote in the Cached by RocksDB section for average throughput applies here. For l.i1 the histograms are much better for the c5, c6 and c7 configs. For q1000 the query response time histograms are much better for the c6 and c7 configs.

Reviewing results based on variance over per-second intervals:

Cached by RocksDB

For l.i1 see the max insert response time charts for the base, c3, c4 and c5 configs. The c5 config greatly reduces the variance.

Cached by OS

For l.i0 see the IPS (insert rate) charts for the base, c3, c4 and c5 configs. The c3 and c4 configs have much more variance.
For l.i1 see the IPS (insert rate) charts for the base, c3, c4 and c5 configs. Results are much better with c5.

IO-bound

For l.i0 compare IPS (insert rate) and max response time charts for c3, c4 and c5. There is much less variance with c5.
For l.i1 compare the IPS (insert rate) and max response time charts for c3, c4 and c5. There is much less variance with c5.

Slow get_min queries

Some of the benchmark steps do a simple query at startup to determine the min value of the transactionid column. This is the PK column and the query should be fast but when MVCC GC falls behind it can be slow. I provide more detail on the problem here.

-- Cached by RocksDB

config q100 q500 q1000

c 9.744 9.029 8.521

c1 11.762 9.487 8.761

c2 8.938 8.635 7.625

c3 16.680 15.582 16.392

c4 15.819 14.861 16.046

c5 15.999 15.251 17.754

c6 16.348 14.795 18.964

c7 15.046 13.760 18.254

-- Cached by OS

config q100 q500 q1000

c 15.080 15.462 15.896

c1 14.064 13.622 11.954

c2 10.148 9.641 7.242

c3 16.876 18.810 18.159

c4 16.252 18.571 18.138

c5 16.290 18.533 18.411

c6 16.530 17.341 18.236

c7 17.148 19.591 18.775

-- IO-bound

config q100 q500 q1000

c 9.310 8.747 11.018

c1 7.853 8.661 10.508

c2 9.156 9.188 11.096

c3 17.336 20.003 20.281

c4 19.281 19.794 21.849

c5 17.313 23.940 22.283

c6 18.848 19.926 22.483

c7 17.170 17.790 21.088

Peak RSS

This shows peak RSS for mysqld during the l.x (create index) benchmark step. It is ~5GB larger for the configs that enable the hyper clock cache. The difference is larger on the open-source builds that I do. I think jemalloc here is tuned better than on the open-source builds.

peak

config RSS (GB)

c 185.3

c1 191.3

c2 185.9

c3 183.3

c4 186.5

c5 188.0

c6 191.3

c7 191.3

Update 1: perf context counters

RocksDB has perf context counters that let you understand the work RocksDB does for a specific operation. AFAIK these can be access per session, but in my case I queried them via a global table assuming there wasn't much other activity on the server and ignoring compaction (which is risky to ignore).

The counters are in the information_schema.rocksdb_perf_context_global table. I ran the query and selected the values of all counters from that table before and after running the query and then computed the diff between the two sets of measurements. The counters are only populated if you set rocksdb_perf_context_level to something >= 2 in my.cnf.

My notes and more numbers from this experiment are here.

This is what the counters look like when the query is fast (0.001s) at the start of benchmark step l.i1.

4 SEEK_CHILD_SEEK_COUNT

5 BLOCK_CACHE_HIT_COUNT

12 IO_THREAD_POOL_ID

46 ITER_READ_BYTES

47 USER_KEY_COMPARISON_COUNT

And then when the query is slow (12.715s) at the start of benchmark step q100. The problems are the large values for *_SKIPPED_COUNT. The large value for USER_KEY_COMPARISON_COUNT might just be from background compaction.

65466 BLOCK_READ_COUNT

225827 BLOCK_CACHE_HIT_COUNT

36521600 INTERNAL_DELETE_SKIPPED_COUNT

36521600 INTERNAL_KEY_SKIPPED_COUNT

256113421 USER_KEY_COMPARISON_COUNT

511093703 BLOCK_READ_BYTE

511093703 IO_BYTES_READ

Small Datum

Wednesday, July 26, 2023

Tuning MyRocks for the Insert Benchmark on a large server

No comments:

Post a Comment