Wednesday, July 26, 2023

Tuning MyRocks for the Insert Benchmark on a large server

I used the Insert Benchmark on a small server to see if I could improve the configuration (my.cnf) I have been using.

tl;dr

  • The good
    • Performance is much better when subcompactions and the hyper clock cache are enabled.
  • The not as good
    • Peak RSS is larger for configs that enable the hyper clock cache. The problem here is not as large as it is on pure open source builds that I test because the version of jemalloc used here is tuned differently. Understanding this is a work in progress.
    • The select min(transactionid) from TABLE query used at the start of benchmark steps can take up to 20 seconds when it should finish in a few milliseconds. I have yet to try to fix this via tuning.
Updates
  • Added values from the RocksDB perf context counters to explain the slow select min() query

Builds

I used MyRocks from FB MySQL 8.0.28 with source from June 2023 at git hash ef5b9b101. 

Benchmark

The insert benchmark was run in three configurations.

  • cached by RocksDB - all tables fit in the RocksDB block cache
  • cached by OS - all tables fit in the OS page cache but not the 4G RocksDB block cache
  • IO-bound - the database is larger than memory

The test HW has 80 cores with hyperthreads enabled, 256G of RAM and fast local-attached NVMe storage.

The benchmark is run with 24 clients and a client per table. The benchmark is a sequence of steps.

  • l.i0
    • insert X million rows across all tables without secondary indexes where X is 20 for cached and 500 for IO-bound
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.
  • q100
    • do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.
  • q500
    • do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.
  • q1000
    • do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.

Configurations

The configuration (my.cnf) files are here and I use abbreviated names for them in this post. For each variant there are two files -- one with a 4G block cache, one with 180G cache.

  • c (4G180G) - base config
  • c1 (4G180G) - adds rocksdb_use_hyper_clock_cache=ON
  • c2 (4G180G) - adds rocksdb_block_cache_numshardbits=4
  • c3 (4G180G) - disables intra-L0 compaction via a hack
  • c4 (4G180G) - reduces level0_slowdown_writes_trigger from 20 to 8 and level0_stop_writes_trigger from 36 to 12
  • c5 (4G180G) - enables subcompactions via rocksdb_max_subcompactions=4
  • c6 (4G180G) - combines c1, c2, c5
  • c7 (4G180G) - combines c1, c5
Results

Performance reports are here for Cached by RocksDBCached by OS and IO-bound.

Reviewing results based on average throughput:
  • Cached by RocksDB
    • The c7 config (combines c1 and c5) is the best followed by c1 and c5 because subcompactions and hyper clock cache are great for perf. The c2 config is bad for l.i0 perhaps because there is more mutex contention with fewer block cache shards. The c3 config is bad for l.i0 and l.i1 because disabling intra-L0 compaction is bad for write throughput. The c4 config is also bad for l.i0 and l.i1 because it makes the L0 less write friendly and more read friendly.
  • Cached by OS
    • See the description above for Cached by RocksDB
  • IO-bound
    • See the description above for Cached by RocksDB
Reviewing results based on response time histograms:
  • Cached by RocksDB
    • What I wrote in the Cached by RocksDB section for average throughput applies here. For l.i1 the histograms are much better for the c5, c6 and c7 configs. For q1000 the query response time histograms are much better for the c6 and c7 configs.
  • Cached by OS
    • What I wrote in the Cached by RocksDB section for average throughput applies here.  For l.i1 the histograms are much better for the c5, c6 and c7 configs while the base config has a 25-second write stall. For q1000 the query response time histograms are much better for the c6 and c7 configs.
  • IO-bound
    • What I wrote in the Cached by RocksDB section for average throughput applies here. For l.i1 the histograms are much better for the c5, c6 and c7 configs. For q1000 the query response time histograms are much better for the c6 and c7 configs.
Reviewing results based on variance over per-second intervals:
  • Cached by RocksDB
    • For l.i1 see the max insert response time charts for the base, c3, c4 and c5 configs. The c5 config greatly reduces the variance.
  • Cached by OS
    • For l.i0 see the IPS (insert rate) charts for the base, c3, c4 and c5 configs. The c3 and c4 configs have much more variance.
    • For l.i1 see the IPS (insert rate) charts for the base, c3, c4 and c5 configs. Results are much better with c5.
  • IO-bound
    • For l.i0 compare IPS (insert rate) and max response time charts for c3, c4 and c5. There is much less variance with c5.
    • For l.i1 compare the IPS (insert rate) and max response time charts for c3, c4 and c5. There is much less variance with c5.
Slow get_min queries

Some of the benchmark steps do a simple query at startup to determine the min value of the transactionid column. This is the PK column and the query should be fast but when MVCC GC falls behind it can be slow. I provide more detail on the problem here.

-- Cached by RocksDB
config  q100    q500    q1000
c        9.744   9.029   8.521
c1      11.762   9.487   8.761
c2       8.938   8.635   7.625
c3      16.680  15.582  16.392
c4      15.819  14.861  16.046
c5      15.999  15.251  17.754
c6      16.348  14.795  18.964
c7      15.046  13.760  18.254

-- Cached by OS
config  q100    q500    q1000
c       15.080  15.462  15.896
c1      14.064  13.622  11.954
c2      10.148   9.641   7.242
c3      16.876  18.810  18.159
c4      16.252  18.571  18.138
c5      16.290  18.533  18.411
c6      16.530  17.341  18.236
c7      17.148  19.591  18.775

-- IO-bound
config  q100    q500    q1000
c        9.310   8.747  11.018
c1       7.853   8.661  10.508
c2       9.156   9.188  11.096
c3      17.336  20.003  20.281
c4      19.281  19.794  21.849
c5      17.313  23.940  22.283
c6      18.848  19.926  22.483
c7      17.170  17.790  21.088

Peak RSS

This shows peak RSS for mysqld during the l.x (create index) benchmark step. It is ~5GB larger for the configs that enable the hyper clock cache. The difference is larger on the open-source builds that I do. I think jemalloc here is tuned better than on the open-source builds.

        peak
config  RSS (GB)
c       185.3
c1      191.3
c2      185.9
c3      183.3
c4      186.5
c5      188.0
c6      191.3
c7      191.3

Update 1: perf context counters

RocksDB has perf context counters that let you understand the work RocksDB does for a specific operation. AFAIK these can be access per session, but in my case I queried them via a global table assuming there wasn't much other activity on the server and ignoring compaction (which is risky to ignore).

The counters are in the information_schema.rocksdb_perf_context_global table. I ran the query and selected the values of all counters from that table before and after running the query and then computed the diff between the two sets of measurements. The counters are only populated if you set rocksdb_perf_context_level to something >= 2 in my.cnf.

My notes and more numbers from this experiment are here.

This is what the counters look like when the query is fast (0.001s) at the start of benchmark step l.i1.

4               SEEK_CHILD_SEEK_COUNT
5               BLOCK_CACHE_HIT_COUNT
12              IO_THREAD_POOL_ID
46              ITER_READ_BYTES
47              USER_KEY_COMPARISON_COUNT

And then when the query is slow (12.715s) at the start of benchmark step q100. The problems are the large values for *_SKIPPED_COUNT. The large value for USER_KEY_COMPARISON_COUNT might just be from background compaction.

65466           BLOCK_READ_COUNT
225827          BLOCK_CACHE_HIT_COUNT
36521600        INTERNAL_DELETE_SKIPPED_COUNT
36521600        INTERNAL_KEY_SKIPPED_COUNT
256113421       USER_KEY_COMPARISON_COUNT
511093703       BLOCK_READ_BYTE
511093703       IO_BYTES_READ
































No comments:

Post a Comment