Small Datum: Insert+delete benchmark, medium server and MyRocks, part 2

This has more results for MyRocks vs the insert benchmark on a medium server. It expands on work from my previous post by finding a few changes to the config file (my.cnf) that improves performance.

tl;dr

Variance during the l.i1 benchmark step is much better for MyRocks than for Postgres or InnoDB. The challenge for a b-tree is the read-modify-write cycle during secondary index maintenance and that is done via blind-writes (RocksDB Put operations) and is read-free for MyRocks.
Query rates at 1-second intervals have an interesting sawtooth pattern. I assume this is the CPU overhead from searching more data as the write buffer and/or L0 fill up and then empty at regular intervals.
Things that help performance

Enabling the hyper clock cache improves performance but has a cost. With it enabled the peak RSS of mysqld is larger when using jemalloc. I have enlisted help from experts to figure that out.
Enabling subcompactions reduces the time for L0->L1 compactions which means there will be fewer write stalls.

Things that hurt performance

Disabling intra-L0 compaction hurts throughput. Hopefully someone on the RocksDB team is amused by this because I have been skeptical of the benefit from it after encountering a few problems with it. But I was wrong about the benefit from it.
Reducing level0_slowdown_writes_trigger and level0_stop_writes_trigger to keep the L0 from getting to large hurts throughput. I tried this to reduce the amount of extra compaction debt that can arrive in L0.
I have one more test in progress for a config that disables intra-L0 and reduces the level0 triggers (see previous two bullet points).

Updates

I tried one more config, c8, that combines c3 and c4 (disables intra-L0, reduces leve0 slowdown and stop triggers). Performance was that was similar to c3 -- which wasn't good.

Hyper clock cache, jemalloc and RSS

The VSZ and RSS for mysqld are larger when hyper clock cache is enabled and this caused OOM in a few tests until I realized that and reduced them for any config that enables the hyper clock cache. This occurs with jemalloc and you really should be using jemalloc or tcmalloc with RocksDB because it can be an allocator stress test. I have sought help from experts to debug this and that is still pending.

The table below shows the impact.

Legend:
* v-delta: peak VSZ - bc, measured during create index

* r-delta: peak RSS - bc, measured during create index

* bc(GB): rocksdb_block_cache_size in GB

config v-delta r-delta bc(GB)
base 41.0 22.3 80
c1 49.5 33.3 60
c2 27.0 14.8 80
c3 27.6 10.8 80
c4 27.4 14.7 80
c5 28.6 14.6 80
c6 50.8 32.6 60
c7 53.2 33.3 60

The value of max_subcompactions

Until this round of tests my benchmarks have been using max_subcompactions=1 which disables subcompactions. The reason I haven't been using them is that most of my testing from prior years was on small servers with 4 or 8 CPU cores and it wasn't clear to me that I had enough spare CPU to make use of subcompactions. Docs for subcompactions are here.

On the medium server, setting max_subcompactions=4 has a huge impact that can is visible in the Avg(sec) column of the compaction statistics. For the base config the value is 45.744 seconds and for the c5 config it drops to 2.861 seconds. This is the time for an L0->L1 compaction.

When L0->L1 compactions take too long (as in 45+ seconds) then more data will pile up in the L0 and future L0->L1 compactions will be larger and slower. From the Read(GB) columns I know the amount of data read during L0->L1 compaction and from the Comp(cnt) columns I know the number of L0->L1 compactions. From that I learn that on average L0->L1 compaction reads 2.2 GB when subcompactions are disabled versus 0.5 GB when they are enabled. So L0->L1 is faster with subcompactions because there are more threads doing the work and there is less work per compaction job.

The benefit is also visible in the write stall metrics where the Cumulative stall time drops from 2.8% to 0.0%. A stale overview of stall counters is here. The code has been improved since I wrote that and I have yet to revisit it.

The benefit of max_subcompactions is visible the average insert rates for the l.i1 benchmark step. See the summary tables for Cached by RocksDB, Cached by OS and IO-bound. Compare the values for the base config (no subcompactions) and c5 config (subcompactions).

Disabling intra-L0 compaction

An overview of intra-L0 compaction is here. I have been wary of intra-L0 because it has caused me a few problems in the past and because it makes the LSM tree shape more dynamic and harder to reason about. But the results here show that I would be foolish to not embrace it.

I modified MyRocks so that I could disable intra-L0 by setting max_blob_size to a value greater than zero. This was a convenient hack, not proper code. And the c3 config used that to disable intra-L0. A positive side-effect of this change is that the average time for L0->L1 compaction jobs drops from 45.744 to 24.105 seconds per the Avg(sec) column. But a negative side-effect is that the Cumulative stall time increased from 2.8% with the base config to 36.0% with the c3 config meaning that write stalls were much worse and the result was a reduction in the average insert rate with the c3 config.

The benefit of intra-L0 is visible the average insert rates for the l.i1 benchmark step. See the summary tables for Cached by RocksDB, Cached by OS and IO-bound. Compare the values for the base config (enabled intra-L0) and c3 config (disabled intra-L0).

Reducing L0 slowdown and stop triggers

Next up is the c4 config that reduces level0_slowdown_writes_trigger from 20 to 8 and level0_stop_writes_trigger from 36 to 12. The hope was that by reducing them the L0 would not have as much data when there was stress (convoys would be smaller). Perhaps that was true but it was lousy for performance.

From compaction stats I see that the average time for L0->L1 compaction jobs is 45.744 seconds with the base config and drops to 10.188 seconds with the c4 config. However the Cumulative stall time increases from 2.8% with the base config to 15.4% with the c4 config resulting in a lower insert rate for the c4 config.

The benefit of larger values for the slowdown and stop triggers is visible the average insert rates for the l.i1 benchmark step. See the summary tables for Cached by RocksDB, Cached by OS and IO-bound. Compare the values for the base config (default values) and c4 config (smaller values).

Benchmarks

The medium server is c2-standard-30 from GCP with 15 cores, hyperthreads disabled, 120G of RAM, and 1.5T of XFS vis SW RAID 0 over 4 local NVMe devices.

An overview of the insert benchmark is here, here and here. The insert benchmark was run for 8 clients. The read+write steps (q100, q500, q1000) were run for 3600 seconds each. The delete per insert option was set for l.i1, q100, q500 and q1000.

Benchmarks were repeated for three setups:

cached by RocksDB - all data fits in the 80G RocksDB block cache. The benchmark tables have 160M rows and the database size is ~12G.
cached by OS - all data fits in the OS page cache but not the 4G RocksDB block cache. The benchmark tables have 160M rows and the database size is ~12G.
IO-bound - the database is larger than memory. The benchmark tables have 4000M rows and the database size is ~281G.

The following configurations were tested:

base config
c1 - adds rocksdb_use_hyper_clock_cache=ON
c2 - adds rocksdb_block_cache_numshardbits=4.
c3 - disables intra-L0 compaction
c4 - reduces level0_slowdown_writes_trigger from 20 to 8 and level0_stop_writes_trigger from 36 to 12
c5 - enables subcompactions via rocksdb_max_subcompactions=4
c6 - combines c1, c2, c5
c7 - combines c1, c5

The my.cnf files are here.

cached by RocksDB, IO-bound: base config, c1, c2, ..., c7. The c1, c6 and c7 configs uses a 60G RocksDB block cache to avoid OOM because they enable the hyper clock cache while others use an 80G block cache.
cached by OS: base config, c1, c2, ..., c7

The benchmark is a sequence of steps.

l.i0

insert X million rows across all tables without secondary indexes where X is 20 for cached and 500 for IO-bound

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start. The inserts are done to the table head and the deletes are done from the tail.

q100

do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background.

q500

do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background.

q1000

do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background.

Reports

Performance reports are here for Cached by RocksDB, Cached by OS and IO-bound. The c7 config provides a small (1% to 10%) improvement for average throughput in most of the benchmark steps.

Most of the configs have worse create index (l.x benchmark step) performance because I used rocksdb_merge_combine_read_size=1G for the base config but =128M for the other configs while I was debugging the OOM issue with hyper clock cache.

From the response time tables

For Cached by RocksDB the max response times are all less than one second and the distributions are similar for all configs
For Cached by OS the max response times are all less than one second and the distributions are mostly similar for all configs, but for q1000 the distributions are slightly better for c5, c6 and c7
For IO-bound the c6 and c7 configs had the worst max response times (~2 seconds) on l.i1 while most others were less than one second. Also for l.i1, the distributions were slightly worse for c5, c6 and c7 versus the others. But that might not be a fair comparison because c5, c6 and c7 sustained a higher insert rate.

Charts for insert/delete/query rates and max response time at 1-second intervals. Note that the charts are from the first client and there are 8 clients

Cached by RocksDB

l.i0, l.i1, q100, q500, q1000
The l.i1 insert rates are stable (see here)
The q100, q500 and q1000 query rates have an interesting sawtooth pattern. This is most likely from the write buffer or L0 filling up, then emptying. Note that the pattern is more compressed when the insert rate is higher (q1000).

Cached by OS

l.i0, l.i1, q100, q500, q1000
The l.i1 insert rates are stable (see here)
The q100, q500 and q1000 query rates have an interesting sawtooth pattern. This is most likely from the write buffer or L0 filling up, then emptying. Note that the pattern is more compressed when the insert rate is higher (q1000).

IO-bound

l.i0, l.i1, q100, q500, q1000
The l.i1 insert rates are stable (see here)
The q100, q500 and q1000 query rates have an interesting sawtooth pattern. This is most likely from the write buffer or L0 filling up, then emptying. Note that the pattern is more compressed when the insert rate is higher (q1000).

Small Datum

Friday, July 14, 2023

Insert+delete benchmark, medium server and MyRocks, part 2

No comments:

Post a Comment