Monday, July 24, 2023

Insert benchmark perf over time for MyRocks 5.6 on a large server

I used the Insert Benchmark to compare performance for MyRocks builds from March 2022 through today to understand how performance changes over time. I was unable to go back further in time because of conflicts between old code and a new compiler toolchain.

tl;dr

  • A build from March 2022 gets ~15% more inserts/s and ~5% more queries/s when compared to recent builds

Builds

All builds use MyRocks 5.6.35 but from different points in time. I used the same compiler toolchain and gcc for all builds. The builds are:
  • fbmy5635_202203072101 - from 7 March 2022 at git hash 84ce624a with RocksDB 6.28.2
  • fbmy5635_202304122154 - from 12 April 2023 at git hash f2161d019 with RocksDB 7.10.2
  • fbmy5635_202305292102 - from 29 May 2023 at git hash 509203f4 with RocksDB 8.2.1
  • fbmy5635_jun23_7e40af67 - from 23 June 2023 at git hash 7e40af67 with RocksDB 8.2.1
Benchmark

The insert benchmark was run in two configurations.

  • cached by RocksDB - RocksDB block cache caches all tables
  • IO-bound - the database is larger than memory

The test HW has 80 cores with hyperthreads enabled, 256G of RAM and fast local-attached NVMe storage.

The benchmark is run with 24 clients and a client per table. The benchmark is a sequence of steps.

  • l.i0
    • insert X million rows across all tables without secondary indexes where X is 20 for cached and 500 for IO-bound
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.
  • q100
    • do queries as fast as possible with 100 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.
  • q500
    • do queries as fast as possible with 500 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.
  • q1000
    • do queries as fast as possible with 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

Configurations

The benchmark used the cy9c5_u configuration for MyRocks. Much more detail on the benefit of the c5 configuration is here -- it adds rocksdb_max_subcompactions which makes L0 -> L1 compactions go faster and reduces (or removes) write stalls.

Results

Performance reports are here for Cached by RocksDB and for IO-bound. From the summaries for cached and IO-bound the March 2022 build gets more throughput than the June 2023 build

  • ~15% more inserts/second on the write-heavy l.i1 benchmark step
  • ~5% more queries/second on the read+write benchmark steps (q100, q500, q1000)
The root cause appears to be more CPU overhead in the June 2023 build.
  • For l.i1 and Cached by RocksDB the cpupq column has the CPU cost per insert (see here) and it is 185 for the March 2022 build vs 209 for the June 2023 build (a 13% difference). The difference for IO-bound is 9% (see here). Note that this metric includes all CPU, including that from background threads, not just CPU from foreground inserts so it can be misleading but I trust it to explain the differences here until I do more perf debugging.
  • For q100 and Cached by RocksDB the cpupq column is 5.8% larger for the June 2023 build relative to the March 2022 build (see here, 252 vs 238). And for IO-bound the cpupq is 4.1% larger for the June 2023 build relative to the March 2022 build (see here, 302 vs 290).
Also, the response time histograms are slightly better for the March 2022 build (see here for Cached by RocksDB and for IO-bound).

Finally, graphs for the insert & query rates at 1-second intervals:
  • Cached by RocksDB: l.i0, l.i1, q100, q500, q1000
    • For q1000 the March 2022 build had the best chart WRT max insert response time and that is also reflected in the IPS (inserts/s) graphs where it has the least noise
  • IO-bound: l.i0, l.i1, q100, q500, q1000
    • For l.i1 the max insert response time charts shift up for builds after March 2022
    • For q100 the max query response time charts have two horizontal lines -- one near 0 and the other near 10,000 usecs for the March 2022 build but near 20,000 usecs for the more recent builds.
    • For q1000 the max insert response time charts have one thick line for the March 2022 builds but two less-think lines for the more recent builds. The max query response time charts are similar to q100.







No comments:

Post a Comment