Monday, September 25, 2023

Variance in peak RSS with jemalloc 5.2.1

Peak RSS for jemalloc 5.2.1 has much variance with the Insert Benchmark with MyRocks. The variance is a function of how you build and configure jemalloc. The worst case (largest peak RSS) is jemalloc 5.2.1 provided by Ubuntu 22.04 and I have yet to figure out how to reproduce that result using jemalloc 5.2.1 compiled from source.

I previously shared results to show that jemalloc and tcmalloc are better than glibc malloc for RocksDB. That was followed by a post that shows the peak RSS with different jemalloc versions. This post has additional results for jemalloc 5.2.1 using different jemalloc config options. 

tl;dr

  • Peak RSS has large spikes with jemalloc 4.4, 4.5 and somewhat in 5.0 and 5.1. Tobin Baker suggested these might be from changes to the usage of MADV_FREE and MADV_DONTNEED. These start to show during the l.i1 benchmark step and then are obvious during q100, q500 and q1000.
  • For tests that use Hyper Clock cache there is a large peak RSS with Ubuntu-provided jemalloc 5.2.1 that is obvious during the l.x and l.i1 benchmark steps. I can't reproduce this using jemalloc 5.2.1 compiled from source despite my attempts to match the configuration.
  • Benchmark throughput is generally improving over time from old jemalloc (4.0) to modern jemalloc (5.3)

Builds

My previous post explains the benchmark and HW. 

To get the jemalloc config details I added malloc-conf="stats_print:true" to my.cnf which causes stats and the config details to get written to the MySQL error log on shutdown.

I compiled many versions of jemalloc from source -- 4.0.4, 4.1.1, 4.2.1, 4.3.1, 4.4.0, 4.5.0, 5.0.1, 5.1.0, 5.2.0, 5.2.1, 5.3.0. All of these used the default jemalloc config, and while it isn't listed there the default value for background_thread is false.

  config.cache_oblivious: true
  config.debug: false
  config.fill: true
  config.lazy_lock: false
  config.malloc_conf: ""
  config.opt_safety_checks: false
  config.prof: false
  config.prof_libgcc: false
  config.prof_libunwind: false
  config.stats: true
  config.utrace: false
  config.xmalloc: false

The config for Ubuntu-provided 5.2.1 is below. This is also the config used by je-5.2.1.prof (see below). It also gets background_thread=false. It differs from what I show above by:

  • uses config.prof: true
  • uses config.prof_libgcc: true

  config.cache_oblivious: true
  config.debug: false
  config.fill: true
  config.lazy_lock: false
  config.malloc_conf: ""
  config.opt_safety_checks: false
  config.prof: true
  config.prof_libgcc: true
  config.prof_libunwind: false
  config.stats: true
  config.utrace: false
  config.xmalloc: false

Finally, I tried one more config when compiling from source to match the config that is used at work. I get that via: 

configure --disable-cache-oblivious --enable-opt-safety-checks --enable-prof --disable-prof-libgcc --enable-prof-libunwind --with-malloc-conf="background_thread:true,metadata_thp:auto,abort_conf:true,muzzy_decay_ms:0"

With that the option values are the following, plus the background thread is enabled. The build that uses this is named je-5.2.1.prod below.


  config.cache_oblivious: false
  config.debug: false
  config.fill: true
  config.lazy_lock: false
  config.malloc_conf: "background_thread:true,metadata_thp:auto,abort_conf:true,muzzy_decay_ms:0"
  config.opt_safety_checks: true
  config.prof: true
  config.prof_libgcc: false
  config.prof_libunwind: true
  config.stats: true
  config.utrace: false
  config.xmalloc: false

Now I have results for variants of jemalloc 5.2.1 and the names here match the names I used on the spreadsheets that show peak RSS.

  • je-5.2.1.ub - Ubuntu-provided 5.2.1
  • je-5.2.1 - compiled from source with default options
  • je-5.2.1.prof - compiled with configure -enable-prof --enable-prof-prof_libgcc to get a config that matches je-5.2.1.ub
  • je-5.2.1.prod - 5.2.1 compiled from source using 

Benchmarks

I ran the Insert Benchmark using a 60G RocksDB block cache. The benchmark was repeated twice -- once using the (older) LRU block cache, once using the (newer) Hyper Clock cache.

The benchmark was run in the IO-bound setup and the database is larger than memory. The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables (client per table). The benchmark is a sequence of steps and the peak RSS problem is worst for the l.x benchmark step that creates indexes and allocates a lot of memory while doing so:

  • l.i0
    • insert 500 million rows per table
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another 100 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. 
  • q100, q500, q1000
    • do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

Configurations

The benchmark was run with 2 my.cnf files: c5 and c7 edited to use a 40G RocksDB block cache. The difference between them is that c5 uses the LRU block cache (older code) while c7 uses the Hyper Clock cache.

Results: perf reports

My standard perf reports are here for both types of block caches: LRU and Hyper Clock.

  • Throughput is generally improving over time from old jemalloc (4.0) to modern jemalloc (5.3). See the tables with absolute and relative throughput in the Summary section for LRU and for Hyper Clock.
  • HW performance metrics are mostly similar regardless of the peak RSS spikes. See the tables for LRU and for Hyper Clock. The interesting columns include: cpupq has CPU per operation, cpups has the average value for vmstat's us + sy, csps has the average value for vmstat's cs and cspq has context switches per operation.
So the good news is that tests here don't find performance regressions, although the more interesting test would be on larger HW with more concurrency.

Results: peak RSS

I measured the peak RSS during each benchmark step. The spreadsheet is here

Summary:

  • The larger values for jemalloc 4.4, 4.5, 5.0 and 5.1 might be from changes in how MADV_FREE and MADV_DONTNEED were used.

Summary

  • The larger values for jemalloc 4.4, 4.5, 5.0 and 5.1 might be from changes in how MADV_FREE and MADV_DONTNEED were used.
  • The peak RSS is larger values for je-5.2.1.ub during l.x and l.i1. I have been unable to reproduce that with jemalloc compiled from source despite matching the configuration.

2 comments:

  1. any chance you'd be able to run the benchmarks with mimalloc and tcmalloc?

    ReplyDelete
    Replies
    1. Already did with tcmalloc. I won't repeat it for minmalloc.
      For tcmalloc and glibc malloc see https://smalldatum.blogspot.com/2023/08/rocksdb-and-glibc-malloc-dont-play-nice.html

      Delete