Small Datum: RocksDB and glibc malloc don't play nice together

Sunday, August 27, 2023

RocksDB and glibc malloc don't play nice together

Pineapple and ham work great together on pizza. RocksDB and glibc malloc don't work great together. The primary problem is that for RocksDB processes the RSS with glibc malloc is much larger than with jemalloc or tcmalloc. I have written about this before -- see here and here. RocksDB is a stress test for an allocator.

tl;dr

For a process using RocksDB the RSS with glibc malloc is much larger than with jemalloc or tcmalloc. There will be more crashes from the OOM killer with glibc malloc.

Benchmark

The benchmark is explained in a previous post.

The insert benchmark was run in the IO-bound setup and the database is larger than memory.

The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables (client per table). The benchmark is a sequence of steps.

l.i0

insert 500 million rows per table

create 3 secondary indexes. I usually ignore performance from this step.

l.i1

insert and delete another 100 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.

q100, q500, q1000

do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

Configurations

The benchmark was run with 2 my.cnf files: c5 and c7 edited to use a 40G RocksDB block cache. The difference between them is that c5 uses the LRU block cache (older code) while c7 uses the Hyper Clock cache.

Malloc

The test was repeated with 4 malloc implementations:

je-5.2.1 - jemalloc 5.2.1, the version provided by Ubuntu 22.04
je-5.3.0 - jemalloc 5.3.0, the current jemalloc release, built from source
tc-2.9.1 - tcmalloc 2.9.1, the version provided by Ubuntu 22.04
glibc 2.3.5 - this is the version provided by Ubuntu 22.04

Results

I measured the peak RSS during each benchmark step.

The benchmark completed for all malloc implementations using the c5 config, but had some benchmark steps run for more time there would have been OOM with glibc. All of the configs used a 40G RocksDB block cache.

The benchmark completed for jemalloc and tcmalloc using the c7 config and fails with OOM with glibc on the q1000 step. Had the l.i1, q100 and q500 steps run for more time then the OOM would have happened sooner.

4 comments:

AnonymousAugust 29, 2023 at 6:52 PM
The Cloud Flare post about "The effect of switching to TCMalloc on RocksDB memory use" is also highly recommended reading, on what happens in glibc malloc.

If one is going to change the allocator, it might also make sense to change the OOM score, so that eg. backup software is more likely to get killed:

[Service]
Environment="LD_PRELOAD=/usr/lib64/libtcmalloc.so.4"
OOMScoreAdjust=-600

Don't forget about "systemctl daemon-reload" and "systemctl cat" before a restart.

Anecdotally, there was no obvious difference, the couple of times I have instead tried libtcmalloc_minimal.so.4, which is described as "does not include the heap profiler and checker (perhaps to reduce binary size for a static binary)"
ReplyDelete
Replies
AnonymousSeptember 2, 2023 at 4:00 PM
The title says, specifically, RocksDB, but the contents of the post refer to MyRocks/MySQL benchmarks; not what I was expecting given the title. Now, most of us understand this distinction, but it begs the question of is this RSS/glibc issue truly a "rocksdb issue" or a "mysql running myrocks" issue?
ReplyDelete
Replies

Add comment