Tuesday, January 3, 2023

RocksDB microbenchmarks: crc32c, xxh3 and lz4 uncompress

The RocksDB benchmark took, db_bench, includes several microbenchmarks to test the performance for hash and checksum functions, compression and decompression. The microbenchmarks measure the latency for these operations per block and the typical block size for me is 4kb or 8kb. A script that I use to run these is here.

The goal for this work is to determine whether there are compiler and other software perf bugs that can be fixed. One such bug has already been found and fixed for clang. These tests can also help me find bugs in the Makefiles used by RocksDB and opportunities to improve the compiler flags.

Disclaimer

  • these are microbenchmarks run in a tight loop which can distort results
  • it would be great to learn that some of these problems can be fixed via compiler options

tl;dr

  • Hopefully perf for crc32c with clang on x86 will improve once the bug fix reaches Ubuntu 22
  • Perf for xxh3 on Arm can be improved because c6i.2xl is ~2.4X to ~5X faster than c7g.2xl
  • Perf for xxh3 on Arm with gcc can be improved because clang is ~1.6X faster than gcc
Updates:
  1. RocksDB uses xxh3 from the dev branch as of Aug, 2021 and c6i.2xl (x86) is ~1.5X faster than c7g.2xl (Graviton3) with that code. With latest code from the dev branch c6i.2xl is only ~1.14X faster -- when xxh3 at 4kb is the metric. Scroll down to Update 1 for more details.
  2. If compiling RocksDB on ARM you might want to edit CXXFLAGS and CFLAGS in Makefile (see here). That is hardwired to -march=armv8-a+crc+crypto and you might want to try -march=native or -mcpu=native. I tried all of these, while that did not change xxh3 perf the current hardwired value might not be great for modern ARM.

Hardware

I tested several CPUs using RocksDB compiled with gcc and clang and share a few interesting results. In all cases I used Ubuntu 22.04 with gcc 11.3.0 and clang 14.0.0. The servers tested are:

  • Intel at home
    • Intel NUC8i7beh (i7-8559u) with turbo boost disabled via BIOS
  • AMD at home
    • Beelink SER 4700u with Ryzen 7 4700u with CPU frequency boost disabled via: echo '0' > /sys/devices/system/cpu/cpufreq/boost
  • x86 on AWS
    • c6i.2xlarge with Intel Xeon Platinum 8375C CPU @ 2.90GHz with hyperthreading disabled
  • Arm on AWS
    • c7g with Graviton 3
Benchmarks

The script is here. I run it like: for i in 1 2 3; do bash cpu.sh > o.$i; done

Compiler command lines:

make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j8 static_lib db_bench

make CC=clang CXX=clang++ DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 -j8 static_lib db_bench


Results: NUC

Earlier this year a perf bug in clang was found and fixed to improve crc32c perf on x86.

The results are here:
  • For crc32c gcc is ~1.4X faster than clang (see here and here). Perhaps the bug fixed earlier this year has yet to reach clang in Ubuntu 22.04.
  • For xxh3 clang is ~1.05X faster than gcc (see here and here)
  • For lz4 uncompress clang is 1.05X to 1.1X faster than gcc (see here and here)
Results: Beelink

The results are here:
  • For crc32c gcc is ~1.1X faster than clang (see here and here)
  • For xxh3 clang is ~1.15X faster than gcc (see here and here)
Results: AWS x86

The results are here:
  • For crc32c gcc is ~1.6X faster than clang (see here and here
  • For xxh3 gcc is ~1.2X faster than clang (see here and here)
  • For lz4 uncompress gcc is ~1.05X faster than clang (see here and here)
Results: AWS Arm

The results are here:
  • For xxh3 clang is ~1.6X faster than gcc (see here and here)
Results: AWS x86 vs AWS Arm

The results are here:
  •  For crc32c
    • For gcc c6i.2xl is ~1.4X faster than c7g.2xl (see here and here)
    • For clang c7g.2xl is ~1.2X faster than c6i.2xl (see here and here)
  • For xxh3
    • For gcc c6i.2xl is ~5X faster than c7g.2xl (see here and here)
    • For clang c6i.2xl is ~2.4X faster than c7g.2xl (see here and here)
  • For lz4 uncompress
    • For gcc c6i.2xl is ~1.05X faster than c7g.2xl (see here and here)
    • For clang c6i.2xl is ~1.01 to ~1.06 faster than c7g.2xl (see here and here)
More AWS details

Results are here from cat /proc/cpuinfo for c7g.2xl and c6i.2xl.

Compiler command lines for crc32c are here. Command lines for xxh3 are the same.

Update 1

Mystery resolved, maybe.
  • RocksDB uses xxhash.h from the xxHash dev branch with the last update from Aug 6, 2021 per this commit which gets xxHash as of this commit.
  • Using benchHash from xxHash repo, xxh3 perf on c7g.2xl improved between release branch and latest on dev branch. Release is at version 0.8.1, last commit was from Nov 29, 2021.
Perf for x86 hasn't changed much from 0.8.1 release branch and the latest on dev. In contrast, perf for ARM has improved a lot. The impact is that c6i.2xl (x86) was ~1.5X faster than c7g.2xl (ARM) with xxh3 for 4kb using older code (the bits in RocksDB). Now c6i.2xl (x86) is only ~1.14X faster.

Using benchHash from xxHash repo and looking at xxh3 for 4kb (the 4th number on the line that starts with "xxh3", the number is MB/s), compiled with gcc -O3, all for c7g.2xl (ARM):
  • 14596 from release branch
  • 21705 from latest on dev branch at git hash 4ebd833a2
  • 14745 from dev branch at git hash 2c611a76f which is what RocksDB uses
  • 21863 from dev branch at git hash 620facc5 which is an ARM specific optimization from Aug, 2022. There were other diffs before and after this one that also help xxh3 on ARM. For reference, here is perf for the diff (c4359b17) immediately preceding 620facc5.
And for c6i.2xl, x86:
  • 23119 from dev branch at latest (4ebd833a2) for c6i.2xl (x86)
  • 23167 from release branch for c6i.2xl (x86)

No comments:

Post a Comment