Thursday, February 2, 2023

RocksDB microbenchmarks: compilers, Arm and x86

This revisits my previous work to understand the impact of compilers and optimizer flags on the performance of RocksDB microbenchmarks. Benchmarks were run on Arm and x86 servers on both AWS and GCP using db_bench from RocksDB and benchHash from xxHash.

tl;dr

  • Why won't AWS tell us whether Graviton3 is Neoverse N1 or N2
  • Much time will be spent figuring out which compiler flags to use for Arm
  • clang on x86 has a known problem with crc32c
  • Good advice appears to be use -march=native on x86 and -mcpu=native on Arm. I don't have a strong opinion on -O2 vs -O3
  • Relative to x86, Arm does worse on xxh3 than on lz4, zstd and crc32c. Using the 8kb input case to compare latency from the best results the (c7g / c6i) ratios are crc32c = 1.35, lz4 uncompress/compress = 1.05 / 1.17, zstd uncompress/compress = 1.15 / 1.13 and then xxh3 = 2.38, so xxh3 is the outlier.
  • On AWS with these single-threaded workloads c6i (x86) was faster than c7g (Arm). I am not sure it is fair to compare the GCP CPUs (c2 is x86, t2a is Arm).

Hardware

For AWS I used c7g for Arm and c6i for x86. For GCP I used t2a for Arm and c2 for x86. See here for more info on the AWS instance types and GCP machine types. The GCP t2a is from the Arm Neoverse N1 family. There is speculation that the AWS c7g is from the Arm Neoverse N2 family but I can't find a statement from AWS on that.

All servers used Ubuntu 22.04.

Benchmarks

The first set of tests were microbenchmarks from db_bench (part of RocksDB) that measure the latency per 4kb and per 8kb page for crc32c, xxh3, lz4 (de)compression and zstd (de)compression. A script for that is here.

The second set of tests were microbenchmarks from benchHash (part of xxHash) that measure the time to do xxh3 and other hash functions for different input sizes including 4kb and 8kb. I share the xxh3 results at 4kb and 8kb.

Compiling

db_bench and xxHash were compiled using clang and gcc with a variety of flags. On Ubuntu 22.04 clang is version 14.0.0-1ubuntu and gcc is version 11.3.0.

For RocksDB the make command lines are:

make DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 static_lib db_bench

make CC=/usr/bin/clang CXX=/usr/bin/clang++ DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 static_lib db_bench


For db_bench:

  • to use -O3 rather than -O2 I edited Makefile here
  • to select -march or -mcpu for Arm I edited Makefile here
By default, xxHash uses "-O3" and the default compiler which doesn't enable the best performance.

For db_bench and x86:

  • rx.gcc.march.o2 - gcc with the default, -O2 -march=native
  • rx.clang.march.o2 - clang with the default, -O2 -march=native
  • rx.gcc.march.o3 - gcc with -O3 -march=native
  • rx.clang.march.o3 - clang with -O3 -march=native

For db_bench and Arm:

  • rx.gcc.march.o2 - gcc with the default, -O2 -march=armv8-a+crc+crypto
  • rx.gcc.mcpu.o2 - gcc with -O2 -mcpu=native
  • rx.gcc.neo.o2 - gcc with -O2 -mcpu=neoverse-512tvb on c7g and -O2 -mcpu=neoverse-n1 on t2a
  • rx.clang.march.o2 - clang with the default, -O2 -march=armv8-a+crc+crypto
  • rx.clang.mcpu.o2 - clang with -O2 -mcpu=native
  • rx.clang.neo.o2 - clang with -O2 -mcpu=neoverse-512tvb on c7g and -O2 -mcpu=neoverse-n1 on t2a
  • rx.gcc.march.o3 - gcc with -O3 -march=armv8-a+crc+crypto
  • rx.gcc.mcpu.o3 - gcc with -O3 -mcpu=native
  • rx.gcc.neo.o3 - gcc with -O3 -mcpu=neoverse-512tvb on c7g and -O3 -mcpu=neoverse-n1 on t2a
  • rx.clang.march.o3 - clang with -O3 -march=armv8-a+crc+crypto
  • rx.clang.mcpu.o3 - clang with -O3 -mcpu=native
  • rx.clang.neo.o3 - clang with -O3 -mcpu=neoverse-512tvb on c7g and -O3 -mcpu=neoverse-n1 on t2a
For xxHash/benchHash and x86:
  • xx.gcc.o2 - CFLAGS="-O2" make
  • xx.gcc.o3 - CFLAGS="-O3" make, the default that matches what you get without CFLAGS
  • xx.gcc.march.o2 - CFLAGS="-O2 -march=native" make
  • xx.gcc.march.o3 - CFLAGS="-O3 -march=native" make
  • xx.clang.o2 - same as xx.gcc.o2 except adds CC=/usr/bin/clang
  • xx.clang.o3 - same as xx.gcc.o3 except adds CC=/usr/bin/clang
  • xx.clang.march.o2 - same as xx.gcc.march.o2 except adds CC=/usr/bin/clang
  • xx.clang.march.o3 - same as xx.gcc.march.o3 except adds CC=/usr/bin/clang
For xxHash/benchHash and Arm:
  • xx.gcc.o2 - gcc with CFLAGS="-O2" make
  • xx.gcc.o3 - gcc with CFLAGS="-O3" make, the default that matches what you get without CFLAGS
  • xx.gcc.mcpu.o2 - gcc with CFLAGS="-O2 -mcpu=native"
  • xx.gcc.mcpu.o3 - gcc with CFLAGS="-O3 -mcpu=native"
  • xx.gcc.neo.o2 - gcc with -O2 -mcpu=neoverse-512tvb on c7g and -O2 -mcpu=neoverse-n1 on t2a
  • xx.gcc.neo.o3 - gcc with -O3 -mcpu=neoverse-512tvb on c7g and -O3 -mcpu=neoverse-n1 on t2a
  • xx.clang.o2 - same as xx.gcc.o2 except adds CC=/usr/bin/clang
  • xx.clang.o3 - same as xx.gcc.o3 except adds CC=/usr/bin/clang
  • xx.clang.mcpu.o2 - same as xx.gcc.mcpu.o2 except adds CC=/usr/bin/clang
  • xx.clang.mcpu.o3 - same as xx.gcc.mcpu.o3 except adds CC=/usr/bin/clang
  • xx.clang.neo.o2 - same as xx.gcc.neo.o2 except adds CC=/usr/bin/clang
  • xx.clang.neo.o3 - same as xx.gcc.neo.o3 except adds CC=/usr/bin/clang
Results

All of the results are in this spreadsheet. Read the previous section to decode the names and understand which compiler options were used. The first four sets of tables have results for the following db_bench microbenchmarks. The numbers are in nanoseconds and the average latency per 4kb or 8kb input. Results for 4kb are on the left and for 8kb on the right.
  • crc32c -  time to compute crc32c for a 4kb or 8kb page
  • uncomp.lz4 - time for lz4 decompression with a 4kb or 8kb page
  • comp.lz4 - time for lz4 compression with a 4kb or 8kb page
  • uncomp.zstd - time for zstd decompression with a 4kb or 8kb page
  • comp.zstd - time for zstd compression with a 4kb or 8kb page
The last 4 tables have results for xxh3 from both db_bench and benchHash for 4kb and 8kb inputs. The numbers are in nanoseconds and the average latency per (4kb, 8kb) input.

I used colors to highlight outliers in red (worst case) and green (best case).

The following are links to the results for db_bench microbenchmarks:
  • c6i (x86, AWS) with 4kb and with 8kb inputs
    • clang is much worse than gcc on crc32c. Otherwise compilers and flags don't change results. The crc32c issue for clang is known and a fix is making its way upstream.
  • c7g (Arm, AWS) with 4kb and with 8kb inputs
    • Compilers and flags don't change results. This one fascinates me.
    • The c6i CPU was between 1.1X and 1.25X faster than c7g.
  • c2 (x86, GCP) with 4kb and with 8kb inputs
    • Same as c6i (clang is worse at crc32c, known problem)
  • t2a (Arm, GCP) with 4kb and with 8kb inputs
    • Results for crc32c with rx.gcc.neo.o2 and rx.gcc.neo.o3 are ~1.15X slower than everything else. In this case neo means -mcpu=neoverse-n1.

The following are links to the results for xxh3 from db_bench and benchHash:
  • c6i (x86, AWS)
    • The best results come from -O3 -march=native for both clang and gcc
  • c7g (Arm, AWS)
    • The best results are from clang with -mcpu=native or -mcpu=neoverse-512tvb. In that case -O2 vs -O3 doesn't matter. The second best results (with gcc or clang with different flags) aren't that far from the best results.
    • The best results on c7g are more than 2X slower than the best on c6i. This difference is much larger than the differences for the db_bench microbenchmarks (crc32, lz4, zstd). I don't know why Arm struggles more with xxh3.
  • c2 (x86, GCP)
    • The best results are from gcc with -O3 -march=native and gcc does better than clang. The benefit from adding -march=native is huge.
  • t2a (Arm, GCP)
    • The best result is from xx.gcc.o3 which is odd (and reproduces). Ignoring that the results for clang and gcc are similar.
Update 1

Another round of tests from benchHash/xxHash on c6i and c7g where I show perf as a function of XXH_VECTOR (for c7g and c6i) and XXH3_NEON_LANES (for c7g).

Results from c7g with gcc
* with default make XXH_VECTOR set to XXH_NEON by detection code
* with CFLAGS="-mcpu=native" XXH_VECTOR set to XXH_SVE by detection code
* some results with xxhash.h edited to set XXH_NEON_LANES

Numbers as latency in nanosecs
4kb     8kb     XXH_VECTOR
326     635     XXH_SCALAR      make -O3
277     528     XXH_NEON        make -O3, set XXH_NEON_LANES=2
245     448     XXH_NEON        make -O3, set XXH_NEON_LANES=4
193     377     XXH_NEON        make -O3, XXH_NEON_LANES=6 (default)
220     419     XHX_NEON        make -O3, set XXH_NEON_LANES=8
195     378     XXH_SVE         make -O3 -mcpu=native

Numbers as MB/second
4kb     8kb     XXH_VECTOR
12.6    12.9    XXH_SCALAR      make -O3
14.8    15.5    XXH_NEON        make -O3, set XXH_NEON_LANES=2
16.7    18.3    XXH_NEON        make -O3, set XXH_NEON_LANES=4
21.3    21.8    XXH_NEON        make -O3, XXH_NEON_LANES=6 (default)
18.7    19.5    XHX_NEON        make -O3, set XXH_NEON_LANES=8
21.0    21.7    XXH_SVE         make -O3 -mcpu=native

Results from c6i with gcc
* with default make XXH_VECTOR set to XXH_SSE2 by detection code
* with CFLAGS="-march=native" XXH_VECTOR set to XXH_AVX512 by detection code
* all results from make with -O3 and -DXXH_VECTOR=...

Numbers as latency in nanosecs
4kb     8kb     XXH_VECTOR
298     595     XXH_SCALAR
172     340     XXH_SSE2
 96     185     XXH_AVX2
 75     143     XXH_AVX512

Numbers as MB/second
4kb     8kb     XXH_VECTOR
13.78   13.8    XXH_SCALAR
23.8    24.1    XXH_SSE2
42.7    44.2    XXH_AVX2
54.4    57.4    XXH_AVX512

2 comments:

  1. easyaspi314 the NEON nerd here, XXH3 tends to have more trouble on ARM because of how sensitive they are to the pipeline. It is recommended to toy around with XXH3_NEON_LANES to get the best performance when optimizing for a specific target. By default it is set to 6 on generic ARM processors because it is best for the average mobile Cortex, but especially on higher end chips this may vary.

    ReplyDelete
    Replies
    1. Thank you. There will be updates from me once I follow the advice.

      Delete