Monday, September 18, 2023

Postgres 16.0 vs the Insert Benchmark on a small server

This has results for Postgres 16.0 and rc1 vs the Insert Benchmark on a small server. My previous post had results for beta1, beta2 and beta3.

tl;dr

  • Postgres 16.0 looks good
  • All versions of Postgres (old and new) have too much variance on this benchmark with the IO-bound setups. See rant below.
Rant

There is too much variance with Postgres vs the Insert Benchmark in IO-bound setups. This isn't a new issue, it occurs for all Postgres versions I have tested -- from version 10 or 11 to 16. I spent months testing different configuration options to see if I could tune this problem away. While better tuning helped it does not fix the problem. I think OtterTune said it best and while each new Postgres version has incremental improvements to vacuum I don't think it will ever be fixed incrementally. 

I hope that something significant is done because while Postgres is a great DBMS I would be reluctant to use it for a write-intensive workload and it risks falling far behind the closed-source vendor forks.

So I focus more on the Cached by Postgres setup and less on the IO-bound setup. The Cached by Postgres setup is great when searching for CPU regressions and I do not see any in Postgres 16.0.

Builds

I compiled Postgres 15.2, 15.3, 15.4, 16 beta1, 16 beta2, 16 beta3, 16rc1 and 16.0 from source. The builds are named o3_native_lto which is shorthand for using: -O3 -march=native -mtune=native -flto.

Benchmark

The Insert Benchmark was run in two setups.

  • cached by Postgres - all tables are cached by Postgres
  • IO-bound - the database is larger than memory

This benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04. 

The benchmark is run with 1 client and 1 table. The benchmark is a sequence of steps.

  • l.i0
    • insert X million rows per table where X is 20 for cached and 800 for IO-bound
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another X million rows per table with secondary index maintenance where X is 300 for Cached and 30 for IO-bound. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. 
  • q100, q500, q1000
    • do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 7200 seconds.
Configurations

I used the a1 and a2 configurations. The difference between them is that a1 has wal_compression set to lz4 while a2 has it set to off.

Results

Reports are here for Cached (a1 configa2 config) and IO-bound (a1 configa2 config).

The results for average throughput are interesting and confusing. The tables linked below use absolute and relative throughput where relative throughput is (QPS for my version / QPS for base version) and the base version is Postgres 15.2.
  • Cached, a1 config (see here)
    • From the summary tables with relative throughput I see that 16.0 does better on the l.i1 benchmark step and then worse on the q100 and q500 steps that follow -- relative throughput values are 1.06, 0.91 and 0.94. Perhaps 16.0 is able to process writes faster during l.i1 but then that means the benchmark steps that follow (q100, q500) inherit more MVCC GC debt. I will revisit this.
    • From the HW metrics the CPU overhead (cpupq is CPU/operation) with version16.0 is less in l.i1 but more in q100 and q500 vs earlier versions. This explains the throughput differences. I assume the increases in q100 and q500 are from doing more MVCC GC (because l.i1 was faster and given the cpupq in q1000 returns to normal, perhaps GC debt was repaid by then).
    • During l.i1 all versions tested have much variance. The per-second insert rates regularly jump from 10k/s and 34k/s.
    • The max insert response time graphs have an interesting pattern for q100 and q500 which is less frequent in q1000. I assume that q100 and q500 have more MVCC GC debt to repay after l.i1 but that debt is gone by q1000.
    • For q100 the per-second QPS charts have less noise in 16.0 than in 15.2. This is somewhat true in q500 (see 16.0 and 15.2) but by q1000 they are similar (see 16.0 and 15.2).
  • Cached, a2 config (see here)
    • The comment above for the a1 config might apply here give that the relative throughput is 1.05 for l.i1 and 0.97 for q100. See the summary tables.
    • From the HW metrics results are similar to the a1 config except the MVCC GC debt was repaid by the end of q100.
    • During l.i1 all versions tested have much variance. The per-second insert rates regularly jump from 10k/s and 34k/s.
  • IO-bound, a1 config (see here)
    • I am reluctant to characterize these because there is too much variance -- especially during l.i1 (which suffers from vacuum-induced variance) and q100 (follows l.i1, inherits MVCC GC debt).
    • There is much variance for insert rates measured at 1-second intervals. Results for 15.2 show a gradual decline to ~500/s, then a sudden jump to 4000/s followed by another decline to ~500/s. The gradual declines are expected, the sudden jumps are not. This benchmark step ran for ~20,000s and I'd rather not run it for days to hope that the variance fades. Results for 16.0 also have sudden jumps but with 16.0 the rates decline to ~400/s which is worse than what 15.2 does.
  • IO-bound, a2 config (see here)
    • I am reluctant to characterize these because there is too much variance -- especially during l.i1 (which suffers from vacuum-induced variance) and q100 (follows l.i1, inherits MVCC GC debt).
    • With the l.i1 benchmark step the gradual fade, sudden jump pattern is similar to the results above for the a1 config (see 16.0 and 15.2).






















No comments:

Post a Comment