Sunday, November 21, 2021

Welll actually, LSM and B-tree

 Things I read about LSM that make me go well actually:

  • LSM does sequential writes - well actually, writes are sequential logically (per-file) but not physically (per-device). From the physical perspective compaction writes are large & concurrent rather than sequential. Compaction writes to files sequentially but it is usually concurrent (multi-threaded) with a file written sequentially per-thread. There can also be small writes to the WAL. With concurrent compaction threads the large (maybe 1MB+) writes are interleaved at the device.
  • Compaction does merge-sort - well actually, that is a k-way merge, not a merge-sort.
  • Leveled compaction suffers from write-amplification - well actually, write-amplification was 10X smaller with MyRocks when the first workload was moved to it from InnodDB. Write-amp with leveled compaction is larger than with tiered, but that is the (C)RUM Conjecture in action: leveled means less space-amp at the cost of more write-amp. One of the problems is that the naive (hand-wavy) estimate of write-amp for leveled (~8 or ~10 per level) is frequently too pessimistic.
  • LSM suffers from write stalls - well actually, that is more true than the previous bullet points. But the real problem for an LSM is figuring out how to smooth the write stalls (reduce response time variance) and RocksDB needs to get better at that. Everything (b-tree, LSM, etc) suffers from write-stalls when ingest exceeds write back throughput and the easy way to reproduce that is a benchmark with WAL fsync disabled. A b-tree benefits from feedback that serves as a small write-stall - when the buffer pool is full of dirty pages and the working set is not cached then a write likely needs to wait for 1 page to be written back before it can read something into the buffer pool. But an LSM doesn't have that feedback as a write usually just needs something to be put into the memtable, non-unique secondary index maintenance doesn't require reads, and even reads needed to check unique constraints (likely for a PK) can be avoided in some cases with bloom filters. Checkpoint is one place where a b-tree can suffer from large write stalls. Not that many years ago the insert benchmark was great at documenting longer write-stalls related to checkpoint in InnoDB and WiredTiger. Fortunately those have been fixed. Postgres also had serious problems, now fixed, from checkpoint IO creating bursts of write IO that starve other IO (like reads for user queries). Checkpoint is a hard problem.

7 comments:

  1. Agree that checkpoint is a hard problem. Spent 2.5 years implementing the last version of checkpoints in NDB 7.6. You can starve other IO activities, but you can also run out of UNDO/REDO log if you don't complete the checkpoint in time. In an in-memory DBMS you can also starve the CPU by being too active in checkpointing.

    So to handle requires a fair bit of adaptive coding where you keep track of IO load,
    CPU load and how much UNDO log and REDO log is remaining before it is full.

    This is not even mentioning the technical issues in getting the checkpoint to be
    correct. I wrote a 30-page comment with a logical proof that the partial checkpoint
    algorithm in NDB actually works, phew :)

    ReplyDelete
    Replies
    1. Checkpoint is an interesting problem that must be re-solved each time a new db engine arrives. As you write, the adaptive aspects and constraints means it usually takes that new db engine a while to get it right.

      Delete
  2. Another reason why wiredTiger(btree)'s checkpoint is much more difficult than Rocksdb(lsm)'s:
    WT's checkpoint does not contain garbage versions, every key-value in the checkpoint file is single-versioned. To satisfy this, checkpoint threads have to do much more comparation work, scanning all the tuples on all the dirty pages.
    Rocksdb's checkpoint is not a clean-cut(of key-value versions). That means a checkpoint may contain several versions of a key-value, version-gc is delayed to further compaction threads. So it's quite easy to implement, just remember the sst-file-names and associated lsn in the wal-file.
    WT's clean-cut checkpoint leads to more stable performance when key-values are updated quite frequently, or when key-values are frequently range-deleted.
    RocksDB is not as stable as WT in the above two situations.

    ReplyDelete
    Replies
    1. Thank you for an interesting comment. If WT checkpoint only writes the latest version then something special must be done to put older versions elsewhere if long-running snapshots are to work. Is that supported yet in Mongo+WT?

      Delete
    2. I also remembered the time I tested Linkbench with long-open snapshots concurrent with the benchmark. Some MVCC engines are less happy about that than others. Need to repeat the test and include Postgres.

      Blog post on that is here

      Delete
    3. In WT3.X series, in-use older versions are stored in the WT-LAS(LAS means lookaside) file, WT10.X is re-designed, LAS file is removed, and older versions are stored in the HISTORY file.
      The HISTORY file is a new mechanism, which makes WT supports longer transactions than the old LAS-file mechanism. However, I didn't have time to investigate the difference of HISTORY file and LAS file in tech-details. Maybe WT's core developpers should provide more detailed documents abnout this.

      Delete
  3. Another reason why wiredTiger(btree)'s checkpoint is much more difficult than Rocksdb(lsm)'s:
    WT's checkpoint does not contain garbage versions, every key-value in the checkpoint file is single-versioned. To satisfy this, checkpoint threads have to do much more comparation work, scanning all the tuples on all the dirty pages.
    Rocksdb's checkpoint is not a clean-cut. That means a checkpoint may contain several versions of a key-value, version-gc is delayed to further compaction threads. So it's quite easy to implement, just remember the sst-file-names and associated lsn in the wal-file.

    ReplyDelete