Tuesday, October 22, 2019

A review of uDepot - keeping up with fast storage

This is a review of Reaping the performance of fast NVM storagewith uDepot which was published in FAST 2019. The paper is worth reading. uDepot is hash-based index+log using my index structure terminology. The goal is to create a database engine that can utilize all of the IO capacity of a fast storage device -- think millions of read IOPs and ~10 usec latency. As I wrote yesterday, stranding read IO is one way to waste a storage device but so are too much space and write amplification.

By hash-based index+log I mean that updates are appended to a log and an in-memory hash index points into the log. Log space is managed as segments and grains. Segments are large (1gb) and contain smaller grains (4kb). GC reclaims previously written segments by copying out live grains. The index must be updated during GC.

Evaluating this according to the CRUM conjecture:
  • read amplification - uDepot does one storage read per point query as there is no block cache. The cost of a block cache is complexity, memory and CPU cycles. The benefit of a block cache is a reduction in storage traffic. uDepot doesn't support range queries because it is hash-based.
  • write amplification - space and write amplification are inversely related with index+log. Doing GC more frequently reduces space-amp at the cost of more write-amp. The paper doesn't discuss this tradeoff and I don't know whether it is configurable (yet) in uDepot.
  • space amplification - see the comment for write-amp. From the paper it wasn't clear whether grains could be shared by small records. If not shared then there will be more space-amp.
  • cache amplification - the hash index needs at least 8 bytes in memory per record. There are hash-based approaches that use less memory per record - SkimpyStash and SILT need ~1 byte/record. The need for something in memory per record is common to index+log approaches because records are not clustered in the log. The memory requirements for uDepot are reduced because it doesn't use a block cache.

uDepot supports get, put and delete. It does not support a range scan because it is hash-based. While hash-based approaches can use much less CPU than a tree-based approach and hash-based is sufficient if you don't need range scans I am curious whether there is sufficient demand to justify the cost of building a production quality hash-based index structure. I hope there is.

Implementation details

The hash index is an array of hash tables. The array can grow dynamically by doubling in size as needed. The paper did not explain whether the array can be reduced in size. Growing is online and incremental. The reported worst-case blocks some operations for 1 millisecond. The hash tables use Hopscotch hashing to support a high fill factor. There is an array of mutexes per hash table and some benchmarks were run with 8192 mutexes/table. The hash index is eventually made durable in the log. The last N changes to the index might not be durable. The paper claims the index can be recovered after a crash in a few seconds but the process wasn't fully explained.

The log has large segments (1gb each) that contain smaller grains (4kb) each. A segment stores one of records or the index. I wrote above that uDepot might not share grains between small records which will waste space. GC copies live grains from a segment to make a segment free. The GC process -- how and when are segments selected for GC -- was not explained. uDepot expects a raw device. This will avoid filesystem overhead but using a filesystem makes life easier in production. The paper did not explain the overhead saved by not using a filesystem.

More implementation details

The implementation raises two interesting questions. What is the best way to do fast IO? What is the best way to implement a thread per core server?

For fast IO uDepot uses SPDK or Linux AIO. I assume that it could work great with io_uring when io_uring becomes widely available. Linux has a habit of eventually catching up to modern hardware once said hardware is sufficiently available. It will be interesting if io_uring removes the need for SPDK. In figures 7 and 8 the paper has results that show a dramatic improvement from using async IO with thread/core compared to sync IO with many threads.

For thread per core uDepot uses TRT -- Task Run Time. This provides coroutines for systems programming. TRT uses cooperative multitasking so it must know when to reschedule tasks. IO and synchronization is done via TRT interfaces to help in that regard. Under the covers it can use async IO and switch tasks while the IO or sync call is blocked. One benefit from coroutines is reducing the number of context switches.

I am curious about the future of coroutines for systems programming in C and C++. RethinkDB used a thread per core model and started via callbacks then realized that coroutines made development easier -- see here and here. Coroutines are coming, or have come, to Seastar. Boost supports fibers and coroutines. I assume they eventually arrive in standard C++.

3 comments:

  1. Mark,


    First of all, thank you for providing an excellent review of our paper.

    You raise a lot of good questions, so I thought it might be a good opportunity to address them here, since we had to omit some details from the paper due to the page limit.


    Many of the questions are answered on an older paper that we wrote, about a log-structured host translation layer called SALSA. SALSA is a log-structured block device that can be used by unmodified applications (e.g., databases and filesystems). In uDepot, we use SALSA for space allocation and GC (but not for IO, i.e., we do not use the SALSA device). So, for details about space management and GC, please have a look at our SALSA paper: https://kkourt.io/papers/salsa-mascots18.pdf, https://kkourt.io/papers/salsa-mascots18-pr.pdf. 


    - Q: "Doing GC more frequently reduces space-amp at the cost of more write-amp.  The paper doesn't discuss this  tradeoff and I don't know whether it is configurable (yet) in uDepot."


    uDepot's behavior under heavy GC is similar to SALSA (see the SALSA paper for a detailed evaluation under GC). Please note that for the uDepot Memcache use case, no GC happens. Instead, we discard items from the cache instead of relocation.


    - Q: "Space amplification - see the comment for write-amp. From the paper it wasn't clear whether grains could be shared by small records. If not shared then there will be more space-amp."


    This is correct, grains cannot be shared by small records. Having said that, users with small key-value sizes can configure uDepot with grain size all the way down to 32bytes -- at the cost of RMW for sub-sector updates.


    - Q: "uDepot expects a raw device. This will avoid filesystem overhead but using a filesystem makes life easier in production. The paper did not explain the overhead saved by not using a filesystem."


    uDepot works with raw devices, and files when the IO backend supports it (e.g., Linux AIO but not SPDK). Raw device guarantees writing sequentially (LBAs) to the device. Also, even though using an FS makes deployment easier, operating on the raw device generally performs better than on a filesystem, especially for parallel writes (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/ext4/file.c?h=v5.4-rc4#n234).


    - Q: "What is the best way to do fast IO?"


    Indeed, we think that io_uring and some form of coroutines (C++20 or Rust's async)  are the best bet for doing fast IO.


    -- Kornilios and Nikolas

    ReplyDelete
    Replies
    1. Hi,
      You have missing slides on "comparisons with aerospike,scylladb". What were the results ?

      Delete
    2. Hi,

      You can find the comparison with aerospike and scyllabdb in the paper (https://www.usenix.org/system/files/fast19-kourtis.pdf), see figures 9 and 10.

      Thanks,
      Nikolas

      Delete