How to Read 100s of Millions of Records per Second from a Single Disk

This article gives an overview of the implementation and performance of two recent additions to LocustDB, an extremely fast open-source analytics database built in Rust. The first addition is support for persistent storage, the second is an lz4 compression pass for data stored on disk or cached in memory. Benchmarking LocustDB on a dataset of 1.46 billion taxi rides demonstrates excellent performance for cold queries, reaching > 95% of sequential read speed on SSD and > 70% of sequential read speed on HDD. Comparing to ClickHouse, queries with similar data volume are as fast or faster, and disk space usage is reduced by 40%.

Implementation

Compression and serialization

A large fraction of the work went into finding a better abstraction for in-memory data and encoding schemes that allows for easy serialization. LocustDB splits up tables into a set of partitions with a fixed number of rows. Within each partition, the data for each column is stored sequentially in one contiguous block of memory/disk. LocustDB automatically applies a variety of compression schemes to data which among other benefits allows it to store datasets with billions of entries completely within RAM.

Before this work, I abstracted over the in-memory encoding schemes using Column and Codec traits that had a distinct implementation for each different encoding scheme. I was able to simplify this significantly, and now both Column and Codec are just basic structs that can be easily serialized and actually make it even easier to support any number of different encoding schemes.

The Column struct consists of some metadata (name, length of column, range of possible values) a list of data sections (an enum for a predefined set of Vecs storing primitive values) and a Codec. The Codec struct provides a uniform and flexible way of defining a variety of different compression schemes. Basically, it is a simple stack machine program composed of a list of well-defined CodecOP instructions that can be converted into a query plan that will decompress the data.

To illustrate, we might encode an integer column by subtracting 413372 from all values, casting them into 2 byte unsigned integers (u16), and then applying a final lz4 pass. The decoder can be expressed as [LZ4(output_type=u16), Add(413372, input_type=u16)]. Other codecs might read additional data sections. For example, dictionary compressed string columns, which store each value as an integer indexing into the string dictionary, have 3 different data sections. The first is the (lz4 compressed) list of indices for each value, the second stores the length and offset for each string, and the third is an array of bytes that contains the actual string data. This is how we can express the decoding operation as a stack machine: [LZ4(output_type=u8), PushDataSection(1), PushDataSection(2), DictLookup(index_type=u8)].

This design has a number of nice properties:

  1. Columns use essentially the same format on disk and in memory and can be straighforwardly and cheaply serialized and deseralized.
  2. It is possible to add any number of new encoding schemes while still maintaining backwards compatibility for existing data.
  3. Decoders can be turned into the same class of operations as user queries, and can benefit from any existing and future optimization passes applied by the query engine.
  4. Often we only want to partially decode data. For example, you might have a query like SELECT country, count(0) FROM users that counts the number of users broken down by a dictionary encoded country column. In this case we can compute the counts directly on the dictionary indices and only convert back to string values at the very end, which speeds up the query by more than an order of magnitude. This kind of optimization can now be automatically derived for any decoding scheme, all we need to do is specify what operations commute with each decoding layer (e.g. does it preserve width? ordering? summation?).

Background: RocksDB

RocksDB is an embeddable persistent key-value store for fast storage based on log structured merge trees. It has replaced InnoDB as the storage backend for all of Facebook’s MySQL clusters, resulting in significant performance gains, and is employed across a wide range of applications and workloads within and outside of Facebook. The interface exposed by RocksDB is very simple at it’s core and has three key operations that can be simplified as follows (&[u8] is Rust for a reference to an array of bytes):

fn put(key: &[u8], value: &[u8]);
fn get(key: &[u8]) -> &[u8];
fn scan(first_key: &[u8]) -> Iterator<Item=(&[u8], &[u8])>;

The put and get operations store and retrieve a value at a key respectively, where both keys and values are arbitrary sequences of bytes. RocksDB internally stores data sorted by key (within each level), and scan provides an efficient way to iterate over ranges of adjacent keys and their respective values.

Disk storage schema

The storage backend for LocustDB is very simple (or rather, all of the actual complexity is taken care for us by RocksDB). Point in case, the RocksDB specific code fits into a single 350 line file, 250 lines of which are imports and serialization boilerplate.

LocustDB uses two separate key spaces. The first one stores only meta data about what partitions and columns exist. This is a small amount of data that can be loaded quickly in full at start-up. The second key space stores the actual column data sorted first by column name and then by a monotonically increasing partition id. On SSD, the data for a column/partition is currently just loaded with a point lookup (get) as required by query workers, which works as long as partition sizes are large enough. On HDD, a single reader thread loads column data in larger contiguous chunks (scan). There’s a little bit of additional complexity in scheduling disk reads and synchronizing worker threads with minimal locking, but Rust’s strong support for concurrency makes this quite manageable. To allow for small partition sizes on SSDs and make good use of very fast HDDs or systems with multiple disks this approach will require some refinement, but at least on my hardware it already works well enough.

Performance

Caveats

For a variety of reasons, the results I present should be taken with a grain of salt. LocustDB is optimized for bulk loads, while ClickHouse will also work for continuous ingestion. The data set and queries I used for the benchmarks is not particularly suited to evaluate cold reads or compression, it’s just what I had handy. The ingestion processes for LocustDB and especially Clickhouse apply various data transformations to the 52 different columns, and I have not verified that the resulting tables are completely comparable in every respect.

If you want to know what performance looks like on your own data (or just enjoy cryptic error messages and program crashes), you will be pleased to hear that LocustDB can now be downloaded as a binary that will run on most x64 Linux systems without any additional setup.

Setup

All results were produced on a dataset of 1.46 billion taxi rides. For more details on the dataset and queries see this blogpost. The HDD was a 2TB Seagate BarraCuda and the SSD a 2014 Crucial MX100 512GB. For additional details, see the methodology section.

Cold read performance

Measurements of query runtime when none of the data is cached in memory. On HDD and after purging caches LocustDB can take a few seconds to start, and the ClickHouse server similarly can take a few seconds to become responsive after restart. This time is not included in the results. For ClickHouse there are two sets of HDD results collected with--max-threads=1 and --max-threads=12. The former significantly improves cold query times on HDD (presumably because it reduces the number of separate data streams and therefore disk seeks), but would of course hurt queries for data cached in-memory in a major way. A superficial scan of ClickHouse docs didn’t turn up any way to just limit the number of disk-reading threads, but I may well have missed the right setting.

#QueryLocustDB (SSD)/sClickHouse (SSD)/sLocustDB (HDD)/sClickHouse (HDD, 1 thread)/sClickHouse (HDD, 12 threads)/s
1Simple Count0.661.00.786.46.3
2Average6.89.02739128
3Count by 2114.1443528
4Count by 318116876147
5Sparse Filter9.4203852268
6Top N13185156236
7Reducible Cardinality6.78.1176498

Drawing strong conclusions from these results is inadvisable because they conflate ingestion schema differences, compression ratios and querying efficiency. Query 1 is very fast because it reads from a single highly compressible column that is reduced to a couple of megabytes by lz4. Queries 2 and 3 are reading a year from a timestamp column, and judging from bandwidth reported by ClickHouse and dstat ClickHouse seems to be able to read a dedicated, highly compressible 2 byte width year column whereas LocustDB reads the full UNIX timestamp. LocustDB enjoys an advantage on query 5 which reads from a delta encoded trip_id column. Data volume on the other queries should be roughly the same, but probably somewhat smaller for LocustDB for which various floating point columns were (mostly losslessly) transformed into integers at ingestion time which helps compression, and which is also able to automatically choose a different compression scheme for each partition.

The main bottleneck for cold queries is IO, and we can judge the storage backend by looking at how close it comes to achieving sequential read speeds (550MiB/s for the SSD and 180MiB/s for the HDD according to fio). Below table shows (over)estimates of the read speed of LocustDB during the queries, computed as the size LocustDB reports for the referenced columns divided by runtime:

Query1234567
Bandwidth (SSD)/MB/s11580560560580580350
Bandwidth (HDD)/MB/s10150140150140150140
Approximate data volume/GiB0.0073.75.99.35.07.12.2

The bandwidths are off by about 7%, maybe because of memory allocator overhead or some other inaccuracy in how LocustDB reports memory usage. The actual bandwidths as gleaned from looking at dstat output are closer to 530MB/s on SSD and 130MB/s on HDD. Either way, these calculations do show that query time is directly proportional to data volume and at least on SSD we are very close to hitting the IO limit. The only outliers are query 1 which reads very little data, and query 7 which reads 2.2GiB of data but is computationally expensive.

Size on disk

The full uncompressed data set encoded as CSV is 602GiB, and reduces to 114GiB after applying gzip. According to du, the dataset takes up 80GiB  and 132GiB on disk when converted to the internal storage formats for LocustDB and ClickHouse respectively. This is roughly as expected. Both LocustDB and ClickHouse use efficient columnar data layouts which are further compressed with lz4. In addition to this, LocustDB automatically applies dictionary encoding to string columns as appropriate which results in significant additional gains, and an additional 5GiB is shaved off by delta encoding the trips column.

Imagine you are running a 100 node ClickHouse cluster. Look at your cluster, now look at me, now back to your cluster. It is now a 60 node LocustDB cluster. You’re welcome, and please send 10% of your cash savings my way. It’s only fair! (Of course you might be constrained on something other than disk space, and in any case LocustDB is not and may never be production ready.)

LocustDB implements a :memtree command that displays full breakdowns of memory usage and encoding schemes. The full dataset does not quite fit into memory on my machine and the output below is from a non-representative sample of 45% of partitions (click the tabs for increasingly detailed breakdowns):

:memtree(1)

trips  32.0GiB  1464785771 rows

:memtree

trips  32.0GiB  1464785771 rows
├─ average_wind_speed       45% resident     0.61GiB  1.9%   0.99B/row
├─ cab_type                 45% resident      3.2MiB 0.01%   5.1mB/row
├─ dropoff                  45% resident      5.9GiB   19%    9.8B/row
├─ dropoff_borocode         45% resident       40MiB 0.12%    65mB/row
├─ dropoff_boroct2010       45% resident     0.18GiB 0.55%   0.29B/row
├─ dropoff_boroname         45% resident       41MiB 0.12%    65mB/row
├─ dropoff_cdeligibil       45% resident       51MiB 0.15%    81mB/row
├─ dropoff_ct2010           45% resident     0.17GiB 0.53%   0.28B/row
├─ dropoff_ctlabel          45% resident     0.13GiB 0.39%   0.21B/row
├─ dropoff_datetime         45% resident      2.3GiB  7.2%    3.8B/row
├─ dropoff_latitude         45% resident       11MiB 0.03%    18mB/row
├─ dropoff_longitude        45% resident       40MiB 0.12%    63mB/row
├─ dropoff_ntacode          45% resident       84MiB 0.26%   0.13B/row
├─ dropoff_ntaname          45% resident       86MiB 0.26%   0.14B/row
├─ dropoff_nyct2010_gid     45% resident     0.13GiB 0.40%   0.21B/row
├─ dropoff_puma             45% resident       92MiB 0.28%   0.15B/row
├─ ehail_fee                45% resident      3.1MiB 0.01%   5.0mB/row
├─ extra                    45% resident     0.41GiB  1.3%   0.67B/row
├─ fare_amount              45% resident      1.1GiB  3.3%    1.8B/row
├─ improvement_surcharge    45% resident      6.9MiB 0.02%    11mB/row
├─ max_temperature          45% resident     0.45GiB  1.4%   0.73B/row
├─ min_temperature          45% resident     0.44GiB  1.4%   0.73B/row
├─ mta_tax                  45% resident       15MiB 0.04%    24mB/row
├─ passenger_count          45% resident     0.30GiB 0.95%   0.50B/row
├─ payment_type             45% resident     0.31GiB 0.96%   0.50B/row
├─ pickup                   45% resident      5.9GiB   19%    9.8B/row
├─ pickup_borocode          45% resident     0.11GiB 0.35%   0.18B/row
├─ pickup_boroct2010        45% resident      1.1GiB  3.5%    1.8B/row
├─ pickup_boroname          45% resident     0.11GiB 0.35%   0.18B/row
├─ pickup_cdeligibil        45% resident     0.17GiB 0.52%   0.27B/row
├─ pickup_ct2010            45% resident      1.1GiB  3.5%    1.8B/row
├─ pickup_ctlabel           45% resident     0.89GiB  2.8%    1.5B/row
├─ pickup_datetime          45% resident      2.1GiB  6.6%    3.4B/row
├─ pickup_latitude          45% resident       16MiB 0.05%    25mB/row
├─ pickup_longitude         45% resident     0.17GiB 0.55%   0.29B/row
├─ pickup_ntacode           45% resident     0.45GiB  1.4%   0.73B/row
├─ pickup_ntaname           45% resident     0.45GiB  1.4%   0.74B/row
├─ pickup_nyct2010_gid      45% resident     0.89GiB  2.8%    1.5B/row
├─ pickup_puma              45% resident     0.45GiB  1.4%   0.73B/row
├─ precipitation            45% resident     0.36GiB  1.1%   0.58B/row
├─ rate_code_id             45% resident       63MiB 0.19%   0.10B/row
├─ snow_depth               45% resident       94MiB 0.29%   0.15B/row
├─ snowfall                 45% resident       71MiB 0.22%   0.11B/row
├─ store_and_fwd_flag       45% resident     0.16GiB 0.51%   0.27B/row
├─ tip_amount               45% resident     0.88GiB  2.7%    1.4B/row
├─ tolls_amount             45% resident     0.14GiB 0.45%   0.23B/row
├─ total_amount             45% resident      1.4GiB  4.4%    2.3B/row
├─ trip_distance            45% resident      1.6GiB  5.0%    2.7B/row
├─ trip_id                  45% resident       26MiB 0.08%    41mB/row
├─ trip_type                45% resident       51KiB 0.00%    81μB/row
└─ vendor_id                45% resident     0.30GiB 0.94%   0.50B/row

:memtree(3)

trips  32.0GiB  1464785771 rows
├─ average_wind_speed       45% resident     0.61GiB  1.9%   0.99B/row
│  ├─ LZ4(U16) ToI64(U16)                    0.61GiB  100%   0.99B/row  100%
│  └─ LZ4(U8) Add(U8)                         2.0KiB 0.00%   6.6mB/row  0.05%
├─ cab_type                 45% resident      3.2MiB 0.01%   5.1mB/row
│  └─ LZ4(U8) Data(1) Data(2) Dict(U8)        3.2MiB  100%   5.1mB/row  100%
├─ dropoff                  45% resident      5.9GiB   19%    9.8B/row
│  ├─ LZ4(U8) StrHexUnpack                    5.9GiB  100%     13B/row  73%
│  ├─ LZ4(U16) Data(1) Data(2) Dict(U16)      8.2MiB 0.13%    8.2B/row  0.16%
│  └─                                          12KiB 0.00%    72μB/row  27%
├─ dropoff_borocode         45% resident       40MiB 0.12%    65mB/row
│  ├─ LZ4(U8) ToI64(U8)                        40MiB  100%    88mB/row  73%
│  └─                                          14KiB 0.03%    80μB/row  27%
├─ dropoff_boroct2010       45% resident     0.18GiB 0.55%   0.29B/row
│  ├─ LZ4(U32) ToI64(U32)                    0.18GiB  100%   0.39B/row  73%
│  └─                                          16KiB 0.01%    96μB/row  27%
├─ dropoff_boroname         45% resident       41MiB 0.12%    65mB/row
│  ├─ LZ4(U8) Data(1) Data(2) Dict(U8)         41MiB  100%    89mB/row  73%
│  └─                                          14KiB 0.03%    80μB/row  27%
├─ dropoff_cdeligibil       45% resident       51MiB 0.15%    81mB/row
│  ├─ LZ4(U8) Data(1) Data(2) Dict(U8)         51MiB  100%   0.11B/row  73%
│  └─                                          16KiB 0.03%    96μB/row  27%
├─ dropoff_ct2010           45% resident     0.17GiB 0.53%   0.28B/row
│  ├─ LZ4(U32) ToI64(U32)                    0.17GiB  100%   0.38B/row  73%
│  └─                                          14KiB 0.01%    80μB/row  27%
├─ dropoff_ctlabel          45% resident     0.13GiB 0.39%   0.21B/row
│  ├─ LZ4(U16) ToI64(U16)                    0.13GiB  100%   0.28B/row  73%
│  └─                                          14KiB 0.01%    80μB/row  27%
├─ dropoff_datetime         45% resident      2.3GiB  7.2%    3.8B/row
│  ├─ ToI64(U32)                              1.8GiB   76%    4.0B/row  72%
│  ├─ LZ4(U32) ToI64(U32)                    0.53GiB   23%    3.2B/row  27%
│  ├─ LZ4(I64)                                 20MiB 0.85%    5.0B/row  0.64%
│  └─ Add(U32)                                8.0MiB 0.34%    4.0B/row  0.32%
├─ dropoff_latitude         45% resident       11MiB 0.03%    18mB/row
│  ├─ LZ4(U8) ToI64(U8)                       5.3MiB   47%    22mB/row  39%
│  ├─ LZ4(U16) Add(U16)                       2.5MiB   22%    33mB/row  12%
│  ├─ LZ4(U8) Add(U8)                         2.4MiB   21%    23mB/row  16%
│  ├─ LZ4(U16) ToI64(U16)                     1.1MiB  9.9%    28mB/row  6.4%
│  └─                                          14KiB 0.12%    80μB/row  27%
├─ dropoff_longitude        45% resident       40MiB 0.12%    63mB/row
│  ├─ LZ4(U8) Add(U8)                          26MiB   66%    80mB/row  53%
│  ├─ LZ4(U16) Add(U16)                        13MiB   33%   0.10B/row  20%
│  └─                                          16KiB 0.04%    96μB/row  27%
├─ dropoff_ntacode          45% resident       84MiB 0.26%   0.13B/row
│  ├─ LZ4(U8) Data(1) Data(2) Dict(U8)         84MiB  100%   0.18B/row  73%
│  └─                                          14KiB 0.02%    80μB/row  27%
├─ dropoff_ntaname          45% resident       86MiB 0.26%   0.14B/row
│  ├─ LZ4(U8) Data(1) Data(2) Dict(U8)         86MiB  100%   0.19B/row  73%
│  └─                                          14KiB 0.02%    80μB/row  27%
├─ dropoff_nyct2010_gid     45% resident     0.13GiB 0.40%   0.21B/row
│  ├─ LZ4(U16) ToI64(U16)                    0.13GiB  100%   0.29B/row  73%
│  └─                                          16KiB 0.01%    96μB/row  27%
├─ dropoff_puma             45% resident       92MiB 0.28%   0.15B/row
│  ├─ LZ4(U16) ToI64(U16)                      91MiB   99%   0.20B/row  73%
│  └─ LZ4(U8) ToI64(U8)                      0.83MiB 0.90%   5.0mB/row  27%
├─ ehail_fee                45% resident      3.1MiB 0.01%   5.0mB/row
│  └─ LZ4(U8) ToI64(U8)                       3.1MiB  100%   5.0mB/row  100%
├─ extra                    45% resident     0.41GiB  1.3%   0.67B/row
│  ├─ LZ4(U16) ToI64(U16)                    0.19GiB   46%   0.87B/row  35%
│  ├─ LZ4(U16) Add(U16)                      0.18GiB   45%   0.52B/row  58%
│  ├─ LZ4(U32) Add(U32)                        20MiB  4.9%    1.1B/row  2.9%
│  ├─ LZ4(U8) ToI64(U8)                        16MiB  3.9%   0.62B/row  4.2%
│  └─ LZ4(U8) Add(U8)                        0.22MiB 0.05%   0.33B/row  0.11%
├─ fare_amount              45% resident      1.1GiB  3.3%    1.8B/row
│  ├─ LZ4(U32) Add(U32)                      0.63GiB   59%    1.8B/row  59%
│  ├─ LZ4(U16) ToI64(U16)                    0.41GiB   39%    1.8B/row  39%
│  ├─ LZ4(U16) Add(U16)                        20MiB  1.8%    1.8B/row  1.8%
│  ├─ LZ4(U32) ToI64(U32)                     3.5MiB 0.32%    1.8B/row  0.32%
│  └─ Add(U16)                               0.16MiB 0.01%    2.1B/row  0.01%
├─ improvement_surcharge    45% resident      6.9MiB 0.02%    11mB/row
│  ├─ LZ4(U8) ToI64(U8)                       4.0MiB   58%    15mB/row  44%
│  ├─ LZ4(U8) Add(U8)                         2.9MiB   41%   8.3mB/row  55%
│  └─ LZ4(U16) Add(U16)                        48KiB 0.68%    12mB/row  0.64%
├─ max_temperature          45% resident     0.45GiB  1.4%   0.73B/row
│  ├─ ToI64(U8)                              0.45GiB  100%    1.0B/row  73%
│  └─ LZ4(U8) ToI64(U8)                       1.1MiB 0.24%   6.7mB/row  27%
├─ min_temperature          45% resident     0.44GiB  1.4%   0.73B/row
│  ├─ ToI64(U8)                              0.41GiB   92%    1.0B/row  67%
│  ├─ LZ4(U8) ToI64(U8)                        22MiB  4.9%   0.12B/row  31%
│  └─ Add(U8)                                  12MiB  2.7%    1.0B/row  1.9%
├─ mta_tax                  45% resident       15MiB 0.04%    24mB/row
│  ├─ LZ4(U16) Add(U16)                       7.6MiB   52%    30mB/row  40%
│  ├─ LZ4(U8) ToI64(U8)                       3.7MiB   25%    15mB/row  39%
│  ├─ LZ4(U8) Add(U8)                         3.4MiB   23%    26mB/row  21%
│  └─ LZ4(U16) ToI64(U16)                      40KiB 0.27%    20mB/row  0.32%
├─ passenger_count          45% resident     0.30GiB 0.95%   0.50B/row
│  └─ LZ4(U8) ToI64(U8)                      0.30GiB  100%   0.50B/row  100%
├─ payment_type             45% resident     0.31GiB 0.96%   0.50B/row
│  └─ LZ4(U8) Data(1) Data(2) Dict(U8)       0.31GiB  100%   0.50B/row  100%
├─ pickup                   45% resident      5.9GiB   19%    9.8B/row
│  ├─ LZ4(U8) StrHexUnpack                    5.9GiB  100%     13B/row  73%
│  ├─ LZ4(U16) Data(1) Data(2) Dict(U16)      8.2MiB 0.13%    8.2B/row  0.16%
│  └─                                          12KiB 0.00%    72μB/row  27%
├─ pickup_borocode          45% resident     0.11GiB 0.35%   0.18B/row
│  ├─ LZ4(U8) ToI64(U8)                      0.11GiB  100%   0.25B/row  73%
│  └─                                          14KiB 0.01%    80μB/row  27%
├─ pickup_boroct2010        45% resident      1.1GiB  3.5%    1.8B/row
│  ├─ LZ4(U32) ToI64(U32)                     1.1GiB  100%    2.5B/row  73%
│  └─                                          16KiB 0.00%    96μB/row  27%
├─ pickup_boroname          45% resident     0.11GiB 0.35%   0.18B/row
│  ├─ LZ4(U8) Data(1) Data(2) Dict(U8)       0.11GiB  100%   0.25B/row  73%
│  └─                                          14KiB 0.01%    80μB/row  27%
├─ pickup_cdeligibil        45% resident     0.17GiB 0.52%   0.27B/row
│  ├─ LZ4(U8) Data(1) Data(2) Dict(U8)       0.17GiB  100%   0.37B/row  73%
│  └─                                          16KiB 0.01%    96μB/row  27%
├─ pickup_ct2010            45% resident      1.1GiB  3.5%    1.8B/row
│  ├─ LZ4(U32) ToI64(U32)                     1.1GiB  100%    2.5B/row  73%
│  └─                                          14KiB 0.00%    80μB/row  27%
├─ pickup_ctlabel           45% resident     0.89GiB  2.8%    1.5B/row
│  ├─ ToI64(U16)                             0.89GiB  100%    2.0B/row  73%
│  ├─ LZ4(U16) ToI64(U16)                    0.19MiB 0.02%   0.19B/row  0.16%
│  └─                                          14KiB 0.00%    80μB/row  27%
├─ pickup_datetime          45% resident      2.1GiB  6.6%    3.4B/row
│  ├─ ToI64(U32)                              1.8GiB   85%    4.0B/row  73%
│  ├─ LZ4(U32) ToI64(U32)                    0.32GiB   15%    1.9B/row  27%
│  ├─ LZ4(U16) Add(U16)                      0.36MiB 0.02%    1.2B/row  0.05%
│  └─ Add(U16)                               0.16MiB 0.01%    2.1B/row  0.01%
├─ pickup_latitude          45% resident       16MiB 0.05%    25mB/row
│  ├─ LZ4(U8) ToI64(U8)                       9.4MiB   60%    34mB/row  45%
│  ├─ LZ4(U8) Add(U8)                         3.5MiB   22%    33mB/row  17%
│  ├─ LZ4(U16) Add(U16)                       2.6MiB   16%    40mB/row  10%
│  ├─ LZ4(U16) ToI64(U16)                    0.25MiB  1.6%    35mB/row  1.1%
│  └─                                          14KiB 0.08%    80μB/row  27%
├─ pickup_longitude         45% resident     0.17GiB 0.55%   0.29B/row
│  ├─ LZ4(U8) Add(U8)                        0.13GiB   74%   0.37B/row  57%
│  ├─ LZ4(U16) Add(U16)                        47MiB   26%   0.46B/row  16%
│  └─                                          14KiB 0.01%    80μB/row  27%
├─ pickup_ntacode           45% resident     0.45GiB  1.4%   0.73B/row
│  ├─ Data(1) Data(2) Dict(U8)               0.45GiB  100%    1.0B/row  73%
│  ├─ LZ4(U8) Data(1) Data(2) Dict(U8)         99KiB 0.02%    97mB/row  0.16%
│  └─                                          14KiB 0.00%    80μB/row  27%
├─ pickup_ntaname           45% resident     0.45GiB  1.4%   0.74B/row
│  ├─ Data(1) Data(2) Dict(U8)               0.45GiB  100%    1.0B/row  73%
│  └─ LZ4(U8) Data(1) Data(2) Dict(U8)       0.94MiB 0.20%   5.6mB/row  27%
├─ pickup_nyct2010_gid      45% resident     0.89GiB  2.8%    1.5B/row
│  ├─ ToI64(U16)                             0.89GiB  100%    2.0B/row  73%
│  ├─ LZ4(U16) ToI64(U16)                    0.19MiB 0.02%   0.19B/row  0.16%
│  └─                                          16KiB 0.00%    96μB/row  27%
├─ pickup_puma              45% resident     0.45GiB  1.4%   0.73B/row
│  ├─ LZ4(U16) ToI64(U16)                    0.45GiB  100%   1.00B/row  73%
│  └─ LZ4(U8) ToI64(U8)                      0.83MiB 0.18%   5.0mB/row  27%
├─ precipitation            45% resident     0.36GiB  1.1%   0.58B/row
│  ├─ LZ4(U16) ToI64(U16)                    0.35GiB   98%   0.66B/row  87%
│  └─ LZ4(U8) ToI64(U8)                       6.1MiB  1.7%    76mB/row  13%
├─ rate_code_id             45% resident       63MiB 0.19%   0.10B/row
│  └─ LZ4(U8) ToI64(U8)                        63MiB  100%   0.10B/row  100%
├─ snow_depth               45% resident       94MiB 0.29%   0.15B/row
│  ├─ LZ4(U16) ToI64(U16)                      92MiB   98%   0.56B/row  27%
│  └─ LZ4(U8) ToI64(U8)                       2.3MiB  2.4%   5.0mB/row  73%
├─ snowfall                 45% resident       71MiB 0.22%   0.11B/row
│  ├─ LZ4(U16) ToI64(U16)                      66MiB   92%   0.42B/row  25%
│  └─ LZ4(U8) ToI64(U8)                       5.4MiB  7.6%    12mB/row  75%
├─ store_and_fwd_flag       45% resident     0.16GiB 0.51%   0.27B/row
│  └─ LZ4(U8) Data(1) Data(2) Dict(U8)       0.16GiB  100%   0.27B/row  100%
├─ tip_amount               45% resident     0.88GiB  2.7%    1.4B/row
│  ├─ LZ4(U16) Add(U16)                      0.46GiB   52%    1.5B/row  49%
│  ├─ LZ4(U16) ToI64(U16)                    0.30GiB   34%    1.3B/row  39%
│  └─ LZ4(U32) Add(U32)                      0.12GiB   14%    1.7B/row  11%
├─ tolls_amount             45% resident     0.14GiB 0.45%   0.23B/row
│  ├─ LZ4(U32) Add(U32)                        59MiB   40%   0.30B/row  31%
│  ├─ LZ4(U16) ToI64(U16)                      47MiB   32%   0.19B/row  41%
│  ├─ LZ4(U16) Add(U16)                        38MiB   26%   0.22B/row  27%
│  └─ LZ4(U32) ToI64(U32)                     2.2MiB  1.5%   0.28B/row  1.3%
├─ total_amount             45% resident      1.4GiB  4.4%    2.3B/row
│  ├─ LZ4(U32) Add(U32)                      0.92GiB   65%    2.5B/row  60%
│  ├─ ToI64(U16)                             0.47GiB   33%    2.0B/row  38%
│  ├─ LZ4(U32) ToI64(U32)                      13MiB 0.88%    2.5B/row  0.81%
│  └─ Add(U16)                                6.3MiB 0.44%    2.0B/row  0.50%
├─ trip_distance            45% resident      1.6GiB  5.0%    2.7B/row
│  ├─ LZ4(U32) ToI64(U32)                     1.3GiB   80%    2.5B/row  84%
│  ├─ LZ4(I64)                               0.32GiB   20%    3.5B/row  15%
│  ├─ ToI64(U16)                              2.7MiB 0.16%    2.1B/row  0.20%
│  └─ LZ4(U32) Add(U32)                       2.5MiB 0.15%    2.5B/row  0.16%
├─ trip_id                  45% resident       26MiB 0.08%    41mB/row
│  ├─ LZ4(U32) Add(U32) Delta(I64)             26MiB  100%    41mB/row  100%
│  └─ LZ4(U32) Delta(U32)                     1.6KiB 0.01%    21mB/row  0.01%
├─ trip_type                45% resident       51KiB 0.00%    81μB/row
│  └─                                          51KiB  100%    81μB/row  100%
└─ vendor_id                45% resident     0.30GiB 0.94%   0.50B/row
   └─ LZ4(U8) Data(1) Data(2) Dict(U8)       0.30GiB  100%   0.50B/row  100%

:memtree(4)

trips  32.0GiB  1464785771 rows
├─ average_wind_speed       45% resident     0.61GiB  1.9%   0.99B/row
│  ├─ LZ4(U16) ToI64(U16)                    0.61GiB  100%   0.99B/row  100%
│  └─ LZ4(U8) Add(U8)                         2.0KiB 0.00%   6.6mB/row  0.05%
├─ cab_type                 45% resident      3.2MiB 0.01%   5.1mB/row
│  └─ LZ4(U8) Data(1) Data(2) Dict(U8)        3.2MiB  100%   5.1mB/row  100%
│     ├─ .0                                   3.0MiB
│     ├─ .1                                   5.1KiB
│     └─ .2                                   5.1KiB
├─ dropoff                  45% resident      5.9GiB   19%    9.8B/row
│  ├─ LZ4(U8) StrHexUnpack                    5.9GiB  100%     13B/row  73%
│  ├─ LZ4(U16) Data(1) Data(2) Dict(U16)      8.2MiB 0.13%    8.2B/row  0.16%
│  │  ├─ .0                                  0.19MiB
│  │  ├─ .1                                   1.0MiB
│  │  └─ .2                                   7.0MiB
│  └─                                          12KiB 0.00%    72μB/row  27%
├─ dropoff_borocode         45% resident       40MiB 0.12%    65mB/row
│  ├─ LZ4(U8) ToI64(U8)                        40MiB  100%    88mB/row  73%
│  └─                                          14KiB 0.03%    80μB/row  27%
├─ dropoff_boroct2010       45% resident     0.18GiB 0.55%   0.29B/row
│  ├─ LZ4(U32) ToI64(U32)                    0.18GiB  100%   0.39B/row  73%
│  └─                                          16KiB 0.01%    96μB/row  27%
├─ dropoff_boroname         45% resident       41MiB 0.12%    65mB/row
│  ├─ LZ4(U8) Data(1) Data(2) Dict(U8)         41MiB  100%    89mB/row  73%
│  │  ├─ .0                                    41MiB
│  │  ├─ .1                                    30KiB
│  │  └─ .2                                    30KiB
│  └─                                          14KiB 0.03%    80μB/row  27%
├─ dropoff_cdeligibil       45% resident       51MiB 0.15%    81mB/row
│  ├─ LZ4(U8) Data(1) Data(2) Dict(U8)         51MiB  100%   0.11B/row  73%
│  │  ├─ .0                                    50MiB
│  │  ├─ .1                                    15KiB
│  │  └─ .2                                   3.8KiB
│  └─                                          16KiB 0.03%    96μB/row  27%
├─ dropoff_ct2010           45% resident     0.17GiB 0.53%   0.28B/row
│  ├─ LZ4(U32) ToI64(U32)                    0.17GiB  100%   0.38B/row  73%
│  └─                                          14KiB 0.01%    80μB/row  27%
├─ dropoff_ctlabel          45% resident     0.13GiB 0.39%   0.21B/row
│  ├─ LZ4(U16) ToI64(U16)                    0.13GiB  100%   0.28B/row  73%
│  └─                                          14KiB 0.01%    80μB/row  27%
├─ dropoff_datetime         45% resident      2.3GiB  7.2%    3.8B/row
│  ├─ ToI64(U32)                              1.8GiB   76%    4.0B/row  72%
│  ├─ LZ4(U32) ToI64(U32)                    0.53GiB   23%    3.2B/row  27%
│  ├─ LZ4(I64)                                 20MiB 0.85%    5.0B/row  0.64%
│  └─ Add(U32)                                8.0MiB 0.34%    4.0B/row  0.32%
├─ dropoff_latitude         45% resident       11MiB 0.03%    18mB/row
│  ├─ LZ4(U8) ToI64(U8)                       5.3MiB   47%    22mB/row  39%
│  ├─ LZ4(U16) Add(U16)                       2.5MiB   22%    33mB/row  12%
│  ├─ LZ4(U8) Add(U8)                         2.4MiB   21%    23mB/row  16%
│  ├─ LZ4(U16) ToI64(U16)                     1.1MiB  9.9%    28mB/row  6.4%
│  └─                                          14KiB 0.12%    80μB/row  27%
├─ dropoff_longitude        45% resident       40MiB 0.12%    63mB/row
│  ├─ LZ4(U8) Add(U8)                          26MiB   66%    80mB/row  53%
│  ├─ LZ4(U16) Add(U16)                        13MiB   33%   0.10B/row  20%
│  └─                                          16KiB 0.04%    96μB/row  27%
├─ dropoff_ntacode          45% resident       84MiB 0.26%   0.13B/row
│  ├─ LZ4(U8) Data(1) Data(2) Dict(U8)         84MiB  100%   0.18B/row  73%
│  │  ├─ .0                                    82MiB
│  │  ├─ .1                                  0.95MiB
│  │  └─ .2                                  0.47MiB
│  └─                                          14KiB 0.02%    80μB/row  27%
├─ dropoff_ntaname          45% resident       86MiB 0.26%   0.14B/row
│  ├─ LZ4(U8) Data(1) Data(2) Dict(U8)         86MiB  100%   0.19B/row  73%
│  │  ├─ .0                                    82MiB
│  │  ├─ .1                                  0.95MiB
│  │  └─ .2                                   2.3MiB
│  └─                                          14KiB 0.02%    80μB/row  27%
├─ dropoff_nyct2010_gid     45% resident     0.13GiB 0.40%   0.21B/row
│  ├─ LZ4(U16) ToI64(U16)                    0.13GiB  100%   0.29B/row  73%
│  └─                                          16KiB 0.01%    96μB/row  27%
├─ dropoff_puma             45% resident       92MiB 0.28%   0.15B/row
│  ├─ LZ4(U16) ToI64(U16)                      91MiB   99%   0.20B/row  73%
│  └─ LZ4(U8) ToI64(U8)                      0.83MiB 0.90%   5.0mB/row  27%
├─ ehail_fee                45% resident      3.1MiB 0.01%   5.0mB/row
│  └─ LZ4(U8) ToI64(U8)                       3.1MiB  100%   5.0mB/row  100%
├─ extra                    45% resident     0.41GiB  1.3%   0.67B/row
│  ├─ LZ4(U16) ToI64(U16)                    0.19GiB   46%   0.87B/row  35%
│  ├─ LZ4(U16) Add(U16)                      0.18GiB   45%   0.52B/row  58%
│  ├─ LZ4(U32) Add(U32)                        20MiB  4.9%    1.1B/row  2.9%
│  ├─ LZ4(U8) ToI64(U8)                        16MiB  3.9%   0.62B/row  4.2%
│  └─ LZ4(U8) Add(U8)                        0.22MiB 0.05%   0.33B/row  0.11%
├─ fare_amount              45% resident      1.1GiB  3.3%    1.8B/row
│  ├─ LZ4(U32) Add(U32)                      0.63GiB   59%    1.8B/row  59%
│  ├─ LZ4(U16) ToI64(U16)                    0.41GiB   39%    1.8B/row  39%
│  ├─ LZ4(U16) Add(U16)                        20MiB  1.8%    1.8B/row  1.8%
│  ├─ LZ4(U32) ToI64(U32)                     3.5MiB 0.32%    1.8B/row  0.32%
│  └─ Add(U16)                               0.16MiB 0.01%    2.1B/row  0.01%
├─ improvement_surcharge    45% resident      6.9MiB 0.02%    11mB/row
│  ├─ LZ4(U8) ToI64(U8)                       4.0MiB   58%    15mB/row  44%
│  ├─ LZ4(U8) Add(U8)                         2.9MiB   41%   8.3mB/row  55%
│  └─ LZ4(U16) Add(U16)                        48KiB 0.68%    12mB/row  0.64%
├─ max_temperature          45% resident     0.45GiB  1.4%   0.73B/row
│  ├─ ToI64(U8)                              0.45GiB  100%    1.0B/row  73%
│  └─ LZ4(U8) ToI64(U8)                       1.1MiB 0.24%   6.7mB/row  27%
├─ min_temperature          45% resident     0.44GiB  1.4%   0.73B/row
│  ├─ ToI64(U8)                              0.41GiB   92%    1.0B/row  67%
│  ├─ LZ4(U8) ToI64(U8)                        22MiB  4.9%   0.12B/row  31%
│  └─ Add(U8)                                  12MiB  2.7%    1.0B/row  1.9%
├─ mta_tax                  45% resident       15MiB 0.04%    24mB/row
│  ├─ LZ4(U16) Add(U16)                       7.6MiB   52%    30mB/row  40%
│  ├─ LZ4(U8) ToI64(U8)                       3.7MiB   25%    15mB/row  39%
│  ├─ LZ4(U8) Add(U8)                         3.4MiB   23%    26mB/row  21%
│  └─ LZ4(U16) ToI64(U16)                      40KiB 0.27%    20mB/row  0.32%
├─ passenger_count          45% resident     0.30GiB 0.95%   0.50B/row
│  └─ LZ4(U8) ToI64(U8)                      0.30GiB  100%   0.50B/row  100%
├─ payment_type             45% resident     0.31GiB 0.96%   0.50B/row
│  └─ LZ4(U8) Data(1) Data(2) Dict(U8)       0.31GiB  100%   0.50B/row  100%
│     ├─ .0                                  0.31GiB
│     ├─ .1                                    30KiB
│     └─ .2                                    11KiB
├─ pickup                   45% resident      5.9GiB   19%    9.8B/row
│  ├─ LZ4(U8) StrHexUnpack                    5.9GiB  100%     13B/row  73%
│  ├─ LZ4(U16) Data(1) Data(2) Dict(U16)      8.2MiB 0.13%    8.2B/row  0.16%
│  │  ├─ .0                                  0.19MiB
│  │  ├─ .1                                   1.0MiB
│  │  └─ .2                                   7.0MiB
│  └─                                          12KiB 0.00%    72μB/row  27%
├─ pickup_borocode          45% resident     0.11GiB 0.35%   0.18B/row
│  ├─ LZ4(U8) ToI64(U8)                      0.11GiB  100%   0.25B/row  73%
│  └─                                          14KiB 0.01%    80μB/row  27%
├─ pickup_boroct2010        45% resident      1.1GiB  3.5%    1.8B/row
│  ├─ LZ4(U32) ToI64(U32)                     1.1GiB  100%    2.5B/row  73%
│  └─                                          16KiB 0.00%    96μB/row  27%
├─ pickup_boroname          45% resident     0.11GiB 0.35%   0.18B/row
│  ├─ LZ4(U8) Data(1) Data(2) Dict(U8)       0.11GiB  100%   0.25B/row  73%
│  │  ├─ .0                                  0.11GiB
│  │  ├─ .1                                    30KiB
│  │  └─ .2                                    30KiB
│  └─                                          14KiB 0.01%    80μB/row  27%
├─ pickup_cdeligibil        45% resident     0.17GiB 0.52%   0.27B/row
│  ├─ LZ4(U8) Data(1) Data(2) Dict(U8)       0.17GiB  100%   0.37B/row  73%
│  │  ├─ .0                                  0.17GiB
│  │  ├─ .1                                    15KiB
│  │  └─ .2                                   3.8KiB
│  └─                                          16KiB 0.01%    96μB/row  27%
├─ pickup_ct2010            45% resident      1.1GiB  3.5%    1.8B/row
│  ├─ LZ4(U32) ToI64(U32)                     1.1GiB  100%    2.5B/row  73%
│  └─                                          14KiB 0.00%    80μB/row  27%
├─ pickup_ctlabel           45% resident     0.89GiB  2.8%    1.5B/row
│  ├─ ToI64(U16)                             0.89GiB  100%    2.0B/row  73%
│  ├─ LZ4(U16) ToI64(U16)                    0.19MiB 0.02%   0.19B/row  0.16%
│  └─                                          14KiB 0.00%    80μB/row  27%
├─ pickup_datetime          45% resident      2.1GiB  6.6%    3.4B/row
│  ├─ ToI64(U32)                              1.8GiB   85%    4.0B/row  73%
│  ├─ LZ4(U32) ToI64(U32)                    0.32GiB   15%    1.9B/row  27%
│  ├─ LZ4(U16) Add(U16)                      0.36MiB 0.02%    1.2B/row  0.05%
│  └─ Add(U16)                               0.16MiB 0.01%    2.1B/row  0.01%
├─ pickup_latitude          45% resident       16MiB 0.05%    25mB/row
│  ├─ LZ4(U8) ToI64(U8)                       9.4MiB   60%    34mB/row  45%
│  ├─ LZ4(U8) Add(U8)                         3.5MiB   22%    33mB/row  17%
│  ├─ LZ4(U16) Add(U16)                       2.6MiB   16%    40mB/row  10%
│  ├─ LZ4(U16) ToI64(U16)                    0.25MiB  1.6%    35mB/row  1.1%
│  └─                                          14KiB 0.08%    80μB/row  27%
├─ pickup_longitude         45% resident     0.17GiB 0.55%   0.29B/row
│  ├─ LZ4(U8) Add(U8)                        0.13GiB   74%   0.37B/row  57%
│  ├─ LZ4(U16) Add(U16)                        47MiB   26%   0.46B/row  16%
│  └─                                          14KiB 0.01%    80μB/row  27%
├─ pickup_ntacode           45% resident     0.45GiB  1.4%   0.73B/row
│  ├─ Data(1) Data(2) Dict(U8)               0.45GiB  100%    1.0B/row  73%
│  │  ├─ .0                                  0.45GiB
│  │  ├─ .1                                  0.93MiB
│  │  └─ .2                                  0.46MiB
│  ├─ LZ4(U8) Data(1) Data(2) Dict(U8)         99KiB 0.02%    97mB/row  0.16%
│  │  ├─ .0                                    96KiB
│  │  ├─ .1                                   2.0KiB
│  │  └─ .2                                   1.0KiB
│  └─                                          14KiB 0.00%    80μB/row  27%
├─ pickup_ntaname           45% resident     0.45GiB  1.4%   0.74B/row
│  ├─ Data(1) Data(2) Dict(U8)               0.45GiB  100%    1.0B/row  73%
│  │  ├─ .0                                  0.45GiB
│  │  ├─ .1                                  0.93MiB
│  │  └─ .2                                   2.0MiB
│  └─ LZ4(U8) Data(1) Data(2) Dict(U8)       0.94MiB 0.20%   5.6mB/row  27%
│     ├─ .0                                  0.91MiB
│     ├─ .1                                   3.4KiB
│     └─ .2                                   4.0KiB
├─ pickup_nyct2010_gid      45% resident     0.89GiB  2.8%    1.5B/row
│  ├─ ToI64(U16)                             0.89GiB  100%    2.0B/row  73%
│  ├─ LZ4(U16) ToI64(U16)                    0.19MiB 0.02%   0.19B/row  0.16%
│  └─                                          16KiB 0.00%    96μB/row  27%
├─ pickup_puma              45% resident     0.45GiB  1.4%   0.73B/row
│  ├─ LZ4(U16) ToI64(U16)                    0.45GiB  100%   1.00B/row  73%
│  └─ LZ4(U8) ToI64(U8)                      0.83MiB 0.18%   5.0mB/row  27%
├─ precipitation            45% resident     0.36GiB  1.1%   0.58B/row
│  ├─ LZ4(U16) ToI64(U16)                    0.35GiB   98%   0.66B/row  87%
│  └─ LZ4(U8) ToI64(U8)                       6.1MiB  1.7%    76mB/row  13%
├─ rate_code_id             45% resident       63MiB 0.19%   0.10B/row
│  └─ LZ4(U8) ToI64(U8)                        63MiB  100%   0.10B/row  100%
├─ snow_depth               45% resident       94MiB 0.29%   0.15B/row
│  ├─ LZ4(U16) ToI64(U16)                      92MiB   98%   0.56B/row  27%
│  └─ LZ4(U8) ToI64(U8)                       2.3MiB  2.4%   5.0mB/row  73%
├─ snowfall                 45% resident       71MiB 0.22%   0.11B/row
│  ├─ LZ4(U16) ToI64(U16)                      66MiB   92%   0.42B/row  25%
│  └─ LZ4(U8) ToI64(U8)                       5.4MiB  7.6%    12mB/row  75%
├─ store_and_fwd_flag       45% resident     0.16GiB 0.51%   0.27B/row
│  └─ LZ4(U8) Data(1) Data(2) Dict(U8)       0.16GiB  100%   0.27B/row  100%
│     ├─ .0                                  0.16GiB
│     ├─ .1                                    14KiB
│     └─ .2                                   5.1KiB
├─ tip_amount               45% resident     0.88GiB  2.7%    1.4B/row
│  ├─ LZ4(U16) Add(U16)                      0.46GiB   52%    1.5B/row  49%
│  ├─ LZ4(U16) ToI64(U16)                    0.30GiB   34%    1.3B/row  39%
│  └─ LZ4(U32) Add(U32)                      0.12GiB   14%    1.7B/row  11%
├─ tolls_amount             45% resident     0.14GiB 0.45%   0.23B/row
│  ├─ LZ4(U32) Add(U32)                        59MiB   40%   0.30B/row  31%
│  ├─ LZ4(U16) ToI64(U16)                      47MiB   32%   0.19B/row  41%
│  ├─ LZ4(U16) Add(U16)                        38MiB   26%   0.22B/row  27%
│  └─ LZ4(U32) ToI64(U32)                     2.2MiB  1.5%   0.28B/row  1.3%
├─ total_amount             45% resident      1.4GiB  4.4%    2.3B/row
│  ├─ LZ4(U32) Add(U32)                      0.92GiB   65%    2.5B/row  60%
│  ├─ ToI64(U16)                             0.47GiB   33%    2.0B/row  38%
│  ├─ LZ4(U32) ToI64(U32)                      13MiB 0.88%    2.5B/row  0.81%
│  └─ Add(U16)                                6.3MiB 0.44%    2.0B/row  0.50%
├─ trip_distance            45% resident      1.6GiB  5.0%    2.7B/row
│  ├─ LZ4(U32) ToI64(U32)                     1.3GiB   80%    2.5B/row  84%
│  ├─ LZ4(I64)                               0.32GiB   20%    3.5B/row  15%
│  ├─ ToI64(U16)                              2.7MiB 0.16%    2.1B/row  0.20%
│  └─ LZ4(U32) Add(U32)                       2.5MiB 0.15%    2.5B/row  0.16%
├─ trip_id                  45% resident       26MiB 0.08%    41mB/row
│  ├─ LZ4(U32) Add(U32) Delta(I64)             26MiB  100%    41mB/row  100%
│  └─ LZ4(U32) Delta(U32)                     1.6KiB 0.01%    21mB/row  0.01%
├─ trip_type                45% resident       51KiB 0.00%    81μB/row
│  └─                                          51KiB  100%    81μB/row  100%
└─ vendor_id                45% resident     0.30GiB 0.94%   0.50B/row
   └─ LZ4(U8) Data(1) Data(2) Dict(U8)       0.30GiB  100%   0.50B/row  100%
      ├─ .0                                  0.30GiB
      ├─ .1                                    10KiB
      └─ .2                                   5.1KiB

What does mB stand for you ask? Why, millibyte of course. Encountering this exotic unit is a surprise, to be sure, but a welcome one.

LZ4 decompression speed

LocustDB is able to keep memory-cached data lz4 compressed as well, which significantly reduces memory usage at the cost of query times. The following table show results for query runtimes in LocustDB for memory-cached data with and without lz4 compression:

Query1234567
Runtime (no lz4)/s0.250.550.543.11.60.224.2
Runtime (lz4)/s0.310.880.723.82.41.34.8

LZ4 overhead is fairly modest for most queries. Queries 5 and 6 select a small number of rows using filter/top n which can be done with a simple index lookup when the referenced columns are fixed width, but requires decoding the entire column when they are lz4 encoded.

Startup time

Loading all 80GiB of data from disk in bulk takes only 164s on SSD (523MB/s) and 625s on HDD (137MB/s). This is very unfortunate because it deprives me of the opportunity to write a paper about fancy techniques for quickly restarting your database when reading from disk is too slow.

Limitations and future work

Storage backend

The storage backend already works well enough for bulk load, but to support continuous ingestion (at reasonable latencies) we would need to store much smaller partitions when data first comes in, have an additional mechanism to periodically merge small partitions into larger ones, time partition data to allow it to be expired, and maybe find ways to reduce overhead from RocksdDB compactions which is quite significant. The fact that obtaining full sequential read speed on HDD has so far eluded me is very trying, but my resolution not to write a custom persistent database remains strong (how hard can it be though, really?). Some additional tuning and other improvements could probably close the gap, and either way RocksDB already provides a strong baseline.

Smarter partitioning

At least on SSD, we are already close to maxing out sequential read speed. So are there still any avenues for improvement (other than better compression)? The answer is yes, there are many situations where we can avoid reading a partition entirely. For example, you might have a query that filters down to the last hour of data in which case scanning a full 30 days is quite wasteful. This can be avoided by partitioning the data by timestamp, and storing the range of timestamps as part of the column metadata. Similarly, we might use bloom filters to quickly check whether a specific string we are filtering on is present in a column.

Compression

The compression ratios in LocustDB are starting to approach the limits of what can be achieved with current methods. I see three remaining avenues for further improvements.

1. Entropy encoding

LZ4 only eliminates redundancy at the byte level, and additional compression can be achieved by applying an entropy encoder that uses less bits for common symbols (Huffman coding or ANS seem like good candidates). This starts to add significant decoding overhead, but might still be strictly beneficial for data stored on disk. In fact, ClickHouse already has (experimental) support for compressing data stored on disk with Zstandard.

2. LZ4 high compression

LZ4 has a configurable parameter that governs the trade off between CPU time spent on compression and compression ratio. LocustDB currently uses the default, but using the high compression option might be reasonable if you’re not bottlenecked by ingestion (decoding speed is unaffected).

3. Domain specific compression

Since LocustDB has very flexible support for compression, it would be quite reasonable to add additional compression schemes for a variety of data distributions. These would only apply to specific datasets, but yield spectacular gains when they do. Examples include the timeseries compression scheme used by Gorilla, and the way stack traces are encoded in Scuba.

What’s next

Compression and disk storage were the two remaining features I wanted to prove out. LocustDB is still missing any support for running on (large) clusters, but getting this to production quality for general use cases would likely require a small team working on LocustDB full-time for a few months and actually operating a deployment. This exceeds the resources available to me. So my next goal will be to add the polish and missing features that will turn LocustDB into a useful command line tool, a sort of Swiss Army knife for ad-hoc data analysis. Think initial configuration overhead of awk or grep, speed of optimized Rust, expressive power of SQL, and the ability to scale to multi-terabyte datasets on a single machine. If you like the sound of that and are interested in speeding development along, or want to cheer from the sidelines, then join us on github.


Methodology

Summary of pertinent facts:

  • In between cold queries, I restarted the db processes and ran sudo sh -c "sync; sudo echo 1 > /proc/sys/vm/drop_caches" to clear OS page caches.
  • Results for cold query times are the medians of three runs.
  • The results for lz4/no lz4 comparison were run with sudo cpupower frequency-set -g performance.
  • Results for lz4/no lz4 comparison are the medians of the last three of four runs.
  • I used a LocustDB partition size of 65536 for HDD results and 1048576 for SSD and in-memory results (the larger partition size results in modest speedups for memory-cached queries, and significant speedups for reading from SSD using point lookups).
  • After ingestion, I restarted LocustDB and waited a few minutes for RocksDB background compactions to complete before collecting results.
  • I used ClickHouse version 1.1.54388 in all benchmarks.

200K lines worth of transcripts for all the terminal sessions that produced the results, including printouts of ClickHouse config files, commit hashes and various system properties:

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.