Exploring Column-Oriented Data in Rust with frunk HLists

13 Jan 2019
Paul Kernfeld dot com

What is column-oriented storage?

Row-oriented storage and column-oriented storage are two major ways of laying out data in memory. In Rust, there is a simple way to think of this: row-oriented storage is like an array of structs, whereas column-oriented storage is like a struct of arrays. It’s easy to use row-oriented storage in Rust, so this post is going to explore column-oriented storage.

To make this post a bit more concrete, I’m going to work with this example struct:

struct Planet {
    mass_kg: f64,
    radius_m: f64,
    habitable: bool,
}

When I create a Vec<Planet>, each Planet will be stored contiguously in memory; this is row-oriented storage. For most purposes, this is nice. For example, appending another Planet into my Vec usually only requires writing to a single region of memory.

In column-oriented storage, all values for each column will be stored contiguously. For Planet, this might look like:

struct PlanetFrame {
    mass_kg: Vec<f64>,
    radius_m: Vec<f64>,
    habitable: Vec<bool>,
}

This can be useful for a few reasons:

Column-oriented storage is most prevalent in databases like Redshift and BigQuery and “big data” file formats like ORC and Parquet, but it can also be useful for in-memory data. Daniel Abadi has a great piece on this: Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?

The problem that I’m investigating here is: how can I take a struct like Planet and automatically transform it into a column-oriented version like PlanetFrame?

What is frunk?

The frunk crate is a “generic functional programming toolbelt for Rust.” Its flagship data type is HList, which is short for “heterogenous list.” This is a kind of list where each element can have a different type, all of which are known at compile time. Here’s some example frunk code to give you a feel of how HList works:

// Create a new HList
let h = hlist![1,  "hi"];
assert_eq!(h.len(), 2);

// Destructure the HList
let hlist_pat!(a, b) = h;
assert_eq!(a, 1);
assert_eq!(b, "hi");

One thing to note: all type-checking with HLists happens at compile-time, so the elements inside the list don’t need to be boxed. The author has written a great introduction if you’d like to learn more about HLists in Rust.

The reason that I chose frunk for this problem is because it allows me to implement column-oriented storage without writing any macros.

The problem at hand

For my proof-of-concept column-oriented Vec storage, I chose to satisfy a minimal set of requirements:

To meet these requirements, I built a column-oriented data structure that is an HList of columns. Briefly, I made Row and Frame traits, and implemented them for frunk::HNil and frunk::HCons.

Below, I’ll share some more details on my implementation.

Selecting a column

hlist already has a way to select columns using the hlist_pat macro. It looks like this:

let hlist_pat![_mass_kg, radius_m, ...] = &planets;
assert_eq!(radius_m, &[6.37814e6, 3.3972e6]);

This is great for collections with only a few columns. Any type errors will be caught at compile time, which is cool.

I think this macro could be burdensome for collections with many columns. Perhaps it could be augmented to look something like:

let hlist_pat![SKIP_8, flight_number, ..., departure_date, SKIP_5] = &flights;

Selecting a row

Naively, I would want the method signature for selecting a row to be something like fn row(&self) -> &Planet. Unfortunately, no Planet actually exists in memory, so returning a reference is not straightforward at all. Instead, I pursued a method that creates and returns a new Planet.

Overall, this implementation wasn’t too difficult. One note is that all fields must implement Clone, because we need to be able to copy it from the column-oriented storage into the newly created Planet object.

To make this more efficient, you could make a function like fn write_row(&self, write_into: &mut Planet) to write a row from the frame into an existing Planet. This way, you could get the value of a row without needing to allocate new memory.

Iterating over rows

I wanted to implement iterators using closures. This could theoretically be done by giving Frame a method fn iter(&self) -> impl Iterator<Item=Self::MyRow>. Unfortunately, a trait can’t currently have a method with impl Trait in return position. To solve this, I made HConsIterator and HNilIterator structs. This is a fairly common pattern for transforming objects like Iterator and Future.

My implementation is not too efficient because it does a lot of allocations. Another option would be to use an iterator of references. This is not possible using Rust’s built-in Iterator trait, so you need a third-party lib like streaming_iterator.

Example code

Here’s the final example code that I came up with. You can see the whole project on GitHub.

#[macro_use]
extern crate frunk;
extern crate frunk_core;
use frunk::indices::Here;
use frunk::prelude::*;
use frunk::Generic;
use frunk_column::*;
use std::iter::FromIterator;

fn main() {
    #[derive(Clone, Copy, Debug, Generic, PartialEq)]
    struct Planet {
        mass_kg: f64,
        radius_m: f64,
        habitable: bool,
    }

    let earth = Planet {
        mass_kg: 5.976e+24,
        radius_m: 6.37814e6,
        habitable: true,
    };
    let mars = Planet {
        mass_kg: 6.421e+23,
        radius_m: 3.3972e6,
        habitable: false,
    };

    let mut planets = <Planet as Generic>::Repr::new_frame();
    planets.push(frunk::into_generic(earth));
    planets.push(frunk::into_generic(mars));

    let hlist_pat![_mass_kg, radius_m, ...] = &planets;
    assert_eq!(radius_m, &[6.37814e6, 3.3972e6]);

    assert_eq!(planets.row(1), Some(frunk::into_generic(mars)));

    assert_eq!(
        Vec::from_iter(planets.into_iter().map::<Planet, _>(frunk::from_generic)),
        vec![earth, mars]
    );
}

Reflections

Type ergonomics

To transform my struct into a frunk HList, I used the Generic derived trait from frunk. This was a bit inconvenient because I needed to add a lot of boilerplate in many situations. For example, creating a new frame:

let mut planets = <Planet as Generic>::Repr::new_frame();
planets.push(frunk::into_generic(earth));
planets.push(frunk::into_generic(mars));

Or, collecting the rows into a Vec:

Vec::from_iter(planets.into_iter().map::<Planet, _>(frunk::from_generic))

I think it would be possible to smooth some of this over with convenience methods.

Trait ergonomics

One limitation is that I can’t implement any third-party traits (e.g. IntoIterator) for Frame. This is because the structs for which I’ve implemented Frame are part of the frunk crate, so implementing IntoIterator would be an orphan trait. I worked around this by giving Frame a method with a similar signature: fn into_iter(self) -> Self::Iterator.

Conclusion

I was impressed that frunk let me implement column-oriented storage so quickly and easily. I didn’t have to write a single macro! However, I’d also like to protoype other solutions to solve the problems that I highlighted above.

I would love to hear any suggestions for further directions to explore for columnar data storage in Rust. If you have any, please contact me via the Reddit post, rust-lang.org forum thread, or via email.

Thanks to Ben Striegel, Anton Lazarev, and Brandon Maister for comments.

Appendix: Further Reading