tomland: Bidirectional TOML serialization

Date: January 14, 2019
Author: Dmitrii Kovanikov

tomland is a bidirectional TOML parsing and pretty-printing library. “Bidirectional” here means that the library allows the user to specify in one place how to convert Haskell data types to TOML format and how to parse TOML configurations to custom user types. This approach allows eliminating one layer of bugs, specifically, when you update your data type and only one part of the conversion (for example, parser) but accidentally forget to update the other one (printer). With the bidirectional approach implemented in tomland there’s no chance of you overlooking anything.

Here is just a quick look at how those bidirectional parsers look like in code:

data User = User
    { userName :: Text
    , userAge  :: Int
    }

userCodec :: TomlCodec User
userCodec = User
    <$> Toml.text "name" .= userName
    <*> Toml.int  "age"  .= userAge

You can see a bigger example in the repository README:

The idea is straightforward but the implementation requires some tricks. As a result of this, in order to provide convenient and type-safe interfaces tomland uses the following non-trivial Haskell techniques:

  1. GADTs and existential wrappers for core types and theorem-proving for more compile-time guarantees.
  2. Category theory and Category instance for the custom data type in order to implement tagged partial bidirectional isomorphism.
  3. Monadic profunctors approach to compose those isomorphisms.
  4. Custom implementation of the prefix tree specific to the structure of nested keys in TOML format.
  5. Property-based tests with the hedgehog library that covers typeclass laws and roundtrip properties.
  6. Type-safe EDSL for specifying TOML AST.

Despite the fact that a lot of fancy features are used, the library is still well-tested. The whole code of the library itself comprises 1600 LOC and 1100 LOC of tests. But without the powerful Haskell type system, the size of such tests could be much bigger.

This blog post describes the architecture and details of the implementation of tomland.

Why bidirectional TOML?🔗

TOML is great for static configurations. Its specification is very simple but expressive enough to cover most cases. Also, it’s unambiguous and human-readable which makes it much smoother to read and edit the configuration.

The state of TOML parsing libraries in the Haskell ecosystem can still be improved. There are already several libraries:

But not all of them have nice ways to convert a TOML AST to custom Haskell types. For example, htoml implements this conversion via ToJSON/FromJSON typeclasses from the aeson library and you might not always want to have extra dependencies if you don’t need JSON. Also, none of the libraries have pretty-printing abilities. Usually, you don’t need pretty-printing for TOML but there were at least two use-cases in my experience where I needed pretty-printing:

  1. summoner: Summoner is the tool for scaffolding modern Haskell projects. It can read TOML configuration where you can specify some settings in advance. But for the moment, you need to write this TOML config file manually. Though there are plans to implement a feature that allows you to build your configuration settings interactively via summoner itself and for this we need an ability to convert the Settings data type to TOML configuration.
  2. life-sync: this tool allows you to sync your configs across multiple machines. It stores all tracked files in its own TOML configuration file. In order to keep this file up-to-date, life-sync needs to parse from this configuration file and it needs to convert its Haskell types to TOML.

I’m not going to compare TOML with JSON/YAML/Dhall, as that is not in the scope of this blog post. Instead, let’s dive into the implementation details of the library.

Core architecture of Tomland🔗

Key concepts🔗

Below I’m going to mention the key architecture decisions behind tomland:

  1. A function-based approach for the description of how to convert Haskell data types to/from TOML instead of a typeclasses-based approach. This approach has a lot of pleasant advantages for decoding libraries: no orphan instances and no ambiguity when there are multiple options for conversion. Here tomland goes along with libraries like hedgehog, sv and waargonaut.
  2. Two-phase approach: parsing (from text to intermediate AST) and decoding (from AST to custom user types). And similarly for pretty-printing.
  3. Make invalid states unrepresentable: TOML supports arrays of values. But the TOML specification also forbids us from having heterogeneous arrays. This is captured at compile time by tomland.
  4. Parser combinators for parsing: tomland uses the modern parsing library megaparsec.

Type checking TOML🔗

It was mentioned earlier that tomland uses GADTs to represent AST. Let’s take a closer look at this. When you want to convert unstructured runtime data to a GADT in Haskell you have to use a 2-step approach:

  1. Parse this data into a plain old ADT.
  2. Type check a simple ADT to a GADT.

TOML has several primitive types, but let’s just consider Integer, Text and Array for simplicity. Here is how an ADT for untyped representation of TOML values looks like:

data UValue
    = UInteger Integer
    | UText Text
    | UArray [UValue]

And here is a GADT version of this data type, that guarantees that all the elements of the array have the same type (it uses extensions -XDataKinds and -XGADTs):

data TValue = TInteger | TText | TArray

data Value (t :: TValue) where
    Integer :: Integer   -> Value 'TInteger
    Text    :: Text      -> Value 'TText
    Array   :: [Value t] -> Value 'TArray

Let’s assume that we have already implemented a parser for UValue (instead of assuming you can also look at the real code). Now we need to convert UValue to Value. But it’s not that simple: (1) this operation can fail and (2) at compile-time we don’t know the type of the resulted Value.

To overcome the second challenge, we need to introduce an existential wrapper around Value.

data AnyValue = forall (t :: TValue) . AnyValue (Value t)

NOTE: you can’t return Value t from the function but you can return AnyValue since the type variable t is hidden inside the AnyValue constructor.

Our conversion function can fail only because of a single reason: when elements of UArray field have different constructors. The algorithm to check that all elements of the list have the same type is the following:

  1. Pattern match on the list. If the list is empty then this is a valid array.
  2. If the list is not empty then we assume that the type of the first element is the expected type of the whole array. So we just need to check every element of the list against this first element and ensure that all of them have the same type. But this needs to be done carefully in order to convince GHC that we checked everything properly.

First, let’s implement the sameValue function that checks whether two Values have the same type. Here we’re going to use propositional equality which is implemented in the following way:

data a :~: b where
    Refl :: a :~: a

And here is sameValue:

data TypeMismatchError = TypeMismatchError
  { typeExpected :: TValue
  , typeActual   :: TValue
  }

valueType :: Value t -> TValue
valueType (Integer _) = TInteger
valueType (Text _)    = TText
valueType (Array _)   = TArray

sameValue :: Value a -> Value b -> Either TypeMismatchError (a :~: b)
sameValue Integer{} Integer{} = Right Refl
sameValue Text{}    Text{}    = Right Refl
sameValue Array{}   Array{}   = Right Refl
sameValue l         r         = Left $ TypeMismatchError
                                         { typeExpected = valueType l
                                         , typeActual   = valueType r
                                         }

Now we’re ready to implement a proper simple type checking algorithm!

typeCheck :: UValue -> Either TypeMismatchError AnyValue
typeCheck (UInteger n) = Right $ AnyValue $ Integer n
typeCheck (UText s)    = Right $ AnyValue $ Text s
typeCheck (UArray a)   = case a of
    []   -> Right $ AnyValue $ Array []  -- empty list is valid array
    x:xs -> do
        AnyValue v <- typeCheck x  -- type check first element of array
        AnyValue . Array <$> checkElem v xs  -- check `xs` against `v`
  where
    checkElem :: Value t -> [UValue] -> Either TypeMismatchError [Value t]
    checkElem v []     = Right [v]  -- no values = no problems
    checkElem v (x:xs) = do
        AnyValue vx <- typeCheck x
        Refl <- sameValue v vx
        (v :) <$> checkElem vx xs

In order to convince GHC that the elements of the array have the same type, we pattern match on Refl. After this pattern matching GHC can see the type equality between the given element and the head of the list. That’s why we can put them in the same list.

Prefix tree🔗

We managed to deal only with values. But in TOML, values are assigned to textual keys. Here is an example of this:

server.id    = 42
server.host  = "best-haskell-blog.com"
server.ports = [8080, 8081, 9080, 9081]

Here server.ports is a key and its corresponding value is an array of integers. Keys are represented in the following way in tomland:

newtype Piece = Piece { unPiece :: Text }
newtype Key = Key { unKey :: NonEmpty Piece }

Keys in TOML are dot-separated words. If some keys share a common prefix, they semantically belong to the same object (though you don’t have to follow this rule, it helps readability a lot). TOML also has tables that help to visually separate parts of the static configuration. Here is an example of tables and nested tables:

[server]
    [server.production]
        id    = 42
        host  = "best-haskell-blog.com"
        ports = [8080, 8081]
    [server.staging]
        id    = 777
        host  = "okayish-haskell-blog.com"
        ports = [9080, 9081]

To represent this structure nicely we need some prefix tree. There are several libraries on Hackage that implement very decent prefix trees but unfortunately, none of them fit our use-case. So we decided to implement our own data structure. It has the following shape (using two mutually recursive data structures):

type Prefix = Key

type PrefixMap a = HashMap Piece (PrefixTree a)

data PrefixTree a
    = Leaf Key a
    | Branch Prefix (Maybe a) (PrefixMap a)

When we want to insert a new key into PrefixTree, we need to split the given key and the key inside the current tree node into common and remaining parts and then split the current node properly. The result of the routine that finds key prefixes is captured by the following data type:

data KeysDiff
    = Equal      -- ^ Keys are equal.
    | NoPrefix   -- ^ Keys don't have any common part.
    | FstIsPref  -- ^ The first key is the prefix of the second one.
        !Key     -- ^ Rest of the second key.
    | SndIsPref  -- ^ The second key is the prefix of the first one.
        !Key     -- ^ Rest of the first key.
    | Diff       -- ^ Keys have a common prefix.
        !Key     -- ^ Common prefix.
        !Key     -- ^ Rest of the first key.
        !Key     -- ^ Rest of the second key.

-- | Takes two keys and returns their difference.
keysDiff :: Key -> Key -> KeysDiff

Every TOML configuration stores the following items:

  1. Top-level keys.
  2. Zero or more tables.
  3. Zero or more array of tables.

The structure of TOML is recursive. So every table and array of tables has the same structure. That’s why a TOML AST is represented in the following way:

data TOML = TOML
    { tomlPairs       :: HashMap Key AnyValue
    , tomlTables      :: PrefixMap TOML
    , tomlTableArrays :: HashMap Key (NonEmpty TOML)
    }

That’s all regarding TOML AST! The following sections of this tutorial cover the bidirectional conversion aspect of the library.

Tagged partial bidirectional isomorphism🔗

First, we need to learn how to switch between basic Haskell types and AnyValue in both directions. Converting from Text to AnyValue is pretty simple. Just apply theText constructor of Value and then wrap into AnyValue. However, conversion from AnyValue to Text may fail because there might be a different constructor. It turns out that conversion to AnyValue may also fail. Consider ByteString: in order to convert ByteString to AnyValue you first need to convert it to Text and this operation may fail. Alternatively, you can convert ByteString to the array of integers.

You can see now that we need a way to convert types in both directions. Just like isomorphisms, with the only difference that the conversion in both directions may fail. Thus it’s partial. And we also want to report good error messages to our users. So this is actually tagged partial bidirectional isomorphism. In tomland this concept is captured by the following data type with a much shorter name:

data BiMap e a b = BiMap
    { forward  :: a -> Either e b
    , backward :: b -> Either e a
    }

If we look at types as sets of values then BiMap can be represented by the following graph:

BiMap illustration

Type parameters a and b represent the values between which we’re converting. And e is the type of error.

NOTE: In tomland BiMap is specialized to the following error type:

data TomlBiMapError
    = WrongConstructor Text Text  -- expected vs. actual
    | WrongValue MatchError
    | ArbitraryError Text

type TomlBiMap = BiMap TomlBiMapError

When we want to write a bidirectional mapping between Text and AnyValue we need to provide two functions. If we want to do the same for ByteString and AnyValue we can implement two similar functions But you may notice that the logic for the Text converter is a subset of the logic for the ByteString one.

It turns out that BiMap composes very nicely! And we need to write only BiMap between ByteString and Text and compose it with BiMap between Text and AnyValue. This is possible thanks to the following Category instance for BiMap because BiMap forms category. Category allows us to compose objects of the type Type -> Type -> Type, but for that, you need to implement the composition operator and the identity object which is a neutral element for composition.

instance Category (BiMap e) where
    id :: BiMap e a a
    id = BiMap Right Right

    (.) :: BiMap e b c -> BiMap e a b -> BiMap e a c
    bc . ab = BiMap
        { forward  =  forward ab >=>  forward bc
        , backward = backward bc >=> backward ab
        }

This instance can be described clearly by the following picture:

Category composition illustration

It’s not convenient to use the dot-operator from Control.Category to compose BiMaps because this operator is already specialized to arrow (->) in Prelude. But we still can use the ‘>>>’ and ‘<<<’ operators for composing objects and this is quite nice!

_ByteStringText :: TomlBiMap ByteString Text
_Text           :: TomlBiMap Text       AnyValue

_ByteString     :: TomlBiMap ByteString AnyValue
_ByteString = _ByteStringText >>> _Text

tomland implements a lot of BiMap combinators out-of-the-box for you so you should be able to avoid writing your own mini-tomland inside your application if you want to use it on common data structures.

Codec🔗

As of now, we are converting between primitive types and TOML values. But usually, in Haskell, we work with a lot of custom data types, including both, product and sum types. This section covers how to compose different BiMaps into a single bidirectional converter for a Haskell data type.

The idea for bidirectional converters is not new. It’s based on the approach called monadic profunctors and is described in detail in this blog post:

And there is the codec library which implements this approach for aeson and binary:

tomland uses the same idea but different connecting pieces and expands this approach to support sum types, not only record types.

Monadic profunctors🔗

Here is the core Codec type for bidirectional conversion:

data Codec r w c a = Codec
    { codecRead  :: r a
    , codecWrite :: c -> w a
    }

It looks scary but I will try to explain its meaning.

  • r is usually some Reader monad. It stores an intermediate AST representation inside the context which we’re going to query to fetch pieces of our big custom user type.
  • w is some Writer or State which stores the result of converting a current piece of custom user type to an intermediate AST.
  • a is the type that we’re converting. For example, Integer, Text or Person.
  • c is a fake type variable which is required in order to make this magic work. It should be equal to a but if we make it a directly in the type definition, we won’t be able to write the required instances.

Codec has Functor, Applicative, Monad, Alternative and Profunctor instances.

NOTE: The Profunctor instance is not implemented in tomland because, currently, it requires the addition of the profunctors package to our dependencies; this would make tomland heavier, but it’s not that necessary to have this instance, it’s enough to just have functions from this instance, so we can wait until the profunctors package is migrated to base. At the time of writing, there is an open GHC ticket for this:

The type of codecWrite semantically should be equivalent to a -> w (): we take a value of our type a, change the context, and ignore the result. But if we do so, we’re not able to implement a desirable, convenient interface.

Another way to look at Codec: it is just a special case of Product monad. Here is a quick reminder of what the Product data type is:

data Product f g a = Pair (f a) (g a)

instance (Monad f, Monad g) => Monad (Product f g) where
    Pair m n >>= f = Pair (m >>= fstP . f) (n >>= sndP . f)
      where
        fstP (Pair a _) = a
        sndP (Pair _ b) = b

So Codec is semantically equivalent to the following Product specialization:

type Codec r w c a = Product r (ReaderT c w) a

Product stores two monadic actions and performs operations over them in parallel. So at one step of >>= it changes both monads. But the changes are parallel and independent. This means that you can describe two monadic actions in a single place, but later you can choose to work with only one of them.

It’s an interesting (and not that difficult) exercise to implement the Functor, Applicative, and Monad instances for Codec. So let’s look closer at dimap from profunctor part:

dimap
    :: (Functor r, Functor w)
    => (c -> d)       -- ^ Mapper for consumer
    -> (a -> b)       -- ^ Mapper for producer
    -> Codec r w d a  -- ^ Source 'Codec' object
    -> Codec r w c b  -- ^ Target ‘Codec’ object
dimap f g codec = Codec
  { codecRead  = g <$> codecRead codec
  , codecWrite = fmap g . codecWrite codec . f
  }

In real life we’re going to substitute a in place of c:

type BiCodec r w a = Codec r w a a

So, a simpler version of dimap for BiCodec looks like this:

dimap
    :: (Functor r, Functor w)
    => (b -> a)
    -> (a -> b)
    -> BiCodec r w a
    -> BiCodec r w b
dimap f g codec = Codec
  { codecRead  = g <$> codecRead codec
  , codecWrite = fmap g . codecWrite codec . f
  }

Now the idea behind Profunctor becomes much clearer! If we have a bidirectional converter for a value of type a and we have two pure functions to convert from a to b and vice versa we can have a bidirectional converter for b.

There’s only a single important piece left: a convenient operator to compose different Codecs:

infixl 5 .=
(.=) :: Codec r w field a -> (object -> field) -> Codec r w object a
codec .= getter = codec { codecWrite = codecWrite codec . getter }

An attentive reader may notice that the (.=) operator is lmap from Profunctor typeclass.

TOML codec🔗

We covered only general data types but it’s beneficial for our understanding to see how it is specialized to write bidirectional codecs for TOML:

type Env = ExceptT DecodeException (Reader TOML)
type St = MaybeT (State TOML)
type TomlCodec = BiCodec Env St

Our r variable stores a a TOML AST inside the context and can return an error. So r a in Codec specializes to:

codecRead @tomland :: TOML -> Either DecodeException a

And St stores produced TOML in context. All monadic actions will insert keys inside this AST and later it can be converted to just Text. So it’s equivalent to the following type:

codecWrite @tomland :: a -> TOML -> (Maybe a, TOML)

NOTE: TOML data type from tomland implements Semigroup and Monoid instances, so we could use Writer monad instead of State. But it’s more efficient to use State.

Now we need a function that can lift TomlBiMap to TomlCodec. But this function requires a key. It will lookup TOML AST to find AnyValue under given Key and then it will try to convert AnyValue according to the given BiMap. This function is simply called match in tomland:

match :: TomlBiMap a AnyValue -> Key -> TomlCodec a

Now we can write helper functions to make the interface easier to work with:

integer :: Key -> TomlCodec Integer
integer = match _Integer

text :: Key -> TomlCodec Text
text = match _Text

arrayOf :: TomlBiMap a AnyValue -> Key -> TomlCodec [a]
arrayOf = match . _Array

Note how we managed to decompose the problem of finding a value under a key and converting this value to Haskell type into two separate problems. BiMap takes care of all errors during conversion between types and doesn’t know anything about TOML and keys. And match doesn’t know what value we’re working with (because of parametric polymorphism), all it does is, it just looks up the required key inside the map and handles the two resulting cases: when the key is found and when it is not. For pretty-printing, match inserts the value under the given key.

Tables in TOML can be used to represent codecs for nested data structures. Remember the example with the production and staging TOML servers? Here is how theCodec for this data type looks like:

data Settings = Settings
    { settingsId    :: Int
    , settingsHost  :: Text
    , settingsPorts :: [Int]
    }

settingsCodec :: TomlCodec Settings
settingsCodec = Settings
    <$> Toml.int               "id"    .= settingsId
    <*> Toml.text              "host"  .= settingsHost
    <*> Toml.arrayOf Toml._Int "ports" .= settingsPorts

data ServerConfig = ServerConfig
    { serverConfigProduction :: Settings
    , serverConfigStaging    :: Settings
    }

serverConfigCodec :: TomlCodec ServerConfig
serverConfigCodec = flip Toml.table "server" $ ServerConfig
    <$> Toml.table settingsCodec "production" .= settingsProduction
    <*> Toml.table settingsCodec "staging"    .= settingsStaging

You may notice that most of the time we’re using Applicative instance of Codec, despite the fact that this approach is called monadic profunctors. But, it turns out that applicative functors are more convenient here. Though, of course, you can have a lot of fun with the Monad instance as well. For example, configuration can contain a field with a key name like spec-version and the value of this field determines how this particular configuration should be parsed.

And here are a couple of examples with possible error messages:

ghci> showRes = either print print

ghci> showRes $ decode (arrayOf _Int "a") "a = [2, 'foo']"
tomland decode error:  Parse error during conversion from TOML to custom user type:
  1:9:
  |
1 | a = [2, 'foo']
  |         ^
unexpected '''
expecting ',', ']', or integer

ghci> showRes $ decode (arrayOf _Int "b") "a = [2, 3]"
tomland decode error:  Key b is not found

ghci> showRes $ decode (arrayOf _Int "a") "a = [2, 3]"
[2,3]

ghci> showRes $ decode (arrayOf _Int "a") "a = ['foo', 'bar']"
tomland decode error:  Invalid constructor
  * Expected: TInteger
  * Actual:   Text "foo"

EDSL🔗

As an additional flavour, tomland has the implementation of EDSL for specifying TOML objects. The implementation is very short.

type TDSL = State TOML ()

mkToml :: TDSL -> TOML
mkToml env = execState env mempty

(=:) :: Key -> Value a -> TDSL
(=:) k v = modify $ insertKeyVal k v

table :: Key -> TDSL -> TDSL
table k = modify . insertTable k . mkToml

However, we need a couple of extra instances to make the API more convenient to work with. Due to the fact that our Value is a GADT, we can implement type-safe instances for sum types!

instance (t ~ 'TInteger) => Num (Value t) where
    (Integer a) + (Integer b) = Integer $ a + b
    (Integer a) * (Integer b) = Integer $ a * b
    abs (Integer a) = Integer (abs a)
    signum (Integer a) = Integer (signum a)
    fromInteger = Integer
    negate (Integer a) = Integer (negate a)

instance (t ~ 'TText) => IsString (Value t) where
    fromString = Text . fromString @Text

NOTE: Here we’re using type equality constraint trick in order to provide better type inference.

With these tools we’re able to specify TOML files in a safer way:

serverToml :: TOML
serverToml = table "server" $ do
    table "production" $ do
        "id"    =: 42
        "host"  =: "best-haskell-blog.com"
        "ports" =: Array [8080, 8081]
    table "staging" $ do
        "id"    =: 777
        "host"  =: "okayish-haskell-blog.com"
        "ports" =: Array [9080, 9081]

This is very useful for testing when you want to test conversion but you don’t want to deal with parsing. With EDSL you can create a TOML AST which is guaranteed to be correct and safe by the construction.

Tests🔗

You may notice that tomland was written with the help of a lot of heavy tools. But types don’t substitute tests, so we have A LOT of them. I will shortly describe what we test in tomland to assure you that we care very much about the correctness of this library. tomland is already used in production and in commercial projects so it’s necessary to have reliable tests.

  1. Unit tests for parsing. The TOML specification is relatively small, but still, there’re a lot of things to keep an eye on. We have ~600 LOC of unit tests for parsing only. And we test not only that we parse correct TOML (files) successfully, but also that we parse incorrect TOML files unsuccessfully (with good error-reporting).
  2. Unit tests for PrefixTree on lookup and insert.
  3. Property tests for PrefixTree on insert with lookup.
  4. Property tests for PrefixTree for Semigroup laws.
  5. Property tests for TOML on Semigroup and Monoid laws.
  6. Roundtrip property tests for parsing and printing TOML AST.
  7. Roundtrip property tests for decoding and encoding TOML.
  8. Roundtrip property tests for every BiMap.
  9. Golden unit tests for pretty-printing.

Benchmarks🔗

And we also have benchmarks to compare tomland with other libraries! Not all libraries are being actively maintained, so there were some difficulties in building them on newer GHC versions, therefore we have only compared tomland with htoml, htoml-megaparsec, and toml-parser. Here is the performance table:

Library parse :: Text -> AST transform :: AST -> Haskell
tomland 305.5 μs 1.280 μs
htoml 852.8 μs 33.37 μs
htoml-megaparsec 295.0 μs 33.62 μs
toml-parser 164.6 μs 1.101 μs

You may see that tomland is not the fastest one (though still very fast), but performance hasn’t been optimized so far and:

  1. toml-parser doesn’t support arrays of tables and because of that it’s hardly possible to specify a list of custom data types in TOML with this library.
  2. tomland supports the latest TOML spec while htoml and htoml-megaparsec don’t have support for all types, values and formats.
  3. tomland is the only library that has pretty-printing.
  4. toml-parser doesn’t have ways to convert a TOML AST to custom Haskell types and htoml* libraries use a typeclasses-based approach via aeson library.
  5. tomland is bidirectional :)

Conclusion🔗

I hope that this blog post helps you to better understand some advanced topics and Haskell in general. You can see how very different pieces of Haskell can be combined inside a single library and can help to come up with a better architecture. Probably you can borrow some ideas from tomland for your next library. Or even give tomland a try for static configuration in your application!

Acknowledgment🔗

I want to say special thanks to vrom911, willbasky, jiegillet and ghallak for their massive help with the tomland library! Their contribution is invaluable.