pl-rants

Rants about programming languages

Jan 15, 2019

A Nix parser in OCaml. Part 2: Parser

Table of Contents

After I got lexing out of the way, the time had come to built the actual parser. It was the first time I used Menhir. The manual is extremely well written and also there is a chapter in Real World OCaml book so I didn't have any issues when it came to understanding how to use the tool.

The First Iteration

The main problem was to find a (semi-)formal description of Nix language. I re-read the chapter on the expression language in the official Nix manual several times, but it lacked some details. There was also a page on NixOS wiki which described the language more formally, or, rather, in a way that was more suitable for my purpose. However, it also lacked some necessary details. For example, in the pattern expression used in lambda definitions it is possible to supply default values, e.g.

{x ? true, y, z} : if x then y else z

However, it is not clear what exactly can be put there. Can I use a let expression?

{x ? let n = g n; in !(n g), y, z}: ...

Or, maybe, it is restricted to simple arithmetic expressions?

{x ? 1+1, y, z}: ...

Oh Lisp,
  Remembering thee,
  Thy simplicity and clarity
  Is unrivalled.

    — Unknown author, circa early XXI century

To understand all those nuances I had to experiment in the repl. A lot. Sometimes, what seemed logical to me wasn't working as I expected, being more restrictive than I anticipated. Other times it was the opposite, allowing more general expressions where I would have expected an atomic value.

The result was quite a messy grammar specification with randomly sprinkled precedence annotations :). So it wasn't a huge surprise when I got

$ dune build
      menhir nix/parser.{ml,mli}
Warning: 12 states have shift/reduce conflicts.
Warning: one state has reduce/reduce conflicts.
Warning: 153 shift/reduce conflicts were arbitrarily resolved.
Warning: one reduce/reduce conflict was arbitrarily resolved.

After spending at least a couple of days in resolving the conflicts and continuous repl-wrangling I decided that I have had it enough and started from scratch.

The Second Iteration

This time around I started from the operator precedence table. Avoiding associativity declarations altogether, I explicitly encoded all the productions.

expr1:
| e = binary_expr(expr2, "->" {Impl}, expr1)
| e = expr2
    { e }

expr2:
| e = binary_expr(expr2, "||" {Or}, expr3)
| e = expr3
    { e }

expr3:
| e = binary_expr(expr3, "&&" {And}, expr4)
| e = expr4
    { e }

(* ... other productions here ... *)

expr14:
| (* ... *)

where

%inline binary_expr(Lhs, Op, Rhs):
lhs = Lhs; op = Op; rhs = Rhs
    { BinaryOp(op, lhs, rhs) }

is a parametrised production rule which expands into the rule that parses the left hand side, the operator, and the right hand side and constructs a BinaryOp value.

I think the ability to parametrise rules is one of the more powerful features of Menhir, which facilitates writing succinct grammars. There is a subtle difference between %inline and non-%inline rules. Namely, %inline rules are expanded in-place and may reduce the number of states/reductions in some cases.

For example

%inline expr8_ops:
| "+" {Plus}
| "-" {Minus}

expr8:
| e = binary_expr(expr8, expr8_ops, expr9)
| e = expr9
    { e }

expands expr8_ops into

expr8:
| e = binary_expr(expr8, "+" {Plus}, expr9)
| e = binary_expr(expr8, "-" {Minus}, expr9)
| e = expr9
   { e }

which after expanding binary_expr becomes

expr8:
| lhs=expr8, "+", rhs=expr9
   { BinaryOp(Plus, lhs, rhs) }
| lhs=expr8, "-", rhs=expr9
   { BinaryOp(Minus, lhs, rhs) }
| e = expr9
   { e }

Menhir comes with the standard library containing a number of pre-defined rules. For example, separated_nonempty_list accepts a separator rule and a rule of an individual element and returns a list built from the semantic actions returned by the elements:

attr_path:
| cs = separated_nonempty_list(".", attr_path_component)
    { cs (* a list of attr path components *) }

Even more exciting is that Menhir allows parameters which can be parametrised rules themselves, essentially providing a higher-order DSL. I haven't exploited the feature yet, but do hope to get back to it at some point.

Also a minor thing, but it is worth mentioning, the explicit encoding of the operator precedence rules make operator associativity obvious. For example

expr_1:
| expr_1; op; expr_2
| expr_2

defines a left-associative operator op (expr1 is on the left), while

expr_1:
| expr_2; op; expr_1
| expr_2

makes op associate to the right. Non-associative operators can be defined as

expr_1:
| expr_2; op; expr_2
| expr_2

I think that the explicitness of the approach provides for better locality and isolation, while the traditional approach, where the operator precedence table encoded elsewhere, may lead to some tricky and hard-to-debug errors. In my mind it is not dissimilar to lexical scoping versus dynamic scoping. When a language obeys lexical scoping rules, by looking at the function code in isolation one can fully understand what it does or what its behaviour depends on. Whilst in the latter case, the behavior of a function may change dramatically depending on the context it is put into.

LR(2) grammar?

The meticulous following of the operator precedence1 table and the explicit, careful encoding of other rules eliminated all the shift/reduce conflicts with a singe reduce/reduce conflict left. I think it's a legit conflict that makes the grammar LR(2), not LR(1). The essence of the conflict is this: We may have an attribute set and a pattern in lambda definition

let attset = {b1 = 1; b2 = 5;} # ...
let func = {p1, p2}: # ...

In the degenerate case the list of formal parameters p1..pn and the list of bindings b1..bn may be empty, i.e.

let attset = {} ... # tokens: LBRACE RBRACE ...
let func = {}: ... # tokens: LBRACE RBRACE COLON ...

The parser, upon seeing RBRACE can not decide whether it should reduce LBRACE RBRACE sequence to an attr_set or it is the beginning of param_set.

My first solution was to mandate non-empty list of formal parameters in param_set. Which is somewhat substantiated by the fact that an empty parameter set makes no sense in a lazy language. Take Haskell for example. Since all the computations are suspended, declaring

f = 1

is equivalent to

let f () = 1

in a strict language. In other words, passing an empty parameter set to a pure function yields exactly the same result as if the body of the function is assigned to a value and that value is used in place of the function call. Hence, in a lazy language, it makes no sense to write

f x = z

where expression z does not have x as a free variable.

The creators of the language and package authors think differently, though, because parsing all the files in nixpkgs uncovered quite a number of the cases.

I got curious how could it work for the original, Bison-generated Nix parser. The answer was %glr-parser directive in the parser specification file. GLR parsers are capable of processing non-LR(1) grammars by "splitting the brain" and maintaining separate parse stacks simultaneously, until following one of the paths yields an error2.

Menhir does not support the feature, so as a workaround I introduced another token, "{}" - EMPTY_CURLY - which eliminated the conflict. It still may fail to parse some valid expressions if there is a comment inside curly braces, e.g.

let f = { /* an empty parameter set */ }: 1

but I found that empty attribute sets are much more common and made LBRACE RBRACE sequence to unambiguously reduce to attr_set.

Incompatibilities

In addition to the above problem (i.e. parsing empty curly braces), I got a number of other errors. To be precise, the parser errored on 129 files out of 14,674 (which is less than 1%). The majority of the errors was due to the trailing comma in parameter sets:

let f = {a, b,}: a + b

I am not sure if it was done that way because of some legacy reasoning or just accidentally crept into the Bison parser, but since Nix manual doesn't mention the possibility of writing parameter sets in such a way, I left it as is.

Another class of problems, a rare one, but quite annoying, was in allowing re-definitions of some values:

let a = { true = true; false = "hi there", null = 8756}

The problem I see here is that re-declaring them may lead to some "funny" code

nix-repl> let false = true; in if false then "true" else "false"
"true"

Although I changed the grammar to allow those, I think it's a bad practice and should be made syntactically impossible.

Another oddity I encountered was an empty binding list in let expressions. That is, some of the .nix files in nixpkgs contain code like

let /* no bingings here */ in blah

That definitely makes no sense whatsoever so I left the requirement of a non-empty binding list for let expressions in place.

Last but not least was a special treatment of or keyword. or is a true keyword which is an essential part of the language, not just some value or a built-in function

nix-repl> let x {a = 1;}.b or "no b in the attr set"; in x
"no b in the attr set"

but it is still possible to re-define it!

nix-repl> let a = {or = 1;}; in a.or
1

I decided against fixing it because it seemed more of a historical accident rather than a technical decision and thus "shall not pass" :)

Performance testing

I am always keen to understand how "good" something is. For computer programs one obvious metric is how fast they are. However, it makes no sense to measure run time of a program in isolation, since the results may differ in an order of magnitude on different hardware configurations.

To have a base line I used the parser that comes with Nix itself. To be pedantic, it does a little bit more than just parsing (and that's the reason why it can not be used for tooling). Still, I think it is as good a starting point as anything else.

nix-instantiate

$ time find ../nixpkgs -name '*.nix' \
     | xargs nix-instantiate --parse  > /dev/null

real 0m4.567s
user 0m3.665s
sys  0m1.068s

So it takes the nix parser 4.6s to parse all the files in nixpkgs source tree.

$ time find ../nixpkgs -name '*.nix' | xargs cat  > /dev/null

real 0m0.672s
user 0m0.173s
sys  0m0.687s

Subtracting 0.7s spent in find-ing we get 3.9s. The observed memory consumption maxed out at 216MB.

Menhir-generated parser

Running the Menhir-generated parser revealed an ego-pleasing result :)

$ time find ../nixpkgs -name '*.nix' | xargs ./parser  > /dev/null

real 0m3.091s
user 0m2.496s
sys  0m0.761s

That is 3.1s which gives 2.4s parsing/printing time! Memory consumption is even better - 64MB at its peak.

I forgot to mention that I wrote a crude printer for the parse tree which output something similar to what nix-instantiate --parse did. I haven't tested the time it spends there because I could not "turn off" the printing in nix-instantiate.

Menhir-generated parser with precedence levels

It was interesting to compare the "pure" grammar with the one having precedence level declarations. I tried to mimic (with some obvious changes) the Bison parser specification from the Nix source tree and got marginally (but repeatedly) better results:

$ time find ../nixpkgs -name '*.nix' | xargs ./parser-prec  > /dev/null

real 0m3.086s
user 0m2.479s
sys  0m0.775s

I think the difference is mostly due to the smaller number of states - 165 vs 183 - in the generated automaton.

Menhir-generatd parser with "table" backend

The last contestant was the parser produced by enabling the table-based backend in Menhir. The manual says that it might be 2-5 times slower

A (not quite scientific) benchmark suggests that the parsers produced by ocamlyacc and menhir –table have comparable speed, whereas those produced by menhir are between 2 and 5 times faster.

$ time find ../nixpkgs -name '*.nix' | xargs ./parser-table  > /dev/null

real 0m4.478s
user 0m3.856s
sys  0m0.779s

Albeit not 2-5 times slower, it was very close to nix-instantite --parse. Which is not bad at all, since it enables a more sophisticated, incremental API and 30-40% performance hit is a cheap price to pay for the feature (see the YouTube video to get a taste of why it might be useful).

all.nix

To reduce the effect of reading a bunch of small files I created all.nix file which had one gigantic attribute set where each attribute was the file name and the value was the content of the file enclosed in parenthesis, i.e.

{
   "foo/blah.nix" = ( {stdenv, ...}: stdenv.mkDerivation {} );
   "foo/goo.nix" =  # ...
}

As a result I got a ~50MB file. With cache-influenced IO effects out of the way I could use a more scientific approach to benchmarking:

$ bench 'nix-instantiate --parse all.nix > /dev/null'
benchmarking nix-instantiate --parse all.nix > /dev/null
time                 3.929 s    (3.801 s .. 4.066 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 3.908 s    (3.889 s .. 3.936 s)
std dev              26.52 ms   (7.154 ms .. 35.10 ms)
variance introduced by outliers: 19% (moderately inflated)
$ bench './parser all.nix > /dev/null'
benchmarking ./parser all.nix > /dev/null
time                 2.817 s    (2.733 s .. 2.889 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.846 s    (2.827 s .. 2.862 s)
std dev              20.81 ms   (8.792 ms .. 26.60 ms)
variance introduced by outliers: 19% (moderately inflated)
$ bench './parser-prec all.nix > /dev/null'
benchmarking ./parser-prec all.nix > /dev/null
time                 2.752 s    (2.725 s .. 2.776 s)
                     1.000 R²   (NaN R² .. 1.000 R²)
mean                 2.787 s    (2.771 s .. 2.809 s)
std dev              22.03 ms   (7.445 ms .. 29.87 ms)
variance introduced by outliers: 19% (moderately inflated)
$ bench './parser-table all.nix > /dev/null'
benchmarking ./parser-table all.nix > /dev/null
time                 5.252 s    (5.141 s .. 5.471 s)
                     1.000 R²   (NaN R² .. 1.000 R²)
mean                 5.253 s    (5.195 s .. 5.295 s)
std dev              60.23 ms   (30.78 ms .. 84.31 ms)
variance introduced by outliers: 19% (moderately inflated)

Memory consumption for Menhir-generated parser was no more than 250MB while nix-instantiate peaked at 558MB.

The table below presents the results in a concise form.

parser parsing nixpkgs/*.nix parsing all.nix
nix-instantiate 4.567s/216MB 3.929s/558MB
parser 3.091s/64MB 2.817s/240MB
parser-prec 3.086s/64MB 2.752s/240MB
parser-table 4.478s/65MB 5.252s/249MB

Can I say that OCamllex/Menhir outperformed the venerable Flex/Bison combo here? To be cautious, I'd say "not certain", since whatever extra work nix-instantiate --parse does, it may outweigh the parsing time. But I can proclaim with no hesitation that the OCamllex/Menhir parsers can be as fast Flex/Bison-generated ones.

Conclusion

Menhir is awesome. It has --trace and --interpret flags which make debugging a grammar not an appalling task. The --dump flag outputs the finite automaton in a human-readable form into a text file, which I found invaluable for understanding the conflicts. And I applaud the idea of parametrised production rules.

I think the incremental API will be essential for some of the tools I have in mind, although given the performance results and still feeling the thrill of competition, I am tempted to write a Nix expression interpreter first :)

I am not sure if I should be comparing it to Bison but it is the only heavy-weight parser-generator tool I have some experience with. They have their strengths and weaknesses and different feature sets. Menhir is limited to LR(1) grammars which may be a deal breaker for some applications. Also, Bison's manual is insanely good. I would read it just for fun, like a morning paper or a fiction book. It is just that good. Menhir's manual, while being of a very good quality, pales in comparison.

With all of that I just re-instantiated the thought that if I ever find myself in a situation where I desperately need to write a compiler or two, OCaml would be the tool I reach out to first.

The source code of the parser specification is available on GitHub.

Footnotes:

1

Not quite to the letter, though. I found a discrepancy between the documentation and implementation of the implication operator and reported the issue on Nix's GitHub

2

The Bison manual has a very good explanation of the technique.