A Nix parser in OCaml. Part 2: Parser
Table of Contents
After I got lexing out of the way, the time had come to built the actual parser. It was the first time I used Menhir. The manual is extremely well written and also there is a chapter in Real World OCaml book so I didn't have any issues when it came to understanding how to use the tool.
The First Iteration
The main problem was to find a (semi-)formal description of Nix language. I re-read the chapter on the expression language in the official Nix manual several times, but it lacked some details. There was also a page on NixOS wiki which described the language more formally, or, rather, in a way that was more suitable for my purpose. However, it also lacked some necessary details. For example, in the pattern expression used in lambda definitions it is possible to supply default values, e.g.
{x ? true, y, z} : if x then y else z
However, it is not clear what exactly can be put there. Can I use a let
expression?
{x ? let n = g n; in !(n g), y, z}: ...
Or, maybe, it is restricted to simple arithmetic expressions?
{x ? 1+1, y, z}: ...
Oh Lisp,
Remembering thee,
Thy simplicity and clarity
Is unrivalled.
— Unknown author, circa early XXI century
To understand all those nuances I had to experiment in the repl. A lot. Sometimes, what seemed logical to me wasn't working as I expected, being more restrictive than I anticipated. Other times it was the opposite, allowing more general expressions where I would have expected an atomic value.
The result was quite a messy grammar specification with randomly sprinkled precedence annotations :). So it wasn't a huge surprise when I got
$ dune build menhir nix/parser.{ml,mli} Warning: 12 states have shift/reduce conflicts. Warning: one state has reduce/reduce conflicts. Warning: 153 shift/reduce conflicts were arbitrarily resolved. Warning: one reduce/reduce conflict was arbitrarily resolved.
After spending at least a couple of days in resolving the conflicts and continuous repl-wrangling I decided that I have had it enough and started from scratch.
The Second Iteration
This time around I started from the operator precedence table. Avoiding associativity declarations altogether, I explicitly encoded all the productions.
expr1: | e = binary_expr(expr2, "->" {Impl}, expr1) | e = expr2 { e } expr2: | e = binary_expr(expr2, "||" {Or}, expr3) | e = expr3 { e } expr3: | e = binary_expr(expr3, "&&" {And}, expr4) | e = expr4 { e } (* ... other productions here ... *) expr14: | (* ... *)
where
%inline binary_expr(Lhs, Op, Rhs): lhs = Lhs; op = Op; rhs = Rhs { BinaryOp(op, lhs, rhs) }
is a parametrised production rule which expands into the rule that parses the
left hand side, the operator, and the right hand side and constructs a
BinaryOp
value.
I think the ability to parametrise rules is one of the more powerful features
of Menhir, which facilitates writing succinct grammars. There is a subtle
difference between %inline
and non-%inline
rules. Namely, %inline
rules
are expanded in-place and may reduce the number of states/reductions in some
cases.
For example
%inline expr8_ops: | "+" {Plus} | "-" {Minus} expr8: | e = binary_expr(expr8, expr8_ops, expr9) | e = expr9 { e }
expands expr8_ops
into
expr8: | e = binary_expr(expr8, "+" {Plus}, expr9) | e = binary_expr(expr8, "-" {Minus}, expr9) | e = expr9 { e }
which after expanding binary_expr
becomes
expr8: | lhs=expr8, "+", rhs=expr9 { BinaryOp(Plus, lhs, rhs) } | lhs=expr8, "-", rhs=expr9 { BinaryOp(Minus, lhs, rhs) } | e = expr9 { e }
Menhir comes with the standard library containing a number of pre-defined
rules. For example, separated_nonempty_list
accepts a separator rule and a
rule of an individual element and returns a list built from the semantic
actions returned by the elements:
attr_path: | cs = separated_nonempty_list(".", attr_path_component) { cs (* a list of attr path components *) }
Even more exciting is that Menhir allows parameters which can be parametrised rules themselves, essentially providing a higher-order DSL. I haven't exploited the feature yet, but do hope to get back to it at some point.
Also a minor thing, but it is worth mentioning, the explicit encoding of the operator precedence rules make operator associativity obvious. For example
expr_1: | expr_1; op; expr_2 | expr_2
defines a left-associative operator op
(expr1
is on the left), while
expr_1: | expr_2; op; expr_1 | expr_2
makes op
associate to the right. Non-associative operators can be defined as
expr_1: | expr_2; op; expr_2 | expr_2
I think that the explicitness of the approach provides for better locality and isolation, while the traditional approach, where the operator precedence table encoded elsewhere, may lead to some tricky and hard-to-debug errors. In my mind it is not dissimilar to lexical scoping versus dynamic scoping. When a language obeys lexical scoping rules, by looking at the function code in isolation one can fully understand what it does or what its behaviour depends on. Whilst in the latter case, the behavior of a function may change dramatically depending on the context it is put into.
LR(2) grammar?
The meticulous following of the operator precedence1 table and the explicit, careful encoding of other rules eliminated all the shift/reduce conflicts with a singe reduce/reduce conflict left. I think it's a legit conflict that makes the grammar LR(2), not LR(1). The essence of the conflict is this: We may have an attribute set and a pattern in lambda definition
let attset = {b1 = 1; b2 = 5;} # ... let func = {p1, p2}: # ...
In the degenerate case the list of formal parameters p1..pn
and the list of
bindings b1..bn
may be empty, i.e.
let attset = {} ... # tokens: LBRACE RBRACE ... let func = {}: ... # tokens: LBRACE RBRACE COLON ...
The parser, upon seeing RBRACE
can not decide whether it should reduce
LBRACE RBRACE
sequence to an attr_set
or it is the beginning of
param_set
.
My first solution was to mandate non-empty list of formal parameters in
param_set
. Which is somewhat substantiated by the fact that an empty
parameter set makes no sense in a lazy language. Take Haskell for
example. Since all the computations are suspended, declaring
f = 1
is equivalent to
let f () = 1
in a strict language. In other words, passing an empty parameter set to a pure function yields exactly the same result as if the body of the function is assigned to a value and that value is used in place of the function call. Hence, in a lazy language, it makes no sense to write
f x = z
where expression z
does not have x
as a free variable.
The creators of the language and package authors think differently, though,
because parsing all the files in nixpkgs
uncovered quite a number of the
cases.
I got curious how could it work for the original, Bison-generated Nix
parser. The answer was %glr-parser
directive in the parser specification
file. GLR parsers are capable of processing non-LR(1) grammars by "splitting
the brain" and maintaining separate parse stacks simultaneously, until following
one of the paths yields an error2.
Menhir does not support the feature, so as a workaround I introduced another
token, "{}" - EMPTY_CURLY
- which eliminated the conflict. It still may fail
to parse some valid expressions if there is a comment inside curly braces, e.g.
let f = { /* an empty parameter set */ }: 1
but I found that empty attribute sets are much more common and made
LBRACE RBRACE
sequence to unambiguously reduce to attr_set
.
Incompatibilities
In addition to the above problem (i.e. parsing empty curly braces), I got a number of other errors. To be precise, the parser errored on 129 files out of 14,674 (which is less than 1%). The majority of the errors was due to the trailing comma in parameter sets:
let f = {a, b,}: a + b
I am not sure if it was done that way because of some legacy reasoning or just accidentally crept into the Bison parser, but since Nix manual doesn't mention the possibility of writing parameter sets in such a way, I left it as is.
Another class of problems, a rare one, but quite annoying, was in allowing re-definitions of some values:
let a = { true = true; false = "hi there", null = 8756}
The problem I see here is that re-declaring them may lead to some "funny" code
nix-repl> let false = true; in if false then "true" else "false" "true"
Although I changed the grammar to allow those, I think it's a bad practice and should be made syntactically impossible.
Another oddity I encountered was an empty binding list in let
expressions. That is, some of the .nix files in nixpkgs
contain code like
let /* no bingings here */ in blah
That definitely makes no sense whatsoever so I left the requirement of a non-empty binding list for let expressions in place.
Last but not least was a special treatment of or
keyword. or
is a true
keyword which is an essential part of the language, not just some value or a
built-in function
nix-repl> let x {a = 1;}.b or "no b in the attr set"; in x "no b in the attr set"
but it is still possible to re-define it!
nix-repl> let a = {or = 1;}; in a.or 1
I decided against fixing it because it seemed more of a historical accident rather than a technical decision and thus "shall not pass" :)
Performance testing
I am always keen to understand how "good" something is. For computer programs one obvious metric is how fast they are. However, it makes no sense to measure run time of a program in isolation, since the results may differ in an order of magnitude on different hardware configurations.
To have a base line I used the parser that comes with Nix itself. To be pedantic, it does a little bit more than just parsing (and that's the reason why it can not be used for tooling). Still, I think it is as good a starting point as anything else.
nix-instantiate
$ time find ../nixpkgs -name '*.nix' \ | xargs nix-instantiate --parse > /dev/null real 0m4.567s user 0m3.665s sys 0m1.068s
So it takes the nix parser 4.6s to parse all the files in nixpkgs
source
tree.
$ time find ../nixpkgs -name '*.nix' | xargs cat > /dev/null
real 0m0.672s
user 0m0.173s
sys 0m0.687s
Subtracting 0.7s spent in find
-ing we get 3.9s. The observed memory
consumption maxed out at 216MB.
Menhir-generated parser
Running the Menhir-generated parser revealed an ego-pleasing result :)
$ time find ../nixpkgs -name '*.nix' | xargs ./parser > /dev/null
real 0m3.091s
user 0m2.496s
sys 0m0.761s
That is 3.1s which gives 2.4s parsing/printing time! Memory consumption is even better - 64MB at its peak.
I forgot to mention that I wrote a crude printer for the parse tree which
output something similar to what nix-instantiate --parse
did. I haven't
tested the time it spends there because I could not "turn off" the printing in
nix-instantiate
.
Menhir-generated parser with precedence levels
It was interesting to compare the "pure" grammar with the one having precedence level declarations. I tried to mimic (with some obvious changes) the Bison parser specification from the Nix source tree and got marginally (but repeatedly) better results:
$ time find ../nixpkgs -name '*.nix' | xargs ./parser-prec > /dev/null
real 0m3.086s
user 0m2.479s
sys 0m0.775s
I think the difference is mostly due to the smaller number of states - 165 vs 183 - in the generated automaton.
Menhir-generatd parser with "table" backend
The last contestant was the parser produced by enabling the table-based backend in Menhir. The manual says that it might be 2-5 times slower
A (not quite scientific) benchmark suggests that the parsers produced by ocamlyacc and menhir –table have comparable speed, whereas those produced by menhir are between 2 and 5 times faster.
$ time find ../nixpkgs -name '*.nix' | xargs ./parser-table > /dev/null
real 0m4.478s
user 0m3.856s
sys 0m0.779s
Albeit not 2-5 times slower, it was very close to nix-instantite
--parse
. Which is not bad at all, since it enables a more sophisticated,
incremental API and 30-40% performance hit is a cheap price to pay for the
feature (see the YouTube video to get a taste of why it might be useful).
all.nix
To reduce the effect of reading a bunch of small files I created
all.nix
file which had one gigantic attribute set where each attribute
was the file name and the value was the content of the file enclosed in
parenthesis, i.e.
{ "foo/blah.nix" = ( {stdenv, ...}: stdenv.mkDerivation {} ); "foo/goo.nix" = # ... }
As a result I got a ~50MB file. With cache-influenced IO effects out of the way I could use a more scientific approach to benchmarking:
$ bench 'nix-instantiate --parse all.nix > /dev/null' benchmarking nix-instantiate --parse all.nix > /dev/null time 3.929 s (3.801 s .. 4.066 s) 1.000 R² (0.999 R² .. 1.000 R²) mean 3.908 s (3.889 s .. 3.936 s) std dev 26.52 ms (7.154 ms .. 35.10 ms) variance introduced by outliers: 19% (moderately inflated)
$ bench './parser all.nix > /dev/null' benchmarking ./parser all.nix > /dev/null time 2.817 s (2.733 s .. 2.889 s) 1.000 R² (1.000 R² .. 1.000 R²) mean 2.846 s (2.827 s .. 2.862 s) std dev 20.81 ms (8.792 ms .. 26.60 ms) variance introduced by outliers: 19% (moderately inflated)
$ bench './parser-prec all.nix > /dev/null' benchmarking ./parser-prec all.nix > /dev/null time 2.752 s (2.725 s .. 2.776 s) 1.000 R² (NaN R² .. 1.000 R²) mean 2.787 s (2.771 s .. 2.809 s) std dev 22.03 ms (7.445 ms .. 29.87 ms) variance introduced by outliers: 19% (moderately inflated)
$ bench './parser-table all.nix > /dev/null' benchmarking ./parser-table all.nix > /dev/null time 5.252 s (5.141 s .. 5.471 s) 1.000 R² (NaN R² .. 1.000 R²) mean 5.253 s (5.195 s .. 5.295 s) std dev 60.23 ms (30.78 ms .. 84.31 ms) variance introduced by outliers: 19% (moderately inflated)
Memory consumption for Menhir-generated parser was no more than 250MB while
nix-instantiate
peaked at 558MB.
The table below presents the results in a concise form.
parser | parsing nixpkgs/*.nix | parsing all.nix |
---|---|---|
nix-instantiate | 4.567s/216MB | 3.929s/558MB |
parser | 3.091s/64MB | 2.817s/240MB |
parser-prec | 3.086s/64MB | 2.752s/240MB |
parser-table | 4.478s/65MB | 5.252s/249MB |
Can I say that OCamllex/Menhir outperformed the venerable Flex/Bison combo
here? To be cautious, I'd say "not certain", since whatever extra work
nix-instantiate --parse
does, it may outweigh the parsing time. But I can
proclaim with no hesitation that the OCamllex/Menhir parsers can be as fast
Flex/Bison-generated ones.
Conclusion
Menhir is awesome. It has --trace
and --interpret
flags which make
debugging a grammar not an appalling task. The --dump
flag outputs the finite
automaton in a human-readable form into a text file, which I found invaluable
for understanding the conflicts. And I applaud the idea of parametrised
production rules.
I think the incremental API will be essential for some of the tools I have in mind, although given the performance results and still feeling the thrill of competition, I am tempted to write a Nix expression interpreter first :)
I am not sure if I should be comparing it to Bison but it is the only heavy-weight parser-generator tool I have some experience with. They have their strengths and weaknesses and different feature sets. Menhir is limited to LR(1) grammars which may be a deal breaker for some applications. Also, Bison's manual is insanely good. I would read it just for fun, like a morning paper or a fiction book. It is just that good. Menhir's manual, while being of a very good quality, pales in comparison.
With all of that I just re-instantiated the thought that if I ever find myself in a situation where I desperately need to write a compiler or two, OCaml would be the tool I reach out to first.
The source code of the parser specification is available on GitHub.
Footnotes:
Not quite to the letter, though. I found a discrepancy between the documentation and implementation of the implication operator and reported the issue on Nix's GitHub
The Bison manual has a very good explanation of the technique.