A Case for a New Language
Table of Contents
How often do we need to craft a new programming language? It is quite a common belief that the answer to that question should be "never" or at least "unless you absolutely have to, and even then think twice and decide against it". However, if we slightly re-formulate the question to "shall I write a DSL for that particular problem?" the answer would be the roaring "YES".
The reason for such polar opinions on essentially the same problem, I suppose, is that people often associate domain specific languages with some sort of a sophisticated library which can be used in the realm of an existing language. In other words, one does not write a new programming language, but leverages existing language features.
Why is this better? Well, we can use all the existing tools (i.e. bells and whistles) such as IDEs, have syntax highlighting in GitHub, type checkers for some lucky those who use statically typed languages (or pretend to be), debuggers and what not. I call those languages "embeddable"1 and their host languages are "embedding".
On the other hand, we have a slew of standalone DSLs we use routinely without treating them as such - regular expressions, JSON, XML2 to name just a few. Among less well-known ones, I would point out mini-languages for command line parameters for find and tcpdump tools. Those are non-trivial languages, needfully complex for expressing all the underlying functionality. And that is reasonable, because the more complex something is, the more complex a configuration language for that thing becomes. A great (albeit somewhat infamous) example of that phenomenon is Sendmail which configuration language is accidentally Turing-complete.
My point is that a configuration language can be, and sometimes, should be treated as a programming language. Often times an existing language is enough - just slap in a YAML/TOML file (or an XML if you want to sell it as "enterprise-grade") and move on. But what if it isn't enough? Then you have no choice but to craft something of your own. And with that comes the whole stack of problems which are typically associated with "big" programming languages - frontends (i.e. lexical analysers and parsers), a typechecker (why not?), and backends - iterpreters, optimisers, compilers… We also should think about the language semantic, its coherency and consistency. But above all else, we want to make those languages laser-focused on the single problem they are born to solve. And make them perform well there, in fact, better than anything else.
It just so happened that I came across such a case.
The Problem
The project I am currently working on involves receiving and processing gRPC messages from some A-brand network devices. The messages contain ordered sets of key-value pairs. The meaning and naming conventions for the keys and values varies greatly and is inconsistent across different message types. The problem was how to describe various processing rules for the messages. As a starting point I decided to use a simple JSON file with some relevant information and hard-coded processing rules. And it was OK for a few ad-hoc rules, but later on our customers have been asking for more and more complex processing and the triggering point was the requirement of remote configurability. In other words, customers themselves wanted to supply those configurations.
That was when I first thought about a programming language. We can treat an ordered set of key-value pairs as a list of tuples. And the very first thing which comes to mind when we talk about list processing is Lisp :)
When I was about to start sketching a Lisp parser/interpreter a colleague of mine made an interesting observation. Apparently, the key-value pairs were not as chaotic as I thought, but occurred as result of flattening of some hierarchical data. Trees.
What languages do we have to work with trees? I don't think I know many of those. Thus I looked at XPath and XSLT and also at GraphQL but neither was a quite the right fit. We needed something like SQL with CTE. And if fact the two could have worked, but expressions quickly became unwieldy, even for relatively simple rules.
A Solution
I knew that SQL had the right semantic. We wanted to select a subset of key-value pairs and aggregate/processes the values in some way. And a natural way to select elements is by using paths. So the language should have had a way to succinctly express paths. I conjectured that file system-like paths expressions were concise enough
/foo/goo/blah/x/y
So I made those into first-class syntactic constructs, like numbers or strings. In order to express variability I added "match all" syntax which matched any path segment
/foo/*/blah/*/y
and to save parts of the path for later use I utilised "var" syntax
/foo/$k/blah/$l/y
Additionally, the paths could have annotated segments
/foo/goo[instance='a2']/blah/x[a='S 512']/y
I added star- and var- syntax there as well.
/foo/goo[instance=*]/blah/x[a=$a_val]/y
Closures
Now that I had a way to express arbitrary paths (with fixed number of segments) I was looking for a syntactic construction to filter the resulting node sets. In SQL we have
SELECT a, b FROM c WHERE <condition>
I got SELECT
and FROM
with paths but WHERE
clause was still
missing. Moreover, that <condition>
could be arbitrary complex,
possibly depending on previous results. The solution was to "capture"
environment and delay evaluation of an expression till when it's needed.
In other words, a closure (aka lambda-function):
{ $1 < 5 && $foo == 'abc' }
where, by convention, the caller binds $1
, $2
, and so on.
Gluing it all together
The combination of function calls and variable bindings was deemed flexible enough for the task.
The main workhorse of the DSL became the select
function
select(/foo/$x/$goo/z/y/w, { value($1) < 3 })
It accepts a path and a predicate which may use arithmetic, boolean, and regular expressions.
I also added group_by
function which takes a collection of elements and
splits it into collection of collections, based on key function, i.e a closure
that returns certain value
# group elements by their length group_by($xs, { len($1) })
Intermediate results may be bound to variables
# return a collection of even numbers between 1 and 10 let $even = {$1 % 2 == 0} in filter(range(1, 10), $even)
and as in many declarative and functional languages, expressions may occur almost anywhere.
Side effects
SQL, at least its query part, is pure, declarative, side-effect free language. My initial thought was to follow the suit. However, processing a single message may yield multiple results. While it was possible to accumulate them all into one big structure I decided that the escape path that side-effects offer was too convenient to overlook.
let $xs = range(1, 10) in for_each($xs, { yield('value=', $1) })
and I added "expression chaining" to perform multiple side-effects
yield('hello'); yield('world'); let $x = 'hi ' in (yield('greetings: '); yield($x + $name))
Totality
When applied to programming languages, word "total" means that program written in such a language should provably terminate. I am not sure that it is possible to prove that the language is total (at least, I know that I don't have enough knowledge to attempt it), but it is a crucial property the language should possess. We don't want a message processing script to stuck in an infinite loop!
One way to achieve it is to prohibit conditional loops, recursion (some forms
of it, at least) and infinite sequences. We still have iteration via for_each
but it is finite, because ranges and node sets (which result from select
) are
also finite. And, to prohibit recursion, I decided against apply
operator
which could be applied to a closure. Otherwise we could have infinite
recursion, for instance
let $f = { apply($1, $1) } in apply($f, $f)
That's why only the built-in functions, which we have full control over, can evaluate closures. Did it make the language always terminating? I daresay "yes". A script either terminates or crashes because of type mismatch.
Type system (lack of thereof)
As much as I'd love to have a static type system, this was not quite possible
here. The reason was that the key-value pairs that come from the messages may
have various data types, such as signed/unsigned integers of different bit
lengths, strings, booleans, and blobs. While I could have provided "typed"
versions of "value-extracting" function, e.g. str_val($node)
or
int_val($node)
- they wouldn't prevent from run-time errors when applied to a
node of a different type. And if a type system can not guard against run-time
errors, what's the merit of having one? I still think that some limited form of
static analysis is possible, but that's an exercise for the future. So it ended
up being a dynamically, albeit strongly, typed language.
Implementation
On one hand, I tried to keep the language small but flexible, goal-oriented,
containing familiar syntactic constructs and just easy to use. On the other
hand, we had only two weeks to build a working prototype (yeah, I have a
tyrant of a boss one the most awesome bosses around), so I had to trade some
of the conveniences for a simpler grammar and easy of implementation.
The implementation language was Elixir because it was the go-to language for anything that runs 24/7 in our team. While Elixir can leverage Erlang's parser generator - yecc - which accepts LALR(1) grammars, I had to prototype the grammar in OCaml (Menhir, to be precise) because it provided so much nicer facilities to debug grammars and inspect the automaton. In the end I settled down with just one shift/reduce conflict left which I solved by precedence declarations.
Writing an interpreter was relatively straightforward task (see e.g. this post). It was just a tree-walk interpreter, and, unless performance is proved to be an issue, I am not going to look at byte-compilation, VMs, JIT, … Yeah, definitely not going to… maybe only if I have free time in one of those long weekends ;)
Conclusion
When I started the project I had a true-to-life business justification. Yet I hesitated a lot and I spent maybe two weeks in total trying to express the rules in JSON/XPath/YAML/Whatelse because… well, because I also share the common opinion. Was it successful? I believe so. Even if we only ever use it internally to pre-configure our software, I think it's a great time-saver. I just hope that this, somewhat emotional, post would help me or someone else to make the right choice faster :)
Oh, and I have a sure-kill argument for the managers who might oppose your
well-deserved fun time spent on designing a programming language every now and
then. We all know that microservices are the "cutting edge" of software
architecture design principles. The best practices nowadays are unthinkable
without an agile process of mass-producing creating loosely coupled3
services that do one thing and do it well. Then why are we so opposed to
exactly the same idea at "compile-time"? Moreover, as using Tomcat and servlets
is not as trendy anymore, why should we blindly welcome the idea of embeddable
DSLs and frown upon standalone mini-languages? I suggest to adopt the idea of
"To the cloud!" slogan some run-time folks use and make it "Let them be
Dragon(slayer)s!" or "Finite Automata to the masses!".