Compilers 2: Compiling Backwards

Reading Time: 9 minutes

In May I took Dave Beazley’s week-long course on compilers. This is the second post in a series about that course. Here’s where you can see all the posts so far.

In the last post, we talked about how computers work—specifically, what happens under the hood when we issue them a command. That’s really the last step in the compilation-to-execution flow: executing an instruction on a machine. Before we continue, let’s envision that entire compilation-to-execution flow with this handy illustration from Crafting Interpreters (thank you, Bob):

Compiler Mountain

On the left is where our source code in our programming language goes in. On the right is where machine instructions come out—the kind we might run on the machines we dissected in the previous post.

In the middle, toward the apex of the mountain, you see “syntax tree.” That’s our base camp for this post. The syntax tree is a data model representation of the instructions contained in our source code. For example, suppose we have the following expression in Wabbit:

2 + 3 * 4;

This code comprises two binary operations: the operation that multiplies 3 and 4, and then the operation that adds together 2 and the result of the operation that multiplies 3 and 4. A data model might represent that set of instructions like this:

BinaryOperator('+', Integer('2'),
       BinaryOperator('*', Integer('3'), Integer('4')))

It’s called a syntax tree because it has a treelike structure with higher precedence (executed first) types, statements, and expressions nested inside lower precedence ones.

Terminology Break!

type – one of the primitive classifications of values in the source language. Examples: integer, float, or string.

statement – a code structure in the source language that has behavior but no return value. Example: a print statement.

expression – a code structure in the source language that has a return value. Example: a binary operation.

– Me, explaining the words I am using

Here’s an illustration of how the syntax tree might look for the line 2 + 3 * 4; that shows you the tree-like shape:

In this case, the multiplication operation happens before the addition operation, and interpretation of the operands happens even before that.

In other words, an interpreter could traverse this syntax tree by recursively interpreting the expressions. It would navigate to the bottom of the tree, interpret the integers 3 and 4 into the target instruction language, multiply those together in the target instruction language, then step out and interpret the integer 2, and finally add 2 to the result of the multiplication operation.

The process of getting from source code to a data model is called parsing. Our example demonstrates the result of a recursive descent parser, which translates source code into a syntax tree that allows recursive traversal at the interpretation step (there are other kinds of parsers, but we’ll leave that discussion for later).

Here’s what you need to know about parsing right this minute: it’s an absolute bear. It’s hard AF. It’s hard because it turns freeform text—yes, text in a highly structured language, but still freeform text—into a data representation. Furthermore, the parsing step needs to catch parsing errors where the source code does not adhere to the rules of the language. In a one-week class, we could easily lose the whole week to parsing alone.

This is why Dave defers the topic of parsing to later in the week. For this post, we’ll practice working with the data model by starting from an accurate data model and attempting to print the line of code that parses to it. We traverse, from our basecamp at the syntax tree, back down the mountain to the left. Here’s an example:

model1 = [
    Print(value=BinaryOperator('*', BinaryOperator('+', Integer(2), Integer(3)), Negative(value=Integer(4)))),
    Print(value=BinaryOperator('/', BinaryOperator('-', Float(2), Float(3)), Negative(value=Float(4.0)))),
]

source1 = """
    print 2 + 3 * -4;
    print 2.0 - 3.0 / -4.0;
"""

assert to_source(model1).join("\n") === source1 

Our code goes in a function called to_source() that accepts a data model and returns a string representing the source code. We could even get fancy and handle something beyond arithmetic, like constants and variables:

model2 = [
    ConstantAssignment(Name('pi'), Float(3.14149)),
    VariableDeclaration(name=Name('tau'), type=WabbitType('float')),
    VariableReAssignment(
        name=Name('tau'),
        value=BinOp('*', Float(2.0), Name('pi'))),
    Print(value=BinOp('+', Negative(value=Name('tau')), Float(10.0)))
]

source2 = """
    const pi = 3.14159;  
    var tau float;
    tau = 2.0 * pi;
    print -tau + 10.0;
"""

assert to_source(model2).join("\n") === source2

How do we implement this?

I have a blog series where I’m tracking my progress through Bob Nystrom’s book, Crafting Interpreters, in which he implements an interpreter (well, two) from scratch. That implementation stores behavior related to each of the data model objects inside the data model classes themselves, so a BinaryOperator handles its own interpretation and so on. This approach takes a little upfront planning, but it eases the process of adding data model objects later on. It’s probably closer to what “real” interpreters do. Here’s the thing though: there is no time limit on how long you can take to read Bob’s book. Plus, the implementation is provided for you.

Neither of these is true with Dave’s class: we have a week, and not everyone ends up with the same implementation. Heck, some students even use different languages than the recommended default (Python) for the exercises. So instead of optimizing for maintainability and organization, the in-class exercises optimize for getting something working as fast as possible. We’re not really doing OO here. So I don’t want to hear it when you see my implementation of to_source(), which is…

def to_source(node, dents=0):
    if isinstance(node, Integer):
        stringification = [repr(node.value)]
    elif isinstance(node, Float):
        stringification = [repr(node.value)]
    elif isinstance(node, BinaryOperator):
        stringification = [f"{to_source(node.left)} {node.operation} {to_source(node.right)}"]
    elif isinstance(node, Comparison):
        stringification = [f"{to_source(node.left)} {node.op} {to_source(node.right)}"]
    elif isinstance(node, Print):
        stringification = [f"print {to_source(node.value)};"]
    elif isinstance(node, VariableAssignment):
        stringification = [f"var {to_source(node.name)}{(' ' + to_source(node.type) + ' ') if node.type else ' '}= {to_source(node.value)};"]
    elif isinstance(node, ConstantAssignment):
        stringification = [f"const {to_source(node.name)} = {to_source(node.value)};"]
    elif isinstance(node, VariableDeclaration):
        stringification = [f"var {to_source(node.name)} to_source({node.type});"]
    elif isinstance(node, VariableReAssignment):
        stringification = [f"{to_source(node.name)} = {to_source(node.value)};"]
    elif isinstance(node, Name):
        stringification = [f"{node.name}"]
    elif isinstance(node, Negative):
        stringification = [f"-{to_source(node.value)}"]
    elif isinstance(node, Conditional):
        ifpart = [
            f"if {to_source(node.if_condition)} {{",
            f"{to_source(node.if_block)}",
            f"}}"
        ]
        if node.else_block:
            ifpart[2] == f"}} else {{"
            ifpart.append(f"{to_source(node.else_block)}")
            ifpart.append(f"}}")

        stringification = ifpart
    elif isinstance(node, WhileLoop):
        whilepart = [
            f"while {to_source(node.while_condition)} {{",
            f"{to_source(node.while_block)}",
            f"}}"
        ]
        stringification = whilepart
    elif isinstance(node, Block):
        stringification = [f"{to_source(statement, dents=(dents+1))}" for statement in node.statements]
    elif isinstance(node, CompoundExpression):
        statements = ''
        for statement in node.statements:
            statements += f" {to_source(statement)}"
        stringification = [f"{{ {statements} }}"]
    elif isinstance(node, Break):
        stringification = ['break;']
    elif isinstance(node, Continue):
        stringification = ['continue;']
    elif isinstance(node, WabbitType):
        stringification = [node.string_representation]
    elif isinstance(node, WabbitBoolean):
        stringification = [f"{node.value}"]
    elif isinstance(node, Unit):
    #     what goes here
        pass
    else:
        raise RuntimeError(f"Can't convert {node} to source")

    indentation = '\t' * dents
    return "\n".join([indentation + stringo for stringo in stringification])

…this massive conditional. Don’t @ me. You’ve been warned.

This conditional switches on the type of the incoming data model object. If it’s a compound data model object that contains other types, to_source recursively calls to_source on each of the objects contained in the compound object.

Switching on the type of the data model object implies that the data model objects have types somewhere. Indeed, they do. Here it is in full, but here’s the upshot: I stored each piece of each structure as an attribute on a class named for the structure itself. For example, the BinaryOperator class looks like this:

class BinaryOperator:
    '''
    Example: left + right
    '''
    def __init__(self, operation, left, right):
        self.operation = operation
        self.left = left
        self.right = right

    def __repr__(self):
        return f"BinaryOperator({self.operation}, {self.left}, {self.right})"

It has the following attributes:

  • operation: a string representing what should be done to the two operands. For example, a ‘*’ here means “multiply the operands”
  • left: the left operand. This could be an integer, a float, or an expression that evaluates to an integer or float.
  • right: the right operand. Same as the left operand, but on the right.

Those attributes provide the context that allows to_source to know what to do with each component of a data model type. For example, upon encountering a BinaryOperator, to_source calls itself on the left and right attributes, and it generates a string that contains the stringification of the left operand, then the string operator, then the stringification of the right operand, then a semicolon. That’s what the Wabbit code would look like that would produce the BinaryOperator data model.

Here’s the interesting part.

We could construct a rudimentary interpreter with the exact same format as this source code printer. It would use the same data model, and it could even use the same sort of giant conditional, as to_source. The difference: instead of arranging stringifications of the data model inputs, it would execute the Python version of the commands encoded in each data model input. If we treat Python as the target for the instruction set, then the interpreter code as described would constitute the right part of the mountain diagram shown above.

We’ll try exactly that in the next post of the series.

If you liked this piece, you might also like:

The Crafting Interpreters series, which covers what I’m learning from Bob Nystrom’s book

The Raft series, of course! Learn to implement a distributed system from the comfort of your home, complete with gifs and smack talk!

The time I talked about what counts as a programming language in front of an audience of programming language designers. I strategically saved this talk for a pandemic-era Zoom conference to eliminate the possibility that they’d actually throw tomatoes.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.