DEV Community

zchtodd
zchtodd

Posted on • Originally published at codetodd.com

Building parsers for fun and profit with Pyparsing

I sometimes need to either create a domain specific language, or at least be able to parse a subset of some existing language. This can be incredibly useful.

Recently, I worked on an application in which users create filters that classify data. The filters are boolean expressions that are constructed using a friendly web interface, but power users can directly edit the filters as text. This direct editing mode, however, means that we’ll need to parse the text to perform validations and to understand what the user intends.

Parsing is a huge subject and can be an equally huge undertaking. Luckily, this is well trodden ground, and there’s no need to start from scratch. The pyparsing module is a powerful toolkit that takes care of the time consuming and error prone aspects of parsing, allowing you to get directly to the essence of what you’re trying to do.

In my case, what I’m trying to do is parse a simplified version of the SQL where clause.

Building anything with Pyparsing involves starting with the basic building blocks of your desired grammar. The basic elements of my simplified SQL grammar are numbers, strings, operators and identifiers (the actual variables or column names).

import pyparsing as pp

lparen = pp.Suppress("(")
rparen = pp.Suppress(")")

and_ = pp.Literal("AND")
or_ = pp.Literal("OR")

operator = pp.oneOf(("=", "!=", ">", ">=", "<", "<="))

alphaword = pp.Word(pp.alphanums + "_")
string = pp.QuotedString(quoteChar="'")

number = (
    pp.Word(pp.nums) + pp.Optional("." + pp.OneOrMore(pp.Word(pp.nums)))
)

identifier = alphaword

Pyparsing takes advantage of operator overloading in Python to make grammar construction compact and elegant. Each time two Pyparsing objects are combined they will form a new object that is immediately returned, thus allowing you to build up a grammar. Every object has a parseString method, meaning all the subsets of your grammar can be independently tested.

>>> number.parseString("1.32")
(['1', '.', '32'], {})

The default result of calling parseString is comparable to what you’d expect to see during the lexing or tokenizing phase of a compiler. This is a useful and necessary step in the parsing process, but it leaves a ton of work for us to do.

At the end of the day, I need to understand the structure of the filter expression, which is best represented as a tree.

The setParseAction method is what you’ll need if you want to encode the structure of a grammar into a tree. The setParseAction method accepts any callable object, meaning you can pass functions or class objects. As your various subgrammars are encountered in an input, Pyparsing will invoke your setParseAction callables, with the results being passed into higher level grammar constructs.

Let’s look again at the last example:

class String(object):
    def __init__(self, result):
        self.value = result[0]

    def generate(self):
        return "'{}'".format(self.value)

class Number(object):
    def __init__(self, result):
        self.value = result[0]

    def generate(self):
        return self.value

class Identifier(object):
    def __init__(self, result):
        self.name = result[0]

    def generate(self):
        return self.name

class Condition(object):
    def __init__(self, tokens):
        self.identifier = tokens[0][0]
        self.op = tokens[0][1]
        self.rval = tokens[0][2]

    def generate(self):
        return " ".join((self.identifier.generate(), self.op, self.rval.generate()))

lparen = pp.Suppress("(")
rparen = pp.Suppress(")")

and_ = pp.Literal("AND")
or_ = pp.Literal("OR")

operator = pp.oneOf(("=", "!=", ">", ">=", "<", "<="))

alphaword = pp.Word(pp.alphanums + "_")
string = pp.QuotedString(quoteChar="'").setParseAction(String)

number = (
 pp.Word(pp.nums) + pp.Optional("." + pp.OneOrMore(pp.Word(pp.nums)))
).setParseAction(Number)

identifier = alphaword.setParseAction(
 Identifier
)

condition = pp.Group(
 identifier + (operator + (string | number))
).setParseAction(Condition)

If we try parsing a test string again, we’ll see that the result is returned in terms of the classes defined above, instead of as a string that has been tokenized.

>>> result = condition.parseString("x = 4")
>>> result[0].identifier
<test.Identifier object at 0x7f327b9ed240>
>>> result[0].op
'='
>>> result[0].rval
<test.Number object at 0x7f327b9ed320>

The AND/OR part of the grammar is where things could start to become a little tricky, but it’s also where Pyparsing really shines. Up to this point, the grammar has been very simple. When we introduce AND/OR to this grammar, two important things have happened: the grammar now has operator precedence, and has also become recursive.

If you’re familiar with SQL you probably know that AND has higher precedence than OR, just as multiplication has higher precedence than addition. We’ll have to be careful how we construct our grammar so that operator precedence is encoded properly.

Let’s take a look at how to add operator precedence and nested expression parsing to this grammar:

class Expression(object):
    def __init__(self, tokens):
        self.and_conditions = tokens[::2]

    def generate(self):
        return (
            "("
            + " OR ".join(
                (and_condition.generate() for and_condition in self.and_conditions)
            )
            + ")"
        )

class AndCondition(object):
    def __init__(self, tokens):
        self.conditions = tokens[::2]

    def generate(self):
        result = " AND ".join((condition.generate() for condition in self.conditions))
        if len(self.conditions) > 1:
            result = "(" + result + ")"
        return result

class Condition(object):
    def __init__(self, tokens):
        self.identifier = tokens[0][0]
        self.op = tokens[0][1]
        self.rval = tokens[0][2]

    def generate(self):
        return " ".join((self.identifier.generate(), self.op, self.rval.generate()))

class String(object):
    def __init__(self, result):
        self.value = result[0]

    def generate(self):
        return "'{}'".format(self.value)

class Number(object):
    def __init__(self, result):
        self.value = result[0]

    def generate(self):
        return self.value

class Identifier(object):
    def __init__(self, result):
        self.name = result[0]

    def generate(self):
        return self.name

lparen = pp.Suppress("(")
rparen = pp.Suppress(")")

and_ = pp.Literal("AND")
or_ = pp.Literal("OR")

operator = pp.oneOf(("=", "!=", ">", ">=", "<", "<="))

alphaword = pp.Word(pp.alphanums + "_")
string = pp.QuotedString(quoteChar="'").setParseAction(String)

number = (
 pp.Word(pp.nums) + pp.Optional("." + pp.OneOrMore(pp.Word(pp.nums)))
).setParseAction(Number)

identifier = alphaword.setParseAction(
 Identifier
)

expr = pp.Forward()

condition = pp.Group(
 identifier + (operator + (string | number))
).setParseAction(Condition)

condition = condition | (lparen + expr + rparen)

and_condition = (condition + pp.ZeroOrMore(and_ + condition)).setParseAction(
 AndCondition
)

expr << (and_condition + pp.ZeroOrMore(or_ + and_condition))
expr = expr.setParseAction(Expression)

It can be helpful to plan out a grammar in terms of something like Backus Naur form before committing it to code. Sketching out this simplified SQL grammar, I came up with the following:

# OP := = | != | > | < | <= | >=
# CONDITION := <word> <op> <number> | (<expression>)
# AND_CONDITION := <condition> | <condition> AND <condition>
# EXPRESSION := <and_condition> | <and_condition> OR <and_condition>

The proper precedence is taken care of because the tree formed by our grammar places OR at a higher level, forcing AND conditions to be resolved first. You might also have noticed that conditions can contain expressions, while conditions are themselves part of expressions. This is a circular dependency, but is also required if we’re going to make nested expressions possible.

Luckily, Pyparsing provides the Forward class for just such cases, allowing us to use part of our grammar before we’ve actually defined it.

Now that we can parse this grammar, what can we do from here? As just one example, I’ve added generate methods to each component class that will produce a string representing that part of the tree. This will produce almost an exact replica of the input string, but in the AndCondition I’ve modified things slightly so that AND’d conditions are grouped together within parentheses in the output.

This is a simple example of the kinds of transformations you can do once you’re able to parse a grammar. Of course, you could also perform all kinds of validations, or even execute the logic itself.

Hopefully you’ve enjoyed this brief intro to Pyparsing. I may do more tutorials or examples in the future, because parsing is a pretty fun subject.

Top comments (0)