pl-rants

Rants about programming languages

Nov 12, 2018

Regular expressions vs Parser Combinators in Haskell

Table of Contents

Update 17/11/2018 - added simpler implementation for parser combinators.

Somewhat accidentally I discovered people posting links to my blog on Reddit. It was very pleasing and surprising, to be honest. That means that people out there actually read them and found them worthy of sharing :) Thank you.

Reading through the comments I saw something that made me stagger. Namely, folks were suggesting using parser combinators instead of regular expressions. I was under impression that regular expressions, or more precisely, DFAs they compile to, were better fit for the kind of problem I was trying to solve. They were succinct to program and (presumably) faster than parser combinators.

That being said, when it comes to Haskell my intuition has been off more often than not. So I thought I should try solving the same problem using both approaches, comparing the size/complexity of the code and running some benchmarks.

Warning: It is somewhat a lengthy read with large chunks of source code, so it might make sense to skip straight to the conclusion.

The Problem

Given a short (less than 256) string of ASCII characters replace any contiguous sequence of any of /:\ .(),'" characters by underscore _.

It is a kind of problem that I would solve with regular expression without a second thought. I mean, I've been solving very similar problems in many programming languages for more than a decade now using various regex libraries. So it is natural for me even to formulate it in regular expression language, let alone solve it.

When I had to solve it in Haskell, though, I was somewhat overwhelmed by the lack of replace_all function1 and the complexity of (=~) operator.

Naive approach

After figuring out how to use (=~) operator I wrote something similar to the following:

import Text.Regex.PCRE
import qualified Data.ByteString.Char8 as S

replaceAll :: S.ByteString -> S.ByteString
replaceAll = (S.intercalate (S.pack "_")) . replace
  where
    replace source =
      let (before, match, after) = op source
      in  if S.null match
          then [before]
          else before : replace after
    op :: S.ByteString -> (S.ByteString, S.ByteString, S.ByteString)
    op = ( =~ (S.pack "[/:\\\\. (),'\"]+"))

The function searches for the first match, builds a list of the part "before" the matched string and the result of calling the function on the remainder of the string. When the list is built it is joined using _ symbol.

Pre-compiled regex

The naive approach works fine but the problem is that the DFA is re-built each time the function is called. Happily, regex-pcre provides lower-level API that allows to pre-compile the regular expression:

import Text.Regex.PCRE.ByteString

compiledRe :: IO Regex
compiledRe = do
  Right re <- compile compBlank execBlank re_text
  return re
  where
    re_text = S.pack "[/:\\\\. (),'\"]+"

Another function that can be used to improve performance is execute. It takes the compiled regex and returns an array of match positions. Those positions could be used to avoid allocating match string (which we don't really need). This results in the function:

replaceAll' :: Regex -> S.ByteString -> IO S.ByteString
replaceAll' re s = do
  pieces <- replace re s
  return $ S.concat pieces
  where
    replace re s' = do
        Right result <- execute re s'
        case result of
          Just a ->
            let before = S.take (fst (a!0)) s'
                after = S.drop (fst (a!0) + snd (a!0)) s'
            in do
              rest <- replace re after
              return $ before : (S.pack "_") : rest
          Nothing ->
            return [s']

which runs almost twice as fast as the naive version.

Other regex libraries

I added some other libraries to the benchmark.

pcre-light

The library also allows to compile the regex, but doesn't have execute. Its match function behaves more like Regex libraries in high-level languages, returning a list of matched strings, as opposed to the array of matched positions from execute. So I ended up writing some extra bits of string manipulation code:

import qualified Data.ByteString.Char8 as S
import Text.Regex.PCRE.Light

re = compile "[/:\\\\. (),'\"]+" []


replaceAll :: S.ByteString -> S.ByteString
replaceAll = S.concat . replace
  where
    replace s =
      case match re s [] of
        Just [x] ->
          let (h, t) = S.breakSubstring x s in
            h : "_" : replace (S.drop (S.length x) t)
        Nothing ->
          [s]

pcre-utils

The library already provides substitute function which accepts a compiled regular expression, so replaceAll essentially becomes a one-liner:

import qualified Data.ByteString.Char8 as S
import Text.Regex.PCRE.ByteString.Utils

Right re = compile' compBlank execBlank "[/:\\\\. (),'\"]+"

replaceAll :: S.ByteString -> S.ByteString
replaceAll s = case substitute' re s "_" of
                 Right result -> result

regex-tdfa

On the regex wiki page the library is mentioned as being both, fast and having native implementation. All the other libraries were essentially just an interface to libpcre so I was curios to compare the performance.

Implementation of replaceAll function is nearly identical to one from regex-pcre with the exception of compile and execute being pure:

import Text.Regex.TDFA
import Text.Regex.TDFA.ByteString

Right re = compile compOpts execOpts "[/:\\. (),'\"]+"

replaceAll :: S.ByteString -> S.ByteString
replaceAll = S.concat . replace
  where
    replace s =
      let Right result = execute re s in
        case result of
          Just a ->
            let before = S.take (fst (a!0)) s
                after = S.drop (fst (a!0) + snd (a!0)) s
            in
              before : "_" : replace after
          Nothing ->
            [s]

C version

I was also curious to measure how fast would using libpcre from C code. So I wrote a simple benchmark for that. I haven't spent much time optimising it so I can't guarantee that it is "as fast as it can be". Nevertheless, it should provide a reasonable estimate.

Parser combinators

attoparsec

That was my first time using the library (although I read about it many times) and I was simply blown away. Apparently it is doing some kind of dark magic with ByteString which makes it run more than 10 times faster than Parsec on this particular task. However, figuring out a way to use it was a bit of a brain-teaser for me. I ended up writing two mutually recursive functions. parse_good parses "good" characters and passes the remainder to parse_bad which in turn parses "bad" characters and hands over the rest to parse_good. Either one would stop when given an empty string.

import Prelude hiding (takeWhile)
import qualified Data.ByteString.Char8 as S
import Data.Attoparsec.ByteString.Char8

chars = inClass "/:\\. (),'\""
good = takeTill chars
bad = takeWhile chars

replaceAll :: S.ByteString -> S.ByteString
replaceAll = S.concat . replace
  where
    parse_good s =
      case parse good s of
        Done left x ->
          if S.null left then [x] else x : (parse_bad left)
        Partial _ ->
          [s]

    parse_bad s =
      case parse bad s of
        Done left _ ->
          if S.null left then [] else S.pack "_" : (parse_good left)
        Partial _ ->
          [S.pack "_"]

    replace = parse_good

Update 17/11/2018 - After thinking about it for some time I came up with a simpler solution which used sepBy1 combinator:

replaceAll = (S.intercalate (S.pack "_")) . replace
  where
    replace s =
      case parseOnly p s of
        Right x -> x
    -- the idea is that the source string is one
    -- or more "good" sequences delimited by "bad"
    -- sequences where "good" sequence can be empty
    p = (good `sepBy1` bad) <* endOfInput

To my surprise it behaved slightly worse, being 25-30% slower than the initial solution. My hypothesis here is that GHC does some crazy optimisation of the former case which it can not perform on the latter.

parsec

I had a chance to use that library earlier and had very positive experience. The library's documentation is comprehensive and is complemented by papers and multiple blog posts.

I used essentially the same trick with two mutually recursive functions. The only difference being that the matched String is converted to a ByteString for every "good" part:+

Update 17/11/2018 - As it was the case for attoparsec, using sepBy1 combinator greatly simplified the code. Unlike attoparsec though, with `Parsec" it delivers better performance. I decided to remove the source code of the first version (which is still available in the git repo) as it is both poorly written and delivers bad performance.

import qualified Data.ByteString.Char8 as S
import Text.Parsec
import Text.Parsec.ByteString (Parser)

chars = "/:\\. (),'\""

-- "good" parser returns the packed bytestring
good = many (noneOf chars) >>= (\xs -> return $ S.pack xs)

-- "bad" parser returns `()` which `sepBy1` ignores anyway
bad = skipMany1 (oneOf chars)

replaceAll :: S.ByteString -> S.ByteString
replaceAll = (S.intercalate (S.pack "_")) . replace
  where
    replace s = case parse p "" s of
                  Right bs -> bs
    p = (good `sepBy1` bad)

Results and conclusion

Before drawing any conclusion it is worth mentioning that the results are for a particular problem obtained on a fixed, randomly generated data set (~1000 strings). Different problems (regexes) and different data sets may have produced a totally different picture.

I used the excellent criterion library to run the benchmarks. The author did an amazing job both with the library and documenting it.

The benchmarks results were obtained on my MacBook Pro mid 2015 with 2.2 GHz Intel Core i7 and 16GB of RAM, running macOS Mojave 10.14.1.

Without further ado, here is a comparison chart that demonstrates the measured performance.

recomb.png

I could not believe my eyes when I saw that Attoparsec was marginally faster (4.501 ms vs 4.636 ms) than C with libpcre! While it is likely that a hand-crafted parser would outperform it, it is pretty amazing how well the library did.

Criterion also generated a pretty html page that details the obtained results (obviously excluding the C version).

The source code of the benchmark is available on GitHub. Pull requests are more than welcome and if the numbers change I will update the charts/sections accordingly.

Footnotes:

1

discovered it later in pcre-utils