Regular expressions vs Parser Combinators in Haskell
Table of Contents
Update 17/11/2018 - added simpler implementation for parser combinators.
Somewhat accidentally I discovered people posting links to my blog on Reddit. It was very pleasing and surprising, to be honest. That means that people out there actually read them and found them worthy of sharing :) Thank you.
Reading through the comments I saw something that made me stagger. Namely, folks were suggesting using parser combinators instead of regular expressions. I was under impression that regular expressions, or more precisely, DFAs they compile to, were better fit for the kind of problem I was trying to solve. They were succinct to program and (presumably) faster than parser combinators.
That being said, when it comes to Haskell my intuition has been off more often than not. So I thought I should try solving the same problem using both approaches, comparing the size/complexity of the code and running some benchmarks.
Warning: It is somewhat a lengthy read with large chunks of source code, so it might make sense to skip straight to the conclusion.
The Problem
Given a short (less than 256) string of ASCII characters replace any contiguous sequence of any of
/:\ .(),'"
characters by underscore_
.
It is a kind of problem that I would solve with regular expression without a second thought. I mean, I've been solving very similar problems in many programming languages for more than a decade now using various regex libraries. So it is natural for me even to formulate it in regular expression language, let alone solve it.
When I had to solve it in Haskell, though, I was somewhat overwhelmed by the
lack of replace_all
function1 and the complexity of (=~)
operator.
Naive approach
After figuring out how to use (=~)
operator I wrote something similar to the
following:
import Text.Regex.PCRE import qualified Data.ByteString.Char8 as S replaceAll :: S.ByteString -> S.ByteString replaceAll = (S.intercalate (S.pack "_")) . replace where replace source = let (before, match, after) = op source in if S.null match then [before] else before : replace after op :: S.ByteString -> (S.ByteString, S.ByteString, S.ByteString) op = ( =~ (S.pack "[/:\\\\. (),'\"]+"))
The function searches for the first match, builds a list of the part "before"
the matched string and the result of calling the function on the remainder of
the string. When the list is built it is joined using _
symbol.
Pre-compiled regex
The naive approach works fine but the problem is that the DFA is re-built each time the function is called. Happily, regex-pcre provides lower-level API that allows to pre-compile the regular expression:
import Text.Regex.PCRE.ByteString compiledRe :: IO Regex compiledRe = do Right re <- compile compBlank execBlank re_text return re where re_text = S.pack "[/:\\\\. (),'\"]+"
Another function that can be used to improve performance is execute
. It
takes the compiled regex and returns an array of match positions. Those
positions could be used to avoid allocating match
string (which we don't
really need). This results in the function:
replaceAll' :: Regex -> S.ByteString -> IO S.ByteString replaceAll' re s = do pieces <- replace re s return $ S.concat pieces where replace re s' = do Right result <- execute re s' case result of Just a -> let before = S.take (fst (a!0)) s' after = S.drop (fst (a!0) + snd (a!0)) s' in do rest <- replace re after return $ before : (S.pack "_") : rest Nothing -> return [s']
which runs almost twice as fast as the naive version.
Other regex libraries
I added some other libraries to the benchmark.
pcre-light
The library also allows to compile the regex, but doesn't have execute
. Its
match
function behaves more like Regex libraries in high-level languages,
returning a list of matched strings, as opposed to the array of matched
positions from execute
. So I ended up writing some extra bits of string
manipulation code:
import qualified Data.ByteString.Char8 as S import Text.Regex.PCRE.Light re = compile "[/:\\\\. (),'\"]+" [] replaceAll :: S.ByteString -> S.ByteString replaceAll = S.concat . replace where replace s = case match re s [] of Just [x] -> let (h, t) = S.breakSubstring x s in h : "_" : replace (S.drop (S.length x) t) Nothing -> [s]
pcre-utils
The library already provides substitute
function which accepts a compiled
regular expression, so replaceAll
essentially becomes a one-liner:
import qualified Data.ByteString.Char8 as S import Text.Regex.PCRE.ByteString.Utils Right re = compile' compBlank execBlank "[/:\\\\. (),'\"]+" replaceAll :: S.ByteString -> S.ByteString replaceAll s = case substitute' re s "_" of Right result -> result
regex-tdfa
On the regex wiki page the library is mentioned as being both, fast and having
native implementation. All the other libraries were essentially just an
interface to libpcre
so I was curios to compare the performance.
Implementation of replaceAll
function is nearly identical to one from
regex-pcre
with the exception of compile
and execute
being pure:
import Text.Regex.TDFA import Text.Regex.TDFA.ByteString Right re = compile compOpts execOpts "[/:\\. (),'\"]+" replaceAll :: S.ByteString -> S.ByteString replaceAll = S.concat . replace where replace s = let Right result = execute re s in case result of Just a -> let before = S.take (fst (a!0)) s after = S.drop (fst (a!0) + snd (a!0)) s in before : "_" : replace after Nothing -> [s]
C version
I was also curious to measure how fast would using libpcre
from C code. So I
wrote a simple benchmark for that. I haven't spent much time optimising it so I
can't guarantee that it is "as fast as it can be". Nevertheless, it should
provide a reasonable estimate.
Parser combinators
attoparsec
That was my first time using the library (although I read about it many times)
and I was simply blown away. Apparently it is doing some kind of dark magic
with ByteString
which makes it run more than 10 times faster than Parsec on
this particular task. However, figuring out a way to use it was a bit of a
brain-teaser for me. I ended up writing two mutually recursive
functions. parse_good
parses "good" characters and passes the remainder to
parse_bad
which in turn parses "bad" characters and hands over the rest to
parse_good
. Either one would stop when given an empty string.
import Prelude hiding (takeWhile) import qualified Data.ByteString.Char8 as S import Data.Attoparsec.ByteString.Char8 chars = inClass "/:\\. (),'\"" good = takeTill chars bad = takeWhile chars replaceAll :: S.ByteString -> S.ByteString replaceAll = S.concat . replace where parse_good s = case parse good s of Done left x -> if S.null left then [x] else x : (parse_bad left) Partial _ -> [s] parse_bad s = case parse bad s of Done left _ -> if S.null left then [] else S.pack "_" : (parse_good left) Partial _ -> [S.pack "_"] replace = parse_good
Update 17/11/2018 - After thinking about it for some time I came up with a
simpler solution which used sepBy1
combinator:
replaceAll = (S.intercalate (S.pack "_")) . replace where replace s = case parseOnly p s of Right x -> x -- the idea is that the source string is one -- or more "good" sequences delimited by "bad" -- sequences where "good" sequence can be empty p = (good `sepBy1` bad) <* endOfInput
To my surprise it behaved slightly worse, being 25-30% slower than the initial solution. My hypothesis here is that GHC does some crazy optimisation of the former case which it can not perform on the latter.
parsec
I had a chance to use that library earlier and had very positive experience. The library's documentation is comprehensive and is complemented by papers and multiple blog posts.
I used essentially the same trick with two mutually recursive functions. The
only difference being that the matched String
is converted to a ByteString
for every "good" part:+
Update 17/11/2018 - As it was the case for attoparsec
, using sepBy1
combinator greatly simplified the code. Unlike attoparsec
though, with
`Parsec" it delivers better performance. I decided to remove the source code of
the first version (which is still available in the git repo) as it is both
poorly written and delivers bad performance.
import qualified Data.ByteString.Char8 as S import Text.Parsec import Text.Parsec.ByteString (Parser) chars = "/:\\. (),'\"" -- "good" parser returns the packed bytestring good = many (noneOf chars) >>= (\xs -> return $ S.pack xs) -- "bad" parser returns `()` which `sepBy1` ignores anyway bad = skipMany1 (oneOf chars) replaceAll :: S.ByteString -> S.ByteString replaceAll = (S.intercalate (S.pack "_")) . replace where replace s = case parse p "" s of Right bs -> bs p = (good `sepBy1` bad)
Results and conclusion
Before drawing any conclusion it is worth mentioning that the results are for a particular problem obtained on a fixed, randomly generated data set (~1000 strings). Different problems (regexes) and different data sets may have produced a totally different picture.
I used the excellent criterion library to run the benchmarks. The author did an amazing job both with the library and documenting it.
The benchmarks results were obtained on my MacBook Pro mid 2015 with 2.2 GHz Intel Core i7 and 16GB of RAM, running macOS Mojave 10.14.1.
Without further ado, here is a comparison chart that demonstrates the measured performance.
I could not believe my eyes when I saw that Attoparsec was marginally faster (4.501 ms vs 4.636 ms) than C with libpcre! While it is likely that a hand-crafted parser would outperform it, it is pretty amazing how well the library did.
Criterion also generated a pretty html page that details the obtained results (obviously excluding the C version).
The source code of the benchmark is available on GitHub. Pull requests are more than welcome and if the numbers change I will update the charts/sections accordingly.
Footnotes:
discovered it later in pcre-utils