Speeding up text processing in Python (is hard)

If you’re doing text or string manipulation in Python, what do you do if your code is too slow? Assuming your algorithm is reasonably efficient, the next step is to try faster alternatives to Python: a compiled extension.

Unfortunately, this is harder than it seems. Some options don’t offer an easy path to optimizations, others are actually slower. To see this limitation in action, we’ll consider some alternatives:

  • Pure Python, with the default Python interpreter.
  • Cython.
  • mypyc.
  • Rust.
  • Pure Python, with the PyPy interpreter.

We’ll also consider what can be done if these option don’t help.

Update: Stefan Behnel came up with a suggestion on how to speed up Cython compared to the original article.

An example task: name matching

In order to have something to measure and discuss, we will consider a concrete example: matching people’s names.

You are a volunteer with the local volunteer group pushing for the construction of more protected bike lanes; bikes are vastly cheaper and more sustainable than driving, and they’re often just as fast as cars in dense cities. The key bottleneck to adoption: safer infrastructure.

Having gathered signatures for a petition, you want to match them to the names of voters in the city’s voter registration database. This means you need to match names in two formats:

  • The names you have collected for your petition are a single field, for example "Itamar Turner-Trauring", manually typed by a human. Since they are manually typed, spaces and capitalization might be inconsistent.
  • The names in the local voter database are stored in two fields (first and last name), in upper case, for example "ITAMAR" and "TURNER-TRAURING".

Names are complicated

How do you convert a single string containing a full name into two strings containing first and last names?

  1. If someone wrote down only two parts to their name, you split that in two and you’re done.
  2. Three parts is less common, and the middle part has three different ways it might be reflected in the voter database:
    1. Omitted: "John Q. Public" becomes ("JOHN", "PUBLIC").
    2. Part of the first name: "Marie Louise Mignot" becomes ("MARIE LOUISE", "MIGNOT").
    3. Part of the last name: "Shakira Mebarak Ripoll" becomes ("SHAKIRA", "MEBARAK RIPOLL").
  3. Finally, one part and four or more parts are rare enough where you live that you decide to skip them.

To normalize names we come up with the following pure Python implementation:

def single_name_to_first_last_names(
    name: str,
) -> list[tuple[str, str]]:
    parts = name.upper().split()
    if len(parts) == 2:
        return [tuple(parts)]
    elif len(parts) == 3:
        a, b, c = parts
        return [(a, c), (a, f"{b} {c}"), (f"{a} {b}", c)]
    else:
        return []

Given a name, it gives us a list of (first, last) pairs we can try to look up in the voter database.

Additional implementations

We can run the above code on the normal Python interpreter (aka CPython) and the sometimes much faster PyPy. But we can also build some alternatives.

Cython

Cython allows mixing Python and C code, but it can just compile most normal Python code. We can install it:

$ pip install cython

Then create a .pyx file with a tweaked version of the above code:

def single_name_to_first_last_names(
        name: str
) -> list[tuple[str, str]]:
    parts = name.upper().split()
    cdef int length = len(parts)  # C variable!
    if length == 2:
        return [tuple(parts)]
    elif length == 3:
        a, b, c = parts
        return [(a, c), (a, f"{b} {c}"), (f"{a} {b}", c)]
    else:
        return []

And then compile it to a Python extension:

$ cythonize -i with_cython.pyx

We can now from with_cython import single_name_to_first_last_names to use the code.

In practice that change doesn’t seem to make a significant difference to speed, but it does show how you can mix the two languages. Stefah Behnel came up with a suggestion why: Cython isn’t quite up to doing type inference on parts = text.upper().split(). So we can change the code to make that more explicit, and hopefully get a faster result:

def single_name_to_first_last_names(
        name: str
) -> list[tuple[str, str]]:
    upper: str = name.upper()
    parts: list[str] = upper.split()
    cdef int length = len(parts)  # C variable!
    if length == 2:
        return [tuple(parts)]
    elif length == 3:
        a, b, c = parts
        return [(a, c), (a, f"{b} {c}"), (f"{a} {b}", c)]
    else:
        return []

Given the intense use of Python APIs, it’s not clear how one could switch to faster C code in the middle beyond that single usage. We could in theory use C or C++ string processing APIs. But given the safety issues with those languages it’s probably best to avoid that.

mypyc

mypyc is an off-shoot of the mypy Python type checker. It can understand Python type annotations and attempt to use them to optimize Python code; like Cython, it emits a compiled extension.

We can install it like so:

$ pip install 'mypy[mypyc]'

The original pure Python code above has a type error when compiled with mypyc, so we use a tweaked version:

def single_name_to_first_last_names(
    name: str,
) -> list[tuple[str, str]]:
    parts = name.upper().split()
    if len(parts) == 2:
        return [(parts[0], parts[1])]  # Make mypyc happy
    elif len(parts) == 3:
        a, b, c = parts
        return [(a, c), (a, f"{b} {c}"), (f"{a} {b}", c)]
    else:
        return []

And then compile it to an extension:

$ mypyc with_mypyc.py

We can now from with_mypyc import single_name_to_first_last_names.

Rust

Rust is a fast, memory-safe language. You can build Python extensions using the PyO3 library. Here’s the function implementation, pretty much matching the Python implementation strategy; we do the string processing with Rust’s string APIs.

#[pyfunction]
fn single_name_to_first_last_names(
        py: Python<'_>, name: String
) -> &PyAny {
    let name = name.to_uppercase();
    let parts: Vec<&str> = name.split(' ')
        .filter(|s| s.len() > 0)
        .collect();
    if parts.len() == 2 {
        PyList::new(py, [(parts[0], parts[1])])
    } else if parts.len() == 3 {
        let (a, b, c) = (parts[0], parts[1], parts[2]);
        PyList::new(py, [
            (a, c),
            (a, &[b, c].join(" ")),
            (&[a, b].join(" "), c)
        ])
    } else {
        PyList::empty(py)
    }
}

Performance results, and some observations

Let’s compare the speed of all five options from slowest to fastest:

Version Interpreter Names/second
PyO3 0.20 (Rust) CPython 3.11 3.8 million
Pure Python CPython 3.11 5.8 million
mypyc 1.7.1 CPython 3.11 6.6 million
Cython 3.0.6 CPython 3.11 7.6 million
Pure Python PyPy 7.3.13 13.5 million

Why these results? Let’s discuss each alternative in turn.

Why is Rust slower?

Given Rust is a fast, compiled language, which in some cases can run 20×-100× as fast as Python code, why is the Rust implementation so much slower in this case? (Yes, I am using the release profile.)

I haven’t done any profiling, but here are some theories:

  • There is overhead from converting between Python and Rust objects. CPython’s memory representation of strings is complex, while Rust uses UTF-8: switching between the two takes work.
  • Python’s string processing APIs are actually quite fast and optimized. In some cases they might actually do better than Rust’s. For example, since the string representation CPython uses differs based on the string contents, it’s possible for CPython to use more optimized implementations specifically for ASCII.
  • Given the function is so short, we’re mostly using CPython APIs anyway.

I did write a slightly faster version by omitting the Vec and using an array instead, but it was still only 4 million names/second, still slower than CPython.

Why are Cython and mypyc only a little faster than CPython?

Most of the code is calls into CPython string APIs that do the majority of the work. And those string APIs are already pretty well optimized. As a result, we’re mostly just running the same code as the Python version.

Why is PyPy faster?

PyPy has a just-in-time compiler so it can notice code that is being called a lot with the same types, and generate a more optimized machine code version on the fly. No compilation needed!

Can we do better?

Can we do better? Maybe.

Right now we loop over names in Python, something like:

for name in petition_names:
    variants = single_name_to_first_last_names(name)
    for first, last in variants:
        if in_voter_database(first, last):
            # We found a match!
            # ... do something with it ...

We could instead pass the full list of petition names into the function we are optimizing. If we were to write it in Rust, it might be slower, or no faster—but we could release the GIL for much of the processing, which would allow us to take advantage of multiple cores. So while it wouldn’t save CPU time, it might save us clock time by using multiple cores at once.

Or we could switch away to from Python datastructures to something like a Polars DataFrame, which might give us similar benefits while allowing us to do our task without writing any custom Rust code.

More broadly, this function is just one part of the whole processing algorithm, and our time might be better spent optimizing it in other ways or at a higher level of abstraction.

How should you optimize your string processing code?

So how should you optimize your Python text processing code?

  1. Simply running your program with PyPy may give you a nice boost without any changes at all… assuming your dependencies work on PyPy.
  2. Cython and mypyc are also worth trying; you’re adding a bit of packaging and setup overhead due to the need to compile extensions, but you don’t really need to rewrite your code.
  3. For text, a compiled language like Rust might not help if you’re just converting a small function. Restructuring your code to use bulk operations might change that, or at least allow you to use parallelism in cases where optimization fails.