The wrong way to speed up your code with Numba

by Itamar Turner-Trauring
Last updated 21 Mar 2024, originally created 21 Mar 2024

If your NumPy-based code is too slow, you can sometimes use Numba to speed it up. Numba is a compiled language that uses the same syntax as Python, and it compiles at runtime, so it’s very easy to write. And because it re-implements a large part of the NumPy APIs, it can also easily be used with existing NumPy-based code.

However, Numba’s NumPy support can be a trap: it can lead you to missing huge optimization opportunities by sticking to NumPy-style code. So in this article we’ll show an example of:

The wrong way to use Numba, writing NumPy-style full array transforms.
The right way to use Numba, namely for loops.

An example: converting color images to grayscale

Consider a color image encoded with red, green, and blue channels:

from skimage import io

RGB_IMAGE = io.imread("dizzymouse.jpg")
print("Shape:", RGB_IMAGE.shape)
print("dtype:", RGB_IMAGE.dtype)
print("Memory usage (bytes):", RGB_IMAGE.size)

Here’s the output:

Shape: (525, 700, 3)
dtype: uint8
Memory usage (bytes): 1102500

And here’s what the image looks like:

A night-time color photo of a roller coaster, with the track going into the mouth of a giant cat, and a sign saying "Dizzy Mouse"

We want to convert this image to grayscale. Instead of having three channels for red, green, and blue, we’ll have just one channel that measures brightness, with 0 being black and 255 being white. Here’s one simplistic way to do this transformation:

import numpy as np

def tg_numpy(color_image):
    result = np.round(
        0.299 * color_image[:, :, 0] +
        0.587 * color_image[:, :, 1] +
        0.114 * color_image[:, :, 2]
    )
    return result.astype(np.uint8)

GRAYSCALE = tg_numpy(RGB_IMAGE)

And here’s what the resulting image looks like:

A grayscale version of the original image

Using Numba, the wrong way

Numba lets us compile Python code to machine code, simply by adding the @numba.jit decorator. For NumPy APIs used in the decorated function, the resulting machine code doesn’t use the NumPy library. Instead, Numba has reimplemented these APIs in a mostly-compatible way using the Numba language.

One way we can use Numba, then, is to take our existing NumPy code, and just add a decorator:

from numba import jit

@jit
def tg_numba(color_image):
    result = np.round(
        0.299 * color_image[:, :, 0] +
        0.587 * color_image[:, :, 1] +
        0.114 * color_image[:, :, 2]
    )
    return result.astype(np.uint8)

GRAYSCALE2 = tg_numba(RGB_IMAGE)
assert np.array_equal(GRAYSCALE, GRAYSCALE2)

Is this any faster? Let’s see:

Code	Elapsed microseconds	Peak allocated memory (bytes)
`tg_numpy(RGB_IMAGE)`	2,712	6,021,410
`tg_numba(RGB_IMAGE)`	2,446	5,889,234

So it is faster, but only a little. This isn’t surprising: NumPy internally is also implemented in a compiled language, so individual operations on arrays are already quite optimized.

It’s also worth noticing the memory usage. Our original image is 1.1MB, and we’re allocating around 6MB to transform it to a grayscale image. This is because we have up to two temporary floating point arrays at any given time. Since float64 uses 8× as much memory as a uint8, this adds up to quite a bit of memory. And since we’re using the same algorithm as the original NumPy code, complete with temporary arrays, we have the same problem with allocated memory being 6× the size of the input image.

Using Numba, the right way

Our current code creates temporary floating point arrays, and then multiplies and adds them. But there really is no reason to have a whole temporary floating point array; that’s a result of the limits of how NumPy works. It needs to operate on whole arrays (so-called “vectorization”) so that it doesn’t use slow Python code. From an algorithm perspective, we can convert each pixel individually.

Numba doesn’t have the same limits as NumPy and normal Python: you can use for loops in Numba and your code will still run quickly. So in this case we can use a for loop to operate pixel by pixel, at the very least reducing the memory allocations in our function. Let’s try that out:

@jit
def tg_numba_for_loop(color_image):
    result = np.empty(color_image.shape[:2], dtype=np.uint8)
    for y in range(color_image.shape[0]):
        for x in range(color_image.shape[1]):
            r, g, b = color_image[y, x, :]
            result[y, x] = np.round(
                0.299 * r + 0.587 * g + 0.114 * b
            )
    return result

GRAYSCALE3 = tg_numba_for_loop(RGB_IMAGE)
assert np.array_equal(GRAYSCALE, GRAYSCALE3)

And here’s the performance and memory usage:

Code	Elapsed microseconds	Peak allocated memory (bytes)
`tg_numpy(RGB_IMAGE)`	2,724	6,021,410
`tg_numba(RGB_IMAGE)`	2,440	5,889,234
`tg_numba_for_loop(RGB_IMAGE)`	536	376,733

By using Numba the right way, our code is both 5× faster and far more memory efficient.

Why is this version faster? It’s not about the number of CPU instructions; tg_numba_for_loop is running 9 million CPU instructions, vs. 15 million CPU instructions for the NumPy version, nowhere near enough to explain a 5× difference in performance. If you want a start at understanding what else is going on here, check out my upcoming book on speeding up low-level code.

Software architecture as a performance constraint

You can speed up your code at multiple levels. In this particular case, we used a particularly powerful approach: switching to a better software architecture.

In particular, NumPy’s full-array paradigm puts hard limits on how you can implement your code. By switching to a compiled language where for loops are fast, you have far more options for how you structure your algorithm. As you can see, this lets you reduce memory usage, enables implementing algorithms that would be impossible with just NumPy, and often lets you significantly speed up your code.

Find performance and memory bottlenecks in your data processing code with the Sciagraph profiler

Slow-running jobs waste your time during development, impede your users, and increase your compute costs. Speed up your code and you’ll iterate faster, have happier users, and stick to your budget—but first you need to identify the cause of the problem.

Find performance bottlenecks and memory hogs in your data science Python jobs with the Sciagraph profiler. Profile in development and production, with multiprocessing support, on macOS and Linux, with built-in support for Jupyter notebooks.

Speed up your Python code and learn skills you can use at your job

Join over 7600 Python developers and data scientists learning practical tools and techniques every week, from Python performance to Docker packaging, by signing up for my newsletter.