This blog post is adapted from material I learned during the 2021 San Diego Supercomputer Center (SDSC) Summer Institute. This was an introductory boot camp to high-performance computing (HPC), and one of the modules taught the application of Numba for in-line parallelization and speeding up of Python code.
What is Numba?
According to its official web page, Numba is a just-in-time (JIT) compiler that translates subsets of Python and NumPy code into fast machine code, enabling it to run at speeds approaching that of C or Fortran. This is becuase JIT compilation enables specific lines of code to be compiled or activated only when necessary. Numba also makes use of cache memory to generate and store the compiled version of all data types entered to a specific function, which eliminates the need for recompilation every time the same data type is called when a function is run.
This blog post will demonstrate a simple examples of using Numba and its most commonly-used decorator, @jit
, via Jupyter Notebook. The Binder file containing all the executable code can be found here.
Note: The ‘@
‘ flag is used to indicate the use of a decorator
Installing Numba and Setting up the Jupyter Notebook
First, in your command prompt, enter:
pip install numba
Alternatively, you can also use:
conda install numba
Next, import Numba:
import numpy as np
import numba
from numba import jit
from numba import vectorize
Great! Now let’s move onto using the @jit
decorator.
Using @jit for executing functions on the CPU
The @jit
decorator works best on numerical functions that use NumPy. It has two modes: nopython
mode and object
mode. Setting nopython=True
tell the compiler to overlook the involvement of the Python interpreter when running the entire decorated function. This setting leads to the best performance. However, in the case when:
nopython=True
failsnopython=False
, ornopython
is not set at all
the compiler defaults to object
mode. Then, Numba will manually identify loops that it can compile into functions to be run in machine code, and will run the remaining code in the interpreter.
Here, @jit
is demonstrated on a simple matrix multiplication function:
# a function that does multiple matrix multiplication
@jit(nopython=True)
def matrix_multiplication(A, x):
b = np.empty(shape=(x.shape[0],1), dtype=np.float64)
for i in range(x.shape[0]):
b[i] = np.dot(A[i,:], x)
return b
Remember – the use of @jit
means that this function has not been compiled yet! Compilation only happens when you call the function:
A = np.random.rand(10, 10)
x = np.random.rand(10, 1)
a_complicated_function(A,x)
But how much faster is Numba really? To find out, some benchmarking is in order. Jupyter Notebook has a handy function called %timeit
that runs simple functions many times in a loop to get their average execution time, that can be used as follows:
%timeit matrix_multiplication(A,x)
# 11.4 µs ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numba has a special .py_func
attribute that effectively allows the decorated function to run as the original uncompiled Python function. Using this to compare its runtime to that of the decorated version,
%timeit matrix_multiplication.py_func(A,x)
# 35.5 µs ± 3.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
From here, you can see that the Numba version runs about 3 times faster than using only NumPy arrays. In addition to this, Numba also supports tuples, integers, floats, and Python lists. All other Python features supported by Numba can be found here.
Besides explicitly declaring @jit at the start of a function, Numba makes it simple to turn a NumPy function into a Numba function by attaching jit(nopython=True)
to the original function. This essentially uses the @jit
decorator as a function. The function to calculate absolute percentage relative error demonstrates how this is done:
# Calculate percentage relative error
def numpy_re(x, true):
return np.abs(((x - true)/true))*100
numba_re = jit(nopython=True)(numpy_re)
And we can see how the Number version is faster:
%timeit numpy_re(x, 0.66)
%timeit numba_re(x, 0.66)
where the NumPy version takes approximately 2.61 microseconds to run, while the Numba version takes 687 nanoseconds.
Inline parallelization with Numba
The @jit
decorator can also be used to enable inline parallelization by setting its parallelization pass parallel=True
. Parallelization in Numba is done via multi-threading, which essentially creates threads of code that are distributed over all the available CPU cores. An example of this can be seen in the code snippet below, describing a function that calculates the normal distribution of a set of data with a given mean and standard deviation:
SQRT_2PI = np.sqrt(2 * np.pi)
@jit(nopython=True, parallel=True)
def normals(x, means, sds):
n = means.shape[0]
result = np.exp(-0.5*((x - means)/sds)**2)
return (1 / (sds * np.sqrt(2*np.pi))) * result
As usual, the function must be compiled:
means = np.random.uniform(-1,1, size=10**8)
sds = np.random.uniform(0.1, 0.2, size=10**8)
normals(0.6, means, sds)
To appreciate the speed-up that Numba’s multi-threading provides, compare the runtime for this with:
- A decorated version of the function with a disabled parallel pass
- The uncompiled, original NumPy function
The first example can be timed by:
normals_deco_nothread = jit(nopython=True)(normals.py_func)
%timeit normals_deco_nothread(0.6, means, sds)
# 3.24 s ± 757 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The first line of the code snippet first makes an uncompiled copy of the normals
function, and then applies the @jit
decorator to it. This effectively creates a version of normals
that uses @jit
, but is not multi-threaded. This run of the function took approximately 3.3 seconds.
For the second example, simply:
%timeit normals.py_func(0.6, means, sds)
# 7.38 s ± 759 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Now, compare both these examples to the runtime of the decorated and multi-threaded normals
function:
%timeit normals(0.6, means, sds)
# 933 ms ± 155 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The decorated, multi-threaded function is significantly faster (933 ms) than the decorated function without multi-threading (3.24 s), which in turn is faster than the uncompiled original NumPy function (7.38 s). However, the degree of speed-up may vary depending on the number of CPUs that the machine has available.
Summary
In general, the improvements achieved by using Numba on top of NumPy functions are marginal for simple, few-loop functions. Nevertheless, Numba is particularly useful for large datasets or high-dimensional arrays that require a large number of loops, and would benefit from the one-and-done compilation that it enables. For more information on using Numba, please refer to its official web page.