How to shrink NumPy, SciPy, Pandas, and Matplotlib for your data product

Scott Zelenka
Towards Data Science
7 min readSep 25, 2018

--

If you’re a Data Scientist or Python developer deploying data products in a microservice framework, there’s a good chance you need to leverage the common scientific ecosystem modules of NumPy, SciPy, Pandas, or Matplotlib. The “micro” in microservice, suggests you should keep components as tiny as possible. However, these four modules are gargantuan when installed through PIP!

Lets explore why NumPy, SciPy, Pandas, Matplotlib are so ginormous when installed through PIP, and devise a strategies (with source code) on how to optimize your data product inside a microservice framework.

Python Scientific Modules

Most all data products I’ve been involved in are deployed through Docker, so this post will use that framework. Keeping inline with the “micro” of microservice framework, the very first Docker best practice centers around keeping your image size small. If you’re interested in general best practices for reducing the size of your Docker image, there are many posts on Medium. This post is going to focus entirely on shrinking the disk size consumed by NumPy, SciPy, Pandas, and Matplotlib.

Just use the precompiled binaries?

The general advice you may read on the Internet regarding NumPy, SciPy, Pandas, and Matplotlib, is to install them through your linux package manager (i.e. apt-get install -y python-numpy). However, in practice, those package managers often trail multiple versions behind the official releases published in PIP or GitHub. If your data product requires a specific version of these modules, this simply will not work.

Just trust the package manager?

The Anaconda distribution does a decent job of compiling these libraries with a smaller footprint than PIP, so why not just use that Python distribution? In my experience, if you’re already at a state where you need a specific version of NumPy, SciPy, Pandas, or Matplotlib, there’s a good chance you need a specific version from another packages as well. The unfortunate news is that those other packages probably wont exist on the Conda repository. So simply using Anaconda or miniconda, and referencing the Conda repository will not satisfy your needs. Not to mention the additional bloat of Anaconda which you don’t need in your Docker image. Remember, we want to keep things as tiny as possible.

Docker history

To view the layer-by-layer impact of commands being ran to build your Docker image, the history command is very helpful:

$ docker history shrink_linalgIMAGE               CREATED              CREATED BY                                      SIZE                COMMENT435802ee0f42        About a minute ago   /bin/sh -c buildDeps='build-essential gcc gf…   508MB

In the above, we can see that the one layer resulted in 508MB, when all we did in that layer was install NumPy, SciPy, Pandas, and Matplotlib with the command:

pip install numpy==1.15.1 pandas==0.23.4 scipy==1.1.0 matplotlib==3.0.0

We can also look at the detailed package disk space consumed within the image with the du command:

$ docker run shrink_linalg /bin/bash -c "du -sh /usr/local/lib/python3.6/site-packages/* | sort -h"...
31M /usr/local/lib/python3.6/site-packages/matplotlib
35M /usr/local/lib/python3.6/site-packages/numpy
96M /usr/local/lib/python3.6/site-packages/pandas
134M /usr/local/lib/python3.6/site-packages/scipy

These modules are massive! You could argue that the capabilities of these modules are such that they need that much disk space. But they don’t need to be this large. We can remove a significant portion (up-to 60%) of unnecessary disk space consumed without impacting the performance of the modules themselves!

BLAS Optimization

One of the advantages of NumPy and SciPy, are the optimizations that have been done with regards to linear algebra methods. For this, there are a number of LinAlg optimizers. For this post, we’re just using OpenBLAS. But we’ll need to verify that NumPy and SciPy were installed to leverage this library.

$ docker run shrink_linalg /bin/bash -c "python -c 'import numpy as np; np.__config__.show();'"...
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
...

The important thing here is to have OpenBLAS recognized and mapped to the correct directory. This will enable NumPy and SciPy to leverage the LinAlg optimizations from that library.

PIP Install Options

When installing through PIP within a Docker container, there’s no point in keeping around the cache. Lets add some flags to instruct PIP how to function.

  • — no-cache-dir
    PIP uses caching to prevent duplicate HTTP requests to pull modules from the repository before installing on the local system. When running inside a Docker image, there’s no need to preserve this cache, so disable it with this flag.
  • — compile
    Compile Python source files to bytecode. When running inside a Docker image, it’s highly unlikely that you’ll need to debug a mature installed module (such as NumPy, SciPy, Pandas, or Matplotlib).
  • — global-option=build_ext
    To inform the C compiler we want to add additional flags during compile and link, we need to set these additional “global-option” flags.

C Compiler Flags

Python has a wrapper for C-Extension called Cython, which enables developers to write C code in a Python-like syntax. The advantage of this, is that it allows for a lot of the optimization of C, but with the ease of writing Python. NumPy, SciPy, and Pandas leverage Cython a lot! Matplotlib appears to contain some Cython as well, but to a much lesser extent.

These compiler flags are passed to the GNU complier installed within your Dockerfile. To cut to the chase, we’re going to investigate only a handful of them:

  • Disable debug statements (-g0)
    Because we’re sticking our data product into a Docker image, it’s highly unlikely we’ll be doing any real-time debugging of the Cython code from NumPy, SciPy, Pandas, or Matplotlib.
  • Remove symbol files (-Wl, — strip-all)
    If we’re never going to debug the build of these packages within the Docker image, there’s no sense in keeping around the symbol files required for debugging either.
  • Optimize for nearly all supported optimizations that do not involve a space-speed tradeoff (-O2) or Optimize for disk space (-Os)
    The challenge with these optimization flags is that while it can increase/decrease the disk size, it also manipulates the run-time performance of the compiled binary! Without explicitly testing the impact it has on our data product, these could be risky (but at the same time, they’re very efficient).
  • Location of header files (-I/usr/include:/usr/local/include)
    Be explicit in telling GCC where to look for the header files needed to compile the Cython modules.
  • Location of library files (-L/usr/lib:/usr/local/lib)
    Be explicit in telling GCC where to look for the library files needed to compile the Cython modules.

But what’s the impact setting these CFLAG during install via PIP?

The lack of debug information may cause experienced developers to raise caution, but if you really need them at a later date, you can always re-build the Docker image with the flag to reproduce any stacktrace or core dump happening. The larger concern is for exceptions which aren’t reproducible .. which would not be diagnosable without the symbol files. But again, nobody does that in a Docker image.

PIP with CFLAG Comparison

Disk consumption inside Docker image

Taking the best optimization CFLAG strategy, you can reduce the disk footprint of your Docker image by 60%!

For my data products, I simply remove the debug/symbols and don’t perform any additional optimizations. If you’re curious how others perform similar CFLAG optimizations, check out the official Anaconda Recipes for NumPy, SciPy, Pandas, and Matplotlib. These are optimized to work within the Anaconda distribution of Python, and may or may not be relevant for your specific Docker data product deployment.

Unit tests

Just because we were able to shrink the compiled binary of each module, we should do a sanity check to validate that the C compiler flags we used didn’t impact the performance of these modules.

All of these packages leverage the pytest module, which we didn’t install in our Docker image because it provides no value in production. We can install it and execute the tests from inside though:

  • NumPy tests inside our Docker image:
$ docker run shrink_linalg /bin/bash -c "pip install pytest; python -c \"import numpy; numpy.test('full');\""4675 passed ... in 201.90 seconds
  • SciPy tests inside our Docker image:
$ docker run shrink_linalg /bin/bash -c "pip install pytest; python -c \"import scipy; scipy.test('full');\""13410 passed ... in 781.44 seconds
  • Pandas tests inside our Docker image:
$ docker run shrink_linalg /bin/bash -c "pip install pytest; python -c \"import pandas; pandas.test();\""22350 passed ... in 439.35 seconds

There were quite a few tests skipped and xfailed, depending on your data product and/or version installed, these could be expected or you may want to investigate further. In most cases, if it’s not a hard failure it’s likely okay to proceed.

Hopefully this will help guide you towards smaller Docker images when your data product requires NumPy, SciPy, Pandas, or Matplotlib! I encourage you to experiment with other CFLAG and test the disk space and unit test performance.

Source Code

--

--

Data Scientist, Artist, Curator, Janitor, Ludologist, and LEGO enthusiast