Materials for my Strata Data talk

containers
data science
reproducibility
Published

September 10, 2018

I’m excited to be speaking at Strata Data in New York this Wednesday afternoon! My talk introduces the benefits of Linux containers and container application platforms for data science workflows.

There are a lot of introductory tutorials about Linux containers, some of which are even ostensibly targeted to data scientists. However, most of these assume that readers in general (and data scientists in particular) really want to get their hands dirty right away packaging software in containers: “here’s a container recipe, here’s a YAML file, now change these to meet your requirements and you’re ready to go.”

I’ve spent a lot of time packaging software and, while I’m not bad at it, there are definitely things I’d rather be doing. Unfortunately, the ubiquity of container tooling has democratized software packaging without making the hard parts any easier; in the worst case, container tooling just makes it really easy to produce bad or unsafe binary packages. So, instead of showing my audience how to make container recipes, I wanted to focus on a few high-level tools that can enable anyone to enjoy the benefits of containers without having to become a packaging expert.

In the remainder of this post, I’ll share some more information about the tools and communities I mentioned.

Binder

The first tool I discussed is Binder, which is a service that takes a link to a Git repository with iPython notebooks and a Python requirements file and will build and start up a Jupyter server in a container to serve those notebooks. The example I showed was [this notebook repository] (https://github.com/willb/probabilistic-structures/) from my DevConf.us talk, which you can run under Binder by clicking here. Finally, like all of the tools I’ll mention, Binder is open-source if you want to run your own or contribute.

source-to-image

If you want a little more flexibility to build container images from source repositories without dealing with the hassles of packaging, the source-to-image tool developed by the OpenShift team at Red Hat is a great place to get started. The source-to-image tooling lets developers or data scientists focus on code while leaving the details of building container images to a packaging expert who develops a particular builder image. In my talk, I showed how to use s2i to build the same notebook I’d served with Docker, using Graham Dumpleton’s excellent notebook s2i builder image and then deployed this image with OKD running on my laptop to get much the same result as I would with Binder; watch the embedded video to see what it looked like:

You aren’t restricted to reimplementing notebook serving with s2i, though; any time you want a repeatable way to create a container from a source repository is a candidate for a source-to-image build. Here are two especially cool examples:

It’s also possible to set up source-to-image builds to trigger automatically when your git repository is updated – check the OpenShift architecture documentation and the OpenShift developer documentation for more details.

radanalytics.io and Kubeflow

The radanalytics.io community is focused on enabling intelligent applications on Kubernetes and OpenShift. The community has produced a containerized distribution of Apache Spark, source-to-image builders (as mentioned above), container images for Jupyter notebooks, and TensorFlow training and serving containers, as well as a source-to-image builder to generate custom TensorFlow binaries optimized for any machine. If your work involves bridging the gap between prototypes and production, or if you work with a cross-functional team to build applications that depend on machine learning, check it out!

Kubeflow is a community effort to package a variety of machine learning libraries and environments for Kubernetes, so that data scientists can work against the same environments locally that their organizations will ultimately deploy in production. So far, the community has packaged JupyterHub, TensorFlow, Seldon, Katib, PyTorch, and other frameworks.

Both of these communities are extremely friendly to newcomers, so if you want to get started building tools to make it easier to use containers for data science or machine learning, they’re great places to start!