A thousand little details: developing software for ops

Writing software for ops is tricky:

  • If you’re a developer, you solve problems by writing software.
  • But if you’re in ops, usually more software means more problems.

So if you’re developing software for ops, how can you write software that doesn’t make ops’ life worse?

In this article I’ll go over some of the reasons writing software for ops is tricky, and some of the ways I’ve tried to mitigate these problems in one specific project (improving Docker packaging). My choices ranged from code as configuration, to building a template instead of a tool.

No magic bullets here, but some ideas (and questions) that you may find useful.

Two problems: too much time, too much space

There are two big reasons that from ops’ perspective more software means more problems.

First, operational software manages software, and so it encounters all the historical flaws of the software it interacts with. Consider Docker packaging:

  • Linux is an re-implementation of Unix, an operating system that started out as a single user operating system, then was extended to support multiple users.
  • Linux then added ways of isolating some processes, to make them think they were the only user of the system.
  • Docker then used those APIs, and other Linux APIs, to build a usable abstraction layer on top.

But this abstraction only goes so far.

And so as a result of this history, Docker packaging involves the accumulated annoyances of decades of software development: signal handling (which dates to the early 1970s), TCP addressing (1980s), a number of minor design mistakes made in the Dockerfile format (2010s), and so on.

Second, above and beyond this historical complexity, there is also the fact that each organization, and often different teams in the same organization, deploy and release software in different ways. That means the problem space you need to address can be quite large.

Because of both these reasons—too much history, and too large of a problem space—it’s very easy to end up with ops tools that require writing even more software to work around their limitations.

So how do you write software for ops that isn’t an impediment, software that’s actually useful? There’s no easy answer general answer, but we can think it through in particular situations.

Let’s see what I did in the domain I’ve been focusing on: Docker packaging.

Decision point #1: How do you deal with a large problem space?

Different organizations and teams want to package things in different ways, in different languages, for different use cases. So how could I improve Docker packaging?

If the goal is to build something better, shrinking the problem space is a good move. In my case, I decided to focus on Docker packaging:

  1. For production use, not development environments.
  2. For Python only, not other programming languages.

This immediately made it easier to see ways to add value. For example, there are only so many ways to install packages in Python—I could detect and then automate most of them, and thereby meet the needs of most users.

Alternatives

  1. Dockerfile deals with scope by basically verging on being a mini-programming language, able to run shell scripts, pass in arguments, etc.. So it addresses a large problem space, at the cost of being more complex.
  2. terraform deals with scope by supporting plugins—you can configure new domains by adding a plugin for that domain.

Decision point #2: How should the solution be delivered?

There are many ways to solve a problem. How do you make Docker packaging for Python better?

One way is to write lots of documentation about it, which I’ve been doing as part of my guide to Docker packaging for Python. But documentation still requires someone to read all of it and apply it correctly.

Another alternative is building a new tool. But a tool can be constraining: what works for one organization won’t necessarily work for another, and so tools end up being too flexible and therefore difficult to use right (docker build and Dockerfile) or too constraining to be used by more than one team.

In the ended I decided on a template, code that gets checked in to the users’ repository. It offers some of the benefits of a tool: pre-written code that can automate common tasks. But it also offers more flexibility: the ability to configure any and all of it for a particular project.

As always, there are tradeoffs, and templates make upgrades harder—I discuss how I deal with this later in this article.

Decision point #3: How should configuration work?

While some requirements can just be automated away, other requirements need configuration. But configuration has its own set of problems.

Consider tagging: images in Docker can be tagged. So how should users choose these tags?

I could have used a configuration format like YAML:

tags:
  - latest
  - release-1.0

But this is a problem: users will want tags to match particular releases, which might be indicated by git tags. Or they might want to do Docker tagging based on the version control branch: one tag (latest, say) for the release branch, and other tags for feature branches.

Given a YAML file, users will need to write code to generate the YAML.

So if users are already writing code, why not make the configuration file be code in the first place? This means tags can be configured in plain old Python. For example, per-branch tags:

# Figure out the git branch:
from subprocess import check_output
_git_branch = check_output(
    ["git", "rev-parse", "--abbrev-ref", "HEAD"],
     universal_newlines=True
).strip()

# Set the TAGS configuration option:
if _git_branch == "master":
    TAGS = ["latest"]
else:
    TAGS = ["branch-" + _git_branch]

Alternatives

  1. The configuration file could be in YAML or a similar format, but with support some sort of simple substitution (or more complex templating) based on git tags and branches.
  2. The configuration file could be in YAML or a similar format, but support options that pre-configured some common policies.

Decision point #4: How should upgrades work?

If the software I’m delivering is a template, that poses a problem: users can modify any file at all, which means upgrades to newer releases will be a tricky, manual process.

My solution was to split the template into two parts:

  1. The part users are expected to modify.
  2. The part users aren’t expected to modify, but which they can modify if they need to.

The hope is that the majority of users will only modify category 1, and so will have easier upgrades. And modifications in category 2 will necessitate manual upgrades, but those particular users probably wouldn’t have been happy with the constraints of a tool if they need to make those sort of changes. So even if they face more difficult upgrades, the template is still providing value.

Decision point #5: How will users debug problems?

Since ops software involves a thousand little details you need to get right, debugging is always a joy. Or, really, it’s annoying and painful, because failures can happen at half a dozen different layers of abstraction.

So I decided where possible to create diagnostic tools, that will at least explain to the user what decision the template is making. For example, how it’s going to install Python packaging:

$ python3 docker-shared/install-py-dependencies.py diagnose
Does requirements.txt exist? no.
Does Pipfile exist? no.
Does setup.py exist? yes.
setup.py exists, so I assume dependencies will be installed using that later on.

And there are also diagnostics for the configuration file, etc..

Alternatives

  1. You can write extensive documentation, which is part of what I ended up doing.
  2. You can offer forums where people can ask questions.
  3. You can provide commercial support.
  4. You can provide linting tools, e.g. hadolint for Dockerfiles.
  5. You can abstract the underlying details away so well that the tool Just Works (this is tricky!).

Other problems, other choices

The decisions I made for my particular template are not necessarily the right choices for everyone. If you’re building sufficiently constrained software you might be able to simplify away all the configuration, or perhaps you might be able to solve the problem with just documentation.

But before jumping in and writing a new tool, do spend some time thinking about the problems you’re facing: accumulated history and the need for flexibility.

And if you’d interested, you can also check out the product that is the end results of my decisions: a template to create production-ready Docker images for Python.