A Pipeline Made of Airbags

2020/09/24

A Pipeline Made of Airbags

At a former job, we used to deploy live systems by doing hot code loading. We had a thing that automatically pushed all the new code onto each server's disk, but would do nothing with it. Then an engineer would log onto one Erlang node, in the REPL. All Erlang nodes were always connected in a mesh, so one of them could talk to any other one.

There'd be a little script that was copy/pastable in the REPL for each deploy where all the build steps were tried in development and then a staging environment ahead of time. We started from a small template that had 4 functions:

UpgradeNode()           % upgrade the current node; returns 'ok' or errors out
NodeVersions()          % return the version of all running nodes
NodesAt(Vsn[, Count])   % return names of all or Count nodes at a given version
RollingUpgrade(Nodes)   % upgrade the nodes in the list in a rolling manner

You'd copy paste the script on the production instance you were on, call UpgradeNode(), see if it worked, then call RollingUpgrade(...) as aggressively or carefully as you thought was warranted. If you wanted, in a few milliseconds, dozens or hundreds of instances got live-deployed without losing a single connection. If you preferred, you could take it slow and do it in stages and carefully monitor things.

That system we'd deploy to would handle over a hundred million messages per second. It was maintained and operated by one or two developers at any given time, and we could deploy it whenever without anybody even realizing it from any operational metrics. It was really neat, and fully stateful.

That system could, if we needed (or if we weren't confident in the safety of a given live upgrade), be slowly rolled in stages, because we were still running immutable infrastructure. But that was optional, and we were free to break down our changesets in order to be trivially live-deployable to do more of them, safely. It created new possibilities when designing code.

It eventually got bulldozed by multiple attempts of normalizing over Go practices, which essentially leave you on your own whenever it comes to operational aspects of the development cycle, and then containerization took over and the whole pipeline was voided of ways to do the nice stateful thing that saves everyone hours of rollouts and draining and reconnection storms with state losses.

At another job, we took roughly 3 months to figure out and implement in-place upgrades of signed packages for embedded surveillance equipment (such as cameras in an airport) where you could roll out updates to production devices without needing to shut down the whole infrastructure and interrupt live security. Any failure would in fact automatically crash and roll-back to the current version (in milliseconds) unless we marked a file on disk that greenlit the new version as being acceptable. It even had a small admin panel where you could do live upgrades across all versions for each device and see things change without interrupting a single data stream.

This also got cancelled; Docker was being seen as simpler to deal with for the whole cloud part of things, and nowadays nobody can really deploy this live. Product requirements were changed to play with the adopted tech. Scheduled downtime more or less became a necessity in most cases, because you can't afford to blank out security coverage while you roll things out in critical systems.

These tools still exist, but they're no longer seen as a good practice to adopt almost anywhere. I'm still sour about the whole freaking docker-meets-kubernetes mandatorily-immutable ecosystem of modern day DevOps because it eagerly throws the baby with the bathwater, every time. Immutable infrastructure is good, but on its own it's also pretty underwhelming. We're now feeling network effects where the choice no longer really exists, because everything assumes you should just be stateless all the time.

There also exists a broad misconception that kubernetes (or any other cluster scheduler) replaces the concepts of supervision trees in Erlang/OTP. The fact is that they operate at different scopes. The "just let it crash and restart" for Erlang often works at a request-level and sometimes at an even finer granularity. You can still benefit from the cluster-level control plane, but you get something much richer if you can have both. The problem is that unless you've tried both, you don't really have a good conception of what is possible, and it's easy to be locked to think inside the box.

To me, abandoning all these live upgrades to have only k8s is like someone is asking me to just get rid of all error and exceptions handling and reboot the computer each time a small thing goes wrong. There's a definite useful aspect to being able to do that, but losing the ability to work on things at a finer granularity is a huge loss.

The thing with "let it crash" and restarting is that it's a good default backstop mechanism for your failures. This is where you start, and if you make that scenario acceptable, then you're always in a manageable situation. This is really great. The unspoken bit about Erlang and Elixir is that from that point on, your system is a living thing that you refine by adjusting the supervision tree's structure, or by gradually handling more and more of your edge cases if you know how to handle them.

The thing that stateless containers and kubernetes do is handle that base case of "when a thing is wrong, replace it and get back to a good state." The thing it does not easily let you do is "and then start iterating to get better and better at not losing all your state and recuperating fast". The idea is that you should be able to plug in invariants and still bail out in bad cases, but also have the option of just keeping things running when they go right: no cache to warm, no synchronization to deal with, no sessions to re-negotiate, no reinstantiation, fewer feature flags to handle, and near-instant deploys rather than having them take a long time.

We're isolating ourselves from a whole class of worthwhile optimizations, of ways of structuring our workflows, of conceptualizing software and configuration changes.

Immutable infra took the backstop and forced everyone to only use the backstop. It's using the airbag on every single street corner with a red light you encounter, forever.