From VMs to Containers to Kubernetes, An 18 months journey (Part 1)

Dali Kilani
Lifen.Engineering
Published in
6 min readSep 25, 2018

--

Lifen is a SaaS healthcare platform for exchanging medical documents between medical professionals (see our introductory article for more info). The original implementation was largely java-based and consisted of a handful of services (~5) following a SOA model as opposed to a micro-services model, consisting of:

  • Authentication service
  • Medical directory service
  • Document Exchange service
  • A couple of other miscellaneous ones

The container adoption wave was taking the world by storm in 2015–2016 but Lifen started with a VM-based approach. How come?

Being in the healthcare space, hosting options are relatively limited because the provider (in France) has to comply with pretty stringent regulations to be able to offer hosted VMs to SaaS operators. These hosting providers tend to shy away from the newest technologies because there are many “fads” and also because simply, most of their customers didn’t ask for the new shiny things.

But here we are, a couple of cloud-minded devops people in charge of the Lifen platform in early 2017. Our provider at the time, offered us a traditional “managed hosting” service. Every change to our infrastructure (add a VM, open a port, install a package on the system!) needed to be done through the provider’s support portal or the phone. Not really what one would call a devops-compatible environment (as a side note for this provider, it’s not a good practice to put high friction in front of a customer who wants to spend money by ordering more things). At the same time, we found ourselves in a corner at a critical moment when heartbleed was discovered because of conflicting libraries between two software components on one of our VMs which runs HAProxy :

  • The updated OpenSSL version that fixed the HeartBleed vulnerability was needed by our haproxy that terminated our HTTPS traffic
  • The OpenSSL custom-patch that the public institution in charge of the smart-cards that identify medical professionals in France, maintained, wasn’t available for the newly updated OpenSSL that’s heartbleed-immune, and haproxy needed it also.

The dilemma: do we stop using the Smart-cards in our signup flow or do we allow heartbleed through or do we maintain two openssl versions on the same VM and run two haproxy instances with LD_LIBRARY_PATH tricks to load the right one for each software?

Major operational pain and major security exposure would clearly be the result. Two Haproxy containers, each with its own OpenSSL, and each listening on a different port on the host, while listening to “standard” ports inside the containers, would have been a simple, easy solution. We decided that containers were really our way forward because they isolated our application components from the underlying OS and we started working toward that goal.

Source : kyohei ito on Flickr

To switch to containers, we needed a hosting provider that was able to give us the right level of control while maintaining regulatory compliance. Luckily, OVH Healthcare started offering an HDS-compliant (HDS = Hébergement Données de Santé [French] or Healthcare Data Hosting [English]) hosting solution in late 2016 so we decided in June 2017 to start preparing for a “double whammy”:

  • Migrating to a different hosting provider AND
  • Switching to docker containers as our standard packaging mechanism for all our applications/components.

At the same moment, we started introducing more technical stacks into our platform (Rails, NodeJS, Python). Containers’ ability to package multiple stacks into a standard format that could be deployed the same way was key for us to standardise our deployment processes across stacks as well as log collection, monitoring, etc.

We spent 2–3 weeks learning about the OVH “Dedicated Cloud” service which consists of a managed VSphere cluster that complies with HDS regulations. Then, we setup the cluster in a simple way and started developing ansible playbooks to cover the most critical actions on this cluster from adding a VM to configuring it as a docker host, to running applications packaged as docker containers to it with their data as host-mapped volumes. With that in hand, we started deploying a production-like environment. At the same time, we introduced docker packaging into our CI/CD workflows for all our components/applications. This effort took us 6 weeks to complete.

A couple of learnings here:

  • If you adopt containers, adopt them across all environments (local dev, staging, production) or it will come back and hurt you
  • Train your team on containers/docker so that they’re comfortable with the concepts and the day to day use
  • The combo of docker containers + Java 8 is not great (Java 8 doesn’t “see” the container resource limits so this can break your resource management) — Oracle finally started offering a more container-friendly version with Java 10 (but we haven’t tried yet)

Finally, we needed to do a trial migration from the old hosting provider to OVH (if there is interest into the details of this exercise, please ask in the comments). This trial migration went well so we did a final database dump and restore on the new setup and updated the DNS on a Sunday night after just a 1h downtime after executing a 30-ish steps pre-planned playbook. And, on a Monday mid-september 2017, we were happy entering this new containerised world.

Let’s summarise what we gained through this transition:

  • Complete isolation between application layer and Host/OS (we changed from Debian to ubuntu 16.04 during this migration, to benefit from its faster release cycle and patch releases)
  • Very good progress toward full automation through Ansible
  • 90% of applications were now deployed the same way:

Copy docker-compose.yml to Host

docker-compose pull

docker-compose up -d — force-recreate

Example docker-compose.yml below :

Some notes here :

  • we use the docker syslog engine and it has served us well in our setup
  • Tagging the logs allows for easy parsing of the syslog output later on.
  • Environment variables are managed through a .env file that docker injects into the container’s environment.

But there were a couple of short-comings :

  • The workload-to-host assignment was still a manual process through Ansible roles
  • The “workload port” to “Host port” assignment was also still a manual process through Ansible roles
  • There was no way to respond to spikes in activity through standard cloud mechanisms like autoscaling.

We started operating this new infrastructure and improving it continuously and we were pretty happy for a while. As our Lifen service picked up momentum in the market, we started feeling the pain as follows:

  • In reality, despite our automation and our configuration management setup, we were in practice dealing with Pets as opposed to Cattle in our dev, staging and production environments (which caused headaches and made us nervous with upgrades).
  • No auto-scaling means over-provisioning (waste of money especially on HDS hosting that costs 2-3x normal hosting) or under-provisioning (bad customer experience)
  • Scaling an application horizontally took hours to setup as opposed to a single API call in the cloud when using Heroku/Google App Engine/AWS Beanstalk/Azure AppService.
  • To ensure isolation and redundancy for workloads in the current setup, we needed to have multiple VMs which increased the number of OS’s that needed operational attention (patching, monitoring, configuration) and this comes at a cost. We wanted to pack more containers per VM but not manage the container scheduling manually.

The obvious answer here is Kubernetes and we embarked on that journey right away. That’s what we’ll cover in Part 2 so stay tuned!

--

--