Saving a Client Over $175k on AWS: Improving DevOps Practices, Upskilling Teams, and Advocating for System Simplification

Bart Bucknill
Bart Bucknill

October 24, 2023

After a critical subsystem had its second outage in the last six weeks (costing at least hundreds of thousands in lost revenue and unquantified reputational damage), you are confronted with an incomprehensible tale of labyrinthine legacy systems intertwined in a distributed system spaghetti.

You ask the tech lead to take you through how the system works. How complex can it be? It should be simple: widgets come in, widgets go out. And yet there are (at least) 18 antediluvian services, 10 databases, myriad message queues, and who even knows what the other boxes on the Kandinsky painting system diagram you've been presented with are.

Can't it be made simpler you ask? The answer comes — yes, maybe, we'd love to, but what can we do with all these fires to put out?

More with less, faster, better, cheaper. No budget for a permanent team expansion (sigh).

You need people you trust to get to the root cause, and focus on the most impactful work. You call a software partner.

Loosely speaking, this describes how 8th Light consulted on a project working with a very legacy system that made money for our client, and lost money due to regular outages. By the end of the initial project, we helped stabilize the legacy system, completed necessary service migrations, fixed deployment pipelines, provided skill development for the client team, and replatformed a failing subsystem.

Challenges with a Complex Legacy System

When we joined the project, we were tasked with an initial audit to investigate what was causing the regular outages. What we found isn’t uncommon with legacy systems: unnecessary complexity, operational issues, and poor observability all contributing to instability and hampered incident mitigation. The responsible team were overwhelmed with constant fire fighting, and unable to make the big changes needed despite a strong desire to do so. Outages cost the business large sums, and were a major headache for the responsible executives.

Our audit found a number of issues compiling into one massive challenge:

  1. The complex architecture had been designed to serve very high load requirements, far higher than the load had ever actually been. By some measures, the load designed for was 30 times the maximum observed today. There was extensive dead code from outdated and abandoned features.

  2. There were operational problems too, partly resulting from knowledge gaps in the team around cloud native and DevOps best practices. Many of the services had undergone a migration from on-premise servers to EKS (AWS managed Kubernetes), but not all the necessary steps for such a migration had been completed.

  3. Deployment pipelines were a frequent source of issues. Overly complex and not standardized, they often failed during incident mitigation. Once, a minor change pushed to a repository caused an outage simply by triggering the deployment pipeline; unknown to the developer an additional manual step was required anytime that pipeline ran.

  4. Observability was not as good as it could be. Dashboards were hard to find, sometimes unreliable, or simply so outdated that they did not display any data at all.

Solutions for Upskilling Teams and Simplifying Complex Systems

Based on our analysis, we proposed embedding with the team to undertake an initial phase of work focused on operational stabilization and cloud native/DevOps upskilling. Our knowledge of cloud native and DevOps best practices, advocacy for observability, and experience in skill development through behavior modeling positioned us well to help our client succeed.

And so began a close collaboration. Together with the client team, we made extensive changes to stabilize the systems in question, including migrating a service to kubernetes, modifying inappropriate deployment configuration and dependency versions, and standardizing and fixing deployment pipelines. We upskilled alongside our client counterparts, demonstrating the growth mindset required in a fast-changing environment. We even ran a book club, studying Kubernetes in Action. Over the course of six months, we improved observability, increasing the data available and making it easier to find, including creating a synthetic monitoring tool for measuring a key business metric which was previously unavailable.

With this stabilization and upskilling complete, we embarked on a fast-paced re-architecture of the most frequently failing subsystem. The observability improvements allowed us to clearly understand how clients used this system, and what its current performance was. This gave us the data we needed to propose a radically simplified architecture, eliminating a database and three kafka consumers. There was considerable skepticism that our approach would be feasible, after all, the original system had been designed that way for a reason. By providing the data on usage patterns, and through extensive performance testing, we were able to show that our solution would not only be adequate, but would in fact outperform the current system when measured against actual production workloads.

Remember that Kandinsky-like architecture diagram? This work transformed an important section of it from many boxes and a tangle of lines, to just a service box and a DB box. More importantly, the number of major incidents dropped from an almost monthly cadence last year, to zero in the last six months.

So how did we save $175k? This visible cost was the approximate annual hosting fee for the database that our simpler design eliminated. In reality, the savings made by reducing outages is far larger, as are the less visible savings in engineering hours maintaining the system, and time required to onboard new developers.

Conclusion

What started out as an assessment to uncover what was causing a serious business challenge turned into a significant win for the client’s developers, executives, and business. Together we radically simplified a legacy system — solving critical issues, paving the way for future improvements, and improving delivery velocity. We facilitated client skill development, ensuring long-term maintainability, and we directly reduced visible and invisible costs.