/ Shayon Mukherjee / blog

Incidents and the requirement of slowing down

March 29, 2024
~3 mins

Speed is key, except when its not. This adage, while seemingly prudent, overlooks a fundamental truth of complex systems: the paradox that slowing down can often be the fastest route to resolution. This contemplative approach to incident management is not merely a tactical choice but a philosophical stance on navigating the intricacies of system failures. It’s an acknowledgment that incidents have a tendency to breed more incidents.

Imagine a scenario in which a production cluster experiences a sudden blip, leading to saturated connections and dropped requests for critical workflows. The knee-jerk reaction might be to initiate an immediate upgrade of the cluster in an attempt to rectify the issue. However, this decision comes fraught with risks. For example, if your application isn’t used to sudden upgrades, introducing them during a stressful situation can lead to unforeseen complications. And even if upgrading the cluster is the answer to the solution, it’s crucial to understand the implications of the upgrade. Will it make matters worse? Will it introduce new issues? Will the upgrade process itself fail, leading to further disruptions?

The urgency often obscures the fact that incidents cause more incidents. Incidents are not isolated events but links in a chain, each capable of setting off a cascade of further issues. The key lies in resisting these impulses, favoring a methodical exploration of safe, effective and reversible solutions.

One thing that I try to practice is to see how much time I can buy before making wide-impacting, irreversible changes. If we are able to buy time and quickly explore + implement less lethal changes, we can avoid making the situation worse. And while this slowdown might seem counterintuitive, it’s a necessary step to ensure that we are not just reacting to the symptoms.

Central to this approach is the principle of tackling one issue at a time. In the chaos of an ongoing incident, it might seem counterintuitive to narrow one’s focus. Yet, this singularity of purpose allows us to precisely understand the effects of our actions. By methodically addressing one aspect of the problem, confirming its resolution, and then moving on to the next, we construct a clear path through the complexity.

This philosophy extends beyond the immediate tactical benefits. Slowing down in the midst of an incident fosters a culture of thoughtful response over reactive haste. It encourages a deeper understanding of the systems we manage and a commitment to resolving issues in a manner that is not just quick but lasting. This is the true benefit of deliberation during incidents—it not only reduces the likelihood of subsequent issues but also enhances our capacity to manage complex systems more effectively.

last modified March 29, 2024