Chatterbox Part 5 — Self healing

This is the fifth part of the Chatterbox series. For your convenience you can find other parts in the table of contents in Part 1 – Origins

One of the most important parts for reliable systems is their ability to heal themselves. While it may sound easy, there are many things to consider here. Let’s say that you want to monitor if your process (or subprocess) died. But then:

  • How do you know it died? Maybe it’s just slow? Maybe it’s waiting for mutex? Maybe it’s working hard on some loong (infinite?) loop? Super simple — just add ping!
  • How do you handle ping? If you run it on a separate thread then how do you know if it isn’t the only running thread in your app? You need to maintain daemon threads properly
  • What if ping is slow? Maybe you should just give it a couple of retries before killing?
  • How do you call ping? File? Socket? Named pipe?

Okay, let’s say that we think process isn’t responding. Let’s kill it. How?

  • Stopping it “the right way” probably won’t work because it is not responding
  • But if we kill it then it won’t release the resources
  • What if it’s being debugged? We won’t just kill it because operating system will stop us
  • What if it is reporting errors (WER)? We can’t kill it either
  • What if it’s some zombie process and it cannot be killed at all?
  • What if it’s running remotely and we lost the connection?

Okay, let’s say we killed it. Let’s now restart:

  • What if resources are still locked?
  • What if it had mutex? We can take ownership but how do we know if data is correct?
  • What if the process dies deterministically because of some poison message? How many times do we restart it? Do we do exponential backoff? Something else?
  • What if our watchdog dies and there is nothing to restart the process?

And so on…

It’s not easy to implement proper watchdog but you need to realize one thing — your process WILL die. Sooner or later. Your machine will restart as well. You cannot catch all exceptions, you cannot handle all issues, sometimes you just need to restart.

My solution currently works like this:

  • Watchdog observers processes via named pipe
  • Each process has deep ping which does something meaningful (we’ll cover that in next part)
  • If ping failed 3 times, watchdog restarts the process
  • Watchdog can take memory dump to simplify debugging later on
  • Processes observe watchdog and kill themselves if watchdog dies
  • Instead of using system mutexes I’m using my custom ones to track ownership (PID and TID)
  • Each mutex is locked with timeout (this is crucial, never wait indefinitely)
  • If process A detects that B holds mutex for too long, it takes B’s memory dump and kills it to retake the lock
  • I had to override a lot of system settings for WER and others, to not end up with zombie processes which cannot be killed at all

It works. While I never can say that it is bulletproof, I haven’t seen issues for months now and it survived many severe conditions (CPU 100% consumed for hours, no memory, no disk space etc).