Kubernetes is a powerful orchestration tool that can make developers' lives much easier, if used and configured correctly. One of the most important things to know about the system in order to manage it correctly is being able to determine if the application is working correctly or if it should be restarted. All of this can be achieved with proper health checks.

Types of health checks

We should start by explaining what a health check actually is. When our application is running in the wild, we have some means of checking if it behaves the way we expect it to do. If it is a web application, we probably access some HTTP endpoint and expect 200 status code. Since we don't want to rely on any data being processed (or being present), we create some dummy endpoints like /health which returns that status. We don't need to limit ourselves to HTTP endpoints, especially if our app is not a web server. In this case, we can come up with a bash command to run, which is expected to return without errors (exit code 0).

Sample application(s)

For the sake of this short tutorial, I will deploy three services onto Minikube (Kubernetes) cluster: master, worker, and cache. The flow is simple: imagine that the master is exposed to the world and accepts requests to calculate the power of a given number:

curl http://master-address:port/?base=2&power=3

Then, the master delegates the calculation to one of the workers. Since calculating the result can be time-consuming, we cache the result to Redis and read it when available, to save some computing power.

We would like to make sure that our systems work using Kubernetes checks.

Liveness check

We start off with building a dummy handler:

http.HandleFunc("/checks/liveness", func(rw http.ResponseWriter, req *http.Request) {
    fmt.Fprint(rw, "OK")
})

Now, in order for Kubernetes to use it to determine the health of our service, we need to put an appropriate livenessProbe information in the definition of its deployment:

kind: ReplicationController
...
spec:
  ...
  template:
    ...
    spec:
      containers:
        - livenessProbe:
            httpGet:
              path: /checks/liveness
              port: 8000
            initialDelaySeconds: 5
          ...

We can see that the application is up and alive immediately:

$ kubectl get pods -w
service "worker" created
replicationcontroller "worker" created
NAME           READY     STATUS              RESTARTS   AGE
worker-cbp5v   0/1       ContainerCreating   0          0s
worker-f7d2v   0/1       ContainerCreating   0          0s
...
worker-74kfv   0/1       Terminating         0          21s
worker-74kfv   0/1       Terminating         0          21s
...
worker-cbp5v   1/1       Running             0          1s
worker-f7d2v   1/1       Running             0          2s

What if we decide to rely on the database? We can add a PING to Redis command to make sure it's up:

http.HandleFunc("/checks/liveness", func(rw http.ResponseWriter, req *http.Request) {
    if err := redisClient.Ping().Err(); err != nil {
        http.Error(rw, "FAIL", http.StatusInternalServerError)
        return
    }
    fmt.Fprint(rw, "OK")
})

If we redeploy the app, we see no change (if the Redis is already up), but what will happen if the database crashes?

# DELETE REDIS USING HELM
$ helm del --purge cache 
release "cache" deleted

$ kubectl get pods -w
NAME           READY     STATUS    RESTARTS   AGE
...
worker-8zp55   1/1       Running   0          20s
worker-wfrrt   1/1       Running   0          20s
...
worker-wfrrt   1/1       Running   1          36s
worker-8zp55   1/1       Running   1          36s
...
worker-wfrrt   1/1       Running   2          56s
worker-8zp55   1/1       Running   2          57s

As you can see, Kubernetes started to restart the worker pods, because the liveness check started to fail. You can see more details if you show the details of one of the pods:

kubectl describe pod worker-8zp55
...
Events:
Type     Reason                 Age               From               Message
----     ------                 ----              ----               -------
...
Warning  Unhealthy              9s (x3 over 29s)  kubelet, minikube  Liveness probe failed: HTTP probe failed with statuscode: 500
Normal   Killing                9s                kubelet, minikube  Killing container with id docker://worker:Container failed liveness probe.. Container will be killed and recreated.

The worst part is that READY column still suggested that everything is fine. What does that even mean?

Readiness check

While both liveness and readiness check and describe the state of the service, they serve quite a different purpose. Liveness describes if the pod has started and everything inside it is ready to take the load. If not, then it gets restarted. This does not include any external dependencies like for example databases, which are clearly something the service doesn't have control over. We don't get much benefit if we restart the service when the database is down, because it will not help.

This is where Readiness checks come in. This is basically a check if the application can handle the incoming requests. It should do a sanity check against both internal and external elements that make the service go, so the connection to the database should be checked there. If the check fails, the service is not being restarted, but the Kubernetes will not direct any traffic to it. This is particularly useful if we perform a rolling update and the new version has some issue connecting to the DB. In this case, it stays not ready, but those instances are not being restarted.

The solution here is to separate checks into two:

// Liveness check
http.HandleFunc("/checks/liveness", func(rw http.ResponseWriter, req *http.Request) {
    fmt.Fprint(rw, "OK")
})
// Readiness check
http.HandleFunc("/checks/readiness", func(rw http.ResponseWriter, req *http.Request) {
    if err := redisClient.Ping().Err(); err != nil {
        http.Error(rw, "FAIL", http.StatusInternalServerError)
        return
    }
    fmt.Fprint(rw, "OK")
})

Then, we define separate checks in the Kubernetes manifest as well:

...
livenessProbe:
  httpGet:
    path: /checks/liveness
    port: 8000
  initialDelaySeconds: 5
readinessProbe:
  httpGet:
    path: /checks/readiness
    port: 8000
  initialDelaySeconds: 10

Now if we crash the cache, we should have the situation under slightly more control:

$ helm del --purge cache 
release "cache" deleted

$ kubectl get pods -w
NAME           READY     STATUS    RESTARTS   AGE
master-q6xhm   1/1       Running   0          20h
worker-9tf2f   1/1       Running   0          2m
worker-kzpjv   1/1       Running   0          2m
...
worker-9tf2f   0/1       Running   0         2m
worker-kzpjv   0/1       Running   0         2m

Note You should not rely on the cache to determine if an application works, this is made just to showcase how Kubernetes checks work. In the real life, you should build your applications in a way that allows your cache to be temporarily unavailable.

Summary

As you can see, with a few lines of application code and some YAML configuration we can help Kubernetes understand how to determine if our application is working correctly. By revealing that to our orchestration tool we can make it restart our services when it's necessary, and stop the traffic when it's not ready to handle it.

Read more

Versions

$ minikube version
minikube version: v0.28.2

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T10:09:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.6-gke.2", GitCommit:"384b4eaa132ca9a295fcb3e5dfc74062b257e7df", GitTreeState:"clean", BuildDate:"2018-08-15T00:10:14Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}

$ go version
go version go1.11 darwin/amd64

$ Docker version
Client:
 Version:           18.06.0-ce
 API version:       1.38
 ...

Server:
 Engine:
  Version:          18.06.0-ce
  API version:      1.38 (minimum version 1.12)
  ...