Kubernetes Scaling Guide - Advanced (and Resilience)

This is the second part of the Kubernetes Scaling Guide, where we'll take a deeper look at resilience, which is an important aspect of having a scalable application.

You may want to check out Part 1 first, to check out the more basic scaling configurations.

Autoscaling

Defining a number of replicas is an easy way to scale an application up (or down), but it is not very efficient. Imagine that your product is suddenly going viral and mentioned all over Reddit, and thousands of potential users start banging on your virtual door. Do you think having two replicas is enough? Should you immediately scale up to six? Ten? When should you scale down? Thankfully, Kubernetes has a resource designed just for this - Horizontal Pod Autoscaler. As the name suggests, HPA is responsible for making sure your service has just the right number of pods (not too many, not too few) for a current load.

We define a HPA configuration as a separate resource manifest:

# hpa.yaml
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: kube-scale-app
spec:
  metrics:
  - resource:
      name: cpu
      targetAverageUtilization: 70
    type: Resource
  minReplicas: 1
  maxReplicas: 10
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: kube-scale-app

What this setting says, is that in a perfect scenario, all of our pods will be utilized at 70% of their resources requests. If the load is bigger, HPA will request another pod to be created. If it's well below the border value, one of the pods will be deleted. All, obviously, within a requested range between one and ten replicas. Just to refresh, resources configuration is defined in a deployment manifest as a property of a container:

# manifest.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-scale-app
spec:
  selector:
    matchLabels:
      app: kube-scale-app
  template:
    metadata:
      labels:
        app: kube-scale-app
    spec:
      containers:
      - name: kube-scale-app
        image: gcr.io/kubernetes-scaling-268016/worker-app:deed569      
        ports:
          - containerPort: 80
            protocol: TCP
        resources:
          limits:
            cpu: 200m
          requests:
            cpu: 100m

Now, when we start our service and there is no load, we'll just have one instance:

$ kubectl get all

NAME                                  READY   STATUS    RESTARTS   AGE
pod/kube-scale-app-6b8f476c46-8pvdq   1/1     Running   0          3m13s

NAME                                                 REFERENCE                   TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/kube-scale-app   Deployment/kube-scale-app   10%/70%   1         10        1          3m14s

Then, once we generate some heavy traffic, we start to see new pods being created:

$ kubectl get all

NAME                                  READY   STATUS    RESTARTS   AGE
pod/kube-scale-app-6b8f476c46-5rzql   1/1     Running   0          7m27s
pod/kube-scale-app-6b8f476c46-5vd98   1/1     Running   0          8m12s
pod/kube-scale-app-6b8f476c46-5xk4w   1/1     Running   0          8m15s
pod/kube-scale-app-6b8f476c46-6vkgk   1/1     Running   0          8m15s
pod/kube-scale-app-6b8f476c46-6xvzb   1/1     Running   0          6m6s
pod/kube-scale-app-6b8f476c46-8wx4q   1/1     Running   0          8m15s
pod/kube-scale-app-6b8f476c46-lbznj   1/1     Running   0          8m6s
pod/kube-scale-app-6b8f476c46-thjtf   1/1     Running   0          8m15s
pod/load-generator                    1/1     Running   0          110s

NAME                                                 REFERENCE                   TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/kube-scale-app   Deployment/kube-scale-app   70%/70%   1         10        8          24m

As you can see, we reached eight replicas. Then, when we cease the traffic, the pods are slowly being deleted:

$ kubectl get all

NAME                                  READY   STATUS    RESTARTS   AGE
pod/kube-scale-app-6b8f476c46-5xk4w   1/1     Running   0          19m
pod/kube-scale-app-6b8f476c46-8wx4q   1/1     Running   0          19m

NAME                                                 REFERENCE                   TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/kube-scale-app   Deployment/kube-scale-app   0%/70%    1         10        2          36m

How slowly? By default scaling up can be triggered every 30 seconds, but scaling down is triggered every five minutes. This makes sense (unless you are very worried about your costs), as in most cases serving all clients quickly is more important than saving a couple of bucks on faster scale-down.

There is one caveat you need to keep in mind with HPA, and it's related to the scaling property mentioned when we started the topic, ie. replicas. Basically, you should not use this with HPA, because it causes an unexpected (at least it was for me and the team when we discovered it) behavior. Imagine that you start with replicas count set to two, and HPA to operate in [2,10] range. You are being hit by the traffic, and end up with six instances. So far, so good. Then you discover a bug and quickly deploy a version with the fix. What happens then? Kubernetes looks at the desired number of replicas from the manifest first, and creates just two new replicas in place of the old six! Once created, they are immediately scaled up to satisfy the load, but still, not a good idea to put the quality and speed of your application on the line. With replicas property removed, you would have six new replicas created to match the number of instances present before the deploy.

Cluster layout with Affinities

Once you are happy with the number of replicas of your application, you start worrying about your nodes. Let say that your microservice system is deployed on three (virtual) machines. Two of them are utilized almost to their maximum, while the third one still has some resources left. Unfortunately, the service that has one replica deployed on that not-quite-full node needs to be scaled up. You end up with two (or more) of replicas of the same service, all on one node. Then something terrible happens and that node crashes. Clients are not served anymore, everyone is angry. But it was so scalable, everything was in place... Well, there is something you can do to prevent this from happening.

The properties you need in these situations are affinities and anti-affinities which behave very similarly. Affinity is a definition of a preference where a pod should be scheduled, while anti-affinity describes what situations should be avoided when scheduling is performed. In our case, we would like to avoid placing a pod on the node that already has one with the same app. To do that we need to add the following settings to our deployment manifest:

# manifest.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-scale-app
spec:
  ...
  template:
    ...
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - kube-scale-app
              topologyKey: kubernetes.io/hostname
            weight: 100

This says that when a pod is scheduled, preferably a pod with a label app and its value kube-scale-app is not yet present on the node. By using preferredDuringSchedulingIgnoredDuringExecution we also do two things:

We prefer which means that if the pods of the type are already on each node, we will give up and allow the next replica to go onto one of the nodes.
We ask for this only during scheduling, which means that once a pod is on the node, we won't try to evict (remove) it just because we noticed that there is a more preferred situation on another node.

Let's say we have our service scheduled to two out of three nodes, then we spin off the third pod. With pod anti-affinity settings in place, the newly created replica will not be assigned to any of the nodes that already contain kube-scale-app, but rather we'll place it on the only node that does not have that service yet:

$ kubectl get pods

NAME                              READY   STATUS              AGE   NODE                                             
kube-scale-app-7476cf5f74-4fkwv   1/1     Running             70s   gke-standard-cluster-1-default-pool-f543d6c7-c0n5
kube-scale-app-7476cf5f74-6kn5b   1/1     Running             75s   gke-standard-cluster-1-default-pool-f543d6c7-88v9
kube-scale-app-7476cf5f74-9w6pw   0/1     ContainerCreating   2s    gke-standard-cluster-1-default-pool-f543d6c7-l9nj

Cluster scaling

We've taken care of the service autoscaling according to its load, we made sure it's evenly spread across nodes, is there anything else we should care about? Turns out that it is.

There are certain situations when a node can be turned off in a controlled way. You can have a dynamically sized node pool, and if they are underutilized it could be scaled-down. You can have a managed cluster and while a Kubernetes version is being performed, all old nodes (not all at once, hopefully) are being brought down to be replaced with the same number of the new ones. Would this cause any disturbance?

Wait, what if there are two nodes removed at the same time, and those are the only places where one of your services is deployed? Yes, it will be out of service for a brief moment. We would really like to avoid that, though. Can you do anything about this? I'm so glad you asked, and the answer is, obviously, yes! (If there was no way to do this I would probably not be writing this section at all, right?) The more elaborate answer would be: Yes, with pod disruption budget!

What is PDB exactly? It's a definition of the desired state of your service in case of a disruption, ie. when nodes are going down and up. It exists as a separate resource type in Kubernetes, and it just defines a minimum number of available replicas, and a selector to verify which pod does this setting relate to:

# pdb.yaml
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: kube-scale-app-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: kube-scale-app

Now, when Kubernetes will try to remove nodes so that one of the services will not be available, the operation would be changed to ensure that this problematic situation is avoided. It can be solved by either killing just one of the nodes, or the pod might be rescheduled. In either case, once the pod is up and running, and available to be used, the operation of tearing down the nodes continues.

Summary

We reached the end of the scaling and resilience Kubernetes settings that I have used over the past three years. This two-post series reflects my own journey through the documentation, and experiments that verify that all that works as it is described there. While it was so much fun to discover one thing after another, I can see that there is a value of knowing it upfront. Maybe this would save us (my team) some time in the process, avoiding any of the sporadic downtime we experienced? While it's too late for me, I believe that lots of people can benefit from our findings, if they are anywhere earlier in their path to Kubernetes mastery. If I ever reach it myself, I'll let you know how it feels...

Paweł Słomka admits:

mycodesmells.com

Kubernetes Scaling Guide - Advanced (and Resilience)

Autoscaling

Cluster layout with Affinities

Cluster scaling

Summary

Paweł Słomka admits:

mycodesmells.com

Kubernetes Scaling Guide - Advanced (and Resilience)

Autoscaling

Cluster layout with Affinities

Cluster scaling

Summary

Read also: