Scaling Percona Kubernetes OperatorYou got yourself a Kubernetes cluster and are now testing our Percona Kubernetes Operator for Percona XtraDB Cluster. Everything is working great and you decided that you want to increase the number of Percona XtraDB Cluster (PXC) pods from the default 3, to let’s say, 5 pods.

It’s just a matter of running the following command:

Good, you run the command without issues and now you will have 5 pxc pods! Right? Let’s check out how the pods are being replicated:

You not only see 4 pods instead of 5 but also the one new pod is stuck in the “Pending” state. Further info shows that the kube scheduler wasn’t able to find a node to deploy the pod:

From that output, we can see what’s the issue: Affinity. Or more specifically: Anti-affinity

Affinity defines eligible pods that can be scheduled (can run) on the node which already has pods with specific labels. Anti-affinity defines pods that are not eligible.

The operator provides an option called “antiAffinityTopologyKey” which can have several values:

  • kubernetes.io/hostname – Pods will avoid residing within the same host.
  • failure-domain.beta.kubernetes.io/zone – Pods will avoid residing within the same zone.
  • failure-domain.beta.kubernetes.io/region – Pods will avoid residing within the same region.
  • none – No constraints are applied. It means that all PXC pods can be scheduled on one Node and you can lose all your cluster because of one Node failure.

The default value is kubernetes.io/hostname which pretty much means: “only one Pod per node”

In this case, the kubernetes cluster is running on top of 3 aws instances, hence when one tries to increase the number of pods, the scheduler will have trouble finding where to put that new pod.

Alternatives?

There are several options. One plain and simple (and obvious) one can be to add new nodes to the k8s cluster.

Another option is to set the anti-affinity to “none”. Now, why one would want to remove the guarantee of having POD distribute among the available nodes? Well, think about lower environments like QA or Staging, where the HA requirements are not hard and you just need to deploy the operator in a couple of nodes (control plane/worker).

Now, here’s how the affinity setting can be changed:

Edit the cluster configuration. My cluster is called “cluster1” so the command is:

Find the line where “antiAffinityTopologyKey” is defined and change “kubernetes.io/hostname” to “none” and save the changes. This modification will be applied immediately.

Delete the old pods ONE BY ONE. Kubernetes will spawn a new one, so don’t worry about it. For example, to delete the pod named “cluster1-pxc-0”, run:

You will see how the Pods are recreated and the one that was on “pending” moves on:

Finally, the goal of having 5 pods is achieved:

But what if one needs a more sophisticated option? One with some degree of guarantee that HA will be met? For those cases, the operator Affinity can use an advanced approach, by using the NodeAffinity with “preferredDuringSchedulingIgnoredDuringExecution”

The whole description and configuration is available in the Operator documentation “Binding Percona XtraDB Cluster components to Specific Kubernetes/OpenShift Nodes”

And also, in the future, the operator will make use of the “topologySpreadConstraints” spec to control the degree to which Pods may be unevenly distributed.

Thanks to Ivan Pylypenko and Mykola Marzhan from the Percona Engineer Team for the guidance.


We understand that choosing open source software for your business can be a potential minefield. You need to select the best available options, which fully support and adapt to your changing needs. In this white paper, we discuss the key features that make open source software attractive, and why Percona’s software might be the best option for your business.

Download “When is Percona Software the Right Choice?”

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
peterzaitsev

I think it is worth to note if you change AntiAffinity you’re increasing risks for cluster availability. It may be good for testing but for production workloads you surely want to run nodes on separate hosts, or otherwise in a way concurrent failure of multiple pods is unlikely.