Automatically Including Kubernetes Control Plane Nodes in a Cloud Load Balancer

Recently I’ve been experimenting with setting up my own Kubernetes cluster using K3s, a lightweight distribution tailored for use in resource-constrained environments. For this project, I opted to host my nodes on Hetzner Cloud; although they don’t have a managed Kubernetes offering themselves, they have some basic support for self-hosted clusters via their Cloud Controller Manager (CCM) and Container Storage Interface (CSI) Driver.

The CCM is what Kubernetes uses to invoke infrastructural changes on the underlying cloud provider. For example, adding a Service similar to the following would create a real load balancer on Hetzner:

apiVersion: v1
kind: Service
metadata:
  name: example
spec:
  ports:
    - port: 80
      targetPort: 8000
  selector:
    app: example
  type: LoadBalancer

Typically such a load balancer will first direct traffic to a random eligible node in the cluster; it then gets routed to the final destination internally via kube-proxy. The list of eligible nodes is maintained by Kubernetes and is periodically synced to the load balancer through the CCM.

In older versions of Kubernetes (<=1.20), control plane / master nodes are excluded from said list by default based on the presence of a specific label (node-role.kubernetes.io/master=true). This is problematic for certain setups where every node runs the control plane (e.g. a single-node cluster) as the load balancer is unable to send traffic to the cluster.

The issue was brought up in kubernetes/kubernetes#65618. One solution is to simply remove the label, although it is unclear whether there are unintended consequences. Of note, K3s reapplies the label on restart, potentially leading to instability. A Kubernetes Enhancement Proposal (KEP) was eventually created, regulating the usage of node-role labels.

Disabling LegacyNodeRoleBehavior

In Kubernetes 1.16, several feature gates (LegacyNodeRoleBehavior, ServiceNodeExclusion, and NodeDisruptionExclusion) and their corresponding labels were added as part of the migration process noted in the KEP. By turning LegacyNodeRoleBehavior off, we allow control plane nodes to be added to load balancers automatically. This can be passed as an additional argument to the CCM and redeployed to the cluster. For example, given Hetzner’s CCM manifest as a base, the feature gate can be added like so:

...
containers:
  - image: hetznercloud/hcloud-cloud-controller-manager:v1.10.0
    name: hcloud-cloud-controller-manager
    command:
      - "/bin/hcloud-cloud-controller-manager"
      - "--cloud-provider=hcloud"
      - "--leader-elect=false"
      - "--allow-untagged-cloud"
      - "--allocate-node-cidrs=true"
      - "--cluster-cidr=10.244.0.0/16"
      - "--feature-gates=LegacyNodeRoleBehavior=false"
...

CCM-Specific Caveats

Something that caught me off guard was that feature gate settings in a given CCM do not necessarily match those of the cluster. This is because these CCMs are using packages derived from specific versions of Kubernetes. For example, v1.10 of Hetzner’s CCM is importing v0.18.8 of the cloud-provider package (from Kubernetes 1.18.8). It is important to take note of this as even when the feature gate is removed in the future, it may still need to be specified in the CCM if its dependencies have not been upgraded yet.

From what I can see, this caveat is common across many different implementations including DigitalOcean’s and Linode’s, although I suspect it’s less of an issue for those two as their customers are likely on their managed offerings.

Kubernetes 1.21 and Beyond

LegacyNodeRoleBehaviour is locked to false in Kubernetes 1.21 and is scheduled to be removed in the following release; as such, control plane nodes should no longer be excluded from load balancers by default. It will take some time before all CCMs upgrade their dependencies to match, however, so the feature gate may still be needed well beyond that point.

In any case, this was an interesting dive into one small part of Kubernetes. For additional reading, I found this post to give a pretty good overview of how CCMs operate.

May 19, 2021