ChaosMesh to Create Chaos in KubernetesIn my talk on Percona Live (download the presentation), I spoke about how we can use Percona Kubernetes Operators to deploy our own Database-as-Service, based on fully OpenSource components and independent from any particular cloud provider.

Today I want to mention an important tool that I use to test our Operators: ChaosMesh, which actually is part of CNCF and recently became GA version 1.0.

ChaosMesh seeks to deploy chaos engineering experiments in Kubernetes deployments which allows it to test how deployment is resilient against different kinds of failures.

Obviously, this tool is important for Kubernetes Database deployments, and I believe this also can be very useful to test your application deployment to understand how the application will perform and handle different failures.

ChaosMesh allows to emulate:

  • Pod Failure: kill pod or error on pod
  • Network Failure: network partitioning, network delays, network corruptions
  • IO Failure: IO delays and IO errors
  • Stress emulation: stress memory and CPU usage
  • Kernel Failure: return errors on system calls
  • Time skew: Emulate time drift on pods

For our Percona Kubernetes Operators, I found Network Failure especially interesting, as clusters that rely on network communication should provide enough resiliency against network issues.

Let’s review an example of how we can emulate a network failure on one of the pods. Assume we have cluster2 running:

And we will isolate cluster2-pxc-1 from the rest of the cluster, by using the following Chaos Experiment:

This will isolate the pod  cluster2-pxc-1 for three seconds. Let’s see what happens with the workload which we directed on cluster2-pxc-0 node (the output is from sysbench-tpcc benchmark):

And the log from cluster2-pxc-1 pod:

We can see that the node lost communication for three seconds and then recovered.

There is a variable evs.suspect_timeout with default five sec which defines the limit of how long the nodes will wait till forming a new quorum without the affected node. So let’s see what will happen if we isolate  cluster2-pxc-1 for nine seconds:

The workload was stalled for five seconds but continued after that. And we can see from the log what happened with node  cluster2-pxc-1. The log is quite verbose, but to describe what is happening:

  1. After 5 sec node declared that it lost connection to other nodes
  2. Figured out it is in minority and can’t form a quorum, declared itself NON-PRIMARY
  3. After the network restored, the node reconnected with cluster
  4. The node caught up with other nodes using IST (incremental state transfer) method
  5. Cluster became 3-nodes cluster

Conclusion

ChaosMesh is a great tool to test the resiliency of a deployment, and in my opinion, it can be useful not only for database clusters but also for the testing of general applications to make sure the application is able to sustain different failure scenarios.