MySQL master discovery methods, part 4: Proxy heuristics

Note: the method described here is an anti pattern

This is the fourth in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via Proxy Heuristics

In Proxy Heuristics all clients connect to the master through a proxy. The proxy observes the backend MySQL servers and determines who the master is.

This setup is simple and easy, but is an anti pattern. I recommend against using this method, as explained shortly.

Clients are all configured to connect to, say, cluster1-writer.proxy.example.net:3306. The proxy will intercept incoming requests either based on hostname or by port. It is aware of all/some MySQL backend servers in that cluster, and will route traffic to the master M.

A simple heuristic that I’ve seen in use is: pick the server that has read_only=0, a very simple check.

Let’s take a look at how this works and what can go wrong.

A non planned failover illustration #1

Master M has died, the box had a power failure. R gets promoted in its place. Our recovery tool:

  • Fails over, but doesn’t need to run any hooks.

The proxy:

  • Knows both about M and R.
  • Notices M fails health checks (select @@global.read_only returns error since the box is down).
  • Notices R reports healthy and with read_only=0.
  • Routes all traffic to R.

Success, we’re happy.

Configuration tip

With an automated failover solution, use read_only=1 in my.cnf at all times. Only the failover solution will set a server to read_only=0.

With this configuration, when M restarts, MySQL starts up as read_only=1.

A non planned failover illustration #2

Master M gets network isolated for 10 seconds, during which time we failover. R gets promoted. Our tool:

  • Fails over, but doesn’t need to run any hooks.

The proxy:

  • Knows both about M and R.
  • Notices M fails health checks (select @@global.read_only returns error since the box is down).
  • Notices R reports healthy and with read_only=0.
  • Routes all traffic to R.
  • 10 seconds later M comes back to life, claiming read_only=0.
  • The proxy now sees two servers reporting as healthy and with read_only=0.
  • The proxy has no context. It does not know why both are reporting the same. It is unaware of failovers. All it sees is what the backend MySQL servers report.

Therein lies the problem: you can not trust multiple servers (MySQL backends) to deterministically pick a leader (the master) without them collaborating on some elaborate consensus communication.

A non planned failover illustration #3

Master M box is overloaded, issuing too many connections for incoming connections.

Our tool decides to failover.

  • And doesn’t need to run any hooks.

The proxy:

  • Notices M fails health checks (select @@global.read_only does not respond because of the load).
  • Notices R reports healthy and with read_only=0.
  • Routes all traffic to R.
  • Shortly followed by M recovering (since no more writes are sent its way), claiming read_only=0.
  • The proxy now sees two servers reporting as healthy and with read_only=0.

Again, the proxy has no context, and neither do M and R, for that matter. The context (the fact we failed over from M to R) was known to our failover tool, but was lost along the way.

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R.

  • M is available and responsive, we set it to read_only=1.
  • We set R to read_only=0.
  • All new connections route to R.
  • We should also instruct our Proxy to kill all previous connections to M.

This works very nicely.

Discussion

There is a substantial risk to this method. Correlation between failover and network partitioning/load (illustrations #2 and #3) is reasonable.

The root of the problem is that we expect individual servers to resolve conflicts without speaking to each other: we expect the MySQL servers to correctly claim “I’m the master” without context.

We then add to that problem by using the proxy to “pick a side” without giving it any context, either.

Sample orchestrator configuration

By way of discouraging use of this method I do not present an orchestrator configuration file.

All posts in this series

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.