Raft 4: Bare Minimum Log Replication

Reading Time: 11 minutes

In November, I took Dave Beazley’s week-long course on The Structure and Interpretation of Computer Programs.

In December, I took Rafting Trip, in which we attempt to implement the Raft algorithm from this paper. In this series, I’ll share memorable insights from the course (here’s where you can see all the posts so far).

In the previous post, we left off here:

We now have our building blocks in place. It’s time to get specific to Raft.

Raft’s ultimate purpose: to allow a server cluster to operate as a single source of truth without possessing a single point of failure. So we must coordinate the servers’ understandings of the up-to-date data. How do the servers catch each other up on the data if some of them go down and come back up?

It is this piece—log replication—that we turn to next.

So we need one server to be able to update the logs of another server after that other server goes down.

This requires a few things:

  1. Each server needs to have some kind of persistent log such that, if it goes down and comes back up, the state that it knew about before it went down is preserved. (This is not strictly necessary; another server could catch up a downed server from zero if it had to. However, this would make the system much more costly to maintain as the shared data store got bigger.
  2. Among multiple servers, we need to establish who is doing the catching up and who is being caught up.
  3. Once a server is established as the one doing the catching up, it must determine how much catching up the other servers need.
  4. Then that server must transmit the logs to be caught up.

In Raft, a lot happens to bring this to fruition:

  • Servers send “heartbeat” messages to each other periodically to tell each other that they’re still up.
  • The servers hold elections to determine which of them is the “leader.”
  • All servers redirect all client requests to write to the data store to the leader.
  • If the servers don’t get a heartbeat from the leader for a while, they hold a leader election.
  • Once a new leader is elected, that leader’s job is to get all the other servers’ logs up to date.
  • In the case of a network partition, it’s possible for multiple servers to be elected leader. The edge cases here get pretty wild.

Here’s a walkthrough, if you’re curious. We’re not going to worry about all that yet. For now, we want to replicate logs, and the rest of this can wait. Here’s my plan for a minimum viable feature set for log replication:

  1. Get a server
  2. With a key value store
  3. that takes the command “you’re the leader”
  4. and updates the logs of all the other servers

We have our pared down challenge. Let’s go! 

How does Raft replicate logs?

The Raft paper explains that the newly elected leader sends a message to the other servers to append the last item in the log to their logs. If that’s the only item that the recipient server needs (or if it’s already up to date), it responds with True​. If, however, the recipient server’s logs are more out of date than that, the server responds with False. The leader server then instead sends the last two items in their log to append. This continues, back and forth, until the leader is sending the exact set of logs that the follower server is missing.

When I read this, it sounded to me like a monumental waste of time. Couldn’t the leader ask the follower “how much of the log do you have?” and then send over anything that the follower server doesn’t have? (Yes, but it’s more complicated than this—as I learned later, when the idea behind the ‘waste of time’ approach started to make sense).

The Raft paper does discuss such a scheme as an “optimization,” and I went ahead and prematurely optimized (perhaps to my detriment, as you’ll see at the end of this post). So far, our server responds to messages to get, set, and delete items in the key value store.

Now, I want it to respond to some new messages. I want it to respond to “how much of the log do you have?” with the state of its log. I want it to respond to “I have this much log” with either “Ah, you have just as much log as I do” or “Oh, you’re missing things! Here they are!” Finally, I want it to respond to “Oh, you’re missing things! Here they are” with an indication of successful catch up (and a thank you, because my software has manners):

Server Conversation

In this schematic, the server on the left represents the leader. How does a server become the leader? I don’t want to take on leader elections yet, so instead we’ll spin up a client and manually nominate a leader from that client. The leader will then begin the conversation with each of the other servers about catching up their logs:

Nominating a Leader from a Client

We implement those responses in this commit. Here’s the juicy bit from server.py:

Screen Shot 2019-12-27 at 12.57.52 PM.png

So the server knows how to have the conversation. The server also needs to know who to have this conversation with. Somehow, we need the servers to know about all the other servers that should be up, as well as the addresses of those servers so that they can send messages to them.

Up until now, each of our servers has only needed one socket: a socket to listen on, to receive messages and respond. Now, our servers need to manage another socket. They need to create a temporary socket to send messages to servers. They also need to provide some kind of return address so that those servers to whom they send messages can respond to them.

So we’re going to have servers register themselves on startup in a server registry file. We implement that in this commit. Again, the juicy bit:

Screen Shot 2019-12-27 at 1.03.02 PM.png

I accidentally committed the server registry file, so here’s what it looks like:

Screen Shot 2019-12-27 at 1.03.13 PM.png

Entries don’t get removed when a server shuts down. You see myserver localhost 10000 in there five times because I started a server named myserver on port 10000, then shut it off, then restarted it, then shut it off, and so on. A leader server should attempt to contact all servers in the cluster, whether or not those servers are up right now. This file is how the leader server will know who to contact. Every server in the cluster has to have been up at some point to get registered. I think this is fine. If a server never comes up, I don’t mind that none of the leader servers ever know to try to contact it.

Servers can get the addresses the other servers in the cluster (this commit), like so:

Screen Shot 2019-12-27 at 1.07.27 PM

Open a temporary socket to talk to a socket at one of those addresses (this commit),

Screen Shot 2019-12-27 at 1.10.08 PM.png

And even send the same message to all of the other servers at once! (this commit.)

Screen Shot 2019-12-27 at 1.11.26 PM.png

Then, when a server receives a message that it has become the leader, it can start a conversation about log catchup with each of the other servers (this commit).

Screen Shot 2019-12-27 at 1.11.43 PM.png

Servers preface all of their sent messages with their name (this commit),

Screen Shot 2019-12-27 at 1.36.23 PM

whether they’re broadcasting or sending a message to a specific server.

Screen Shot 2019-12-27 at 1.36.30 PM

Servers can then split this name from the message,

Screen Shot 2019-12-27 at 1.38.02 PM

such that they can get the address of a server with that name and send their response to the correct recipient.

Screen Shot 2019-12-27 at 1.38.09 PM

Now we can start up a cluster of servers and watch the conversations happen. In this example, I started up a cluster of servers named every, tom, dick, and, harry (after a phrase my mother used when I was young to mean “a whole bunch of people.” Raft employs an odd number of servers in the cluster—usually five or seven, in examples. So these are our five servers.)

Here, I have started up every Tom Dick and Harry, and I have started a client to send Dick some commands: set a 1, set b 2, set c 3, set d 4. Now Dick has four records that none of the other servers have. I then use the client to tell Dick youre_the_leader. What follows is Dick’s logs about catching up the server named every​:

Screen Shot 2019-12-27 at 1.42.21 PM

See how much more fun this is when the servers are polite? I try to inject joy into my projects by giving my servers human names and teaching them basic courtesy. It even makes the mistakes more fun:

Ugh, I hate your elifs, Chelsea. Don’t you know about objects?

Have you read this blog? Of course I know about objects. But software doesn’t usually model objects, as we’ve discussed before:

Many authors [have] their examples reference physical objects. Uncle Bob’s books famously do a step-by-step software implementation of the Mark IV Coffee Maker. Practical Object-Oriented Design in Ruby, by Sandi Metz, describes the object-oriented approach to building a bicycle.

The goal here is noble: to choose a domain that readers already understand, so that they don’t have to learn new concepts unrelated to software in order to access the example.

The problem: we do not build coffeemakers and bicycles out of software. So when we’re discussing separation of concerns, the ease with which you can do that in these examples is fake. Yes, clearly there is a different set of responsibilities for a gear and a handlebar, or a water heater and a coffee pot. The boundaries between the concepts that we represent in most software—like users and accounts—are much, much fuzzier. It’s this messiness that programming texts miss, which forces programmers to go forth and bridge the gap between these toy examples and the abstract world on their own.

So how do we bridge the gap between the infinitely nuanced abstract world and our limited options for representing it in code?

Here, I am keeping all my functionality in one place until the data I have collected on the problem reaches the threshold at which the abstractions I define are satisfactorily likely to be accurate. Pulling out separate responsibilities from a preexisting blob is much less bamboozling than attempting to move responsibilities between a suite of inaccurately separated concerns. So, until I understand the problem better, the code looks like this. And you know what? It’s pretty legible.

In the next post, you’ll see some separated concerns begin to emerge.

If you liked this piece, you might also like:

This series on a framework for feature engineering (fun for the data scientists and ML engineers)

The risk-oriented testing screencast (example in Ruby, concept useful everywhere)

The time and space efficiency series (again, example in Ruby, concept applicable in other languages)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.