Raft 9: Log Replication, Part 3

Reading Time: 12 minutes

In December, I took a course in which we attempted to implement the Raft distributed consensus algorithm from this paper. Parts 1-5 of this series share insights from the course. From now on, I’m guiding you through my continued work implementing Raft “for fun” (I know. I don’t understand me, either).

Here’s where you can see all the posts so far.

leopard-in-box

In the previous post, our follower servers learned to check that their existing logs were up to date before adding whatever the leader sent.

So now it’s time to actually do something about out-of-date logs.

Here’s what Raft says is supposed to happen (emphasis mine).

In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log. Section 5.4 will show that this is safe when coupled with one more restriction.

To bring a follower’s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point. All of these actions happen in response to the consistency check performed by AppendEntries RPCs.

OK. So this is going to work very similar to a professional disagreement, where the holders of the discordant opinions walk back step by step to figure out where they last agreed. Except with software, it’s easier. We have to find the most recent index-term pair in the leader’s logs that also appear in the follower’s logs.

Let’s look back at the paper.

The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower. When a leader first comes to power, it initializes all nextIndex values to the index just after the last one in its log (11 in Figure 7). If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any). Once AppendEntries succeeds, the follower’s log is consistent with the leader’s, and it will remain that way for the rest of the term.

So suppose a leader’s log looks like this:

Screen Shot 2020-08-17 at 3.08.19 PM

And a follower’s log only has that first line: “0 0 set zero 0.” Each call from the leader will indicate what its last index-term pair is:

Screen Shot 2020-08-17 at 3.04.12 PM

And if the follower doesn’t have it, it will say so. Then the leader will send it’s next most recent index-term pair—the one prior to the one mentioned in the follower’s response message—while adding the command that the follower doesn’t have to the list of commands that it’s supposed to append.

Screen Shot 2020-08-17 at 3.06.06 PM

This process repeats until it reaches an index-term pair that the two servers share.

Screen Shot 2020-08-17 at 3.06.20 PM

You’ll notice something about my implementation here: the state machine on the follower is empty. Why? Because we don’t need to implement updates to a state machine that isn’t used. When a client asks a follower for a value based on a key, it replies that it is not the leader, and that only the leader responds to such requests. Were this server to become the leader, then we would need to update the state machine.

I’m not thrilled with my implementation of the above because I didn’t come up with an elegant way to store what I need. Now, not only do I need to fetch commands by their index-term combination, but also I need to be able to fetch the command prior to the one associated with a given index-term combination.

I suppose I could have done a doubly linked list of key-value pairs, but Python doesn’t come with that kind of data structure, and frankly I didn’t feel like implementing a data structure in a language that doesn’t have it as means to an end in a fun side project. So instead I did this:

Screen Shot 2020-08-12 at 10.13.24 PM

Just kidding. I named it better than that:

Screen Shot 2020-08-17 at 3.17.00 PM

I now populate this object instead of a lone dictionary with term-index combinations as keys and commands as values. In addition to that dictionary (which goes into term_indexed_logs on the LogDataAccessObject), I make a list of the index-term combinations and put that into ordered_logs. You can see those changes here:

Screen Shot 2020-08-17 at 3.15.59 PM

So then, when a leader gets a ‘Logs not updated’ message from a follower and needs to back up a command and send that in an AppendEntries response, it can:

  1. Get the index in ordered_logs of the index-term pair that the follower didn’t have,
  2. Decrement this index to get the term-index pair of the prior command,
  3. Use this term-index pair to get the prior command from term_indexed_logs
  4. Add that pair to the list of entries to instruct the follower to append,
  5. Use the term-index pair acquired in step 2 as the new pair for the follower to append entries after.

This block of code is kind of long and in teeny font in this blog post, but feel free to peruse, and if you don’t want to peruse it, rest assured that the five steps above explain what it does.

Screen Shot 2020-08-17 at 3.16.30 PM

Yes, I am aware that this block of code would be very slow if I had millions of log entries. For one thing it calls log_access_object(), which we saw has a ton of logic, three separate times, when it could easily cache the result after the first call. I’m sure we’ll have more to do in this block, which will give us an opportunity to revisit optimizing it later.

But what about deleting any discordant entries in the follower log?

At this point, the follower doesn’t delete any entries that appear after the one where it agrees with the leader. So anything the leader appends would go on top of those discordant entries. Let’s delete any entries that come after the entry where a leader and follower diverge. So let’s fix that. This commit is pretty small. We add a method in the KeyValueStore to clear out those log entries:

Screen Shot 2020-08-17 at 3.36.56 PM

And then we call it when the follower has found the term-index pair it’s supposed to append after, such that any commands that supersede the one with that term-index pair go away:

Screen Shot 2020-08-17 at 3.37.13 PM

But wait, Chelsea! There’s another problem! What if the follower’s logs are completely empty?

Good catch. The code we have written will choke if we try to append entries to an empty log, because the leader will never find a term-index pair in the follower’s log that matches one of its own.

Could I have included conditional logic in the AppendEntries response block to deal with this? Of course.

Instead, in this commit, I start instantiating the KeyValueStore‘s highest_index to 1 instead of 0. Then, in the zero spot (when a server starts up and its log is empty), I add a “synthetic” (fake, fabricated, I made it up) command so that every server has the same first log entry.

Screen Shot 2020-08-17 at 3.41.29 PM

This command is lying—it’s reachable, but you’d only try it if you know it’s there. It sets a value on the key of an empty string. In order to get ahold of it, you need to type into a client “get” and then hit the spacebar twice. It works:

Screen Shot 2020-08-17 at 3.45.14 PM

But I don’t anticipate an empty string becoming an overloaded key that everyone wants to use in their state machine. And you know what, so what if they do? This can be overwritten without a problem. The only reason it’s there is to ensure that every server’s log does have one common term-index pair at its root.

Do I feel bad about this decision? Not even a little bit.

As far as I’m concerned, it removes an edge that I no longer have to handle.

Look, the Raft paper itself explicitly tells us to use an AppendEntries call as the heartbeat. As far as I can tell, this has no practical reason besides “Honestly any RPC would have done fine here, so we reused the one we already had.”

If they get to do that, I get to do this.

This post is already long, but I’ll mention one last thing.

It is a lot of back-and-forth for the leader server to send the singular immediate precedent call whenever a follower says it’s not caught up, and there are ways to reduce the number of calls here. I’ll include the Raft paper’s comments on this in an appendix to this post below.

In the meantime, though, we have basic log replication, as determined based on the following set of manual tests:

  1. Fire up a leader and a follower server, with the leader having a log with multiple commands and the follower having an empty log.
  2. Check that the leader successfully replicates its logs on the follower.
  3. Without shutting down either server, issue some write commands to the leader.
  4. Check that the new write commands are accurately replicated on the follower log.
  5. Shut down both servers.
  6. Delete some number of commands from the tail of the follower’s log. Replace them with commands that are different from the leader logs, or at least have different index-term pairs at the front.
  7. Fire up the leader again.
  8. Check that the leader successfully replaces the bogus logs on the follower with the correct logs.
  9. Without shutting down either server, issue some write commands to the leader.
  10. Check that the new write commands are accurately replicated on the follower log.

In the next post, I’d typically refactor. But to be frank, my experience with this code base has contradicted most conventional wisdom about refactoring. First of all, as I mentioned in a prior post, I un-refactored something to make it legible when I returned to this code base after a hiatus. Then there’s this:

So maybe in the next post we’ll go straight into implementing elections and see where that takes us.

Appendix: Optimizing Log Catchup, as per the Raft Paper

If desired, the protocol can be optimized to reduce the number of rejected AppendEntries RPCs. For example, when rejecting an AppendEntries request, the follower 7 can include the term of the conflicting entry and the first index it stores for that term. With this information, the leader can decrement nextIndex to bypass all of the conflicting entries in that term; one AppendEntries RPC will be required for each term with conflicting entries, rather than one RPC per entry. In practice, we doubt this optimization is necessary, since failures happen infrequently and it is unlikely that there will be many inconsistent entries.

With this mechanism, a leader does not need to take any special actions to restore log consistency when it comes to power. It just begins normal operation, and the logs automatically converge in response to failures of the AppendEntries consistency check. A leader never overwrites or deletes entries in its own log (the Leader Append-Only Property in Figure 3).

This log replication mechanism exhibits the desirable consensus properties described in Section 2: Raft can accept, replicate, and apply new log entries as long as a majority of the servers are up; in the normal case a new entry can be replicated with a single round of RPCs to a majority of the cluster; and a single slow follower will not impact performance.

If you liked this piece, you might also like:

How to Jump-Start a New Programming Language, or maybe, even, gain a more concrete mental model of the one you already use!

Lessons from Space: Edge-Free Programming, which also explores an Android app’s design and the engineering choices behind it (plus, cool pictures of rockets!)

How does git detect renames?—This piece is about how git detects renames, but it’s also about how to approach questions and novel problems in programming in general

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.