Automerge is a library provides conflict-free replicated data types—data structures that make it convenient to sync data. The team behind it hopes to enable software that can adhere to seven ideals. To paraphrase:
I’ve never built software like this before, so it’s been really interesting for me to think about the complexities that come up when building apps like this. I’ve found Automerge to be a very well-designed data format that aims to be “conflict-free” on many levels.
In addition to the core Automerge library, they published a separate library, Automerge Repo (JS, Rust) that provides interfaces and implementations for common tasks around storage and syncing. I was particularly struck by their storage implementation when reading through their documentation on storage—there were huge implications, but I had no idea how it worked!
Here’s the excerpt of what I was confused about:
The upshot of all this then is that our model for storage is not a key value store with document IDs as keys and byte arrays as values, but instead a slightly more complex model where the keys are arrays of the form
[<document ID>, <chunk type>, <chunk identifier>]
where chunk type is either"snapshot"
or"incremental"
and the chunk ID is either the heads of the document at compaction time or the hash of the change bytes respectively. The storage backend then must implement range queries so the storage system can do things like “load all the chunks for document ID x”.In typescript that looks like this:
export type StorageKey = string[] export abstract class StorageAdapter { abstract load(key: StorageKey): Promise<Uint8Array | undefined> abstract save(key: StorageKey, data: Uint8Array): Promise<void> abstract remove(key: StorageKey): Promise<void> abstract loadRange(keyPrefix: StorageKey): Promise<{key: StorageKey, data: Uint8Array}[]> abstract removeRange(keyPrefix: StorageKey): Promise<void> }
And here’s what stuck out to me:
But this raised more questions than answers!
key
s look like?
data
values look like?To answer my questions, I reimplemented Automerge Repo’s filesystem storage mechanism.
To be honest, I had a pretty foundational non-understanding of what Automerge documents looked like and kept track of. I spent about a day or two reading documentation (there’s a lot of documentation spread out on their website and the languages/bindings they support), but in the end, I found it most helpful to generate a document, to mutate it, and to print out the contents and metadata on each change.
I’m not sure if this summary helps—I definitely recommend experimenting with the library rather than relying on my second-hand information—but here’s it goes!
Automerge has two ways of serializing this data into its binary format. I think the function names are slightly misnamed since they’re called “save” even though there is no IO. Nitpicking aside, here is what they do:
save()
returns the full document along with the full history.save_incremental()
returns a diff of everything that happened since the last save.In both cases, the document keeps track of what was last saved so that save_incremental
is always able to return only what changed since the last save.
Let’s think about a simple example where a document has 4 versions: we start from an empty string and append one character at a time until we end up with abcd
. I’m kinda skipping ahead to some kind of final implementation, but I only understood storage after I saved the raw data from save()
into a file and tried loading it.
Step | Content | Heads | History | Save |
---|---|---|---|---|
1 | "" |
[] |
[] |
|
2 | a |
[3a5c] |
[3a5c] |
|
3 | ab |
[2a06] |
[3a5c, 2a06] |
save() |
4 | abc |
[a20f] |
[3a5c, 2a06, a20f] |
save_incremental() |
5 | abcd |
[1d60] |
[3a5c, 2a06, a20f, 1d60] |
save_incremental() |
6 | abcd |
[1d60] |
[3a5c, 2a06, a20f, 1d60] |
save() |
Since we called “save” four times, we’ll end up with three files. Here they are below. Let’s say that the document ID is test
—the ID here is just a kind of document name. I’ve ordered this alphabetically by filename, but also I took a few liberties to try to explain one of the interesting edge cases I saw, and please note especially that the content column is a complete oversimplification!
Filename | Content |
---|---|
<document ID>.<chunk type>.<chunk identifier> |
|
test.incremental.1d60 |
___d |
test.incremental.a20f |
__c |
test.snapshot.1d60 |
abcd |
test.snapshot.2a06 |
ab |
Now, let’s load these files alphabetically.
Step | Load file | Content | Note |
---|---|---|---|
1 | test.incremental.1d60 |
"" |
No error but no "d" either |
2 | test.incremental.a20f |
"" |
No error but no "cd" either |
3 | test.snapshot.1d60 |
abcd |
|
4 | test.snapshot.2a06 |
abcd |
No error; no-op |
It loads pretty smoothly, even when we load the history slightly out of order! I want to take a second to explain my oversimplification; loading incomplete documents doesn’t raise an error, but we can check to see if the document is in an error state by checking if it’s missing any heads. There’s a silent failure, but recoverable as long as we eventually load the missing data.
Here’s what happens when we start off with a valid document but make it temporarily invalid by loading some incremental changes out of order:
Step | Load file | Content | Note |
---|---|---|---|
1 | test.snapshot.2a06 |
ab |
|
2 | test.incremental.1d60 |
"" |
No error but no "abd" either |
3 | test.incremental.a20f |
abcd |
|
4 | test.snapshot.1d60 |
abcd |
No error; no-op |
And lastly, here’s what happens when we’ve already read everything!
Step | Load file | Content | Note |
---|---|---|---|
1 | test.snapshot.1d60 |
abcd |
|
2 | test.snapshot.2a06 |
abcd |
No error; no-op |
3 | test.incremental.a20f |
abcd |
No error; no-op |
4 | test.incremental.1d60 |
abcd |
No error; no-op |
All this to say, Automerge makes it very easy to load data! As long as we have all of the information we need to build a complete history, it doesn’t matter how we store it or how we load it. This greatly simplifies application development since we don’t need to keep a sorted list of changes—note that a sorted list would require something like global state (some kind of centralized database) or an accurate timestamp (which is also a kind of global state).
Automerge’s flexibility around loading data lets multiple applications share a single data store:
A quick example here of what would happen:
Step | Client | Content | History | Note |
---|---|---|---|---|
1 | A | "abc" |
[00] |
Load |
2 | B | "abc" |
[00] |
Load |
3 | B | "123abc" |
[00, b1] |
Edit and save |
4 | A | "abcdef" |
[00, a1] |
Edit and save |
5 | A | "abcdef" |
[00, a1] |
Compaction (Delete duplicate data) |
6 | C | "123abcdef" |
[00, b1, a1] |
New client loads everything |
7 | C | "123abcdef" |
[00, b1, a1] |
Compaction |
At the end of Step 4, after all the editing and saving, we’d see the following files:
test.snapshot.00
test.incremental.a1
test.incremental.b1
At the end of Step 5, we’d see these files. It’s important that clients only compact files it’s loaded. Here, Client A never loaded the b1
change, so it left it untouched.
test.snapshot.a1
test.incremental.b1
And at the end of Step 7, we’d see just one file. Client C compacted the files it knew about. Note that this filename shows that there are two heads as a result of our merge, a1
and b1
.
test.snapshot.b1.a1
At any point though, all clients can refresh its data by reading the entire document. Since Automerge filters out duplicate changes, all clients would eventually end up with 123abcdef
.
At this point, we know enough of the building blocks to implement a storage API for Automerge. This is just one possibility—note that this doesn’t match the interface in the docs!
export abstract class MyStorageAdapter {
// We need a way to save a document snapshot
abstract saveSnapshot(documentId: String, data: Uint8Array): Promise<void>
// We need a way to save an incremental snapshot
abstract saveIncremental(documentId: String, heads: String[], data: Uint8Array): Promise<void>
// We need a way to load a document
abstract loadAll(documentId: String): Promise<{key: StorageKey, data: Uint8Array}[]>
// We need a way to compact duplicate data
abstract removeSnapshot(documentId: String, heads: String[])
abstract removeIncremental(documentId: String, heads: String[])
}
I’m really amazed by Automerge’s deep “conflict-free” nature: its individual data types, its recommended data storage format, and its flexibility towards load order.
I mentioned that I’m working towards building a notes app, so I’m planning on continuing to look into Automerge. I’m still in the design and research phase of this project. I don’t know how far I’ll get, or if I’ll finish, but I’m glad to have had the excuse to look into this so far.