Making Bitcoin Fast: Introduction

I've been busy for the last month or so working on a really interesting engineering problem: Making Bitcoin's chainstate database fast. The chainstate database is the core part of Bitcoin that tracks current/active balances and transactions; from a data perspective it's what makes Bitcoin "tick". This has been a really fascinating project because the chainstate database has very unusual access patterns, and a lot of the normal intuition for database optimization is ineffective. It's also been fun because I quickly got productive results: I have a GitHub PR up right now that makes blockchain sync times up to 25% faster. There's more work to be done though. My goal is to make Bitcoin the fastest and most efficient blockchain out there, at least from a database/storage perspective.

To measure my results I've been simulating IBD ("initial block download") times for nodes with slow disks. In other words, I'm simulating how long it takes to do a full sync of the Bitcoin blockchain for users with fast internet bandwidth and not very fast hard drives. This is a scenario that ends up bottle-necked on disk I/O as a result of building the chainstate database. The preliminary results are very promising. Here are the results from my current best branch compared to Bitcoin v0.16.0 (the current stable release):

These are early measurements on code I have not yet submitted for review, and I'm testing on a configuration that's unusually bad for the current Bitcoin code. It's still exciting to see these kinds of numbers, and validates the work I've been doing. In the next post I'll explain the benchmarking methodology I used to create this graph, as well as details about what information is stored in the chainstate database.

The chainstate database itself is implemented in Bitcoin using LevelDB as the storage backend. Most of my work in the last month has been reading the LevelDB source from top to bottom, experimenting with changes to LevelDB internals, and testing different caching and buffering strategies within Bitcoin. I'm pretty deep in the weeds at this point, and have a solid grasp of most aspects of LevelDB internals regarding data layout and caching. A lot of important information in these areas is either completely undocumented, or only exists as out of date notes other people have written online. The result is that there's a lot of misinformation about LevelDB, and many projects (Bitcoin just being one of them) are leaving a lot of important optimizations on the table.

My primary goal in this series of blog posts is to document the darker corners of LevelDB, and explain the changes I'm making to the Bitcoin source code as a result of these discoveries. There's a lot of information locked up in my head that other people should know about. I want to keep a record of what I've found and how I've been testing different hypotheses.

Future articles will include the following:

I think this will be an interesting series of blog posts. Some of these will be somewhat short and some will be longer, but I think most of them will have interesting graphs and code examples. Stay tuned, and feel free to reach out to me over email if you want to discuss any of these topics in more detail.