Thursday, September 16, 2021

Huge Project Build Systems

Summary: Shake won't scale to millions of files, this post says what would be required to make it do so.

While Shake has compiled projects with hundreds of thousands of files, it's never scaled out to millions of files, and it would be unlikely to work well at that size. The most popular build systems that operate at that scale are Buck (from Facebook) and Bazel (from Google). In this post I go through the changes that would need to be made to make Shake scale.

The first issue is covered in my previous post, that Shake doesn't know if you change the build rules themselves. As you scale up, it becomes much more important that if you change the rules, everything is properly tracked. As the number of people involved in a project increases, the rate at which the build system changes will also increase. Both Buck and Bazel solve this problem using a deterministic Python-based configuration language called Starlark. If Shake stopped being a Haskell DSL, I'd argue that it stops being Shake and becomes something different, so it's unclear what could be done there.

The next issue is that every time Shake is invoked, it checks the modification time of every file, and then walks the entire dependency graph. That works fine at 10K files, but as you move to 1M files, it takes too long. The solution is two-fold, first be informed which files have changed using notification APIs (e.g. the Watchman tool), and then use reverse dependencies to only explore the portion of the graph that has changed. Happily, Pepe already has a patch adding reverse dependencies to Shake, so that isn't too infeasible.

The final issue is that Shake was designed as a single-machine build system, not for sharing results between users. When I first wrote Shake, I didn't have access to servers, and AWS was brand new. Now, over a decade later, servers are easy to obtain and large scale build systems need to share results, so that if one user builds a file, no one else needs to. Within the realm of multi-user build systems, there are two basic operations - sharing results and offloading commands.

Shake, with it's new cloud features, is able to share results between users using a shared drive. It works, and big companies are using it for real, but I'd consider it fairly experimental. For execution, Shake is unable to run actions remotely, so can't make use of something like Bazel's remote execution API. Since dependencies are specified at the rule level, and remote execution operates at the command level, there is a bit of a mismatch, and it's unclear what that might look like in Shake.

While Shake won't work at huge scales, it is still quite an effective build system at quite large scales. But, given the limitations, I imagine it will never get to the scale of Buck/Bazel. At the same time, Buck/Bazel lack dynamic dependencies, which makes them unable to express rules such as Haskell effectively.

Happily, I am involved with a new build system, the next generation of Buck. I joined Facebook two years ago, and since that time have been focused on this project. It's written in Rust, configured with Starlark (I've spent a lot of time working on an open-source Starlark interpreter in Rust), and should work at huge scales. It's not yet open source, but it will be - we are targeting early next year.

I think Shake is still a build system with a lot to offer, and continue to maintain and improve it. For people who want to scale beyond the range of Shake, I'd definitely recommend using the next generation of Buck, once it is available.

2 comments:

John Ericson said...

At some point, it might be nice to trade notes with you on my https://github.com/NixOS/rfcs/pull/92 vs any plans for dynamism with the next-gen Buck.

As is well-known, Nix already has "hermeticity / reproducibility", between sandboxing and the fact that outputs can't get mixed up with inputs, and the same "it's an interpreter" solution to safe caching accross build system changes as Bazel/Buck. "Floating content addressed derivations", which are already merged, give it early cut-off, and now this (using much of the same underlying mechanisms, actually) makes it dynamic -- either full Monadic or "Selective applicative functor", actually based just on on small tweaks, as just occurred to me.

Neil Mitchell said...

@John: Sounds like it would be an interesting thing to discuss. It does seem like these things are ultimately heading in the same direction :)