About a month ago, I was benchmarking the impact of a new Bazel feature and I noticed that a test build that should have taken only a few seconds took almost 10 minutes. My Internet connection was flaking out indeed, but something else didn’t seem right. So I looked and found that Bazel was doing network calls within a critical section, and these were the root cause behind the massive slowdown. But how did we get such an obvious no-no into the codebase? Read on to see how this happened and how gnarly it was to fix!

If you are a Bazel user, you know that we at Google use remote build services for pretty much all of our builds. (And by the way: you can—and should—do too!)

Remote-based builds work great when the client machine is attached to a high speed/low latency network—which was the common scenario for us in the office, pre-COVID-19. But with an increasing working from home (WFH) scenario… we now face less-than-deal Internet connections, and these are surfacing subtle long-standing issues throughout.

One of these issues became visible about a month ago when I was benchmarking the behavior of a new feature. To measure that feature’s impact, I was running a build at home and noticed that the build progressed at a glacial pace. I guessed my network connection was misbehaving—and it was. But things didn’t add up. Even under those conditions, the build seemed much slower than it should have been. In particular, all build actions appeared to complete sequentially despite the large number of --jobs (hundreds) we use for remote builds.

First analysis

Seeing actions print messages to the terminal in slow succession made me go “huh, that’s interesting”—which, you know, is how many ~~horror movies~~ troubleshooting scenarios and great discoveries start.

My instinctive reaction was to grab a thread dump from the slow Bazel instance with jstack while it was still running. The dump was insightful because all Skyframe threads look identical in it:

"skyframe-evaluator-23" #1425 daemon prio=5 os_prio=0 cpu=9.67ms elapsed=6.10s tid=0x0000073ef9a2c000 nid=0xd1ad1 waiting for monitor entry  [0x00007f7673746000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at com.google.devtools.build.lib.runtime.UiEventHandler.actuallyHandle(UiEventHandler.java:286)
        - waiting to lock <0x00000005b2b58248> (a com.google.devtools.build.lib.runtime.UiEventHandler)
        at com.google.devtools.build.lib.runtime.UiEventHandler.handleInternal(UiEventHandler.java:390)
        at com.google.devtools.build.lib.runtime.UiEventHandler.handle(UiEventHandler.java:409)
        at com.google.devtools.build.lib.events.Reporter.handle(Reporter.java:142)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.dumpRecordedOutErr(SkyframeActionExecutor.java:1646)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.dumpRecordedOutErr(SkyframeActionExecutor.java:1629)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.access$1800(SkyframeActionExecutor.java:131)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.actuallyCompleteAction(SkyframeActionExecutor.java:1094)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1065)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:975)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:129)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:81)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:464)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:842)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:312)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
        at java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(java.base@11/ForkJoinTask.java:1407)
        at java.util.concurrent.ForkJoinTask.doExec(java.base@11/ForkJoinTask.java:290)
        at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@11/ForkJoinPool.java:1020)
        at java.util.concurrent.ForkJoinPool.scan(java.base@11/ForkJoinPool.java:1656)
        at java.util.concurrent.ForkJoinPool.runWorker(java.base@11/ForkJoinPool.java:1594)
        at java.util.concurrent.ForkJoinWorkerThread.run(java.base@11/ForkJoinWorkerThread.java:177)

... and hundreds more "skyframe-evaluator N" entries like this ...

In other words: all Skyframe threads are waiting on a lock to enter the gigantic critical section in UiEventHandler#actuallyHandle (rev 5bb3bec).

IMPORTANT: What’s critical to know at this point is that there is one Skyframe thread for every concurrent --jobs: in other words, these are the threads executing actions, so if they are all waiting on the same lock, parallel execution is worthless.

The lock all these threads are waiting on belongs in the UI. Wait, what? That sounds dumb. Why are the Skyframe threads locking in the UI? Sure, this is not a graphical application, but it still has a UI—and haven’t we learn by now, in this day and age, that the UI must be decoupled from actual processing?

Alright, alright, so they are all blocked in the same way. But blocked on what? As it turns out, they are all waiting for one other Skyframe thread to complete:

"skyframe-evaluator-178" #1612 daemon prio=5 os_prio=0 cpu=7.42ms elapsed=5.94s tid=0x0000073ef973c000 nid=0xd1bda runnable  [0x00007f766c984000]
   java.lang.Thread.State: RUNNABLE
        at java.io.FileInputStream.readBytes(java.base@11/Native Method)
        at java.io.FileInputStream.read(java.base@11/FileInputStream.java:279)
        at com.google.common.io.ByteStreams.toByteArrayInternal(ByteStreams.java:181)
        at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:221)
        at com.google.devtools.build.lib.util.io.FileOutErr$FileOutputReference.getFinalBytes(FileOutErr.java:555)
        at com.google.devtools.build.lib.runtime.UiEventHandler.writeToStream(UiEventHandler.java:415)
        at com.google.devtools.build.lib.runtime.UiEventHandler.actuallyHandle(UiEventHandler.java:368)
        - locked <0x00000005b2b58248> (a com.google.devtools.build.lib.runtime.UiEventHandler)
        at com.google.devtools.build.lib.runtime.UiEventHandler.handleInternal(UiEventHandler.java:390)
        at com.google.devtools.build.lib.runtime.UiEventHandler.handle(UiEventHandler.java:409)
        at com.google.devtools.build.lib.events.Reporter.handle(Reporter.java:142)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.dumpRecordedOutErr(SkyframeActionExecutor.java:1646)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.dumpRecordedOutErr(SkyframeActionExecutor.java:1629)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.access$1800(SkyframeActionExecutor.java:131)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.actuallyCompleteAction(SkyframeActionExecutor.java:1094)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1065)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:975)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:129)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:81)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:464)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:842)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:312)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
        at java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(java.base@11/ForkJoinTask.java:1407)
        at java.util.concurrent.ForkJoinTask.doExec(java.base@11/ForkJoinTask.java:290)
        at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@11/ForkJoinPool.java:1020)
        at java.util.concurrent.ForkJoinPool.scan(java.base@11/ForkJoinPool.java:1656)
        at java.util.concurrent.ForkJoinPool.runWorker(java.base@11/ForkJoinPool.java:1594)
        at java.util.concurrent.ForkJoinWorkerThread.run(java.base@11/ForkJoinWorkerThread.java:177)

… and this thread that holds the lock is doing I/O as we can see from the top of the stacktrace. The thread is busy reading from a stream, and it all happens deep inside UiEventHandler#actuallyHandle. Sounds like a pretty obvious problem: I/O inside a critical section.

But if this was such an “obvious” performance problem, how come I was the first one to notice?

There are a couple of reasons. First, I’m in the team that supports builds from home, so I get to run a lot of these on a suboptimal network link. And, second, this problematic build I encountered spews tons of warning messages from the Objective C compiler. Comparing the behavior of this build to any other well-behaved build (one that doesn’t generate thousands of warnings) didn’t show the problem.

Remote execution, FUSE, and outputs

So far we have diagnosed the potential problem: we have a critical section that does some I/O and all Skyframe threads have to go through it. Why is this specific I/O problematic and why are all threads trying to enter it?

To reach an answer, we must first understand how Bazel runs actions—and, in particular, how Bazel prints the output of an action to the local console when the action has run on a different, remote machine.

When an action runs remotely, the stdout/stderr streams of the action are generated remotely, obviously. If Bazel wants to show them to the user, it has to explicitly download them from the remote service. Users typically expect a build system to print the stdout/stderr streams of any actions it runs because they typically contain warnings and errors, so Bazel does download these files at all times.

Typically, network calls look very obvious in code because they look different than everything else. If you are writing a network call, you are very aware that you are doing it. And if you are reading code, network calls are painfully visible because they aren’t simple function calls (as much as gRPC and the like want to make you believe).

WARNING: If you are writing or reviewing code and you notice a network call inside a critical section (synchronized block in Java), all alarms should go off.

But things are more subtle in our stack at Google. As you may know, we have a FUSE file system to lazily fetch outputs from the remote execution service: Bazel can simply assume that such files are on disk and thus can access them via regular POSIX calls. The stdout/stderr outputs of remote actions are no exception: these are exposed via this FUSE file system too and Bazel can read them as necessary with regular file system operations.

To make things more obscure, Bazel has high-level abstractions on top of these streams (the OutErr and FileOutErr classes). I say obscure because, the more abstractions you have between the code and its side-effects, the harder it is to spot what is happening under the hood.

By this point you should start piecing things together: the critical section in UiEventHandler#actuallyHandle was hiding both file I/O via OutErr and network calls via FUSE. These two became pathological under specific conditions.

That’s a problem, but it still doesn’t explain why the Skyframe threads would choke on the UI code. Yes, we need to print the stdout and stderr of any action we run, but why weren’t the individual Skyframe threads downloading those files without holding a lock? After all, they do precisely this for regular file outputs, so they might as well download the stdout/stderr first and then feed them to the UI, right?

Enter another abstraction layer: the Reporter, which is used to communicate user-visible events across modules. The Reporter has a method called handle that runs arbitrary code and does so synchronously. Each Skyframe thread calls this method to report the stdout and stderr of an action on completion via SkyframeActionExecutor#dumpRecordedOutErr.

And finally we reach the true problem. The Reporter is just an interface so the caller has no idea about the implementation it uses. As it turns out… the UiEventHandler class we’ve been talking about is a Reporter and its synchronous handle method calls that huge critical section that reads stdout/stderr from within, potentially faulting remote files in via FUSE.

Reproduction scenario

To verify this theory and quantify the seriousness of the bug, I wrote a trivial BUILD rule like this:

TSID = "1234567909"
[genrule(
    name = "foo-%d" % i,
    outs = [("foo-%d.txt") % i],
    cmd = ("echo %s - %d; echo %d >$@") % (TSID, i, i),
) for i in range(10000)]

… and I ran it like this, where I used the TSID trick to generate unique actions (which are important to reproduce this scenario!) in each attempt:

sed -itmp -e "/^TSID =/s/\".*$/\"$(date +%s)\"/" pkg/BUILD
bazel clean
bazel build //pkg/...

The goal of this test is to see what happens when we have thousands of uncached and independent actions, each writing a novel and unique message to its stdout. Under these conditions, we expect Bazel to run as many actions concurrently as allowed by --jobs given that they do not depend on each other. We also expect Bazel to print the outputs of each action in no particular order. And because we are generating novel outputs on each run thanks to the TSID changes, we know that those outputs will not be cached in any layer: not in the local page cache, not in the remote build system’s cache, and not in the local FUSE daemon.

Voilà. Running this test took almost 10 minutes! Yes… 10 minutes to run a build that creates 10,000 different files with one line of text each. For comparison, my 2013 Mac Pro can create these files in 4 seconds with a parallelism of 12. How can Bazel behave so poorly then? Essentially because every stdout read blocks on a network call via the lazy FUSE file system. If we count, this translates to about 60ms per file, which isn’t terrible, but 60ms times 10,000 is 10 minutes.

Coming up with a fix

When I found this bug over a month ago, I was very excited because I was convinced that this was a contributing factor to our still-ugly build tail latency, especially on WFH environments. But fixing the problem has proven to be much more complicated—and frustrating—than I had hoped.

At first, I toyed with the idea of delivering events to the UI asynchronously so as to not block the Skyframe threads. In other words, make Reporter#handle() queue events instead of blocking. Fancy, right? This quickly turned into a pretty complex solution and stabilizing it would have required a ton of effort. Sure, this would have looked cool from a coding perspective, but the increasing complexity wasn’t warranted.

As a second approach, I added code to prefetch the stdout/stderr of actions outside of the synchronized block by doing dummy reads of the files. The intent was to cause our FUSE daemon to do the downloads first, so that the I/O calls within the critical section didn’t have to touch the network. This worked but, as you can imagine, janakdr@ pointed out that removing the I/O from within the critical section would be the right thing to do. Of course, I thought… but this comes with a trade off.

You might be wondering why the I/O happened inside the critical section in the first place. There were two reasons:

The UI tried to dynamically limit the amount of data printed to the console based on a concept of “remaining capacity”. And because of this, we couldn’t know how much data to read from the output files before grabbing the state lock.
michajlo@, who reviewed my code, raised the very good point that the current design of the UI looked overly complicated and that some features were getting in the way of fixing the issue described here. So we teamed up and simplified things, first by removing message deduplication and then by removing the dynamic capacity limiting logic.
The UI doesn’t want to mess its own console state tracking, which is super complicated, with the printing of a file. Printing the file should not interfere with the multi-line status display, and the outputs of different actions shouldn’t be intermixed. By doing the I/O in the critical section, then, we know we are the only writer to the console so we know dumping a large file won’t corrupt any other state. This is a property we have chosen to maintain for now.

As a consequence of these two, we can move all I/O outside of the critical section, but we still need to do the console writes inside the critical section to respect the properties mentioned above. The downside of doing this is that we need to buffer more data in memory (because we need to read the outputs before grabbing the lock, which means we might have --jobs concurrent threads doing that).

Imposing sensible limits

Buffering the output of hundreds of concurrent actions could increase our high watermark memory usage. We had to add a limit to prevent memory blowups given that we know there are misbehaved actions out there generating MiBs of useless output.

The first suggestion was to add a flag to cap the maximum memory usage by the UI and use a semaphore to track usage. Think of each semaphore permit as a “used byte”. Neat design, but implementing this still required dealing with files larger than the configured flag, and printing only part of their contents was still deemed as too complex.

The second suggestion was to be more aggressive: cap how much a given action can print and not worry about tracking aggregate usage. After all, if an action is printing more than, say, a few hundred KiBs of data to the console, it’s already “misbehaving” in the sense that nobody will possibly want to read such data. By setting this flag to a more conservative value, like 1 MiB, we know we are limiting aggregate memory usage too, and we also have a straightforward implementation.

So in the end, we chose to go this second route in rev ca2f946.

The new flag --experimental_ui_max_stdouterr_memory_bytes caps the size of the stdout/stderr Bazel will print for any given action.

Results

How does our fix perform?

For the pathological build described above, the fix reduces execution time from 6 minutes to 27 seconds on a machine I have in the office (great Internet connection). On my home machine with a more modest connection, the execution time goes from 12 minutes to 60 seconds. These numbers do not match the 10 minutes I mentioned earlier, but these timings are pretty variable and different across runs.

The more interesting result will be to see how this improves overall build time latencies. As mentioned in the opening, I’m certain this bug is slowing down some of our interactive builds—and those are where you want to have the fastest possible edit/build/test cycle. Unfortunately, the fix has just been submitted today so it’ll still take a few weeks for me to have that kind of proof!

Wow, this turned out to be much longer than anticipated. Hope you enjoyed the tour.