Making Async Rust Reliable


Last year was an important year for Async Rust, culminating in the release of async fn in traits, one of Rust’s most long-awaited language features. I’m really proud of the work and expertise the Async Working Group has put in to this effort, and for the community feedback that helped shape the release.

We still have important work to do in the coming year, and setting priorities is one of the top things on my mind right now.1 In this post though, I want to take a step back and consider how we want Async Rust to look when we’re “done”.

I think many people can agree that async still feels far from finished. There are some obvious features it still lacks, but beyond that, it’s harder to articulate what it needs to really feel complete. In this post, I want to explore a key theme that I think will help answer that question.

Key theme: Reliability

If I could choose only one word to describe what sets Rust apart from other languages, it would be “reliability”. Rust is not the most succinct language or the easiest to learn. Rust is fast, and that is important, but there are other fast languages. Rust’s most important feature is that programs written in it behave predictably in a wide variety of situations. This makes the upfront costs of learning Rust worth it in the end. You might spend more time learning Rust and writing an initial implementation than you would in a more familiar language, but your program will behave more reliably in production, your reviewers will have less work to do, and you’ll spend less time fixing bugs.

With async I want the story to be just the same. But today, there are many more footguns and facerakes to avoid in Async Rust programming than in normal Rust. Some of this comes from the inevitable complexity of the problem space, but much of it – and maybe even most of it – comes from the interaction between that problem space, the language, and the libraries we use to write async code.

I believe that reliability, above all else, needs to be the guiding principle for Async over the coming years. We have a long way to go before async can truly be on equal footing with the rest of Rust, but I believe it can be done.

Aside: Was Async Rust a mistake?

I sometimes see this dichotomy between regular and async Rust leading people to question whether async was a mistake. I understand the sentiment, but it needs to be put into context.

Rust first became popular by composing a unique blend of well-studied language features in a new context, systems programming, which turned out to unlock many possibilities. These included “fearless concurrency”, freedom from undefined behavior and memory vulnerabilities, and a type system powerful enough to build abstractions which encapsulate unsafe code. This was a winning combination born out of decades in programming languages research and industry experience, but it also required quite a bit of iteration and upheaval before hitting 1.0.

When Async Rust came onto the scene, it further tested the limits of how far we could go with this existing combination of features, plus some syntax sugar and additions to the standard library. It turns out the answer is quite far. Straight-line async code with no allocation required, in a production-ready systems programming language – that is a niche that remains unfilled by any other language I know of today.

Rust is pushing the boundaries of what’s possible, like many new languages have done before it. The area where it pushes the most is plausibly async. The result naturally feels less polished and lovely than the core of the language. What’s required to get there, more than anything, is iteration fueled by experimentation.

This is the challenge and opportunity of Async Rust. The thing I take the most encouragement from is that Rust has attracted some of the most thoughtful and talented people in our industry to work both on the language and in the wider ecosystem. I believe we can get it done.

Control flow

Why does Async Rust feel less reliable? Almost all of the footguns I see in Async Rust today are a result of mismatched intuitions about the control flow of async programs. There are three important cases, in which control flow is unexpectedly

  1. Stopped,
  2. Starved, or
  3. Detached from a caller.

Cancellation

A Rust future can be canceled and stop executing at any await point, in theory. In practice, cancellation semantics of Rust futures is an implicit contract between the async callee and the caller.

In many cases, the author of an async fn wrote it without thinking carefully about the implications of cancellation. Consider the following function: 2

async fn read_send(file: &mut File, channel: &mut Sender<...>) {
    loop {
        let data = read_next(file).await;
        let items = parse(&data);
        for item in items {
            channel.send(item).await;
        }
    }
}

This function advances the file handle it was passed to perform a batch read, then processes the items asynchronously one at a time. The problem is that if the caller drops its future in the middle of processing a batch, the remaining items to process are lost without any indication this has happened. file has already been advanced as a result of calling read_next.

Sometimes this is not a problem to write code like this because all callers await this function, driving it to completion, and none of those callers get unexpectedly canceled either. The problem usually shows up when a caller who either isn’t thinking carefully about cancellation, or isn’t aware of any underlying cancellation contract of the function, uses a combinator like select:

let mut file = ...;
let mut channel = ...;
loop {
    futures::select! {
        _ = read_send(&mut file, &mut channel) => {},
        some_data = socket.read_packet() => {
            // ...
        }
    }
}

The behavior of the combinator is such that read_send gets called each time we enter the select, and its future is dropped each time we exit. What that means in practice is that every time data is received on the socket, we will lose an arbitrary number of entries from the last batch.

Solution I: Better primitives

Importantly, we don’t have to leave these facerakes laying around for people to step on. Async Rust has implicit cancellation contracts, so we should make sure that the tools we use to express async patterns force you to think about those contracts. That means deprecating select in favor of other combinators, like merge, which was written about by Yoshua Wuyts and later covered by boats.

let mut file = ...;
let mut channel = ...;
merge! {
    repeated(|| read_send(&mut file, &mut channel)) => (),
    some_data = socket.packet_stream() => {
        // ...
    }
}.await;

It’s worth mentioning that allowing AsyncIterator::next() to have state, by making next the implementation target of the trait, would exacerbate this problem further. We have a lot of code written today that, for better or worse, assumes the future returned by next can be dropped freely. On a practical level, converting this code to a world where it no longer can assume this would be painful, and this inability to drop the next future would interact with the existing deficiencies in our combinators in an unfortunate way.

Solution II: Better contracts

The other solution to this problem is to make cancellation an explicit part of the futures contract, embedded in the language. These would require the caller and/or callee to explicitly acknowledge the semantics it is choosing to invoke if they are anything other than poll-to-completion.

Whatever form this takes, I think it will be much more disruptive to the ecosystem than recommending a different combinator library. Still, the kinds of implicit contracts we see today are out of line with Rust’s design philosophy, so I think we will want to do this eventually.

That could happen either through support for !Drop types (which could include support for async Drop) or a new kind of poll-to-completion future, as was experimented with in the completion crate. Either of these could be combined with one or more opt-in cancellation mechanisms, up to and including implicit cancellation points at every await point, as we have today.

Starvation

I wrote about starvation in my recent post, for await and the battle of buffered streams. The primary cause of unexpected starvation is the alternating control flow between an async iterator and the async loop body that processes it.

Solution I: poll_progress

In response to my post, boats wrote about a solution called poll_progress, which I agree is the most direct solution to this problem. It does have the disadvantage of complicating the code generated by for await, in addition to making the AsyncIterator trait more complicated.

Solution II: Spawning

The only other solution that seems acceptable to me is that we declare the style of iterator described in my post, one that manages active connections behind the scenes, to be a bug. Instead, we can spawn tasks which progress independently of the iterator itself.

The big problem with this approach is that it requires an extra allocation, and therefore doesn’t work in embedded contexts lacking an allocator. Given that Rust is uniquely capable of filling the async/await niche in these contexts, it would be unfortunate to leave this problem unsolved there.

Detached execution

The third kind of control flow problem occurs when a spawned task continues execution without the knowledge of its transitive callers.

The best story I can find of this is an async sqlite library closing its database handle asynchronously, in a task spawned from its destructor. Since sqlite can only have one handle open at a time, this meant that opening a new handle after dropping the old one could fail non-deterministically.

Solution I: Async Drop

This spawn-in-destructor pattern was necessitated by the lack of async Drop, which I believe we need a solution to. Solving this problem in the fully general case is a big challenge, which is covered well in a classic blog post by Sabrina Jewson.

The main problem comes from what to do when you end up with an async Drop type inside a synchronous context, including simple generic code like Vec. Preventing synchronous drops of async Drop types at compile time, in the general case, requires non-droppable linear types.

While I hope we can find a solution that works in all cases, I also don’t believe we have to. Even synchronous destructors are not guaranteed to run, and as a result they are used for resource management and not soundness. If we manage to catch a large majority of cases where the failure mode in the remaining cases is, say, a resource leak coupled with an error log, that would likely be good enough.

Solution II: Structured concurrency

This problem with detached execution also occurs with regular threads. The standard library recently gained a scoped thread API, which guarantees that all spawned threads have completed before returning from the scope. The main benefit that has been advertised in this API is the ability for nested threads to borrow from outer scopes, as in the example below.

let mut a = vec![1, 2, 3];
let mut x = 0;

thread::scope(|s| {
    s.spawn(|| {
        println!("a={a:?}");
    });
    s.spawn(|| {
        x += a[0] + a[2];
    });
    println!("hello from the main thread");
});

// After the scope, we can modify and access our variables again:
a.push(4);
assert_eq!(x, a.len());

Borrowing from scoped threads is a real benefit,3 but in my opinion the most important aspect of the API is that the lifetime of each thread is embedded directly in the structure of your program, just like regular control flow. No thread ever outlives the scope s that it was spawned on. This makes reasoning about control flow with scoped threads only incrementally more complicated than the way you would reason about control flow in regular straight-line code.

The async version of this is called structured concurrency. In the past several years, it has been adopted by both Swift and Java.

Some of the benefits of structured concurrency come when all the code in a program uses it, but it can also be adopted incrementally. Structured spawn APIs are possible to write in Rust today as a shim on top of existing executors, so I’m surprised to see this happening almost nowhere in the ecosystem.

If and when the Rust standard library gets a Spawn trait, I think we should seriously consider making structured concurrency part of its API. In the meantime, I’d very much like to see experimentation with such APIs in the ecosystem.

Generators

Speaking of control flow, it’s much harder to reason about when you have to write a manual implementation of poll_next or even next. Generators solve this problem. I’ve found the recent posts by boats on this topic to be helpful.

I don’t have much to say here other than that we should add generators to the language, beginning with async, where it’s needed most. We still need to decide some details about the async iteration trait, including whether to include poll_progress, but these all seem solvable in the next year or so.

Conclusion

Reliability doesn’t cover the full extent of my ambitions for Async Rust, but I believe it contains the most important ones. If we could solve all the problems in this post and never touch async again, we would still have a high-leverage, state-of-the-art systems language that would be useful in a wide range of applications for many years to come.

There’s more I would like to say about the qualities that set Rust apart and their implications for the evolution of async, but I’ll leave it here for now.


  1. If you haven’t yet, I recommend giving Niko’s post on goals for async in 2024 a read. ↩︎

  2. From A look back at asynchronous Rust by Tomaka ↩︎

  3. Unfortunately it is not possible for a work-stealing executor to allow borrowing in scoped tasks today, but I still think the benefits of explicit control flow are worth it. ↩︎