How Rust optimizes async/await II: Program analysis


In Part 1, we covered how async fns in Rust are compiled to state machines. We saw that the internal compiler implementation uses generators and the yield statement to facilitate this transformation. We also saw that the optimal way to lay out one of these state machines in memory is using an enum-like representation, similar to the following:

enum SumGenerator {
    Unresumed { xs: Vec<i32> },
    Suspend0 { xs: Vec<i32>, iter0: Iter<'self, i32>, sum: i32 },
    Suspend1 { xs: Vec<i32>, iter1: Iter<'self, i32>, sum: i32 },
    Returned
}

Each variable stored in this enum is a local variable of our original function, needed to track the internal state of our state machine. Here, the key point is that iter0 and iter1 are never used at the same time, so we can reclaim the bytes of iter0, using them for iter1 when our state machine transitions to the next stage.

Matching variables with variants

How do we produce a type like the enum above from nothing but a generator’s source code? Let’s revisit the code for this generator:

let xs = vec![1, 2, 3];
let mut gen = move || {
    let mut sum = 0;
    for x0 in xs.iter() {  // iter0
        sum += x0;
        yield sum;  // Suspend0
    }
    for x1 in xs.iter().rev() {  // iter1
        sum -= x1;
        yield sum;  // Suspend1
    }
};

Given this source code, the compiler must decide which local variables go in which variants. Let’s drill down into what’s actually happening here.

After the compiler parses the code, one of the first things it does is lower from a syntax tree to HIR1, Rust’s high-level intermediate representation. One reason to use HIR is that it allows us to take all the rich kinds of syntax that programmers love to use and rewrite them using fewer, more general, forms. The benefit of this for compiler developers is that when working with HIR, we have fewer cases to think about.

One example of this is that HIR doesn’t have for or while loops, it just has loop.2 We can rewrite our for loops to regular loop constructs like this:

let xs = vec![1, 2, 3];
let mut gen = move || {
    let mut sum = 0;
    {
        let mut iter0 = xs.iter();
        loop {
            match iter0.next() {
                Some(x0) => {
                    sum += x0;
                    yield sum;  // Suspend0
                }
                None => break,
            }
        }
    }
    {
        let mut iter1 = xs.iter().rev();
        loop {
            match iter1.next() {
                Some(x1) => {
                    sum -= x1;
                    yield sum;  // Suspend1
                }
                None => break,
            }
        }
    }
};

Every yield in the source corresponds to a Suspend variant in our state machine. Now that we can see explicitly where each variable is introduced, how do we determine which variables to save, and which variants they should be in?

To answer this question, we perform liveness analysis at every yield point. Liveness analysis asks the question, for each variable V and suspend point S: Are there any values written to V before point S that might be read after S?

To see this in action, let’s apply it to the above code. First, look at iter0 and the first yield point (labeled Suspend0). Is iter0 ever written to prior to our yield point? Yes. Can the value be read after that point, once the generator resumes again? Yes, in the next iteration of the loop. Therefore, we must store the value of iter0 in the Suspend0 variant.

What about x0? We “write” to x0 prior to our yield, binding it in the Some match arm. Can that value be read after the yield? In the next iteration, we might read from x0. But that only occurs after writing a new value to x0! So, no value stored in x0 prior to a yield will ever be read after that yield. We shouldn’t store the value of x0 in the Suspend0 variant.

We can apply this same analysis to every single variable and every single yield point, to get the following table:

Unresumed Suspend0 Suspend1 Finished
xs
sum
iter0
x0
iter1
x1

Notice that x0 and x1 don’t have any checkmarks in their row. That means we don’t have to store them in our generator at all! Instead, x0 and x1 are temporaries stored on the stack, just like in normal functions. By the time our generator’s resume() method returns, we won’t need the value again, no matter which point in the function we suspended (or returned) from. Liveness analysis proves this is sound, and saves us from using up extra bytes.

Liveness and Drop

Consider the following simple generator:

let gen = || {
    let s = String::from("hello, world!");
    yield;
}

Since we don’t use s after the yield, you might think that there’s no need to store s in our generator. But actually, that’s not the case: At the very end of our generator, s is implicitly dropped, which is a use of s! Therefore, we must store it in the generator.

Rust doesn’t try to be clever here and drop s before the first yield. If it did, it would be changing the semantics of our program in subtle ways. Usually these semantics aren’t something we care about, but sometimes they are: what if s was actually a MutexGuard, for instance?3

let gen = || {
    let s = some_mutex.lock();
    // use s here
    yield;
    // do I still own the lock here?
    // in a normal Rust function, the answer is always "yes",
    // and Rust preserves the same behavior in generators.
}

Note that generally speaking, you shouldn’t be holding a lock across a yield point in a generator. But if you do, the lock should behave exactly as it would in a normal function, and this is what happens. Liveness analysis has no bearing on when destructors run, but the running of destructors can change the result of the liveness analysis.

We can prevent s from being stored in our generator by using an explicit scope, which destroy the variable before yielding:

let gen = || {
    {
        let s = some_mutex.lock();
        // use s here
    }
    // we don't own the lock anymore.
    yield;
}

“Whether to save” vs “where to save”

Now that we have a mapping between variables and variants, can we lay out the bytes of our state machine?

Not so fast. Consider this example:

let gen = || {
    let x = read_line();
    yield;  // Suspend0
    let y = read_line();  // XXX
    process(x);
    yield;  // Suspend1
    process(y);
};

Let’s assume process is some function that takes a parameter by value and does important things with it. Applying the liveness analysis from earlier, we should get something like this:

Unresumed Suspend0 Suspend1 Finished
x
y

Based on our table, we can store x in Suspend0 and y in Suspend1. But unlike our earlier example, we cannot reuse the bytes from x to store y! If we do, line 4 will overwrite the value of x with a new value. We will then move this value on line 5, thinking it’s the value of x, and finally try to reuse the moved value (as y, this time) on line 7.

What’s the key difference between this example and the last one? In this example, x is used after we initialize y.

The mapping we built above is useful for understanding our state machine at a given suspension point. But it’s not enough to understand all the constraints of our state machine’s memory layout. In particular, what happens while advancing the state machine, in the code between two suspension points, can place additional constraints on our memory layout.

To solve this problem, we need to draw a “boundary” around the uses of each variable, and take note of which variables are in use at the same time. If any two variables are ever in use at the same time, they must not use any of the same bytes in memory. In other words, their storage must be non-overlapping. As long as we uphold this invariant, we can lay out the state machine however we like, without breaking any code.

We might be tempted to use an analysis similar to before, looking at variable reads and writes to determine when a variable’s memory is safe to reuse. In our previous example, we could have asked the question: Is x used after the first write to y? This sometimes works, but not always. Consider the following example:

let gen = || {
    let x = read_line();
    let x_ref = &x;
    yield;  // Suspend0
    let y = read_line();  // XXX
    process(*x_ref);
    yield;  // Suspend1
    process(y);
};

This is a bit funkier. This time we take a reference to x, and later dereference that to get the original value of x. x is still in scope; therefore, any references to it should be valid as well. This also extends to unsafe — any pointers to x can still be dereferenced.

These cases can be quite difficult to deal with in general. What if, for example, we hand a pointer to x to a function, and it stores it away in some opaque data structure? How would we draw a box around all the times x is used? We can’t use the borrow checker when raw pointers are involved.

Rust does have well-defined semantics about when variables like x and y are destroyed for good: when they go out of scope! As it turns out, the safest thing we can do is to use information about the original source scopes. Since x and y were in the same scope before, Rust won’t overlap their bytes in the final state machine. That’s good; it means that we don’t produce any broken code.

However, it’s obvious that in this example, we probably do want to overlap x and y. Let’s reorder our last use of x to go before initializing y, and again restructure our code to use explicit scopes:

let gen = || {
    {
        let x = read_line();
        yield;  // Suspend0
        process(x);
    }
    {
        let y = read_line();  // XXX
        yield;  // Suspend1
        process(y);
    }
};

Now Rust can see that x and y never exist at the same time, and reuses the memory from x to store y.

Gathering our constraints

Concretely, here’s what the compiler does to get the constraints on our memory layout:

  1. Trace through each control flow path in our function, recording when each variable comes into scope and goes out of scope. For every variable, we store a bit which tells us the variable might be in scope at this point.4
  2. At each point in the function, if two variables a and b can be in-scope at the same time, we record a conflict for them. This means that a and b cannot overlap in the final memory layout of our state machine.
  3. At the end, we’re left with a matrix of conflicts. For example, in our original example from the beginning of the post, we’d see this:
xs sum iter0 iter1
xs
sum
iter0
iter1
let xs = vec![1, 2, 3];
let mut gen = move || {
    let mut sum = 0;
    for x0 in xs.iter() {  // iter0
        sum += x0;
        yield sum;  // Suspend0
    }
    for x1 in xs.iter().rev() {  // iter1
        sum -= x1;
        yield sum;  // Suspend1
    }
};

✗ marks a conflict between two variables. This “conflicts” relation is symmetric. Storing our matrix thus requires N^2 / 2 bits of memory at compile time, where N is the number of saved local variables in our generator.5 Note that we only count saved local variables: if a variable is not live across any yield point, we aren’t saving it in our generator, so we don’t record conflict information for it.

The list of saved variables plus their conflict matrix is all we need to lay out the bytes of our state machine. We can place each variable anywhere, reusing memory without changing the behavior of our program, as long as we make sure that no two variables with a “conflict” ever use the same bytes.

An aside on MIR

So far in this post, we’ve been looking at the surface Rust syntax or a HIR-like syntax when we describe the analysis passes. This is useful because that’s the syntax programmers work with every day, and it’s easy enough for a human to reason about programs this way.

But compilers aren’t good at looking at a program of nested scopes, conditionals, and loops, and reasoning directly about their behaviors. For that, they need a control-flow graph. Rust represents a program’s control flow graph in MIR, or mid-level intermediate representation.

By the time our code gets to MIR, it has been translated from few high-level operations to many low-level operations, like assignments, borrows, and individual function calls. The entire MIR for our code example wouldn’t fit on one screen, but here’s part of it:

MIR consists of basic blocks (the rectangles in the above diagram), which themselves consist of simple statements in linear order. All control flow happens along the edges of our control-flow graph, between the basic blocks.

By the time our code is lowered to MIR, we don’t have nested scopes anymore. Instead, what we’re left with are StorageLive(x) and StorageDead(x) statements for when each variable goes in and out of scope. These are the statements we use to mark a variable x as being in-scope. Also, all variable names have been replaced with numbers.

For an in-depth explanation of MIR, I recommend this blog post introducing MIR.6

Getting smarter

Using explicit scopes everywhere in our code is a little verbose. Why can’t we just write this?

let gen = || {
    let x = read_line();
    yield;  // Suspend0
    process(x);
    let y = read_line();  // XXX
    yield;  // Suspend1
    process(y);
};

In some cases, the compiler is smart enough to optimize this. Because process(x) accepts x by value, we are moving the value out of x. If we know there’s no value in x when we first write to y, it’s valid to overlap x and y in the final data structure. Indeed, this is what the compiler does today.

However, there are some important caveats:

Borrowing

If x is ever borrowed before it’s moved, we disable this optimization today. This is because it’s harder to prove that no value exists in x if a pointer to x exists somewhere in our program.7

Copy

In this section, we’ve been assuming move semantics. But if the type of x is Copy, what happens when we call process(x)? Instead of moving the value from x, we copy it! That means a value still exists in x, and we don’t do this optimization today. This puts us in the weird position that making a type Copy disables some optimizations!

In many cases, it’s possible to turn the copy of x into a move if we notice that x is unused after our call to process. We don’t do this today, but it’s future work that is tracked by #62952.

Not guaranteed

Even if you never borrow x and its type is not Copy, the semantics of this optimization may change. If you care about using as little memory as possible, using explicit scopes is your safest bet for now.

That said, you only need to think about this for variables which are indeed stored in the generator. That is, only variables which are written to before a yield point and then read from (or implicitly dropped) after that yield point.

Conclusion

In this post, we went over some subtleties that the compiler implementation must consider when optimizing generators. We looked at two different kinds of analysis, liveness analysis and storage conflict detection. These tell us, respectively, whether a variable needs to be stored at some suspension point, and which variables are allowed to have overlapping bytes in the final memory layout.

In the next post, we’ll look at an algorithm which finally decides where every variable goes in our state machine’s memory. Finally, we’ll tie all of this knowledge back to async/await, using a concrete example to see the stages our asynchronous code goes through in the compiler.

Thanks to Josh Dover and iximeow for reviewing a draft of this post. Thanks to Taylor Cramer, Petr Hosek, and Paul Kirth for reviewing a precursor to these posts.

Appendix: Implementation

After my last post, some people requested links to the implementation. If you’re curious what all of this translates to in actual code, you can take a look at the PRs below. The Rustc Guide is a great resource to understanding many important details about the compiler, and it may come in handy while looking through the code.

I’ve also added an appendix to the previous post.


  1. I pronounce it like “hear,” but some people prefer to say H-I-R. ↩︎

  2. Similarly, HIR doesn’t have if or if let, but represents both of them using the more general match. ↩︎

  3. Another example is a silly side-effecting Drop impl that prints a message every time it’s called. If we changed when a destructor was run, you could easily observe that behavior as a change to the program behavior. ↩︎

  4. The exact algorithm we’re applying here is called data-flow analysis. We say a variable “might be in scope,” because data-flow is always an approximation (we can’t trace through every possible execution of a program, so we must approximate conservatively). ↩︎

  5. Today, we still use twice the bits we need (N^2). This makes some of the code computing the layout a bit simpler, but it’s something we can optimize. ↩︎

  6. See also the relevant chapter of the Rustc Guide as a reference. ↩︎

  7. Even though it’s undefined behavior to read from the bytes of x after it’s been moved, writing to the bytes of x is something we allow today. But we might not in the future. Don’t rely on the ability to do this in your unsafe code.

    let gen = || {
        let x = read_line();
        let x_ptr = &mut x as *mut Foo;
        yield;  // Suspend0
        process(x);
        let y = read_line();  // XXX
        unsafe { *x = foo(); }
        yield;  // Suspend1
        process(y);
    };
    

    If we overlapped x and y, it would break the behavior of this code. This is something we want to be conservative about.

    We might choose to break this code one day, which would mean adding a new requirement on unsafe code: You cannot dereference the old address of a variable after it is moved, period. While this is a fairly reasonable requirement, it’s actually adding a new kind of undefined behavior that wasn’t there before. If you have a use case for violating this rule, please comment on this issue: #61849! ↩︎