Waiting for process groups, macOS edition

In the previous posts, we saw why waiting for a process group is complicated and we covered a specific, bullet-proof mechanism to accomplish this on Linux. Now is the time to investigate this same topic on macOS. Remember that the problem we are trying to solve (#10245) is the following: given a process group, wait for all of its processes to fully terminate.

macOS has a bunch of fancy features that other systems do not have, but process control is not among them. We do not have features like Linux’s child subreaper or PID namespaces to keep track of process groups. Therefore, we’ll have to roll our own. And the only way to do this is to scan the process table looking for processes with the desired process group identifier (PGID) and waiting until they are gone.

Unfortunately, there is no portable API to programmatically access the process table. Sure, you can imagine shelling out to ps(1) and parsing its output, but this would be very inefficient and error-prone. So we have no choice but to rely on Darwin-specific primitives for efficiency and reliability.

We can query the process table in Darwin by using sysctl(3) and looking under the kern.proc.pgrp.<PGID> name (MIB). (It’s interesting to see that this is modeled after the BSD’s kvm_getprocs(2) interface and I’m not sure why Darwin had to merge it with sysctl(3).) However: this MIB doesn’t appear to be documented so things might break in the future. The way I found this is by looking at Apple’s own ps.c source code.

Let’s coerce the sysctl(3) interface to give us what we want. The interface is not the easiest to use, but it’s not that difficult either; we just have to account for the fact that its return value is of variable size:

#include <sys/types.h>
#include <sys/sysctl.h>

#include <assert.h>
#include <errno.h>
#include <stdlib.h>

// Waits for a process group to terminate.  Assumes that the process leader
// still exists in the process table (though it may be a zombie), and allows
// it to remain.
//
// May never converge if the processes in the group are still spawning their
// own subprocesses.
int wait_for_process_group(pid_t pgid) {
    int name[] = {CTL_KERN, KERN_PROC, KERN_PROC_PGRP, pgid};

    for (;;) {
        // Query the list of processes in the group by using sysctl(3).
        // This is "hard" because we don't know how big that list is, so we
        // have to first query the size of the output data and then account for
        // the fact that the size might change by the time we actually issue
        // the query.
        struct kinfo_proc *procs = NULL;
        size_t nprocs = 0;
        do {
            size_t len;
            if (sysctl(name, 4, 0, &len, NULL, 0) == -1) {
                return -1;
            }
            procs = (struct kinfo_proc *)malloc(len);
            if (sysctl(name, 4, procs, &len, NULL, 0) == -1) {
                assert(errno == ENOMEM);
                free(procs);
                procs = NULL;
            } else {
                nprocs = len / sizeof(struct kinfo_proc);
            }
        } while (procs == NULL);
        assert(nprocs >= 1);  // Must have found the group leader at least.

        if (nprocs == 1) {
            // Found only one process, which must be the leader because we have
            // purposely expect it as a zombie.
            assert(procs->kp_proc.p_pid == pgid);
            free(procs);
            return 0;
        }

        // More than one process left in the process group.  Pause a little bit
        // before retrying to avoid burning CPU.
        struct timespec ts;
        ts.tv_sec = 0;
        ts.tv_nsec = 1000000;
        if (nanosleep(&ts, NULL) == -1) {
            return -1;
        }
    }
}

The above code snippet gives us a wait_for_process_group function that ensures all processes in a process group are gone, except for the leader. With this, we could implement the following algorithm:

Start process group.
Wait for the process group leader to terminate using waitpid(2). Remember that we want to report the exit status of the leader, so we must do this.
Wait for all other processes in the group to exit by calling wait_for_process_group.

This works… but there is a little problem. Once waitpid(2) returns, the kernel has cleared all knowledge of our process leader, which means all of its children became orphan and were reparented to init(8). The PGID was thus reclaimed and could now be used by a racing process. (This is very unlikely because the kernel will try hard to not reallocate PIDs too quickly, but it’s still a possibility and experience has shown me that, with scale, the unlikely is actually common.)

So how do we fix this? We need to introduce an extra step. We have to detect when the process group leader becomes a zombie without actually reaping its status so that the PGID remains assigned while we do our wait_for_process_group dance.

And to do this, we use another Darwin-specific primitive: kqueue(2). With this functionality, we can wait for the child process to change its status. (We could possibly do this by monitoring SIGCHLD but I haven’t found a way to make this not racy, and anything dealing with signals always makes me cringe.) The code looks like follows:

#include <sys/types.h>
#include <sys/event.h>

#include <assert.h>
#include <stddef.h>
#include <unistd.h>

// Waits for a process to terminate but does *not* collect its exit status,
// thus leaving the process as a zombie.
//
// According to the kqueue(2) documentation (and I confirmed it experimentally),
// registering for an event reports any pending such events, so this is not racy
// if the process happened to exit before we got to installing the kevent.
int wait_for_process(pid_t pid) {
    int kq;
    if ((kq = kqueue()) == -1) {
        return -1;
    }

    struct kevent kc;
    EV_SET(&kc, pid, EVFILT_PROC, EV_ADD | EV_ENABLE, NOTE_EXIT, 0, 0);

    int nev;
    struct kevent ke;
    if ((nev = kevent(kq, &kc, 1, &ke, 1, NULL)) == -1) {
        return -1;
    }
    assert(nev == 1);
    assert(ke.ident == pid);
    assert(ke.fflags & NOTE_EXIT);

    return close(kq);
}

Alright. So now we have another function, called wait_for_process, that will wait until the given PID becomes a zombie. With this, our algorithm looks like:

Start process group.
Wait for the process group leader to become a zombie with wait_for_process. At this point, the PGID is still assigned to us.
Wait for all other processes in the group to exit by calling wait_for_process_group.
Collect the leader’s status by using waitpid(2).

Simple, right? It was all in the details—which is why I had to come up with the sample code in these posts.

Let’s put it all into practice by updating our original test tool:

#include <sys/wait.h>

#include <err.h>
#include <stdlib.h>
#include <unistd.h>

int wait_for_process(pid_t);
int wait_for_process_group(pid_t);

// Convenience macro to abort quickly if a syscall fails with -1.
//
// Not great error handling, but better have some than none given that you, the
// reader, might be copy/pasting this into real production code.
#define CHECK_OK(call) if (call == -1) err(EXIT_FAILURE, #call);

int main(int argc, char** argv) {
    if (argc < 2) {
        errx(EXIT_FAILURE, "Must provide a program name and arguments");
    }

    int fds[2];
    CHECK_OK(pipe(fds));
    pid_t pid;
    CHECK_OK((pid = fork()));

    if (pid == 0) {
        // Enter a new process group for all of our descendents.
        CHECK_OK(setpgid(getpid(), getpid()));

        // Tell the parent that we have successfully created the group.
        CHECK_OK(close(fds[0]));
        CHECK_OK(write(fds[1], "\0", sizeof(char)));
        CHECK_OK(close(fds[1]));

        // Execute the given program now that the environment is ready.
        execv(argv[1], argv + 1);
        err(EXIT_FAILURE, "execv");
    }

    // Wait until the child has created its own process group.
    //
    // This is a must to prevent a race between the parent waiting for the
    // group and the group not existing yet, and is the only safe way to do so.
    CHECK_OK(close(fds[1]));
    char dummy;
    CHECK_OK(read(fds[0], &dummy, sizeof(char)));
    CHECK_OK(close(fds[0]));

    // Now wait for the direct child to terminate and keep it around as a
    // zombie.  This ensures that the process group is not reparented to init,
    // which allows us to query it without racing other processes getting the
    // same group identifier.
    wait_for_process(pid);

    // The direct child is dead and the kernel would reparent all processes of
    // our group to init if we hadn't kept the child around as a zombie.  Wait
    // for all of them to vanish.
    wait_for_process_group(pid);

    // And now reap the exit status of the direct child.
    int status;
    CHECK_OK(waitpid(pid, &status, 0));
    return WIFEXITED(status) ? WEXITSTATUS(status) : EXIT_FAILURE;
}

And if we build and run it with the same same sample command as before:

$ ./wait-all-darwin /bin/sh -c '/bin/sh -c "sleep 5; echo 2" & echo 1'
1
2
$

you’ll see that wait-all-darwin did not terminate until the nested subshell did, even though the outer shell exited quickly.

Great. However, be aware that this solution is not bullet-proof: if a subprocess creates a new process group, it will escape our algorithm and there is nothing we can do about it. (It looks like FreeBSD has a NOTE_TRACK event that could let us track those processes… but of course this doesn’t exist on Darwin.) Anyway, in the context of Bazel’s process wrapper, we are willing to live with this limitation.

Lastly note that this solution should be generalizable to other systems. In the worst case, you’d need to use ps(1) to walk the process table, but you could still do it. The only issue is that you might race against other processes grabbing the PGID as described above, but it’s a pretty unlikely scenario and maybe it can be fixed with some SIGCHLD magic.

Waiting for process groups, macOS edition

Featured software

Featured posts