Thoughts on Emulating Command Buffers for OpenGL

If you take a look at modern graphics APIs, you'll find that all of them have the concept of a command buffer (Metal, D3D 12, Vulkan). One may think of command buffers as small "programs" that execute on the GPU and do things necessary to set the scene for your draw calls: binding resources, changing some pipeline state, initiating data transfer, performing synchronization, and so on. The application may benefit from being able to record multiple command buffers in parallel on different threads, and reusing previously recorded command buffers.

If you're writing a low-ish level cross-platform graphics API abstraction targeting different backends, it makes sense to expose command buffers to the user. But what if one of your target backends is OpenGL? OpenGL has no concept of command buffers, so if your abstract API supports them, there's no standard OpenGL equivalent to map them to. The closest thing I could find was in the vendor-specific NV_command_list extension. I could identify three options to deal with this problem:

Have two paths in your API - one for backends that support command buffers, and the other for those that don't. I don't like this approach very much: it makes the (likely already nontrivial) interface even more complicated and kind of defeats the purpose of being cross-platform by requiring the user to choose one or the other method based on which back-end they're using;
Drop OpenGL support altogether. The reasoning is as follows: OpenGL is dead on Apple platforms; if your API has a D3D backend, that's what you'll be using on Windows anyway; and on Android there is Vulkan. The problem is that Vulkan support on Android is not as widespread as one would like. Dropping GL also poses a problem if you really want to support WebGL as one of the target backends. I do think, however, that this is the correct long-term approach. Vulkan support on Android will hopefully improve with time (both in terms of market share and driver quality), and for those who want to target the browser, there's WebGPU (though whether that effort will succeed remains to be seen).
Emulate command buffers on top of OpenGL. It means choosing some method of encoding commands into "fake" command buffers, and then "interpreting" them (by making actual OpenGL calls) when they are "submitted". Of course, the main drawback here is that it is pure overhead.

I've decided to investigate the third approach more closely. To reiterate, I understand that it's nothing but overhead. One does not actually get any of the benefits of real command buffers from this, the only benefit is being able to expose the same API to the user regardless of the backend they choose. Making direct OpenGL calls would always be faster than going through this abstraction. But is there a way to make the cost low enough that it doesn't matter much?

For starters, let's look at how the commands could be represented. It turns out that you need something on the order of tens of bytes to encode a single command. It could be done with a tagged union like this:


enum cmd_type {
  CMD_TYPE_BIND_PIPELINE,
  CMD_TYPE_BIND_VERTEX_BUFFER,
  CMD_TYPE_SET_VIEWPORT,
  ....
};

struct cmd {
  command_type type;
  cmd *next;
  union {
    cmd_bind_pipeline bind_pipeline;
    cmd_bind_vertex_buf bind_vert_buf;
    irect2d  viewport;
    ....
  };
};

The most naive way to approach recording a command buffer would be to write the information about commands into a continuous region of memory. Of course, the problem here is that we don't know a priori how large it needs to be, so we'd need to resize it dynamically, and dynamic allocations are not cheap.

To avoid them, a fixed-size block allocator could be used. It works by preallocating a chunk of memory and dividing it into fixed-size blocks. A list of all the free blocks is maintained. On every allocation, we simply pop off the first block from the free list. On every de-allocation, we prepend the returned block to the list. Thus, allocation/deallocation boils down to a few simple pointer operations. If we run out of blocks at some point, we may allocate a new chunk, chop it up into blocks and add those to the free list as well.

Given an allocator that doles out memory in command-sized blocks, recording a new command into an emulated command buffer amounts to allocating a new block, populating it with command data, and appending it to the command list. To avoid having to do synchronization, we can have one separate block allocator for each thread that records command buffers.

When the command buffer is "submitted" for execution, we go through each command in the list and "interpret" it by switching on the command type and making actual OpenGL calls.

Since individual command blocks are not at all guaranteed to be near each other in memory, we have ample opportunity for cache misses during interpretation. This may be mitigated somewhat if we increase the size of our blocks. Then, it'll be possible to record several commands into a single block, thus making sure that they're near each other in memory and reducing the amount of cache misses (of course, this wastes a bit of memory at the end of each command buffer).

I'm still experimenting with this approach and I don't know if I'll end up sticking with it. At least it seems that I'm not the only one who has had similar ideas. I haven't found too many resources dealing with implementation details and pitfalls here. If you've done something similar, I'd love to hear from you.

Like this post? Follow this blog on Twitter for more!