Jul 19, 2021 Tags: llvm, rust Series: llvm-internals
This post is at least a year old.
I’ve done a couple of posts on LLVM itself, mostly on things you can do with LLVM or how LLVM represents particular program features.
I’ve received some good feedback on these, but I’d like to focus a sub-series of posts on LLVM’s implementation itself: the file formats, parsing strategies, and algorithmic choices underlying LLVM’s public interfaces (APIs, CLIs, and consumable output files). I’ll be writing these posts as I work on a series of pure-Rust libraries for ingesting LLVM’s intermediate representation, with the end goal of being able to perform read-only analyses of programs compiled to LLVM IR in pure Rust.
For those who don’t know what LLVM is, this post has a broader background on LLVM’s components and intermediate representation.
LLVM’s intermediate representation is not just an in-memory series
of changes to a program as it passes through compilation phases: it’s a
stable model of the program that can be saved to disk and subsequently
revivified for analyses that run in entirely new invocations of clang
or
other LLVM tools2.
So: how does LLVM faithfully preserve its intermediate representation on disk?
The answer is twofold: LLVM provides both a textual form that looks roughly like a C-family programming language, and a binary form (bitcode, in LLVM’s parlance) that’s denser and (ostensibly) faster to parse. The two forms are semantically identical, and trivial to produce and convert between:
1
2
3
4
5
6
7
8
9
10
11
12
# create foo.bc for foo.c
$ clang -emit-llvm -c foo.c
$ file foo.bc
foo.bc: LLVM IR bitcode
# convert foo.bc into foo.ll (human readable IR)
$ llvm-dis foo.bc
$ file foo.ll
foo.ll: ASCII text, with very long lines
# create foo.ll directly from foo.c
$ clang -S -emit-llvm -c foo.c
The textual form of LLVM’s IR is fun to read, but it’s “just” text: slow and (perhaps) annoying to parse with a machine, but easy for in-the-know humans to comprehend.
By contrast the bitcode form is remarkably inscrutable, even by binary format standards: apart from a few printable strings, there are no obvious fixed-size or length-prefixed structures. In exchange for this inscrutability, it performs admirably at storing LLVM’s intermediate representation in a space-efficient manner. Compare the representations of SQLite’s 3.36.0 amalgamated release:
sqlite3.c | sqlite3.ll | sqlite3.bc |
---|---|---|
8.0MB | 14.0MB | 1.8MB |
An order of magnitude smaller isn’t bad34. Let’s find out what we need to do to parse the bitcode format for ourselves.
LLVM is a well-documented project, and the bitcode format is no exception. To summarize:
This all sounds very nicely structured, which begs the question: why is there no visible structure in an bitcode file (when viewed with a hex editor)?
The answer is in the bit of bitcode and bitstream: blocks and records are not aligned on bytes (or larger boundaries like words); they’re encoded as variable-length bitfields. These bitfields can share a byte or span across multiple bytes, giving bitcode files remarkably high entropy when compared to binary objects (visualized with binvis.io):
sqlite3.bc |
sqlite3.o |
---|---|
1.8MB | 1.3MB |
Both blocks and data records are identified by abbreviation IDs, usually
shortened to “abbrev IDs” in the LLVM documentation. In true self-specifying
fashion, the majority of abbreviation IDs used by a bitcode file are not
specified by the container format — they’re defined within the bitstream
itself by the appearance of DEFINE_ABBREV
records5. Indeed, the
container format only specifies 4 extremely basic abbreviation IDs: two for
managing block scope (ENTER_SUBBLOCK
and END_BLOCK
), DEFINE_ABBREV
itself, and UNABBREV_RECORD
as a general-purpose inefficient “escape hatch”
encoding for records6.
These abbreviation IDs (and the abbreviation formats they reference, more on
that soon) are tightly scoped to the block that introduces them7, with
one exception: bitstream blocks that describe an LLVM module may themselves
contain a BLOCKINFO
block, which in turn can define multiple abbreviations
for any subsequent blocks that match the block IDs specified by the BLOCKINFO
.
If all of this indirection and abbreviation wasn’t complicated enough, the bitstream container format uses three distinct techniques for determining the length of records and their constituent bitfields:
Every block record starts with the ENTER_SUBBLOCK
abbreviation ID, and
records a new abbreviation ID length. This new length is used to decode all
subsequent abbreviation IDs in the current block, with sub-blocks defining
their own new abbreviation ID lengths8.
All of the non-default abbreviation IDs have an associated abbreviation
definition (via DEFINE_ABBREV
), which specifies the layout of the block
or record itself. This allows an emitter of LLVM bitcode to get very clever
with the sizes of its records: the in-memory representation of the IR knows
the exact count of each thing that needs to be stored, and can use this to
select exactly the right number of bits for each field.
Finally, variable-width integers (“VBRs” or VBRn
in
the LLVM docs). These occur in chunks of n
bits, with the high bit of each
chunk indicating a continuation to a subsequent chunk in the stream. These
give the bitstream container the ability to represent common expected values
in a small bitspace, while still providing an escape hatch for larger
values9. There are also signed VBRs, which can pack signed integers
about as efficiently and are distinguished from normal VBRs by context
(e.g., cases in LLVM’s IR where negative offsets or constants are expected).
BLOCKINFO
As mentioned above: every abbreviation has a corresponding abbreviation definition, the location of which in the stream determines its corresponding abbreviation ID.
In other words: abbreviation IDs are never explicitly defined anywhere.
They’re implicitly defined as part of the order of appearance of DEFINE_ABBREV
records that apply to the current block scope.
So, how do we determine which abbreviations apply to the current block scope? We apply three rules:
0
through 3
are defined by
the bitstream container itself: they’re END_BLOCK
, ENTER_SUBBLOCK
, DEFINE_ABBREV
,
and UNABBREV_RECORD
, as mentioned above.4
, we check the special
BLOCKINFO
block. The BLOCKINFO
(essentially) contains a mapping of block IDs
to abbreviation records — we assign each of the records that matches our
current scope’s block ID a sequential abbreviation ID.DEFINE_ABBREV
records present in our current scope.
These also receive sequential abbreviation IDs.These rules can be difficult to wrap your head around in the abstract, so
consider the following symbolic representation of a LLVM IR module in the
bitstream container format (I’ve given the DEFINE_ABBREV
s symbolic names
for comprehension):
1
2
3
4
5
6
7
8
9
10
11
BLOCK MODULE_BLOCK (BLOCK ID #8)
BLOCK BLOCKINFO (BLOCK ID #0)
RECORD SETBID #8 (MODULE_BLOCK)
RECORD DEFINE_ABBREV [...] // "A"
RECORD DEFINE_ABBREV [...] // "B"
RECORD SETBID #12 (FUNCTION_BLOCK)
RECORD DEFINE_ABBREV [...] // "C"
RECORD DEFINE_ABBREV [...] // "D"
RECORD DEFINE_ABBREV [...] // "E"
BLOCK FUNCTION_BLOCK (BLOCK ID #12)
RECORD DEFINE_ABBREV [...] // "F"
Here, we have three potential block scopes (the top-level MODULE_BLOCK
, with a
nested BLOCKINFO
and FUNCTION_BLOCK
). The non-default abbreviation IDs for
each are:
MODULE_BLOCK
: A
(#4), B
(#5), D
(#6), E
(#7)BLOCKINFO
: None (only the built-in abbrev IDs!)FUNCTION_BLOCK
: C
(#4), F
(#5)Observe that there’s some subtlety here: the MODULE_BLOCK
is specified as the
top-level block and the BLOCKINFO
is nested within it, but the BLOCKINFO
can (and does in this case) contain abbreviation definitions that we must
interpret in order to correctly parse the MODULE_BLOCK
. As such, we
potentially need to jump forwards in the bitstream to interpret BLOCKINFO
before we can correctly interpret our actual starting point10.
All this text, and we’re no visibly closer to actually getting IR semantics out of a bitcode file generated by LLVM. Whoops.
Everything here is important background for interpreting the bitstream
container, but getting actual IR semantics out of is an entirely different,
separate beast. I intend to dedicate a subsequent post (give it about a month)
to just that: taking the primitives here and hacking using them to build
a Rust library that can interpret bitcode modules11. That crate, in turn,
will become the substrate of a higher-level crate that exposes pleasant
APIs for actually analyzing LLVM IR. All without needing to build
or link to your system’s build of LLVM!
The first step of this (long) process is already done: I’m open-sourcing
an initial version of llvm-bitcursor,
which will provide a bit-granular cursor over a bitstream container. It
already has support for VBRn
decoding, too.
The easy example of this is LLVM’s implementation of LTO: the “completeness” (w.r.t. information) of any re-ingested LLVM IR is necessary for the correctness of any link-time optimizations. ↩
Interestingly, applying naive compression (just gzip
) narrows the gap significantly: sqlite3.ll.gz
is only 1.9MB, while sqlite3.bc.gz
is 1.2MB. There are probably reasons why LLVM’s authors went with a relatively complex bitcode format instead of a simpler one (which would also compress even better than the current format), but I’m not privy to them. ↩
I’ve saved the commands used here. clang
is clang-10
, version 10.0.0-4ubuntu1
. ↩
This is where LLVM’s bitstream container begins to look a little bit like BER, which is not a pleasant similarity. ↩
More on each of these (and all of the actual IR-specific ones) in a subsequent post. ↩
Meaning that neither the parent nor child blocks of a particular block can use that abbreviation. ↩
One of many places where it’s appropriate to think of an LLVM bitstream interpreter as a cute little stack machine. ↩
To make this intuitive: if you have a space of values {0..15}
with a distribution bias towards the range {0..6}
, then you expect the vast majority of your values to fit in a single VBR3 chunk. The ones that don’t fit should be a minority and their slightly larger representation (two chunks instead of one) is negligible. ↩
From my understanding, anyways. The docs make it explicit that MODULE_BLOCK
is the top-level block when interpreting a single LLVM IR module, and that it can contain a BLOCKINFO
. There seems to be no restriction on BLOCKINFO
containing SETBID MODULE_BLOCK (#8)
records which (according to the specification) should apply unconditionally to any matching blocks, including the parent. ↩
The library is going to take longer than the blog post. ↩