Sep 19, 2020 Tags: llvm, programming
This post is at least a year old.
(With apologies to Branson Reese.)
I’ve been working inside of codebases built on LLVM for a little over two years. I’m used to (and very comfortable) reading LLVM IR.
Despite all that, I still need to stop and think about getelementptr
’s semantics whenever I see
it. Thus: this post will serve to help myself (and maybe you, reader) feel more comfortable with
getelementptr
.
First of all, there’s this thing called LLVM.
LLVM originally stood Low Level Virtual Machine, but no longer stands for anything.
The LLVM Project has many parts, but is perhaps best known by ordinary developers for being the home of Clang, an optimizing C/C++ compiler1.
Beyond being an impressive piece of compiler engineering, LLVM’s greatest claim to fame is its Intermediate Representation, or IR. Having a (well-designed, mature, stable) IR confers several advantages to an optimizing compiler:
Broadly speaking, LLVM’s IR is divided into four (increasingly narrow) scopes:
Instructions, in turn, take Values. Functions, basic blocks, and instructions are themselves kinds of values2 (among many other kinds), completing the picture.
This blog post concerns a single LLVM instruction: getelementptr
.
Here’s how the
LLVM Language Reference defines and
describes getelementptr
as of LLVM 10:
1
2
3
<result> = getelementptr <ty>, <ty>* <ptrval>{, [inrange] <ty> <idx>}*
<result> = getelementptr inbounds <ty>, <ty>* <ptrval>{, [inrange] <ty> <idx>}*
<result> = getelementptr <ty>, <ptr vector> <ptrval>, [inrange] <vector index type> <idx>
The ‘
getelementptr
’ instruction is used to get the address of a subelement of an aggregate data structure. It performs address calculation only and does not access memory. The instruction can also be used to calculate a vector of such addresses.
In other words: it’s a souped up LLVM version of x86’s
lea
: it computes the effective address of
some field without actually touching any memory. I say “souped up” because it can do a few things
that (a single) lea
can’t:
A single lea
’s ability to compute an indirect address is constrained by
x86’s addressing modes: if an effective
address into some data structure can’t be computed by the combination of the
scale, index, base, and displacement parameters, then multiple lea
s are necessary to get
to the ultimate address. This is not true for getelementptr
: a single getelementptr
instruction can have as many nested indices as an LLVM frontend is content to generate.
Like most LLVM instructions, getelementptr
has a “vector” variant: instead of returning a
single computed address, it can return a vector of addresses computed together. Consequently,
for the vectorized variant, some (or even all) of the getelementptr
’s index arguments
must be vectors themselves.
I’ve read this official overview dozens of times, and I basically understand it. Nonetheless,
my eyes still swim whenever I actually read a getelementptr
in real IR. LLVM’s developers
are aware of how confusing getelementptr
is: they have an
entire separate page dedicated to explaining its syntax,
why it doesn’t touch memory, and the various constraints on constructing a valid one.
Unfortunately, what that page doesn’t have is a ton of side-by-side examples with real C code.
So that’s what I’m going to do today. All examples will use global variables and
-fno-discard-value-names
to make the generated IR a bit easier to read.
Let’s take a look at a trivial C function that just returns the first element in a global array:
1
2
3
4
long *nums = {1, 2, 3};
long index_first(void) {
return nums[0];
}
and the generated IR (cleaned up a bit):
1
2
3
4
5
6
7
8
@nums = dso_local global i64* inttoptr (i64 1 to i64*), align 8
define dso_local i64 @index_first() #0{
%0 = load i64*, i64** @nums, align 8
%arrayidx = getelementptr inbounds i64, i64* %0, i64 0
%1 = load i64, i64* %arrayidx, align 8
ret i64 %1
}
(View it on Godbolt.)
Now, step by step:
load
our global nums
into %0
, with type i64*
3We use getelementptr
to calculate the address of an i64
(the first argument), using:
%0
as our base address (the second argument)0
as our index (the third argument)…and we store our calculated address into %arrayidx
.
load
an i64
from our calculated address (%arrayidx
) into %1
ret
our loaded value (%1
)That’s not too bad at all! getelementptr
behaves exactly like the simple index operation that
it models.
What if we change nums
to an array? Arrays decompose to pointers in C so it should be the same, right?
1
2
3
4
long nums[] = {1, 2, 3};
long index_first(void) {
return nums[0];
}
Wrong! LLVM IR isn’t C, and doesn’t play by C’s rules. In particular, it has sized array types that can be cast to pointers, but that don’t decompose on their own:
1
2
3
4
5
6
@nums = dso_local global [3 x i64] [i64 1, i64 2, i64 3], align 16
define dso_local i64 @index_first() #0 {
%0 = load i64, i64* getelementptr inbounds ([3 x i64], [3 x i64]* @nums, i64 0, i64 0), align 16
ret i64 %0
}
(View it on Godbolt.)
This is simpler in some ways and more complicated in others: getelementptr
is now invoked
inline within the load
4, but we’ve also saved a load
and two SSA variables.
Breaking the getelementptr
down:
[3 x i64]
(i.e., an array of 3 i64
s) (first argument)nums
is our base address, which has type [3 x i64]*
(second argument)But wait, there are two i64 0
indexes instead of just one! What gives?
As best I can tell, this is a quirk of the new type of nums
: it’s not a pointer anymore, but an
array. However, because we’re accessing it through a pointer (because that’s how getelementptr
is defined), we actually need two 0
indexes: the first to piece the pointer, and the second to
index the array itself. Same operation, slightly more indirection within the IR5.
What happens when we add a level of source indirection, like an index variable?
1
2
3
4
5
long nums[] = {1, 2, 3};
long i;
long index_i(void) {
return nums[i];
}
1
2
3
4
5
6
7
8
9
@nums = dso_local global [3 x i64] [i64 1, i64 2, i64 3], align 16
@i = common dso_local global i64 0, align 8
define dso_local i64 @index_i() #0 {
%0 = load i64, i64* @i, align 8
%arrayidx = getelementptr inbounds [3 x i64], [3 x i64]* @nums, i64 0, i64 %0
%1 = load i64, i64* %arrayidx, align 8
ret i64 %1
}
(View it on Godbolt.)
It’s nearly identical to our second trivial example! The only thing that’s changed is our
last index, going from 0
to %0
(which corresponds to our loaded i
). This in
turn confirms what we thought about the two i64 0
s above: that the first (still present)
simply pierces the pointer, while the second performs the “actual” indexing.
Let’s try it with more flavor dimensionality:
1
2
3
4
5
long nums[3][3] = { {1, 2, 3}, {2, 3, 4}, {3, 4, 5} };
long i;
long index_i2(void) {
return nums[i][i];
}
yields:
1
2
3
4
5
6
7
8
9
@nums = dso_local local_unnamed_addr global [3 x [3 x i64]] [[3 x i64] [i64 1, i64 2, i64 3], [3 x i64] [i64 2, i64 3, i64 4], [3 x i64] [i64 3, i64 4, i64 5]], align 16
@i = common dso_local local_unnamed_addr global i64 0, align 8
define dso_local i64 @index_i2() local_unnamed_addr #0 {
%0 = load i64, i64* @i, align 8
%arrayidx1 = getelementptr inbounds [3 x [3 x i64]], [3 x [3 x i64]]* @nums, i64 0, i64 %0, i64 %0
%1 = load i64, i64* %arrayidx1, align 8
ret i64 %1
}
(View it on Godbolt.)
Exactly as expected: all we’re doing is indexing one layer deeper into the aggregate with the same
index (%0
) and, sure, enough, our getelementptr
has one additional argument.
Adding a little bit of math makes it clear that the order of the indexes is as we expect:
1
2
3
4
5
long nums[3][3] = { {1, 2, 3}, {2, 3, 4}, {3, 4, 5} };
long i;
long index_i2(void) {
return nums[i][i + 1];
}
becomes:
1
2
3
4
5
6
7
8
9
10
@nums = dso_local local_unnamed_addr global [3 x [3 x i64]] [[3 x i64] [i64 1, i64 2, i64 3], [3 x i64] [i64 2, i64 3, i64 4], [3 x i64] [i64 3, i64 4, i64 5]], align 16
@i = common dso_local local_unnamed_addr global i64 0, align 8
define dso_local i64 @index_i2() local_unnamed_addr #0 {
%0 = load i64, i64* @i, align 8
%add = add nsw i64 %0, 1
%arrayidx1 = getelementptr inbounds [3 x [3 x i64]], [3 x [3 x i64]]* @nums, i64 0, i64 %0, i64 %add
%1 = load i64, i64* %arrayidx1, align 8
ret i64 %1
}
(View it on Godbolt.)
%add
boils down to i + 1
, as expected.
We’ve been doing pointers and arrays so far, but getelementptr
is well defined for any
aggregate, including structures.
Let’s go all in:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
typedef struct foo {
struct {
long field1;
struct {
long field2;
long field3;
union {
long field4;
char field5[32];
} quux;
long field6;
long field7;
} baz;
long field8;
} bar;
} foo;
foo chunky;
char take_field(void) {
return chunky.bar.baz.quux.field5[17];
}
Once again, LLVM blesses us with a single getelementptr
:
1
2
3
4
5
6
7
8
9
10
11
%struct.foo = type { %struct.anon }
%struct.anon = type { i64, %struct.anon.0, i64 }
%struct.anon.0 = type { i64, i64, %union.anon, i64, i64 }
%union.anon = type { i64, [24 x i8] }
@chunky = common dso_local local_unnamed_addr global %struct.foo zeroinitializer, align 8
define dso_local signext i8 @take_field() local_unnamed_addr #0 {
%0 = load i8, i8* getelementptr inbounds (%struct.foo, %struct.foo* @chunky, i64 0, i32 0, i32 1, i32 2, i32 1, i64 9), align 1
ret i8 %0
}
(View it on Godbolt)
We’re off to a good start again: we’re addressing into a foo
, and the foo
that we’re addressing
into is the chunky
global. Then, in sequence:
i64 0
: Pierce the pointer! getelementptr
takes a pointer to chunky
, not chunky
itself.i32 0
: The first field of our foo
, i.e. foo.bar
.
foo
is bar
.i32 1
: The second field of our nested aggregate, i.e. foo.bar.baz
6
i32 0
here would have been foo.bar.field1
. Similarly, i32 2
would have been
foo.bar.field8
. In other words, we always restart at 0
when we enter a new aggregate.i32 2
: The third field of our next nested aggregate, i.e. foo.bar.baz.quux
.i32 1
: The second field of our next nested aggregate, i.e. foo.bar.baz.quux.field5
.
quux
is a union, not a struct. But LLVM just doesn’t care; it’s an aggregate
all the same.i64 9
: …what?What just happened? Where’s our 17
? Where the hell did 9
come from?
Let’s look at our union type more closely:
1
%union.anon = type { i64, [24 x i8] }
First of all, it’s not actually a union in LLVM land. LLVM does not have its own union type.
However, if we look more closely, we’ll see that LLVM has engaged in some cleverness:
%union.anon
is just another structure, but its second field (our field5
) is 24 bytes instead
of 32 — the first 8 are in field4
, just as the C abstract machine expects.
Thus, LLVM has correctly modeled C’s union semantics without unions of its own.
So, whence index 9
? Well, remember that we’re dealing with a [24 x i8]
, i.e. the last 24 bytes
of field5
. In other words, by virtue of our previous i32 1
index, we’ve already moved past
the first sizeof(i64) == 8
bytes of field5
, and 8 + 9 == 17
! Our index! LLVM hasn’t let us
down.
But wait: what happens if we try to grab a part of field5
that overlaps with field4
, like field5[6]
?
1
%0 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* bitcast (%union.anon* getelementptr inbounds (%struct.foo, %struct.foo* @chunky, i64 0, i32 0, i32 1, i32 2) to [32 x i8]*), i64 0, i64 6), align 2
So, we have a nested getelementptr
expression, but it’s not nearly as bad as it looks:
getelementptr
grabs the entire union (foo.bar.baz.quux
)bitcast
turns it into a pointer to the appropriate union variant ([32 x i8]*
)getelementptr
takes the cast variant, pierces the pointer, and grabs the index (6
)
as expectedIn other words: we sometimes need two (or more) getelementptr
s7 to compute the effective
addresses of union fields.
To my embarrassment, the first structure example above took me about 30 minutes to understand. So
why stop the beating education there?
As mentioned all the way at the beginning, getelementptr
can take and return entire
vectors of addresses and indexes.
I spent about an hour trying to get LLVM to emit a vectorized getelementptr
, and failed miserably.
So I’m just going to contrive a few.
First, let’s look at a basic case: indexing every address in a vector with a constant scalar value:
1
%foo = getelementptr i64, <8 x i64*> %tbl, i64 0
Observe that the first argument is not a vector: it’s still just the type of thing we’re computing addresses for.
The second is where we introduce our vector: 8 i64
s. More precisely, it’s a pointer in this context:
even with vectors, getelementptr
expects its base address to be a pointer.
Finally, our constant scalar 0
. This gets multiplexed across every member of %tbl
, amounting to:
1
2
3
4
5
6
7
8
long *foo[8];
long tbl[8];
foo[0] = &tbl[0];
foo[1] = &tbl[0];
foo[2] = &tbl[0];
// ...
foo[7] = &tbl[0];
Voilà! 8 address computations in one!
Now let’s do it with a non-constant scalar:
1
%bar = getelementptr i64, <8 x i64*> %tbl, i64 %offset
If you understood the above, then this one shouldn’t be too big of a next step: instead of
always grabbing index 0 in %tbl
, we’re grabbing whatever value is in %offset
. In effect:
1
2
3
4
5
6
7
8
9
long *foo[8];
long tbl[8];
long offset;
foo[0] = &tbl[offset];
foo[1] = &tbl[offset];
foo[2] = &tbl[offset];
// ...
foo[7] = &tbl[offset];
So that’s scalars indexes. What about other vectors?
1
%bar = getelementptr i64, <8 x i64*> %tbl, <8 x i64> %offsets
First, observe that %offsets
is the same size as %tbl
: they’re both <8 x i64>
.
This is an invariant of the vectorized form: you can’t use differently sized vectors in
getelementptr
.
This is because getelementptr
is simple: it doesn’t do anything matrix-y, it just maps
each member of the base (or intermediate) vector to its corresponding index:
1
2
3
4
5
6
7
8
long *foo[8];
long tbl[8], offsets[8];
foo[0] = &tbl[offsets[0]];
foo[1] = &tbl[offsets[1]];
foo[2] = &tbl[offsets[2]];
// ...
foo[7] = &tbl[offsets[7]];
To cap things off: remember that there’s no limit on the number of index arguments that a
getelementptr
may have, and that they don’t have to be in any particular order:
1
%bar = getelementptr i64, <8 x i64*> %tbl, <8 x i64> %offsets, i64 %x, <8 x i64> %more_offsets, i64 6
corresponds to:
1
2
3
4
5
6
7
8
9
long *foo[8];
long tbl[8], offsets[8];
long x;
foo[0] = &tbl[offsets[0][x][0][6]];
foo[1] = &tbl[offsets[1][x][1][6]];
foo[2] = &tbl[offsets[2][x][2][6]];
// ...
foo[7] = &tbl[offsets[7][x][3][6]];
I haven’t talked at all about some of the possible keywords on the getelementptr
instruction,
like inbounds
and inrange
.
In brief:
inbounds
turns the result of the getelementptr
into a
poison value if the ultimate computed address
(or computed piecewise addresses, in vector mode) are not within the bounds of some (any?)
allocated object. Poison values are beyond the scope of this post but, as the name implies,
they’re not generally something you want8.
inrange
turns any load or store to or from the ultimate computed address into
undefined behavior unless the computed address is within the range of the inrange
-tagged index
(or indexes). inrange
currently only applies to getelementptr
expressions (like ones nested
inside another instruction), not full-fledged standalone instructions.
All told: getelementptr
is a little unintuitive and verbose, but not nearly as bad as I thought it
was.
It’s also extremely precise: getelementptr
tells us exactly what kind of address it’s
calculating, where it’s beginning, and the exact “path” through the aggregate that results in
a final effective address. It’s generic and flexible, just like LLVM itself.
Now I just need to understand the SelectionDAG.
And, by a significant margin, the only serious open-source competitor against GCC. ↩
Very literally: llvm::Function
et al. are derived from llvm::Value
. ↩
You’ll probably notice that we load from an i64**
, despite nums
being just an i64*
. This is more LLVM semantic funniness: load
takes a <ty>*
for its source, so loading an i64*
requires i64**
. ↩
If you’re a stickler, you’ll point out that these inline GEPs are not technically instructions in LLVM-land, but actually constant expressions (i.e. llvm::GetElementPtrConstantExpr
instead of llvm::GetElementPtrInst
). But the semantics of the two are identical for my explanatory purposes, so sue me. ↩
Sure enough, the LLVM docs explain this. ↩
I have no clue as to why LLVM switches between i64
and i32
in this GEP. Most LLVM documentation seems to suggest that it just doesn’t matter. ↩
But again, for the pedants: these are constant expressions instead of “real” GEPs, so they’re essentially free. ↩
Without inbounds
, getelementptr
will always return some result address, which may or may not be valid. ↩