In this article I give a brief tour of a program analysis technique called symbolic execution. As the title indicates, I aim to explain both the intuition behind the technique, and also the details of a simple implementation. My intended audience is professional programmers who have no background in program analysis. This is a group caught in an awkward position. Although I believe they could greatly benefit from what the program analysis community has to offer, this knowledge seems to be stuck in academia, hidden behind technical jargon, and kept out of reach from all but the most dedicated enthusiasts. I hope this article will spark some interest in the professional programming community, both in people interested in writing these tools and in people interested in using these tools. I start by describing the intuition behind the technique, and then walk through a simple implementation. The implementation is in Haskell, but I’ve done my best to treat the Haskell as pseudo code and explain the snippets in depth. The entire implementation is on GitHub.
1 Introduction
Suppose you were testing the following function. What test cases might you write?
# Compute the absolute value of an int.
def abs(x):
assert(type(x) is int)
if x < 0:
return -x
else:
return x
Two come to mind: (1) a test where x
is a negative number and (2) a test were x
is a positive number. Why these particular conditions? The rationale, of course, is to feed in values that test each code path. Other tests of interest might test the boundary condition (x = 0
) and the extremes (the minimum and maximum integer).1 The goal is to test the function on its interesting paths which are triggered by interesting values.
To a programmer, coming up with these test cases is second nature, but how might this process be described as an algorithm? A starting point is to ask what is known about the value x
at different points in the program.
- At the entry point to the function, very little is known about
x
. Since Python is dynamically typed,x
may not even be anint
. - After the
assert
,x
must be at least anint
, but it could be anyint
. - Inside the first branch of the
if
,x
must be negative, and at the firstreturn
statement,x
is negated. This branch must return a positiveint
. - Inside the second branch,
x
must be0
or positive, and thus it also returns a positiveint
.
Reasoning backwards from these observations, it’s clear that a negative int
is required to test the second branch, and either 0
or a positive int
is required to test the first.
There are a few interesting things to note here.
First, as we step through this program, we don’t think about specific values of x
like 13
. Instead, we think about the set of possible values x
might take on. At different points in the program, x
might be an element of all integers, negative integers, or positive integers plus 0. In other words, we think of x
as a symbolic value (i.e., a set of possible values) rather than a concrete value (i.e., a particular element of that set of possible values).
Second, as the program branches, the set of possible values is constrained. x
might be any integer after the first assert
, but if that value flows into the first branch, it’s known that it can only be a negative integer. In other words, branches enforce constraints on the possible values of x
.
This process, which programmers perform instinctually all the time, is the topic of this article: symbolic execution. When this idea is formalized and automated, it can produce very interesting results. To name a few:
- Automatic test generation: Just as a programmer performs this process informally to create test cases, once automated, this process can generate test cases exercising all explored paths.
- Detection of failing asserts: Since symbolic execution tracks the set of possible values that can flow into an
assert
statement, that information can be used to check if any of the values in that set violate theassert
. If they do, those values can be reported back to the programmer. - Detection of buggy behavior: There are some behaviors that, despite being well defined, are usually unintended and indicative of a bug. For example, in a languages with fixed width integers, overflows are usually unintended. Since a symbolic execution engine knows which values can flow where, it can detect if an addition might result in an overflow. For example, if
x
andy
are both unconstrained 32-bit integers, then for some values ofx
andy
,x+y
can result in an overflow. - Detection of invariant violations: Invariants are properties that are always true at all point in a program’s execution (e.g., the left child of a binary tree is always less than or equal to the parent, or the balance of a smart contract is always positive or zero). By automatically exploring all or many paths, symbolic execution can detect paths that violate invariants and report back which inputs trigger these paths.
2 Implementation
Before implementing the symbolic execution engine, I first need a language to symbolically execute. For this purpose, I’ve put together a small stack language. The first part of this section will describe the implementation of this stack language, and the second will describe a symbolic execution engine for this language.
I chose a stack language because of its resemblance to a low level bytecode. Symbolic execution engines often target low level bytecode or assembly so that they are language agnostic and don’t depend on the correctness of a language’s compiler.
2.1 The Stack Language
I will keep this section brief, since the language itself is not the focus, but it’s worthwhile to walk through the implemenation to see how it compares to the symbolic execution engine and also to describe the semantics of the language.
In this language, a program is a list of instructions. Most instructions operate on a global stack, which is a stack of 32-bit words. The Add
instruction, for example, pops the first two items off the stack, adds them, and pushes the result. Below is a program that reads two numbers from the console, adds them, and prints the result. Since I have not written a parser, the programs are Haskell lists.
addTwoNumbers :: [Instr] -- A program is a list of instructions
= [ Read -- Read from the console and push to the stack
addTwoNumbers Read
, Add -- Add the numbers on top of the stack and push result to the stack
, Print -- Print the number at the top of the stack
, Done -- Ends execution
, ]
Below is the entire instruction set along with an informal description of the semantics. The semantics are described using (before -- after)
notation. For example, (a b -- a + b)
means that before the instruction is executed, a
and b
are on the top of the stack. After the instruction is executed, the top element is the result of a + b
.
data Instr = Add {- (a b -- a + b) -}
| And {- (a b -- a /\ b) -}
| Or {- (a b -- a \/ b) -}
| Not {- (a -- 1 if a=0 else 0) -}
| Lt {- (a b -- a < b) -}
| Eq {- (a b -- a = b) -}
| Push Word32 {- ( -- a) a is the Word32 -}
| Pop {- (a -- ) -}
| Swap {- (a b -- b a) -}
| Dup {- (a -- a a) -}
| Over {- (a b c -- a b c a) -}
| RotL {- (a b c -- b c a) -}
| Read {- ( -- a ) a is read from console in base 10 -}
| Print {- (a -- ) a is printed to console in base 10 -}
| JmpIf {- (cond addr -- ) pc := addr if cond =/= 0 -}
| Store {- (addr val -- ) mem[addr] := val -}
| Load {- (addr -- mem[addr]) -}
| Done deriving (Eq, Show)
Most of these are straightforward. There are a few instructions for basic arithmetic (Add
, Eq
, Lt
, etc) and logic (Not
, Or
, And
, etc) and a few that manipulate the stack (Push
, Dup
, Over
, RotL
, etc).
A few of the instructions deserve special attention: JmpIf
, Load
, and Store
.
JmpIf
is the only control flow instruction. It pops the jump condition and target address off the stack. If the jump condition is true (i.e., is non-zero) then the instruction jumps to the target address by changing the value of pc
, the program counter. Otherwise, it just moves to the next instruction.
The machine also has a memory area that maps 32-bit words to 32-bit words. Store
pops the first two items off the stack. The first is the address and the second is the value. It then stores that value to the memory area using the address as the key. Load
loads from the address at the top of the stack and pushes the value at that address onto the stack.
To formalize this a bit more, execution of a program is defined as the manipulation of three data structures. Each instruction manipulates these in different ways.
- Program counter: this contains the address of the current instruction. An address is an index into the list of instructions that constitute the program, and thus
0
is the address of the first instruction. - Memory area: this is a map of 32-bit words to 32-bit words that can be read and modified by the
Load
andStore
instructions. - Stack: the stack of 32-bit words that most instructions operate on.
The program is executed by walking over the list of instructions and updating those three items accordingly. Each update is one step of the program. The implemenation itself is structured in this way. A function called run
walks the program and repeatedly calls a function called step
which takes the current state and returns the new state, depending on what the current instruction is.
Concretely, the program state is represented as a 3-tuple of an Int
, which is the program counter, a map of 32-bit words, which is the memory area, and the stack, which is a list of 32-bit words. The definition is below.
-- | Program state: (program counter, memory, stack)
type State = (Int, Map Word32 Word32, [Word32])
For most instructions, a single step increments the program counter and performs some operation on the stack. Below is the implemenation of Add
.
-- | Given the current state and an instruction, carry out the
-- | instruction by returning a new updated state.
-- | Only the implementation of `Add` is shown.
step :: State -> Instr -> IO State
:r:stack) Add = return (pc+1, mem, l+r : stack) step (pc, mem, l
So the step
function is parameterized over a 3-tuple, which constitutes the program state, and the current instruction. Each element of the program state and its update is described below.
pc
: The program counter. It’s set to the address of the currentAdd
instruction.mem
: The memory area mapping 32-bit words to 32-bit words.l:r:stack
: The stack, with the top two items pattern matched and bound tol
andr
.
Execution of the Add
instruction means taking the state described above, and updating it to the following:
pc+1
: The program counter is incremented to the next instruction.mem
: The memory area is unchanged.(l+r) : stack
:l
is added tor
and pushed onto the top of thestack
.
To execute the program, step
is called repeatedly by the run
function, which fetches the new instruction, feeds it to step
, and repeats until it encounters a Done
instruction.
The exact mechanics of the run
function are not terribly important, but it is reproduced below with comments for completeness.
-- | Given a program and an initial state, run the program and return
-- | the final value of the stack.
run :: Prog -> State -> IO [Word32]
@(pc, _, stack) =
run prg state-- Fetch the next instruction
case prg ! (Offset pc) of
Just Done ->
-- If the instruction is `Done`, stop execution and return the stack.
return stack
Just instr ->
-- Otherwise, execute the instruction.
-- Call step to get the new state, and do it all over again.
>>= run prg
step state instr Nothing ->
-- If there is no instruction at that address, throw an error.
error ("No instruction at " <> show pc)
The complete implementation is on GitHub.
2.2 The Symbolic Execution Engine
Symbolic execution proceeds similarly to normal execution. The program state and path constraints (this will be covered shortly), are updated repeatedly by a symStep
function. Two differences set this apart from normal execution:
- The memory area and stack store symbolic rather than concrete values. Instead of operating on a stack of 32-bit words, symbolic execution operates on a stack of symbolic expressions which describe sets of 32-bit words. These sets represent all the values a stack entry (or memory entry) might contain during execution.
- Multiple execution paths are explored. While exploring these paths, a set of conditions is accumulated describing the values that, under concrete execution, would trigger a particular path. These are called path constraints. In the introductory example, “
x
must be negative to flow into this branch” is an example of a path constraint. Ideally, all execution paths are explored to exhaustion, but this is often not possible in practice.
Before getting to the implementation, let’s explore these ideas in more detail.
2.2.1 Symbolic and Concrete
What’s the difference between a symbolic value and a concrete value? An example of a concrete 32-bit word is 1
. An example of a symbolic 32-bit word is x
where the symbol x
stands for any 32-bit word (or the set of all possible 32-bit words). Under symbolic execution, the stack and memory area contain these symbolic rather than concrete values. Let’s look at an example program to develop a better intuition for what this means.
Here again is the addition program from above. It reads two numbers from the console, adds them, and prints them.
addTwoNumbers :: [Instr] -- A program is a list of instructions
= [ Read -- Read from the console and push to the stack
addTwoNumbers Read
, Add -- Add the numbers on top of the stack and push result to the stack
, Print -- Print the number at the top of the stack
, Done -- Ends execution
, ]
Under symbolic execution, the Read
instruction will not actually query the user for input. Instead, it will create a symbolic value representing all the values a user might input. It will then push this symbolic value onto the stack. So after the Read
instruction, the stack will look like this:
Stack |
---|
x |
Where x
is a symbolic value that represents any 32-bit word. Similar to the programmer trying to test the abs
function in the introduction, symbolic execution reasons about all possible values rather than specific values.
After the second Read
, another symbolic value, y
, is pushed onto the stack:
Stack |
---|
y |
x |
Finally, after the Add
instruction, x
and y
are popped off the stack, and a data structure representing the addition of x
and y
is pushed onto the stack.
Stack |
---|
y + x |
At this point, it’s worth asking what can be inferred from the state of the symbolic stack. The values of x
and y
are not known, but the following can be inferred:
- The stack contains a single entry, and thus the next instruction,
Print
, is safe. - The entry could be any 32-bit word, because
y+x
could evaluate to any 32-bit word depending on the values ofx
andy
. - Because
x
andy
are unconstrained, the addition could result in an overflow.
So even with a program as simple as this, it’s possible to derive some interesting properties. This gets even more interesting when the program contains branches, but before moving on, let’s wrap up the example.
The final two instructions Print
and Done
will pop the final value off the stack and end symbolic execution. Because execution is symbolic, nothing is actually printed to the console.
There’s an additional question we can ask at this point: Given the symbolic execution as a whole, what can we infer?
- At no point does the program crash due to a lack of arguments on the stack.
- The program ends with an empty stack. No data is left over.
- The program counter always points to a valid instruction.
- The program terminates.
Because the program was explored exhaustively, the above is true for all possible executions of the program. You might think that this is trivially true because there is only one path through the program, but remember that this also explored all possible values. So although you could achieve 100% coverage with a single test, it would take 2^32 * 2^32
tests to verify these properties to for all values. No matter what input you give it, the program will never crash, and it will always exit with a clean stack.
That said, in the real world, it’s usually the case that exploring a program to exhaustion is intractable, and so often it’s not possible to make such strong statements about a program based off symbolic execution alone.
In the next section, we’ll explore how branching affects symbolic execution.
2.2.2 Path Constraints
As mentioned above, this becomes more interesting if the program contains branches. Here’s a version of the addition program that only prints the result if it’s greater than 15.
It’s not necessary to understand the exact mechanics of this program. Suffice it to say that the program reads in two numbers, adds them, compares the result to 15, and either jumps to a Print
instruction or ends execution immediately with a Done
instruction.
addInputsPrintOver15 :: [Instr]
=
addInputsPrintOver15 Read
[ Read
, Add
, Dup
, Push 15
, Lt
, Push 10 -- Address of Print instruction
, Swap
, JmpIf -- If result is over 15, jump to the Print instruction
, Done -- Otherwise, exit
, Print
, Done ] ,
Let’s fast forward to right before the JmpIf
instruction. Our stack will contain the following symbolic values:
Stack |
---|
15 < (y + x) |
10 |
y + x |
The topmost value is the symbolic representation of the condition: the two inputs x
and y
must sum to a value greater than 15
.
The second value is the address to jump to if the condition is true. This is 10
rather than some symbolic variable like x
because 10
is part of the program itself (Push 10
), so we know for sure this stack entry takes on the value 10
.
Finally, the result of y+x
is at the bottom of the stack in case it needs to be printed.
So how is JmpIf
evaluated? It’s not possible to tell from the expression itself if the condition is true or false. For some values of x
and y
, 15 < (y + x)
is true, and for some it’s false. This means the engine must explore both branches, and while it does, it will keep track of the conditions that must be true for a given branch. So, in the case that it jumps, it will record that 15 < (y + x)
must be true, and in the case that it doesn’t jump it will record that ~(15 < (y + x))
must be true. These constraints are known as path constraints and are the primary way that we learn things about the symbolic values in the program. These path constraints are carried along with the program state.
Examining the true case, 15 < (y + x)
is added to the path constraints and the program jumps to the Print
instruction. The new program state is below.
Stack |
---|
y + x |
Path constraints | 15 < (y + x) |
Program counter | 10 (the Print instruction) |
So in addition to the symbolic stack, it now has a path constraint which asserts something about the values of x
and y
at this point in the program. So, what can be inferred?
- The set of possible values
x
andy
could take on is still quite large, but it’s known with certainty that(y=1, x=1)
is not an element because15 < 2
is false. On the other hand,(y=10, x=15)
is an element of that set because15 < 25
is true. y+x
could overflow.- The upcoming
Print
instruction is safe because the stack contains at least one element.
The other branch is similar, except the path constraint is negated.
Stack |
---|
y + x |
Path constraints | ~(15 < (y + x)) |
Program counter | 9 (the Done instruction) |
Again, because of the branch, the possible values of x
and y
are constrained. We know (y=1,x=1)
is in this set, but (y=15,x=0) is not.
The key takeaway is that branches constraint the set of concrete values that the symbolic values might take on, and solving the path constraints reveals the values that will trigger a given path.
One last bit of the puzzle is how these constraints are solved. Usually these constraints are sent to an SMT (Satisfiability Modulo Theories) solver, which reports back if the constraints are satisfiable at all, and if so, what concrete values satisfy it. The inner workings of SMT solvers are beyond the scope of this article, so for now I will treat them as an oracle.
2.2.3 Implementation
Similar to concrete execution, symbolic execution proceeds by walking the program and repeatedly updating the program state. Diving right in, this is how the Add
instruction is implemented. Compare it to the implemenation of step
.
-- | Given a symbolic state and an instruction,
-- | return a new state updated according to the
-- | semantics of the instruction.
-- | Only `Add` is shown.
symStep :: SymState -> Instr -> [SymState]
:r:stack, cs) Add =
symStep (pc, i, mem, l+1, i, mem, (SAdd l r) : stack, cs)] [(pc
Similar to step
it takes a state and returns a new state updated accordingly, but there are two main differences:
SymState
rather thanState
is updated.SymState
represents the program’s symbolic state rather than concrete state.- A list of
SymStates
rather than a singleState
is returned.
SymState
is similar to State
except that it also holds onto a list of path constraints. Further, the stack and memory area hold onto Sym
rather than 32-bit words. Sym
is the data type used to represent symbolic expressions.
type SymState = ( Int -- Program counter
Int -- Unique variable counter
, Map Word32 Sym -- Memory area
, Sym] -- Symbolic stack
, [Sym] -- Path constraint list
, [ )
Note that the memory area maps 32-bit words to Sym
, so the addresses in the memory area are still concrete. More on that later.
Finally, Sym
the data type representing symbolic expressions is constructed like a typical expression tree. It resembles an AST for a simple expression language.
data Sym = SAdd Sym Sym -- Addition
| SEq Sym Sym -- Equality
| SNot Sym -- Logical Not
| SOr Sym Sym -- Logical Or
| SCon Word32 -- A concrete 32-bit word
| SAnd Sym Sym -- Logical And
| SLt Sym Sym -- Less than
| SAny Int -- Any 32-bit word or the set of all 32-bit words
deriving (Show, Eq, Ord)
Below are examples of how the Sym
data type is used to represent symbolic expressions. CAny 0
, for example, represents any 32-bit word, which is how the variables x
and y
are used throughout the article. The 0
in CAny 0
is used to uniquely identify a symbolic value (e.g., if it’s used multiple times).
Expression | Sym representation |
---|---|
y | CAny 0 |
y+x | CAdd (CAny 0) (CAny 1) |
10 | SCon 10 |
10 < (y + x) | SLt (SCon 10) (CAdd (CAny 0) (CAny 1)) |
~(10 < (y + x)) | SNot (SLt (SCon 10) (CAdd (CAny 0) (CAny 1)))) |
y + y | CAdd (CAny 0) (CAny 0) |
Now let’s turn our attention back to symStep
for Add
and examine the updates one by one. The components of the symbolic state are as follows:
pc
: This is still the program counter, and it’s still anInt
.i
: This is a counter that’s incremented each time a symbolic value is created. Its value is used to uniquely name symbolic values. This will make more sense when we examinesymStep
for theRead
instruction.mem
The memory area, mapping 32-bit words to symbolic expressions.l:r:stack
: The symbolic stack, with the first two entries bound tol
andr
.cs
: The list of path constraints.
These are updated in the following way:
pc
: The program counter is incremented to the next instruction.i
: No symbolic values are created, so this is not changed.mem
: Add does not affect the memory area, and so this is unchanged.l:r:stack
: The symbolic expression addingl
andr
is created (SAdd l r
) and pushed onto the stack.cs
: No path constraints are added, so this is unchanged.
Read
is an interesting instruction because it introduces new symbolic values. Below is its implementation.
Read =
symStep (pc, i, mem, stack, cs) +1, i+1, mem, SAny i : stack, cs)] [(pc
Here i
is used as an argument to SAny
, which is the data type that represents symbolic values. The symbolic value, SAny i
, is then pushed onto the stack to represent some arbitrary user input. i+1
is returned to keep the counter unique. And, of course, no actual I/O happens.
JmpIf
is especially interesting because it is the only branching instruction. Below is its implementation.
symStep :: SymState -> Instr -> [SymState]
:SCon addr:stack, path) JmpIf =
symStep (pc, i, mem, cond-- Create multiple states, one for each branch of the condition.
+1, i, mem, stack, SNot cond : path) -- False branch
[ (pc: path)] -- True branch
, (wordToInt addr, i, mem, stack, cond
:_:stack, path) JmpIf =
symStep (pc, i, mem, _-- Do not explore the true branch if the destination address
-- is symbolic.
+1, i, mem, stack, path)] [(pc
There are several notable differences between this and Add
or Read
.
First, this instruction returns multiple new states, one for each branch. In the first case, where the condition is false, pc
is incremented to the next instruction, and no jump occurs. In the case where the condition is true, pc
is set to the jump destination.
Second, also notice that path
, the path constraints, are updated. First the jump condition is popped off the stack and bound to cond
. When exploring the true case, cond
itself is added to the path constraints. When exploring the false case, its negation is added: SNot cond
.
Finally, the true branch is only explored if the destination address is concrete (note the SCon addr
pattern match in the arguments list). Why is this necessary? Consider what would happen if the address were not concrete. Suppose, for example, that the jump address were x
, any 32-bit word. This would mean that the program could jump to any address at all. Exploring this many branches quickly becomes intractable, so a compromise is made and only branches with a concrete destination address are explored. This is a case of trading exhaustiveness for tractability. In order to make the analysis tractable, a compromise is made: some bugs may be missed. As you can imagine, coming up with intelligent ways of handling situations where exhaustiveness must be traded for tractability is an active area of research.
The Load
and Store
instructions suffer from a similar problem: symbolic values can flow into the address argument. If the destination address is any 32-bit word, then Store
could store to any memory location at all. Similarly, it’s possible for Load
to load from any memory location at all. There are many ways of handling this. One way is to treat it as a branching instruction where each possible address is one branch of the program. In other words, one branch loads from address 0
, another branch loads from 1
, and so on. This is the safest option in the sense that it’s exhaustive, but it quickly becomes intractable. Other techniques trade exhaustiveness for tractability, by only exploring some subset of possible addresses.
To keep the implementation simple, this analysis does something much dumber: it ignores instructions that store to a symbolic address. Alternatively, when asked to load from a symbolic address a generic symbolic value is loaded. Since it could be loaded from anywhere, it over approximates the possible values by making the result any 32-bit word.
What are the consequences of this? If instructions are ignored, symbolic execution might diverge entirely from the semantics of the real program, and end up both missing bugs that really exist, and flagging bugs that don’t really exist. For example, in the case where a symbolic value is used for Load
s, the analysis might miss bugs where the actual memory address is empty.
Although real world symbolic execution engines use more sophisticated solutions for dealing with these situations, getting in the habit of thinking about these trade offs is worthwhile. A lot of research in this area is about coming up with better techniques for handling these cases, and carefully thinking about how it affects the analysis.
Below is the implementation of Store
.
-- If storing to a concrete address, update the memory
-- area accordingly.
SCon addr:w:stack, cs) Store =
symStep (pc, i, mem, +1, i, M.insert addr w mem, stack, cs)]
[(pc-- If storing to a symbolic address, ignore the
-- instruction.
:_:stack, cs) Store =
symStep (pc, i, mem, _+1, i, mem, stack, cs)] [(pc
And the implementation of Load
:
-- If loading from a concrete address, load the value
-- from the memory area.
SCon addr:stack, cs) Load =
symStep (pc, i, mem, case M.lookup addr mem of
Just w -> [(pc+1, i, mem, w:stack, cs)]
Nothing -> error "Nothing to Load at address."
-- If loading from a symbolic address, treat the
-- load as loading any possible 32-bit word.
:stack, cs) Load =
symStep (pc, i, mem, _+1, i+1, mem, SAny i: stack, cs)] [(pc
The last step of symbolic execution is calling symStep
repeatedly, either until all paths have been explore exhaustively, or until a certain execution depth it reached. This analysis returns a tree of states, where each path from the root to a leaf represents an explored execution path, and each node represents executing one instruction.
symRun :: Int -> Prog -> SymState -> T.Tree SymState
@(pc, _, _, _, _) =
symRun maxDepth prog state-- Fetch the current instruction.
case prog ! (Offset pc) of
Just Done ->
-- If the instruction is Done, terminate
-- execution of this branch.
T.Node state []
Just instr ->
if maxDepth > 0 then
-- Get all the new states (there might be more than one)
-- and recursively call `symRun` on them.
let newStates = symStep state instr
= fmap (symRun (maxDepth - 1) prog) newStates
children in T.Node state children
else
-- Max depth reached. Terminate execution.
T.Node state []
Nothing ->
error $ "No instruction at " <> show pc
Note here that path exploration proceeds depth first. In this way, symbolic execution can be reframed as a search problem. Better techniques for exploring the tree of possible paths is an active area of research.
The complete implementation is on GitHub.
2.2.4 Constraint Solving
The final step is to send the path constraints off to the contraint solver. In this toy example, the program is executed to an abitrary depth, and then the constraints solver is invoked on the resulting tree. Separating these into stages simplifies the implementation, but a real world engine would interleave these steps so that information from the constraint solver could be used during symbolic execution.
Translating the constraints is a matter of walking the constraint syntax tree, Sym
, and translating each node into whatever data structure the SMT solver uses. I’ve chosen to use sbv as my interface into Z3, an SMT solver. Because this is library specific, I won’t go into much detail on how this is done. Please see the appendix for more details.
What’s more interesting is the form these contraints take on. Suppose I have the following path constraints:
10 < (y + x)
y < x
These get translated into the following:
∃ y. ∃ x. 10 < (y + x) \and y < x
So each symbolic variable is existentially quantified and all path constraints are conjoined.
The SMT solver then reports back with (1) is the formula satisfiable at all? and (2) what assignments to x
and y
make the formula true?
So in the case of the formula above, it might report: Satisfiable, y = 1, x = 10
. What this means is the path is a possible path through the program (because its constraints are satisfiable) and it can be triggered by setting y
to 1
and x
to 10
.
What if the formula is unsatisfiable? That would mean that particular path could never be taken. This is one reason why constraint solving while symbolically executing is valuable: if a branch is impossible, it can be pruned before resources are wasted exploring it.
As always, the complete implementation is on GitHub.
2.3 Putting it all together
Let’s execute a few example programs symbolically and examine the output.
Below is the program that adds its inputs together and only prints if their sum is over 15.
addInputsPrintOver15 :: [Instr]
=
addInputsPrintOver15 Read
[ Read
, Add
, Dup
, Push 15
, Lt
, Push 10 -- Address of Print instruction
, Swap
, JmpIf -- If result is over 15, jump to the Print instruction
, Done -- Otherwise, exit
, Print
, Done ] ,
Symbolically executing it yields the following result.
PC: 0
Stack: []
Path Constraints: "1"
Solved Values: Trivial
|
`- PC: 1
Stack: ["val_0"]
Path Constraints: "1"
Solved Values: Trivial
|
`- PC: 2
Stack: ["val_1","val_0"]
Path Constraints: "1"
Solved Values: Trivial
|
`- PC: 3
Stack: ["(val_1 + val_0)"]
Path Constraints: "1"
Solved Values: Trivial
|
`- PC: 4
Stack: ["(val_1 + val_0)","(val_1 + val_0)"]
Path Constraints: "1"
Solved Values: Trivial
|
`- PC: 5
Stack: ["15","(val_1 + val_0)","(val_1 + val_0)"]
Path Constraints: "1"
Solved Values: Trivial
|
`- PC: 6
Stack: ["15 < (val_1 + val_0)","(val_1 + val_0)"]
Path Constraints: "1"
Solved Values: Trivial
|
`- PC: 7
Stack: ["10","15 < (val_1 + val_0)","(val_1 + val_0)"]
Path Constraints: "1"
Solved Values: Trivial
|
`- PC: 8
Stack: ["15 < (val_1 + val_0)","10","(val_1 + val_0)"]
Path Constraints: "1"
Solved Values: Trivial
|
+- PC: 9
| Stack: ["(val_1 + val_0)"]
| Path Constraints: "~(15 < (val_1 + val_0)) and 1"
| Solved Values: val_0 = 0 :: Word32, val_1 = 0 :: Word32,
|
`- PC: 10
Stack: ["(val_1 + val_0)"]
Path Constraints: "15 < (val_1 + val_0) and 1"
Solved Values: val_0 = 4294967280 :: Word32, val_1 = 0 :: Word32,
|
`- PC: 11
Stack: []
Path Constraints: "15 < (val_1 + val_0) and 1"
Solved Values: val_0 = 4294967280 :: Word32, val_1 = 0 :: Word32,
Each node in this tree represents a single snapshot of the symbolic state at that point in the program. A path through the tree represents a path through the program. PC
is the program counter, and the other fields are the symbolic state immediately before executing the instruction at the program counter.
After the initial Read
instructions are executed, two symbolic variables are pushed onto the stack (one for each Read
instruction). The first is called val_0
and the second is val_1
.
PC: 2
Stack: ["val_1","val_0"]
Path Constraints: "1"
Solved Values:
At this point, the only path constraint is 1
or True
, which is the trivially true path constraint. In other words, any inputs to the program will trigger this path–it’s a path that’s always run.
Things get more interesting at the branch.
PC: 8
Stack: ["15 < (val_1 + val_0)","10","(val_1 + val_0)"]
Path Constraints: "1"
Solved Values: Trivial
|
+- PC: 9
| Stack: ["(val_1 + val_0)"]
| Path Constraints: "~(15 < (val_1 + val_0)) and 1"
| Solved Values: val_0 = 0 :: Word32, val_1 = 0 :: Word32,
|
`- PC: 10
Stack: ["(val_1 + val_0)"]
Path Constraints: "15 < (val_1 + val_0) and 1"
Solved Values: val_0 = 4294967280 :: Word32, val_1 = 0 :: Word32,
PC: 8
is the address of the JmpIf
instruction, and it has two children representing each side of the branch. One branch represents the condition being false (the first child) and the other represents it being true (the second child). Which inputs would trigger which branch? The “Solved Values” section tells us that setting the first input, val_0
, to 0
and the second, val_1
, to 0
would trigger the false half of the branch, and indeed 0+0
is not greater than 15
. From the other node, we see that 4294967280
and 0
will trigger the other half of the branch, and indeed 4294967280+0
is greater than 15.
Here is another program. It loops forever.
loop :: [Instr]
= [ Push 0 -- Jump destination
loop Push 1 -- Condition: 1 is always true.
, JmpIf
, Done
, ]
The condition, 1
, is always true, so it always jumps back to the beginning of the program.
Below is a fragment of the symbolic program trace.
PC: 2
Stack: ["1","0"]
Path Constraints: "1"
Solved Values: Trivial
|
+- PC: 3
| Stack: []
| Path Constraints: "~(1) and 1"
| Solved Values: Unsatisfiable
|
`- PC: 0
Stack: []
Path Constraints: "1 and 1"
Solved Values: Trivial
The first node is the JmpIf
instruction, and the two children are each side of the branch. The first half of the branch, the false half, returns the result Unsatisfiable
in the “Solved Values” section. The solver correctly deduces that the path constraint is never true and so it must be the case that that branch is never taken. On the other hand, the true half of the branch returns Trivial
indicating that the path constraints are trivially satisfiable and that branch is always taken.
The two examples given above are perhaps interesting, but not too impressive by themselves. The real power of this technique is using the information gained from symbolic execution to flag issues like failing asserts and integer overflows. Although those are not covered in this tutorial, you can image what these might look like. Given a path constraint and some assert, the SMT solver can tell if there are any value that satisfy the constraints but negate the assert. If there are, it’s possible for the assert to fail. As for overflows, it’s possible to ask the solver if any satisfying values trigger the overflow bit. Beyond this, you could even imagine using temporal logic to assert properties over entire execution trees such as : if p
is true, then eventually q
will be true. The possibilities go on and on.
3 Conclusion
My goal for writing this tutorial is to bring this technique to a wider audience. In particular, I believe this is a promising technique because it can be mostly automated, and has seen some real world success. I hope if this interests you, you will keep exploring. A good place to start is this survey which contains a good overview of the more sophisticated techniques for dealing with memory addresses and jumps. In follow up posts, I’d like to talk more about some of the real world successes, and the existing tools you can use today.
4 Appendix
4.1 Constraint solving with sbv and z3
Constraint solving with sbv is a matter of walking the Sym
tree and translating each node into the appropriate sbv command. Below is how SAdd
is implemented.
import qualified Data.SBV.Dynamic as S
SAdd l r) =
symToSMT m (<$> symToSMT m l <*> symToSMT m r S.svPlus
Each operand to SAdd
is converted with a recursive call to symToSMT
and then sbv’s plus operation is applied, svPlus
.
It’s worth pointing out that sbv’s untyped API (Data.SBV.Dynamic
) must be used when creating sbv expressions at runtime.
Arbitrary 32-bit words are created using sbv’s svMkSymVar
function.
-- | Create an existential word of `i` bits with
-- | the name `name`.
sWordEx :: Int -> String -> S.Symbolic S.SVal
= ask >>= liftIO . S.svMkSymVar (Just S.EX) (S.KBounded False i) (Just (toList name)) sWordEx i name
As shown above, it’s possible to specify if the variable is bounded at all, by how many bits, and if the value is signed or not. You can also specify if the variable is existentially (S.EX
) or universally quanitified. Here, I use bounded, unsigned, 32-bit words that are existentially quantified: S.KBounded False 32
.
Before the Sym
tree is translated, the tree is walked and any symbolic variables are collected (SAny
).
-- | Walk the constraint gathering up the free
-- | variables.
gatherFree :: Sym -> S.Set Sym
@(SAny _) = S.singleton c
gatherFree cSAdd l r) = gatherFree l <> gatherFree r
gatherFree (SEq l r) = gatherFree l <> gatherFree r
gatherFree (SNot c) = gatherFree c
gatherFree (SOr l r) = gatherFree l <> gatherFree r
gatherFree (SAnd l r) = gatherFree l <> gatherFree r
gatherFree (SLt l r) = gatherFree l <> gatherFree r
gatherFree (SCon _) = mempty gatherFree (
This set of symbolic variables is used to declare all variables to the SMT solver before translating the expression. The declared variables are saved to a map and retreived when the symbolic expression references them.
SAny i) = do
symToSMT m (case M.lookup i m of
Just val -> return val
Nothing -> error "Missing symbolic variable."
In a language with fixed width integers, there is one more negative integer than positive integer, so calculating the absolute value of the minimum integer leads to surprising results. Python automatically switches to an abitrary precision representation when this happens so it’s not an issue.↩︎