TruffleRuby Native: Fast Even for Short Scripts

Introduction

Nowadays, it seems every major Ruby implementation has a Just-In-Time (JIT) compiler. Recently, YARV-MJIT has been merged to MRI (CRuby) trunk. JRuby relies on the Java Virtual Machine JIT compilers, and TruffleRuby uses Graal.

One big challenge for JIT compilers is to be beneficial on short-running scripts. In general, JIT compilers are better for long-running applications like web servers.

John Hawthorn Recently wrote a blog post about using YARV-MJIT for a small Ruby script. In this post, I want to expand on that and analyze the performance of 43 short-running programs (from 0.04s to 20s). Quick startup and fast warmup are therefore important to achieve good results.

JRuby and TruffleRuby on JVM do not perform well on short-running programs as their startup alone gives them a big disadvantage. On the other hand, TruffleRuby on SubstrateVM has much better startup and warmup.

TruffleRuby Native

TruffleRuby can run on 2 different virtual machines:

In both cases, the Graal dynamic/just-in-time compiler is used to compile Ruby code down to machine code and obtain great peak performance.

Since we look at short scripts, we pick TruffleRuby on SubstrateVM, also called TruffleRuby Native in this post. TruffleRuby is part of GraalVM, which can be downloaded on OTN. To run TruffleRuby Native, just add a --native flag to bin/ruby:

$ time graalvm-0.31/bin/ruby -v -e 'puts "Slow Startup ..."'
truffleruby 0.31, like ruby 2.3.5 <GraalVM 0.31 with Graal> [linux-x86_64]
Slow Startup ...
6.26s user 0.18s system 339% cpu 1.899 total

$ time graalvm-0.31/bin/ruby --native -v -e 'puts "Fast Native Startup!"'
truffleruby 0.31, like ruby 2.3.5 <native build with Graal> [linux-x86_64]
Fast Native Startup!
0.08s user 0.01s system 99% cpu 0.084 total

That’s pretty fast. I have been working on TruffleRuby startup for a while and it’s starting to look nice. Not as good as MRI yet, but we’re getting there (that’s for another blog post).

$ time ruby -v -e 'puts "MRI Startup"'
ruby 2.5.0p0 (2017-12-25 revision 61468) [x86_64-linux]
MRI Startup
0.03s user 0.00s system 98% cpu 0.040 total

The Benchmarks

Startup is interesting, but it would be much more interesting to try on real Ruby scripts. So I took my solutions to Advent of Code, for all 25 days and both puzzles of each day. This amounts to 43 benchmarks, as for some of the days a single script solves both puzzles.

I wrote these solutions. So, of course, they might be biased and might not be representative of other short-running Ruby scripts. But I wrote them in good faith, optimizing for a concise and elegant style, tweaking the code for performance only when it would run for too long. For Advent of Code, I enjoy writing code straight from the problem description rather than reasoning about the maths behind the puzzle. Note that I also made a couple tweaks to TruffleRuby after solving the puzzles (see below for details), which are now part of GraalVM 0.31.

In this blog post, I use the latest MRI/CRuby trunk as of writing (r62451) as the baseline. I also try the new bundled YARV-MJIT and compare against the latest RTL-MJIT from Vladimir Makarov and TruffleRuby Native from GraalVM 0.31.

Enumerable#sum and Kernel#yield_self are not defined in all implementations as some of them target a different Ruby version than 2.5. So I used a compat.rb file defining these methods in Ruby only when the method does not exist (sum is needed for TruffleRuby and yield_self for RTL-MJIT). I verified all implementations produce the same output. I ran each benchmark 10 times and took the average. The maximal deviation from the average across the 10 runs is: for MRI trunk 8%, YARV-MJIT 8%, TruffleRuby Native 13% and RTL-MJIT 78% (due to the unstable startup time between 54ms and >100ms; the maximal deviation is 10% for programs running over 1s). For completeness, this is run on a laptop with Fedora 26, an Intel Core i7-7700HQ CPU @ 2.80GHz and a SSD. MRI was compiled with the system GCC 7.2.1 20170915 (Red Hat 7.2.1-2), which is also used by YARV-MJIT and RTL-MJIT.

The results are in seconds. The implementations are compared with time differences (Δ) instead of speedup/slowdown factors to reflect how much time a user gains or loses (10x faster if it’s already <100ms does not make a difference to the user in such a use case).

Cells highlighted in green show gains compared to the baseline. Cells in red highlight losses of more than 1 second.

Bench MRI trunk YARV-MJIT Δ RTL-MJIT Δ TruffleRuby Native Δ
1a.rb 0.041 0.176 +0.135 0.094 +0.053 0.164 +0.123
1b.rb 0.040 0.179 +0.139 0.094 +0.054 0.103 +0.063
2a.rb 0.040 0.230 +0.191 0.172 +0.133 0.090 +0.050
2b.rb 0.040 0.198 +0.158 0.086 +0.047 0.115 +0.075
3a.rb 0.040 0.227 +0.187 0.185 +0.145 0.078 +0.038
3b.rb 0.040 0.239 +0.199 0.197 +0.156 0.098 +0.057
4a.rb 0.041 0.226 +0.185 0.132 +0.091 0.133 +0.092
4b.rb 0.045 0.195 +0.149 0.102 +0.057 0.401 +0.355
5a.rb 0.080 0.227 +0.147 0.187 +0.107 0.274 +0.194
5b.rb 3.312 3.583 +0.271 3.151 -0.160 0.534 -2.778
6.rb 0.087 0.222 +0.135 0.133 +0.046 0.283 +0.197
7a.rb 0.043 0.215 +0.171 0.111 +0.068 0.162 +0.119
7b.rb 0.046 0.239 +0.193 0.143 +0.097 0.260 +0.214
8a.rb 0.042 0.294 +0.252 0.249 +0.208 0.135 +0.093
8b.rb 0.042 0.316 +0.274 0.288 +0.246 0.144 +0.101
9.rb 0.042 0.232 +0.189 0.140 +0.098 0.135 +0.093
10a.rb 0.040 0.225 +0.186 0.173 +0.133 0.093 +0.053
10b.rb 0.046 0.283 +0.237 0.123 +0.077 0.181 +0.134
11.rb 13.818 15.706 +1.889 14.246 +0.429 0.805 -13.013
12a.rb 0.043 0.252 +0.209 0.122 +0.079 0.152 +0.108
12b.rb 0.044 0.181 +0.137 0.083 +0.039 0.179 +0.135
13a.rb 0.043 0.201 +0.158 0.102 +0.059 0.201 +0.158
13b.rb 1.830 1.979 +0.149 1.503 -0.327 0.456 -1.374
14a.rb 0.211 0.308 +0.097 0.272 +0.061 0.684 +0.472
14b.rb 0.244 0.307 +0.063 0.342 +0.098 0.984 +0.740
15a.rb 15.565 14.988 -0.576 14.069 -1.495 2.166 -13.399
15b.rb 8.802 8.404 -0.398 7.974 -0.828 1.278 -7.524
16a.rb 0.051 0.311 +0.260 0.195 +0.144 0.361 +0.310
16b.rb 19.581 22.147 +2.566 24.874 +5.293 8.994 -10.587
17a.rb 0.040 0.240 +0.200 0.133 +0.093 0.103 +0.063
17b.rb 3.027 2.411 -0.616 1.577 -1.451 0.588 -2.439
18a.rb 0.042 0.165 +0.123 0.078 +0.036 0.109 +0.067
18b.rb 0.076 0.184 +0.108 0.088 +0.012 0.810 +0.734
19.rb 0.068 0.187 +0.119 0.129 +0.061 0.530 +0.463
20a.rb 3.362 3.908 +0.545 3.071 -0.291 2.170 -1.192
20b.rb 2.100 2.394 +0.294 2.096 -0.004 5.101 +3.001
21.rb 3.408 3.697 +0.289 3.886 +0.478 3.731 +0.323
22a.rb 0.063 0.261 +0.198 0.167 +0.104 0.277 +0.214
22b.rb 16.493 18.182 +1.689 16.893 +0.400 2.734 -13.759
23a.rb 0.047 0.205 +0.158 0.080 +0.033 0.277 +0.230
23b.rb 4.297 3.906 -0.391 2.058 -2.239 0.582 -3.715
24.rb 5.726 5.899 +0.173 6.474 +0.748 1.901 -3.825
25.rb 1.980 2.094 +0.114 1.929 -0.051 1.563 -0.417
Total 105.068 116.023 +10.954 108.203 +3.135 40.118 -64.950

There seems to be essentially 2 categories in these benchmarks. Scripts which take less than 1 second and on which none of the implementations with a JIT runs faster than MRI trunk. But the JIT implementations also don’t take more than 1 second, so it’s likely not a big difference to the user.

For scripts which run for more than 1 second, TruffleRuby Native saves a significant amount of time (except 20b.rb and 21.rb). YARV-MJIT and RTL-MJIT achieve some gains on that second category as well, although they are much more modest.

Overall, the last line (Total) shows that YARV-MJIT and RTL-MJIT are not improving the total time needed to run those scripts. It is a big challenge for JIT compilers to be beneficial on short-running scripts. However, since TruffleRuby Native gains so much on the second category, it manages to execute all scripts in less than half the time MRI trunk takes!

Analysis

All 3 contenders have slower startup than MRI trunk here. YARV-MJIT currently has the known problem to compile a header on every startup and waiting for it. It seems RTL-MJIT has the same issue. TruffleRuby Native currently has to load its core library written in Ruby on startup, which makes it a bit slower than MRI.

The other issue is warmup. The approach to shell out to an external compiler and emitting C code (MJIT) is far from optimal in terms of warmup (how long it takes until the often-executed parts of the program are compiled). For instance, running ruby --jit --jit-verbose=1 15a.rb shows that compiling a method with YARV-MJIT takes at minimum 28ms and the median for all 82 methods compiled is 105ms.

With TruffleRuby Native, the TruffleRuby interpreter and Graal are compiled ahead-of-time by SubstrateVM to machine code. That machine code is saved in an executable (called the image). When starting the executable, we have an already warmed-up TruffleRuby interpreter and calling the JIT compiler is just a method call away. Since Graal is ahead-of-time compiled it starts compiling faster than on JVM and requires no classloading. So for instance, graalvm-0.31/bin/ruby --native --native.XX:+TraceTruffleCompilation 15a.rb shows that Graal takes 11ms to compile the block at line 20, compared to 70ms for YARV-MJIT.

On 20b.rb, the slowdown for TruffleRuby seems to be caused by Struct#== using Struct#values, which is not specialized compared to other Struct methods (it’s a bug, Struct#to_a is specialized).

Finally, it’s time to consider performance as a whole. We see that for slightly longer scripts, TruffleRuby can save up to 13 seconds. The maximum gain for YARV-MJIT is 1 second and for RTL-MJIT 2 seconds.

For YARV-MJIT, it is still the early days and it does not have many optimizations. RTL-MJIT has more optimizations, but does not support Ruby inlining currently. TruffleRuby supports Ruby inlining and also inlining to and from the core library (for both the part written in Ruby and the part written in Java). It even supports inlining Ruby method calls from C extensions.

Integer#times and On-Stack-Replacement

Some programs (particularly short-running ones) are hard to optimize for a JIT compiler. For instance, let’s take John Hawthorn’s solution to Day 15:

def calculate(a, b, n = 40_000_000)
  n.times.count do
    a = a * 16807 % 2147483647
    b = b * 48271 % 2147483647

    (a & 0xffff) == (b & 0xffff)
  end
end

p result: calculate(699, 124)

Let’s simplify a little bit by removing the Enumerator to understand better what is going on (the same reasoning applies as count ends up calling times with a block, just with more indirections):

def calculate(a, b, n = 40_000_000)
  count = 0
  n.times do
    a = a * 16807 % 2147483647
    b = b * 48271 % 2147483647

    count += 1 if (a & 0xffff) == (b & 0xffff)
  end
  count
end

p result: calculate(699, 124)

Here, we would ideally compile the calculate method and inline everything called from there. But that method is only called once. So by the time the JIT compiler thinks it’s good to compile that method, we will be inside that method, never call it again, and keep executing in the non-compiled code.

The next best thing is compiling Integer#times, inlining its block and everything from there. In TruffleRuby, Integer#times is defined in Ruby:

class Integer < Numeric
  def times
    return to_enum(:times) { self } unless block_given?

    i = 0
    while i < self
      yield i
      i += 1
    end
    self
  end
end

But we meet the same problem. By the time we figure out this times method has a loop with many iterations and calls a block (yield) many times, we will already be in the while loop and when we get out the program finishes so we will never use a compiled version of Integer#times.

That’s where On-Stack-Replacement (OSR) comes in. On-Stack-Replacement enables to compile a loop and jump to the compiled loop from the interpeter. TruffleRuby can perform On-Stack-Replacement in while loops thanks to the support for OSR by Truffle and Graal.

Once Truffle detects an interpreter iterates in a loop many times (the default threshold is 100 000 loop iterations), it triggers an OSR compilation of that loop. Once that compilation finishes, the interpreter jumps in the compiled loop at the next iteration, executing the rest of the loop much faster.

In this case, this works because Integer#times is written in Ruby and uses a while loop which has OSR support. In the previous GraalVM version (0.30), Integer#times was written in Java and did not have OSR support (it would be possible but more complex). This caused the block given to times to be compiled but not the loop itself which makes a big difference as calling a block from the interpreter is much slower than an inlined block call.

When I was playing with my own solution for Day 15, I tried redefining Integer#times in Ruby and that alone sped up the execution from 7 seconds to 2 seconds, illustrating the gains of On-Stack-Replacement. Interesting how defining more in Ruby can actually help performance.

Conclusion

Improving the performance of short-running programs with just-in-time compilers is challenging.

If the program executes for less than a second, none of the implementations with a JIT compiler managed to gain anything compared to MRI trunk. But, they also all took less than a second, so it probably doesn’t matter much for scripts run only once or a few times.

For programs running for longer than a second, TruffleRuby Native shows it is possible to gain a significant amount of time with a Just-In-Time compiler. This requires fast startup (well below 1 second) and fast warmup (otherwise the program finishes before the compiled code is used). Of course, the JIT compiler benefits from being more advanced such as supporting inlining and a better understanding of Ruby’s constructs. In the case of higher-level loops like Integer#times called only once and with many iterations, On-Stack-Replacement is important to achieve good performance.

YARV-MJIT and RTL-MJIT are exciting but still very young JIT compilers for MRI. Improving warmup while shelling out to GCC (or Clang) is certainly challenging. Making GCC (or Clang) understand better Ruby constructs is also gonna be interesting. Let’s see what the future brings.

If you want to try TruffleRuby Native, you can download GraalVM from OTN. See Getting Started for details. We are working on making it easier to install TruffleRuby (e.g., with rvm/rbenv-install/ruby-install), but that has not landed yet.

If you liked this post, consider following @eregontp on Twitter for more Ruby, performance and concurrency blog posts.