TL;DR: CharsetDecoders got several times faster in JDK 17, leaving CharsetEncoders behind. After a few false starts and some help from the community I found a trick to speed up CharsetEncoders similarly. This may or may not speed up your apps in the future.

This is a technical read, but also a story about the process of failing and trying again, with no graphs and lots of distracting links to source code. Sorry. But there will be cake.

Decoding / Encoding

I’ve previously blogged about some JDK 17 improvements to charset decoding, where I used intrinsics originally added to optimize JEP 254 in a few places to realize localized speed-ups of 10x or more.

But dealing with text is a two-way street. When turning some text data into chars the Java APIs decode into an internal representation (which since JEP 254 is either ISO-8859-1 or a variant of UTF-16). When communicating with the outside world, those chars will have to be encoded from the internal representation into whichever character is expected out there.

Encoding characters was left more or less unchanged in JDK 17, though, because there was seemingly no straightforward way to adapt existing intrinsics for a similar speed-up. The intrinsic that did exist only works specifically for ISO-8859-1.

I have grown up using ISO 8859-1 - a legacy encoding with nice properties such as encoding every supported character using a single byte and not having any emojis. It even has cute nicknames such as “latin-1”. But much of the world use something else these days. UTF-8 is probably a good guess.

Performance check: JDK 17

Let’s take CharsetEncodeDecode - a naive microbenchmark for testing CharsetDecoder and CharsetEncoder performance using ASCII-only data. We can pick which encodings to test, so let’s go with the most common ones. According to W3Techs.com these are the top 10 used on public websites (ran using the Oracle OpenJDK 17 bits on my x86 workstation):

CharsetEncodeDecode.encode:
      (type)  Mode  Cnt   Score   Error  Units
       UTF-8  avgt   30  12.019 ± 0.430  us/op
  ISO-8859-1  avgt   30   2.983 ± 0.197  us/op
Windows-1251  avgt   30  41.315 ± 0.715  us/op
Windows-1252  avgt   30  41.845 ± 1.406  us/op
      GB2312  avgt   30  45.914 ± 1.043  us/op
   Shift-JIS  avgt   30  46.328 ± 1.617  us/op
  ISO-8859-9  avgt   30  41.519 ± 0.776  us/op
Windows-1254  avgt   30  41.589 ± 0.911  us/op
      EUC-JP  avgt   30  64.199 ± 3.368  us/op
      EUC-KR  avgt   30  46.412 ± 1.367  us/op

In this ASCII-centric test encoding to ISO-8859-1 is faster than pretty much anything else. 4x faster than UTF-8, and more than 10x faster than any of the other encodings. This is thanks to the implEncodeISOArray method being intrinsified, something that was implemented already in JDK-8.

The UTF-8 encoder does have a helpful ASCII fast-path, explaining why the gap here is significantly smaller than for the other encodings:

            // ASCII only loop
            while (dp < dlASCII && sa[sp] < '\u0080')
                da[dp++] = (byte) sa[sp++];

Any charset that is ASCII-compatible would likely benefit from doing the same, but most are somewhat surprisingly missing such fast-paths.

Regardless: any encoding that is ASCII-compatible (and produce a single byte stream when encoding ASCII) is essentially doing the same thing in this benchmark, but ending up with vastly different results. The difference is that ISO-8859-1 intrinsic, which speeds up ISO-8859-1 and ISO-8859-1 alone. That’s not very nice, but that’s how it’s been since JDK 8.

Dropping the ball

I admit feeling proud after optimizing the charset decoders: Even though I’ve been working as a performance engineer on the OpenJDK for a while now it’s exceedingly rare to be able to pull off a 10x optimization in anything, no less something that might be up front and center in many applications. The HotSpot JVM has impressive performance in most every way, so getting just a few percent performance increase can require a lot of engineering effort.

But I couldn’t see a way to speed up encoding using anything similar. I knew there ought to be some way, but when optimizing the decoders I basically re-used intrinsics that were already there in the JDK-internal Java API, and the intrinsics for encoding - basically only the aforementioned implEncodeISOArray - was not directly applicable for anything but ISO-8859-1.

With no real regression in sight other things took priority. I also went on parental leave for a while and had a nice summer break.

Nudged back into the game

Sometimes the reason to take a second look comes from surprising places. Including Twitter:

(Maybe someone reads my blog!?) I also got a few pings from others who seemed anxious to hear if I had found a solution to their woes (Hi, @yazicivo!).

The issue boils down to String.getBytes(UTF8) being much faster than using a CharsetEncoder - even when the usage of the latter involves reusing CharBuffers and ByteBuffers to be essentially allocation-free. My initial guess-work as to why this was was way off the mark, but the conversation led to a JMH test case which demonstrated the core issue. The reported performance issue was mostly a discrepancy between JDK 8 and later JDKs caused by a large improvement in JDK 9 to certain String methods. But there also seemed to be some 5-10% regressions when comparing back to JDK 8 in some of these microbenchmarks.

A nicely written, well-defined microbenchmark? And a noticable regression on some variants? Caused by one of the best performance features in JDK 9? OK, I’m in!

Reluctantly crouched at the starting line

Since Compact Strings String.getBytes has specialized implementations to encode to some built-in charsets such as UTF-8 and ISO-8859-1. These can be hard to beat since all Strings that can be will be represented internally in a ISO-8859-1 binary format. Encoding to ISO-8859-1 from such a String means a simple Arrays.copyOf. Encoding to UTF-8 can be an array copy, too, if your String is all ASCII:

        if (!StringCoding.hasNegatives(val, 0, val.length))
            return Arrays.copyOf(val, val.length);

You’ll have to take my word for it when I say that allocating and copying arrays is pretty fast. And the StringCoding.hasNegatives is a HotSpot intrinsic method that will use hand-optimized vector instructions to check for negative bytes in the String byte array. If there are none that means they are all ASCII, and you get a fast copy.

Looking at the StringEncode JDK microbenchmark (which target String.getBytes) we can see that this meant a pretty solid speed-up compared to JDK 8:

StringEncode.WithCharset.encodeCharset:
JDK (charsetName)  Mode  Cnt    Score   Error  Units 
  8         UTF-8  avgt   15  128.070 ± 5.013  ns/op
 17         UTF-8  avgt   15   68.964 ± 5.236  ns/op

These optimizations are absent in the code used by CharsetEncoder. Again I was where I was months ago. Staring at a set of intrinsic, optimized methods used in one place that seemed perfectly incompatible with the use case I sought to optimize. Alas.. maybe there’s some way to get halfway there?

He’s going the distance

One of the intrinsics I had looked at earlier is the one used to “compress” chars to bytes. (It’s really tossing away the upper byte of each char, which would be hilariously lossy unless you’ve checked that the upper bytes are all zeroes.) Which led to my first attempt at fixing this in the UTF-8 encoder:

            while (lastAscii < slASCII && sa[lastAscii] < '\u0080') {
                lastAscii++;
            }
            if (lastAscii > sp) {
                int len = lastAscii - sp;
                JLA.compressCharsToBytes(sa, sp, da, dp, len);
                sp = lastAscii;
                dp += len;
            }

That is: scan the input chars until we find a non-ASCII character, then use the StringUTF16.compress method (which I here exposed as JavaLangAccess.compressCharsToBytes) to compress-copy that to the destination.

Even though the scan for non-ASCII isn’t explicitly vectorized with something similar to StringCoding.hasNegatives, this tweak still meant a sizable speed-up on my system: a 1.45x in Carter Kozak’s microbenchmark, and almost 1.65x on CharsetEncodeDecode.encode:

      (type)  Mode  Cnt  Score   Error  Units
       UTF-8  avgt   30  7.134 ± 0.336  us/op

Good! But the fully intrinsified ISO-8859-1 encoding was still much faster.

He’s going for speed

So I took a long hard look at what that ISO-8859-1 intrinsic actually does. Then it finally dawned upon me. That thing that in hindsight is so obvious.

On x86, implEncodeISOArray is implemented by setting up a bitmask made up of 0xFF00 repeated 8, 16 or 32 times (depending on hardware capabilities). As long as there are no bits detected when OR:ing that bitmask against a chunk of chars that means that all those chars can be ISO-8859-1-encoded.

Every chunk that pass through the filter will be subject to some AVX magic to copy every other byte to the destination. (vpackuswb + vpermq = yeah, no, I get it but also don’t.) Thankfully that part isn’t something we have to care about. All that needs to be different for our needs are those bitmasks. And those were easy enough to spot.

So what we need here is the exact same thing, but with a (repeated) bitmask that would detect any non-ASCII char:0xFF80.

It took me a few hours of furiously copy-pasting code from various places in the C2 code to get it all up and running but finally everything seemed properly duct-taped together and I had created a new, very derivative, intrinsic: _encodeAsciiArray. That first version is all there in the PR changelog, hacks and redundant scaffolding included.

But it worked!

After cleaning up the implementation (thanks to some much needed input from Tobias Hartmann! and making sure most ASCII-compatible charsets ges the same treatment then numbers look almost too good to be true:

CharsetEncodeDecode.encode:
        (type)  Mode  Cnt   Score   Error  Units
         UTF-8  avgt   30   3.246 ± 0.192  us/op
    ISO-8859-1  avgt   30   3.028 ± 0.202  us/op
  Windows-1251  avgt   30   2.922 ± 0.092  us/op
  Windows-1252  avgt   30   2.880 ± 0.196  us/op
        GB2312  avgt   30  46.004 ± 0.903  us/op
     Shift-JIS  avgt   30  46.130 ± 1.142  us/op
    ISO-8859-9  avgt   30   3.112 ± 0.304  us/op
  Windows-1254  avgt   30   3.016 ± 0.235  us/op
        EUC-JP  avgt   30  64.867 ± 3.100  us/op
        EUC-KR  avgt   30  47.918 ± 1.847  us/op

UTF-8 is 4x faster. Several legacy encodings see 15x improvements. Not the result I expected getting back at this - but very nice!

Combined effect

Microbenchmarks are great tools with surprisingly spotty predictive power on how real application will behave. A large speed-up in an isolated test might mean little in practice unless that just-so happened to be the bottleneck. Which is rather unlikely. On one application we saw a significant drop in CPU time from the decoder optimization in 17 and.. no improvement on throughput. Perhaps not too surprising since encoding and decoding Strings often happen when doing I/O, where the bottleneck is probably the network or some disks. Things that regular microbenchmarks typically avoid touching.

To naively try and emulate the real world a tiny bit better I prepared EncodeDecodeLoop: a simple program which generates a chunk of ASCII data, encodes it and writes it to a file, then reads it back before decoding it. These are numbers for Windows-1251, using a file, repeated 100.000 times:

          Real time   User time
JDK 11      31.149s      9.511s
JDK 17      27.526s      6.850s
JDK 18      21.586s      1.820s

A huge drop in user time - as expected - but also a significant speed-up on real time measurements. Despite the benchmark doing its worst to bottleneck on disk I/O. Yes, in essence this is yet another microbenchmark, but it suggests that maybe these optimizations - when taken together - will be powerful enough to be seen through the noise on some apps. And even when there’s no observable speed-up to be had, doing the same work using significantly less CPU cycles is always good.

Future Work

I’ve only implemented this optimization on x86. I’m not very good at reading assembler code (assembly?) on any platform, much less writing it. I really got lucky when spotting a straightforward way to fix the x86 routine. I’ve filed a follow-up bug to get it fixed on Aarch64 and it’s being looked at by people who should know what they’re doing.

I’ve also taken a look at some of the encodings in that Top 10 list and seen that at least some of them actually are ASCII-compatible, just implemented differently. I have a patch the EUC-JP encoding and are looking at a few of the others.

And finally I think this work should be impactful, safe and uncontroversial to backport 17u: While it ended up being a large improvement over JDK 8 - the most baseline of baselines - it also resolved some regressions.

Edit 2021-10-15: DoubleByte encodings

It turns out most of the encodings that didn’t see a speed-up above implements DoubleByte and are marked ASCII-compatible - which means they actually encode ASCII as single-byte. Thus getting those up to speed was actually easier than I thought. Ironically the first one I took a look at - EUC-JP - was an exception that took some extra care. I have prepared a PR (#6102) that improve all the Top 10 encodings to within a small factor of ISO-8859-1:

Benchmark                   (size)      (type)  Mode  Cnt  Score   Error  Units
CharsetEncodeDecode.encode   16384  ISO-8859-1  avgt   30  2.856 ± 0.078  us/op
CharsetEncodeDecode.encode   16384   Shift-JIS  avgt   30  5.287 ± 0.209  us/op
CharsetEncodeDecode.encode   16384      GB2312  avgt   30  5.490 ± 0.251  us/op
CharsetEncodeDecode.encode   16384      EUC-JP  avgt   30  7.657 ± 0.368  us/op
CharsetEncodeDecode.encode   16384      EUC-KR  avgt   30  5.718 ± 0.267  us/op