bytecount now ARMed and Web-Assembled

02 October 2023

It’s been some time since I wrote on this blog. I did blog at other venues and worked a lot, which left too little time to write here. So this is “just” a release announcement for my bytecount crate. And also a bit of talk around SIMD.

First things first, bytecount as of version 0.6.4 now supports ARM natively on stable using NEON intrinsics. Yay! While bytecount had unstable ARM SIMD support under the generic-simd feature (which used the nightly only packed-simd crate that mimics the upcoming std::simd API), for stable Rust ARM users were restricted to 64-bit integer based pseudo-SIMD, which used some bit twiddling hacks to simulate SIMD with scalar integer operations.

One reason for this is that for the longest time, the only ARM CPU I owned was in my mobile phone, a cheap-ish Android device that nonetheless had the 4+4 core ARM CPU setup that is now so common. Many folks probably don’t know, but there’s Termux, an Android terminal emulator + environment which has included Rust for quite some time. So I ran a small benchmark trying to count through 1 gigabyte of random bytes, which came in at around 6G/s. Not too bad, given that it used bit twiddling instead of actual SIMD.

Termux only supplies a stable Rust toolchain though, but that didn’t stop me from also benchmarking the generic-SIMD version. Using the RUSTC_BOOTSTRAP env variable lets one use unstable features on stable Rust (beware: This is not guaranteed to be safe for any use!), so I did the same benchmark with our generic SIMD implementation. The result came it at 11G/s.

As I recently had some motivation to get into SIMD stuff again, I set out to port bytecount to ARM. The algorithm is mostly the same; the only difference is the data layout: Where Intel’s AVX2 has 256 bit registers, and AVX-512 even 512 bit ones, NEON is still at 128 bits. However, there are enough registers available to check 4 of them for equality while still having enough to keep the count, so that’s what I did.

The implementation work was sped up considerably by me getting a MacBook Pro with an M2 from Flatfile for some optimization work which they generously let me use to test bytecount.

The final implementation is equal up to a small bit faster than the generic-SIMD one, because I can get away with a few less instructions in certain cases.

So bolstered by this success, I went on to tackle WebAssembly support next. The simd128 intrinsics in core::arch::wasm32 are actually quite similar to what you’d find on ARM, but named differently, plus for some reason there doesn’t seem to be a horizontal addition operation. However, that lack is compensated for by a number of widening addition ops that I could use to combine multiple values, so in the end the code looks quite similar to the aarch64 one.

This one got into the current version 0.6.5. So if you use bytecount, upgrade to enjoy a very healthy speed boost on ARM and on the web.