One Billion Rows Challenge in C
By Danny van Kooten on on Permalink.
Just a few days after finishing Advent of Code 2023, I got nerd sniped1 by Gunnar Morling with his One Billion Rows Challenge.
The challenge is to compute the average, min and max temperature value over 1 billion data points keyed by city name. As fast as possible, using only the Java standard library.
I conveniently ignored the Java constraint and did a 1BRC implementation in C instead. It's written in standard C11 and uses POSIX threads for processing the file in parallel.
Naive implementation
My first solution did things in a straightforward way:
- Read the file line-by-line.
- Parse the temperature using
strtod
. - Parse the city name using
strchr
- Store results in an array on the stack and perform a
O(n)
lookup on every iteration (ouch) to find the index of each city by its name.
The last item probably raises an eyebrow or two, but I just wanted to have a baseline since C does not have an easily available hashmap.
This approach took a whopping 8 minutes on my laptop with AMD Ryzen 4800U and 16 GB of RAM (although it was barely utilizing the latter at this point).
For reference, getting rid of the linear search on its own by introducing a hashmap brought down the runtime to ~90 seconds already.
Final implementation
The final implementation improves on pretty much all of the above points:
- Use a hashmap with a simple but effective multiplication hash, linear probing and a load factor of well under 0.5.
- Memory map the entire 12 GB file at once.
- Process file in parallel using 16 completely separate chunks.
- Custom unrolled integer parser.
So how much faster is this?
On my machine this finishes in just over 1.5 seconds, ~320x as fast as my naive implementation.
You can find the runtimes for the various states of my implementation below.
Description | Time |
---|---|
naive | 480s |
hashmap | 97s |
custom float/int parser | 45s |
fread in 64 MB chunks | 25s |
unroll parsing of city name + hash generation | 20s |
parallelize | 6s |
mmap dataset file | 1.6s |
The code for my implementation and each progression can be found on GitHub: dannyvankooten/1brc
Here's what worked, in order of significance.
Memory mapping vs. chunked reading:
A huge part of the program's runtime is spent on reading the 12 GB dataset file from disk.
Since the official challenge runs the benchmark 5 times and then discards the fastest and slowest run, we can gain a lot by ensuring the file hangs around in memory.
By using mmap
to map the file into memory, the kernel conveniently handles
this for us.
The performance difference between a warm and a hot cache is quite extreme:
$ echo 3 > /proc/sys/vm/drop_caches
$ time bin/analyze >/dev/null
real 0m5.048s
user 0m34.828s
sys 0m6.145s
And then, running again while the cache is warm:
$ time bin/analyze >/dev/null
real 0m1.949s
user 0m26.654s
sys 0m0.868s
Processing concurrently and then aggregating the final results
The data is processed in parallel using completely separate chunks using nproc
threads (16 on my machine).
Once all threads are done, the result of each thread is aggregated into the final result. This means we don't have to use any concurrent data structure or locking, a simple join on all threads suffices.
Hashing the city names
We get rid of the linear search to find the index of each city's result by using a simple but fast multiplication hash combined with linear probing2 to find a free spot and a load factor of well under 0.5.
Parsing the city name and generating its hash value is done simultaneously, in a single loop.
// hash everything up to ';'
// assumption: key (city name) is at least 1 char
unsigned int len = 1;
unsigned int hash = (unsigned char)buf[0];
while (buf[len] != ';') {
hash = (hash * 31) + (unsigned char)buf[len++];
}
// probe map until free spot or matching key
unsigned int idx = hashmap[hash & cap-1];
while (idx != 0 && memcmp(results[idx], buf, len) != 0) {
hash++;
idx = hashmap[hash & cap-1];
}
// idx is now either 0 (new entry)
// or contains the index of our key in the results array
Integer math
Since the challenge data only uses a single decimal, we can perform integer math and divide the final result by 10 instead.
Custom integer parser
Because temperatures are within the range of -99.0
to +99.0
, we can use a
custom parser that does a single branch condition for the position of the
decimal and then do the rest of the parsing without any additional branching.
/* Parse string with single decimal as integer */
/* Much faster than strod or atoi, but we can do better */
void parse_number(int *dest, char *s) {
char mod = 1;
int n = 0;
int d = 0;
// parse sign
if (*s == '-') {
mod = -1;
s++;
}
// parse characteristic
while (*s >= '0' && *s <= '9') {
n = (n * 10) + (*s++ - '0');
}
// skip separator
s++;
// parse mantissa
while (*s >= '0' && *s <= '9') {
d = (d * 10) + (*s++ - '0');
}
*dest = mod * n * 10 + d;
}
/* Unrolled version, using the property of our input */
/* that the decimal separator is always at index 2 or 3 */
char *parse_number_unrolled(int *dest, char *s) {
// parse sign
int mod;
if (*s == '-') {
mod = -1;
s++;
} else {
mod = 1;
}
if (s[1] == '.') {
*dest = ((s[0] * 10) + s[2] - ('0' * 11)) * mod;
return s + 4;
}
*dest = (s[0] * 100 + s[1] * 10 + s[3] - ('0' * 111)) * mod;
return s + 5;
}
Potential improvements that did not work
Some other things I tried that did not result in any noticeable performance improvements are:
Branchless min() and max()
I am not sure whether b < a
requires a branch on my machine, but I did not
find any performance improvements from using this trick.
int min(int a, int b) {
return a ^ ((b ^ a) & -(b < a));
}
Checking godbolt it seems that gcc 13.2 with -O2
on x86-64
architectures is smart enough to generate a branchless version anyway.
int max_1(int a, int b) {
return a > b ? a : b;
}
int max_2(int a, int b) {
return a ^ ((b ^ a) & -(b < a));
}
Results in the following (identical) assembly3.
max_1:
cmp esi, edi
mov eax, edi
cmovge eax, esi
ret
max_2:
cmp edi, esi
mov eax, esi
cmovle eax, edi
ret
Huge tables
Profiling my implementation using perf
shows that a lot of time is spent dealing with
pagefaults.
+ 93.39% 0.00% analyze libc.so.6 [.] clone3
+ 91.62% 0.00% analyze libc.so.6 [.] start_thread
+ 90.75% 66.58% analyze analyze [.] process_chunk
+ 23.52% 0.60% analyze [kernel.kallsyms] [k] asm_exc_page_fault
+ 22.51% 0.00% analyze [kernel.kallsyms] [k] exc_page_fault
+ 22.51% 0.00% analyze [kernel.kallsyms] [k] do_user_addr_fault
+ 20.44% 0.35% analyze [kernel.kallsyms] [k] handle_mm_fault
+ 19.52% 0.29% analyze [kernel.kallsyms] [k] __handle_mm_fault
+ 16.36% 0.56% analyze [kernel.kallsyms] [k] do_fault
+ 13.96% 0.00% analyze [kernel.kallsyms] [k] __do_fault
I tried getting mmap
to work with the MAP_HUGETLB
flag set, but it relies on
MAP_ANONYMOUS
being set as well, which if set will ignore the file descriptor
argument. Calling mmap
once more or using read
to get the entire file into
our mapped region proved slower than just letting the kernel deal with all page
faults.
Casting last 8 bytes as integer
A very interesting idea coined by Hello10244 on GitHub was that instead of parsing the temperature character by character, we could cast the last 8 bytes of every line as an integer directly. That would be huge, but I was not able to use it while retaining correct results.
Comments are welcome. You can email me at hi @ this domain.