One Billion Rows Challenge in C

By Danny van Kooten on Jan 4, 2024 on Permalink.

Just a few days after finishing Advent of Code 2023, I got nerd sniped¹ by Gunnar Morling with his One Billion Rows Challenge.

The challenge is to compute the average, min and max temperature value over 1 billion data points keyed by city name. As fast as possible, using only the Java standard library.

I conveniently ignored the Java constraint and did a 1BRC implementation in C instead. It's written in standard C11 and uses POSIX threads for processing the file in parallel.

Naive implementation

My first solution did things in a straightforward way:

Read the file line-by-line.
Parse the temperature using strtod.
Parse the city name using strchr
Store results in an array on the stack and perform a O(n) lookup on every iteration (ouch) to find the index of each city by its name.

The last item probably raises an eyebrow or two, but I just wanted to have a baseline since C does not have an easily available hashmap.

This approach took a whopping 8 minutes on my laptop with AMD Ryzen 4800U and 16 GB of RAM (although it was barely utilizing the latter at this point).

For reference, getting rid of the linear search on its own by introducing a hashmap brought down the runtime to ~90 seconds already.

Final implementation

The final implementation improves on pretty much all of the above points:

Use a hashmap with a simple but effective multiplication hash, linear probing and a load factor of well under 0.5.
Memory map the entire 12 GB file at once.
Process file in parallel using 16 completely separate chunks.
Custom unrolled integer parser.

So how much faster is this?

On my machine this finishes in just over 1.5 seconds, ~320x as fast as my naive implementation.

You can find the runtimes for the various states of my implementation below.

Description	Time
naive	480s
hashmap	97s
custom float/int parser	45s
fread in 64 MB chunks	25s
unroll parsing of city name + hash generation	20s
parallelize	6s
mmap dataset file	1.6s

The code for my implementation and each progression can be found on GitHub: dannyvankooten/1brc

Here's what worked, in order of significance.

Memory mapping vs. chunked reading:

A huge part of the program's runtime is spent on reading the 12 GB dataset file from disk.

Since the official challenge runs the benchmark 5 times and then discards the fastest and slowest run, we can gain a lot by ensuring the file hangs around in memory.

By using mmap to map the file into memory, the kernel conveniently handles this for us.

The performance difference between a warm and a hot cache is quite extreme:

$ echo 3 > /proc/sys/vm/drop_caches

$ time bin/analyze >/dev/null

real    0m5.048s
user    0m34.828s
sys     0m6.145s

And then, running again while the cache is warm:

$ time bin/analyze >/dev/null

real    0m1.949s
user    0m26.654s
sys     0m0.868s

Processing concurrently and then aggregating the final results

The data is processed in parallel using completely separate chunks using nproc threads (16 on my machine).

Once all threads are done, the result of each thread is aggregated into the final result. This means we don't have to use any concurrent data structure or locking, a simple join on all threads suffices.

Hashing the city names

We get rid of the linear search to find the index of each city's result by using a simple but fast multiplication hash combined with linear probing² to find a free spot and a load factor of well under 0.5.

Parsing the city name and generating its hash value is done simultaneously, in a single loop.

// hash everything up to ';'
// assumption: key (city name) is at least 1 char
unsigned int len = 1;
unsigned int hash = (unsigned char)buf[0];
while (buf[len] != ';') {
  hash = (hash * 31) + (unsigned char)buf[len++];
}

// probe map until free spot or matching key
unsigned int idx = hashmap[hash & cap-1];
while (idx != 0 && memcmp(results[idx], buf, len) != 0) {
    hash++;
    idx = hashmap[hash & cap-1];
}

// idx is now either 0 (new entry)
// or contains the index of our key in the results array

Integer math

Since the challenge data only uses a single decimal, we can perform integer math and divide the final result by 10 instead.

Custom integer parser

Because temperatures are within the range of -99.0 to +99.0, we can use a custom parser that does a single branch condition for the position of the decimal and then do the rest of the parsing without any additional branching.

/* Parse string with single decimal as integer */
/* Much faster than strod or atoi, but we can do better */
void parse_number(int *dest, char *s) {
  char mod = 1;
  int n = 0;
  int d = 0;

  // parse sign
  if (*s == '-') {
    mod = -1;
    s++;
  }

  // parse characteristic
  while (*s >= '0' && *s <= '9') {
    n = (n * 10) + (*s++ - '0');
  }

  // skip separator
  s++;

  // parse mantissa
  while (*s >= '0' && *s <= '9') {
    d = (d * 10) + (*s++ - '0');
  }

  *dest = mod * n * 10 + d;
}

/* Unrolled version, using the property of our input */
/* that the decimal separator is always at index 2 or 3 */
char *parse_number_unrolled(int *dest, char *s) {
  // parse sign
  int mod;
  if (*s == '-') {
    mod = -1;
    s++;
  } else {
    mod = 1;
  }

  if (s[1] == '.') {
    *dest = ((s[0] * 10) + s[2] - ('0' * 11)) * mod;
    return s + 4;
  }

  *dest = (s[0] * 100 + s[1] * 10 + s[3] - ('0' * 111)) * mod;
  return s + 5;
}

Potential improvements that did not work

Some other things I tried that did not result in any noticeable performance improvements are:

Branchless min() and max()

I am not sure whether b < a requires a branch on my machine, but I did not find any performance improvements from using this trick.

int min(int a, int b) {
    return a ^ ((b ^ a) & -(b < a));
}

Checking godbolt it seems that gcc 13.2 with -O2 on x86-64 architectures is smart enough to generate a branchless version anyway.

int max_1(int a, int b) {
    return a > b ? a : b;
}

int max_2(int a, int b) {
    return a ^ ((b ^ a) & -(b < a));
}

Results in the following (identical) assembly³.

max_1:
        cmp     esi, edi
        mov     eax, edi
        cmovge  eax, esi
        ret
max_2:
        cmp     edi, esi
        mov     eax, esi
        cmovle  eax, edi
        ret

Huge tables

Profiling my implementation using perf shows that a lot of time is spent dealing with pagefaults.

+   93.39%     0.00%  analyze  libc.so.6             [.] clone3
+   91.62%     0.00%  analyze  libc.so.6             [.] start_thread
+   90.75%    66.58%  analyze  analyze               [.] process_chunk
+   23.52%     0.60%  analyze  [kernel.kallsyms]     [k] asm_exc_page_fault
+   22.51%     0.00%  analyze  [kernel.kallsyms]     [k] exc_page_fault
+   22.51%     0.00%  analyze  [kernel.kallsyms]     [k] do_user_addr_fault
+   20.44%     0.35%  analyze  [kernel.kallsyms]     [k] handle_mm_fault
+   19.52%     0.29%  analyze  [kernel.kallsyms]     [k] __handle_mm_fault
+   16.36%     0.56%  analyze  [kernel.kallsyms]     [k] do_fault
+   13.96%     0.00%  analyze  [kernel.kallsyms]     [k] __do_fault

I tried getting mmap to work with the MAP_HUGETLB flag set, but it relies on MAP_ANONYMOUS being set as well, which if set will ignore the file descriptor argument. Calling mmap once more or using read to get the entire file into our mapped region proved slower than just letting the kernel deal with all page faults.

Casting last 8 bytes as integer

A very interesting idea coined by Hello1024⁴ on GitHub was that instead of parsing the temperature character by character, we could cast the last 8 bytes of every line as an integer directly. That would be huge, but I was not able to use it while retaining correct results.

Comments are welcome. You can email me at hi @ this domain.