Saturday, February 6, 2021

Why most programming language performance comparisons are most likely wrong

For as long as programming languages have existed, people have fought over which one of them is the fastest. These debates have ranged from serious scientific research to many a heated late night bar discussion. Rather than getting into this argument, let's look at the problem at a higher level, namely how would you compare the performance of two different programming languages. The only really meaningful approach is to do it empirically, that is, implementing a bunch of test programs in both programming languages, benchmarking them and then declaring the winner.

This is hard. Really hard. Insanely hard in some cases and very laborious in any case. Even though the problem seems straightforward, there are a ton of error sources that can trip up the unaware (and even many very-much-aware) performance tester.

Equivalent implementations?

In order to make the two implementations comparable they should be "of equal quality". That is, they should have been implemented by people with roughly the same amount of domain knowledge as well as programming skills in their chosen language. This is difficult to organise. If the implementations are written by different people, they may approach the problem with different algorithms making the relative performance not a question of programming languages per se, but of the programming approaches chosen by each programmer.

Even if both implementation are written by the same person using the same algorithm, there are still problems. Typically people are better at some programming languages than others. Thus they tend to provide faster implementations in their favourite language. This causes bias, because the performance is not a measure of the programming languages themselves, but rather the individual programmer. These sorts of tests can be useful in finding usability and productivity differences, but not so much for performance.

Thus you might want to evaluate existing programs written by many expert programmers. This is a good approach, but sometimes even seasoned researches get it wrong. There is a paper that tries to compare different programming languages for performance and power efficiency using this approach. In their test results one particular program's C implementation was 30% faster than the same program written in C++. This single measurement throws a big shade over the entire paper. If we took the original C source, changed all the sources' file extension from .c to .cpp and recompiled, the end result should have the same performance within a few percentage points. Thus we have to conclude that one of the following is happening (in decreasing order of probability):
  1. The C++ version is suboptimally coded.
  2. The testing methodology has a noticeable flaw.
  3. The compiler used has a major performance regression for C++ as opposed to C.
Or, in other words, the performance difference comes somewhere else than the choice of programming language.

The difficulty of measurement

A big question is how does one actually measure the performance of any given program. A common approach is to run the test multiple times in a row and then do something like the following:
  • Handle outliers by dropping the points at extreme ends (that is, the slowest and fastest measurements)
  • Calculate the mean and/or median for the remaining data points
  • Compare the result between different programs, the one with the fastest time wins
Those who remember their high school statistics lessons might calculate standard deviation as well. This approach seems sound and rigorous, but it contains several sources of systematic error. The first of these is quite surprising and has to do with noise in measurements.

Most basic statistical tools assume that the error is normally distributed with an average value of zero. If you are measuring something like temperature or speed this is a reasonable assumption. For this case it is not. A program's measured time consists of the "true" time spent solving the problem and overhead that comes from things like OS interruptions, disk accesses and so on. If we assume that the noise is gaussian with a zero average then what it means is that the physical machine has random processes that make the program run faster than it would if the machine was completely noise free. This is, of course, impossible. The noise is strongly non-gaussian simply because it can never have a negative value.

In fact, the measurement that is the closest to the platonic ideal answer is the fastest one. It has the least amount of noise interference from the system. That is the very same measurement that was discarded in the first step when outliers were cleaned out. Sometimes doing established and reasonable things makes things worse.

Statistics even harder

Putting that aside, let's assume we have measurements for the two programs, which do look "sufficiently gaussian". Numerical analysis shows that language #1 takes 10 seconds to run whereas language #2 takes 9 seconds. A 10% difference is notable and thus we can conclude that language #2 is faster. Right?

Well, no. Suppose the actual measurement data look like this:


Is the one on the right faster or not? Maybe? Probably? Could be? Answering this question properly requires going all the way to university level statistics. First one formulates a null hypothesis, that is, that the two programs have no performance difference. Then one calculates the probability that both of these measurements have come from the same probability distribution. If the probability for this is small (typically 5%), then the null hypothesis is rejected and we have proven that one program is indeed faster than the other. This method is known as Student's t-test. and it is used commonly in heavy duty statistics. Note that some implementations of the test assume gaussian data and if you data has some other shape, the results you get might not be reliable.

This works for one program, but a rigorous test has many different programs. There are statistical methods for evaluating those, but they get even more complicated. Looking up how they work is left as an exercise to the reader.

All computers' alignment is Chaotic Neutral

Statistics are hard, but fortunately computers are simple because they are deterministic, reliable and logical. For example if you have a program and you modify it by adding a single NOP instruction somewhere in the stream, the biggest impact it could possibly have is one extra instruction cycle, which is so vanishingly small as to be unmeasurable. If you do go out and measure it, though, the results might confuse and befuddle you. Not only can this one minor change make the program 10% slower (or possibly even more), it can even make it 10% faster. Yes. Doing pointless extra work can make your the program faster.

If this is the first time you encounter this issue you probably won't believe it. Some fraction might already have gone to Twitter to post their opinion about this "stupid and wrong" article written by someone who is "clearly incompetent and an idiot". That's ok, I forgive you. Human nature and all that. You'll grow out of it eventually. The phenomenon is actually real and can be verified with measurements. How is it possible that having the CPU do extra work could make the program faster?

The answer is that it doesn't. The actual instruction is irrelevant. What actually matters is code alignment. As code shifts around in memory, its performance characteristics change. If a hot loop gets split by a cache boundary it slows down. Unsplitting it speeds it up. The NOP does not need to be inside the loop for this to happen, simply moving the entire code block up or down can cause this difference. Suppose you measure two programs in the most rigorous statistical way possible. If the performance delta between the two is under something like 10%, you can not reasonably say one is faster than the other unless you use a measurement method that eliminates alignment effects.

It's about the machine, not the language

As programs get faster and faster optimisation undergoes an interesting phase transition. Once performance gets to a certain level the system no longer about what the compiler and CPU can do to run the developer's program as fast as possible. Instead it becomes about how the programmer can utilize the CPU's functionality as efficiently as possible. These include things like arranging your data into a layout that the processor can crunch with minimal effort and so on. In effect this means replacing language based primitives with hardware based primitives. In some circles optimization works weirdly backwards in that the programmer knows exactly what SIMD instructions they want a given loop to be optimized into and then fiddles around with the code until it does. At this point the functionality of the programming language itself is immaterial.

This is the main reason why languages like C and Fortran are still at the top of many performance benchmarks, but the techniques are not limited to those languages. Years ago I worked on a fairly large Java application that had been very thoroughly optimized. Its internals consisted of integer arrays. There were no classes or even Integer objects in the fast path, it was basically a recreation of C inside Java. It could have been implemented in pretty much any programming language. The performance differences between them would have mostly come down to each compiler's optimizer. They produce wildly different performance outcomes even when using the same programming language, let alone different ones. Once this happens it's not really reasonable to claim that any one programming language is clearly faster than others since they all get reduced to glorified inline assemblers.

References

Most of the points discussed here has been scraped from other sources and presentations, which the interested reader is encouraged to look up. These include the many optimization talks by Andrei Alexandrescu as well as the highly informational CppCon keynote by Emery Berger. Despite its venue the latter is fully programming language agnostic so you can watch it even if you are the sort of person who breaks out in hives whenever C++ is mentioned.

4 comments:

  1. More that "most programming language performance comparisons" are not even wrong.
    The comparisons are under-specified.
    The goals of the comparison are unspecified.

    The motivation may be akin to prophesy — Which programming language will allow whichever programmers we have to create the most performant software for whatever tasks we need to accomplish over the next X years, without compromising our unstated requirements for productivity and reliability and …?

    > Thus we have to conclude that one of the following is happening…

    The C program used pcre.h and the C++ program used boost/regex.hpp and the researchers refused to discard outliers because they thought that would be mis-represented as "cherry-picking results" even when the outlier was an order-of-magnitude different.

    (The other outlier problem: the data tables published with that 2017 paper, show a 15x difference between the measured times of the selected JS and TS fannkuch-redux programs. That should explain the TS and JS average Time difference.)

    Simply dropping those outliers would have avoided much of the criticism heaped upon them. (otoh There's no such thing as bad publicity.)

    > not really reasonable to claim that any one programming language is clearly faster than others…

    Python is as fast as C when it is C?

    But reasonable to claim that one programming language is more suitable for your current specific requirements?

    https://benchmarksgame-team.pages.debian.net/benchmarksgame/why-measure-toy-benchmark-programs.html#reimplement

    ReplyDelete
  2. > If we took the original C source, changed all the sources' file extension from .c to .cpp and recompiled…

    Did you do that? Probably I made a mistake but —

    $ /usr/bin/gcc -pipe -Wall -O3 -fomit-frame-pointer -march=native -fopenmp regexredux.gcc-4.c -lpcre

    OK

    $ /usr/bin/gcc -pipe -Wall -O3 -fomit-frame-pointer -march=native -fopenmp regexredux.gcc-4.cpp -lpcre
    regexredux.gcc-4.cpp: In function ‘void replace(const char*, const char*, const string*, string*, pcre_jit_stack*)’:
    regexredux.gcc-4.cpp:40:37: error: invalid conversion from ‘void*’ to ‘char*’ [-fpermissive]
    40 | dst_String->data=realloc(dst_String->data, dst_String->capacity*=2);
    | ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    | |
    | void*

    etc etc

    ReplyDelete
    Replies
    1. I did not actually test it myself (hence the text above says "should"). There are a few cases where a C++ compiler is stricter than C so you need to add casts and the like, but those should be syntactic sugar/salt rather than functional changes.

      But, as said, I have not tested this myself. So it is possible (though unlikely) that reality proves me to be completely wrong.

      Delete
  3. 8 out of 12 showed compiler errors (not warnings)
    error binarytrees.gcc-3.cpp
    fannkuchredux.gcc-5.cpp
    error chameneosredux.gcc-5.cpp
    fasta.gcc-2.cpp
    error knucleotide.cpp
    error mandelbrot.gcc-6.cpp
    nbody.gcc-4.cpp
    pidigits.cpp
    error regexredux.gcc-4.cpp
    error revcomp.gcc-6.cpp
    error spectralnorm.gcc-4.cpp

    How does the performance of those cpp programs compare to the performance of the c programs?

    ReplyDelete