How I got here is quite a long story. However, I'm benchmarking trivial implementations of the Atkin sieve with different programming languages/implementations. Why the "trivial implementation"? Because:
- I did not want to spend to much time to write clever ones in each language
- Because otherwise half the benchmark would have been about my specific optimization skills (platform internals knowledge etc.) in each language
- I'm lazy
So... we try not to put too much importance on unimportant things and draw some interesting conclusions. First, as a comparison I use a very fast C implementation of the atkin sieve which I found here. This is the bottom-line. It is a cleverly implemented version in a fast language. I do not expect anybody to get close.
It is almost unbetable:
% time primes 1 10000000 > /dev/null
primes 1 10000000 > /dev/null 0.07s user 0.00s system 97% cpu 0.070 total
Consider that I had to suppress the output because this one not only computes the primes, but it also prints them.
Quite interestingly an unoptimized Java implementation is just above 100 milliseconds. Ok, I did not write every number on standard output. C still is faster. The point is that the Java code took like 10 minutes to be written and is compared against a carefully crafted implementation in a language considered to be faster. This is a "good enough result" for Java (and it shows some excellent compiler work behind that). Both Java and C had to find the primes below 10000000. This is also what I did with the other languages.
This is a kludge I used to test both regular Python (using numpy, which should provide a reasonably fast buffer to support the sieve) and Pypy, which should JIT to death integer computations.
I found quite interesting that using on not using numpy does not change things much. Regular Python took 6.16s without numpy and 6.12s with numpy. This makes sense: afterall, numpy is fast if used matrix-wise. In this case, however, I'm doing things element-wise. Perhaps I should devise a pure matrix manipulation implementation and benchmark that instead.
Pypy is surprisingly good: it takes 1.20 seconds to run the 10000000 numbers computation.
After testing Java, the natural candidate is Clojure. How much inefficiency does the clojure runtime add? The code I wrote is voluntarily unclojurish and is more a direct translation from Java to clojure, exactly because now the focus is on the overhead. Apparently there are some big problems.
The first version I wrote did not involve the int coercions that the present version has and took about 2.8 secs. Replacing loops with doseqs made it slower (to about 3.x secs), so I went back to the explicit loop version. After that adding and removing coercions made times vary. Some coercions slowed things down, some sped them up. Another funny thing (logical, in fact) is that it is not true that a given coercion always speeds things up: for example the coercion on n1 was an improvement only after I also coerced end. For no reason I could fathom coercing n2 and n3 makes things worse.
In fact the presented version takes 1.5-1.6 seconds. However, adding the coercion on n2 makes the whole thing jump to 2.7 seconds. While for other coercions I understood why thing actually got worse, in this case I just don't know what to think.
Moreover, uncommenting the count-primes function call, makes the timing for atkin jump at 2.2 seconds. I think I plainly admit my incapacity to optimize clojure code: the logic behind its behavior just evades me.