Note: These speedups apparently require GHC 6.12, so the entire speedup won't be seen on the Shootout until/if they switch to GHC 6.12 at some point (possibly after the next Ubuntu stable release?). However, there was still about a 3 second speedup to hold us over until then ;).
Lately I've been sort of poking at the language shootout, to see what's there. The previous two posts have centered on this theme. However, unlike those posts, this one doesn't want to show a Haskell program could be made faster, bur rather how it can be made much faster.
Lately I've been sort of poking at the language shootout, to see what's there. The previous two posts have centered on this theme. However, unlike those posts, this one doesn't want to show a Haskell program could be made faster, bur rather how it can be made much faster.
Looking to reduce one of the greatest disparities, I looked to the fasta benchmark. In this case, the fastest submission is the C implementation, with a great time of 1.73 seconds. Avoiding success in the most complete way possible, the Haskell implementation scores in at about 20th place, being 9x slower than the fastest C. However, the entry doesn't have terrible company, at least it is bordered by a couple (JIT) compiled languages, F# and Go.
Ok, so what can be done?
I don't know. First we have to find out where the problem is! For this we can look to the ghc profiling output.
For this I just compile with the ghc options "-prof -auto-all", in addition to the other options. When we use this, and then additionally run the binary with "+RTS -p -xc -RTS" we will get a file called (for example) fastaold.prof. This will report, first:
look Main 38.7 71.7
choose Main 26.6 0.0
rand Main 14.6 20.7
and additionally:
look Main 282 403333334 38.7 71.7 79.8 92.4
rand Main 285 200000000 14.6 20.7 14.6 20.7
choose Main 283 623445679 26.6 0.0 26.6 0.0
The top output basically tells us the main offenders (by time of execution) in our program. The bottom is an extract from the more detailed information. clearly these functions are called a lot. Therefore, it makes sense to try to optimize the lookup function for a symbol.
Okay, so clearly the most commonly used functions need optimization... (this is really revolutionary advice here...)
Briefly, these are the optimizations I have made. Not all of these may be really useful, but it is a true listing of what I did:
- Replace the pair of symbol/probability with its own datatype
- Ditto for probability/random-seed pairs
- Changed there representation from PDF to CDF, saves some subtraction.
- Pushed look into the body of the fold function (this actually saves quite a bit of time, I guess it allows better specialization).
- Introduced a "cached" version of the list, which essentially pushed the first 4 entries up. I believe this makes it have more-or-less constant access time for these elements, without having to "uncons" everything, as with a regular list.
These seemed to pay off! I believe they even don't violate the spirit of the language shootout, as they certainly do less of a modification than the top C version. The top C version essentially seems to expand the CDF into a larger space, so that it basically becomes a constant-time lookup for the correct symbol: my version is still quite linear.
In the end I was able to make the following improvement, going from:
0m13.305s
to:
0m6.219s
This is a 2.14x improvement! The implementation is still much slower than the 1st place one. Its a little hard to say how this would affect the Shootout results themselves, given I have different hardware, and a different version of GHC (they have 6.10, I have 6.12). However, if one wants to perform the dubious step of applying my speedup to the current version listed on the Shootout page, it should move the Haskell entry from 20th place to 8th, and put it right between a C++ and a Java entry. That's pretty nice company to be around!
(Double check my work! Files hosted on GitHub)
(Double check my work! Files hosted on GitHub)