Friday 26 February 2010

Putting Haskell Between C++ and Java (Language Shootout Exercise)

Note: These speedups apparently require GHC 6.12, so the entire speedup won't be seen on the Shootout until/if they switch to GHC 6.12 at some point (possibly after the next Ubuntu stable release?). However, there was still about a 3 second speedup to hold us over until then ;).

Lately I've been sort of poking at the language shootout, to see what's there. The previous two posts have centered on this theme. However, unlike those posts, this one doesn't want to show a Haskell program could be made faster, bur rather how it can be made much faster.

Looking to reduce one of the greatest disparities, I looked to the fasta benchmark. In this case, the fastest submission is the C implementation, with a great time of 1.73 seconds. Avoiding success in the most complete way possible, the Haskell implementation scores in at about 20th place, being 9x slower than the fastest C. However, the entry doesn't have terrible company, at least it is bordered by a couple (JIT) compiled languages, F# and Go.

Ok, so what can be done?

I don't know. First we have to find out where the problem is! For this we can look to the ghc profiling output.
For this I just compile with the ghc options "-prof -auto-all", in addition to the other options. When we use this, and then additionally run the binary with "+RTS -p -xc -RTS" we will get a file called (for example) fastaold.prof. This will report, first:


look                           Main                  38.7   71.7
choose                         Main                  26.6    0.0
rand                           Main                  14.6   20.7

and additionally:

 look     Main       282   403333334  38.7   71.7    79.8   92.4
   rand   Main       285   200000000  14.6   20.7    14.6   20.7
   choose Main       283   623445679  26.6    0.0    26.6    0.0

The top output basically tells us the main offenders (by time of execution) in our program. The bottom is an extract from the more detailed information. clearly these functions are called a lot. Therefore, it makes sense to try to optimize the lookup function for a symbol.


Okay, so clearly the most commonly used functions need optimization... (this is really revolutionary advice here...)

Briefly, these are the optimizations I have made. Not all of these may be really useful, but it is a true listing of what I did:


  • Replace the pair of symbol/probability with its own datatype
  • Ditto for probability/random-seed pairs
  • Changed there representation from PDF to CDF, saves some subtraction.
  • Pushed look into the body of the fold function (this actually saves quite a bit of time, I guess it allows better specialization).
  • Introduced a "cached" version of the list, which essentially pushed the first 4 entries up. I believe this makes it have more-or-less constant access time for these elements, without having to "uncons" everything, as with a regular list.

These seemed to pay off! I believe they even don't violate the spirit of the language shootout, as they certainly do less of a modification than the top C version. The top C version essentially seems to expand the CDF into a larger space, so that it basically becomes a constant-time lookup for the correct symbol: my version is still quite linear.

In the end I was able to make the following improvement, going from:
0m13.305s

to:

0m6.219s

This is a 2.14x improvement! The implementation is still much slower than the 1st place one. Its a little hard to say how this would affect the Shootout results themselves, given I have different hardware, and a different version of GHC (they have 6.10, I have 6.12). However, if one wants to perform the dubious step of applying my speedup to the current version listed on the Shootout page, it should move the Haskell entry from 20th place to 8th, and put it right between a C++ and a Java entry. That's pretty nice company to be around!

(Double check my work! Files hosted on GitHub)

Worker-threads to Strategies: Performance Follow-up

When I was last examining the simplification of the mandelbrot-shootout entry I neglected one point: to confirm that the multithreaded performance of the program was maintained after I had made my modifications. First we can see the parallel performance under 4 cores for the original:
Note that this is about a 3.5x speedup.

Now in the simplified version, where the strategies from Data.Parallel.Strategies are used, we see:




Which is slightly slower than the old version (which is consistent with performance under 1 core), with the speedup now in the ballpark of about 3.3x

The absolute times are about 50ms vs 53ms, respectively.

Sunday 21 February 2010

From Pointers and Worker-threads to ByteStrings and Strategies

Just for fun the other day, I thought I'd take a look through the Language Shootout submissions, to see what highly optimized programs tended to look like. It's really quite interesting to see where the strengths of each language lay. Often when  you're not playing to the style that the language caters to, you can get some fairly ugly looking results. I saw one such result in the mandelbrot submission for Haskell.

The original submission (with full author attribution can be found here) was using the more traditional concurrency primitives offered. You can see it (without attribution and comments to make program size more evident in comparison) below:



I noticed a few things in the implementation:

  1. It used (as previously mentioned) the low-level concurrency primitives
  2. Ptr, poke, and pointer arithmetic were used to create bit arrays.
  3. The state that was unfolded through the computation was a big tuple of stuff.
I think that these tend to clutter the meaning of the program. I was curious if I could try to reduce some of their usages to make the program a bit more clear, while still maintaining the relatively high performance.


Low level concurrency
At the core of the algorithm, it just processes one line of the output at a time. Even better, each line of the output is independent from the others! This independence indicates that we can use the higher-level parallel strategies in Control.Parallel.Strategies to parallelize the task.

Pointer-level Computation
This one had a pretty clear solution. Since the arrays that were used in the old implementation contained simple bytes (Word8), it is fairly easy to transition them to ByteStrings instead. By doing this we get to keep the purity at the same time as getting efficient operations on the bytes.

Folded State
This was nothing too scary, just paring the state (symbolized by T) down to something more manageable. I turned it from a tuple of x-position, x-byte-position, y-position, y-imaginary-position to just x-position.

Finally, after making the above changes, the resulting code is more pure, and (I think) easier to read:


One of the best things is, the performance is left largely unchanged (generated using Criterion). The time to generate a 1000x1000 image being distributed as:

And my revised version of the code sees:




It is a little hard to see due to the closeness of the graphs, but the average times are 175.75ms for the old, and 176.74 ms for the new, although the confidence intervals at 95% do overlap. With less confidence we may be able to say that the old version is faster.

The code for this exercise can be seen on the lovely GitHub.