Just for fun the other day, I thought I'd take a look through the Language Shootout submissions, to see what highly optimized programs tended to look like. It's really quite interesting to see where the strengths of each language lay. Often when you're not playing to the style that the language caters to, you can get some fairly ugly looking results. I saw one such result in the mandelbrot submission for Haskell.
The original submission (with full author attribution can be found here) was using the more traditional concurrency primitives offered. You can see it (without attribution and comments to make program size more evident in comparison) below:
I noticed a few things in the implementation:
I noticed a few things in the implementation:
- It used (as previously mentioned) the low-level concurrency primitives
- Ptr, poke, and pointer arithmetic were used to create bit arrays.
- The state that was unfolded through the computation was a big tuple of stuff.
I think that these tend to clutter the meaning of the program. I was curious if I could try to reduce some of their usages to make the program a bit more clear, while still maintaining the relatively high performance.
Low level concurrency
At the core of the algorithm, it just processes one line of the output at a time. Even better, each line of the output is independent from the others! This independence indicates that we can use the higher-level parallel strategies in Control.Parallel.Strategies to parallelize the task.
Pointer-level Computation
This one had a pretty clear solution. Since the arrays that were used in the old implementation contained simple bytes (Word8), it is fairly easy to transition them to ByteStrings instead. By doing this we get to keep the purity at the same time as getting efficient operations on the bytes.
Folded State
This was nothing too scary, just paring the state (symbolized by T) down to something more manageable. I turned it from a tuple of x-position, x-byte-position, y-position, y-imaginary-position to just x-position.
Finally, after making the above changes, the resulting code is more pure, and (I think) easier to read:
One of the best things is, the performance is left largely unchanged (generated using Criterion). The time to generate a 1000x1000 image being distributed as:
And my revised version of the code sees:
It is a little hard to see due to the closeness of the graphs, but the average times are 175.75ms for the old, and 176.74 ms for the new, although the confidence intervals at 95% do overlap. With less confidence we may be able to say that the old version is faster.
The code for this exercise can be seen on the lovely GitHub.
The code for this exercise can be seen on the lovely GitHub.
Very nice!
ReplyDeleteJust a tiny style suggestion: I wondered why you still needed to import Foreign. It turns out it's just for the Word8 type. You could import that from Data.Word instead, to make it more obvious that nothing funny is going on.
Here's another: the code is lovely, but it's hard to follow without the signatures for the functions.
ReplyDeleteReid: Thanks for pointing that out, I guess my rough scheme was the leave the imports unchanged unless they broke something, but you're absolutely right.
ReplyDeletescvalex: Yes, normally I would document everything to the 9's and include type annotations for extra clarity. Here, however, I am trying to make the size differences evident. However, perhaps trying to create both programs in a literate style (perhaps also using lhs2TeX) and seeing which is more clear/succinct would be interesting as well (just not for me right now :)).