However, I feel that should reassure anyone who has been reading that, if you have also seen a blog post by Jon Harrop, you can be sure that it "really is this easy" to get the parallelism. There has been some discomfort felt by Jon that I chose the 4th version of Lennart's code over the 5th version. He questions the veracity of my claims, accusing me of "cherry picking" results, and goes on to try to discover exactly why I would have chosen the 4th version over the 5th. He eventually concludes that I specially designed my version to be more parallelizable. Let me help Jon, since he is having difficulty figuring out exactly why I chose the 4th version;
While the 5th version of Lennart's code is the fastest on 9 levels of spheres, it is over-optimized for this serial case:
- It is ugly (and I say that with no disrespect), it has unwrapped basically all of the vector operations that were previously function calls.
- It's not robust. It stack-overflows with the default settings on 12 levels of spheres, but actually Jon knew about that already.
- It performs worse at anything except 9 levels of spheres. When the stack-size is increased past its default to accommodate the program, it takes 50% longer than version 4 (which I based my version on), and uses 300% more memory than version 4. On 13 levels of spheres it eats all of my memory.
The Main Attraction
To be honest, this post won't focus too much on parallelism. However, the parallelism does provide some context to view laziness in. Now, in the previous posts I had made, I took the 4th version of Lennart's ray tracer and performed some small optimizations on them. Now, I don't know anything really about ray-tracing, so I'm glad that there were some hints given that indicated that it may be a great place to look at parallelism.
Combining the fact that at higher levels of spheres, the ray-tracer benchmark essentially becomes a test of allocation and initialization speed, it seems that doing both of these less would likely increase performance. Of course, this turns out to be true. Now a language which has lazy evaluation, such as Haskell, provides a nice opportunity for this optimization to be done. Also, since the language has support for lazy evaluation, it lets the programmer easily create software lazily-by-default, and strict-when-needed.
As applied to the ray-tracer benchmark, this turns out to be a very valuable property! I recently took the most unoptimized version that Lennart produced, and back-ported a few of the simpler optimizations that were done in the later versions, and some new ones, such as:
- specializing two versions of intersect
- removing the use of sqrt when able
- using sqrt epsilon_float as a constant
I believe the first one gets the largest gain, and was already present in the other more optimized versions. These seem to give a nice increase over the original version, and gets it within about 2.3x of the C++ version in single-core operation. The key here is to leave the lazy-evaluation alone, and to not force the entire scene to be constructed.
This can be seen in the timing results (I've tried to overlay them over the previous results, the chart is a little crowded, so I took the markers off the previous results)
As you can see, the Lazy version starts off the slowest. However, from the clumping of times in both the single and multi-core runs, it is clear that the lazy program scales much better when increasing the levels of spheres that were processed. In fact, I could take the levels up to seemingly arbitrary levels. 2000, 3000 levels of spheres, all completed in under 10 seconds. Now of course, everything being equal, it probably wouldn't be too hard to get the C++ program to do this as well. It isn't my intent to get anyone to believe that these things are impossible in other languages; Haskell just happens to make laziness easier.
The scaling behaviour under multiple cores is also revealing,
We can see that due to the reduced burden of creating and initializing the scene, the speedup due to parallelism is more or less the same as the complexity of the scene increases. So not only does the serial speed of the scene construction have a nicer runtime complexity, that same serial code can be parallelized more effectively than the non-lazy version.
In conclusion, the lazy version of the program has several nice benefits:
- It's relatively small (about 70 lines with my additions, and I like slightly more line-breaks than Lennart, so it could be even smaller).
- It performs better as the levels increase.
- The relative speedup vs. serial maintains its shape better as cores are added
But laziness, in my opinion, also has some pitfalls,
- It can be hard to switch to the idea of thinking when something really needs to be evaluated, to realize when and where laziness is really being seen.
- If you're not a little careful you can end up lazily creating some terms that are too large, and will degrade performance.
- Optimizing with respect to laziness can be a tricky balance.
However, in the end, I think that being aware of laziness is good food-for-thought. The more tools we have in the toolbox, often we find new and interesting ways to solve our problems in an efficient way. This will likely be my last post on this benchmark, it's been a fun little exercise. I hope someone else picks it up and finds another aspect to explore!
Again, the code for Lennart's 1st version, optimized, can be found on GitHub.