Tuesday, 19 January 2010

Multi-core Laziness: Doing Less in Parallel

Preamble: (You may want to ignore this, it doesn't pertain directly to the topic of this post) I just started this blog, and I'm very thankful for the constructive comments I've received both here, and on reddit. I appreciate the discussions that have ensued over the issues of parallelizing the code that I've been examining.

However, I feel that  should reassure anyone who has been reading that, if you have also seen a blog post by Jon Harrop, you can be sure that it "really is this easy" to get the parallelism. There has been some discomfort felt by Jon that I chose the 4th version of Lennart's code over the 5th version. He questions the veracity of my claims, accusing me of "cherry picking" results, and goes on to try to discover exactly why I would have chosen the 4th version over the 5th. He eventually concludes that I specially designed my version to be more parallelizable. Let me help Jon, since he is having difficulty figuring out exactly why I chose the 4th version;

While the 5th version of Lennart's code is the fastest on 9 levels of spheres, it is over-optimized for this serial case:

  1. It is ugly (and I say that with no disrespect), it has unwrapped basically all of the vector operations that were previously function calls. 
  2. It's not robust. It stack-overflows with the default settings on 12 levels of spheres, but actually Jon knew about that already
  3. It performs worse at anything except 9 levels of spheres. When the stack-size is increased past its default to accommodate the program, it takes 50% longer than version 4 (which I based my version on), and uses 300% more memory than version 4. On 13 levels of spheres it eats all of my memory. 
So, when I made the choice to select version 4 of Lennart's code, it wasn't to bias the results towards parallelism.  It was because, either serial or parallel, version 5 offers only minor speed increase and only at level 9. For the other levels tested it was worse in terms of both memory and speed. If I had chosen this version I wouldn't have even been able to get any results for 13 levels of spheres. I hope Jon feels better now.


The Main Attraction

To be honest, this post won't focus too much on parallelism. However, the parallelism does provide some context to view laziness in. Now, in the previous posts I had made, I took the 4th version of Lennart's ray tracer and performed some small optimizations on them. Now, I don't know anything really about ray-tracing, so I'm glad that there were some hints given that indicated that it may be a great place to look at parallelism.

Combining the fact that at higher levels of spheres, the ray-tracer benchmark essentially becomes a test of allocation and initialization speed, it seems that doing both of these less would likely increase performance. Of course, this turns out to be true. Now a language which has lazy evaluation, such as Haskell, provides a nice opportunity for this optimization to be done. Also, since the language has support for lazy evaluation, it lets the programmer easily create software lazily-by-default, and strict-when-needed.

As applied to the ray-tracer benchmark, this turns out to be a very valuable property! I recently took the most unoptimized version that Lennart produced, and back-ported a few of the simpler optimizations that were done in the later versions, and some new ones, such as:

  • specializing two versions of intersect
  • removing the use of sqrt when able
  • using sqrt epsilon_float as a constant
I believe the first one gets the largest gain, and was already present in the other more optimized versions. These seem to give a nice increase over the original version, and gets it within about 2.3x of the C++ version in single-core operation. The key here is to leave the lazy-evaluation alone, and to not force the entire scene to be constructed.

This can be seen in the timing results (I've tried to overlay them over the previous results, the chart is a little crowded, so I took the markers off the previous results)



As you can see, the Lazy version starts off the slowest. However, from the clumping of times in both the single and multi-core runs, it is clear that the lazy program scales much better when increasing the levels of spheres that were processed. In fact, I could take the levels up to seemingly arbitrary levels. 2000, 3000 levels of spheres, all completed in under 10 seconds. Now of course, everything being equal, it probably wouldn't be too hard to get the C++ program to do this as well. It isn't my intent to get anyone to believe that these things are impossible in other languages; Haskell just happens to make laziness easier.

The scaling behaviour under multiple cores is also revealing,


We can see that due to the reduced burden of creating and initializing the scene, the speedup due to parallelism is more or less the same as the complexity of the scene increases. So not only does the serial speed of the scene construction have a nicer runtime complexity, that same serial code can be parallelized more effectively than the non-lazy version.

In conclusion, the lazy version of the program has several nice benefits:
  • It's relatively small (about 70 lines with my additions, and I like slightly more line-breaks than Lennart, so it could be even smaller).
  • It performs better as the levels increase.
  • The relative speedup vs. serial maintains its shape better as cores are added
But laziness, in my opinion, also has some pitfalls,
  • It can be hard to switch to the idea of thinking when something really needs to be evaluated, to realize when and where laziness is really being seen.
  • If you're not a little careful you can end up lazily creating some terms that are too large, and will degrade performance.
  • Optimizing with respect to laziness can be a  tricky balance.
However, in the end, I think that being aware of laziness is good food-for-thought. The more tools we have in the toolbox, often we find new and interesting ways to solve our problems in an efficient way. This will likely be my last post on this benchmark, it's been a fun little exercise. I hope someone else picks it up and finds another aspect to explore!

Again, the code for Lennart's 1st version, optimized, can be found on GitHub.

11 comments:

  1. Don't worry about Harrop, and keep up the interesting articles. There's a large community of Haskellers eagerly following your explorations!

    ReplyDelete
  2. Just a comment on the 5 versions of the program. I didn't choose those versions, I just rewrote the 5 OCaml versions. And if you look in my blog I say that I'm glad version 5 didn't show any speedup because it's very ugly.

    ReplyDelete
  3. @Saynte: The difference between these programs is not just laziness. The version you are calling "lazy" actually uses hard-coded symbolically-derived bounds whereas the "eager" versions compute the bounds. If the lazy version had to compute the bounds as well then it would end up forcing the evaluation of the entire scene tree and would have no benefit. In other words, there is also a difference between the algorithms.

    @Lennart: Your "ugly" version is faster for 9, 10 and 11 levels of spheres.

    ReplyDelete
  4. I think that Saynte's original post made it clear that he was more interested at parallelization than single core performance.

    I don't see your point about trying to force the idea that paralellization on c++ is just as easy in other languages... it's already hard in c# as it is.

    "Naive" has something to do with being easy.

    ReplyDelete
  5. edit. I'd like to add before I'm being told that your point wasn't that c++ parallelization can be done...

    I rather mean that most c++ programmers won't go around parallelizing their stuff as a default behavior. They really try to optimize it like crazy to get reasonable performance, and if they still don't reach it, they might go through the trouble of parallelizing it.

    Then, someone will have to maintain that crap.

    Obviously, unless I'm mistaken HLVM has better to offer in regards to parallelization... and that is two thumbs up!

    ReplyDelete
  6. dons: Thanks for the encouragement! I want to see if I can find another small but interesting project to do.

    augustss: Ahh, I see! Thanks for the clarification.

    ReplyDelete
  7. I followed your articles about parallel programs in Haskell, and tried to reproduce your automatic gains in my simple Mandelbrot calculations... Unfortunately, I can't see ANY speed difference - it appears that using parBuffer doesn't have any speed impact on my code, only one core is used... why?

    My code is on github (github dot com slash ttsiodras slash parallelMandelbrot)

    ReplyDelete
  8. ttsiodras: You are making 3 errors I think:

    1) You're a little confused with your usage of the runtime system options (RTS). They go when you run the program, not the compiler.

    2) You have a 2 core machine, you are unlikely to get speedup with GHC, because it tends to hate using all the cores (not sure if this is a GHC thing or a Linux thing, as I only run GHC on Linux)

    3) Your algorithm is asking for a 'reverse' of the list of pixels. This at the very least asks for the list structure to be constructed. You can rewrite your x,y point generation to be backwards, then you never have to worry about reverse.

    I've made these changes to your code/Makefile and now it gets some small parallelism increases, maxing out at about 2x over 5/6 cores. If there's a way I can push the changes back to you just let me know.

    There may be something I'm missing I didn't look too much.

    ReplyDelete
  9. Thank you!

    Just "git diff -u origin/master > patch" and send me the patch, at ttsiodras at gmail dot com - I will push the changes.

    A pessimistic comment, though: the end result appears to be that (a) on dual core machines haskell offers no speedup from parallelism (as opposed to OpenMP/C which almost doubles the frame rate), and (b) that even on quad core and up, the speed up is not linear, even for algorithms that are so amenable to parallelization as the Mandelbrot calculations...

    You can see the OpenMP (and CUDA!) versions of the code at ttsiodras dot googlepages dot com slash mandelSSE dot html

    ReplyDelete
  10. saynte: Ah, really! I'd noticed that my application runs much, much better at -N7 than -N8. We're also using Linux, and this has been the case with various 6.8 and 6.10 GHCs. I'd not really been concerned about bad performance on two-core machines because we were so bad on one core that I assumed we needed four or more.

    ttsiodras: An optimistic comment: I can't speak for two-core machines, but for us GHC's SMP support makes a huge difference. When running our trading program, we make money at -N7, and lose money if we forget to use that option. In that sort of situation, the extra cost of buying 8- or 16-core instead of 2- or 4-core servers is small beans compared to that it does actually go up, linearly or not, with more cores.

    ReplyDelete
  11. Curt: It's great to hear that you're getting some benefit from the parallel capabilities from Haskell. If you move up to 6.12.1 I think you'll be able to use TheadScope to help you guide your optimizations and see if you can get even more performance out of your code.

    I still haven't used it, but it looks interesting enough to give a quick shot in the future.

    ReplyDelete