Friday, 15 January 2010

Naïve Parallelization: C++ (OpenMP) and Haskell

Update: floodyberry from reddit has suggested a great improvement to the C++ version. This lets the speedup trend increase much more nicely in the case of 8 levels of spheres. He also had a good suggestion that the lack-of speedup in the higher-level cases is due to it essentially becoming an allocation benchmark. The graphs have been updated accordingly.


Background
During my previous post I compared a few ray tracing implementations, and there were some questions on reddit about why the parallel Haskell version didn't scale beyond about 3x improvement, even up to 8 logical (4 physical) cores, on such an innately parallel task.

I couldn't answer that question when it was asked: and I can't answer it now. Sorry ;)

However, it did make me curious about how one could also parallelize the C++ version to start examining this problem to see if it was a problem with the Haskell implementation, or if it is a deeper issue. Now I am definitely not a master of concurrency and parallel programming, so I thought I would pick the simplest method of parallelizing C++ programs that I know: OpenMP. It's nice because it works through #pragma declarations and provides some primitive operations to help with synchronization. Additionally, it contains one pragma that I thought would be perfect for this sort of task: #pragma omp parallel for, which will take the for block that follows and will parallelize its operation. This is roughly like the parBuffer function that I employed when I did the same to the Haskell version of the code.


To be honest, I don't know if this is the best parallelization of the loop, or perhaps the C++ program can be rewritten in such a way to further increase parallel performance. However, this isn't really the question I'm trying to answer which is: how do naïve parallelizations in these two languages compare?
I focus on these simple modifications because in the context of software engineering we're often trying to balance correctness, execution speed, and cost (time to implement). So then, given two implementations of the ray-tracing task, how do these compare?

Parallelizing
The transformation from serial to parallel in the Haskell version can be had by simply taking the list of intensities, let us call this list picture', and applying parBuffer ni rwhnf to it. This results in a new list of intensities (identical to the original, just evaluated in parallel). Due to the pure functional nature of Haskell, we know that this operation will preserve the result of the original serial computation. If it was correct before, it is correct now: great! (Likewise, if it was incorrect before, it will be incorrect now too ;)).

The same transformation in C++ is actually fairly similar with OpenMP, which is also great! The typing overhead is not very high, but it is a bit more than the Haskell version. As stated before, basically appending
#pragma omp parallel for before the main for-loop of the program solves the parallelism aspect. However, and this it the important part, because of the use of shared resources (cout) the body of the loop had to be modified, due to the race-condition that was introduced. The race-condition I mention is the use of cout to immediately output the results, this is of course sensitive to the order in which the intensities are printed. To solve this I introduced an area of memory which each iteration wrote their result into the correct location, and another nested loop later to print these things again to cout


I'm not a great programmer, I am well aware of that. That didn't stop me, though, from loving the fact that I could immediately transform a serial pure-functional solution in a parallel pure-functional solution and still get the same correct answer that I had in the serial version. In C++ however, I saw the race-condition inevitability and had to design around it, modify the behaviour of the original loop, and break the output section into another couple loops.


I am not in the least trying to infer that everyone would have used the same parallelization method that I would, and I would wager there are much  better ways to do it, both in terms of speed of execution and safety of implementation. I just really believe that the combination of using a pure-functional coding style in combination with parallel primitives, like those offered in Haskell, allows one to produce code in a way that is less subject to errors, and still get some "free" parallelism in the end, too!


Performance
The second part of this post is the performance characteristics. This was run on a 32-bit ArchLinux installation running on an Intel i7 920, with around 3GB of RAM. Note that the i7 920 doesn't have 8 physical cores, it has 4 physical cores with hyper-threading on each, allowing my Linux installation to see 8 logical cores. This means that I'm not as confident in the results above 4 cores. However, there is a continued increase in performance above 4 cores, so I have left them in for the curious.

Update: GHC 6.12.1 and GCC 4.4.2 were the compilers used, with a 2.6.32 Linux kernel.






There are two interesting things about these results, I believe:
  1. In every case the relative speedup experienced by the Haskell version is greater than that of the C++ version. One may argue that this is because Haskell performs worse in the single-core version and thus has more "room" to grow into.
  2. The absolute performances get much closer as more cores are added. I'm not precisely sure how to interpret that, I think either both implementations are hitting a lower-level barrier (cache contention?) or the Haskell runtime has some sort of magic. I'm torn between the two possibilities!
Additionally, some Haskell runs take a dip on 8-cores, could this perhaps be due to the GC robbing one of the cores of some CPU time?

The code for these is available on GitHub, since I really hated formatting the Haskell in the previous post. Please feel free to modify the C++ code (originally adapted from here), and improve it's parallel performance.
I don't want to give the impression that I'm saying that C++ isn't able to do better, or that this parallelization is the definitive one. I'm sure there's something obvious that I've missed in this task! However, if you can, try to make as few modifications as possible and show-off how easy it was to parallelize.

5 comments:

  1. GHC expects that the number of cores it is given is really, totally available for its single use, and can show a pretty serious degradation in performance if this isn't the case; so you should leave a core out for the rest of the system in order to avoid contention.

    ReplyDelete
  2. This is actually a known GHC bug that manifests on Linux; it is called the 'last core parallel slowdown' bug, and it occurs on Linux systems when you attempt to use all the cores available, because the GC uses spinlocks. When the native thread the GC is on gets descheduled, it causes the GC to suffer badly - on OS X and Solaris however, you can still get better performance by using all your cores; the effects are quite bad on linux however.

    A corollary of this is that on my debian 5.0 dual core core2 laptop, running *any* parallel haskell program with '-N2' will actually make it always run slower, not faster. I hope it gets fixed soon. :(

    The bug is here on the GHC wiki - http://hackage.haskell.org/trac/ghc/ticket/3553

    ReplyDelete
  3. @Quentin: Yes, I had some idea of that before, but hadn't seen the behaviour on 6.12.1 yet. It's unfortunate as Austin points out below, because it makes (very common) dual-core systems have difficulty benefiting from parallel programs.

    @Austin: Thanks for the bug ticket, I didn't know that it was filed!

    ReplyDelete
  4. In my one-project academic experience with OpenMP, I did come across certain conclusions. One was that cache contention really is a great limitation. My trick is that you should have several parallel threads running, not just 8. This is not very intuitive, but somehow worked (a lot) better for me than just sticking with the available pragma directives for load balancing.

    ReplyDelete
  5. ya i agree with pedro,..

    ofcourse i am not an expert in the subject,.. i am still trying some interesting stuffs in openmp. meanwhile i came across following blog.

    http://openmp.blogspot.com/2011/01/home.html

    i recommend one to go through the blog,..
    it would be helpful for newbies who might have a big question "WHY OPENMP?"

    ReplyDelete