Tuesday 19 January 2010

Multi-core Laziness: Doing Less in Parallel

Preamble: (You may want to ignore this, it doesn't pertain directly to the topic of this post) I just started this blog, and I'm very thankful for the constructive comments I've received both here, and on reddit. I appreciate the discussions that have ensued over the issues of parallelizing the code that I've been examining.

However, I feel that  should reassure anyone who has been reading that, if you have also seen a blog post by Jon Harrop, you can be sure that it "really is this easy" to get the parallelism. There has been some discomfort felt by Jon that I chose the 4th version of Lennart's code over the 5th version. He questions the veracity of my claims, accusing me of "cherry picking" results, and goes on to try to discover exactly why I would have chosen the 4th version over the 5th. He eventually concludes that I specially designed my version to be more parallelizable. Let me help Jon, since he is having difficulty figuring out exactly why I chose the 4th version;

While the 5th version of Lennart's code is the fastest on 9 levels of spheres, it is over-optimized for this serial case:

  1. It is ugly (and I say that with no disrespect), it has unwrapped basically all of the vector operations that were previously function calls. 
  2. It's not robust. It stack-overflows with the default settings on 12 levels of spheres, but actually Jon knew about that already
  3. It performs worse at anything except 9 levels of spheres. When the stack-size is increased past its default to accommodate the program, it takes 50% longer than version 4 (which I based my version on), and uses 300% more memory than version 4. On 13 levels of spheres it eats all of my memory. 
So, when I made the choice to select version 4 of Lennart's code, it wasn't to bias the results towards parallelism.  It was because, either serial or parallel, version 5 offers only minor speed increase and only at level 9. For the other levels tested it was worse in terms of both memory and speed. If I had chosen this version I wouldn't have even been able to get any results for 13 levels of spheres. I hope Jon feels better now.


The Main Attraction

To be honest, this post won't focus too much on parallelism. However, the parallelism does provide some context to view laziness in. Now, in the previous posts I had made, I took the 4th version of Lennart's ray tracer and performed some small optimizations on them. Now, I don't know anything really about ray-tracing, so I'm glad that there were some hints given that indicated that it may be a great place to look at parallelism.

Combining the fact that at higher levels of spheres, the ray-tracer benchmark essentially becomes a test of allocation and initialization speed, it seems that doing both of these less would likely increase performance. Of course, this turns out to be true. Now a language which has lazy evaluation, such as Haskell, provides a nice opportunity for this optimization to be done. Also, since the language has support for lazy evaluation, it lets the programmer easily create software lazily-by-default, and strict-when-needed.

As applied to the ray-tracer benchmark, this turns out to be a very valuable property! I recently took the most unoptimized version that Lennart produced, and back-ported a few of the simpler optimizations that were done in the later versions, and some new ones, such as:

  • specializing two versions of intersect
  • removing the use of sqrt when able
  • using sqrt epsilon_float as a constant
I believe the first one gets the largest gain, and was already present in the other more optimized versions. These seem to give a nice increase over the original version, and gets it within about 2.3x of the C++ version in single-core operation. The key here is to leave the lazy-evaluation alone, and to not force the entire scene to be constructed.

This can be seen in the timing results (I've tried to overlay them over the previous results, the chart is a little crowded, so I took the markers off the previous results)



As you can see, the Lazy version starts off the slowest. However, from the clumping of times in both the single and multi-core runs, it is clear that the lazy program scales much better when increasing the levels of spheres that were processed. In fact, I could take the levels up to seemingly arbitrary levels. 2000, 3000 levels of spheres, all completed in under 10 seconds. Now of course, everything being equal, it probably wouldn't be too hard to get the C++ program to do this as well. It isn't my intent to get anyone to believe that these things are impossible in other languages; Haskell just happens to make laziness easier.

The scaling behaviour under multiple cores is also revealing,


We can see that due to the reduced burden of creating and initializing the scene, the speedup due to parallelism is more or less the same as the complexity of the scene increases. So not only does the serial speed of the scene construction have a nicer runtime complexity, that same serial code can be parallelized more effectively than the non-lazy version.

In conclusion, the lazy version of the program has several nice benefits:
  • It's relatively small (about 70 lines with my additions, and I like slightly more line-breaks than Lennart, so it could be even smaller).
  • It performs better as the levels increase.
  • The relative speedup vs. serial maintains its shape better as cores are added
But laziness, in my opinion, also has some pitfalls,
  • It can be hard to switch to the idea of thinking when something really needs to be evaluated, to realize when and where laziness is really being seen.
  • If you're not a little careful you can end up lazily creating some terms that are too large, and will degrade performance.
  • Optimizing with respect to laziness can be a  tricky balance.
However, in the end, I think that being aware of laziness is good food-for-thought. The more tools we have in the toolbox, often we find new and interesting ways to solve our problems in an efficient way. This will likely be my last post on this benchmark, it's been a fun little exercise. I hope someone else picks it up and finds another aspect to explore!

Again, the code for Lennart's 1st version, optimized, can be found on GitHub.

Friday 15 January 2010

Naïve Parallelization: C++ (OpenMP) and Haskell

Update: floodyberry from reddit has suggested a great improvement to the C++ version. This lets the speedup trend increase much more nicely in the case of 8 levels of spheres. He also had a good suggestion that the lack-of speedup in the higher-level cases is due to it essentially becoming an allocation benchmark. The graphs have been updated accordingly.


Background
During my previous post I compared a few ray tracing implementations, and there were some questions on reddit about why the parallel Haskell version didn't scale beyond about 3x improvement, even up to 8 logical (4 physical) cores, on such an innately parallel task.

I couldn't answer that question when it was asked: and I can't answer it now. Sorry ;)

However, it did make me curious about how one could also parallelize the C++ version to start examining this problem to see if it was a problem with the Haskell implementation, or if it is a deeper issue. Now I am definitely not a master of concurrency and parallel programming, so I thought I would pick the simplest method of parallelizing C++ programs that I know: OpenMP. It's nice because it works through #pragma declarations and provides some primitive operations to help with synchronization. Additionally, it contains one pragma that I thought would be perfect for this sort of task: #pragma omp parallel for, which will take the for block that follows and will parallelize its operation. This is roughly like the parBuffer function that I employed when I did the same to the Haskell version of the code.


To be honest, I don't know if this is the best parallelization of the loop, or perhaps the C++ program can be rewritten in such a way to further increase parallel performance. However, this isn't really the question I'm trying to answer which is: how do naïve parallelizations in these two languages compare?
I focus on these simple modifications because in the context of software engineering we're often trying to balance correctness, execution speed, and cost (time to implement). So then, given two implementations of the ray-tracing task, how do these compare?

Parallelizing
The transformation from serial to parallel in the Haskell version can be had by simply taking the list of intensities, let us call this list picture', and applying parBuffer ni rwhnf to it. This results in a new list of intensities (identical to the original, just evaluated in parallel). Due to the pure functional nature of Haskell, we know that this operation will preserve the result of the original serial computation. If it was correct before, it is correct now: great! (Likewise, if it was incorrect before, it will be incorrect now too ;)).

The same transformation in C++ is actually fairly similar with OpenMP, which is also great! The typing overhead is not very high, but it is a bit more than the Haskell version. As stated before, basically appending
#pragma omp parallel for before the main for-loop of the program solves the parallelism aspect. However, and this it the important part, because of the use of shared resources (cout) the body of the loop had to be modified, due to the race-condition that was introduced. The race-condition I mention is the use of cout to immediately output the results, this is of course sensitive to the order in which the intensities are printed. To solve this I introduced an area of memory which each iteration wrote their result into the correct location, and another nested loop later to print these things again to cout


I'm not a great programmer, I am well aware of that. That didn't stop me, though, from loving the fact that I could immediately transform a serial pure-functional solution in a parallel pure-functional solution and still get the same correct answer that I had in the serial version. In C++ however, I saw the race-condition inevitability and had to design around it, modify the behaviour of the original loop, and break the output section into another couple loops.


I am not in the least trying to infer that everyone would have used the same parallelization method that I would, and I would wager there are much  better ways to do it, both in terms of speed of execution and safety of implementation. I just really believe that the combination of using a pure-functional coding style in combination with parallel primitives, like those offered in Haskell, allows one to produce code in a way that is less subject to errors, and still get some "free" parallelism in the end, too!


Performance
The second part of this post is the performance characteristics. This was run on a 32-bit ArchLinux installation running on an Intel i7 920, with around 3GB of RAM. Note that the i7 920 doesn't have 8 physical cores, it has 4 physical cores with hyper-threading on each, allowing my Linux installation to see 8 logical cores. This means that I'm not as confident in the results above 4 cores. However, there is a continued increase in performance above 4 cores, so I have left them in for the curious.

Update: GHC 6.12.1 and GCC 4.4.2 were the compilers used, with a 2.6.32 Linux kernel.






There are two interesting things about these results, I believe:
  1. In every case the relative speedup experienced by the Haskell version is greater than that of the C++ version. One may argue that this is because Haskell performs worse in the single-core version and thus has more "room" to grow into.
  2. The absolute performances get much closer as more cores are added. I'm not precisely sure how to interpret that, I think either both implementations are hitting a lower-level barrier (cache contention?) or the Haskell runtime has some sort of magic. I'm torn between the two possibilities!
Additionally, some Haskell runs take a dip on 8-cores, could this perhaps be due to the GC robbing one of the cores of some CPU time?

The code for these is available on GitHub, since I really hated formatting the Haskell in the previous post. Please feel free to modify the C++ code (originally adapted from here), and improve it's parallel performance.
I don't want to give the impression that I'm saying that C++ isn't able to do better, or that this parallelization is the definitive one. I'm sure there's something obvious that I've missed in this task! However, if you can, try to make as few modifications as possible and show-off how easy it was to parallelize.

Tuesday 12 January 2010

Haskell, Ray Tracing, Parallel Computation

Note: The initial post I made had an error in one of the functions which produced incorrect output. This has been fixed and the performance numbers have been updated. They didn't shift too far, but there has been a slight performance degradation in the tests that were run.

Since seeing a blog post made about the efficiency of ray-tracer implementations in several languages, I wanted to see if the placing of the Haskell entry could be improved. In the post, the Haskell implementation comes out 3.2 slower than the C++ version. I thought this may be a little too large, so I tried to tweak down the Haskell version to some better timings. Around 2x would be my goal, but even at the current 3x is still pretty good for a language which is at such a high level of abstraction.

So I went to work, I went back to the source: Lennart Augustsson's implementation. I chose the 4th version, as it would seem to not have the uglier performance hacks that are present in the final version, I like working with clean code if possible. I pulled out the implementation of the vector, as it can be had through the AC-vector package from Hackage; it's environmentally friendly to recycle.

I didn't actually end up modifying too much, apart from a few style things. I tweaked a few functions to get their complexity down slightly, I didn't look to the C++ code to see if a similar optimization could be made, as I want to be able to be able to relate my results to original comparison post, aside from hardware.

Results

These results were obtained on a ArchLinux 32-bit installation, using GHC 6.12.1, GCC 4.4.2, OCaml 3.11.1, and HLVM was included from SVN just for more data-points to align with the other blog post.

9 Layers, 512x512
12 Layers, 512x512

Details:

The scary compiler invocation for GHC is as follows:

ghc --make -O2 -fforce-recomp -fexcess-precision -funbox-strict-fields ray4.hs -o ray4-hs -fvia-c -optc-O3 -optc-ffast-math -funfolding-keeness-factor=10


The compile lines for the others can be found on the original ray tracing comparison page, and the HLVM version (as of the SVN version of this writing) has to be hacked as it doesn't support passing command line arguments to adjust the number of levels. Also, it used exorbitantly more memory than the other implementations, with 9 levels it ate approximately 422MB, where the Haskell version only used 18MB.

Parallel Ray Tracing

Now this isn't the really interesting part, the really interesting thing is when we start dipping into the parallel package/runtime support that was recently improved in GHC 6.12.1 . Since I'd never really had the chance to use it (never had to do any really large computationally intensive processing) I thought this would be the perfect time!

Surprisingly, I only had to append the following line to get the performance seen below:

parBuffer ni rwhnf picture'

and also invoking the compiler needs the following additions:

-threaded -feager-blackholing

Decomposing the addition to the source code, we basically use the parBuffer function from the Control.Parallel.Strategies package. Now I'm just starting with these parallel strategies but as I understand it, the buffering size (ni) is how many elements from the list (picture') are taken to be evaluated at a time, using the strategy provided (rwhnf). I thought this strategy was best because each element in the list is quite computationally expensive, so it makes sense not to try to evaluate them all at the same time, but to do them in batches.

The results were quite promising on 9-levels:

Using about 6 cores seems to be the sweet-spot, and tuning the runtime via +RTS -qb0 -N8 -RTS (use parallel garbage collector on generation-0, -Nn specifies number of cores to use) was essential as well.

I found this an interesting exercise, especially seeing how pure functional programming can still be quite performant compared to impure languages. I wouldn't mind some tips from anyone more knowledgeable about improving the parallel performance in this setting to see if it we could do even better!






Appendix - Haskell Code Used for the trials


{-# LANGUAGE BangPatterns #-}
import Control.Parallel.Strategies
import Data.List (foldl')
import Data.Vector
import System
import System.IO

infixl 7 .*, *|

(.*) :: Vector3 -> Vector3 -> Double
(.*) = vdot

(*|) :: Double -> Vector3 -> Vector3
(*|) = (*<>)

len = vmag

vect :: Double -> Double -> Double -> Vector3
vect x y z = Vector3 x y z

infinity, delta :: Double
infinity = 1/0
delta = sqrt e where e = encodeFloat (floatRadix e) (-floatDigits e)

unitise :: Vector3 -> Vector3
unitise r = 1 / len r *| r

data Scene = S !Vector3 !Double [Scene]
data Hit = H {l :: !Double, nv :: Vector3}

ray_sphere :: Vector3 -> Vector3 -> Double -> Double
ray_sphere dir v radius =
let disc = v .* v - radius * radius
b = (v .* dir)
b2 = b*b
in if disc < 0 || b2 < disc
then infinity
else let disk = sqrt (b2 - disc)
t1 = b - disk
in if t1 > 0 then t1 else b + disk

ray_sphere' :: Vector3 -> Vector3 -> Vector3 -> Double -> Bool
ray_sphere' orig dir center radius =
let v = center - orig
b = v .* dir
b2 = b * b
rest = v .* v - radius * radius
in b2 >= rest && (b > 0 || rest < 0)

intersect dir first@(H l _) !(S center radius scene) =
let l' = ray_sphere dir center radius
in if l' >= l
then first
else case scene of
[] -> H l' (unitise (l' *| dir - center))
scenes -> foldl' (intersect dir) first scenes

intersect' orig dir !(S center radius scenes) =
ray_sphere' orig dir center radius &&
(null scenes || any (intersect' orig dir) scenes)

ray_trace light dir scene =
case intersect dir (H infinity 0) scene of
H 0 _ -> infinity
H lambda normal ->
let g = normal .* light
in if g >= 0 then 0
else let p = lambda *| dir + delta *| normal
in if intersect' p neg_light scene then 0 else - g


neg_light = unitise (Vector3 1 3 (-2))

bound (S c r s) (S c' r' []) = S c (max r (len (c - c') + r')) s
bound b (S _ _ l) = foldl' bound b l

create 1 c r = S c r []
create level c r =
let a = 3 * r / sqrt 12
aux x' z' = create (level - 1 :: Int) (c + vect x' a z') (0.5 * r)
l = [S c r [], aux (-a) (-a), aux a (-a), aux (-a) a, aux a a]
in foldl' bound (S (c + vect 0 r 0) 0 l) l

ss = 4

light = unitise (vect (-1) (-3) 2)

pixel_vals n scene y x =
[ let
f a da = a - n / 2 + da / ss
d = unitise (vect (f x dx) (f y dy) n)
in ray_trace light d scene | dx <- [0..ss-1], dy <- [0..ss-1] ]

main = do
[level,ni] <- fmap (map read) getArgs
let n = fromIntegral ni
scene = create level (vect 0 (-1) 4) 1
scale x = 0.5 + 255 * x / (ss*ss)
f y x = toEnum . truncate . scale . sum $ pixel_vals n scene y x
picture' = [ f y x | y <- [n-1,n-2..0], x <- [0..n-1]]
picture = parBuffer ni rwhnf picture'
hSetEncoding stdout latin1
putStr $ "P5\n" ++ show ni ++ " " ++ show ni ++ "\n255\n" ++ picture