Sunday, 14 March 2010

GHC, LLVM, A Couple Shootout Entries: Faster for Free

After seeing a post by Alp Mestanogullari , I thought it might be interesting to see how the new LLVM backend to GHC is working out. The instructions in Alp's post are quite thorough, although they are under the assumption that you are patching LLVM to do so. This is no longer necessary as the LLVM project has accepted a required patch to provide a new calling convention for GHC.

Thus, the instructions have to be changed a little:
  1. Get LLVM from SVN, compile/install it.
  2. Get GHC from the darcs repository.
  3. Patch it with both Don's and David's patches (found on the blog post at the top). These provide the LLVM backend and command-line options to GHC to invoke certain optimizations through LLVM.

  4. Apply the following:

    hunk ./compiler/llvmGen/Llvm/Types.hs 527

    - show CC_Fastcc = "fastcc"
    + show CC_Fastcc = show (CC_Ncc 10)
  5. Compile GHC. You may need to rollback to a stable-date of some sort, the last time I checked, the most recent didn't compile. I think 2010-02-25 was fine.
I would also reference Alp's post to pick up other details, as he gives much more time to this process than I do here.

Anyhow, once this was done I tried to run a couple of the shootout programs, and found that the LLVM backend seemed to hold up. Admittedly this is a very limited test, just testing the waters:

  • Mandelbrot -- old: 2.723s, LLVM: 2.641s
  • NBody -- old 2.093s, LLVM: 2.029s
The LLVM options that were used were just -std-compile-opts and -O3, also note that these were run with reduced parameters, 4000 for mandelbrot, 5000000 for nbody. These were averaged from 5 runs of each binary. This shows that LLVM gives us a speedup of  3.08% and 3.15% respectively, in these two limited cases. Even if the cases are small, and few, it's still very encouraging to note that its possible to get a bit more performance, essentially free of cost! Hopefully the LLVM-backend work will continue to improve and pay off into the future!

Friday, 26 February 2010

Putting Haskell Between C++ and Java (Language Shootout Exercise)

Note: These speedups apparently require GHC 6.12, so the entire speedup won't be seen on the Shootout until/if they switch to GHC 6.12 at some point (possibly after the next Ubuntu stable release?). However, there was still about a 3 second speedup to hold us over until then ;).

Lately I've been sort of poking at the language shootout, to see what's there. The previous two posts have centered on this theme. However, unlike those posts, this one doesn't want to show a Haskell program could be made faster, bur rather how it can be made much faster.

Looking to reduce one of the greatest disparities, I looked to the fasta benchmark. In this case, the fastest submission is the C implementation, with a great time of 1.73 seconds. Avoiding success in the most complete way possible, the Haskell implementation scores in at about 20th place, being 9x slower than the fastest C. However, the entry doesn't have terrible company, at least it is bordered by a couple (JIT) compiled languages, F# and Go.

Ok, so what can be done?

I don't know. First we have to find out where the problem is! For this we can look to the ghc profiling output.
For this I just compile with the ghc options "-prof -auto-all", in addition to the other options. When we use this, and then additionally run the binary with "+RTS -p -xc -RTS" we will get a file called (for example) fastaold.prof. This will report, first:


look                           Main                  38.7   71.7
choose                         Main                  26.6    0.0
rand                           Main                  14.6   20.7

and additionally:

 look     Main       282   403333334  38.7   71.7    79.8   92.4
   rand   Main       285   200000000  14.6   20.7    14.6   20.7
   choose Main       283   623445679  26.6    0.0    26.6    0.0

The top output basically tells us the main offenders (by time of execution) in our program. The bottom is an extract from the more detailed information. clearly these functions are called a lot. Therefore, it makes sense to try to optimize the lookup function for a symbol.


Okay, so clearly the most commonly used functions need optimization... (this is really revolutionary advice here...)

Briefly, these are the optimizations I have made. Not all of these may be really useful, but it is a true listing of what I did:


  • Replace the pair of symbol/probability with its own datatype
  • Ditto for probability/random-seed pairs
  • Changed there representation from PDF to CDF, saves some subtraction.
  • Pushed look into the body of the fold function (this actually saves quite a bit of time, I guess it allows better specialization).
  • Introduced a "cached" version of the list, which essentially pushed the first 4 entries up. I believe this makes it have more-or-less constant access time for these elements, without having to "uncons" everything, as with a regular list.

These seemed to pay off! I believe they even don't violate the spirit of the language shootout, as they certainly do less of a modification than the top C version. The top C version essentially seems to expand the CDF into a larger space, so that it basically becomes a constant-time lookup for the correct symbol: my version is still quite linear.

In the end I was able to make the following improvement, going from:
0m13.305s

to:

0m6.219s

This is a 2.14x improvement! The implementation is still much slower than the 1st place one. Its a little hard to say how this would affect the Shootout results themselves, given I have different hardware, and a different version of GHC (they have 6.10, I have 6.12). However, if one wants to perform the dubious step of applying my speedup to the current version listed on the Shootout page, it should move the Haskell entry from 20th place to 8th, and put it right between a C++ and a Java entry. That's pretty nice company to be around!

(Double check my work! Files hosted on GitHub)

Worker-threads to Strategies: Performance Follow-up

When I was last examining the simplification of the mandelbrot-shootout entry I neglected one point: to confirm that the multithreaded performance of the program was maintained after I had made my modifications. First we can see the parallel performance under 4 cores for the original:
Note that this is about a 3.5x speedup.

Now in the simplified version, where the strategies from Data.Parallel.Strategies are used, we see:




Which is slightly slower than the old version (which is consistent with performance under 1 core), with the speedup now in the ballpark of about 3.3x

The absolute times are about 50ms vs 53ms, respectively.

Sunday, 21 February 2010

From Pointers and Worker-threads to ByteStrings and Strategies

Just for fun the other day, I thought I'd take a look through the Language Shootout submissions, to see what highly optimized programs tended to look like. It's really quite interesting to see where the strengths of each language lay. Often when  you're not playing to the style that the language caters to, you can get some fairly ugly looking results. I saw one such result in the mandelbrot submission for Haskell.

The original submission (with full author attribution can be found here) was using the more traditional concurrency primitives offered. You can see it (without attribution and comments to make program size more evident in comparison) below:



I noticed a few things in the implementation:

  1. It used (as previously mentioned) the low-level concurrency primitives
  2. Ptr, poke, and pointer arithmetic were used to create bit arrays.
  3. The state that was unfolded through the computation was a big tuple of stuff.
I think that these tend to clutter the meaning of the program. I was curious if I could try to reduce some of their usages to make the program a bit more clear, while still maintaining the relatively high performance.


Low level concurrency
At the core of the algorithm, it just processes one line of the output at a time. Even better, each line of the output is independent from the others! This independence indicates that we can use the higher-level parallel strategies in Control.Parallel.Strategies to parallelize the task.

Pointer-level Computation
This one had a pretty clear solution. Since the arrays that were used in the old implementation contained simple bytes (Word8), it is fairly easy to transition them to ByteStrings instead. By doing this we get to keep the purity at the same time as getting efficient operations on the bytes.

Folded State
This was nothing too scary, just paring the state (symbolized by T) down to something more manageable. I turned it from a tuple of x-position, x-byte-position, y-position, y-imaginary-position to just x-position.

Finally, after making the above changes, the resulting code is more pure, and (I think) easier to read:


One of the best things is, the performance is left largely unchanged (generated using Criterion). The time to generate a 1000x1000 image being distributed as:

And my revised version of the code sees:




It is a little hard to see due to the closeness of the graphs, but the average times are 175.75ms for the old, and 176.74 ms for the new, although the confidence intervals at 95% do overlap. With less confidence we may be able to say that the old version is faster.

The code for this exercise can be seen on the lovely GitHub.

Tuesday, 19 January 2010

Multi-core Laziness: Doing Less in Parallel

Preamble: (You may want to ignore this, it doesn't pertain directly to the topic of this post) I just started this blog, and I'm very thankful for the constructive comments I've received both here, and on reddit. I appreciate the discussions that have ensued over the issues of parallelizing the code that I've been examining.

However, I feel that  should reassure anyone who has been reading that, if you have also seen a blog post by Jon Harrop, you can be sure that it "really is this easy" to get the parallelism. There has been some discomfort felt by Jon that I chose the 4th version of Lennart's code over the 5th version. He questions the veracity of my claims, accusing me of "cherry picking" results, and goes on to try to discover exactly why I would have chosen the 4th version over the 5th. He eventually concludes that I specially designed my version to be more parallelizable. Let me help Jon, since he is having difficulty figuring out exactly why I chose the 4th version;

While the 5th version of Lennart's code is the fastest on 9 levels of spheres, it is over-optimized for this serial case:

  1. It is ugly (and I say that with no disrespect), it has unwrapped basically all of the vector operations that were previously function calls. 
  2. It's not robust. It stack-overflows with the default settings on 12 levels of spheres, but actually Jon knew about that already
  3. It performs worse at anything except 9 levels of spheres. When the stack-size is increased past its default to accommodate the program, it takes 50% longer than version 4 (which I based my version on), and uses 300% more memory than version 4. On 13 levels of spheres it eats all of my memory. 
So, when I made the choice to select version 4 of Lennart's code, it wasn't to bias the results towards parallelism.  It was because, either serial or parallel, version 5 offers only minor speed increase and only at level 9. For the other levels tested it was worse in terms of both memory and speed. If I had chosen this version I wouldn't have even been able to get any results for 13 levels of spheres. I hope Jon feels better now.


The Main Attraction

To be honest, this post won't focus too much on parallelism. However, the parallelism does provide some context to view laziness in. Now, in the previous posts I had made, I took the 4th version of Lennart's ray tracer and performed some small optimizations on them. Now, I don't know anything really about ray-tracing, so I'm glad that there were some hints given that indicated that it may be a great place to look at parallelism.

Combining the fact that at higher levels of spheres, the ray-tracer benchmark essentially becomes a test of allocation and initialization speed, it seems that doing both of these less would likely increase performance. Of course, this turns out to be true. Now a language which has lazy evaluation, such as Haskell, provides a nice opportunity for this optimization to be done. Also, since the language has support for lazy evaluation, it lets the programmer easily create software lazily-by-default, and strict-when-needed.

As applied to the ray-tracer benchmark, this turns out to be a very valuable property! I recently took the most unoptimized version that Lennart produced, and back-ported a few of the simpler optimizations that were done in the later versions, and some new ones, such as:

  • specializing two versions of intersect
  • removing the use of sqrt when able
  • using sqrt epsilon_float as a constant
I believe the first one gets the largest gain, and was already present in the other more optimized versions. These seem to give a nice increase over the original version, and gets it within about 2.3x of the C++ version in single-core operation. The key here is to leave the lazy-evaluation alone, and to not force the entire scene to be constructed.

This can be seen in the timing results (I've tried to overlay them over the previous results, the chart is a little crowded, so I took the markers off the previous results)



As you can see, the Lazy version starts off the slowest. However, from the clumping of times in both the single and multi-core runs, it is clear that the lazy program scales much better when increasing the levels of spheres that were processed. In fact, I could take the levels up to seemingly arbitrary levels. 2000, 3000 levels of spheres, all completed in under 10 seconds. Now of course, everything being equal, it probably wouldn't be too hard to get the C++ program to do this as well. It isn't my intent to get anyone to believe that these things are impossible in other languages; Haskell just happens to make laziness easier.

The scaling behaviour under multiple cores is also revealing,


We can see that due to the reduced burden of creating and initializing the scene, the speedup due to parallelism is more or less the same as the complexity of the scene increases. So not only does the serial speed of the scene construction have a nicer runtime complexity, that same serial code can be parallelized more effectively than the non-lazy version.

In conclusion, the lazy version of the program has several nice benefits:
  • It's relatively small (about 70 lines with my additions, and I like slightly more line-breaks than Lennart, so it could be even smaller).
  • It performs better as the levels increase.
  • The relative speedup vs. serial maintains its shape better as cores are added
But laziness, in my opinion, also has some pitfalls,
  • It can be hard to switch to the idea of thinking when something really needs to be evaluated, to realize when and where laziness is really being seen.
  • If you're not a little careful you can end up lazily creating some terms that are too large, and will degrade performance.
  • Optimizing with respect to laziness can be a  tricky balance.
However, in the end, I think that being aware of laziness is good food-for-thought. The more tools we have in the toolbox, often we find new and interesting ways to solve our problems in an efficient way. This will likely be my last post on this benchmark, it's been a fun little exercise. I hope someone else picks it up and finds another aspect to explore!

Again, the code for Lennart's 1st version, optimized, can be found on GitHub.

Friday, 15 January 2010

Naïve Parallelization: C++ (OpenMP) and Haskell

Update: floodyberry from reddit has suggested a great improvement to the C++ version. This lets the speedup trend increase much more nicely in the case of 8 levels of spheres. He also had a good suggestion that the lack-of speedup in the higher-level cases is due to it essentially becoming an allocation benchmark. The graphs have been updated accordingly.


Background
During my previous post I compared a few ray tracing implementations, and there were some questions on reddit about why the parallel Haskell version didn't scale beyond about 3x improvement, even up to 8 logical (4 physical) cores, on such an innately parallel task.

I couldn't answer that question when it was asked: and I can't answer it now. Sorry ;)

However, it did make me curious about how one could also parallelize the C++ version to start examining this problem to see if it was a problem with the Haskell implementation, or if it is a deeper issue. Now I am definitely not a master of concurrency and parallel programming, so I thought I would pick the simplest method of parallelizing C++ programs that I know: OpenMP. It's nice because it works through #pragma declarations and provides some primitive operations to help with synchronization. Additionally, it contains one pragma that I thought would be perfect for this sort of task: #pragma omp parallel for, which will take the for block that follows and will parallelize its operation. This is roughly like the parBuffer function that I employed when I did the same to the Haskell version of the code.


To be honest, I don't know if this is the best parallelization of the loop, or perhaps the C++ program can be rewritten in such a way to further increase parallel performance. However, this isn't really the question I'm trying to answer which is: how do naïve parallelizations in these two languages compare?
I focus on these simple modifications because in the context of software engineering we're often trying to balance correctness, execution speed, and cost (time to implement). So then, given two implementations of the ray-tracing task, how do these compare?

Parallelizing
The transformation from serial to parallel in the Haskell version can be had by simply taking the list of intensities, let us call this list picture', and applying parBuffer ni rwhnf to it. This results in a new list of intensities (identical to the original, just evaluated in parallel). Due to the pure functional nature of Haskell, we know that this operation will preserve the result of the original serial computation. If it was correct before, it is correct now: great! (Likewise, if it was incorrect before, it will be incorrect now too ;)).

The same transformation in C++ is actually fairly similar with OpenMP, which is also great! The typing overhead is not very high, but it is a bit more than the Haskell version. As stated before, basically appending
#pragma omp parallel for before the main for-loop of the program solves the parallelism aspect. However, and this it the important part, because of the use of shared resources (cout) the body of the loop had to be modified, due to the race-condition that was introduced. The race-condition I mention is the use of cout to immediately output the results, this is of course sensitive to the order in which the intensities are printed. To solve this I introduced an area of memory which each iteration wrote their result into the correct location, and another nested loop later to print these things again to cout


I'm not a great programmer, I am well aware of that. That didn't stop me, though, from loving the fact that I could immediately transform a serial pure-functional solution in a parallel pure-functional solution and still get the same correct answer that I had in the serial version. In C++ however, I saw the race-condition inevitability and had to design around it, modify the behaviour of the original loop, and break the output section into another couple loops.


I am not in the least trying to infer that everyone would have used the same parallelization method that I would, and I would wager there are much  better ways to do it, both in terms of speed of execution and safety of implementation. I just really believe that the combination of using a pure-functional coding style in combination with parallel primitives, like those offered in Haskell, allows one to produce code in a way that is less subject to errors, and still get some "free" parallelism in the end, too!


Performance
The second part of this post is the performance characteristics. This was run on a 32-bit ArchLinux installation running on an Intel i7 920, with around 3GB of RAM. Note that the i7 920 doesn't have 8 physical cores, it has 4 physical cores with hyper-threading on each, allowing my Linux installation to see 8 logical cores. This means that I'm not as confident in the results above 4 cores. However, there is a continued increase in performance above 4 cores, so I have left them in for the curious.

Update: GHC 6.12.1 and GCC 4.4.2 were the compilers used, with a 2.6.32 Linux kernel.






There are two interesting things about these results, I believe:
  1. In every case the relative speedup experienced by the Haskell version is greater than that of the C++ version. One may argue that this is because Haskell performs worse in the single-core version and thus has more "room" to grow into.
  2. The absolute performances get much closer as more cores are added. I'm not precisely sure how to interpret that, I think either both implementations are hitting a lower-level barrier (cache contention?) or the Haskell runtime has some sort of magic. I'm torn between the two possibilities!
Additionally, some Haskell runs take a dip on 8-cores, could this perhaps be due to the GC robbing one of the cores of some CPU time?

The code for these is available on GitHub, since I really hated formatting the Haskell in the previous post. Please feel free to modify the C++ code (originally adapted from here), and improve it's parallel performance.
I don't want to give the impression that I'm saying that C++ isn't able to do better, or that this parallelization is the definitive one. I'm sure there's something obvious that I've missed in this task! However, if you can, try to make as few modifications as possible and show-off how easy it was to parallelize.

Tuesday, 12 January 2010

Haskell, Ray Tracing, Parallel Computation

Note: The initial post I made had an error in one of the functions which produced incorrect output. This has been fixed and the performance numbers have been updated. They didn't shift too far, but there has been a slight performance degradation in the tests that were run.

Since seeing a blog post made about the efficiency of ray-tracer implementations in several languages, I wanted to see if the placing of the Haskell entry could be improved. In the post, the Haskell implementation comes out 3.2 slower than the C++ version. I thought this may be a little too large, so I tried to tweak down the Haskell version to some better timings. Around 2x would be my goal, but even at the current 3x is still pretty good for a language which is at such a high level of abstraction.

So I went to work, I went back to the source: Lennart Augustsson's implementation. I chose the 4th version, as it would seem to not have the uglier performance hacks that are present in the final version, I like working with clean code if possible. I pulled out the implementation of the vector, as it can be had through the AC-vector package from Hackage; it's environmentally friendly to recycle.

I didn't actually end up modifying too much, apart from a few style things. I tweaked a few functions to get their complexity down slightly, I didn't look to the C++ code to see if a similar optimization could be made, as I want to be able to be able to relate my results to original comparison post, aside from hardware.

Results

These results were obtained on a ArchLinux 32-bit installation, using GHC 6.12.1, GCC 4.4.2, OCaml 3.11.1, and HLVM was included from SVN just for more data-points to align with the other blog post.

9 Layers, 512x512
12 Layers, 512x512

Details:

The scary compiler invocation for GHC is as follows:

ghc --make -O2 -fforce-recomp -fexcess-precision -funbox-strict-fields ray4.hs -o ray4-hs -fvia-c -optc-O3 -optc-ffast-math -funfolding-keeness-factor=10


The compile lines for the others can be found on the original ray tracing comparison page, and the HLVM version (as of the SVN version of this writing) has to be hacked as it doesn't support passing command line arguments to adjust the number of levels. Also, it used exorbitantly more memory than the other implementations, with 9 levels it ate approximately 422MB, where the Haskell version only used 18MB.

Parallel Ray Tracing

Now this isn't the really interesting part, the really interesting thing is when we start dipping into the parallel package/runtime support that was recently improved in GHC 6.12.1 . Since I'd never really had the chance to use it (never had to do any really large computationally intensive processing) I thought this would be the perfect time!

Surprisingly, I only had to append the following line to get the performance seen below:

parBuffer ni rwhnf picture'

and also invoking the compiler needs the following additions:

-threaded -feager-blackholing

Decomposing the addition to the source code, we basically use the parBuffer function from the Control.Parallel.Strategies package. Now I'm just starting with these parallel strategies but as I understand it, the buffering size (ni) is how many elements from the list (picture') are taken to be evaluated at a time, using the strategy provided (rwhnf). I thought this strategy was best because each element in the list is quite computationally expensive, so it makes sense not to try to evaluate them all at the same time, but to do them in batches.

The results were quite promising on 9-levels:

Using about 6 cores seems to be the sweet-spot, and tuning the runtime via +RTS -qb0 -N8 -RTS (use parallel garbage collector on generation-0, -Nn specifies number of cores to use) was essential as well.

I found this an interesting exercise, especially seeing how pure functional programming can still be quite performant compared to impure languages. I wouldn't mind some tips from anyone more knowledgeable about improving the parallel performance in this setting to see if it we could do even better!






Appendix - Haskell Code Used for the trials


{-# LANGUAGE BangPatterns #-}
import Control.Parallel.Strategies
import Data.List (foldl')
import Data.Vector
import System
import System.IO

infixl 7 .*, *|

(.*) :: Vector3 -> Vector3 -> Double
(.*) = vdot

(*|) :: Double -> Vector3 -> Vector3
(*|) = (*<>)

len = vmag

vect :: Double -> Double -> Double -> Vector3
vect x y z = Vector3 x y z

infinity, delta :: Double
infinity = 1/0
delta = sqrt e where e = encodeFloat (floatRadix e) (-floatDigits e)

unitise :: Vector3 -> Vector3
unitise r = 1 / len r *| r

data Scene = S !Vector3 !Double [Scene]
data Hit = H {l :: !Double, nv :: Vector3}

ray_sphere :: Vector3 -> Vector3 -> Double -> Double
ray_sphere dir v radius =
let disc = v .* v - radius * radius
b = (v .* dir)
b2 = b*b
in if disc < 0 || b2 < disc
then infinity
else let disk = sqrt (b2 - disc)
t1 = b - disk
in if t1 > 0 then t1 else b + disk

ray_sphere' :: Vector3 -> Vector3 -> Vector3 -> Double -> Bool
ray_sphere' orig dir center radius =
let v = center - orig
b = v .* dir
b2 = b * b
rest = v .* v - radius * radius
in b2 >= rest && (b > 0 || rest < 0)

intersect dir first@(H l _) !(S center radius scene) =
let l' = ray_sphere dir center radius
in if l' >= l
then first
else case scene of
[] -> H l' (unitise (l' *| dir - center))
scenes -> foldl' (intersect dir) first scenes

intersect' orig dir !(S center radius scenes) =
ray_sphere' orig dir center radius &&
(null scenes || any (intersect' orig dir) scenes)

ray_trace light dir scene =
case intersect dir (H infinity 0) scene of
H 0 _ -> infinity
H lambda normal ->
let g = normal .* light
in if g >= 0 then 0
else let p = lambda *| dir + delta *| normal
in if intersect' p neg_light scene then 0 else - g


neg_light = unitise (Vector3 1 3 (-2))

bound (S c r s) (S c' r' []) = S c (max r (len (c - c') + r')) s
bound b (S _ _ l) = foldl' bound b l

create 1 c r = S c r []
create level c r =
let a = 3 * r / sqrt 12
aux x' z' = create (level - 1 :: Int) (c + vect x' a z') (0.5 * r)
l = [S c r [], aux (-a) (-a), aux a (-a), aux (-a) a, aux a a]
in foldl' bound (S (c + vect 0 r 0) 0 l) l

ss = 4

light = unitise (vect (-1) (-3) 2)

pixel_vals n scene y x =
[ let
f a da = a - n / 2 + da / ss
d = unitise (vect (f x dx) (f y dy) n)
in ray_trace light d scene | dx <- [0..ss-1], dy <- [0..ss-1] ]

main = do
[level,ni] <- fmap (map read) getArgs
let n = fromIntegral ni
scene = create level (vect 0 (-1) 4) 1
scale x = 0.5 + 255 * x / (ss*ss)
f y x = toEnum . truncate . scale . sum $ pixel_vals n scene y x
picture' = [ f y x | y <- [n-1,n-2..0], x <- [0..n-1]]
picture = parBuffer ni rwhnf picture'
hSetEncoding stdout latin1
putStr $ "P5\n" ++ show ni ++ " " ++ show ni ++ "\n255\n" ++ picture