poorly typed: Haskell, Ray Tracing, Parallel Computation

Note: The initial post I made had an error in one of the functions which produced incorrect output. This has been fixed and the performance numbers have been updated. They didn't shift too far, but there has been a slight performance degradation in the tests that were run.

Since seeing a blog post made about the efficiency of ray-tracer implementations in several languages, I wanted to see if the placing of the Haskell entry could be improved. In the post, the Haskell implementation comes out 3.2 slower than the C++ version. I thought this may be a little too large, so I tried to tweak down the Haskell version to some better timings. Around 2x would be my goal, but even at the current 3x is still pretty good for a language which is at such a high level of abstraction.

So I went to work, I went back to the source: Lennart Augustsson's implementation. I chose the 4th version, as it would seem to not have the uglier performance hacks that are present in the final version, I like working with clean code if possible. I pulled out the implementation of the vector, as it can be had through the AC-vector package from Hackage; it's environmentally friendly to recycle.

I didn't actually end up modifying too much, apart from a few style things. I tweaked a few functions to get their complexity down slightly, I didn't look to the C++ code to see if a similar optimization could be made, as I want to be able to be able to relate my results to original comparison post, aside from hardware.

Results

These results were obtained on a ArchLinux 32-bit installation, using GHC 6.12.1, GCC 4.4.2, OCaml 3.11.1, and HLVM was included from SVN just for more data-points to align with the other blog post.

9 Layers, 512x512

12 Layers, 512x512

Details:

The scary compiler invocation for GHC is as follows:

ghc --make -O2 -fforce-recomp -fexcess-precision -funbox-strict-fields ray4.hs -o ray4-hs -fvia-c -optc-O3 -optc-ffast-math -funfolding-keeness-factor=10

The compile lines for the others can be found on the original ray tracing comparison page, and the HLVM version (as of the SVN version of this writing) has to be hacked as it doesn't support passing command line arguments to adjust the number of levels. Also, it used exorbitantly more memory than the other implementations, with 9 levels it ate approximately 422MB, where the Haskell version only used 18MB.

Parallel Ray Tracing

Now this isn't the really interesting part, the really interesting thing is when we start dipping into the parallel package/runtime support that was recently improved in GHC 6.12.1 . Since I'd never really had the chance to use it (never had to do any really large computationally intensive processing) I thought this would be the perfect time!

Surprisingly, I only had to append the following line to get the performance seen below:

parBuffer ni rwhnf picture'

and also invoking the compiler needs the following additions:

-threaded -feager-blackholing

Decomposing the addition to the source code, we basically use the parBuffer function from the Control.Parallel.Strategies package. Now I'm just starting with these parallel strategies but as I understand it, the buffering size (ni) is how many elements from the list (picture') are taken to be evaluated at a time, using the strategy provided (rwhnf). I thought this strategy was best because each element in the list is quite computationally expensive, so it makes sense not to try to evaluate them all at the same time, but to do them in batches.

The results were quite promising on 9-levels:

Using about 6 cores seems to be the sweet-spot, and tuning the runtime via +RTS -qb0 -N8 -RTS (use parallel garbage collector on generation-0, -Nn specifies number of cores to use) was essential as well.

I found this an interesting exercise, especially seeing how pure functional programming can still be quite performant compared to impure languages. I wouldn't mind some tips from anyone more knowledgeable about improving the parallel performance in this setting to see if it we could do even better!

Appendix - Haskell Code Used for the trials

{-# LANGUAGE BangPatterns #-}
import Control.Parallel.Strategies
import Data.List (foldl')
import Data.Vector
import System
import System.IO

infixl 7 .*, *|

(.*) :: Vector3 -> Vector3 -> Double
(.*) = vdot

(*|) :: Double -> Vector3 -> Vector3
(*|) = (*<>)

len = vmag

vect :: Double -> Double -> Double -> Vector3
vect x y z = Vector3 x y z

infinity, delta :: Double
infinity = 1/0
delta = sqrt e where e = encodeFloat (floatRadix e) (-floatDigits e)

unitise :: Vector3 -> Vector3
unitise r = 1 / len r *| r

data Scene = S !Vector3 !Double [Scene]
data Hit = H {l :: !Double, nv :: Vector3}

ray_sphere :: Vector3 -> Vector3 -> Double -> Double
ray_sphere dir v radius =
    let  disc = v .* v - radius * radius
         b = (v .* dir)
         b2 = b*b
    in if disc < 0 || b2 < disc
       then infinity
       else let disk = sqrt (b2 - disc)
                t1 = b - disk
            in if t1 > 0 then t1 else b + disk

ray_sphere' :: Vector3 -> Vector3 -> Vector3 -> Double -> Bool
ray_sphere' orig dir center radius =
    let v = center - orig
        b = v .* dir
        b2 = b * b
        rest = v .* v - radius * radius
    in  b2 >= rest && (b > 0 || rest < 0)

intersect dir first@(H l _) !(S center radius scene) =
    let l' = ray_sphere dir center radius
    in  if l' >= l 
        then first
        else case scene of
               [] -> H l' (unitise (l' *| dir - center))
               scenes -> foldl' (intersect dir) first scenes

intersect' orig dir !(S center radius scenes) =
    ray_sphere' orig dir center radius && 
                    (null scenes || any (intersect' orig dir) scenes)

ray_trace light dir scene =
    case intersect dir (H infinity 0) scene of
      H 0 _ -> infinity
      H lambda normal ->
          let g = normal .* light
          in  if g >= 0 then 0
              else let p = lambda *| dir + delta *| normal
                   in  if intersect' p neg_light scene then 0 else - g

       
neg_light = unitise (Vector3 1 3 (-2))

bound (S c r s) (S c' r' []) = S c (max r (len (c - c') + r')) s
bound b (S  _ _ l) = foldl' bound b l

create 1 c r = S c r []
create level c r =
    let a = 3 * r / sqrt 12
        aux x' z' = create (level - 1 :: Int) (c + vect x' a z') (0.5 * r)
        l = [S c r [], aux (-a) (-a), aux a (-a), aux (-a) a, aux a a]
    in  foldl' bound (S (c + vect 0 r 0) 0 l) l

ss = 4

light = unitise (vect (-1) (-3) 2)

pixel_vals n scene y x =
    [ let 
        f a da = a - n / 2 + da / ss
        d = unitise (vect (f x dx) (f y dy) n)
   in ray_trace light d scene | dx <- [0..ss-1], dy <- [0..ss-1] ]

main = do 
    [level,ni] <- fmap (map read) getArgs
    let n       = fromIntegral ni
        scene   = create level (vect 0 (-1) 4) 1  
        scale x = 0.5 + 255 * x / (ss*ss)
        f y x   = toEnum . truncate . scale . sum $ pixel_vals n scene y x
        picture' = [ f y x | y <- [n-1,n-2..0], x <- [0..n-1]]
        picture = parBuffer ni rwhnf picture'
    hSetEncoding stdout latin1
    putStr $ "P5\n" ++ show ni ++ " " ++ show ni ++ "\n255\n" ++ picture

6 comments:

np13 January 2010 at 17:12
Great! Good work!

The consiness of change for parallel computations is especial nice!

Did you sent your code Jon Harrop to make him updates its benchmark?
Anonymous14 January 2010 at 03:38
That's really nice. I love at no point do you really give any indication of parallelism til the end with parbuffer and its strategy.

Doing the parallelism strategy yourself is not fun either.

BTW performant doesn't mean what you want it to me, just say it performs very well. A performant is a performer (like a clown), whether or not they perform well or not has nothing to do with their status as a performer (think carrot top).

http://dictionary.reference.com/browse/performant
http://boulter.com/blog/2004/08/19/performant-is-not-a-word/
You'll see here it is marked neologism, that's because someone misused it and it stuck:
http://en.wiktionary.org/wiki/performant
http://weblogs.asp.net/jgalloway/archive/2007/05/10/performant-isn-t-a-word.aspx
Thanassis14 January 2010 at 07:11
How did you install GHC 6.12.1 in Arch?
It only has 6.10.4, and there's nothing in AUR either... did you compile GHC from source?
saynte14 January 2010 at 08:24
@ertai: I didn't send my code to him, this was more of a personal exercise.

@haskellish: Thanks for the remarks! I picked up performant (probably) from usage on the internet, but I'll put an end to my use of it as a word ;).

@ttsiodras: I think it's in the 'extra' or 'community' repositories. Actually I think a new update came out yesterday (a new Arch update, not GHC itself I think).
Sanny Sanoff14 January 2010 at 14:57
Wanted to improve your code, program does not compile, Vector3 is undefined. Checked out darcs version of Data.Vector, could not find definition there. Would you mind pointing to proper place to look? GHC 6.10.4 though. Thanks.
saynte14 January 2010 at 15:05
@Sanny:
I used the AC-Vector from hackage, through "cabal install AC-Vector" on Linux.

You'll likely need 6.12.1 to get nicer parallel performance though.

poorly typed

Tuesday, 12 January 2010

Haskell, Ray Tracing, Parallel Computation

6 comments:

Followers

Blog Archive

About Me