Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[Haskell-cafe] Haskell version of ray tracer code is much slower than the original ML

9 views
Skip to first unread message

Philip Armstrong

unread,
Jun 21, 2007, 6:55:26 AM6/21/07
to haskel...@haskell.org
In odd spare moments, I took John Harrops simple ray tracer[1] & made a
Haskell version:

http://www.kantaka.co.uk/cgi-bin/darcsweb.cgi?r=ray

darcs get http://www.kantaka.co.uk/darcs/ray

It's pretty much a straight translation into idiomatic Haskell (as far
as my Haskell is idiomatic anyway).

Unfortunately, it's a lot slower than the ML version, despite turning
all the optimisation options up as far as they'll go. Profiling
suggests that much of the time is spent in the intersection' function,
and that the code is creating (and garbage collecting) an awful lot of
(-|) vector subtraction thunks. Trying to make intersection' or
ray_sphere stricter (with seq) appears to have no effect whatsoever:
the output of -ddump-simpl is unchanged (with the arguments all
staying lazy).

Am I missing anything obvious? I don't want to carry out herculean
code rewriting efforts: that wouldn't really be in the spirit of the
thing.

cheers, Phil

[1] http://www.ffconsultancy.com/languages/ray_tracer/comparison.html

--
http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Philip Armstrong

unread,
Jun 21, 2007, 7:36:46 AM6/21/07
to Haskel...@haskell.org, Sebastian Sylvan
On Thu, Jun 21, 2007 at 12:25:44PM +0100, Sebastian Sylvan wrote:
>Try using floats for the vector, and strict fields (add a ! to the
>fields in the data declaration).

Because the optimisation page on the haskell wiki is very explicit
about never using Float when you can use Double, that's why. An older
revision used Float and it was slower than the current one. Making the
datatypes strict also makes no difference.

I have tried the obvious things :)

>That's the simplest possible thing I can think of after about two
>seconds of looking anyway. It appears that the ML version uses float
>so I don't understand why you would use Double for the Haskell version
>at all, and then think you could do any sort of valid comparisons
>between them...

OCaML floats are Doubles, at least on x86.

cheers, Phil

Bulat Ziganshin

unread,
Jun 21, 2007, 8:27:32 AM6/21/07
to Philip Armstrong, Haskel...@haskell.org
Hello Philip,

Thursday, June 21, 2007, 3:36:27 PM, you wrote:

> revision used Float and it was slower than the current one. Making the
> datatypes strict also makes no difference.

don't forget to use either -funpack-strict-fields or {#- UNPACK -#} pragma

--
Best regards,
Bulat mailto:Bulat.Z...@gmail.com

Philip Armstrong

unread,
Jun 21, 2007, 8:45:24 AM6/21/07
to Bulat Ziganshin, Haskel...@haskell.org
On Thu, Jun 21, 2007 at 04:23:37PM +0400, Bulat Ziganshin wrote:
>Thursday, June 21, 2007, 3:36:27 PM, you wrote:
>> revision used Float and it was slower than the current one. Making the
>> datatypes strict also makes no difference.
>
>don't forget to use either -funpack-strict-fields or {#- UNPACK -#} pragma

ahem:

http://www.kantaka.co.uk/cgi-bin/darcsweb.cgi?r=ray;a=headblob;f=/Makefile

(Assuming you mean -funbox-strict-fields; -funpack-strict-fields
doesn't appear in the ghc 6.6.1 makefile as far as I can see.)

As I said, I've tried the obvious things & they didn't make any
difference. Now I could go sprinkling $!, ! and seq around like
confetti but that seems like giving up really.

Phil

Jon Harrop

unread,
Jun 21, 2007, 8:45:57 AM6/21/07
to haskel...@haskell.org

Awesome stuff!

On Thursday 21 June 2007 12:36:27 Philip Armstrong wrote:
> On Thu, Jun 21, 2007 at 12:25:44PM +0100, Sebastian Sylvan wrote:
> >Try using floats for the vector, and strict fields (add a ! to the
> >fields in the data declaration).
>
> Because the optimisation page on the haskell wiki is very explicit
> about never using Float when you can use Double, that's why. An older
> revision used Float and it was slower than the current one. Making the
> datatypes strict also makes no difference.

Where exactly do the !s go and what do they do?

> >That's the simplest possible thing I can think of after about two
> >seconds of looking anyway. It appears that the ML version uses float
> >so I don't understand why you would use Double for the Haskell version
> >at all, and then think you could do any sort of valid comparisons
> >between them...
>
> OCaML floats are Doubles, at least on x86.

Yes. OCaml doesn't have a 32-bit float storage format, apart from an
entirely-float big array. Also, the ML in OCaml doesn't stand for
metalanguage. ;-)

There is probably some benefit to laziness in this example because many of the
spheres are occluded in the final image, so parts of the scene tree that are
eagerly generated in the other implementations may actually never be
traversed by the renderer.

I take it you saw the whole language comparison:

http://www.ffconsultancy.com/languages/ray_tracer/

I'll be uploading concurrent implementations ASAP. Haskell should do well
there... :-)

PS: You spelled my name wrong!
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
The OCaml Journal
http://www.ffconsultancy.com/products/ocaml_journal/?e

Philip Armstrong

unread,
Jun 21, 2007, 9:12:32 AM6/21/07
to Jon Harrop, haskel...@haskell.org
On Thu, Jun 21, 2007 at 01:39:24PM +0100, Jon Harrop wrote:
>Awesome stuff!
>
>On Thursday 21 June 2007 12:36:27 Philip Armstrong wrote:
>> On Thu, Jun 21, 2007 at 12:25:44PM +0100, Sebastian Sylvan wrote:
>> >Try using floats for the vector, and strict fields (add a ! to the
>> >fields in the data declaration).
>>
>> Because the optimisation page on the haskell wiki is very explicit
>> about never using Float when you can use Double, that's why. An older
>> revision used Float and it was slower than the current one. Making the
>> datatypes strict also makes no difference.
>
>Where exactly do the !s go and what do they do?

On the datatypes:

data Vector = V !Double !Double !Double

for instance. They tell the compiler to make those fields strict
rather than lazy. This may or may not help things...

>> OCaML floats are Doubles, at least on x86.
>
>Yes. OCaml doesn't have a 32-bit float storage format, apart from an
>entirely-float big array. Also, the ML in OCaml doesn't stand for
>metalanguage. ;-)

Point!

>I take it you saw the whole language comparison:
>
> http://www.ffconsultancy.com/languages/ray_tracer/

Yup. I'd been meaning to run off a haskell version for a
while.

>I'll be uploading concurrent implementations ASAP. Haskell should do well
>there... :-)

I'll be on the lookout.

>PS: You spelled my name wrong!

Sorry!

Phil

Derek Elkins

unread,
Jun 21, 2007, 1:09:23 PM6/21/07
to Jon Harrop, haskel...@haskell.org
On Thu, 2007-06-21 at 13:39 +0100, Jon Harrop wrote:
> Awesome stuff!
>
> On Thursday 21 June 2007 12:36:27 Philip Armstrong wrote:
> > On Thu, Jun 21, 2007 at 12:25:44PM +0100, Sebastian Sylvan wrote:
> > >Try using floats for the vector, and strict fields (add a ! to the
> > >fields in the data declaration).
> >
> > Because the optimisation page on the haskell wiki is very explicit
> > about never using Float when you can use Double, that's why. An older
> > revision used Float and it was slower than the current one. Making the
> > datatypes strict also makes no difference.
>
> Where exactly do the !s go and what do they do?

> > >That's the simplest possible thing I can think of after about two
> > >seconds of looking anyway. It appears that the ML version uses float
> > >so I don't understand why you would use Double for the Haskell version
> > >at all, and then think you could do any sort of valid comparisons
> > >between them...
> >
> > OCaML floats are Doubles, at least on x86.
>
> Yes. OCaml doesn't have a 32-bit float storage format, apart from an
> entirely-float big array. Also, the ML in OCaml doesn't stand for
> metalanguage. ;-)

To be technical, it should be OCAML.

Mark T.B. Carroll

unread,
Jun 21, 2007, 1:29:45 PM6/21/07
to Philip Armstrong, Haskel...@haskell.org
Philip Armstrong <ph...@kantaka.co.uk> writes:
(snip)

> Because the optimisation page on the haskell wiki is very explicit
> about never using Float when you can use Double, that's why.
(snip)

Is that still true if you use -fexcess-precision ?

-- Mark

peterv

unread,
Jun 21, 2007, 2:16:03 PM6/21/07
to Philip Armstrong, Haskel...@haskell.org, Sebastian Sylvan
So float math in *slower* than double math in Haskell? That is interesting.
Why is that?

BTW, does Haskell support 80-bit "long double"s? The Intel CPU seems to use
that format internally.

Philip Armstrong

unread,
Jun 21, 2007, 2:31:08 PM6/21/07
to Mark T.B. Carroll, Haskel...@haskell.org
On Thu, Jun 21, 2007 at 01:29:56PM -0400, Mark T.B. Carroll wrote:
>Philip Armstrong <ph...@kantaka.co.uk> writes:
>(snip)
>> Because the optimisation page on the haskell wiki is very explicit
>> about never using Float when you can use Double, that's why.
>(snip)
>
>Is that still true if you use -fexcess-precision ?

Why on earth would you use -fexcess-precision if you're using Floats?
The excess precision only apples to Doubles held in registers on x86
IIRC. (If you spill a Double from a register to memory, then you lose
the extra precision bits in the process).

Unless -fexcess-precision with ghc does something completely different
to the analogous gcc setting that is.

Phil

Philip Armstrong

unread,
Jun 21, 2007, 2:32:32 PM6/21/07
to peterv, Haskel...@haskell.org
On Thu, Jun 21, 2007 at 08:15:36PM +0200, peterv wrote:
>So float math in *slower* than double math in Haskell? That is interesting.
>Why is that?
>
>BTW, does Haskell support 80-bit "long double"s? The Intel CPU seems to use
>that format internally.

As I understand things, that is the effect of using -fexcess-precision.

Obviously this means that the behaviour of your program can change
with seemingly trivial code rearrangements, but if you're messing with
floating point numbers then you ought to know what you're doing anyway.

Mark T.B. Carroll

unread,
Jun 21, 2007, 3:29:03 PM6/21/07
to Philip Armstrong, Haskel...@haskell.org
Philip Armstrong <ph...@kantaka.co.uk> writes:
(snip)
> Why on earth would you use -fexcess-precision if you're using Floats?
> The excess precision only apples to Doubles held in registers on x86
> IIRC. (If you spill a Double from a register to memory, then you lose
> the extra precision bits in the process).

Some googling suggests that point 2 on
http://www.haskell.org/hawiki/FasterFloatingPointWithGhc
might have been what I was thinking of.

-- Mark

Philip Armstrong

unread,
Jun 21, 2007, 3:43:17 PM6/21/07
to Mark T.B. Carroll, Haskel...@haskell.org
On Thu, Jun 21, 2007 at 03:29:17PM -0400, Mark T.B. Carroll wrote:
>Philip Armstrong <ph...@kantaka.co.uk> writes:
>(snip)
>> Why on earth would you use -fexcess-precision if you're using Floats?
>> The excess precision only apples to Doubles held in registers on x86
>> IIRC. (If you spill a Double from a register to memory, then you lose
>> the extra precision bits in the process).
>
>Some googling suggests that point 2 on
>http://www.haskell.org/hawiki/FasterFloatingPointWithGhc
>might have been what I was thinking of.

That's the old wiki. The new one gives the opposite advice! (As does
the ghc manual):

http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html
http://www.haskell.org/haskellwiki/Performance/Floating_Point

Phil

Philip Armstrong

unread,
Jun 22, 2007, 6:41:07 AM6/22/07
to haskel...@haskell.org
On Thu, Jun 21, 2007 at 08:42:57PM +0100, Philip Armstrong wrote:
>On Thu, Jun 21, 2007 at 03:29:17PM -0400, Mark T.B. Carroll wrote:
>>Philip Armstrong <ph...@kantaka.co.uk> writes:
>>(snip)
>>>Why on earth would you use -fexcess-precision if you're using Floats?
>>>The excess precision only apples to Doubles held in registers on x86
>>>IIRC. (If you spill a Double from a register to memory, then you lose
>>>the extra precision bits in the process).
>>
>>Some googling suggests that point 2 on
>>http://www.haskell.org/hawiki/FasterFloatingPointWithGhc
>>might have been what I was thinking of.
>
>That's the old wiki. The new one gives the opposite advice! (As does
>the ghc manual):
>
> http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html
> http://www.haskell.org/haskellwiki/Performance/Floating_Point

Incidentally, the latter page implies that ghc is being overly
pessimistic when compilling FP code without -fexcess-precision:

"On x86 (and other platforms with GHC prior to version 6.4.2), use
the -fexcess-precision flag to improve performance of floating-point
intensive code (up to 2x speedups have been seen). This will keep
more intermediates in registers instead of memory, at the expense of
occasional differences in results due to unpredictable rounding."

IIRC, it is possible to issue an instruction to the x86 FP unit which
makes all operations work on 64-bit Doubles, even though there are
80-bits available internally. Which then means there's no requirement
to spill intermediate results to memory in order to get the rounding
correct.

Ideally, -fexcess-precision should just affect whether the FP unit
uses 80 or 64 bit Doubles. It shouldn't make any performance
difference, although obviously the generated results may be different.

As an aside, if you use the -optc-mfpmath=sse option, then you only
get 64-bit Doubles anyway (on x86).

cheers, Phil

Simon Marlow

unread,
Jun 22, 2007, 8:17:18 AM6/22/07
to Philip Armstrong, haskel...@haskell.org
Philip Armstrong wrote:
> On Thu, Jun 21, 2007 at 08:42:57PM +0100, Philip Armstrong wrote:
>> On Thu, Jun 21, 2007 at 03:29:17PM -0400, Mark T.B. Carroll wrote:
>
>> That's the old wiki. The new one gives the opposite advice! (As does
>> the ghc manual):
>>
>> http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html
>> http://www.haskell.org/haskellwiki/Performance/Floating_Point
>
> Incidentally, the latter page implies that ghc is being overly
> pessimistic when compilling FP code without -fexcess-precision:
>
> "On x86 (and other platforms with GHC prior to version 6.4.2), use
> the -fexcess-precision flag to improve performance of floating-point
> intensive code (up to 2x speedups have been seen). This will keep
> more intermediates in registers instead of memory, at the expense of
> occasional differences in results due to unpredictable rounding."
>
> IIRC, it is possible to issue an instruction to the x86 FP unit which
> makes all operations work on 64-bit Doubles, even though there are
> 80-bits available internally. Which then means there's no requirement
> to spill intermediate results to memory in order to get the rounding
> correct.

For some background on why GHC doesn't do this, see the comment "MORE FLOATING
POINT MUSINGS..." in

http://darcs.haskell.org/ghc/compiler/nativeGen/MachInstrs.hs

The main problem is floats: even if you put the FPU into 64-bit mode, your float
operations will be done at 64-bit precision. There are other technical problems
that we found with doing this, the comment above elaborates.

GHC passes -ffloat-store to GCC, unless you give the flag -fexcess-precision.
The idea is to try to get reproducible floating-point results. The native code
generator is unaffected by -fexcess-precision, but it produces rubbish
floating-point code on x86 anyway.

> Ideally, -fexcess-precision should just affect whether the FP unit
> uses 80 or 64 bit Doubles. It shouldn't make any performance
> difference, although obviously the generated results may be different.
>
> As an aside, if you use the -optc-mfpmath=sse option, then you only
> get 64-bit Doubles anyway (on x86).

You probably want SSE2. If I ever get around to finishing it, the GHC native
code generator will be able to generate SSE2 code on x86 someday, like it
currently does for x86-64. For now, to get good FP performance on x86, you
probably want

-fvia-C -fexcess-precision -optc-mfpmath=sse2

Cheers,
Simon

Simon Marlow

unread,
Jun 22, 2007, 8:20:20 AM6/22/07
to Philip Armstrong, Haskel...@haskell.org
Philip Armstrong wrote:
> On Thu, Jun 21, 2007 at 08:15:36PM +0200, peterv wrote:
>> So float math in *slower* than double math in Haskell? That is
>> interesting.
>> Why is that?
>>
>> BTW, does Haskell support 80-bit "long double"s? The Intel CPU seems
>> to use
>> that format internally.
>
> As I understand things, that is the effect of using -fexcess-precision.
>
> Obviously this means that the behaviour of your program can change
> with seemingly trivial code rearrangements,

Not just code rearrangements: your program will give different results depending
on the optimisation settings, whether you compile with -fvia-C or -fasm, and the
results will be different from those on a machine using fixed 32-bit or 64-bit
precision floating point operations.

Cheers,
Simon

Philip Armstrong

unread,
Jun 22, 2007, 9:13:37 AM6/22/07
to haskel...@haskell.org
On Fri, Jun 22, 2007 at 01:16:54PM +0100, Simon Marlow wrote:
>Philip Armstrong wrote:
>>IIRC, it is possible to issue an instruction to the x86 FP unit which
>>makes all operations work on 64-bit Doubles, even though there are
>>80-bits available internally. Which then means there's no requirement
>>to spill intermediate results to memory in order to get the rounding
>>correct.
>
>For some background on why GHC doesn't do this, see the comment "MORE
>FLOATING POINT MUSINGS..." in
>
> http://darcs.haskell.org/ghc/compiler/nativeGen/MachInstrs.hs

Twisty. I guess 'slow, but correct, with switches to go faster at the
price of correctness' is about the best option.

>You probably want SSE2. If I ever get around to finishing it, the GHC
>native code generator will be able to generate SSE2 code on x86 someday,
>like it currently does for x86-64. For now, to get good FP performance on
>x86, you probably want
>
> -fvia-C -fexcess-precision -optc-mfpmath=sse2

Reading the gcc manpage, I think you mean -optc-msse2
-optc-mfpmath=sse. -mfpmath=sse2 doesn't appear to be an option.

(I note in passing that the ghc darcs head produces binaries from
ray.hs which are about 15% slower than ghc 6.6.1 ones btw. Same
optimisation options used both times.)

cheers, Phil

Claus Reinke

unread,
Jun 22, 2007, 10:01:16 AM6/22/07
to Simon Marlow, haskel...@haskell.org
> -fvia-C -fexcess-precision -optc-mfpmath=sse2

is there, or should there be a way to define -O "profiles" for ghc?
so that -O would refer to the standard profile, -Ofp would refer
to the combination above as a floating point optiimisation profile,
other profiles might include things like -funbox-strict-fields, and
-Omy42 would refer to my own favourite combination of flags..

perhaps this should be generalised to ghc flag profiles, to cover
things like '-fno-monomorphism-restriction -fno-mono-pat-binds'
or '-fglasgow-exts -fallow-undecidable-instances; and the like?

just a thought,
claus

Dougal Stanton

unread,
Jun 22, 2007, 10:12:30 AM6/22/07
to Claus Reinke, Simon Marlow, haskel...@haskell.org
On 22/06/07, Claus Reinke <claus....@talk21.com> wrote:

> perhaps this should be generalised to ghc flag profiles, to cover
> things like '-fno-monomorphism-restriction -fno-mono-pat-binds'
> or '-fglasgow-exts -fallow-undecidable-instances; and the like?

You just *know* someone's gonna abuse that to make a genuine
-funroll-loops, right? ;-)

D.

Claus Reinke

unread,
Jun 22, 2007, 11:04:32 AM6/22/07
to Simon Marlow, haskel...@haskell.org
on second thought, user-defined profiles are a two-edged sword,
negating the documentation advantages of in-source flags. better to
handle that in the editor/ide. but predefined flag profiles would still
seem to make sense?

there is something wrong about this wealth of options. it is great
that one has all that control over details, but it also makes it more
difficult to get things right (eg, i was surprised that -O doesn't
unbox strict fields by default). even a formula one driver doesn't
control every lever himself, that's up to the team.

for optimisations, i used to have a simple picture in mind (from
my c days, i guess?), when ghci is no longer fast enough, that is:

no -O: standard executables are fast enough, thank you

-O: standard executables aren't fast enough, do something
about it, but don't bother me with the details

-O2: i need your best _safe_ optimisation efforts, and i'm
prepared to pay for that with longer compilation times

-O3: i need your absolute best optimisation efforts, and i'm
prepared to verify myself that optimisations that cannot
automatically be checked for safety have no serious negative
effect on the results (it would be nice if you told me which
potentially unsafe optimisations you used in compilation)

on top of that, as an alternative to -O3, specific tradeoffs would
be useful, where i specify whether i want to optimize for space
or for time, or which kinds of optimization opportunities the
compiler should pay attention to, such as strictness, unboxing,
floating point ops, etc.. but even here i wouldn't want to give
platform-specific options, i'd want the compiler to choose the
most appropriate options, given my specified tradeoffs and
emphasis, taking into account platform and self-knowledge.

so, i'd say -Ofp, and the compiler might pick:

>> -fvia-C -fexcess-precision -optc-mfpmath=sse2

if i'm on a platform and compiler version where that is an
appropriate selection of flags to get the best floating point
performance. and it might pick a different selection of flags
on a different platform, or with a different compiler version.

> perhaps this should be generalised to ghc flag profiles, to cover
> things like '-fno-monomorphism-restriction -fno-mono-pat-binds'
> or '-fglasgow-exts -fallow-undecidable-instances; and the like?

that is a slightly different story, and it might be useful (a) to
provide flag groups (-fno-mono*) and (b) to specify implication
(just about every language extension flag implies -fglasgow-exts,
so there's no need to specify that again, and there might be
other opportunities for reducing groups of options with a
single maximum in the implication order; one might even
introduce pseudo-flags for grouping, such as -fhaskell2;-).

Philip Armstrong

unread,
Jun 22, 2007, 11:37:11 AM6/22/07
to haskel...@haskell.org
On Thu, Jun 21, 2007 at 01:45:04PM +0100, Philip Armstrong wrote:
>As I said, I've tried the obvious things & they didn't make any
>difference. Now I could go sprinkling $!, ! and seq around like
>confetti but that seems like giving up really.

OK. Looks like I was mistaken. Strictness annotations *do* make a
difference! Humph. Wonder what I was doing wrong yesterday?

Anyway timings follow, with all strict datatypes in the Haskell
version:

Langauge File Time in seconds
Haskell ray.hs 38.2
OCaml ray.ml 23.8
g++-4.1 ray.cpp 12.6

(ML & C++ Code from
http://www.ffconsultancy.com/languages/ray_tracer/comparison.html)

Gcc seems to have got quite a bit better since Jon last benchmarked
this code.

Bulat Ziganshin

unread,
Jun 22, 2007, 2:32:51 PM6/22/07
to Philip Armstrong, haskel...@haskell.org
Hello Philip,

Friday, June 22, 2007, 7:36:51 PM, you wrote:
> Langauge File Time in seconds
> Haskell ray.hs 38.2
> OCaml ray.ml 23.8
> g++-4.1 ray.cpp 12.6

can you share sourcecode of this variant? i'm interested to see how
much it is obfuscated

btw, *their* measurement said that ocaml is 7% faster :)

--
Best regards,
Bulat mailto:Bulat.Z...@gmail.com

_______________________________________________

Philip Armstrong

unread,
Jun 22, 2007, 2:54:37 PM6/22/07
to Bulat Ziganshin, haskel...@haskell.org
On Fri, Jun 22, 2007 at 10:11:27PM +0400, Bulat Ziganshin wrote:
>Friday, June 22, 2007, 7:36:51 PM, you wrote:
>> Langauge File Time in seconds
>> Haskell ray.hs 38.2
>> OCaml ray.ml 23.8
>> g++-4.1 ray.cpp 12.6
>
>can you share sourcecode of this variant? i'm interested to see how
>much it is obfuscated

http://www.kantaka.co.uk/darcs/ray

The cpp & ml versions are precisely those available from the download
links on http://www.ffconsultancy.com/languages/ray_tracer/comparison.html

The optimisation options I used can be seen in the makefile.

>btw, *their* measurement said that ocaml is 7% faster :)

Indeed. The gcc-4.0 compilied binary runs at about 15s IIRC, but it's
still much better than 7% faster than the ocaml binary.

cheers, Phil

Claus Reinke

unread,
Jun 22, 2007, 7:43:03 PM6/22/07
to haskel...@haskell.org
> http://www.kantaka.co.uk/darcs/ray

try making ray_sphere and intersect' local to intersect,
then drop their constant ray parameter. saves me 25%.
claus

Jon Harrop

unread,
Jun 22, 2007, 10:35:22 PM6/22/07
to haskel...@haskell.org
On Friday 22 June 2007 19:54:16 Philip Armstrong wrote:
> On Fri, Jun 22, 2007 at 10:11:27PM +0400, Bulat Ziganshin wrote:
> >btw, *their* measurement said that ocaml is 7% faster :)
>
> Indeed. The gcc-4.0 compilied binary runs at about 15s IIRC, but it's
> still much better than 7% faster than the ocaml binary.

What architecture, platform, compiler versions and compile lines are you
using?

On my 2x 2.2GHz Athlon64 running x64 Debian I now get:

GHC 6.6.1: 26.5s ghc -funbox-strict-fields -O3 ray.hs -o ray
OCaml 3.10.0: 14.158s ocamlopt -inline 1000 ray.ml -o ray
g++ 4.1.3: 8.056s g++ -O3 -ffast-math ray.cpp -o ray

Also, the benchmarks and results that I cited before are more up to date than
the ones you're using. In particular, you might be interested in these faster
versions:

http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.ml
http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.cpp

For "./ray 6 512", I get:

OCaml: 3.140s ocamlopt -inline 1000 ray.ml -o ray
C++: 2.970s g++ -O3 -ffast-math ray.cpp -o ray

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
The OCaml Journal
http://www.ffconsultancy.com/products/ocaml_journal/?e

Donald Bruce Stewart

unread,
Jun 23, 2007, 12:41:43 AM6/23/07
to Jon Harrop, haskel...@haskell.org
jon:

> On Friday 22 June 2007 19:54:16 Philip Armstrong wrote:
> > On Fri, Jun 22, 2007 at 10:11:27PM +0400, Bulat Ziganshin wrote:
> > >btw, *their* measurement said that ocaml is 7% faster :)
> >
> > Indeed. The gcc-4.0 compilied binary runs at about 15s IIRC, but it's
> > still much better than 7% faster than the ocaml binary.
>
> What architecture, platform, compiler versions and compile lines are you
> using?
>
> On my 2x 2.2GHz Athlon64 running x64 Debian I now get:
>
> GHC 6.6.1: 26.5s ghc -funbox-strict-fields -O3 ray.hs -o ray

Don't use -O3 , its *worse* than -O2, and somewhere between -Onot and -O iirc,

ghc -O2 -funbox-strict-fields -fvia-C -optc-O2 -optc-ffast-math -fexcess-precision

Are usually fairly good.

Andrew Coppin

unread,
Jun 23, 2007, 3:49:02 AM6/23/07
to haskel...@haskell.org
Donald Bruce Stewart wrote:
> Don't use -O3 , its *worse* than -O2, and somewhere between -Onot and -O iirc,
>
> ghc -O2 -funbox-strict-fields -fvia-C -optc-O2 -optc-ffast-math -fexcess-precision
>
> Are usually fairly good.
>

Is this likely to be fixed ever?

Philip Armstrong

unread,
Jun 23, 2007, 3:54:35 AM6/23/07
to haskel...@haskell.org
On Sat, Jun 23, 2007 at 12:42:33AM +0100, Claus Reinke wrote:
>>http://www.kantaka.co.uk/darcs/ray
>
>try making ray_sphere and intersect' local to intersect,
>then drop their constant ray parameter. saves me 25%.
>claus

I see: I guess I'm paying for laziness in the first parameter to
intersect' which I don't need. 25% brings it within spitting distance
of the OCaml binary too.

Phil

Philip Armstrong

unread,
Jun 23, 2007, 3:55:21 AM6/23/07
to haskel...@haskell.org
On Sat, Jun 23, 2007 at 08:49:15AM +0100, Andrew Coppin wrote:
>Donald Bruce Stewart wrote:
>>Don't use -O3 , its *worse* than -O2, and somewhere between -Onot and -O
>>iirc,
>
>Is this likely to be fixed ever?

There is at least a bug report for it IIRC.

Phil

Philip Armstrong

unread,
Jun 23, 2007, 3:58:29 AM6/23/07
to haskel...@haskell.org
On Sat, Jun 23, 2007 at 03:28:53AM +0100, Jon Harrop wrote:
>What architecture, platform, compiler versions and compile lines are you
>using?

32-bit x86, Debian unstable, gcc version 4.1.2, OCaml version
3.09.2-9, GHC version 6.6.1, compile line in the Makfile at

http://www.kantaka.co.uk/darcs/ray/Makefile

> http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.ml
> http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.cpp

I gather these use algorithmic optimisations.

cheers, Phil

Jon Harrop

unread,
Jun 23, 2007, 5:44:00 AM6/23/07
to haskel...@haskell.org
On Saturday 23 June 2007 08:58:10 Philip Armstrong wrote:
> On Sat, Jun 23, 2007 at 03:28:53AM +0100, Jon Harrop wrote:
> >What architecture, platform, compiler versions and compile lines are you
> >using?
>
> 32-bit x86...

Intel or AMD?

Both. Versions 1-4 are progressively algorithmic, version 5 includes low-level
optimizations (mostly manual inlining and unrolling). Implementing version 1
is probably also interesting: it is the most concise version.

BTW, the ray tracer leverages the semantics of the float infinity (as the
parameter when there is no intersection). Shouldn't be a problem but some
compiler options might break it.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
The OCaml Journal
http://www.ffconsultancy.com/products/ocaml_journal/?e

Jon Harrop

unread,
Jun 23, 2007, 5:51:40 AM6/23/07
to haskel...@haskell.org
On Saturday 23 June 2007 08:54:11 Philip Armstrong wrote:
> On Sat, Jun 23, 2007 at 12:42:33AM +0100, Claus Reinke wrote:
> >>http://www.kantaka.co.uk/darcs/ray
> >
> >try making ray_sphere and intersect' local to intersect,
> >then drop their constant ray parameter. saves me 25%.
> >claus
>
> I see: I guess I'm paying for laziness in the first parameter to
> intersect' which I don't need. 25% brings it within spitting distance
> of the OCaml binary too.

Can you post the code for this?

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
The OCaml Journal
http://www.ffconsultancy.com/products/ocaml_journal/?e

Claus Reinke

unread,
Jun 23, 2007, 7:05:31 AM6/23/07
to haskel...@haskell.org
>>>http://www.kantaka.co.uk/darcs/ray
>>
>>try making ray_sphere and intersect' local to intersect,
>>then drop their constant ray parameter. saves me 25%.
>>claus

also try replacing that (foldl' intersect') with (foldr (flip intersect'))!

using a recent ghc head instead of ghc-6.6.1 also seems to
make a drastic difference (wild guess, seeing the unroll 1000
for ocaml: has there been a change to default unrolling in ghc?).

claus

Ray1.hs

Jon Harrop

unread,
Jun 23, 2007, 8:00:34 AM6/23/07
to haskel...@haskell.org

Wow! Now I get:

GHC: 15.8s
OCaml: 14s
g++: 10s

That's very impressive, and more than I was expecting. I think it is worth
noting that you are now comparing optimized Haskell to unoptimized OCaml
though, so the OCaml has more "low-hanging optimization fruit". :-)

At this point, the single most productive optimization for the OCaml is to
evade the polymorphism and closure in the "intersect" function by inlining
the call to "List.fold_left". This makes the OCaml basically as fast as the
C++ on my machine:

GHC: 15.8s
OCaml: 10.6s
g++: 10s

let delta = sqrt epsilon_float

type vec = {x:float; y:float; z:float}
let zero = {x=0.; y=0.; z=0.}
let ( *| ) s r = {x = s *. r.x; y = s *. r.y; z = s *. r.z}
let ( +| ) a b = {x = a.x +. b.x; y = a.y +. b.y; z = a.z +. b.z}
let ( -| ) a b = {x = a.x -. b.x; y = a.y -. b.y; z = a.z -. b.z}
let dot a b = a.x *. b.x +. a.y *. b.y +. a.z *. b.z
let length r = sqrt(dot r r)
let unitise r = 1. /. length r *| r

let rec intersect orig dir (l, _ as hit) (center, radius, scene) =
let l' =
let v = center -| orig in
let b = dot v dir in
let disc = sqrt(b *. b -. dot v v +. radius *. radius) in
let t1 = b -. disc and t2 = b +. disc in
if t2>0. then if t1>0. then t1 else t2 else infinity in
if l' >= l then hit else match scene with
| [] -> l', unitise (orig +| l' *| dir -| center)
| scenes -> intersects orig dir hit scenes
and intersects orig dir hit = function
| [] -> hit
| scene::scenes -> intersects orig dir (intersect orig dir hit scene) scenes

let light = unitise {x=1.; y=3.; z= -2.} and ss = 4

let rec ray_trace dir scene =
let l, n = intersect zero dir (infinity, zero) scene in
let g = dot n light in
if g <= 0. then 0. else
let p = l *| dir +| sqrt epsilon_float *| n in
if fst (intersect p light (infinity, zero) scene) < infinity then 0. else
g

let rec create level c r =
let obj = c, r, [] in
if level = 1 then obj else
let a = 3. *. r /. sqrt 12. in
let aux x' z' = create (level - 1) (c +| {x=x'; y=a; z=z'}) (0.5 *. r) in
c, 3. *. r, [obj; aux (-.a) (-.a); aux a (-.a); aux (-.a) a; aux a a]

let level, n =
try int_of_string Sys.argv.(1), int_of_string Sys.argv.(2) with _ -> 6, 512

let scene = create level {x=0.; y= -1.; z=4.} 1.;;

Printf.printf "P5\n%d %d\n255\n" n n;;
for y = n - 1 downto 0 do
for x = 0 to n - 1 do
let g = ref 0. in
for dx = 0 to ss - 1 do
for dy = 0 to ss - 1 do
let aux x d = float x -. float n /. 2. +. float d /. float ss in
let dir = unitise {x=aux x dx; y=aux y dy; z=float n} in
g := !g +. ray_trace dir scene
done;
done;
let g = 0.5 +. 255. *. !g /. float (ss*ss) in
Printf.printf "%c" (char_of_int (int_of_float g))
done;
done

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
The OCaml Journal
http://www.ffconsultancy.com/products/ocaml_journal/?e

Philip Armstrong

unread,
Jun 23, 2007, 2:08:14 PM6/23/07
to Claus Reinke, haskel...@haskell.org
On Sat, Jun 23, 2007 at 12:05:01PM +0100, Claus Reinke wrote:
>>>>http://www.kantaka.co.uk/darcs/ray
>>>
>>>try making ray_sphere and intersect' local to intersect,
>>>then drop their constant ray parameter. saves me 25%.
>>>claus
>
>also try replacing that (foldl' intersect') with (foldr (flip intersect'))!

Thanks guys, this is exactly the kind of advice I was seeking.

OK, next question: Given that I'm using all the results from
intersect', why is the lazy version better than the strict one? Is ghc
managing to do some loop fusion?

>using a recent ghc head instead of ghc-6.6.1 also seems to
>make a drastic difference (wild guess, seeing the unroll 1000
>for ocaml: has there been a change to default unrolling in ghc?).

Um. I tried ghc head on the current version and it was about 15%
*slower* than 6.6.1

Perhaps it does better on the (slightly) optimised version?

Philip Armstrong

unread,
Jun 23, 2007, 2:23:56 PM6/23/07
to Jon Harrop, haskel...@haskell.org
On Sat, Jun 23, 2007 at 10:32:31AM +0100, Jon Harrop wrote:
>On Saturday 23 June 2007 08:58:10 Philip Armstrong wrote:
>> On Sat, Jun 23, 2007 at 03:28:53AM +0100, Jon Harrop wrote:
>> >What architecture, platform, compiler versions and compile lines are you
>> >using?
>>
>> 32-bit x86...
>
>Intel or AMD?

AMD. Athlon 64 3000+ to be precise.

>> > http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.ml
>> > http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.cpp
>>
>> I gather these use algorithmic optimisations.
>
>Both. Versions 1-4 are progressively algorithmic, version 5 includes low-level
>optimizations (mostly manual inlining and unrolling). Implementing version 1
>is probably also interesting: it is the most concise version.
>
>BTW, the ray tracer leverages the semantics of the float infinity (as the
>parameter when there is no intersection). Shouldn't be a problem but some
>compiler options might break it.

Thus far the Haskell version generates identical images to the OCaml
one.

Phil

Philip Armstrong

unread,
Jun 23, 2007, 2:36:50 PM6/23/07
to Claus Reinke, haskel...@haskell.org
On Sat, Jun 23, 2007 at 07:07:49PM +0100, Philip Armstrong wrote:
>On Sat, Jun 23, 2007 at 12:05:01PM +0100, Claus Reinke wrote:
>>>>>http://www.kantaka.co.uk/darcs/ray
>>>>
>>>>try making ray_sphere and intersect' local to intersect,
>>>>then drop their constant ray parameter. saves me 25%.
>>>>claus
>>
>>also try replacing that (foldl' intersect') with (foldr (flip intersect'))!
>
>Thanks guys, this is exactly the kind of advice I was seeking.
>
>OK, next question: Given that I'm using all the results from
>intersect', why is the lazy version better than the strict one? Is ghc
>managing to do some loop fusion?

Incidentally, replacing foldl' (intersect ray) hit scene with foldr
(flip (intersect ray)) hit scene makes the current version (without
the lifting of ray out of intersect & ray_scene) almost exactly as
fast as the OCaml version on my hardware. That's almost a 40% speedup!

>>using a recent ghc head instead of ghc-6.6.1 also seems to
>>make a drastic difference (wild guess, seeing the unroll 1000
>>for ocaml: has there been a change to default unrolling in ghc?).
>
>Um. I tried ghc head on the current version and it was about 15%
>*slower* than 6.6.1
>
>Perhaps it does better on the (slightly) optimised version?

Nope, just tried it on the foldr version. It's still slower than 6.6.1
(although not by as much: 26s vs 24s for the 6.6.1 binary). This is
ghc head from this Friday.

Jon is probably correct that hoisting ray out is a code level
transformation that makes the Haskell version different to the OCaml
one, but personally I'd suggest that replacing foldl with foldr is
not: the end result is the same & both have to walk the entire list in
order to get a result.

So, on my hardware at least, the Haskell and OCaml version have
equivalent performance. I think that's pretty impressive. Getting
close to the C++ would probably going to require rather more effort!

Claus Reinke

unread,
Jun 24, 2007, 10:54:51 AM6/24/07
to haskel...@haskell.org
>>also try replacing that (foldl' intersect') with (foldr (flip intersect'))!
> OK, next question: Given that I'm using all the results from
> intersect', why is the lazy version better than the strict one? Is ghc
> managing to do some loop fusion?

haskell tends to prefer foldr where mls prefer foldl, be it for
lazyness and short-circuiting operators, or because a tail-recursive
function with a lazy accumulator is only an efficient way to construct
inefficient expressions.

so, the very first thing i tend to look for when someone ports a
program from ml to haskell are tail recursions with non-strict
accumulators. even using foldl', when constructing pairs in the
accumulator, there's no guarantee that the pair components will
be evaluated early even if the pairs themselves are forced. so
replacing foldl with foldr when porting from ml to haskell tends
to be a useful habit, unless there are good reasons for foldl.

however, things seem to be a little bit more involved in this
example: intersect' forces the first component, and ignores
the second, never building nasty delayed combinations of old
accumulator and list head in the new accumulator. but if you
compare the outputs of -ddump-simpl, or if you export all
definitions from main and compare the outputs of --show-iface,
you'll find differences related to the the result of intersect':
whether or not that result can be passed unboxed.

Constructed Product Result Analysis for Haskell (2000)
http://citeseer.ist.psu.edu/baker-finch00constructed.html

i don't know the details, but in the foldr case, the result of
intersect' seems to be passed unboxed, in the foldl' case, it
isn't. i'll leave it to the experts to explain whether that has to
be the case or whether it is an omission in the optimizer.

claus



>>using a recent ghc head instead of ghc-6.6.1 also seems to
>>make a drastic difference

$ uname -a
CYGWIN_NT-5.1 cr3-lt 1.5.19(0.150/4/2) 2006-01-20 13:28 i686 Cygwin

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.6.1

$ gcc --version
gcc.exe (GCC) 3.4.2 (mingw-special)
Copyright (C) 2004 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ ghc --make Ray1.hs
[1 of 1] Compiling Main ( Ray1.hs, Ray1.o )
Linking Ray1.exe ...

$ time ./Ray1.exe >out

real 0m55.705s
user 0m0.015s
sys 0m0.031s

$ /cygdrive/c/fptools/ghc/ghc-6.7.20070613/bin/ghc --make Ray1.hs -o Ray1head.exe
[1 of 1] Compiling Main ( Ray1.hs, Ray1.o )
Linking Ray1head.exe ...

$ time ./Ray1head.exe >out.head

real 0m24.989s
user 0m0.031s
sys 0m0.015s

$ diff -q --binary out out.head

$

Simon Marlow

unread,
Jun 26, 2007, 9:29:57 AM6/26/07
to Philip Armstrong, haskel...@haskell.org
Philip Armstrong wrote:
> On Sat, Jun 23, 2007 at 08:49:15AM +0100, Andrew Coppin wrote:
>> Donald Bruce Stewart wrote:
>>> Don't use -O3 , its *worse* than -O2, and somewhere between -Onot and
>>> -O iirc,
>>
>> Is this likely to be fixed ever?
>
> There is at least a bug report for it IIRC.

It was fixed yesterday.

Cheers,
Simon

0 new messages