Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Ray tracer in Stalin

690 views
Skip to first unread message

Jon Harrop

unread,
Aug 14, 2005, 11:28:32 AM8/14/05
to

I've just done a little benchmarking of the ray tracer written in Scheme and
compiled using Stalin. Here are the results. On x86 (900MHz Athlon T-bird):

$ g++ -O3 -march=athlon-tbird -ffast-math ray.cpp -o ray
$ time ./ray 6 160 >image.pgm
real 0m2.152s

$ mlton ray.sml
$ time ./ray 6 160 >image.pgm
real 0m2.435s

$ ocamlopt -inline 100 -ffast-math ray.ml -o ray
$ time ./ray 6 160 >image.pgm
real 0m3.255s

$ stalin -d0 -d1 -d5 -d6 -On -q -d -architecture IA32-align-double
-no-clone-size-limit -split-even-if-no-widening -copt -O2 -copt
-fomit-frame-pointer -copt -malign-double ray
$ time ./ray 6 160 >image.pgm
real 0m3.712s

On AMD64 (1.8GHz Athlon64):

$ g++ -O3 -march=athlon-tbird -ffast-math ray.cpp -o ray
$ time ./ray 6 160 >image.pgm
real 0m0.987s

$ mlton ray.sml
$ time ./ray 6 160 >image.pgm
real 0m1.056s

$ ocamlopt -inline 100 -ffast-math ray.ml -o ray
$ time ./ray 6 160 >image.pgm
real 0m1.037s

$ stalin -d0 -d1 -d5 -d6 -On -q -d -architecture IA32-align-double
-no-clone-size-limit -split-even-if-no-widening -copt -O2 -copt
-fomit-frame-pointer -copt -malign-double ray
$ time ./ray 6 160 >image.pgm
real 0m1.773s

I hadn't expected a simple Lisp or Scheme implementation to be able to
compete in terms of performance but only 70% slower on 32-bit AMD64 when
C++ and OCaml are fully 64-bit is very impressive, IMHO.

--
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com

Matthia...@gmail.com

unread,
Aug 14, 2005, 11:54:25 AM8/14/05
to

Jon Harrop wrote:
> I hadn't expected a simple Lisp or Scheme implementation to be able to
> compete [...]

Calling Stalin a "simple" implementation has to be /the/ understatement
of the year!

Ray Dillinger

unread,
Aug 14, 2005, 12:33:25 PM8/14/05
to
Jon Harrop wrote:

> I hadn't expected a simple Lisp or Scheme implementation to be able to
> compete in terms of performance but only 70% slower on 32-bit AMD64 when
> C++ and OCaml are fully 64-bit is very impressive, IMHO.

There are things I don't like about Stalin (related mostly to its
"static" nature and skimpy error handling), but raw speed is one
thing it's terribly good at. A *LOT* of work has gone into the
Stalin compiler to make it compile code that runs fast. It is not
an example of a "simple" implementation.

Bear


Jon Harrop

unread,
Aug 14, 2005, 12:46:54 PM8/14/05
to

Ooops! Sorry, I didn't mean to say "simple" there at all. Honestly, I wonder
who typed that... ;-)

Jon Harrop

unread,
Aug 14, 2005, 12:52:45 PM8/14/05
to
Ray Dillinger wrote:
> There are things I don't like about Stalin (related mostly to its
> "static" nature and skimpy error handling), but raw speed is one
> thing it's terribly good at.

As a Lisp/Scheme virgin, I'm finding Stalin much easier going than the other
compilers I've tried (Bigloo, SBCL, CMUCL). Bigloo in particular has the
worst error reporting I've ever seen...

> A *LOT* of work has gone into the
> Stalin compiler to make it compile code that runs fast. It is not
> an example of a "simple" implementation.

Ahem. Yes, sorry about that.

Daniel C. Wang

unread,
Aug 14, 2005, 2:44:37 PM8/14/05
to Jon Harrop
{stuff deleted}

> I hadn't expected a simple Lisp or Scheme implementation to be able to
> compete in terms of performance but only 70% slower on 32-bit AMD64 when
> C++ and OCaml are fully 64-bit is very impressive, IMHO.


Stalin is a very sophsticated implementation that uses whole program
compilation algorithms. It's very much like MLton. Chicken Scheme is
what I'd call a "simple" implementation.

Kjetil Svalastog Matheussen

unread,
Aug 14, 2005, 6:44:53 PM8/14/05
to


Its a shame theres so much difference. Would be interesting
to see what results cmucl or sbcl gives. Have you posted a
request to comp.lang.lisp?

Anyway, perhaps you could try the following options for stalin? I got
better results with these:


-d0 -d1 -d5 -d6 -On -k -Ob -Om -Or -Ot -q -d -architecture
IA32-align-double -no-clone-size-limit -split-even-if-no-widening -copt
-O3 -copt -fomit-frame-pointer
-copt -malign-double -copt -ffast-math -copt -freg-struct-return


--

Jens Axel Søgaard

unread,
Aug 14, 2005, 7:24:51 PM8/14/05
to
Jon Harrop wrote:

> I hadn't expected a simple Lisp or Scheme implementation to be able to
> compete in terms of performance but only 70% slower on 32-bit AMD64 when
> C++ and OCaml are fully 64-bit is very impressive, IMHO.

It is also worth noting the Stalin code has no type declarations.

--
Jens Axel Søgaard

Jon Harrop

unread,
Aug 14, 2005, 7:42:06 PM8/14/05
to

Well, it uses "vec" and "list" but I'm not sure what you'd call a type
declaration.

Jon Harrop

unread,
Aug 14, 2005, 9:14:48 PM8/14/05
to

I just remembered what I meant when I wrote that: "I hadn't expected a
simple implementation of the ray tracer written in Lisp or Scheme to be
able to compete in terms of performance...".

In other words, I was expecting fast Lisp/Scheme implementations of the ray
tracer to be obfuscated compared to the other languages but that doesn't
seem to be the case.

Förster vom Silberwald

unread,
Aug 15, 2005, 7:27:49 AM8/15/05
to

Jon Harrop wrote:
> I've just done a little benchmarking of the ray tracer written in Scheme and
> compiled using Stalin. Here are the results. On x86 (900MHz Athlon T-bird):

Which Scheme version did you use? Is it the same as the one you used
under Bigloo (except for the +fl, etc. operators).


Schneewittchen

Förster vom Silberwald

unread,
Aug 15, 2005, 7:32:06 AM8/15/05
to

There is one bug which hit me: Stalin takes a rather, rather long time
for compiling. Maybe this issue has changed in the meantime. I once
used Stalin on my old laptop under SuSE Linux 8.0 and every program
took at least 33 seconds when compiling. That was rather tedious; at
least for smaller programs.

Schneewittchen

Förster vom Silberwald

unread,
Aug 15, 2005, 7:45:22 AM8/15/05
to

Jon Harrop wrote:

>Bigloo in particular has the worst error reporting I've ever seen...

It depends. There are three stages: bigloo -i, bigloo, and bigloo
-Obench. And all will give you different error messages. And there is
also the debugger in Bee. Okay, I cannot say much to the debugger since
I haven't used it much.

Do you think that OCaml its error messages are better? The problem
often in functional codes: ther error itself is often not related to
the place where it occurs.

Stalin is indeed an outstanding unsurpassed project. However, do you
really want to use Stalin in a professional environment? Bigloo its
manpower is as tiny
as Stalin its one. However, Bigloo has a much better presentation
home-page-wise, documenation, support, etc.

Schneewittchen
PS: Just out of curiosity: could you post your obersvation on
comp.lang.lisp that your made the experience that CMUCL has a rather
weak error message reporting facility. I do not like Common Lisp but I
always thought there exists some rather strong tools for debugging
code.

Rob Thorpe

unread,
Aug 15, 2005, 1:15:47 PM8/15/05
to

What version of Stalin are you guys using?

I've tried using Stalin 0.10alpha2, with little luck. It can compile
simple things like "hello.sc" and "xhello.sc", but apart from that it
can't compile anything, even it's own benchmarks.

Nothing works if I start specifying C compiler optimization flags.

Jon Harrop

unread,
Aug 15, 2005, 4:31:36 PM8/15/05
to
Förster vom Silberwald wrote:
> Jon Harrop wrote:
>>Bigloo in particular has the worst error reporting I've ever seen...
>
> It depends. There are three stages: bigloo -i, bigloo, and bigloo
> -Obench. And all will give you different error messages.

I'll check those out, thanks.

> Do you think that OCaml its error messages are better?

Vastly better, yes. Stalin second, g++ third, then MLton and SML/NJ and
finally Bigloo. That may well be because I'm not invoking Bigloo properly
though...

> Stalin is indeed an outstanding unsurpassed project. However, do you
> really want to use Stalin in a professional environment? Bigloo its
> manpower is as tiny
> as Stalin its one. However, Bigloo has a much better presentation
> home-page-wise, documenation, support, etc.

Yes. I am very impressed with Stalin.

> Schneewittchen
> PS: Just out of curiosity: could you post your obersvation on
> comp.lang.lisp that your made the experience that CMUCL has a rather
> weak error message reporting facility.

It isn't the error messages from CMUCL and SBCL that give me a headache so
much as all the extra work required to get reasonable performance. AFAIK,
there is no theoretical reason why the Lisp compilers can't do the type
inference, static type checking optimisations that Stalin does (and SML/NJ,
MLton, OCaml etc.).

They do emit warnings telling the programmer where the type problems are but
that is no substitute for having the compiler do the work for you.

Jon Harrop

unread,
Aug 15, 2005, 4:37:28 PM8/15/05
to
Rob Thorpe wrote:
> What version of Stalin are you guys using?

I'm using the Stalin from Debian testing, which identifies itself as 0.10.

> I've tried using Stalin 0.10alpha2, with little luck. It can compile
> simple things like "hello.sc" and "xhello.sc", but apart from that it
> can't compile anything, even it's own benchmarks.
>
> Nothing works if I start specifying C compiler optimization flags.

Weird. I think that is exactly the same version that I'm using. The package
is "0.9+0.10alpha", whatever that means.

qo...@purdue.edu

unread,
Aug 15, 2005, 8:32:44 PM8/15/05
to
Which Scheme version of ray are you compiling with Stalin?
Is it the version that I posted? If not, you can get much faster
results with Stalin as per my earlier post.

Jon Harrop

unread,
Aug 15, 2005, 9:10:09 PM8/15/05
to

I'm using the version that you posted. Who are you, BTW?

Jorgen Schaefer

unread,
Aug 14, 2005, 11:31:33 PM8/14/05
to
Jon Harrop <use...@jdh30.plus.com> writes:

> real 0m2.152s
> real 0m2.435s
> real 0m3.255s
> real 0m3.712s
> real 0m0.987s
> real 0m1.056s
> real 0m1.037s
> real 0m1.773s

Are you sure you want to compare run times which differ in amounts
of under one second?

Greetings,
-- Jorgen

--
((email . "for...@forcix.cx") (www . "http://www.forcix.cx/")
(gpg . "1024D/028AF63C") (irc . "nick forcer on IRCnet"))

qo...@purdue.edu

unread,
Aug 16, 2005, 5:52:06 AM8/16/05
to
I wrote Stalin.

Please send me the exact code that you ran. I believe that I can make
it considerably faster. It appears that you modified the code that I
posted to take command line arguments. If one is not careful, this
could slow things down considerably.

Förster vom Silberwald

unread,
Aug 16, 2005, 6:10:59 AM8/16/05
to

Jon: Is it possible to post it here? Please!

Schneewittchen

Jon Harrop

unread,
Aug 16, 2005, 9:02:53 AM8/16/05
to

Sure. Here's the code and some new (longer) timings:

Here are the compile lines I used on x86:

ocamlopt -inline 100 -ffast-math ray.ml -o ray

g++ -O3 -march=athlon-tbird -ffast-math ray.cpp -o ray

mlton ray.sml
javac ray.java


stalin -d0 -d1 -d5 -d6 -On -q -d -architecture IA32-align-double
-no-clone-size-limit -split-even-if-no-widening -copt -O2 -copt

-fomit-frame-pointer -copt -malign-double qobi

Running times for level=9, n=512 and ss=4 on 900MHz Athlon T-bird:

46.181s OCaml
33.086s C++ (gcc)
35.298s SML (MLton)
147.242s Java (Sun JDK 1.5)
73.976s Scheme (Stalin)

Compile times:

0.349s OCaml
5.205s C++ (gcc)
17.093s SML (MLton)
7.361s Java (Sun JDK 1.5)
94.647s Scheme (Stalin)


Here are the compile lines I used on AMD64:

ocamlopt -inline 100 ray.ml -o ray
g++ -O3 ray.cpp -o ray
mlton ray.sml
javac ray.java


stalin -d0 -d1 -d5 -d6 -On -q -d -architecture IA32-align-double
-no-clone-size-limit -split-even-if-no-widening -copt -O2 -copt

-fomit-frame-pointer -copt -malign-double qobi
(load (compile-file "nathan.lisp"))

Running times for level=9, n=512 and ss=4 on 1.8GHz Athlon64:

12.154s OCaml
12.226s C++ (gcc)
12.407s SML (MLton)
21.616s Java (Sun JDK 1.5)
26.867s Scheme (Stalin)
56.366s Lisp (SBCL)

MLton, Stalin and SBCL are run from 32-bit mode. All others from 64-bit
mode.

Compile times:

1.378s OCaml
2.279s C++ (gcc)
7.786s SML (MLton)
2.650s Java (Sun JDK 1.5)
94.647s Scheme (Stalin)
0.445s Lisp (SBCL)

Here's the Stalin implementation "qobi.sc" that I used:

(define-structure point x y z)

(define-structure obj c r objs)

(define-structure hit lam normal)

(define fail #f)

(define (fold-right f i l)
(if (null? l) i (f (car l) (fold-right f i (cdr l)))))

(define infinity (/ 1.0 0.0))

(define epsilon 1e-15)

(define (s*v s b)
(make-point (* s (point-x b)) (* s (point-y b)) (* s (point-z b))))

(define (v+v a b)
(make-point (+ (point-x a) (point-x b))
(+ (point-y a) (point-y b))
(+ (point-z a) (point-z b))))

(define (v-v a b)
(make-point (- (point-x a) (point-x b))
(- (point-y a) (point-y b))
(- (point-z a) (point-z b))))

(define (cpsv-v a b receiver)
(receiver (- (point-x a) (point-x b))
(- (point-y a) (point-y b))
(- (point-z a) (point-z b))))

(define (dot a b)
(+ (* (point-x a) (point-x b))
(* (point-y a) (point-y b))
(* (point-z a) (point-z b))))

(define (dot1 ax ay az b)
(+ (* ax (point-x b)) (* ay (point-y b)) (* az (point-z b))))

(define (dot2 ax ay az bx by bz) (+ (* ax bx) (* ay by) (* az bz)))

(define (unitise r) (s*v (/ 1.0 (sqrt (dot r r))) r))

(define (ray-sphere orig dir center radius)
(cpsv-v
center
orig
(lambda (vx vy vz)
(let* ((b (dot1 vx vy vz dir))
(disc (+ (- (* b b) (dot2 vx vy vz vx vy vz)) (* radius radius))))
(if (< disc 0.0)
infinity
(let ((t2 (+ b (sqrt disc))))
(if (< t2 0.0)
infinity
(let ((t1 (- b (sqrt disc))))
(if (> t1 0.0) t1 t2)))))))))

(define zero (make-point 0.0 0.0 0.0))

(define (intersect orig dir obj)
(let lp ((obj obj) (hit (make-hit infinity zero)))
(let ((l (ray-sphere orig dir (obj-c obj) (obj-r obj))))
(cond ((>= l (hit-lam hit)) hit)
((pair? (obj-objs obj)) (fold-right lp hit (obj-objs obj)))
((null? (obj-objs obj))
(make-hit l (unitise (v+v orig (v-v (s*v l dir) (obj-c obj))))))
(else (display "Bad obj") (display obj) (newline))))))

(define neg_light (unitise (make-point 1.0 3.0 -2.0)))

(define orig (make-point 0.0 0.0 -4.0))

(define (ray-trace dir scene)
(let* ((hit (intersect orig dir scene))
(lam (hit-lam hit))
(normal (hit-normal hit)))
(cond ((>= lam infinity) 0.0)
((< (hit-lam
(intersect
(v+v orig (v+v (s*v lam dir) (s*v (sqrt epsilon) normal)))
neg_light
scene))
infinity)
0.0)
(else (max 0.0 (dot normal neg_light))))))

(define (create level c r)
(let ((obj (make-obj c r '()))
(a (* 3.0 (/ r (sqrt 12.0)))))
(if (= level 1)
obj
(let ((aux (lambda (x z)
(create (- level 1) (v+v c (make-point x a z)) (* 0.5
r)))))
(make-obj c
(* 3.0 r)
(list obj
(aux (- a) (- a))
(aux a (- a))
(aux (- a) a)
(aux a a)))))))

(define level 6)
(define level (or (and (= (length argv) 3)
(string->number (vector-ref argv 1))) 9))

(define n 160)
(define n (or (and (= (length argv) 3)
(string->number (vector-ref argv 2))) 512))

(define ss 4)

(define scene (create level (make-point 0.0 -1.0 0.0) 1.0))

(define ss2 (* ss ss))

(define (aux x d) (+ (- x (/ n 2.0)) (/ d ss)))

(define (g x y)
(do ((dx 0 (+ dx 1))
(sum 0 (do ((dy 0 (+ dy 1))
(sum sum (+ sum
(ray-trace (unitise
(make-point (aux x dx)
(aux (- (- n 1) y) dy)
(exact->inexact n)))
scene))))
((>= dy ss) sum))))
((>= dx ss) sum)))

(define (pixel x y)
(write-char
(integer->char (inexact->exact (truncate (+ 0.5 (* 255 (/ (g x y)
ss2))))))))

(define n2 n)

(define (go)
(with-output-to-file "scheme.pgm"
(lambda ()
(display "P5")
(newline)
(display n)
(display " ")
(display n)
(newline)
(display 255)
(newline)
(do ((y 0 (+ y 1))) ((>= y n2))
(do ((x 0 (+ x 1))) ((>= x n))
(pixel x y))))))

(go)

Kjetil Svalastog Matheussen

unread,
Aug 16, 2005, 10:20:43 AM8/16/05
to

These are not optimal options for stalin. You could at least
use the same gcc-options as for g++: -O3 -march=athlon-tbird
--fast-math


> Running times for level=9, n=512 and ss=4 on 1.8GHz Athlon64:
>
> 12.154s OCaml
> 12.226s C++ (gcc)
> 12.407s SML (MLton)
> 21.616s Java (Sun JDK 1.5)
> 26.867s Scheme (Stalin)
> 56.366s Lisp (SBCL)
>
> MLton, Stalin and SBCL are run from 32-bit mode. All others from 64-bit
> mode.
>
> Compile times:
>
> 1.378s OCaml
> 2.279s C++ (gcc)
> 7.786s SML (MLton)
> 2.650s Java (Sun JDK 1.5)
> 94.647s Scheme (Stalin)
> 0.445s Lisp (SBCL)

Seems like a cut and paste error. Same compile time for stalin on
both machines?


--


Jon Harrop

unread,
Aug 16, 2005, 1:00:04 PM8/16/05
to
Kjetil Svalastog Matheussen wrote:
> These are not optimal options for stalin. You could at least
> use the same gcc-options as for g++: -O3 -march=athlon-tbird
> --fast-math

Good point!

> Seems like a cut and paste error. Same compile time for stalin on
> both machines?

Oops. I forgot to measure the compile time on x86. Here are some new results
for Stalin on x86:

Compile line:

time stalin -d0 -d1 -d5 -d6 -On -q -d -architecture IA32-align-double
-no-clone-size-limit -split-even-if-no-widening -copt -O3 -copt
-fomit-frame-pointer -copt -malign-double -copt -march=athlon-tbird -copt
-ffast-mathqobi

Running time:

64.191s

Compile time:

574.331s

Förster vom Silberwald

unread,
Aug 19, 2005, 6:41:40 AM8/19/05
to
Jon Harrop wrote:

> Yes. I am very impressed with Stalin.

The Scheme community is happy that there are many Scheme
implementations and they will not raise their brows if you do not like
any specific Scheme implementation. Common Lisper call it a bug.
However, we call it a feature.

I have 3 missions:

a) Man must believe in God

b) Scheme is highly recommended for *scientific numerical calculations"

c) EVERY Scheme implementation has a lot to offer. The Scheme language
standard is a beauty in itslef. Although, Bigloo for example goes
farther and will give you pattern-matching for example:

==
(module test)

(define
(map2e::obj fun::obj lis1::pair-nil lis2::pair-nil)
(match-case (list lis2 lis2)
[(() ()) '()]
[((?h1 ???t1) (?h2 ???t2))
(cons
(fun h1 h2)
(map2 fun t1 t2))]
[(?- ?-) (error "map2" "invalid list argument" (list lis
lis2))]))

(print (map2 (lambda (x y) (+ x y))
(list 2 34)
(list 2 3)))
==

You shouldn't go with the impression Scheme is a tiny standard.

Corratec_Team_Racer

Ulrich Hobelmann

unread,
Aug 19, 2005, 7:29:14 AM8/19/05
to
Förster vom Silberwald wrote:
> Jon Harrop wrote:
>
>> Yes. I am very impressed with Stalin.
>
> The Scheme community is happy that there are many Scheme
> implementations and they will not raise their brows if you do not like
> any specific Scheme implementation. Common Lisper call it a bug.
> However, we call it a feature.

Actually Common Lisp has lots of different implementations, too.
Lispers complain that many useful features aren't present (in a
standardized way) in many Scheme implementations.

I haven't done much Scheme in a while, maybe the SRFIs changed something
about this.

> I have 3 missions:
>
> a) Man must believe in God

What man? What god? What about women? What about pagan gods, or about
one nature god?

> b) Scheme is highly recommended for *scientific numerical calculations"

But probably not as fast as Fortran for that. Yet.

> c) EVERY Scheme implementation has a lot to offer. The Scheme language
> standard is a beauty in itslef. Although, Bigloo for example goes
> farther and will give you pattern-matching for example:

It would be even more beautiful with a standard module system and
separate compilation features.

> You shouldn't go with the impression Scheme is a tiny standard.

But it is. Sure, there are the SRFIs, and they're getting widely
implemented, but not all of them in all implementations.

The biggest problem is that there is no standard way to ship Scheme
modules, because there is no module system.

--
I believe in Karma. That means I can do bad things to people
all day long and I assume they deserve it.
Dogbert

Jon Harrop

unread,
Aug 19, 2005, 7:48:17 AM8/19/05
to
Ulrich Hobelmann wrote:

> Förster vom Silberwald wrote:
>> b) Scheme is highly recommended for *scientific numerical calculations"
>
> But probably not as fast as Fortran for that. Yet.

That depends what you're doing. For most interesting problems, it simply
isn't feasible to implement them in Fortran (it is too tedious and
error-prone to be done by hand).

Matthias Buelow

unread,
Aug 19, 2005, 8:33:23 AM8/19/05
to
F?rster vom Silberwald <chain...@hotmail.com> wrote:

[...]

>a) Man must believe in God

>c) The Scheme language


>standard is a beauty in itslef.

> (map2e::obj fun::obj lis1::pair-nil lis2::pair-nil)


> (match-case (list lis2 lis2)
> [(() ()) '()]
> [((?h1 ???t1) (?h2 ???t2))

Are you being ironic? The above doesn't even look very much like
Scheme anymore, let alone "beautyful"..

mkb.

David Golden

unread,
Aug 19, 2005, 9:11:38 AM8/19/05
to
>> I have 3 missions:
>>
>> a) Man must believe in God
>
> What man? What god? What about women? What about pagan gods, or
> about one nature god?
>

I find there's a certain kind of person (or at least an outlook),
henceforth "tosser", that just wants other people to believe in _some_
"higher power" other than themselves. In concrete terms, imagine a
christian who'd rather deal with a muslim or even a sun-worshipper than
an atheist - i.e. they only want people to have "freedom" of choice of
a religion at most, rather than the important freedom from religion.

Some atheists, understandably, might have some difficulty trusting
people who determine their actions through their assumptions about the
whims of their imaginary friends in the sky. But apparently the
tosser has difficulty trusting people who don't. Might be about
psychological patterns of submission to authority and/or something like
the atheist would rather deal with people who have demonstrated
capacity for reason and love, the tosser would rather deal with people
who have demonstrated capacity for obedience and faith.

Or maybe it's just that it might be easier (I have no data) to convert
someone from one faith to another* (typically the tosser's), than from
no faith to a faith - i.e. the viral meme complex of the tosser wants a
hospitable environment, and one with potential hosts that are
susceptible to similar viral meme complexes is preferred to one without
such potential hosts where propagation is therefore impossible, even if
it means the meme complex has to compete with others for control of
hosts... :-)

*e.g. similar to how the christians in Ireland just demoted the celtic
pantheon to become the early irish saints by the appropriate string
substitutions in the legends, or muslims regarding Jesus as a
prophet...

Marco Antoniotti

unread,
Aug 19, 2005, 10:58:53 AM8/19/05
to

But the following looks very much like Common Lisp


(use-package "UNIFY")

(defun map2 (fun list-1 list-2)
(match-case (list list-1 list-2)
((() ()) ())
(((?h1 . ?t1) (?h2 . ?t2))
(cons (funcall fun h1 h2) (map2 fun t1 t2)))
((_ _) (error "Invalid list arguments to map2: ~S ~S." list-1
list-2)))

Not only it looks like Common Lisp. It is written in Common Lisp and it
runs on every decent Common Lisp system.

http://common-lisp.net/project/cl-unification

Cheers
--
Marco

Ulrich Hobelmann

unread,
Aug 19, 2005, 11:52:48 AM8/19/05
to
David Golden wrote:
> Or maybe it's just that it might be easier (I have no data) to convert
> someone from one faith to another* (typically the tosser's), than from
> no faith to a faith - i.e. the viral meme complex of the tosser wants a

But everybody knows that (flame-)wars are the worst between different
sects and similar religions. See Linux vs BSD (ok, the BSD side doesn't
really care :D), Lisp vs Scheme (ok, Lispers in general are quite
peaceful), vi vs Emacs, GNU Emacs vs XEmacs, square brackets vs
parentheses in Scheme ;)

> hospitable environment, and one with potential hosts that are
> susceptible to similar viral meme complexes is preferred to one without
> such potential hosts where propagation is therefore impossible, even if
> it means the meme complex has to compete with others for control of
> hosts... :-)

I think it's easiest to infect a clean host, i.e. one without clear
religious preference yet. Atheists often ARE very strongly opinionated,
so hard to convert, as are firm Muslims, Christians etc.

> *e.g. similar to how the christians in Ireland just demoted the celtic
> pantheon to become the early irish saints by the appropriate string
> substitutions in the legends, or muslims regarding Jesus as a
> prophet...

Hm, maybe some form of tolerance/subsumption that made it easier (i.e.
was necessary) to establish their religions... Like the Mac
implementing Unix in order to be accepted by programmer hordes.

Jon Harrop

unread,
Aug 19, 2005, 12:25:59 PM8/19/05
to
Förster vom Silberwald wrote:
> (define
> (map2e::obj fun::obj lis1::pair-nil lis2::pair-nil)
> (match-case (list lis2 lis2)
> [(() ()) '()]
> [((?h1 ???t1) (?h2 ???t2))
> (cons
> (fun h1 h2)
> (map2 fun t1 t2))]
> [(?- ?-) (error "map2" "invalid list argument" (list lis
> lis2))]))

In Mathematica:

Map2[f_, {}, {}] := {}
Map2[f_, {h1_, t1___}, {h2_, t2___}] :=
Prepend[f[h1, h2], Map2[f, {t1}, {t2}]]

I thought the "?h1" vs "h1_" and "???t1" vs "t1___" was quite
interesting. :-)

Matthias Buelow

unread,
Aug 19, 2005, 12:57:24 PM8/19/05
to
Ulrich Hobelmann <u.hob...@web.de> wrote:

>I think it's easiest to infect a clean host, i.e. one without clear
>religious preference yet. Atheists often ARE very strongly opinionated,
>so hard to convert, as are firm Muslims, Christians etc.

We're looking at this tinted by our western glasses, where we're
used to meeting haunted-looking people clothed in shades of grey
at 9am on Sundays ringing at our door, trying to peddle magazines
called "Watchtower" and with a pained smile trying to convince you
of the pleasures of Jesus and Kingdom Come; something which is a
quite annoying but even more droll experience.

I'd say that when looking at the whole planet, religious beliefs
are not spread like a mental infection from person to person by
means of treacherous persuasion but rather unconditionally imposed
by priests upon the newborn, and, through mass psychosis, enforced
by whipping, stoning and leaving in the desert to die against
unfaithful individuals.

mkb.

Matthias Buelow

unread,
Aug 19, 2005, 12:58:22 PM8/19/05
to
Marco Antoniotti <mar...@cs.nyu.edu> wrote:

>Not only it looks like Common Lisp. It is written in Common Lisp and it
>runs on every decent Common Lisp system.

And your point is?

mkb.

Marco Antoniotti

unread,
Aug 19, 2005, 1:05:03 PM8/19/05
to

That you can have your cake and eat it too :)

Cheers
--
marco

Förster vom Silberwald

unread,
Aug 20, 2005, 4:04:16 AM8/20/05
to

David Golden wrote:

> I find there's a certain kind of person (or at least an outlook),
> henceforth "tosser", that just wants other people to believe in _some_
> "higher power" other than themselves. In concrete terms, imagine a
> christian who'd rather deal with a muslim or even a sun-worshipper than
> an atheist - i.e. they only want people to have "freedom" of choice of
> a religion at most, rather than the important freedom from religion.

I went through all your concerns when I was younger (now I am at age
31).

However, I resigned from any church (catholic to be specific). In the
eyes of the church I am atheist. How silly from their perspective.

Although, the Pope is most of the time right. When I was 18 or 21
or,... I happend to deride the Pope, too.

Things are very complicated. The caveat of Western-Civilications and
their cult only holds in a first attempt. It isn't that easy.

Schneewittchen

Förster vom Silberwald

unread,
Aug 20, 2005, 4:07:18 AM8/20/05
to
Marco Antoniotti wrote:
> But the following looks very much like Common Lisp

> (use-package "UNIFY")

> (defun map2 (fun list-1 list-2)
> (match-case (list list-1 list-2)
> ((() ()) ())
> (((?h1 . ?t1) (?h2 . ?t2))
> (cons (funcall fun h1 h2) (map2 fun t1 t2)))
> ((_ _) (error "Invalid list arguments to map2: ~S ~S." list-1
> list-2)))

That is very interesting. Thanks for the pointer. Didn't know that
Common Lisp has some inherent potential.

Would be interesting to hear whether Bigloo pinched from Common Lisp
(or the package thereof) or whether the "UNIFY"-package and Bigloo its
pattern-matching implementations derive from the same source in the
literature.

I must honestly say, that I do not feel the urge very often for using
pattern-matching in Scheme. However, I find the Bigloo
patttern-matching facilty cool (okay I must honestly say: for the very
first time I find Common Lisp also cool).

Beauty is in the eye of the beholder.

Schneewittchen
PS: Sorry I didn't know that it will also appear on comp.lang.lisp (I
writing from com.lang.scheme, though).

qo...@purdue.edu

unread,
Aug 28, 2005, 3:38:21 PM8/28/05
to
Recent posts contained some benchmark comparisons between variants of a
ray
tracer written in C++, OCaml, SML, and Scheme as compiled by g++,
ocamlopt,
MLton, and Stalin. These results appeared to show that Stalin was
slower than
g++, ocamlopt, MLton, and Stalin. However, those benchmarks appear to
be
flawed. http://www.ffconsultancy.com/free/ray_tracer/languages.html
contains
4 different variants of the ray tracer, each implemented in C++, Java,
OCaml,
and SML. The different variants use different algorithms that result in
significant different run times within each language. The previously
posted
Scheme version corresponds to the first (and slowest) variant. But it
appears
that it was benchmarked against other variants with g++, ocamlopt, and
MLton.

I have conducted as close to an apples-to-apples comparison that I can.
I
wrote all 4 variants of the benchmark in Scheme. All of the source code
and
scripts needed to run this benchmark are in

http://www-bcl.cs.nuim.ie/~qobi/ray.tgz

Times are u+s as reported by tcsh time under Debian Sarge/stable on an
1.4GHz IBM Thinkpad X40 with 1.5G RAM and 40G disk plugged into AC.
G++, OCaml, MLton, and Stalin are all the versions under Debian
Sarge/stable.

# apt-get install ocaml-nox
# apt-get install mlton
# apt-get install stalin

Running:

% ./run

yields:

Compile times
ray1 ray2 ray3 ray4
g++ 1.031 1.092 1.124 1.118
OCaml 0.304 0.320 0.329 0.340
MLton 3.691 3.696 3.784 3.789
Stalin 14.693 15.506 14.761 14.991

Run times (best of three)
ray1 ray2 ray3 ray4
g++ 17.114 13.029 10.646 10.442
OCaml 24.653 18.279 15.777 14.787
MLton 15.751 12.308 10.772 10.403
Stalin 11.642 8.293 7.525 7.654

Run times relative to Stalin
ray1 ray2 ray3 ray4
g++ 1.470 1.570 1.414 1.364
OCaml 2.117 2.204 2.096 1.932
MLton 1.353 1.483 1.431 1.359

On average, Stalin is 45% faster than g++, 108% faster than OCaml,
and 40% faster than MLton.

Notes:
1. ray[1234] correspond to:
1. Minimal
2. Tighter bounding volumes
3. Partially specialized shadow intersection
4. Fully specialized shadow intersection
2. ray[1234].{cpp,ml,sml} were taken verbatim from:
http://www.ffconsultancy.com/free/ray_tracer/languages.html
3. These 16 programs do not all produce the same pgm file. The pgm
files fall
into the following three equivalence classes:
image[1234]{cpp,ml}.pgm
image[1234]sml.pgm
image[1234]sc.pgm
I do not know enough about ray tracing, C++, ML, and SML to track
down the
source of the differences. Different results make these benchmarks
suspect.
One needs to verify that these programs all compute the same thing.
4. I wrote ray[1234].sc so that they match the corresponding
ray[1234].sml
in identifier names and structure. More optimization is possible on
the
Scheme versions. I have refrained from doing so to keep the code
relatively
clean and in correspondence with the SML code. The only hand
optimization
is in cpsv-v, dot1, dot2, the use of atoi instead if
string->number, and
in manually inlining some instances of fold.
5. There are many gratuitous differences between ray[12].sml,
ray[23].sml,
and ray[34].sml. I have written ray[1234].sc so that if you do:
% diff ray[12].sc
% diff ray[23].sc
% diff ray[34].sc
you see the minimal set of substantive changes between the
variants.
6. I believe that one can find other sets of options to Stalin (or the
other
compilers) to get better results. And the optimal choice of options
may
vary according to the hardware you are running on and the versions
of the
software you are running.
7. I cannot get the Java versions to run with gcj. I do not have JDK
1.5
installed on my laptop so I cannot include its performance in these
results.
8. The g++ runs used the options:
-O3
-fomit-frame-pointer
-freg-struct-return
-malign-double
-march=prescott
-ffast-math

The OCaml runs used the options:
-inline 100
-ffast-math

The Stalin runs used the options:
-Ob
-Om
-On
-Or
-Ot
-dP
-dC
-dh
-q
-d
-db


-architecture IA32-align-double
-no-clone-size-limit
-split-even-if-no-widening
-copt -O3
-copt -fomit-frame-pointer

-copt -freg-struct-return
-copt -malign-double
-copt -march=prescott
-copt -ffast-math

Jon Harrop

unread,
Aug 29, 2005, 11:51:09 AM8/29/05
to
qo...@purdue.edu wrote:
> ...However, those benchmarks appear to be flawed...The previously

> posted
> Scheme version corresponds to the first (and slowest) variant. But it
> appears
> that it was benchmarked against other variants with g++, ocamlopt, and
> MLton.

No, I only compared the least optimised versions in all languages (I just
checked).

> Times are u+s as reported by tcsh time under Debian Sarge/stable on an
> 1.4GHz IBM Thinkpad X40 with 1.5G RAM and 40G disk plugged into AC.

Intel Pentium 4?

> I do not know enough about ray tracing, C++, ML, and SML to track
> down the
> source of the differences. Different results make these benchmarks
> suspect.

Yes. They should produce identical output. The most likely source of
discrepancy is floating point arithmetic. I believe they all produce
identical output on my machine but I'll check again.

> One needs to verify that these programs all compute the same thing.
> 4. I wrote ray[1234].sc so that they match the corresponding
> ray[1234].sml
> in identifier names and structure. More optimization is possible on
> the
> Scheme versions. I have refrained from doing so to keep the code
> relatively
> clean and in correspondence with the SML code. The only hand
> optimization
> is in cpsv-v, dot1, dot2, the use of atoi instead if
> string->number, and
> in manually inlining some instances of fold.

The other (first) implementations are essentially completely unoptimised so
it is (arguably) cheating to manually optimise the Stalin code and not the
other implementations. IIRC, I included some low-level optimisations in the
later versions of the code, along with the algorithmic optimisations.

> 6. I believe that one can find other sets of options to Stalin (or the
> other
> compilers) to get better results. And the optimal choice of options
> may
> vary according to the hardware you are running on and the versions
> of the
> software you are running.

Can you recommend any changes to the command-line arguments for AMD
hardware?

> 7. I cannot get the Java versions to run with gcj. I do not have JDK
> 1.5
> installed on my laptop so I cannot include its performance in these
> results.

GCJ gets appauling performance so you might as well not bother with it. JDK
1.5 is only used to calculate the value of delta. To use JDK pre1.5 just
hardwire a floating point constant in there.

I suspect the differences (although they are very big) are simply due to
Intel vs AMD. Several other people have reporting wildly different timings
on Intel hardware but I only have AMD hardware here...

I just ran your script and it failed on C++ and Stalin because of the
march=prescott, so I've changed it for my Athlon. The tabulate program then
produced lots of errors but that's probably because it was expecting
numerical input and got error messages. I'll try again...

Ulrich Hobelmann

unread,
Aug 29, 2005, 3:19:17 PM8/29/05
to
Jon Harrop wrote:
>> Times are u+s as reported by tcsh time under Debian Sarge/stable on an
>> 1.4GHz IBM Thinkpad X40 with 1.5G RAM and 40G disk plugged into AC.
>
> Intel Pentium 4?

I think it's a Centrino (Pentium M). They have very poor FPU
performance (haven't been following the thread much, so I don't know if
that makes a difference).

> GCJ gets appauling performance so you might as well not bother with it. JDK
> 1.5 is only used to calculate the value of delta. To use JDK pre1.5 just
> hardwire a floating point constant in there.
>
> I suspect the differences (although they are very big) are simply due to
> Intel vs AMD. Several other people have reporting wildly different timings
> on Intel hardware but I only have AMD hardware here...

Generally the AMDs have quite consistent performance I'd expect. P4s
can be really fast when the code is optimized for it, otherwise not so
much. Centrinos are good, except for FPU or SSE code.

--
My ideal for the future is to develop a filesystem remote interface
(a la Plan 9) and then have it implemented across the Internet as
the standard rather than HTML. That would be ultimate cool.
Ken Thompson

Jon Harrop

unread,
Aug 29, 2005, 6:06:21 PM8/29/05
to

I just successfully ran your programs on my 900MHz Athlon t-bird and I get
similar results to yours:

Compile times
ray1 ray2 ray3 ray4

g++ 2.260 2.340 2.389 2.380
OCaml 0.340 0.350 0.340 0.350
MLton 9.260 9.400 9.409 9.529
Stalin 37.759 41.039 38.700 38.820

Run times (best of three)
ray1 ray2 ray3 ray4

g++ 26.119 19.000 15.690 15.350
OCaml 36.770 27.649 25.230 23.129
MLton 26.729 21.019 18.149 17.459
Stalin 23.270 11.930 11.170 11.539

Run times relative to Stalin
ray1 ray2 ray3 ray4

g++ 1.122 1.592 1.404 1.330
OCaml 1.580 2.317 2.258 2.004
MLton 1.148 1.761 1.624 1.512

On average, Stalin is 36% faster than g++, 104% faster than OCaml,
and 51% faster than MLton.

Stalin is now 3x faster than it was!

This raises some questions, however:

1. The C++, OCaml and Java implementations all appear to produce identical
output. The SML output differs because of an error in the rounding code. But
why does the Stalin code produce different output? It looks exactly the
same, which implies something more subtle and, potentially, more sinister
is going on. Is it really using double precision? Is it just using the
wrong value for delta?

2. Why is the Stalin code so much longer than all of the other
implementations? Is it equally representative of unoptimised code or does
it contain low-level optimisations?

3. What does cpsv-v do and what is the equivalent in the other languages?

Also, if I change the definition of delta to this:

(define delta 1.49011611938476562e-08)

then the timings change to this:

Compile times
ray1 ray2 ray3 ray4

g++ 2.289 2.600 2.279 2.420
OCaml 0.340 0.340 0.340 0.350
MLton 9.179 10.229 9.479 9.720
Stalin 40.020 42.640 40.139 40.379

Run times (best of three)
ray1 ray2 ray3 ray4

g++ 26.119 18.989 15.690 15.349
OCaml 36.779 27.670 25.250 23.150
MLton 27.130 21.539 18.219 17.819
Stalin 35.979 12.969 11.630 12.090

Run times relative to Stalin
ray1 ray2 ray3 ray4

g++ 0.725 1.464 1.349 1.269
OCaml 1.022 2.133 2.171 1.914
MLton 0.754 1.660 1.566 1.473

On average, Stalin is 20% faster than g++, 81% faster than OCaml,
and 36% faster than MLton.

Note that the least optimised Scheme implementation is 60% slower than it
was. So these results are at least very variable.

Jon Harrop

unread,
Aug 29, 2005, 6:14:53 PM8/29/05
to

I just successfully ran your programs on my 900MHz Athlon t-bird and I get
similar results to yours:

Compile times
ray1 ray2 ray3 ray4


g++ 2.260 2.340 2.389 2.380
OCaml 0.340 0.350 0.340 0.350
MLton 9.260 9.400 9.409 9.529
Stalin 37.759 41.039 38.700 38.820

Run times (best of three)
ray1 ray2 ray3 ray4


g++ 26.119 19.000 15.690 15.350
OCaml 36.770 27.649 25.230 23.129
MLton 26.729 21.019 18.149 17.459
Stalin 23.270 11.930 11.170 11.539

Run times relative to Stalin
ray1 ray2 ray3 ray4


g++ 1.122 1.592 1.404 1.330
OCaml 1.580 2.317 2.258 2.004
MLton 1.148 1.761 1.624 1.512

On average, Stalin is 36% faster than g++, 104% faster than OCaml,
and 51% faster than MLton.

Stalin is now 3x faster than it was!

This raises some questions, however:

1. The C++, OCaml and Java implementations all appear to produce identical

output. The SML output differs and, by the looks of the image, it has
offset everything vertically by something like 1/4px probably due to an


error in the rounding code. But why does the Stalin code produce different
output? It looks exactly the same, which implies something more subtle and,
potentially, more sinister is going on. Is it really using double
precision? Is it just using the wrong value for delta?

2. Why is the Stalin code so much longer than all of the other
implementations? Is it equally representative of unoptimised code or does
it contain low-level optimisations?

3. What does cpsv-v do and what is the equivalent in the other languages?

Why do the optimised Stalin implementations contain functions like "dot2"
when the other implementations do not?

It is also worth noting that making minor changes can have a huge effect on
the performance of the Stalin implementation. For example, changing the
definition of delta to:

(define delta 1.49011611938476562e-08)

makes the first (slowest) implementation take 60% longer here.

Vesa Karvonen

unread,
Aug 31, 2005, 4:58:07 PM8/31/05
to
Jon Harrop <use...@jdh30.plus.com> wrote:

> I just successfully ran your programs on my 900MHz Athlon t-bird and I get
> similar results to yours:

[...]


> On average, Stalin is 36% faster than g++, 104% faster than OCaml,
> and 51% faster than MLton.

So, are you planning to publish these results on your language comparison
page (http://www.ffconsultancy.com/free/ray_tracer/languages.html)?

-Vesa Karvonen

P.S. I have a recollection of someone doubting the optimization ability of
Lisp (and Scheme) compilers.

Jon Harrop

unread,
Sep 1, 2005, 12:12:18 AM9/1/05
to
Vesa Karvonen wrote:
> Jon Harrop <use...@jdh30.plus.com> wrote:
>> I just successfully ran your programs on my 900MHz Athlon t-bird and I
>> get similar results to yours:
> [...]
>> On average, Stalin is 36% faster than g++, 104% faster than OCaml,
>> and 51% faster than MLton.
>
> So, are you planning to publish these results on your language comparison
> page (http://www.ffconsultancy.com/free/ray_tracer/languages.html)?

I'd certainly like to. It'll take a bit more work though. In particular,
I'll remove all of the low-level optimisations that have been added to the
Scheme but not to the other languages. I've already had a play at doing
this and, although it slows the Stalin implementations down, it remains
very competitive.

I'll post back when I have new programs and results. I'm very busy ATM so
this could be a little while...

> P.S. I have a recollection of someone doubting the optimization ability of
> Lisp (and Scheme) compilers.

:-)

The free Lisp compilers are still a long way behind but Stalin seems to give
excellent performance without sacrificing usability.

qo...@purdue.edu

unread,
Sep 1, 2005, 11:02:39 AM9/1/05
to
In response to my earlier post, there was some concern raised about
low-level
optimization in the Scheme versions of the code. I have posted new
Scheme
versions without such low-level optimizations and a new set of
benchmark
results. Here is a summary of the changes.

1. The original post used

((foreign-procedure (char*) int "atoi" "stdlib") ...)

to process the command-line arguments. The reason that I used this
is
that the Scheme string->number procedure can return any kind of
number or
#f. In Stalin, this creates an abstract value of type
(union fixnum float false). Propagating such abstract values
potentially
widens the type of other values downstream. Using a foreign
procedure call
to atoi avoids this. The new versions use:

(inexact->exact (truncate (string->number ...)))

This also prevents widening of values downstream. It just takes
longer for
Stalin to compile. The run times remain the same. In any case, I
don't see
this as constituting an optimization as the C++ version uses atoi
so
either my original Scheme code or the new variant qualify as
an apples-to-apples comparison.

2. The original code had a manually-inlined version of fold in the
intersect
procedure:

(let loop ((result hit) (l (scene-scenes scene)))
(if (null? l) result (loop (aux (car l) result) (cdr l))))

The new code replaces this with a call to fold:

(fold aux hit (scene-scenes scene))

The reason that I wrote the original code was that for some reason
that I
do not have the time to track down, Stalin fails to split the call
to
fold, even when specifying -no-clone-size-limit and
-split-even-if-no-widening, and thus fails to specialize and
in-line that
call. Note that Stalin does split other calls to fold and thus does
specialize and in-line those calls as intended. Note that the C++
version
includes an open-coded iteration:

for (Scenes::const_iterator it=child.begin(); it!=child.end();
++it)
hit2 = (*it)->intersect(hit2, ray);

So I don't see the fact that I also open-coded a loop as an
optimization.
I believe that either my original Scheme code or the new variant
qualifies
as an apples-to-apples comparison. In any case, this change to the
Scheme
code introduces only a very small slowdown. I open-coded the loop
in my
original post simply because it resulted in Stalin-generated C code
that
exhibited the theoretical maximal amount of in-lining.

3. The original code had two extra variants of dot and one extra
variant of
v-v:

(define (dot1 ax ay az b)
(+ (* ax (point-x b)) (* ay (point-y b)) (* az (point-z b))))

(define (dot2 ax ay az bx by bz) (+ (* ax bx) (* ay by) (* az bz)))

(define (cpsv-v a b receiver)


(receiver (- (point-x a) (point-x b))
(- (point-y a) (point-y b))
(- (point-z a) (point-z b))))

These were used in ray-sphere:

(define (ray-sphere orig dir center radius)
(cpsv-v
center
orig
(lambda (vx vy vz)
(let* ((b (dot1 vx vy vz dir))
(disc
(+ (- (* b b) (dot2 vx vy vz vx vy vz)) (* radius
radius))))

...))))

Note that I didn't write this code. This code was in the version
posted on
comp.lang.scheme that I used as a basis for my variants. The
obvious
reason that the original author wrote the code this way was to
avoid some
consing of intermediate temporary values. My newly posted versions
eliminate the extra variants of dot and v-v and code ray-sphere as:

(define (ray-sphere orig dir center radius)

(let* ((v (v-v center orig))
(b (dot v dir))
(disc (+ (- (* b b) (dot v v)) (* radius radius))))
...))

This does slow down the code somewhat. Note that Stalin can and
does
automatically unbox structures such as points with the -dI option.
This
will automatically achieve the same effect with the new code as the
manually written optimization in the original code. Just that
currently,
Stalin aggressively unboxes everything that it can when you give
the -dI
option. Sometimes (like in this case) such unboxing can help but
sometimes
it can hurt drastically. In this benchmark, aggressive unboxing
hurts a lot
more than it helps. In principle, one can modify Stalin to decide
automatically when to unbox. But it currently lacks such a
capability.
Note that the C++ code manually specifies what is boxed and what is
not
boxed using constructs like

struct Vec {
double x, y, z;
...
};

and const Vec &a. So I don't consider the Scheme code in the
original post
to contain optimizations not present in the C++ code. And I
consider
either the original code or this new code to be a fair
apples-to-apples
comparison.

4. The original code contained the following:

double delta = sqrt(real.epsilon()), ...;
let delta = sqrt epsilon_float
val delta = Math.sqrt (Real.nextAfter(1.0, 2.0) - 1.0)
(define delta (sqrt 1e-15))

Since the SML/MLton version and the Scheme/Stalin version produced
results
that differ from each other and from the C++/g++ and OCaml/ocamlopt
versions, there was a suggestion that this was the cause of the
difference. The new code contains:

double delta = sqrt(1e-15), ...;
let delta = sqrt 1e-15
val delta = Math.sqrt 1e~15
(define delta (sqrt 1e-15))

Note that the SML/MLton version and the Scheme/Stalin version still
produce results that differ from each other and from the C++/g++
and
OCaml/ocamlopt versions. Thus this cannot be the sole source of the
difference.

The original source code and benchmark results are available at:

http://www-bcl.cs.nuim.ie/~qobi/ray1.tgz

The new source code and benchmark results are available at:

http://www-bcl.cs.nuim.ie/~qobi/ray1.tgz

With these changes running:

% ./run

yields:

Compile times
ray1 ray2 ray3 ray4

g++ 1.031 1.092 1.120 1.116
OCaml 0.307 0.317 0.327 0.334
MLton 3.618 3.668 3.638 3.634
Stalin 218.744 220.225 215.009 216.649

Run times (best of three)
ray1 ray2 ray3 ray4

g++ 16.985 12.930 10.573 10.362
OCaml 24.625 18.193 15.704 14.715
MLton 16.022 12.031 10.908 10.432
Stalin 13.091 9.300 8.538 8.987

Run times relative to Stalin
ray1 ray2 ray3 ray4

g++ 1.297 1.390 1.238 1.152
OCaml 1.881 1.956 1.839 1.637
MLton 1.223 1.293 1.277 1.160

On average, Stalin is 26% faster than g++, 82% faster than OCaml,
and 23% faster than MLton.

Note that Stalin still is significantly faster than all of the other
languages
on all of the benchmarks.

Jon Harrop

unread,
Sep 1, 2005, 8:40:27 PM9/1/05
to
qo...@purdue.edu wrote:
> (inexact->exact (truncate (string->number ...)))
>
> This also prevents widening of values downstream. It just takes
> longer for
> Stalin to compile. The run times remain the same. In any case, I
> don't see
> this as constituting an optimization as the C++ version uses atoi
> so
> either my original Scheme code or the new variant qualify as
> an apples-to-apples comparison.

Yes. This also looks like nicer Scheme code (to my untrained eye).

> Note that the C++
> version
> includes an open-coded iteration:
>
> for (Scenes::const_iterator it=child.begin(); it!=child.end();
> ++it)
> hit2 = (*it)->intersect(hit2, ray);
>
> So I don't see the fact that I also open-coded a loop as an
> optimization.

True. Having to define fold does worsen verbosity though (assuming fold is
built-in as well). Using a templated loop from C++ would incur no run-time
performance cost and minimal compile-time cost but I can never remember the
bizarre syntax... ;-)

> This does slow down the code somewhat. Note that Stalin can and
> does
> automatically unbox structures such as points with the -dI option.
> This
> will automatically achieve the same effect with the new code as the
> manually written optimization in the original code. Just that
> currently,
> Stalin aggressively unboxes everything that it can when you give
> the -dI
> option. Sometimes (like in this case) such unboxing can help but
> sometimes
> it can hurt drastically. In this benchmark, aggressive unboxing
> hurts a lot
> more than it helps.

That's interesting. When does unboxing hurt a lot? MLton also tries to
unbox, AFAIK. OCaml does not (in this context).

> Note that the C++ code manually specifies what is boxed and what is
> not
> boxed using constructs like
>
> struct Vec {
> double x, y, z;
> ...
> };
>
> and const Vec &a. So I don't consider the Scheme code in the
> original post
> to contain optimizations not present in the C++ code. And I
> consider
> either the original code or this new code to be a fair
> apples-to-apples
> comparison.

This I'm not so sure about. There is no notion of boxing in C/C++. Although
you can mimic it, of course, nobody would. If Stalin can automate unboxing
then that is fine. OCaml will not unbox. MLton probably will. I think
unboxing is one of the more important aspects of optimising compilers for
modern FPLs.

> Since the SML/MLton version and the Scheme/Stalin version produced
> results
> that differ from each other and from the C++/g++ and OCaml/ocamlopt
> versions, there was a suggestion that this was the cause of the
> difference. The new code contains:
>
> double delta = sqrt(1e-15), ...;
> let delta = sqrt 1e-15
> val delta = Math.sqrt 1e~15
> (define delta (sqrt 1e-15))

This I really don't like. Firstly, my current code demonstrates the use of
mach eps in various languages. Secondly, as I've said, this alteration can
have a big effect on performance. Finally, this is clearly reducing all
other implementations to the lowest common denominator. Is there no mach
eps in Stalin? If there isn't then the "correct" figure can be hardcoded.

> Run times relative to Stalin
> ray1 ray2 ray3 ray4
> g++ 1.297 1.390 1.238 1.152
> OCaml 1.881 1.956 1.839 1.637
> MLton 1.223 1.293 1.277 1.160
>
> On average, Stalin is 26% faster than g++, 82% faster than OCaml,
> and 23% faster than MLton.
>
> Note that Stalin still is significantly faster than all of the other
> languages
> on all of the benchmarks.

Yes. That is an incredible result. Stalin is clearly a fantastic piece of
work - well done!

May I put code derived from yours on my site if I include Stalin-compiled
Scheme in my language comparison?

qo...@purdue.edu

unread,
Sep 3, 2005, 11:48:49 PM9/3/05
to
That's interesting. When does unboxing hurt a lot?

You have to be more precise about what you mean by unboxing. There are
several
independent issues. One kind of boxing is that some compilers force all
data
objects to be of fixed width, i.e. 32 bits. Thus instead of passing
around
doubles, you pass around pointers to doubles. In C:

/* Boxed */
double *w;
(double *)f(double *x){double *y; struct t {double *z;}; ...}

/* Unboxed */
double w;
double f(double x){double y; struct t {double z;}; ...}

>From the above note that this affects how arguments are passed, how
values are
returned, how local and global variables are represented, and how
structure
slots are represented. Each of these 5 different kinds of
representation
decisions are independent. You can box some and unbox others. And you
can
potentially even make that decision independently for each argument,
for each
return point, for each local or global variable, and for each structure
slot.
It is not as simple as saying a compiler boxes or unboxes. Different
compilers
can do different mixtures of the two.

Now there is a tradeoff. With an unboxed representation, you are moving
around
64-bit quantities. This causes more register pressure, more spilling,
more
memory bandwidth, more cache misses, and more page faults. With a boxed
representation, you need to dereference to access a value and you need
to
allocate to return or store a value (and you need to reclaim such
allocated
storage when it is no longer used). Which is more efficient depends a
lot on
the context in which you use doubles in the program and what kind of
code the
compiler generates (i.e. the in-liner, the parameter passing mechanism,
the
register allocator, etc.) For example, if you write:

(define (fn x y) (+ x y))
...
(define (f3 x y) (fn x y))
(define (f2 x y) (f3 x y))
(define (f1 x y) (f2 x y))

and the compiler doesn't in-line and passes parameters on the stack the
above
code will be much faster with boxed representations than unboxed ones
because
reducing the memory bandwidth will outweigh the cost of allocation,
reclamation, and indirection.

The same thing applies to structures. You can represent structures as:

/* Boxed */
struct s *w;
(struct s *)f(struct s *x){struct s *y; struct t {struct s *z;}; ...}

/* Unboxed */
struct s w;
struct s f(struct s x){struct s y; struct t {struct s z;}; ...}

And just as for doubles, this affects how arguments are passed, how
values are
returned, how local and global variables are represented, and how
structure
slots are represented. And just as for doubles, these 5 different kinds
of
representations decisions are independent. You can box some and unbox
others.
And you can make these decisions independently for each argument , for
each
return point, for each local or global variable, and for each structure
slot.
There are two differences however. First, you can't unbox when the
identity of
the object matters. This can happen for two reasons. One is that you
side-effect the object. The other is you compare objects with EQ?.
Second, you
can't unbox if you have a recursive data structure like:

struct s {int x; struct s *next;}

The tradeoffs of boxing vs. unboxing of structures are different than
for
doubles because doubles are fixed width 64-bits while structures can be
smaller or (much larger).

Stalin always unboxes floats/doubles. Without -dI Stalin never unboxes
structures. With -dI Stalin unboxes structures whenever it can prove
that
identity doesn't matter and the structure is not recursive. This
requires
whole-program interprocedural analysis. The unboxing decisions can
depend on
the precision of such analysis and Stalin has many options to control
the
analysis precision. But with -dI, Stalin will always unbox structures
whenever
it can prove that it is semantically sound. It does not analyze whether
boxing
or unboxing would be more efficient.

There is no notion of boxing in C/C++.
Although you can mimic it, of course, nobody would.

I don't know C++. But I do know C extremely well. And there definitely
is a
notion of boxing/unboxing in C as I illustrated above. I just presume
that
when I see things like * and & in C++ that they mean similar things to
what
they mean in C and in C they mean control over boxing and unboxing.

MLton also tries to unbox, AFAIK. OCaml does not (in this context).

If Stalin can automate unboxing then that is fine. OCaml will not
unbox.
MLton probably will. I think unboxing is one of the more important
aspects
of optimising compilers for modern FPLs.

I don't know OCaml at all. I had quite detailed knowledge of MLton as
of about
4 years ago because MLton was originally written at NECI and I was also
at
NECI at the time. (I started working on Stalin in 1993, at University
of
Toronto and later at the Technion, before I arrived at NECI, long
before work
started on MLton, after I arrived at NECI.) But I have not followed
MLton
closely since I left NECI in 2001. In ML, the situation is a little
different
than in Scheme. The semantics of ML does not define the notion of
identity of
structures (i.e. there is no EQ?). And the only thing you can side
effect is
ref cells. I am pretty sure that in 2001, MLton did not do the
whole-program
interprocedural analysis needed to determine whether it could unbox ref
cells.
And I am even more sure that it didn't do any automatic tradeoff
analysis to
determine whether boxed or unbox representations were more efficient
for each
type and program point. (Steve, Suresh, and Henry, please correct me if
I am
wrong.) I'm not aware of any compiler that does so, but then again, I
have not
followed the literature closely on this and there might be some that
do.

This I really don't like. Firstly, my current code demonstrates the
use of
mach eps in various languages. Secondly, as I've said, this
alteration can
have a big effect on performance. Finally, this is clearly reducing
all
other implementations to the lowest common denominator. Is there no
mach
eps in Stalin? If there isn't then the "correct" figure can be
hardcoded.

When you compare compilers, you need to make sure that your benchmarks
are all
performing the same computation. If different values of delta lead to
different computations then having the benchmarks use those different
values
means that your benchmarks are invalid.

Stalin implements a (large) subset of R4RS plus some extensions. RnRS
does not
define mach eps. One can access mach eps using the foreign procedure
interface
of Stalin:

qobi@thakaa>cat foo.cpp
#include <limits>
using namespace std;
numeric_limits<double> real;
extern "C" double eps(void) {return real.epsilon();}
qobi@thakaa>cat bar.sc
(write ((foreign-procedure () float "eps")))
(newline)
qobi@thakaa>g++ -c foo.cpp
qobi@thakaa>stalin -On -db -d -copt foo.o bar
8.837u 0.549s 0:11.46 81.7% 0+0k 0+0io 0pf+0w
qobi@thakaa>./bar
2.220446049250313080847263336181640625e-16
qobi@thakaa>

May I put code derived from yours on my site if I include
Stalin-compiled
Scheme in my language comparison?

Yes, of course.

0 new messages