Pure-functional N-body benchmark implementation

fft1976

unread,

Aug 10, 2009, 5:41:58 AM8/10/09

to Clojure

I just uploaded to the group an implementation of the n-body benchmark
in Clojure (see nbody_init.clj)

http://shootout.alioth.debian.org/u32/benchmark.php?test=nbody&lang=java&box=1

My goal was to write a pure-functional version and to avoid any micro-
optimizations. There are no type declarations and plenty of laziness
and higher-order functions.

It seems to be 100 times slower than Java (I was expecting worse) on
my machine, putting it in the same class as Ruby, Perl and Python
(which are imperative).

The obvious things to do next here, in the order of ugliness, are:

1. switch to 3D vectors instead of arbitrary length functions
2. add type declarations
3. avoid maps
4. experiment with adding/removing "(vec ..."
5. get imperative or transient
6. switch to Java arrays/classes

I am curious about how much more "non-idiomatic" Clojure has to become
here for each step it takes towards Java-like performance in this
benchmark, and whether the most optimized version will match Java in
speed.

I hope others can pick this up and implement the optimizations, or
post their timings.

Jarkko Oranen

unread,

Aug 10, 2009, 7:46:41 AM8/10/09

to Clojure

On Aug 10, 12:41 pm, fft1976 <fft1...@gmail.com> wrote:
> I just uploaded to the group an implementation of the n-body benchmark
> in Clojure (see nbody_init.clj)
>

> http://shootout.alioth.debian.org/u32/benchmark.php?test=nbody〈=j...

>
> My goal was to write a pure-functional version and to avoid any micro-
> optimizations. There are no type declarations and plenty of laziness
> and higher-order functions.
>
> It seems to be 100 times slower than Java (I was expecting worse) on
> my machine, putting it in the same class as Ruby, Perl and Python
> (which are imperative).
>
> The obvious things to do next here, in the order of ugliness, are:
>
> 1. switch to 3D vectors instead of arbitrary length functions
> 2. add type declarations
> 3. avoid maps
> 4. experiment with adding/removing "(vec ..."
> 5. get imperative or transient
> 6. switch to Java arrays/classes
>
> I am curious about how much more "non-idiomatic" Clojure has to become
> here for each step it takes towards Java-like performance in this
> benchmark, and whether the most optimized version will match Java in
> speed.
>
> I hope others can pick this up and implement the optimizations, or
> post their timings.

I'm not going to start optimising, but I don't think there's a need to
avoid maps. You could use struct maps and
generate accessors for the base keys... but somehow I doubt that is
the bottleneck

--
Jarkko

fft1976

unread,

Aug 10, 2009, 2:35:07 PM8/10/09

to Clojure

On Aug 10, 4:46 am, Jarkko Oranen <chous...@gmail.com> wrote:

> I'm not going to start optimising,

Somebody'd better!

You always hear this dogma that one should write "elegant" code first
and optimize later, and when you do that, a few little changes can
make Clojure as fast as Java.

Here's your chance to show it :-)

> but somehow I doubt that is
> the bottleneck

The locality of changes is what I doubt.

Andy Fingerhut

unread,

Aug 10, 2009, 3:12:56 PM8/10/09

to Clojure

I've got a functional version in my clojure-benchmarks git repository,
too, and it is currently 138 times slower on my machine than the Java
version. I haven't looked in detail at your code, but at a high level
it appears to be similar to yours in structure -- i.e., it doesn't try
to go overboard with float/double declarations, and uses only Clojure
data structures. I haven't tried optimizing it yet, just because I've
done more optimizing with some of the other benchmarks first.

It would be nice to have a Clojure version under 10 times Java's run
time here, without resorting to Java data structures and methods.
There may be better motivational techniques than using the word
"dogma" and daring people to do better :-)

Then again, perhaps this may be an example where the best thing for
someone working on a real project is to take the inner loop code and
write it in Java, the way some people take inner loops in C progams
and hand-code them in assembly -- because they are in a situation
where it is worth the time to optimize. I would point out that some
of these benchmark programs in Haskell, for example, are not pure
functional implementations, and people have gone to (what seems to me)
great effort to optimize the heck out of their programs in ways that
most Haskell programmers would probably find excessive. Even some of
the C and C++ implementations seem to pull out __builtin_memmove calls
and explicitly using SSE macros in order to tweak out a few more
milliseconds of lower run time.

Andy

Jonathan Smith

unread,

Aug 10, 2009, 5:19:42 PM8/10/09

to Clojure

1.) use something mutable
2.) unroll all the loops (mapping is a loop)
3.) try not to coerce between seq/vec/hash-map too much.

in real world, stuff like the shootout is pretty useless, as generally
you'd reach for a better algorithm rather than implementing the
shackled, crippled, naive algorithms that the benchmark forces you to
implement.

(Not that they aren't useful to some extent, just that language
productivity and how fast you can iterate your software design in the
language is, IMO, a much better indicator of a good language than
micro benchmarking).

On Aug 10, 5:41 am, fft1976 <fft1...@gmail.com> wrote:
> I just uploaded to the group an implementation of the n-body benchmark
> in Clojure (see nbody_init.clj)
>

> http://shootout.alioth.debian.org/u32/benchmark.php?test=nbody〈=j...

Andy Fingerhut

unread,

Aug 10, 2009, 6:00:33 PM8/10/09

to Clojure

On Aug 10, 2:19 pm, Jonathan Smith <jonathansmith...@gmail.com> wrote:
> 1.) use something mutable
> 2.) unroll all the loops (mapping is a loop)
> 3.) try not to coerce between seq/vec/hash-map too much.
>
> in real world, stuff like the shootout is pretty useless, as generally
> you'd reach for a better algorithm rather than implementing the
> shackled, crippled, naive algorithms that the benchmark forces you to
> implement.
>
> (Not that they aren't useful to some extent, just that language
> productivity and how fast you can iterate your software design in the
> language is, IMO, a much better indicator of a good language than
> micro benchmarking).

I agree that they are useful to some extent. In the real world, you
would use a better algorithm, but those better algorithms can be
implemented in any language, too. As you say, a higher level language
that lets you iterate more quickly on multiple implementations of
different algorithms has advantages. At the end of the day, there is
something in the language implementations that can enable faster
execution times than others. A micro-benchmark can be useful to tell
you that if your application needs to do lots of double number
crunching, you definitely shouldn't be implementing it in Perl, for
example -- at least not that part of your application.

By the way, fft1976, this Clojure version of the n-body benchmark
takes about half the time of your code on my machine (for a shorter
test than the one for the benchmark web site -- I haven't compared yet
on the full length test since it takes about an hour for my version).

http://github.com/jafingerhut/clojure-benchmarks/blob/3e45bd8f6c3eba47f982a0f6083493a9f076d0e9/n-body/nbody.clj-5.clj

It doesn't use any mutable data structures, but it does use vectors,
has some double declarations, and avoids using map for operations on
pairs of 3d vectors, so it has some optimizations.

Using the new transient vectors recently added by Rich, and also
either using transient maps, or changing the implementation of planet
data to use vectors instead of the maps that I use, would probably
speed things up some more. I also still have a fair number of uses of
map that would probably be faster with loop/recur.

Andy

Andy Fingerhut

unread,

Aug 10, 2009, 8:15:21 PM8/10/09

to Clojure

OK, I've got a new Clojure program for the n-body benchmark, and it is
significantly faster than my previous one -- down from 138 x Java run
time, to 37 x Java run time. Still room for improvement somewhere
there, I'm sure, including perhaps using Java arrays instead of
Clojure vectors.

http://github.com/jafingerhut/clojure-benchmarks/blob/2423c5ba7bcded349bc21cfa4204bb840fd75bfa/n-body/nbody.clj-6.clj

The main changes from the slower version are:

+ make separate vectors for each "attribute" of a body in motion, i.e.
a separate vector of positions, a vector of velocities, etc., instead
of using maps.
+ Use loop/recur almost everywhere it makes sense. I still have a few
map calls in the function 'energy' and the functions it calls, and
maybe in the init code, but that is only done twice and once during
the whole execution, versus advance which is called 50,000,000 times
in the long version of the benchmark.
+ It uses the new transient/assoc!/conj!/persistent! functions for
Clojure vectors, so you need a pretty new copy of Clojure if you want
to run it yourself.

Updates summary of results is here:

http://github.com/jafingerhut/clojure-benchmarks/blob/14c5df3ee28d418a938de584f6675860043dd0f5/RESULTS

And, as usual, improvements to the performance of these Clojure
programs are welcome.

Andy

Isaac Gouy

unread,

Aug 10, 2009, 8:54:13 PM8/10/09

to Clojure

On Aug 10, 3:00 pm, Andy Fingerhut <andy_finger...@alum.wustl.edu>
wrote:

> On Aug 10, 2:19 pm, Jonathan Smith <jonathansmith...@gmail.com> wrote:
>
> > 1.) use something mutable
> > 2.) unroll all the loops (mapping is a loop)
> > 3.) try not to coerce between seq/vec/hash-map too much.
>

> > in real world, stuff like theshootoutis pretty useless, as generally

> > you'd reach for a better algorithm rather than implementing the
> > shackled, crippled, naive algorithms that the benchmark forces you to
> > implement.
>
> > (Not that they aren't useful to some extent, just that language
> > productivity and how fast you can iterate your software design in the
> > language is, IMO, a much better indicator of a good language than
> > micro benchmarking).
>
> I agree that they are useful to some extent. In the real world, you
> would use a better algorithm, but those better algorithms can be
> implemented in any language, too. As you say, a higher level language
> that lets you iterate more quickly on multiple implementations of
> different algorithms has advantages.

-snip-

An opportunity to use Clojure to iterate quickly to an implementation
that exploits all of the available cores?

Mark Engelberg

unread,

Aug 10, 2009, 8:57:18 PM8/10/09

to clo...@googlegroups.com

Andy,

My understanding is that any double that gets stored in a vector or
map is boxed, and therefore, the vast majority of your double
conversions aren't really doing anything, because when you pull them
out of the vector or map, they'll just be Double objects again.

I believe that the biggest reason for the performance difference
between Clojure and Java on this benchmark is that if you use a
Clojure data structure (e.g., a map) to represent each "body" object,
you get a lot of boxing and unboxing of the doubles.

So I hypothesize that the biggest bang for the buck is to find a way
in Clojure to create body objects with primitive double fields. Can
the new new do that? If not, your best bet may be to represent a body
object as a Java array of several doubles, rather than using a Clojure
map. The first double could represent the mass, the second could
represent vx, the second vy, and so on. Make macros for the accessors
that extract these values using aget.

It's speculation on my part, but I think this will get you much closer
to Java speed than your current approach.

fft1976

unread,

Aug 10, 2009, 10:22:00 PM8/10/09

to Clojure

On Aug 10, 5:15 pm, Andy Fingerhut <andy_finger...@alum.wustl.edu>
wrote:

> OK, I've got a new Clojure program for the n-body benchmark, and it is
> significantly faster than my previous one -- down from 138 x Java run
> time, to 37 x Java run time. Still room for improvement somewhere
> there, I'm sure, including perhaps using Java arrays instead of
> Clojure vectors.
>

> http://github.com/jafingerhut/clojure-benchmarks/blob/2423c5ba7bcded3...

>
> The main changes from the slower version are:
>
> + make separate vectors for each "attribute" of a body in motion, i.e.
> a separate vector of positions, a vector of velocities, etc., instead
> of using maps.
> + Use loop/recur almost everywhere it makes sense. I still have a few
> map calls in the function 'energy' and the functions it calls, and
> maybe in the init code, but that is only done twice and once during
> the whole execution, versus advance which is called 50,000,000 times
> in the long version of the benchmark.
> + It uses the new transient/assoc!/conj!/persistent! functions for
> Clojure vectors, so you need a pretty new copy of Clojure if you want
> to run it yourself.

I don't have the CVS version with transients, so I can't test your
version. This new version I just uploaded (nbody_v2.clj in the group
vault) is 55x slower than Java in my tests, which should make it 1.5x
slower than yours, but it uses no macros, transients, type hints,
struct-maps (it uses maps), Java arrays or mutation.

I tried struct-maps. They made things worse. Type hints do make it
faster, but only 1.3-1.4 times (perhaps I didn't add them well), so I
decided that they are not worthwhile at this point.

I'll write a Java-in-Clojure version of this in a week or two, if
nobody does it first.

fft1976

unread,

Aug 10, 2009, 11:08:39 PM8/10/09

to Clojure

On Aug 10, 2:19 pm, Jonathan Smith <jonathansmith...@gmail.com> wrote:

> 1.) use something mutable
> 2.) unroll all the loops (mapping is a loop)
> 3.) try not to coerce between seq/vec/hash-map too much.

Are you saying this w.r.t. my code or in general? If the former, be
specific, better yet, show us your code. I avoided (1) on purpose, as
I explained. The other choices I think are reasonable already. It
makes no sense to unroll a procedure that runs once, or macro-ify
something that gets inlined.

> in real world, stuff like the shootout is pretty useless, as generally
> you'd reach for a better algorithm rather than implementing the
> shackled, crippled, naive algorithms that the benchmark forces you to
> implement.

Some of us prefer facts and measurements, flawed as they may appear to
blind faith.

Many of the Shootout benchmarks are quite contrived, I agree, but the
N-Body benchmark is nice. I think it started elsewhere.

You'll have a hard time coming up with a better algorithm, unless you
are familiar with the field of numerical solutions of ordinary
differential equations. There are Runge-Kutta methods that can give a
constant factor speed-up, but they are just as easily implemented in
any other language.

Andy Fingerhut

unread,

Aug 11, 2009, 2:15:53 AM8/11/09

to Clojure

I've tried an approach like you suggest, using mutable Java arrays of
doubles, macros using aget / aset-double for reading and writing these
arrays, and loop/recur everywhere iteration is needed in the inner
loop. It is here:

http://github.com/jafingerhut/clojure-benchmarks/blob/173e68a52ca2e180f9d5f3ac4c81a737358d1491/n-body/nbody.clj-7.clj

I was surprised to find that it is significantly slower (about 3x)
than my best Clojure one so far, which uses the new transients, and
vectors of doubles to store the state of the bodies -- the one I
posted earlier today.

http://github.com/jafingerhut/clojure-benchmarks/blob/173e68a52ca2e180f9d5f3ac4c81a737358d1491/n-body/nbody.clj-6.clj

The updated summary of results is here:

http://github.com/jafingerhut/clojure-benchmarks/blob/173e68a52ca2e180f9d5f3ac4c81a737358d1491/RESULTS

I suspect I'm doing something wrong in my mutable Java array
implementation, but I don't see what it could be.

Thanks,
Andy

Mark Engelberg

unread,

Aug 11, 2009, 2:33:36 AM8/11/09

to clo...@googlegroups.com

On Mon, Aug 10, 2009 at 11:15 PM, Andy
Fingerhut<andy_fi...@alum.wustl.edu> wrote:
> I suspect I'm doing something wrong in my mutable Java array
> implementation, but I don't see what it could be.

There still seems to be a lot of boxing and unboxing going on. For example, in:
(let [[momx momy momz] (offset-momentum bodies)
the call to offset-momentum returns a vector of the three doubles, so
they would be boxed, so momx momy momz are not primitives.

Jonathan Smith

unread,

Aug 11, 2009, 2:42:32 AM8/11/09

to Clojure

On Aug 10, 11:08 pm, fft1976 <fft1...@gmail.com> wrote:
> On Aug 10, 2:19 pm, Jonathan Smith <jonathansmith...@gmail.com> wrote:
>
> > 1.) use something mutable
> > 2.) unroll all the loops (mapping is a loop)
> > 3.) try not to coerce between seq/vec/hash-map too much.
>
> Are you saying this w.r.t. my code or in general? If the former, be
> specific, better yet, show us your code. I avoided (1) on purpose, as
> I explained. The other choices I think are reasonable already. It
> makes no sense to unroll a procedure that runs once, or macro-ify
> something that gets inlined.
>

w.r.t your code.
At compilation you know a lot about the computation you are doing, so
there is a lot of room to unroll and 'cheat' instead of actually
looping. Obviously you won't bother optimizing the offset momentum
part of the code, but you will optimize your simulate-1 function a
lot.

The way your code is setup, you will spend a lot of time in funcall
overhead just because you used a lot of functions instead of doing the
calculation in bigger chunks. You can still have an elegant
implementation and good performance if you definline a few things.

Also it is good to put local constants in a let form at the top of the
file, they are faster to access by quite a bit.

And a last thing is that if you hack your clojure's implementation of
nth to inline the 3ary version of nth, you will see a big speedup in
the destructuring binds that you use in your 3-d library.

> > in real world, stuff like the shootout is pretty useless, as generally
> > you'd reach for a better algorithm rather than implementing the
> > shackled, crippled, naive algorithms that the benchmark forces you to
> > implement.
>
> Some of us prefer facts and measurements, flawed as they may appear to
> blind faith.
>

I kind of approach it as knowing that I get better abstraction from
lisps and knowing that I will pay the price somewhere down the line.
(most of the time in heavily looping numerics). :-)

> Many of the Shootout benchmarks are quite contrived, I agree, but the
> N-Body benchmark is nice. I think it started elsewhere.
>
> You'll have a hard time coming up with a better algorithm, unless you
> are familiar with the field of numerical solutions of ordinary
> differential equations. There are Runge-Kutta methods that can give a
> constant factor speed-up, but they are just as easily implemented in
> any other language.

I dunno, manipulating systems of equations seems to be one of lisps'
strong suits.

My current solution (I wrote it a few months ago) is up on github
here:
(It isn't as good as Andy's, i wrote it mostly to test a def-inline
that I wrote).

http://gist.github.com/165670

As I said, using mutable objects and loop unrolling will give definite
performance advantages.

My thought was to use a regular old java array and write accessor
macros to slot positions (similar to a struct in CL), but I never got
around to implementing it as it started to seem like all too much
mental masturbation.

Andy Fingerhut

unread,

Aug 11, 2009, 2:43:13 AM8/11/09

to Clojure

On Aug 10, 11:33 pm, Mark Engelberg <mark.engelb...@gmail.com> wrote:
> On Mon, Aug 10, 2009 at 11:15 PM, Andy
>

> Fingerhut<andy_finger...@alum.wustl.edu> wrote:
> > I suspect I'm doing something wrong in my mutable Java array
> > implementation, but I don't see what it could be.
>
> There still seems to be a lot of boxing and unboxing going on. For example, in:
> (let [[momx momy momz] (offset-momentum bodies)
> the call to offset-momentum returns a vector of the three doubles, so
> they would be boxed, so momx momy momz are not primitives.

That code is only run once during initialization, then never again.
It is well under 0.1% of the entire run time of the long benchmark.

The bulk of the computation is this call tree, where advance! is
called 50,000,000 times sequentially in the long benchmark:

advance!
bodies-update-velocities!
bodies-update-positions!

Andy

Christophe Grand

unread,

Aug 11, 2009, 2:50:55 AM8/11/09

to clo...@googlegroups.com

Hi Andy,

On Tue, Aug 11, 2009 at 8:15 AM, Andy Fingerhut <andy_fi...@alum.wustl.edu> wrote:

I've tried an approach like you suggest, using mutable Java arrays of
doubles, macros using aget / aset-double for reading and writing these
arrays, and loop/recur everywhere iteration is needed in the inner
loop. It is here:

aget-* and aset-* are slow, just use aget and aset with type hints.

Christophe

--
Professional: http://cgrand.net/ (fr)
On Clojure: http://clj-me.blogspot.com/ (en)

Andy Fingerhut

unread,

Aug 11, 2009, 3:39:04 AM8/11/09

to Clojure

On Aug 10, 11:50 pm, Christophe Grand <christo...@cgrand.net> wrote:
> Hi Andy,
>
> On Tue, Aug 11, 2009 at 8:15 AM, Andy Fingerhut <
>

> andy_finger...@alum.wustl.edu> wrote:
> > I've tried an approach like you suggest, using mutable Java arrays of
> > doubles, macros using aget / aset-double for reading and writing these
> > arrays, and loop/recur everywhere iteration is needed in the inner
> > loop. It is here:
>
> aget-* and aset-* are slow, just use aget and aset with type hints.
>
> Christophe

Wow, you ain't kiddin. I changed about 10 lines from my last version,
to avoid using aset-double, using aset and type hints until the
reflection warnings went away, and it sped up by a factor of 10. I'm
leaving the previous version's source and results there just so I can
have a future reference to point to at the difference it makes. New
results here:

http://github.com/jafingerhut/clojure-benchmarks/blob/9dc56d8ff53f0b8d363f213317587432bd8793de/RESULTS

Still almost 11 times slower than the Java version, but a lot more
respectable than my earlier attempts.

Should there be a similar warning like *warn-on-reflection* that tells
you to avoid using aset-* if you want better performance? Or at least
put a loud warning in the doc strings for those functions about their
poor performance? Or can they just be removed?

Thanks!

Andy

Mark Engelberg

unread,

Aug 11, 2009, 4:26:08 AM8/11/09

to clo...@googlegroups.com

On Tue, Aug 11, 2009 at 12:39 AM, Andy
Fingerhut<andy_fi...@alum.wustl.edu> wrote:
> Wow, you ain't kiddin. I changed about 10 lines from my last version,
> to avoid using aset-double, using aset and type hints until the
> reflection warnings went away, and it sped up by a factor of 10. I'm
> leaving the previous version's source and results there just so I can
> have a future reference to point to at the difference it makes. New
> results here:

I remember reading on someone's blog that even though it isn't
announced by the reflection warning, you can get a big performance
boost with aset and aget by explicitly casting the index to an int.
So that's another idea you might want to try.

fft1976

unread,

Aug 11, 2009, 4:42:37 AM8/11/09

to Clojure

On Aug 10, 11:42 pm, Jonathan Smith <jonathansmith...@gmail.com>
wrote:

> The way your code is setup, you will spend a lot of time in funcall
> overhead just because you used a lot of functions instead of doing the
> calculation in bigger chunks.

I thought, as I understood from Rich's lectures, JVM inlines whatever
it wants. Anyway, I put everything inside a LET. In my tests my normal
code (nbody_v2.clj) is 55x slower than Java, and yours is 40x slower.
Andy's transients-using and type-declaring version is 37x slower (in
his tests). After I put everything inside a LET, I get to be 52x
slower than Java. Definitely not worth it, considering how ugly it is.

Your version is not pure-functional, right? If so, I think I still
have the fastest pure-functional and non-type declaring version.

Let's just say that hacking Clojure's implementation of destructuring
is against the rules. If it were a good idea, I'm sure Rich would have
done it by now.

fft1976

unread,

Aug 11, 2009, 4:50:26 AM8/11/09

to Clojure

On Aug 11, 12:39 am, Andy Fingerhut <andy_finger...@alum.wustl.edu>
wrote:

>
> http://github.com/jafingerhut/clojure-benchmarks/blob/9dc56d8ff53f0b8...
>

Why isn't the array-using version as fast as Java? Shouldn't using
Java's data structures, mutation and no reflection supposed to be
equivalent to using Java?

fft1976

unread,

Aug 11, 2009, 5:25:17 AM8/11/09

to Clojure

On Aug 11, 12:39 am, Andy Fingerhut <andy_finger...@alum.wustl.edu>
wrote:

> On Aug 10, 11:50 pm, Christophe Grand <christo...@cgrand.net> wrote:
>
> > Hi Andy,
>
> > On Tue, Aug 11, 2009 at 8:15 AM, Andy Fingerhut <
>
> > andy_finger...@alum.wustl.edu> wrote:
> > > I've tried an approach like you suggest, using mutable Java arrays of
> > > doubles, macros using aget / aset-double for reading and writing these
> > > arrays, and loop/recur everywhere iteration is needed in the inner
> > > loop. It is here:
>
> > aget-* and aset-* are slow, just use aget and aset with type hints.
>
> > Christophe
>
> Wow, you ain't kiddin. I changed about 10 lines from my last version,
> to avoid using aset-double, using aset and type hints until the
> reflection warnings went away, and it sped up by a factor of 10. I'm
> leaving the previous version's source and results there just so I can
> have a future reference to point to at the difference it makes. New
> results here:
>

> http://github.com/jafingerhut/clojure-benchmarks/blob/9dc56d8ff53f0b8...

>
> Still almost 11 times slower than the Java version, but a lot more
> respectable than my earlier attempts.

Hmmm....

I just ran your version #8, and it's almost as slow as mine
(nbody_v2.clj): 53 times slower than Java, but I'm running Clojure 1.0
and

java version "1.6.0_0"
OpenJDK Runtime Environment (build 1.6.0_0-b11)
OpenJDK Client VM (build 1.6.0_0-b11, mixed mode, sharing)

Are you aware, by the way, that the "steady-state" Java runs 21 times
instead of once, so you have to divide its time?

fft1976

unread,

Aug 11, 2009, 5:36:31 AM8/11/09

to Clojure

On Aug 11, 2:25 am, fft1976 <fft1...@gmail.com> wrote:
> Hmmm....
>
> I just ran your version #8, and it's almost as slow as mine
> (nbody_v2.clj): 53 times slower than Java, but I'm running Clojure 1.0
> and

Strike that. I f'ed up the namespaces and was actually measuring my
own version. Yours is 8x slower than Java (I use "-server"
everywhere).

By the way, Gambit-C is 1.5x slower than C in this benchmark (and I'm
not sure if the source is optimal as opposed to being generic Scheme).
This is with "unsafe" optimization on though.

Jonathan Smith

unread,

Aug 11, 2009, 7:50:25 AM8/11/09

to Clojure

On Aug 11, 4:42 am, fft1976 <fft1...@gmail.com> wrote:
> On Aug 10, 11:42 pm, Jonathan Smith <jonathansmith...@gmail.com>
> wrote:
>
> > The way your code is setup, you will spend a lot of time in funcall
> > overhead just because you used a lot of functions instead of doing the
> > calculation in bigger chunks.
>
> I thought, as I understood from Rich's lectures, JVM inlines whatever
> it wants.

Up to a certain size of chunk of code. Things do get faster if you run
them a couple times, certainly, but you need to give it a good size
block of smallish functions to really take effect with the in-lining.
To my knowledge it won't inline a map or reduce yet.

> Anyway, I put everything inside a LET. In my tests my normal
> code (nbody_v2.clj) is 55x slower than Java, and yours is 40x slower.
> Andy's transients-using and type-declaring version is 37x slower (in
> his tests). After I put everything inside a LET, I get to be 52x
> slower than Java. Definitely not worth it, considering how ugly it is.
>

I don't think you have to put *everything* in the let, just your
constants. (so days per year and solar mass, the bodies themselves).

I don't know about ugly one way or another, but I mostly do the 'fns
in a let' thing to keep my code organized and structured. I only want
certain things to be external in the package... (And other things I
want to pre-compute...)

> Your version is not pure-functional, right? If so, I think I still
> have the fastest pure-functional and non-type declaring version.
>

Ok, sure.

> Let's just say that hacking Clojure's implementation of destructuring
> is against the rules. If it were a good idea, I'm sure Rich would have
> done it by now.

I didn't hack destructuring, I made a small change to whether or not
'nth' expands into an inline form in a certain case. It does actually
make a big difference on code that does a lot of destructuring.

fft1976

unread,

Aug 11, 2009, 2:43:59 PM8/11/09

to Clojure

On Aug 11, 4:50 am, Jonathan Smith <jonathansmith...@gmail.com> wrote:

> I don't think you have to put *everything* in the let, just your
> constants. (so days per year and solar mass, the bodies themselves).

How will they "escape" from the LET though? I see that in your code
everything is inside a LET. That's what I tried.

By the way, solar-mass, days-per-year, init-bodies only get accessed
once in my code. No point in optimizing that.

> > Your version is not pure-functional, right? If so, I think I still
> > have the fastest pure-functional and non-type declaring version.
>
> Ok, sure.

Yes! Touchdown.

Jonathan Smith

unread,

Aug 11, 2009, 3:25:32 PM8/11/09

to Clojure

On Aug 11, 2:43 pm, fft1976 <fft1...@gmail.com> wrote:
> On Aug 11, 4:50 am, Jonathan Smith <jonathansmith...@gmail.com> wrote:
>
> > I don't think you have to put *everything* in the let, just your
> > constants. (so days per year and solar mass, the bodies themselves).
>
> How will they "escape" from the LET though? I see that in your code
> everything is inside a LET. That's what I tried.
>

I don't think I quite understand what you mean by 'escape'? I mean,
generally speaking they get captured in the environment of the more
inner form (i.e. main has everything in its environment).

You could just as easily use something like defn- to regulate external/
internal-ness, I just happened to feel like using let in this
situation.

I think java has to do different stuff depending on whether something
is public or private. I think private is faster... I'm not a java guru
tho.

> By the way, solar-mass, days-per-year, init-bodies only get accessed
> once in my code. No point in optimizing that.
>

No i guess not. I used it in a couple of places I thought. Has been a
while since i looked at the code. :-)

> > > Your version is not pure-functional, right? If so, I think I still
> > > have the fastest pure-functional and non-type declaring version.
>
> > Ok, sure.
>
> Yes! Touchdown.

Woo!
And you can make it faster yet! :-)

Andy Fingerhut

unread,

Aug 11, 2009, 5:26:17 PM8/11/09

to Clojure

In case it matters to anyone, my intent in creating these Clojure
programs to compare their speed to others isn't to try to rip into
Clojure, or start arguments. It is for me to get my feet wet with
Clojure, and perhaps produce some examples that others can learn from
on what performs well in Clojure and what doesn't. So far, it has
even had the unexpected benefit of providing a ready-made test for
Christophe Grand to do a quick test of some improvements to his
implement of transients for Clojure maps. Cool beans.

Also, while it may appear I do benchmarks for a living, I don't :-)
I'd personally like to know before investing more time with Clojure
what kind of performance I can attain with it. Most of my commercial
software development to date has been in assembly and C, and the
expressiveness of Lisp is a breath of fresh air that helps me remember
why I love programming. Clojure's concurrency features, and keeping
around the power of Lisp macros, are a huge pull for me.

But there are those times that you want performance, and I'm curious
how much I can get out of Clojure code, vs. implementing certain inner
loops in Java, C, or what-have-you. Just because some part of your
code might lead you, for desire of better performance, to implement it
by hand in Java or C doesn't deter me from Clojure at all -- but I'd
like to know how likely that is.

OK, with that said, I've now got the n-body benchmark within 3.2 times
Java's run time on the same problem. The only change from my previous
version that sped things up was to replace 3-operand arithmetic
operations with 2-operand operations. Currently, this enables Clojure
to use primitive ops, instead of something slower.

Updated results here:

http://github.com/jafingerhut/clojure-benchmarks/blob/0f7fdd358086cb95c898d7c2656215408a8d0689/RESULTS

As always, suggestions or improved versions are welcome.

Thanks,
Andy

Aaron Cohen

unread,

Aug 11, 2009, 5:36:49 PM8/11/09

to clo...@googlegroups.com

At that point is it possible you're just paying the price of
PersistentVector for the "bodies" vector? Does it improve much if you
change bodies to an array?

fft1976

unread,

Aug 11, 2009, 6:14:49 PM8/11/09

to Clojure

On Aug 11, 2:26 pm, Andy Fingerhut <andy_finger...@alum.wustl.edu>
wrote:

> As always, suggestions or improved versions are welcome.

I noticed that when I wrap ~new-mass in (double ...) in this

(defmacro set-mass! [p new-mass] `(aset ~p 0 ~new-mass))

and other setters, I get warnings.

Andy Fingerhut

unread,

Aug 11, 2009, 8:13:42 PM8/11/09

to Clojure

On Aug 11, 2:36 pm, Aaron Cohen <remled...@gmail.com> wrote:
> At that point is it possible you're just paying the price of
> PersistentVector for the "bodies" vector? Does it improve much if you
> change bodies to an array?

About 7% faster changing bodies to a Java array of java.lang.Object's,
each of which happens to be a Java array of primitive doubles, as
before. Now about 3.0 x Java run time.

http://github.com/jafingerhut/clojure-benchmarks/blob/43ed2f42e6c1485541532e51eacc66488949c658/RESULTS

Thanks,
Andy

Aaron Cohen

unread,

Aug 12, 2009, 12:26:57 AM8/12/09

to clo...@googlegroups.com

I'm actually glad that the difference is that small there.

Nicolas Oury

unread,

Aug 12, 2009, 5:55:40 AM8/12/09

to clo...@googlegroups.com

Hello,

I tried to inline everything in the main loop (the updaters loops) and
obtained on my machine a 15% speed-up.

One of the possible slowdown may come from having arrays and not object.
Maybe, each access need to perform a size check on the array. Which is
not very costly but not negligible when you have very few operations on
each elements.

I read Hot Spot removes checks when it can decide they are useless. But
it is not clear in this program they are useless.
(You take an array in a structure in each iteration and access fields
from 0 to 6. If the size is not known then it has to test acces for 1,
then 2, then 3...)
Maybe you could try each time you take a [b (bodies i)] to check that
it's size is at least 7. If Hot Spot is clever enough it would reduce
the number of checks.
I am not sure it would do much good though, but I would like to know.

Else, you could try to put everything in one big array, and replace
bodies by their index. I wonder whether there would be a speed-up or not
from removing indirections.

Best,

Nicolas.

Aaron Cohen

unread,

Aug 12, 2009, 4:49:04 PM8/12/09

to clo...@googlegroups.com

I'm getting a very significant performance improvement by adding a
couple of JVM parameters (using jdk 1.6.0_14). They are:
-XX:+DoEscapeAnalysis
-XX:+UseBiasedLocking (I think the -server flag is required for those
two flags to do anything).

My runtime with n = 5,000,000 goes from ~7.5 seconds to ~4.5 seconds.

I can't currently check whether the java code gets the same
performance boost, but it's possible and even likely that the clojure
version would see a better improvement from those parameters than the
java one.

Aaron Cohen

unread,

Aug 12, 2009, 5:18:52 PM8/12/09

to clo...@googlegroups.com

I actually just confirmed that, on my computer at least, those flags
have basically no effect on the java version and a big difference on
the clojure one.

Nicolas Oury

unread,

Aug 13, 2009, 5:33:36 AM8/13/09

to clo...@googlegroups.com

-XX:+AggressiveOpts improves another 5-10%.

EscapeAnalysis seems more important than BiasedLocking.

I don't have a disassembling module installed. Could someone use the
PrintAssembly option and put the asm for the JITed method somewhere.

It could be interesting to see it side by side with the java one...

Best,

Nicolas.

Nicolas Oury

unread,

Aug 16, 2009, 6:50:45 AM8/16/09

to clo...@googlegroups.com

Dear all,

The good news: I have a version of the N-body benchmark that goes "as
fast as java".

The bad news: I am cheating a little bit...

As I suspected that a lot of time was spend in the array bound check
arithmetic, I replaced #^doubles in the implementation of body by an
object implemented in Java. (That's where I cheated).
I just created a Body struct with 7 public double fields.

This resulted in a x 2.3 speed-up from the last version.

Then replacing persistent vectors for bodies by Java array gives another
15% speed increase.

Then using my own Java written function to have a more specific
signature to array access gave another 15%. (This means that type cast
seems to have a cost in a very tight loop)

Starting with a program running in 66 sec, I have now a program running
in 17 sec (for 5. 10^7 iterations)
I have not ran the java program but as the 66 sec one was 3.2 times
slower than the Java one, I think the goal is reached.

But we had to cheat.
I think the most crucial point is that there is no way in Clojure to
implement structures with the same performance characteristics than
native java structures. (In this kind of loop, it speed up by 2-2.5
times the program to use such a java object.)

I think I read an article about an array bound check elimination for
jre7 that would get read of a big part of the cost. But having to encode
structures as arrays to have primitive fields is not very idiomatic
anyway.

So to sum up how to have micro-benchmarks as fast as java:
- manual inlining. There seems to have a cost to calling a function.
Maybe the new code generator will improve that.
- possibility to have "native structures". As for now that means writing
small java files. Maybe the new new will improve that. I also read
someone has a gen-struct macro. (biggest factor)
- the right set of options for the JVM. (Escape analysis, biased
locking, AggressiveOpts)
- possibility of having array access similar to those for primitives for
any type. (Less important than the other.)

I think it may be reassuring for people starting a project using Clojure
to know that once they find a bottleneck in a program they can rewrite
it to go at the same speed as the java program doing the same thing.

That's not a programming style I would advocate for a whole project
though.

Best regards,

Nicolas.

On Wed, 2009-08-12 at 00:26 -0400, Aaron Cohen wrote:

Meikel Brandmeyer

unread,

Aug 16, 2009, 3:52:48 PM8/16/09

to clo...@googlegroups.com

Hi,

Am 16.08.2009 um 12:50 schrieb Nicolas Oury:

> The bad news: I am cheating a little bit...

Why is this cheating? People wrote programs in
C and dropped down to Assembly if necessary.
People write programs in Python and drop down
to C if necessary.

Why can't we write programs in Clojure and
drop down to Java if necessary?

Sincerely
Meikel

Bradbev

unread,

Aug 16, 2009, 5:00:07 PM8/16/09

to Clojure

>
> Why can't we write programs in Clojure and
> drop down to Java if necessary?

That's what I find funny about these threads, Clojure's Java interop
is good, Java is easy to write performant code in. There is a clear
path to getting the best JVM performance possible from a Clojure
environment. I'm not saying that we shouldn't worry about Clojure's
general performance - but for microbenchmarks there is a very clear
optimization strategy.

Brad

Nicolas Oury

unread,

Aug 17, 2009, 4:32:50 AM8/17/09

to clo...@googlegroups.com

I was referring to the rules of the benchmark game. When you benchmark
language, using another language is not fair.

If you were to do your own program, of course you could use Java.
However, in the particular circumstance, it is a bit annoying to use
Java just to create a data structure type.

Best,

Nicolas.

Mark Engelberg

unread,

Aug 17, 2009, 4:38:23 AM8/17/09

to clo...@googlegroups.com

Here's what I've learned from following this benchmark thread:

From the various things I've read about Clojure's performance, I've
always had this sense that:
a) if you have a performance problem, there's probably some inner loop
that needs to be optimized, and so
b) you can use Clojure's type-hinting or primitive-casting within that
inner loop to get Java-like performance.

But these benchmarks are showing that when it comes to performance
degradation caused by boxing/unboxing primitives it doesn't always
occur in a single loop that is easily optimized. The most obvious
problem that is showing up in these benchmark examples is that often
these primitives tend to be combined in manipulated as some kind of
struct (e.g., an x,y,z vector). Or sometimes, the logic manipulating
these primitives can't easily be described in a single loop, but is
more easily expressed when spread out over several functions. But in
Clojure, just about everything you do (other than let/loop/recur)
boxes a primitive right back up again, making it very, very difficult
to write performant code. Very often, primitives need to be combined
as part of structured data, passed to a function, etc.

So I agree with those who are pointing to these examples and saying,
"Hey, what can we do to make it easier to optimize our Clojure
programs, while maintaining the basic structure and feel of the code?"

I think Nicolas Oury is right on when he suggests that adding a
Clojure way to build structures with primitive fields would be a
valuable addition from a performance-within-Clojure standpoint,
allowing for more localized rewrites to boost performance.

David Nolen

unread,

Aug 17, 2009, 10:15:15 AM8/17/09

to clo...@googlegroups.com

On Mon, Aug 17, 2009 at 4:32 AM, Nicolas Oury <nicola...@gmail.com> wrote:

I was referring to the rules of the benchmark game. When you benchmark
language, using another language is not fair.

If you were to do your own program, of course you could use Java.
However, in the particular circumstance, it is a bit annoying to use
Java just to create a data structure type.

I would say just be a little patient on this point. Seems like there are things developing related to Clojure-in-Clojure that would help with this. Clojure is already so useful and polished it's easy to forget just how early days it really is.

David

e

unread,

Aug 17, 2009, 10:26:14 AM8/17/09

to clo...@googlegroups.com

i don't know much about this (haven't followed closely, lately), but do the new Transients come into play to somewhat address this? Sounds like they were designed just for this sort of thing: inner-loop optimization and low-level mutation that still works functionally to everything outside...

Nicolas Oury

unread,

Aug 17, 2009, 11:02:53 AM8/17/09

to clo...@googlegroups.com

On this particular example, I think we are a bit further that what Transients currently offers. Even using a mutable primitive Java array results in code 2 or 3 times slower than the Java implementation of the benchmarks.

I have no doubt the struct and transients in Clojure will allow to do that at native speed, at some point in the future though.
And they are the right way to approach this kind of problem, I agree.

Best,

Nicolas.

David Nolen

unread,

Aug 17, 2009, 11:26:37 AM8/17/09

to clo...@googlegroups.com

On Sun, Aug 16, 2009 at 6:50 AM, Nicolas Oury <nicola...@gmail.com> wrote:

Dear all,

The good news: I have a version of the N-body benchmark that goes "as
fast as java".

The bad news: I am cheating a little bit...

You're only cheating if you care about the fantasy world that is microbenchmarks. I think few Clojurians do. Your solution seems like a justified and proper one if you ever need to implement n-body in a shipping software application.

So to sum up how to have micro-benchmarks as fast as java:
- manual inlining. There seems to have a cost to calling a function.
Maybe the new code generator will improve that.

You don't need to manually inline. Convert your function into a macro.

Bradbev

unread,

Aug 17, 2009, 12:25:32 PM8/17/09

to Clojure

On Aug 17, 1:32 am, Nicolas Oury <nicolas.o...@gmail.com> wrote:
> I was referring to the rules of the benchmark game. When you benchmark
> language, using another language is not fair.
>
> If you were to do your own program, of course you could use Java.
> However, in the particular circumstance, it is a bit annoying to use
> Java just to create a data structure type.
>

Ah, that makes more sense re the "cheating" then. Your insight for
array range check elimination got me thinking - why can't the accessor
macros (posx, etc) that use aset/aget have their ranges eliminated by
the JVM? After all, it should be a simple constant fold. I found
another 2-3x speed up by coercing the indexes with (int x), ie
(defmacro mass [p] `(double (aget ~p (int 0))))
I don't have the Java version running on my machine, but I saw
runtimes go from 833ms to 295ms for 100000 iterations, a 2.8x speed
up, which should put the "no cheating" version on the same standing as
the Java implementation.

Cheers,
Brad

Nicolas Oury

unread,

Aug 17, 2009, 1:12:09 PM8/17/09

to clo...@googlegroups.com

Seems to mean that I was wrong and that the cost is both in bound check
and unpacking the indices, mostly the second one.

Mark Engelberg

unread,

Aug 17, 2009, 7:45:22 PM8/17/09

to clo...@googlegroups.com

On Mon, Aug 17, 2009 at 9:25 AM, Bradbev<brad.be...@gmail.com> wrote:
> I found
> another 2-3x speed up by coercing the indexes with (int x), ie
> (defmacro mass [p] `(double (aget ~p (int 0))))

Which makes me wonder why aget doesn't automatically coerce an index
to an int. Would an input that can't be coerced to an int even really
make sense here?

Aaron Cohen

unread,

Aug 17, 2009, 7:47:57 PM8/17/09

to clo...@googlegroups.com

Seems like literals could be auto-coerced to whatever the type needs to be.

Brad Beveridge

unread,

Aug 18, 2009, 11:28:52 AM8/18/09

to FFT, clo...@googlegroups.com, nicola...@gmail.com, mark.en...@gmail.com

On 2009-08-17, at 8:58 PM, FFT <fft...@gmail.com> wrote:

> On Mon, Aug 17, 2009 at 9:25 AM, Bradbev<brad.be...@gmail.com>
> wrote:
>>

>> On Aug 17, 1:32 am, Nicolas Oury <nicolas.o...@gmail.com> wrote:
>>> I was referring to the rules of the benchmark game. When you
>>> benchmark
>>> language, using another language is not fair.
>>>
>>> If you were to do your own program, of course you could use Java.
>>> However, in the particular circumstance, it is a bit annoying to use
>>> Java just to create a data structure type.
>>>
>> Ah, that makes more sense re the "cheating" then. Your insight for
>> array range check elimination got me thinking - why can't the
>> accessor
>> macros (posx, etc) that use aset/aget have their ranges eliminated by
>> the JVM? After all, it should be a simple constant fold. I found
>> another 2-3x speed up by coercing the indexes with (int x), ie
>> (defmacro mass [p] `(double (aget ~p (int 0))))
>

> I'm not seeing this. Maybe you are running this on "-client"?
I'm running Java 1.5 32bit on OS X 10.5 with -server.

>
>> I don't have the Java version running on my machine, but I saw
>> runtimes go from 833ms to 295ms for 100000 iterations, a 2.8x speed
>> up, which should put the "no cheating" version on the same standing
>> as
>> the Java implementation.
>

> You can't get a consistent timing for anything less than 1-10M
> iterations here.
Why do you think that? Everything I've read says that hotspot kicks
in at 10,000, and I always do a warmup run.
I see consistent enough timings, within 50ms each run. When coerced
ints gives an immediate 3x speedup something is happening. What JVM
are you running & what settings? I'll compile the java version soon
so I can do a direct compare on a single machine. I take it that your
setup is showing clojure 3x slower than the java version?

Brad

FFT

unread,

Aug 17, 2009, 11:58:27 PM8/17/09

to clo...@googlegroups.com, brad.be...@gmail.com, nicola...@gmail.com, mark.en...@gmail.com

On Mon, Aug 17, 2009 at 9:25 AM, Bradbev<brad.be...@gmail.com> wrote:
>

> On Aug 17, 1:32 am, Nicolas Oury <nicolas.o...@gmail.com> wrote:
>> I was referring to the rules of the benchmark game. When you benchmark
>> language, using another language is not fair.
>>
>> If you were to do your own program, of course you could use Java.
>> However, in the particular circumstance, it is a bit annoying to use
>> Java just to create a data structure type.
>>
> Ah, that makes more sense re the "cheating" then. Your insight for
> array range check elimination got me thinking - why can't the accessor
> macros (posx, etc) that use aset/aget have their ranges eliminated by
> the JVM? After all, it should be a simple constant fold. I found
> another 2-3x speed up by coercing the indexes with (int x), ie
> (defmacro mass [p] `(double (aget ~p (int 0))))

I'm not seeing this. Maybe you are running this on "-client"?

> I don't have the Java version running on my machine, but I saw

> runtimes go from 833ms to 295ms for 100000 iterations, a 2.8x speed
> up, which should put the "no cheating" version on the same standing as
> the Java implementation.

You can't get a consistent timing for anything less than 1-10M iterations here.

Aaron Cohen

unread,

Aug 18, 2009, 3:32:07 PM8/18/09

to clo...@googlegroups.com

I don't see much of any difference here from those coercions either.
What clojure version are you using?
-- Aaron

Aaron Cohen

unread,

Aug 18, 2009, 4:00:16 PM8/18/09

to clo...@googlegroups.com

I reworked advance! a little bit and while it didn't have much
performance impact, I find this version a little clearer:

(defmacro doarray
"Executes an expression for each element of array a (presumably for
its side-effects), using an index named idx,
beginning with element 'start'"
[a idx start expr]
`(let [a# ~a end# (int (alength a#))]
(loop [~idx (int ~start)]
(if (< ~idx end#)
(do
~expr
(recur (unchecked-inc ~idx)))))))

(defn advance! [#^"[Ljava.lang.Object;" bodies delta-t]
(let [delta-t (double delta-t)]
(doarray bodies i1 0
(doarray bodies i2 (unchecked-inc i1)
(let [#^doubles b1 (aget bodies i1)
#^doubles b2 (aget bodies i2)
delta-posx (- (posx b1) (posx b2))
delta-posy (- (posy b1) (posy b2))
delta-posz (- (posz b1) (posz b2))
dist-squared (+ (+ (* delta-posx delta-posx)
(* delta-posy delta-posy))
(* delta-posz delta-posz))
dist (Math/sqrt dist-squared)
mag (/ delta-t (* dist-squared dist))
b1-scale (* (- mag) (mass b2))
dv1x (* delta-posx b1-scale)
dv1y (* delta-posy b1-scale)
dv1z (* delta-posz b1-scale)
b2-scale (* mag (mass b1))
dv2x (* delta-posx b2-scale)
dv2y (* delta-posy b2-scale)
dv2z (* delta-posz b2-scale)]
(add-to-vel! b1 dv1x dv1y dv1z)
(add-to-vel! b2 dv2x dv2y dv2z))))
(doarray bodies i 0
(let [#^doubles b (aget bodies i)]
(set-posx! b (+ (posx b) (* (velx b) delta-t)))
(set-posy! b (+ (posy b) (* (vely b) delta-t)))
(set-posz! b (+ (posz b) (* (velz b) delta-t)))))))

Reply all

Reply to author

Forward