The benchmark I did was the fannkuchen-redux benchmark the code I came up with was the following:
http://github.com/Licenser/clj-shootout/blob/master/fannkuchen/src/profile/fannkuchen.clj
the scala version I used is from the alioth shootout:
http://github.com/Licenser/clj-shootout/blob/master/fannkuchen/scala/fannkuchen.scala
For the clojure version, I tried to get rid of anything that isn't very close to java. All data ist stored in mutable arrays, all in all I use three and no extra data is allocated or tossed away for this. I tried to make everything I could to a macro (removing overhead for function calls), also I tried to use unchecked and primitive math wherever possible. The code is not pretty and certainly not idiomatic but it is somewhat fast, I managed to get things to a point where my clojure version took only 3.5 times longer then the scala version (I say only since the most idiomatic version was beyond compare so it was just beautiful!).
After this I did some profiling - eevar will love me for that - and I noticed one thing that stood out. intCast is called a gazillion times (in a run of about 4 seconds it was called ca. 150 million times) that IS a lot and it took some time (more then 10% of the over all execution time). That stroke me as odd, so I did some diging and found an interesting point:
Java arrays use int's as index - which since long is the primitive causes every single array access to be typecast via the intCast(long) call which is expensive since it does range checking. Now since clojure is open source - yay - I changed the aget java code to take a long and do a unckeded typecast as in '(int)i' when accessing the array and made my own aget and aset methods that did not typecheck (since I knew what it was called with). This bought about 10% execution time.
So what I'd suggest is to add some unckecked-aget methods that take a long as index and don't typecast, for array intensive code this is a 10% speedup which isn't that bad. Other then that I am still looking why clojure is slower by such a magnitude.
Also it would be great if there were a way to give functions signatures if desired, as in having (aget Object) that runs (int ...) on Object and (aget int/long) that does just pass the value w/o casting.
Also the profiles show that despite my best efforts there is still a large number of calls made for Number casts, I'm not sure where they are from, likely from clojures core interiors.
Here are some of the results I got: http://grab.by/57x0
and some profiling results: http://grab.by/57vO (look at the time taken for intCast)
Regards,
Heinz
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
I have to admit this is true...
(With the above realisation, I feel like I'm biting off a chunk of a
cake which I really wanted to keep just a short while ago.)
Sincerely,
Michał
I seem to recall in one of the original Clojure videos, Rich talked
about the relationship between Clojure and Java. There's a long
history of C programmers dropping down to assembly, Python programmers
dropping down to C, and so on. He explained that Java was Clojure's
"assembly". You use Clojure to write the logic that is too complex to
code compactly in Java. You write Java to code the low-level bits you
can't do in Clojure.
I must admit I was surprised by Rich's recent statement that Clojure
is not useful to him if he can't write high performance code in it,
because he "only writes the hard bits". To me this seemed like a
change of perspective from the days when it was accepted that the most
performance-critical parts would likely be implemented in Java;
perhaps the Clojure-in-Clojure project has gotten him more focused on
replacing Java entirely with Clojure. He says we wouldn't like a
Clojure that was written in Clojure as it currently stands, because it
would be too slow. That's probably true, but I never expected it to
be possible. Up until this recent discussion, I believed that the
main focus for Clojure-in-clojure was to find ways to move more parts
into Clojure without affecting performance, in order to achieve better
portability to other hosts. I didn't think it would ever be possible
to move all of it over.
I agree that it has been extraordinarily difficult to write
high-performance numerical code in Clojure. I agree that everyone
benefits if this gets easier. I want faster code as much as anyone.
But I'm worried that people's goals for static-typed-performance might
be unrealistic. I'm delighted that Clojure already performs so well
for a dynamically typed language, thanks to Rich's hard work
optimizing the Java code that underlies Clojure so everything can be
as fast as possible, but it seems that a lot of people here will be
disappointed if Clojure can't compete with Java/Scala on the
benchmarks.
I'm glad someone is starting to tackle these benchmarks. I think it's
especially interesting to see how fast the code can be when written
idiomatically, so I hope we'll see more of these results as well.
Once you start using mutable arrays and type hinting every single
variable, why aren't you just using Java?
With respect to this particular benchmark, I don't think it will be
possible to get idiomatic code in the same ballpark as Java, because
if you use Clojure's vectors to store the permuted values, they will
be boxed. Unless Clojure starts having primitive vectors, I don't see
how it's possible to get around this. But I'd love to be proven
wrong.
> When exactly did people start expecting Clojure to be as fast as Java
> and/or Scala?
>
One of the earlier talk/video, the claim was clojure is between 1x to 3x of java
performance.
Fast math performance was touched on here also
http://www.infoq.com/interviews/hickey-clojure
And that's pre 1.0
I don't disagree with your message as a whole, but:
> Once you start using mutable arrays and type hinting every single
> variable, why aren't you just using Java?
The argument I've heard is this: there are two ways to get a fast
program in a slow-by-default language.
The first is Python's way: write the whole thing in Python (high-
level), find out the parts that are slow, then rewrite those in a
different language (C). Now you need a C compiler, you need to build
separately on each platform, and so on.
The second is Common Lisp's way: write the whole thing in un-annotated
CL (high-level), find out the parts that are slow, then add
annotations or make representational changes (lists => vectors, alists
=> hash-maps) to allow the compiler to make them fast.
The argument is that being able to use *the same language* to write
anywhere on the spectrum from slow/succinct to fast/verbose is useful:
no complex combined build system, no bridging between environments,
iterative development, easier profiling, your code always works, etc.
If I can take one little part of my Clojure program and make it ugly
like Java to get it as fast as Java, it's better than having to
actually _use_ Java in a heterogeneous project.
-R
With respect to this particular benchmark, I don't think it will be
possible to get idiomatic code in the same ballpark as Java, because
if you use Clojure's vectors to store the permuted values, they will
be boxed. Unless Clojure starts having primitive vectors, I don't see
how it's possible to get around this. But I'd love to be proven
wrong.
> Don't have time to dive into this right now, but all I can say is now you are starting to have an idea what the "other side" looks like. The kind of hoop jumping you have to do get within 3x of Scala is absurd.
Yes to see what the "other side" looks like was one of the goals, and it is abselutly insane I agree, so to be mean, adding a few (long) or +' in there won't make it much worst :P but that isn't the point I wanted to see how far one can go.
> This style of coding is largely unnecessary with the latest equals branch. However in order to be competitive you're going to have to be a bit more clever with your combination of deftype/record, protocols, statics:
>
> * Data locality, solved by fields with deftype/record (also allows to define a bunch of method/fns that have access to the field type to avoid cast)
> * primitive return solved by static
> * macro inlining is unnecessary with static
I took the aproach that was most likely to succeed to be fast and I figured the lower you go the faster you get, I am not entirely sure but I think that some using records and types instead of arrays would be slower then this implementation but I'll gladly be proven wrong :P.
On Jun 25, 2010, at 4:50 , Mark Engelberg wrote:
> When exactly did people start expecting Clojure to be as fast as Java
> and/or Scala?
Don't get me wrong I neither expect nor demand clojure to be as Fast as Java/Scala, this was or is a scientific experiment to see how fast it can be, scala is just a nice measuring pole especially since Rich mentioed that this changes in the equal branch bring us a bit closer to the point where people don't claim clojure to be that much slower.
> I'm glad someone is starting to tackle these benchmarks. I think it's
> especially interesting to see how fast the code can be when written
> idiomatically, so I hope we'll see more of these results as well.
> Once you start using mutable arrays and type hinting every single
> variable, why aren't you just using Java?
Because Java us ugly as hell, it's worst then the most horrible clojure code you could possibly write :P. (my 2 cents) also it would defeat the idea behind this test. The ideomatic (at least in my eyes) clojure code is also included in the tests but it is slower in a magnitude that it makes it un-benchmarkable for longer runs. On the other hand the ideomatic code nearly reads as the problem description, which is wonderful and I think in most cases I'd say 'screw speed and make it nice' ... actualy I saied that before :P ... but you get the point I think.
Regards,
heinz
--
I've actually been thinking of learning scala to have something to
"drop down to" when clojure isn't fast enough. Anyone have experience
with this?
martin
> Have you tried manual copying of the perm array (as in the scala version)?
> It is probably not much better or worse, but it would be nice to have the same algorithm than scala, for comparaison purpose.
Yes the original version did that, the impact of change was minimal but in the end I found it cleaner to use the low level system funciton.
Regards,
Heniz
--
On my machine this is about 4x slower than the shootout Java
implementation. Using Java as the baseline and comparing my local
results to the shootout timings, it puts Clojure 1.1 on par with
Erlang, Go and OCaml.
On profiling I have a bunch of intCast(Object) and doubleCast(double)
totaling ~9% CPU time (no screenshot yet, sorry), I'll see if I can
eliminate them for a few percent improvement.
Whatever JVM flag tweaks can be done for Clojure can be done for Java
too, and I'm primarily interested in relative performance, which I
suppose is roughly the same as long as I use the same flags for both.
OK, I tried this. Object field access instead of arrays made a few
percent difference, but not enough to be significant.
Definterface and defprotocol, on the other hand, not only gave cleaner
code but was more than twice as fast. A huge win if you ask me :)
So summarizing this particular benchmark:
* 1.1 style optimization using primitive Java arrays peaks at ~4x
slower than Java.
* 1.2 style optimization using mutable primitive fields in a deftype
is only ~1.7x (70%) slower than Java.
Links:
* more detail including profiling snapshots, JVM version etc.
http://wiki.github.com/j-g-faustus/Clojure-test-code/
* 1.2 implementation:
http://github.com/j-g-faustus/Clojure-test-code/blob/master/shootout/nbody_type.clj
I haven't tried the new numeric branches, there seems to be a
sufficient number of people with opinions on those already :)
The number of calls to Double.valueOf(double) seems to suggest that it
is called only on aset, not on aget, though I can't think of any
reason how that could be.
Does anyone know more about this?
Tried the equiv branch briefly: The "1.1 style" version is ~4%
quicker, but still ~4x slower than Java and ~2x slower than mutable
deftype.
But I found another issue: Array access apparently converts primitives
to boxed values at every access. This is perhaps because aget/aset is
a function and primitives cannot cross function boundaries?
That would explain the relative slowness of arrays.
Here is a test case http://gist.github.com/458669
And a profiler screenshot http://i589.photobucket.com/albums/ss334/j-g-faustus/profiling/array-test-50k.png
15% CPU time goes to Double.valueOf(double) in all-primitive array
access and another ~4% to intCast(int).
The number of calls to Double.valueOf(double) seems to suggest that it
is called only on aset, not on aget, though I can't think of any
reason how that could be.
Does anyone know more about this?
Regards
jf
Using the equiv branch, I get
* Java arrays: 18s
* Immutable deftype: 10s
* Mutable deftype: 2s
* Plain Java: 2s
I did:
"Java arrays 18s" is Java arrays in Clojure.
"Plain Java 2s" is Java arrays in Java.
I was just asking the other day on #clojure why Clojure had coercion functions at all and why type hints didn't work for primitives, does this mean 1.2 will get rid of coercions (or make them no longer necessary) and allow type-hints for primitives?
If so, that's really awesome. :-D
> user=> (time foo)
...
> "Elapsed time: 245.152 msecs"
I hate to be a wet blanket, but how accurate is this? The doc doesn't
even say whether it measures wall clock time or cpu time. Even if you
knew that, it's at best a foundation to build a real benchmarking
system on.
By "real benchmarking" I mean one that does things like run the
benchmark a few times to fill caches and let the JIT do it's thing,
etc., makes sure it runs the target code enough times to get a chunk
of time large enough to means something, and takes into account the
overhead of making multiple runs. Now do that a dozen or so times so
you can take some stats, so when you compare two timings you actually
have enough information to decide if one is really faster than the
other, or the difference is just fuzz in the measurement system.
Ok, I started building something like that, and immediately started
getting ludicrous results. Googling for Java benchmarks turned up
exactly *one* such tool (which presumably works on any runnable or
callable, but fails on Clojure functions) amongst the sets of
benchmark code (most of which didn't talk about how they measured the
time).
First thing: a timer that returns the time so I can work with it:
(defmacro nano-timer "Time body, returning nanoseconds and result"
[body] `(do
(System/runFinalization)
(System/gc)
(let [start# (System/nanoTime)
res# ~body]
[(- (System/nanoTime) start#) res#])))
Second thing: run the benchmark many times, subtracting out the loop
overhead:
(defmacro loop-timer
"Time code in a loop, returning nanoseconds taken minus loop overhead"
[code count] `(let [overhead# (first (nano-timer (dotimes [_# ~count])))
mytime# (first (nano-timer (dotimes [_# ~count] ~code)))]
(- mytime# overhead#)))
Which has the annoying habit of returning negative numbers :-(. I'd
really like to know how the loop that calculates overhead takes more
time than the loop that actually runs the code, no matter how trivial
the code is!
Is anyone using anything more sophisticated than clojure.core/time for
benchmarking clojure code?
<mike
--
Mike Meyer <m...@mired.org> http://www.mired.org/consulting.html
Independent Network/Unix/Perforce consultant, email for more information.
O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
No, but last time I thought about this I figured a very simple
(benchmark ...) would simply:
* Iterate with exponentially higher repeat counts until total runtime
reaches >= 1 second (say).
* Then, iterate with the same count until the timing over some period,
say 10 seconds, has a sufficiently low standard deviation.
* Then, take that repeat count and re-run enough to reach, say, 10
seconds, of time.
Obviously the 1, 10 and std dev thesholds would be selectable, with
some default. So that in the normal case for micro-benchmarking, you'd
just do:
(benchmark (run-my-stuff))
And it would usually "just work" even if run-my-stuff is a trivial
function that completes really really quickly (due to the exponential
growth).
Thoughts? It's obviously not meant for really serious benchmarking,
but I think it would be a good thing to have in cases where one might
now otherwise use (time ..).
--
/ Peter Schuller
This is pretty much where I was heading when I started writing those
two macros. Wanting to be able to time very short runtimes without
timing the loop, I wrote loop-timer to deal with loop overhead. Only
to have it return negative values in the process of trying to get a
reasonable total run time. This makes that process, um, interesting.
> Is anyone using anything more sophisticated than clojure.core/time for
> benchmarking clojure code?
I wrote a benchmarking lib at http://github.com/hugoduncan/criterium
The references in the README are worth checking.
You might also find Lau's blog post of interest:
http://www.bestinclass.dk/index.clj/2010/03/functional-fluid-dynamics-in-clojure.html
--
Hugo Duncan
> On Thu, 01 Jul 2010 13:44:25 -0400, Mike Meyer <mwm-keyword-goo...@mired.org> wrote:
>
>> Is anyone using anything more sophisticated than clojure.core/time for
>> benchmarking clojure code?
>
> I wrote a benchmarking lib at http://github.com/hugoduncan/criterium
Actually the initial thing I started did not even use time I used a very flawed method using 'time java -cp ...' But I did this on purpose since my goal was to benchmark clojure in comparison to Scala (or other languages) where neither (time) nor creterium are useable since it is language specific.
Regards,
Heinz
> On Thu, 01 Jul 2010 13:44:25 -0400, Mike Meyer
> <mwm-keyword-goo...@mired.org> wrote:
>
> > Is anyone using anything more sophisticated than clojure.core/time for
> > benchmarking clojure code?
>
> I wrote a benchmarking lib at http://github.com/hugoduncan/criterium
>
> The references in the README are worth checking.
I'd seen the Elliptic group posting.
This looks nice, but doesn't work with 1.1 :-(. Do you know the last
commit that did?
Better yet, can I talk you into posting a 1.1 jar file to the
downloads area, maybe along with a 1.2 RC, so users who aren't
comfortable with the Java infrastructure
(http://www.mired.org/home/mwm/papers/simple-clojure.html) can play
along?
Thanks,
The author responded here.
> Based on ideas in this article:
> http://www.ibm.com/developerworks/java/library/j-benchmark1.html
>
> The stackoverflow question where I found it, thanks to Michal Marczyk:
> http://stackoverflow.com/questions/3041299/how-to-benchmark-functions-in-clojure
>
>
> Getting *accurate* results can be hard, even with a benchmarking
> library. Criterium runs the code 60 times and does statistical
> analysis on the results, but I can still get variations above +/-10%
> from run to run in the REPL.
>
> I think benchmarking works best when
> * starting a new run each time - i.e. from the command line, a "clean
> slate" JVM
> * having something that runs long enough to stabilize the JVM -
> Criterium wants 1 min or more total runtime.
> * running it more than once and checking that results are tolerably
> consistent
> * looking for differences in orders of magnitude rather than a couple
> of percent more or less.
Criterium (and the Java code based on the same ideas) has what looks
like a serious problem for microbenchmarking:
After figuring out how many times to execute the body to make it last
a minute, it times the execution of the loop that does this. Which
means the reported time includes loop overhead.
When I tried the obvious solution (time an empty loop and subtract
that value from the original) on top of time, it was sometimes
generating negative values, so the empty loop is taking more time than
the loop being timed. That's a pretty solid indication that timing the
target code was lost in the noise of timing the loop overhead.
Possibly trying this fix with all of criterium code to stabilize the
JVM would help with the problem. I'd fix it (pretty easy) and try
myself, but there's no 1.1 jar and the current sources require 1.2.
If nothing else adding code to measure the empty loop and punting if
the difference between that and the code loop is statistically
insignificant would seem like a good idea.
> This looks nice, but doesn't work with 1.1 :-(. Do you know the last
> commit that did?
I'm not sure that I would be too confident on the correctness of any
version that ran on 1.1.
> Better yet, can I talk you into posting a 1.1 jar file to the
> downloads area, maybe along with a 1.2 RC, so users who aren't
> comfortable with the Java infrastructure
> (http://www.mired.org/home/mwm/papers/simple-clojure.html) can play
> along?
I would need to take the time to backport the lib from its existing
state. As far as I remember there is only very light defrecord usage to
contend with. I'm afraid it is not high priority for me at the moment.
I would also wonder how significant the loop overhead is, especially for
comparing implementations of the same function, which would then have
similar loop overhead. It is interesting to time Thread/sleep calls with
different sleep periods (not that sleep is going to be accurate itself,
but a regression of loop count versus Criterium's measured time and the
sleep time might be interesting).
Another issue with timing fast, simple functions is that jit can optimise
a calculation to a constant if it is simple enough.
My major concern with Criterium would rather be on the percentage garbage
collection time reported. There still needs to be work done on removing
the allocation of result structures from the body of the timed results. I
would expect that to have a bigger impact than the loop overhead.
Hugo
--
Hugo Duncan
If nothing else adding code to measure the empty loop and punting if
the difference between that and the code loop is statistically
insignificant would seem like a good idea.
> > If nothing else adding code to measure the empty loop and punting if
> > the difference between that and the code loop is statistically
> > insignificant would seem like a good idea.
> It's actually notoriously hard to time the "empty loop" on the JVM. Once
> you've iterated a few thousand times, the JIT will kick in and recognize the
> loop does no work (or possibly even find a closed form solution for the loop
> if there are simple math ops within it) and remove it entirely.
If it tosses out your empty loop, then you're back where you started -
timing the loop as well as the operation.
If it finds the closed form solution for the real loop - that would
explain the results I'm seeing.
> Cliff Click devotes entire talks to benchmarking in Java, how hard it is to
> do correctly, and the common pitfalls:
> http://www.azulsystems.com/events/javaone_2009/session/2009_J1_Benchmark.pdf
()#*$Q#% PPT. But I think some of the ideas can be incorporated....
Thanks,
> I don't see how the loop is relevant here, at least if the same benchmarking function is used for all the benchmarks you're doing, it should make a difference then since the overhead is the same.
It depends on what you're benchmarking. If the loop time is much
smaller than either the actual code time or the standard deviation in
the measurements, then it's just noise, and you can ignore it. If it's
on the order of the same size as the standard deviation, then it can
fool you into falsely concluding that there's no statistically
significant difference between two algorithms. Once it gets to be
around half the total benchmark time, your results are pretty much
worthless.