Google Groups

Re: JMH vs Caliper: reference thread

Aleksey Shipilev Feb 1, 2014 4:19 AM
Posted in group: mechanical-sympathy
Full disclosure: I work for Oracle, and do Java performance work in
OpenJDK. I also develop and maintain JMH, and JMH is my 4-th (I think)
benchmark harness. Hence, my opinion is biased, and I try to stay objective
because we've been in Caliper's shoes...

Disclaimer: I am not the only maintainer and developer for JMH. It was
developed with heavy contributions from both JRockit (where it came
originally) and HotSpot performance teams. Hence, when I say "we", I mean
many JMH contributors.

IMO, Caliper is not as bad for large benchmarks. In fact, Caliper feels just
like pre-JMH harnesses we had internally in Sun/BEA. And that is not a
coincidence, because Caliper benchmark interface is very intuitive and an
obvious one. The sad revelation that cas upon me over previous several
years is that the simplicity of benchmark APIs does not correlate with
benchmark reliability. 

I don't follow Caliper development, and I'm not in position to bash Caliper,
so instead of claiming anything that Caliper does or does not do, let me
highlight the history of JMH redesigns over the years. That should help
to review other harnesses, since I can easily say "been there, tried that,
it's broken <in this way>". Most of the things can even be guessed from the
API choices the harness makes. If API can not provide the instruments to
avoid a pitfall, then it is very probable harness makes no moves to avoid it
(except for the cases where magic dust is involved).

I tend to think this is a natural way for a benchmark harness to evolve, and
you can map this timeline back to your favorite benchmark harness. The
pitfalls are many and tough, the non-extensive "important list" is as follows:

A. Dynamic selection of benchmarks. 

Since you don't know at "harness" compile time what benchmarks it would run,
the obvious choice would be calling the benchmark methods via Reflection.
Back in the days, this pushed us to accept the same "repetition" counter in
the method to amortize the reflective costs. This already introduces the
major pitfall about looping, see below.

But infrastructure-wise, harness then should intelligently choose the
repetition count. This almost always leads to calibrating mechanics, which
is almost always broken when loop optimizations are in effect. If one
benchmark is "slower" and autobalances with lower reps count, and another
benchmark is "faster" and autobalances with higher reps count, then
optimizer have more opportunity to optimize "faster" benchmark even further.
Which departs us from seeing how exactly the benchmark performs and
introduces another (hidden! and uncontrollable!) degree of freedom.

In retrospect, the early days decision of JMH to generate synthetic
benchmark code around the method, which contains the loop (carefully chosen
by us to avoid the optimizations in current VMs -- separation on concerns,
basically), is paying off *very* nicely. We can then call that synthetic
stub via Reflection without even bothering about the costs. 

...That is not to mention users can actually review the generated benchmark
code looking for the explanations for the weird effects. We do that
frequently as the additional control.

B. Loop optimizations.

This is by far my major shenanigan about almost every harness. What is the
usual answer to "My operation is very small, and timers' granularity/latency
is not able to catch the effect"? My, yes, of course, warp it in the indexed-loop.
This mistake is painfully obvious, and real pain-in-the-back to prevent. We
even have the JMH sample to break the habit of people coming to build the
same style benchmarks:

(BTW, the last time I tried Caliper a few years ago, it even refused to run
when calibration says the running time does not change when changing the
reps count. Well, THANK YOU, but I really WANT to run that benchmark!)

C. Dead-code elimination.

This is my favorite pet-peeve. It is remarkably hard to introduce the
side-effect to the benchmark which is both reliable and low-overhead.
Low-overhead parts really require the JVM expertise to get right, and
pushing that on to users is very, very dumb. JMH's Blackhole classes took
a significant amount of our time to implement correctly, and we still doing
the tunings here and there to minimize their costs [to the extreme we are
thinking about the proper VM interface to consume the values]. Remarkably,
we can hide all that complexity behind the simple user interface, and let
users concentrate on their workloads. This is what good harnesses do.


The usual ways to deal with DCE are broken in subtle ways:

 a) Returning the value from the reflective call: JIT inflates the
reflective call, and inlines it as usual Java method, DCE ensues.

 b) Writing the values in the fields: doing that in the loop means runtime
can only write the latest value, and DCE anything else; storing Objects in
fields usually entail GC store barriers; storing fields usually entail false
sharing with some other important data...

 c) Accumulate the values in locals and print them: still allows loop
pipelining and partial DCE-ing; and also, good luck with Objects! 

You might want to investigate which one your favorite harness is using.

D. Constant foldings

As much as dead-code elimination is a buzzword in benchmarking community,
the symmetric effect is mostly overlooked. That is, DCE works by eliminating
the part of program graph because of the unclaimed outputs. But there is
also the optimization that eliminates the part of program graph because of
the predictable inputs. This JMH sample demonstrates the effect:

Avoiding this issue again requires JVM expertise, and it is cruel to push
users to do that. It takes a very careful design of benchmark loop to break the
load coalescing across loop iterations, when you *also* want to provide low
overhead for fine-grained benchmarks. We spent considerable amount of time
tuning up the measurement loops (and this is transparent for JMH users,
because you "just" recompile the benchmark code, and the new synthetic code
is being generated, voila).

When harness asks users to create benchmark loop on their own, it pushes
users to deal with the issue on their own as well. I can count the people
who have time, courage, and expertise to do this kind of code on the fingers
of one hand.

E. Non-throughput measures

Now, when the payload is wrapped in the benchmark loop, it seems impossible
to collect any non-throughput metrics. The two most significant that we
learned through our internal JMH uses are: sampling the execution time, and
single-shot measurements. 

Measuring the individual timings is very tough, because timer overheads can
be very painful, and there is also coordinated omission tidbits, yada-yada...
That is, without a smart scheme that samples only *some* invocations, you
will mostly drown in the timing overheads. It turns out, sampling is rather
easy to implement with harness which *already* generates the synthetic code.
This is why JMH's support for SampleTime was so clean and easy to implement.
(Success story: measuring FJP latencies on JDK 8 Streams)

Measuring the single-invocation timings is needed for warmup studies: what's
the time to invoke the payload "in cold"? Again, once you generate the code
around the benchmark, it is easy to provide the proper timestamping. When
your harness implements multiple forks, it is very easy to have thousands
of "cold" invocations without leaving your coffee cold. What if your harness
requires reps count and requires calibration? Forget it.

The second-order concern is to provide the clean JVM environment for this
kind of run. In JMH, there is a separation between host JVM and the forked
JVM, where most of the heavy infrastructural heavy-lifting like regexp
matching, statistics, printing, etc is handled in the host VM. The forked VM
fast-pathes to "just" measure, not contaminating itself with most infra
stuff. This makes SingleShot benchmark modes very convenient in JMH.
(Success story: JDK 8 Lambda linkage/capture costs, and also JSR 292 things)

See the examples here: 

It is educational to compile the benchmarks and look for the generated code
to see the loops we are generating for them (target/generated-sources/...)

F. Synchronize iterations

Everything significantly complicates when you start to support
multi-threaded benchmarks. It is *not* enough to shove in the executor and
run the benchmark in multiple threads. The simplest issue everyone overlooks
is that starting/stopping threads is not instantaneous, and so you need to
care if all your worker threads are indeed started. More in this JMH example:

Without this, most of heavily-threaded benchmarks are way, way off the
actual results. We routinely seen >30% difference prior introducing this
kind of workaround. The only other harness I know doing this is SPECjvm2008.

G. Multi-threaded sharing

Multi-threaded benchmarks are also interesting because they introduce
sharing. It is tempting to "just" make the benchmark object either shared
between the worker threads, or allocate completely distinct objects for
each worker thread. That's the obvious way to introduce sharing in the
benchmark API. 

However, the reality begs to differ: in many cases, you want the
state-bearing objects to have *different* shareability domains. E.g. in many
concurrent benchmarks, I want to have the shared state which holds my
concurrent primitive to test, and a distinct state which keeps my scratch

In JMH, it forces you to introduce @State:

...together with some clean way of injecting the state objects into the run, 
since the default benchmark object is not the appropriate substitute (can't be 
both shared and distinct).

H. Multi-threaded setup/teardown

States often require setup and teardown. It gets interesting for two 
reasons: 1) in many cases, you don't want any non-worker thread to touch the
state object, and let only the worker threads to setup/teardown state objects,
like in the cases where you initialize thread-local structures or otherwise
care about NUMA and locality -- this calls for tricky lazy init schemes; 
2) in many cases, you have to call setup/teardown on shared objects, which
means you need to synchronize workers, and you can't do that on hot-paths 
with blocking the worker threads (schedulers kick in and ruin everything) -- this
calls for tricky busy-looping concurrency control.

Fortunately, it can be completely hidden under the API, like in JMH:

I. False-god-damned-sharing

And of course, after you done with all the API support for multi-threaded
benchmarks, you have to dodge some new unfortunate effects.
False-god-damned-sharing included. The non-extensive list where we got the
false sharing, and it affected our results is: 1) can't afford false sharing
on the "terminate" flag, which can be polled every nanosecond; 2) can't
afford false sharing in blackholes, because you deal with nanosecond-scale
events there; 3) can't afford false sharing in state objects, because you
know why; 4) can't afford false sharing in any other control structure which
is accessed by worker threads.

In JMH, we did a lot, scratch that, *A LOT* to avoid false sharing in the 
infra code. As well as we automatically pad the state objects providing at 
least some level of protection for otherwise oblivious users.

J. Asymmetric benchmarks

Now that you take a breath after working hard dealing with all these issues,
you have to provide the support for the benchmarks which are asymmetric. I.e.
in the same run, you might want to have the benchmark methods executing 
_different_ chunks of code, and measure them _distinctly_. Working example is
Nitsan's queuing experiments:

...but let me instead show the JMH example:

K. Inlining

The beast of the beasts: for many benchmarks, the performance differences
can only be explained by the inlining differences, which broke/enabled some
additional compiler optimizations. Hence, playing nice with the inliner is 
essential for benchmark harness. Again, pushing users to deal with this 
completely on their own is cruel, and we can ease their pain a bit.

JMH does two things: 1) It peels the hottest measurement loop in a separate
method, which provides the entry point for compilation, and the inlining 
budget starts there; 2) @CompilerControl annotation to control inlining
in some known places (@GMB and Blackhole methods are forcefully inlined these
days, for example).

Of course, we have a sample for that:


The benchmarking harness business is very hard, and very non-obvious. My own
experience tells me even the smartest people make horrible mistakes in them,
myself included. We try to get around that by fixing more and more things
in JMH as we discover more, even if that means significant API changes.
Please do not trust the names behind the projects: whether it's Google or 
Oracle -- the only thing matters is whether the projects are up to technical
challenges they face.

The job for a benchmark harness it to provide reliable benchmarking 
environment. It could go further than that (up to the point harness can 
<strike>read mail</strike> submit results to GAE), but it is only prudent
if it does its primary job done.

The issues above explain why I get all amused when people bring up trivial
things like IDE support and/or the ability to draw the graphs as the
deal-breaker things for benchmark harness choices. It's like looking at the
cold fusion reactor and deciding to run the the coal power plant instead,
because the fusion reactor has an ugly shape, and painted in the color you
don't particularly like.