JMH vs Caliper: reference thread

11,433 views
Skip to first unread message

Gleb Smirnov

unread,
Feb 1, 2014, 3:59:39 AM2/1/14
to mechanica...@googlegroups.com
Hi All,

I have recently seen several discussions where people were trying to decide on which tool to use for benchmarking. Here's one of those.
I think there should exist some trusted reference material on the subject. A thread on mechanical-sympathy seems just fine. Hence, I kindly ask you to:

* List any and all problems and pitfalls that you believe either of the systems has;
* Share your relevant experience in using any of the systems;
* And generally anything that you think will help a party in doubt make the right choice.

The summary of the discussion should also make for an excellent blog post, I think.

Cheers,
Gleb

Aleksey Shipilev

unread,
Feb 1, 2014, 7:19:40 AM2/1/14
to mechanica...@googlegroups.com
Full disclosure: I work for Oracle, and do Java performance work in
OpenJDK. I also develop and maintain JMH, and JMH is my 4-th (I think)
benchmark harness. Hence, my opinion is biased, and I try to stay objective
because we've been in Caliper's shoes...

Disclaimer: I am not the only maintainer and developer for JMH. It was
developed with heavy contributions from both JRockit (where it came
originally) and HotSpot performance teams. Hence, when I say "we", I mean
many JMH contributors.

IMO, Caliper is not as bad for large benchmarks. In fact, Caliper feels just
like pre-JMH harnesses we had internally in Sun/BEA. And that is not a
coincidence, because Caliper benchmark interface is very intuitive and an
obvious one. The sad revelation that cas upon me over previous several
years is that the simplicity of benchmark APIs does not correlate with
benchmark reliability. 

I don't follow Caliper development, and I'm not in position to bash Caliper,
so instead of claiming anything that Caliper does or does not do, let me
highlight the history of JMH redesigns over the years. That should help
to review other harnesses, since I can easily say "been there, tried that,
it's broken <in this way>". Most of the things can even be guessed from the
API choices the harness makes. If API can not provide the instruments to
avoid a pitfall, then it is very probable harness makes no moves to avoid it
(except for the cases where magic dust is involved).

I tend to think this is a natural way for a benchmark harness to evolve, and
you can map this timeline back to your favorite benchmark harness. The
pitfalls are many and tough, the non-extensive "important list" is as follows:

A. Dynamic selection of benchmarks. 

Since you don't know at "harness" compile time what benchmarks it would run,
the obvious choice would be calling the benchmark methods via Reflection.
Back in the days, this pushed us to accept the same "repetition" counter in
the method to amortize the reflective costs. This already introduces the
major pitfall about looping, see below.

But infrastructure-wise, harness then should intelligently choose the
repetition count. This almost always leads to calibrating mechanics, which
is almost always broken when loop optimizations are in effect. If one
benchmark is "slower" and autobalances with lower reps count, and another
benchmark is "faster" and autobalances with higher reps count, then
optimizer have more opportunity to optimize "faster" benchmark even further.
Which departs us from seeing how exactly the benchmark performs and
introduces another (hidden! and uncontrollable!) degree of freedom.

In retrospect, the early days decision of JMH to generate synthetic
benchmark code around the method, which contains the loop (carefully chosen
by us to avoid the optimizations in current VMs -- separation on concerns,
basically), is paying off *very* nicely. We can then call that synthetic
stub via Reflection without even bothering about the costs. 

...That is not to mention users can actually review the generated benchmark
code looking for the explanations for the weird effects. We do that
frequently as the additional control.

B. Loop optimizations.

This is by far my major shenanigan about almost every harness. What is the
usual answer to "My operation is very small, and timers' granularity/latency
is not able to catch the effect"? My, yes, of course, warp it in the indexed-loop.
This mistake is painfully obvious, and real pain-in-the-back to prevent. We
even have the JMH sample to break the habit of people coming to build the
same style benchmarks:


(BTW, the last time I tried Caliper a few years ago, it even refused to run
when calibration says the running time does not change when changing the
reps count. Well, THANK YOU, but I really WANT to run that benchmark!)

C. Dead-code elimination.

This is my favorite pet-peeve. It is remarkably hard to introduce the
side-effect to the benchmark which is both reliable and low-overhead.
Low-overhead parts really require the JVM expertise to get right, and
pushing that on to users is very, very dumb. JMH's Blackhole classes took
a significant amount of our time to implement correctly, and we still doing
the tunings here and there to minimize their costs [to the extreme we are
thinking about the proper VM interface to consume the values]. Remarkably,
we can hide all that complexity behind the simple user interface, and let
users concentrate on their workloads. This is what good harnesses do.

Examples:


The usual ways to deal with DCE are broken in subtle ways:

 a) Returning the value from the reflective call: JIT inflates the
reflective call, and inlines it as usual Java method, DCE ensues.

 b) Writing the values in the fields: doing that in the loop means runtime
can only write the latest value, and DCE anything else; storing Objects in
fields usually entail GC store barriers; storing fields usually entail false
sharing with some other important data...

 c) Accumulate the values in locals and print them: still allows loop
pipelining and partial DCE-ing; and also, good luck with Objects! 

You might want to investigate which one your favorite harness is using.

D. Constant foldings

As much as dead-code elimination is a buzzword in benchmarking community,
the symmetric effect is mostly overlooked. That is, DCE works by eliminating
the part of program graph because of the unclaimed outputs. But there is
also the optimization that eliminates the part of program graph because of
the predictable inputs. This JMH sample demonstrates the effect:


Avoiding this issue again requires JVM expertise, and it is cruel to push
users to do that. It takes a very careful design of benchmark loop to break the
load coalescing across loop iterations, when you *also* want to provide low
overhead for fine-grained benchmarks. We spent considerable amount of time
tuning up the measurement loops (and this is transparent for JMH users,
because you "just" recompile the benchmark code, and the new synthetic code
is being generated, voila).

When harness asks users to create benchmark loop on their own, it pushes
users to deal with the issue on their own as well. I can count the people
who have time, courage, and expertise to do this kind of code on the fingers
of one hand.

E. Non-throughput measures

Now, when the payload is wrapped in the benchmark loop, it seems impossible
to collect any non-throughput metrics. The two most significant that we
learned through our internal JMH uses are: sampling the execution time, and
single-shot measurements. 

Measuring the individual timings is very tough, because timer overheads can
be very painful, and there is also coordinated omission tidbits, yada-yada...
That is, without a smart scheme that samples only *some* invocations, you
will mostly drown in the timing overheads. It turns out, sampling is rather
easy to implement with harness which *already* generates the synthetic code.
This is why JMH's support for SampleTime was so clean and easy to implement.
(Success story: measuring FJP latencies on JDK 8 Streams)

Measuring the single-invocation timings is needed for warmup studies: what's
the time to invoke the payload "in cold"? Again, once you generate the code
around the benchmark, it is easy to provide the proper timestamping. When
your harness implements multiple forks, it is very easy to have thousands
of "cold" invocations without leaving your coffee cold. What if your harness
requires reps count and requires calibration? Forget it.

The second-order concern is to provide the clean JVM environment for this
kind of run. In JMH, there is a separation between host JVM and the forked
JVM, where most of the heavy infrastructural heavy-lifting like regexp
matching, statistics, printing, etc is handled in the host VM. The forked VM
fast-pathes to "just" measure, not contaminating itself with most infra
stuff. This makes SingleShot benchmark modes very convenient in JMH.
(Success story: JDK 8 Lambda linkage/capture costs, and also JSR 292 things)

See the examples here: 

It is educational to compile the benchmarks and look for the generated code
to see the loops we are generating for them (target/generated-sources/...)

F. Synchronize iterations

Everything significantly complicates when you start to support
multi-threaded benchmarks. It is *not* enough to shove in the executor and
run the benchmark in multiple threads. The simplest issue everyone overlooks
is that starting/stopping threads is not instantaneous, and so you need to
care if all your worker threads are indeed started. More in this JMH example:


Without this, most of heavily-threaded benchmarks are way, way off the
actual results. We routinely seen >30% difference prior introducing this
kind of workaround. The only other harness I know doing this is SPECjvm2008.

G. Multi-threaded sharing

Multi-threaded benchmarks are also interesting because they introduce
sharing. It is tempting to "just" make the benchmark object either shared
between the worker threads, or allocate completely distinct objects for
each worker thread. That's the obvious way to introduce sharing in the
benchmark API. 

However, the reality begs to differ: in many cases, you want the
state-bearing objects to have *different* shareability domains. E.g. in many
concurrent benchmarks, I want to have the shared state which holds my
concurrent primitive to test, and a distinct state which keeps my scratch
data. 

In JMH, it forces you to introduce @State:

...together with some clean way of injecting the state objects into the run, 
since the default benchmark object is not the appropriate substitute (can't be 
both shared and distinct).

H. Multi-threaded setup/teardown

States often require setup and teardown. It gets interesting for two 
reasons: 1) in many cases, you don't want any non-worker thread to touch the
state object, and let only the worker threads to setup/teardown state objects,
like in the cases where you initialize thread-local structures or otherwise
care about NUMA and locality -- this calls for tricky lazy init schemes; 
2) in many cases, you have to call setup/teardown on shared objects, which
means you need to synchronize workers, and you can't do that on hot-paths 
with blocking the worker threads (schedulers kick in and ruin everything) -- this
calls for tricky busy-looping concurrency control.

Fortunately, it can be completely hidden under the API, like in JMH:

I. False-god-damned-sharing

And of course, after you done with all the API support for multi-threaded
benchmarks, you have to dodge some new unfortunate effects.
False-god-damned-sharing included. The non-extensive list where we got the
false sharing, and it affected our results is: 1) can't afford false sharing
on the "terminate" flag, which can be polled every nanosecond; 2) can't
afford false sharing in blackholes, because you deal with nanosecond-scale
events there; 3) can't afford false sharing in state objects, because you
know why; 4) can't afford false sharing in any other control structure which
is accessed by worker threads.

In JMH, we did a lot, scratch that, *A LOT* to avoid false sharing in the 
infra code. As well as we automatically pad the state objects providing at 
least some level of protection for otherwise oblivious users.

J. Asymmetric benchmarks

Now that you take a breath after working hard dealing with all these issues,
you have to provide the support for the benchmarks which are asymmetric. I.e.
in the same run, you might want to have the benchmark methods executing 
_different_ chunks of code, and measure them _distinctly_. Working example is
Nitsan's queuing experiments:


...but let me instead show the JMH example:


K. Inlining

The beast of the beasts: for many benchmarks, the performance differences
can only be explained by the inlining differences, which broke/enabled some
additional compiler optimizations. Hence, playing nice with the inliner is 
essential for benchmark harness. Again, pushing users to deal with this 
completely on their own is cruel, and we can ease their pain a bit.

JMH does two things: 1) It peels the hottest measurement loop in a separate
method, which provides the entry point for compilation, and the inlining 
budget starts there; 2) @CompilerControl annotation to control inlining
in some known places (@GMB and Blackhole methods are forcefully inlined these
days, for example).

Of course, we have a sample for that:


BOTTOM-LINE:
----------------------------------

The benchmarking harness business is very hard, and very non-obvious. My own
experience tells me even the smartest people make horrible mistakes in them,
myself included. We try to get around that by fixing more and more things
in JMH as we discover more, even if that means significant API changes.
Please do not trust the names behind the projects: whether it's Google or 
Oracle -- the only thing matters is whether the projects are up to technical
challenges they face.

The job for a benchmark harness it to provide reliable benchmarking 
environment. It could go further than that (up to the point harness can 
<strike>read mail</strike> submit results to GAE), but it is only prudent
if it does its primary job done.

The issues above explain why I get all amused when people bring up trivial
things like IDE support and/or the ability to draw the graphs as the
deal-breaker things for benchmark harness choices. It's like looking at the
cold fusion reactor and deciding to run the the coal power plant instead,
because the fusion reactor has an ugly shape, and painted in the color you
don't particularly like.

-Aleksey.

Martin Thompson

unread,
Feb 1, 2014, 7:44:54 AM2/1/14
to mechanica...@googlegroups.com
As someone who has written a lot of benchmarks over the years and made a lot of mistakes that result in character building experiences when they are bluntly, but rightly, pointed out by JVM engineers :-)

I'm finding that JMH is becoming my tool of choice. The more I use it the more I'm impressed by how it is correct. I've seen too many cases with my own benchmarks, and the likes of Caliper, were code got optimised to the point of being misleading because of things like loop unrolling, dead code elimination, or de-optimisations resulting in megamorphic dispatch.

However JMH could improve on the usability front. For example, it could be provided as a downloadable JAR and not require Maven to build it :-) Maybe even have an Ant task with ability to fail a build on a configurable deviation from a baseline. Performance testing needs to be part of continuous delivery pipeline so we have continuous performance profiling and testing.

Martin...

Aleksey Shipilev

unread,
Feb 1, 2014, 7:48:12 AM2/1/14
to mechanica...@googlegroups.com
Thanks Martin,

суббота, 1 февраля 2014 г., 16:44:54 UTC+4 пользователь Martin Thompson написал:
However JMH could improve on the usability front. For example, it could be provided as a downloadable JAR and not require Maven to build it :-)

JMH is at Maven Central for a few months now, see the update on OpenJDK page:

(you can even ask the Maven archetype to generate the benchmark project for you)
 
Maybe even have an Ant task with ability to fail a build on a configurable deviation from a baseline. Performance testing needs to be part of continuous delivery pipeline so we have continuous performance profiling and testing.

I will gladly accept such the task in mainline JMH workspace, subject to due OpenJDK contribution process :)

-Aleksey.

Norman Maurer

unread,
Feb 1, 2014, 7:48:33 AM2/1/14
to Martin Thompson, mechanica...@googlegroups.com

You could grab the jar just from here:

http://central.maven.org/maven2/org/openjdk/jmh/



Martin...


-- 
Norman Maurer


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Martin Thompson

unread,
Feb 1, 2014, 10:21:16 AM2/1/14
to mechanica...@googlegroups.com
Maybe I need some sort of treatment to overcome the adverse reaction I seem to have to Maven. :-) I've just seen way way too many projects end up in library bloat as mvn makes it so easy to download and then depend on half the Internet. We have roach motel semantics with locks; I feel maven turns your project into a roach motel for external dependencies. Oh how I dream of simpler days and makefiles... I've good reasons to really dislike Maven. If you ever had the misfortune of having no choice but to script within it with Jelly you will know the pain as just one example.

BTW thanks for the great work on JMH! I'd be happy to help integrate it with CI pipelines when I get the time.

Marshall Pierce

unread,
Feb 1, 2014, 1:26:54 PM2/1/14
to mechanica...@googlegroups.com, Martin Thompson
I have a general dislike of maven, though perhaps not as vehement as
Martin's. :)

This thread reminded me of how sad it made me to have to use maven the
last time I made a JMH project, so I threw together a demo project
showing how to use JMH with Gradle:
https://bitbucket.org/marshallpierce/gradle-jmh-demo

A Jenkins plugin that could enforce performance thresholds (and perhaps
generate pretty historical graphs) would be great. Maybe next weekend.

-Marshall

On 02/01/2014 04:48 AM, Norman Maurer wrote:
>
> An 1. Februar 2014 at 13:44:55, Martin Thompson (mjp...@gmail.com
> <mailto://mjp...@gmail.com>) schrieb:

Kirk Pepperdine

unread,
Feb 1, 2014, 3:01:50 PM2/1/14
to mechanica...@googlegroups.com
On Feb 1, 2014, at 4:21 PM, Martin Thompson <mjp...@gmail.com> wrote:

Maybe I need some sort of treatment to overcome the adverse reaction I seem to have to Maven. :-) I've just seen way way too many projects end up in library bloat as mvn makes it so easy to download and then depend on half the Internet. We have roach motel semantics with locks; I feel maven turns your project into a roach motel for external dependencies. Oh how I dream of simpler days and makefiles... I've good reasons to really dislike Maven. If you ever had the misfortune of having no choice but to script within it with Jelly you will know the pain as just one example.

+1000000


BTW thanks for the great work on JMH! I'd be happy to help integrate it with CI pipelines when I get the time.

On Saturday, 1 February 2014 12:48:12 UTC, Aleksey Shipilev wrote:
Thanks Martin,

суббота, 1 февраля 2014 г., 16:44:54 UTC+4 пользователь Martin Thompson написал:
However JMH could improve on the usability front. For example, it could be provided as a downloadable JAR and not require Maven to build it :-)

JMH is at Maven Central for a few months now, see the update on OpenJDK page:

(you can even ask the Maven archetype to generate the benchmark project for you)
 
Maybe even have an Ant task with ability to fail a build on a configurable deviation from a baseline. Performance testing needs to be part of continuous delivery pipeline so we have continuous performance profiling and testing.

I will gladly accept such the task in mainline JMH workspace, subject to due OpenJDK contribution process :)

-Aleksey.

Henri Tremblay

unread,
Feb 1, 2014, 3:52:28 PM2/1/14
to mechanica...@googlegroups.com
I tend to agree. Except that Maven isn't using Jelly since Maven 2 which went out years ago. But I can understand that the pain is still vivid (hey, let's do a programming language in xml!)


--

Martin Grajcar

unread,
Feb 1, 2014, 4:12:32 PM2/1/14
to mechanica...@googlegroups.com
Actually, programming in XML is very productive... when your metrics is the line count.

Back to the topic: I tried to convert one simple benchmark[1] from Caliper to JMH only to find out that there's nothing like com.google.caliper.Param. This is sort of showstopper when you want to measure how the branching probability influences the timing. Or is there some simple workaround?


Aleksey Shipilev

unread,
Feb 1, 2014, 4:21:03 PM2/1/14
to mechanica...@googlegroups.com
воскресенье, 2 февраля 2014 г., 1:12:32 UTC+4 пользователь Martin Grajcar написал:
Back to the topic: I tried to convert one simple benchmark[1] from Caliper to JMH only to find out that there's nothing like com.google.caliper.Param. This is sort of showstopper when you want to measure how the branching probability influences the timing. Or is there some simple workaround?

We resist supporting first-class @Params in JMH, because it opens the significant can of worms (interaction with @State-s and asymmetric benchmarks, generally non-trivial logic of traversing the parameter space, representing parameters in stable machine-readable formats, overriding parameters from the command line, etc.). The "workaround" we have in JMH, or rather, the recommended way of doing this kind of thing is using JMH API, like in: https://github.com/shipilev/article-exception-benchmarks/blob/master/src/main/java/net/shipilev/perf/exceptions/ExceptionsVsFlagsBench.java (see Main there).

-Aleksey.

Aleksey Shipilev

unread,
Feb 1, 2014, 4:27:06 PM2/1/14
to mechanica...@googlegroups.com
воскресенье, 2 февраля 2014 г., 1:21:03 UTC+4 пользователь Aleksey Shipilev написал:
воскресенье, 2 февраля 2014 г., 1:12:32 UTC+4 пользователь Martin Grajcar написал:
Back to the topic: I tried to convert one simple benchmark[1] from Caliper to JMH only to find out that there's nothing like com.google.caliper.Param. This is sort of showstopper when you want to measure how the branching probability influences the timing. Or is there some simple workaround?

We resist supporting first-class @Params in JMH, because it opens the significant can of worms (interaction with @State-s and asymmetric benchmarks, generally non-trivial logic of traversing the parameter space, representing parameters in stable machine-readable formats, overriding parameters from the command line, etc.). The "workaround" we have in JMH, or rather, the recommended way of doing this kind of thing is using JMH API, like in: https://github.com/shipilev/article-exception-benchmarks/blob/master/src/main/java/net/shipilev/perf/exceptions/ExceptionsVsFlagsBench.java (see Main there)

Georges Gomes

unread,
Feb 1, 2014, 4:27:58 PM2/1/14
to mechanica...@googlegroups.com
Hi Aleksey,

Interesting thank you.
Is that also how you recommend to run the same bench through multiple implementations of the same interface? 
I couldn't run the inheritance sample in anyway... copy/paste works obviously but it's super painful and error prone.

Many thanks
GG



--

Aleksey Shipilev

unread,
Feb 1, 2014, 4:32:06 PM2/1/14
to mechanica...@googlegroups.com
Hi Georges,

воскресенье, 2 февраля 2014 г., 1:27:58 UTC+4 пользователь Georges Gomes написал:
Interesting thank you.
Is that also how you recommend to run the same bench through multiple implementations of the same interface? 
I couldn't run the inheritance sample in anyway... copy/paste works obviously but it's super painful and error prone.


Otherwise, we sometimes build the benchmarks which read String property and instantiate proper implementation in @Setup, like: https://github.com/shipilev/dbpools-bench/blob/master/src/main/java/org/sample/Benchmark1Ex.java

-Aleksey.

Georges Gomes

unread,
Feb 1, 2014, 4:55:07 PM2/1/14
to mechanica...@googlegroups.com
Thanks for the reply.
No I can make the JMHSample_24_Inheritance.java run.
It just don't process it up to the benchmark files in manifest.
If you think it should make the job I will dig into it and provide feedback.

The property technic is interesting but I think people would appreciate self contained benchmarks.
I appreciate to call "jmh ^queues.*spsc.*" to run all benchmarks of queues spsc (for exemple).
Having to call main() makes some benchmark not runnable like other... 

It doesn't matter for benchmarks that your are running during development of a particular piece.
But, like Martin, we are trying to integrate performance benchmarking (in other words JMH) in our development process and having a single way to call them, self contained is important for the build process and CI automation. So the convention for declaring and running jmh benchs are important.

This been said, I have been using JMH intensively in the past few weeks and I'm impressed with the quality and stability of results. So many things are done right. It's difficult (impossible) to go back.

My favorite detail that makes a lot of difference for a multi-thread bench: the sync mode
that "warmup" threads and only measure during "true" concurrent processing.


JMH Examples are great, except the Sample_24_Inheritance that doesn't work :)
Just kidding, must be me some how :) 

Cheers
GG



--

Aleksey Shipilev

unread,
Feb 1, 2014, 5:02:34 PM2/1/14
to mechanica...@googlegroups.com
воскресенье, 2 февраля 2014 г., 1:55:07 UTC+4 пользователь Georges Gomes написал:
Thanks for the reply.
No I can make the JMHSample_24_Inheritance.java run.
It just don't process it up to the benchmark files in manifest.
If you think it should make the job I will dig into it and provide feedback.

Please get on jmh-dev: http://mail.openjdk.java.net/mailman/listinfo/jmh-dev, and we can follow up.
 
It doesn't matter for benchmarks that your are running during development of a particular piece.
But, like Martin, we are trying to integrate performance benchmarking (in other words JMH) in our development process and having a single way to call them, self contained is important for the build process and CI automation. So the convention for declaring and running jmh benchs are important.

I agree, but there are technicalities about @Param that make them hard to implement. Last time I tried almost a year ago, maybe it's time to try again.

-Aleksey. 

tm jee

unread,
Feb 1, 2014, 7:01:20 PM2/1/14
to mechanica...@googlegroups.com
Hi guys, 

what about

It is pretty good as well. 

Georges Gomes

unread,
Feb 2, 2014, 1:51:56 AM2/2/14
to mechanica...@googlegroups.com
Hi

Latencyutils is a great tool but it's only measuring latency (and correcting it as well).
You still need to write the benchmark. And that's where things are difficult to get right.
(just look at Aleksey's comments)

Gil will comment better than a I do but, in my point of view, LatencyUtils are more targeting measurement in live or simulated environments.

My colleague Jean-Philippe Bempel would say: "That's the only absolute truth!"
But, during the optimization process, measuring around a small piece of the code is more convenient.
That's were JMH and Caliper are helpful.

This been said, I do agree with Jean-Philippe, and a "real-life" full end-to-end benchmark is mandatory at the end or periodically.

Cheers
GG




--

Georges Gomes

unread,
Feb 2, 2014, 2:01:50 AM2/2/14
to mechanica...@googlegroups.com
Privately

Thanks for your work. JMH is great.

Regading @Param, if it's a hard problem fro you then I'm useless :)
But if I can help in anyway, test beta, write samples, etc...
Just let me know

Kind regards
GG




--

Georges Gomes

unread,
Feb 2, 2014, 2:02:39 AM2/2/14
to mechanica...@googlegroups.com
(privately failed! haha)

ymo

unread,
Feb 2, 2014, 11:52:21 PM2/2/14
to mechanica...@googlegroups.com
Wonder if anyone here used http://www.faban.org ? used to be a sun tool IIRC.

Aleksey Shipilev

unread,
Feb 16, 2014, 9:41:22 AM2/16/14
to mechanica...@googlegroups.com
On 02/02/2014 02:02 AM, Aleksey Shipilev wrote:
> It doesn't matter for benchmarks that your are running during
> development of a particular piece.
> But, like Martin, we are trying to integrate performance
> benchmarking (in other words JMH) in our development process and
> having a single way to call them, self contained is important for
> the build process and CI automation. So the convention for declaring
> and running jmh benchs are important.
>
>
> I agree, but there are technicalities about @Param that make them hard
> to implement. Last time I tried almost a year ago, maybe it's time to
> try again.

...and somewhat 2 weeks later, here's the basic support for @Params in
JMH:
http://mail.openjdk.java.net/pipermail/jmh-dev/2014-February/000453.html

-Aleksey.

Chris Vest

unread,
Feb 16, 2014, 12:26:00 PM2/16/14
to mechanica...@googlegroups.com
What if the thing I want to parameterise is the number of benchmark threads. Say I have a concurrent data structure, and I want to measure how different levels of concurrent access influence performance.

Cheers,
Chris

Aleksey Shipilev

unread,
Feb 16, 2014, 1:24:32 PM2/16/14
to mechanical-sympathy
Remember I was telling about the "can of worms"? Here you go.
Use the API then, Luke, that's the swiss-army knife.
@Param is just the convenient short-cut.

-Aleksey.

Peter Hughes

unread,
Feb 18, 2014, 5:58:19 PM2/18/14
to mechanica...@googlegroups.com
For what it's worth, as a relative newcomer to the field, using Maven I was able to get a usable JMH project running from scratch in less than a minute or two. Granted, it didn't do much of anything, but I felt like a dissenting opinion should be offered ;) 

JMH itself is a dream to use; the code samples are truly excellent for getting to grips with how to approach various scenarios. The only thing that has proved less-than-excellent so far is hunting down particular annotations. For instance, I only discovered @OperationsPerInvocation after looking at Nitsan's JAQ benchmarks - a central documentation of these would prove very useful (if one already exists, then apologies, although I couldn't find any mentioned on the JMH homepage)

  - Peter

Aleksey Shipilev

unread,
Feb 18, 2014, 6:04:40 PM2/18/14
to mechanica...@googlegroups.com
On 02/19/2014 02:58 AM, Peter Hughes wrote:
> The only thing that has proved less-than-excellent so far is hunting
> down particular annotations. For instance, I only discovered
> @OperationsPerInvocation after looking at Nitsan's JAQ benchmarks - a
> central documentation of these would prove very useful (if one
> already exists, then apologies, although I couldn't find any
> mentioned on the JMH homepage)

We were thinking the samples will gradually introduce all the useful
annotations. But I agree, Javadocs should be published somewhere. It
will take some time to figure out...

Meanwhile, it seems useful to link the annotation folder:
http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-core/src/main/java/org/openjdk/jmh/annotations/

-Aleksey.

Rüdiger Möller

unread,
Apr 22, 2015, 3:45:24 PM4/22/15
to mechanica...@googlegroups.com
What makes me shy away from using JMH more often is the ceremony required to run a bench (setup maven project & stuff). It really would make a difference if I would be able to quickly run benchmarks from within the IDE ad hoc & quick similar to how unittests can be run from IntelliJ. A simple entry point like JMH.runTest( Class, method, [options] ) would be great :-)

Aleksey Shipilev

unread,
Apr 22, 2015, 3:51:44 PM4/22/15
to mechanical-sympathy
But wait, there is a section "IDE Support" on JMH page...
There is also a link to IDEA plugin at the bottom of the same page...
And every JMH sample has the runnable main() method that is directly invoke-able...

Hm.

-Aleksey.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/m4opvy4xq3U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rüdiger Möller

unread,
Apr 22, 2015, 7:36:39 PM4/22/15
to mechanica...@googlegroups.com
Uuhh .. I admit I haven't looked into JMH for some time :-). Will have myself an update ...
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

pron

unread,
Apr 24, 2015, 9:08:06 AM4/24/15
to mechanica...@googlegroups.com
One more point in favor of JMH: its pluggable profilers, and, in particular, the awesome perfasm. It recently helped us pinpoint and fix some unnecessary cache misses in a hot data structure.

On Saturday, February 1, 2014 at 2:19:40 PM UTC+2, Aleksey Shipilev wrote:
Full disclosure: I work for Oracle, and do Java performance work in
OpenJDK. I also develop and maintain JMH, and JMH is my 4-th (I think)
benchmark harness. Hence, my opinion is biased, and I try to stay objective
because we've been in Caliper's shoes...

Disclaimer: I am not the only maintainer and developer for JMH. It was
developed with heavy contributions from both JRockit (where it came
originally) and HotSpot performance teams. Hence, when I say "we", I mean
many JMH contributors.

IMO, Caliper is not as bad for large benchmarks. In fact, Caliper feels just
like pre-JMH harnesses we had internally in Sun/BEA. And that is not a
coincidence, because Caliper benchmark interface is very intuitive and an
obvious one. The sad revelation that cas upon me over previous several
years is that the simplicity of benchmark APIs does not correlate with
benchmark reliability. 

I don't follow Caliper development, and I'm not in position to bash Caliper,
so instead of claiming anything that Caliper does or does not do, let me
highlight the history of JMH redesigns over the years. That should help
to review other harnesses, since I can easily say "been there, tried that,
it's broken <in this way>". Most of the things can even be guessed from the
API choices the harness makes. If API can not provide the instruments to
avoid a pitfall, then it is very probable harness makes no moves to avoid it
(except for the cases where magic dust is involved).

I tend to think this is a natural way for a benchmark harness to evolve, and
you can map this timeline back to your favorite benchmark harness. The
pitfalls are many and tough, the non-extensive "important list" is as follows:

A. Dynamic selection of benchmarks. 

Since you don't know at "harness" compile time what benchmarks it would run,
the obvious choice would be calling the benchmark methods via Reflection.
Back in the days, this pushed us to accept the same "repetition" counter in
the method to amortize the reflective costs. This already introduces the
major pitfall about looping, see below.

But infrastructure-wise, harness then should intelligently choose the
repetition count. This almost always leads to calibrating mechanics, which
is almost always broken when loop optimizations are in effect. If one
benchmark is "slower" and autobalances with lower reps count, and another
benchmark is "faster" and autobalances with higher reps count, then
optimizer have more opportunity to optimize "faster" benchmark even further.
Which departs us from seeing how exactly the benchmark performs and
introduces another (hidden! and uncontrollable!) degree of freedom.

In retrospect, the early days decision of JMH to generate synthetic
benchmark code around the method, which contains the loop (carefully chosen
by us to avoid the optimizations in current VMs -- separation on concerns,
basically), is paying off *very* nicely. We can then call that synthetic
stub via Reflection without even bothering about the costs. 

...That is not to mention users can actually review the generated benchmark
code looking for the explanations for the weird effects. We do that
frequently as the additional control.

B. Loop optimizations.

This is by far my major shenanigan about almost every harness. What is the
usual answer to "My operation is very small, and timers' granularity/latency
is not able to catch the effect"? My, yes, of course, warp it in the indexed-loop.
This mistake is painfully obvious, and real pain-in-the-back to prevent. We
even have the JMH sample to break the habit of people coming to build the
same style benchmarks:


(BTW, the last time I tried Caliper a few years ago, it even refused to run
when calibration says the running time does not change when changing the
reps count. Well, THANK YOU, but I really WANT to run that benchmark!)

C. Dead-code elimination.

This is my favorite pet-peeve. It is remarkably hard to introduce the
side-effect to the benchmark which is both reliable and low-overhead.
Low-overhead parts really require the JVM expertise to get right, and
pushing that on to users is very, very dumb. JMH's Blackhole classes took
a significant amount of our time to implement correctly, and we still doing
the tunings here and there to minimize their costs [to the extreme we are
thinking about the proper VM interface to consume the values]. Remarkably,
we can hide all that complexity behind the simple user interface, and let
users concentrate on their workloads. This is what good harnesses do.

Examples:


The usual ways to deal with DCE are broken in subtle ways:

 a) Returning the value from the reflective call: JIT inflates the
reflective call, and inlines it as usual Java method, DCE ensues.

 b) Writing the values in the fields: doing that in the loop means runtime
can only write the latest value, and DCE anything else; storing Objects in
fields usually entail GC store barriers; storing fields usually entail false
sharing with some other important data...

 c) Accumulate the values in locals and print them: still allows loop
pipelining and partial DCE-ing; and also, good luck with Objects! 

You might want to investigate which one your favorite harness is using.

D. Constant foldings

As much as dead-code elimination is a buzzword in benchmarking community,
the symmetric effect is mostly overlooked. That is, DCE works by eliminating
the part of program graph because of the unclaimed outputs. But there is
also the optimization that eliminates the part of program graph because of
the predictable inputs. This JMH sample demonstrates the effect:


Avoiding this issue again requires JVM expertise, and it is cruel to push
users to do that. It takes a very careful design of benchmark loop to break the
load coalescing across loop iterations, when you *also* want to provide low
overhead for fine-grained benchmarks. We spent considerable amount of time
tuning up the measurement loops (and this is transparent for JMH users,
because you "just" recompile the benchmark code, and the new synthetic code
is being generated, voila).

When harness asks users to create benchmark loop on their own, it pushes
users to deal with the issue on their own as well. I can count the people
who have time, courage, and expertise to do this kind of code on the fingers
of one hand.

E. Non-throughput measures

Now, when the payload is wrapped in the benchmark loop, it seems impossible
to collect any non-throughput metrics. The two most significant that we
learned through our internal JMH uses are: sampling the execution time, and
single-shot measurements. 

Measuring the individual timings is very tough, because timer overheads can
be very painful, and there is also coordinated omission tidbits, yada-yada...
That is, without a smart scheme that samples only *some* invocations, you
will mostly drown in the timing overheads. It turns out, sampling is rather
easy to implement with harness which *already* generates the synthetic code.
This is why JMH's support for SampleTime was so clean and easy to implement.
(Success story: measuring FJP latencies on JDK 8 Streams)

Measuring the single-invocation timings is needed for warmup studies: what's
the time to invoke the payload "in cold"? Again, once you generate the code
around the benchmark, it is easy to provide the proper timestamping. When
your harness implements multiple forks, it is very easy to have thousands
of "cold" invocations without leaving your coffee cold. What if your harness
requires reps count and requires calibration? Forget it.

The second-order concern is to provide the clean JVM environment for this
kind of run. In JMH, there is a separation between host JVM and the forked
JVM, where most of the heavy infrastructural heavy-lifting like regexp
matching, statistics, printing, etc is handled in the host VM. The forked VM
fast-pathes to "just" measure, not contaminating itself with most infra
stuff. This makes SingleShot benchmark modes very convenient in JMH.
(Success story: JDK 8 Lambda linkage/capture costs, and also JSR 292 things)

See the examples here: 

It is educational to compile the benchmarks and look for the generated code
to see the loops we are generating for them (target/generated-sources/...)

F. Synchronize iterations

Everything significantly complicates when you start to support
multi-threaded benchmarks. It is *not* enough to shove in the executor and
run the benchmark in multiple threads. The simplest issue everyone overlooks
is that starting/stopping threads is not instantaneous, and so you need to
care if all your worker threads are indeed started. More in this JMH example:


Without this, most of heavily-threaded benchmarks are way, way off the
actual results. We routinely seen >30% difference prior introducing this
kind of workaround. The only other harness I know doing this is SPECjvm2008.

G. Multi-threaded sharing

Multi-threaded benchmarks are also interesting because they introduce
sharing. It is tempting to "just" make the benchmark object either shared
between the worker threads, or allocate completely distinct objects for
each worker thread. That's the obvious way to introduce sharing in the
benchmark API. 

However, the reality begs to differ: in many cases, you want the
state-bearing objects to have *different* shareability domains. E.g. in many
concurrent benchmarks, I want to have the shared state which holds my
concurrent primitive to test, and a distinct state which keeps my scratch
data. 

In JMH, it forces you to introduce @State:

...together with some clean way of injecting the state objects into the run, 
since the default benchmark object is not the appropriate substitute (can't be 
both shared and distinct).

H. Multi-threaded setup/teardown

States often require setup and teardown. It gets interesting for two 
reasons: 1) in many cases, you don't want any non-worker thread to touch the
state object, and let only the worker threads to setup/teardown state objects,
like in the cases where you initialize thread-local structures or otherwise
care about NUMA and locality -- this calls for tricky lazy init schemes; 
2) in many cases, you have to call setup/teardown on shared objects, which
means you need to synchronize workers, and you can't do that on hot-paths 
with blocking the worker threads (schedulers kick in and ruin everything) -- this
calls for tricky busy-looping concurrency control.

Fortunately, it can be completely hidden under the API, like in JMH:

I. False-god-damned-sharing

And of course, after you done with all the API support for multi-threaded
benchmarks, you have to dodge some new unfortunate effects.
False-god-damned-sharing included. The non-extensive list where we got the
false sharing, and it affected our results is: 1) can't afford false sharing
on the "terminate" flag, which can be polled every nanosecond; 2) can't
afford false sharing in blackholes, because you deal with nanosecond-scale
events there; 3) can't afford false sharing in state objects, because you
know why; 4) can't afford false sharing in any other control structure which
is accessed by worker threads.

In JMH, we did a lot, scratch that, *A LOT* to avoid false sharing in the 
infra code. As well as we automatically pad the state objects providing at 
least some level of protection for otherwise oblivious users.

J. Asymmetric benchmarks

Now that you take a breath after working hard dealing with all these issues,
you have to provide the support for the benchmarks which are asymmetric. I.e.
in the same run, you might want to have the benchmark methods executing 
_different_ chunks of code, and measure them _distinctly_. Working example is
Nitsan's queuing experiments:


...but let me instead show the JMH example:


K. Inlining

The beast of the beasts: for many benchmarks, the performance differences
can only be explained by the inlining differences, which broke/enabled some
additional compiler optimizations. Hence, playing nice with the inliner is 
essential for benchmark harness. Again, pushing users to deal with this 
completely on their own is cruel, and we can ease their pain a bit.

JMH does two things: 1) It peels the hottest measurement loop in a separate
method, which provides the entry point for compilation, and the inlining 
budget starts there; 2) @CompilerControl annotation to control inlining
in some known places (@GMB and Blackhole methods are forcefully inlined these
days, for example).

Of course, we have a sample for that:


BOTTOM-LINE:
----------------------------------

The benchmarking harness business is very hard, and very non-obvious. My own
experience tells me even the smartest people make horrible mistakes in them,
myself included. We try to get around that by fixing more and more things
in JMH as we discover more, even if that means significant API changes.
Please do not trust the names behind the projects: whether it's Google or 
Oracle -- the only thing matters is whether the projects are up to technical
challenges they face.

The job for a benchmark harness it to provide reliable benchmarking 
environment. It could go further than that (up to the point harness can 
<strike>read mail</strike> submit results to GAE), but it is only prudent
if it does its primary job done.

The issues above explain why I get all amused when people bring up trivial
things like IDE support and/or the ability to draw the graphs as the
deal-breaker things for benchmark harness choices. It's like looking at the
cold fusion reactor and deciding to run the the coal power plant instead,
because the fusion reactor has an ugly shape, and painted in the color you
don't particularly like.

-Aleksey.

Roland Deschain

unread,
Jul 23, 2015, 4:10:37 PM7/23/15
to mechanical-sympathy, moru...@gmail.com
You should also check ScalaMeter (http://scalameter.github.io). It works for both Scala and Java, and has some powerful features.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Jul 24, 2015, 4:40:21 PM7/24/15
to mechanical-sympathy, moru...@gmail.com, roland....@gmail.com
Can anyone familiar with with these say if it is possible to :

1) Generate a load in a very *deterministic* manner in these benchmark tools ?
2) When they are not deterministic (meaning suffering from coordinated omission) what do they do ?

I have not found a single benchmark tool so far that can claim to be deterministic in its load generation on the jvm. 

Jin Mingjian

unread,
Jul 25, 2015, 1:19:14 AM7/25/15
to mechanica...@googlegroups.com
Rudiger, me too:) The ide support which still rely on maven integration way could not be accepted by some. I did a primary plain jar version when JMH initial coming. But Aleksey's diligent updates soon kill my idea. When recently back to my project, I find more goods done JMH and Aleksey. I plan to try if I can maintain a plain version of JMH again. let's see if Aleksey leave rooms for us:)


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Nitsan Wakart

unread,
Jul 27, 2015, 7:44:24 AM7/27/15
to mechanica...@googlegroups.com, moru...@gmail.com, roland....@gmail.com
JMH has one type of load which is "All-out", it is very deterministically all out.
You can construct a cost under load test using a per invocation @Setup method which sleeps/throttles the invocation, and use the sample benchmark mode to capture percentiles of random sampling measurements around the invocation (this is random omission, which is not an issue as it is not biased). The benchmark I just described will suffer from coordinated omission as there's no notion of schedule. Measurement has no 'intended' start time, so no correction for such is made.
As with all missing features from OSS projects this should be viewed as an opportunity for contribution rather than an issue ;-)
There's no generic "load generation on the JVM" tool that I know of.
Some domain specific load generators do tackle CO(e.g. YCSB post 0.2.0, Wrk2, Cassandra stress2, LDBC benchmarks for graph DBs, Gattling).



--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages