JVM random performance

Roger Alsing

unread,

Aug 1, 2017, 1:26:55 PM8/1/17

to mechanical-sympathy

Some context: I'm building an actor framework, similar to Akka but polyglot/cross-platform..

For each platform we have the same benchmarks, where one of them is an in process ping-pong benchmark.

On .NET and Go, we can spin up pairs of ping-pong actors equal to the number of cores in the CPU and no matter if we spin up more pairs, the total throughput remains roughly the same.

But, on the JVM. if we do this, I can see how we max out at 100% CPU, as expected, but if I instead spin up a lot more pairs, e.g. 20 * core_count, the total throughput tipples.

I suspect this is due to the system running in a more steady state kind of fashion in the latter case, mailboxes are never completely drained and actors don't have to switch between processing and idle.

Would this be fair to assume?

This is the reason why I believe this is a question for this specific forum.

Now to the real question.. roughly 60-40 when the benchmark is started, it runs at 250 mil msg/sec. steadily and the other times it runs at 350 mil msg/sec.

The reason why I find this strange is that it is stable over time. if I don't stop the benchmark, it will continue at the same pace.

If anyone is bored and like to try it out, the repo is here:

https://github.com/AsynkronIT/protoactor-kotlin

and the actual benchmark here: https://github.com/AsynkronIT/protoactor-kotlin/blob/master/examples/src/main/kotlin/actor/proto/examples/inprocessbenchmark/InProcessBenchmark.kt

This is also consistent with or without various vm arguments.

I'm very interested to hear if anyone has any theories what could cause this behavior.

One factor that seems to be involved is GC, but not in the obvious way, rather reversed.

In the beginning, when the framework allocated more memory, it more often ran at the high speed.

And the fewer allocations I've managed to do w/o touching the hot path, the more the benchmark have started to toggle between these two numbers.

Thoughts?

Wojciech Kudla

unread,

Aug 1, 2017, 1:48:35 PM8/1/17

to mechanical-sympathy

It definitely makes sense to have a look at gc activity, but I would suggest looking at safepoints from a broader perspective. Just use -XX:+PrintGCApplicationStoppedTime to see what's going on. If it's safepoints, you could get more details with safepoint statistics.
Also, benchmark runs in java may appear undeterministic simply because compilation happens in background threads by default and some runs may exhibit a different runtime profile since the compilation threads receive their time slice in different moments throughout the benchmark.
Are the results also jittery when run entirely in interpreted mode? It may be worth to experiment with various compilation settings (ie. disable tiered compilation, employ different warmup strategies, play around with compiler control).
Are you employing any sort of affinitizing threads to cpus?
Are you running on a multi-socket setup?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Georges Gomes

unread,

Aug 1, 2017, 2:22:23 PM8/1/17

to mechanical-sympathy

Are you benchmarking on a multi-socket/NUMA server?

Kirk Pepperdine

unread,

Aug 1, 2017, 2:59:55 PM8/1/17

to mechanica...@googlegroups.com

Hi,

From my observations there appears to be some race conditions in the hotspot compilations that can affect hot/cold path decisions during warmup. If the race wins in your favor, all is well, if not…. Also memory layout of the JVM will have some impact on what optimizations are applied. If you’re in low ram you’ll get different optimizations than if you’re running in high ram. I’d suggest you run your benches in a highly controlled environment to start with and then afterwards you can experiment to understand what environmental conditions your bench maybe sensitive to.

Kind regards,

Kirk

Roger Alsing

unread,

Aug 1, 2017, 3:32:37 PM8/1/17

to mechanical-sympathy

Does this tell anyone anything?

https://gist.github.com/rogeralsing/1e814f80321378ee132fa34aae77ef6d

https://gist.github.com/rogeralsing/85ce3feb409eb7710f713b184129cc0b

This is beyond my understanding of the JVM.

ps. no multi socket or numa.

Regards

Roger

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,

Aug 1, 2017, 6:57:17 PM8/1/17

to mechanical-sympathy

Add -XX:+PrintGCTimeStamps, also, run with time so we can see the total run time...

Kirk Pepperdine

unread,

Aug 2, 2017, 4:55:22 AM8/2/17

to mechanica...@googlegroups.com

There are a couple of very long safe point times in there. By long I mean 6 or more milliseconds. However, without full gc logging it’s difficult to know if the safe pointing is due to GC or something else.

Other than that…. the logs all show pretty normal operations. Can you run this with -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Xloggc:<logfile> as well as the flags you’re using. I have some analytics that I could run but I need time stamps and GC times for them to be meaningful.

I’d run myself but I’m currently running a couple of other benchmarks.

Kind regards,

Kirk

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Roger Alsing

unread,

Aug 2, 2017, 9:44:29 AM8/2/17

to mechanical-sympathy

This is the output of the xloggc https://gist.github.com/rogeralsing/64a9e11b825e870acb20bb4dfb69cc29

and here is the console output of the same run https://gist.github.com/rogeralsing/22d78fe3ae5155f920fd659c66b124db

Roger Alsing

unread,

Aug 2, 2017, 9:46:04 AM8/2/17

to mechanical-sympathy

Adding to that,

I've also tried replacing the current forkjoin threadpool with a custom thread/core affine scheduler and the behavior is exactly the same.

Kirk Pepperdine

unread,

Aug 2, 2017, 11:09:17 AM8/2/17

to mechanica...@googlegroups.com

Ok, my bet is this is on memory layout. So, to test you may want to load up your favorite word processor with one or two documents and see if this consistently gives you the slower performance numbers.

Regards,

Kirk

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Todd Lipcon

unread,

Aug 3, 2017, 1:32:33 AM8/3/17

to mechanica...@googlegroups.com

I've seen this kind of bimodal performance behavior even in native apps. On some builds, a benchmark will consistently run in 6 seconds, and on the next build, it will go back to 5 seconds. The two "timings" are very consistent given as single build, and the timing distribution when taken across all builds is very clearly bimodal with two narrow peaks.

At one point I spent several hours looking at two consecutive built binaries that had the different performance characteristics and determined that the cause was loop alignment. Due to changes in code size elsewhere in the app (a few instructions here and there), one of the hot loops was getting misaligned off a 16-byte boundary, and this affected the performance of the microbenchmark.

I'd guess that the JIT has enough non-determinism in its heuristics that on different invocations of the program, you could end up with different loop alignments, different placement of hot functions, different usage of huge pages, etc, and all of those could be enough to throw a microbenchmark out of whack by 20-30%.

-Todd

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tony Finch

unread,

Aug 3, 2017, 11:29:09 AM8/3/17

to mechanica...@googlegroups.com

Todd Lipcon <to...@lipcon.org> wrote:

> I've seen this kind of bimodal performance behavior even in native apps. On
> some builds, a benchmark will consistently run in 6 seconds, and on the
> next build, it will go back to 5 seconds.

I have heard of similar effects with a single build where the variation
between runs is due to differences in page mapping, e.g. in this link
where the author discusses "Sometimes the OS hands us a range of physical
pages that works well for our workloads, other times it doesn’t."

https://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-case-for-caches/

Tony.
--
f.anthony.n.finch <d...@dotat.at> http://dotat.at/ - I xn--zr8h punycode
Fitzroy: Southwesterly veering northwesterly later in northwest, 4 or 5
increasing 6 at times. Moderate or rough, occasionally very rough at first in
north. Occasional rain. Good, occasionally moderate.

Oleg Mazurov

unread,

Aug 8, 2017, 1:29:13 AM8/8/17

to mechanical-sympathy

I ran the benchmark with a profiler and was able to reproduce both modes, fast and slow. The difference appears to be due to how HotSpot compiles the

DefaultMailbox.run() -> ActorContext.invokeUserMessage(msg) sequence. In fast mode, DefaultMailbox.run inlines ActorContext.invokeUserMessage() and

both PingActor.autoReceive() and EchoActor.autoReceive() thus recognizing that there are only two interface implementations for Actor.autoReceive().

In slow mode, invokeUserMessage() goes through a series of initial compilations to be finally deoptimized and call PingActor.autoReceive() and EchoActor.autoReceive()

via itab, i.e. a generic interface call mechanism, which is quite expensive (can there be even more implementations than these two?).

Which path is taken by HotSpot may depend on how many different interface implementations it observes at the autoReceive call site for each intermediate

compilation and that is totally indeterministic due to the nature of the benchmark.

-- Oleg

Reply all

Reply to author

Forward