Mystery: Why does the JVM show more latency for the same block of code after a busy spin pause?

402 views
Skip to first unread message

J Crawford

unread,
Apr 29, 2017, 10:56:17 AM4/29/17
to mechanical-sympathy
The mystery boils down to: 

    "The exact same block of code becomes slower after a busy spin pause."

I posted a short source code that unequivocally proves and demonstrates the problem: http://stackoverflow.com/questions/43696948/why-does-the-jvm-show-more-latency-for-the-same-block-of-code-after-a-busy-spin

Note that I'm not using Thread.sleep. Also note that there are no conditionals leading to a HotSpot/JIT de-optimization as I'm changing the pause using a math operation instead of an IF.

The story told by the short source code is:

- There is a block of math operations that I want to time.
- First I time the block pausing 1 nanosecond before I start my measurement. I do that 20,000 times.
- Then I change the pause from 1 nanosecond to 5 seconds and proceed to measure the latency as usual. I do that 15 times.
- Then I print the last 30 measurements, so you can see 15 measurements with the pause of 1 nanosecond and 15 measurements with the pause of 5 seconds.

As you can see below, the discrepancy is big, especially in the very first measurement after the pause change. Why is that!?

$ java -server -cp . JvmPauseLatency
Sat Apr 29 10:34:28 EDT 2017 => Please wait 75 seconds for the results...
Sat Apr 29 10:35:43 EDT 2017 => Calculation: 4.0042328611017236E11
Results:
215
214
215
214
215
214
217
215
216
214
216
213
215
214
215
2343 <----- FIRST MEASUREMENT AFTER PAUSE CHANGE
795
727
942
778
765
856
762
801
708
692
765
776
780
754

Thanks!

-JC

Ben Evans

unread,
Apr 29, 2017, 11:02:11 AM4/29/17
to mechanica...@googlegroups.com
You should start by rewriting this as a JMH benchmark.

System.nanoTime based benchmarks are incredibly difficult to write without error & it is far too easy to be misled or suffer from confirmation bias of one form or another.

As it stands, you haven't unequivocally demonstrated anything - sorry.

Ben

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

J Crawford

unread,
Apr 29, 2017, 11:14:20 AM4/29/17
to mechanical-sympathy
Hi Ben,

Are you saying that there is a problem with System.nanoTime() for this short and simple source code and that System.nanoTime() is not reliable to benchmark anything? I suspected that, but I've also measured using different native timers, clock_gettime and rdtsc, and the exact same problem happens. So I have to believe that I unequivocally demonstrated this problem and using JMH or other profiling tool won't change/solve anything.

Thanks!

-JC

Ben Evans

unread,
Apr 29, 2017, 12:18:14 PM4/29/17
to mechanica...@googlegroups.com
The JMH authors have done a lot of careful, meticulous work to produce
a great tool precisely so that I don't have to worry about a bunch of
the subtleties that are in the JIT compilation subsystem, or about the
myriad ways that overly simplistic benchmarks can mislead.

If you're sure that this is a real effect, then you should be able to
produce a simple JMH benchmark which shows the same behaviour. I
fundamentally don't believe virtually *anything* a simple nanoTime()
benchmark seems to be telling me without corroboration, and without
that confirmation I'm not going to spend a lot of time looking at the
purported result.

Ben
>>> email to mechanical-symp...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.

J Crawford

unread,
Apr 29, 2017, 12:38:54 PM4/29/17
to mechanical-sympathy
Hi Ben,

That's a fair assumption. I did not see anything in the JIT logs (with inlining on) so I don't think it is JIT related. Would you mind using JMH to benchmark the code and post your findings here? It is a very short and simple code. My concern is that I've never used JMH before so it would be preferable that someone neutral and with JMH experience proves or disproves my point.

Thanks!

-JC
>>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Aleksey Shipilev

unread,
Apr 29, 2017, 12:56:16 PM4/29/17
to mechanica...@googlegroups.com
On 04/29/2017 04:56 PM, J Crawford wrote:
> The mystery boils down to:
>
> "/The exact same block of code becomes slower after a busy spin pause./"
>
> I posted a short source code that *unequivocally* proves and demonstrates the
> problem: http://stackoverflow.com/questions/43696948/why-does-the-jvm-show-more-latency-for-the-same-block-of-code-after-a-busy-spin

I don't think this is much of the mystery, because wakeup from a long sleep
causes gradual transition from low power/frequency/scheduling states to higher
ones.

This is clearly visible if you do several consecutive operations after the
sleep, e.g. by adding this:

while(count < results.length) {
+ if ((count & 7) > 0) interval = 0;
double x = busyPause(interval);

Results:
191
191
191
191
191
193
190
190
194
192
192
191
191
191
191
4682 <--- wakeup
323 <--- (yaaaaaawn, still sleepy)
204 <--- (okay, let's have some morning coffee)
189 <--- (alright, we are at full swing)
191
192
195
189
2868 <--- WHAT. I HAVEN'T BEEN SLEEPING DEEPLY YET
273
222
191
194
189
192


Pretty sure configuring the system to avoid deep power states would alleviate
the first several iterations jump at the expense of power draw.

Thanks,
-Aleksey

signature.asc

Aleksey Shipilev

unread,
Apr 29, 2017, 1:12:19 PM4/29/17
to mechanica...@googlegroups.com
On 04/29/2017 06:56 PM, Aleksey Shipilev wrote:
> On 04/29/2017 04:56 PM, J Crawford wrote:
>> The mystery boils down to:
>>
>> "/The exact same block of code becomes slower after a busy spin pause./"
>>
>> I posted a short source code that *unequivocally* proves and demonstrates the
>> problem: http://stackoverflow.com/questions/43696948/why-does-the-jvm-show-more-latency-for-the-same-block-of-code-after-a-busy-spin
>
> I don't think this is much of the mystery, because wakeup from a long sleep
> causes gradual transition from low power/frequency/scheduling states to higher
> ones.
>
> This is clearly visible if you do several consecutive operations after the
> sleep, e.g. by adding this:
>
> while(count < results.length) {
> + if ((count & 7) > 0) interval = 0;
> double x = busyPause(interval);

Ah, I misread the test. busyPause is not powering down anything. What's weirder,
the first pair of nanoTime calls after exiting busyPause is hiccuping even
without calculations in between:

while(count < results.length) {
if ((count & 7) > 0) interval = 0;
double x = busyPause(interval);

long latency = System.nanoTime();
latency = System.nanoTime() - latency;

results[count++] = latency;
interval = (count / WARMUP * (PAUSE - 1)) + 1;
}

...
34
34
35
32
35
2547
51
35
34
36
33
35
...

-Aleksey

signature.asc

Ben Evans

unread,
Apr 29, 2017, 1:35:43 PM4/29/17
to mechanica...@googlegroups.com
Hi,

JMH is one of the tools that you will need to get to grips with if
you're going to do performance work at this level. The other one which
I think is pretty much essential is JITWatch.

I would suggest that it's probably a better use of your time right now
to get more familiar with it and try some simple examples of JMH, then
try to apply what you've learned to this specific problem, rather than
try to chase this one down with a JMH benchmark that someone else has
written.

That way you'll have learned something of how JMH works and you can
apply it to other latency problems - as it seems from your other posts
that you're quite interested in this area. "Teach a man to fish" and
all that...

There's an introductory presentation I did about JMH here:
https://www.infoq.com/presentations/jmh - there are loads of other
resources available online too.

Thanks,

Ben
>> >>> email to mechanical-symp...@googlegroups.com.
>> >>> For more options, visit https://groups.google.com/d/optout.
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "mechanical-sympathy" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an
>> > email to mechanical-symp...@googlegroups.com.
>> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.

Aleksey Shipilev

unread,
Apr 29, 2017, 1:44:34 PM4/29/17
to mechanica...@googlegroups.com
On 04/29/2017 07:12 PM, Aleksey Shipilev wrote:
> On 04/29/2017 06:56 PM, Aleksey Shipilev wrote:
>> On 04/29/2017 04:56 PM, J Crawford wrote:
>>> The mystery boils down to:
>>>
>>> "/The exact same block of code becomes slower after a busy spin pause./"
>>>
>>> I posted a short source code that *unequivocally* proves and demonstrates the
>>> problem: http://stackoverflow.com/questions/43696948/why-does-the-jvm-show-more-latency-for-the-same-block-of-code-after-a-busy-spin
>>
>> I don't think this is much of the mystery, because wakeup from a long sleep
>> causes gradual transition from low power/frequency/scheduling states to higher
>> ones.
>>
>> This is clearly visible if you do several consecutive operations after the
>> sleep, e.g. by adding this:
>>
>> while(count < results.length) {
>> + if ((count & 7) > 0) interval = 0;
>> double x = busyPause(interval);
>
> Ah, I misread the test. busyPause is not powering down anything. What's weirder,
> the first pair of nanoTime calls after exiting busyPause is hiccuping even
> without calculations in between.

Still flying in blind for fun, trying different things:
http://cr.openjdk.java.net/~shade/scratch/JvmPauseLatency.java

a) "Warming up" both branches in busyPause seems to help first hiccup. Notice
that with small interval one of the loop branches would be non-taken, and C2
would put a trap there, blowing up on first access. So much for "avoiding the
deoptimization"...

b) Inlining busyPause seems to help to avoid every hiccup but the first. I
wonder if this is actually the machine effect when e.g. the branch mispredict on
loop exit that faces unusual code shape ("ret" from uninlined method?) makes the
lag worse. When method is inlined, only the first trap would introduce lag, not
any subsequent one.

At this point you want to dive into the generated code, and there JMH perfasm is
your friend.

-Aleksey


signature.asc

Oleg Mazurov

unread,
May 13, 2017, 3:04:55 AM5/13/17
to mechanical-sympathy
Here is my theory. After warmup, every call to busyPause() takes 5 seconds. 
During that time the thread will be interrupted many times. Even if it is run 
on the same CPU after every interrupt, the cache state will be gone. 
Basically, what you measure after busyPause() is repopulating your caches.
To test this theory, add one line after busyPause() which does that proactively 
before measurement: 

    double x = busyPause(interval);
+   calculation += x / (rand.nextInt(5) + 1);


    -- Oleg

J Crawford

unread,
May 13, 2017, 11:03:22 PM5/13/17
to mechanical-sympathy

Hi Oleg,

I made sure to use thread pinning so the thread was pinned to an isolated CPU core. Furthermore, because the thread is spinning, I wouldn't think it is being interrupted or released from the CPU core.

Ivan from SO came up with a theory that you must also warmup dynamically the pauses, check it out: http://stackoverflow.com/a/43699146/7833248

Also not sure what you are trying to accomplish / prove with that extra line after busyPause. Can you elaborate?

Thanks!

-JC

Oleg Mazurov

unread,
May 15, 2017, 1:35:57 PM5/15/17
to mechanical-sympathy
User threads are interrupted uninterruptibly (see /proc/interrupts) and pinning your thread/process to a particular CPU 
does not guarantee that the OS won't run other threads on the same CPU, just that yours won't be run on others.
More importantly, however, in this case is the fact that System.nanoTime() is used to keep the CPU busy.
System.nanoTime() translates to clock_gettime(CLOCK_MONOTONIC) and that thing is far more complex than 
just 'rdtsc' and a couple of arithmetic operations. For one thing, it occasionally makes system calls.
Not only those system calls take much longer to execute, you relinquish control to the kernel at a
well defined point so it may decide to run stuff it wouldn't during scheduling interrupts.
Anyway, all that to support a simple idea that after 5 seconds of "spinning" in System.nanoTime()
you can be sure that cached 'calculation', 'rand' and whatever 'rand' uses internally are long gone.
That is to say that when you measure your computation for the first time after busyPause()
you measure, among other things, how long it takes to read all those values from lower levels of 
the memory hierarchy. My one-liner was intended to show that if we 'prefetch' the values the
measured time goes down significantly. 
Whatever other theory you may have, it now has to provide an account for that change as well.
This is not a full solution to the mystery you started with but an important factor not to be
neglected.

    -- Oleg
Reply all
Reply to author
Forward
0 new messages