JNI performance question.

357 views
Skip to first unread message

Vladimir Ozerov

unread,
Dec 9, 2014, 2:28:24 PM12/9/14
to mechanica...@googlegroups.com
Hello everyone,

I encountered strange results when benchmarking Java code called from JNI and would like to ask the community for help.

Let's say I have the method M1 written in Java. I use very simple approach to benchmark it's latency: [nanoTime] -> [M1] -> [nanoTime] which gives me result of 10mcs per call. As this method is relatively heavy, even so simple benchmark gives reliable results. Running the same method with JMH confirmed that these numbers are correct.
And I have another scenario when this method is called from native application with Java started in embedded mode. When I run the same benchmark as follows [native code] -> [JNIEnv->Call*Method] -> [nanoTime] -> [M1] -> [nanoTime], I got worse results - about 13mcs per call. 
I tested several other scenarios in a similar way and got slowdown in all of them. E.g. in another case I got 1.5mcs when method is called from Java (confirmed by JMH) and 3mcs when the same method is called from JNI.

Both cases were tested with the same JDK (1.7.0_33) and the same flags, all set to default. Profiling with Mission Control didn't show anything suspicious except of the fact that GC were called a little bit less frequently in native benchmark.

Could you please give me some hints how to investigate this further? Could it be a problem with JIT? GC? Safepoints? Etc.

Thank you.
Vladimir.

Aleksey Shipilev

unread,
Dec 9, 2014, 2:43:55 PM12/9/14
to mechanica...@googlegroups.com
Hi,

On 09.12.2014 22:28, Vladimir Ozerov wrote:
> I tested several other scenarios in a similar way and got slowdown in
> all of them. E.g. in another case I got 1.5mcs when method is called
> from Java (confirmed by JMH) and 3mcs when the same method is called
> from JNI.

I have a gut feeling this has something to do with optimizeability of
the target Java method. The methods called from JNI would not be inlined
by HotSpot, the devirtualization would not happen, etc. etc. etc.

> Both cases were tested with the same JDK (1.7.0_33) and the same flags,
> all set to default. Profiling with Mission Control didn't show anything
> suspicious except of the fact that GC were called a little bit less
> frequently in native benchmark.

If "1.5mcs" stands for "1.5 microseconds", no wonder the
application-level profilers show nothing.

> Could you please give me some hints how to investigate this further?
> Could it be a problem with JIT? GC? Safepoints? Etc.

Since you already have JMH benchmarks for these, it would be helpful to
see the result of "-prof perfasm" for both approaches. I have never
looked at JNI transitions under perfasm, so dragons be there.

Thanks,
-Aleksey.

signature.asc

Vladimir Ozerov

unread,
Dec 9, 2014, 3:56:59 PM12/9/14
to mechanica...@googlegroups.com
Aleksey, 

Thanks for your reply. My JMH benchmarks are pretty straightforward:
public Object testM1() {
    return M1(ThreadLocalRandom.current().nextInt(...));
}
I run them with different compiler options - INLINE, DONT_INLINE, EXCLUDE. All gave the same results consistent with my dumb benchmark. Could these results rule out possible problems with optimizer? Or this assumption is invalid?
I forgot to mention that I run benchmarks on Windows 7/8 as this is the target platform for this use case. Therefore perfasm is not an option. But if there are no simple or well-known explanations to what I observe, it seems I'll have to dig into assembly anyway.

Vladimir.

Aleksey Shipilev

unread,
Dec 9, 2014, 4:05:15 PM12/9/14
to mechanical-sympathy
I think it is not very safe to presume if pure Java calls with compiler inline/dont_inline do not show the difference, the same applies to from-JNI calls. I agree the route to follow now is to dig into assembly. It might be simpler to replicate the same thing on Linux and use perfasm there though (I am biased in my belief performance work on Windows is a nightmare).

The only thing I don't get from your experiment is why the score is the same when you have @CompilerControl(EXCLUDE) over the target method. It should bail the method from compiling *at all*, and therefore the score should be much lower. Do you actually have @CompilerControl over M1()? If so, we may have to suspect compiler hints to be disfunctional. See if those things actually work with -XX:+PrintCompilation -XX:+PrintInlining, you should see the lines like "force inline by CompilerOracle", etc.

-Aleksey.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kirk Pepperdine

unread,
Dec 10, 2014, 2:20:28 AM12/10/14
to mechanica...@googlegroups.com
Hi,

I’m not the size of targeted workload is big enough to amortize the cost of the JNI call. A quick check with a C profiler would be prudent.

Regards,
Kirk
signature.asc

Aleksey Shipilev

unread,
Dec 10, 2014, 8:32:19 AM12/10/14
to mechanical-sympathy
On 12/10/2014 12:04 AM, Aleksey Shipilev wrote:
> The only thing I don't get from your experiment is why the score is the
> same when you have @CompilerControl(EXCLUDE) over the target method. It
> should bail the method from compiling *at all*, and therefore the score
> should be much lower. Do you actually have @CompilerControl over M1()?
> If so, we may have to suspect compiler hints to be disfunctional. See if
> those things actually work with -XX:+PrintCompilation
> -XX:+PrintInlining, you should see the lines like "force inline by
> CompilerOracle", etc.

I merged the additional tests into JMH workspace:
http://hg.openjdk.java.net/code-tools/jmh/rev/73451bc77ee0

That means you can "just" run the JMH tests from the workspace tip on
your target machine to see if JMH is functional there. My runs on Win 7,
8u11 yield no failures.

-Aleksey.

signature.asc

Jan van Oort

unread,
Dec 10, 2014, 11:37:05 AM12/10/14
to mechanical-sympathy
Gut feeling: overhead of JNI call. Those are not "for free" in Java. It takes interaction with the OS, and then loading of the shared library. 

You may, however, want to run a prolonged test, in order to see if the cost per call decreases ( I think it should ). With other words: your test might be too short. 



Fortuna audaces adiuvat - hos solos ? 


-Aleksey.

Vladimir Ozerov

unread,
Dec 10, 2014, 2:33:48 PM12/10/14
to mechanica...@googlegroups.com
Thank you all for your answers.

Aleksey,
I believe that nothing was wrong with JMH. The method I measure looks as follows:
Object M1(Object arg) {
    <minor_stuff>
    try {
        return innerMethod(arg);
    }
    finally {
        <minor_stuff>
    }
}
So preventing compilation of M1 or thin benchmark around it with several additional instructions should not significantly affect performance because "innerMethod" is still complied/inlined. And this "innerMethod" is relatively heavy (e.g., according to Hotspot log it triggers compilation of several hundreds other methods of our product).

Jan,
We call Java from native code, not vice versa. And in both cases I perform measurements inside Java, i.e. after control was passed from native code to JVM. This is why I was so confused with results. 

What I did further is made Java code being executed in both Java and native benchmarks as similar as possible. Then I run benchmarks once again and the difference became neglible. Say, 11mcs in Java vs 12mcs when called from JNI. I compared compilation logs for both cases with (may be not very accurate) assumption that if method sizes are equal, then they probably compiled into the same assembly. And I saw several differences: one hot method from the very deep of our logic was not compiled in native scenario, and another important method was deopted in Java scenario. Most probably this is caused not by JNI, but by differences between setup code which is slightly different between pure Java and native cases (and it cannot be made absolutely equal by application's nature).
Given that my benchmarks are far from perfect, Windows is known to have poor timer granularity, and these numbers are acceptable for us, I decided to stop at this point. My final diagnosis: microeffects caused by incorrect experiment.

Vladimir.
Reply all
Reply to author
Forward
0 new messages