Dalvik VM versus NDK performance

1,159 views
Skip to first unread message

Biosopher

unread,
Jan 6, 2010, 12:35:04 PM1/6/10
to android-ndk
I'm about to throw away the Java-based Fast Fourier Transform (FFT)
code that I just ported to Android and revert back to my C
implementation. Before I do that, I wanted to ensure others had my
same experience running process intensive code on the Dalvik VM and
found the NDK to solve their problems.

The digital sound processing app I've written requires ~50,000 FFTs
which takes less than 2 seconds on a modern
JIT-enabled JVM running J2ME and less than 1 second as a C program on
a similarly powered mobile phone. However an on HTC Hero, the same
Java-based code takes 15 seconds. For optimal user experience, I need
a solution that works in under 2.5 seconds.

To benchmark Android, I ran very simple tests (results below) on an
HTC Hero. The results show the Dalvik VM to be >20 times slower than
J2ME and 25-50 times slower than a C program performing the same
operations on a similarly powered mobile phone.

For example, this simple iteration over an empty method 2 million
times takes 1.4 seconds even though it doesn’t do anything. The same
iteration is performed in milliseconds by a C program and about 50ms
on a similarly powered J2ME phone.

public void performanceTest1() {
for (int i = 0; i < 2000000; i++) {
emptyMethod();
}
}

private int emptyMethod() {
return 0;
}

Doing something a little more complex like calculating the imaginary
component of a complex conjugate 2 million times takes 3.2 seconds.
Again, this takes milliseconds on other mobile phones running J2ME or
C.

public void performanceTest2() {
for (int i = 0; i < 2000000; i++) {
int a = 5;
int b = 5;
int c = 5;
int x = 5;
int y = 5;

y = ((a >> 16) * ((c << 16) >> 16)) + (((a &
0X0000FFFF) * ((c <<
16) >> 16)) >> 16);
y = -y;
y += ((b >> 16) * (c >> 16)) + (((b & 0X0000FFFF) *
(c >> 16)) >>
16);
}

}

Has anyone else seen this problem on Android. My assumption is that
as Dalvik runs interpreted code without a JIT, then the NDK should
avoid these performance issues...but I wanted to post this reality
check to the NDK forum first.

Jack Palevich

unread,
Jan 6, 2010, 10:57:40 PM1/6/10
to android-ndk
Dalvik does not currently use a JIT, but one is being developed. See
this thread for details:

http://groups.google.com/group/android-platform/browse_thread/thread/331d5f5636f5f532/dee6e0a81ae72264?#dee6e0a81ae72264

But it's not clear yet whether the Dalvik JIT will be back-ported to
every existing Android phone.

For that reason alone it seems like it would be best to use native
code for FFT type applications on Android.

Kevin Duffey

unread,
Jan 6, 2010, 11:22:08 PM1/6/10
to andro...@googlegroups.com
Wow.. that's sad. :( I didn't know a JIT wasn't on the JVM on the phone. That's really sad that J2ME runs a lot faster than the Java on our powerful Android devices. I am glad a JIT is being developed.. hopefully sooner than later. I don't understand how J2ME with its less powered and less memory has a JIT with it and our almost home computer sized cpus/memory on our android devices left it out. I do realize Android is a much more robust platform than J2ME and takes up more memory, but I am rather surprised and disappointed to see J2ME being 20x faster (in this example). Will a JIT speed this up that much more? If so, does that mean once it is available ALL apps on existing devices (IF the jit enabled dalvik can be upgraded to) will get a fat boost?

I still think anything like FFT and intensive operations should always be done at the C level if possible. The main issue with Java -> JNI is the overhead of the calls. If there is a way to pass pointers from Java to C and not have to copy the data across the stack, for example, and the C code can work directly on the pointers passed from Java to access things like sound samples, images, etc.. that would be very nice. I am a total noob to JNI/Java capabilities, and I am sure the Dalvik JVM has more limitations than normal Java/JNI. IT's too bad that the JNA library isn't part of Android. I found that JNA on normal apps was sooo much easier to use than having to build C/JNI stuff. I also only used JNA a wee bit, but if I recall, it allows you to use any c compiled library without having to have the C headers/stubs that you usually do for JNI to work. Not sure tho if that would be a big benefit for us android developers right now, but it certainly helped on things like Windows/Linux where you couldn't recompile libraries that came with the OS. 


--
You received this message because you are subscribed to the Google Groups "android-ndk" group.
To post to this group, send email to andro...@googlegroups.com.
To unsubscribe from this group, send email to android-ndk...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/android-ndk?hl=en.




Divkis

unread,
Jan 7, 2010, 3:32:53 AM1/7/10
to android-ndk
Hi,

On Jan 6, 10:35 pm, Biosopher <astev...@gracenote.com> wrote:

> For example, this simple iteration over an empty method 2 million
> times takes 1.4 seconds even though it doesn’t do anything.  The same
> iteration is performed in milliseconds by a C program and about 50ms
> on a similarly powered J2ME phone.
>
> public void performanceTest1() {
>         for (int i = 0; i < 2000000; i++) {
>                 emptyMethod();
>         }
>
> }
>
> private int emptyMethod() {
>         return 0;
>
> }
>

I think it will depend on the compiler flags as it would be
automatically optimized away.

> Doing something a little more complex like calculating the imaginary
> component of a complex conjugate 2 million times takes 3.2 seconds.
> Again, this takes milliseconds on other mobile phones running J2ME or
> C.
>
> public void performanceTest2() {
>         for (int i = 0; i < 2000000; i++) {
>                 int a  = 5;
>                 int b  = 5;
>                 int c  = 5;
>                 int x  = 5;
>                 int y  = 5;
>
>                 y = ((a >> 16) * ((c << 16) >> 16)) + (((a &
> 0X0000FFFF) * ((c <<
> 16) >> 16)) >> 16);
>                 y = -y;
>                 y  += ((b >> 16) * (c >> 16)) + (((b & 0X0000FFFF) *
> (c >> 16)) >>
> 16);
>         }
>
> }
>
> Has anyone else seen this problem on Android.  My assumption is that
> as Dalvik runs interpreted code without a JIT, then the NDK should
> avoid these performance issues...but I wanted to post this reality
> check to the NDK forum first.

If I understand correctly, then for Android, the code is compiled
using Standard JDK for desktop, while for JavaME, the compiler used is
different and that could the source of difference in performance, as
both would try to optimize in a different way.

Android SDK/ suite does not provide any separate compiler for android
platform and hence it is possible that the source of difference lies
somewhere there.

AFAIK even on JavaME there are several profiles like CDC and CLDC
based on the presence or absence of a floating point support. I don't
have much experience with JavaME per say so let me know if there is
__NO__ difference between Java compiler for desktop and for ME.

Unless you give more information about what device / VM / compiler
flags you are running with etc., it will be hard for any one to guess
why Dalvik VM is slow. From what I could gather from Android docs,
Dalvik was chosen for speed even though it might not be perfectly Java
Compliant. But we cannot really rule out licensing issues with other
VMs for google to avoid them. Ideally google should have posted the
performance difference, if any, for whatever benchmarks they had run
for comparison of VMs.

Since you have not provided the snippet of code which is doing FFT, I
am assuming that it is all integer based, otherwise the performance
difference could be because of a presence VFP.

Overall I think Android definitely requires some more information /
documentation/ benchmark runs to really compare the performance of
Android with other platforms.

HTH,
DivKis

Biosopher

unread,
Jan 7, 2010, 1:02:17 PM1/7/10
to android-ndk
I should clarify that the example above is a simplified version of the
actual test performed. The actual test returned the results of the
performanceTest2() method and wrote them to the log. That prevented
the C compiler from optimizing the code away. The Android compiler on
the other hand did not optimize the empty method so it gave us a good
benchmark for the overhead of simply iterating through a multiple
method calls.

I would appreciate any pointers on optimizing the compiled Java code
in a way that might improve the performance. The Dalvik performance
tests I ran were using the default compiler flags.

As an FYI, I've now rewritten the same code to run native via
Android's NDK and the performance is great in that case.

Biosopher

unread,
Jan 7, 2010, 2:39:40 PM1/7/10
to android-ndk
Hi Divkis,

The performance tests were run on an HTC Hero using the default
compilation flags. As you noted, the FFTs are fixed point using ints
so no slowdown due to floats.

From what I've read, Dalvik was not selected for speed as much as the
need to circumvent Sun's licenses for Java virtual machines. This
explains why Dalvik hasn't yet taken advantage of modern Java
optimizations like Jit.

Others have found similar performance issues with Dalvik:

http://android.serverbox.ch/?page_id=28&cpage=1#comment-61
http://occipital.com/blog/2008/10/31/android-performance-2-loop-speed-and-the-dalvik-vm/

fadden

unread,
Jan 12, 2010, 5:00:56 PM1/12/10
to android-ndk
On Jan 6, 9:35 am, Biosopher <astev...@gracenote.com> wrote:
> For example, this simple iteration over an empty method 2 million
> times takes 1.4 seconds even though it doesn’t do anything.  The same
> iteration is performed in milliseconds by a C program and about 50ms
> on a similarly powered J2ME phone.
>
> public void performanceTest1() {
>         for (int i = 0; i < 2000000; i++) {
>                 emptyMethod();
>         }
> }
> private int emptyMethod() {
>         return 0;
> }

A JIT that performs inlining would turn this from 2 million method
calls into two million integer increments.

A JIT that performs strength-reduction would turn this into "i =
2000000".

So it's meaningless as a performance benchmark. It's really more of a
JIT feature detector.


> Doing something a little more complex like calculating the imaginary
> component of a complex conjugate 2 million times takes 3.2 seconds.
> Again, this takes milliseconds on other mobile phones running J2ME or
> C.

This is the sort of thing that JITs do extremely well. (No, I'm not
going to post performance numbers for the Dalvik JIT, since it's not
shipping yet.)

Have you tried executing the code under J2ME with the JIT disabled
there? ("-Xint" might do the trick.) I'm curious how an apples-to-
apples test compares.

Biosopher

unread,
Jan 13, 2010, 4:50:26 PM1/13/10
to android-ndk
Hi fadden,

As I mentioned back on Jan 7 at 10:02 am, this performance test was a
highly simplified but very useful test of Dalvik's current
configuration. Of course...a future Dalvik would ideally optimize
this away, but as Dalvik doesn't currently, this test shows an utterly
simple case of how the existing Dalvik underperforms.

The performance test could have been written in a more complex manner
to show the same result, but this one proved a very major point ijn a
simple way: "Dalvik's current implementation introduces considerable
overhead even for very simple operations."

A more complex test like this would have shown the same performance
issue but would have complicated the analysis (was passing the value
slow, was the return value slow, was the method call slow):

public void performanceTest2() {
int val = 0;


for (int i = 0; i < 2000000; i++) {

val += simpleMethod(int val);
}
}

private int emptyMethod(int val) {
return 1;
}

My first simple case showed that for the current Dalvik impementation,
the slow performance is completely due to the method calls....not from
passing arguments and return values around.

fadden

unread,
Jan 14, 2010, 4:42:03 PM1/14/10
to android-ndk
On Jan 13, 1:50 pm, Biosopher <astev...@gracenote.com> wrote:
> A more complex test like this would have shown the same performance
> issue but would have complicated the analysis (was passing the value
> slow, was the return value slow, was the method call slow):

I contend that, in a more complex test, the function call wouldn't
have been inlined by the JIT. So you would have been mostly comparing
the cost of computation performed by the method, rather than the cost
of "making a method call" vs. "incrementing an integer". "i++;"
always beats "func(); i++;".

If your code has a lot of trivial methods that can be inlined easily,
then your simple test is an accurate reflection of real-world
differences. In practice, unless you're calling getter/setter methods
in inner loops, it's not a result from which one can draw meaningful
conclusions.

I have benchmarks the show the VM outperforming native code on a
standard benchmark, because the VM can use the floating point hardware
and the NDK is configured for software FP. It is true that, based on
that benchmark, the VM outperforms native code on float-intensive
computations. I would not, however, state that the interpreter is
faster than the native CPU.

Benchmarks are easy to write and execute, but deriving meaning from
the results can be tricky.

Ultimately the only result that matters is the speed of your actual
code on the target platform with the now-shipping OS. For the
computations you're performing on the devices you have, the speed is
unacceptably slow, and writing it in native code is the best solution.

David Sauter

unread,
Jan 14, 2010, 5:12:51 PM1/14/10
to andro...@googlegroups.com
> I have benchmarks the show the VM outperforming native code on a
> standard benchmark, because the VM can use the floating point hardware
> and the NDK is configured for software FP.

I don't suppose there's any way to flip that switch?

Or is this another thing on the wishlist for NDK 2.X?

David Sauter

Biosopher

unread,
Jan 14, 2010, 6:04:43 PM1/14/10
to android-ndk
Just ran my NDK code on a Droid running 2.0.1 and the performance was
great: 0.8 seconds. This is 50% faster than the HTC Ion/Magic I
tested earlier which ran in 1.6 seconds.

HTC Ion/Magic specs: Qualcomm® MSM7200A™, 528 MHz
Droid specs: Arm® Cortex™ A8 processor 550 mHz

If the mHz was the primary difference between the two chips (only 22
mHx), then Android 2.0.1 gave a considerable boost to my NDK
performance!

Dianne Hackborn

unread,
Jan 15, 2010, 2:43:51 AM1/15/10
to andro...@googlegroups.com
Not until you can specify an appropriate CPU target, or else your app would crash on all of the existing devices that support the baseline ARM native code but not the FPU.

--
Dianne Hackborn
Android framework engineer
hac...@android.com

Note: please don't send private questions to me, as I don't have time to provide private support, and so won't reply to such e-mails.  All such questions should be posted on public forums, where I and others can see and answer them.

Tristan Miller

unread,
Jan 15, 2010, 2:52:15 AM1/15/10
to andro...@googlegroups.com

Does the Android kernel not emulate floating point instructions on processors that lack an FPU?  I was under the impression (despite being a compile time option for the kernel) that this is pretty standard for Linux ARM.

Tristan Miller

On Jan 15, 2010 2:44 AM, "Dianne Hackborn" <hac...@android.com> wrote:

On Thu, Jan 14, 2010 at 2:12 PM, David Sauter <del...@gmail.com> wrote:

> > > I have benchmarks the show the VM outperforming native code on a > > standard benchmark, becau...

Not until you can specify an appropriate CPU target, or else your app would crash on all of the existing devices that support the baseline ARM native code but not the FPU.

--
Dianne Hackborn
Android framework engineer
hac...@android.com

Note: please don't send private questions to me, as I don't have time to provide private support, and so won't reply to such e-mails.  All such questions should be posted on public forums, where I and others can see and answer them.


guich

unread,
Jan 15, 2010, 9:25:25 AM1/15/10
to android-ndk
Hi Biosopher,

We are porting the TotalCross VM to Android and i would love to see
the same benchmark using our vm. We plan to have an alpha version in
one month. I believe you will be able to port your code without much
trouble to our api. In the meanwhile, you could run it in Pocket PC or
even in win32 (winxp).

You can download it at: www.totalcross.com

best

guich

David Turner

unread,
Jan 15, 2010, 1:06:09 PM1/15/10
to andro...@googlegroups.com
On Thu, Jan 14, 2010 at 11:52 PM, Tristan Miller <trism...@gmail.com> wrote:

Does the Android kernel not emulate floating point instructions on processors that lack an FPU?  I was under the impression (despite being a compile time option for the kernel) that this is pretty standard for Linux ARM.

It's not standard anymore, and hasn't been for a very long time.

More specifically, the kernel used to trap VFPv1 instructions when executed by the CPU, emulate the computation, then return to userland. This resulted in extremely bad performance.

Nowadays, the standard Linux ARM ABI uses "soft-floating point", where all floating point computations are performed by calling software helper functions. The compiler inserts the
corresponding function calls automatically for you, so you don't need to worry about it. This gets *much* better results, though not as fast as a real hardware FPU.

Moreover, the ARMv7 CPUs all use VFPv3 which corresponds to a different set of FPU instruction codes, and the kernel sources have strictly no support for these.
 

Robert Green

unread,
Jan 15, 2010, 2:05:50 PM1/15/10
to android-ndk
Since the code is in an SO built by the 1.6 Compiler, it's not NDK
that's running faster (since NDK only builds the library, it doesn't
execute the code, the kernel does) but actually the CPU that is
faster. The Arm Cortex A8 clock-for-clock seems to be faster than the
MSM7200A. Either that or it's the fact that the MSM7200A is actually
running at 378Mhz and the Cortex A8 is running at 550Mhz plus it's a
little faster.

Either way, running the same SO on two different platforms tests the
kernel and CPU, not the dev kit :)

fadden

unread,
Jan 15, 2010, 4:30:42 PM1/15/10
to android-ndk
On Jan 14, 3:04 pm, Biosopher <astev...@gracenote.com> wrote:
> Just ran my NDK code on a Droid running 2.0.1 and the performance was
> great: 0.8 seconds.  This is 50% faster than the HTC Ion/Magic I
> tested earlier which ran in 1.6 seconds.
[...]

> If the mHz was the primary difference between the two chips (only 22
> mHx), then Android 2.0.1 gave a considerable boost to my NDK
> performance!

Higher clock rate, dual-issue CPU, faster memory, etc. The Droid (and
Nexus One) feature substantially greater computing horsepower.

Switching from fixed-point math to float will also be pretty dramatic
on a Droid (esp. with double-precision ops).

Biosopher

unread,
Jan 18, 2010, 12:30:39 PM1/18/10
to android-ndk
Keep us posted, guich. It would be nice to create cross-platform UIs
on Android.
Reply all
Reply to author
Forward
0 new messages