Improving NestedVM's performance..?

Dieter Krachtus

unread,

Jul 3, 2007, 9:07:11 PM7/3/07

to NestedVM

Hi,

I believe NestedVM's performance is already impressive but still not
enough for certain solutions where performance is king. Can there be
still done something substantially when it comes to performance?

My experience is, that I never saw anything faster then 20% of the speed
native code offers. Or am I wrong? I am curious what your ideas are
where performance could be still improved - the GCC, changes to the JVM,
NestedVM itself? And how much would it improve things performance-wise?
Perhaps with some brainstorming together we can come up with some fresh
ideas.

Cheers,
Dieter

Roelof Berg

unread,

Jul 4, 2007, 3:03:41 AM7/4/07

to nest...@googlegroups.com

Hello Dieter,

an interesting topic might be changing the "calling convention". Maybe
there can be some performance impact. And possibly forcing gcc to do as
much inlining as possible (maybe by patching gcc if the gcc options are
to weak for this. Adding "inline" statements to the code does not force
gcc to inline - it only advises gcc that inlining might be good at this
place ...).

-----------------------------------------------

A bit "freaky" would be the approach to detect "macro blocks" of known
assember code using a simple pattern-matching approach. When a pattern
is detected a similar and faster java replacement is called.

Example:

Very often you see the same assembler statements:
a1b2##c3********d4e5f6a1b2c3d4
(not really this ones ...)

meaning e.g. something like:
la R##, 0x********
ld R1, C(1)
label1:
subu R2, R##,R1
beqz R2, label2
...
label2:

Theese could for example be transferred into a jump to a special Java
subroutine
x=memory[0x********]
for(i=x; i>0; i--)
{
registers[##] = x;
execute(0xStart, 0xEnd);
}

Maybe this is not the best example. Could there be this sort of
macro-blocks that occur very often ? E.g. headers and footers of
functions (stack handling). Or what else might occur often ?

-----------------------------------------------

"Native MIPS" floating point support would also improve the performance
for floating point related applications. I wondered why our vector
atithmetic stuff takes so much performance: Several minutes in NestedVM
and only a few seconds in native C++. Maybe this is floating-point
related (IEEE conversions). But maybe there is another reason (e.g. slow
fetching/storing values from/to memory). I can check this by comparing
an integer to a floating-point build of our vector arithmetic stuff.

Best regards,
Roelof Berg

--

Gruss / Best Regards
Roelof Berg

KEYMILE GmbH
Dept. Embedded Software
Wohlenbergstraße 3
30179 Hannover

Telefon: +49 511 6747-481
Telefax: +49 511 6747-450
Mobile: +49 170 517 16 95

roelo...@keymile.com
http://www.keymile.com

Triple Your Access

*********************************************************************************************************
"Diese Nachricht ist streng vertraulich und enthält Geschäfts- und Betriebsgeheimnisse, insbesondere technischer Natur, oder sonstige vertrauliche bzw. rechtlich geschützte Informationen. Sie ist nur für die Adressaten bestimmt. Falls Sie diese E-Mail irrtümlich erhalten haben, bitten wir Sie, uns umgehend telefonisch oder per E-Mail zu informieren und den Inhalt dieser Nachricht in Ihrem System zu löschen. Die Kommunikation per E-Mail ist anfällig für Datenmissbrauch, Abfangen von Daten, nicht autorisierte Änderungen, Verfälschung und Viren. Wir senden und empfangen E-Mails nur auf der Grundlage, dass wir nicht für solchen Datenmissbrauch, Abfangen von Daten, nicht autorisierte Änderungen, Verfälschung und Viren sowie deren Konsequenzen haften."

Geschäftsführer: Dr. Ziaedin Chahabadi, Dipl.-Kfm. Andreas Gebauer, Heinrich Kreft
Rechtsform der Gesellschaft: GmbH, Sitz: Hannover
HRB 61069, Amtsgericht Hannover; USt-Id. Nr.: DE 812282795; WEEE-Reg.-Nr.: DE 59336750

Dieter Krachtus

unread,

Jul 4, 2007, 4:11:27 AM7/4/07

to nest...@googlegroups.com

Hi Roelof,

Interesting ideas, I wonder how big the impact on performance would be.
Perhaps it would help if one could isolate a part in NestedVM that is
general enough to be implemented natively using JNI. The idea is, to
split up NestedVM in a way that most of the heavy work is done in the
native part and communication between the native and Java parts is
minimal. Does this make any sense to you?

Cheers,
Dieter

--
--------------------------------------------------------------------------
Interdisciplinary Center for Scientific Computing (www.iwr.uni-heidelberg.de)

Dieter Krachtus Heidelberg University, IWR
Computational Biophysics
+49-6221-54 8805 (office R222) Im Neuenheimer Feld 368
private: dieterk...@web.de D - 69120 Heidelberg
dieter....@iwr.uni-heidelberg.de Germany
--------------------------------------------------------------------------

Brian Alliet

unread,

Jul 4, 2007, 8:10:34 PM7/4/07

to Dieter Krachtus, NestedVM

On Wed, Jul 04, 2007 at 03:07:11AM +0200, Dieter Krachtus wrote:
> I believe NestedVM's performance is already impressive but still not
> enough for certain solutions where performance is king. Can there be
> still done something substantially when it comes to performance?

The only idea for improving performance that I haven't yet implemented
is being "smarter" with using local variables. Right now registers are
mostly stored in fields. We store some common registers (like the stack
pointer) in local variables, but they have to be written back to the
fields whenever we leave a method. This means if we store too many
registers in local vars performance for code like this:

if(a == b)
return a + b: /* fast path */
else
/* huge body of code that uses tons of regs */

suffers a huge performance hit because we needlessly read and write a
ton of unused registers in the fast path (this is just a consequence of
the stupid way this is implemented).

Doing this right essentially is the same as doing register allocation,
except the registers are local vars and the stack is the fields.

> where performance could be still improved - the GCC, changes to the JVM,

Changes to the JVM? I'd love to be able to change the JVM. Stupid JVM
limitations are the cause of just about all of NestedVM's performance
problems. Removing the 64k method size limitation alone would do
wonders for NestedVM's performance.

Trouble is I don't think I have the right kind of connections to change
the JVM (which hasn't been changed, even by Sun, since version 1.0) and
get the new version deployed on every computer in the world. :)

I'm sure some changes to GCC could make the generated code a little
more NestedVM friendly. Problem there is the GCC codebase scares me to
death. I've made a few minor changes to it but I don't think the
performance improvements that might come of nestedvm specific changes
would justify the time I'd have to spend wrapping me head around it.

-Brian

Brian Alliet

unread,

Jul 4, 2007, 8:26:32 PM7/4/07

to Roelof Berg, nest...@googlegroups.com

On Wed, Jul 04, 2007 at 09:03:41AM +0200, Roelof Berg wrote:
> an interesting topic might be changing the "calling convention". Maybe

I'm not sure how much changing the calling convention would help. Right
now most arguments are passed in registers which is about as efficient
as you can get. Changing the calling convention could also create one
more thing that is different about NestedVM. Keeping it as close to
other mips architectures as possible helps ensure that stuff just works
out of the box with NestedVM.

> there can be some performance impact. And possibly forcing gcc to do as
> much inlining as possible (maybe by patching gcc if the gcc options are
> to weak for this. Adding "inline" statements to the code does not force
> gcc to inline - it only advises gcc that inlining might be good at this
> place ...).

This is something you'd have to take up with the GCC guys. I'm
certainly not interested in adding features to GCC. You could also just
make those functions you really want inlined to be CPP macros (that's
what people did before the inline keyword).

> A bit "freaky" would be the approach to detect "macro blocks" of known
> assember code using a simple pattern-matching approach. When a pattern
> is detected a similar and faster java replacement is called.

This is definitely a good idea. Lots of times you'll see stuff like:

this.x = memory[someaddr];
this.y = memory[someaddr];
this.z = this.x + this.y;

One thing I've wanted to do for a long time is write a jvm bytecode
optimizer that'll catch stuff like this. This would actually be useful
for LambdaVM too, so there is a chance I might actually do it write it
eventually. :)

> meaning e.g. something like:
> la R##, 0x********
> ld R1, C(1)
> label1:
> subu R2, R##,R1
> beqz R2, label2
> ...
> label2:

Loops should actually be handled pretty efficiently now, as long as
they are small enough not to cross a method boundary (and hence require
a trip back to the trampoline). The beqz would be turned into a
conditional jump bytecode.

> "Native MIPS" floating point support would also improve the performance
> for floating point related applications. I wondered why our vector

I'm not sure what you mean by "Native MIPS". As you mentioned, the
major performance with floating point stuff is all the
doubleToLongBits/longBitsToDouble stuff that has to go on, since memory
is an int[] array. I can't see any way to fix this.

-Brian

Brian Alliet

unread,

Jul 4, 2007, 8:35:59 PM7/4/07

to Dieter Krachtus, nest...@googlegroups.com

On Wed, Jul 04, 2007 at 10:11:27AM +0200, Dieter Krachtus wrote:
> Perhaps it would help if one could isolate a part in NestedVM that is
> general enough to be implemented natively using JNI. The idea is, to

I'm not sure where you're going with this. It looks like you'd end up
with the worst of both worlds. Once you require JNI you'd given up the
"run anywhere" part. You might as well just compile your native code on
each platform you support.

Unless you're thinking of some optional JNI code that gets used on
platforms where it is available and falls back to the pure java version
where it isn't. Trouble is there aren't really any large chunks of code
in the runtime that are performance bottlenecks. The slowdown is in the
generated code. Stuff like array bounds checking, reading double
values, the trampoline stuff, etc. This isn't really stuff you can fix
with JNI.

-Brian

Roelof Berg

unread,

Jul 5, 2007, 5:21:38 AM7/5/07

to nest...@googlegroups.com

Thanks for the good feedback regarding the performance optimizations.
NestedVM is imho already great enough, that it possibly wouldn't need
this optimizations :) But it's interesting to think about getting it to
an even more outstanding tool ;)

Brian Alliet wrote:
> One thing I've wanted to do for a long time is write a jvm bytecode
> optimizer that'll catch stuff like this.

I meant searching for patterns in the MIPS binary code that could have
be replaced by equivalents in Java. I hope that was written
understandable. However, a Java Bytecode Optimizer (maybe as a
post-proceccor stage to NestedVM) coud be a very great tool. Excellent
idea :) I googled about that and found some Java_Performance_Tools on
http://www.javaperformancetuning.com/resources.shtml.

Examples:
http://jarg.sourceforge.net/
http://jode.sourceforge.net/
http://www.cs.purdue.edu/s3/projects/bloat/

I will try out some of this and email the results to the newsgroup :)

> I'm not sure what you mean by "Native MIPS".

I meant: When the log() call is not executed by soft-float but by the
MIPS-fpu-command (mapped to the java log() call) the "real" fpu of the
host-system would be used ... In January someone explained in this group
how to integrate this. I always hoped of getting the time to try this
out (but my boss rated this as a low-priority-feature ... maybe I'll try
that at home some time ...).

Roelof

Roelof Berg

unread,

Jul 5, 2007, 5:59:50 AM7/5/07

to nest...@googlegroups.com

Roelof Berg wrote:
> Brian Alliet wrote:
>> One thing I've wanted to do for a long time is write a jvm bytecode
>> optimizer that'll catch stuff like this.

> I googled about that and [...] will try out some of this and email the

> results to the newsgroup :)
>

I made a quick test using ProGuard. The speed remained the same and the
codesize was only shrunk by 1%. Possibly because the underlying MIPS
code remains untouched. (There were "ommitted some parts" warnings - so
possibly I didn't set all options right ...)

Roelof

Dieter Krachtus

unread,

Jul 5, 2007, 6:05:13 AM7/5/07

to Brian Alliet, nest...@googlegroups.com

Brian Alliet wrote:
> On Wed, Jul 04, 2007 at 10:11:27AM +0200, Dieter Krachtus wrote:
>
>> Perhaps it would help if one could isolate a part in NestedVM that is
>> general enough to be implemented natively using JNI. The idea is, to
>>
>
> I'm not sure where you're going with this. It looks like you'd end up
> with the worst of both worlds. Once you require JNI you'd given up the
> "run anywhere" part. You might as well just compile your native code on
> each platform you support.
>

Not necessarily. Suns JRE is full of JNI calls and it really boosts
performance (io, math, pack200, image, graphics). Other Java
implementations like Harmony do some of this stuff in pure Java, e.g.
pack200 compression takes ages. JNI makes sense for low level stuff
which is often used and in many places. But you are the expert - if you
say there are no basic operations in NestedVM or the generated Code
which could be done faster natively, we can forget about that.

> The slowdown is in the
> generated code. Stuff like array bounds checking,

Ages ago I did read something about turning of bounds checking in
relation to Java game programming, where performance is important. I
guess the JIT can do it in some cases - perhaps there are recent
developments related to turning of bounds checking. Perhaps some
framework which automatically allows you to write code that can be
optimized by the JIT.

Now we can"t get more then 20% of the performance of the native code.
How much would removing the 64K limit boost performance? Could one
achieve 50% of native performance? Just give me some number even if it
is a wild guess.

I actually have a lot of experience with bytecode optimization and use
it on a daily basis. Could we push with this the 64K limit, since
optimized methods would be smaller? Or do I get here something wrong? I
use bytecode optimization in an Ant based build system and it basically
works out of the box if you don't use things like reflection. The
current application I am working on has a size of 8MB (bytecode) and
after optimization it is 2.5MB.

Cheers,
Dieter

Brian Alliet

unread,

Jul 5, 2007, 8:39:38 AM7/5/07

to Roelof Berg, nest...@googlegroups.com

On Thu, Jul 05, 2007 at 11:21:38AM +0200, Roelof Berg wrote:
> I meant searching for patterns in the MIPS binary code that could have
> be replaced by equivalents in Java. I hope that was written

Right. I got that. I think most of this stuff you'd be able to catch
just as easily at the bytecode level though, plus the code could be
used outside NestedVM

> I meant: When the log() call is not executed by soft-float but by the
> MIPS-fpu-command (mapped to the java log() call) the "real" fpu of the
> host-system would be used ... In January someone explained in this group
> how to integrate this. I always hoped of getting the time to try this
> out (but my boss rated this as a low-priority-feature ... maybe I'll try
> that at home some time ...).

Actually MIPS doesn't have an log (or sin, cos, tan, etc) instruction
(unlike x86 for example which has opcodes for everything under the
sun). So the way we're doing it now is "native mips". The plan is to
implement them as syscalls, which would make absolutely no sense an a
real mips machine, but will help here as we're using the "native java"
(if that term even makes sense...) implementation.

Here is the overview for doing the fast trip ops:

http://wiki.brianweb.net/NestedVM/FastTrigOps

-Brian

Brian Alliet

unread,

Jul 5, 2007, 9:05:52 AM7/5/07

to Dieter Krachtus, nest...@googlegroups.com

On Thu, Jul 05, 2007 at 12:05:13PM +0200, Dieter Krachtus wrote:
> Not necessarily. Suns JRE is full of JNI calls and it really boosts
> performance (io, math, pack200, image, graphics). Other Java

Indeed it is. Sun can do that though, since they distribute the JVM.
Once we use JNI though we have to distribute the native code that
implements those JNI operations, which is different for each platform.

As I said, this and a pure java fallback wouldn't be too bad. I really
can't think of anything that could be shifted out to a JNI call that
takes long enough to justify the JNI overhead though. I certainly could
be missing something though so if anyone has any ideas let me know.

> Ages ago I did read something about turning of bounds checking in
> relation to Java game programming, where performance is important. I
> guess the JIT can do it in some cases - perhaps there are recent

I don't know much about Sun's JIT compiler. I haven't been able to find
any documentation on it other than high level overviews that look like
something the marketing guys wrote up.

I know the JVM does eleminate bounds checks in places it can prove they
aren't needed (like "for(i=0;i<a.length;i++) sum+=a[i]"). I doubt it
can see any of this in NestedVM because of how much gcc mangles up the
code.

> How much would removing the 64K limit boost performance? Could one
> achieve 50% of native performance? Just give me some number even if it
> is a wild guess.

A while guess... I think it might double performance. Right now a jump
to an address outside the current method looks like this:

run_0x0100() {
int r29,r2,r3,r4;
// tons of code
// the jump
this.r29 = r29;
// ... save the rest of the regs in fields
this.pc = 0x204;
return;
}

trampoline() {
while(true) {
case (pc>>8) {
case 0x1: run_0x100(); continue
case 0x2: run_0x200(); continue;
}
}
}

run_0x200() {
int r29 = this.r29;
// restore the rest of the regs
case pc {
case 0x200:
// other stuff we don't want, we want 0x204
case 0x204:
// now do the function
}
}

All this nonsense is to get from one method to another. Without the 64k
method limitation we could put the entire program in one method and
just do a goto (yes, the JVM has a goto instruction).

> I actually have a lot of experience with bytecode optimization and use
> it on a daily basis. Could we push with this the 64K limit, since
> optimized methods would be smaller? Or do I get here something wrong? I

It really doesn't matter how much smaller you make the bytecode. Unless
it is small enough to fit the entire program within 64k of bytecode
(and the bytecode is much larger than the mips code) we're still stuck
with the trampoline stuff

-Brian

Reply all

Reply to author

Forward