Optimizations?

Robert Fischer

unread,

Feb 24, 2009, 9:03:28 AM2/24/09

to jvm-la...@googlegroups.com

What optimizing efforts are worthwhile on generated bytecode given the optimizing capabilities of
the JIT? I'm having trouble finding an explicit or definitive list of optimizations, but trying to
be clever about method inlining is pretty clearly wasted effort. I can't find any documentation on
it, but I'm assuming the JIT also takes care of constant folding, strength reduction, and dead code
removal, since they're so straightforward. Or am I wrong, and the JIT assumes the compiler author
applied those filters already?

Is it worthwhile to recognize tail calls and turn them into GOTOs, or is that another thing the JIT
will take care of? What about loop unrolling? Branch ordering?

And even if the JIT takes care of things, there's presumably a cost for having it do them -- is it
worth applying the optimizations simply to save the JIT the effort and let it focus on more involved
work?

How do I even get my hands on some metrics to test these questions out for myself?

~~ Robert Fischer.
Grails Training http://GroovyMag.com/training
Smokejumper Consulting http://SmokejumperIT.com
Enfranchised Mind Blog http://EnfranchisedMind.com/blog

Check out my book, "Grails Persistence with GORM and GSQL"!
http://www.smokejumperit.com/redirect.html

Patrick Wright

unread,

Feb 24, 2009, 9:17:56 AM2/24/09

to jvm-la...@googlegroups.com

Hi

Some information that might be helpful, from the Hotspot Wiki
(http://wikis.sun.com/display/HotSpotInternals/Home)

http://wikis.sun.com/display/HotSpotInternals/PerformanceTechniques

Tips for microbenchmarking
http://wikis.sun.com/display/HotSpotInternals/MicroBenchmarks

Showing compilation steps/decisions
http://wikis.sun.com/display/HotSpotInternals/LogCompilation+overview
http://wikis.sun.com/display/HotSpotInternals/LogCompilation+tool

LogCompilation is the older tool; there's a new one called
PrintCompilation which will be in JDK 7 (I believe it may already be
in the current dev drops for J7 but may require a debug build to use;
not sure).

I believe the papers on the Ideal Graph Visualizer also describe the
optimization phases in some detail
http://wikis.sun.com/display/HotSpotInternals/Publications+JKU#PublicationsJKU-IdealGraphVisualizer

HTH
Patrick

Patrick Wright

unread,

Feb 24, 2009, 9:18:02 AM2/24/09

to jvm-la...@googlegroups.com

Robert Fischer

unread,

Feb 24, 2009, 9:38:12 AM2/24/09

to jvm-la...@googlegroups.com

Thank you!

~~ Robert.

--

Charles Oliver Nutter

unread,

Feb 24, 2009, 1:22:20 PM2/24/09

to jvm-la...@googlegroups.com

Robert Fischer wrote:
> What optimizing efforts are worthwhile on generated bytecode given the optimizing capabilities of
> the JIT? I'm having trouble finding an explicit or definitive list of optimizations, but trying to
> be clever about method inlining is pretty clearly wasted effort. I can't find any documentation on
> it, but I'm assuming the JIT also takes care of constant folding, strength reduction, and dead code
> removal, since they're so straightforward. Or am I wrong, and the JIT assumes the compiler author
> applied those filters already?

Anecdotally, I've found that most micro optimizations to the bytecode
don't make a big difference. I've got bits of Duby's compiler that will
generate slightly different loop or branch logic from javac, and the
performance is no different. But there are a few things that seem to be
worthwhile:

- General reduction in bytecode size. Simply having less bytecode seems
to have a large effect on how well Hotspot is able to consume the code.
There are also size limitations on inlining code, so less bytecode often
means more inlining.
- Reduction in numbers/complexity of branches. This also plays into
total bytecode size, but I've seen improvements from simply flipping
loops around or calculating jump conditions in aggregate before making a
single jump.
- Outline as much code as humanly possible (as opposed to inlining).
JRuby's compiler originally just emitted all logic straight into the
method body. This turned out pretty badly; it was very slow, and there
was a tremendous amount of duplication. By pulling as much as possible
into static utility methods, bytecode size was drastically reduced and
performance went up substantially.

> Is it worthwhile to recognize tail calls and turn them into GOTOs, or is that another thing the JIT
> will take care of? What about loop unrolling? Branch ordering?

TCO is apparently "done" in the MLVM patchset, but until then avoiding
the recursive call can definitely help.

> And even if the JIT takes care of things, there's presumably a cost for having it do them -- is it
> worth applying the optimizations simply to save the JIT the effort and let it focus on more involved
> work?

The number one way to "help" the JIT is to keep code bodies as small as
possible...beyond any other technique, this is the one I recommend.

I can elaborate on specific techniques if you like. We've gone through a
lot of experimentation on JRuby.

- Charlie

Matt Fowles

unread,

Feb 24, 2009, 1:41:19 PM2/24/09

to jvm-la...@googlegroups.com

All~

From my experience, eliminating branch to branch chains produces better JIT-ability of bytecode.

Meaing things of the form:

GOTO_A
code
LABEL_A:
GOTO_B:
code
LABEL_B

just eliminate LABEL_A and have everyone go directly to B.

Matt

John Cowan

unread,

Feb 24, 2009, 4:41:41 PM2/24/09

to jvm-la...@googlegroups.com

On Tue, Feb 24, 2009 at 9:03 AM, Robert Fischer
<robert....@smokejumperit.com> wrote:

> What optimizing efforts are worthwhile on generated bytecode given the optimizing capabilities of
> the JIT? I'm having trouble finding an explicit or definitive list of optimizations, but trying to
> be clever about method inlining is pretty clearly wasted effort. I can't find any documentation on
> it, but I'm assuming the JIT also takes care of constant folding, strength reduction, and dead code
> removal, since they're so straightforward. Or am I wrong, and the JIT assumes the compiler author
> applied those filters already?

Other people can probably give you nitty-gritty details. The main
rule, I think, is "Generate dumb code: specifically, generate what a
fairly unclever Java programmer would write, as compiled by a
straightforward and unclever Java compiler." That's what the JIT will
have been mostly tested on. The cleverer you or your compiler are,
the more likely you are to hit a case the JIT can't handle well.

--
GMail doesn't have rotating .sigs, but you can see mine at
http://www.ccil.org/~cowan/signatures

Charles Oliver Nutter

unread,

Feb 24, 2009, 5:02:40 PM2/24/09

to jvm-la...@googlegroups.com

Looking at this same statement from a different angle: write bytecode
you'd like to maintain if it had started out as Java code.

And as far as the "clever" thing goes, I can promise you that's true. A
number of languages (including JRuby) have tried to work around
"finally" being inlined everywhere by jumping to it. Unfortunately it
turns out that the clever version is really hard to optimize in the best
case (exceptional and non-exceptional paths sharing the same code) and
can crash some JITs in the worst case (we saw crashes on both Java 5 and
Java 6 client compilers until we fixed this). Since javac never emits
this "clever" version, JVMs aren't tested or optimized for it.

- Charlie

John Cowan

unread,

Feb 24, 2009, 5:38:02 PM2/24/09

to jvm-la...@googlegroups.com

On Tue, Feb 24, 2009 at 5:02 PM, Charles Oliver Nutter
<charles...@sun.com> wrote:

> Looking at this same statement from a different angle: write bytecode
> you'd like to maintain if it had started out as Java code.

Fair enough. In my case, I'm generating Java code rather than
bytecode, so for me the rule is "Generate straightforward Java."

> And as far as the "clever" thing goes, I can promise you that's true. A
> number of languages (including JRuby) have tried to work around
> "finally" being inlined everywhere by jumping to it. Unfortunately it
> turns out that the clever version is really hard to optimize in the best
> case (exceptional and non-exceptional paths sharing the same code) and
> can crash some JITs in the worst case (we saw crashes on both Java 5 and
> Java 6 client compilers until we fixed this). Since javac never emits
> this "clever" version, JVMs aren't tested or optimized for it.
>
> - Charlie
>
> >
>

Matt Fowles

unread,

Feb 24, 2009, 6:31:56 PM2/24/09

to jvm-la...@googlegroups.com

John~

Given that you are generating code, do you generate java directly or generate an AST?

If you generate an AST, do you unparse that into java and run javac on it or do you generate bytecode directly from it?

If you generate bytecode directly from it, what library or libraries do you use?

Thanks,
Matt

John Cowan

unread,

Feb 25, 2009, 1:23:58 AM2/25/09

to jvm-la...@googlegroups.com

On Tue, Feb 24, 2009 at 6:31 PM, Matt Fowles <matt....@gmail.com> wrote:

> Given that you are generating code, do you generate java directly or
> generate an AST?

From an AST, I suppose is the right answer. But it's a Lisp-family
language, so the surface syntax is already a tree -- I just keep
transforming the tree until its semantics are those of Java.

> If you generate an AST, do you unparse that into java and run javac on it or
> do you generate bytecode directly from it?

I generate Java from it. Javac is certainly appropriate, though one
could use gcj or janino, too. (No generics in the generated code.)

> If you generate bytecode directly from it, what library or libraries do you
> use?

N/A

Matt Fowles

unread,

Feb 25, 2009, 11:43:54 AM2/25/09

to jvm-la...@googlegroups.com

John~

Thanks for the information. I did the same thing and have switched to using janino and just wanted to see what other people do.

Matt

John Rose

unread,

Feb 25, 2009, 2:11:02 PM2/25/09

to jvm-la...@googlegroups.com

On Feb 24, 2009, at 10:22 AM, Charles Oliver Nutter wrote:

> - General reduction in bytecode size. Simply having less bytecode
> seems
> to have a large effect on how well Hotspot is able to consume the
> code.
> There are also size limitations on inlining code, so less bytecode
> often
> means more inlining.
> - Reduction in numbers/complexity of branches. This also plays into
> total bytecode size, but I've seen improvements from simply flipping
> loops around or calculating jump conditions in aggregate before
> making a
> single jump.
> - Outline as much code as humanly possible (as opposed to inlining).
> JRuby's compiler originally just emitted all logic straight into the
> method body. This turned out pretty badly; it was very slow, and there
> was a tremendous amount of duplication. By pulling as much as possible
> into static utility methods, bytecode size was drastically reduced and
> performance went up substantially.

Thanks, Charlie. Those are great rules of thumb. They probably
belong somewhere on the internals wiki, despite the warnings that it
is not a tuning document.

Branch-to-branch doesn't matter much for HotSpot, since it uses SSA
and sea-of-nodes IR. Recently, we have added a pre-pass to make sure
the IR is generated in RPO, regardless of concrete bytecode
structure. This makes the initial JIT steps more strongly
normalizing, hence more reliably optimizing. I imagine the other JVMs
do similar "due diligence" on their JIT inputs.

For language runtimes the JVMs should supply more hooks for affecting
JIT-level inlining, maybe something as simple (in HotSpot's case) as
@sun.misc.Inline. I'm not aware of a credible effort to standardize
on it, though.

Branch-free idioms (as long as they are simple and do not greatly
expand bytecode size) are helpful. The Hotspot JIT is pretty good at
turning simple control flow into branch-free code when possible, but
in very simple cases it is more reliable to write it in Java code.
The main problem is that bytecodes cannot express conditional move
directly; the best you can do is control from from x = p ? x1 : x2.
The important thing to encourage conditional moves is that the
predicate p and one or both of the conditional values x1, x2 should
free of side effects and exceptions, such as constants or variable
references.

Always use -XX:+PrintAssembly to see what you are doing.

BTW, +PrintCompilation is a very *old* flag; +LogCompilation is
slightly less old. The newest thing here, probably is that
+PrintAssembly is now a diagnostic switch, which means it works for
product builds (after being unlocked). You need a disassembler
plugin, and I'm pleased to announce that this is now second-sourced
(x86/32 only at present) at http://kenai.com/projects/base-hsdis/ .
You may still need to do a build, but hopefully the extra option will
make it easier for some.

Best,
-- John

Charles Oliver Nutter

unread,

Feb 25, 2009, 4:03:02 PM2/25/09

to jvm-la...@googlegroups.com

John Rose wrote:
> For language runtimes the JVMs should supply more hooks for affecting
> JIT-level inlining, maybe something as simple (in HotSpot's case) as
> @sun.misc.Inline. I'm not aware of a credible effort to standardize
> on it, though.

If I could simply tell hotspot to inline through all my dyncall
plumbing, it would save me a world of pain trying to bring unique code
all the way back to the original call site. Of course invokedynamic has
better ways around this like TCO, but for now my plumbing totally
defeats inlining (as discussed in Cliff Click's presentation last fall).

In theory if I could tell Hotspot to immediately inline my
"CachingCallSite" logic everywhere it was encountered, I'd be that much
closer to getting the method handles to inline, wouldn't I?

BTW, I did also try making CachingCallSite's logic static methods that
manipulate a common data structure. It did not help.

The problem I run into now trying to get dyncalls to inline is that it
usually adds so much extra bytecode to the calling method (because the
dyncall plumbing is complex) that it actually slows things down more
than it speeds them up :(

- Charlie

Bill Robertson

unread,

Feb 25, 2009, 10:44:16 PM2/25/09

to JVM Languages

I think I understand what you are getting at, but will you please give
a couple of simple pseudo code examples?

Thanks,
Bill

On Feb 24, 1:22 pm, Charles Oliver Nutter <charles.nut...@sun.com>
wrote:

> - Reduction in numbers/complexity of branches. This also plays into
> total bytecode size, but I've seen improvements from simply flipping
> loops around or calculating jump conditions in aggregate before making a
> single jump.
> - Outline as much code as humanly possible (as opposed to inlining).
> JRuby's compiler originally just emitted all logic straight into the
> method body. This turned out pretty badly; it was very slow, and there
> was a tremendous amount of duplication. By pulling as much as possible
> into static utility methods, bytecode size was drastically reduced and
> performance went up substantially.

> - Charlie

Reply all

Reply to author

Forward