Is it worthwhile to recognize tail calls and turn them into GOTOs, or is that another thing the JIT
will take care of? What about loop unrolling? Branch ordering?
And even if the JIT takes care of things, there's presumably a cost for having it do them -- is it
worth applying the optimizations simply to save the JIT the effort and let it focus on more involved
work?
How do I even get my hands on some metrics to test these questions out for myself?
~~ Robert Fischer.
Grails Training http://GroovyMag.com/training
Smokejumper Consulting http://SmokejumperIT.com
Enfranchised Mind Blog http://EnfranchisedMind.com/blog
Check out my book, "Grails Persistence with GORM and GSQL"!
http://www.smokejumperit.com/redirect.html
Some information that might be helpful, from the Hotspot Wiki
(http://wikis.sun.com/display/HotSpotInternals/Home)
http://wikis.sun.com/display/HotSpotInternals/PerformanceTechniques
Tips for microbenchmarking
http://wikis.sun.com/display/HotSpotInternals/MicroBenchmarks
Showing compilation steps/decisions
http://wikis.sun.com/display/HotSpotInternals/LogCompilation+overview
http://wikis.sun.com/display/HotSpotInternals/LogCompilation+tool
LogCompilation is the older tool; there's a new one called
PrintCompilation which will be in JDK 7 (I believe it may already be
in the current dev drops for J7 but may require a debug build to use;
not sure).
I believe the papers on the Ideal Graph Visualizer also describe the
optimization phases in some detail
http://wikis.sun.com/display/HotSpotInternals/Publications+JKU#PublicationsJKU-IdealGraphVisualizer
HTH
Patrick
Anecdotally, I've found that most micro optimizations to the bytecode
don't make a big difference. I've got bits of Duby's compiler that will
generate slightly different loop or branch logic from javac, and the
performance is no different. But there are a few things that seem to be
worthwhile:
- General reduction in bytecode size. Simply having less bytecode seems
to have a large effect on how well Hotspot is able to consume the code.
There are also size limitations on inlining code, so less bytecode often
means more inlining.
- Reduction in numbers/complexity of branches. This also plays into
total bytecode size, but I've seen improvements from simply flipping
loops around or calculating jump conditions in aggregate before making a
single jump.
- Outline as much code as humanly possible (as opposed to inlining).
JRuby's compiler originally just emitted all logic straight into the
method body. This turned out pretty badly; it was very slow, and there
was a tremendous amount of duplication. By pulling as much as possible
into static utility methods, bytecode size was drastically reduced and
performance went up substantially.
> Is it worthwhile to recognize tail calls and turn them into GOTOs, or is that another thing the JIT
> will take care of? What about loop unrolling? Branch ordering?
TCO is apparently "done" in the MLVM patchset, but until then avoiding
the recursive call can definitely help.
> And even if the JIT takes care of things, there's presumably a cost for having it do them -- is it
> worth applying the optimizations simply to save the JIT the effort and let it focus on more involved
> work?
The number one way to "help" the JIT is to keep code bodies as small as
possible...beyond any other technique, this is the one I recommend.
I can elaborate on specific techniques if you like. We've gone through a
lot of experimentation on JRuby.
- Charlie
> What optimizing efforts are worthwhile on generated bytecode given the optimizing capabilities of
> the JIT? I'm having trouble finding an explicit or definitive list of optimizations, but trying to
> be clever about method inlining is pretty clearly wasted effort. I can't find any documentation on
> it, but I'm assuming the JIT also takes care of constant folding, strength reduction, and dead code
> removal, since they're so straightforward. Or am I wrong, and the JIT assumes the compiler author
> applied those filters already?
Other people can probably give you nitty-gritty details. The main
rule, I think, is "Generate dumb code: specifically, generate what a
fairly unclever Java programmer would write, as compiled by a
straightforward and unclever Java compiler." That's what the JIT will
have been mostly tested on. The cleverer you or your compiler are,
the more likely you are to hit a case the JIT can't handle well.
--
GMail doesn't have rotating .sigs, but you can see mine at
http://www.ccil.org/~cowan/signatures
Looking at this same statement from a different angle: write bytecode
you'd like to maintain if it had started out as Java code.
And as far as the "clever" thing goes, I can promise you that's true. A
number of languages (including JRuby) have tried to work around
"finally" being inlined everywhere by jumping to it. Unfortunately it
turns out that the clever version is really hard to optimize in the best
case (exceptional and non-exceptional paths sharing the same code) and
can crash some JITs in the worst case (we saw crashes on both Java 5 and
Java 6 client compilers until we fixed this). Since javac never emits
this "clever" version, JVMs aren't tested or optimized for it.
- Charlie
> Given that you are generating code, do you generate java directly or
> generate an AST?
From an AST, I suppose is the right answer. But it's a Lisp-family
language, so the surface syntax is already a tree -- I just keep
transforming the tree until its semantics are those of Java.
> If you generate an AST, do you unparse that into java and run javac on it or
> do you generate bytecode directly from it?
I generate Java from it. Javac is certainly appropriate, though one
could use gcj or janino, too. (No generics in the generated code.)
> If you generate bytecode directly from it, what library or libraries do you
> use?
N/A
> - General reduction in bytecode size. Simply having less bytecode
> seems
> to have a large effect on how well Hotspot is able to consume the
> code.
> There are also size limitations on inlining code, so less bytecode
> often
> means more inlining.
> - Reduction in numbers/complexity of branches. This also plays into
> total bytecode size, but I've seen improvements from simply flipping
> loops around or calculating jump conditions in aggregate before
> making a
> single jump.
> - Outline as much code as humanly possible (as opposed to inlining).
> JRuby's compiler originally just emitted all logic straight into the
> method body. This turned out pretty badly; it was very slow, and there
> was a tremendous amount of duplication. By pulling as much as possible
> into static utility methods, bytecode size was drastically reduced and
> performance went up substantially.
Thanks, Charlie. Those are great rules of thumb. They probably
belong somewhere on the internals wiki, despite the warnings that it
is not a tuning document.
Branch-to-branch doesn't matter much for HotSpot, since it uses SSA
and sea-of-nodes IR. Recently, we have added a pre-pass to make sure
the IR is generated in RPO, regardless of concrete bytecode
structure. This makes the initial JIT steps more strongly
normalizing, hence more reliably optimizing. I imagine the other JVMs
do similar "due diligence" on their JIT inputs.
For language runtimes the JVMs should supply more hooks for affecting
JIT-level inlining, maybe something as simple (in HotSpot's case) as
@sun.misc.Inline. I'm not aware of a credible effort to standardize
on it, though.
Branch-free idioms (as long as they are simple and do not greatly
expand bytecode size) are helpful. The Hotspot JIT is pretty good at
turning simple control flow into branch-free code when possible, but
in very simple cases it is more reliable to write it in Java code.
The main problem is that bytecodes cannot express conditional move
directly; the best you can do is control from from x = p ? x1 : x2.
The important thing to encourage conditional moves is that the
predicate p and one or both of the conditional values x1, x2 should
free of side effects and exceptions, such as constants or variable
references.
Always use -XX:+PrintAssembly to see what you are doing.
BTW, +PrintCompilation is a very *old* flag; +LogCompilation is
slightly less old. The newest thing here, probably is that
+PrintAssembly is now a diagnostic switch, which means it works for
product builds (after being unlocked). You need a disassembler
plugin, and I'm pleased to announce that this is now second-sourced
(x86/32 only at present) at http://kenai.com/projects/base-hsdis/ .
You may still need to do a build, but hopefully the extra option will
make it easier for some.
Best,
-- John
If I could simply tell hotspot to inline through all my dyncall
plumbing, it would save me a world of pain trying to bring unique code
all the way back to the original call site. Of course invokedynamic has
better ways around this like TCO, but for now my plumbing totally
defeats inlining (as discussed in Cliff Click's presentation last fall).
In theory if I could tell Hotspot to immediately inline my
"CachingCallSite" logic everywhere it was encountered, I'd be that much
closer to getting the method handles to inline, wouldn't I?
BTW, I did also try making CachingCallSite's logic static methods that
manipulate a common data structure. It did not help.
The problem I run into now trying to get dyncalls to inline is that it
usually adds so much extra bytecode to the calling method (because the
dyncall plumbing is complex) that it actually slows things down more
than it speeds them up :(
- Charlie