I'll start.
The most recent discovery for us is that -XX:NewRatio=1 performs better
than the default 2 on server and 8 on client VMs. Because Ruby is as a
rule a lot more object-intensive (having no unboxed numeric types, for
example) we really do need a larger young generation. Once transient
objects get promoted to older generations the game is over; we're
spending all our time collecting Fixnums instead of working.
I posted this to hotspot-dev (Paul Hohensee answered but redirected me
to hotspot-gc-use) and was told to also consider two additional flags
that give the young generation more wiggle room:
MaxTenuringThreshold=0-15 ; the number of times an object must survive
young gen collections to be promoted; not sure about defaults on client
and server or across collectors
SurvivorRatio ; Lower values increase the size of the survivor space,
allowing more objects to "just die" in the young generation instead of
being promoted.
I have not tried either of these yet. Anyone else? Any other flags or
defaults you've found helpful for performance in your dynlang apps?
Here's another from me, which I've posted about previously:
-XX:bootclasspath and friends. JRuby is a Ruby implementation, which
means it needs to perform reasonably well as a command-line tool.
Typical JVM startup precludes fast command-line startup, but it turns
out a large portion of that time is spent verifying bytecode.
bootclasspath allows JRuby and its dependencies to skip verification,
which in our case improved startup time almost 3X, putting it
comfortably under 0.5s on OS X. That was a *huge* find, akin to a silver
bullet for startup speed.
- Cahrlie
Another recommended flag from Peter Kessler is -Xmn, which explicitly
sets the size of the young generation. Obviously this requires a bit
more knowledge about actual heap size than NewRatio, but it allows us to
do better than half the heap for the young generation. He also had some
suggestions on how to study memory use, which I've copied verbatim below:
"
Why limit yourself to NewRatio? The best you can get that way is half
the heap for the young generation. If you really want to a big young
generation (to give your temporary objects time to die without even
being looked at by the collector), use -Xmn (or -XX:NewSize= and
-XX:MaxNewSize=) to set it directly. Figure out what your live data
size is and use that as the base size for the old generation. Then
figure out what kinds of pauses the young generation collections impose,
and how much they promote, then amortize the eventual old generation
collection time over as many young generation collections as you can
give space to in the old generation. Then make your total heap (-Xmx)
as big as you can afford to get as big a young generation as that will
allow.
"
Again, I've had no time to look into this. My response pretty well sums
up what I think this round of emails should involve:
"
I guess there's really two tasks I'm
looking into right now:
1. Discovering appropriate flags to tweak and "more appropriate"
defaults dynlangs might want to try
2. Exploring real-world dynlang applications and loads to refine those
better/best-guess defaults
I'd say this round of emails is focused mostly on 1, since 2 is going to
vary more across languages. And I think we can only start to explore 2
iff we know 1.
"
- Charlie
> Here's another from me, which I've posted about previously:
> -XX:bootclasspath and friends. JRuby is a Ruby implementation, which
> means it needs to perform reasonably well as a command-line tool.
> Typical JVM startup precludes fast command-line startup, but it turns
> out a large portion of that time is spent verifying bytecode.
> bootclasspath allows JRuby and its dependencies to skip verification,
> which in our case improved startup time almost 3X, putting it
> comfortably under 0.5s on OS X. That was a *huge* find, akin to a silver
> bullet for startup speed.
That's an excellent hacque for *any* non-Java language on the JVM,
dynamic or not. All such languages are going to have core libraries
that are as important to them as java.lang.*, so they need the benefit
of the fast startup.
>
> - Cahrlie
>
> >
>
--
GMail doesn't have rotating .sigs, but you can see mine at
http://www.ccil.org/~cowan/signatures
In our case it was a doubly nice find, since we generate bytecode at
runtime and can't afford to use verify:none. bootclasspath allowed us to
target verification-skipping at just the code we'd verified a bazillion
times during our day-to-day test runs.
Of course I understand it didn't help Groovy as much when they tried it,
probably because we load many, many more tiny classes to act as method
invokers, and because Groovy uses reflection for everything. Reflection
is, on the whole, a much, much larger time sink on startup than
verification, so bootclasspath helped us a lot and Groovy very little.
- Charlie
Can you describe how to do that and exactly what it is? I'm not familiar
with it.
- Charlie
I use ASM that is able to generate StackMap using flag |COMPUTE_FRAMES|
when creating a ClassWriter.
The algorithme used by ASM is described here:
http://asm.objectweb.org/doc/developer-guide.html#controlflow
Unlike JRuby, I don't generate code at runtime so
I have no idea about the overhead introduced by enabling stack map
computation.
> - Charlie
>
Rémi
Java 1.6 (JSR202) introduces a new verifier named split-verifier
(technically this technology comes from J2ME)
So if I follow you correctly, COMPUTE_FRAMES should not be used for
runtime code unless we're willing to swallow it being much more costly
than just allowing the normal verification process to handle that code
at runtime. That's good information; we don't generate code frequently
enough at runtime to have noticed a performance bottleneck in
classloading, but every little bit helps; I'll probably turn this off
for JITted methods generated at runtime.
- Charlie
Ahh ok, THAT StackMap. Yes, I'm using that for all code we generate,
including code at runtime, but it sounds like doing it at runtime may
actually be slower than letting the normal verifier run. I'll probably
be turning it off in that case.
- Charlie
Rémi
I think the case for us isn't particularly dire. Even the largest Ruby
methods will only generate several thousand bytes of bytecode, and the
verification cost is going to pale in comparison to all the Ruby
overhead. So it will really make a huge difference either way.
Under Java 7 anonymous classloading, however, we could be spending a lot
more time generating and regenerating code. So we'll eventually need to
look into the real cost.
> Another thing I should remind about StackMap calculation in ASM is
> that ClassWriter is by default loading classes to identify if two
> given types have common super type. When transforming existing
> bytecode from within custom class loader or Java agent, this may cause
> side effects and unexpected classloading. In that case
> ClassWriter.getCommonSuperClass() need to be overwritten and
> reimplemented without using Class.forName(..), e.g. see
> org.objectweb.asm.ClassWriterComputeFramesTest that has naive non-
> caching implementation that uses ASM to extract required information
> from classes.
Ahh, that's interesting, thanks for that. For the most part JRuby's code
generation has been designed to work as little islands, with no compiled
script having references to any code outside JRuby proper. So I think
that's probably not going to be an issue, but I'll keep it in mind as we
improve and advance JRuby's compiler.
- Charlie