Coming up with better defaults on HotSpot

Charles Oliver Nutter

unread,

Sep 2, 2008, 4:03:19 PM9/2/08

to jvm-la...@googlegroups.com

It's becoming more and more apparent to me that HotSpot's defaults,
while probably great for typical Java applications, are not so great for
e.g. dynamic languages running on the JVM. I think it's time that all
the various dynlang implementers pool their shared knowledge of what
flags need to be tweaked for better performance, so we can form a better
picture of what a fast dynlang hotspot should be.

I'll start.

The most recent discovery for us is that -XX:NewRatio=1 performs better
than the default 2 on server and 8 on client VMs. Because Ruby is as a
rule a lot more object-intensive (having no unboxed numeric types, for
example) we really do need a larger young generation. Once transient
objects get promoted to older generations the game is over; we're
spending all our time collecting Fixnums instead of working.

I posted this to hotspot-dev (Paul Hohensee answered but redirected me
to hotspot-gc-use) and was told to also consider two additional flags
that give the young generation more wiggle room:

MaxTenuringThreshold=0-15 ; the number of times an object must survive
young gen collections to be promoted; not sure about defaults on client
and server or across collectors

SurvivorRatio ; Lower values increase the size of the survivor space,
allowing more objects to "just die" in the young generation instead of
being promoted.

I have not tried either of these yet. Anyone else? Any other flags or
defaults you've found helpful for performance in your dynlang apps?

Here's another from me, which I've posted about previously:
-XX:bootclasspath and friends. JRuby is a Ruby implementation, which
means it needs to perform reasonably well as a command-line tool.
Typical JVM startup precludes fast command-line startup, but it turns
out a large portion of that time is spent verifying bytecode.
bootclasspath allows JRuby and its dependencies to skip verification,
which in our case improved startup time almost 3X, putting it
comfortably under 0.5s on OS X. That was a *huge* find, akin to a silver
bullet for startup speed.

- Cahrlie

Charles Oliver Nutter

unread,

Sep 2, 2008, 4:41:42 PM9/2/08

to jvm-la...@googlegroups.com

Charles Oliver Nutter wrote:
> It's becoming more and more apparent to me that HotSpot's defaults,
> while probably great for typical Java applications, are not so great for
> e.g. dynamic languages running on the JVM. I think it's time that all
> the various dynlang implementers pool their shared knowledge of what
> flags need to be tweaked for better performance, so we can form a better
> picture of what a fast dynlang hotspot should be.
>
> I'll start.
>
> The most recent discovery for us is that -XX:NewRatio=1 performs better
> than the default 2 on server and 8 on client VMs. Because Ruby is as a
> rule a lot more object-intensive (having no unboxed numeric types, for
> example) we really do need a larger young generation. Once transient
> objects get promoted to older generations the game is over; we're
> spending all our time collecting Fixnums instead of working.

Another recommended flag from Peter Kessler is -Xmn, which explicitly
sets the size of the young generation. Obviously this requires a bit
more knowledge about actual heap size than NewRatio, but it allows us to
do better than half the heap for the young generation. He also had some
suggestions on how to study memory use, which I've copied verbatim below:

"
Why limit yourself to NewRatio? The best you can get that way is half
the heap for the young generation. If you really want to a big young
generation (to give your temporary objects time to die without even
being looked at by the collector), use -Xmn (or -XX:NewSize= and
-XX:MaxNewSize=) to set it directly. Figure out what your live data
size is and use that as the base size for the old generation. Then
figure out what kinds of pauses the young generation collections impose,
and how much they promote, then amortize the eventual old generation
collection time over as many young generation collections as you can
give space to in the old generation. Then make your total heap (-Xmx)
as big as you can afford to get as big a young generation as that will
allow.
"

Again, I've had no time to look into this. My response pretty well sums
up what I think this round of emails should involve:

"
I guess there's really two tasks I'm
looking into right now:

1. Discovering appropriate flags to tweak and "more appropriate"
defaults dynlangs might want to try
2. Exploring real-world dynlang applications and loads to refine those
better/best-guess defaults

I'd say this round of emails is focused mostly on 1, since 2 is going to
vary more across languages. And I think we can only start to explore 2
iff we know 1.
"

- Charlie

John Cowan

unread,

Sep 2, 2008, 4:47:17 PM9/2/08

to jvm-la...@googlegroups.com

On Tue, Sep 2, 2008 at 4:03 PM, Charles Oliver Nutter
<charles...@sun.com> wrote:

> Here's another from me, which I've posted about previously:
> -XX:bootclasspath and friends. JRuby is a Ruby implementation, which
> means it needs to perform reasonably well as a command-line tool.
> Typical JVM startup precludes fast command-line startup, but it turns
> out a large portion of that time is spent verifying bytecode.
> bootclasspath allows JRuby and its dependencies to skip verification,
> which in our case improved startup time almost 3X, putting it
> comfortably under 0.5s on OS X. That was a *huge* find, akin to a silver
> bullet for startup speed.

That's an excellent hacque for *any* non-Java language on the JVM,
dynamic or not. All such languages are going to have core libraries
that are as important to them as java.lang.*, so they need the benefit
of the fast startup.
>
> - Cahrlie
>
> >
>

--
GMail doesn't have rotating .sigs, but you can see mine at
http://www.ccil.org/~cowan/signatures

Charles Oliver Nutter

unread,

Sep 2, 2008, 4:52:27 PM9/2/08

to jvm-la...@googlegroups.com

John Cowan wrote:
> On Tue, Sep 2, 2008 at 4:03 PM, Charles Oliver Nutter
> <charles...@sun.com> wrote:
>
>> Here's another from me, which I've posted about previously:
>> -XX:bootclasspath and friends. JRuby is a Ruby implementation, which
>> means it needs to perform reasonably well as a command-line tool.
>> Typical JVM startup precludes fast command-line startup, but it turns
>> out a large portion of that time is spent verifying bytecode.
>> bootclasspath allows JRuby and its dependencies to skip verification,
>> which in our case improved startup time almost 3X, putting it
>> comfortably under 0.5s on OS X. That was a *huge* find, akin to a silver
>> bullet for startup speed.
>
> That's an excellent hacque for *any* non-Java language on the JVM,
> dynamic or not. All such languages are going to have core libraries
> that are as important to them as java.lang.*, so they need the benefit
> of the fast startup.

In our case it was a doubly nice find, since we generate bytecode at
runtime and can't afford to use verify:none. bootclasspath allowed us to
target verification-skipping at just the code we'd verified a bazillion
times during our day-to-day test runs.

Of course I understand it didn't help Groovy as much when they tried it,
probably because we load many, many more tiny classes to act as method
invokers, and because Groovy uses reflection for everything. Reflection
is, on the whole, a much, much larger time sink on startup than
verification, so bootclasspath helped us a lot and Groovy very little.

- Charlie

Rémi Forax

unread,

Sep 2, 2008, 6:17:26 PM9/2/08

to jvm-la...@googlegroups.com

Charles Oliver Nutter a écrit :

> John Cowan wrote:
>
>> On Tue, Sep 2, 2008 at 4:03 PM, Charles Oliver Nutter
>> <charles...@sun.com> wrote:
>>
>>
>>> Here's another from me, which I've posted about previously:
>>> -XX:bootclasspath and friends. JRuby is a Ruby implementation, which
>>> means it needs to perform reasonably well as a command-line tool.
>>> Typical JVM startup precludes fast command-line startup, but it turns
>>> out a large portion of that time is spent verifying bytecode.
>>> bootclasspath allows JRuby and its dependencies to skip verification,
>>> which in our case improved startup time almost 3X, putting it
>>> comfortably under 0.5s on OS X. That was a *huge* find, akin to a silver
>>> bullet for startup speed.
>>>
>> That's an excellent hacque for *any* non-Java language on the JVM,
>> dynamic or not. All such languages are going to have core libraries
>> that are as important to them as java.lang.*, so they need the benefit
>> of the fast startup.
>>
>
> In our case it was a doubly nice find, since we generate bytecode at
> runtime and can't afford to use verify:none. bootclasspath allowed us to
> target verification-skipping at just the code we'd verified a bazillion
> times during our day-to-day test runs.
>

I have seen an impressive boost by generating StackMap infos,
particularly when generated codes use overlaped exception handlers.

> Of course I understand it didn't help Groovy as much when they tried it,
> probably because we load many, many more tiny classes to act as method
> invokers, and because Groovy uses reflection for everything. Reflection
> is, on the whole, a much, much larger time sink on startup than
> verification, so bootclasspath helped us a lot and Groovy very little.
>
> - Charlie
>

Rémi

Charles Oliver Nutter

unread,

Sep 2, 2008, 6:58:27 PM9/2/08

to jvm-la...@googlegroups.com

Rémi Forax wrote:
> I have seen an impressive boost by generating StackMap infos,
> particularly when generated codes use overlaped exception handlers.

Can you describe how to do that and exactly what it is? I'm not familiar
with it.

- Charlie

Rémi Forax

unread,

Sep 2, 2008, 7:30:17 PM9/2/08

to jvm-la...@googlegroups.com

Charles Oliver Nutter a écrit :

Java 1.6 (JSR202) introduces a new verifier named split-verifier
(technically this technology comes from J2ME)
that is able to verify the bytecode in one pass using informations
stored in a special
attribute named StackMap.
https://jdk.dev.java.net/verifier.html

I use ASM that is able to generate StackMap using flag |COMPUTE_FRAMES|
when creating a ClassWriter.
The algorithme used by ASM is described here:
http://asm.objectweb.org/doc/developer-guide.html#controlflow

Unlike JRuby, I don't generate code at runtime so
I have no idea about the overhead introduced by enabling stack map
computation.

> - Charlie
>
Rémi

John Rose

unread,

Sep 2, 2008, 7:55:22 PM9/2/08

to jvm-la...@googlegroups.com

On Sep 2, 2008, at 4:30 PM, Rémi Forax wrote:

Java 1.6 (JSR202) introduces a new verifier named split-verifier

(technically this technology comes from J2ME)

It comes from J2ME but we compressed the format when we adopted it... I was fresh off the Pack200 project. (An oddity of history: The ME version has much more slack in it.)

-- John

Eugene Kuleshov

unread,

Sep 3, 2008, 1:35:50 AM9/3/08

to JVM Languages

Rémi Forax wrote:

> > Can you describe how to do that and exactly what it is? I'm not familiar
> > with it.

> Java 1.6 (JSR202) introduces a new verifier named split-verifier
> (technically this technology comes from J2ME)
> that is able to verify the bytecode in one pass using informations
> stored in a special attribute named StackMap.
> https://jdk.dev.java.net/verifier.html
>
> I use ASM that is able to generate StackMap using flag |COMPUTE_FRAMES|
> when creating a ClassWriter.
> The algorithme used by ASM is described here:
> http://asm.objectweb.org/doc/developer-guide.html#controlflow
>
> Unlike JRuby, I don't generate code at runtime so
> I have no idea about the overhead introduced by enabling stack map
> computation.

In my subjective tests, the StackMap calculation overhead is much
higher then performance gain you'll get from the new verifier. This is
actually by design of the new verifier, so pre-verification step when
StackMap is generated runs off line and can take long, but
verification at run time is really quick because it uses pre-
calculated StackMap structures.
The only chance to avoid that is to not use COMPUTE_FRAMES (skip the
data flow analysis algorithm) and generate StackMap together with the
bytecode if you know your code structure and type hierarchies.

regards,
Eugene

Charles Oliver Nutter

unread,

Sep 3, 2008, 3:21:38 AM9/3/08

to jvm-la...@googlegroups.com

Eugene Kuleshov wrote:
> In my subjective tests, the StackMap calculation overhead is much
> higher then performance gain you'll get from the new verifier. This is
> actually by design of the new verifier, so pre-verification step when
> StackMap is generated runs off line and can take long, but
> verification at run time is really quick because it uses pre-
> calculated StackMap structures.
> The only chance to avoid that is to not use COMPUTE_FRAMES (skip the
> data flow analysis algorithm) and generate StackMap together with the
> bytecode if you know your code structure and type hierarchies.

So if I follow you correctly, COMPUTE_FRAMES should not be used for
runtime code unless we're willing to swallow it being much more costly
than just allowing the normal verification process to handle that code
at runtime. That's good information; we don't generate code frequently
enough at runtime to have noticed a performance bottleneck in
classloading, but every little bit helps; I'll probably turn this off
for JITted methods generated at runtime.

- Charlie

Charles Oliver Nutter

unread,

Sep 3, 2008, 3:22:38 AM9/3/08

to jvm-la...@googlegroups.com

Rémi Forax wrote:
> I use ASM that is able to generate StackMap using flag |COMPUTE_FRAMES|
> when creating a ClassWriter.
> The algorithme used by ASM is described here:
> http://asm.objectweb.org/doc/developer-guide.html#controlflow
>
> Unlike JRuby, I don't generate code at runtime so
> I have no idea about the overhead introduced by enabling stack map
> computation.

Ahh ok, THAT StackMap. Yes, I'm using that for all code we generate,
including code at runtime, but it sounds like doing it at runtime may
actually be slower than letting the normal verifier run. I'll probably
be turning it off in that case.

- Charlie

Rémi Forax

unread,

Sep 3, 2008, 5:11:13 AM9/3/08

to jvm-la...@googlegroups.com

Charles Oliver Nutter a écrit :

> Eugene Kuleshov wrote:
>
>> In my subjective tests, the StackMap calculation overhead is much
>> higher then performance gain you'll get from the new verifier. This is
>> actually by design of the new verifier, so pre-verification step when
>> StackMap is generated runs off line and can take long, but
>> verification at run time is really quick because it uses pre-
>> calculated StackMap structures.
>> The only chance to avoid that is to not use COMPUTE_FRAMES (skip the
>> data flow analysis algorithm) and generate StackMap together with the
>> bytecode if you know your code structure and type hierarchies.
>>

Eugene, even if you generate StackMap 'by hand', there is a problem if
there is a resizing (GOTO => GOTO_W),
because StackMap attribute can't be updated incrementally.
In that case, ASM reparse the whole byte-code with COMPUTE_FRAMES enabled.

Rémi

Eugene Kuleshov

unread,

Sep 3, 2008, 8:00:27 AM9/3/08

to JVM Languages

On Sep 3, 5:11 am, Rémi Forax <fo...@univ-mlv.fr> wrote:

> Eugene, even if you generate StackMap 'by hand', there is a problem if
> there is a resizing (GOTO => GOTO_W),
> because StackMap attribute can't be updated incrementally.
> In that case, ASM reparse the whole byte-code with COMPUTE_FRAMES enabled.

Good point. Though it is said in ASM developers guide that StackMap
can't be resized in those cases [1], I still wonder if it is actually
possible. The effect on the control flow graph is reasonable isolated
and it may work if unpacked StackMap frames are used (similarly to
LocalVariablesSorter).

regards,
Eugene

[1] http://asm.objectweb.org/doc/developer-guide.html#controlflow#codeattrs

Eugene Kuleshov

unread,

Sep 3, 2008, 8:46:20 AM9/3/08

to JVM Languages

On Sep 3, 3:21 am, Charles Oliver Nutter <charles.nut...@sun.com>
wrote:

> So if I follow you correctly, COMPUTE_FRAMES should not be used for
> runtime code unless we're willing to swallow it being much more costly
> than just allowing the normal verification process to handle that code
> at runtime. That's good information; we don't generate code frequently
> enough at runtime to have noticed a performance bottleneck in
> classloading, but every little bit helps; I'll probably turn this off
> for JITted methods generated at runtime.

Correct. However it is just my observation and it may also depend on
what methods are being analyzed. I've been testing computation of
StackMap when transforming all classes from rt.jar, and on my machine,
with Java 6, it takes ~2.5 extra seconds to compute frames for 24,000
classes (~2 times slower then computing max stack).
But, your experience may be different. So, I would recommend to take
my comment with grain of salt and maybe do a quick time measurement of
the performance with and without StackMap calculation. Though to
actually see the difference you may need to generate and then load
large number of unique classes.

Another thing I should remind about StackMap calculation in ASM is
that ClassWriter is by default loading classes to identify if two
given types have common super type. When transforming existing
bytecode from within custom class loader or Java agent, this may cause
side effects and unexpected classloading. In that case
ClassWriter.getCommonSuperClass() need to be overwritten and
reimplemented without using Class.forName(..), e.g. see
org.objectweb.asm.ClassWriterComputeFramesTest that has naive non-
caching implementation that uses ASM to extract required information
from classes.

regards,
Eugene

Charles Oliver Nutter

unread,

Sep 3, 2008, 4:36:04 PM9/3/08

to jvm-la...@googlegroups.com

Eugene Kuleshov wrote:
> Correct. However it is just my observation and it may also depend on
> what methods are being analyzed. I've been testing computation of
> StackMap when transforming all classes from rt.jar, and on my machine,
> with Java 6, it takes ~2.5 extra seconds to compute frames for 24,000
> classes (~2 times slower then computing max stack).
> But, your experience may be different. So, I would recommend to take
> my comment with grain of salt and maybe do a quick time measurement of
> the performance with and without StackMap calculation. Though to
> actually see the difference you may need to generate and then load
> large number of unique classes.

I think the case for us isn't particularly dire. Even the largest Ruby
methods will only generate several thousand bytes of bytecode, and the
verification cost is going to pale in comparison to all the Ruby
overhead. So it will really make a huge difference either way.

Under Java 7 anonymous classloading, however, we could be spending a lot
more time generating and regenerating code. So we'll eventually need to
look into the real cost.

> Another thing I should remind about StackMap calculation in ASM is
> that ClassWriter is by default loading classes to identify if two
> given types have common super type. When transforming existing
> bytecode from within custom class loader or Java agent, this may cause
> side effects and unexpected classloading. In that case
> ClassWriter.getCommonSuperClass() need to be overwritten and
> reimplemented without using Class.forName(..), e.g. see
> org.objectweb.asm.ClassWriterComputeFramesTest that has naive non-
> caching implementation that uses ASM to extract required information
> from classes.

Ahh, that's interesting, thanks for that. For the most part JRuby's code
generation has been designed to work as little islands, with no compiled
script having references to any code outside JRuby proper. So I think
that's probably not going to be an issue, but I'll keep it in mind as we
improve and advance JRuby's compiler.

- Charlie

Reply all

Reply to author

Forward