Method Deoptimzations after Warmup

264 views
Skip to first unread message

Alex Rothberg

unread,
Apr 25, 2014, 1:12:41 PM4/25/14
to mechanica...@googlegroups.com
Let's say I have a method foo that looks like the following:

foo(){
    a();
    b();
    c();
    ...
    p = d();
    if(p){
        e();
    }else{
        f();
    }
}

and I run foo 10k times, in all cases where d() returns true. I see foo get compiled at level 4 and hopefully I see one or more of a,b,c etc get inlined into this compiled version of foo. I then hit a case where d returns false. I believe that this causes a deoptimization with the old version of foo being declared "non entrant" and then zombie. My question is how long does it take for foo to be re-compiled at level 4? Will methods a,b,c run with version compiled at level 4? Or will the also run with deoptimized versions (< level 4)?

I have the a similar question for baz:

baz(Type t){
    a();
    b();
    c();
    ...
    t.d();
}

In this case I run baz 10k times with one type of Type. Then I hit a case where t is a subclass of Type (i.e. the d call is virtual and I am now invoking it on a different implementation of d). What does the JVM do here as far as deoptimizations? Does it matter when the class loader loads the subclass of Type (before vs after I run baz 10k times)?

Vitaly Davidovich

unread,
Apr 25, 2014, 3:51:55 PM4/25/14
to mechanica...@googlegroups.com

For foo() I would not actually expect a deopt since it's just a change in branch direction and cpu should pick up on that.  The else block may be moved out of line by jit initially (I.e. code is further away for it in memory) but I'd not expect that to cause recompilation (unless the real case has some additional speculative optimizations).

For baz() the jit will install a guard for class loading to catch cases where a deopt needs to occur.  It then will recompile with an inline cache for the (now) bimorphic case.  If any further subclasses are loaded I think it just turns this into a regular virtual call.

This is my understanding, so may be wrong.

Sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Harish Babu

unread,
Apr 26, 2014, 1:06:40 AM4/26/14
to mechanica...@googlegroups.com

  I think it is possible that foo too can deopt, if the JIT is aggressive enough and it has profile information to determine the probability of the else branch is never taken by probability. Then the JIT can install unreached traps. I believe this helps in better code generation(optimizations and register allocations). But AFAIK OpenJDK is less aggressive on Integer comparisons but for pointer comparisons it is aggressive enough to install unreached traps.

Gil Tene

unread,
Apr 27, 2014, 12:15:43 PM4/27/14
to
A lot of "it depends" follow. See below.


On Friday, April 25, 2014 10:12:41 AM UTC-7, Alex Rothberg wrote:
Let's say I have a method foo that looks like the following:

foo(){
    a();
    b();
    c();
    ...
    p = d();
    if(p){
        e();
    }else{
        f();
    }
}

and I run foo 10k times, in all cases where d() returns true. I see foo get compiled at level 4 and hopefully I see one or more of a,b,c etc get inlined into this compiled version of foo. I then hit a case where d returns false. I believe that this causes a deoptimization with the old version of foo being declared "non entrant" and then zombie.

It depends on a bunch of stuff, but yes, the else path could be optimized away under certain conditions as an untaken path, guarded by a check that would de-optimize the implementation of foo() if the path is taken. The reason and value of doing this is not apparent in the code above, but imagine the code looked like this:
    ....
    p = d();
    if(p){
        e();
        g = 5;
    }else{
        f();
        g = 1000;
    }
    ...

    if (g > 10) {
       ...
    } else {
       ...
    }

Here, assuming p is true allows the optimizer to constant-propogate g=5, and completely eliminate dead code in the second else.
 
My question is how long does it take for foo to be re-compiled at level 4?

This varies by de-optimization cause, JVM version, choice of tiered compilation mode, and other flags, but generally foo() will wait to collect more profiling data before re-optimizing, doing so when some count has been triggered.
 
Will methods a,b,c run with version compiled at level 4? Or will the also run with deoptimized versions (< level 4)?

If a,b,c were separately hot and got compiled on their own right as opposed to only being inlined into foo(), the calls to them would still be to optimized versions. Otherwise, they may also be interpreted, or compiled at lower optimization levels. 
 

I have the a similar question for baz:

baz(Type t){
    a();
    b();
    c();
    ...
    t.d();
}

In this case I run baz 10k times with one type of Type. Then I hit a case where t is a subclass of Type (i.e. the d call is virtual and I am now invoking it on a different implementation of d). What does the JVM do here as far as deoptimizations? Does it matter when the class loader loads the subclass of Type (before vs after I run baz 10k times)?

This one has many cases:

1. CHA (Class Hierarchy Analysis):

At the time baz() is optimized, CHA proves that there is only one method implementation of t.d() (e.g. Type.d()) in the currently loaded universe. This leads to one of two results:

1.1: t.d() becomes a direct call to Type.d(), with no virtual dispatch, and no checking on the type of t.

1.2: t.d() gets inlined into baz(), with optimizations propagating into it's body, with no checking of the type of t (e.g. constant propagation and dead code removal get to gain a lot here).

The choice of whether or not to inline depends on various inlining heuristics.

In either case, the CHA observation is registered as an assumption that the optimized version of baz() depends on. Any action that would make that assumption becomes false, will de-optimize baz() before it can occur.E.g. if class Doop is later loaded that is a subclass of Type and has an implementation of d(), the class loader will force a de-optimization of baz() before any instances of Doop are ever instantiated (could be at class loading time, or later, but no Doop instance can materialize with the optimized version of baz() being "live").

2. Inline caching:

If CHA cannot prove that there is only one implementation of t.d(), the virtual dispatch can still be avoided by assuming that the call sitre is "Monomorphic" (as opposed to "Megamorphic", both being variants of polymorphic), replacing it with a guarded direct call. This is known as an "inline cache" in JIT parlance, which is a confusing name to most people who haven't heard of it before, because it has nothing to do with code inlining (the thing being "cached inline" is the expected type of the object being dispatched on).

If all t.d() calls to date have been done with t.class == Doof, the optimized code will replace:

t.d(...);  // virrtual call

with something the is logical equivalent to:

if (t.class == Doof) {
  Doof.d(t, ...); // direct call
} else {
  // fix this call site to become t.d() instead of direct in future invocations
  t.d(); // virtual call
}

[remember that the first argument to a virtual method is the object instance].

In HotSpot, running into a different type for t will not de-optimize the code above. Instead, it will patch the call site in place to become a regular virtual dispatch.

The direct dispatch version is faster than the virtual dispatch version because it does not depend on indirection. The value of this varies by processor. E.g. on most RISC CPUs, this is a huge win, but on x86 variants the BTB (branch target buffer) provides some overlapping benefits if it an correctly predict the target PC for the virtual dispatch.

There are variants of this optimizations that deal with "Bi-morphic" situations as well. But their value depends on the CPU type. E.g. a Bi-morphic call site can underperform compared o a hot virtual dispatch call on CPUs with good BTBs.

3. Guarded inlining

This is actual code inlining, guarded by a test of the type to verify that it is the one expected. If this case, if all t.d() calls to date have been done with t.class == Doof, the optimizer will replace the t.d(); call with something equivalent to:

if (t.class == Doof) {
  // // inlined equivalent of a direct call to Doof.d()
} else {
  // deoptimize here

The choice of whether to do inline cacheing (2) or guarded inlining (3) depends on many things.

 

Vitaly Davidovich

unread,
Apr 26, 2014, 6:59:09 PM4/26/14
to mechanica...@googlegroups.com

Gil,

Thanks for the much more comprehensive answer.  One thing I'd like to point out is that, as you say, virtual dispatch itself isn't necessarily a problem due to BTB; the real wins come from inlining virtual calls as that increases the optimization horizon for the optimizer.  I think that's what your #3 is but your logical code snippet makes it look like it's just a direct call instead of inlining (unless I misinterpreted).

Sent from my phone

On Apr 26, 2014 11:36 AM, "Gil Tene" <g...@azulsystems.com> wrote:
A lot of "it depends" follow. See below.

On Friday, April 25, 2014 10:12:41 AM UTC-7, Alex Rothberg wrote:
Let's say I have a method foo that looks like the following:

foo(){
    a();
    b();
    c();
    ...
    p = d();
    if(p){
        e();
    }else{
        f();
    }
}

and I run foo 10k times, in all cases where d() returns true. I see foo get compiled at level 4 and hopefully I see one or more of a,b,c etc get inlined into this compiled version of foo. I then hit a case where d returns false. I believe that this causes a deoptimization with the old version of foo being declared "non entrant" and then zombie.

It depends on a bunch of stuff, but yes, the else path could be optimized away under certain conditions as an untaken path, guarded by a check that would de-optimize the implementation of foo() if the path is taken. The reason and value of doing this is not apparent in the code above, but imagine the code looked like this:
    ....
    p = d();
    if(p){
        e();
        g = 5;
    }else{
        f();
        g = 1000;
    }
    ...

    if (g > 10) {
       ...
    } else {
       ...
    }

Here, assuming p is true allows the optimizer to constant-propogate g=5, and completely eliminate dead code in the second else.
 
My question is how long does it take for foo to be re-compiled at level 4?

This varies by de-optimization cause, JVM version, choice of tiered compilation mode, and other flags, but generally foo() will wait to collect more profiling data before re-optimizing, doing so when some count has been triggered.
 
Will methods a,b,c run with version compiled at level 4? Or will the also run with deoptimized versions (< level 4)?

If a,b,c were separately hot and got compiled on their own right as opposed to only being inlined into foo(), the calls to them would still be to optimized versions. Otherwise, they may also be interpreted, or compiled at lower optimization levels. 
 
I have the a similar question for baz:

baz(Type t){
    a();
    b();
    c();
    ...
    t.d();
}

In this case I run baz 10k times with one type of Type. Then I hit a case where t is a subclass of Type (i.e. the d call is virtual and I am now invoking it on a different implementation of d). What does the JVM do here as far as deoptimizations? Does it matter when the class loader loads the subclass of Type (before vs after I run baz 10k times)?
  // deoptimize here

The choice of whether to do inline cacheing (2) or guarded inlining (3) depends on many things.

 

--

Gil Tene

unread,
Apr 26, 2014, 10:18:09 PM4/26/14
to <mechanical-sympathy@googlegroups.com>
Yup, the text in #3 explains it, but the logical code should probably say "// inlined equivalent of a direct call to Doof.d()" in the comment.

Sent from Gil's iPhone
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/9SJ1kCPFghk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Jimmy Jia

unread,
Apr 27, 2014, 11:46:55 AM4/27/14
to mechanica...@googlegroups.com
That's really helpful, thank you.

Aside from the general recommendation to run warmup on things as representative of actual workload as possible, is the best bet for understanding this sort of phenomenon to look at LogCompilation output and verify that I'm not seeing unexpected de-optimizations?

And how much does ReadyNow! help in this sort of situation? Specifically in that when I hit this sort of de-optimization, my entire code path takes a substantial hit to performance, manifesting in a frustrating way as my code suddenly getting much slower for a while the first time anything unexpected happens.

Gil Tene

unread,
Apr 27, 2014, 12:52:41 PM4/27/14
to mechanica...@googlegroups.com
On Sunday, April 27, 2014 8:46:55 AM UTC-7, Jimmy Jia wrote:
That's really helpful, thank you.

Aside from the general recommendation to run warmup on things as representative of actual workload as possible, is the best bet for understanding this sort of phenomenon to look at LogCompilation output and verify that I'm not seeing unexpected de-optimizations?

Using -XX:+PrintCompilation is usually enough to track such transitions. I also like to look at generated code. For HotSpot, JP has some good instructions for doing that in his blog entry here. [side note: On Zing, I tend to use our ZVision tool to look at hot generated code on the fly, because I find having per-instruction "hotness" indication very helpful in reading the code quickly, mostly by focusing my attention on the parts that actually need reading.]

There are many causes for deoptimization. If you want to play with what one looks like, you can use a simple example I build ( FunInABox.java ) to play around with how deoptimization behavior changes and shows up in output. FunInABox is a simple test that reliably exhibits a classic case of deoptimization happening when code is optimized before some classes it may call in the future are initialized. When FunInABox is called with no parameters it will deopt. If called with a parameter (it doesn't care or parse what the parameter is), it will prime the relevant classes and thereby avoid the depot. Output with -XX:+PrintCompilation looks like this:


Lumpy.local-40% 
Lumpy.local-40% java -XX:+PrintCompilation FunInABox                                           77   1       java.lang.String::hashCode (64 bytes)
    109   2       sun.nio.cs.UTF_8$Decoder::decodeArrayLoop (553 bytes)
    115   3       java.math.BigInteger::mulAdd (81 bytes)
    118   4       java.math.BigInteger::multiplyToLen (219 bytes)
    121   5       java.math.BigInteger::addOne (77 bytes)
    123   6       java.math.BigInteger::squareToLen (172 bytes)
    127   7       java.math.BigInteger::primitiveLeftShift (79 bytes)
    130   8       java.math.BigInteger::montReduce (99 bytes)
    140   1%      java.math.BigInteger::multiplyToLen @ 138 (219 bytes)
Starting warmup run (will only use ThingTwo):
    147   9       sun.security.provider.SHA::implCompress (491 bytes)
    153  10       java.lang.String::charAt (33 bytes)
    154  11       FunInABox$ThingTwo::getValue (10 bytes)
    154   2%      FunInABox::testRun @ 4 (38 bytes)
    161  12       FunInABox::testRun (38 bytes)
Warmup run [1000000 iterations] took 27 msec..

...Then, out of the box
Came Thing Two and Thing One!
And they ran to us fast
They said, "How do you do?"...

Starting actual run (will start using ThingOne a bit after using ThingTwo):
   5183  12      made not entrant  FunInABox::testRun (38 bytes)
   5184   2%     made not entrant  FunInABox::testRun @ -2 (38 bytes)
   5184   3%      FunInABox::testRun @ 4 (38 bytes)
   5184  13       FunInABox$ThingOne::getValue (10 bytes)
Test run [200000000 iterations] took 1299 msec...
Lumpy.local-41% 
Lumpy.local-41% 
Lumpy.local-41% java -XX:+PrintCompilation FunInABox KeepThingsTame
     75   1       java.lang.String::hashCode (64 bytes)
    107   2       sun.nio.cs.UTF_8$Decoder::decodeArrayLoop (553 bytes)
    113   3       java.math.BigInteger::mulAdd (81 bytes)
    115   4       java.math.BigInteger::multiplyToLen (219 bytes)
    119   5       java.math.BigInteger::addOne (77 bytes)
    121   6       java.math.BigInteger::squareToLen (172 bytes)
    125   7       java.math.BigInteger::primitiveLeftShift (79 bytes)
    127   8       java.math.BigInteger::montReduce (99 bytes)
    133   1%      java.math.BigInteger::multiplyToLen @ 138 (219 bytes)
Keeping ThingOne and ThingTwo tame (by initializing them ahead of time):
Starting warmup run (will only use ThingTwo):
    140   9       sun.security.provider.SHA::implCompress (491 bytes)
    147  10       java.lang.String::charAt (33 bytes)
    147  11       FunInABox$ThingTwo::getValue (10 bytes)
    147   2%      FunInABox::testRun @ 4 (38 bytes)
    154  12       FunInABox::testRun (38 bytes)
Warmup run [1000000 iterations] took 24 msec..

...Then, out of the box
Came Thing Two and Thing One!
And they ran to us fast
They said, "How do you do?"...

Starting actual run (will start using ThingOne a bit after using ThingTwo):
   5178  13       FunInABox$ThingOne::getValue (10 bytes)
Test run [200000000 iterations] took 2164 msec...
Lumpy.local-42%  

And how much does ReadyNow! help in this sort of situation? Specifically in that when I hit this sort of de-optimization, my entire code path takes a substantial hit to performance, manifesting in a frustrating way as my code suddenly getting much slower for a while the first time anything unexpected happens.

The "how much" depends on why de-optimization is happening...

ReadyNow (in Zing) is focused on reducing (an in some cases eliminating) deoptimization, for exactly the reason you mention above. A simple motivating use case for reducing de-optimization is trading at market open, where seemingly warmed-up code often encounters de-optimizations and a significant temporary slowdown at the most critical time of the day. The code goes back to being fast after a few seconds after it is re-optimized, but by then many slow trades have been executed...

ReadyNow adds optional flags that reduce de-optimizations and execution delays caused by several issues. ReadyNow adds control options that would avoid the deoptimization in the untaken {if (p) ... else ...} case mentioned below.  With ReadyNow, aggressive class loading can be used to help load, verify, and resolve classes earlier in runs. In addition, early initialization can can be used to avoid later depots in cases where the initializers are empty (e.g. in the FunInABox example above, classes ThingOne and ThingTwo can be safely initialized early because they have no static initialization code).

The above (control of untaken path optimization, class loading, initialization) are some simple examples of de-optimization reduction. But deoptimization causes are wide and varied, and the real world scenarios can get pretty complicated. We keep adding to the ways we find to avoid them in Zing. We've been working with actual, real-world customer code, analyzing the actual causes of de-optimization at market open, and we've identified at least 6 different mechanisms that seem commonplace in our customer base... So ReadyNow keeps getting improved upon, and we have work identified in that space for at least the coming year.
 


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/9SJ1kCPFghk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Chris Newland

unread,
Apr 27, 2014, 1:19:06 PM4/27/14
to mechanica...@googlegroups.com
Hi Jimmy,

(Apologies to all for the pseudo sales pitch)

I've built a free and open source tool called JITWatch for inspecting the output of LogCompilation which might be of use to you (https://github.com/AdoptOpenJDK/jitwatch/wiki). One of the reports it can generate is a toplist of most-deoptimised methods so that might save you from wading through the XML.

I've experienced deopts when processing price data feeds and the market behaviour changes but I've never been able to simulate them with a unit test (probably because HotSpot's too smart for me!). Just read Gil's latest reply with example so hopefully I can now make the tool more useful in this area. Thanks Gil!

Kind regards,

Chris
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/9SJ1kCPFghk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages