Strategies for optimizing closure invocations

390 views
Skip to first unread message

Charles Oliver Nutter

unread,
Apr 2, 2011, 1:40:58 PM4/2/11
to JVM Languages
I'm getting deeper into JRuby optimization lately, and starting to
ponder strategies for optimizing closures (or, in the case of Clojure
and Scala, optimizing pass-through methods that make megamorphic
callbacks to functions).

First, some explanation...

In Ruby, the "each" method on Array receives a block of code:

[1,2,3].each {|i| do_something_with i}

"each" is implemented as a simple loop over Array elements, each
iteration calling Block.yield on a Block object passed in. Block
aggregates the code body above with a Binding object holding the
surrounding closed-over state.

public IRubyObject eachCommon(ThreadContext context, Block block) {
if (!block.isGiven()) {
throw context.getRuntime().newLocalJumpErrorNoBlock();
}
for (int i = 0; i < realLength; i++) {
// do not coarsen the "safe" catch, since it will
misinterpret AIOOBE from the yielded code.
// See JRUBY-5434
block.yield(context, safeArrayRef(values, begin + i));
}
return this;
}

Similar idioms certainly exist in any other JVM languages that allow
passing functions or closures around.

The problem here is that the "yield" call is always going to go
megamorphic very quickly. There will be dozens of places in a typical
Ruby (or Clojure, or Scala, or Groovy) application that use the same
list-iteration logic, and they all pass through the same generic code.
That means the best you can optimize is inlining the loop logic into
the caller...the closure body won't inline and will have to optimize
on its own.

This also has implications for escape analysis. If there's any state
(including the closure object itself) involved in doing that
iteration, it's now impossible for it to EA away, since the closure
can't inline all the way back into the caller. Any heap-based
structures are now firmly on the heap, and add to our allocation
overhead.

Now, strategies...

In general, what's needed is a way to specialize "each" for many
different call sites and closures.

The easiest for us implementers would be if the JVMs simply started to
see through these calls. Closure/function-receiving code is going to
become more and more common, and indeed was already rather common for
event-handling systems and the like. Rémi and I talked with Fredrik
(JRockit) two JVMLSs ago about how JRockit might be able to optimize
these cases. Fredrik believed it would be possible, but that some sort
of marker was needed on the "each" method to show JRockit that it's
code that calls code. I forget who suggested it, but we decided an
easy marker would be to have the signature receive a MethodHandle or
subtype. JRockit (and perhaps other JVMs) could use that as indication
that this method should be specialized to the caller and provided
closure.

Barring JVM help, we are likely to attempt to optimize this case in
JRuby directly. My strategy will be to see (via runtime profiling)
that particular closure-receiving methods are hot, and do the
specialization myself. This would boil down to having JRuby's JIT emit
both the caller's class body *and* a copy of the "each" body into the
compiled result. The caller, each, and the closure (also in the same
result) would lie along a monomorphic path, allowing all three to
inline together. This will be easy with Ruby code; if I see a
closure-passing call to another Ruby method, I emit a body for that
method too. For core JRuby methods, which are implemented in Java, it
will be trickier; I'll need to either have simple ways to duplicate
those methods in-place or I'll need to move them into Ruby.
Ironically, we may be seeing the beginning of an age where it's faster
to implement JRuby core classes in Ruby.

Have any of you other implementers thought about this? What are you
considering as strategies for optimizing closure invocations?

- Charlie

Robert Fischer

unread,
Apr 2, 2011, 6:30:09 PM4/2/11
to jvm-la...@googlegroups.com
I've been looking into exactly this kind of problem for my functional
programming language. The sheer amount of megamorphic love that "fold"
receives got me thinking about this. Although there's some work that I
could do at compile time, there's really not terribly much, and I've
been under the impression that this is exactly the kind of situation
which makes life hard on the JVM's runtime optimization.

The big question is how much room there is for an optimization to
really benefit? How much tracking overhead becomes counter-productive
to optimization?

~~ Robert.

> --
> You received this message because you are subscribed to the Google Groups "JVM Languages" group.
> To post to this group, send email to jvm-la...@googlegroups.com.
> To unsubscribe from this group, send email to jvm-language...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/jvm-languages?hl=en.
>
>

Charles Oliver Nutter

unread,
Apr 2, 2011, 8:34:14 PM4/2/11
to jvm-la...@googlegroups.com
On Sat, Apr 2, 2011 at 5:30 PM, Robert Fischer <smokej...@gmail.com> wrote:
> I've been looking into exactly this kind of problem for my functional
> programming language. The sheer amount of megamorphic love that "fold"
> receives got me thinking about this. Although there's some work that I
> could do at compile time, there's really not terribly much, and I've
> been under the impression that this is exactly the kind of situation
> which makes life hard on the JVM's runtime optimization.
>
> The big question is how much room there is for an optimization to
> really benefit? How much tracking overhead becomes counter-productive
> to optimization?

I think the potential is *very* large. My escape analysis example was
only one key case that's impossible to optimize without being able to
inline closures. Others may include loop unrolling (JIT can't optimize
as well when it doesn't know what a downstream call might do),
bounds-check elimination (loops conditional on the closure result, for
example), and of course anything related to doing a slow CALL
(register allocation, memory optimizations, inline cache or vtable
dispatch...). Basically, it forces the body of some iteration into
always being a CALL, and all optimizations you'd normally get inside a
loop body are impossible.

I also want to make clear this affects all languages that pass
functions or closure around, including Groovy, JRuby, Scala, Clojure,
current Java with callbacks/anon classes, and upcoming Java 7 closure
support.

- Charlie

Charles Oliver Nutter

unread,
Apr 2, 2011, 8:35:34 PM4/2/11
to jvm-la...@googlegroups.com
On Sat, Apr 2, 2011 at 7:34 PM, Charles Oliver Nutter
<hea...@headius.com> wrote:
> I also want to make clear this affects all languages that pass
> functions or closure around, including Groovy, JRuby, Scala, Clojure,
> current Java with callbacks/anon classes, and upcoming Java 7 closure
> support.

Sent this a bit quickly...

This may actually be a case where JRuby has a better chance of
optimizing since we are already mixed-mode. It won't be very difficult
for us to specialize a body of code to the point of use, allowing that
body and a closure passed to it to inline. Assuming that's not
possible for other languages, has anyone thought of other ways to make
closures/callbacks optimize as well as normal inlined monomophic call
paths?

- Charlie

Kresten Krab Thorup

unread,
Apr 3, 2011, 11:38:11 AM4/3/11
to JVM Languages
I would think that this is a case where a vm has a chance to shine.
Your "eachCommon" just needs to be inlined into the callsite, and the
call to yield becomes monomorphic. That's the primary optimization
that makes the JVM run fast.

So. If method handles or something else disables this optimization, or
makes it more expensive then perhaps we should consider adding a hint/
annotation to the vm saying "it may make sense to inline this". but I
would think that that is exactly the kind of assessment that the vm
can do better than static knowledge allows.

Kresten

Robert Fischer

unread,
Apr 3, 2011, 12:48:29 PM4/3/11
to jvm-la...@googlegroups.com
For Ashlar, I represent my closures as MethodHandle instances. (I
actually represent pretty much everything as MethodHandle instances at
this point, but that's a conversation for another time.) Since I'm
statically typed, there's some work I can do at compile time in
certain cases (recognizing Pervasive#each calls and turning them into
loops, for instance). But let's set that aside and focus on runtime
optimizations.

Now, there's no way that I am aware of to go from a MethodHandle to a
java.lang.reflect.Method instance and an array of arguments, which
would then enable me to inline it at runtime. There are good reasons I
can imagine for not allowing that, starting first and foremost with
argument adapters (and pretty much all the mutations in the
MethodHandles class), which means that the MethodHandle has semantics
different from passing the given arguments to the given method.

So, if we're going to inline based on the MethodHandle, we need some
kind of wrapper which will tell us whether the MethodHandle is
inlinable, and which will tell us the method and arguments if it is
reversible. Note that the method and arguments need not correspond
directly to the underlying implementation—in theory, we could generate
some other code at runtime to represent adapters, etc. But for the
initial implementation, I'm considering a MethodHandle inlinable if it
corresponds directly to an API method, possibly with currying. That
solves half the problem: I can now go from MethodHandle to a
java.lang.reflect.Method, which allows me to inline it at runtime (at
least in the common case).

The other half the problem is the call site. I use a custom CallSite
which exposes MethodHandle#invokeGeneric so that it can track
arguments passed in. I'd use that to track MethodHandle arguments
passed in, and if the same inlinable MethodHandle is hit often enough,
it should generate the inlined code, and then update the MethodHandle
to check for this common case (MethodHandles#guardWithTest comes in
here). This is pretty much exactly the logic I would hope a JVM would
eventually implement, and I'm presuming that the JVM already has (or
will soon have) the ability to optimize the MethodHandle resulting
from MethodHandles#guardWithTest calls.

Now, this is a fair bit of infrastructure and additional obscuring of
the user's code path, which is where the question comes in about
whether or not it is worth it. There are also serious concurrency
issues: how do we balance performance with shared state? For instance,
consider the implied counter for MethodHandle arguments—do we really
want the overhead of synchronization on *every* invocation of this
method? Imagining a global lock on "fold" just makes me blanch. We
could go "volatile" and force memory flushes in all those cases, but
is that really better? What's the alternative? (Taking the overhead
and the loss of opportunities to track it per-thread? Is that really
better?) I'm hesitant to spend time implementing an optimization that
isn't (developing my language is a hobby, not my job, so there's very
limited time resources), but if the gain is huge, it's probably worth
it.

Of course, I'd really like to declare it's the JVM's problem and look
forward longingly to the day when it's resolved there.

~~ Robert.

On Sat, Apr 2, 2011 at 8:35 PM, Charles Oliver Nutter

Jochen Theodorou

unread,
Apr 3, 2011, 6:04:07 PM4/3/11
to jvm-la...@googlegroups.com
Am 03.04.2011 17:38, schrieb Kresten Krab Thorup:
> So. If method handles or something else disables this optimization, or
> makes it more expensive then perhaps we should consider adding a hint/
> annotation to the vm saying "it may make sense to inline this". but I
> would think that that is exactly the kind of assessment that the vm
> can do better than static knowledge allows.

I am wondering... in what case does it *not* make sense to inline? I
mean I can imagine reasons like certain limitations, but in the end...
Does anyone know?

bye Jochen

--
Jochen "blackdrag" Theodorou
The Groovy Project Tech Lead
http://blackdragsview.blogspot.com/
For Groovy programming sources visit http://groovy.codehaus.org

Rémi Forax

unread,
Apr 3, 2011, 6:41:00 PM4/3/11
to jvm-la...@googlegroups.com
On 04/04/2011 12:04 AM, Jochen Theodorou wrote:
> Am 03.04.2011 17:38, schrieb Kresten Krab Thorup:
>> So. If method handles or something else disables this optimization, or
>> makes it more expensive then perhaps we should consider adding a hint/
>> annotation to the vm saying "it may make sense to inline this". but I
>> would think that that is exactly the kind of assessment that the vm
>> can do better than static knowledge allows.
>
> I am wondering... in what case does it *not* make sense to inline? I
> mean I can imagine reasons like certain limitations, but in the end...
> Does anyone know?

The problem is more that to optimize such case you need backpointers from
a method to all its callsites. This datastructure has a cost in memory and
in cycle to manage it and is only necessary if you have high order method
(a method that takes a lambda).
So the detection of such method is important.

I don't think the VM need a flag.
A method that takes a MethodHandle as parameter
and this method handle value doesn't escape and is invoked (invoke*)
should be optimized.

Now, Hotspot has tiered compilation, so if the high order method is hot
(or an OSR of this method). It can be compiled and also
trigger the fact that its callsites must be recorded.

>
> bye Jochen
>

R�mi

Charles Oliver Nutter

unread,
Apr 3, 2011, 11:32:56 PM4/3/11
to jvm-la...@googlegroups.com, Kresten Krab Thorup
On Sun, Apr 3, 2011 at 10:38 AM, Kresten Krab Thorup <kr...@trifork.com> wrote:
> I would think that this is a case where a vm has a chance to shine.
> Your "eachCommon" just needs to be inlined into the callsite, and the
> call to yield becomes monomorphic. That's the primary optimization
> that makes the JVM run fast.

I'm not sure if you're saying that current JVMs work this way or
saying that they should work this way. They don't do this right now.

eachCommon's call sites are considered independent of any callers.
Because the yield call ends up seeing numerous code bodies, that call
site is megamorphic and will never inline anything. eachCommon itself
will inline into the caller, but that's as far as it goes since
current JVMs do not specialize call sites based on the current
method's caller.

Of course I'd love to see JVMs be smart enough to specialize call
sites based on the caller backtrace.

> So. If method handles or something else disables this optimization, or
> makes it more expensive then perhaps we should consider adding a hint/
> annotation to the vm saying "it may make sense to inline this". but I
> would think that that is exactly the kind of assessment that the vm
> can do better than static knowledge allows.

If we could say "inline the called method as though it were right here
and specialize all its call sites based on this method" then this
would work. But simply inlining is not enough if call sites are not
specialized.

- Charlie

Charles Oliver Nutter

unread,
Apr 3, 2011, 11:41:28 PM4/3/11
to jvm-la...@googlegroups.com
On Sun, Apr 3, 2011 at 11:48 AM, Robert Fischer <smokej...@gmail.com> wrote:
> Now, there's no way that I am aware of to go from a MethodHandle to a
> java.lang.reflect.Method instance and an array of arguments, which
> would then enable me to inline it at runtime. There are good reasons I
> can imagine for not allowing that, starting first and foremost with
> argument adapters (and pretty much all the mutations in the
> MethodHandles class), which means that the MethodHandle has semantics
> different from passing the given arguments to the given method.

In JRuby, since we support Java 5 and 6, the objects passed around as
"handles" are our own construct ("DynamicMethod"). There's metadata
there to know if it needs a heap-based scope, if it needs a Frame, the
shape if its arguments, and what JVM bytecoded method it points at.
Given that information we should (and do, in dynopt mode) have enough
to either insert direct calls (skipping dynamic call logic) *or*
inline the target code directly into the current one at runtime.

Using only MethodHandle would be lovely, but as you point out you may
need other metadata if you're going to do any magic at runtime.

Is Ashlar mixed-mode or compiled only?

> Of course, I'd really like to declare it's the JVM's problem and look
> forward longingly to the day when it's resolved there.

Indeed :) Of course right now we're just waiting for invokedynamic and
method handles to be fast fast fast, so it will be some time before
the JVM guys can come up for air and start considering new problems
like closure inlining.

- Charlie

Charles Oliver Nutter

unread,
Apr 3, 2011, 11:45:29 PM4/3/11
to jvm-la...@googlegroups.com
On Sun, Apr 3, 2011 at 5:04 PM, Jochen Theodorou <blac...@gmx.org> wrote:
> Am 03.04.2011 17:38, schrieb Kresten Krab Thorup:
>>
>> So. If method handles or something else disables this optimization, or
>> makes it more expensive then perhaps we should consider adding a hint/
>> annotation to the vm saying "it may make sense to inline this". but I
>> would think that that is exactly the kind of assessment that the vm
>> can do better than static knowledge allows.
>
> I am wondering... in what case does it *not* make sense to inline? I mean I
> can imagine reasons like certain limitations, but in the end... Does anyone
> know?

There's the black art of runtime optimization :)

I actually reported a performance bug in invokedynamic a couple weeks
back that turned out to be Hotspot inlining the *wrong* logic.
Basically, it was inlining recursive calls to fib as far as possible
without inlining the other calls made in fib that were actually more
expensive (and more valuable to inline and optimize. Indeed, the
question of what *not* to inline is at least as important as the
question of what *to* inline.

As a rule of thumb I suppose I'd say you want to inline methods that
have the best cost/size ratio; basically, inline the smallest, hottest
methods first. I'd love to hear from the JVM vendors what rules their
compilers follow, though (and whether there are papers the rest of us
could read to understand them better).

- Charlie

Kirk

unread,
Apr 4, 2011, 2:15:11 AM4/4/11
to jvm-la...@googlegroups.com
>
>
> As a rule of thumb I suppose I'd say you want to inline methods that
> have the best cost/size ratio; basically, inline the smallest, hottest
> methods first. I'd love to hear from the JVM vendors what rules their
> compilers follow, though (and whether there are papers the rest of us
> could read to understand them better).

Indeed... I was looking for better tooling to let me look at the code cache and talked to Cliff Click about this a couple of weeks ago. His comment was that he was surprised that people were interested in this subject along with how the code cache was managed. The code cache was intended to be self managing thus no need to metrics. If anyone can point to better doc/tooling in this area.....

Kirk

Jochen Theodorou

unread,
Apr 4, 2011, 10:46:12 AM4/4/11
to jvm-la...@googlegroups.com
Am 04.04.2011 05:45, schrieb Charles Oliver Nutter:
> On Sun, Apr 3, 2011 at 5:04 PM, Jochen Theodorou<blac...@gmx.org> wrote:
[...]

>> I am wondering... in what case does it *not* make sense to inline? I mean I
>> can imagine reasons like certain limitations, but in the end... Does anyone
>> know?
>
> There's the black art of runtime optimization :)

ohhh, yes :)

> I actually reported a performance bug in invokedynamic a couple weeks
> back that turned out to be Hotspot inlining the *wrong* logic.

yes, I did read that... and getting from that I had the impression the
inlining logic for MethodHandles is quite eager.... which then of course
could be used in this case... but it doesn't turn out well all the time
I guess.

[...]


> As a rule of thumb I suppose I'd say you want to inline methods that
> have the best cost/size ratio; basically, inline the smallest, hottest
> methods first. I'd love to hear from the JVM vendors what rules their
> compilers follow, though (and whether there are papers the rest of us
> could read to understand them better).

I don't really know, I can only assume because I am really not too much
into the JVM details. From what I learned I assume that in the following
example

> public static void doSomething(MyInterface x, Object[] args) {
> x.call(args) //C_1
> }
> public static void doSomethingSmall(){...}
> MyInterface mi = new MyInterface() {
> public void call(Object[] args) {
> doSomethingSmall() //C_2
> }
> }
> doSomething(mi) //C_3

doSomethingSmall in callsite C_2 will be inlined. C_1 is megamorphic
(because called from many other places in many variations), so no
inlining. But C_3 could maybe. I don't know if the megamorphic C_2
taints C_3, or if that does not matter at all... no idea.

What I assume is that the JVM has a kind of global concept of a callsite
being megamorphic or monomorphic. Even if C_3 would be inlined it would
not help C_2.

If - and I am being crazy here I guess, because I talk about things I
essentially have no idea of - if C_3 is inlined doesn't that make C_1
kind of monomorphic in a local sense? Wouldn't that mean that if at C_3
is inlined we can inline at C_2 and C_1 as well? Isn't that exactly what
we would need the JVM doing for "generalized code that calls code"?
Wouldn't that mean that if the JVM means a megamorphic hot call site
that there could be maybe a transformation making it partially
monomorphic by isolating part of the call graph? Wouldn't that be a
strategy the JVM could implement in general, without special marker? Or
does the JVM do this already?

bye blackdrag

Charles Oliver Nutter

unread,
Apr 4, 2011, 11:34:09 AM4/4/11
to jvm-la...@googlegroups.com, Jochen Theodorou
On Mon, Apr 4, 2011 at 9:46 AM, Jochen Theodorou <blac...@gmx.org> wrote:
>> public static void doSomething(MyInterface x, Object[] args) {
>>  x.call(args) //C_1
>> }
>> public static void doSomethingSmall(){...}
>> MyInterface mi = new MyInterface() {
>>  public void call(Object[] args) {
>>    doSomethingSmall() //C_2
>>  }
>> }
>> doSomething(mi) //C_3
>
> doSomethingSmall in callsite C_2 will be inlined. C_1 is megamorphic
> (because called from many other places in many variations), so no inlining.
> But C_3 could maybe. I don't know if the megamorphic C_2 taints C_3, or if
> that does not matter at all... no idea.
>
> What I assume is that the JVM has a kind of global concept of a callsite
> being megamorphic or monomorphic. Even if C_3 would be inlined it would not
> help C_2.

Speaking from what I know of Hotspot...

C_2 and C_3 will both likely inline their respective targets. They're
not a virtual calls, and they're small.

C_1 will inline if only one or two unique types are provided to the
doSomething method (Hotspot can inline monomorphic and bimorphic call
sites, I believe). If a new type later comes in, the code will be
deoptimized and the inlining reversed.

Note that another of my "bugs" comes into play here: even if multiple
types implement MyInterface and share the same implementation code,
Hotspot will consider them separate types for profiling purposes, and
as a result even if you only ever have one implementation of "call" in
the system, you still might not get it to inline. I'd love to see that
fixed :)

> If - and I am being crazy here I guess, because I talk about things I
> essentially have no idea of - if C_3 is inlined doesn't that make C_1 kind
> of monomorphic in a local sense? Wouldn't that mean that if at C_3 is
> inlined we can inline at C_2 and C_1 as well? Isn't that exactly what we
> would need the JVM doing for "generalized code that calls code"? Wouldn't
> that mean that if the JVM means a megamorphic hot call site that there could
> be maybe a transformation making it partially monomorphic by isolating part
> of the call graph? Wouldn't that be a strategy the JVM could implement in
> general, without special marker? Or does the JVM do this already?

Again speaking from my non-expert understanding of Hotspot:

Even if C_3 inlines, C_1 remains megamorphic because Hotspot treats it
as a lone call site independent of calls around it. There's no concept
of "the C_1 call site when called from C_3".

This problem is actually rampant in JRuby (and probably Groovy too)
since we both call through generic pieces of code (e.g. our own call
site logic or dispatchers) that see dozens or hundreds of different
targets. It's the problem Cliff Click pointed out at the first JVMLS
when he profiled JRuby on Azul...our CachingCallSite, while reducing
call costs significantly, was preventing target methods from inlining
into callers. If I remember right, he described as a solution that
you'd have to be able to do repeat passes of type profiling, inlining,
type-profiling, inlining...to allow inlined results to themselves be
subject to a profiling pass. Hotspot does not do this. I do not know
about JRockit or J9, but I suspect that tiered compilers may be able
to implement this optimization more easily (since they can instrument
inlined code and re-optimize it based on that information).

- Charlie

Kresten Krab Thorup

unread,
Apr 5, 2011, 9:25:29 AM4/5/11
to JVM Languages
> Even if C_3 inlines, C_1 remains megamorphic because Hotspot treats it
> as a lone call site independent of calls around it. There's no concept
> of "the C_1 call site when called from C_3".

That is actually not how I thought HotSpot works; isn't that the whole
point of inlining that the embedded calls can be applied with more
precise information? Strange.... I'm disappointed.

Kresten


On Apr 4, 5:34 pm, Charles Oliver Nutter <head...@headius.com> wrote:

Charles Oliver Nutter

unread,
Apr 5, 2011, 10:26:13 AM4/5/11
to jvm-la...@googlegroups.com, Kresten Krab Thorup
On Tue, Apr 5, 2011 at 8:25 AM, Kresten Krab Thorup <kr...@trifork.com> wrote:
>> Even if C_3 inlines, C_1 remains megamorphic because Hotspot treats it
>> as a lone call site independent of calls around it. There's no concept
>> of "the C_1 call site when called from C_3".
>
> That is actually not how I thought HotSpot works; isn't that the whole
> point of inlining that the embedded calls can be applied with more
> precise information?  Strange.... I'm disappointed.

It is disappointing. Cliff Click just posted an entry about this
issue, confirming that current JVMs don't do what you hoped they'd do
and discussing at least one strategy for solving "The Problem":
http://www.azulsystems.com/blog/cliff/2011-04-04-fixing-the-inlining-problem

Long story short...nope, JVMs don't do this right now. Pass a couple
functions into an iterator and JVMs won't inline those functions.
You're stuck with virtual calls and no visibility across that call
boundary to optimize the loop itself.

I'm sure this is one of the remaining major bottlenecks in JRuby. When
I switch to modes that profile at runtime and insert direct inlinable
invocations, JRuby perf improves 2-3x even on benchmarks we used to
think were bottlenecking on Fixnum object creation. In fact, I can get
JRuby to run nearly as fast as Java itself for equivalent work (i.e. a
Java-based Fixnum fibonacci versus a Ruby-based Fixnum fibonacci). If
inlining plain method calls has that kind of impact, imagine how much
performance we're losing by not being able to inline functions across
megamorphic iterator boundaries.

For my part, I will probably continue down the road of specializing at
the bytecode level, to allow more code to inline (at the cost of more
of the inlining budget, unfortunately) and doing runtime optimizations
as much as possible. Hopefully the JVMs will catch up to JRuby soon :)

- Charlie

Reply all
Reply to author
Forward
0 new messages