First, some explanation...
In Ruby, the "each" method on Array receives a block of code:
[1,2,3].each {|i| do_something_with i}
"each" is implemented as a simple loop over Array elements, each
iteration calling Block.yield on a Block object passed in. Block
aggregates the code body above with a Binding object holding the
surrounding closed-over state.
public IRubyObject eachCommon(ThreadContext context, Block block) {
if (!block.isGiven()) {
throw context.getRuntime().newLocalJumpErrorNoBlock();
}
for (int i = 0; i < realLength; i++) {
// do not coarsen the "safe" catch, since it will
misinterpret AIOOBE from the yielded code.
// See JRUBY-5434
block.yield(context, safeArrayRef(values, begin + i));
}
return this;
}
Similar idioms certainly exist in any other JVM languages that allow
passing functions or closures around.
The problem here is that the "yield" call is always going to go
megamorphic very quickly. There will be dozens of places in a typical
Ruby (or Clojure, or Scala, or Groovy) application that use the same
list-iteration logic, and they all pass through the same generic code.
That means the best you can optimize is inlining the loop logic into
the caller...the closure body won't inline and will have to optimize
on its own.
This also has implications for escape analysis. If there's any state
(including the closure object itself) involved in doing that
iteration, it's now impossible for it to EA away, since the closure
can't inline all the way back into the caller. Any heap-based
structures are now firmly on the heap, and add to our allocation
overhead.
Now, strategies...
In general, what's needed is a way to specialize "each" for many
different call sites and closures.
The easiest for us implementers would be if the JVMs simply started to
see through these calls. Closure/function-receiving code is going to
become more and more common, and indeed was already rather common for
event-handling systems and the like. Rémi and I talked with Fredrik
(JRockit) two JVMLSs ago about how JRockit might be able to optimize
these cases. Fredrik believed it would be possible, but that some sort
of marker was needed on the "each" method to show JRockit that it's
code that calls code. I forget who suggested it, but we decided an
easy marker would be to have the signature receive a MethodHandle or
subtype. JRockit (and perhaps other JVMs) could use that as indication
that this method should be specialized to the caller and provided
closure.
Barring JVM help, we are likely to attempt to optimize this case in
JRuby directly. My strategy will be to see (via runtime profiling)
that particular closure-receiving methods are hot, and do the
specialization myself. This would boil down to having JRuby's JIT emit
both the caller's class body *and* a copy of the "each" body into the
compiled result. The caller, each, and the closure (also in the same
result) would lie along a monomorphic path, allowing all three to
inline together. This will be easy with Ruby code; if I see a
closure-passing call to another Ruby method, I emit a body for that
method too. For core JRuby methods, which are implemented in Java, it
will be trickier; I'll need to either have simple ways to duplicate
those methods in-place or I'll need to move them into Ruby.
Ironically, we may be seeing the beginning of an age where it's faster
to implement JRuby core classes in Ruby.
Have any of you other implementers thought about this? What are you
considering as strategies for optimizing closure invocations?
- Charlie
The big question is how much room there is for an optimization to
really benefit? How much tracking overhead becomes counter-productive
to optimization?
~~ Robert.
> --
> You received this message because you are subscribed to the Google Groups "JVM Languages" group.
> To post to this group, send email to jvm-la...@googlegroups.com.
> To unsubscribe from this group, send email to jvm-language...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/jvm-languages?hl=en.
>
>
I think the potential is *very* large. My escape analysis example was
only one key case that's impossible to optimize without being able to
inline closures. Others may include loop unrolling (JIT can't optimize
as well when it doesn't know what a downstream call might do),
bounds-check elimination (loops conditional on the closure result, for
example), and of course anything related to doing a slow CALL
(register allocation, memory optimizations, inline cache or vtable
dispatch...). Basically, it forces the body of some iteration into
always being a CALL, and all optimizations you'd normally get inside a
loop body are impossible.
I also want to make clear this affects all languages that pass
functions or closure around, including Groovy, JRuby, Scala, Clojure,
current Java with callbacks/anon classes, and upcoming Java 7 closure
support.
- Charlie
Sent this a bit quickly...
This may actually be a case where JRuby has a better chance of
optimizing since we are already mixed-mode. It won't be very difficult
for us to specialize a body of code to the point of use, allowing that
body and a closure passed to it to inline. Assuming that's not
possible for other languages, has anyone thought of other ways to make
closures/callbacks optimize as well as normal inlined monomophic call
paths?
- Charlie
Now, there's no way that I am aware of to go from a MethodHandle to a
java.lang.reflect.Method instance and an array of arguments, which
would then enable me to inline it at runtime. There are good reasons I
can imagine for not allowing that, starting first and foremost with
argument adapters (and pretty much all the mutations in the
MethodHandles class), which means that the MethodHandle has semantics
different from passing the given arguments to the given method.
So, if we're going to inline based on the MethodHandle, we need some
kind of wrapper which will tell us whether the MethodHandle is
inlinable, and which will tell us the method and arguments if it is
reversible. Note that the method and arguments need not correspond
directly to the underlying implementation—in theory, we could generate
some other code at runtime to represent adapters, etc. But for the
initial implementation, I'm considering a MethodHandle inlinable if it
corresponds directly to an API method, possibly with currying. That
solves half the problem: I can now go from MethodHandle to a
java.lang.reflect.Method, which allows me to inline it at runtime (at
least in the common case).
The other half the problem is the call site. I use a custom CallSite
which exposes MethodHandle#invokeGeneric so that it can track
arguments passed in. I'd use that to track MethodHandle arguments
passed in, and if the same inlinable MethodHandle is hit often enough,
it should generate the inlined code, and then update the MethodHandle
to check for this common case (MethodHandles#guardWithTest comes in
here). This is pretty much exactly the logic I would hope a JVM would
eventually implement, and I'm presuming that the JVM already has (or
will soon have) the ability to optimize the MethodHandle resulting
from MethodHandles#guardWithTest calls.
Now, this is a fair bit of infrastructure and additional obscuring of
the user's code path, which is where the question comes in about
whether or not it is worth it. There are also serious concurrency
issues: how do we balance performance with shared state? For instance,
consider the implied counter for MethodHandle arguments—do we really
want the overhead of synchronization on *every* invocation of this
method? Imagining a global lock on "fold" just makes me blanch. We
could go "volatile" and force memory flushes in all those cases, but
is that really better? What's the alternative? (Taking the overhead
and the loss of opportunities to track it per-thread? Is that really
better?) I'm hesitant to spend time implementing an optimization that
isn't (developing my language is a hobby, not my job, so there's very
limited time resources), but if the gain is huge, it's probably worth
it.
Of course, I'd really like to declare it's the JVM's problem and look
forward longingly to the day when it's resolved there.
~~ Robert.
On Sat, Apr 2, 2011 at 8:35 PM, Charles Oliver Nutter
I am wondering... in what case does it *not* make sense to inline? I
mean I can imagine reasons like certain limitations, but in the end...
Does anyone know?
bye Jochen
--
Jochen "blackdrag" Theodorou
The Groovy Project Tech Lead
http://blackdragsview.blogspot.com/
For Groovy programming sources visit http://groovy.codehaus.org
The problem is more that to optimize such case you need backpointers from
a method to all its callsites. This datastructure has a cost in memory and
in cycle to manage it and is only necessary if you have high order method
(a method that takes a lambda).
So the detection of such method is important.
I don't think the VM need a flag.
A method that takes a MethodHandle as parameter
and this method handle value doesn't escape and is invoked (invoke*)
should be optimized.
Now, Hotspot has tiered compilation, so if the high order method is hot
(or an OSR of this method). It can be compiled and also
trigger the fact that its callsites must be recorded.
>
> bye Jochen
>
R�mi
I'm not sure if you're saying that current JVMs work this way or
saying that they should work this way. They don't do this right now.
eachCommon's call sites are considered independent of any callers.
Because the yield call ends up seeing numerous code bodies, that call
site is megamorphic and will never inline anything. eachCommon itself
will inline into the caller, but that's as far as it goes since
current JVMs do not specialize call sites based on the current
method's caller.
Of course I'd love to see JVMs be smart enough to specialize call
sites based on the caller backtrace.
> So. If method handles or something else disables this optimization, or
> makes it more expensive then perhaps we should consider adding a hint/
> annotation to the vm saying "it may make sense to inline this". but I
> would think that that is exactly the kind of assessment that the vm
> can do better than static knowledge allows.
If we could say "inline the called method as though it were right here
and specialize all its call sites based on this method" then this
would work. But simply inlining is not enough if call sites are not
specialized.
- Charlie
In JRuby, since we support Java 5 and 6, the objects passed around as
"handles" are our own construct ("DynamicMethod"). There's metadata
there to know if it needs a heap-based scope, if it needs a Frame, the
shape if its arguments, and what JVM bytecoded method it points at.
Given that information we should (and do, in dynopt mode) have enough
to either insert direct calls (skipping dynamic call logic) *or*
inline the target code directly into the current one at runtime.
Using only MethodHandle would be lovely, but as you point out you may
need other metadata if you're going to do any magic at runtime.
Is Ashlar mixed-mode or compiled only?
> Of course, I'd really like to declare it's the JVM's problem and look
> forward longingly to the day when it's resolved there.
Indeed :) Of course right now we're just waiting for invokedynamic and
method handles to be fast fast fast, so it will be some time before
the JVM guys can come up for air and start considering new problems
like closure inlining.
- Charlie
There's the black art of runtime optimization :)
I actually reported a performance bug in invokedynamic a couple weeks
back that turned out to be Hotspot inlining the *wrong* logic.
Basically, it was inlining recursive calls to fib as far as possible
without inlining the other calls made in fib that were actually more
expensive (and more valuable to inline and optimize. Indeed, the
question of what *not* to inline is at least as important as the
question of what *to* inline.
As a rule of thumb I suppose I'd say you want to inline methods that
have the best cost/size ratio; basically, inline the smallest, hottest
methods first. I'd love to hear from the JVM vendors what rules their
compilers follow, though (and whether there are papers the rest of us
could read to understand them better).
- Charlie
Indeed... I was looking for better tooling to let me look at the code cache and talked to Cliff Click about this a couple of weeks ago. His comment was that he was surprised that people were interested in this subject along with how the code cache was managed. The code cache was intended to be self managing thus no need to metrics. If anyone can point to better doc/tooling in this area.....
Kirk
ohhh, yes :)
> I actually reported a performance bug in invokedynamic a couple weeks
> back that turned out to be Hotspot inlining the *wrong* logic.
yes, I did read that... and getting from that I had the impression the
inlining logic for MethodHandles is quite eager.... which then of course
could be used in this case... but it doesn't turn out well all the time
I guess.
[...]
> As a rule of thumb I suppose I'd say you want to inline methods that
> have the best cost/size ratio; basically, inline the smallest, hottest
> methods first. I'd love to hear from the JVM vendors what rules their
> compilers follow, though (and whether there are papers the rest of us
> could read to understand them better).
I don't really know, I can only assume because I am really not too much
into the JVM details. From what I learned I assume that in the following
example
> public static void doSomething(MyInterface x, Object[] args) {
> x.call(args) //C_1
> }
> public static void doSomethingSmall(){...}
> MyInterface mi = new MyInterface() {
> public void call(Object[] args) {
> doSomethingSmall() //C_2
> }
> }
> doSomething(mi) //C_3
doSomethingSmall in callsite C_2 will be inlined. C_1 is megamorphic
(because called from many other places in many variations), so no
inlining. But C_3 could maybe. I don't know if the megamorphic C_2
taints C_3, or if that does not matter at all... no idea.
What I assume is that the JVM has a kind of global concept of a callsite
being megamorphic or monomorphic. Even if C_3 would be inlined it would
not help C_2.
If - and I am being crazy here I guess, because I talk about things I
essentially have no idea of - if C_3 is inlined doesn't that make C_1
kind of monomorphic in a local sense? Wouldn't that mean that if at C_3
is inlined we can inline at C_2 and C_1 as well? Isn't that exactly what
we would need the JVM doing for "generalized code that calls code"?
Wouldn't that mean that if the JVM means a megamorphic hot call site
that there could be maybe a transformation making it partially
monomorphic by isolating part of the call graph? Wouldn't that be a
strategy the JVM could implement in general, without special marker? Or
does the JVM do this already?
bye blackdrag
Speaking from what I know of Hotspot...
C_2 and C_3 will both likely inline their respective targets. They're
not a virtual calls, and they're small.
C_1 will inline if only one or two unique types are provided to the
doSomething method (Hotspot can inline monomorphic and bimorphic call
sites, I believe). If a new type later comes in, the code will be
deoptimized and the inlining reversed.
Note that another of my "bugs" comes into play here: even if multiple
types implement MyInterface and share the same implementation code,
Hotspot will consider them separate types for profiling purposes, and
as a result even if you only ever have one implementation of "call" in
the system, you still might not get it to inline. I'd love to see that
fixed :)
> If - and I am being crazy here I guess, because I talk about things I
> essentially have no idea of - if C_3 is inlined doesn't that make C_1 kind
> of monomorphic in a local sense? Wouldn't that mean that if at C_3 is
> inlined we can inline at C_2 and C_1 as well? Isn't that exactly what we
> would need the JVM doing for "generalized code that calls code"? Wouldn't
> that mean that if the JVM means a megamorphic hot call site that there could
> be maybe a transformation making it partially monomorphic by isolating part
> of the call graph? Wouldn't that be a strategy the JVM could implement in
> general, without special marker? Or does the JVM do this already?
Again speaking from my non-expert understanding of Hotspot:
Even if C_3 inlines, C_1 remains megamorphic because Hotspot treats it
as a lone call site independent of calls around it. There's no concept
of "the C_1 call site when called from C_3".
This problem is actually rampant in JRuby (and probably Groovy too)
since we both call through generic pieces of code (e.g. our own call
site logic or dispatchers) that see dozens or hundreds of different
targets. It's the problem Cliff Click pointed out at the first JVMLS
when he profiled JRuby on Azul...our CachingCallSite, while reducing
call costs significantly, was preventing target methods from inlining
into callers. If I remember right, he described as a solution that
you'd have to be able to do repeat passes of type profiling, inlining,
type-profiling, inlining...to allow inlined results to themselves be
subject to a profiling pass. Hotspot does not do this. I do not know
about JRockit or J9, but I suspect that tiered compilers may be able
to implement this optimization more easily (since they can instrument
inlined code and re-optimize it based on that information).
- Charlie
It is disappointing. Cliff Click just posted an entry about this
issue, confirming that current JVMs don't do what you hoped they'd do
and discussing at least one strategy for solving "The Problem":
http://www.azulsystems.com/blog/cliff/2011-04-04-fixing-the-inlining-problem
Long story short...nope, JVMs don't do this right now. Pass a couple
functions into an iterator and JVMs won't inline those functions.
You're stuck with virtual calls and no visibility across that call
boundary to optimize the loop itself.
I'm sure this is one of the remaining major bottlenecks in JRuby. When
I switch to modes that profile at runtime and insert direct inlinable
invocations, JRuby perf improves 2-3x even on benchmarks we used to
think were bottlenecking on Fixnum object creation. In fact, I can get
JRuby to run nearly as fast as Java itself for equivalent work (i.e. a
Java-based Fixnum fibonacci versus a Ruby-based Fixnum fibonacci). If
inlining plain method calls has that kind of impact, imagine how much
performance we're losing by not being able to inline functions across
megamorphic iterator boundaries.
For my part, I will probably continue down the road of specializing at
the bytecode level, to allow more code to inline (at the cost of more
of the inlining budget, unfortunately) and doing runtime optimizations
as much as possible. Hopefully the JVMs will catch up to JRuby soon :)
- Charlie