After poking around the code for the past few days I'm finally starting
to get my head wrapped around Rubinius. Here's a proposal for how to add
dynamic method inlining.
Goal
----
Dynamically inline recipient method/block bytecodes directly into the
sender method bytecodes, thus removing method dispatch overhead (which
includes doing the lookup, of course, as well as moving arguments and
return variables on and off the stack, creating and retiring contexts,
etc.) This should speed things up just by itself, and it will set the
stage for future JITting by removing sends (which can't be JITted).
Immediate targets
-----------------
There are a couple immediate inlining targets I have in mind:
1. Methods that only use local variables (including arguments) and
global variables. Easy to inline anywhere.
2. Methods that only use instance variables or class variables, and the
sender is the same object as the receiver. Easy, and a common case
(helper functions.)
3. Blocks that don't use any variables except block arguments. I don't
know how common this really is, but it's easy.
Everything else is hard because the context between the sender and the
receiver isn't the same and there is access to variables that depend on
the receiver's context.
Basic control flow
------------------
Here's how it would work. Each sendsite is a potential inline target. At
any execution point with a sendsite, if the hit count is high enough,
the VM may decide to inline the send into the CompiledMethod it's
currently executing.
(Note that I'm using send count and not cpu time spent in the receiver
method. I think this is the right metric. What we can speed up through
inlining is only the mechanics of the send itself, so that's the
important feature.)
If the VM does decide to inline, it will stop executing, pass the
receiver method, the sender method, the bytecode position, etc. to a
(Ruby) inliner, which will produce a new version of the sender method
with the receiver inlined (or return nil if it doesn't want to.) These
bytecodes will be stored in the sender CompiledMethod object alongside
the original bytecodes. Further inline operations may replace them, but
the original, unmodified bytecodes will always be preserved for when we
have to deoptimize. (Note that the inlined version still contains the
original sendsite. Details below.)
If the recipient method gets overridden, we have to deoptimize. We do
basically what the sendsites do with selectors: not only does each
sendsite cache get flushed, but the enclosing CompiledMethod inline
version gets discarded. (I can think of fancier variants of this
procedure, but this is the simplest.) Then we're back to our original
state of having no inlining at all.
How a method gets inlined
-------------------------
The basic idea in inlining a method is that we replace the send_X (or,
in the case of blocks, the meta_send_call) opcode with a preamble and a
translation of the recipient method.
The preamble will basically do a validity check and fall back to the
original send mechanics if it fails. So for case #2 above, where the
receiver method uses instance or class variables, the preamble will look
something like this (assuming the receipient object is at the top of the
stack):
dup_top
push_self
equal
goto_if_true doit
<original send mechanics>
goto done
doit: pop
<translated method>
done: <...>
If the receiver method only uses local variables (case #1 above), then
the preamble is more complicated, because we're checking for type
equality instead of instance equality. We'll take the sendsite cached
class as the recipient class. This might actually be a metaclass, so we
have to be a little careful in the check. Assuming we're sending to an
object at the top of the stack, and the SendSite cached class is Klass,
the preamble will be something like this:
dup_top
push_const :Klass
instance_of
goto_if_true doit
open_metaclass
push_const :Klass
equal
goto_if_true doit
<original send mechanics>
goto done
doit: pop
<translated method>
done: <...>
(Or is there a better way to do this comparison? Open_metaclass seems
like it actually creates the metaclass if it isn't there. But
instance_of doesn't check the metaclass...)
And something similar for case #3.
What's nice about this preamble is that if the dispatch IS to multiple
target types, this will build up an if-then-else-else-else sequence of
typechecks a la polymorphic inline caches, because the <original send
mechanics> section will contain the SendSites, and further inlining will
nest those preambles right in there.
So that's the preamble. The translation for an inlined method will be as
follows:
Method arguments are already at the top of the stack, so we'll pop them
off into new locals (namespaced somehow) and rewrite the references to
them, and discard the set_local_from_fp calls. (Maybe there'll be an
obvious way to detect and collapse the pushing and immediate popping.)
References to instance variables, class variables, and global variables
can all stay the same, since we're guaranteed that if these occur, the
sender is the same instance as the receiver.
And finally, branch targets have to be updated by adding the bytecode
offset to them.
And I think that's it!
Data model modifications
------------------------
CompiledMethod:
- add a inlined method field for bytecodes which have had some amount of
inlining done to them.
SendSite:
- Add a inline_attempted? flag?
Other minutia
-------------
If the receiver method is a AccessVarMethod or a DelegatedMethod, we'll
special-case that transformation. It should be pretty easy.
Current questions/problems
--------------------------
1. Backtraces. They'll be compressed by this.
2. Is there a good way to namespace the inlined locals so that they
don't conflict with the sender method locals?
3. How does one transition from the VM to a specific bit of Ruby code? I
mean, I know how you *could* do this, but is does this happen at all
right now? There's some kind of cpu_perform_hook thing; is that
designed for this purpose?
Comments, suggestions, etc. appreciated. Especially if there's anything
obvious I've overlooked!
--
William <wmo...@masanjin.net>
I really like this one. It's simple to implement, and simple to check.
You've listed it as the first example, and I think it should be the
first case we tackle.
For this, I think we should introduce another bytecode instruction to
assist. Given the class and the object, it can perform the required
checks internally, to avoid creating the metaclass.
In addition, in the new VM, I'm working on code that lets the VM
determine type equivalence much easier. The new instruction can use
this new mechanism as we switch to the new VM.
>
>
> And something similar for case #3.
>
> What's nice about this preamble is that if the dispatch IS to multiple
> target types, this will build up an if-then-else-else-else sequence of
> typechecks a la polymorphic inline caches, because the <original send
> mechanics> section will contain the SendSites, and further inlining
> will
> nest those preambles right in there.
>
> So that's the preamble. The translation for an inlined method will
> be as
> follows:
>
> Method arguments are already at the top of the stack, so we'll pop
> them
> off into new locals (namespaced somehow) and rewrite the references to
> them, and discard the set_local_from_fp calls. (Maybe there'll be an
> obvious way to detect and collapse the pushing and immediate popping.)
It's easy enough for us to rewrite access to locals to access via a
Tuple instead. In the method we're inlining into, we could enlarge
it's regular locals by one, and stick the Tuple that is the locals for
the inlined method in there. Meaning:
push_local 0
Becomes
push_local 4
tuple_at 0
I've just made up the tuple_at instruction to assist in this. Could
also just be
push_int 0
push_local 4
send at 1
It should be noted that in the C++ VM, argument assignment is handled
by the VM. That doesn't change this scheme, just wanted to mention it.
>
>
> References to instance variables, class variables, and global
> variables
> can all stay the same, since we're guaranteed that if these occur, the
> sender is the same instance as the receiver.
>
> And finally, branch targets have to be updated by adding the bytecode
> offset to them.
>
> And I think that's it!
>
> Data model modifications
> ------------------------
>
> CompiledMethod:
> - add a inlined method field for bytecodes which have had some
> amount of
> inlining done to them.
>
> SendSite:
> - Add a inline_attempted? flag?
I think this scheme fits very nicely with the new VM's backend
classes. In the new VM, each CompiledMethod object is backed by a
VMMethod C++ object, which contains VM specific data about the
CompiledMethod. It's here that the inlined version could live.
>
>
> Other minutia
> -------------
>
> If the receiver method is a AccessVarMethod or a DelegatedMethod,
> we'll
> special-case that transformation. It should be pretty easy.
Yeah, these also fit very nice.
>
>
> Current questions/problems
> --------------------------
>
> 1. Backtraces. They'll be compressed by this.
This is something we'll have to fix. If backtraces don't show any form
of inlining occured, the backtrace will be useless because it will
jump between 2 unrelated methods (and the line numbers will be wrong).
As one potential solution, we could maintain a map of ip_region to
inlined CompiledMethod. So if a method B is inlined into A, say
starting at ip 12. We'll know how much bytecode from B we put into A
(say 20 new instructions), so the map would contain:
[12, 20, B]
We'll just need to make our Backtrace class away of these maps.
Something else to considering is multi-level inlining. IE, if method C
was inlined into method B, then B into A. Our map can still handle
this, by being sorted by most specific sequence first. Say C was just
2 instructions into B at an offset of 3, A's map would contain:
[15, 17, C]
[12, 20, B]
>
> 2. Is there a good way to namespace the inlined locals so that they
> don't conflict with the sender method locals?
See my scheme above.
>
> 3. How does one transition from the VM to a specific bit of Ruby
> code? I
> mean, I know how you *could* do this, but is does this happen at all
> right now? There's some kind of cpu_perform_hook thing; is that
> designed for this purpose?
This is the most tricky part. Being stackless, there is no easy way to
interrupt Ruby code with other Ruby code. One trick we could use is to
run the Inliner on a seperate Thread, and using a Channel to wake it
and suspend the current Thread while the Inliner goes to work.
Using perform hook is an option too, but if we do this, we'd have to
actually execute the inlined version manually as the last thing we do
in the Inliner. Obviously future executions would run directly, but
the first time the Inliner would do the work, then execute the new
version itself. The backtrace in this case would include at least one
context of the inline, but thats ok.
>
>
> Comments, suggestions, etc. appreciated. Especially if there's
> anything
> obvious I've overlooked!
Great work! I can't wait 'til we get started!
>
> --
> William <wmo...@masanjin.net>
>
> >
There's actually a second caveat to this I just thought of, which is
that there can be no references to self. Which means if you have
something like:
class C
def a; 4 end
def b; a + 5 end
end
def c
x = C.new
10000.times { x.b }
end
Then you have to wait for a to be inlined into b before you can inline b
into c. Which is fine, just means we have to be careful to clear the
tried_inlining? flag in the sender method if we inline something into
it, because doing so might make it suddenly eligibile for inlining.
Details!
> > dup_top
> > push_self
> > equal
> > goto_if_true doit
> > <original send mechanics>
> > goto done
> > doit: pop
> > <translated method>
> > done: <...>
>
> I really like this one. It's simple to implement, and simple to check.
> You've listed it as the first example, and I think it should be the
> first case we tackle.
Ok.
> > dup_top
> > push_const :Klass
> > instance_of
> > goto_if_true doit
> > open_metaclass
> > push_const :Klass
> > equal
> > goto_if_true doit
> > <original send mechanics>
> > goto done
> > doit: pop
> > <translated method>
> > done: <...>
> >
> > (Or is there a better way to do this comparison? Open_metaclass seems
> > like it actually creates the metaclass if it isn't there. But
> > instance_of doesn't check the metaclass...)
>
> For this, I think we should introduce another bytecode instruction to
> assist. Given the class and the object, it can perform the required
> checks internally, to avoid creating the metaclass.
Sounds good. I guess it would be something that's halfway between
kind_of and instance_of, that does check the metaclass but doesn't
traverse the inheritancy hierarchy. A basically_kind_of. :)
> It's easy enough for us to rewrite access to locals to access via a
> Tuple instead. In the method we're inlining into, we could enlarge
> it's regular locals by one, and stick the Tuple that is the locals for
> the inlined method in there. Meaning:
>
> push_local 0
>
> Becomes
>
> push_local 4
> tuple_at 0
That's very nice. No namespace issues at all.
> > SendSite:
> > - Add a inline_attempted? flag?
>
> I think this scheme fits very nicely with the new VM's backend
> classes. In the new VM, each CompiledMethod object is backed by a
> VMMethod C++ object, which contains VM specific data about the
> CompiledMethod. It's here that the inlined version could live.
Ok.
> > 1. Backtraces. They'll be compressed by this.
>
> This is something we'll have to fix. If backtraces don't show any form
> of inlining occured, the backtrace will be useless because it will
> jump between 2 unrelated methods (and the line numbers will be wrong).
>
> As one potential solution, we could maintain a map of ip_region to
> inlined CompiledMethod.
That sounds fine. It gets a little hairy with the line numbers in there
too but it's workable.
> > 3. How does one transition from the VM to a specific bit of Ruby
> > code? I mean, I know how you *could* do this, but is does this
> > happen at all right now? There's some kind of cpu_perform_hook
> > thing; is that designed for this purpose?
>
> This is the most tricky part. Being stackless, there is no easy way to
> interrupt Ruby code with other Ruby code. One trick we could use is to
> run the Inliner on a seperate Thread, and using a Channel to wake it
> and suspend the current Thread while the Inliner goes to work.
>
> Using perform hook is an option too, but if we do this, we'd have to
> actually execute the inlined version manually as the last thing we do
> in the Inliner. Obviously future executions would run directly, but
> the first time the Inliner would do the work, then execute the new
> version itself. The backtrace in this case would include at least one
> context of the inline, but thats ok.
I don't entirely understand the perform_hook thing, but what seems like
a good option would be to take the kernel-interrupt-style top half vs
bottom half approach, where (and maybe this is what you were suggesting)
if the VM wants to inline a call, it executes the inlining code
*instead* of executing the sender method. The inlining code (which is
passed both sender and receiver methods) finishes by calling a primitive
which does the rest of the VM-internal work: stores the new inlined
bytecodes in the CompiledMethod, pops a context to get out of the
inliner method, and then executing the new improved method.
I don't understand the internals of the system well enough to make an
informed decision here though.
--
William <wmo...@masanjin.net>
Actually I don't think it has to be this complicated. Just have a
primitive that assigns a bytecode sequence to a CompiledMethod's
"inlined bytecode" field, and treat the inliner as a regular old method
call that happens immediately before the sendsite executes. The inliner
does its work, possibly calls that primitive as the last operation, then
returns, and execution continues in the (uninlined) method body. The
next time that method's called, the inlined version will be executed.
--
William <wmo...@masanjin.net>