From other threads:
> Now we are placing arguments or return values in registers according
> to PDD03 and the other end has immediately access to the placed
> values, because the register file is in the interpreter.
> With the indirect addressing of the register frame, this argument
> passing is probably the most costly part of function calls and
> returns, because we have to copy the values.
and
>> Anyway, if you can pop both register frames -and- context structures,
>> you won't run GC too often, and everything will nicely fit into the
>> cache.
> I thought about that too, but I'd rather have registers adjacent, so
> that the caller can place function arguments directly into the callers
> in arguments.
> OTOH it doesn't really matter, if the context structure is in the
> frame too. We'd just need to skip that gap. REG_INT(64) or I64 is as
> valid as I0 or I4, as long as it's assured, that it's exactly
> addressing the incoming argument area of the called function.
A problem with this is that you can't be sure that you can actually
have the "next" frame of registers adjacent to the current frame--they
might already be taken. Imagine A calls B, then B creates a
continuation and stores it in a global, then returns. If A makes
another function call, it can't just use the adjacent register
frame--it's still "taken". (I'm referring here to the "sliding register
window" idea of course.) But this is reminding me of another idea I've
had.
I've been thinking for a while about another idea to decrease some of
the copying which we are now doing for function calls, as well as the
allocation of register sets during subroutine calls, and which would as
a side-effect allow us to reduce the need for register spilling.
First, a code snippet. Consider the following C code:
int a(int x) { return b(x); }
int b(int x) { return c(x + 1); }
int c(int x) { return x + 7; }
As compiled on the PPC, this code is very compact--each function is
implemented by just one or two instructions, and only one register is
used. There are two key factors here: (1) no register preservation is
needed, and (2) the call from a() to b() is just a branch, since the
call to a() has already loaded the registers with the correct values
for the call to b().
By my thinking, register usage falls into three basic categories:
1) Registers used for parameter passing and returns values for function
calls.
2) Registers used to hold values which need to be preserved across
function calls.
3) Registers used to hold values which do not need to be preserved
across function calls.
Really, (1) and (3) are similar--in either case, you have registers
whose values are allowed to change across function calls. (For register
which hold return values that's obvious, and for those used to pass
parameters, it seems fair that these be expected to change across a
function call.) So we really have two cases--volatile and non-volatile
register usage.
In the PPC ABI, half the registers are treated as volatile, and the
other half as non-volatile. For the PPC, this corresponds to
caller-preserved v. callee-preserved. Because of continuations[1],
parrot can't have callee-preserved registers, but it could still have
volatile v. non-volatile registers. Here's what I'm thinking:
Keep the old-scheme registers inside the interpreter structure, *as
well as* the new indirect registers. Call the registers in the
interpreter I0 to I31, and the indirect registers I32 to whatever.[2]
I'll call these the "direct" and "indirect" registers. By their very
nature, the direct registers are volatile, and the indirect registers
are non-volatile. What does this buy us? The following:
1) Parameter passing and return occur via the established calling
conventions, in the volatile registers. No extra copying is needed upon
function call or return--the volatile registers are physically the same
for the caller and the callee.
2) "Temporary" calculations can use the volatile registers. Values
which need preserving can use the non-volatile registers directly.
3) In functions which need no non-volatile registers, there's no need
to allocate the indirect registers at all.
4) Cases like my example code above can compile just as compactly as
the PPC case. With the volatile Parrot registers mapped to volatile CPU
registers (in the PPC case at least), this would be highly efficient:
you'd end up with asm similar to what you have in the C case above, and
you'd not have a need to save the CPU registers back to Parrot
registers (other than those used in the calling conventions) for Parrot
function calls out of JIT code.
5) Because this opens things up to more than 32 registers, code could
use as many registers as needed. You'd never have register spilling
per-se in order to accommodate local variables, though you'd still want
a smart register allocator which minimized the number of registers
needed (for memory locality as well as minimizing allocation). But a
very simple one-non-volatile-register-per-local would function
correctly, as a baseline implementation (correct but slow).
The main value here is in the ability to avoid allocation of indirect
registers in many cases, minimize copying, and simplify the register
allocation policy.
Further notes:
1) For the JIT case, correct code can be emitted to handle direct v.
indirect registers (relative to the interpreter pointer, or relative to
the new base pointers, depending on the register). For the other cores,
we'd either need to generate additional op variants to handle direct v.
indirect registers (probably lots more op variants), or we'd need to
have register-accessing code be conditional on the register number
(slow--lots of branches). The latter approach doesn't bother me for the
non-prederef cores (they're slow anyway, and mostly good for testing),
but I don't know enough about the prederef cores to see if we could do
something efficient, without an op explosion.
2) Since the number of indirect registers could vary from
function-to-function, we'd need a syntax to specify how may are needed
for a given function--probably an op to allocate the right number of
registers, *or* it could somehow be metadata kept as part of the sub
itself. With the op approach, for safety upon function call we could
swap into the indirect register pointer, a pointer to a zero-filled
read-only chunk of memory, to catch attempts to use indirect registers
before allocating any. But we'd still have to deal with the issue of
how to catch trying to access more registers than you actually
allocated. A compiler could ensure that this would never happen, but it
would feel dangerous if the VM didn't have a way to catch this, for
hand-generated assembly. (On the other hand, a bytecode verifier could
catch it.)
3) Really, only the direct registers are "real" registers,
philosophically--they're the only ones which live at fixed memory
locations. The indirect registers are fulfilling a purpose that other
VMs (or calling conventions) would use "the stack" for. By calling them
registers, we can use them directly for calculations, but still pretend
that we are RISC.
4) This may not be directly relevant to the "slow fib benchmark"
problem, *except* that for tail calls this does speed things up
because, by definition, you don't need any register preservation for
tail calls. This is true (and an advantage) even without actual
tail-call optimizations (ie, even if you have the usual frame overhead,
at least you don't need to allocate indirect registers). And I believe
there's a theorem that all recursive operation can be re-written to be
tail-recursive, so it's sort of relevant, though I guess what's
important about the fib benchmark isn't specifically that it's
recursive, but rather just that it creates a lot of frames (lots of
function calls, deep call depth).
Comments of course welcome, and encouraged. Thanks if you read all of
that--it ended up longer than I was expecting.
JEff
[1] Callee-preserved means "callee saved and then restored just before
return". With continuations, you can jump past the restoration code,
and ruin this strategy.
[2] These could be numbered starting from 32, or they could be referred
to via another notation indexed from 0--NVI0 for "non-volatile I
register 0", or something.