Not to sound like a Jackie Chan cartoon or anything, but...
If we go MMD all the way, we can skip the bytecode->C->bytecode transition for MMD functions that are written in parrot bytecode, and instead dispatch to them like any other sub.
Not to make this sound good or anything, of course. :-P -- Dan
--------------------------------------"it's like this"------------------- Dan Sugalski even samurai d...@sidhe.org have teddy bears and even teddy bears get drunk
> If we go MMD all the way, we can skip the bytecode->C->bytecode > transition for MMD functions that are written in parrot bytecode, and > instead dispatch to them like any other sub.
> Not to make this sound good or anything, of course. :-P
Dan Sugalski <d...@sidhe.org> wrote: > If we go MMD all the way, we can skip the bytecode->C->bytecode > transition for MMD functions that are written in parrot bytecode, and > instead dispatch to them like any other sub.
Not really. Or not w/o significant overhead for MMD functions implemented in C. Opcodes like C<invoke> that branch somewhere have a special treatment in the JIZ (and other) run cores. These instruction are a branch source, the next instruction is a branch target. This means that all CPU registers must be flushed to Parrot's register file and reloaded on the next instruction.
Prederefed run core have to recalulate their program counter relative to the prederefed code.
So I'd rather not do that. I expect most of the functions being executed are implemented in C and not in PASM/PIR. Operator overloading has to have some cost :)
>Dan Sugalski <d...@sidhe.org> wrote: >> If we go MMD all the way, we can skip the bytecode->C->bytecode >> transition for MMD functions that are written in parrot bytecode, and >> instead dispatch to them like any other sub.
>Not really. Or not w/o significant overhead for MMD functions >implemented in C.
Well... about that. It's actually easily doable with a bit of trickery. We can either:
1) Mark the overload subs as special and change their calling conventions 2) Wrap the overload subs in some bytecode that Does The Right Thing--takes a continuation, pushes the registers to the stack, then calls the overload sub--when we add them to the MMD table. -- Dan
--------------------------------------"it's like this"------------------- Dan Sugalski even samurai d...@sidhe.org have teddy bears and even teddy bears get drunk
Dan Sugalski <d...@sidhe.org> wrote: > 1) Mark the overload subs as special and change their calling conventions
Different calling conventions are not really pleasant for the compiler(s). But doable.
> 2) Wrap the overload subs in some bytecode that Does The Right > Thing--takes a continuation, pushes the registers to the stack, then > calls the overload sub--when we add them to the MMD table.
That has the same cost + overhead as the current scheme, which is just that wrapper in C.
3) Inspect the delegated method or MMD sub and save only the needed register range. E.g. if a MMD sub doesn't use I and N registers, only S[0]..P[31] needs saving. That reduces memcpy cost by 3/5. Doesn't work, when the sub calls another sub of course. But for simple functions it'll work.
Dan Sugalski <d...@sidhe.org> wrote: > Well... about that. It's actually easily doable with a bit of > trickery. We can either:
I have trickery number 4) here. Dunno if its doable, but worth considering IMHO:
Here is mmd.pasm (using bxor but substitute any math/bitwise/... op). Comments inline.
_main: new P16, .PerlInt new P17, .PerlInt new P18, .PerlInt set P17, 0b101 set P18, 0b100 # might call a PASM sub or not, who knows bxor P16, P17, P18 # print P16 print "\n"
>Dan Sugalski <d...@sidhe.org> wrote: >> Well... about that. It's actually easily doable with a bit of >> trickery. We can either:
>I have trickery number 4) here. Dunno if its doable, but worth >considering IMHO:
It's doable but the problem you run into is that if you can't be sure that you're going to see a MMD-able PMC you need to do this everywhere, just to be sure. Since generally we're not going to be able to tell (joys of dynamic library loading) it'd mean we'd need to emit that code all the time. And if the binary ops always expand, we might as well make the compact versions just do the MMD stuff. -- Dan
--------------------------------------"it's like this"------------------- Dan Sugalski even samurai d...@sidhe.org have teddy bears and even teddy bears get drunk
Leopold Toetsch <l...@toetsch.at> wrote: > 3) Inspect the delegated method or MMD sub and save only the needed > register range.
Have this now running here locally and tested:
$ ./bench -b=^over Numbers are relative to the first one. (lower is better) p-j-Oc p-C-Oc perl-th perl python ruby overload 100% 126% 300% 257% - -
This[1] doubles the overload benchmark performance. The "my_mul" function uses (changes) only integer regs, so 128 bytes are saved now instead of 640. Object vtable method delegation will also be faster.
Cachgrind reports these numbers:
CVS: I refs: 1,070,689,038 D refs: 666,860,918 D1 misses: 4,030,633
now I refs: 464,016,706 D refs: 316,245,506 D1 misses: 1,530,816
Cache misses are still to high.
perl 5.8.0: I refs: 1,189,527,716 D refs: 724,919,542 D1 misses: 24,844
leo
[1] not alone. dod_register_pmc() of the return continuation in Parrot_runops_fromc() isn't really necessary. The old continuation is on the CPU stack. The passed continuation is in the registers. I was a bit too pessimistic, when coding this.
Dan Sugalski <d...@sidhe.org> wrote: > At 11:35 AM +0200 4/30/04, Leopold Toetsch wrote: >>Dan Sugalski <d...@sidhe.org> wrote: >>> If we go MMD all the way, we can skip the bytecode->C->bytecode >>> transition for MMD functions that are written in parrot bytecode, and >>> instead dispatch to them like any other sub.
>>Not really. Or not w/o significant overhead for MMD functions >>implemented in C. > Well... about that. It's actually easily doable with a bit of > trickery. We can either:
This still doesn't work. Function calls just look different then "plain" opcodes like "add Px, Py, Pz". - it's not known, if C<add> calls a PASM subroutine - if it calls a PASM routine, registers have to be preserved. Which registers depend on the subroutine that actually gets called (ok, this information - which registers are changed by the sub - can be attached to the Sub's metadata) - every opcode that possibly branches has a significant overhead for JIT and prederefed run cores: they must recalculate their PC from a byte code PC to a run loop PC.
Changing C<add> or any MMDed opcode to look like a branch is a severe performance impact for the non-overloaded case.
WRT performance: You can set
#define SAVE_ALL_REGS 0
in interpreter.c:912. This checks the sub's register usage and saves only needed registers, e.g. only 128 byte instead of 640 for the overload benchmark. This makes MMD calls via Parrot_runops faster then a plain function call + the check, if the operation is actually overloaded.
WRT continuations: - It's highly unlikely that one would like to (ab)?use this functionality, i.e. take a continuation from an overloaded PASM and branch elsewhere. - If we really need this "feature" it is doable. On each (re)entering of the run loop a Parrot_exception is created. We would need a run loop nesting level in the continuation. When a continuation is invoked and the nesting level differs, we could longjmp(3) until we reach the old nesting level and then resume at the continuation offset.
>Dan Sugalski <d...@sidhe.org> wrote: >> At 11:35 AM +0200 4/30/04, Leopold Toetsch wrote: >>>Dan Sugalski <d...@sidhe.org> wrote: >>>> If we go MMD all the way, we can skip the bytecode->C->bytecode >>>> transition for MMD functions that are written in parrot bytecode, and >>>> instead dispatch to them like any other sub.
>>>Not really. Or not w/o significant overhead for MMD functions >>>implemented in C.
>> Well... about that. It's actually easily doable with a bit of >> trickery. We can either:
>This still doesn't work. Function calls just look different then >"plain" opcodes like "add Px, Py, Pz". >- it's not known, if C<add> calls a PASM subroutine >- if it calls a PASM routine, registers have to be preserved. Which > registers depend on the subroutine that actually gets called (ok, this > information - which registers are changed by the sub - can be attached > to the Sub's metadata) >- every opcode that possibly branches has a significant overhead for JIT > and prederefed run cores: they must recalculate their PC from a byte > code PC to a run loop PC.
>Changing C<add> or any MMDed opcode to look like a branch is a severe >performance impact for the non-overloaded case.
If the JIT structure makes it untenable, it doesn't work, and that's fine. I don't think it has to be quite as bad as it is now, but on the other hand the performance hit in general needed to make this work better is probably not worth it.
Something to keep in mind once we get more of the base PMC types implemented, and have more of an idea how much of the MMD code ends up being bytecode vs C. -- Dan
--------------------------------------"it's like this"------------------- Dan Sugalski even samurai d...@sidhe.org have teddy bears and even teddy bears get drunk
Dan Sugalski <d...@sidhe.org> wrote: > At 12:02 PM +0200 5/5/04, Leopold Toetsch wrote: >>Changing C<add> or any MMDed opcode to look like a branch is a severe >>performance impact for the non-overloaded case. > If the JIT structure makes it untenable, it doesn't work, and that's > fine. I don't think it has to be quite as bad as it is now, but on > the other hand the performance hit in general needed to make this > work better is probably not worth it.
It's not that slow any more. Running overloaded PASM hasn't more overhead then calling a sub. My Pentium 600 runs 1E6 overloaded C<bxor> functions in 1.5 seconds. The overload benchmark is at 3 times the speed of perl 5.8.2 now. The SSE version of memcpy gave it a big boost on Pentiums. And there is still SSE2, which I can't test.
> Something to keep in mind once we get more of the base PMC types > implemented, and have more of an idea how much of the MMD code ends > up being bytecode vs C.
I don't expect much of the basic functionality being in PASM.
Leopold Toetsch <l...@toetsch.at> writes: > Dan Sugalski <d...@sidhe.org> wrote: >> At 11:35 AM +0200 4/30/04, Leopold Toetsch wrote: >>>Dan Sugalski <d...@sidhe.org> wrote: >>>> If we go MMD all the way, we can skip the bytecode->C->bytecode >>>> transition for MMD functions that are written in parrot bytecode, and >>>> instead dispatch to them like any other sub.
>>>Not really. Or not w/o significant overhead for MMD functions >>>implemented in C.
>> Well... about that. It's actually easily doable with a bit of >> trickery. We can either:
> This still doesn't work. Function calls just look different then > "plain" opcodes like "add Px, Py, Pz". > - it's not known, if C<add> calls a PASM subroutine > - if it calls a PASM routine, registers have to be preserved. Which > registers depend on the subroutine that actually gets called (ok, this > information - which registers are changed by the sub - can be attached > to the Sub's metadata)
No, we're in caller saves remember. The registers that need saving are dependent on the caller. Since the registers used by a function at any point are statically determined, maybe add's signature could be altered to take an integer 'save flags' argument specifying which registers need to be preserved for the caller, then if MMD determines that the call needs to go out to a PASM function, the appropriate registers can be saved.
Piers Cawley <pdcaw...@bofh.org.uk> wrote: > Leopold Toetsch <l...@toetsch.at> writes: >> - if it calls a PASM routine, registers have to be preserved. Which >> registers depend on the subroutine that actually gets called (ok, this >> information - which registers are changed by the sub - can be attached >> to the Sub's metadata) > No, we're in caller saves remember.
Ok, yes. But MMD and delegated functions are a bit different. The caller isn't knowing that it's a caller. The PASM is run from the inside of the C code.
> ... The registers that need saving are > dependent on the caller.
Not quite for this case. Or in theory yes, but... As calling the subroutine mustn't have any changes to the caller's registers, it's just simpler to save these registers that the subroutine might change.
> ... Since the registers used by a function at any > point are statically determined, maybe add's signature could be altered > to take an integer 'save flags' argument specifying which registers > need to be preserved for the caller,
This has a performance penalty for the non-MMD case. I can imagine that overloaded MMD functions are simpler (in respect of register usage) then the caller's code. So it seems that saving, what the MMD sub might change on behalf of the caller is just more effective.
Leopold Toetsch <l...@toetsch.at> writes: > Piers Cawley <pdcaw...@bofh.org.uk> wrote: >> Leopold Toetsch <l...@toetsch.at> writes:
>>> - if it calls a PASM routine, registers have to be preserved. Which >>> registers depend on the subroutine that actually gets called (ok, this >>> information - which registers are changed by the sub - can be attached >>> to the Sub's metadata)
>> No, we're in caller saves remember.
> Ok, yes. But MMD and delegated functions are a bit different. The caller > isn't knowing that it's a caller. The PASM is run from the inside of the > C code.
>> ... The registers that need saving are >> dependent on the caller.
> Not quite for this case. Or in theory yes, but... As calling the > subroutine mustn't have any changes to the caller's registers, it's just > simpler to save these registers that the subroutine might change.
>> ... Since the registers used by a function at any >> point are statically determined, maybe add's signature could be altered >> to take an integer 'save flags' argument specifying which registers >> need to be preserved for the caller,
> This has a performance penalty for the non-MMD case. I can imagine that > overloaded MMD functions are simpler (in respect of register usage) then > the caller's code. So it seems that saving, what the MMD sub might > change on behalf of the caller is just more effective.
But generating the save signature for a given sub is a compile time cost that only needs to be paid once for each sub and shoved on an I register (which could, of course, be standardized). An MMD sub with a PASM implementation simply looks at the appropriate register, saves the right stuff, sets up a return continuation and has the interpreter invoke it. Which leaves a correctly set up continuation chain and a PASM implementation which can do whatever the heck it likes, including making continuations, closures etc that can be returned to multiple times because it got invoked in the normal runloop.
The work has to be done either way, but by arranging things so that everything looks like caller saves (and so that there is no MMD barrier to continuations) just seems to make the most sense. BTW, if it's a continuation barrier does that also mean it's an exception barrier?
Piers Cawley wrote: > Leopold Toetsch <l...@toetsch.at> writes:
>>Not quite for this case. Or in theory yes, but... As calling the >>subroutine mustn't have any changes to the caller's registers, it's just >>simpler to save these registers that the subroutine might change. > But generating the save signature for a given sub is a compile time cost > that only needs to be paid once for each sub and shoved on an I register
... once per sub per location where the sub is called from. But there isn't any knowledge that a sub might be called. So the cost is actually more per PMC instruction that might eventually run a PASM MMD. This is, when its done right, or ...
> (which could, of course, be standardized).
Yes. saveall, which is really expensive.
> ... An MMD sub with a PASM > implementation simply looks at the appropriate register, saves the right > stuff, sets up a return continuation and has the interpreter invoke > it.
Well, that's exactly how it works now, with a bit differing in the meaning of "right stuff" :)
> The work has to be done either way, but by arranging things so that > everything looks like caller saves (and so that there is no MMD barrier > to continuations) just seems to make the most sense. BTW, if it's a > continuation barrier does that also mean it's an exception barrier?
It looks like caller saves. The saved range of register's can't change that view. If the caller or the called sub defines the saved register range does in no way change *how* registers are saved. They are saved by the C code that actually runs the PASM. And the PASM is run from C code. These are the "problems".
And WRT continuation barrier: I already have said: if we really need that (an opcode function "jumps" somewhere) then its possible. On each enter of the run loop a setjmp(3) is done, which is also the base for throwing exceptions from within an opcode function. There are no barriers, AFAIK.
Leopold Toetsch <l...@toetsch.at> writes: > Piers Cawley wrote: >> Leopold Toetsch <l...@toetsch.at> writes:
>>>Not quite for this case. Or in theory yes, but... As calling the >>>subroutine mustn't have any changes to the caller's registers, it's just >>>simpler to save these registers that the subroutine might change.
>> But generating the save signature for a given sub is a compile time cost >> that only needs to be paid once for each sub and shoved on an I register
> ... once per sub per location where the sub is called from. But there > isn't any knowledge that a sub might be called. So the cost is actually > more per PMC instruction that might eventually run a PASM MMD. This is, > when its done right, or ...
No. Once per compilation unit. Stick it in a high register and keep it nailed there for the duration of the sub. Specify this register as part of the calling conventions; the right value will then get restored at any function return and there's no need to regenerate it.
Piers Cawley <pdcaw...@bofh.org.uk> wrote: > Leopold Toetsch <l...@toetsch.at> writes:
[ calculating registers to save ]
>> ... once per sub per location where the sub is called from. But there >> isn't any knowledge that a sub might be called. So the cost is actually >> more per PMC instruction that might eventually run a PASM MMD. This is, >> when its done right, or ... > No. Once per compilation unit.
An example:
.sub foo
# a lot of string handling code # and some PMCs $P0 = concat $P0, $S0 # <<< 1) calculate: save P, S here # now a lot of float code # no strings used any more # and no branch back to 1) $N1 = 47.11 # $N1's live starts here $P0 = $P1 + $N1 # <<< 2) calculate: save P, N regs $P2 = $P0 + $N1 # <<< 3) calculate: save P regs # no N reg used here .end
At 1) the caller is not interested in preserving N-registers, these aren't used there. Saving everything, the caller needs saving, ends up with C<saveall> in non trivial subroutines.
Using your proposal would need a lot of storage for the saved register ranges.
If the calculation is done based on the called subroutine, it's not unlikely that only a few registers have to be preserved, e.g. no N-registers for the overloaded C<concat> and no string registers for the overloaded C<add>.
This doesn't violate the principle of caller saves: all that needs preserving from the caller's POV is preserved.
Leopold Toetsch <l...@toetsch.at> writes: > Piers Cawley <pdcaw...@bofh.org.uk> wrote: >> Leopold Toetsch <l...@toetsch.at> writes:
> [ calculating registers to save ]
>>> ... once per sub per location where the sub is called from. But there >>> isn't any knowledge that a sub might be called. So the cost is actually >>> more per PMC instruction that might eventually run a PASM MMD. This is, >>> when its done right, or ...
>> No. Once per compilation unit.
> An example:
> .sub foo
> # a lot of string handling code > # and some PMCs > $P0 = concat $P0, $S0 # <<< 1) calculate: save P, S here > # now a lot of float code > # no strings used any more > # and no branch back to 1) > $N1 = 47.11 # $N1's live starts here > $P0 = $P1 + $N1 # <<< 2) calculate: save P, N regs > $P2 = $P0 + $N1 # <<< 3) calculate: save P regs > # no N reg used here > .end
> At 1) the caller is not interested in preserving N-registers, these > aren't used there. Saving everything, the caller needs saving, ends up > with C<saveall> in non trivial subroutines.
> Using your proposal would need a lot of storage for the saved > register ranges.
> If the calculation is done based on the called subroutine, it's not > unlikely that only a few registers have to be preserved, e.g. no > N-registers for the overloaded C<concat> and no string registers for the > overloaded C<add>.
> This doesn't violate the principle of caller saves: all that needs > preserving from the caller's POV is preserved.
But under this scheme, the implementing function will have to do a saveall for every function it calls because it doesn't know what registers its caller cares about. And you're almost certainly going to want to call other functions to do the heavy lifting for all the usual reasons of code reuse. I can see a situation where you end up with
simply because you want to follow good coding practice. You're right that, in the limiting case, my 'fingerprinting' approach is going to reduce to a saveall, but the example you give could be broken up into
> But under this scheme, the implementing function will have to do a > saveall for every function it calls because it doesn't know what > registers its caller cares about. And you're almost certainly going > to want to call other functions to do the heavy lifting for all the > usual reasons of code reuse.
Yep that's true. As well as with real caller saves. Which leads back to my (almost) warnocked "proposal":
Subject: Register stacks again Date: Sat, 08 May 2004
>>But under this scheme, the implementing function will have to do a >>saveall for every function it calls because it doesn't know what >>registers its caller cares about. And you're almost certainly going >>to want to call other functions to do the heavy lifting for all the >>usual reasons of code reuse.
>Yep that's true. As well as with real caller saves. Which leads back >to my (almost) warnocked "proposal":
If you want to go back to a frame pointer style of register stack access, that's doable, but that's the way it was in the beginning and the performance penalties in normal code outweighed the savings in stack pushes.
If you want to try it again to see if things are different I don't care, so long as the semantics expressed to the bytecode programs don't change. It will invalidate all the current JIT code on all the platforms so it's a not-insignificant thing to do. I also don't think we've sufficient real code to judge performance, so I think it's a bit premature to worry about it. -- Dan
--------------------------------------"it's like this"------------------- Dan Sugalski even samurai d...@sidhe.org have teddy bears and even teddy bears get drunk
Dan Sugalski <d...@sidhe.org> wrote: > If you want to go back to a frame pointer style of register stack > access, that's doable, but that's the way it was in the beginning and > the performance penalties in normal code outweighed the savings in > stack pushes.
JITted memory access through the frame pointer is as fast as with absolute memory addresses. The same is likely true for gcc/CGP core, when we force the frame pointer being a CPU register.
> If you want to try it again to see if things are different I don't > care, so long as the semantics expressed to the bytecode programs > don't change. It will invalidate all the current JIT code on all the > platforms so it's a not-insignificant thing to do.
That's the problem, yes.
> ... I also don't think > we've sufficient real code to judge performance, so I think it's a > bit premature to worry about it.
This is of course true, the more for changing it in the first place :)
What about issues with JIT and prederefed cores and multi-threading: currently we need to "recompile" all bytecode per thread.
Leopold Toetsch <l...@toetsch.at> writes: > Piers Cawley wrote: >> But under this scheme, the implementing function will have to do a >> saveall for every function it calls because it doesn't know what >> registers its caller cares about. And you're almost certainly going >> to want to call other functions to do the heavy lifting for all the >> usual reasons of code reuse.
> Yep that's true. As well as with real caller saves. Which leads back to > my (almost) warnocked "proposal":
Consider a sub, call it fred, that calls other subs and only uses PMC registers. At compile time, you wrap those calls in appropriate pushtopp/poptopp pairs.
Then, at runtime, 'fred' gets set up as the implemntation for an op.
Which, given your implementation, means that each function call that fred makes should be protected with savetop/restoretop pairs. Oops.
Piers Cawley <pdcaw...@bofh.org.uk> wrote: > Then, at runtime, 'fred' gets set up as the implemntation for an op. > Which, given your implementation, means that each function call that > fred makes should be protected with savetop/restoretop pairs. Oops.
The implementation checks register usage of the called sub at *runtime*, or more precisely at first invocation of the sub and caches the value. It would need a notification (similar to the method cache), if the sub got recompiled.