I see two ways to fix it:
1) use frame pointer relative addressing:
+ prederefed code is usable by different threads too
- ~4 times increase in code size of core_ops_*.{c,o} [1]
2) Re-prederef on function calls, if frame pointer differs
+ no impact on code size
- needs precise code length of functions
- threads need distinct prederefed code
- possibly slower then 1)
Comments welcome,
leo
[1] due to absolute addressing a constant argument and a register
argument have the same code, set_i_ic and set_i_i are the same.
Or 3) Toss the prederef stuff entirely.
--
Dan
--------------------------------------it's like this-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk
Which might not be quite as bad as it sounds: on at least one "strange
platform" (IA64 HP-UX) the native C compiler gets the switch core
running faster than the prederef core! (!)
Duraid
Well, the prederefed function core (parrot -P) is for sure not
necessary. Are still remaining CGP and switched core, which is
prederefed too. CGP is by far the fasted run-core for JIT-less
architectures, if CGoto is available. The switched core can of course
run w/o prederef too.
But one thing is nice with prederef: it's by far the simplest way to
create a safe run core that verifies opcode arguments. This could of
course be done w/o predereferencing afterwords, but while you are
checking function args, predereferencing these is of almost zero cost.
Using option 1) above isn't really complicated. The problem we have is
code size and opcode count, which is a problem with the CGoto core too.
I've proposed not too long ago to toss all opcode variants with
constants and just leave:
set I, Ic
set N, Nc
set S, Sc
Immediate constants aren't really that useful with RISC cpus. You might
have a look at e.g. jit/arm/jit_emit.h:459 ff.
leo
Err, the switched core *is* a prederefed core.
> Duraid
leo
I've now committed this case 1) as a fix for prederefed run cores. It's
unoptimized currently. make fulltest is passing again here.
>> Or 3) Toss the prederef stuff entirely.
>
>
> Well, the prederefed function core (parrot -P) is for sure not
> necessary.
Patches welcome to remove the plain prederefed function core
F<ops/core_ops_prederef.*>. F<lib/Parrot/OpTrans/CPrederef.pm> is still
needed as an abstract base class of CGP.pm and CSwitch.pm but can be
cleanued up too.
I still like to keep CGP and CSwitch run cores. The latter as the safe
run core with argument checking and as a fallback, if CGOTO isn't
available on that platform. The former as an extension for JIT to run
non-JITted opcodes. Similar to the current JIT_CGP stuff on i386, but in
a more general way:
For a sequence of non-JITted opcodes: create a copy of the byte-code of
these non-JITted opcodes and append one opcode that returns to JIT. Then
fill it with the CORE_ops_prederef__ opcode. Generate code to call this
piece of code via cgp_core().
leo
While I want to keep the switch core, I'm still not seeing the need
for prederef with it. I'm presuming this crept in at some point and
just needs un-creeping?
> While I want to keep the switch core, I'm still not seeing the need
> for prederef with it. I'm presuming this crept in at some point and
> just needs un-creeping?
Using prederef for switch has one advantage: it's a bit faster. Before
the indirect register addressing it had another one: it took only 1/4th
of code size because of the collapsing of constant and register variants
into one switch case.
There is of course no need to prederef the switched core.
Maybe benchmarking the two variants yields a final answer.
leo
> Or 3) Toss the prederef stuff entirely.
And here is, why I want to keep the CGP core:
sub_i_i_i
0x81bbef0 <cgp_core+33488>: mov 0x4(%esi),%ecx
0x81bbef3 <cgp_core+33491>: mov 0x8(%esi),%edx
0x81bbef6 <cgp_core+33494>: mov 0xc(%esi),%eax
0x81bbef9 <cgp_core+33497>: add $0x10,%esi
0x81bbefc <cgp_core+33500>: mov (%eax,%edi,1),%eax
0x81bbeff <cgp_core+33503>: mov (%edx,%edi,1),%edx
0x81bbf02 <cgp_core+33506>: sub %eax,%edx
0x81bbf04 <cgp_core+33508>: mov %edx,(%ecx,%edi,1)
0x81bbf07 <cgp_core+33511>: jmp *(%esi)
if_i_ic
0x81b4152 <cgp_core+1330>: mov 0x4(%esi),%eax
0x81b4155 <cgp_core+1333>: cmpl $0x0,(%eax,%edi,1)
0x81b4159 <cgp_core+1337>: je 0x81b4167 <cgp_core+1351>
0x81b415b <cgp_core+1339>: mov 0x8(%esi),%eax
0x81b415e <cgp_core+1342>: mov (%eax),%eax
0x81b4160 <cgp_core+1344>: shl $0x2,%eax
0x81b4163 <cgp_core+1347>: add %eax,%esi
0x81b4165 <cgp_core+1349>: jmp *(%esi)
0x81b4167 <cgp_core+1351>: add $0xc,%esi
0x81b416a <cgp_core+1354>: jmp *(%esi)
%esi ... cur_opcode
%edi ... register frame pointer
A register access is 2 CPU instructions only:
mov 8(%esi), %edx # cur_opcode[2], i.e. offset of REG_INT(x)
mov (%edx, %edi, 1), %edx # get *(base + offset)
That's all.
$ ./parrot -C mops.pasm
Iterations: 100000000
Estimated ops: 200000000
Elapsed time: 2.156002
M op/s: 92.764291
That's an Athlon 800 - 8.5 CPU instructions per Parrot instruction.
leo
0x001048d4 <cgp_core+35652>: lwz r0,8(r30)
0x001048d8 <cgp_core+35656>: lwz r2,12(r30)
0x001048dc <cgp_core+35660>: lwzx r0,r27,r0
0x001048e0 <cgp_core+35664>: lwzx r2,r27,r2
0x001048e4 <cgp_core+35668>: lwz r9,4(r30)
0x001048e8 <cgp_core+35672>: subf r0,r2,r0
0x001048ec <cgp_core+35676>: stwx r0,r27,r9
0x001048f0 <cgp_core+35680>: lwzu r2,16(r30)
0x001048f4 <cgp_core+35684>: mtctr r2
0x001048f8 <cgp_core+35688>: bctr
Only slightly longer caused by the branch sequence but also quite
compact.
leo