[RfD] parrot run loops

Leopold Toetsch

unread,

Jan 30, 2003, 4:07:26 AM1/30/03

to P6I, Dan Sugalski

or our runloops are wrong
or deep core stuff

All run loops get a pointer to the parrot byte code for execution.
This has several impacts on the runloop itself and on branching and
jumping between instructions.
As parrot PASM jumps are expressed by means of opcodes (absolute or
relative) all runloops have their individual calculation routines WRT
branches.

Proposal:

Runloops should get the opcode offset to the start of byte code as
param for running not the actual address of the byte code to run.

Current typical run loop Proposed
------------------------------------------------------------------

while(pc) while(offs)
DO_OP(pc, interpreter); DO_OP(offs, interpreter)

Changing this would seem to prohibit a jump to byte code offset 0, this
would end the run loop (or it would not even start normally :-)
Solution: Instruction 0 in opcode stream is always HALT(), all
bytecode start at offset 1.
Advantage for e.g. JIT: exiting the runloop is centralized on a defined
place.

Consideration for individual runloops:

runops_fast_core:

This is the standard run loop in the absence of
CGoto and is exactly above typical runloop:

#define DO_OP(PC,INTERP)
(PC = ((INTERP->op_func_table)[*PC])(PC,INTERP))

The addressing in the op_func_table would need a change to

------------------------------------------------------------------
offs = Itp->f_tbl[*(code_start+offs)](offs, ..)

Seems more expensive but compilers should convert this to some base
indexed instruction, it probably depends, how code_start is setup[1]
E.g.

code_start = interpreter->code->base.data; // new syntax
while (offs)
offs = interp->func_table[*(code_start+offs)](offs, ..)

runops_slow_core:
As above.

CGoto (cg_core):
Similar:

------------------------------------------------------------------
goto *ops_addr[*cur_opcode]; goto *ops_addr[*(code+offs)];

runops_prederef:

Similar runloop, but has to do recalculations of offsets forth and
back to the two code pointers, it needs. The addressing of operands is
done relative to the PC and would then be relative to code_start - so
no change (in terms of costs) here:

(*(INTVAL *)cur_opcode[1]) = (*(INTVAL *)cur_opcode[2]);
return cur_opcode + 3;

runops_jit:

Addressing is either totally internal or based on offsets, which have
to be recalculated every time external (non JITed) code is called,
that might cause a control flow change. This is similar to:

run_compiled:

Has a switch based runloop with offsets, but a linear code
representation, i.e. consecutive instructions of one block don't have
a runloop lookup, they are just appended (giving a bigger runtime image
but faster execution). Can't do forward jumps and restart operations
including eval.

------------------------------------------------------------------
switch(cur_opcode - start_code) switch(offset)

Special: invoke vtable call

This too has a pointer to the byte code of the current run loop. Due
to the dynamic nature of of pmc->vtable->invoke, this instruction
might be intended to go anywhere. This is currently done e.g. for
eval, with nasty tricks (leaving the inner run loop, then switching
code segments, reentering the runloop).

Passing the offset of the branch address (and adjusting) code_start
would then be enough to go anywhere in byte code.

[1] Execution speed:

I did a short test and changed runops_fast_core, to simulate
addressing relative to code start:

#undef DO_OP
# define DO_OP(PC,INTERP) \
(PC = ((INTERP->op_func_table)[*(_cs+PC)])(PC,INTERP))

opcode_t *
runops_fast_core(struct Parrot_Interp *interpreter, opcode_t *pc)
{
int _cs = 0; // interpreter->code->base.data
while (pc) {
DO_OP(pc, interpreter);
}
return pc;
}

There is *no* remarkable change of execution speed in the mops program.

Summary:

All addressing in PASM is done in terms of opcodes, addressing in the
runloops is done via absolute code pointers. This makes it necessary
to recalc opcode offsets for each branch that leaves the runloop or
after return from such external code.

Changing the addressing scheme to opcode offsets relative to code
start would simplify all kinds of (non local) control flow changes. As
real world programs mostly consists of such subroutine calls, these
would be simplified a lot (and would then not need leaving the runloop
- probably ;-)

The "fast" run loops (compiled C and JIT) would take most advantage of
this change.

Comments welcome
leo

Steve Fink

unread,

Jan 31, 2003, 3:59:08 AM1/31/03

to Leopold Toetsch, P6I

I don't really know a whole lot about this area, but I remember I was
surprised the first time I looked at this and discovered it was based
on pointers instead of offsets. I assumed there was some good reason
for it that I didn't know at the time (eg performance), but now I
doubt that. Your way seems much better to me. (It makes debugging
slightly easier too.) Have you tried it on a more RISC-y machine where
any performance loss might be more noticeable? I tried the change you
suggested on a PPC, and saw no speed difference.

Leopold Toetsch

unread,

Jan 31, 2003, 4:50:13 AM1/31/03

to Steve Fink, P6I

Steve Fink wrote:

> I don't really know a whole lot about this area, but I remember I was
> surprised the first time I looked at this and discovered it was based
> on pointers instead of offsets. I assumed there was some good reason
> for it that I didn't know at the time (eg performance), but now I
> doubt that. Your way seems much better to me. (It makes debugging
> slightly easier too.)

Yep

> ... Have you tried it on a more RISC-y machine where

> any performance loss might be more noticeable? I tried the change you
> suggested on a PPC, and saw no speed difference.

No, I just did this quick test on 2 machines (1 Athlon, 1 Pentium). On
RISC machines with probably plenty registers I expect similar i.e. no
effects.

But even when there is a slight speed impact - which I doubt - then this
speed impact would only harm the interpreted run loops and only, when
the whole program is just looping like mops.pasm.

But for such programs, timings like these hold (on Athlon 800):

CGoto: 20
fast_core: 12
Prederef: 17
JIT: 800
C no opt 195
C -O3: 277

So when the whole thing is run loop bound, you have already lost ;-)
Real world programs are not run loop bound, so I don't see any harm in
changing the run loops to take an offset.

leo

Jason Gloudon

unread,

Feb 1, 2003, 4:36:20 PM2/1/03

to Leopold Toetsch, P6I, Dan Sugalski

On Thu, Jan 30, 2003 at 10:07:26AM +0100, Leopold Toetsch wrote:

> code_start = interpreter->code->base.data; // new syntax
> while (offs)
> offs = interp->func_table[*(code_start+offs)](offs, ..)

It's unclear to me whether you are saying the opcode functions would still be
passed the PC or offs(et) ? If you pass the offset, the opcode functions will
have to re-calculate the PC in order to access opcode arguments.

> Changing the addressing scheme to opcode offsets relative to code
> start would simplify all kinds of (non local) control flow changes. As
> real world programs mostly consists of such subroutine calls, these
> would be simplified a lot (and would then not need leaving the runloop
> - probably ;-)

How would non local control flow be simplified ? You would still have to leave
the runloop because the bytecode base has changed and code_start would no
longer be correct.

--
Jason

Leopold Toetsch

unread,

Feb 1, 2003, 6:46:50 PM2/1/03

to Jason Gloudon, P6I, Dan Sugalski

Jason Gloudon wrote:

> On Thu, Jan 30, 2003 at 10:07:26AM +0100, Leopold Toetsch wrote:

>> code_start = interpreter->code->base.data; // new syntax
>> while (offs)
>> offs = interp->func_table[*(code_start+offs)](offs, ..)

> It's unclear to me whether you are saying the opcode functions would still be
> passed the PC or offs(et) ? If you pass the offset, the opcode functions will
> have to re-calculate the PC in order to access opcode arguments.

The opcode functions do now address the registers relative to the PC =
cur_opcode = the absolute opcode pointer, e.g.

#define IREG(i) interpreter->ctx.int_reg.registers[cur_opcode[i]]

which would then be

#define IREG(i) interpreter->ctx.int_reg.registers[code_start[offs+i]]

So clearly only the offset is passed to opfuntions.

Yes, I did miss this part of probable slowdown.

No, I don't think, it will be much - and the assembler (i.e.) imcc

could provide the index [offs + i] - relative to code start,

*if* it really would slow down the basic run loops.

But and again, please have a look at the different timings of run loops.
We are speaking here of magnitudes from 10 - 800, only the slow loops
have a negative impact - if even and if the program is loop bound, which
IMHO no real world program will be.

>>Changing the addressing scheme to opcode offsets relative to code
>>start would simplify all kinds of (non local) control flow changes.

> How would non local control flow be simplified ? You would still have to leave

> the runloop because the bytecode base has changed and code_start would no
> longer be correct.

A direct jump to a different code segment is currently not feasible
because of the absolute pointer to the byte code is passed around, which
additionally depends on the running run loop. E.g. prederefed code,
when predereferencing is not done has a pointer to the convert routine
which dereferences registers and changes the function pointer, which
then points to the real CPrederef opcode function.
When here a code segment change op comes in, it is either hard or
impossible to do it right.
When all addresses involved are in offsets relative to code_start (what
ever this is, compiled JIT or prederefed code), the operations itself
can do the intercodesegment jump by changing offset plus codestart,
that's all.

And, formost, my proposal makes it all simpler and natural: Bytecode and
especially branches are all in terms of opcode offsets, our runloops are
not. Here is the point were all possible complications start.
Currently the various OpTrans/*.pm functions take care of providing most
of these pointer manipulations (which e.g. fail for PBC compiled to
native C, when non trivial branches are involved). Passing just the
offset would unit this can of worms.
And as I did say: JIT and compiled C code only have offsets in bytecode.
Opcode pointers are pointers to machine code, you cant't calculate with
these and get back the offset to some byte code location. The passing
"PC concept fails" here totally.

leo

Jason Gloudon

unread,

Feb 2, 2003, 12:16:03 AM2/2/03

to Leopold Toetsch, P6I, Dan Sugalski

On Sun, Feb 02, 2003 at 12:46:50AM +0100, Leopold Toetsch wrote:

> #define IREG(i) interpreter->ctx.int_reg.registers[code_start[offs+i]]

Where does the value of code_start coming from ?
code_start in an opcode function is not a constant, so the above is really:

interpreter->ctx.int_reg.registers[interpreter->code->base.data[offs+i]]

following the names you've used in previous mail.

Intersegment jumps may not work readily for all runloops, but I don't believe
that requires as big a change as you're suggesting.

--
Jason

Leopold Toetsch

unread,

Feb 2, 2003, 6:10:38 AM2/2/03

to Jason Gloudon, P6I, Dan Sugalski

Jason Gloudon wrote:

> On Sun, Feb 02, 2003 at 12:46:50AM +0100, Leopold Toetsch wrote:
>
>
>>#define IREG(i) interpreter->ctx.int_reg.registers[code_start[offs+i]]
>>
>
> Where does the value of code_start coming from ?

As stated in my first mail in this thread, code_start could be a auto
variable in the function doing the run loop, e.g. for the CGoto case.

> code_start in an opcode function is not a constant, so the above is really:
>
> interpreter->ctx.int_reg.registers[interpreter->code->base.data[offs+i]]

Yes, this would be the term for runops_{slow,fast}_core.

> Intersegment jumps may not work readily for all runloops, but I don't believe
> that requires as big a change as you're suggesting.

When looking at above longish IREG(i), then it seems, that my suggestion
is really suboptimal for runops_{slow,fast}_core. I still think, it's
the right thing for all other run loops.

I wrote this proposal, after reading the c source for the native
compiled eval.pasm:

switch_label:
switch(cur_opcode - start_code) {
case 0: PC_0: {
[...]
dest = (opcode_t *)p->vtable->invoke(interpreter, p, &&PC_11);
cur_opcode = dest;
goto switch_label;
}
}
case 11: PC_11: {

The address passed to the invoke funtion is a local label in the run
loop. This can't work for coroutines, evals and so on.

leo

Dan Sugalski

unread,

Feb 3, 2003, 11:50:20 AM2/3/03

to Leopold Toetsch, P6I

At 10:07 AM +0100 1/30/03, Leopold Toetsch wrote:
>Changing the addressing scheme to opcode offsets relative to code
>start would simplify all kinds of (non local) control flow changes. As
>real world programs mostly consists of such subroutine calls, these
>would be simplified a lot (and would then not need leaving the runloop
>- probably ;-)

The big problem with this is you're increasing non-local call
performance for normal running. I don't think this is a good
idea--while you'll save maybe 50 cycles for each non-local call,
you're making each opcode pay another four or five cycles, perhaps a
bit more on the more register-starved architectures if the extra
interpreter structure element fetching causes a register to get stack
flushed.

I'll go read through the rest of the thread to see if maybe there's a
different solution we can come up with to solve part of the problem,
or if maybe the problem you're looking to solve isn't what I think it
is.
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Leopold Toetsch

unread,

Feb 3, 2003, 5:15:59 PM2/3/03

to Dan Sugalski, P6I

Dan Sugalski wrote:

> At 10:07 AM +0100 1/30/03, Leopold Toetsch wrote:
>
>> Changing the addressing scheme to opcode offsets relative to code
>

> The big problem with this is you're increasing non-local call
> performance for normal running. I don't think this is a good idea--while
> you'll save maybe 50 cycles for each non-local call, you're making each
> opcode pay another four or five cycles, perhaps a bit more on the more
> register-starved architectures if the extra interpreter structure
> element fetching causes a register to get stack flushed.

Yep. But first, we don't really know yet, how many local/non-local calls
we have in real world programs and the "some cycles per normal op"
impact is not for Prederef, JIT and compiled C. The latter 2 are in
magnitudes faster then the slower/normal kinds of run loops - in tight
loops or runloop based tests.

Anyway, I was able to fix most of compiled C code quirks, where

this RfD eminated from.

- invoke
- bsr/jsr + ret
always work on offsets now. The latter was necessary because PASM subs
called via set_addr/invoke/ret did work differently to bsr/ret in
compiled C code.

So it might be only necessary to clearly define which branches need a
run-loop based PC (whatever this might be) and which ops take a offset
in terms of opcode_t from code_start. The macros in core.ops should
somehow reflect this behaviour - currently I have changed these to
CUR_OPCODE + $x, which is always the offset (todays 3 commits)

leo