YARL - yet another run loop: CSwitch

Leopold Toetsch

unread,

Feb 7, 2003, 3:49:29 AM2/7/03

to P6I

Yesterday night I hacked together a switched prederefed run loop. It's
running ~50% faster then fast_core but a lot slower then the CGoto based
loops.
The question is: Should I put it in? I thought, for compilers lacking
computed goto (ar there any?) it could be an alternative.
The disadvantage of such a switched loop is: It would need extra work
for the proposed "core ops extending" (calling another switched loop
from the "default:" statement ...) with some not trivial overhead.

leo

Jason Gloudon

unread,

Feb 7, 2003, 10:11:50 AM2/7/03

to Leopold Toetsch, P6I

On Fri, Feb 07, 2003 at 09:49:29AM +0100, Leopold Toetsch wrote:

> Yesterday night I hacked together a switched prederefed run loop. It's
> running ~50% faster then fast_core but a lot slower then the CGoto based
> loops.

The speedups are great. The next question is how do you do use this in a
multi-threaded program without wasting a lot of memory and losing the speedup
becase of the extra memory bloat of having to prederef the same code for every
thread ?

The x86 jit had the same problem (I don't know if anyone has changed that), the
code it generates is only good for a single interpreter.

--
Jason

Leopold Toetsch

unread,

Feb 7, 2003, 11:49:35 AM2/7/03

to Jason Gloudon, P6I

Jason Gloudon wrote:

> On Fri, Feb 07, 2003 at 09:49:29AM +0100, Leopold Toetsch wrote:
>
>
>>Yesterday night I hacked together a switched prederefed run loop. It's
>>running ~50% faster then fast_core but a lot slower then the CGoto based
>>loops.

> The speedups are great.

I thougt that the switched loop could be faster. One problem could be,
that gcc doesn't generate a jump table. It seems to be the binary search
strategy for label look up.

> ... The next question is how do you do use this in a

> multi-threaded program without wasting a lot of memory and losing the speedup
> becase of the extra memory bloat of having to prederef the same code for every
> thread ?
>
> The x86 jit had the same problem (I don't know if anyone has changed that), the
> code it generates is only good for a single interpreter.

I don't know yet, how multi threading will be done. But when multiple
interpreters share the ->code data member (as newinterp/runinterp) do,
then they will use the same JIT/prederef or whatever data.
All code related structures are already in the code segment since my
packfile patches.

leo

Jason Gloudon

unread,

Feb 8, 2003, 9:49:05 AM2/8/03

to Leopold Toetsch, perl6-i...@perl.org

On Fri, Feb 07, 2003 at 05:49:35PM +0100, Leopold Toetsch wrote:

> I don't know yet, how multi threading will be done. But when multiple
> interpreters share the ->code data member (as newinterp/runinterp) do,
> then they will use the same JIT/prederef or whatever data.

You can't do that for prederef in a multi-threaded process because prederef
stores the address of the registers in the interpreter structure in the
prederef data.

case PARROT_ARG_I:
pc_prederef[i] = (void *)&interpreter->ctx.int_reg.registers[pc[i]];

--
Jason

Leopold Toetsch

unread,

Feb 8, 2003, 11:36:58 AM2/8/03

to Jason Gloudon, perl6-i...@perl.org

Jason Gloudon wrote:

> On Fri, Feb 07, 2003 at 05:49:35PM +0100, Leopold Toetsch wrote:
>
>
>>I don't know yet, how multi threading will be done. But when multiple
>>interpreters share the ->code data member (as newinterp/runinterp) do,
>>then they will use the same JIT/prederef or whatever data.
>>
>
> You can't do that for prederef in a multi-threaded process because prederef
> stores the address of the registers in the interpreter structure in the
> prederef data.

Ouch, yes. So does JIT.
So JIT/prederefed code must be separated for threads.

leo

Jerome Vouillon

unread,

Feb 14, 2003, 8:39:16 AM2/14/03

to Leopold Toetsch, Jason Gloudon, perl6-i...@perl.org

I'm not sure this is the right think to do. If we force gcc to store
in a machine register the address of the Parrot registers, it should
generate some code which is just as fast as the prederefed one. For
instance, the sub opcode would be

; %esi contains the code pointer
; %edi contains the address of the current thread registers
mov 0x4(%esi),%ecx
mov 0x8(%esi),%edx
mov 0xc(%esi),%eax
add $0x10,%esi
mov (%edi,%eax),%eax ; base %edi + offset of the register %eax
mov (%edi,%edx),%edx
sub %eax,%edx
mov %edx,(%edi,%ecx)
jmp *(%esi)

instead of

mov 0x4(%esi),%ecx
mov 0x8(%esi),%edx
mov 0xc(%esi),%eax
add $0x10,%esi
mov (%eax),%eax
mov (%edx),%edx
sub %eax,%edx
mov %edx,(%ecx)
jmp *(%esi)

We still have a problem with all these opcodes with constants. I
think we should drop them. We can just keep some of the "set" opcodes
to load a constant into a register, and then all other opcodes would
work only with registers. I believe the overhead would be quite low
with the bytecode interpreter, and negligeable with the JIT compiler.

-- Jérôme

Leopold Toetsch

unread,

Feb 14, 2003, 11:28:17 AM2/14/03

to Jerome Vouillon, Jason Gloudon, perl6-i...@perl.org

Jerome Vouillon wrote:

[ prederef/JIT and threads ]

> I'm not sure this is the right think to do. If we force gcc to store
> in a machine register the address of the Parrot registers, it should
> generate some code which is just as fast as the prederefed one. For
> instance, the sub opcode would be
>
> ; %esi contains the code pointer
> ; %edi contains the address of the current thread registers
> mov 0x4(%esi),%ecx
> mov 0x8(%esi),%edx
> mov 0xc(%esi),%eax
> add $0x10,%esi
> mov (%edi,%eax),%eax ; base %edi + offset of the register %eax

Yep. And for better locality of interpreter->ctx parrot interpreters
should be allocated from a separate mem pool, where all interpreters are
side by side.

> We still have a problem with all these opcodes with constants. I
> think we should drop them. We can just keep some of the "set" opcodes
> to load a constant into a register, and then all other opcodes would
> work only with registers.

Yes. And worse, we have e.g. add_i_ic_ic and such. I think a smaller
core would be faster anyway.

> -- Jérôme

leo

Dan Sugalski

unread,

Feb 14, 2003, 4:52:35 PM2/14/03

to Leopold Toetsch, Jason Gloudon, perl6-i...@perl.org

Yup. This was a decision made a long time ago, back when Daniel
started the first JIT stuff. It got a big speedup, and was deemed to
be worth it, since the regular runloop was still darned fast. (And
that was before everyone got it going insanely fast :)

It's one of the reasons I haven't been too worried about speeding up
the core loop. I've been figuring we'll end up with three:

*) JIT
*) CGoto
*) Old indirect dispatch

and leave it at that. When Gregor was working on the prederef I
figured we'd use it as the third, since the JIT was new and I wasn't
sure it'd be possible to get it as a good general solution, but it's
developed so much that I'm not sure it's worth more loop development.
(I could, of course, be wrong... :)
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Leopold Toetsch

unread,

Feb 14, 2003, 6:13:26 PM2/14/03

to Dan Sugalski, Jason Gloudon, perl6-i...@perl.org

Dan Sugalski wrote:

> At 5:36 PM +0100 2/8/03, Leopold Toetsch wrote:

[ threaded JIT/prederef ]

>> Ouch, yes. So does JIT.
>> So JIT/prederefed code must be separated for threads.

> Yup. This was a decision made a long time ago, back when Daniel started
> the first JIT stuff.

Not necessarily so now, it seems. Adressing registers by (thread +
index), as outlined in Jerome's recent reply to the YARL (switched run
loop) thread is equally fast on e.g. I386. Dunno if other $arch has a
similar addressing mode. But as all recent processors now are (very)
pipelined, one extra argument in the op stream only increases code size
but not execution speed.

> ... (And that was

> before everyone got it going insanely fast :)

Thank you Sir.

> It's one of the reasons I haven't been too worried about speeding up the
> core loop. I've been figuring we'll end up with three:
>
> *) JIT
> *) CGoto
> *) Old indirect dispatch

The fastest are in terms of possible $arch/compiler features now:

- JIT
- CGP (makes CGoto obsolete)
- Switched Prederef (not in CVS)

but plain function call is needed e.g. for JIT - now.

> and leave it at that. When Gregor was working on the prederef I figured
> we'd use it as the third, since the JIT was new and I wasn't sure it'd
> be possible to get it as a good general solution, but it's developed so
> much that I'm not sure it's worth more loop development. (I could, of
> course, be wrong... :)

JIT (known as that acronym, but isn`t just in time in parrot) is a very
$arch depend feature. I did speed up mul_i_ic by 2-50 for some constants
today, which a different $arch doesn't yet have implemented.

The CGP core is really fast for all compilers that have computed goto -
and honestly, the code that HL emit, will resemble much a code that is
not well suited for JIT.

The optimizer in imcc is the real *challenge* in the whole story.

leo

Dan Sugalski

unread,

Feb 15, 2003, 2:06:58 PM2/15/03

to Leopold Toetsch, Jason Gloudon, perl6-i...@perl.org

At 12:13 AM +0100 2/15/03, Leopold Toetsch wrote:
>Dan Sugalski wrote:
>
>>At 5:36 PM +0100 2/8/03, Leopold Toetsch wrote:
>
>
>[ threaded JIT/prederef ]
>
>>>Ouch, yes. So does JIT.
>>>So JIT/prederefed code must be separated for threads.
>
>Yup. This was a decision made a long time ago, back when Daniel
>started the first JIT stuff.
>
>
>Not necessarily so now, it seems. Adressing registers by (thread +
>index), as outlined in Jerome's recent reply to the YARL (switched
>run loop) thread is equally fast on e.g. I386. Dunno if other $arch
>has a similar addressing mode. But as all recent processors now are
>(very) pipelined, one extra argument in the op stream only increases
>code size but not execution speed.

Right, but that's for now, not necessarily forever.

>>It's one of the reasons I haven't been too worried about speeding
>>up the core loop. I've been figuring we'll end up with three:
>>
>>*) JIT
>>*) CGoto
>>*) Old indirect dispatch
>
>
>The fastest are in terms of possible $arch/compiler features now:
>
> - JIT
> - CGP (makes CGoto obsolete)
> - Switched Prederef (not in CVS)
>
>but plain function call is needed e.g. for JIT - now.

Right, but that's something that'll get slowly phased out as the JIT
gets more mature and more opcodes get JITted. There's also the
potential for the JIT to get really aggressive, if we can find folks
with both the talent and the time to do it.

>
>>and leave it at that. When Gregor was working on the prederef I
>>figured we'd use it as the third, since the JIT was new and I
>>wasn't sure it'd be possible to get it as a good general solution,
>>but it's developed so much that I'm not sure it's worth more loop
>>development. (I could, of course, be wrong... :)
>
>JIT (known as that acronym, but isn`t just in time in parrot) is a
>very $arch depend feature. I did speed up mul_i_ic by 2-50 for some
>constants today, which a different $arch doesn't yet have
>implemented.
>
>The CGP core is really fast for all compilers that have computed
>goto - and honestly, the code that HL emit, will resemble much a
>code that is not well suited for JIT.

I'm not sure we'll come across anything that's less well suited to
the JIT than to a CG core, though I suppose there are potential code
density issues.

One of the big things I'm concerned about is a proliferation of core
loops, and the impact on the size of running programs. I want to keep
the number of cores that force preprocessing the input bytecode to a
minimum if at all possible. We also need to deal with those compilers
that don't have computed goto, a feature that is definitely not C89
compatible.

What I'd like is to keep us at four potential cores (since I realized
I forgot one):

1) Plain function dispatch
2) Switch core (for compilers with no computed goto)
3) CG core
4) JIT

with only the JIT allowed to rewrite the bytecode stream that it
executes. I do realize that this rules out some potential code, and
I'm not happy about that, but I'm worried that we're going to end up
with a dozen different cores, all of which are only half-maintained.

Having said that, if the core building is completely automatic, and
we can find an easy way (i.e. one that requires very little
programmer thought and effort) to build and test any random
collection of cores, I'm OK with that.