leo
> Yesterday night I hacked together a switched prederefed run loop. It's
> running ~50% faster then fast_core but a lot slower then the CGoto based
> loops.
The speedups are great. The next question is how do you do use this in a
multi-threaded program without wasting a lot of memory and losing the speedup
becase of the extra memory bloat of having to prederef the same code for every
thread ?
The x86 jit had the same problem (I don't know if anyone has changed that), the
code it generates is only good for a single interpreter.
--
Jason
> On Fri, Feb 07, 2003 at 09:49:29AM +0100, Leopold Toetsch wrote:
>
>
>>Yesterday night I hacked together a switched prederefed run loop. It's
>>running ~50% faster then fast_core but a lot slower then the CGoto based
>>loops.
> The speedups are great.
I thougt that the switched loop could be faster. One problem could be,
that gcc doesn't generate a jump table. It seems to be the binary search
strategy for label look up.
> ... The next question is how do you do use this in a
> multi-threaded program without wasting a lot of memory and losing the speedup
> becase of the extra memory bloat of having to prederef the same code for every
> thread ?
>
> The x86 jit had the same problem (I don't know if anyone has changed that), the
> code it generates is only good for a single interpreter.
I don't know yet, how multi threading will be done. But when multiple
interpreters share the ->code data member (as newinterp/runinterp) do,
then they will use the same JIT/prederef or whatever data.
All code related structures are already in the code segment since my
packfile patches.
leo
> I don't know yet, how multi threading will be done. But when multiple
> interpreters share the ->code data member (as newinterp/runinterp) do,
> then they will use the same JIT/prederef or whatever data.
You can't do that for prederef in a multi-threaded process because prederef
stores the address of the registers in the interpreter structure in the
prederef data.
case PARROT_ARG_I:
pc_prederef[i] = (void *)&interpreter->ctx.int_reg.registers[pc[i]];
--
Jason
> On Fri, Feb 07, 2003 at 05:49:35PM +0100, Leopold Toetsch wrote:
>
>
>>I don't know yet, how multi threading will be done. But when multiple
>>interpreters share the ->code data member (as newinterp/runinterp) do,
>>then they will use the same JIT/prederef or whatever data.
>>
>
> You can't do that for prederef in a multi-threaded process because prederef
> stores the address of the registers in the interpreter structure in the
> prederef data.
Ouch, yes. So does JIT.
So JIT/prederefed code must be separated for threads.
leo
I'm not sure this is the right think to do. If we force gcc to store
in a machine register the address of the Parrot registers, it should
generate some code which is just as fast as the prederefed one. For
instance, the sub opcode would be
; %esi contains the code pointer
; %edi contains the address of the current thread registers
mov 0x4(%esi),%ecx
mov 0x8(%esi),%edx
mov 0xc(%esi),%eax
add $0x10,%esi
mov (%edi,%eax),%eax ; base %edi + offset of the register %eax
mov (%edi,%edx),%edx
sub %eax,%edx
mov %edx,(%edi,%ecx)
jmp *(%esi)
instead of
mov 0x4(%esi),%ecx
mov 0x8(%esi),%edx
mov 0xc(%esi),%eax
add $0x10,%esi
mov (%eax),%eax
mov (%edx),%edx
sub %eax,%edx
mov %edx,(%ecx)
jmp *(%esi)
We still have a problem with all these opcodes with constants. I
think we should drop them. We can just keep some of the "set" opcodes
to load a constant into a register, and then all other opcodes would
work only with registers. I believe the overhead would be quite low
with the bytecode interpreter, and negligeable with the JIT compiler.
-- Jérôme
[ prederef/JIT and threads ]
> I'm not sure this is the right think to do. If we force gcc to store
> in a machine register the address of the Parrot registers, it should
> generate some code which is just as fast as the prederefed one. For
> instance, the sub opcode would be
>
> ; %esi contains the code pointer
> ; %edi contains the address of the current thread registers
> mov 0x4(%esi),%ecx
> mov 0x8(%esi),%edx
> mov 0xc(%esi),%eax
> add $0x10,%esi
> mov (%edi,%eax),%eax ; base %edi + offset of the register %eax
Yep. And for better locality of interpreter->ctx parrot interpreters
should be allocated from a separate mem pool, where all interpreters are
side by side.
> We still have a problem with all these opcodes with constants. I
> think we should drop them. We can just keep some of the "set" opcodes
> to load a constant into a register, and then all other opcodes would
> work only with registers.
Yes. And worse, we have e.g. add_i_ic_ic and such. I think a smaller
core would be faster anyway.
> -- Jérôme
leo
Yup. This was a decision made a long time ago, back when Daniel
started the first JIT stuff. It got a big speedup, and was deemed to
be worth it, since the regular runloop was still darned fast. (And
that was before everyone got it going insanely fast :)
It's one of the reasons I haven't been too worried about speeding up
the core loop. I've been figuring we'll end up with three:
*) JIT
*) CGoto
*) Old indirect dispatch
and leave it at that. When Gregor was working on the prederef I
figured we'd use it as the third, since the JIT was new and I wasn't
sure it'd be possible to get it as a good general solution, but it's
developed so much that I'm not sure it's worth more loop development.
(I could, of course, be wrong... :)
--
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk
> At 5:36 PM +0100 2/8/03, Leopold Toetsch wrote:
[ threaded JIT/prederef ]
>> Ouch, yes. So does JIT.
>> So JIT/prederefed code must be separated for threads.
> Yup. This was a decision made a long time ago, back when Daniel started
> the first JIT stuff.
Not necessarily so now, it seems. Adressing registers by (thread +
index), as outlined in Jerome's recent reply to the YARL (switched run
loop) thread is equally fast on e.g. I386. Dunno if other $arch has a
similar addressing mode. But as all recent processors now are (very)
pipelined, one extra argument in the op stream only increases code size
but not execution speed.
> ... (And that was
> before everyone got it going insanely fast :)
Thank you Sir.
> It's one of the reasons I haven't been too worried about speeding up the
> core loop. I've been figuring we'll end up with three:
>
> *) JIT
> *) CGoto
> *) Old indirect dispatch
The fastest are in terms of possible $arch/compiler features now:
- JIT
- CGP (makes CGoto obsolete)
- Switched Prederef (not in CVS)
but plain function call is needed e.g. for JIT - now.
> and leave it at that. When Gregor was working on the prederef I figured
> we'd use it as the third, since the JIT was new and I wasn't sure it'd
> be possible to get it as a good general solution, but it's developed so
> much that I'm not sure it's worth more loop development. (I could, of
> course, be wrong... :)
JIT (known as that acronym, but isn`t just in time in parrot) is a very
$arch depend feature. I did speed up mul_i_ic by 2-50 for some constants
today, which a different $arch doesn't yet have implemented.
The CGP core is really fast for all compilers that have computed goto -
and honestly, the code that HL emit, will resemble much a code that is
not well suited for JIT.
The optimizer in imcc is the real *challenge* in the whole story.
leo
Right, but that's for now, not necessarily forever.
>>It's one of the reasons I haven't been too worried about speeding
>>up the core loop. I've been figuring we'll end up with three:
>>
>>*) JIT
>>*) CGoto
>>*) Old indirect dispatch
>
>
>The fastest are in terms of possible $arch/compiler features now:
>
> - JIT
> - CGP (makes CGoto obsolete)
> - Switched Prederef (not in CVS)
>
>but plain function call is needed e.g. for JIT - now.
Right, but that's something that'll get slowly phased out as the JIT
gets more mature and more opcodes get JITted. There's also the
potential for the JIT to get really aggressive, if we can find folks
with both the talent and the time to do it.
>
>>and leave it at that. When Gregor was working on the prederef I
>>figured we'd use it as the third, since the JIT was new and I
>>wasn't sure it'd be possible to get it as a good general solution,
>>but it's developed so much that I'm not sure it's worth more loop
>>development. (I could, of course, be wrong... :)
>
>JIT (known as that acronym, but isn`t just in time in parrot) is a
>very $arch depend feature. I did speed up mul_i_ic by 2-50 for some
>constants today, which a different $arch doesn't yet have
>implemented.
>
>The CGP core is really fast for all compilers that have computed
>goto - and honestly, the code that HL emit, will resemble much a
>code that is not well suited for JIT.
I'm not sure we'll come across anything that's less well suited to
the JIT than to a CG core, though I suppose there are potential code
density issues.
One of the big things I'm concerned about is a proliferation of core
loops, and the impact on the size of running programs. I want to keep
the number of cores that force preprocessing the input bytecode to a
minimum if at all possible. We also need to deal with those compilers
that don't have computed goto, a feature that is definitely not C89
compatible.
What I'd like is to keep us at four potential cores (since I realized
I forgot one):
1) Plain function dispatch
2) Switch core (for compilers with no computed goto)
3) CG core
4) JIT
with only the JIT allowed to rewrite the bytecode stream that it
executes. I do realize that this rules out some potential code, and
I'm not happy about that, but I'm worried that we're going to end up
with a dozen different cores, all of which are only half-maintained.
Having said that, if the core building is completely automatic, and
we can find an easy way (i.e. one that requires very little
programmer thought and effort) to build and test any random
collection of cores, I'm OK with that.