This runloop combines the faster dispatch of opcodes via computed goto and the clever register addressing due to predereferencing registers and constants. And it's compact due to the fact that all opcode variants with constants collapse to just one implementation of the functions body. It's so compact, that my ancient gcc 2.95.2 even can compile it -O3, which didn't succeed with core_ops_cg.c.
-rw-r--r-- 1 lt users 618496 Feb 6 12:33 core_ops.c -rw-r--r-- 1 lt users 665012 Feb 6 13:10 core_ops.o -rw-r--r-- 1 lt users 219169 Feb 6 12:33 core_ops_cg.c -rw-r--r-- 1 lt users 339312 Feb 6 13:10 core_ops_cg.o -rw-r--r-- 1 lt users 154457 Feb 6 13:05 core_ops_cgp.c -rw-r--r-- 1 lt users 165520 Feb 6 13:27 core_ops_cgp.o -rw-r--r-- 1 lt users 219446 Feb 6 12:33 core_ops_prederef.c -rw-r--r-- 1 lt users 240592 Feb 6 13:10 core_ops_prederef.o
This runloop is now enabled with the -P switch. If you want to run the "normal" prederefed runloop then use '-P -g'.
This runloop combines the faster dispatch of opcodes via computed goto and the clever register addressing due to predereferencing registers and constants. And it's compact due to the fact that all opcode variants with constants collapse to just one implementation of the functions body. It's so compact, that my ancient gcc 2.95.2 even can compile it -O3, which didn't succeed with core_ops_cg.c.
-rw-r--r-- 1 lt users 618496 Feb 6 12:33 core_ops.c -rw-r--r-- 1 lt users 665012 Feb 6 13:10 core_ops.o -rw-r--r-- 1 lt users 219169 Feb 6 12:33 core_ops_cg.c -rw-r--r-- 1 lt users 339312 Feb 6 13:10 core_ops_cg.o -rw-r--r-- 1 lt users 154457 Feb 6 13:05 core_ops_cgp.c -rw-r--r-- 1 lt users 165520 Feb 6 13:27 core_ops_cgp.o -rw-r--r-- 1 lt users 219446 Feb 6 12:33 core_ops_prederef.c -rw-r--r-- 1 lt users 240592 Feb 6 13:10 core_ops_prederef.o
This runloop is now enabled with the -P switch. If you want to run the "normal" prederefed runloop then use '-P -g'.
> This runloop combines the faster dispatch of opcodes via computed goto > and the clever register addressing due to predereferencing registers and > constants.
This generates some pretty good code (for instance, for sub_i_i_i) :
Jerome Vouillon wrote: >>CGP -O3 MOps: 65 !!!1 > This generates some pretty good code (for instance, for sub_i_i_i) :
Yep. I'd have a look at it, really compact.
> Why not also predereference the operation address? This would save a > memory read. The goto would become:
> goto **(cur_opcode += 4);
This is left as an exercise to the reader :)
Improvements welcome - and I'm a really bad C programmer, I won't do it.
> It seems hard to improve upon this. One possibility is to have some > new instructions that uses an accumulator:
> accu = accu - (*(INTVAL *)cur_opcode[1]);
This looks very similar to my proposal WRT micro ops. I had 3 "accu" globals - pretty fast. But the problem is, they must be somewhere in the interpreter because of re-entrancy, threads and that. So this seems impossible.
Melvin Smith wrote: > At 10:12 PM 2/6/2003 +0100, Leopold Toetsch wrote:
>> Improvements welcome - and I'm a really bad C programmer, I won't do it.
> *cough*
> If you are a "bad" C programmer, what is your "good" language? :)
I don't have one. But IMHO I have a fair survey over the whole (except io/ and icu/ :) And - as already done in imcc - I can take existing (really good pieces) and knot them together, a la CGP.
> This runloop combines the faster dispatch of opcodes via computed goto > and the clever register addressing due to predereferencing registers and > constants.
Oooh. Shiny
> And it's compact due to the fact that all opcode variants with constants > collapse to just one implementation of the functions body. It's so > compact, that my ancient gcc 2.95.2 even can compile it -O3, which > didn't succeed with core_ops_cg.c.
Oooooh. Shinier
I had a (possibly) impractical idea - computed goto / JIT (or even computed goto / prederef / jit) core
I don't know whether this is possible:
Inside a cgoto core have 1 extra op - enter JITted section.
The bytecode is "compiled" by the JIT (at some point) - if there are a run of consecutive JIT-able ops, then issue a section (an isolated section) of machine code for those ops, and replace those ops in the bytecode with an op that calls that section. If there isn't a run of JIT-able ops, then just leave the ops as ops, and use the regular computed goto core dispatch.
I'd envisage it becoming a win if PBC often ends up with tight, isolated loops that use a lot of JITted ops, but most of the code is ops we've not written JITted versions of. The sections with the loops are converted to native code, and the rest of the ops still gets the benefit of your sterling efforts at improving the "regular" bytecode dispatch.
On Sat, Feb 08, 2003 at 03:10:58PM +0000, Nicholas Clark wrote: > The bytecode is "compiled" by the JIT (at some point) - if there are a run > of consecutive JIT-able ops, then issue a section (an isolated section) of > machine code for those ops, and replace those ops in the bytecode with an op > that calls that section. If there isn't a run of JIT-able ops, then just > leave the ops as ops, and use the regular computed goto core dispatch.
Yep. That's the sort of use I created the enternative op for. Right now it's used by the compiled C code generator. Basically the entry points to basic blocks are replaced with enternative calls so that the transition to compiled basic blocks of ops code happens transparently to the interpreter.
Nicholas Clark wrote: > I had a (possibly) impractical idea - computed goto / JIT > (or even computed goto / prederef / jit) core
> I don't know whether this is possible:
> Inside a cgoto core have 1 extra op - enter JITted section.
> The bytecode is "compiled" by the JIT (at some point) - if there are a run > of consecutive JIT-able ops, then issue a section (an isolated section) of > machine code for those ops, and replace those ops in the bytecode with an op > that calls that section. If there isn't a run of JIT-able ops, then just > leave the ops as ops, and use the regular computed goto core dispatch.
This would need non trivial changes to JIT, for building only a section of code (JIT isn't really (J)ust ;-)
But before going a rather complicated way, I would implement more JITed ops, as i386 already does with many vtable calls.
Or go the other way round: Run from JIT. If there is a sequence of non JITable ops, convert these to a CGP section, which returns to JIT when finished. This would save a lot of function calls to jit_normal_op.
> I had a (possibly) impractical idea - computed goto / JIT > (or even computed goto / prederef / jit) core
> I don't know whether this is possible:
> Inside a cgoto core have 1 extra op - enter JITted section.
> The bytecode is "compiled" by the JIT (at some point) - if there are a run > of consecutive JIT-able ops, then issue a section (an isolated section) of > machine code for those ops, and replace those ops in the bytecode with an op > that calls that section. If there isn't a run of JIT-able ops, then just > leave the ops as ops, and use the regular computed goto core dispatch.
Are you discussing some sort of unroller for native code for some opcodes ? .. basic blocks ?. Anything too complicated to unroll is replaced by a jump back into the interpreter core ?..
This strategy has been used in pnet's engine (which is still a souped up interpreter) ... the motivation was ease of porting unrollers than full JITs, ie opcode by opcode .
> I'd envisage it becoming a win if PBC often ends up with tight, isolated > loops that use a lot of JITted ops, but most of the code is ops we've not
Jerome Vouillon wrote: > On Thu, Feb 06, 2003 at 01:37:42PM +0100, Leopold Toetsch wrote:
>>This is one thing I allways wanted to try ;-)
>>fast_core MOps: 11 >>Prederef: 17.5 >>CGoto MOps: 19.4 >>CGP MOps: 27.5 >>CGP -O3 MOps: 65 !!!1 > Why not also predereference the operation address? This would save a > memory read. The goto would become:
> goto **(cur_opcode += 4);
Many thanks for this hint. I did write "I won't do it" but ehem, yes, here it is:
The new CGP code causes compilation of interpreter.c to fail if HAVE_COMPUTED_GOTO is undefined. This is because it references Parrot_DynOp_core_cgp_0_0_9, which is defined in core_ops_cgp.h, and the latter is only included when HAVE_COMPUTED_GOTO is defined. This bug appears to be responsible for most of the current breakage in the tinderbox; however, it can also be provoked by a standard Linux/x86/gcc combination if you configure without computed goto.
The patch below fixes the problem, but I'd prefer some comments from the experts before I commit it.
Cheers, Simon
--- interpreter.c.old Mon Feb 10 15:41:10 2003 +++ interpreter.c Mon Feb 10 15:41:21 2003 @@ -15,13 +15,13 @@ #include "parrot/interp_guts.h" #include "parrot/oplib/core_ops.h" #include "parrot/oplib/core_ops_prederef.h" +#include "parrot/oplib/core_ops_cgp.h" #include "parrot/runops_cores.h" #ifdef HAS_JIT # include "parrot/jit.h" #endif #ifdef HAVE_COMPUTED_GOTO # include "parrot/oplib/core_ops_cg.h" -# include "parrot/oplib/core_ops_cgp.h" #endif #include "parrot/method_util.h"
Leopold Toetsch wrote: > Nicholas Clark wrote: >> Inside a cgoto core have 1 extra op - enter JITted section. > Or go the other way round: Run from JIT. If there is a sequence of non > JITable ops, convert these to a CGP section, which returns to JIT when > finished. This would save a lot of function calls to jit_normal_op.
I have thought about this, with a little help from ddd:
This produces unoptimized almost the same code quality and speed as -O3.
The cur_opcode is in %esi, all operand access is done like in the posted -O3 example: $ parrot -P mops.pbc 82.019517 M op/s
2) There is one new opcode:
B<jmp_to_eip inconst INT>
The argument is the native_ptr in JIT code, where to return from a section of non JITed code.
goto **(cur_opcode + 1);
The address is filled in by the JIT emit functions.
3) The Parrot_jit_begin() emits code to call cgp_core (alas setting up the same stack frame as cgp_core) and B<jmp_to_eip> back to the address after the function call
4) When there is a seqence (more then 1) non JITed ops, Parrot_jit_normal_op emits code to calculate %esi (the *cur_opcode) in the prederefed jump table and *jumps* there. The end of the section is above jmp_to_eip instruction. This implies, that after a non JITed section, the JITed section is at least two opcodes sized, to have room to fill in this jump.
5) non JITed branches do not fit very nicely in this scheme, but there are several possible ways to handle these: - make all branches JITted - generate another core (cgp_jit_core), which does the right thing - always emit code by Parrot_jit_cpcf_op() for the last opcode in the section (cgp_core would only be used if the nonJITed section is >= 3 instructions) - if both ends of the branch are non JITed sections do nothing, just stay in the cgp_core and patch the ends of these branches to return to JIT (brrr)
6) and finally the prederefed B<end> opcode gets jumped to by code emitted from Parrot_end_jit, to clean up the cgp_core stack frame and return from JIT.
So we would not have any function call overhead and getting the best performance by combining the 2 fastest run cores.
This approach would of course need some architecture/compiler specific hacks, but JIT is such a hack anyway. OTOH it is almost totally encapsulated in the architecture jit file, so it *can* be implemented but there is no need to do so.
On Tue, Feb 11, 2003 at 10:49:14AM +0100, Leopold Toetsch wrote: > Leopold Toetsch wrote:
> >Nicholas Clark wrote:
> >>Inside a cgoto core have 1 extra op - enter JITted section.
> >Or go the other way round: Run from JIT. If there is a sequence of non > >JITable ops, convert these to a CGP section, which returns to JIT when > >finished. This would save a lot of function calls to jit_normal_op.
I failed to follow most of the specific details, and all of the x86 specific stuff.
> So we would not have any function call overhead and getting the best > performance by combining the 2 fastest run cores.
> This approach would of course need some architecture/compiler specific > hacks, but JIT is such a hack anyway. OTOH it is almost totally > encapsulated in the architecture jit file, so it *can* be implemented > but there is no need to do so.
> Comments welcome,
The idea actually works at all? And it goes faster than the prederef computed goto core?
Nicholas Clark wrote: > On Tue, Feb 11, 2003 at 10:49:14AM +0100, Leopold Toetsch wrote: > I failed to follow most of the specific details, and all of the x86 specific > stuff.
Basically, all non JITed opcodes, which are now called functions (Parrot_jit_normal_op) would instead get jumped to in the cgp_core and from there operation in JITcode again will continue with a jump.
> The idea actually works at all?
I think so.
> And it goes faster than the prederef computed goto core?
Should be so, yes. When there are a lot of JITed ops (mainly integers and floats, kept in registers) JIT is faster then native -O3 compiled C. (I hope that the optimizer in imcc can turn a lot of plain perl code into native ints/floats). When there are many nonJITted functions calls (e.g. string_concat), the CGP core wins compared to JIT. So combining (again :) the 2 fastest concepts give you the best of both.
> So, in comparison, how fast is Python...
A little slower then perl5 (2.0 MOps)
$ python examples/mops/mops.py M op/s: 1.70850480747
$ imcc -P examples/assembly/mops_p.pasm M op/s: 40.195826
$ imcc -j examples/assembly/mops_p.pasm M op/s: 68.099593
(I took the PMC based mops for comparison, and of course, this is
opcode dispatch only - or almost only)
parrot/imcc: -O3 compiled on i386/linux, Athlon 800.
Nicholas Clark wrote: > On Tue, Feb 11, 2003 at 10:49:14AM +0100, Leopold Toetsch wrote:
>>Leopold Toetsch wrote: >>>Or go the other way round: Run from JIT. If there is a sequence of non >>>JITable ops, convert these to a CGP section, which returns to JIT when >>>finished. This would save a lot of function calls to jit_normal_op. > The idea actually works at all?
I have it running now. JIT/i386 uses the stackframe of CGP for its own. When there is a sequence of non-JITed functions, JIT code braanches directly into the CGP core and executes the CGP ops. Going back to JIT is a asm("ret"), which is a new (the 1000th !) opcode.
> And it goes faster than the prederef computed goto core?
Yep, though I don't have many test cases: $ parrot -P life.pbc 5000 generations in 5.629500 seconds. 888.178341 generations/sec
On Thu, Feb 13, 2003 at 04:43:37PM +0100, Leopold Toetsch wrote: > Nicholas Clark wrote: > > The idea actually works at all?
> I have it running now.
Can you write an opcode that solves the halting problem? :-)
> JIT/i386 uses the stackframe of CGP for its own. When there is a > sequence of non-JITed functions, JIT code braanches directly into the > CGP core and executes the CGP ops. Going back to JIT is a asm("ret"), > which is a new (the 1000th !) opcode.
That sounds hairy. :-)
> > And it goes faster than the prederef computed goto core?
> Yep, though I don't have many test cases: > $ parrot -P life.pbc > 5000 generations in 5.629500 seconds. 888.178341 generations/sec
> $ cd languages/perl6 ; perl6 -C -O3 ../../life.pasm > 5000 generations in 5.154097 seconds. 970.102045 generations/sec
> parrot/imcc: -O3 compiled on i386/linux, Athlon 800.
Presumably life.pasm mainly uses ops that already have JIT implementations. Do you have examples of code that doesn't have many OPs, and "currently" goes faster on x86 under CGoto rather than JIT? These would be the interesting ones.
> Should I commit it or send to the list first?
I don't know. Are you confident in it? Does it pass all Parrot's regression tests?
Nicholas Clark wrote: > On Thu, Feb 13, 2003 at 04:43:37PM +0100, Leopold Toetsch wrote:
>>Nicholas Clark wrote: >>JIT/i386 uses the stackframe of CGP for its own. When there is a >>sequence of non-JITed functions, JIT code braanches directly into the >>CGP core and executes the CGP ops. Going back to JIT is a asm("ret"), >>which is a new (the 1000th !) opcode. > That sounds hairy. :-)
It's not utterly complex. Though single-stepping through a short program with ddd might be helpful ;-)
> Presumably life.pasm mainly uses ops that already have JIT implementations.
Yep.
> Do you have examples of code that doesn't have many OPs, and "currently" > goes faster on x86 under CGoto rather than JIT? These would be the > interesting ones.
IIRC perl6/examples/life.p6 was slower with JIT. Now JIT wins, but of course mainly due to more JITed opcodes. No I don't have a good test for this.
>>Should I commit it or send to the list first?
> I don't know. Are you confident in it? Does it pass all Parrot's regression > tests?
It passes parrot's, imcc's and perl6 tests, with -g and with -O3.
On Thu, Feb 13, 2003 at 06:34:55PM +0100, Leopold Toetsch wrote: > Nicholas Clark wrote:
> >On Thu, Feb 13, 2003 at 04:43:37PM +0100, Leopold Toetsch wrote: > >>Should I commit it or send to the list first?
> >I don't know. Are you confident in it? Does it pass all Parrot's regression > >tests?
> It passes parrot's, imcc's and perl6 tests, with -g and with -O3.
I'd suggest committing it, rather than sending it to the list. Unless there are extra comments you'd want to make to the list. But even then, I think it would be better to write them into a design doc (or whatever the correct place is) and just tell the list where to read it from the checkout.