fast_core MOps: 11
Prederef: 17.5
CGoto MOps: 19.4
CGP MOps: 27.5
CGP -O3 MOps: 65 !!!1
This runloop combines the faster dispatch of opcodes via computed goto
and the clever register addressing due to predereferencing registers and
constants.
And it's compact due to the fact that all opcode variants with constants
collapse to just one implementation of the functions body. It's so
compact, that my ancient gcc 2.95.2 even can compile it -O3, which
didn't succeed with core_ops_cg.c.
-rw-r--r-- 1 lt users 618496 Feb 6 12:33 core_ops.c
-rw-r--r-- 1 lt users 665012 Feb 6 13:10 core_ops.o
-rw-r--r-- 1 lt users 219169 Feb 6 12:33 core_ops_cg.c
-rw-r--r-- 1 lt users 339312 Feb 6 13:10 core_ops_cg.o
-rw-r--r-- 1 lt users 154457 Feb 6 13:05 core_ops_cgp.c
-rw-r--r-- 1 lt users 165520 Feb 6 13:27 core_ops_cgp.o
-rw-r--r-- 1 lt users 219446 Feb 6 12:33 core_ops_prederef.c
-rw-r--r-- 1 lt users 240592 Feb 6 13:10 core_ops_prederef.o
This runloop is now enabled with the -P switch. If you want to run the
"normal" prederefed runloop then use '-P -g'.
Have fun,
leo
Leopold Toetsch <l...@toetsch.at>
02/06/2003 07:37 AM
To: P6I <perl6-i...@perl.org>
cc:
Subject: [CVS ci] CGP - CGoto Prederefed runloop
This generates some pretty good code (for instance, for sub_i_i_i) :
(*(INTVAL *)cur_opcode[1]) =
(*(INTVAL *)cur_opcode[2]) - (*(INTVAL *)cur_opcode[3]);
goto *ops_addr[*(cur_opcode += 4)];
mov 0x4(%esi),%ecx
mov 0x8(%esi),%edx
mov 0xc(%esi),%eax
mov (%eax),%eax
mov (%edx),%edx
sub %eax,%edx
mov %edx,(%ecx)
add $0x10,%esi
mov (%esi),%eax
shl $0x2,%eax
jmp *0x80f75a0(%eax)
Why not also predereference the operation address? This would save a
memory read. The goto would become:
goto **(cur_opcode += 4);
It seems hard to improve upon this. One possibility is to have some
new instructions that uses an accumulator:
accu = accu - (*(INTVAL *)cur_opcode[1]);
goto **(cur_opcode += 2);
mov 0x4(%esi),%eax
mov (%eax),%eax
sub %ebx,%eax
add $0x8,%esi
jmp *(%esi)
-- Jérôme
>>CGP -O3 MOps: 65 !!!1
> This generates some pretty good code (for instance, for sub_i_i_i) :
Yep. I'd have a look at it, really compact.
> Why not also predereference the operation address? This would save a
> memory read. The goto would become:
>
> goto **(cur_opcode += 4);
This is left as an exercise to the reader :)
Improvements welcome - and I'm a really bad C programmer, I won't do it.
> It seems hard to improve upon this. One possibility is to have some
> new instructions that uses an accumulator:
>
> accu = accu - (*(INTVAL *)cur_opcode[1]);
This looks very similar to my proposal WRT micro ops. I had 3 "accu"
globals - pretty fast. But the problem is, they must be somewhere in the
interpreter because of re-entrancy, threads and that. So this seems
impossible.
> -- Jérôme
leo
*cough*
If you are a "bad" C programmer, what is your "good" language? :)
-Melvin
I don't have one. But IMHO I have a fair survey over the whole (except
io/ and icu/ :) And - as already done in imcc - I can take existing
(really good pieces) and knot them together, a la CGP.
> -Melvin
leo
Oooh. Shiny
> And it's compact due to the fact that all opcode variants with constants
> collapse to just one implementation of the functions body. It's so
> compact, that my ancient gcc 2.95.2 even can compile it -O3, which
> didn't succeed with core_ops_cg.c.
Oooooh. Shinier
I had a (possibly) impractical idea - computed goto / JIT
(or even computed goto / prederef / jit) core
I don't know whether this is possible:
Inside a cgoto core have 1 extra op - enter JITted section.
The bytecode is "compiled" by the JIT (at some point) - if there are a run
of consecutive JIT-able ops, then issue a section (an isolated section) of
machine code for those ops, and replace those ops in the bytecode with an op
that calls that section. If there isn't a run of JIT-able ops, then just
leave the ops as ops, and use the regular computed goto core dispatch.
I'd envisage it becoming a win if PBC often ends up with tight, isolated
loops that use a lot of JITted ops, but most of the code is ops we've not
written JITted versions of. The sections with the loops are converted to
native code, and the rest of the ops still gets the benefit of your sterling
efforts at improving the "regular" bytecode dispatch.
Nicholas Clark
> The bytecode is "compiled" by the JIT (at some point) - if there are a run
> of consecutive JIT-able ops, then issue a section (an isolated section) of
> machine code for those ops, and replace those ops in the bytecode with an op
> that calls that section. If there isn't a run of JIT-able ops, then just
> leave the ops as ops, and use the regular computed goto core dispatch.
Yep. That's the sort of use I created the enternative op for. Right now it's
used by the compiled C code generator. Basically the entry points to basic
blocks are replaced with enternative calls so that the transition to compiled
basic blocks of ops code happens transparently to the interpreter.
--
Jason
> I had a (possibly) impractical idea - computed goto / JIT
> (or even computed goto / prederef / jit) core
>
> I don't know whether this is possible:
>
> Inside a cgoto core have 1 extra op - enter JITted section.
>
> The bytecode is "compiled" by the JIT (at some point) - if there are a run
> of consecutive JIT-able ops, then issue a section (an isolated section) of
> machine code for those ops, and replace those ops in the bytecode with an op
> that calls that section. If there isn't a run of JIT-able ops, then just
> leave the ops as ops, and use the regular computed goto core dispatch.
This would need non trivial changes to JIT, for building only a section
of code (JIT isn't really (J)ust ;-)
But before going a rather complicated way, I would implement more JITed
ops, as i386 already does with many vtable calls.
Or go the other way round: Run from JIT. If there is a sequence of non
JITable ops, convert these to a CGP section, which returns to JIT when
finished. This would save a lot of function calls to jit_normal_op.
> Nicholas Clark
leo
Are you discussing some sort of unroller for native code for some
opcodes ? .. basic blocks ?. Anything too complicated to unroll is
replaced by a jump back into the interpreter core ?..
This strategy has been used in pnet's engine (which is still a souped up
interpreter) ... the motivation was ease of porting unrollers than full
JITs, ie opcode by opcode .
> I'd envisage it becoming a win if PBC often ends up with tight, isolated
> loops that use a lot of JITted ops, but most of the code is ops we've not
See the Unrolling section on Rhys's paper for a detailed discussion on
this idea <http://www.southern-storm.com.au/download/pnet-engine.pdf>
Speaking truthfully , all I know is that it lifts pnetmark score for
loops by about 10 times .
Hope it helps,
Gopal
--
The difference between insanity and genius is measured by success
> On Thu, Feb 06, 2003 at 01:37:42PM +0100, Leopold Toetsch wrote:
>
>>This is one thing I allways wanted to try ;-)
>>
>>fast_core MOps: 11
>>Prederef: 17.5
>>CGoto MOps: 19.4
>>CGP MOps: 27.5
>>CGP -O3 MOps: 65 !!!1
> Why not also predereference the operation address? This would save a
> memory read. The goto would become:
>
> goto **(cur_opcode += 4);
Many thanks for this hint. I did write "I won't do it" but ehem, yes,
here it is:
CGP MOps 34.5
CGP MOps -O3: 92.3
This is now the sub_i_i_i (-O3)
0x80e9060 <cgp_core+9808>: mov 0x4(%esi),%ecx
0x80e9063 <cgp_core+9811>: mov 0x8(%esi),%edx
0x80e9066 <cgp_core+9814>: mov 0xc(%esi),%eax
0x80e9069 <cgp_core+9817>: add $0x10,%esi
0x80e906c <cgp_core+9820>: mov (%eax),%eax
0x80e906e <cgp_core+9822>: mov (%edx),%edx
0x80e9070 <cgp_core+9824>: sub %eax,%edx
0x80e9072 <cgp_core+9826>: mov %edx,(%ecx)
0x80e9074 <cgp_core+9828>: jmp *(%esi)
So this saved 2 instructions including the memory access
This is the branch from mops.pasm
0x80e7ff0 <cgp_core+5600>: mov 0x4(%esi),%eax
0x80e7ff3 <cgp_core+5603>: cmpl $0x0,(%eax)
0x80e7ff6 <cgp_core+5606>: je 0x80e8010 <cgp_core+5632>
0x80e7ff8 <cgp_core+5608>: mov 0x8(%esi),%eax
0x80e7ffb <cgp_core+5611>: mov (%eax),%eax
0x80e7ffd <cgp_core+5613>: shl $0x2,%eax
0x80e8000 <cgp_core+5616>: add %eax,%esi
0x80e8002 <cgp_core+5618>: jmp *(%esi)
leo
The new CGP code causes compilation of interpreter.c to fail if
HAVE_COMPUTED_GOTO is undefined. This is because it references
Parrot_DynOp_core_cgp_0_0_9, which is defined in core_ops_cgp.h,
and the latter is only included when HAVE_COMPUTED_GOTO is defined. This
bug appears to be responsible for most of the current breakage in the
tinderbox; however, it can also be provoked by a standard Linux/x86/gcc
combination if you configure without computed goto.
The patch below fixes the problem, but I'd prefer some comments from the
experts before I commit it.
Cheers,
Simon
--- interpreter.c.old Mon Feb 10 15:41:10 2003
+++ interpreter.c Mon Feb 10 15:41:21 2003
@@ -15,13 +15,13 @@
#include "parrot/interp_guts.h"
#include "parrot/oplib/core_ops.h"
#include "parrot/oplib/core_ops_prederef.h"
+#include "parrot/oplib/core_ops_cgp.h"
#include "parrot/runops_cores.h"
#ifdef HAS_JIT
# include "parrot/jit.h"
#endif
#ifdef HAVE_COMPUTED_GOTO
# include "parrot/oplib/core_ops_cg.h"
-# include "parrot/oplib/core_ops_cgp.h"
#endif
#include "parrot/method_util.h"
> Hi all.
>
> The new CGP code causes compilation of interpreter.c to fail if
> HAVE_COMPUTED_GOTO is undefined.
Ah, yes thanks. I have checked in a fix.
leo
> Nicholas Clark wrote:
>> Inside a cgoto core have 1 extra op - enter JITted section.
> Or go the other way round: Run from JIT. If there is a sequence of non
> JITable ops, convert these to a CGP section, which returns to JIT when
> finished. This would save a lot of function calls to jit_normal_op.
I have thought about this, with a little help from ddd:
1)
opcode_t *
cgp_core(opcode_t *cur_op, struct Parrot_Interp *interpreter)
{
#ifdef __GNUC__
register opcode_t *cur_opcode asm ("esi") = cur_op;
#else
opcode_t *cur_opcode = cur_op;
#endif
This produces unoptimized almost the same code quality and speed as -O3.
The cur_opcode is in %esi, all operand access is done like in the posted
-O3 example:
$ parrot -P mops.pbc
82.019517 M op/s
2) There is one new opcode:
B<jmp_to_eip inconst INT>
The argument is the native_ptr in JIT code, where to return from a
section of non JITed code.
goto **(cur_opcode + 1);
The address is filled in by the JIT emit functions.
3) The Parrot_jit_begin() emits code to call cgp_core (alas setting up
the same stack frame as cgp_core) and B<jmp_to_eip> back to the address
after the function call
4) When there is a seqence (more then 1) non JITed ops,
Parrot_jit_normal_op emits code to calculate %esi (the *cur_opcode) in
the prederefed jump table and *jumps* there. The end of the section is
above jmp_to_eip instruction. This implies, that after a non JITed
section, the JITed section is at least two opcodes sized, to have room
to fill in this jump.
5) non JITed branches do not fit very nicely in this scheme, but there
are several possible ways to handle these:
- make all branches JITted
- generate another core (cgp_jit_core), which does the right thing
- always emit code by Parrot_jit_cpcf_op() for the last opcode in the
section (cgp_core would only be used if the nonJITed section is >= 3
instructions)
- if both ends of the branch are non JITed sections do nothing, just
stay in the cgp_core and patch the ends of these branches to return to
JIT (brrr)
6) and finally the prederefed B<end> opcode gets jumped to by code
emitted from Parrot_end_jit, to clean up the cgp_core stack frame and
return from JIT.
So we would not have any function call overhead and getting the best
performance by combining the 2 fastest run cores.
This approach would of course need some architecture/compiler specific
hacks, but JIT is such a hack anyway. OTOH it is almost totally
encapsulated in the architecture jit file, so it *can* be implemented
but there is no need to do so.
Comments welcome,
leo
I failed to follow most of the specific details, and all of the x86 specific
stuff.
> So we would not have any function call overhead and getting the best
> performance by combining the 2 fastest run cores.
>
> This approach would of course need some architecture/compiler specific
> hacks, but JIT is such a hack anyway. OTOH it is almost totally
> encapsulated in the architecture jit file, so it *can* be implemented
> but there is no need to do so.
>
> Comments welcome,
The idea actually works at all?
And it goes faster than the prederef computed goto core?
So, in comparison, how fast is Python...
Nicholas Clark
> On Tue, Feb 11, 2003 at 10:49:14AM +0100, Leopold Toetsch wrote:
> I failed to follow most of the specific details, and all of the x86 specific
> stuff.
Basically, all non JITed opcodes, which are now called functions
(Parrot_jit_normal_op) would instead get jumped to in the cgp_core and
from there operation in JITcode again will continue with a jump.
> The idea actually works at all?
I think so.
> And it goes faster than the prederef computed goto core?
Should be so, yes. When there are a lot of JITed ops (mainly integers
and floats, kept in registers) JIT is faster then native -O3 compiled C.
(I hope that the optimizer in imcc can turn a lot of plain perl code
into native ints/floats).
When there are many nonJITted functions calls (e.g. string_concat), the
CGP core wins compared to JIT. So combining (again :) the 2 fastest
concepts give you the best of both.
> So, in comparison, how fast is Python...
A little slower then perl5 (2.0 MOps)
$ python examples/mops/mops.py
M op/s: 1.70850480747
$ imcc -P examples/assembly/mops_p.pasm
M op/s: 40.195826
$ imcc -j examples/assembly/mops_p.pasm
M op/s: 68.099593
(I took the PMC based mops for comparison, and of course, this is
opcode dispatch only - or almost only)
parrot/imcc: -O3 compiled on i386/linux, Athlon 800.
> Nicholas Clark
leo
> On Tue, Feb 11, 2003 at 10:49:14AM +0100, Leopold Toetsch wrote:
>
>>Leopold Toetsch wrote:
>>>Or go the other way round: Run from JIT. If there is a sequence of non
>>>JITable ops, convert these to a CGP section, which returns to JIT when
>>>finished. This would save a lot of function calls to jit_normal_op.
> The idea actually works at all?
I have it running now.
JIT/i386 uses the stackframe of CGP for its own. When there is a
sequence of non-JITed functions, JIT code braanches directly into the
CGP core and executes the CGP ops. Going back to JIT is a asm("ret"),
which is a new (the 1000th !) opcode.
> And it goes faster than the prederef computed goto core?
Yep, though I don't have many test cases:
$ parrot -P life.pbc
5000 generations in 5.629500 seconds. 888.178341 generations/sec
$ parrot -j life.pbc
5000 generations in 5.290360 seconds. 945.115271 generations/sec
$ cd languages/perl6 ; perl6 -C -O3 ../../life.pasm
5000 generations in 5.154097 seconds. 970.102045 generations/sec
parrot/imcc: -O3 compiled on i386/linux, Athlon 800.
Should I commit it or send to the list first?
> Nicholas Clark
leo
> > The idea actually works at all?
>
>
> I have it running now.
Can you write an opcode that solves the halting problem? :-)
> JIT/i386 uses the stackframe of CGP for its own. When there is a
> sequence of non-JITed functions, JIT code braanches directly into the
> CGP core and executes the CGP ops. Going back to JIT is a asm("ret"),
> which is a new (the 1000th !) opcode.
That sounds hairy. :-)
> > And it goes faster than the prederef computed goto core?
>
>
> Yep, though I don't have many test cases:
> $ parrot -P life.pbc
> 5000 generations in 5.629500 seconds. 888.178341 generations/sec
>
> $ parrot -j life.pbc
> 5000 generations in 5.290360 seconds. 945.115271 generations/sec
>
> $ cd languages/perl6 ; perl6 -C -O3 ../../life.pasm
> 5000 generations in 5.154097 seconds. 970.102045 generations/sec
>
> parrot/imcc: -O3 compiled on i386/linux, Athlon 800.
Presumably life.pasm mainly uses ops that already have JIT implementations.
Do you have examples of code that doesn't have many OPs, and "currently"
goes faster on x86 under CGoto rather than JIT? These would be the
interesting ones.
> Should I commit it or send to the list first?
I don't know. Are you confident in it? Does it pass all Parrot's regression
tests?
Nicholas Clark
> On Thu, Feb 13, 2003 at 04:43:37PM +0100, Leopold Toetsch wrote:
>
>>Nicholas Clark wrote:
>>JIT/i386 uses the stackframe of CGP for its own. When there is a
>>sequence of non-JITed functions, JIT code braanches directly into the
>>CGP core and executes the CGP ops. Going back to JIT is a asm("ret"),
>>which is a new (the 1000th !) opcode.
> That sounds hairy. :-)
It's not utterly complex. Though single-stepping through a short program
with ddd might be helpful ;-)
> Presumably life.pasm mainly uses ops that already have JIT implementations.
Yep.
> Do you have examples of code that doesn't have many OPs, and "currently"
> goes faster on x86 under CGoto rather than JIT? These would be the
> interesting ones.
IIRC perl6/examples/life.p6 was slower with JIT. Now JIT wins, but of
course mainly due to more JITed opcodes.
No I don't have a good test for this.
>>Should I commit it or send to the list first?
>>
>
> I don't know. Are you confident in it? Does it pass all Parrot's regression
> tests?
It passes parrot's, imcc's and perl6 tests, with -g and with -O3.
> Nicholas Clark
leo
I'd suggest committing it, rather than sending it to the list.
Unless there are extra comments you'd want to make to the list. But even
then, I think it would be better to write them into a design doc (or whatever
the correct place is) and just tell the list where to read it from the
checkout.
Sorry. Thinking "aloud"
Nicholas Clark
>
> I'd suggest committing it, rather than sending it to the list.
Ok, I'll do that after checking again on a second machine
> ... I think it would be better to write them into a design doc
I'll include docs/dev/jit_i386.dev with has most of the gory details.
> Nicholas Clark
leo