Need for fingerprinting? [was: Re: What to do if Digest::MD5 is unavailable?]

gre...@focusresearch.com

unread,

Nov 3, 2002, 12:37:28 PM11/3/02

to Josh Wilmes, Brent Dax, Andy Dougherty, jo...@hitchhiker.org, Perl6 Internals

All --

FWIW, this stuff came up early on in Parrot's infancy. At one time we had
fingerprinting, then we removed it. At one time it was MD5, then we went
away from MD5. IM(NS)HO, the better approach overall is to remove the need
for fingerprinting altogether. I used to know roughly how to get from
where we were to that state, but when I just looked at the code, I
realized I'd have to catch up on a lot of history to say anything useful
about how to get there from where we are now.

If there is interest in the idea in general, I'll see if I can find
pointers to the prior discussion. Else, I'll just duck back into the
corner :) ...

On a related note, I'm working on a toy VM outside of Parrot to
demonstrate the technique I've proposed here in the past, but alas, time
is short. Anyway, if I can get something produced that is worth looking
at, I'll post here.

Regards,

-- Gregor

Josh Wilmes <jo...@hitchhiker.org>
Sent by: jo...@hitchhiker.org
11/01/2002 11:34 PM
Please respond to Josh Wilmes

To: "Brent Dax" <bren...@cpan.org>
cc: "'Andy Dougherty'" <doug...@lafayette.edu>, "'Perl6 Internals'"
<perl6-i...@perl.org>
Subject: Re: What to do if Digest::MD5 is unavailable?

I think this solution is the simplest... i'll go ahead and commit it.

--Josh

At 10:15 on 11/01/2002 PST, "Brent Dax" <bren...@cpan.org> wrote:

> Andy Dougherty:
> # At the moment, the bytecode "fingerprint" is built with
> # Digest::MD5. Alas, Digest::MD5 wasn't standard with perl
> # versions prior to 5.8.0. What should happen in those cases?
> # Anybody have any good ideas?
>
> Not sure if this qualifies as "good" :^), but we *could* package the
> pure-Perl implementation of Digest::MD5 with Parrot.
>
> --Brent Dax <bren...@cpan.org>
> @roles=map {"Parrot $_"} qw(embedding regexen Configure)
>
> Wire telegraph is a kind of a very, very long cat. You pull his tail in
> New York and his head is meowing in Los Angeles. And radio operates
> exactly the same way. The only difference is that there is no cat.
> --Albert Einstein (explaining radio)

Leopold Toetsch

unread,

Nov 3, 2002, 2:49:32 PM11/3/02

to gre...@focusresearch.com, Josh Wilmes, Brent Dax, Andy Dougherty, Perl6 Internals

gre...@focusresearch.com wrote:

> All --
>
> FWIW, this stuff came up early on in Parrot's infancy.

Pointers, hints, information ...

> On a related note, I'm working on a toy VM outside of Parrot to
> demonstrate the technique I've proposed here in the past,

Pointers, hints, information ...

thanks,
leo ;-)

gre...@focusresearch.com

unread,

Nov 3, 2002, 4:59:22 PM11/3/02

to Leopold Toetsch, Brent Dax, Andy Dougherty, Josh Wilmes, Perl6 Internals

Leo --

Here's one of the early fingerprinting patches, 2001-09-14:

http://archive.develooper.com/perl6-i...@perl.org/msg04063.html

Here's where Simon removed Digest::MD5, 2001-09-18:

http://archive.develooper.com/cvs-p...@perl.org/msg00151.html

Here's one of the messages about how I'd like to see us link op
implementations with their op codes:

http://archive.develooper.com/perl6-i...@perl.org/msg06193.html

You can use that to vector into the thread. My ideas have changed a bit
since then (see below), but
you can get some of the idea there.

Here's another message that touched on this kind of stuff:

http://archive.develooper.com/perl6-i...@perl.org/msg06270.html

What I advocate is having possibly only one (maybe too extreme, but
doable) built-in op pre-loaded
at opcode zero. This op's name is "useop", and its arguments give an
opcode (optable index), and
sufficent information for the interpreter to chase down the opinfo (and
opfunc). In the best scenario, this
could mean even doing some dynamic loading of oplibs. BTW, that was the
point of my initial bloated-
but-lightning fast oplookup switch() tree implementation, which has now
been replaced with something
I expect is more sane (I went that extreme because I was getting push-back
that the by-name lookups
would be slow, and even though I never advocated looking them up in DO_OP,
I still wanted to
demonstrate that they could be *very* fast).

Now, whether or not you statically link other oplibs, I suggest not having
every op be allocated a slot
in the optable. Rather, the initial few ops in the startup code of a chunk
of Parrot code is responsible
for utilizing useop to arrange the appropriate optable for that code. For
example, assembling
mops.pasm would result in the first chunk of code making 13 calls to useop
to attach the ops used by
that code. No longer to we care what order ops are in their oplibs,
because opcodes are not a
meaningful concept at a static level (in the Parrot source).

I've noticed that the current setup concats *.ops into one big core_ops.c,
which is very different from
what I was trying to move us towards long ago. I'm an advocate of smaller
and independant *.ops
files, separately compiled, and (possibly) only some actually statically
linked in. An additional by-name
lookup will be needed to map oplib names (and possibly version info, if we
determine that is necessary)
to the oplibinfo structures we got from the statically or dynamically
linked oplibs. The oplibinfo structures
give you the function pointer to call to look up ops by name in that
oplib.

One final interesting (at least to me) note: A chunk of code could
overwrite the optable entry zero with
some noo?p-equivalent op to prevent any further changes to its optable
once it has things the way it
wants them.

Regards,

-- Gregor

Leopold Toetsch <l...@toetsch.at>
11/03/2002 02:49 PM

To: gre...@focusresearch.com
cc: Josh Wilmes <jo...@hitchhiker.org>, Brent Dax <bren...@cpan.org>, "'Andy

Dougherty'" <doug...@lafayette.edu>, "'Perl6 Internals'"
<perl6-i...@perl.org>

Subject: Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is
unavailable?]

Leopold Toetsch

unread,

Nov 4, 2002, 5:40:07 AM11/4/02

to gre...@focusresearch.com, Brent Dax, Andy Dougherty, Josh Wilmes, Perl6 Internals

gre...@focusresearch.com wrote:

> Leo --

> Here's one of the messages about how I'd like to see us link op
> implementations with their op codes:
>
> http://archive.develooper.com/perl6-i...@perl.org/msg06193.html

Thanks for all these pointers.

I did read this thread WRT dynamic opcode loading. We will need a
possibility to load different oplibs. But for the core.ops I'd rather
stay with the static scheme used now.

Your proposal would solve the problem with fingerprinting, but
especially for huge programs, the loadtime overhead seems to big for me.
Invalid PBC files due to changes will get less and less, during
development of parrot, when the core.ops stabilizes.

Remaining is the problem, how to run ops from different oplibs _fast_.

leo

gre...@focusresearch.com

unread,

Nov 4, 2002, 7:22:20 AM11/4/02

to Leopold Toetsch, Brent Dax, Andy Dougherty, Josh Wilmes, Perl6 Internals

Leo --

Your concern about speed has been raised before, but I still don't see it
(one of the reasons I've been working on a toy VM to demonstrate is that I
figure there's some piece I'm not communicating well, and being able to
point at code would help). Optable build time is not a function of program
size, but rather of optable size, since it would be (at least
conventionally/normally) built as a sort of preamble in the generated
bytecode.

My latest thinking is to divorce the two issues: dynamic optables and
dynamic loading (of anything). In the past, I've considered both pieces
important (and I still think that), but now I think even if you statically
link all the ops, still doing dynamic optables is a win (and enables us to
implement dynamic loading of oplibs eventually).

I am not against having some oplibs (core) statically linked, so we are
not in disagreement there. Over the long haul, we might disagree as to
what constitutes a "core" op, but I don't think that bears on whether or
not dynamic optables are right for Parrot.

Having dynamic optables means that we do not require core.ops to ever
stabilize entirely. That will be handy for a while yet during development
(as you mention), and will remain handy as Parrot evolves through later
versions and we discover we wish we had op X but don't (something I think
you see as unlikely, but I see as likely). Dynamic optables essentially
reduce friction -- they give us the freedom to add ops however and
wherever we want without having to overcome a barrier created by a desire
to not obsolete people's existing PBC files.

I don't think it remains a problem how to run ops from different oplibs
_fast_. Op lookup is already fast (assuming it hasn't slowed significantly
from where I left it). Here is an email that mentions the original
switch() based find_op() oplookup performance being on a par with the
interpreter's ops/sec rating:

http://archive.develooper.com/perl6-i...@perl.org/msg08629.html

So, that would mean that for the oplookup part, you would have the
equivalent of executing an extra N ops (remember that the mops.pasm idea
of "an op" is one of the fastest ops in the interpreter, too) *only at the
startup of your program*. There would be some additional cycles eaten to
do the optable build besides the oplookup, but I assume that the bits and
pieces of that are going to be fast, too, since it would likely leverage
other dynamic list code that has been optimized for speed. After the
preamble, while the program is running, the cost of having a dynamic
optable is absolutely *nil*, whether the ops in question were statically
or dynamically loaded (if you don't see that, then either I'm very wrong,
or I haven't given you the right mental picture of what I'm talking
about).

BTW, here's the email where I show the benchmarker I used to measure the
oplookup speed with the original switch() implementation of find_op(). It
looks up a single op numerous times. Of course, a better one would
randomly pick a known op name N times and look that up.

http://archive.develooper.com/perl6-i...@perl.org/msg08676.html

Regards,

-- Gregor

Leopold Toetsch <l...@toetsch.at>
11/04/2002 05:40 AM

To: gre...@focusresearch.com
cc: Brent Dax <bren...@cpan.org>, "'Andy Dougherty'"
<doug...@lafayette.edu>, Josh Wilmes <jo...@hitchhiker.org>, "'Perl6

Internals'" <perl6-i...@perl.org>
Subject: Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is
unavailable?]

Leopold Toetsch

unread,

Nov 4, 2002, 8:45:26 AM11/4/02

to gre...@focusresearch.com, Brent Dax, Andy Dougherty, Josh Wilmes, Perl6 Internals

gre...@focusresearch.com wrote:

> Leo --
>
> ... Optable build time is not a function of program

> size, but rather of optable size

Ok, I see that, but ...

> I don't think it remains a problem how to run ops from different oplibs
> _fast_.

.... the problem is, that as soon as there are dynamic oblibs, they can't
be run in the CGoto core, which is normally the fastest core, when
executions time is depending on opcode dispatch time. JIT is (much)
faster, in almost integer only code, e.g. mops.pasm, but for more
complex programs, involving PMCs, JIT is currently slower.

> ... Op lookup is already fast ...

I rewrote find_op, to build a lookup hash at runtime, when it's needed.
This is 2-3 times faster then the find_op with the static lookup table
in the core_ops.c file.

> ... After the

> preamble, while the program is running, the cost of having a dynamic
> optable is absolutely *nil*, whether the ops in question were statically
> or dynamically loaded (if you don't see that, then either I'm very wrong,
> or I haven't given you the right mental picture of what I'm talking
> about).

The cost is only almost *nil*, if program execution time doesn't depend
on opcode dispatch time. E.g. mops.pasm has ~50% execution time in
cg_core (i.e. the computed goto core). Running the normal fast_core
slows this down by ~30%.

This might or might not be true for RL applications, but I hope, that
the optimizer will bring us near above relations for average programs.

Nethertheless I see the need for dynamic oplibs. If e.g. a program pulls
in obsure.ops, it could as well pay the penalty for using these.

> Regards,
>
> -- Gregor

leo

gre...@focusresearch.com

unread,

Nov 4, 2002, 10:09:06 AM11/4/02

to Leopold Toetsch, Brent Dax, Andy Dougherty, Josh Wilmes, Perl6 Internals

Leo --

Ah. It seems the point of divergence is slow_core vs. cg_core, et al.

As you have figured out, I've been referring to performance of the non-cg,
non-prederef, non-JIT (read: "slow" ;) core.

I don't know much about the CG core, but prederef and JIT should be able
to work with dynamic optables. For prederef and JIT, optable mucking does
expire your prederefed and JITted blocks (in general), but for
conventional use (preamble setup), you don't pay a price during mainline
execution once you've set up your optable. You only pay an additional cost
if your program is dynamic enough to muck with its optable in the middle
somewhere, so you have to pay to re-prederef or re-JIT stuff (and a use
tax like that seems appropriate to me).

Of all the cores, the CG core is the most "crystalized" (rigid), so it
stands to reason that it would not be a good match for dynamic optables.

While I don't think I'm sophisticated enough to pull it off on my own, I
do think it should be possible to use what was learned to build the JIT
system to build the equivalent of a CG core on the fly, given its
structure. I think the information and basic capabilities are already
there: The JIT system knows already how to compile a sequence of ops to
machine code -- using this plus enough know-how to plop in the right JMP
instructions pretty much gets you there. A possible limitation to the
coolness, here: I think the JIT system bails out for the non-inline ops
and just calls the opfunc (please forgive if my understanding of what JIT
does and doesn't do is out of date). I think the CG core doesn't have to
take the hit of that extra indirection for non-inline ops. If so, then the
hypothetical dynamic core construction approach just described would
approach the speed of the CG core, but would fall somewhat short on
workloads that involve lots of non-inline ops (FWIW, there are more inline
ops than not in the current *.ops files).

Then, you get CG (-esque) speed along with the dynamic capabilities. Its
cheating, to be sure, but I like that kind of cheating. :) Further,
DCC would work with dynamically loaded oplibs (presumably using purely the
JIT-func-call technique, although I suppose its possible to do even
better), where the CG core would not.

It would be interesting to see where DCC would fit on the performance
spectrum compared to JIT, for mops.pasm and for other examples with
broader op usage...

Regards,

-- Gregor

Leopold Toetsch <l...@toetsch.at>
11/04/2002 08:45 AM

To: gre...@focusresearch.com
cc: Brent Dax <bren...@cpan.org>, "'Andy Dougherty'"
<doug...@lafayette.edu>, Josh Wilmes <jo...@hitchhiker.org>, "'Perl6
Internals'" <perl6-i...@perl.org>
Subject: Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is
unavailable?]

Jason Gloudon

unread,

Nov 4, 2002, 11:41:28 AM11/4/02

to gre...@focusresearch.com, Leopold Toetsch, Brent Dax, Andy Dougherty, Josh Wilmes, Perl6 Internals

On Sun, Nov 03, 2002 at 04:59:22PM -0500, gre...@focusresearch.com wrote:

> What I advocate is having possibly only one (maybe too extreme, but
> doable) built-in op pre-loaded
> at opcode zero. This op's name is "useop", and its arguments give an
> opcode (optable index), and
> sufficent information for the interpreter to chase down the opinfo (and
> opfunc). In the best scenario, this

One question this raises is where does this initialization occur ?

I think the information that would be encoded in these instructions should
normally go into a metadata section of the bytecode stored on disk. Having to
pseudo-execute the bytecode in order to disassemble seems unnecessary. I think
keeping this information separete from the executable section will make the
code generators simpler as well.

--
Jason

Brent Dax

unread,

Nov 4, 2002, 11:21:04 AM11/4/02

to Leopold Toetsch, gre...@focusresearch.com, Andy Dougherty, Josh Wilmes, Perl6 Internals

Leopold Toetsch:
# .... the problem is, that as soon as there are dynamic
# oblibs, they can't
# be run in the CGoto core, which is normally the fastest core, when
# executions time is depending on opcode dispatch time. JIT is (much)
# faster, in almost integer only code, e.g. mops.pasm, but for more
# complex programs, involving PMCs, JIT is currently slower.

Wasn't the plan to deal with that to use the JIT to construct a new
cgoto core?

Leopold Toetsch

unread,

Nov 4, 2002, 11:36:51 AM11/4/02

to gre...@focusresearch.com, Brent Dax, Andy Dougherty, Josh Wilmes, Perl6 Internals

gre...@focusresearch.com wrote:

> Leo --

>
> I don't know much about the CG core, but prederef and JIT should be able
> to work with dynamic optables. For prederef and JIT, optable mucking does
> expire your prederefed and JITted blocks (in general), but for
> conventional use (preamble setup), you don't pay a price during mainline
> execution once you've set up your optable.

Yep

[ JITlike cg_core ]

> ... If so, then the

> hypothetical dynamic core construction approach just described would
> approach the speed of the CG core, but would fall somewhat short on
> workloads that involve lots of non-inline ops (FWIW, there are more inline
> ops than not in the current *.ops files).

Exactly here is the problem. Allmost all non integer/float stuff is
unimplemented in JIT. You don't pay the price per non-inline ops, but
per op not in JIT. In CG the op functions are not functions but code
pieces, which get jumped too. JITed code (as long as implemented) is a
linear sequence of the functions bodies (or better there asm equivalencies).

> Then, you get CG (-esque) speed along with the dynamic capabilities. Its
> cheating, to be sure, but I like that kind of cheating. :)

If we are able to build such a system, yes.

But see "Of mops and microops" for yet another approach.

By splitting current opcodes to more fine grained pieces, we would

need less different ops alltogether, and it could be really fast.

> Regards,
>
> -- Gregor

leo

gre...@focusresearch.com

unread,

Nov 4, 2002, 12:24:36 PM11/4/02

to Jason Gloudon, Brent Dax, Andy Dougherty, Josh Wilmes, Leopold Toetsch, Perl6 Internals

Jason --

Originally, I considered adding an optable segment to the packfile format
and no useop op.

After considering useop, I found the idea of a conventional (but
technically optional, TMTOWTDI) preamble
and the ability to modify the optable while running intriguing.

There is value to using an optable segment to pick the ops for the code
segment, even if none of the more
dynamic stuff is done.

Retaining the ability to modify the optable at runtime is still
interesting, to me. It essentially makes the
optable a shortcut for the useop preamble. However, retaining this ability
means that any generic
disassembler will have to be useop aware, which is what I think you are
trying to avoid.

Regards,

-- Gregor

Jason Gloudon <pe...@gloudon.com>
11/04/2002 11:41 AM

To: gre...@focusresearch.com
cc: Leopold Toetsch <l...@toetsch.at>, Brent Dax <bren...@cpan.org>, "'Andy

Dougherty'" <doug...@lafayette.edu>, Josh Wilmes <jo...@hitchhiker.org>,
"'Perl6 Internals'" <perl6-i...@perl.org>
Subject: Re: Need for fingerprinting? [was: Re: What to do if Digest::MD5 is
unavailable?]

Nicholas Clark

unread,

Nov 4, 2002, 4:21:06 PM11/4/02

to gre...@focusresearch.com, Leopold Toetsch, Brent Dax, Andy Dougherty, Josh Wilmes, Perl6 Internals

On Mon, Nov 04, 2002 at 10:09:06AM -0500, gre...@focusresearch.com wrote:

> While I don't think I'm sophisticated enough to pull it off on my own, I
> do think it should be possible to use what was learned to build the JIT
> system to build the equivalent of a CG core on the fly, given its
> structure. I think the information and basic capabilities are already
> there: The JIT system knows already how to compile a sequence of ops to
> machine code -- using this plus enough know-how to plop in the right JMP
> instructions pretty much gets you there. A possible limitation to the

I'm not convinced. Compiling the computed goto core with any sort of
optimisation turns on *really* hurts the machine. I think it's over a
minute even a 733 MHz PIII, and it happily pages everything else out while
it's doing it. :-(
I doubt that the GC core's stats look anywhere near as impressive for the
unoptimised case. [And I'm not at a machine were I can easily generate some]
This makes me think that it would be hard to "just in time"

> coolness, here: I think the JIT system bails out for the non-inline ops
> and just calls the opfunc (please forgive if my understanding of what JIT
> does and doesn't do is out of date). I think the CG core doesn't have to
> take the hit of that extra indirection for non-inline ops. If so, then the
> hypothetical dynamic core construction approach just described would
> approach the speed of the CG core, but would fall somewhat short on
> workloads that involve lots of non-inline ops (FWIW, there are more inline
> ops than not in the current *.ops files).

I believe that your understanding of the JIT and the GC cores are still
correct. The problem would be solved if we had some nice way of getting the
C compiler to generate us nice stub versions of all the non-inline ops
functions, which we could then place inline. However, I suspect that part of
the speed of the CG core comes from the compiler (this is always gcc?)
being able to do away with the function call and function return overheads
between the ops it has inlined in the GC core.

I've no idea if gcc is allowed to re-order the op blocks in the CG core.
If not, then we might be able to pick apart the blocks it compiles (for
units for the JIT to use) by putting in custom asm statements between each,
which our assembler (or machine code) parser spots and uses as delimiters
(hmm. particularly if we have header and trailer asm statements that are
actually just assembly language comments with marker text that gcc passes
through undigested. This would let us annotate the assembler output of gcc)

Nicholas Clark
--
Brainfuck better than perl? http://www.perl.org/advocacy/spoofathon/

Jason Gloudon

unread,

Nov 4, 2002, 8:11:16 PM11/4/02

to Nicholas Clark, gre...@focusresearch.com, Leopold Toetsch, Brent Dax, Andy Dougherty, Josh Wilmes, Perl6 Internals

On Mon, Nov 04, 2002 at 09:21:06PM +0000, Nicholas Clark wrote:

> I'm not convinced. Compiling the computed goto core with any sort of
> optimisation turns on *really* hurts the machine. I think it's over a
> minute even a 733 MHz PIII, and it happily pages everything else out while
> it's doing it. :-(
> I doubt that the GC core's stats look anywhere near as impressive for the
> unoptimised case. [And I'm not at a machine were I can easily generate some]
> This makes me think that it would be hard to "just in time"

It turns out the optimization does make a difference for gcc at least, but for
a strange reason. It seems that without optimization gcc allocates a *lot*
more space on the stack for cg_core. I suspect this is because gcc does not
coalesce the stack space used for temporary values unless optimization is
enabled.

(gcc with no optimization)
../parrot mops.pbc
Iterations: 100000000
Estimated ops: 200000000
Elapsed time: 13.528871
M op/s: 14.783200

(gcc -O2)
../parrot2 mops.pbc
Iterations: 100000000
Estimated ops: 200000000
Elapsed time: 30.111252
M op/s: 6.642035

> functions, which we could then place inline. However, I suspect that part of
> the speed of the CG core comes from the compiler (this is always gcc?)

Newer versions of the SUN workshop compiler supports computed goto.

> units for the JIT to use) by putting in custom asm statements between each,
> which our assembler (or machine code) parser spots and uses as delimiters
> (hmm. particularly if we have header and trailer asm statements that are
> actually just assembly language comments with marker text that gcc passes
> through undigested. This would let us annotate the assembler output of gcc)

A lot of research has been done on this type of thing under the terms 'partial
evaluation' and 'specialization'. There is a working specializing python
compiler that might be of interest : http://psyco.sourceforge.net/doc.htm

--
Jason

Rhys Weatherley

unread,

Nov 4, 2002, 8:52:06 PM11/4/02

to perl6-i...@perl.org

Nicholas Clark wrote:

> I believe that your understanding of the JIT and the GC cores are still
> correct. The problem would be solved if we had some nice way of getting the
> C compiler to generate us nice stub versions of all the non-inline ops
> functions, which we could then place inline. However, I suspect that part of
> the speed of the CG core comes from the compiler (this is always gcc?)
> being able to do away with the function call and function return overheads
> between the ops it has inlined in the GC core.

You may want to check out the following two papers and their references:

I. Piumarta, F. Riccardi. Optimizing direct threaded code by selective
inlining. In ACM SIGPLAN Conference on Programming Language Design and
Implementation (PLDI), June 17-19, Montreal, Canada, 1998
ftp://ftp.inria.fr/INRIA/Projects/SOR/papers/1998/ODCSI_pldi98.ps.gz

M. Anton Ertl, A Portable Forth Engine, Proceedings euroFORTH '93,
pages 253-257. http://www.complang.tuwien.ac.at/forth/threaded-code.html

Everything you ever wanted to know about optimising threaded interpreters,
but were too afraid to ask. The "selecting inlining" method in particular
talks about how to extract inline code blocks dynamically and then paste
them together.

Cheers,

Rhys.

Leopold Toetsch

unread,

Nov 5, 2002, 3:36:59 AM11/5/02

to Jason Gloudon, Nicholas Clark, gre...@focusresearch.com, Brent Dax, Andy Dougherty, Josh Wilmes, Perl6 Internals

Jason Gloudon wrote:

> On Mon, Nov 04, 2002 at 09:21:06PM +0000, Nicholas Clark wrote:

> It turns out the optimization does make a difference for gcc at least, but for
> a strange reason. It seems that without optimization gcc allocates a *lot*
> more space on the stack for cg_core. I suspect this is because gcc does not
> coalesce the stack space used for temporary values unless optimization is
> enabled.

I never figured out, where this stack space was used. Anyway my last
patch should improve the unoptimized case due to faster
trace_system_stack by putting lo_var_ptr beyond this jump table.

> (gcc with no optimization)

> M op/s: 14.783200
>
> (gcc -O2)

> M op/s: 6.642035

Numbers reversed?

leo

Leopold Toetsch

unread,

Nov 5, 2002, 3:22:31 AM11/5/02

to Nicholas Clark, gre...@focusresearch.com, Brent Dax, Andy Dougherty, Josh Wilmes, Perl6 Internals

Nicholas Clark wrote:

> On Mon, Nov 04, 2002 at 10:09:06AM -0500, gre...@focusresearch.com wrote:
>

[ JIT + cg_core ]

> I'm not convinced. Compiling the computed goto core with any sort of
> optimisation turns on *really* hurts the machine.

Here gcc 2.95.2 just fails (256 MB Mem, same swap)

> I doubt that the GC core's stats look anywhere near as impressive for the
> unoptimised case. [And I'm not at a machine were I can easily generate some]
> This makes me think that it would be hard to "just in time"

> ... However, I suspect that part of

> the speed of the CG core comes from the compiler (this is always gcc?)
> being able to do away with the function call and function return overheads
> between the ops it has inlined in the GC core.

Yes, saving the function code overhead is the major speed win in CGoto.

> I've no idea if gcc is allowed to re-order the op blocks in the CG core.

Doesn't matter IMHO (when we annotate the source) ...

> If not, then we might be able to pick apart the blocks it compiles (for
> units for the JIT to use) by putting in custom asm statements between each,
> which our assembler (or machine code) parser spots and uses as delimiters
> (hmm. particularly if we have header and trailer asm statements that are
> actually just assembly language comments with marker text that gcc passes
> through undigested. This would let us annotate the assembler output of gcc)

.... but this is only half of the work. JITs current outstanding integer
performance depends on explict register allocation for the must used
IRegs in one block. Mixing of JIT instructions and gcc generated
wouldn't work because of this register allocation.

My experiment with microops could help here, where the optimizer would
basically generate code for a 3-register machine.

> Nicholas Clark

leo

Rhys Weatherley

unread,

Nov 5, 2002, 6:29:35 PM11/5/02

to Perl6 Internals

Nicholas Clark wrote:

> I'm not convinced. Compiling the computed goto core with any sort of
> optimisation turns on *really* hurts the machine. I think it's over a
> minute even a 733 MHz PIII, and it happily pages everything else out while
> it's doing it. :-(

Use the "-fno-gcse" option to gcc, to turn off global common
subexpression elimination. That may help with the speed issue.
GCSE messes up interpreter loops anyway.

I found out the hard way on Portable.NET that GCSE makes the
code perform worse, not better. The compiler gets too greedy
about common code, and starts moving things that should stay
inline. GCSE is great in normal code, but not the central
interpreter loop.

Cheers,

Rhys.