Why is the fib benchmark still slow

Leopold Toetsch

unread,

Nov 5, 2004, 5:39:53 AM11/5/04

to Perl 6 Internals

Below (inline/attached) is a longish analysis of fib benchmark timings.
leo

FIB1.TXT

Leopold Toetsch

unread,

Nov 5, 2004, 8:12:48 AM11/5/04

to Miroslav Silovic, perl6-i...@perl.org

Miroslav Silovic wrote:
> l...@toetsch.at wrote:
>
>> a) accessing a new register frame and context
>> b) during DOD/GC
>>
> Or it would make sense to use multi-frame register chunks. I kept
> locality of access in mind but somehow never spelled it out. But I
> *think* I mentioned 64kb as a good chunk size precisely because it fits
> well into the CPU cache - without ever specifying this as the reason.

Yep that's the idea, as originally proposed The one frame per chunk was
the intermediate step. And I'm thinking of 64 Kb too. The Parrot
register structure is 640 bytes on a 32bit system w. 8byte double. With
that size we have worst-case 1% overhead for allocating a new chunk.
Or, 100 levels nesting are w/o any allocation.

> Anyway, if you can pop both register frames -and- context structures,
> you won't run GC too often, and everything will nicely fit into the
> cache.

I thought about that too, but I'd rather have registers adjacent, so
that the caller can place function arguments directly into the callers
in arguments.

OTOH it doesn't really matter, if the context structure is in the frame
too. We'd just need to skip that gap. REG_INT(64) or I64 is as valid as
I0 or I4, as long as it's assured, that it's exactly addressing the
incoming argument area of the called function.

> ... Is the context structure a PMC now (and does it have to be, if
> the code doesn't specifically request access to it?)

The context structure is inside the interpreter structure. When doing a
function call, its a malloced copy of the caller's state, hanging off
the continuation PMC.

>> ad b)

> Is there a way to find out how many misses came out from DoD, compared
> to register frames allocation?

Sure. Cachegrind is showing you the line in C source code ;) With
incremental M&S we have (top n, rounded):

L2 write misses (all in context and registers)

500.000 Parrot_Sub_invoke touch interpreter->ctx.bp
500.000 -""- touch registers e.g. REG_PMC(0) = foo
200.000 copy_registers -""-
700.000 ???:??? very likely JIT code writing regs first
600.000 malloc.c:chunk_alloc

L2 read misses (DOD)

1.300.000 Parrot_dod_sweep
1.300.000 contained_in_pool
600.000 get_free_object free_list handling

plus 500.000 more in __libc_free.

> I believe that you shouldn't litter (i.e. create an immediately GCable
> object) on each function call - at least not without generational
> collector specifically optimised to work with this.

The problem isn't the object creation per se, but the sweep through the
*whole object memory* to detect dead objects. It's of course true, that
we don't need the return continuation PMC for the fib benchmark. But a
HLL translated fib would use Integer PMCs for calculations.

> ... This would entail
> the first generation that fits into the CPU cache and copying out live
> objects from it. And this means copying GC for Parrot, something that
> (IMHO) would be highly nontrivial to retrofit.

A copying GC isn't really hard to implement. And it has the additional
benefit of providing better cache locality. Nontrivial to retrofit or
not, we need a generational GC.

> Miro

leo

Miroslav Silovic

unread,

Nov 5, 2004, 11:55:09 AM11/5/04

to Leopold Toetsch, perl6-i...@perl.org

Leopold Toetsch wrote:

>> I believe that you shouldn't litter (i.e. create an immediately
>> GCable object) on each function call - at least not without
>> generational collector specifically optimised to work with this.
>
>
> The problem isn't the object creation per se, but the sweep through
> the *whole object memory* to detect dead objects. It's of course true,
> that we don't need the return continuation PMC for the fib benchmark.

Well, creation is also the problem if you crawl the entire free heap
before triggering the next GC round. You get a potential cache miss on
each creation and on each mark and on each destruction. To keep GC out
of the way, the entire arena has to be confined to cache size or less.

> But a HLL translated fib would use Integer PMCs for calculations.

Hmm, I'm nitpicking here, but it's not how e.g. Psyco works. It
specialises each function to specific argument types and recompiles for
each new argument type set. Assuming that you'll call only very few
functions with more than 1-2 type combinations, this is a good tradeoff.
It also removes a lot of consing, especially for arithmetics.

>
>> ... This would entail the first generation that fits into the CPU
>> cache and copying out live objects from it. And this means copying GC
>> for Parrot, something that (IMHO) would be highly nontrivial to
>> retrofit.
>
>
> A copying GC isn't really hard to implement. And it has the additional
> benefit of providing better cache locality. Nontrivial to retrofit or
> not, we need a generational GC.

Copying and generational are orthogonal concepts. It's quite possible to
have noncopying gengc (and nongenerational copying GC, but that's beside
the point). This gives you quick mark phases but without so much gain
with locality (I think Boehm GC can do this).

The problem with copying GC is that pointers can change under your feet
at every opportunity. Embedded C libraries that try to manipulate GCed
objects really hate when that happens - in particular where some
pointers get cached in the CPU registers - and telling GC to (un)protect
a pointer is a chore on C programmers (as bad as manual refcounting). I
suppose that there are good solutions to this, I'm just not aware of any.

The gain is that you can guarantee that the youngest generation will be
no bigger than X kb. This can be very good thing.

However, for the problem at hand - namely, littering during function
calls, custom deallocator (that'd be chunks) could be enough. In
particular, it makes sense to use it in conjunction with a non-copying
gengc.

Miro

Leopold Toetsch

unread,

Nov 5, 2004, 12:51:06 PM11/5/04

to Miroslav Silovic, perl6-i...@perl.org

Miroslav Silovic <mi...@puremagic.com> wrote:

> The problem with copying GC is that pointers can change under your feet
> at every opportunity.

Basically yes. We have that problem currently with the copying
collection of strings. Normally this is solved by keeping the old object
in place, so that pointer stores into the object don't fail. And a write
barrier redirects the desired action to the new object location with a
forward pointer.

> However, for the problem at hand - namely, littering during function
> calls, custom deallocator (that'd be chunks) could be enough.

Yes. As said, there are several issues, and we've to address one by one.

> Miro

leo

Piers Cawley

unread,

Nov 5, 2004, 1:27:13 PM11/5/04

to Miroslav Silovic, Leopold Toetsch, perl6-i...@perl.org

Miroslav Silovic <mi...@puremagic.com> writes:

The catch with generation GC is that, once you have guaranteed
destructors being called promptly, you still have to sweep the whole
arena every time you leave a scope.

Dan Sugalski

unread,

Nov 5, 2004, 2:24:49 PM11/5/04

to Leopold Toetsch, Perl 6 Internals

At 11:39 AM +0100 11/5/04, Leopold Toetsch wrote:
>The cache misses are basically in two places

>
>a) accessing a new register frame and context
>b) during DOD/GC

b) is relatively easy -- I'd bet that the vast majority of the cache
misses are because of the copying collector. That could be cleared up
my moving to a non-copying collector. I'm perfectly fine with that.

Having said that, I do *not* want the copying collector to be the
default, or even enabled, for any non-production release, the same
way that the default build is with no C compiler optimizations. When
we cut 1.0.0, or want to do a real benchmark test, it can be turned
on, but until then I want it off. The copying collector right now is
truly vicious on dangling pointer bugs, and makes parrot die at the
drop of a hat when there's a GC/DOD problem. I rather like that.

a), the register frame/context stuff. We've batted around a number of
solutions to this, and I think a variant of the "we get change
interpreter structure on invoke" will get us what we need.

If, generally speaking, invoking something gets you a new interpreter
structure with most of the contents of the current structure, we can
tidy some stuff up and, I think, make things a bit cleaner. As a side
effect, a return continuation becomes the interpreter structure
itself. If we do this we have five basic 'destination' PMCs:

1) Sub PMC

2) Closure PMC

3) Coroutine PMC

4) Continuation PMC

5) Return continuation

On invocation:

Sub PMC) Allocates a new interpreter structure from the interpreter
cache. Copies the calling convention registers into the new
interpreter. Copies most of the environment data from the old
interpreter into the new one. Copies a few bits of info for the new
sub (default namespace pointer and such) from the sub PMC into the
new interpreter

Closure PMC) Pretty much the same as the sub PMC, except that the
closure PMC copies its cached lexical pad pointer into the new
interpreter rather than starting fresh

Coroutine PMC) Like the closure PMC, except that it caches the top
half of the register sets and copies those in on invocation, and
holds two start addresses (the default start and the current start).

Continuation PMC) Allocates a new interpreter and copies the entire
cached interpreter into it with one massive memcpy. (Whether the low
half of the registers gets copied in is up in the air) If the current
interpreter's not had a continuation taken on it, then it's
immediately put on the free interpreter list.

Return Continuation PMC) We just make it the interpreter we use,
copying the low half registers of the (now old) interpreter in. The
current interpreter's immediately put on the free interpreter list.

I think this covers it all. It means that returning from a sub can
immediately recycle an interpreter, and needs only a potentially
small memcpy to work. Calling a sub does require an allocation off a
special queue and some memory copying, but since each sub potentially
has a very different environment (lexicals, namespaces, opcode files,
bytecode file, security settings, and such) we're going to have to do
this regardless of what we want.

We *could*, to cut down on some of this, split out the 'constant
across calls' parts, like the counters, memory pools, arenas, and
whatnot, into a separate structure that gets accessed indirectly.
Depending on how much there was we may or may not see a win, since
we're trading off a little extra access time (through a pointer) for
a bit less memory copying on sub invocation. It's probably worth it,
though.
--
Dan

--------------------------------------it's like this-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Leopold Toetsch

unread,

Nov 5, 2004, 3:34:06 PM11/5/04

to Dan Sugalski, perl6-i...@perl.org

Dan Sugalski <d...@sidhe.org> wrote:

[ Yet another f'up ]

> ..., except that it caches the top

> half of the register sets

[ ... ]

> copying the low half registers of the (now old)

Dan, the split in lower and upper half of registers was a premature
optimization with zig opcodes to address the problem, what we already
had at that time - memory copying.

Please remember how these opcodes and the split came in. It was wrong and
it remains wrong.

That approach tried to address the real problem (memory copying cost and
cache pullution) by only doing half of the work. This makes either a
16x4-register machine out of Parrot or a broken 32x4-register machine,
because some needed registers might not be preserved.

This does just not work as it doesn't account for the actual register
usage, neither of the caller nor the called sub.

And it's a PITA for the register allocation code.

leo

Leopold Toetsch

unread,

Nov 5, 2004, 3:03:16 PM11/5/04

to Piers Cawley, perl6-i...@perl.org

Piers Cawley <pdca...@bofh.org.uk> wrote:

> The catch with generation GC is that, once you have guaranteed
> destructors being called promptly, you still have to sweep the whole
> arena every time you leave a scope.

Yes. I hope that we can track objects needing timely destruction
directly. We have write barriers that should be able to track the
liveness of these biests.

leo

Leopold Toetsch

unread,

Nov 5, 2004, 2:57:32 PM11/5/04

to Dan Sugalski, perl6-i...@perl.org

Dan Sugalski <d...@sidhe.org> wrote:
> At 11:39 AM +0100 11/5/04, Leopold Toetsch wrote:
>>The cache misses are basically in two places
>>
>>a) accessing a new register frame and context
>>b) during DOD/GC

> b) is relatively easy -- I'd bet that the vast majority of the cache
> misses are because of the copying collector.

Which copying collector? We have one for buffer memory. That isn't even
triggered. Please (re)read the thread. It's all during M&S sweep phase.
And there are details, where the misses are ...

[ ... ]

> If, generally speaking, invoking something gets you a new interpreter
> structure

That's cache-wise worse then the current scheme and doesn't address the
cache issues at all, sorry. The interpreter structure is bigger then a
register frame so it'll blow caches more.

leo

Dan Sugalski

unread,

Nov 5, 2004, 4:28:55 PM11/5/04

to l...@toetsch.at, perl6-i...@perl.org

At 9:34 PM +0100 11/5/04, Leopold Toetsch wrote:
>Dan Sugalski <d...@sidhe.org> wrote:
>
>[ Yet another f'up ]
>
>> ..., except that it caches the top
>> half of the register sets
>
>[ ... ]
>
>> copying the low half registers of the (now old)
>
>Dan, the split in lower and upper half of registers was a premature
>optimization

You're right. At this point, all this stuff is a premature optimization.

The current calling scheme stands. We're not changing it now. We're
not going to tune it now. PMC sizes stay the way they are. The DOD
sweep can pick up dead continuations when it needs to, since that
works well enough. The source changes for GC barriers can stay, and
any changes to allow later replacement of the GC or DOD systems can
go in, but if anyone wants to screw around with a generational
collector they can do it in a branch -- the existing M&S DOD system
and copying collector are fine for now.

No more performance changes. Period. We get parrot fully functional first.

Leopold Toetsch

unread,

Nov 6, 2004, 3:38:46 AM11/6/04

to Dan Sugalski, perl6-i...@perl.org

Dan Sugalski <d...@sidhe.org> wrote:

> The current calling scheme stands.

[ ... ]

> No more performance changes. Period. We get parrot fully functional first.

Ok. As long as PIR code is concerned call internals are nicely hidden by
the function call and return syntax. But PASM could be a problem if
there is a growing amount of code around.
We have to change calling conventions slightly for more speed
eventually.

leo

Dan Sugalski

unread,

Nov 6, 2004, 10:07:29 AM11/6/04

to l...@toetsch.at, perl6-i...@perl.org

We can lift the requirement that registers be saved for calls. Beyond
that, the calling conventions are fixed.

Why is the fib benchmark still slow - part 1

Leopold Toetsch

Leopold Toetsch

Miroslav Silovic

Leopold Toetsch

Piers Cawley

Dan Sugalski

Leopold Toetsch

Leopold Toetsch

Leopold Toetsch

Dan Sugalski

Leopold Toetsch

Dan Sugalski