[PATCH] single item stack chunks

0 views
Skip to first unread message

Leopold Toetsch

unread,
Mar 23, 2004, 6:34:27 AM3/23/04
to P6I
I've stripped down the whole stack code to use one item per chunk. It
passes all tests (3 disabled that push infintely and check for
CHECK_LIMIT and t/pmc/eval_6 which is borken).

This slows down register saving (and other stack operations)
considerably whithout any additional measures[1]:

$ perl tools/dev/parrotbench.pl -c=parrotbench.conf -b='^oof' -t
Numbers are cpu times in seconds. (lower is better)
p-j-Oc parr-j parr-C perl-th perl python ruby
oofib 4.150s 11.530s 12.450s 4.100s 3.540s 2.140s 2.170s

p-j-Oc = parrot -j -Oc where savetop is optimized to pushtopp
parr-j = parrot -j, all 4 registers are saved (both unoptimized build)

[1] The old stack code used a next pointer to prevent stack thrashing.
This is of not much use for single items (and probably doesn't work with
Continuations), so I'm thinking of dedicated free_lists (again!) that
hold popped of stack items of one buffer size kind. They'll be kept
alive by marking them during DOD. If we cannot do this, I see not much
chance to gain old subroutine call performance.

Here is the diffstat of the patch:
classes/closure.pmc | 3
classes/coroutine.pmc | 2
imcc/pcc.c | 8 +-
include/parrot/stacks.h | 13 ---
src/debug.c | 16 +---
src/register.c | 46 +++++-------
src/stack_common.c | 142 ++++----------------------------------
src/stacks.c | 57 ++++++---------
src/sub.c | 18 +---
t/op/stacks.t | 4 -
t/pmc/eval.t | 4 +
11 files changed, 86 insertions(+), 227 deletions(-)

TODOS: cleanup, POD, debug.c stack commands.

leo

stack-items.patch

Dan Sugalski

unread,
Mar 23, 2004, 10:53:29 AM3/23/04
to Leopold Toetsch, P6I
At 12:34 PM +0100 3/23/04, Leopold Toetsch wrote:
>I've stripped down the whole stack code to use one item per chunk.
>It passes all tests (3 disabled that push infintely and check for
>CHECK_LIMIT and t/pmc/eval_6 which is borken).

From reading this patch, either something's horribly wrong here or
the stack code's mutated beyond all recognition since the last time I
looked. For this to be reasonable we need:

1) A stack chunk freelist, akin to the PMC and buffer freelists
2) The stack entry has to look something like:

struct {
struct IntStackFrame *prev;
INTVAL Registers[REGSPERFRAME];
} IntStackFrame;

It could be a union of INTVAL, NUMVAL, STRING *, and PMC * if we
don't want to maintain several free lists. Throwing a flag word in
there's probably not out of order either.

The indirection through the Buffer stuff has to go (the two
indirections are a waste, and we certainly don't need this stuff
shifted around by the GC) and skipping the allocation of the memory
for the frame is needed. (Parrot_allocate really ought not ever be
called for these fixed-sized structures, it's a huge waste)
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Leopold Toetsch

unread,
Mar 23, 2004, 11:38:23 AM3/23/04
to Dan Sugalski, P6I
Dan Sugalski wrote:

> looked. For this to be reasonable we need:
>
> 1) A stack chunk freelist, akin to the PMC and buffer freelists

Well, that did I write in my message. Please read it (again). And please
one thing and then the next.


> 2) The stack entry has to look something like:
>
> struct {
> struct IntStackFrame *prev;
> INTVAL Registers[REGSPERFRAME];
> } IntStackFrame;
>
> It could be a union of INTVAL, NUMVAL, STRING *, and PMC * if we don't
> want to maintain several free lists. Throwing a flag word in there's
> probably not out of order either.

This fixed size item doesn't really work IMHO. NUMVAL frames are of
different size, as are user-stack entries.


> The indirection through the Buffer stuff has to go

Well in your last mail WRT that stuff you had PMCs there :) The Buffer
stuff doesn't really hurt, when the chunks are coming directly from the
free lists.

> .. (Parrot_allocate really ought not ever be called for these

> fixed-sized structures, it's a huge waste)

They aren't fixed size--or only per stack kind, but with have more then one.

leo

Dan Sugalski

unread,
Mar 23, 2004, 11:55:25 AM3/23/04
to Leopold Toetsch, P6I
At 5:38 PM +0100 3/23/04, Leopold Toetsch wrote:
>Dan Sugalski wrote:
>
>>looked. For this to be reasonable we need:
>>
>>1) A stack chunk freelist, akin to the PMC and buffer freelists
>
>Well, that did I write in my message. Please read it (again).

I did. The need for a freelist, however, has been there since the
beginning. This isn't a new need--it was in my mail about this that
started this round.

A freelist is a given here.

>>2) The stack entry has to look something like:
>>
>> struct {
>> struct IntStackFrame *prev;
>> INTVAL Registers[REGSPERFRAME];
>> } IntStackFrame;
>>
>>It could be a union of INTVAL, NUMVAL, STRING *, and PMC * if we
>>don't want to maintain several free lists. Throwing a flag word in
>>there's probably not out of order either.
>
>This fixed size item doesn't really work IMHO. NUMVAL frames are of
>different size, as are user-stack entries.

User stack entries are different from backing stack frames. They need
to be treated differently, and as such aren't under consideration
here. I see that someone unified this all ages back--that was a
mistake.

>>The indirection through the Buffer stuff has to go
>
>Well in your last mail WRT that stuff you had PMCs there :)

Yeah, with no indirection. To be clear, I was proposing:

struct IntFramePMC {
pobj_t obj;
VTABLE *vtable;
struct PMC_EXT *pmc_ext;
DPOINTER *data;
PMC *_metadata;
struct _Sync *_synchronize;
PMC *_next_for_GC;
INTVAL Registers[REGISTERS_PER_FRAME];
};

for an integer register frame, with pmc_ext set to &data, and data
set to &Registers[0], and all the internal code skipping all the damn
indirection and going directly to the register store frm the pointer
to the frame.

For those folks following along that aren't familiar with what's
here, this is equivalent to:

struct IntFramePMC {
struct PMC;
struct PMC_EXT;
INTVAL Registers[REGISTERS_PER_FRAME];
}

which would be an OK way of doing it as well.

> The Buffer stuff doesn't really hurt, when the chunks are coming
>directly from the free lists.

Yeah, it does. Besides the extra indirection, memory allocated with
Parrot_allocate's movable by the GC. For fixed sized structures this
is a waste, since there's no real reason to bother with compacting
fixed-sized structures. Arenas with freelists work better there.

>>.. (Parrot_allocate really ought not ever be called for these
>>fixed-sized structures, it's a huge waste)
>
>They aren't fixed size--or only per stack kind, but with have more then one.

They're of fixed size. That there are multiple types doesn't change
that. Every integer register frame will be the exact same size as
every other one--they don't vary.

Leopold Toetsch

unread,
Mar 23, 2004, 12:32:54 PM3/23/04
to Dan Sugalski, perl6-i...@perl.org
Dan Sugalski <d...@sidhe.org> wrote:

> User stack entries are different from backing stack frames. They need
> to be treated differently, and as such aren't under consideration
> here.

They are in the continuation context too. While the user stack may be
freshly allocated in a sub the control stack can't. It's one stack. As
COW is abandoned now, how different are stacks handled in stacks.c (User,
Control, and Pad).

Just saying "that's wrong" isn't really helpful.

And finally WRT variable sized frames: It was discussed to save only
used registers. When only P16..P19 registers are to be preserved we
could just move these 4 pointers into a chunk. When this is happening a
lot, we are wasting much memory with fixed sized frames.
Investigating that isn't a bad idea - it might not work though.

Changing the internals to use malloced memory is simple. It's all in one
subroutine. I see the point with GC, but please lets do one step and
then the next.

leo

Piers Cawley

unread,
Mar 24, 2004, 6:18:04 AM3/24/04
to Leopold Toetsch, P6I
Leopold Toetsch <l...@toetsch.at> writes:

> I've stripped down the whole stack code to use one item per chunk. It
> passes all tests (3 disabled that push infintely and check for
> CHECK_LIMIT and t/pmc/eval_6 which is borken).
>
> This slows down register saving (and other stack operations)
> considerably whithout any additional measures[1]:
>
> $ perl tools/dev/parrotbench.pl -c=parrotbench.conf -b='^oof' -t
> Numbers are cpu times in seconds. (lower is better)
> p-j-Oc parr-j parr-C perl-th perl python ruby
> oofib 4.150s 11.530s 12.450s 4.100s 3.540s 2.140s 2.170s
>
> p-j-Oc = parrot -j -Oc where savetop is optimized to pushtopp
> parr-j = parrot -j, all 4 registers are saved (both unoptimized build)

Interesting. I redid oofib.imc to only save the registers it cares
about rather than using savetop, and here are my numbers (admittedly on
a PowerMac G5:

parrot parrotj parrotC perl python ruby
oofib 3.770s 3.190s 2.950s 2.210s 1.100s 1.770s
oofibt 7.750s 7.370s 6.960s 2.210s 1.140s 1.800s


oofibt is the original version, oofib is my rewrite (attached). The
perl, python & ruby equivalents were generated with a simple copy...

For reference, here are the numbers using a CVS fresh parrot:

parrot parrotj parrotC perl python ruby
oofib 3.770s 3.150s 3.100s 2.210s 1.080s 1.960s
oofibt 6.700s 6.240s 6.170s 2.330s 1.040s 1.890s

So it looks like saving single registers is a win whichever parrot
you're using...

oofib.imc

Leopold Toetsch

unread,
Mar 24, 2004, 7:14:36 AM3/24/04
to Piers Cawley, perl6-i...@perl.org
Piers Cawley <pdca...@bofh.org.uk> wrote:

> Interesting. I redid oofib.imc to only save the registers it cares
> about rather than using savetop, and here are my numbers (admittedly on
> a PowerMac G5:

> parrot parrotj parrotC perl python ruby
> oofib 3.770s 3.190s 2.950s 2.210s 1.100s 1.770s
> oofibt 7.750s 7.370s 6.960s 2.210s 1.140s 1.800s

You can't really compare these versions. The C<savetop> saves all 4
register frame types. Using C<-Oc> replaces it with <pushtopp>.

$ parrot -j oofib.imc # savetop
fib(28) = 317811 3.903611s

$ parrot -j -Oc oofib.imc # pushtopp
fib(28) = 317811 2.569886s

$ parrot -j oofpc.imc # save/save
fib(28) = 317811 2.734809s

This is already with the new stack code using malloced memory and a
freelist for popped register frames. The two C<save> onto the user stack
take obviously longer now then memcpy()ing 16 pointers. Albeit I can
imagine that just moving 2 pointers could be faster, despite the
additional count argument. Or it needs at least 8 or such, who knows.

(unoptimized build, Python runs 2.1s, Perl 3.5s)

leo

Leopold Toetsch

unread,
Mar 25, 2004, 1:41:15 PM3/25/04
to Dan Sugalski, perl6-i...@perl.org
Dan Sugalski <d...@sidhe.org> wrote:
> 2) The stack entry has to look something like:

> struct {
> struct IntStackFrame *prev;
> INTVAL Registers[REGSPERFRAME];
> } IntStackFrame;

Ok, now we "something like" that:

typedef struct Stack_Chunk {
pobj_t obj;
int size_class;
const char * name;
struct Stack_Chunk *prev;
struct Stack_Chunk *free_p;
void *data;
} Stack_Chunk_t;

The payload is allocated inside the buffer and refered to by:

#define STACK_DATAP(chunk) &chunk->data

So GC doesn't see stack memory any more.

leo

Reply all
Reply to author
Forward
0 new messages