[INFO] The first pirate parrot takes to the air

Peter Gibbs

unread,

Aug 16, 2002, 10:42:40 AM8/16/02

to perl6-internals

For purely academic purposes, I have re-synchronised some of my
forbidden code with the latest CVS version of Parrot. All tests pass
without gc debug; and with gc_debug plus Steve Fink's patch.
Benchmarks follow, on a 166MHz Pentium running linux 2.2.18.

Parrot African Grey
life (5000 generations) 172 seconds 81 seconds
reverse <core_ops.c >/dev/null 193 seconds 130 seconds
hanoi 14 >/dev/null 51 seconds 37 seconds

The differences between the two versions are:
1) Use of the interpreter cycle-counter instead of stack walking.
2) Linked lists of buffer headers sorted by bufstart
3) COW-supporting code in GC (for all buffer objects)
4) Implementation of COW for string_copy and string_substr

Items 1 and 2 use techniques that have been specifically banned,
and items 3 and 4 depend on item 2, so none of this code is
usable in Parrot (which is why I haven't attached any patches)

Some of the changes I made before the memory management
code was totally reorganised have not yet been re-integrated.
My last version prior to that reorganisation ran 5000 lives in
61 seconds, and I hope to get back to somewhere close to
that again.

--
Peter Gibbs
EmKel Systems

Matt Fowles

unread,

Aug 16, 2002, 11:22:45 AM8/16/02

to perl6-internals

All~

> The differences between the two versions are:
> 1) Use of the interpreter cycle-counter instead of stack walking.
> 2) Linked lists of buffer headers sorted by bufstart
> 3) COW-supporting code in GC (for all buffer objects)
> 4) Implementation of COW for string_copy and string_substr
>
> Items 1 and 2 use techniques that have been specifically banned,
> and items 3 and 4 depend on item 2, so none of this code is
> usable in Parrot (which is why I haven't attached any patches)

I am fairly new to this project and this is the first that I have heard of
these things. What are the techniques that have been banned and where can I
find a list of the things one ought not to do?

Thanks,
Matt

Peter Gibbs

unread,

Aug 16, 2002, 11:59:11 AM8/16/02

to Matt Fowles, perl6-internals

Matt Fowles wrote:

> I am fairly new to this project and this is the first that I have heard of
> these things. What are the techniques that have been banned and where can
I
> find a list of the things one ought not to do?

Don't worry too much about it, Matt. It is part of an ongoing game between
myself and Dan Sugalski, where I submit patches and he rejects them for
various reasons, normally related to performance or memory usage in
critical areas within the Parrot core, such as memory management and
garbage collection.

If you are interested in the history, some references follow.

The interpreter cycle counter first arose in the following patch:
http://archive.develooper.com/perl6-i...@perl.org/msg10218.html
and was immediately rejected because it would cause a general
performance hit, rather than being targeted directly at the problem areas.

The subject of not adding additional fields to the Buffer header structure
arose in a thread started by Dan himself, with contributions from Mike
Lambert and myself:
http://archive.develooper.com/perl6-i...@perl.org/msg11424.html
The specific message from Dan prohibiting it was:
http://archive.develooper.com/perl6-i...@perl.org/msg11570.html

Mike Lambert

unread,

Aug 16, 2002, 1:36:09 PM8/16/02

to Peter Gibbs, perl6-internals

> For purely academic purposes, I have re-synchronised some of my
> forbidden code with the latest CVS version of Parrot. All tests pass
> without gc debug; and with gc_debug plus Steve Fink's patch.
> Benchmarks follow, on a 166MHz Pentium running linux 2.2.18.
>
> Parrot African Grey
> life (5000 generations) 172 seconds 81 seconds
> reverse <core_ops.c >/dev/null 193 seconds 130 seconds
> hanoi 14 >/dev/null 51 seconds 37 seconds

Rather impressive. Except that it makes me look bad. :)

> The differences between the two versions are:
> 1) Use of the interpreter cycle-counter instead of stack walking.
> 2) Linked lists of buffer headers sorted by bufstart
> 3) COW-supporting code in GC (for all buffer objects)
> 4) Implementation of COW for string_copy and string_substr

1) Yeah, the approach of cycle-counter is a nice one. I also had a similar
solution involving neonate flag usage, somewhere in the archives. Both
have *significant* speed advantages versus the curent codebase's stack
walking.

I tried to convince Dan of the merit, but they failed for various reasons:

Your solution, (ignoring the extra cycle counter byte for now), cannot
handle vtable methods implemented in Parrot. The current system to
implement this involves the interpreter recursively calling runops_core to
handle the vtable method. If you increment cycles on the inner loop, you
risk pre-collection of stuff on the stack of the vtable method calling
stuff. If you don't increment cycles, you prevent any of the memory
allocated inside of this vtable method from ever being collected during
the method execution...bad stuff when your vtable methods are multiplying
gigantic matrices or somesuch.

My neonate buffers solution fails only in the presence of longjmp.

Granted, we don't do any of this yet, so these solutions will mop the
floor with my current stackwalk code, and pass tests to boot. But it's the
planned introduction of these requirements which are the reason for making
these solutions 'forbidden'.

One of Nick's solutions was to fallback on stack-walking to handle the
cases where our faster solutions fail. I can definitely see that working
with neonate buffers to handle the fixups needed after a longjmp call. But
it doesn't seem as attractive in the presence of your solution, for which
it would require stackwalking for all re-entrant runops calls. Do you have
another solutioin in mind for handling re-entrant runops calls?

As far as the extra byte in the buffer, I don't mind that one at all.
There are a lot of restrictions on the GC code in the interest of making
stuff "lightweight". Unfortuantely, GC code takes a significant portion of
the execution time in any realistic application. Hopefully we can convince
Dan to allow extra fields in the buffers in the interest of speed, but I
don't think we can reduce parrot/perl6's feature set in the interest of
speed...might as well use C if that's what you want. :)

2) Currently, we use linked list of buffer headers for freeing and
allocating headers. I'm not sure what you mean by saying that they are
sorted by bufstart? What does this buy over the current system?

3) Definitely a good one. I've been trying to merge your original COW
patch into my code here. Without GC_DEBUG, it fails one test. With
GC_DEBUG, it fails the traditional set plus that one test. The test case
is rather large unfortunately, I haven't been able to narrow down the
problem further or I'd have committed it.

4) Isn't this really the same thing as item 3? I'm basing my knowledge off
your old COW patches. Has additional work been done on the string function
integration since then, or do #3 and #4 both come from those patches?

> Some of the changes I made before the memory management
> code was totally reorganised have not yet been re-integrated.
> My last version prior to that reorganisation ran 5000 lives in
> 61 seconds, and I hope to get back to somewhere close to
> that again.

I'm not sure how much of the new code you've merged with. Which of the new
files are you planning to integrate/merge with, and which have you thrown
out in favor of older versions? I'm specifically referring to any of
resources/dod/smallobject/headers.c.

Regardless of whether or not it goes in, I'd be interested in seeing a
patch. I can work on integrating a lot of your non-forbidden code into the
current codebase.

Thanks for spending the time to generate these numbers...they're a nice
eyeopener on what can be done without the current restrictions. Hopefully
they'll allow us to reconsider each restriction in the context of
the speed of our GC.

Mike Lambert

Peter Gibbs

unread,

Aug 16, 2002, 2:50:51 PM8/16/02

to Mike Lambert, perl6-internals

Mike Lambert wrote:

> Rather impressive. Except that it makes me look bad. :)

You have to follow orders - I don't any more! As I said before,
I am now doing this for fun - and when running on a slow machine,
speed becomes very important. I have been very pleased to see
that a lot of my code has made it into the official codebase, and
that you have been extending the concepts.

> Your solution, (ignoring the extra cycle counter byte for now), cannot
> handle vtable methods implemented in Parrot. The current system to

I agree that the cycle counter approach is not going to handle all
possibilities - however, it was originally dismissed because:
<< start quote from Dan
Unfortunately I don't. I'm willing to live with a performance
degredation in those spots where it's truly necessary, but only
there. The performance hit has to be localized to those places where
infant mortality is actually a problem. Those spots are currently
reasonably rare, and I'm not sure that the hoops we're building to
jump through (and I'm as guilty as anyone) are worth it.
>> end quote

I'm sure you'll agree that the stack-walking code suffers from the
same problem - we take a performance hit for every DOD run.
As Nick suggested, perhaps we need multiple methodologies.
The cycle counter approach really only has one merit - it is
transparent - and that was why I proposed it originally, because we
had multiple developers producing code that had neonate bugs.

> 2) Currently, we use linked list of buffer headers for freeing and
> allocating headers. I'm not sure what you mean by saying that they are
> sorted by bufstart? What does this buy over the current system?

I am now keeping a linked list of *allocated* buffer headers, headed
up by the Memory_Block structure, and linked in ascending order
by their bufstart value (this is how they are allocated anyway)
This is primarily used by COW-3 - when a shared header is
created, it is inserted into the linked list in the appropriate place.
Then compact_pool runs through the Memory_Block list, and finds
each header in sequence. If the header is dead (this is one of the
things I have not re-implemented yet) then it is freed and no copy
takes place; if the header overlaps the previous one (i.e. COW)
then the header is re-addressed without a copy.
This fixes the problem of a large buffer being kept around only
because a small substring is still alive. Unfortunately, it is very
expensive in terms of memory usage, as it uses a doubly-linked
list and a reference back to the pool (although the latter could
probably be eliminated.)

> 4) Isn't this really the same thing as item 3? I'm basing my knowledge off
> your old COW patches. Has additional work been done on the string function
> integration since then, or do #3 and #4 both come from those patches?

The logic described above is what I meant by #3 i.e. the GC code
is equipped to handle COW of any buffer type, not only strings.
#4 meant that only for strings is any actual COW functionality
implemented in the next layer.

> I'm not sure how much of the new code you've merged with. Which of the new
> files are you planning to integrate/merge with, and which have you thrown
> out in favor of older versions? I'm specifically referring to any of
> resources/dod/smallobject/headers.c.

I have basically cut-and-pasted my changes into your new files; because
otherwise I was going to be arrested for cruelty to CVS. My last diff
against current CVS produced a 24K patch file - I don't believe that
diff is quite accurate any more, but I have also installed Steve's patch
to test with GC_DEBUG, so I will have to do a bit of work to sort it out.
Once I have a working patch, I will email it to you directly, in case you
can find anything useful in it.

For interest, the other changes in my old version that I am still working
on integrating are:
1) Allocate memory pool in fixed size chunks, even during compaction.
In other words, instead of allocating a new block of memory equal to
the current size used, allocate a chunk (of say 16K or whatever), then
allocate another chunk when that one is full.
During compaction, once a chunk has been emptied, it is returned to
a free chunk list, and can then be reallocated. This reduces the peak
memory consumption, and allows the compaction process to be
suspended between chunks. I believe it would also make
generational collection easier, though I haven't tried that yet.
The downside is obviously wasted space at the end of each chunk;
but there are ways to minimise that. Chunk sizes are increased
dynamically as memory usage grows, and to accomodate large
allocation requests.
2) Postpone freeing of buffers. At present, the DOD run first marks
everything it can find, then frees what it couldn't. By updating the
cycle counter in the buffer header when a buffer is marked live,
the free_unused_buffers runs can be delayed until we actually need
free buffers in a specific resource (sorry, small object) pool. In
conjunction with the code I referred to above for freeing buffer
headers during a compaction run, this can reduce the frequency
of running free_unused_buffers. This requires that buffer_lives be
passed the interpreter, so it can access the cycle counter.

The other changes I made were to the string subsystem, and I
probably won't do them again because they change too many
files:
a) replace chartype and encoding by a single charset entry
This removes one pointer field from the string header, and
simplifies all the string functions, as they are forever checking
both fields to determine if transcoding is required. I do not
believe that the two fields are orthogonal, and therefore the
number of charsets would be less than #chartypes * #encodings.
b) some alterations to the single vtable thus created, in
particular the addition of a find_substring method.

--
Peter Gibbs
EmKel Systems

----- Original Message -----
To: "Peter Gibbs" <pe...@emkel.co.za>
Cc: "perl6-internals" <perl6-i...@perl.org>
Sent: 16 August 2002 07:36
Subject: Re: [INFO] The first pirate parrot takes to the air

> > For purely academic purposes, I have re-synchronised some of my
> > forbidden code with the latest CVS version of Parrot. All tests pass
> > without gc debug; and with gc_debug plus Steve Fink's patch.
> > Benchmarks follow, on a 166MHz Pentium running linux 2.2.18.
> >
> > Parrot African Grey
> > life (5000 generations) 172 seconds 81 seconds
> > reverse <core_ops.c >/dev/null 193 seconds 130 seconds
> > hanoi 14 >/dev/null 51 seconds 37 seconds
>
>

> > The differences between the two versions are:
> > 1) Use of the interpreter cycle-counter instead of stack walking.
> > 2) Linked lists of buffer headers sorted by bufstart
> > 3) COW-supporting code in GC (for all buffer objects)
> > 4) Implementation of COW for string_copy and string_substr
>
> 1) Yeah, the approach of cycle-counter is a nice one. I also had a similar
> solution involving neonate flag usage, somewhere in the archives. Both
> have *significant* speed advantages versus the curent codebase's stack
> walking.
>
> I tried to convince Dan of the merit, but they failed for various reasons:
>

> 3) Definitely a good one. I've been trying to merge your original COW
> patch into my code here. Without GC_DEBUG, it fails one test. With
> GC_DEBUG, it fails the traditional set plus that one test. The test case
> is rather large unfortunately, I haven't been able to narrow down the
> problem further or I'd have committed it.
>
>

> > Some of the changes I made before the memory management
> > code was totally reorganised have not yet been re-integrated.
> > My last version prior to that reorganisation ran 5000 lives in
> > 61 seconds, and I hope to get back to somewhere close to
> > that again.
>
>

Dan Sugalski

unread,

Aug 16, 2002, 6:45:28 PM8/16/02

to Peter Gibbs, perl6-internals

At 4:42 PM +0200 8/16/02, Peter Gibbs wrote:
>For purely academic purposes, I have re-synchronised some of my
>forbidden code with the latest CVS version of Parrot. All tests pass
>without gc debug; and with gc_debug plus Steve Fink's patch.
>Benchmarks follow, on a 166MHz Pentium running linux 2.2.18.
>
> Parrot African Grey
>life (5000 generations) 172 seconds 81 seconds
>reverse <core_ops.c >/dev/null 193 seconds 130 seconds
>hanoi 14 >/dev/null 51 seconds 37 seconds

How much of the speed win is from the cycle count instead of stack
walking? Unless you've solved the problem of recursive interpreter
calls and setjmp, it's not a valid solution, no matter what the speed
win might be.

>The differences between the two versions are:
>1) Use of the interpreter cycle-counter instead of stack walking.

This still dies in the face of setjmp/longjmp and recursive
interpreter calls. Which, since it's how we're doing exceptions and
parrot-code vtable entries, is kind of problematic.

>2) Linked lists of buffer headers sorted by bufstart

IIRC, this required another pointer per buffer. I really, *really*
don't want to do it, but given the speed wins I'm willing. We chew up
altogether too much memory per internal structure as it is, which is
bad. Suppose we can make that up elsewhere somehow.

>3) COW-supporting code in GC (for all buffer objects)
>4) Implementation of COW for string_copy and string_substr
>
>Items 1 and 2 use techniques that have been specifically banned,
>and items 3 and 4 depend on item 2, so none of this code is
>usable in Parrot (which is why I haven't attached any patches)

Well, all banned things can be unbanned if they work and are faster.
Neither 3 nor 4 fundamentally require 1 or 2,
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Peter Gibbs

unread,

Aug 17, 2002, 3:05:05 AM8/17/02

to perl6-internals, Dan Sugalski

Dan Sugalski wrote:

> How much of the speed win is from the cycle count instead of stack
> walking? Unless you've solved the problem of recursive interpreter
> calls and setjmp, it's not a valid solution, no matter what the speed
> win might be.

According to my notes the progression (for 5000 lives) was:
CVS: 172 seconds
Cycle count instead of stack walk: 97 seconds
COW with stack walk: 158 seconds
Cycle count + COW: 81 seconds
The performance gain from COW is fairly small, as it really only
results from the decreased number of compaction runs needed.
The performance loss from the stack walking code is the major
area; what concerns me is that that overhead is going to be
incurred all the time, not only when situations arise that might
require it - you may recall that your original objection to the
cycle count code was based on the same reasoning.

Anyway, as I said, this is purely an academic exercise - I am
not proposing that any changes be made to Parrot. When
features are added to Parrot that cause my current code to
fail, I can find an alternative solution, reject those features, or
simply abandon the exercise - you do not have those options,
so you have to use code that will support everything you intend
to do in the future.

> Neither 3 nor 4 fundamentally require 1 or 2,

I know that Mike is working on implementing COW in a way that
will not break anything else; I have just been experimenting with
a different approach that is based on point 2. I will be sending
my patch to Mike so he can see if there is anything he can use.

Mike Lambert

unread,

Aug 17, 2002, 4:23:24 AM8/17/02

to Peter Gibbs, perl6-internals

Peter Gibbs wrote:

> > How much of the speed win is from the cycle count instead of stack
> > walking? Unless you've solved the problem of recursive interpreter
> > calls and setjmp, it's not a valid solution, no matter what the speed
> > win might be.

> According to my notes the progression (for 5000 lives) was:
> CVS: 172 seconds
> Cycle count instead of stack walk: 97 seconds
> COW with stack walk: 158 seconds
> Cycle count + COW: 81 seconds

Just for fun, can you run Hanoi on CVS versus CVS+COW?

I just got COW implemented here, and while I get a 17% speedup on
life, I get a 5% loss on hanoi. Since you only posted life, it's a
bit hard to see if the drop on hanoi is just my fault, or the fault of
COW in general.

(More benchmarks will appear in my soon-to-be-sent COW patch email.)

Thanks,
Mike Lambert

Peter Gibbs

unread,

Aug 17, 2002, 4:59:18 AM8/17/02

to Mike Lambert, perl6-internals

Mike Lambert wrote:

> Just for fun, can you run Hanoi on CVS versus CVS+COW?
>
> I just got COW implemented here, and while I get a 17% speedup on
> life, I get a 5% loss on hanoi. Since you only posted life, it's a
> bit hard to see if the drop on hanoi is just my fault, or the fault of
> COW in general.

Hanoi is a difficult case because pretty much the only string
manipulation it does is repeat, which doesn't benefit from COW,
so all you get is the additional overhead in the compaction code
without any compensating gains.

I have started doing the timings; however, I no longer have a correct
COW-only patch, so I need to do a bit of fiddling. I will post the results
as soon as I have them.

Attached is my current 'African Grey' patch as discussed yesterday.
Also, since you mention hanoi, here (in straight code format, I'm too
lazy to do a diff) is an experimental version of string_repeat that
tries to reduce the number of calls to memcpy. I can't find my notes
at the moment as to what benefit it gave - perhaps you might like
to try it sometime. (Note that the string_make call will need to be
changed to the split chartype/encoding version)

--
Peter Gibbs
EmKel Systems

STRING *
string_repeat(struct Parrot_Interp *interpreter, const STRING *s, UINTVAL
num)
{
STRING *dest;
UINTVAL i;
char *ptr;

dest = string_make(interpreter, NULL, s->bufused * num, s->charset, 0);
if (num == 0) {
return dest;
}

dest->bufused = s->bufused * num;
dest->strlen = s->strlen * num;

mem_sys_memcopy(dest->bufstart, s->bufstart, s->bufused);
i = 1;
num--;
ptr = (char *)dest->bufstart + s->bufused;
while (num) {
mem_sys_memcopy(ptr, dest->bufstart, s->bufused * i);
ptr += (s->bufused * i);
num -= i;
i *= 2;
if (i > num) {
i = num;
}
}

return dest;
}

grey1.patch