Copy-on-write strings

Steve Fink

unread,

May 8, 2002, 2:11:03 AM5/8/02

to Peter Gibbs, perl6-internals

On Mon, May 06, 2002 at 10:58:33AM +0200, Peter Gibbs wrote:
> This implementation handles the following:
> Copied strings and substrings are COWed instead of copied (i.e. new string
> header only); this applies to normal and constant strings.
> Functions that alter strings in-place make a new copy first if required
> (actually, in some circumstances copies are made when they could be avoided;
> this will be fixed shortly)
> The GC process will detect when a previously shared buffer is no longer
> shared, and clear the COW flag
>
> The COWing of substrings requires a major interface change:
> *** All references to the data in a STRING must use the new strstart pointer
> instead of bufstart ***
> This patch makes that change to current CVS code; all work-in-progress will
> need to be amended accordingly.
> Note that the JIT code has been broken in the process; I have left that for
> somebody who understands it.
>
> The 'string_tail' data added to the end of (non-constant) strings has been
> defined as a struct in case any other uses are found for it.

I'm more hesitant about unilaterally committing this patch, because it
makes some significant changes. I'd be happier if someone else (Jeff,
Dan) ventured an opinion too.

Though if nobody says anything for a while and the patch starts to
rot, I'll probably apply some version of it anyway, 'cause I like
change. :-)

Comments:

1. Needs documentation. For one, docs/pdd/pdd09_gc.pod needs updating
with the new meaning of BUFFER_constant_FLAG. But this is also tricky
enough stuff that the data structures and maintenance mechanisms
really need to be documented. It's probably a little out of date, but
docs/string.pod currently has documentation at a sufficient level of
detail.

2. Is it necessary for ->buflen to lie? This prevents it from being a
proper subclass of Buffer, or in other words you can't cast a STRING*
to a Buffer* and have valid fields. Hm... have any other Buffer fields
changed meaning?

3. Is there an alternative to using the tail area for the moved flag?
Just brainstorming here, but when you make a lazy copy of a string,
couldn't you store a pointer to its header in your header and then use
the pointed-at header's regular flags field to record when its buffer
moves? I'm probably just misunderstanding the data structures. Or get
even more friendly with the gc mechanism and figure out whether
something's moved or fated to be moved based on its address. Or take
it completely out of band and keep a hash table of bufstarts that have
been moved. Maybe all of these have been considered and rejected
already?

4. What's this stuff:

- mem_sys_memcopy(dest->bufstart, (char *)src->bufstart + substart_off,
- (unsigned)(subend_off - substart_off));
+ /* do in-place if possible */
+/*
+ if (!(src->flags & BUFFER_constant_FLAG)) {
+*/
+ dest = clone_header(interpreter, src);
+ dest->strstart = (char *)dest->strstart + substart_off;
+/*
+ }
+ else {
+ dest = string_make(interpreter, NULL, subend_off - substart_off,
+ src->encoding, 0, src->type);
+ mem_sys_memcopy(dest->strstart, (char *)src->strstart + substart_off,
+ (unsigned)(subend_off - substart_off));
+ }
+*/

Otherwise, it looks good to me, now that I've figured out how it works.

Peter Gibbs

unread,

May 8, 2002, 2:58:51 AM5/8/02

to Steve Fink, perl6-internals

----- Original Message -----

Steve Fink wrote:
> 1. Needs documentation. For one, docs/pdd/pdd09_gc.pod needs updating

I agree that documentation is required; however, I was delaying that until
after the implementation had been finalised, so that the documentation only
has to change once.

> 2. Is it necessary for ->buflen to lie? This prevents it from being a

This is a relic of my original implementation. Looking at it fresh, there is
no requirement for it; the only string_x function that uses buflen is
string_replace, and that can be easily fixed. The only complication it adds
is computing the position of the tail. Of course, the alternative is to make
it a header instead, with strstart initialised to bufstart +
sizeof(String_Bufffer_Header); this just breaks alignment unless the header
is an appropriate width.

> 3. Is there an alternative to using the tail area for the moved flag?
> Just brainstorming here, but when you make a lazy copy of a string,
> couldn't you store a pointer to its header in your header and then use
> the pointed-at header's regular flags field to record when its buffer
> moves? I'm probably just misunderstanding the data structures. Or get
> even more friendly with the gc mechanism and figure out whether
> something's moved or fated to be moved based on its address. Or take
> it completely out of band and keep a hash table of bufstarts that have
> been moved. Maybe all of these have been considered and rejected
> already?

I'm sure there are lots of possibilities I haven't considered; I just used
the simplest method that came to me.
One problem is that a buffer has no permanent parent header:
# assume S0 points to some string
clone S1, S0 # S0 and S1 now point to different
headers but share a buffer
clone S2, S1 # and S2's header now shares the same
buffer also
substr S0, 0, 4, "gone" # S0 now gets a new buffer; S1 and S2 continue
to share the old one
If S0's header contained any important information, it is now lost.

> 4. What's this stuff:
>
> - mem_sys_memcopy(dest->bufstart, (char *)src->bufstart + substart_off,
> - (unsigned)(subend_off - substart_off));
> + /* do in-place if possible */
> +/*
> + if (!(src->flags & BUFFER_constant_FLAG)) {
> +*/
> + dest = clone_header(interpreter, src);
> + dest->strstart = (char *)dest->strstart + substart_off;
> +/*
> + }
> + else {
> + dest = string_make(interpreter, NULL, subend_off - substart_off,
> + src->encoding, 0, src->type);
> + mem_sys_memcopy(dest->strstart, (char *)src->strstart +
substart_off,
> + (unsigned)(subend_off - substart_off));
> + }
> +*/

Whoops! At one stage during development, I wasn't sure if I could do COWed
substrings of constants, so I put in the conditional code above. When I
decided it should work, I commented the 'if' out to test - it did work, but
obviously I forgot to remove the old code!

I won't redo the patch yet - let's wait a while and see if any other
comments are forthcoming.

--
Peter Gibbs
EmKel Systems

Steve Fink

unread,

May 8, 2002, 12:56:25 PM5/8/02

to Peter Gibbs, perl6-internals

I was mostly just asking questions. Now that the ideas are settling
in, the way you did it is making more and more sense. I just want the
various memory structures we allocate to be as simple as possible,
because I foresee a number of them, and it gets tricky remembering all
of the constraints. It would be nice to limit the different cases to a
small number (currently 3, for PMC, Buffer, and STRING?).

So is this currently a valid mental model?:

There are three types of managed structures we allocate from memory.

First, there are PMCs, which are fixed-size. Subclasses differ only in
the vtable and in what use they make of the ->data, ->cache, and
private bits of the ->flags members. PMCs may manage memory directly,
but will probably point to a Buffer or an array of Buffers for that
purpose.

Second are Buffers, which are used for pointing to memory. Most will
be a certain fixed size, but the not-yet-implemented
new_tracked_header() may be used to create larger subclasses. Those
subclasses will require their own free pool, but otherwise will be
managed as Buffers.

Third are STRINGs, which may be thought of as COW-capable Buffers.
Except they're more than that; they also can do efficient subrange
operations, and they store additional fields that are probably only
useful for character data. They can probably be subclassed in more or
less the same way as Buffers. They have their own implementation of
some memory management routines in order to support their COW feature,
and some fields they share with Buffers may have different meaning.
Most importantly, the memory they point to is no longer a completely
free-form chunk of bytes. The main data area must be at least large
enough to hold a pointer, it must be aligned so that a pointer at the
very beginning can be followed, and there will be some extra space at
the end for a tail structure.

As usual, I've neglected to mention system-allocated Buffers which
don't participate in garbage collection but still interact with the
memory management subsystem.

-----

Assuming this is compatible with what you're thinking, then the only
weirdness in there is the overloading of STRINGs and COW-able Buffers.
But since right now we only have two types of Buffers (Buffer,STRING),
there doesn't seem much point in splitting it out. If we later want
other non-STRING Buffers to be COW-able, I would imagine going towards
something more like:

Buffer
COW Buffer
STRING
PMC

(that's supposed to be a "class" hierarchy)

Peter Gibbs

unread,

May 8, 2002, 1:55:35 PM5/8/02

to Steve Fink, perl6-internals

That's a very good analysis of the current state-of-the-art, and I have made
a copy of this for when I do the documentation. There is one oddity, i.e.
constant strings. These share header layout with their bovine cousins, but
do not have tails. This would have to change if we need to compact constant
string memory, although I suspect that dynamic module load/unload would be
better handled by having separate memory pools. This reminds of me of why
buflen 'lies' - this way, the tail is private to the memory management
system, and the string handling functions can remain blissfully ignorant of
it. As I mentioned earlier, this could be overcome by using a header instead
of a tail -- buflen could then be real, and the header would be hidden
between bufstart and strstart -- but this would require the header to
maintain alignment, and therefore be larger than the current tail.

One limitation in the current implementation is that a substring that
survives the death of its parent will cause the entire buffer to be
preserved; I have a few ideas for dealing with this, but for now I am
assuming that this situation will not be too common.

Following your previous comments about buflen, I had another look at
string_replace, and I think this could profit from a bit of a clean-up. COW
functionality can be better integrated; this function did not exist when I
first wrote my code, and so I did the simplest possible handling when it was
added, i.e. just copy the buffer and leave the rest alone. I will produce a
separate patch for this, and copy it to Melvin, who I believe was the
original author.

I am still undecided about system Buffers, particularly the free header
pools - I would like to make them normal buffers, just with immunity from
DOD; but this gets slightly complicated when trying to allocate the free
buffer header pool from itself! Of course I could always cheat and implement
another of my experiments, which replaces the free pool by a linked list
within the arena itself - this could be beneficial in situations where the
size of the free pool is subject to frequent growth.

Steve Fink

unread,

May 8, 2002, 2:10:02 PM5/8/02

to Peter Gibbs, perl6-internals

On Wed, May 08, 2002 at 07:55:35PM +0200, Peter Gibbs wrote:
> That's a very good analysis of the current state-of-the-art, and I have made
> a copy of this for when I do the documentation. There is one oddity, i.e.
> constant strings. These share header layout with their bovine cousins, but
> do not have tails. This would have to change if we need to compact constant

I think you've hit on the reason why we pretty much have to go with
the tail structure:

"How is a bovine Buffer different from a regular Buffer?"
"It's got a tail."

Dan Sugalski

unread,

May 9, 2002, 3:58:13 PM5/9/02

to Peter Gibbs, perl6-internals

At 10:58 AM +0200 5/6/02, Peter Gibbs wrote:
>The COWing of substrings requires a major interface change:
>*** All references to the data in a STRING must use the new strstart pointer
>instead of bufstart ***

Actually, we don't. (Sez the man catching up on altogether too much
mail) Since we're putting the COW stuff at the tail end, substrings
of COW strings are fine. You set the bufstart to where the substring
starts, buflen set so it goes to the end of the original buffer, and
the string length bit holds the real string length. That way you
start where you need to, but you can still find the COW marker off
the end of the original buffer.

Yeah, the substring loses the bits of the original string that are
before the substring's beginning, but I don't think that'll be a
problem.
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Peter Gibbs

unread,

May 10, 2002, 3:10:16 AM5/10/02

to perl6-internals, Dan Sugalski

----- Original Message -----
From: "Dan Sugalski" <d...@sidhe.org>
> Actually, we don't. (Sez the man catching up on altogether too much
> mail) Since we're putting the COW stuff at the tail end, substrings
> of COW strings are fine. You set the bufstart to where the substring
> starts, buflen set so it goes to the end of the original buffer, and
> the string length bit holds the real string length. That way you
> start where you need to, but you can still find the COW marker off
> the end of the original buffer.
>

As I understand it, that means putting the original buffer length in the
tail, and using
true_bufstart = (char *)s->bufstart - (tail->buflen - s->buflen)
in compact_string_pool?

Nicholas Clark

unread,

May 10, 2002, 10:46:52 AM5/10/02

to Peter Gibbs, Steve Fink, perl6-internals

On Wed, May 08, 2002 at 08:58:51AM +0200, Peter Gibbs wrote:
> I won't redo the patch yet - let's wait a while and see if any other
> comments are forthcoming.

I'm still failing to find the time to read and understand all these bits of
parrot internals, so I've no idea of the following comment is useful.
[Blame impending 5.8 for lack of time]

I've also no idea whether it's a well known idea, or something not well
known that I've independently thought of, or new, as I've not been reading
about these sort of things.

However, I experimented with copy on write in perl 5.7. Admittedly only for
whole strings, not substrings.

The thing that I didn't want to solve because it got complex was this.

Say I have a scalar A:

A

I make a copy of it to B. B identifies itself as a "child" of A. A still
owns the buffer. B knows that if it wants to write to the the buffer, it
copies it first:

A
/
B

I make a copy of A to C:

A
/ \
B C

Now I change A. Bletch. "Easy" solution is that A donates its buffer to one
of its children, and re-parents the others

B
\
C

but that starts to feel complex, particularly if there's been some more
copying:

A
/ \
B C
/ \ \
D E F

and it means that there are two sorts of COW scalars - parents and children.
And the particular problem I didn't like was that parents have to have back
pointers to their children, and as there could be multiple children that meant
a dynamic structure.

So I went for a different approach. When a scalar was copied created a
circular linked list. Initially it's only 2 items:

+--+
v |
A->B

but if you copy A or B it becomes 3, etc

+-----+
v |
A->B->C

The advantages I saw in this over a tree-like system were

1: Fixed sized data - exactly 1 pointer per COW scalar.
2: All COW scalars are equal - there is less code
3: Changing a COW scalar is easy:

Make a new copy of the the buffer for yourself.

Then, if the scalar you point to point back to you then you know there are
only 2 scalars in the loop. Donate the existing buffer to the other scalar,
flag it as no longer COW.

Otherwise, adjust the pointers so that you aren't in the loop.

All COWs are equal. No COW is more equal than any other.

The biggest downside I can think of is that you have to run around the loop
of pointers when you are modified, and the loop may get quite big.

Anyway, it seemed to work, but it was all rather to experimental to go into
5.8, and there were other more real things to get fixed.

I hope that this may be of use. If not, sorry for wasting everyone's time.

Nicholas Clark

Peter Gibbs

unread,

May 10, 2002, 11:38:00 AM5/10/02

to Nicholas Clark, Steve Fink, perl6-internals

Hi Nicholas

The final design is now waiting on Dan, but it is always interesting to see
other ideas.

Like you, I rejected the parent/child technique. However, my proposed
solution did not use any links at all, because it relies on the garbage
collection system to determine when a shared buffer has only one reference.

It worked as follows (ignoring substrings for simplicity of explanation):

Each string header (the equivalent of a scalar in this regard, as it points
to a specific string instance) points to the shared buffer
The string header contains a COW flag, set in both source and destination
headers when a string is [not] copied
The buffer also contains a flag, used only during garbage collection and
always initialised to zero
When a string needs to be changed, we simply copy the data into a new buffer
and clear the COW flag in the header

During the memory pool compaction phase of the GC system, the first header
encountered that points to a specific buffer will cause the buffer to be
collected as normal; the special flag is set in the old copy of the buffer
to say that this buffer has been relocated, and the address of this first
header is stored in the old buffer. The COW flag is cleared at this time.
If another header is found that points to a buffer that has already been
moved, no copying takes place, and the header is simply corrected to point
to the new location of the buffer. The first header has its COW flag set
again.

The result is that the last header of a COWed string will still believe that
the buffer is shared until a GC collection run occurs, and therefore could
result in buffers being copied unnecessarily. Your system eliminates this
problem; however, I believe that Dan may be averse to using a linked list -
we'll see.

Dan Sugalski

unread,

May 10, 2002, 12:12:41 PM5/10/02

to Peter Gibbs, Nicholas Clark, Steve Fink, perl6-internals

At 5:38 PM +0200 5/10/02, Peter Gibbs wrote:
>The result is that the last header of a COWed string will still believe that
>the buffer is shared until a GC collection run occurs, and therefore could
>result in buffers being copied unnecessarily. Your system eliminates this
>problem; however, I believe that Dan may be averse to using a linked list -
>we'll see.

As long as there's no externally visible signs of the COW stuff, I
don't care as long as the code's commented well enough to be
maintained. I'd prefer commits of the code to be done only when
there's a demonstrable win to the committed code, though. (Which is
to say "No checking in code that slows things down")

If you want to take an intermediate step, it's fine to mark
substrings that don't start at the beginning of a buffer with an
extra flag of some sort (BUFFER_COW_substring_FLAG or something) that
caused the GC system to make a clean copy when it ran through the
buffer pool and collected, though that has issues of properly gaguing
how much memory's needed for the new pool.

Graham Barr

unread,

May 10, 2002, 12:17:05 PM5/10/02

to Dan Sugalski, Peter Gibbs, Nicholas Clark, Steve Fink, perl6-internals

On Fri, May 10, 2002 at 12:12:41PM -0400, Dan Sugalski wrote:
> At 5:38 PM +0200 5/10/02, Peter Gibbs wrote:
> >The result is that the last header of a COWed string will still believe that
> >the buffer is shared until a GC collection run occurs, and therefore could
> >result in buffers being copied unnecessarily. Your system eliminates this
> >problem; however, I believe that Dan may be averse to using a linked list -
> >we'll see.
>
> As long as there's no externally visible signs of the COW stuff, I
> don't care as long as the code's commented well enough to be
> maintained. I'd prefer commits of the code to be done only when
> there's a demonstrable win to the committed code, though. (Which is
> to say "No checking in code that slows things down")
>
> If you want to take an intermediate step, it's fine to mark
> substrings that don't start at the beginning of a buffer with an
> extra flag of some sort (BUFFER_COW_substring_FLAG or something) that
> caused the GC system to make a clean copy when it ran through the
> buffer pool and collected, though that has issues of properly gaguing
> how much memory's needed for the new pool.

One of the performce benefits in perl is that it holds a pointer
to the start of the buffer and an offset to the statr of the
string data. This is a great benefit when trimming the start off strings.

Why would parrot not need this ? And if it did, could you not use it
for COW substrings ?

Graham.
[You has a lot of catching up todo]