String API

Benjamin Goldberg

unread,

Aug 19, 2003, 12:07:22 AM8/19/03

to perl6-i...@perl.org

There are a number of shortcomings in the API, which I'd like to address
here, and propose improvments for.

Not so much the string_* functions, but rather with how they work (the
encoding API, the transcoding functions).

To allow user-defined encodings, and user-defined transcoding, (written
in parrot) the first parameter of all of the function pointers in the
ENCODING and TYPE structures should be INTERP.

This, of course means that the first parameter to all of the string_*
functions needs to be INTERP, too. Prefereably, all of them, not just
the ones which actually use ->encoding or ->type... so that we'll have
the freedom to change them so they *do* use ->encoding and ->type in the
future. (And for consistancy). Note that currently, *not* all of the
string_ functions have INTERP as their first param.

For other encodings than the built-in ones, the string data should not
*need* to be inside of the string buffer itself (though of course it
still *can* be); we should be able to build it lazily, or read it from a
disk, or by calling methods on a PMC, or whatever. This will likely
mean having one or more pObjs (PMCs, generally) stored in the buffer.
This, in turn, means we need a 'mark' entry in the encoding vtable, to
prevent them from being cleaned up from under us. We should also have
the option of allocating non-gced memory from the system and freeing it
when the string is freed; this means we need a 'destroy' entry in the
encoding vtable.

(For the builtin string types, 'mark' and 'destroy' will of course be
NULL. For custom strings, of course, they might not be.)

(For simplicity, we might allow the buffer to *only* consist of either
raw data, *or* of pointers to pObjs; then we only need a single flag to
tell the gc which we have, and don't need a 'mark' or 'destroy')

I *really* *really* want string iterators. The current API for
iterating through the characters of a string is, IMHO, vastly
insufficient.

The following are what I want for string iterators:

1/ Iterators won't become invalid if the string gets moved in memory.

Currently, all we've got is a void* pointer which points into the buffer
of the string; during GC, strings can get reallocated, making the
pointer invalid. For that matter, if you grow a string, while iterating
forward through it, there's a good chance that your iterator will become
invalid.

2/ Iterators should be integers, structs, or pointers into immobile
memory (memory which won't be moved during GC). They should not need to
be anchored to avoid being GCed, nor freed to avoid memory leakage.

If, for the builtin types, we changed that 'void*' pointer to an
integer, indicating a number of bytes from strstart, then conditions 1&2
would be satisfied for them.

3/ The encoding functions (iterator creation, advancement, etc.)
should be able to call into parrot code; thus, they need an INTERP as
the first parameter.

4/ It should take O(1) time to get an iterator to the start or end of
a string.

5/ It should take O(n) time to advance an iterator n characters
(either forwards or backwards). It would be nice if it took O(1) time,
but it's not necessary.

6/ It should take O(1) time to decode whatever characters are at the
iterator.

7/ If two iterators are N characters apart, it should take O(N) time
to measure that distance.

8/ The encoding/iterator API should be sufficiently complete to allow
someone to write a character-rope string type, and have it work
seamlessly with other strings.

9/ New ops which provide access to the string iterator API.

10/ Add methods to PerlString to make it compatible with Iterator.

11/ Any string_ function which takes a character index as a
parameter, should be able to take a string iterator.

12/ The rx engine should use the new ops.

12a/ We should be able to use the rx engine to "match" a stream of
values from an Iterator PMC. Whether this Iterator is crawling over a
PerlString, or PerlArray, or something else, shouldn't matter to the rx
engine.

--
$a=24;split//,240513;s/\B/ => /for@@=qw(ac ab bc ba cb ca
);{push(@b,$a),($a-=6)^=1 for 2..$a/6x--$|;print "$@[$a%6
]\n";((6<=($a-=6))?$a+=$_[$a%6]-$a%6:($a=pop @b))&&redo;}

Luke Palmer

unread,

Aug 19, 2003, 2:49:51 AM8/19/03

to Benjamin Goldberg, perl6-i...@perl.org

Benjamin Goldberg writes:
> I *really* *really* want string iterators. The current API for
> iterating through the characters of a string is, IMHO, vastly
> insufficient.

Not only because it's inconvenient, but it's also essential for doing
pattern matching efficiently on some multibyte encodings, most notably
UTF-8.

> The following are what I want for string iterators:
>

> 5/ It should take O(n) time to advance an iterator n characters
> (either forwards or backwards). It would be nice if it took O(1) time,
> but it's not necessary.

And also impossible for certain encodings. O(n) should suffice.

> 6/ It should take O(1) time to decode whatever characters are at the
> iterator.
>
> 7/ If two iterators are N characters apart, it should take O(N) time
> to measure that distance.
>
> 8/ The encoding/iterator API should be sufficiently complete to allow
> someone to write a character-rope string type, and have it work
> seamlessly with other strings.

More importantly, it should be possible to write a lazy string, perhaps
generated from a filehandle. There are downsides to this, however. The
more indirection we put in, especially in the form of function calls,
the more efficiency we lose when things need to be tight and local.
This is particularly true if we're going to do pattern matching at the
bytecode level as opposed to the op level.

> 9/ New ops which provide access to the string iterator API.

Yes. What is going to be used to store an iterator. An I reg, a P reg?
If it's a PMC, would it be possible to just implement the iterator
itself as a PMC, and use the standard iterator vtable methods (which
are?) for motion and dereferencing? Again, that involves a vtable
overhead and doesn't lend itself to JIT very well (which is very, very
important).

> 10/ Add methods to PerlString to make it compatible with Iterator.
>
> 11/ Any string_ function which takes a character index as a
> parameter, should be able to take a string iterator.
>
> 12/ The rx engine should use the new ops.
>
> 12a/ We should be able to use the rx engine to "match" a stream of
> values from an Iterator PMC. Whether this Iterator is crawling over a
> PerlString, or PerlArray, or something else, shouldn't matter to the rx
> engine.

Luke

Leopold Toetsch

unread,

Aug 19, 2003, 3:47:29 AM8/19/03

to Benjamin Goldberg, perl6-i...@perl.org

Benjamin Goldberg <ben.go...@hotpop.com> wrote:
> There are a number of shortcomings in the API, which I'd like to address
> here, and propose improvments for.

> To allow user-defined encodings, and user-defined transcoding, (written

> in parrot) the first parameter of all of the function pointers in the
> ENCODING and TYPE structures should be INTERP.

This belongs IMHO into PerlString (or better a class derived from that).

> I *really* *really* want string iterators. The current API for
> iterating through the characters of a string is, IMHO, vastly
> insufficient.

encoding->skip_forward(.., by_n) doesn't look like that insufficient. A
skip_one() function wouldn't harm though.

> 1/ Iterators won't become invalid if the string gets moved in memory.

> Currently, all we've got is a void* pointer which points into the buffer
> of the string; during GC, strings can get reallocated, making the
> pointer invalid.

You are not allowed to cache the pointer. string->strstart + idx is
always your actual character in the string.

To satisfy 1/ we would have to mark the string as "immobile" (which we
have a flag for) *but* you can't grow such strings, the copying
collector can't cleanup the block, where the string is in (and worse, the
collector currently just frees the block).

> 10/ Add methods to PerlString to make it compatible with Iterator.

Yep. That was in my iterator proposal.

> 11/ Any string_ function which takes a character index as a
> parameter, should be able to take a string iterator.

Bloat IMHO. While this abstraction is flexible, it IMHO doesn't belong
into the string subsystem but into a string class, that implements these
functions.

> 12/ The rx engine should use the new ops.

> 12a/ We should be able to use the rx engine to "match" a stream of
> values from an Iterator PMC. Whether this Iterator is crawling over a
> PerlString, or PerlArray, or something else, shouldn't matter to the rx
> engine.

That's fine, we are at classes level here.

leo

Tim Bunce

unread,

Aug 19, 2003, 6:36:32 AM8/19/03

to Benjamin Goldberg, perl6-i...@perl.org

On Tue, Aug 19, 2003 at 12:07:22AM -0400, Benjamin Goldberg wrote:
> There are a number of shortcomings in the API, which I'd like to address
> here, and propose improvments for.

Just to be sure people are keeping it in mind, I'll repost this from Larry:

On Wed, Jan 30, 2002 at 10:47:36AM -0800, Larry Wall wrote:
>
> For various reasons, some of which relate to the sequence-of-integer
> abstraction, and some of which relate to "infinite" strings and arrays,
> I think Perl 6 strings are likely to be represented by a list of
> chunks, where each chunk is a sequence of integers of the same size or
> representation, but different chunks can have different integer sizes
> or representations. The abstract string interface must hide this from
> any module that wishes to work at the abstract string level. In
> particular, it must hide this from the regex engine, which works on
> pure sequences in the abstract.

Tim.

Benjamin Goldberg

unread,

Aug 19, 2003, 5:58:47 PM8/19/03

to perl6-i...@perl.org

Luke Palmer wrote:
>
> Benjamin Goldberg writes:
[snip]

> > 9/ New ops which provide access to the string iterator API.
>
> Yes. What is going to be used to store an iterator. An I reg, a P reg?
> If it's a PMC, would it be possible to just implement the iterator
> itself as a PMC, and use the standard iterator vtable methods (which
> are?) for motion and dereferencing? Again, that involves a vtable
> overhead and doesn't lend itself to JIT very well (which is very, very
> important).

I envision that an iterator for a STRING* would be an INTVAL, but an
iterator for a PerlString pmc would be an Iterator pmc.

There are a number of reasons why INTVALs are best -- the most important
ones being that they're *small*, and that we need do nothing at all in C
code to prevent them from being garbage collected.

Also, if STRING*s use INTVALs as their iterators, then PerlString can
work with the Iterator.pmc class simply by wrapping these INTVALs inside
Key.pmc objects, using key_new_integer.

There are, alas, a few drawbacks to this.
1/ The algorithm needs to be designed so that these intvals are
agnostic to the location of the string's location in memory -- if
garbage collection moves the string's buffer, the iterator needs to
remain valid. Thus, you can't use a pointer into the buffer which has
been simply cast to an integer. (You *can*, however, subtract the start
of the buffer from a pointer into the buffer, and then cast this
ptrdiff_t into an INTVAL).
2/ If you need data more complex than an integer -- e.g., if your
string's data is actually in some sort of tree, and you want to say,
"the node at the end of the fourth branch from the twentieth branch from
the firth branch from the root," then you're in trouble.

Hmm, considering how much of a problem 2/ is, maybe we should use
Key.pmc objects as our iterators? This would make PerlString's
interface with Iterator even simpler.

Benjamin Goldberg

unread,

Aug 19, 2003, 6:26:54 PM8/19/03

to perl6-i...@perl.org

Leopold Toetsch wrote:
>
> Benjamin Goldberg <ben.go...@hotpop.com> wrote:
> > There are a number of shortcomings in the API, which I'd like to
> > address here, and propose improvments for.
>
> > To allow user-defined encodings, and user-defined transcoding,
> > (written in parrot) the first parameter of all of the function
> > pointers in the ENCODING and TYPE structures should be INTERP.
>
> This belongs IMHO into PerlString (or better a class derived from that).

Then how do we pass a user-defined string to a function which expects an
argument in an Sreg? In particular, consider if the string is Really
Really big, and actually resides on disk instead of in memory. If we
had to convert from a PerlString derivative to a STRING*, then we'd have
to load that whole file into memory. Ugh.

> > I *really* *really* want string iterators. The current API for
> > iterating through the characters of a string is, IMHO, vastly
> > insufficient.
>
> encoding->skip_forward(.., by_n) doesn't look like that insufficient. A
> skip_one() function wouldn't harm though.

That wasn't precisely what I was speaking of.

> > 1/ Iterators won't become invalid if the string gets moved in
> > memory.
>
> > Currently, all we've got is a void* pointer which points into the
> > buffer of the string; during GC, strings can get reallocated, making
> > the pointer invalid.
>
> You are not allowed to cache the pointer.

That's my point. I want an iterator value which I *can* cache.

> string->strstart + idx is always your actual character in the string.

The pointer can get invalidated even without storing it for any length
of time.

Consider:

PMC * array = pmc_new( interpreter, enum_class_Sarray );
void * iter = str->strstart;
INTVAL i = 0, len = string_length( str );
VTABLE_set_integer( array, string_length( str ) );
for( i = 0; i < len; ++i ) {
INTVAL c = str->encoding->decode( iter );
VTABLE_set_integer_keyed_int( array, i, c );
iter = str->encoding->skip_forward( iter, 1 );
}

What happens if VTABLE_set_integer on that sarray causes a gc to get
run, and if that gc causes the string's buffer to get moved? Oops.

Is the code construct I have forbidden?

> To satisfy 1/ we would have to mark the string as "immobile" (which we
> have a flag for) *but* you can't grow such strings, the copying
> collector can't cleanup the block, where the string is in (and worse,
> the collector currently just frees the block).

Indeed. That's why using a pointer into the string is so very
insufficient.

Now, suppose that instead of a pointer, we had an integer describing the
number of bytes from strstart to where we're looking... *now* most of
the problems go away. It would no longer matter if the string got
moved, would it? The drawback of course is that we'd need to add it to
the str->strstart pointer before any time that we want to use it... but
that's not especially painful.

> > 10/ Add methods to PerlString to make it compatible with Iterator.
>
> Yep. That was in my iterator proposal.
>
> > 11/ Any string_ function which takes a character index as a
> > parameter, should be able to take a string iterator.
>
> Bloat IMHO. While this abstraction is flexible, it IMHO doesn't belong
> into the string subsystem but into a string class, that implements these
> functions.

The bloat can be avoided if the primary string_ implementations *only*
took string iterators. Then, to satisfy those who want to use character
indices, provide wrappers which take character index arguments, and
converts them into string iterators relative to those particular
strings.

Benjamin Goldberg

unread,

Aug 19, 2003, 6:55:33 PM8/19/03

to perl6-i...@perl.org

*I* was certainly keeping it in mind ;).

Just for the curious, the *reasoning* behind my proposed requirements
are as follows:

1/ The regex engine uses string_index all over the place. This is an
O(n) operation for the utf8 encoding. This is bad.

If we had real string iterators, then this would be an O(1)
operation.

My requirements 4..7 are a description of the time complexity which
most normal people can expect of iterators.

2/ There's no way for a string to refer to other strings or pmcs,
making all sorts of things (including what Larry mentioned) impossible.

Or at least, if we tried, there's no way to prevent the things we're
pointing to from being cleaned up out from underneath us, since we've no
way of marking them as alive.

This is my requirement 8, and, to a lesser degree, requirement 3.

/***/

Most of everything else assumes that the solution to failing 1/ of
the current API will actually be a string iterator.

3/ String iterator usage should be *simple*.

Making them pmcs would mean that we'd need to temporarily anchor
them. Having them as void* pointers to gc-relocatable memory means that
they can become invalid at unexpected times. Obviously if we
temporarily disable DOD/GC, then these can be avoided, but that has
other drawbacks.

This is my requirements 1 and 2. I fear that we're going to lose the
requirement that iterators be simple objects, since letting them be pmcs
gives us so much more flexibility. (Which, I fear, we need, if the
encoding is a not-so-simple data structure (like a tree, or a lazily
concatenated sequence of substrings)).

It's also my requirements 9..12: if we're going to have them, use
them.

/***/

4/ If we've got fast string iterators, then an Iterator.pmc object
for a PerlString object won't be significantly slower than using
iterators for strings directly. So... why not make the rx engine work
on Iterator.pmc objects?

Leopold Toetsch

unread,

Aug 20, 2003, 3:18:59 AM8/20/03

to Benjamin Goldberg, perl6-i...@perl.org

Benjamin Goldberg <ben.go...@hotpop.com> wrote:
> Leopold Toetsch wrote:
>>
>> Benjamin Goldberg <ben.go...@hotpop.com> wrote:
>> > There are a number of shortcomings in the API, which I'd like to
>> > address here, and propose improvments for.
>>
>> > To allow user-defined encodings, and user-defined transcoding,
>> > (written in parrot) the first parameter of all of the function
>> > pointers in the ENCODING and TYPE structures should be INTERP.
>>
>> This belongs IMHO into PerlString (or better a class derived from that).

> Then how do we pass a user-defined string to a function which expects an
> argument in an Sreg?

We have here IMHO the same relationship as with ints:
int (INTVAL) <-> Int (PerlInt)
str (STRING) <-> Str (PerlString)
You can't tie native types, you can't attach properties on these and so
on. And you can't pass some kind of active native strings in the SReg.
The user-defined (written in parrot) implies, that these are
PerlString PMCs.

> Now, suppose that instead of a pointer, we had an integer describing the
> number of bytes from strstart to where we're looking... *now* most of
> the problems go away.

So lets change the encoding->skip_{for,back}ward to take/return an
INTVAL being the byte-position relative to strstart.

>> > 11/ Any string_ function which takes a character index as a
>> > parameter, should be able to take a string iterator.
>>
>> Bloat IMHO. While this abstraction is flexible, it IMHO doesn't belong
>> into the string subsystem but into a string class, that implements these
>> functions.

> The bloat can be avoided if the primary string_ implementations *only*
> took string iterators. Then, to satisfy those who want to use character
> indices, provide wrappers which take character index arguments, and
> converts them into string iterators relative to those particular
> strings.

Ok. I see. That's fine - except for utf8 strings. But these could be
converted to utf32 as soon as they are seen.

leo

Benjamin Goldberg

unread,

Aug 20, 2003, 7:19:42 PM8/20/03

to perl6-i...@perl.org

Leopold Toetsch wrote:
>
> Benjamin Goldberg <ben.go...@hotpop.com> wrote:
> > Leopold Toetsch wrote:
> >>
> >> Benjamin Goldberg <ben.go...@hotpop.com> wrote:
> >> > There are a number of shortcomings in the API, which I'd like to
> >> > address here, and propose improvments for.
> >>
> >> > To allow user-defined encodings, and user-defined transcoding,
> >> > (written in parrot) the first parameter of all of the function
> >> > pointers in the ENCODING and TYPE structures should be INTERP.
> >>
> >> This belongs IMHO into PerlString (or better a class derived from
> >> that).
>
> > Then how do we pass a user-defined string to a function which expects
> > an argument in an Sreg?
>
> We have here IMHO the same relationship as with ints:
> int (INTVAL) <-> Int (PerlInt)
> str (STRING) <-> Str (PerlString)
> You can't tie native types, you can't attach properties on these and so
> on. And you can't pass some kind of active native strings in the SReg.
> The user-defined (written in parrot) implies, that these are
> PerlString PMCs.

Not having an INTERP argument severely limits us, even in other ways.

Even ignoring the problem of an encoding being an "active" string, what
about if it needs to allocate memory (perhaps some temporary buffer to
do some computation needed to decode bytes of a string)?
sys_mem_allocate and sys_mem_free all over the place? Blech. I want to
be able to allocate a garbage collectible Buffer :(

This also eliminates the chance of having a STRING* which is attached to
a file -- we'd only be able to do such a thing as a PerlString
derivative.

Similarly, that would eliminate the chance of a STRING* which is
actually a lazily concatenated list of other STRING*s; we'd only be able
to do this as a PerlString derivitive.

> > Now, suppose that instead of a pointer, we had an integer describing
> > the number of bytes from strstart to where we're looking... *now* most
> > of the problems go away.
>
> So lets change the encoding->skip_{for,back}ward to take/return an
> INTVAL being the byte-position relative to strstart.

And they need to take str->strstart! :)

I said "most", not "all". It solves the problems incurred with the
string buffer getting moved by gc (which is good), but it doesn't solve
everything.

In particular, if we make a STRING* encoding which is a lazily concatted
list of other strings (yes I keep going back to it, but Larry said we'll
have them. Theoretically he might have only meant as a high level type
(a PerlString derivitive) but *I* think it would be nice for to be able
to have this as a STRING* type) we'd want our iterator to be two
integers: the first one being the integer into our array, the second
being the iterator into the string we're currently iterating through.

> >> > 11/ Any string_ function which takes a character index as a
> >> > parameter, should be able to take a string iterator.
> >>
> >> Bloat IMHO. While this abstraction is flexible, it IMHO doesn't
> >> belong into the string subsystem but into a string class, that
> >> implements these functions.
>
> > The bloat can be avoided if the primary string_ implementations *only*
> > took string iterators. Then, to satisfy those who want to use
> > character indices, provide wrappers which take character index
> > arguments, and converts them into string iterators relative to those
> > particular strings.
>
> Ok. I see. That's fine - except for utf8 strings.

Why wouldn't it work for utf8 strings?

INTVAL string_index_to_iterator(INTERP, STRING *s, INTVAL index) {
INTVAL start = s->encoding->iterator_start(interpreter, s);
return s->encoding->skip_forward(interpreter, s, start, index);
}

Converting an index to an iterator for a utf8 string crawls through the
string and finds the byte offset.

> But these could be converted to utf32 as soon as they are seen.

For a long string, that could be quite a bit of bloat.

Leopold Toetsch

unread,

Aug 21, 2003, 2:37:00 AM8/21/03

to Benjamin Goldberg, perl6-i...@perl.org

Benjamin Goldberg wrote:

>
> Leopold Toetsch wrote:
> Not having an INTERP argument severely limits us, even in other ways.

The INTERP argument is fine. The user defined encoding is/was my problem.

> Similarly, that would eliminate the chance of a STRING* which is
> actually a lazily concatenated list of other STRING*s; we'd only be able
> to do this as a PerlString derivitive.

I have problems imaginating such kind of STRINGs. They need an attached
PMC doing the work + an attached list containing the string chunks. You
need a PMC anyway. Why not have this in a PerlString derived class. So
you don't have an overhead on "average" strings.

> In particular, if we make a STRING* encoding which is a lazily concatted
> list of other strings (yes I keep going back to it, but Larry said we'll
> have them. Theoretically he might have only meant as a high level type
> (a PerlString derivitive) but *I* think it would be nice for to be able
> to have this as a STRING* type) we'd want our iterator to be two
> integers: the first one being the integer into our array, the second
> being the iterator into the string we're currently iterating through.

How many strings in JAPP[]1 might need that? Do you really want to slow
down all string access, just for one very special corner case?

>>>... provide wrappers which take character index

>>>arguments, and converts them into string iterators relative to those
>>>particular strings.
>>>
>>Ok. I see. That's fine - except for utf8 strings.
>>
>
> Why wouldn't it work for utf8 strings?

The wrapper is O(n) for utf8 strings. So converting once might be
cheaper during the first character-index access.

leo

JAPP Joe Average Perl Program

Peter Gibbs

unread,

Aug 21, 2003, 6:57:40 AM8/21/03

to perl6-i...@perl.org

If the string API is to be revised, I would like to suggest that
consideration be given to having a single string vtable, merging
the current encoding and chartype structures into a single one.

This removes one pointer from each string header, and allows
a single parameter to be used instead of two for transcode, etc.
Also, the IO system will need to have a mechanism for specifying
the character set used by files; this again could then be a single
value.

I do not believe that the two existing parameters are orthogonal,
so the number of charset (or whatever) entities would be less than
the cross product. e.g. the existing 2 chartypes x 4 encodings
would really only require 4 charsets.

I actually implemented this change some time ago as part of my
'African Grey' variant; an extract from my charset.h appears below.
The get_unicode and put_unicode entries combine the get and put
operations with transcoding; this simplifies the transcode operation
significantly. find_substring was an experimental feature that simply
replaced the two calls to skip_forward used by string_substr; it also
implemented the optimisation for single-byte encodings that has
subsequently been catered for by specific code in string_substr.

----------------------------------------------------------------------------
--------------
enum {
enum_charset_usascii,
enum_charset_utf8,
enum_charset_utf16,
enum_charset_utf32,
enum_charset_MAX
};

struct parrot_charset_t {
INTVAL index;
const char *name;
Parrot_UInt max_bytes;
Parrot_UInt(*length) (const void *ptr, Parrot_UInt bytes);
const void *(*skip_forward) (const void *ptr, Parrot_UInt n);
const void *(*skip_backward) (const void *ptr, Parrot_UInt n);
Parrot_UInt(*get) (const void *ptr);
Parrot_UInt(*get_unicode) (const void *ptr);
void *(*put) (void *ptr, Parrot_UInt c);
void *(*put_unicode) (void *ptr, Parrot_UInt c);
Parrot_Int(*is_digit)(Parrot_UInt c);
Parrot_Int(*get_digit)(Parrot_UInt c);
void (*find_substring)(const void *ptr, Parrot_UInt *start,
Parrot_UInt *length);
};
----------------------------------------------------------------------------
--------------

--
Peter Gibbs
EmKel Systems

Tom Hughes

unread,

Aug 21, 2003, 8:01:59 AM8/21/03

to Peter Gibbs

In message <067401c367d3$0adcaaa0$0b01...@emkel.co.za>
Peter Gibbs <pe...@emkel.co.za> wrote:

> I do not believe that the two existing parameters are orthogonal,
> so the number of charset (or whatever) entities would be less than
> the cross product. e.g. the existing 2 chartypes x 4 encodings
> would really only require 4 charsets.

The problem is that there are hundreds of characters sets that
use a single byte encoding so you're going to wind up duplicating
the encoding related actions for all those character sets.

Tom

--
Tom Hughes (t...@compton.nu)
http://www.compton.nu/

Nicholas Clark

unread,

Aug 21, 2003, 9:15:30 AM8/21/03

to Benjamin Goldberg, perl6-i...@perl.org

On Wed, Aug 20, 2003 at 07:19:42PM -0400, Benjamin Goldberg wrote:

> Leopold Toetsch wrote:

> > But these could be converted to utf32 as soon as they are seen.
>
> For a long string, that could be quite a bit of bloat.

Jarkko's view is that the combined hit of the size of the extra code to skip
along the variable length encoding, the time taken to execute that code,
(and I guess the cache misses it creates) is greater than the gain from
saving space. Particularly when the regexp engine is written assuming O(1)
random access. He thinks perl 5 would probably have been faster if it used
UCS32 internally. Maybe ponie will.

Nicholas Clark

Elizabeth Mattijsen

unread,

Aug 21, 2003, 11:35:31 AM8/21/03

to Nicholas Clark, Benjamin Goldberg, perl6-i...@perl.org

At 14:15 +0100 8/21/03, Nicholas Clark wrote:
>On Wed, Aug 20, 2003 at 07:19:42PM -0400, Benjamin Goldberg wrote:
> > Leopold Toetsch wrote:
> > > But these could be converted to utf32 as soon as they are seen.
> > For a long string, that could be quite a bit of bloat.
>Jarkko's view is that the combined hit of the size of the extra code to skip
>along the variable length encoding, the time taken to execute that code,
>(and I guess the cache misses it creates) is greater than the gain from
>saving space.

Indeed. I think available memory has increased more than 4 fold
since the first regexp engine that could only do 1-byte ASCII. So
relatively, I don't think that bloat is an issue. Just don't do
regexps on 256Mbyte strings when your machine has less than 1 GByte
RAM ;-)

Liz

Dan Sugalski

unread,

Aug 21, 2003, 12:34:02 PM8/21/03

to Elizabeth Mattijsen, Nicholas Clark, Benjamin Goldberg, perl6-i...@perl.org

FWIW, we're not going to do string ops on UTF-8 stuff. We'll understand
it, and know how to translate it to more useful forms, but it's just a
static storage format for us. (Mainly because, while working with UTF-8
strings is a massive pain, it's foolish to transform it to UTF16 or UTF32
if we don't need to) Our unicode operations will be done either on UTF-16
(if we get ICU going, since that's what it uses) or UTF-32. -8 is a
legacy/storage format only so far as we're concerned.

THe same thing goes for other variable-width encodings such as Shift-JIS,
FWIW.

Dan

Benjamin Goldberg

unread,

Aug 21, 2003, 6:37:52 PM8/21/03

to perl6-i...@perl.org

Nicholas Clark wrote:
>
> On Wed, Aug 20, 2003 at 07:19:42PM -0400, Benjamin Goldberg wrote:
>
> > Leopold Toetsch wrote:
>
> > > But these could be converted to utf32 as soon as they are seen.
> >
> > For a long string, that could be quite a bit of bloat.
>
> Jarkko's view is that the combined hit of the size of the extra code to
> skip along the variable length encoding,

We've already got code to skip along a variable length encoding
(skip_forward does precisely this).

> the time taken to execute that code,

With the current code, where string_ functions take indices, not
iterators, this skipping forward already needs to be done. Except that
it's done inside of each and every string_ function, instead of done in
a seperate string_convert_index_to_iterator function.

> (and I guess the cache misses it creates)

If we're converting indices to iterators often, then the converter will
remain in cache. If we're smart enough to convert once, and then do
everything relative to that iterator, then the cost of the cache-miss to
load the convert function into memory will be relatively minor.

> is greater than the gain from saving space.

How much gain there is in space by keeping data in utf8, I don't know.
This would have to be determined by examining samples of Real World utf8
data (in particular, samples of Real World data which can't be
downgraded to some singlebyte encoding).

> Particularly when the regexp engine is written assuming O(1) random
> access.

It doesn't *need* to assume O(1) random access; after all, it's never
accessing *randomly*, it's always accessing just one character away from
some other character that it's recently accessed. Sounds like a job for
an iterator for me. With an iterator, it needs only assume that
advancing the iterator a distance of 1, takes O(1) time.

> He thinks perl 5 would probably have been faster if it used
> UCS32 internally. Maybe ponie will.
>
> Nicholas Clark

--

Benjamin Goldberg

unread,

Aug 21, 2003, 7:22:38 PM8/21/03

to perl6-i...@perl.org

Leopold Toetsch wrote:
>
> Benjamin Goldberg wrote:
>
> >
> > Leopold Toetsch wrote:
> > Not having an INTERP argument severely limits us, even in other ways.
>
> The INTERP argument is fine. The user defined encoding is/was my
> problem.

As in, you think we shouldn't have any, at all?

> > Similarly, that would eliminate the chance of a STRING* which is
> > actually a lazily concatenated list of other STRING*s; we'd only be
> > able to do this as a PerlString derivitive.
>
> I have problems imaginating such kind of STRINGs.

You lack sufficient imagination -- Larry's suggested that Perl6 strings
may consist of a list of chunks. I can easily imagine each of those
"chunks" being full-fledged STRING* objects.

A foolish question: can you imagine strings which are lazily read from a
file?

If so, could you imagine such a string, sitting in front of a really
really big file, bigger than could fit into memory?

Not only can I imagine all of theat, I can imagine the pain that would
be caused, if such file-strings are only implemented at the PMC level,
not at the STRING level, and someone unknowingly converts such a string
from a PMC into a STRING, and forces the entire file to be loaded from
disk into memory.

> They need an attached PMC doing the work + an attached list containing
> the string chunks.

Not necessarily. If we could have str->strstart as a pointer to a
vector of STRING*s, we wouldn't need any PMC to contain the chunks. And
the str->encoding api is (already) sufficient for doing the work. The
only lack is a custom mark, to keep the sub-strings alive.

> You need a PMC anyway. Why not have this in a PerlString derived class.

If we have it in a PerlString derived class, and do not make it part of
STRING*, then we cannot pass such strings to C functions defined to
accept strings in STRING* parameters, or to Parrot subroutines which are
defined to accept strings in S-registers, or which move the strings from
P-registers to S-registers.

We would lose the magic, similar to how moving from a PerlInt to an
INTVAL loses magic.

Well, except that when a PerlInt loses magic going to an INTVAL, the
resulting integer generally takes *less* memory than it did as a PMC,
whereas losing magic by changing from a PMC to a STRING could very
easily result in using *more* memory. (And doing lots of work, which we
wouldn't need if our string kept it's magic).

> So you don't have an overhead on "average" strings.

How much speed overhead is there?

> > In particular, if we make a STRING* encoding which is a lazily
> > concatted list of other strings (yes I keep going back to it, but
> > Larry said we'll have them. Theoretically he might have only meant as
> > a high level type (a PerlString derivitive) but *I* think it would be
> > nice for to be able to have this as a STRING* type) we'd want our
> > iterator to be two integers: the first one being the integer into our
> > array, the second being the iterator into the string we're currently
> > iterating through.
>
> How many strings in JAPP[]1 might need that?

That depends. Does concatenation in Perl6, by default, produce a lazy
concatenation, or an immediate actual concatenation?

Furthermore, consider if we allowed something like:

my str $slurp = File.new($filename).slurp(); # =
File.slurp($filename)?

Sure, we could have this read in the whole file, but wouldn't it be
nicer if it would *lazily* fill in $slurp?

> Do you really want to slow down all string access, just for one very
> special corner case?

I don't believe that it *would* slow down all string access.

> >>>... provide wrappers which take character index
> >>>arguments, and converts them into string iterators relative to those
> >>>particular strings.
> >>>
> >>Ok. I see. That's fine - except for utf8 strings.
> >>
> >
> > Why wouldn't it work for utf8 strings?
>
> The wrapper is O(n) for utf8 strings. So converting once might be
> cheaper during the first character-index access.

For the current string code, we already take O(n) to get a void* pointer
into an appropriate part of a utf8 string, for each character-index.

If we factor the current code into functions taking iterators, an
index-to-iterator converter, and wrappers taking indices, then it
shouldn't be significantly slower than the current code (except for the
overhead of entering/leaving a function, which might be eliminated by
the C compiler inlining the wrapper and conversion function, if they're
small enough... or by us providing macro versions of the wrappers and
converter).

And if use of the wrappers is discouraged in favor of the iterator
versions, except in cases where random access to the string truly is
needed, then some speed improvements can be gotten.

Leopold Toetsch

unread,

Aug 22, 2003, 4:01:37 AM8/22/03

to Benjamin Goldberg, perl6-i...@perl.org

Benjamin Goldberg <ben.go...@hotpop.com> wrote:
> Leopold Toetsch wrote:

>> I have problems imaginating such kind of STRINGs.

> You lack sufficient imagination -- Larry's suggested that Perl6 strings
> may consist of a list of chunks. I can easily imagine each of those
> "chunks" being full-fledged STRING* objects.

Did Larry speak of PerlString or STRING?

> A foolish question: can you imagine strings which are lazily read from a
> file?

Sure.

> ... If we could have str->strstart as a pointer to a

> vector of STRING*s, we wouldn't need any PMC to contain the chunks. And
> the str->encoding api is (already) sufficient for doing the work. The
> only lack is a custom mark, to keep the sub-strings alive.

So you have everything what a string *PMC* has: a list of chunks
(hanging off some pointer), custom mark, one or 2 vtables (encoding
stuff) ...

> If we have it in a PerlString derived class, and do not make it part of
> STRING*, then we cannot pass such strings to C functions defined to
> accept strings in STRING* parameters,

Such C functions must be aware of the string API anyway, they can't
assume to get a char * something, they have to call the iterator
interface.

> Well, except that when a PerlInt loses magic going to an INTVAL, the
> resulting integer generally takes *less* memory than it did as a PMC,
> whereas losing magic by changing from a PMC to a STRING could very
> easily result in using *more* memory. (And doing lots of work, which we
> wouldn't need if our string kept it's magic).

That's right. But your (or Larry's) proposed list of chunk with custom
mark is a PMC effectively, if you call it STRING or not doesn't matter.
Its a string PMC with a special vtable. The chunk list contains STRING*
buffers. That's it.

> my str $slurp = File.new($filename).slurp(); # =
> File.slurp($filename)?

> Sure, we could have this read in the whole file, but wouldn't it be
> nicer if it would *lazily* fill in $slurp?

Isn't there a big fat warning in $doc, to avoid such kind of code?
Anyway either the string iterator calls the file iterator getting the
string or above code is illegal as tie()ing an "int".

>> Do you really want to slow down all string access, just for one very
>> special corner case?

> I don't believe that it *would* slow down all string access.

2 more indirections for the chunk buffer: its variable sized so its a
buffer header + buffer memory. And we are creating new strings all over
the place which really hurts already now.

> For the current string code, we already take O(n) to get a void* pointer
> into an appropriate part of a utf8 string, for each character-index.

Dan said, we don't do operations on such kind of string encodings. OTOH
if the chunks all have a character count, we can quickly locate a
certain position inside such strings.

leo

Nicholas Clark

unread,

Aug 22, 2003, 4:59:39 PM8/22/03

to Benjamin Goldberg, perl6-i...@perl.org

On Thu, Aug 21, 2003 at 06:37:52PM -0400, Benjamin Goldberg wrote:
>
>
> Nicholas Clark wrote:

> > Particularly when the regexp engine is written assuming O(1) random
> > access.
>
> It doesn't *need* to assume O(1) random access; after all, it's never
> accessing *randomly*, it's always accessing just one character away from
> some other character that it's recently accessed. Sounds like a job for
> an iterator for me. With an iterator, it needs only assume that
> advancing the iterator a distance of 1, takes O(1) time.

Probably true for the actual regexp engine. But I'm pretty sure that where
perl wins (or won, historically - the world is catching up) was in the
optimiser, which takes shortcuts and works out where things can't match.
I suspect that that does think of things in terms of random character
offsets, and independent of that I'm confident that thinking about one in
O(1) is easier than thinking about one in O(n)

But I've never written one, so who am I to say?

Nicholas Clark

Gordon Henriksen

unread,

Aug 24, 2003, 1:29:35 AM8/24/03

to perl6-i...@perl.org

Now, I don't really have much of an opinion on compound strings in
general. I do want to address one particular argument, though—the lazily
slurped file string.

On Thursday, August 21, 2003, at 07:22 , Benjamin Goldberg wrote:

> A foolish question: can you imagine strings which are lazily read from
> a file?
>
> If so, could you imagine such a string, sitting in front of a really
> really big file, bigger than could fit into memory?

Having a lazily slurped file string simply delays disaster, and opens
the door for Very Big Mistakes. Such strings would have to be treated
very delicately, or the program would behave very inefficiently or
crash. (And let's be frank, a lazily concatenated STRING* is just a
tie()d string value—I thought that was leaving the core.) There's power
in such strings, no doubt. There's also TERROR of passing the string to
anything lest your program explode because some CPAN module's author
wasn't also TERRIFIED of your input being something not-just-a-string.
If I'm going to have the potential to load the entire file into memory
if I'm the least bit careless, I'd prefer to be up front about it.
Anti-action-at-a-distance. I don't need to be deluded that my code is
efficient because it reads lazily. (Fact is, it's probably faster if it
buffers the file all at once, if it's going to buffer it at all.
Certainly more memory-efficient (!). Fewer chunks. Less overhead. But
probably faster still to mmap() it.)

And what if your admittedly huge file is larger than 2**32 bytes? (A
very real possibility! You said it was too big to fit in memory!) Are
you going to suggest that all STRING* consumers on 32-bit platforms
emulate 64-bit arithmetic whenever manipulating STRING* lengths?

To efficiently process a Very Large String, you need to *stream* through
it, not buffer it. Same applies to infinite strings (generators) or
indeterminate strings (generators and sockets). Such strings don't have
representable or knowable lengths. STRING*'s *really* *really* should
reliably have lengths, I think.

IMAGINE, if you will, something absolutely crazy:

grammar HTTPServer {
rule http {
(<request> <commit>)*
}
rule request {
<get_request> | <post_request> | ...
}
rule get_request {
GET <path> <version> <crlf>
<header>
{
my $file = open(...)
or print("403 Access Denied\r\n"), fail;
print "200 OK\r\n";
while (<$file>) print;
close $file;
}
}
rule post_request {
GET <path> <version> <crlf>
<header>
{
# Blahblahblah...
}
}
rule crlf { \r\n }
rule header {
<header_line>* <crlf>
<commit>
}
rule header_line {
([:alpha:]+): ([^\r\n]* <crlf> ([ \t]+ [^\r\n]* <crlf>)*)
<commit>
}
# ... more ...
}

If perl's using a stream rather than buffering to a STRING*, then
$sock =~ /<HTTPServer::http>/ could actually work—and quite efficiently.
[1] How cool is that? Just imagine trying to apply the same pattern to a
more long-lived protocol than HTTP, though—a database connection, maybe,
or IRC. Or an HTTP client, which can download lots of data. Using chunky
strings? perl, meet rlimit. rlimit, this is perl. [2] Using streams?
Network programming becomes crazily easy.

—

Gordon Henriksen
mali...@mac.com

[1] Of course, this requires that the regex engine be coded to think in
sequences. The regex engine could keep its own backtracking buffer, and
trim that buffer at each commit.

[2] No doubt, unshift hacks[3] could be found to make the lazy slurpy
file string not crash. But these are just changes to make strings behave
like streams, and would impose upon STRING* consumers everywhere Very
Strange things like those strings which don't know their own length. A
string wants to be a string, and a stream wants to be a stream.

[3] Unshift hack #1: Where commit appears in the above, exit the
grammar, trim the beginning of the string, and re-enter. (But that
forces the grammar author to discard the regex state, whereas commit
would offer no such restriction.) Unshift hack #2: Tell =~ that <commit>
can trim the beginning of the string. (DWIM departs; /cgxism returns.)

Dan Sugalski

unread,

Aug 24, 2003, 3:04:59 PM8/24/03

to perl6-i...@perl.org

At 12:07 AM -0400 8/19/03, Benjamin Goldberg wrote:
>There are a number of shortcomings in the API, which I'd like to address
>here, and propose improvments for.

You're conflating language level strings with low-level strings. Don't.

STRINGs, the parrot structure and what S registers point to, are
single-encoding, single-character set entities. They're designed for
as fast access as feasable while maintaining the bare minimum of
language/set/encoding abstraction. Parts of the core *will* assume
they are concrete and, once transformed into a fixed-width format,
can be accessed directly while avoiding all the overhead of the
encoding (and even character set) functions.

PMCs are full-blown language level variables that can do whatever the
heck they want when accessed. Everything you do can be mediated by C,
Parrot, or perl/python/ruby/scheme/forth/BASIC/whatever code. Lazy
strings, multi-encoding strings, tree structures with strin reps,
whiatever.

Most, nearly all, high-level language level functionality should use
PMCs. If a STRING is insufficiently flexible for what you want,
that's a sign you shouldn't use it.
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Dan Sugalski

unread,

Aug 24, 2003, 2:43:21 PM8/24/03

to l...@toetsch.at, Benjamin Goldberg, perl6-i...@perl.org

At 10:01 AM +0200 8/22/03, Leopold Toetsch wrote:
>Benjamin Goldberg <ben.go...@hotpop.com> wrote:
>> Leopold Toetsch wrote:
>>> I have problems imaginating such kind of STRINGs.
>
>> You lack sufficient imagination -- Larry's suggested that Perl6 strings
>> may consist of a list of chunks. I can easily imagine each of those
>> "chunks" being full-fledged STRING* objects.
>
>Did Larry speak of PerlString or STRING?

Larry, bluntly, doesn't care. He wants it to work at the perl level,
and whatever magic we need to do to make it work is fine with him. If
we have to have two versions of the regex core, one that works on
PMCs and one that shortcuts to STRINGs, well, that's our issue not
his.

Dan Sugalski

unread,

Aug 24, 2003, 2:57:38 PM8/24/03

to perl6-i...@perl.org

At 12:57 PM +0200 8/21/03, Peter Gibbs wrote:
>If the string API is to be revised, I would like to suggest that
>consideration be given to having a single string vtable, merging
>the current encoding and chartype structures into a single one.

I think this has been addressed, but in case it hasn't... while I'd
love to have a single unified scheme, there is a NxM problem here.
There are a relatively few types of bytestream<->character mappings,
which is what the encoding handles. 8, 16, and 32 bit integers, UTF8,
UTF16 (which is variable width), Big5, and shift-JIS spring to mind.
There are also a *lot* of different character sets for at least some
of those encodings. (There are many different 8-bit character sets, a
few variants of Big 5, and IIRC a couple of different ways to use the
UTF-8 and -16 encodings)

Splitting them up means less work for whoever's writing the character
set translations, as well as for whoever writes the encoding
translations. I can, for example, write a Big5->16-bit fixed width
transform, but I'd be hard pressed to define the different character
set elements for it. We can reasonably easily provide a set of
encodings as part of parrot, and leave the actual character set stuff
for most sets to third-party folks, which strikes me as the way to go.

Yeah, it does mean one more pointer per string, which isn't great,
but it means fewer tables--when a string changes encodings but not
character sets we don't have to have a whole new table that mushes
together the string and encoding.

It may turn out that we have relatively few encoding/set variants, in
which case that decision can be revisited and we can go with an
implementation that presents two tables conceptually but a single
unified table for the implementation.

Benjamin Goldberg

unread,

Aug 24, 2003, 7:03:23 PM8/24/03

to perl6-i...@perl.org

Dan Sugalski wrote:
>
> At 12:07 AM -0400 8/19/03, Benjamin Goldberg wrote:
> >There are a number of shortcomings in the API, which I'd like to address
> >here, and propose improvments for.
>
> You're conflating language level strings with low-level strings. Don't.
>
> STRINGs, the parrot structure and what S registers point to, are
> single-encoding, single-character set entities. They're designed for
> as fast access as feasable while maintaining the bare minimum of
> language/set/encoding abstraction. Parts of the core *will* assume
> they are concrete and, once transformed into a fixed-width format,
> can be accessed directly while avoiding all the overhead of the
> encoding (and even character set) functions.
>
> PMCs are full-blown language level variables that can do whatever the
> heck they want when accessed. Everything you do can be mediated by C,
> Parrot, or perl/python/ruby/scheme/forth/BASIC/whatever code. Lazy
> strings, multi-encoding strings, tree structures with strin reps,
> whiatever.
>
> Most, nearly all, high-level language level functionality should use
> PMCs. If a STRING is insufficiently flexible for what you want,
> that's a sign you shouldn't use it.

Ok, I'm convinced.

Benjamin Goldberg

unread,

Aug 24, 2003, 9:03:36 PM8/24/03

to perl6-i...@perl.org

Gordon Henriksen wrote:
>
> Now, I don't really have much of an opinion on compound strings in
> general. I do want to address one particular argument, though—the lazily
> slurped file string.
>
> On Thursday, August 21, 2003, at 07:22 , Benjamin Goldberg wrote:
>
> > A foolish question: can you imagine strings which are lazily read from
> > a file?
> >
> > If so, could you imagine such a string, sitting in front of a really
> > really big file, bigger than could fit into memory?
>
> Having a lazily slurped file string simply delays disaster, and opens
> the door for Very Big Mistakes. Such strings would have to be treated
> very delicately, or the program would behave very inefficiently or
> crash.

Although Dan's convinced me that STRING*s don't need to be anything
other than concrete, wholly-in-memory, non-active buffers of data, (for
various and sundry reasons), I'm not sure why a lazily slurped file
string would need to be treated "delicately".

In particular, what would make the program crash?

(I'll assume that you're thinking the program might become inefficient
due to someone grabbing bytes out of arbitrary offsets, due to having to
seek() about here and there all the time. While it's true that this
might be inneficient, it's not much worse than skipping around inside of
a utf8 string.)

> (And let's be frank, a lazily concatenated STRING* is just a
> tie()d string value—I thought that was leaving the core.) There's power
> in such strings, no doubt. There's also TERROR of passing the string to
> anything lest your program explode because some CPAN module's author
> wasn't also TERRIFIED of your input being something not-just-a-string.
> If I'm going to have the potential to load the entire file into memory
> if I'm the least bit careless,

Why would you have the potential to load the entire file into memory if
you're careless?

> I'd prefer to be up front about it.
> Anti-action-at-a-distance. I don't need to be deluded that my code is
> efficient because it reads lazily. (Fact is, it's probably faster if it
> buffers the file all at once, if it's going to buffer it at all.
> Certainly more memory-efficient (!). Fewer chunks. Less overhead. But
> probably faster still to mmap() it.)

Well, I'll certainly agree that mmaping is almost certainly faster than
any other way of bringing whole files into strings.

Hmm, does parrot's memory system use mmap when really big chunks are
requested? IIRC, perl5's does.

> And what if your admittedly huge file is larger than 2**32 bytes? (A
> very real possibility! You said it was too big to fit in memory!) Are
> you going to suggest that all STRING* consumers on 32-bit platforms
> emulate 64-bit arithmetic whenever manipulating STRING* lengths?

Blech. Yeah, that *would* be annoying. OTOH, they're already emulating
64-bit arithmetic whenever they deal with file offsets. Or perhaps I
should be saying, "bad enough that they're already ... with file
offsets, we don't want to have to do it with string lengths, too."

> To efficiently process a Very Large String, you need to *stream* through
> it, not buffer it. Same applies to infinite strings (generators) or
> indeterminate strings (generators and sockets). Such strings don't have
> representable or knowable lengths. STRING*'s *really* *really* should
> reliably have lengths, I think.
>
> IMAGINE, if you will, something absolutely crazy:
>
> grammar HTTPServer {
> rule http {
> (<request> <commit>)*
> }
> rule request {
> <get_request> | <post_request> | ...
> }
> rule get_request {
> GET <path> <version> <crlf>
> <header>

[snip]

You should have a <commit> after that CRLF there :)

> If perl's using a stream rather than buffering to a STRING*, then
> $sock =~ /<HTTPServer::http>/ could actually work—and quite efficiently.
> [1]

> [1] Of course, this requires that the regex engine be coded to think in
> sequences.

Well, PerlString already supports the Iterator interface, and if the
regex engine were redesigned to *use* the Iterator interface, and if
streams were designed to support the Iterator interface, then passing a
stream instead of a string to the regex engine would be easy.

The way that I would envision a stream's support of the iterator
interface would be as follows:

PMC * get_integer_keyed(PMC *key) {
return key_integer(INTERP, key);
}

PMC * nextkey_keyed(PMC *key, INTVAL what) {
PMC * ret;
switch( what ) {
case ITERATE_FROM_START:
key_set_integer(INTERP, key, DYNSELF->shift_integer());
return key;
case ITERATE_GET_NEXT:
ret = key_next(INTERP, key);
if( ret ) return ret;
ret = key_new_integer(INTERP, DYNSELF->shift_integer());
key_set_next( key, ret );
return key;
case ITERATE_GET_PREV:
case ITERATE_FROM_END:
internal_exception(???, "Unsupported");
}
internal_exception(???, "Illegal iterator request");
return NULL;
}

> The regex engine could keep its own backtracking buffer, and
> trim that buffer at each commit.

s/buffer/stack/.

The regex engine already does. It currently pushes onto it's stack
pointers into the string it's matching against -- obviously, for a
switch to iterators, it would push onto it's stack Key.pmc objects.

ISTM that the regex compiler can (and probably should) produce code for
both arbitrary objects supporting iteration, and for strings. That way,
if/when a regex is applied to a PerlString (which is the normal case),
it will avoid the extra indirection produced by using the iterator
interface.

> How cool is that? Just imagine trying to apply the same pattern to a
> more long-lived protocol than HTTP, though—a database connection, maybe,
> or IRC.

Through a database connection? I can envision that for the purpose of
implementing the protocol, but if you mean, examining the high-level
output of a database... I don't see what you mean.

For IRC... that can be done.

> Or an HTTP client, which can download lots of data. Using chunky
> strings? perl, meet rlimit. rlimit, this is perl. [2] Using streams?
> Network programming becomes crazily easy.
>
> —
>
> Gordon Henriksen
> mali...@mac.com
>
>

> [2] No doubt, unshift hacks[3] could be found to make the lazy slurpy
> file string not crash. But these are just changes to make strings behave
> like streams, and would impose upon STRING* consumers everywhere Very
> Strange things like those strings which don't know their own length. A
> string wants to be a string, and a stream wants to be a stream.

I wasn't considering allowing lazily slurped file strings on anything
other than plain files (ones for which perl's "-f" operator returns
true).

Thus, I can't see how the string wouldn't know it's own length.

> [3] Unshift hack #1: Where commit appears in the above, exit the
> grammar, trim the beginning of the string, and re-enter. (But that
> forces the grammar author to discard the regex state, whereas commit
> would offer no such restriction.) Unshift hack #2: Tell =~ that <commit>
> can trim the beginning of the string. (DWIM departs; /cgxism returns.)

Trimming off the beginning of the string is the job of the <cut>
operator, not the <commit> operator.

Hmm... I wonder how <cut> would be done with an iterator. Bleh.

However, <commit> can easily be done, just by removing the bottom of our
stack, up to the current point. And popping off an empty stack could
raise a catchable exception, indicating that we've attempted to
backtrack past a commit.

For strings, <commit> frees up part of the backtracking stack, but since
the things *on* the stack are just pointers into the string being
iterated upon, nothing more is freed.

For iterated objects, <commit> frees up part of the backtracking stack,
and since the things on the stack are Key.pmc objects, these also get
freed.

Gordon Henriksen

unread,

Aug 24, 2003, 9:57:45 PM8/24/03

to Benjamin Goldberg, perl6-i...@perl.org

Benjamin Goldberg wrote:

> Gordon Henriksen wrote:
>
> > Having a lazily slurped file string simply delays disaster, and
> > opens the door for Very Big Mistakes. Such strings would have to be
> > treated very delicately, or the program would behave very
> > inefficiently or crash.
>
> Although Dan's convinced me that STRING*s don't need to be anything
> other than concrete, wholly-in-memory, non-active buffers of data,
> (for various and sundry reasons), I'm not sure why a lazily slurped
> file string would need to be treated "delicately".
>
> In particular, what would make the program crash?

s/crash/uses HUGE GOBS OF MEMORY and exhaust the system's swapfile/g.

> Why would you have the potential to load the entire file into
> memory if you're careless?

Mutations would remain in memory, right? uc() such a string and watch
your swapfile fill right up. Or s///g. Or just in general change it.

And character indexing a file too big to fit in memory, when char
indexing is an O(n) problem for significant cases (UTF-8)...? Very
Bad things...

Or were you thinking that changes would be written back? In which
case... each string mod would have to rewrite the (huge, remember) file
from that point forward. Way to render an API useless.

I have no doubt that p6 will have file-tied strings which will address
many of these problems--they're just very complex and don't belong
inside STRING*.

> > And what if your admittedly huge file is larger than 2**32 bytes? (A
> > very real possibility! You said it was too big to fit in memory!)
> > Are you going to suggest that all STRING* consumers on 32-bit
> > platforms emulate 64-bit arithmetic whenever manipulating STRING*
> > lengths?
>
> Blech. Yeah, that *would* be annoying. OTOH, they're already
> emulating 64-bit arithmetic whenever they deal with file offsets. Or

> perhaps I should be saying, "bad enough that they're already ... With

> file offsets, we don't want to have to do it with string lengths,
> too."

I've got my money on option #2.

> > grammar HTTPServer {
> > rule http {
> > (<request> <commit>)*
> > }
> > rule request {
> > <get_request> | <post_request> | ...
> > }
> > rule get_request {
> > GET <path> <version> <crlf>
> > <header>
> [snip]
>
> You should have a <commit> after that CRLF there :)

Yeah, well, one could go after GET, too, and after <path>, and after
<version>, and every other non-optional protocol element. It gets noisy
after a while.

> > How cool is that? Just imagine trying to apply the same pattern to a

> > more long-lived protocol than HTTP, though-a database connection,

> > maybe, or IRC.
>
> Through a database connection? I can envision that for the purpose of

> implementing the protocol, [...]

I did indeed mean implementing the database protocol. Though, not
thRough.

> > [2] No doubt, unshift hacks[3] could be found to make the lazy
> > slurpy file string not crash. But these are just changes to make
> > strings behave like streams, and would impose upon STRING*
> > consumers everywhere Very Strange things like those strings which
> > don't know their own length. A string wants to be a string, and a
> > stream wants to be a stream.
>
> I wasn't considering allowing lazily slurped file strings on anything
> other than plain files (ones for which perl's "-f" operator returns
> true).
>
> Thus, I can't see how the string wouldn't know it's own length.

Fine, in theory--but UTF-8 and other variable-length encodings would
need to open and scan THE ENTIRE FILE at the time it was tied in order
to know their length in characters. Ouch.

> > [3] Unshift hack #1: Where commit appears in the above, exit the
> > grammar, trim the beginning of the string, and re-enter. (But that
> > forces the grammar author to discard the regex state, whereas commit
> > would offer no such restriction.) Unshift hack #2: Tell =~ that
> > <commit> can trim the beginning of the string. (DWIM departs;
> > /cgxism returns.)
>
> Trimming off the beginning of the string is the job of the <cut>
> operator, not the <commit> operator.

Indeed, my bad--been a while since I read the apocalypse.

> Hmm... I wonder how <cut> would be done with an iterator. Bleh.

Equivalent to <commit>, I say.... Then your grammar rule can work on an
iterator, or on a string that's being used as a buffer.

Here's a question: How does $iter =~ /a+b/ work on an iterator which
returns "aaaaaaack!"? Requires a putback op.

I'm not sure about <cut> vs. <commit>. They seem so orthogonal, and they
pervasively tie a grammar to an implementation choice. It seems more
like an m:option.

--

Gordon Henriksen
IT Manager
ICLUBcentral Inc.
gor...@iclub.com

Benjamin Goldberg

unread,

Aug 25, 2003, 12:18:07 PM8/25/03

to perl6-i...@perl.org

Gordon Henriksen wrote:
>
> Benjamin Goldberg wrote:
>
> > Gordon Henriksen wrote:

[snip]

> > > [3] Unshift hack #1: Where commit appears in the above, exit the
> > > grammar, trim the beginning of the string, and re-enter. (But that
> > > forces the grammar author to discard the regex state, whereas commit
> > > would offer no such restriction.) Unshift hack #2: Tell =~ that
> > > <commit> can trim the beginning of the string. (DWIM departs;
> > > /cgxism returns.)
> >
> > Trimming off the beginning of the string is the job of the <cut>
> > operator, not the <commit> operator.
>
> Indeed, my bad--been a while since I read the apocalypse.
>
> > Hmm... I wonder how <cut> would be done with an iterator. Bleh.
>
> Equivalent to <commit>, I say....

Not necessarily. I mean, consider if someone does:

my @ary := unpack "U*", $string;

Then, doing:

@ary =~ /regex/

And:

$string =~ /regex/

*Should* be equivilant, ne?

And a <cut> on the string should trim the front of the string, and a
<cut> on the array should splice of the front of the array.

Maybe we should provide some vtable entries for altering a pmc through
an iterator, so that that <cut> could say:

VTABLE_trim_front_keyed(INTERP, pmc, key);

> Then your grammar rule can work on an
> iterator, or on a string that's being used as a buffer.
>
> Here's a question: How does $iter =~ /a+b/ work on an iterator which
> returns "aaaaaaack!"? Requires a putback op.

Have you read how perl6's regex engine works?

> I'm not sure about <cut> vs. <commit>. They seem so orthogonal, and they
> pervasively tie a grammar to an implementation choice. It seems more
> like an m:option.

They are orthogonal: A <commit> says, throw away the backtracking stack
before this point, because if we *try* to backtrack before here, then
the entire match fails. A <cut> says, throw away the backtracking
stack, *AND* throw away the front of the string/aggregate we're matching
against.

However, I'm not so sure that they tie a grammar to a choice: if we
provide an interface to trim off the "front" of an aggregate using an
iterator, (and we can already trim off the front of a string, using
string_replace), then we have both <commit> and <cut> with both.

Of course, it's quite possible that some pmc classes won't *provide* a
method for trimming itself, in which case <cut> will throw an exception.