The .bytes/.codepoints/.graphemes methods

Brent 'Dax' Royal-Gordon

unread,

Jun 26, 2004, 3:27:38 PM6/26/04

to perl6-l...@perl.org

As currently designed, the String::bytes, String::codepoints, and
String::graphemes methods return the number of bytes, codepoints, and
graphemes, respectively, in the string they were called on. I would
like to suggest that, when called in list context, these methods return
an array of strings split by bytes, codepoints, and graphemes, respectively.

This would make it unambiguous whether certain string operations
referred to bytes, codepoints, or graphemes:

$str.bytes[0].ord
$str.codepoints[0..4].join #substr

As well as allowing some operations that are currently much more difficult:

$str.bytes[3].ord
$str.graphemes[144].lc

Issues:
* Limits lvalue substr (doesn't allow it to be a different size)
unless splice is used (or a substr method is also provided).
* Memory consumption.
* A bit odd-looking.

Benefits:
* Removes ambiguity in an area that needs said ambiguity removed.
* Allows us to reuse constructs (e.g. slicing).
* Opens up a few previously-difficult constructs (like getting the
ord() of an arbitrary character).

--
Brent "Dax" Royal-Gordon <br...@brentdax.com>
Perl and Parrot hacker

Oceania has always been at war with Eastasia.

Larry Wall

unread,

Jun 26, 2004, 4:20:59 PM6/26/04

to perl6-l...@perl.org

On Sat, Jun 26, 2004 at 12:27:38PM -0700, Brent 'Dax' Royal-Gordon wrote:
: As currently designed, the String::bytes, String::codepoints, and

: String::graphemes methods return the number of bytes, codepoints,
: and graphemes, respectively, in the string they were called on. I
: would like to suggest that, when called in list context, these
: methods return an array of strings split by bytes, codepoints, and
: graphemes, respectively.
:
: This would make it unambiguous whether certain string operations
: referred to bytes, codepoints, or graphemes:
:
: $str.bytes[0].ord
: $str.codepoints[0..4].join #substr
:
: As well as allowing some operations that are currently much more
: difficult:
:
: $str.bytes[3].ord
: $str.graphemes[144].lc
:
: Issues:
: * Limits lvalue substr (doesn't allow it to be a different size)
: unless splice is used (or a substr method is also provided).

That all has to be looked at anyway. What does "5" mean when you
pass it to substr, anyway? (I've been trying to make it assume some
implicit unit based on the current lexical scope's Unicode level,
but issues remain.) We have magical string positions that have
different numeric values depending on what units you view them as,
but at what point does a number like "5" get translated to such
a magical string position?

: * Memory consumption.

Not necessarily, if the method merely returns a "view" of the string
without actually doing the split.

: * A bit odd-looking.

I dunno--it reads pretty well. Maybe these'll be heavily enough
used that we should Huffmanize them down a bit:

$str.bytes
$str.codes
$str.graphs
$str.letters

Though "letters" is a bit inadequate to describe language-dependent
graphemes, since it also divides any non-letters...I suppose we
could go with .characters if we don't mind forcing a heavily
overloaded word in one particular direction, culturally speaking.
Except, I'd kinda like to keep them starting with different letters.
(And maybe .chars should be reserved to mean whatever the default
unit is in the current lexical scope, as with substr() above.)

: Benefits:

: * Removes ambiguity in an area that needs said ambiguity removed.
: * Allows us to reuse constructs (e.g. slicing).
: * Opens up a few previously-difficult constructs (like getting the
: ord() of an arbitrary character).

I'd also point out that the scalar definitions fall out of it
naturally.

One other downside is that you might have to insert + in various
places to get the numeric interpretation. But that could be
construed as self-dedocumentation.

Larry

Jonadab The Unsightly One

unread,

Jun 28, 2004, 11:26:32 AM6/28/04

to perl6-l...@perl.org

Larry Wall <la...@wall.org> writes:

> That all has to be looked at anyway. What does "5" mean when you
> pass it to substr, anyway?

I was just going to ask about substrings, and then didn't because I
figured that had been hashed out already and I'd missed it...

> (I've been trying to make it assume some implicit unit based on the
> current lexical scope's Unicode level, but issues remain.) We have
> magical string positions that have different numeric values
> depending on what units you view them as, but at what point does a
> number like "5" get translated to such a magical string position?

It would be possible to have right-associative operators (that bind at
least more tightly than comma and possibly very tightly) and convert a
number to one of these objects, so that we can do stuff like this:

substr($string, 2 bytes, 4 bytes) = $substitute;

Then if you pass a plain number to substr it could either assume
something (possibly generating a warning) or spit an error, depending
on some feature of the current lexical scope.

The word "bytes" is clearly much too long, though, much less
"graphemes" or "codepoints". I thought about this:

substr($string, 2b, 4b) = $substitute;

With presumably g and c for graphemes and codepoints, but I rather
suspect that might conflict with some other existing syntax (though I
can't think of anything in particular).

And I can't think of another abbreviation that would be remotely
intuitive.

There's also the possibility of bsubstr and so on, but that leads us
down the path of C, having a hillion bajillion functions with names
like fgets, stoi, and fstrnclost. Having sprintf is quite enough of
that, IMO.

> I dunno--it reads pretty well. Maybe these'll be heavily enough
> used that we should Huffmanize them down a bit:
>
> $str.bytes
> $str.codes
> $str.graphs
> $str.letters

codes and graphs is better than codepoints and graphemes, at least.

> Though "letters" is a bit inadequate to describe language-dependent
> graphemes, since it also divides any non-letters...I suppose we
> could go with .characters if we don't mind forcing a heavily
> overloaded word in one particular direction, culturally speaking.
> Except, I'd kinda like to keep them starting with different letters.
> (And maybe .chars should be reserved to mean whatever the default
> unit is in the current lexical scope, as with substr() above.)

You could coin the abbreviation ligs, for Language Independent
Graphemes. Then some ingenious rascal can create a pragma or whatever
that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness.

--
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,"ten.thgirb\@badanoj$/ --";$\=$ ;-> ();print$/

Larry Wall

unread,

Jun 28, 2004, 12:52:43 PM6/28/04

to perl6-l...@perl.org

On Mon, Jun 28, 2004 at 11:26:32AM -0400, Jonadab the Unsightly One wrote:
: You could coin the abbreviation ligs, for Language Independent

: Graphemes. Then some ingenious rascal can create a pragma or whatever
: that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness.

Except they'd have to be "ldgs". Graphemes are ligs in current parlance.

Larry

Dave Whipp

unread,

Jun 28, 2004, 12:55:00 PM6/28/04

to perl6-l...@perl.org

"Jonadab The Unsightly One" <jon...@bright.net> wrote in message
news:8ye7r9...@jonadab.homeip.net...

> It would be possible to have right-associative operators (that bind at
> least more tightly than comma and possibly very tightly) and convert a
> number to one of these objects, so that we can do stuff like this:
>
> substr($string, 2 bytes, 4 bytes) = $substitute;

I think that the common case will use the same units for both the index and
the length. So perhaps:

substr($string, 2, 4 :bytes)

would be more appropriate. Also, by only requiring us to write the unit
once, the need for ultra-short abbreviations is reduced.

Dave.

Dan Sugalski

unread,

Jun 28, 2004, 12:54:46 PM6/28/04

to perl6-l...@perl.org

And 'ligs' implies ligatures. And since that'd require font, style, and
possibly layout information, I think we'd rather not go there right now...

Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Juerd

unread,

Jun 28, 2004, 1:51:10 PM6/28/04

to Dave Whipp, perl6-l...@perl.org

Dave Whipp skribis 2004-06-28 9:55 (-0700):

> > substr($string, 2 bytes, 4 bytes) = $substitute;

> substr($string, 2, 4 :bytes)

substr($string, 2 but graphemes, 4 but bytes);

I think "but" even makes sense, if substr defaults to something.

Juerd

Dan Sugalski

unread,

Jun 28, 2004, 1:52:28 PM6/28/04

to Juerd, Dave Whipp, perl6-l...@perl.org

I think mixing strings, bytes, graphemes, and code points together is a
phenomenally bad idea, likely to lead to many tears, much gnashing of
teeth, and quite a few rampages with sharp objects, not to mention a lot
of code guaranteed to fail at the edge cases.

If, as a programmer, you *really* want to run with scissors then convert
your string to a binary byte buffer and go from there. At least then when
you poke out an eye you won't be nearly so surprised.

Austin Hastings

unread,

Jun 28, 2004, 2:27:34 PM6/28/04

to Dan Sugalski, perl6-l...@perl.org

--- Dan Sugalski <d...@sidhe.org> wrote:
> On Mon, 28 Jun 2004, Juerd wrote:
>
> > Dave Whipp skribis 2004-06-28 9:55 (-0700):
> > > > substr($string, 2 bytes, 4 bytes) = $substitute;
> > > substr($string, 2, 4 :bytes)
> >
> > substr($string, 2 but graphemes, 4 but bytes);
> >
> > I think "but" even makes sense, if substr defaults to something.
>
> I think mixing strings, bytes, graphemes, and code points together
> is a phenomenally bad idea, likely to lead to many tears, much
> gnashing of teeth, and quite a few rampages with sharp objects,
> not to mention a lot of code guaranteed to fail at the edge cases.

Hmm. Suppose that I have a system that is friendly to 80 byte records.
I want to output "meaningful" strings, so I want to partition a buffer
into 80-ish byte substrings, but preserve any graphemes (i.e., store
the data in a legible format).

How would I do that?

The obvious answer is a gnarly little loop, but I think I'd like to
have perl do that for me. Can I say something like:

while ($buffer)
{
$output = substr($buffer, 0, 80 but bytes, units => graphemes);
$buffer = substr($buffer, 0, length $output :graphemes);

$cout << $output << nl; # :-)
}

and get some dwimmery?

=Austin

Dan Sugalski

unread,

Jun 28, 2004, 2:36:24 PM6/28/04

to Austin Hastings, perl6-l...@perl.org

On Mon, 28 Jun 2004, Austin Hastings wrote:

> --- Dan Sugalski <d...@sidhe.org> wrote:
> > On Mon, 28 Jun 2004, Juerd wrote:
> >
> > > Dave Whipp skribis 2004-06-28 9:55 (-0700):
> > > > > substr($string, 2 bytes, 4 bytes) = $substitute;
> > > > substr($string, 2, 4 :bytes)
> > >
> > > substr($string, 2 but graphemes, 4 but bytes);
> > >
> > > I think "but" even makes sense, if substr defaults to something.
> >
> > I think mixing strings, bytes, graphemes, and code points together
> > is a phenomenally bad idea, likely to lead to many tears, much
> > gnashing of teeth, and quite a few rampages with sharp objects,
> > not to mention a lot of code guaranteed to fail at the edge cases.
>
> Hmm. Suppose that I have a system that is friendly to 80 byte records.
> I want to output "meaningful" strings, so I want to partition a buffer
> into 80-ish byte substrings, but preserve any graphemes (i.e., store
> the data in a legible format).
>
> How would I do that?

You don't. Or if you do, you do it with a lot of pain, sweat, and annoying
hard work. 80 bytes gets you somewhere between three (And this may be a
*high* estimate--there may be circumstances where 80 bytes is
insufficient for *one* grapheme) and 80 graphemes.

This isn't something that can be made generically easy.

Austin Hastings

unread,

Jun 28, 2004, 11:54:40 AM6/28/04

to Jonadab the Unsightly One, perl6-l...@perl.org

--- Jonadab the Unsightly One <jon...@bright.net> wrote:

> Larry Wall <la...@wall.org> writes:
>
> > (I've been trying to make it assume some implicit unit based on the
> > current lexical scope's Unicode level, but issues remain.) We have
> > magical string positions that have different numeric values
> > depending on what units you view them as, but at what point does a
> > number like "5" get translated to such a magical string position?
>
> It would be possible to have right-associative operators (that bind
> at least more tightly than comma and possibly very tightly) and
> convert a number to one of these objects, so that we can do stuff
> like this:
>
> substr($string, 2 bytes, 4 bytes) = $substitute;
>
> Then if you pass a plain number to substr it could either assume
> something (possibly generating a warning) or spit an error, depending
> on some feature of the current lexical scope.

A couple of alternatives:

substr.bytes($string, 2, 4) = $substitute;

substr($string.bytes, 2, 4) = $substitute;

# Make it a pragma
use String(bytes);
substr($string, 2, 4) = substitute;

# Make it a global mode
set_string_mode(bytes);
substr($string, 2, 4) = substitute;

# Make it an object mode
$string.access_mode(bytes);
substr($string, 2, 4) = $substitute;

> The word "bytes" is clearly much too long, though, much less
> "graphemes" or "codepoints". I thought about this:
>
> substr($string, 2b, 4b) = $substitute;

Problems with:

substr($string, 0b, 1b) = $substitute;

Is that binary or bytes? Also:

substr($string, $start b, $end b) = $substitute;

Looks unintuitive.

> With presumably g and c for graphemes and codepoints, but I rather
> suspect that might conflict with some other existing syntax (though I
> can't think of anything in particular).

0c? 0x16c ?

> And I can't think of another abbreviation that would be remotely
> intuitive.
>
> There's also the possibility of bsubstr and so on, but that leads us
> down the path of C, having a hillion bajillion functions with names
> like fgets, stoi, and fstrnclost. Having sprintf is quite enough of
> that, IMO.
>
> > I dunno--it reads pretty well. Maybe these'll be heavily enough
> > used that we should Huffmanize them down a bit:
> >
> > $str.bytes
> > $str.codes
> > $str.graphs
> > $str.letters
>
> codes and graphs is better than codepoints and graphemes, at least.

In certain (IMO large) sectors of the Perl community, string processing
is just about all the work there is. I submit that there needs to be a
way to drive the token length to 0: either a pragma, or a global mode,
or a type definition.

>
> > Though "letters" is a bit inadequate to describe language-dependent
> > graphemes, since it also divides any non-letters...I suppose we
> > could go with .characters if we don't mind forcing a heavily
> > overloaded word in one particular direction, culturally speaking.
> > Except, I'd kinda like to keep them starting with different
> > letters.
> > (And maybe .chars should be reserved to mean whatever the default
> > unit is in the current lexical scope, as with substr() above.)
>
> You could coin the abbreviation ligs, for Language Independent
> Graphemes. Then some ingenious rascal can create a pragma or
> whatever that allows $str.b, $str.c, $str.g, and $str.l for
> fans of terseness.

As opposed to 'ligs' meaning ligatures? Fraught with peril. :-)

To me, the right thing to do is provide a 'default' way to work, and
allow for changing that default to some other way. The obvious defaults
are 'bytes', which gives C-like behavior (unpopular though that may
presently be) and imposes little or no conceptual strain but likewise
no enormous benefit, and 'graphemes'.

I like graphemes for the default because I hate and fear graphemes. The
whole *code thing just crawls right in my ear, so having the language
transparently support it would be a win. Having the language force me
to understand this stuff, if it cannot be transparently supported,
would also be a win, on a longer time scale.

=Austin

Jonadab The Unsightly One

unread,

Jun 29, 2004, 10:17:38 AM6/29/04

to Dan Sugalski, Austin Hastings, perl6-l...@perl.org

Dan Sugalski <d...@sidhe.org> writes:

>> Hmm. Suppose that I have a system that is friendly to 80 byte
>> records. I want to output "meaningful" strings, so I want to
>> partition a buffer into 80-ish byte substrings, but preserve any
>> graphemes (i.e., store the data in a legible format).
>>
>> How would I do that?
>
> You don't. Or if you do, you do it with a lot of pain, sweat, and
> annoying hard work. 80 bytes gets you somewhere between three (And
> this may be a *high* estimate--there may be circumstances where 80
> bytes is insufficient for *one* grapheme) and 80 graphemes.
>
> This isn't something that can be made generically easy.

It's no worse than implementing word wrap. Someone will of course
implement it as a generic routine, something along the lines of

my @line = breakunicodestringintobytebufferchunks(
string => $string,
chunksize => 80,
keeptogether => 'graphemes',
extremelongparts => 'split',
# 'split' will try to split it at a mostly-reasonable
# place if possible, similar to word wrap that looks
# for syllable boundaries.
# 'truncate' would do the same but drop the second part,
# rather than putting it in the next line.
# 'skip' would drop the whole grapheme out.
# 'allow' would create a line longer (in bytes) than
# the chunksize, which is what a lot of word wrap
# algorithms do, but would not work if you really
# have to fit in a fixed-byte-size buffer. It would
# of course put the thing on a line by itself though,
# to minimize the overflow.
);

There are reasons for doing this, e.g. if you've got Unicode text to
send via a network protocol with an octet-oriented RFC, or if you're
interacting with some legacy C code that has fixed-size buffers.
Someone will write the routine to do as well as can be expected, and
it'll be put on the CPAN, and people who need this sort of thing will
use it.

I don't think the language needs to be designed around it though.

Jonadab The Unsightly One

unread,

Jun 29, 2004, 10:37:03 AM6/29/04

to Austin_...@yahoo.com, Jonadab the Unsightly One, perl6-l...@perl.org

Austin Hastings <austin_...@yahoo.com> writes:

> A couple of alternatives:
>
> substr.bytes($string, 2, 4) = $substitute;

Well, that's arguably better than bsubstr.

> substr($string.bytes, 2, 4) = $substitute;

I could live with that, although it doesn't allow mixing units.
(Someone will pop in here and say that's to be construed as a
feature.)

> # Make it a pragma
> use String(bytes);
> substr($string, 2, 4) = substitute;

I think a pragma should set the default unit for the current lexical
scope, at least. (The default, in the absense of the pragma, is an
open question; at worst the default could be to throw an exception if
units aren't specified; personally I think throwing exceptions willy
nilly is unPerlish.)

> # Make it a global mode
> set_string_mode(bytes);
> substr($string, 2, 4) = substitute;

I don't like this. It's no more useful than the pragma but has bigger
caveats.

> # Make it an object mode
> $string.access_mode(bytes);
> substr($string, 2, 4) = $substitute;

Wouldn't this add extra operations all over the place?

>> The word "bytes" is clearly much too long, though, much less
>> "graphemes" or "codepoints". I thought about this:
>>
>> substr($string, 2b, 4b) = $substitute;
>
> Problems with:
>
> substr($string, 0b, 1b) = $substitute;
>
> Is that binary or bytes? Also:

I figured it would conflict with something.

> substr($string, $start b, $end b) = $substitute;
>
> Looks unintuitive.

*shrug*. I chose it because I thought the other way around looked
unintuitive:
substr($string, b $start, b $end) = $substitute;

That looks like calling a function -- which *is* what's going on,
under the hood, but the other way around looks like tagging on units,
which seems more natural to me.

>> With presumably g and c for graphemes and codepoints, but I rather
>> suspect that might conflict with some other existing syntax (though I
>> can't think of anything in particular).
>
> 0c? 0x16c ?

Ick, yes, I missed that. (I was thinking only of numbers specified in
decimal.) I knew there'd be something.

>> codes and graphs is better than codepoints and graphemes, at least.
>
> In certain (IMO large) sectors of the Perl community, string
> processing is just about all the work there is. I submit that there
> needs to be a way to drive the token length to 0: either a pragma,
> or a global mode, or a type definition.

A pragma should set the default, IMO. I think what we're talking
about here is what the syntax would be for using a unit other than the
default, or for specifying the units if you haven't used the pragma to
set the default.

>> You could coin the abbreviation ligs, for Language Independent
>> Graphemes. Then some ingenious rascal can create a pragma or
>> whatever that allows $str.b, $str.c, $str.g, and $str.l for
>> fans of terseness.
>
> As opposed to 'ligs' meaning ligatures? Fraught with peril. :-)

I thought about that, but figured it wasn't a big deal; there are
*lots* of abbreviations with more than one possible interpretation,
and you just deal with having to know which one is meant. However, it
was then pointed out that it would actually be ldgs, which IMO is
unpronounceable and ugly. So something else is needed for those.

*shrug*. Make up a word. Call them woohickies for all I care and
abbreviate it woo or just w.

> I like graphemes for the default because I hate and fear
> graphemes. The whole *code thing just crawls right in my ear, so
> having the language transparently support it would be a win.

I can see the logic in that. Personally I don't care what the default
is. Almost none of my code will need to care one way or the other,
and that which does can use the pragma.

Have the implications of the bytes/codepoints/graphemes/woohickies
distinction for the regular expression engine been discussed already?

Austin Hastings

unread,

Jun 29, 2004, 11:34:16 AM6/29/04

to Jonadab the Unsightly One, Austin_...@yahoo.com, Jonadab the Unsightly One, perl6-l...@perl.org

--- Jonadab the Unsightly One <jon...@bright.net> wrote:
>

> Have the implications of the bytes/codepoints/graphemes/woohickies
> distinction for the regular expression engine been discussed already?

Not enough.

One of my current clients just rolled on to redhat 9, and what a
steaming pile of digestive byproducts *that* turned out to be.
Apparently the default locale setting changed, so now LC_ALL="" out of
the box.

One effect of this is irritating lack of proper behavior in the
utilities. But when you switch to LC_ALL= <pick your favorite
language>, you just get really slow performance: Apparently the 'C'
locale is such a totally special case that the performance of LC_ALL=C
is one or more orders of magnitude better than LC_ALL=en_US.UTF-8, even
when the data is 7bit ascii.

I think that (1) this is unacceptable: the temptation to switch to the
'C' locale has been too great, both at this site and on a lot of the RH
support forums; (2) Perl6 should equitably support all its target
locales; (3) we should set out to make sure the performance is damn
fast no matter what locale we're using.

This has no direct bearing on p6l, since performance is a p6i issue.
But perhaps in the interests of performance as well as hackery we
should explicitly provide some sort of variant regex behavior:

/a./ :bytes
/a./ :graphemes

where the first would recognize 0x61 followed by any single byte, while
the second would recognize 'a' followed by any number of bytes
composing a single grapheme.

(I'll claim that it's legitimate to want to search for, say, any MBCs
introduced via \x0F\x01, regardless of length. This is likely not
supported any other way.)

=Austin

Jonadab The Unsightly One

unread,

Jun 29, 2004, 11:54:10 AM6/29/04

to perl6-l...@perl.org

Juerd <ju...@convolution.nl> writes:

> substr($string, 2 but graphemes, 4 but bytes);
>
> I think "but" even makes sense, if substr defaults to something.

That could be combined with a smart substr that only needs the units
once (err, only needs a position object for one of the args) and knows
how to conver the other number to the same units (err, same type of
position object):

substr($string, 2, 4 but bytes);

This would still allow for specifying units on both if you for some
reason wanted them different (which, as Dan S points out, sounds like
a bad idea, on the face of it).

:bytes is shorter than but bytes, though.

Jonathan Scott Duff

unread,

Jun 29, 2004, 11:52:34 AM6/29/04

to Austin Hastings, Jonadab the Unsightly One, perl6-l...@perl.org

On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote:
> This has no direct bearing on p6l, since performance is a p6i issue.
> But perhaps in the interests of performance as well as hackery we
> should explicitly provide some sort of variant regex behavior:
>
> /a./ :bytes
> /a./ :graphemes
>
> where the first would recognize 0x61 followed by any single byte, while
> the second would recognize 'a' followed by any number of bytes
> composing a single grapheme.

Isn't that what :u0, :u1, :u2, and :u3 are for?

:u0 # use bytes (. is byte)
:u1 # level 1 support (. is codepoint)
:u2 # level 1 support (. is grapheme)
:u3 # level 1 support (. is language dependent)

These modifiers say nothing about the state of the data, but in
general internal Perl data will already be in Normalization Form
C, so even under :u1, the precomposed characters will usually do
the right thing. Note that these modifiers are for overriding
the default support level, which was probably set by pragma at
the top of the file.

Or was that to imply that a literal "a" in the RE would be
interpretted as a "grapheme a" when :u2 is active?

-Scott
--
Jonathan Scott Duff Division of Nearshore Research
du...@lighthouse.tamucc.edu Senior Systems Analyst II

Matt Diephouse

unread,

Jun 30, 2004, 8:51:58 PM6/30/04

to perl6-l...@perl.org

Larry Wall wrote:
> On Sat, Jun 26, 2004 at 12:27:38PM -0700, Brent 'Dax' Royal-Gordon wrote:
> : Issues:
> : * Limits lvalue substr (doesn't allow it to be a different size)
> : unless splice is used (or a substr method is also provided).
>
> That all has to be looked at anyway. What does "5" mean when you
> pass it to substr, anyway? (I've been trying to make it assume some
> implicit unit based on the current lexical scope's Unicode level,
> but issues remain.) We have magical string positions that have
> different numeric values depending on what units you view them as,
> but at what point does a number like "5" get translated to such
> a magical string position?

While we're on the topic of substr, allow me to beg. Please, can we
replace substr with with array style operations like Ruby and Python?
Please? Something like this would be nice:

my $string = "Hello, World!";
say $string[0..4]; # prints "Hello\n"
$string[7...] = "Larry!";
say $string; # prints "Hello, Larry!\n"

We already have our strings acting as objects, and we have [] as a
postcircumfix operator, so it's something that someone could define
easily. Of course, I have no idea how to reconcile this with all the
talk of unicode other than to say that the easy stuff should be easy.

It just follows this would also be nice for arrays, to replace splice.
For me, these two functions are the most bothersome part of Perl 5, and
I would love to see them go.

matt

Juerd

unread,

Jul 1, 2004, 6:59:34 AM7/1/04

to Matt Diephouse, perl6-l...@perl.org

Matt Diephouse skribis 2004-06-30 20:51 (-0400):

> my $string = "Hello, World!";
> say $string[0..4]; # prints "Hello\n"
> $string[7...] = "Larry!";
> say $string; # prints "Hello, Larry!\n"

And that "array" is one of bytes? graphemes?

In general, I like the idea. In <40DDCE2A...@brentdax.com>, almost
the same was suggested, but implemented differently: a string's .bytes
method in list context (but isn't it array context, technically?) would
dwym. As would the other parts-of-string methods.

Perhaps without method, the string in array/list context can default to
the default set by a lexical pragma. Which, I hope, has a default
itself. (I like default defaults...)

Juerd

Matt Diephouse

unread,

Jul 1, 2004, 7:29:43 AM7/1/04

to perl6-l...@perl.org

Juerd wrote:
> Matt Diephouse skribis 2004-06-30 20:51 (-0400):
>
>> my $string = "Hello, World!";
>> say $string[0..4]; # prints "Hello\n"
>> $string[7...] = "Larry!";
>> say $string; # prints "Hello, Larry!\n"
>
>
> And that "array" is one of bytes? graphemes?

I'm not really up on my unicode, but I think .chars is what I have in
mind. I want it to operate like a non-unicode string in Perl 5. Anything
unicode can be more complex, as I think this will be the common case.

> In general, I like the idea. In <40DDCE2A...@brentdax.com>, almost
> the same was suggested, but implemented differently: a string's .bytes
> method in list context (but isn't it array context, technically?) would
> dwym. As would the other parts-of-string methods.

Think of this as Huffmanized .chars then?

matt

John Williams

unread,

Jul 1, 2004, 4:15:24 PM7/1/04

to Juerd, Matt Diephouse, perl6-l...@perl.org

On Thu, 1 Jul 2004, Juerd wrote:

> Matt Diephouse skribis 2004-06-30 20:51 (-0400):
> > my $string = "Hello, World!";
> > say $string[0..4]; # prints "Hello\n"
> > $string[7...] = "Larry!";
> > say $string; # prints "Hello, Larry!\n"
>
> And that "array" is one of bytes? graphemes?
>
> In general, I like the idea. In <40DDCE2A...@brentdax.com>, almost
> the same was suggested, but implemented differently: a string's .bytes
> method in list context (but isn't it array context, technically?) would
> dwym. As would the other parts-of-string methods.

What if you could add the slice onto the method:

my $string = "Hello, World!";

say $string.bytes[0..4]; # prints "Hello\n"
$string.codepoints[7...] = "Søren!";
say $string; # prints "Hello, Søren!\n"

The string slicing operator would have to return an array of
bytes/codepoints/etc in list context and a substr in scalar context.

~ John Williams

Aaron Sherman

unread,

Jul 2, 2004, 4:50:01 PM7/2/04

to Austin Hastings, Perl6 Language List

On Tue, 2004-06-29 at 11:34, Austin Hastings wrote:

> [...] when you switch to LC_ALL= <pick your favorite

> language>, you just get really slow performance: Apparently the 'C'
> locale is such a totally special case that the performance of LC_ALL=C
> is one or more orders of magnitude better than LC_ALL=en_US.UTF-8, even
> when the data is 7bit ascii.

Well, of course. I can't imagine a way in which this would not be true.

After all, in LC_ALL="C" the number of characters in a string is equal
to the number of bytes in the string. In LC_ALL="en_US.UTF-8" the length
of a string is dependent on what exactly you mean by length, and a lot
of special cases arise. Special cases and context mean you have more
code to execute for the same logical task, which means you have more
processing to do.

Unicode support is expensive, even if you're just doing ASCII-as-UTF-8.
That doesn't mean it's a bad thing to do, it's just that it's expensive.

> I think that (1) this is unacceptable: the temptation to switch to the
> 'C' locale has been too great, both at this site and on a lot of the RH
> support forums;

And yet, in English-speaking countries (and Hawaiian and
Swahili-speaking countries for that matter) and in situations where the
fidelity of certain types of string data (such as names) is not
considered critical, this is a fine default. e.g. for general shell
work.

> (2) Perl6 should equitably support all its target
> locales; (3) we should set out to make sure the performance is damn
> fast no matter what locale we're using.

Well, that's a nice theory, but you can prove that low-level encodings
(e.g. ASCII, EBCDIC) will be more efficient than high-level encodings
(e.g. UTF-8), so the only way to accomplish what you suggest in (2) is
to break (3) by slowing down the faster handling (not what you wanted,
I'm sure).

Of course, you want to have as much performance out of string handling
as possible.

> This has no direct bearing on p6l, since performance is a p6i issue.
> But perhaps in the interests of performance as well as hackery we
> should explicitly provide some sort of variant regex behavior:
>
> /a./ :bytes
> /a./ :graphemes

As pointed out by others, this is already there, though I'm not sure
that it would be specified that way. More likely:

m :u0 /a./
[etc]

--
Aaron Sherman <a...@ajs.com>
Senior Systems Engineer and Perl Toolsmith
http://www.ajs.com/~ajs/resume.html

Brent 'Dax' Royal-Gordon

unread,

Jul 3, 2004, 6:37:43 AM7/3/04

to Aaron Sherman, Austin Hastings, Perl6 Language List

Aaron Sherman wrote:
> On Tue, 2004-06-29 at 11:34, Austin Hastings wrote:
>>(2) Perl6 should equitably support all its target
>>locales; (3) we should set out to make sure the performance is damn
>>fast no matter what locale we're using.
>
> Well, that's a nice theory, but you can prove that low-level encodings
> (e.g. ASCII, EBCDIC) will be more efficient than high-level encodings
> (e.g. UTF-8), so the only way to accomplish what you suggest in (2) is
> to break (3) by slowing down the faster handling (not what you wanted,
> I'm sure).

At the Parrot level, codepoint operations will generally be the most
efficient, even on strings with exotic charsets. Parrot uses an
internal encoding that allows O(1) access to codepoints; essentially, it
uses an array of 8-, 16-, or 32-bit integers, depending on the highest
codepoint value. This is the default even for character sets with shift
characters, like Shift-JIS.

On strings where all codepoints have values under 256, bytewise and
codepointwise lookup are equivalent; otherwise, though, bytewise lookup
will actually be *slower* than codepointwise, as Parrot will maintain
the illusion that each codepoint is stored in an integer that's the
perfect size for it.

If you force Parrot to use the UTF-8 encoding internally then bytewise
lookup becomes fastest, and codepointwise slows down a lot. But you
really shouldn't do that--UTF-8 is ill-suited for actually
*manipulating* text, unlike the Parrot internal encodings. (UTF-16 and
UTF-32 will presumably be available too, although I've seen no specific
mention of them.)

You can also force it to use a "raw" or "bytes" encoding, where bytes
and codepoints are identical. But you can't store Unicode characters in
such a string and have them behave in a reasonable way.

(Note: this is all based on my own, possibly false, memory.)

Larry Wall

unread,

Jul 7, 2004, 11:09:51 PM7/7/04

to perl6-l...@perl.org, Austin Hastings, Jonadab the Unsightly One

On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote:

: On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote:
: > This has no direct bearing on p6l, since performance is a p6i issue.
: > But perhaps in the interests of performance as well as hackery we
: > should explicitly provide some sort of variant regex behavior:
: >
: > /a./ :bytes
: > /a./ :graphemes
: >
: > where the first would recognize 0x61 followed by any single byte, while
: > the second would recognize 'a' followed by any number of bytes
: > composing a single grapheme.
:
: Isn't that what :u0, :u1, :u2, and :u3 are for?
:
: :u0 # use bytes (. is byte)
: :u1 # level 1 support (. is codepoint)
: :u2 # level 1 support (. is grapheme)
: :u3 # level 1 support (. is language dependent)

These modifiers might get renamed to match whatever b/c/g/w convention
we come up with pragmas. The levels aren't all that intuitive, though
there is a kind of progression of semantic complexity that would get
lost with ordinary names.

: These modifiers say nothing about the state of the data, but in

: general internal Perl data will already be in Normalization Form
: C, so even under :u1, the precomposed characters will usually do
: the right thing.

These days it might be that most of the data we see will be maximally
decomposed rather than maximally composed. But the jury is still out
on that. And in any event, :u2 and :u3 should hide that distinction.

: Note that these modifiers are for overriding

: the default support level, which was probably set by pragma at
: the top of the file.

Another way of saying that is that these modifiers are, in fact,
lexically scoped pragmas with the *exact* same effect as the ordinary
Unicode level pragmas. It's just that they're lexically scoped to
the rest of a rule or group rather than to the rest of a block.

: Or was that to imply that a literal "a" in the RE would be

: interpretted as a "grapheme a" when :u2 is active?

I don't know what you mean by "grapheme a" there. If you mean, "Does
it match any grapheme that happens to be exactly U+0061?", then the
answer is yes. If you mean "Does it wildcard to any grapheme that uses
U+0061 as the base character?", then the answer is probably no. We
have not yet come up with a syntax for that kind of wildcarding, other
than dropping down to codepoints [:u1 a \pM+] or some such. That may
or may not be sufficient. It'd be pretty easy to define a <like a>
assertion in any case.

Larry

Larry Wall

unread,

Jul 7, 2004, 11:15:30 PM7/7/04

to perl6-l...@perl.org, Austin Hastings, Jonadab the Unsightly One

On Wed, Jul 07, 2004 at 08:09:51PM -0700, Larry Wall wrote:

: On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote:
: : On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote:
: : > This has no direct bearing on p6l, since performance is a p6i issue.
: : > But perhaps in the interests of performance as well as hackery we
: : > should explicitly provide some sort of variant regex behavior:
: : >
: : > /a./ :bytes
: : > /a./ :graphemes
: : >
: : > where the first would recognize 0x61 followed by any single byte, while
: : > the second would recognize 'a' followed by any number of bytes
: : > composing a single grapheme.
: :
: : Isn't that what :u0, :u1, :u2, and :u3 are for?
: :
: : :u0 # use bytes (. is byte)
: : :u1 # level 1 support (. is codepoint)
: : :u2 # level 1 support (. is grapheme)
: : :u3 # level 1 support (. is language dependent)
:
: These modifiers might get renamed to match whatever b/c/g/w convention
: we come up with pragmas. The levels aren't all that intuitive, though
: there is a kind of progression of semantic complexity that would get
: lost with ordinary names.

On the flip side, a good reason to get rid of the numeric values is
that in all likelihood people will continually make the mistake of
thinking :u1 means "one byte at a time" and :u2 means "two bytes at
a time". And then they'll wonder why :u4 doesn't give them UTF-32...

Larry

Austin Hastings

unread,

Jul 8, 2004, 10:35:44 AM7/8/04

to Larry Wall, perl6-l...@perl.org

--- Larry Wall <la...@wall.org> wrote:
> On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote:
>
> : Or was that to imply that a literal "a" in the RE would be
> : interpretted as a "grapheme a" when :u2 is active?
>
> I don't know what you mean by "grapheme a" there. If you mean, "Does
> it match any grapheme that happens to be exactly U+0061?", then the
> answer is yes.

In my original question, I meant to differentiate between 'grapheme'
and 'possible component of a multibyte expression'.

> If you mean "Does it wildcard to any grapheme that uses
> U+0061 as the base character?", then the answer is probably no. We
> have not yet come up with a syntax for that kind of wildcarding,
> other than dropping down to codepoints [:u1 a \pM+] or some such.
> That may or may not be sufficient. It'd be pretty easy to define a
> <like a> assertion in any case.

I think this is something that we'll want as a "mode", a la
case-insensitivity. Think of it as "mark insensitivity."

I'm not sure if this should be language/locale dependent or not, but a
basic search feature for text is "fre'd" -> "fred".

Maybe it can just roll into :i?

=Austin

Jonadab The Unsightly One

unread,

Jul 10, 2004, 10:36:59 PM7/10/04

to Austin_...@yahoo.com, Larry Wall, perl6-l...@perl.org

Austin Hastings <austin_...@yahoo.com> writes:

> I think this is something that we'll want as a "mode", a la
> case-insensitivity. Think of it as "mark insensitivity."

Makes sense to me, but...

> Maybe it can just roll into :i?

It will probably get used in _conjunction_ with case-insensitivity
quite a lot, but I suspect people will want to be able to use one
without the other.

Since mark-insensitivity is probably mostly a non-issue in the ASCII
world, it would probably be a better candidate than average for being
turned on using a unicode character, if we're running low on letters
for designating these rules.

Luke Palmer

unread,

Jul 11, 2004, 1:22:53 AM7/11/04

to Jonadab the Unsightly One, Austin_...@yahoo.com, Larry Wall, perl6-l...@perl.org

Jonadab the Unsightly One writes:
> Austin Hastings <austin_...@yahoo.com> writes:
>
> > I think this is something that we'll want as a "mode", a la
> > case-insensitivity. Think of it as "mark insensitivity."
>
> Makes sense to me, but...
>
> > Maybe it can just roll into :i?
>
> It will probably get used in _conjunction_ with case-insensitivity
> quite a lot, but I suspect people will want to be able to use one
> without the other.
>
> Since mark-insensitivity is probably mostly a non-issue in the ASCII
> world, it would probably be a better candidate than average for being
> turned on using a unicode character, if we're running low on letters
> for designating these rules.

Or, god forbid, a word?

m:base/que mas/

We're not mathematicians: we're allowed to use more than one letter in a
row to designate something :-)

Luke

Austin Hastings

unread,

Jul 11, 2004, 5:22:51 PM7/11/04

to Jonadab the Unsightly One, Larry Wall, perl6-l...@perl.org

> -----Original Message-----
> From: Jonadab the Unsightly One [mailto:jon...@bright.net]
> Austin Hastings <austin_...@yahoo.com> writes:
>
> > I think this is something that we'll want as a "mode", a la
> > case-insensitivity. Think of it as "mark insensitivity."
>
> Makes sense to me, but...
>
> > Maybe it can just roll into :i?
>
> It will probably get used in _conjunction_ with
> case-insensitivity quite a lot, but I suspect people will want
to be able
> to use one without the other.
>
> Since mark-insensitivity is probably mostly a non-issue
> in the ASCII world, it would probably be a better candidate than
> average for being turned on using a unicode character, if we're
running
> low on letters for designating these rules.

How about :i ?

:) :) :)

=Austin

Jonadab The Unsightly One

unread,

Jul 11, 2004, 9:25:39 PM7/11/04

to Luke Palmer, Jonadab the Unsightly One, Austin_...@yahoo.com, Larry Wall, perl6-l...@perl.org

Luke Palmer <lu...@luqui.org> writes:

> Or, god forbid, a word?
>
> m:base/que mas/
>
> We're not mathematicians: we're allowed to use more than one letter
> in a row to designate something :-)

Well, if it were *me*, *I* would have voted for keeping the core
language 100% pure ASCII, untainted by rogue untypeable characters...
So naturally :base is fine by *me*...

David Green

unread,

Jul 13, 2004, 8:00:59 PM7/13/04

to perl6-l...@perl.org

In article <20040708030...@wall.org>,
la...@wall.org (Larry Wall) wrote:

>On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote:
>: :u0 # use bytes (. is byte)
>: :u1 # level 1 support (. is codepoint)
>: :u2 # level 1 support (. is grapheme)
>: :u3 # level 1 support (. is language dependent)
>
>These modifiers might get renamed to match whatever b/c/g/w convention
>we come up with pragmas. The levels aren't all that intuitive, though
>there is a kind of progression of semantic complexity that would get
>lost with ordinary names.

bytes
codepts
graphemes
langdepends

That's a kind of progression. And "codepts" seems a natural enough
abbreviation, though I don't really know what to do with language_
dependent_thingummies. Though with less typing, the initials b < c < g < l
give the same progression.

-David "except for encodings where c<b, of course...." Green