Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
The .bytes/.codepoints/.graphemes methods
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 30 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Brent 'Dax' Royal-Gordon  
View profile  
 More options Jun 26 2004, 3:27 pm
Newsgroups: perl.perl6.language
From: br...@brentdax.com (Brent 'Dax' Royal-Gordon)
Date: Sat, 26 Jun 2004 12:27:38 -0700
Local: Sat, Jun 26 2004 3:27 pm
Subject: The .bytes/.codepoints/.graphemes methods
As currently designed, the String::bytes, String::codepoints, and
String::graphemes methods return the number of bytes, codepoints, and
graphemes, respectively, in the string they were called on.  I would
like to suggest that, when called in list context, these methods return
an array of strings split by bytes, codepoints, and graphemes, respectively.

This would make it unambiguous whether certain string operations
referred to bytes, codepoints, or graphemes:

     $str.bytes[0].ord
     $str.codepoints[0..4].join #substr

As well as allowing some operations that are currently much more difficult:

     $str.bytes[3].ord
     $str.graphemes[144].lc

Issues:
   * Limits lvalue substr (doesn't allow it to be a different size)
     unless splice is used (or a substr method is also provided).
   * Memory consumption.
   * A bit odd-looking.

Benefits:
   * Removes ambiguity in an area that needs said ambiguity removed.
   * Allows us to reuse constructs (e.g. slicing).
   * Opens up a few previously-difficult constructs (like getting the
     ord() of an arbitrary character).

--
Brent "Dax" Royal-Gordon <br...@brentdax.com>
Perl and Parrot hacker

Oceania has always been at war with Eastasia.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Larry Wall  
View profile  
 More options Jun 26 2004, 4:20 pm
Newsgroups: perl.perl6.language
From: la...@wall.org (Larry Wall)
Date: Sat, 26 Jun 2004 13:20:59 -0700
Local: Sat, Jun 26 2004 4:20 pm
Subject: Re: The .bytes/.codepoints/.graphemes methods
On Sat, Jun 26, 2004 at 12:27:38PM -0700, Brent 'Dax' Royal-Gordon wrote:

: As currently designed, the String::bytes, String::codepoints, and
: String::graphemes methods return the number of bytes, codepoints,
: and graphemes, respectively, in the string they were called on.  I
: would like to suggest that, when called in list context, these
: methods return an array of strings split by bytes, codepoints, and
: graphemes, respectively.
:
: This would make it unambiguous whether certain string operations
: referred to bytes, codepoints, or graphemes:
:
:     $str.bytes[0].ord
:     $str.codepoints[0..4].join        #substr
:
: As well as allowing some operations that are currently much more
: difficult:
:
:     $str.bytes[3].ord
:     $str.graphemes[144].lc
:
: Issues:
:   * Limits lvalue substr (doesn't allow it to be a different size)
:     unless splice is used (or a substr method is also provided).

That all has to be looked at anyway.  What does "5" mean when you
pass it to substr, anyway?  (I've been trying to make it assume some
implicit unit based on the current lexical scope's Unicode level,
but issues remain.)  We have magical string positions that have
different numeric values depending on what units you view them as,
but at what point does a number like "5" get translated to such
a magical string position?

:   * Memory consumption.

Not necessarily, if the method merely returns a "view" of the string
without actually doing the split.

:   * A bit odd-looking.

I dunno--it reads pretty well.  Maybe these'll be heavily enough
used that we should Huffmanize them down a bit:

    $str.bytes
    $str.codes
    $str.graphs
    $str.letters

Though "letters" is a bit inadequate to describe language-dependent
graphemes, since it also divides any non-letters...I suppose we
could go with .characters if we don't mind forcing a heavily
overloaded word in one particular direction, culturally speaking.
Except, I'd kinda like to keep them starting with different letters.
(And maybe .chars should be reserved to mean whatever the default
unit is in the current lexical scope, as with substr() above.)

: Benefits:
:   * Removes ambiguity in an area that needs said ambiguity removed.
:   * Allows us to reuse constructs (e.g. slicing).
:   * Opens up a few previously-difficult constructs (like getting the
:     ord() of an arbitrary character).

I'd also point out that the scalar definitions fall out of it
naturally.

One other downside is that you might have to insert + in various
places to get the numeric interpretation.  But that could be
construed as self-dedocumentation.

Larry


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jonadab The Unsightly One  
View profile  
 More options Jun 28 2004, 11:26 am
Newsgroups: perl.perl6.language
From: jona...@bright.net (Jonadab The Unsightly One)
Date: Mon, 28 Jun 2004 11:26:32 -0400
Local: Mon, Jun 28 2004 11:26 am
Subject: Re: The .bytes/.codepoints/.graphemes methods

Larry Wall <la...@wall.org> writes:
> That all has to be looked at anyway.  What does "5" mean when you
> pass it to substr, anyway?  

I was just going to ask about substrings, and then didn't because I
figured that had been hashed out already and I'd missed it...

> (I've been trying to make it assume some implicit unit based on the
> current lexical scope's Unicode level, but issues remain.)  We have
> magical string positions that have different numeric values
> depending on what units you view them as, but at what point does a
> number like "5" get translated to such a magical string position?

It would be possible to have right-associative operators (that bind at
least more tightly than comma and possibly very tightly) and convert a
number to one of these objects, so that we can do stuff like this:

substr($string, 2 bytes, 4 bytes) = $substitute;

Then if you pass a plain number to substr it could either assume
something (possibly generating a warning) or spit an error, depending
on some feature of the current lexical scope.

The word "bytes" is clearly much too long, though, much less
"graphemes" or "codepoints".  I thought about this:

substr($string, 2b, 4b) = $substitute;

With presumably g and c for graphemes and codepoints, but I rather
suspect that might conflict with some other existing syntax (though I
can't think of anything in particular).

And I can't think of another abbreviation that would be remotely
intuitive.

There's also the possibility of bsubstr and so on, but that leads us
down the path of C, having a hillion bajillion functions with names
like fgets, stoi, and fstrnclost.  Having sprintf is quite enough of
that, IMO.

> I dunno--it reads pretty well.  Maybe these'll be heavily enough
> used that we should Huffmanize them down a bit:

>     $str.bytes
>     $str.codes
>     $str.graphs
>     $str.letters

codes and graphs is better than codepoints and graphemes, at least.

> Though "letters" is a bit inadequate to describe language-dependent
> graphemes, since it also divides any non-letters...I suppose we
> could go with .characters if we don't mind forcing a heavily
> overloaded word in one particular direction, culturally speaking.
> Except, I'd kinda like to keep them starting with different letters.
> (And maybe .chars should be reserved to mean whatever the default
> unit is in the current lexical scope, as with substr() above.)

You could coin the abbreviation ligs, for Language Independent
Graphemes.  Then some ingenious rascal can create a pragma or whatever
that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness.

--
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,"ten.thgirb\@badanoj$/ --";$\=$ ;-> ();print$/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Larry Wall  
View profile  
 More options Jun 28 2004, 12:52 pm
Newsgroups: perl.perl6.language
From: la...@wall.org (Larry Wall)
Date: Mon, 28 Jun 2004 09:52:43 -0700
Local: Mon, Jun 28 2004 12:52 pm
Subject: Re: The .bytes/.codepoints/.graphemes methods
On Mon, Jun 28, 2004 at 11:26:32AM -0400, Jonadab the Unsightly One wrote:
: You could coin the abbreviation ligs, for Language Independent
: Graphemes.  Then some ingenious rascal can create a pragma or whatever
: that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness.

Except they'd have to be "ldgs".  Graphemes are ligs in current parlance.

Larry


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dave Whipp  
View profile  
 More options Jun 28 2004, 12:55 pm
Newsgroups: perl.perl6.language
From: d...@whipp.name (Dave Whipp)
Date: Mon, 28 Jun 2004 09:55:00 -0700
Subject: Re: The .bytes/.codepoints/.graphemes methods
"Jonadab The Unsightly One" <jona...@bright.net> wrote in message
news:8ye7r9ef.fsf@jonadab.homeip.net...

> It would be possible to have right-associative operators (that bind at
> least more tightly than comma and possibly very tightly) and convert a
> number to one of these objects, so that we can do stuff like this:

> substr($string, 2 bytes, 4 bytes) = $substitute;

I think that the common case will use the same units for both the index and
the length. So perhaps:

  substr($string, 2, 4 :bytes)

would be more appropriate. Also, by only requiring us to write the unit
once, the need for ultra-short abbreviations is reduced.

Dave.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dan Sugalski  
View profile  
 More options Jun 28 2004, 12:54 pm
Newsgroups: perl.perl6.language
From: d...@sidhe.org (Dan Sugalski)
Date: Mon, 28 Jun 2004 12:54:46 -0400 (EDT)
Local: Mon, Jun 28 2004 12:54 pm
Subject: Re: The .bytes/.codepoints/.graphemes methods

On Mon, 28 Jun 2004, Larry Wall wrote:
> On Mon, Jun 28, 2004 at 11:26:32AM -0400, Jonadab the Unsightly One wrote:
> : You could coin the abbreviation ligs, for Language Independent
> : Graphemes.  Then some ingenious rascal can create a pragma or whatever
> : that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness.

> Except they'd have to be "ldgs".  Graphemes are ligs in current parlance.

And 'ligs' implies ligatures. And since that'd require font, style, and
possibly layout information, I think we'd rather not go there right now...

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
d...@sidhe.org                         have teddy bears and even
                                      teddy bears get drunk


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Juerd  
View profile  
 More options Jun 28 2004, 1:51 pm
Newsgroups: perl.perl6.language
From: ju...@convolution.nl (Juerd)
Date: Mon, 28 Jun 2004 19:51:10 +0200
Local: Mon, Jun 28 2004 1:51 pm
Subject: Re: The .bytes/.codepoints/.graphemes methods
Dave Whipp skribis 2004-06-28  9:55 (-0700):

> > substr($string, 2 bytes, 4 bytes) = $substitute;
> substr($string, 2, 4 :bytes)

substr($string, 2 but graphemes, 4 but bytes);

I think "but" even makes sense, if substr defaults to something.

Juerd


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dan Sugalski  
View profile  
 More options Jun 28 2004, 1:52 pm
Newsgroups: perl.perl6.language
From: d...@sidhe.org (Dan Sugalski)
Date: Mon, 28 Jun 2004 13:52:28 -0400 (EDT)
Local: Mon, Jun 28 2004 1:52 pm
Subject: Re: The .bytes/.codepoints/.graphemes methods

On Mon, 28 Jun 2004, Juerd wrote:
> Dave Whipp skribis 2004-06-28  9:55 (-0700):
> > > substr($string, 2 bytes, 4 bytes) = $substitute;
> > substr($string, 2, 4 :bytes)

> substr($string, 2 but graphemes, 4 but bytes);

> I think "but" even makes sense, if substr defaults to something.

I think mixing strings, bytes, graphemes, and code points together is a
phenomenally bad idea, likely to lead to many tears, much gnashing of
teeth, and quite a few rampages with sharp objects, not to mention a lot
of code guaranteed to fail at the edge cases.

If, as a programmer, you *really* want to run with scissors then convert
your string to a binary byte buffer and go from there. At least then when
you poke out an eye you won't be nearly so surprised.

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
d...@sidhe.org                         have teddy bears and even
                                      teddy bears get drunk


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Austin Hastings  
View profile  
 More options Jun 28 2004, 2:27 pm
Newsgroups: perl.perl6.language
From: austin_hasti...@yahoo.com (Austin Hastings)
Date: Mon, 28 Jun 2004 11:27:34 -0700 (PDT)
Local: Mon, Jun 28 2004 2:27 pm
Subject: Re: The .bytes/.codepoints/.graphemes methods
--- Dan Sugalski <d...@sidhe.org> wrote:

> On Mon, 28 Jun 2004, Juerd wrote:

> > Dave Whipp skribis 2004-06-28  9:55 (-0700):
> > > > substr($string, 2 bytes, 4 bytes) = $substitute;
> > > substr($string, 2, 4 :bytes)

> > substr($string, 2 but graphemes, 4 but bytes);

> > I think "but" even makes sense, if substr defaults to something.

> I think mixing strings, bytes, graphemes, and code points together
> is a phenomenally bad idea, likely to lead to many tears, much
> gnashing of teeth, and quite a few rampages with sharp objects,
> not to mention a lot of code guaranteed to fail at the edge cases.

Hmm. Suppose that I have a system that is friendly to 80 byte records.
I want to output "meaningful" strings, so I want to partition a buffer
into 80-ish byte substrings, but preserve any graphemes (i.e., store
the data in a legible format).

How would I do that?

The obvious answer is a gnarly little loop, but I think I'd like to
have perl do that for me. Can I say something like:

  while ($buffer)
  {
    $output = substr($buffer, 0, 80 but bytes, units => graphemes);
    $buffer = substr($buffer, 0, length $output :graphemes);

    $cout << $output << nl; # :-)
  }

and get some dwimmery?

=Austin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dan Sugalski  
View profile  
 More options Jun 28 2004, 2:36 pm
Newsgroups: perl.perl6.language
From: d...@sidhe.org (Dan Sugalski)
Date: Mon, 28 Jun 2004 14:36:24 -0400 (EDT)
Local: Mon, Jun 28 2004 2:36 pm
Subject: Re: The .bytes/.codepoints/.graphemes methods

You don't. Or if you do, you do it with a lot of pain, sweat, and annoying
hard work. 80 bytes gets you somewhere between three (And this may be a
*high* estimate--there may be circumstances where 80 bytes is
insufficient for *one* grapheme) and 80 graphemes.

This isn't something that can be made generically easy.

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
d...@sidhe.org                         have teddy bears and even
                                      teddy bears get drunk


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Austin Hastings  
View profile  
 More options Jun 28 2004, 11:54 am
Newsgroups: perl.perl6.language
From: austin_hasti...@yahoo.com (Austin Hastings)
Date: Mon, 28 Jun 2004 08:54:40 -0700 (PDT)
Local: Mon, Jun 28 2004 11:54 am
Subject: Re: The .bytes/.codepoints/.graphemes methods
--- Jonadab the Unsightly One <jona...@bright.net> wrote:

A couple of alternatives:

  substr.bytes($string, 2, 4) = $substitute;

  substr($string.bytes, 2, 4) = $substitute;

  # Make it a pragma
  use String(bytes);        
  substr($string, 2, 4) = substitute;

  # Make it a global mode
  set_string_mode(bytes);
  substr($string, 2, 4) = substitute;

  # Make it an object mode
  $string.access_mode(bytes);
  substr($string, 2, 4) = $substitute;

> The word "bytes" is clearly much too long, though, much less
> "graphemes" or "codepoints".  I thought about this:

> substr($string, 2b, 4b) = $substitute;

Problems with:

  substr($string, 0b, 1b) = $substitute;

Is that binary or bytes? Also:

  substr($string, $start b, $end b) = $substitute;

Looks unintuitive.

> With presumably g and c for graphemes and codepoints, but I rather
> suspect that might conflict with some other existing syntax (though I
> can't think of anything in particular).

0c? 0x16c ?

In certain (IMO large) sectors of the Perl community, string processing
is just about all the work there is. I submit that there needs to be a
way to drive the token length to 0: either a pragma, or a global mode,
or a type definition.

> > Though "letters" is a bit inadequate to describe language-dependent
> > graphemes, since it also divides any non-letters...I suppose we
> > could go with .characters if we don't mind forcing a heavily
> > overloaded word in one particular direction, culturally speaking.
> > Except, I'd kinda like to keep them starting with different
> > letters.
> > (And maybe .chars should be reserved to mean whatever the default
> > unit is in the current lexical scope, as with substr() above.)

> You could coin the abbreviation ligs, for Language Independent
> Graphemes.  Then some ingenious rascal can create a pragma or
> whatever that allows $str.b, $str.c, $str.g, and $str.l for
> fans of terseness.

As opposed to 'ligs' meaning ligatures? Fraught with peril. :-)

To me, the right thing to do is provide a 'default' way to work, and
allow for changing that default to some other way. The obvious defaults
are 'bytes', which gives C-like behavior (unpopular though that may
presently be) and imposes little or no conceptual strain but likewise
no enormous benefit, and 'graphemes'.

I like graphemes for the default because I hate and fear graphemes. The
whole *code thing just crawls right in my ear, so having the language
transparently support it would be a win. Having the language force me
to understand this stuff, if it cannot be transparently supported,
would also be a win, on a longer time scale.

=Austin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jonadab The Unsightly One  
View profile  
 More options Jun 29 2004, 10:17 am
Newsgroups: perl.perl6.language
From: jona...@bright.net (Jonadab The Unsightly One)
Date: Tue, 29 Jun 2004 10:17:38 -0400
Local: Tues, Jun 29 2004 10:17 am
Subject: Re: The .bytes/.codepoints/.graphemes methods

Dan Sugalski <d...@sidhe.org> writes:
>> Hmm. Suppose that I have a system that is friendly to 80 byte
>> records.  I want to output "meaningful" strings, so I want to
>> partition a buffer into 80-ish byte substrings, but preserve any
>> graphemes (i.e., store the data in a legible format).

>> How would I do that?

> You don't. Or if you do, you do it with a lot of pain, sweat, and
> annoying hard work. 80 bytes gets you somewhere between three (And
> this may be a *high* estimate--there may be circumstances where 80
> bytes is insufficient for *one* grapheme) and 80 graphemes.

> This isn't something that can be made generically easy.

It's no worse than implementing word wrap.  Someone will of course
implement it as a generic routine, something along the lines of

my @line = breakunicodestringintobytebufferchunks(
   string => $string,
   chunksize => 80,
   keeptogether => 'graphemes',
   extremelongparts => 'split',
    # 'split' will try to split it at a mostly-reasonable
    #   place if possible, similar to word wrap that looks
    #   for syllable boundaries.
    # 'truncate' would do the same but drop the second part,
    #   rather than putting it in the next line.
    # 'skip' would drop the whole grapheme out.
    # 'allow' would create a line longer (in bytes) than
    #   the chunksize, which is what a lot of word wrap
    #   algorithms do, but would not work if you really
    #   have to fit in a fixed-byte-size buffer.  It would
    #   of course put the thing on a line by itself though,
    #   to minimize the overflow.
   );

There are reasons for doing this, e.g. if you've got Unicode text to
send via a network protocol with an octet-oriented RFC, or if you're
interacting with some legacy C code that has fixed-size buffers.
Someone will write the routine to do as well as can be expected, and
it'll be put on the CPAN, and people who need this sort of thing will
use it.

I don't think the language needs to be designed around it though.

--
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,"ten.thgirb\@badanoj$/ --";$\=$ ;-> ();print$/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jonadab The Unsightly One  
View profile  
 More options Jun 29 2004, 10:37 am
Newsgroups: perl.perl6.language
From: jona...@bright.net (Jonadab The Unsightly One)
Date: Tue, 29 Jun 2004 10:37:03 -0400
Local: Tues, Jun 29 2004 10:37 am
Subject: Re: The .bytes/.codepoints/.graphemes methods

Austin Hastings <austin_hasti...@yahoo.com> writes:
> A couple of alternatives:

>   substr.bytes($string, 2, 4) = $substitute;

Well, that's arguably better than bsubstr.

>   substr($string.bytes, 2, 4) = $substitute;

I could live with that, although it doesn't allow mixing units.
(Someone will pop in here and say that's to be construed as a
feature.)

>   # Make it a pragma
>   use String(bytes);        
>   substr($string, 2, 4) = substitute;

I think a pragma should set the default unit for the current lexical
scope, at least.  (The default, in the absense of the pragma, is an
open question; at worst the default could be to throw an exception if
units aren't specified; personally I think throwing exceptions willy
nilly is unPerlish.)

>   # Make it a global mode
>   set_string_mode(bytes);
>   substr($string, 2, 4) = substitute;

I don't like this.  It's no more useful than the pragma but has bigger
caveats.

>   # Make it an object mode
>   $string.access_mode(bytes);
>   substr($string, 2, 4) = $substitute;

Wouldn't this add extra operations all over the place?

>> The word "bytes" is clearly much too long, though, much less
>> "graphemes" or "codepoints".  I thought about this:

>> substr($string, 2b, 4b) = $substitute;

> Problems with:

>   substr($string, 0b, 1b) = $substitute;

> Is that binary or bytes? Also:

I figured it would conflict with something.

>   substr($string, $start b, $end b) = $substitute;

> Looks unintuitive.

*shrug*.  I chose it because I thought the other way around looked
unintuitive:
substr($string, b $start, b $end) = $substitute;

That looks like calling a function -- which *is* what's going on,
under the hood, but the other way around looks like tagging on units,
which seems more natural to me.

>> With presumably g and c for graphemes and codepoints, but I rather
>> suspect that might conflict with some other existing syntax (though I
>> can't think of anything in particular).

> 0c? 0x16c ?

Ick, yes, I missed that.  (I was thinking only of numbers specified in
decimal.)  I knew there'd be something.

>> codes and graphs is better than codepoints and graphemes, at least.

> In certain (IMO large) sectors of the Perl community, string
> processing is just about all the work there is. I submit that there
> needs to be a way to drive the token length to 0: either a pragma,
> or a global mode, or a type definition.

A pragma should set the default, IMO.  I think what we're talking
about here is what the syntax would be for using a unit other than the
default, or for specifying the units if you haven't used the pragma to
set the default.

>> You could coin the abbreviation ligs, for Language Independent
>> Graphemes.  Then some ingenious rascal can create a pragma or
>> whatever that allows $str.b, $str.c, $str.g, and $str.l for
>> fans of terseness.

> As opposed to 'ligs' meaning ligatures? Fraught with peril. :-)

I thought about that, but figured it wasn't a big deal; there are
*lots* of abbreviations with more than one possible interpretation,
and you just deal with having to know which one is meant.  However, it
was then pointed out that it would actually be ldgs, which IMO is
unpronounceable and ugly.  So something else is needed for those.

*shrug*.  Make up a word.  Call them woohickies for all I care and
abbreviate it woo or just w.

> I like graphemes for the default because I hate and fear
> graphemes. The whole *code thing just crawls right in my ear, so
> having the language transparently support it would be a win.

I can see the logic in that.  Personally I don't care what the default
is.  Almost none of my code will need to care one way or the other,
and that which does can use the pragma.

Have the implications of the bytes/codepoints/graphemes/woohickies
distinction for the regular expression engine been discussed already?

--
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,"ten.thgirb\@badanoj$/ --";$\=$ ;-> ();print$/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Austin Hastings  
View profile  
 More options Jun 29 2004, 11:34 am
Newsgroups: perl.perl6.language
From: austin_hasti...@yahoo.com (Austin Hastings)
Date: Tue, 29 Jun 2004 08:34:16 -0700 (PDT)
Local: Tues, Jun 29 2004 11:34 am
Subject: Re: The .bytes/.codepoints/.graphemes methods
--- Jonadab the Unsightly One <jona...@bright.net> wrote:

> Have the implications of the bytes/codepoints/graphemes/woohickies
> distinction for the regular expression engine been discussed already?

Not enough.

One of my current clients just rolled on to redhat 9, and what a
steaming pile of digestive byproducts *that* turned out to be.
Apparently the default locale setting changed, so now LC_ALL="" out of
the box.

One effect of this is irritating lack of proper behavior in the
utilities. But when you switch to LC_ALL= <pick your favorite
language>, you just get really slow performance: Apparently the 'C'
locale is such a totally special case that the performance of LC_ALL=C
is one or more orders of magnitude better than LC_ALL=en_US.UTF-8, even
when the data is 7bit ascii.

I think that (1) this is unacceptable: the temptation to switch to the
'C' locale has been too great, both at this site and on a lot of the RH
support forums; (2) Perl6 should equitably support all its target
locales; (3) we should set out to make sure the performance is damn
fast no matter what locale we're using.

This has no direct bearing on p6l, since performance is a p6i issue.
But perhaps in the interests of performance as well as hackery we
should explicitly provide some sort of variant regex behavior:

    /a./ :bytes
    /a./ :graphemes

where the first would recognize 0x61 followed by any single byte, while
the second would recognize 'a' followed by any number of bytes
composing a single grapheme.

(I'll claim that it's legitimate to want to search for, say, any MBCs
introduced via \x0F\x01, regardless of length. This is likely not
supported any other way.)

=Austin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jonadab The Unsightly One  
View profile  
 More options Jun 29 2004, 11:54 am
Newsgroups: perl.perl6.language
From: jona...@bright.net (Jonadab The Unsightly One)
Date: Tue, 29 Jun 2004 11:54:10 -0400
Local: Tues, Jun 29 2004 11:54 am
Subject: Re: The .bytes/.codepoints/.graphemes methods

Juerd <ju...@convolution.nl> writes:
> substr($string, 2 but graphemes, 4 but bytes);

> I think "but" even makes sense, if substr defaults to something.

That could be combined with a smart substr that only needs the units
once (err, only needs a position object for one of the args) and knows
how to conver the other number to the same units (err, same type of
position object):

substr($string, 2, 4 but bytes);

This would still allow for specifying units on both if you for some
reason wanted them different (which, as Dan S points out, sounds like
a bad idea, on the face of it).

:bytes is shorter than but bytes, though.

--
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,"ten.thgirb\@badanoj$/ --";$\=$ ;-> ();print$/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jonathan Scott Duff  
View profile  
 More options Jun 29 2004, 11:52 am
Newsgroups: perl.perl6.language
From: d...@lighthouse.tamucc.edu (Jonathan Scott Duff)
Date: Tue, 29 Jun 2004 10:52:34 -0500
Local: Tues, Jun 29 2004 11:52 am
Subject: Re: The .bytes/.codepoints/.graphemes methods

On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote:
> This has no direct bearing on p6l, since performance is a p6i issue.
> But perhaps in the interests of performance as well as hackery we
> should explicitly provide some sort of variant regex behavior:

>     /a./ :bytes
>     /a./ :graphemes

> where the first would recognize 0x61 followed by any single byte, while
> the second would recognize 'a' followed by any number of bytes
> composing a single grapheme.

Isn't that what :u0, :u1, :u2, and :u3 are for?

            :u0         # use bytes       (. is byte)
            :u1         # level 1 support (. is codepoint)
            :u2         # level 1 support (. is grapheme)
            :u3         # level 1 support (. is language dependent)

        These modifiers say nothing about the state of the data, but in
        general internal Perl data will already be in Normalization Form
        C, so even under :u1, the precomposed characters will usually do
        the right thing. Note that these modifiers are for overriding
        the default support level, which was probably set by pragma at
        the top of the file.

Or was that to imply that a literal "a" in the RE would be
interpretted as a "grapheme a" when :u2 is active?

-Scott
--
Jonathan Scott Duff                     Division of Nearshore Research
d...@lighthouse.tamucc.edu             Senior Systems Analyst II


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matt Diephouse  
View profile  
 More options Jun 30 2004, 8:51 pm
Newsgroups: perl.perl6.language
From: m...@diephouse.com (Matt Diephouse)
Date: Wed, 30 Jun 2004 20:51:58 -0400
Local: Wed, Jun 30 2004 8:51 pm
Subject: Re: The .bytes/.codepoints/.graphemes methods

Larry Wall wrote:
> On Sat, Jun 26, 2004 at 12:27:38PM -0700, Brent 'Dax' Royal-Gordon wrote:
> : Issues:
> :   * Limits lvalue substr (doesn't allow it to be a different size)
> :     unless splice is used (or a substr method is also provided).

> That all has to be looked at anyway.  What does "5" mean when you
> pass it to substr, anyway?  (I've been trying to make it assume some
> implicit unit based on the current lexical scope's Unicode level,
> but issues remain.)  We have magical string positions that have
> different numeric values depending on what units you view them as,
> but at what point does a number like "5" get translated to such
> a magical string position?

While we're on the topic of substr, allow me to beg. Please, can we
replace substr with with array style operations like Ruby and Python?
Please? Something like this would be nice:

  my $string = "Hello, World!";
  say $string[0..4]; # prints "Hello\n"
  $string[7...] = "Larry!";
  say $string; # prints "Hello, Larry!\n"

We already have our strings acting as objects, and we have [] as a
postcircumfix operator, so it's something that someone could define
easily. Of course, I have no idea how to reconcile this with all the
talk of unicode other than to say that the easy stuff should be easy.

It just follows this would also be nice for arrays, to replace splice.
For me, these two functions are the most bothersome part of Perl 5, and
I would love to see them go.

matt


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Juerd  
View profile  
 More options Jul 1 2004, 6:59 am
Newsgroups: perl.perl6.language
From: ju...@convolution.nl (Juerd)
Date: Thu, 1 Jul 2004 12:59:34 +0200
Local: Thurs, Jul 1 2004 6:59 am
Subject: Re: The .bytes/.codepoints/.graphemes methods
Matt Diephouse skribis 2004-06-30 20:51 (-0400):

>  my $string = "Hello, World!";
>  say $string[0..4]; # prints "Hello\n"
>  $string[7...] = "Larry!";
>  say $string; # prints "Hello, Larry!\n"

And that "array" is one of bytes? graphemes?

In general, I like the idea. In <40DDCE2A.1080...@brentdax.com>, almost
the same was suggested, but implemented differently: a string's .bytes
method in list context (but isn't it array context, technically?) would
dwym. As would the other parts-of-string methods.

Perhaps without method, the string in array/list context can default to
the default set by a lexical pragma. Which, I hope, has a default
itself. (I like default defaults...)

Juerd


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matt Diephouse  
View profile  
 More options Jul 1 2004, 7:29 am
Newsgroups: perl.perl6.language
From: m...@diephouse.com (Matt Diephouse)
Date: Thu, 01 Jul 2004 07:29:43 -0400
Local: Thurs, Jul 1 2004 7:29 am
Subject: Re: The .bytes/.codepoints/.graphemes methods

Juerd wrote:
> Matt Diephouse skribis 2004-06-30 20:51 (-0400):

>> my $string = "Hello, World!";
>> say $string[0..4]; # prints "Hello\n"
>> $string[7...] = "Larry!";
>> say $string; # prints "Hello, Larry!\n"

> And that "array" is one of bytes? graphemes?

I'm not really up on my unicode, but I think .chars is what I have in
mind. I want it to operate like a non-unicode string in Perl 5. Anything
unicode can be more complex, as I think this will be the common case.

> In general, I like the idea. In <40DDCE2A.1080...@brentdax.com>, almost
> the same was suggested, but implemented differently: a string's .bytes
> method in list context (but isn't it array context, technically?) would
> dwym. As would the other parts-of-string methods.

Think of this as Huffmanized .chars then?

matt


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
John Williams  
View profile  
 More options Jul 1 2004, 4:15 pm
Newsgroups: perl.perl6.language
From: willi...@tni.com (John Williams)
Date: Thu, 1 Jul 2004 14:15:24 -0600 (MDT)
Local: Thurs, Jul 1 2004 4:15 pm
Subject: Re: The .bytes/.codepoints/.graphemes methods

On Thu, 1 Jul 2004, Juerd wrote:
> Matt Diephouse skribis 2004-06-30 20:51 (-0400):
> >  my $string = "Hello, World!";
> >  say $string[0..4]; # prints "Hello\n"
> >  $string[7...] = "Larry!";
> >  say $string; # prints "Hello, Larry!\n"

> And that "array" is one of bytes? graphemes?

> In general, I like the idea. In <40DDCE2A.1080...@brentdax.com>, almost
> the same was suggested, but implemented differently: a string's .bytes
> method in list context (but isn't it array context, technically?) would
> dwym. As would the other parts-of-string methods.

What if you could add the slice onto the method:

  my $string = "Hello, World!";
  say $string.bytes[0..4]; # prints "Hello\n"
  $string.codepoints[7...] = "Søren!";
  say $string; # prints "Hello, Søren!\n"

The string slicing operator would have to return an array of
bytes/codepoints/etc in list context and a substr in scalar context.

~ John Williams


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aaron Sherman  
View profile  
 More options Jul 2 2004, 4:50 pm
Newsgroups: perl.perl6.language
From: a...@ajs.com (Aaron Sherman)
Date: Fri, 02 Jul 2004 16:50:01 -0400
Local: Fri, Jul 2 2004 4:50 pm
Subject: Re: The .bytes/.codepoints/.graphemes methods

On Tue, 2004-06-29 at 11:34, Austin Hastings wrote:
> [...] when you switch to LC_ALL= <pick your favorite
> language>, you just get really slow performance: Apparently the 'C'
> locale is such a totally special case that the performance of LC_ALL=C
> is one or more orders of magnitude better than LC_ALL=en_US.UTF-8, even
> when the data is 7bit ascii.

Well, of course. I can't imagine a way in which this would not be true.

After all, in LC_ALL="C" the number of characters in a string is equal
to the number of bytes in the string. In LC_ALL="en_US.UTF-8" the length
of a string is dependent on what exactly you mean by length, and a lot
of special cases arise. Special cases and context mean you have more
code to execute for the same logical task, which means you have more
processing to do.

Unicode support is expensive, even if you're just doing ASCII-as-UTF-8.
That doesn't mean it's a bad thing to do, it's just that it's expensive.

> I think that (1) this is unacceptable: the temptation to switch to the
> 'C' locale has been too great, both at this site and on a lot of the RH
> support forums;

And yet, in English-speaking countries (and Hawaiian and
Swahili-speaking countries for that matter) and in situations where the
fidelity of certain types of string data (such as names) is not
considered critical, this is a fine default. e.g. for general shell
work.

> (2) Perl6 should equitably support all its target
> locales; (3) we should set out to make sure the performance is damn
> fast no matter what locale we're using.

Well, that's a nice theory, but you can prove that low-level encodings
(e.g. ASCII, EBCDIC) will be more efficient than high-level encodings
(e.g. UTF-8), so the only way to accomplish what you suggest in (2) is
to break (3) by slowing down the faster handling (not what you wanted,
I'm sure).

Of course, you want to have as much performance out of string handling
as possible.

> This has no direct bearing on p6l, since performance is a p6i issue.
> But perhaps in the interests of performance as well as hackery we
> should explicitly provide some sort of variant regex behavior:

>     /a./ :bytes
>     /a./ :graphemes

As pointed out by others, this is already there, though I'm not sure
that it would be specified that way. More likely:

        m :u0 /a./
        [etc]

--
Aaron Sherman <a...@ajs.com>
Senior Systems Engineer and Perl Toolsmith
http://www.ajs.com/~ajs/resume.html


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Brent 'Dax' Royal-Gordon  
View profile  
 More options Jul 3 2004, 6:37 am
Newsgroups: perl.perl6.language
From: br...@brentdax.com (Brent 'Dax' Royal-Gordon)
Date: Sat, 03 Jul 2004 03:37:43 -0700
Local: Sat, Jul 3 2004 6:37 am
Subject: Re: The .bytes/.codepoints/.graphemes methods

Aaron Sherman wrote:
> On Tue, 2004-06-29 at 11:34, Austin Hastings wrote:
>>(2) Perl6 should equitably support all its target
>>locales; (3) we should set out to make sure the performance is damn
>>fast no matter what locale we're using.

> Well, that's a nice theory, but you can prove that low-level encodings
> (e.g. ASCII, EBCDIC) will be more efficient than high-level encodings
> (e.g. UTF-8), so the only way to accomplish what you suggest in (2) is
> to break (3) by slowing down the faster handling (not what you wanted,
> I'm sure).

At the Parrot level, codepoint operations will generally be the most
efficient, even on strings with exotic charsets.  Parrot uses an
internal encoding that allows O(1) access to codepoints; essentially, it
uses an array of 8-, 16-, or 32-bit integers, depending on the highest
codepoint value.  This is the default even for character sets with shift
characters, like Shift-JIS.

On strings where all codepoints have values under 256, bytewise and
codepointwise lookup are equivalent; otherwise, though, bytewise lookup
will actually be *slower* than codepointwise, as Parrot will maintain
the illusion that each codepoint is stored in an integer that's the
perfect size for it.

If you force Parrot to use the UTF-8 encoding internally then bytewise
lookup becomes fastest, and codepointwise slows down a lot.  But you
really shouldn't do that--UTF-8 is ill-suited for actually
*manipulating* text, unlike the Parrot internal encodings.  (UTF-16 and
UTF-32 will presumably be available too, although I've seen no specific
mention of them.)

You can also force it to use a "raw" or "bytes" encoding, where bytes
and codepoints are identical.  But you can't store Unicode characters in
such a string and have them behave in a reasonable way.

(Note: this is all based on my own, possibly false, memory.)

--
Brent "Dax" Royal-Gordon <br...@brentdax.com>
Perl and Parrot hacker

Oceania has always been at war with Eastasia.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Larry Wall  
View profile  
 More options Jul 7 2004, 11:09 pm
Newsgroups: perl.perl6.language
From: la...@wall.org (Larry Wall)
Date: Wed, 7 Jul 2004 20:09:51 -0700
Local: Wed, Jul 7 2004 11:09 pm
Subject: Re: The .bytes/.codepoints/.graphemes methods
On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote:

: On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote:
: > This has no direct bearing on p6l, since performance is a p6i issue.
: > But perhaps in the interests of performance as well as hackery we
: > should explicitly provide some sort of variant regex behavior:
: >
: >     /a./ :bytes
: >     /a./ :graphemes
: >
: > where the first would recognize 0x61 followed by any single byte, while
: > the second would recognize 'a' followed by any number of bytes
: > composing a single grapheme.
:
: Isn't that what :u0, :u1, :u2, and :u3 are for?
:
:           :u0         # use bytes       (. is byte)
:           :u1         # level 1 support (. is codepoint)
:           :u2         # level 1 support (. is grapheme)
:           :u3         # level 1 support (. is language dependent)

These modifiers might get renamed to match whatever b/c/g/w convention
we come up with pragmas.  The levels aren't all that intuitive, though
there is a kind of progression of semantic complexity that would get
lost with ordinary names.

:         These modifiers say nothing about the state of the data, but in
:         general internal Perl data will already be in Normalization Form
:         C, so even under :u1, the precomposed characters will usually do
:         the right thing.

These days it might be that most of the data we see will be maximally
decomposed rather than maximally composed.  But the jury is still out
on that.  And in any event, :u2 and :u3 should hide that distinction.

:         Note that these modifiers are for overriding
:         the default support level, which was probably set by pragma at
:         the top of the file.

Another way of saying that is that these modifiers are, in fact,
lexically scoped pragmas with the *exact* same effect as the ordinary
Unicode level pragmas.  It's just that they're lexically scoped to
the rest of a rule or group rather than to the rest of a block.

: Or was that to imply that a literal "a" in the RE would be
: interpretted as a "grapheme a" when :u2 is active?

I don't know what you mean by "grapheme a" there.  If you mean, "Does
it match any grapheme that happens to be exactly U+0061?", then the
answer is yes.  If you mean "Does it wildcard to any grapheme that uses
U+0061 as the base character?", then the answer is probably no.  We
have not yet come up with a syntax for that kind of wildcarding, other
than dropping down to codepoints [:u1 a \pM+] or some such.  That may
or may not be sufficient.  It'd be pretty easy to define a <like a>
assertion in any case.

Larry


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Larry Wall  
View profile  
 More options Jul 7 2004, 11:15 pm
Newsgroups: perl.perl6.language
From: la...@wall.org (Larry Wall)
Date: Wed, 7 Jul 2004 20:15:30 -0700
Local: Wed, Jul 7 2004 11:15 pm
Subject: Re: The .bytes/.codepoints/.graphemes methods
On Wed, Jul 07, 2004 at 08:09:51PM -0700, Larry Wall wrote:

: On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote:
: : On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote:
: : > This has no direct bearing on p6l, since performance is a p6i issue.
: : > But perhaps in the interests of performance as well as hackery we
: : > should explicitly provide some sort of variant regex behavior:
: : >
: : >     /a./ :bytes
: : >     /a./ :graphemes
: : >
: : > where the first would recognize 0x61 followed by any single byte, while
: : > the second would recognize 'a' followed by any number of bytes
: : > composing a single grapheme.
: :
: : Isn't that what :u0, :u1, :u2, and :u3 are for?
: :
: :         :u0         # use bytes       (. is byte)
: :         :u1         # level 1 support (. is codepoint)
: :         :u2         # level 1 support (. is grapheme)
: :         :u3         # level 1 support (. is language dependent)
:
: These modifiers might get renamed to match whatever b/c/g/w convention
: we come up with pragmas.  The levels aren't all that intuitive, though
: there is a kind of progression of semantic complexity that would get
: lost with ordinary names.

On the flip side, a good reason to get rid of the numeric values is
that in all likelihood people will continually make the mistake of
thinking :u1 means "one byte at a time" and :u2 means "two bytes at
a time".  And then they'll wonder why :u4 doesn't give them UTF-32...

Larry


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Austin Hastings  
View profile  
 More options Jul 8 2004, 10:35 am
Newsgroups: perl.perl6.language
From: austin_hasti...@yahoo.com (Austin Hastings)
Date: Thu, 8 Jul 2004 07:35:44 -0700 (PDT)
Local: Thurs, Jul 8 2004 10:35 am
Subject: Re: The .bytes/.codepoints/.graphemes methods
--- Larry Wall <la...@wall.org> wrote:

> On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote:

> : Or was that to imply that a literal "a" in the RE would be
> : interpretted as a "grapheme a" when :u2 is active?

> I don't know what you mean by "grapheme a" there.  If you mean, "Does
> it match any grapheme that happens to be exactly U+0061?", then the
> answer is yes.  

In my original question, I meant to differentiate between 'grapheme'
and 'possible component of a multibyte expression'.

> If you mean "Does it wildcard to any grapheme that uses
> U+0061 as the base character?", then the answer is probably no.  We
> have not yet come up with a syntax for that kind of wildcarding,
> other than dropping down to codepoints [:u1 a \pM+] or some such.
> That may or may not be sufficient.  It'd be pretty easy to define a
> <like a> assertion in any case.

I think this is something that we'll want as a "mode", a la
case-insensitivity. Think of it as "mark insensitivity."

I'm not sure if this should be language/locale dependent or not, but a
basic search feature for text is "fre'd" -> "fred".

Maybe it can just roll into :i?

=Austin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 30   Newer >
« Back to Discussions « Newer topic     Older topic »