Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion The .bytes/.codepoints/.graphemes methods
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Jonadab The Unsightly One  
View profile  
 More options Jun 28 2004, 11:26 am
Newsgroups: perl.perl6.language
From: jona...@bright.net (Jonadab The Unsightly One)
Date: Mon, 28 Jun 2004 11:26:32 -0400
Local: Mon, Jun 28 2004 11:26 am
Subject: Re: The .bytes/.codepoints/.graphemes methods

Larry Wall <la...@wall.org> writes:
> That all has to be looked at anyway.  What does "5" mean when you
> pass it to substr, anyway?  

I was just going to ask about substrings, and then didn't because I
figured that had been hashed out already and I'd missed it...

> (I've been trying to make it assume some implicit unit based on the
> current lexical scope's Unicode level, but issues remain.)  We have
> magical string positions that have different numeric values
> depending on what units you view them as, but at what point does a
> number like "5" get translated to such a magical string position?

It would be possible to have right-associative operators (that bind at
least more tightly than comma and possibly very tightly) and convert a
number to one of these objects, so that we can do stuff like this:

substr($string, 2 bytes, 4 bytes) = $substitute;

Then if you pass a plain number to substr it could either assume
something (possibly generating a warning) or spit an error, depending
on some feature of the current lexical scope.

The word "bytes" is clearly much too long, though, much less
"graphemes" or "codepoints".  I thought about this:

substr($string, 2b, 4b) = $substitute;

With presumably g and c for graphemes and codepoints, but I rather
suspect that might conflict with some other existing syntax (though I
can't think of anything in particular).

And I can't think of another abbreviation that would be remotely
intuitive.

There's also the possibility of bsubstr and so on, but that leads us
down the path of C, having a hillion bajillion functions with names
like fgets, stoi, and fstrnclost.  Having sprintf is quite enough of
that, IMO.

> I dunno--it reads pretty well.  Maybe these'll be heavily enough
> used that we should Huffmanize them down a bit:

>     $str.bytes
>     $str.codes
>     $str.graphs
>     $str.letters

codes and graphs is better than codepoints and graphemes, at least.

> Though "letters" is a bit inadequate to describe language-dependent
> graphemes, since it also divides any non-letters...I suppose we
> could go with .characters if we don't mind forcing a heavily
> overloaded word in one particular direction, culturally speaking.
> Except, I'd kinda like to keep them starting with different letters.
> (And maybe .chars should be reserved to mean whatever the default
> unit is in the current lexical scope, as with substr() above.)

You could coin the abbreviation ligs, for Language Independent
Graphemes.  Then some ingenious rascal can create a pragma or whatever
that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness.

--
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,"ten.thgirb\@badanoj$/ --";$\=$ ;-> ();print$/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.