Question about list context for String.chars

Gcomnz

unread,

Apr 11, 2005, 2:03:47 PM4/11/05

to perl6-l...@perl.org

Hi all,

I'm writing a bunch of examples for perl 6 pleac and it seems rather
natural to expect $string.chars to return a list of unicode chars in
list context, however I can't find anything to confirm that. (The
other alternatives being split and unpack.)

# unpack
@array = unpack("C*", $string);
# split
@array = split /./, $string;
# this too?
@array = $string.split(/./)
# and how about this?
@array = $string.chars
# and this explicit list context?
@array = $string.chars[];

Thanks,

Marcus

Ingo Blechschmidt

unread,

Apr 11, 2005, 2:12:42 PM4/11/05

to perl6-l...@perl.org

Hi,

gcomnz wrote:
> I'm writing a bunch of examples for perl 6 pleac and it seems rather
> natural to expect $string.chars to return a list of unicode chars in
> list context, however I can't find anything to confirm that. (The
> other alternatives being split and unpack.)

I like that.

If one wanted to have the *number* of chars/graphemes/whatever, one
could still use the cheap unary "+" operator.

And .keys, .values, .pairs, etc. don't return a plain number, but actual
contents, too (consistency!).

--Ingo

--
Linux, the choice of a GNU | Wissen ist Wissen, wo man es findet.
generation on a dual AMD |
Athlon! |

Aaron Sherman

unread,

Apr 11, 2005, 3:21:00 PM4/11/05

to Ingo Blechschmidt, Perl6 Language List

On Mon, 2005-04-11 at 14:12, Ingo Blechschmidt wrote:

> gcomnz wrote:
> > I'm writing a bunch of examples for perl 6 pleac and it seems rather
> > natural to expect $string.chars to return a list of unicode chars in
> > list context, however I can't find anything to confirm that. (The
> > other alternatives being split and unpack.)
>
> I like that.

Same here, though I have to admit that I'm slow on this whole Unicode
thing, so I'm not sure what you mean by "Unicode chars". For example,
are you expecting to get "f", "f", "i" or "ﬃ" back when you say
"ﬃ".chars? More interestingly, what about all of the Arabic ligatures
which someone who speaks that language might reasonably expect to get
back as multiple "chars", but they have their own Unicode codepoint
(e.g. ﳳ which is "U+FCF3 ARABIC LIGATURE SHADDA WITH DAMMA MEDIAL FORM"
which you might expect to get "ﹸ", "ﹽ" from)? Any Arabic speakers to
confirm or deny this behavior of ligatures?

Please be aware, I'm talking about ligatures above, NOT special letters
such as "æ", which are their own letters, and cannot be decomposed into
"a", "e" without losing information.

Given Parrot, what happens when you are presented with a Big5 string
that does not have a strict Unicode equivalent? Does .chars throw an
exception, or does it rely on the string to know how to "characterify
itself" according to its vtable?

--
Aaron Sherman <a...@ajs.com>
Senior Systems Engineer and Toolsmith
"It's the sound of a satellite saying, 'get me down!'" -Shriekback

Aaron Sherman

unread,

Apr 11, 2005, 3:55:19 PM4/11/05

to gcomnz, Perl6 Language List

On Mon, 2005-04-11 at 15:40, gcomnz wrote:
> I have to say I'm slightly confused too for some languages,
> especiallyfor syllabic alphabets. At the same time, I'm pretty clear
> for CJK,Syllabaries, and alphabets, or at least I hope I'm clear (I
> guess I'mabout to find out), .chars just returns the right unicode
> level forwhatever the string contents requires.

> "abc".chars would return <a b c>, which I'm guessing would be
> bytesize usually.

Fair enough.

> "日本語".chars would return <日　本　語>, which can probably be expressed with
> UTF8?

I think you're confusing UTF8 (which can represent ALL Unicode
characters) and "the UTF8 subset which consists of one-byte
representations" (which happens to overlap with 7-bit ASCII).

> >From Apocalyps 5: "Under level 2 Unicode support, a character
> isassumed to mean a grapheme, that is, a sequence consisting of a
> basecharacter followed by 0 or more combining characters."
> Marcus

Hmmm... that doesn't answer the ligature question clearly though. That
answers for the case of combining diacritical marks:

http://en.wikipedia.org/wiki/Combining_diacritical_mark

e.g. <A ̀> vs "À", which is a pre-combined example, but there are (as I
understand it), many valid examples which do not have a pre-combined
representation in Unicode.

But not for ligatures:

http://en.wikipedia.org/wiki/Ligature_%28typography%29

which are, by definition, actually two or more unique characters which
have a special typographical representation when adjacent. So, they are
a single grapheme, but like I said: certain cultures would be shocked by
a .chars that did not decompose their ligatures (and again, I'm mostly
thinking Arabic, so I'd defer to someone who actually spoke Arabic and
knows how they deal with this).

Mark Reed

unread,

Apr 11, 2005, 3:53:32 PM4/11/05

to gcomnz, Aaron Sherman, Ingo Blechschmidt, Perl6 Language List

On 2005-04-11 15:40, "gcomnz" <gco...@gmail.com> wrote:
>

"日本語".chars would return <日　本　語>, which can probably be expressed
with UTF8?

The string "日本語" is probably represented internally as UTF-8, but that
should have no effect on what .chars returns, which should, indeed, be <日　
本　語>, that is, an array whose elements are strings which each represent
one Unicode code point – irrespective of encoding.

I think that, in general, at the level of Perl code, 1 “character” should be
one code point, and any higher-level support for combining and splitting
should be outside the core, in Unicode::Whatever.

Gcomnz

unread,

Apr 11, 2005, 4:08:04 PM4/11/05

to Aaron Sherman, Perl6 Language List

> > "abc".chars would return <a b c>, which I'm guessing would be
> > bytesize usually.
>
> Fair enough.
>
> > "日本語".chars would return <日　本　語>, which can probably be expressed with
> > UTF8?
>
> I think you're confusing UTF8 (which can represent ALL Unicode
> characters) and "the UTF8 subset which consists of one-byte
> representations" (which happens to overlap with 7-bit ASCII).

Perhaps my confusion is that I thought, perhaps wrongly, that since
.chars returns a count that is appropriate for the given unicode
level, that would mean that if it were able to return a list in list
context then it would be with the right storage size as needed for the
given string contents. For instance, <a b c> just requires bytes for
each element, while Kanji would require more. I'm leaving very wide
room open here for me really misunderstanding how all this works.

>
> > >From Apocalyps 5: "Under level 2 Unicode support, a character
> > isassumed to mean a grapheme, that is, a sequence consisting of a
> > basecharacter followed by 0 or more combining characters."
> > Marcus
>
> Hmmm... that doesn't answer the ligature question clearly though. That
> answers for the case of combining diacritical marks:

I read "followed by 0 or more combining characters" to mean that it is
smart enough to combine the vowels in Arabic and other syllabic
alphabets that use special conjuncts. However I'm also not exactly
sure if that's even reasonably possible, or even if it makes sense in
the counting of "characters" for languages that use those.

Rod Adams

unread,

Apr 11, 2005, 11:57:52 PM4/11/05

to perl6-l...@perl.org

gcomnz wrote:

Well, in general the word "chars" has come to mean whatever a character
is in the current lexical scope, typically a language level char.

It had previously been decided that C<.chars>,etc would return the
length. I'm not about to change that without approval from @Larry.

I don't see any technical problem with saying that C<.chars> returns an
array of those chars, when then gets converted to length of array in
scalar context. The "creating a list just to get length" can of course
be optimized away.

My main issue is that it's it giving two rather different semantics to
the same method name, and leaving it to what amounts to context based
dispatching. So I don't like this idea as written.

However, I do like the idea of treating a string as an array of chars. I
remember some discussion a while back about making [] on strings do
something useful (but not the same thing as C<substr>), but I forget how
it ended, and my brain is too fried to go hunt it down. But overall I
like that idea. Then you could just say:

@array = $string[];

Which is a lot prettier than anything you mentioned above, let's us get
rid of the .split:/<null>/ issue, has better huffman coding, and lets
.chars have only one meaning.

For reference, what I'm thinking of having [] do is return the chars
specified as a list. This should be lvaluable, so you can hack at
individual chars to your heart's content.

This is different from substr(), since the latter returns a string of
the range of chars, not the individual chars. Consider:

$a = $b = "All good boys go to heaven.";
substr($a,9,3) = "girl";
$b[9..11] = "girl"[];
say "A: $a";
say "B: $b";

A: All good girls go to heaven.
B: All good girs go to heaven.

-- Rod Adams

Gcomnz

unread,

Apr 12, 2005, 12:20:27 AM4/12/05

to Rod Adams, perl6-l...@perl.org

> Rod wrote:
> However, I do like the idea of treating a string as an array of chars. I
> remember some discussion a while back about making [] on strings do
> something useful (but not the same thing as C<substr>), but I forget how
> it ended, and my brain is too fried to go hunt it down. But overall I
> like that idea. Then you could just say:
>
> @array = $string[];

This all sounds nice and simple. My only question then is what about
the instances where you specifically need the array of graphs, codes,
bytes, or whatever? If we can do one, why not all?

I recall that a good point Larry made previously is not to bend over
backward to let C programmers still think like C programmers in Perl
(sorry if my munging didn't get that just right). And to be honest I
only came up with this question for the cookbook (pleac) examples, but
I'm guessing there's some reasonable use for all this stuff outside of
the C-thinking world?

Gcomnz

unread,

Apr 12, 2005, 12:43:52 AM4/12/05

to ma...@diephouse.com, Rod Adams, perl6-l...@perl.org

> > > However, I do like the idea of treating a string as an array of chars. I
> > > remember some discussion a while back about making [] on strings do
> > > something useful (but not the same thing as C<substr>), but I forget how
> > > it ended, and my brain is too fried to go hunt it down. But overall I
> > > like that idea. Then you could just say:
> > >
> > > @array = $string[];
> >
> > This all sounds nice and simple. My only question then is what about
> > the instances where you specifically need the array of graphs, codes,
> > bytes, or whatever? If we can do one, why not all?
>

> That's why C<$string.chars[]> was proposed -- it would be accompanied
> by .graphs, .codes, and .bytes. That is all fine and dandy, but I
> don't think I should have to think about unicode if i don't want to.
> And if I understand correctly, that means that I want everything to
> use chars by default. And C<$string[]> would be a nice shortcut for
> that.

Yes, that's sort of what I was arguing for, in an underhanded way. I
agree that $string[] is a good shorthand for the most common usage
($string.chars[]) too.

Matt Diephouse

unread,

Apr 12, 2005, 12:41:24 AM4/12/05

to gcomnz, Rod Adams, perl6-l...@perl.org

On Apr 12, 2005 12:20 AM, gcomnz <gco...@gmail.com> wrote:
> > Rod wrote:
> > However, I do like the idea of treating a string as an array of chars. I
> > remember some discussion a while back about making [] on strings do
> > something useful (but not the same thing as C<substr>), but I forget how
> > it ended, and my brain is too fried to go hunt it down. But overall I
> > like that idea. Then you could just say:
> >
> > @array = $string[];
>
> This all sounds nice and simple. My only question then is what about
> the instances where you specifically need the array of graphs, codes,
> bytes, or whatever? If we can do one, why not all?

That's why C<$string.chars[]> was proposed -- it would be accompanied

by .graphs, .codes, and .bytes. That is all fine and dandy, but I
don't think I should have to think about unicode if i don't want to.
And if I understand correctly, that means that I want everything to
use chars by default. And C<$string[]> would be a nice shortcut for
that.

--
matt diephouse
http://matt.diephouse.com

Rod Adams

unread,

Apr 12, 2005, 1:00:00 AM4/12/05

to perl6-l...@perl.org

Matt Diephouse wrote:

I've been meaning to ask what people thing about having operators that
temporarily change the "current lexical Unicode level" for just one
single expression. I see them as solving all kinds of corner cases.

Unfortunately, I don't have a solid proposal handy, which has kept me
from posting it. But since there is some interest in this, I'll throw
the concept out there, and see if anyone else has a good idea what they
should look like, and exactly how they should work.

-- Rod Adams

Larry Wall

unread,

Apr 12, 2005, 7:14:07 AM4/12/05

to Perl6 Language List

On Mon, Apr 11, 2005 at 03:53:32PM -0400, Mark Reed wrote:
: I think that, in general, at the level of Perl code, 1 “character” should be

: one code point, and any higher-level support for combining and splitting
: should be outside the core, in Unicode::Whatever.

I think the default should be language-independent graphemes, and
that support for all Unicode levels below that is in the core, while
all of the many level 4 ("use French") modules should come standard,
which is core by some definition.

Larry

Larry Wall

unread,

Apr 12, 2005, 7:27:36 AM4/12/05

to Perl6 Language List

On Mon, Apr 11, 2005 at 01:08:04PM -0700, gcomnz wrote:
: I read "followed by 0 or more combining characters" to mean that it is

: smart enough to combine the vowels in Arabic and other syllabic
: alphabets that use special conjuncts. However I'm also not exactly
: sure if that's even reasonably possible, or even if it makes sense in
: the counting of "characters" for languages that use those.

The "0 or more combining characters" is relying on the exact
definition of combining character in Unicode, which is construed as
(somewhat) language-independent. But the language-dependent level
can split up characters in whatever way makes sense to a native
speaker of the language. That's what it's there for. But you
actually have to declare up front what language you want to work in.
Language-independent graphemes is the highest we can go by default,
and that's where I think we should go by default, because that's
closest to what the naïve user will expect. The smart people will
know to drop to codepoints or bytes when they need that.

Larry