Re: String Theory

3 views
Skip to first unread message

Rod Adams

unread,
Mar 19, 2005, 6:41:30 PM3/19/05
to Perl6 Language List
It's been pointed out to me that A12 mentions:

> Coercions to other classes can also be defined:
>
> multi sub *coerce:as (Us $us, Them ::to) { to.transmogrify($us) }
>
> Such coercions allow both explicit conversion:
>
> $them = $us as Them;
>
> as well as implicit conversions:
>
> my Them $them = $us;
>

I read S12 in detail (actually all the S's) before posting. Neither S12
nor S13 mentions C<coerce:<as>>, so I missed the A12 mention of it in my
prep work.

Reading it now, my C<as>is a bit different, since I'll allowing options
for defining the encoding and Unicode level. There may be other options
that make sense in some contexts. Of course one could view the different
encodings and levels as subclasses of Str, which I considered at some
point, but it felt like it was going to get rather cumbersome given the
cross product effect of the two properties.

Also, it is unclear if C<coerce:<as>> returns an lvalue or not, which my
C<.as> does.

There's likely room for unification of the two ideas.

-- Rod Adams

Rod Adams

unread,
Mar 19, 2005, 6:07:49 PM3/19/05
to Perl6 Language List

I propose that we make a few decisions about strings in Perl. I've read
all the synopses, several list threads on the topic, and a few web
guides to Unicode. I've also thought a lot about how to cleanly define
all the string related functions that we expect Perl to have in the face
of all this expanded Unicode support.

What I've come up with is that we need a rule that says:

A single string value has a single encoding and a single Unicode Level
associated with it, and you can only talk to that value on its own
terms. These will be the properties "encoding" and "level".

However, it should be easy to coerce that string into something that
behaves some other way.

To accomplish this, I'm hijacking the C<as> method away from the Perl 5
C<sprintf> (which can be named C<to>, and which I plan to do more with
at some later point), and making it a general purpose coercion method.
The general form of this will be something like:

multi method as ($self : ?Class $to = $self.meta.name, *%options)

The purpose of C<as> is to create a "view" of the invocant in some other
form. Where possible, it will return a lvalue that allows one to alter
the original invocant as if it were a C<$to>.

This makes several things easy.

my Str $x = 'Just Another Perl Hacker' but utf8;
my @x := $x.as(Array of uint8);
say "@x.pop() @x.pop()";
say $x;

Generates:

114 101
Just Another Perl Hack

To make things easier, I think we need new types qw/Grapheme CodePoint
LangChar/ that all C<does Character> (ick! someone come up with a
better name for this role), along with Byte. Character is a role,
not a class, so you can't go creating instances of it.

But we could write:

my Str $x = 'Just Another Perl Hacker';
my @x := $x.as(Array of Character);

And then C<@x.pop()> returns whichever of
Grapheme/CodePoint/LangChar/Byte that $x thought of itself in terms of.
In other words, it's C<chop>.


Since by default, C<as> assumes the invocant type, we can convert from
one string encoding/level to another with:

$str.as(encoding => 'utf8', level => 'graph');

But we'll make it where C<*%options> handles known encodings and levels
as boolean named parameters as well, so

$str.as:utf8:graph;

does the same thing: makes another Str with the same contents as $str,
only with utf8 encoding and grapheme character semantics.


What does all this buy us? Well... for one thing it all disappears if
you want the default semantics of what you're working with.

Second, it makes it where a position within a string can be thought of
as a single integer again. What that integer means is subject to the
C<level> of the string you're operating with.

We could probably even resurrect C<length> if we wanted to, making it
where people who don't care about Unicode don't have to care. Those who
do care exactly which length they are getting can say
C<length $str.as:graph>.

To the user, almost the entire string function library winds up looking
like it did in Perl 5.


Some side points:

It is an error to do things like C<index> with strings of different
levels, but not different encodings.

level and encoding should default to whatever the source code was
written in, if known.

C<pack> and C<unpack> should be able to be replaced with C<as> views of
compact structs (see S09).

C<as> kills C<vec>. Or at least buries it very deeply, without oxygen.


Comments?

-- Rod Adams

Larry Wall

unread,
Mar 19, 2005, 10:11:08 PM3/19/05
to Perl6 Language List
On Sat, Mar 19, 2005 at 05:07:49PM -0600, Rod Adams wrote:
: I propose that we make a few decisions about strings in Perl. I've read

: all the synopses, several list threads on the topic, and a few web
: guides to Unicode. I've also thought a lot about how to cleanly define
: all the string related functions that we expect Perl to have in the face
: of all this expanded Unicode support.
:
: What I've come up with is that we need a rule that says:
:
: A single string value has a single encoding and a single Unicode Level
: associated with it, and you can only talk to that value on its own
: terms. These will be the properties "encoding" and "level".

You've more or less described the semantics available at the "use
bytes" level, which basically comes down to a pure OO approach where
the user has to be aware of all the types (to the extent that OO
doesn't hide that). It's one approach to polymorphism, but I think
it shortchanges the natural polymorphism of Unicode, and the approach
of Perl to such natural polymorphisms as evident in autoconversion
between numbers and strings. That being said, I don't think your
view is so far off my view. More on that below.

: However, it should be easy to coerce that string into something that
: behaves some other way.

The question is, "how easy?" You're proposing a mechanism that,
frankly, looks rather intrusive and makes my eyes glaze over as a
representative of the Pooh clan. I think the typical user would rather
have at least the option of automatic coercion in a lexical scope.

But let me back up a bit. What I want to do is to just widen your
definition of a string type slightly. I see your current view as a
sort of degenerate case of my view. Instead of viewing a string as
having an exact Unicode level, I prefer to think of it as having a
natural maximum and minimum level when it's born, depending on the
type of data it's trying to represent. A memory buffer naturally has
a mininum and maximum Unicode level of "bytes". A typical Unicode
string encoded in, say, UTF-8, has a minimum Unicode level of bytes,
and maximum of "chars" (I'm using that to represent language-dependent
graphemes here.) A Unicode string revealed by an abstract interface
might not allow any bytes-level view, but use codepoints for the
natural minimum, or even graphemes, but still allow any view up
to chars, as long as it doesn't go below codepoints.

A given lexical scope chooses a default Unicode view, which can be
naturally mapped for any data types that allow that view. The question
is what to do outside of that range. (Inside that range, I suspect
we can arrange to find a version of index($str,$targ) that works
even if $str and $targ aren't the same exact type, preferably one
that works at the current Unicode level. I think the typical user
would prefer that we find such a function for him without him having
to play with coercions.)

If the current lexical view is outside the range allowed by the
current, I think the default behavior is different looking up than
down. If I'm working at the chars level, then everything looks like
chars, even if it's something smaller. To take an extreme case,
suppose I do a chop on a string that is allows the byte view as the
highest level, that is, a byte buffer. I always get the last byte
of the string, even if the data could conceivably be interpreted as
some other encoding. For that string, the bytes *are* the characters.
They're also the codepoints, and the graphemes. Likewise, a string
that is max codepoints will behave like a codepoint buffer even under
higher levels. This seems very dwimmy to me.

Going the other way, if a lower level tries to access a string that is
minimum a higher level, it's just illegal. In a bytes lexical context,
it will force you to be more specific about what you mean if you want
to do an operation on a string that requires a higher level of abstraction.

As a limiting case, if you force all your incoming strings to be
minimum == maximum, and write your code at the bytes level, this
degenerates to your proposed semantics, more or less. I don't doubt
that many folks would prefer to program at this explicit level where
all the polymorphism is supplied by the objects, but I also think a
lot of folks would prefer to think at the graphemes or chars level
by default. It's the natural human way of chunking text.

I know this view of string polymorphism makes a bit more work for us,
but it's one of the basic Perl ideals to try to do a lot of vicarious
work in advance on behalf of the user. That was Perl's added value
over other languages when it started out, both on the level of mad
configuration and on the level of automatic str/num/int polymorphism.
I think Perl 6 can do this on the level of Str polymorphism.

When it comes to Unicode, most other OO languages are falling into
the Lisp trap of expecting the user think like the computer rather
than the computer like the user. That's one of the few ideas from
Lisp I'm trying very hard *not* to steal.

Larry

Rod Adams

unread,
Mar 20, 2005, 2:05:10 AM3/20/05
to Perl6 Language List
Larry Wall wrote:

>You've more or less described the semantics available at the "use
>bytes" level, which basically comes down to a pure OO approach where
>the user has to be aware of all the types (to the extent that OO
>doesn't hide that). It's one approach to polymorphism, but I think
>it shortchanges the natural polymorphism of Unicode, and the approach
>of Perl to such natural polymorphisms as evident in autoconversion
>between numbers and strings. That being said, I don't think your
>view is so far off my view. More on that below.
>

>[ rest of post snipped, not because it isn't relevant, but because it's long and my responses don't match any single part of it. -- RHA ]
>

What I see here is a need to define what it means to coerce a string
from one level to another.

First let me lay down my understanding of the different levels. I am
towards the novice end of the Unicode skill level, so it'll be pretty basic.

At the "byte" level, all you have is 8 bits, which may have some meaning
as text if treat them like ASCII.

You can take one or more bytes at a time, lump them together in a
predefined way, and generate a Code Point, which is an index into the
Unicode table of "characters".

However, Unicode has problem with what it assigns code points to, so you
have one or more code points together to form a proper character, or
grapheme.

But Unicode has another problem, where certain graphemes mean very
different things depending on what language you happen to be in. (Mostly
a CJK issue, from what I've read.) So we add a language dependent level,
which is basically graphemes with an implied language.

Even if I got parts of that wrong (very possible), the main point is
that in general, a higher level takes one _or_more_ units of the level
below it to construct a unit at it's level.


So now, there's the question of what it means to move something from one
level to another.

We'll start with moving "up" to a higher level. I'll use the example of
moving from Code Points (cpts) to Graphemes (grfs), but the talk should
translate to other conversions.

There are two approaches I see to this:
1) Convert every cpt into an exactly equivalent grf. The "length" of the
strings are equal.
2) Scan through the string, grouping cpts into associated grfs as
possible. The resulting string "length" is less than or equal to the
input. In short, attempt to keep the same semantic meaning of the word.

I see both methods as being useful in certain contexts, but #2 is likely
what people want more often, and is what I have in mind.

Going "down" the chain, you stand the possibility of losing information
in method #1.
However, using #2, you simply "expand" the relevant grfs into the
associated group of cpts.

My general approach of how to convert a string from one level to another
is to pick an encoding both levels understand, generate a bitstring from
the old level, and then have the new level parse that bitstring into
it's level. If the start and goal don't allow this, throw an error.

I'm not certain how your views relate to this all this, but I was left
with the impression that you were talking about conversions of type #1,
which would make sense to outlaw downward conversions, since it's
possible the grf won't "fit" into a cpt.

It would also make sense that you have an "allowable levels" parameter
in such a scheme, so you know not to store a grf that can't also be cpt,
or at least to track that after one does it, they can't go back to cpts.


Taking a step back, perhaps I didn't make it clear (or even mention)
that my coercions were DWIMish in nature, not pure bit level unions. I
covered String to String coercions above. For String -> Array, what
happens depends on the type of the array. For String -> Array of
Characters (back to my role), each element of the array corresponds to a
single of what the string thought a character was. However, String ->
Array of u?int\d+ would do bit level operations, and the encoding scheme
would matter greatly in this case.

We/I will have to come up with a table of what these DWIMish operations
are, and how a user could define a new one. That likely will be an
extension of how you decide "tie" should happen in Perl 6.


I also see nothing wrong with most operations between strings of two
levels autocoercing one string to the higher level of the other. Things
like C<cmp>, C<~>, and many others should be fine in this regard, as
long as they default to coercing "up". I soloed C<index> out, because it
deals with two strings *and* it deals with positions within those
strings, and what a given integer position means can vary greatly with
level. But even there I suppose that we could force the target's level
onto the term, and make all positions relative to the target, and it's
level.


As for the exact syntax of the coercion, I'm open to suggestions.

-- Rod Adams

Chip Salzenberg

unread,
Mar 25, 2005, 2:38:10 PM3/25/05
to perl6-l...@perl.org
Would this be a good time to ask for explanation for C<str> being
never Unicode, while C<Str> is always Unicode, thus leading to an
inability to box a non-Unicode string?

And might I also ask why in Perl 6 (if not Parrot) there seems to be
no type support for strings with known encodings which are not subsets
of Unicode?

If the explanations are "you have greatly misunderstood the contents
of Synopsis $foo", I will happily retire to my reading room.
--
Chip Salzenberg - a.k.a. - <ch...@pobox.com>
"What I cannot create, I do not understand." - Richard Feynman

Rod Adams

unread,
Mar 26, 2005, 3:12:51 PM3/26/05
to perl6-l...@perl.org
Chip Salzenberg wrote:

>Would this be a good time to ask for explanation for C<str> being
>never Unicode, while C<Str> is always Unicode, thus leading to an
>inability to box a non-Unicode string?
>
>

That's not quite it. C<str> is a forced Unicode level of "Bytes", with
encoding "raw", which happens to not have any Unicode semantics attached
to it.

>And might I also ask why in Perl 6 (if not Parrot) there seems to be
>no type support for strings with known encodings which are not subsets
>of Unicode?
>
>

There are two different things to consider at the P6 level: Unicode
level, and encoding. Level is one of Bytes, CodePoints, Graphemes, or
Language Dependent Characters (aka LChars aka Chars). It's the way of
determining what a "character" means. This can all get a bit confusing
for people who only speak English, since our language happens to map
nicely into all the levels at once, with no "merging of multiple code
points into a grapheme" monkey business.

Encoding is how a particular string gets mapped into bits. I see P6 as
needing to support all the common encodings (raw, ASCII, UTF\d+[be|le]?,
UCS\d+) "out of the box", but then allowing the user to add more as they
see fit (EBCDIC, etc).

Level and Encoding can be mixed and matched independently, except for
the combos that don't make any sense.

-- Rod Adams


Larry Wall

unread,
Mar 26, 2005, 10:48:58 PM3/26/05
to perl6-l...@perl.org
On Fri, Mar 25, 2005 at 07:38:10PM -0000, Chip Salzenberg wrote:
: Would this be a good time to ask for explanation for C<str> being

: never Unicode, while C<Str> is always Unicode, thus leading to an
: inability to box a non-Unicode string?

As Rod said, "str" is just a way of declaring a byte buffer, for which
"characters", "graphemes", "codepoints", and "bytes" all mean the
same thing. Conversion or coercion to more abstract types must be
specified explicitly.

: And might I also ask why in Perl 6 (if not Parrot) there seems to be


: no type support for strings with known encodings which are not subsets
: of Unicode?

Well, because the main point of Unicode is that there *are* no encodings
that cannot be considered subsets of Unicode. Perl 6 considers
itself to have abstract Unicode semantics regardless of the underlying
representation of the data, which could be Latin-1 or Big5 or UTF-76.

That being said, abstract Unicode itself has varying levels of
abstraction, which is how we end up with .codes, .graphs, and .chars
in addition to .bytes.

Larry

Chip Salzenberg

unread,
Mar 28, 2005, 11:53:07 AM3/28/05
to perl6-l...@perl.org
According to Larry Wall:

> On Fri, Mar 25, 2005 at 07:38:10PM -0000, Chip Salzenberg wrote:
> : And might I also ask why in Perl 6 (if not Parrot) there seems to be
> : no type support for strings with known encodings which are not subsets
> : of Unicode?
>
> Well, because the main point of Unicode is that there *are* no encodings
> that cannot be considered subsets of Unicode.

Certainly the Unicode standard makes such a claim about itself. There
are people who remain unpersuaded by Unicode's advertising. I conclude
that they will find Perl 6 somewhat disappointing.


--
Chip Salzenberg - a.k.a. - <ch...@pobox.com>

Open Source is not an excuse to write fun code
then leave the actual work to others.

Larry Wall

unread,
Mar 28, 2005, 3:23:08 PM3/28/05
to perl6-l...@perl.org
On Mon, Mar 28, 2005 at 11:53:07AM -0500, Chip Salzenberg wrote:
: According to Larry Wall:

: > On Fri, Mar 25, 2005 at 07:38:10PM -0000, Chip Salzenberg wrote:
: > : And might I also ask why in Perl 6 (if not Parrot) there seems to be
: > : no type support for strings with known encodings which are not subsets
: > : of Unicode?
: >
: > Well, because the main point of Unicode is that there *are* no encodings
: > that cannot be considered subsets of Unicode.
:
: Certainly the Unicode standard makes such a claim about itself. There
: are people who remain unpersuaded by Unicode's advertising. I conclude
: that they will find Perl 6 somewhat disappointing.

If it turns out to be a Real Problem, we'll fix it. Right now I think
it's a Fake Problem, and we have more important things to worry about.
Most of the carping about Unicode is with regard to CJK unifications
that can't be represented in any one existing character set anyway.
Unicode has at least done pretty well with the round-trip guarantee for
any single existing character set. There are certainly localization
issues with regard to default input and output transformations, and
things like changing the default collation order from Unicodian to
SJISian or Big5ian or whatever. But those are good things to make
explicit in any event, and that's what the language-dependent level
is for. And people who are trying to write programs across language
boundaries are already basically screwed over by their national
character sets. You can't even go back and forth between Japanese
and English without getting all fouled up between ¥ and \. Unicode
distinguishes them, so it's a distinction that Perl 6 *always makes*.

That being said, there's no reason in the current design that a string
that is viewed as on the language level as, say, French couldn't
actually be encoded in Morse code or some such. It's *only* the
abstract semantics at the current Unicode level that are required to
be Unicode semantics by default. And it's as lazy as we care to make
it--when you do s/foo/bar/ on a string, it's not required to convert
the string from any particular encoding to any other. It only has to
have the same abstract result *as if* you'd translated it to Unicode
and then back to whatever the internal form is. Even if you don't want
to emulate Unicode in the API, there are options. For some problems
it'd be more efficient to do translate lazily, and for others it's
more efficient to just translate everything once one input and once
on output. (It also tends to be a little cleaner to isolate "lossy"
translations to one spot in the program. By the round-trip nature
of Unicode, most of the lossy translations would be on output.)

But anyway, a bit about my own psychology. I grew up as a preacher's
kid in a fundamentalist setting, and I heard a lot of arguments of the
form, "I'm not offended by this, but I'm afraid someone else might be
offended, so you shouldn't do it." I eventually learned to discount
such arguments to preserve my own sanity, so saying "someone might
be disappointed" is not quite sufficient to motivate me to action.
Plus there are a lot of people out there who are never happy unless
they have something to be unhappy about. If I thought that I could
design a language that will never disappoint anyone, I'd be a lot
stupider than I already think I am, I think.

All that being said, you can do whatever you like with Parrot, and
if you give a decent enough API, someone will link it into Perl 6. :-)

Larry

Reply all
Reply to author
Forward
0 new messages