Plans for string processing

Dan Sugalski

unread,

Apr 12, 2004, 11:43:45 AM4/12/04

to perl6-i...@perl.org

Okay, I've not dug through all the fallout from the ICU checkin, but
I can see there's an awful lot. I'll dig through that in a bit, but...

Here's the plan. We've gone over it in the past, but I'm not sure
everything's been gathered together, so it's time to do so.

Some declarations:

1) Parrot will *not* require Unicode. Period. Ever. (Well, upon
release, at least) We will strongly recommend it, however, and use it
if we have it
2) Parrot *will* support multiple encodings (the bytes->code points
stuff), character sets (code points->meaning of a sort), and
language-specific overrides of character set behaviour.
3) All string data can be dealt with as either a series of bytes,
code points, or characters. (Characters are potentially multiple code
points--basically combining character stuff from those standards that
do so)
4) We will *not* use ICU for core functions. (string to number or
number to string conversions, for example)
5) Parrot will autoconvert strings as needed. If a string can't be
converted, parrot will throw an exception. This goes for language,
character set, or encoding.
6) There *may* be an overriding set of rules for throwing conversion
exceptions. (They may be supressed on lossy conversions, or required
for any conversions)
7) There *may* be an overriding language used for language-specific
operations (case folding or sorting).

I know ICU's got all sorts of nifty features, but bluntly we're not
going to use most of them.

The original split of encoding, character set, and language is one
that I want to keep. I know we've lost a good chunk of that with the
latest ICU patch, but that's only temporary and the breakage is worth
it to get Unicode actually in use. I expect I need to step up to the
plate and get an alternate encoding and charset in, so I'll probably
take a shot at JIS X 0208:1997 or CNS11643-1992. (Or whatever the
current version of those is)

As far as Parrot is concerned, a string is a series of bytes which
may, via its encoding, be turned into a series of 32 bit integer code
points. Those 32-bit integer code points can be turned, via its
character set, into a series of characters where each character is
one or more code points. Those characters may be classified and
transformed based on the language of the string.

The responsibilities of the three layers are:

Encoding
========

*) Transform stream of bytes to and from a set of 32-bit integers
*) Manages byte buffer (so buffer positioning and manipulation by
code point offset is handled here)

Character set
=============
*) Provides default manipulation and comparison behaviour (sorting
and case mangling)
*) Provides default character classifications (digit, word char,
space, punctuation, whatever)
*) Provides code point and character manipulation. (substring
functionality, basically)
*) Provides integrity features (exceptions if a string would be invalid)

Language
========
*) Provides language-sensitive manipulation of characters (case mangling)
*) Provides language-sensitive comparisons
*) Provides language-sensitive character overrides ('ll' treated as a
single character, for example, in Spanish if that's still desired)
*) Provides language-sensitive grouping overrides.

Since examples are good, here are a few. They're in an "If we"/"Then
Parrot" format.

IW: Mush together (either concatenate or substr replacement) two
strings of different languages but same charset
TP: Checks to see if that's allowed. If not, an exception is thrown.
If so, we do the operation. If one string is manipulated the language
stays whatever that string was. If a new string is created either the
left side wins or the default language is used, depending on the
interpreter setting.

IW: Mush together two strings of different charsets
TP: If the two strings can be losslessly converted to one of the two
charsets, do so, otherwise transform to Unicode and mush together. If
transformation is lossy optionally throw an exception (or warning)
Language rules above still apply.

IW: Force a conversion to a different character set
TP: Does it. An exception or warning may be thrown if the conversion
is not lossless.

Please note that in most cases parrot deals with string data as
*strings* in S registers (or hiding behind PMCs) not as integers in I
registers (even though we treat strings as a series of abstract
integer code points). This is because even something as simple as
"give me character 5" may return a series of code points if character
5 is a combining character set. We may (possibly, but possibly not)
get a bit dirtier for the regex code for speed reasons, but we'll see
about that.

Also note that some languages, such as perl 6, have a more restricted
view of things. That's fine, but we don't really care much as long as
everything that they need is provided, so the fact that Larry's
mandated the Ux levels is fine, but as they're a (possibly
excessively) restricted subset of what we're going to do means we
can, and in fact should (as they're more restrictive) ignore them for
our purposes. Same goes for other languages that have similar
restrictions.

Finally note that, in general, the actual character set or language
of a string becomes completely irrelevant so there isn't any loss in
abstracting things--to properly support Unicode means abstracting the
heck out of so much stuff that supporting multiple encodings and
character sets is a matter of switching out table pointers, and as
such not particularly a big deal.

Yes, this does mean that some of the recent ICU integration's going
to be moved back some, and it means that string data's more complex
than you might want it to be, but it already is, so we deal.

This all is not, as of yet, entirely non-negotiable, though I've yet
to get a convincing argument for change.
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Michael Scott

unread,

Apr 12, 2004, 12:50:37 PM4/12/04

to P6I List

Just thought I'd mention that I'm in the process of trying to get
strings.pod updated to reflect the current state of affairs.

Mike

Matt Fowles

unread,

Apr 12, 2004, 6:49:14 PM4/12/04

to Dan Sugalski, perl6-i...@perl.org

Dan~

I know that you are not technically required to defend your position,
but I would like an explanation of one part of this plan.

Dan Sugalski wrote:
> 4) We will *not* use ICU for core functions. (string to number or number
> to string conversions, for example)

Why not? It seems like we would just be reinventing a rather large
wheel here.

Matt

Jarkko Hietaniemi

unread,

Apr 13, 2004, 3:42:45 AM4/13/04

to perl6-i...@perl.org, Matt_...@softhome.net, Dan Sugalski, perl6-i...@perl.org

Matt Fowles wrote:

Without having looked at what ICU supplies in this department I would
guess it's simply because of the overhead. atoi() is probably quite a
bit faster than pulling in the full support for TIBETAN HALF THREE.

(Though to be honest I think Parrot shouldn't trust on atoi() or any
of those guys: Perl 5 has tought us not to put trust too much on them.
Perl 5 these days parses all the integer formats itself.)

Dan Sugalski

unread,

Apr 13, 2004, 9:45:57 AM4/13/04

to Jarkko Hietaniemi, Matt_...@softhome.net, perl6-i...@perl.org

That's part of it, yep--if we want it done the way we want it, we'll
need to do it ourselves, and it'll likely be significantly faster.

Also, there's the issue of not requiring ICU, which makes it
difficult to do string conversion if it isn't there... :)

Aaron Sherman

unread,

Apr 13, 2004, 1:55:01 PM4/13/04

to Dan Sugalski, Perl6 Internals List

Ok, I'm still lost on the language thing. I'm not arguing, I just don't
get it, and I feel that if I'm going to do some of the things that I
want to for Perl 6, I'm going to have to get it.

On Mon, 2004-04-12 at 11:43, Dan Sugalski wrote:

> Language
> ========
> *) Provides language-sensitive manipulation of characters (case mangling)
> *) Provides language-sensitive comparisons

Those two things do not seem to me to need language-specific strings at
all. They certainly need to understand the language in which they are
operating (avoiding the use of the word locale here, as per Larry's
concerns), but why does the language of origin of the string matter?

For example, in Perl5/Ponie:

@names=<NAMES>;
print "Phone Book: ", sort(@names), "\n";

In this example, I don't see why I would care that NAMES might be a
pseudo-handle that iterates over several databases, and returns strings
in the 7 different languages that those databases happen to contain. I
want my Phone Book sorted in a way that is appropriate to the language
of my phone book, with whatever special-case rules MY language has for
sorting funky foreign letters (and that might mean that even though a
comparison of two strings is POSSIBLE, in the current language it might
yield an exception, e.g. because Chinese and Japanese share a great many
characters that can be roughly converted, but neither have meaning in my
American English comparison).

More generally, an operation performed on a string (be it read
(comparison) or write (upcase, etc)) should be done in the way that the
*caller* expects, regardless of what legacy source the string came from
(I daren't even guess where that string that I got over a Parrot-enabled
CORBA might have been fetched from or if the language is still used
since it was stored in a cache somewhere 200 years ago, and it damn well
better not affect my sorting, no?)

Ok, so that's my take... what am I missing?

> *) Provides language-sensitive character overrides ('ll' treated as a
> single character, for example, in Spanish if that's still desired)
> *) Provides language-sensitive grouping overrides.

Ah, and here we come to my biggest point of confusion.

You describe logic that surrounds a given language, but you'll never
need "cmp" to know how to compare Spanish "ll" to English "ll", for
example. In fact, that doesn't even make sense to me. What you will need
is for cmp to know the Spanish comparison rules so that when it gets two
strings to compare, and it is asked to do so in Spanish, the proper
thing will happen.

I guess this boils down to two choices:

a) All strings will have the user's language by default

or

b) Strings will have different languages and behave according to their
"sources" regardless of the native rules of the user.

"b" seems to me to yield very surprising results, and not at all justify
the baggage placed inside a string. If I can be forgiven for saying so,
it's even close to Perl 4's $], which allowed you to change the
semantics of arrays, only here, you're doing it as a property on a
string so that I can't trust that any string will behave the way I
expect unless I "untaint" it.

Again, I'm asking for corrections here.

> IW: Mush together (either concatenate or substr replacement) two
> strings of different languages but same charset

According to whose rules? Does it make sense to merge an American
English string with a Japanese string unless you have a target language?

This means that someone's rules must become dominant, and as a
programmer, I'm expecting that to be neither string a nor string b, but
the user's. If the user happens to be Portuguese, then I would expect
that some kind of exception is going to emerge, but if the user is
Japanese, then it makes sense, and American English can be treated as
romaji, and an exception thrown if non-romaji ascii characters are used.
Again, this is not something that the STRING can really have much of a
clue about. It's all context.

What is the reason for every string value carrying around such context?
Certainly numbers don't carry around their base as context, and yet
that's critical when converting to a string!

--
Aaron Sherman <a...@ajs.com>
Senior Systems Engineer and Toolsmith
"It's the sound of a satellite saying, 'get me down!'" -Shriekback

Dan Sugalski

unread,

Apr 13, 2004, 3:06:13 PM4/13/04

to Aaron Sherman, Perl6 Internals List

At 1:55 PM -0400 4/13/04, Aaron Sherman wrote:
>Ok, I'm still lost on the language thing. I'm not arguing, I just don't
>get it, and I feel that if I'm going to do some of the things that I
>want to for Perl 6, I'm going to have to get it.
>
>On Mon, 2004-04-12 at 11:43, Dan Sugalski wrote:
>
>> Language
>> ========
>> *) Provides language-sensitive manipulation of characters (case mangling)
>> *) Provides language-sensitive comparisons
>
>Those two things do not seem to me to need language-specific strings at
>all. They certainly need to understand the language in which they are
>operating (avoiding the use of the word locale here, as per Larry's
>concerns), but why does the language of origin of the string matter?

Because the way a string is upcased/downcased/titlecased depends on
the language the string came from. The treatment of accents and a
number of specific character sequences depends on the language the
string came from. Ignore it and, well, you're going to find that
you're messing up the display of someone's name. That strikes me as
rather rude.

You also don't always have the means of determining what's right.
It's particularly true of library code.

>For example, in Perl5/Ponie:
>
> @names=<NAMES>;
> print "Phone Book: ", sort(@names), "\n";
>
>In this example, I don't see why I would care that NAMES might be a
>pseudo-handle that iterates over several databases, and returns strings
>in the 7 different languages that those databases happen to contain.

Then *you* don't. That's fine. Why, though, do you assume that
*nobody* will? That's the point.

You may decide that all strings shall be treated as if they were in
character set X, and language Y, whatever that is. Fine. You may
decide that the language you're designing will treat all strings as
if they're in character set X and language Y. That's fine too. Parrot
must support the capability of forcing the decision, and we will.

What I don't want to do is *force* uniformity. Some of us do care. If
we do it the way I want, then we can ultimately both do what we want.
If we do it the way you want, though, we can't--I'm screwed since the
data is just not there and can't *be* there.

We've tried the whole monoculture thing before. That didn't work with
ASCII, EBCDIC, any of the Latin-x, ISO-whatever, and it's not working
for a lot of folks with Unicode. (Granted, only a couple of billion,
so it's not *that* big a deal...) We've also tried the whole global
setting thing, and if you think that worked I dare you to walk up to
Jarkko and whisper "Locale" in his ear.

If you want to force a simplified view of things as either an app
programmer or language designer, well, great. I am OK with that. More
than OK, really, and I do understand the desire. What I'm not OK with
is mandating that simplified view on everyone.

Brent 'Dax' Royal-Gordon

unread,

Apr 13, 2004, 3:47:28 PM4/13/04

to Dan Sugalski, perl6-i...@perl.org

Dan Sugalski wrote:
> 1) Parrot will *not* require Unicode. Period. Ever.

My old 8MB Visor Prism thanks you.

> *) Transform stream of bytes to and from a set of 32-bit integers
> *) Manages byte buffer (so buffer positioning and manipulation by code
> point offset is handled here)

What's wrong with, *as an internal optimization only*, storing the
string in the more efficient-to-access format of the patch? I mean,
yeah, you don't want it to be externally visible, but if you're going to
treat a string as a series of ints, why not store it that way?

I really see no reason to store strings as UTF-{8,16,32} and waste CPU
cycles on decoding it when we can do a lossless conversion to a format
that's both more compact (in the most common cases) and faster.

--
Brent "Dax" Royal-Gordon <br...@brentdax.com>
Perl and Parrot hacker

Oceania has always been at war with Eastasia.

Dan Sugalski

unread,

Apr 13, 2004, 3:54:45 PM4/13/04

to Brent 'Dax' Royal-Gordon, perl6-i...@perl.org

At 12:44 PM -0700 4/13/04, Brent 'Dax' Royal-Gordon wrote:

>Dan Sugalski wrote:
>>1) Parrot will *not* require Unicode. Period. Ever.
>

>My old 8MB Visor Prism thanks you.

:) As does my gameboy.

>>*) Transform stream of bytes to and from a set of 32-bit integers
>>*) Manages byte buffer (so buffer positioning and manipulation by
>>code point offset is handled here)
>

>What's wrong with, *as an internal optimization only*, storing the
>string in the more efficient-to-access format of the patch? I mean,
>yeah, you don't want it to be externally visible, but if you're
>going to treat a string as a series of ints, why not store it that
>way?
>
>I really see no reason to store strings as UTF-{8,16,32} and waste
>CPU cycles on decoding it when we can do a lossless conversion to a
>format that's both more compact (in the most common cases) and
>faster.

Erm... UTF-32 is a fixed-width encoding. (That Unicode is inherently
a variable-width character set is a separate issue, though given the
scope of the project a correct decision) I'm fine with leaving ICU to
store unicode data internally any damn way it wants, though--partly
because the IBM folks are Darned Clever and I trust their judgement,
and partly because it means we don't have to write all the code to
properly handle Unicode.

Other variable-width encodings will likely be stored internally as
fixed-width buffers, at least once the data gets manipulated some.
Assuming I'm not convinced that Unicode is the true way to go... :)

Michael Scott

unread,

Apr 13, 2004, 4:44:20 PM4/13/04

to Dan Sugalski, P6I List

On 12 Apr 2004, at 17:43, Dan Sugalski wrote:

> IW: Mush together (either concatenate or substr replacement) two
> strings of different languages but same charset
> TP: Checks to see if that's allowed. If not, an exception is thrown.
> If so, we do the operation. If one string is manipulated the language
> stays whatever that string was. If a new string is created either the
> left side wins or the default language is used, depending on the
> interpreter setting.
>

Does that mean that a Parrot string will always have a specific
language associated with it?

Mike

Dan Sugalski

unread,

Apr 13, 2004, 4:48:19 PM4/13/04

to Michael Scott, P6I List

Yes.

Note that the language might be "Dunno". :) There'll be a default
that's assigned to input data and suchlike things, and the language
markers in the strings can be overridden by code.

Michael Scott

unread,

Apr 13, 2004, 5:28:31 PM4/13/04

to Dan Sugalski, P6I List

On 13 Apr 2004, at 22:48, Dan Sugalski wrote:

> Note that the language might be "Dunno". :) There'll be a default
> that's assigned to input data and suchlike things, and the language
> markers in the strings can be overridden by code.
>

Would this be right?

English + English = English
English + Chinese = Dunno
English + Dunno = Dunno

+ being symmetric.

How does a Dunno string know how to change case?

Mike

Dan Sugalski

unread,

Apr 13, 2004, 5:43:19 PM4/13/04

to Michael Scott, P6I List

At 11:28 PM +0200 4/13/04, Michael Scott wrote:
>On 13 Apr 2004, at 22:48, Dan Sugalski wrote:
>
>>Note that the language might be "Dunno". :) There'll be a default
>>that's assigned to input data and suchlike things, and the language
>>markers in the strings can be overridden by code.
>>
>
>Would this be right?
>
>English + English = English
>English + Chinese = Dunno
>English + Dunno = Dunno
>
>+ being symmetric.

I've been assuming it's a left-side wins, as you're tacking onto an
existing string, so you'd get English in all cases. Alternately you
could get an exception. The end result of a mixed-language operation
could certainly be the Dunno language or the current default--both'd
be reasonable.

>How does a Dunno string know how to change case?

It uses the defaults provided by the character set.

Leopold Toetsch

unread,

Apr 13, 2004, 6:33:22 PM4/13/04

to Brent 'Dax' Royal-Gordon, perl6-i...@perl.org

Brent 'Dax' Royal-Gordon <br...@brentdax.com> wrote:

> I really see no reason to store strings as UTF-{8,16,32} and waste CPU
> cycles on decoding it when we can do a lossless conversion to a format
> that's both more compact (in the most common cases) and faster.

The default format now isn't UTF8. It's a series of fixed sized entries
of either uint_8, uint_16, or uint_32. These reflect most common
encodings which are: char*, USC-2, and UCS-4/UTF-32 (or possibly other
32-bit encodings). This should cover "common" cases.

No cycles are wasted for storing "straight" encodings.

leo

Leopold Toetsch

unread,

Apr 13, 2004, 6:23:25 PM4/13/04

to Aaron Sherman, perl6-i...@perl.org

Aaron Sherman <a...@ajs.com> wrote:
> For example, in Perl5/Ponie:

> @names=<NAMES>;
> print "Phone Book: ", sort(@names), "\n";

> In this example, I don't see why I would care that NAMES might be a
> pseudo-handle that iterates over several databases, and returns strings
> in the 7 different languages

I already did show an example where uc("i") isn't "I". Collating is sill
more cmplex then a »simple« uc().

> More generally, an operation performed on a string (be it read
> (comparison) or write (upcase, etc)) should be done in the way that the
> *caller* expects,

Well, we dont't know what the caller expects. The caller has to decide.
There are basically at least two ways: Treat all strings language
independent (from their origin) or append more information to each
string.

>> *) Provides language-sensitive character overrides ('ll' treated as a
>> single character, for example, in Spanish if that's still desired)
>> *) Provides language-sensitive grouping overrides.

> Ah, and here we come to my biggest point of confusion.

Another example:

"my dog Fiffi" eq "my dog Fi\x{fb03}"

When my program is doing typographical computations, above equation is
true. And useful. The characters "f", "f", "i" are goin' to be printed.
But the ligature "ffi" takes less space when printed as such.
This is the same character string, though, when I'm a reader of this dog
news paper.

When I do an analysis of counting "f"s in dog names, I don't care if
it's written in one of these forms, it should be the same - or when I
search for "ffi" in the text.

It just depends who's using these features in which context.

> I guess this boils down to two choices:

> a) All strings will have the user's language by default

> or

> b) Strings will have different languages and behave according to their
> "sources" regardless of the native rules of the user.

and/or either the strings or the users default come in depending on the
desired action.

>> IW: Mush together (either concatenate or substr replacement) two
>> strings of different languages but same charset

> According to whose rules?

User level - what do you want to achieve. At codepoint level the
operation is fine. It doesn't make sense above that, though.

> This means that someone's rules must become dominant,

It doesn't make much sense to do

bors S0, S1 # stringwise bit not

to anything that isn't singlebyte encoded. It depends.

The rules - how and when they apply - still have to be layed out.

leo

Aaron Sherman

unread,

Apr 13, 2004, 6:09:49 PM4/13/04

to Dan Sugalski, Perl6 Internals List

Thanks for your response. I'm not sure that you and I are speaking about
exactly the same things, since you state that the logical extensions, if
not outright goals, of an alternate approach would be an exclusionary
monoculture. I'm not sure that's quite right....

On Tue, 2004-04-13 at 15:06, Dan Sugalski wrote:

> >> *) Provides language-sensitive manipulation of characters (case mangling)
> >> *) Provides language-sensitive comparisons
> >
> >Those two things do not seem to me to need language-specific strings at
> >all. They certainly need to understand the language in which they are
> >operating (avoiding the use of the word locale here, as per Larry's
> >concerns), but why does the language of origin of the string matter?
>
> Because the way a string is upcased/downcased/titlecased depends on
> the language the string came from. The treatment of accents and a
> number of specific character sequences depends on the language the
> string came from.

> Ignore it and, well, you're going to find that
> you're messing up the display of someone's name. That strikes me as
> rather rude.

For proper names, you may have a point (though the ordering of names in
a phone book, for example, is often according to the language of the
book, not the origin of the names), and in some forms of string
processing, that kind of deference to the origin of a word may turn out
to be useful. I do "get" that much.

What I'm not getting is

* Why do we assume that the language property of a string will be
the language from which the word correctly originates rather
than the locale of the database / web site / file server /
whatever that we received it from? That could actually result in
dealing with native words according to the rules of foreign
languages, and boy-howdy is that going to be fun to debug.
* Why is it so valuable as to attach a value to every string ever
created for it rather than creating an abstraction at a higher
level (e.g. a class)
* Why wouldn't you do the same thing for MIME type, as strings may
also (and perhaps more often) contain data which is more
appropriately tagged that way? The SpamAssassin guys would love
you for this!

> What I don't want to do is *force* uniformity. Some of us do care.

Hey, that's a bit of a low blow. I care quite a bit, or I would not ask.
I'm not saying that the guy who wants to sort names according to their
source language is wrong, I'm saying that he doesn't need core support
in Parrot to do it, so I'm curious why it's in there.

> We've tried the whole monoculture thing before.

I just don't think that moving language up a layer or two of abstraction
enforces a monoculture... again, I'm willing to see the light if someone
can explain it.

A lot of your response is about "enforcing", and I'm not sure how I gave
the impression of this being an enforcement issue (or perhaps you think
that non-localization is something that needs to be enforced?) I just
can't see how every string needs to carry around this kind of
world-view-altering context when 99% of programs that use string data
(even those that use mixed encodings) won't want to apply said context,
but rather perform all operations according to their locale. Am I wrong
about that?

One thing that was not answered, though is what happens in terms of
dominance. When sorting French and Norwegian Unicode strings, who loses
(wins?) when you try to compare them? Comparing across language
boundaries would be a monumental task, and would be instantly reviled as
wrong by every language purist in the world (to my knowledge no one has
ever published a uniform way to compare two words, much less arbitrary
text, unless you are willing to do so using the rules of one and only
one culture (and I say culture because often the rules of a culture are
mutually incompatible with those of any one source language's strict
rules)). So, if you have to convert in order to compare, whose language
do you do the comparison in? You can't really rely on LHS vs. RHS, since
a sort will reverse these many times (and C<$a cmp $b> had better be
C<-($b cmp $a)> or your sort may never terminate!)

Michael Scott

unread,

Apr 14, 2004, 7:39:17 AM4/14/04

to Dan Sugalski, P6I List

On 13 Apr 2004, at 23:43, Dan Sugalski wrote:

> I've been assuming it's a left-side wins, as you're tacking onto an
> existing string, so you'd get English in all cases. Alternately you
> could get an exception. The end result of a mixed-language operation
> could certainly be the Dunno language or the current default--both'd
> be reasonable.
>

Would I be right in thinking that *language* in the context of Parrot
strings is not necessarily an accurate description of the actual
language of the string, but rather a means of specifying a particular
set of idiosyncratic behavior normally associated with an actual
language?

An "english" string continues to behave in an English way regardless of
what I append to or insert into it.

Is there ever a situation where the contents of the appended/inserted
strings are altered because of the change in *language*? In other
words, are there any *language* (as distinct from character set)
transforms? And, can new *languages* be defined?

For example, will there be a way to define a *language* "toetsch" where
'ro' becomes '0r' in 'b0rken', and 'see' becomes 's.'?

Mike

Larry Wall

unread,

Apr 14, 2004, 2:16:36 PM4/14/04

to P6I List

On Wed, Apr 14, 2004 at 01:39:17PM +0200, Michael Scott wrote:
:

I think the idea of tagging complete strings with "language" is not
terribly useful. If it's to be of much use at all, then it should
be generalized to a metaproperty system for applying any property to
any range of characters within a string, such that the properties
float along with the characters they modify. The whole point of
doing such properties is to be able to ignore them most of the time,
and then later, after you've constructed your entire XML document,
you can say, "Oh, by the way, does this character have the "toetsch"
property?" There's no point in tagging text with language if 99%
of it gets turned into "Dunno", or "English, but not really."

Larry

Dan Sugalski

unread,

Apr 14, 2004, 3:02:55 PM4/14/04

to Michael Scott, P6I List

At 1:39 PM +0200 4/14/04, Michael Scott wrote:
>On 13 Apr 2004, at 23:43, Dan Sugalski wrote:
>
>>I've been assuming it's a left-side wins, as you're tacking onto an
>>existing string, so you'd get English in all cases. Alternately you
>>could get an exception. The end result of a mixed-language
>>operation could certainly be the Dunno language or the current
>>default--both'd be reasonable.
>>
>
>Would I be right in thinking that *language* in the context of
>Parrot strings is not necessarily an accurate description of the
>actual language of the string, but rather a means of specifying a
>particular set of idiosyncratic behavior normally associated with an
>actual language?

Basically, yes.

>Is there ever a situation where the contents of the
>appended/inserted strings are altered because of the change in
>*language*? In other words, are there any *language* (as distinct
>from character set) transforms? And, can new *languages* be defined?

New language code could certainly be defined, yes. I'm not sure you'd
see too many explicit transforms from one to another past some sort
of initial classification.

>For example, will there be a way to define a *language* "toetsch"
>where 'ro' becomes '0r' in 'b0rken', and 'see' becomes 's.'?

Probably not, no, unless you really wanted to mangle the
upcase/downcase/titlecase transformations.

Aaron Sherman

unread,

Apr 14, 2004, 3:19:24 PM4/14/04

to Leopold Toetsch, Perl6 Internals List

On Tue, 2004-04-13 at 18:23, Leopold Toetsch wrote:
> Aaron Sherman <a...@ajs.com> wrote:
> > For example, in Perl5/Ponie:
>
> > @names=<NAMES>;
> > print "Phone Book: ", sort(@names), "\n";
>
> > In this example, I don't see why I would care that NAMES might be a
> > pseudo-handle that iterates over several databases, and returns strings
> > in the 7 different languages
>
> I already did show an example where uc("i") isn't "I". Collating is sill
> more cmplex then a »simple« uc().

Correct. I agree, and I don't think anything I said contradicted that,
did it?

> Well, we dont't know what the caller expects. The caller has to decide.
> There are basically at least two ways: Treat all strings language
> independent (from their origin) or append more information to each
> string.

Hmmm... or the third, and far more common approach in all languages that
I've seen that deal with these issues: deal with the comparison
according to the rules set out by the language in which the comparison
is being done. Why is that option being passed over? Is it considered to
be, in some way, identical to ignoring language distinctions? How?

> >> *) Provides language-sensitive character overrides ('ll' treated as a
> >> single character, for example, in Spanish if that's still desired)
> >> *) Provides language-sensitive grouping overrides.
>
> > Ah, and here we come to my biggest point of confusion.
>
> Another example:
>
> "my dog Fiffi" eq "my dog Fi\x{fb03}"
>
> When my program is doing typographical computations, above equation is
> true. And useful. The characters "f", "f", "i" are goin' to be printed.
> But the ligature "ffi" takes less space when printed as such.
> This is the same character string, though, when I'm a reader of this dog
> news paper.

Ok, so here you essentially say, "in the typographical context this
statement has one result, in a string data context it has another."

So, why is that:

"my dog Fiffi":language("blah") eq "my dog Fi\x{fb03}":langauge("blah")

and not

use language "blah";

"my dog Fiffi" eq "my dog Fi\x{fb03}"

and what in Parrot's name does

"james":langauge("blah") eq "jim":language("bletch")

mean? Should "blah"'s language rules (in which "james" and "jim" are the
same name) or "bletch"'s language rules (in which they are not) take
priority? The comparison of two different languages would have to be
done in a third context of "culture" (e.g. "culture foo holds that
blah's rules for names are used and bletch's rules for everything else
are used except when a word in bletch was derived from a word used in
blah during the third invasion and swap meet of 1233").

Then, of course, we can get into how I feel about my program telling me
(in any context) that "ffi" and "\x{fb03}" are the same for any number
of reasons, not the least of which is that I consider such
representations to be markup, not text... but that's just me, and
perhaps I'll just have to put "use language 'ignorant American geek'" at
the start of all of my programs ;)

> > b) Strings will have different languages and behave according to their
> > "sources" regardless of the native rules of the user.

Again, I have never seen any source of information that suggests that
there is a universally known way to implement the above. Don't get me
started on the impact of going to southeast Asia and suggesting that
"ok, one of your language rules have to win when comparing characters of
differing languages"... ha! IMHO, the only thing that CAN be done at
such a low level as Parrot is to do the work according to the language
rules that govern the rest of this execution of the program, and if a
string makes no sense in that context, an exception is thrown.

But otherwise, how do you sort \x{6728} in Japanese vs Mandarin Chinese?
The two languages have different answers, and you HAVE to pick one.

> >> IW: Mush together (either concatenate or substr replacement) two
> >> strings of different languages but same charset
>
> > According to whose rules?
>
> User level - what do you want to achieve. At codepoint level the
> operation is fine. It doesn't make sense above that, though.

So, you seem to be suggesting that a single language (that of the user,
not the 2+ involved if you tag every string) should decide? If so, why
have strings tagged with language?

> > This means that someone's rules must become dominant,
>
> It doesn't make much sense to do
>
> bors S0, S1 # stringwise bit not
>
> to anything that isn't singlebyte encoded. It depends.

Sorry, you lost me. Did I bring that up? I was asking if:

$a cmp $b

would have a result in which $b was considered with respect to $a's
language or visa versa. Most commonly (always?) there is an incomplete
intersection of rules between the two, so someone's rules will have to
"win". So you have choices:

* If you go with LHS vs. RHS, then sort gets borked because sort
will reverse the "sides" repeatedly as it executes. This can and
would result in infinite sort times.
* If you come up with a list of languages in descending order of
dominance then there will be at least as many camps that
disagree with you as the length of the list minus one.
* If you try to architect a universal set of rules for applying
all language rules to strings involving all other language
rules, you will finish right about the time Esperanto takes over
the world, and you will have conducted several world wars in the
process.

If you think I'm over-stating, then please go to Iraq and ask the Sunni
how they feel about the Shia rules being dominant in some situations
when comparing Arabic strings....

And so, I return to my premise: why is this information associated with
strings, rather than being some sort of context that is associated with
an operation (e.g. "compare these two Arabic strings WRT the rules of
Sunni Bedouins speaking the Nadji Arabic dialect")? And why, again in
response to Dan's comments, is that monoculuristic of me?

So far, the responses have sounded like "that's complex, you wouldn't
understand". I hope that I'm making it clear that I understand the
complexities just fine, and I don't see associating language with a
string as resolving so much as introducing complexity to an already
intractable problem.

Leopold Toetsch

unread,

Apr 15, 2004, 5:00:51 AM4/15/04

to Aaron Sherman, perl6-i...@perl.org

Aaron Sherman <a...@ajs.com> wrote:

> So, why is that:

> "my dog Fiffi":language("blah") eq "my dog Fi\x{fb03}":langauge("blah")

> and not

> use language "blah";
> "my dog Fiffi" eq "my dog Fi\x{fb03}"

What, if this is:

$dog eq "my dog Fi\x{fb03}"

and C<$dog> hasn't some language info attached?

leo

Michael Scott

unread,

Apr 15, 2004, 6:17:16 AM4/15/04

to Larry Wall, P6I List

On 14 Apr 2004, at 20:16, Larry Wall wrote:

> I think the idea of tagging complete strings with "language" is not
> terribly useful. If it's to be of much use at all, then it should
> be generalized to a metaproperty system for applying any property to
> any range of characters within a string, such that the properties
> float along with the characters they modify. The whole point of
> doing such properties is to be able to ignore them most of the time,
> and then later, after you've constructed your entire XML document,
> you can say, "Oh, by the way, does this character have the "toetsch"
> property?" There's no point in tagging text with language if 99%
> of it gets turned into "Dunno", or "English, but not really."
>

It seems natural to associate language with utterances. When these
utterances are written down - or as I'm doing here, skipping the
speaking part and uttering straight to text - then the association
still works. But once we start emitting written things (strings) in a
less aural way, then the notion of an associated language can easily
become forced or inaccurate.

The process whereby we read a string like

"Is <b>this</b> string in Englisch?"

is generally a kind of lossy conversion to our language of preference
for that particular string. It's very difficult for us to do otherwise.
This natural generalization means that there will always be a demand
for strings to have language associated with them, no matter how
illogical it may seem to those who reflect upon it a bit.

I think it is this user state that Dan is trying to support. And, in so
far as it models natural and common perception, I think I agree with
him.

Lossy conversion is a kind of info-sin, especially when it should be
avoided. There are circumstances where it would be more natural to read
the above string as

"Is open-bold-tag this close-bold-tag string in
the-German-word-for-English question mark"

i.e. when we are being more precise.

It is for this more precise user state that we would be preserving
information on substrings.

There are plenty of strings which are simply never intended to be
uttered, and therefore are effectively language-less. And many strings
obviously in particular languages are often treated as if they weren't.
It would be odd to submit the processing of such strings to a
requirement of non or useless information preservation. Any sensible
user would want to turn off language processing in such cases.

So, we need to ask the user their state, and have the necessary level
of support in place to be able to behave accordingly.

Looking at this from an object-oriented perspective I can't help but
wonder why we don't have a hierarchy of Parrot string types

String
LanguageString
MultiLanguageString

with a "left wins" rule for composition.

Mike

Aaron Sherman

unread,

Apr 15, 2004, 11:43:35 AM4/15/04

to Leopold Toetsch, Perl6 Internals List

Looks good to me. Great example!

Seriously, why is that a problem? That was my entry-point to this
conversation: I just don't see any case in which performing a comparison
of ANY two strings according to whatever arbitrary SINGLE language rules
is a problem. I cannot imagine the case where you need two or more
language rules AND could start off with any sense of what that would
mean, and even if you could contrive such a case, I would suggest that
its rarity should dictate it being attached to a class that defines a
string-like object which mutates its behavior based on the language
spoken by the maintainer of the database from which it was fetched or
somesuch.

Leopold Toetsch

unread,

Apr 15, 2004, 5:55:52 PM4/15/04

to Aaron Sherman, perl6-i...@perl.org

Aaron Sherman <a...@ajs.com> wrote:
> On Thu, 2004-04-15 at 05:00, Leopold Toetsch wrote:
>> $dog eq "my dog Fi\x{fb03}"
>>
>> and C<$dog> hasn't some language info attached?

> Looks good to me. Great example!

> Seriously, why is that a problem?

Dan's problem to come up with better examples--or explanations :)

leo - resisting from further utterances WRT that topic in the absence of
"The Plan(tm)".

Dan Sugalski

unread,

Apr 15, 2004, 11:13:48 PM4/15/04

to l...@toetsch.at, Aaron Sherman, perl6-i...@perl.org

At 11:55 PM +0200 4/15/04, Leopold Toetsch wrote:
>Aaron Sherman <a...@ajs.com> wrote:
>> On Thu, 2004-04-15 at 05:00, Leopold Toetsch wrote:
>>> $dog eq "my dog Fi\x{fb03}"
>>>
>>> and C<$dog> hasn't some language info attached?
>
>> Looks good to me. Great example!
>
>> Seriously, why is that a problem?
>
>Dan's problem to come up with better examples--or explanations :)

Nah, that turns out not to be the case. It's my plan, and it's
reasonable to say I'm OK with it. :) While I'd prefer to have
everyone agree, I can live with it if people don't.

>leo - resisting from further utterances WRT that topic in the absence of
>"The Plan(tm)".

The Plan is in progress, though I admit I'm tempted to hit easier and
less controvertial things (like, say, threads or events) first.

Aaron Sherman

unread,

Apr 16, 2004, 1:33:44 AM4/16/04

to Dan Sugalski, Perl6 Internals List

On Thu, 2004-04-15 at 23:13, Dan Sugalski wrote:

> Nah, that turns out not to be the case. It's my plan, and it's
> reasonable to say I'm OK with it. :) While I'd prefer to have
> everyone agree, I can live with it if people don't.

Perhaps, as usual, I've been too verbose and everyone just skipped over
what I thought were useful questions, but I came into this thinking "I
must just not get it"... now I'm left with the feeling that there are
some basic questions no one is asking here. Don't respond to this
message, but please keep these questions in mind as you start to
implement... whatever it is that you're going to implement for this.

1. People have referred to comparing names, but most of the things
that make comparing names hard exist with respect to NAMES, and
not arbitrary strings (e.g. "McLean" is very different from
substr("358dsMcLeannbv35d",5,6).... That is not something that
attaching metadata to a string is likely to resolve.
2. There is no universal interchange rule-set (that I have ever
heard of) for operating on sequences of characters with respect
to two or more different languages at once, you have to pick a
language's (or culture's) rules to use, otherwise you are
comparing (or operating on) apples and oranges.
3. In any given comparison type operation, one side's rules will
have to become dominant for that operation. Woefully, you have
no realistic way to decide this at run-time (e.g. because going
with LHS-wins would result in sorts potentially getting C<($a
cmp $b) == 1> and C<($b cmp $a) == 1> which can result in
infinite sort times.
4. Given 1..3, you will probably have to implement some kind of
language "context" system (in most languages, this is handled by
locale) at some point, and it may need to take priority over the
language property of the strings that it operates on in certain
cases.
5. Given 4, all unary operators become, for example,
{
set_current_locale($s.langauge);
uc($s.data)
}
Which is, after all what most languages do anyway, but they keep
that language information as a piece of global state. Allowing
just for lexical scoping of such things would be very nice.
6. Separate from 1..5, language is an interesting property to
associate with strings, but so are a vast number of other
properties. Why are all of them second class citizens WRT
parrot, but not language? Why not build a class one level of
abstraction above raw strings which can bear arbitrary
properties?
7. Which programming language does Parrot wish to host which
requires unique language tagging of all string data? Would this
perhaps be better left for a 2.0 feature, once the needs of the
client languages are better understood?

Ok, that's my peace. Thanks for taking the time. I'll be over here
watching now.

> easier and less controvertial things (like, say, threads or events) first.

Hah! That's rich!

signature.asc

Jeff Clites

unread,

Apr 16, 2004, 3:27:34 AM4/16/04

to Larry Wall, P6I List

On Apr 14, 2004, at 11:16 AM, Larry Wall wrote:

> I think the idea of tagging complete strings with "language" is not
> terribly useful. If it's to be of much use at all, then it should
> be generalized to a metaproperty system for applying any property to
> any range of characters within a string, such that the properties
> float along with the characters they modify. The whole point of
> doing such properties is to be able to ignore them most of the time,
> and then later, after you've constructed your entire XML document,
> you can say, "Oh, by the way, does this character have the "toetsch"
> property?" There's no point in tagging text with language if 99%
> of it gets turned into "Dunno", or "English, but not really."

I tend to agree, and BTW that's exactly what an NSAttributedString does
on Mac OS X. To quote the docs:

An attributed string identifies attributes by name, storing a value
under
the name in an NSDictionary. You can assign any attribute name/value
pair
you wish to a range of characters, in addition to the standard
attributes
described in the "Constants" section....

See:
<http://developer.apple.com/documentation/Cocoa/Reference/Foundation/
ObjC_classic/Classes/NSAttributedString.html>

(Of course, and NSDictionary is the Cocoa version of a hash.)

This is the basis of styled text handling on Mac OS X, but you can
toetsch-ify XML documents as well.

JEff