Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

ICU incorporation and string changes heads-up

17 views
Skip to first unread message

Jeff Clites

unread,
Mar 17, 2004, 1:00:31 PM3/17/04
to p6i List
I'm almost finished preparing a patch which incorporates the usage of
ICU, and makes some additional changes to the internal representation
of strings. These changes give us an internal representation model
which is a bit simpler, and is measurably faster. (More details to
follow with the actual patch, of course.)

The patch itself is shaping out to be rather large (mostly because I've
added an interpreter argument to the string functions which were
lacking it, and updated the places where these function are called),
but is self-contained in the sense that (1) the functions which
actually call out directly to any ICU API are collected into one file
(for clarity, and in case we ever want to transition to a different
library), (2) there are minimal API changes, and (3) the GC behavior of
strings isn't affected.

It will take me a few more days to finish this up, but I wanted to give
everyone a heads-up that this is on the way.

JEff

Jeff Clites

unread,
Apr 7, 2004, 1:27:14 PM4/7/04
to p6i List
It has taken me longer than I expected to carve out some time to work
on finishing my ICU/string patch, but it's progressing now, and I just
finished tracking down some bugs of mine that the config_lib.pasm stuff
was exercising. So I'm currently back to the state of passing all
expected tests (ie, not the [perl #28344] test which is currently
failing for me in the unpatched source).

Just wanted to send out a quick update, to let everyone know that I
haven't completely dropped the ball here.

JEff

Dan Sugalski

unread,
Apr 7, 2004, 1:45:37 PM4/7/04
to Jeff Clites, p6i List

Are you at a point where anything can get checked in? I'd rather have
a partial checkin that we can build on in now than wait a while and
have it be even more difficult to integrate as the source drifts
further from what you're working on...
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Jeff Clites

unread,
Apr 8, 2004, 11:52:42 AM4/8/04
to Dan Sugalski, p6i List
On Apr 7, 2004, at 10:45 AM, Dan Sugalski wrote:

> At 10:27 AM -0700 4/7/04, Jeff Clites wrote:
>> It has taken me longer than I expected to carve out some time to work
>> on finishing my ICU/string patch, but it's progressing now, and I
>> just finished tracking down some bugs of mine that the
>> config_lib.pasm stuff was exercising. So I'm currently back to the
>> state of passing all expected tests (ie, not the [perl #28344] test
>> which is currently failing for me in the unpatched source).
>>
>> Just wanted to send out a quick update, to let everyone know that I
>> haven't completely dropped the ball here.
>
> Are you at a point where anything can get checked in? I'd rather have
> a partial checkin that we can build on in now than wait a while and
> have it be even more difficult to integrate as the source drifts
> further from what you're working on...

Yes, I think so actually. I have it up-to-date with the current head of
cvs. I need to make a pass through and remove a few things that I've
//-commented out and such, and I'll work on that tonight, and with luck
I'll be able to send it in either tomorrow or this weekend.

JEff

Dan Sugalski

unread,
Apr 8, 2004, 12:02:21 PM4/8/04
to Jeff Clites, p6i List

If it's just C++-style comments, send the patch now -- we can yank
those out for you :)

Jeff Clites

unread,
Apr 9, 2004, 6:42:16 AM4/9/04
to Dan Sugalski, p6i List
I've sent my patch in through RT--it's [perl #28405]!

JEff

Leopold Toetsch

unread,
Apr 9, 2004, 10:19:13 AM4/9/04
to Jeff Clites, perl6-i...@perl.org
Jeff Clites <jcl...@mac.com> wrote:
> I've sent my patch in through RT--it's [perl #28405]!

Phew, that's huge. I'd really like to have smaller patches that do it
step by step. But anyway, the patch must have been a lot of work so
lets see and make the best out of it.

Some questions:

First:
- What is string->representation? It seems only to contain
string_rep_{one,two,four}.
- How does UTF8 fit here?
- Where are encoding and chartype now?
- Where is string->language?

With this string type how do we deal with anything beyond codepoints?

And some misc remarks/questions:

- string_compare seems just to compare byte,short, or int_32
- What happenend to external constant strings?
- What's the plan towards all the transcode opcodes? (And leaving these
as a noop would have been simpler)
- hash_string seems not to deal with mixed encodings anymore.
- Why does read/PIO_reads generate an UTF-8 first?
- Why does PIO_putps convert to UTF-8?

(I've not applied the patch here, so I might have missed something)

> JEff

leo

Dan Sugalski

unread,
Apr 9, 2004, 11:07:20 AM4/9/04
to l...@toetsch.at, Jeff Clites, perl6-i...@perl.org
At 4:19 PM +0200 4/9/04, Leopold Toetsch wrote:
>Jeff Clites <jcl...@mac.com> wrote:
>> I've sent my patch in through RT--it's [perl #28405]!
>
>Phew, that's huge. I'd really like to have smaller patches that do it
>step by step. But anyway, the patch must have been a lot of work so
>lets see and make the best out of it.

I'll add a "Wow" in here.

I've snipped all of Leo's concerns, not because I don't share them
but because I do. (ICU also still doesn't build on a debian/testing
build, though that's Debian's fault) I also have the concern that
this moves over to Unicode-everywhere.

But... this gets us very much closer to where we want to be, and I'm
figuring that we're better off applying this and working out the
kinks than not. I'll leave this one to Leo to make final decision on,
though.

Jeff Clites

unread,
Apr 9, 2004, 12:20:57 PM4/9/04
to Dan Sugalski, perl6-i...@perl.org, l...@toetsch.at
On Apr 9, 2004, at 8:07 AM, Dan Sugalski wrote:

> At 4:19 PM +0200 4/9/04, Leopold Toetsch wrote:
>> Jeff Clites <jcl...@mac.com> wrote:
>>> I've sent my patch in through RT--it's [perl #28405]!
>>
>> Phew, that's huge. I'd really like to have smaller patches that do it
>> step by step. But anyway, the patch must have been a lot of work so
>> lets see and make the best out of it.
>
> I'll add a "Wow" in here.
>
> I've snipped all of Leo's concerns, not because I don't share them but
> because I do.

I'm in the process of crafting my (appropriately) long explanation in
reply to Leo's concerns.

> I also have the concern that this moves over to Unicode-everywhere.

It does, in a sense, and I'll defend that. (And at some point I'll ask
you for a clarification of your concerns to make sure that I answer
them directly, since "Unicode" can mean different things--a character
model, a standard, a set of encodings....)

JEff

Leopold Toetsch

unread,
Apr 9, 2004, 12:29:45 PM4/9/04
to Dan Sugalski, Jeff Clites, perl6-i...@perl.org
Dan Sugalski wrote:

> But... this gets us very much closer to where we want to be, and I'm
> figuring that we're better off applying this and working out the kinks
> than not. I'll leave this one to Leo to make final decision on, though.

Thanks Dan for this easter egg :)

Intermediate results:

- patch applied fine
- compiles currently modulo some C++ style variable declarations and
s/#import/#include/
- a few warnings about shadowed vars
- patched source does *not* conform to pdd07
- additional coffee breaks during compiles will ruin me finally
- compiled ok
- make test - not too bad (5/1495 subtests failed)
- JIted NCI (uses string_make for 't' sig) is b0rken
- one native_pbc test fails (more are skipped now)
- make testj fails some more (8/1432 subtests failed)

I'll not apply it before tomorrow, though. (If not someone else were
faster :)

leo

Dan Sugalski

unread,
Apr 9, 2004, 12:38:23 PM4/9/04
to Leopold Toetsch, Jeff Clites, perl6-i...@perl.org
At 6:29 PM +0200 4/9/04, Leopold Toetsch wrote:
>Dan Sugalski wrote:
>
>>But... this gets us very much closer to where we want to be, and
>>I'm figuring that we're better off applying this and working out
>>the kinks than not. I'll leave this one to Leo to make final
>>decision on, though.
>
>Thanks Dan for this easter egg :)

Hey, they don't call you the patchmonster for nothing--didn't want to
break your streak. :)

>I'll not apply it before tomorrow, though. (If not someone else were faster :)

If you want it in I can commit it now--I've a local version and big
enough pipe for it not to be a big deal.

Dan Sugalski

unread,
Apr 9, 2004, 12:59:24 PM4/9/04
to Jeff Clites, perl6-i...@perl.org, l...@toetsch.at
At 9:20 AM -0700 4/9/04, Jeff Clites wrote:
>On Apr 9, 2004, at 8:07 AM, Dan Sugalski wrote:
>
>>At 4:19 PM +0200 4/9/04, Leopold Toetsch wrote:
>>>Jeff Clites <jcl...@mac.com> wrote:
>>>> I've sent my patch in through RT--it's [perl #28405]!
>>>
>>>Phew, that's huge. I'd really like to have smaller patches that do it
>>>step by step. But anyway, the patch must have been a lot of work so
>>>lets see and make the best out of it.
>>
>>I'll add a "Wow" in here.
>>
>>I've snipped all of Leo's concerns, not because I don't share them
>>but because I do.
>
>I'm in the process of crafting my (appropriately) long explanation
>in reply to Leo's concerns.

Cool. I'll wait. :)

>>I also have the concern that this moves over to Unicode-everywhere.
>
>It does, in a sense, and I'll defend that. (And at some point I'll
>ask you for a clarification of your concerns to make sure that I
>answer them directly, since "Unicode" can mean different things--a
>character model, a standard, a set of encodings....)

We can hash that out after the other things get argued out.

Leopold Toetsch

unread,
Apr 9, 2004, 3:59:11 PM4/9/04
to Dan Sugalski, perl6-i...@perl.org
Dan Sugalski <d...@sidhe.org> wrote:
> At 6:29 PM +0200 4/9/04, Leopold Toetsch wrote:

>>I'll not apply it before tomorrow, though. (If not someone else were faster :)

> If you want it in I can commit it now--I've a local version and big
> enough pipe for it not to be a big deal.

Put it in. We'll deal with various issues step by step.

leo

Dan Sugalski

unread,
Apr 9, 2004, 4:44:30 PM4/9/04
to l...@toetsch.at, perl6-i...@perl.org

Done. It'll guaranteed kill half the tinderboxen--I think my first
thing to do on monday is to patch up the build procedure to use the
system ICU if it's available.

Adam Thomason

unread,
Apr 9, 2004, 4:55:30 PM4/9/04
to Dan Sugalski, l...@toetsch.at, perl6-i...@perl.org
> -----Original Message-----
> From: Dan Sugalski [mailto:d...@sidhe.org]
> Sent: Friday, April 09, 2004 1:45 PM
> To: l...@toetsch.at
> Cc: perl6-i...@perl.org
> Subject: Re: ICU incorporation and string changes heads-up
>
>
> At 9:59 PM +0200 4/9/04, Leopold Toetsch wrote:
> >Dan Sugalski <d...@sidhe.org> wrote:
> >> At 6:29 PM +0200 4/9/04, Leopold Toetsch wrote:
> >
> >>>I'll not apply it before tomorrow, though. (If not someone else
> >>>were faster :)
> >
> >> If you want it in I can commit it now--I've a local
> version and big
> >> enough pipe for it not to be a big deal.
> >
> >Put it in. We'll deal with various issues step by step.
>
> Done. It'll guaranteed kill half the tinderboxen--I think my first
> thing to do on monday is to patch up the build procedure to use the
> system ICU if it's available.

Does this mean you're resigned to requiring a C++ compiler? Or will it be possible to switch ICU off?

Adam

Dan Sugalski

unread,
Apr 9, 2004, 5:00:33 PM4/9/04
to Adam Thomason, l...@toetsch.at, perl6-i...@perl.org
At 1:55 PM -0700 4/9/04, Adam Thomason wrote:
> > From: Dan Sugalski [mailto:d...@sidhe.org]
> > Done. It'll guaranteed kill half the tinderboxen--I think my first
>> thing to do on monday is to patch up the build procedure to use the
>> system ICU if it's available.
>
>Does this mean you're resigned to requiring a C++ compiler? Or will
>it be possible to switch ICU off?

It'll be possible to switch it off, or to use the system ICU install
if there is one.

Jeff Clites

unread,
Apr 9, 2004, 5:10:50 PM4/9/04
to l...@toetsch.at, perl6-i...@perl.org
On Apr 9, 2004, at 7:19 AM, Leopold Toetsch wrote:

> Jeff Clites <jcl...@mac.com> wrote:
>> I've sent my patch in through RT--it's [perl #28405]!
>
> Phew, that's huge. I'd really like to have smaller patches that do it
> step by step.

Yes, I know it got quite large--sorry about that, I know it makes
things more difficult. Mostly it was all-or-nothing, though--changes to
string.c, and then dealing with the consequence of those changes. (It
would have been smaller if I had not added the "interpreter" argument
to the string API which lacked it--updating the places they are used is
probably half of the patch or so.)

In addition to may responses below, I hope to have a better-organized
explanation of the approach I've taken, and the rationale behind it.

> But anyway, the patch must have been a lot of work so
> lets see and make the best out of it.
>
> Some questions:
>
> First:
> - What is string->representation? It seems only to contain
> string_rep_{one,two,four}.
> - How does UTF8 fit here?
> - Where are encoding and chartype now?

The key idea is to move to a model in which a string is not represented
as a bag of bytes + associated encoding, but rather conceptually as an
array of (abstract) characters, which boils down to an array of
numbers, those numbers being the Unicode code point of the character.
The easiest way to do this would be to represent a string using an
array of 32-bit ints, but this is a large waste of space in the common
case of an all-ASCII string. So what I'm doing is sticking to the idea
of modeling a string as conceptually an array of 32-bit numbers, but
optimizing by using just a uint8_t[] if all of the characters happen to
have numerical value < 2^8, uint16_t[] if some are outside of that
range but still < 2^16, and finally uint32_t[] in the rare case it
contains characters above that range.

So internally, strings don't have an associated encoding (or chartype
or anything)--if you want to know what the N-th character is, you just
jump to index N for the appropriate datatype, and that number is your
answer.

In my view, the fundamental question you can ask a string is "what's
your N-th character", and the fundamental thing you do with the answer
is go look up what properties that character has (eg, sort ordering,
case mapping, value as a digit, etc.).

In this model, an "encoding" is a serialization algorithm for strings,
and that's it (much like Data::Dumper defines a serialization format
for other Perl types). A particular encoding is a particular strategy
for serializing a string into a bag of bytes (or vice-versa), almost
always for interchange via I/O (which is always byte-based). The split
into separate "encoding" and "chartype" ends up being undesirable. (I'd
argue that such a split is a possible internal design option for a
transcoding library, but not part of the conceptual API of a string
library. So ICU may have such a split internally, but parrot doesn't
need to.) In particular, this metadata is invariably specified as a
single parameter--in an XML declaration is called "encoding", in MIME
headers it's called "charset", but I don't know of any interchange
format which actually tries to specify this sort of thing via two
separate parameters. Additionally, this split isn't reflected in other
string libraries I know of, nor is the concept universal across the
actual encoding standards themselves (though in some cases it is used
pedagogically).

So I get what the split is trying to capture, but I think it's
counterproductive, especially since developers tend to get confused
about the whole "encoding thing", and we make matters worse if we try
to maintain a type of generality that outruns actual usage.

So in this model (to recap), if I have two files which contain the same
Japanese text, one in Shift-JIS and one in UTF-8, then after reading
those into a parrot string, they are identical. (They're the same
text--right?) You could think of this as "early normalization", but my
viewpoint if that the concept of an "encoding" deals exclusively with
how to serialize a string for export (and the reverse), and not with
the in-memory manipulation (or API) of a string itself. This is very
much like defining a format to serialize objects into XML--you don't
end up thinking of the objects as "XML-based" themselves.

There are a couple of key strengths of this approach, from a
performance perspective (in addition to the conceptual benefit I am
claiming):

1) Indexing into a string is O(1). (With the existing model, if you
want to find the 1000th character in a string being represented in
UTF-8, you need to start at the beginning and scan forward.)

2) There's no memory allocation (and hence no GC) needed during string
comparisons or hash lookups. I consider that to be a major win.

3) The Boyer-Moore algorithm can be used for all string searches. (But
currently there are a couple of cases I still need to fill in.)


Incidentally, this closely matches the approach used by ObjC and Java.

> - Where is string->language?

I removed it from the string struct because I think that's the wrong
place for it (and it wasn't actually being used anywhere yet,
fortunately). The problem is that although many operations can be
language-dependent (sorting, for example), the operation doesn't depend
on the language of the strings involved, but rather on the locale of
the reader. The classic example is a list containing a mixture of
English and Swedish names; for an English reader the while list should
be sorted in English order, and a Swedish reader would want to see them
in Swedish alphabetical order. (That example is from Richard Gillam's
book on Unicode.)

I'm assuming that was the intention of "language". If it was meant to
indicate "perl" v. "python" or something, then I should put it back.


Also, we may want to re-create an API which allows us to obtain an
integer to later use to specify an encoding (right now, they're
specified by name, as C-strings), but the ICU API doesn't use this sort
of mechanism, so at the moment we'd end up turning around and looking
up the C-string to pass into the ICU API, so it isn't a
performance-enhancer currently.

> With this string type how do we deal with anything beyond codepoints?

Hmm, what do you mean? Even prior to this, all of our operations ended
up relying on being able to either transcode two arbitrary strings to
the same encoding, or ask arbitrary strings what their N-th character
is. Ultimately, this carried the requirement that a string be
representable as a series of code points. I don't think that there is
anything beyond that which a string has to offer. But I'm curious as to
what you had in mind.

> And some misc remarks/questions:
>
> - string_compare seems just to compare byte,short, or int_32

Yep, exactly.

(There are compare-strings-using-a-particular-normalization-form
concepts which we'll need to ultimately handle, which are much like the
simpler concept of case-insensitive comparison. I see these being
handles by separate API/ops, and there are a couple of different
direction that could take.)

> - What happenend to external constant strings?

They should still work (or could). But the only cases in which we can
optimize, and actually use "in-place" a buffer handed to string_make,
is for a handful of encodings. But dealing with the "flags" argument
passed into string_make may still be incomplete.

> - What's the plan towards all the transcode opcodes? (And leaving these
> as a noop would have been simpler)

Basically there's no need for a transcode op on a string--it no longer
makes sense, there's nothing to transcode. What we need to add is an
"encode" which creates a bag of bytes from a string + an encoding, and
a "decode" which creates a string based on a bag of bytes + an
encoding. But we currently lack (as far as I can tell) a data type to
hold a naked bag of bytes. It's in my plans to create a PMC for that,
and once that exists to add the corresponding ops.

> - hash_string seems not to deal with mixed encodings anymore.

Yep, since we're hashing based on characters rather than bytes, there's
no such thing as mixed encodings. That means that, to use my example
from above, a string representing the Japanese word for "sushi" will
hash to the same thing no matter what encoding may have been used to
represent it on disk.

> - Why does PIO_putps convert to UTF-8?

Mostly for testing. We currently lack a way to associate an encoding
with an IO handle (or with an IO read or write), and if you want to
write out a string you have to pick an encoding--that's the recipe
needed to convert a string into something write-able. Until we've
developed that, I'm writing everything out in UTF-8, just as a sensible
thing to use to allow basically any string to be written out. Before
this, we were writing out the raw bytes used to internally represent
the string, which never really made sense.

> - Why does read/PIO_reads generate an UTF-8 first?

Similar to the above, but right now that API is only being called by
the freeze/thaw stuff, and it's current odd state reflects what needed
to be done it keep it working (basically, to match the write by
PIO_putps). But ultimately, for the freeze-using-opcodes case we should
be accumulating our bytes into a raw byte buffer rather than sneaking
them into the body of a string, but since we don't yet have a data type
representing a raw byte buffer (as mentioned above), this was a
workaround.

Fundamentally, I think we need two sorts of IO API (which may be
handled via IO layers or filters): byte-based and string-based. For the
string based case, an encoding always needs to be specified in some
manner (either implicitly or explicitly, and either associated with the
IO handle or just with a particular read/write).

Of course, pass along any other questions/concerns you have.

JEff

Jeff Clites

unread,
Apr 9, 2004, 5:11:41 PM4/9/04
to Leopold Toetsch, Dan Sugalski, perl6-i...@perl.org
On Apr 9, 2004, at 9:29 AM, Leopold Toetsch wrote:

> Dan Sugalski wrote:
>
>> But... this gets us very much closer to where we want to be, and I'm
>> figuring that we're better off applying this and working out the
>> kinks than not. I'll leave this one to Leo to make final decision on,
>> though.
>
> Thanks Dan for this easter egg :)
>
> Intermediate results:
>
> - patch applied fine

Good! (I tested it enough, so I'm happy that was successful.)

> - compiles currently modulo some C++ style variable declarations

I'm lately being spoiled by C99 (which allows mid-block declarations
too), but these were just oversights on my part.

> and s/#import/#include/

Oops. (It's the ObjC in me.)

> - a few warnings about shadowed vars

Hmm, I'll need to figure out how to turn on warnings about this in gcc.
I would expect some "unused" warning as well.

> - patched source does *not* conform to pdd07

Not too surprising--I'm sure there are some tabs and too-long-lines and
such, that I know I need to fix.

> - additional coffee breaks during compiles will ruin me finally

Yes, no kidding! At least, the makefile as currently constructed won't
re-build ICU as a result of a re-Configure-ing parrot. (And it might
make sense to remove cleaning of ICU from the default clean target, so
that you only rebuild it if you've cleaned it explicitly.) Also, be
glad I'm stopping it from building the data into a library--that takes
an additional lifetime to build.

> - compiled ok

Yeah!

> - make test - not too bad (5/1495 subtests failed)

Let me know which ones--I'm curious as to why.

> - JIted NCI (uses string_make for 't' sig) is b0rken

Ah yes, it make sense I wouldn't have hit this, if it's in the
i386-specific code.

> - one native_pbc test fails (more are skipped now)

I know there needs to be added some endianness correction of the
serialized string data now, which hasn't been added yet, so I wouldn't
be surprised if the ppc-generated file fails the test on i386, so this
might be that. There are SKIPs around the other platform test files
(probably should be TODOs), which will need to be regenerated on those
platforms.

> - make testj fails some more (8/1432 subtests failed)

I just confirmed that for me all is passing under make testj (except
t/pmc/object-meths.t:17 of course), but it's not surprising for this to
be platform-specific.

Thanks much for the feedback!

JEff

Jeff Clites

unread,
Apr 9, 2004, 5:24:22 PM4/9/04
to Dan Sugalski, Adam Thomason, l...@toetsch.at, perl6-i...@perl.org
On Apr 9, 2004, at 2:00 PM, Dan Sugalski wrote:

> At 1:55 PM -0700 4/9/04, Adam Thomason wrote:
>> > From: Dan Sugalski [mailto:d...@sidhe.org]
>> > Done. It'll guaranteed kill half the tinderboxen--I think my first
>>> thing to do on monday is to patch up the build procedure to use the
>>> system ICU if it's available.
>>
>> Does this mean you're resigned to requiring a C++ compiler? Or will
>> it be possible to switch ICU off?
>
> It'll be possible to switch it off, or to use the system ICU install
> if there is one.

The problem with switching it off is that if we can't guarantee access
to the full Unicode character properties database (and associated
data), the we'll have cases where 2 different perl installs disagree
about how to uppercase a string (for instance). ICU gives us case
folding and normalization and character property API (and locales,
actually), in addition to encoding data. (I can see how "that encoding
isn't support by this install" could be handled, but missing the other
stuff would be more problematic.) But maybe you have some ideas on how
this could be done safely/unconfusingly.

JEff

Jens Rieks

unread,
Apr 9, 2004, 6:31:53 PM4/9/04
to Jeff Clites, perl6-i...@perl.org
Hi,

On Friday 09 April 2004 23:11, Jeff Clites wrote:
> Hmm, I'll need to figure out how to turn on warnings about this in gcc.
> I would expect some "unused" warning as well.

-Wshadow and -Wunused

-Wuninitialized and -Wunreachable-code can also help

jens

Jarkko Hietaniemi

unread,
Apr 10, 2004, 4:18:32 AM4/10/04
to perl6-i...@perl.org, Jeff Clites, l...@toetsch.at, perl6-i...@perl.org
FWIW, the change sounds all good to me. The O(1) is the most important
property of a string, the 8/16/32 gives us that and space savings too,
going all Unicode at the heart of it is the only sensible thing to do
(anything else leads into combinatorial explosion and instant insanity),
"encodings" are just serializations and belong at the earliest to I/O,
and a "language" attribute doesn't belong in a string. (It may belong
to a string in a higher abstraction level, such as a database, but not
here.) Good work.

Leopold Toetsch

unread,
Apr 10, 2004, 4:12:41 AM4/10/04
to Jeff Clites, perl6-i...@perl.org
Jeff Clites <jcl...@mac.com> wrote:
> On Apr 9, 2004, at 7:19 AM, Leopold Toetsch wrote:

> So internally, strings don't have an associated encoding (or chartype
> or anything)

How do you handle EBCDIC? UTF8 for Ponie?

>> - Where is string->language?

> I removed it from the string struct because I think that's the wrong
> place for it (and it wasn't actually being used anywhere yet,
> fortunately).

Not used *yet* - what about:

use German;
print uc("i");
use Turkish;
print uc("i");

> language-dependent (sorting, for example), the operation doesn't depend
> on the language of the strings involved, but rather on the locale of
> the reader.

And if one is working with two different language at a time?

>> With this string type how do we deal with anything beyond codepoints?

> Hmm, what do you mean?

"\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}" eq
"\N{LATIN CAPITAL LETTER A WITH ACUTE}"

when comparing graphemes or letters. The latter might depend on the
language too.

We'll basically need 4 levels of string support:

,--[ Larry Wall ]--------------------------------------------------------
| level 0 byte == character, "use bytes" basically
| level 1 codepoint == character, what we seem to be aiming for, vaguely
| level 2 grapheme == character, what the user usually wants
| level 3 letter == character, what the "current language" wants
`------------------------------------------------------------------------

> or ask arbitrary strings what their N-th character

The N-th character depends on the level. Above examples C<.length> gives
either 2 or 1, when the user queries at level 1 or 2. The same problem
arises with positions. The current level depends on the scope were the
string was coming from too. (s. example WRT turkish letter "i")

>> - What happenend to external constant strings?

> They should still work (or could).

,--[ string.c:714 ]--------------------------------------------------
| else /* even if buffer is "external", we won't use it directly */
`--------------------------------------------------------------------

>> - What's the plan towards all the transcode opcodes? (And leaving these
>> as a noop would have been simpler)

> Basically there's no need for a transcode op on a string--it no longer
> makes sense, there's nothing to transcode.

I can't imagine that. I've an ASCII string and want to convert it to UTF8
and UTF16 and write it into a file. How do I do that?

>> - hash_string seems not to deal with mixed encodings anymore.

> Yep, since we're hashing based on characters rather than bytes, there's
> no such thing as mixed encodings.

s. above

> JEff

leo

Jarkko Hietaniemi

unread,
Apr 10, 2004, 5:02:28 AM4/10/04
to perl6-i...@perl.org, l...@toetsch.at, Jeff Clites, perl6-i...@perl.org
> Jeff Clites <jcl...@mac.com> wrote:
>
>>On Apr 9, 2004, at 7:19 AM, Leopold Toetsch wrote:

I'm replying for Jeff since I've been burned by the same questions
over and over again :-)

>
>>So internally, strings don't have an associated encoding (or chartype
>>or anything)
>
>
> How do you handle EBCDIC? UTF8 for Ponie?


All character sets (like EBCDIC) or encodings (like UTF-8) are
"normalized" to the Unicode (character set) (and our own *internal*
encoding, the 8/16/32 one.)

> Not used *yet* - what about:
>
> use German;
> print uc("i");
> use Turkish;
> print uc("i");

That is implementable (and already implemented by ICU) but by something
higher level than a "string".

> And if one is working with two different language at a time?

One becomes mad. As Jeff demonstrated, there is no silver bullet in
there, one gets quickly to situations where there provably is NO correct
solution. So we shouldn't try building the impossible to the lowest
level of string implementation.

> when comparing graphemes or letters. The latter might depend on the
> language too.
>
> We'll basically need 4 levels of string support:
>
> ,--[ Larry Wall ]--------------------------------------------------------
> | level 0 byte == character, "use bytes" basically
> | level 1 codepoint == character, what we seem to be aiming for, vaguely
> | level 2 grapheme == character, what the user usually wants
> | level 3 letter == character, what the "current language" wants
> `------------------------------------------------------------------------

Jeff's solution gives us level 1, and I assume that level 0 is trivially
deductible from that. Note, however, that not all string operations
(especially such a rich set of string ops as Perl has) can even be
defined for all those levels: e.g. bitstring boolean bit ops are rather
insane at levels higher than zero.

> The N-th character depends on the level. Above examples C<.length> gives
> either 2 or 1, when the user queries at level 1 or 2. The same problem
> arises with positions. The current level depends on the scope were the
> string was coming from too. (s. example WRT turkish letter "i")

The levels 2 and 3 depend on something higher level, like the higher
levels of ICU. I believe we have everything we need (and even more) in
ICU. Let's get the levels 0 and 1 working first.

>>>- What's the plan towards all the transcode opcodes? (And leaving these
>>> as a noop would have been simpler)
>
>
>>Basically there's no need for a transcode op on a string--it no longer
>>makes sense, there's nothing to transcode.
>
>
> I can't imagine that. I've an ASCII string and want to convert it to UTF8
> and UTF16 and write it into a file. How do I do that?

IIUC the old "transcoding" stuff was doing transcoding in run-time so
that two encoding-marked strings could be compared. The new scheme
"normalizes" (not to be confused with Unicode normalization) all strings
to Unicode. If you want to do transformations like you describe above
you either call an explicit transcoding interface (which ICU no doubt
has) or your I/O layers do that implicitly (this functionality PIO does
not yet have, if I understood Jeff correctly).

Maybe it's good to refresh on the 'character hierarchy' as defined by
Unicode (and IETF, and W3C).

ACR - Abstract Character Repertoire: an unordered collection of abstract
characters, like "UPPERCASE A" or "LOWERCASE B" or "DECIMAL DIGIT SEVEN".

CCS - Coded Character Set: an ordered (numbered) list of characters,
like 65 -> "UPPERCASE A". For example: ASCII and EBCDIC.

CEF - Character Encoding Form: mapping the numbers of the CCS character
codoes to platform-specific numbers like bytes or integers.

CES - Character Encoding Scheme: mapping the CEF numbers to serialized
bytes, possibly adding synchronization metadata like shift codes or byte
order markers.

Why the great confusion exists is mostly because in the old way (like
ASCII or Latin-1) all these four levels were conflated into one.

ISO 8859-1 (which is a CCS) has an eight-bit CEF. UTF-8 is both a CEF
and a CES. UTF-16 is a CEF, while UTF-16LE is a CES. ISO 2022-{JP,KR}
are CES.

(Outside of Unicode) there is TES (Transfer Encoding Syntax), too, which
is application-level encoding like base64 or gzip.

Leopold Toetsch

unread,
Apr 10, 2004, 4:38:06 AM4/10/04
to Dan Sugalski, perl6-i...@perl.org
Dan Sugalski wrote:

>
> Done. It'll guaranteed kill half the tinderboxen--I think my first thing
> to do on monday is to patch up the build procedure to use the system ICU
> if it's available.

Thanks for checkin. And yes. What about building without ICU? I can
imagine that some embedded usage of Parrot, dealing with ASCII only,
wouldn't need it. We could provide pure ASCII fallbacks for this case
and reduce functionality.

leo

Leopold Toetsch

unread,
Apr 10, 2004, 5:40:58 AM4/10/04
to Jarkko Hietaniemi, perl6-i...@perl.org
Jarkko Hietaniemi <j...@iki.fi> wrote:

>> How do you handle EBCDIC? UTF8 for Ponie?

> All character sets (like EBCDIC) or encodings (like UTF-8) are
> "normalized" to the Unicode (character set) (and our own *internal*
> encoding, the 8/16/32 one.)

Ok.

>> Not used *yet* - what about:
>>
>> use German;
>> print uc("i");
>> use Turkish;
>> print uc("i");

> That is implementable (and already implemented by ICU) but by something
> higher level than a "string".

So the first question is: Where is this higher level? Isn't Parrot
responsible for providing that? The old string type did have the
relevant information at least.

I think we can't say it's a Perl6 lib problem. HLL interoperability
comes in here too. *If* there are some more advanced string levels above
Parrot strings, they have to play together too.

So let's first concentrate on this issue. The rest is more or less
an implementation detail.

leo

Jeff Clites

unread,
Apr 10, 2004, 5:52:23 AM4/10/04
to l...@toetsch.at, perl6-i...@perl.org
On Apr 10, 2004, at 1:12 AM, Leopold Toetsch wrote:

> Jeff Clites <jcl...@mac.com> wrote:
>> On Apr 9, 2004, at 7:19 AM, Leopold Toetsch wrote:
>
>> So internally, strings don't have an associated encoding (or chartype
>> or anything)
>
> How do you handle EBCDIC?

I'll use pseudo-C to illustrate:

string = string_make(buffer, "EBCDIC"); /* buffer came from an EBCDIC
file, for instance containing the word "hello" */

outputBuffer1 = encode(string, "EBCDIC");
outputBuffer2 = encode(string, "ASCII");
outputBuffer3 = encode(string, "UTF-8");
outputBuffer4 = encode(string, "UTF-32");

//maybe write buffers out to files now....

So far, that would all just work, and outputBuffers 1-4 all have the
word "hello" serialized using various encodings.

Now, let's say instead that we had done:

string = string_make(buffer, "ASCII"); /* buffer came from an ASCII
file, for instance containing the word "hello" */

Now the other 4 lines of code would be the same, and outputBuffers 1-4
are identical to what they would have been above, and in fact "string"
would be identical in the 2 cases as well.

> UTF8 for Ponie?

Ponie should only need to pretend that a string is represented
internally in UTF-8 in a limited number of situations--not for string
comparisons or (most?) regular expressions, etc. For the cases where it
does need that, it can on-the-fly (or, cached) create a buffer of bytes
to work from. But I think of that as a backward compatibility, and not
the case we're optimized for.

>>> - Where is string->language?
>
>> I removed it from the string struct because I think that's the wrong
>> place for it (and it wasn't actually being used anywhere yet,
>> fortunately).
>
> Not used *yet* - what about:
>
> use German;
> print uc("i");
> use Turkish;
> print uc("i");

Perfect example. The string "i" is the same in each case. What you've
done is implicitly supplied a locale argument to the uc()
operation--it's just a hidden form of:

uc(string, locale);

The important thing is that the locale is a parameter to the operation,
not an attribute of the string.

>> language-dependent (sorting, for example), the operation doesn't
>> depend
>> on the language of the strings involved, but rather on the locale of
>> the reader.
>
> And if one is working with two different language at a time?

Hmm? The point is that if you have a list of strings, for instance some
in English, some in Greek, and some in Japanese, and you want to sort
them, then you have to pick a sort ordering. If you associate a
language with each string, that doesn't help you--how do you compare
the English and Japanese strings with one another? So again, the sort
ordering is a parameter to the sort operation:

sort_strings(array, locale);

And you could certainly have:

sortedInOneWay = sort_strings(someArrayOfStrings, locale1)
sortedInAnotherWay = sort_strings(someArrayOfStrings, locale2)

That's only awkward if we're assuming a locale is hanging around
implicitly specified--it's it's an explicit parameter, it's very clear.
And in practice, the collation algorithm specified by a locale has to
be prepared to handle sorting any strings at all--so that you can sort
Japanese strings in the English sort order, for instance. That sounds
strange, but generally this tends to be modeled as a base sorting
algorithm (covering all characters/strings) plus small per-locale
variations. That means that the sort order for Kanji strings would
probably end up being the same for the English and Dutch locales,
though strings containing Latin characters might sort differently.

>>> With this string type how do we deal with anything beyond codepoints?
>
>> Hmm, what do you mean?
>
> "\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}" eq
> "\N{LATIN CAPITAL LETTER A WITH ACUTE}"
>
> when comparing graphemes or letters. The latter might depend on the
> language too.

Right, but this isn't comparing graphemes, it's this:

one = "\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"
two = "\N{LATIN CAPITAL LETTER A WITH ACUTE}";

one eq two //false--they're different strings
normalizeFormD(one) eq normalizeFormD(two) //true

This is quite analogous to:

three = "abc"
four = "ABC"

three eq four //false
uc(three) eq uc(four) //true

So it's not comparing gramphemes, it's comparing strings under
normalization. It's a much clearer, and much more well-worked-out
concept. If you want to think of this as comparing graphemes, you need
to invent a data type to model "a grapheme", and that's a mess and
nobody's ever done that. It's the wrong way to think about the problem.

> We'll basically need 4 levels of string support:
>
> ,--[ Larry Wall
> ]--------------------------------------------------------
> | level 0 byte == character, "use bytes" basically
> | level 1 codepoint == character, what we seem to be aiming for,
> vaguely
> | level 2 grapheme == character, what the user usually wants
> | level 3 letter == character, what the "current language" wants
> `----------------------------------------------------------------------
> --

Yes, and I'm boldly arguing that this is the wrong way to go, and I
guarantee you that you can't find any other string or encoding library
out there which takes an approach like that, or anyone asking for one.
I'm eager for Larry to comment.

>> or ask arbitrary strings what their N-th character
>
> The N-th character depends on the level. Above examples C<.length>
> gives
> either 2 or 1, when the user queries at level 1 or 2. The same problem
> arises with positions. The current level depends on the scope were the
> string was coming from too. (s. example WRT turkish letter "i")

I'm arguing that whether you have dotted-i or dotless-i is decided when
you create the string, and how those are case-mapped depend on your
choice of case-folding algorithm. The ICU case-folding API has explicit
variants to decide how to case fold the i's, for instance.

>>> - What happenend to external constant strings?
>
>> They should still work (or could).
>
> ,--[ string.c:714 ]--------------------------------------------------
> | else /* even if buffer is "external", we won't use it directly */
> `--------------------------------------------------------------------

Yes, that's the else clause for the case where we've determined that we
can't use the passed-in buffer directly. A few lines above that is the
place where we have a mem_sys_memcopy() which we can avoid (but the
code doesn't yet, I see), and use the passed-in buffer in the
external-constant case.

>>> - What's the plan towards all the transcode opcodes? (And leaving
>>> these
>>> as a noop would have been simpler)
>
>> Basically there's no need for a transcode op on a string--it no longer
>> makes sense, there's nothing to transcode.
>
> I can't imagine that. I've an ASCII string and want to convert it to
> UTF8
> and UTF16 and write it into a file. How do I do that?

That's the mindset shift. You don't have an ASCII string. You have a
string, which may have come from a file or a buffer representing a
string using the ASCII encoding. It's the example from above, again:

inputBuffer = read(inputHandle);
string = string_make(inputBuffer, "ASCII");
outputBuffer = encode(string, "UTF-16");
write(outputHandle, outputBuffer);

or, if we want to associate an encoding with a handle (sometimes more
convenient, sometimes less), it could go like this:

inputHandle = open("file1", "ASCII");
string = stringRead(inputHandle);
outputHandle = open("file2", "UTF-16");
stringWrite(outputHandle, string);


Basically, we need to think of strings from the top down (what concept
are they trying to capture, what API makes sense for that), rather than
from the bottom up (how are bytes stored in files). The key is to think
of strings as, by definition, representing a sequence of (logical)
characters, and to think of a character as what's trying to be captured
by things such as "LATIN CAPITAL LETTER A WITH ACUTE" or "SYRIAC
QUSHSHAYA". That's the whole cookie.

(And again, I wish I could take credit for all of this, but it's really
the general state-of-the-art with regard to strings. It's how people
are thinking about these things these day.)

JEff

Jarkko Hietaniemi

unread,
Apr 10, 2004, 6:19:39 AM4/10/04
to perl6-i...@perl.org, Jeff Clites, l...@toetsch.at, perl6-i...@perl.org
>>We'll basically need 4 levels of string support:
>>
>>,--[ Larry Wall
>>]--------------------------------------------------------
>>| level 0 byte == character, "use bytes" basically
>>| level 1 codepoint == character, what we seem to be aiming for,
>>vaguely
>>| level 2 grapheme == character, what the user usually wants
>>| level 3 letter == character, what the "current language" wants
>>`----------------------------------------------------------------------
>>--
>
>
> Yes, and I'm boldly arguing that this is the wrong way to go, and I
> guarantee you that you can't find any other string or encoding library
> out there which takes an approach like that, or anyone asking for one.
> I'm eager for Larry to comment.

I'm no Larry, either :-) but I think Larry is *not* saying that the
"localeness" or "languageness" should hang off each string (or *shudder*
off each substring). What I've seen is that Larry wants the "level" to
be a lexical pragma (in Perl terms). The "abstract string" stays the
same, but the operative level decides for _some_ ops what a "character
stands for.

The default level should be somewhere between levels 1 and 2 (again, it
depends on the ops).

For example, usually /./ means "match one Unicode code point" (a CCS
character code). But one can somehow ratchet the level up to 2 and make
it mean "match one Unicode base character, followed by zero or more
modifier characters". For level 3 the language (locale) needs to be
specified.

As another example, bitstring xor does not make much sense for anything
else than level zero.

The basic idea being that we cannot and should not dictate at what level
of abstraction the user wants to operate. We will give a default level,
and ways to "zoom in" and "zoom out".

(If Larry is really saying that the "locale" should be an attribute of
the string value, I'm on the barricades with you, holding cobblestones
and Molotov cocktails...)

Larry can feel free to correct me :-)

Jeff Clites

unread,
Apr 10, 2004, 6:11:22 AM4/10/04
to l...@toetsch.at, perl6-i...@perl.org, Jarkko Hietaniemi
On Apr 10, 2004, at 2:40 AM, Leopold Toetsch wrote:

> Jarkko Hietaniemi <j...@iki.fi> wrote:
>
>>> Not used *yet* - what about:
>>>
>>> use German;
>>> print uc("i");
>>> use Turkish;
>>> print uc("i");
>
>> That is implementable (and already implemented by ICU) but by
>> something
>> higher level than a "string".
>
> So the first question is: Where is this higher level? Isn't Parrot
> responsible for providing that? The old string type did have the
> relevant information at least.

See my separate post--what's needed is a locale parameter to uc(),
giving uc("i", locale). For Parrot, we at least need ops/API which take
an explicit locale parameter. The interoperability issue comes into
play in whether we decide to let a default locale be specified at the
parrot level, and have op variants which use this. I think that we
actually don't need that, and that we can let it mostly happen at the
HLL level--that is, of Perl6 (etc.) want API such as uc() without an
explicit parameter, it just needs to compile 'print uc("i")' down to:

set S0, "i"
find_global P1, "default_locale" # some agreed-upon global
uc S1, S0, P1
print S1

So parrot can support a default locale cross-language, without needing
separate ops or anything. And if, for instance, Python didn't have a
default locale, and always make you pass a locale into operations which
need one, then everything would still be fine--just Python wouldn't be
looking up this particular global.

> I think we can't say it's a Perl6 lib problem. HLL interoperability
> comes in here too. *If* there are some more advanced string levels
> above
> Parrot strings, they have to play together too.
>
> So let's first concentrate on this issue. The rest is more or less
> an implementation detail.

JEff

Jarkko Hietaniemi

unread,
Apr 10, 2004, 6:37:30 AM4/10/04
to l...@toetsch.at, perl6-i...@perl.org
> So the first question is: Where is this higher level? Isn't Parrot
> responsible for providing that? The old string type did have the
> relevant information at least.
>
> I think we can't say it's a Perl6 lib problem. HLL interoperability

Right. It's a Parrot lib problem. But it's not a ".c/.cpp" problem.

> comes in here too. *If* there are some more advanced string levels above
> Parrot strings, they have to play together too.
>
> So let's first concentrate on this issue. The rest is more or less
> an implementation detail.

Once we get levels 0 and 1 working, we can worry about bolting the
levels 2 and 3 from ICU to a Parrot level API. (ICU goes much further
than 2 or 3, incidentally: how about some Buddhist calendar?)

--
Jarkko Hietaniemi <j...@iki.fi> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

Jarkko Hietaniemi

unread,
Apr 10, 2004, 6:47:39 AM4/10/04
to Jeff Clites, perl6-i...@perl.org, l...@toetsch.at

> Another example could be that at level 2 (and 3), maybe "eq"
> automatically normalizes before doing string comparisons, and at levels
> 1 and 0 it doesn't.

Exactly. People wanted implicit "eq" normalization for Perl 5 Unicode.
The problem always is "where does it end?", because the logical followup
to that would have been "cmp" to do the full Unicode collation.

Jeff Clites

unread,
Apr 10, 2004, 6:38:34 AM4/10/04
to Jarkko Hietaniemi, perl6-i...@perl.org, l...@toetsch.at
On Apr 10, 2004, at 3:19 AM, Jarkko Hietaniemi wrote:

>>> We'll basically need 4 levels of string support:
>>>
>>> ,--[ Larry Wall
>>> ]--------------------------------------------------------
>>> | level 0 byte == character, "use bytes" basically
>>> | level 1 codepoint == character, what we seem to be aiming for,
>>> vaguely
>>> | level 2 grapheme == character, what the user usually wants
>>> | level 3 letter == character, what the "current language" wants
>>> `--------------------------------------------------------------------
>>> --
>>> --
>>
>>
>> Yes, and I'm boldly arguing that this is the wrong way to go, and I
>> guarantee you that you can't find any other string or encoding library
>> out there which takes an approach like that, or anyone asking for one.
>> I'm eager for Larry to comment.
>
> I'm no Larry, either :-) but I think Larry is *not* saying that the
> "localeness" or "languageness" should hang off each string (or
> *shudder*
> off each substring). What I've seen is that Larry wants the "level" to
> be a lexical pragma (in Perl terms). The "abstract string" stays the
> same, but the operative level decides for _some_ ops what a "character
> stands for.

That makes a lot of sense to me, and I'd further it by saying that
levels 2 and 3 don't mean that we need to have "grapheme" or "letter"
data types, per se. (If we tried to have those, we'd need properties
databases to go with them, and we'd go crazy.)

> For example, usually /./ means "match one Unicode code point" (a CCS
> character code). But one can somehow ratchet the level up to 2 and
> make
> it mean "match one Unicode base character, followed by zero or more
> modifier characters". For level 3 the language (locale) needs to be
> specified.

Another example could be that at level 2 (and 3), maybe "eq"

automatically normalizes before doing string comparisons, and at levels
1 and 0 it doesn't.

> (If Larry is really saying that the "locale" should be an attribute of


> the string value, I'm on the barricades with you, holding cobblestones
> and Molotov cocktails...)

It's nice to have company!

JEff

Leopold Toetsch

unread,
Apr 10, 2004, 6:54:35 AM4/10/04
to Jeff Clites, perl6-i...@perl.org
Jeff Clites <jcl...@mac.com> wrote:
> On Apr 10, 2004, at 1:12 AM, Leopold Toetsch wrote:

>> use German;
>> print uc("i");
>> use Turkish;
>> print uc("i");

> Perfect example. The string "i" is the same in each case. What you've
> done is implicitly supplied a locale argument to the uc()
> operation--it's just a hidden form of:

> uc(string, locale);

Ok. Now when the identical string "i" (but originating from different
locale environmets) goes through a sequence of string operations later,
how do you track the locale down to the final C<uc> where it's needed?

e.g.

use German;
my $gi = "i";
use Turkish;
my $ti = "i";

my $s = $gi x 10;
...
print uc($s); # locale is what?

Where do you track the locale, if not in the string itself.

> The important thing is that the locale is a parameter to the operation,
> not an attribute of the string.

If that works ...

> Hmm? The point is that if you have a list of strings, for instance some
> in English, some in Greek, and some in Japanese, and you want to sort
> them, then you have to pick a sort ordering.

Ok. I want to uppercase the strings - no sorting (yet). I've an array of
Vienna's Kebab boothes. Half of these have turkish names (at least) the
rest is a mixture of other languages. I'd like to uppercase this array
of names. How do I do it?

> one = "\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"
> two = "\N{LATIN CAPITAL LETTER A WITH ACUTE}";

> one eq two //false--they're different strings
> normalizeFormD(one) eq normalizeFormD(two) //true

Sure. But if I want to compare "letters": one eq two. I think this is
the normal case the user of Unicode wants or expects. On the surface it
doesn't matter if the internal representation is different

OTOH normalizing all strings on input is not possible - what if they
should go into a file in unnormalized form.

> This is quite analogous to:

> three = "abc"
> four = "ABC"

No.

>> ,--[ Larry Wall
>> ]--------------------------------------------------------
>> | level 0 byte == character, "use bytes" basically
>> | level 1 codepoint == character, what we seem to be aiming for,
>> vaguely
>> | level 2 grapheme == character, what the user usually wants
>> | level 3 letter == character, what the "current language" wants
>> `----------------------------------------------------------------------
>> --

> Yes, and I'm boldly arguing that this is the wrong way to go, and I
> guarantee you that you can't find any other string or encoding library
> out there which takes an approach like that, or anyone asking for one.
> I'm eager for Larry to comment.

The design gods may speak up, yes.

>> I can't imagine that. I've an ASCII string and want to convert it to
>> UTF8
>> and UTF16 and write it into a file. How do I do that?

> That's the mindset shift. You don't have an ASCII string. You have a
> string, which may have come from a file or a buffer representing a
> string using the ASCII encoding. It's the example from above, again:

> inputBuffer = read(inputHandle);
> string = string_make(inputBuffer, "ASCII");
> outputBuffer = encode(string, "UTF-16");
> write(outputHandle, outputBuffer);

Ok. I should have asked: How do I do that in PASM of course.

> JEff

leo

Jarkko Hietaniemi

unread,
Apr 10, 2004, 7:29:50 AM4/10/04
to perl6-i...@perl.org, l...@toetsch.at, Jeff Clites, perl6-i...@perl.org
> Ok. Now when the identical string "i" (but originating from different
> locale environmets) goes through a sequence of string operations later,
> how do you track the locale down to the final C<uc> where it's needed?
>
> e.g.
>
> use German;
> my $gi = "i";
> use Turkish;
> my $ti = "i";

$gi and $ti contain the same Unicode code points, in this case 0x69.

> my $s = $gi x 10;
> ...
> print uc($s); # locale is what?

Locale is what *you* said the level 3 locale should be. If it's not
set, it's probably according to the Unicode default casing rules, which
are language-neutral.

> Where do you track the locale, if not in the string itself.

You don't track it. It's lexical, a policy in that code block.

>>Hmm? The point is that if you have a list of strings, for instance some
>>in English, some in Greek, and some in Japanese, and you want to sort
>>them, then you have to pick a sort ordering.
>
>
> Ok. I want to uppercase the strings - no sorting (yet). I've an array of
> Vienna's Kebab boothes. Half of these have turkish names (at least) the

Mmmm, kebab.

> rest is a mixture of other languages. I'd like to uppercase this array
> of names. How do I do it?

You pick a locale and you say uc().

You can't have *BOTH* Turkish and German casing rules in effect at the
same time. Well, sometimes you might get away with mixing policies, but
in the general case it cannot work (or make sense: casing is meaningless
for many Asian scripts, or be devilishly complex: "Japanese" mixes
several different "scripts" and "languages"). Take www.yahoo.co.jp:
what "language" are the "Yahoo!" strings in?

Let's throw in some more: Vienna beer houses with German names, Vienna
cafes with German names, Vienna cafes with French names, Vienna kebab
houses with Turkish names, Vienna Chinese restaurants, and Vienna Thai
restaurants. Now you want to sort them. Are you going to implement 6x5
or 30 sorting algorithms?

> OTOH normalizing all strings on input is not possible - what if they
> should go into a file in unnormalized form.

Please study the ACR-CCS-CEF-CES mantra. You say "unnormalized form"
without specifying what form you mean. If you e.g really want the bytes
of the serialized input file/stream (a CES), mark your PIO stream as
"bytes" and read it in, and then you can operate it at level zero.

In PASM, we need a way to say:

string_level_0
string_level_1
string_level_2
string_level_3(locale)

The string_level2 *might* have an argument of which Unicode
normalization scheme should be picked, or we might just punt and pick
one as the default.

Jeff Clites

unread,
Apr 10, 2004, 3:35:09 PM4/10/04
to Jeff Clites, perl6-i...@perl.org, l...@toetsch.at
On Apr 10, 2004, at 12:21 PM, Jeff Clites wrote:

> On Apr 10, 2004, at 3:54 AM, Leopold Toetsch wrote:
>
>> Ok. I want to uppercase the strings - no sorting (yet). I've an array
>> of
>> Vienna's Kebab boothes. Half of these have turkish names (at least)
>> the
>> rest is a mixture of other languages. I'd like to uppercase this array
>> of names. How do I do it?

...
> If you are having signs painted for the vendors, and you want the
> names all in uppercase (for style) but you want to make sure that the
> you are uppercasing a name in an appropriate way for that vendor's
> national origin, then on a per-string basis you need to decide on a
> locale.

Darn, forgot something else I meant to say:

On the other hand, it might not make sense to case-map with national
origin of the vendor in mine: If you had kebab (or pretzel?) vendors
with booths in the US, some of whom are German, you might _not_ want to
case map with national origin in mind, because if you used the German
"beta" variant for "ss" then you'd confuse the American customers. So
you'd have a real decision to make--do you localize based on the
nationality of the vendor, or based on the nationality of the
customers. Either choice is reasonable, and your API needs to handle
both--case map all of the strings using the same locale, or case map
each in a different locale. It depends on what you want to do with your
signs, not on the contents of the strings.

And now I'm hungry.

JEff

Jeff Clites

unread,
Apr 10, 2004, 3:21:16 PM4/10/04
to l...@toetsch.at, perl6-i...@perl.org
On Apr 10, 2004, at 3:54 AM, Leopold Toetsch wrote:

> Jeff Clites <jcl...@mac.com> wrote:
>> On Apr 10, 2004, at 1:12 AM, Leopold Toetsch wrote:
>
>>> use German;
>>> print uc("i");
>>> use Turkish;
>>> print uc("i");
>
>> Perfect example. The string "i" is the same in each case. What you've
>> done is implicitly supplied a locale argument to the uc()
>> operation--it's just a hidden form of:
>
>> uc(string, locale);
>
> Ok. Now when the identical string "i" (but originating from different
> locale environmets) goes through a sequence of string operations later,
> how do you track the locale down to the final C<uc> where it's needed?
>
> e.g.
>
> use German;
> my $gi = "i";
> use Turkish;
> my $ti = "i";
>
> my $s = $gi x 10;
> ...
> print uc($s); # locale is what?
>
> Where do you track the locale, if not in the string itself.

I think it's quite like file handles in perl5--there are 2 choices:

print OUT "foo"; # string is printed to file handle OUT
print "foo"; # string is printed to currently selected file handle

compare with:

uc($s, $locale); # string $s is uppercased using locale $locale
uc($s); # string is uppercased using current effective locale

I presume that "use German" would be equivalent to "set the current
locale to German".

So again, locale is an implicit (or explicit) parameter to certain
string operations, but not to string creation.

But let's say we did it the way you were thinking, and made locale part
of the string. Consider this:

$string = $gi.$ti;

Now what locale would $string be in? It would be quite confusing.

Another way to state my point there is that locale definitely comes
into play when certain operations are performed, and if you wanted to
find the relevant locale by attaching it to a string, then things get
instantly confusing when you need to do some operation involving two
strings with different locales attached. To use my analogy from above,
that would be similar to having the "currently selected file handle" be
an attribute of a string.

[[Side note: Although uppercase/lowercase/titlecase are
locale-dependent, there's also the separate notion of case-folding,
which is locale-independent, and in a Unicode world is the convenient
thing to use if you are just trying to discard case differences between
two strings.]]

>> Hmm? The point is that if you have a list of strings, for instance
>> some
>> in English, some in Greek, and some in Japanese, and you want to sort
>> them, then you have to pick a sort ordering.
>
> Ok. I want to uppercase the strings - no sorting (yet). I've an array
> of
> Vienna's Kebab boothes. Half of these have turkish names (at least) the
> rest is a mixture of other languages. I'd like to uppercase this array
> of names. How do I do it?

You get to decide, for each, which locale to use for uppercasing, or
you use the German locale (for instance) to uppercase them all. What
you decide to do will depend on what your goal is--on what you are
trying to achieve by uppercasing them.

If your goal (for instance) is to just case normalize so that you can
look for duplicates in your list, then you can use case-folding and
avoid the whole locale issue.

If you are having signs painted for the vendors, and you want the names
all in uppercase (for style) but you want to make sure that the you are
uppercasing a name in an appropriate way for that vendor's national

origin, then on a per-string basis you need to decide on a locale. That
might be a pain, but you'd have had the same pain if you wanted to
attach the locale at string-creation time--you would have had to
specify the locale for each one separately then as well. Now, let's say
you had decided to concatenate the names into a single string, and
uppercase that. What locale would you use? Once you start concatenating
strings, the idea of attaching the locale to a string becomes
unworkable.

[[Side note 2: There's can really be 2 different meanings of "attach a
locale to a string": there is (1) locale is fundamentally a property of
a string, and (2) let's hang a locale off of a string, for convenience.
I'm saying that (1) is conceptually wrong, and (2) breaks down when you
start concatenating strings.]]

>> one = "\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"
>> two = "\N{LATIN CAPITAL LETTER A WITH ACUTE}";
>
>> one eq two //false--they're different strings
>> normalizeFormD(one) eq normalizeFormD(two) //true
>
> Sure. But if I want to compare "letters": one eq two. I think this is
> the normal case the user of Unicode wants or expects. On the surface it
> doesn't matter if the internal representation is different

Right, that's fine, and I'm saying that the different "levels" boil
down to different behaviors of an equality operator, not different
types of string. Again it's quite analogous to things already in Perl5.
For instance, we have "eq" v. "==", and the decision on whether to use
string v. numeric comparison is decided by the caller, not
automatically determined based on the contents of the variables
involved. So taking your example strings, we'd have two possible
approaches:

use level 1;
one eq two; //false

use level 2;
one eq two; //true

or instead, this approach could be taken:

one eq two; //false--this is a level-1 comparison
one linguisticallyEq two; //true--this is a level two comparison
one caseInsensitiveEq two; //a different sort of "semantic" comparison

But with either approach, it boils down to deciding what comparison
algorithm to use, and the decision is based on something other than the
contents of the strings.

> OTOH normalizing all strings on input is not possible - what if they


> should go into a file in unnormalized form.

Sure, of course--especially because there are at least 4 common
normalization forms, and you are 100% correct that which one to apply
(if any) is a decision a would be make by a programmer on a per-string
basis, depending on what they are trying to do.

>> This is quite analogous to:
>
>> three = "abc"
>> four = "ABC"
>
> No.

Yes! :)

It actually illustrates something a bit different. Really:

three caseInsensitiveEq four

is equivalent to

caseFold(three) eq caseFold(four) # same as uc(three)... in an
ASCII-only world

That is, different styles of string comparison end up being equivalent
to literal string comparison applied after some normalization process.

So there are choices as to how to expose this at a language level (esp.
an HLL level), but I'm thinking that different styles of string
comparison will (internally) correspond to different types of
normalization, not to different types of strings.

[[Side note 3: I think it will be really confusing if Perl6 has a
single "eq" operator, whose behavior depends on the "current level",
rather than either different operators or operators which take an
additional parameter. But that's a language-level design question, and
parrot can handle either approach.]]

>>> I can't imagine that. I've an ASCII string and want to convert it to
>>> UTF8
>>> and UTF16 and write it into a file. How do I do that?
>
>> That's the mindset shift. You don't have an ASCII string. You have a
>> string, which may have come from a file or a buffer representing a
>> string using the ASCII encoding. It's the example from above, again:
>
>> inputBuffer = read(inputHandle);
>> string = string_make(inputBuffer, "ASCII");
>> outputBuffer = encode(string, "UTF-16");
>> write(outputHandle, outputBuffer);
>
> Ok. I should have asked: How do I do that in PASM of course.

I'm envisioning something like this:

open P0, "file", "<"
read P1, P0 # P1 is a byte-buffer PMC
new S1, P1, "ASCII"
encode P2, S1, "UTF-16" # P2 is a byte-buffer PMC
open P3, "file2", ">"
print P3, P2

For this, we need encode/decode ops, as well as a PMC to represent
"raw" bytes.

The other variant might look like this:

open P0, "file", "<", "ASCII" # or whatever syntax we decide, maybe
"<:ASCII"
read S1, P0 # S1 is a string since P0 knows what encoding to use
open P3, "file2", ">", "UTF-16"
print P3, S1 # P3 knows what encoding to use

I think we want both variants to be available, be in the IO API we need
to support two styles of IO handles--one which reads and writes bytes,
and one which reads and writes strings (and must, therefore, have an
encoding attached to the handle). This might be done via the IO layer
approach (you could push a string-ifying layer onto the stack), or
maybe this would be what IO "filters" are for--I've seen the term
mentioned in some of the docs, but I've not yet asked what a filter is
supposed to be.

Oh, and another option would be:

open P0, "file", "<"
read S1, P0, "ASCII" # the IO operation specifies the encoding
open P3, "file2", ">"
print P3, S1, "UTF-16" # ditto

I think that option 1 is clear/explicit, and option 2 is convenient
(but less powerful), and option 3 is reasonable but somehow less
appealing. I think options 1 and 2 would co-exist nicely.


Keep the questions coming!

JEff

Larry Wall

unread,
Apr 11, 2004, 12:10:11 AM4/11/04
to perl6-i...@perl.org
On Sat, Apr 10, 2004 at 01:19:39PM +0300, Jarkko Hietaniemi wrote:
: I'm no Larry, either :-) but I think Larry is *not* saying that the

: "localeness" or "languageness" should hang off each string (or *shudder*
: off each substring). What I've seen is that Larry wants the "level" to
: be a lexical pragma (in Perl terms). The "abstract string" stays the
: same, but the operative level decides for _some_ ops what a "character
: stands for.

Yes, just as an abstract position stays the same, but it may have
different numeric interpretations in different lexical scopes.

: The default level should be somewhere between levels 1 and 2 (again, it
: depends on the ops).

Well, I don't think you can have a default between levels. And if you
can, you shouldn't... :-)

I'd really like Perl 6 to default to grapheme level because that's what
the naive user will expect. It'll be easy enough for the experts to
drop down to "use codepoints" or whatever the declaration turns out to be.

: For example, usually /./ means "match one Unicode code point" (a CCS


: character code). But one can somehow ratchet the level up to 2 and make
: it mean "match one Unicode base character, followed by zero or more
: modifier characters". For level 3 the language (locale) needs to be
: specified.

I really, really hate to call those "locales" because of all the
butchery that has happened in the name of locales. If we give any
support to "locales" at all, it'll be at the low end at level 0.
If "language" isn't a good enough name for the distinctions at level 3,
then let's find a better name. But it isn't "locale".

: As another example, bitstring xor does not make much sense for anything


: else than level zero.
:
: The basic idea being that we cannot and should not dictate at what level
: of abstraction the user wants to operate. We will give a default level,
: and ways to "zoom in" and "zoom out".

Yes, different views of a consistent semantics. It's something Parrot
has to solve anyway to support multiple languages. You might argue
that the four different levels of Unicode support in Perl 6 are really
four different languages, all called Perl. Of course, we've said all
along that any time you say "use" you're mutating the language, so this
is nothing new...

In practical terms, one tricky thing to figure out is at what point
the number 3 gets turned into "3 bytes", "3 codepoints", "3 graphemes",
or "3 letters".

: (If Larry is really saying that the "locale" should be an attribute of
: the string value, I'm on the barricades with you, holding cobblestones
: and Molotov cocktails...)

Me too, me too! Oh, wait... :-)

Larry

Leopold Toetsch

unread,
Apr 11, 2004, 7:42:49 AM4/11/04
to Jeff Clites, perl6-i...@perl.org
Jeff Clites <jcl...@mac.com> wrote:
> On Apr 9, 2004, at 7:19 AM, Leopold Toetsch wrote:

>> - What happenend to external constant strings?

> They should still work (or could). But the only cases in which we can
> optimize, and actually use "in-place" a buffer handed to string_make,
> is for a handful of encodings. But dealing with the "flags" argument
> passed into string_make may still be incomplete.

I've changed that now. singlebyte encoded external strings (what we have
a lot from e.g. const_string()) are now handled, i.e. string memory
isn't allocated and copied.

This speeds up oo1.pasm by a factor of 8 (eight).

Code is still a bit slower then before but not dramatically.

> JEff

leo

0 new messages