The patch itself is shaping out to be rather large (mostly because I've
added an interpreter argument to the string functions which were
lacking it, and updated the places where these function are called),
but is self-contained in the sense that (1) the functions which
actually call out directly to any ICU API are collected into one file
(for clarity, and in case we ever want to transition to a different
library), (2) there are minimal API changes, and (3) the GC behavior of
strings isn't affected.
It will take me a few more days to finish this up, but I wanted to give
everyone a heads-up that this is on the way.
JEff
Just wanted to send out a quick update, to let everyone know that I
haven't completely dropped the ball here.
JEff
Are you at a point where anything can get checked in? I'd rather have
a partial checkin that we can build on in now than wait a while and
have it be even more difficult to integrate as the source drifts
further from what you're working on...
--
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk
> At 10:27 AM -0700 4/7/04, Jeff Clites wrote:
>> It has taken me longer than I expected to carve out some time to work
>> on finishing my ICU/string patch, but it's progressing now, and I
>> just finished tracking down some bugs of mine that the
>> config_lib.pasm stuff was exercising. So I'm currently back to the
>> state of passing all expected tests (ie, not the [perl #28344] test
>> which is currently failing for me in the unpatched source).
>>
>> Just wanted to send out a quick update, to let everyone know that I
>> haven't completely dropped the ball here.
>
> Are you at a point where anything can get checked in? I'd rather have
> a partial checkin that we can build on in now than wait a while and
> have it be even more difficult to integrate as the source drifts
> further from what you're working on...
Yes, I think so actually. I have it up-to-date with the current head of
cvs. I need to make a pass through and remove a few things that I've
//-commented out and such, and I'll work on that tonight, and with luck
I'll be able to send it in either tomorrow or this weekend.
JEff
If it's just C++-style comments, send the patch now -- we can yank
those out for you :)
JEff
Phew, that's huge. I'd really like to have smaller patches that do it
step by step. But anyway, the patch must have been a lot of work so
lets see and make the best out of it.
Some questions:
First:
- What is string->representation? It seems only to contain
string_rep_{one,two,four}.
- How does UTF8 fit here?
- Where are encoding and chartype now?
- Where is string->language?
With this string type how do we deal with anything beyond codepoints?
And some misc remarks/questions:
- string_compare seems just to compare byte,short, or int_32
- What happenend to external constant strings?
- What's the plan towards all the transcode opcodes? (And leaving these
as a noop would have been simpler)
- hash_string seems not to deal with mixed encodings anymore.
- Why does read/PIO_reads generate an UTF-8 first?
- Why does PIO_putps convert to UTF-8?
(I've not applied the patch here, so I might have missed something)
> JEff
leo
I'll add a "Wow" in here.
I've snipped all of Leo's concerns, not because I don't share them
but because I do. (ICU also still doesn't build on a debian/testing
build, though that's Debian's fault) I also have the concern that
this moves over to Unicode-everywhere.
But... this gets us very much closer to where we want to be, and I'm
figuring that we're better off applying this and working out the
kinks than not. I'll leave this one to Leo to make final decision on,
though.
> At 4:19 PM +0200 4/9/04, Leopold Toetsch wrote:
>> Jeff Clites <jcl...@mac.com> wrote:
>>> I've sent my patch in through RT--it's [perl #28405]!
>>
>> Phew, that's huge. I'd really like to have smaller patches that do it
>> step by step. But anyway, the patch must have been a lot of work so
>> lets see and make the best out of it.
>
> I'll add a "Wow" in here.
>
> I've snipped all of Leo's concerns, not because I don't share them but
> because I do.
I'm in the process of crafting my (appropriately) long explanation in
reply to Leo's concerns.
> I also have the concern that this moves over to Unicode-everywhere.
It does, in a sense, and I'll defend that. (And at some point I'll ask
you for a clarification of your concerns to make sure that I answer
them directly, since "Unicode" can mean different things--a character
model, a standard, a set of encodings....)
JEff
> But... this gets us very much closer to where we want to be, and I'm
> figuring that we're better off applying this and working out the kinks
> than not. I'll leave this one to Leo to make final decision on, though.
Thanks Dan for this easter egg :)
Intermediate results:
- patch applied fine
- compiles currently modulo some C++ style variable declarations and
s/#import/#include/
- a few warnings about shadowed vars
- patched source does *not* conform to pdd07
- additional coffee breaks during compiles will ruin me finally
- compiled ok
- make test - not too bad (5/1495 subtests failed)
- JIted NCI (uses string_make for 't' sig) is b0rken
- one native_pbc test fails (more are skipped now)
- make testj fails some more (8/1432 subtests failed)
I'll not apply it before tomorrow, though. (If not someone else were
faster :)
leo
Hey, they don't call you the patchmonster for nothing--didn't want to
break your streak. :)
>I'll not apply it before tomorrow, though. (If not someone else were faster :)
If you want it in I can commit it now--I've a local version and big
enough pipe for it not to be a big deal.
Cool. I'll wait. :)
>>I also have the concern that this moves over to Unicode-everywhere.
>
>It does, in a sense, and I'll defend that. (And at some point I'll
>ask you for a clarification of your concerns to make sure that I
>answer them directly, since "Unicode" can mean different things--a
>character model, a standard, a set of encodings....)
We can hash that out after the other things get argued out.
>>I'll not apply it before tomorrow, though. (If not someone else were faster :)
> If you want it in I can commit it now--I've a local version and big
> enough pipe for it not to be a big deal.
Put it in. We'll deal with various issues step by step.
leo
Done. It'll guaranteed kill half the tinderboxen--I think my first
thing to do on monday is to patch up the build procedure to use the
system ICU if it's available.
Does this mean you're resigned to requiring a C++ compiler? Or will it be possible to switch ICU off?
Adam
It'll be possible to switch it off, or to use the system ICU install
if there is one.
> Jeff Clites <jcl...@mac.com> wrote:
>> I've sent my patch in through RT--it's [perl #28405]!
>
> Phew, that's huge. I'd really like to have smaller patches that do it
> step by step.
Yes, I know it got quite large--sorry about that, I know it makes
things more difficult. Mostly it was all-or-nothing, though--changes to
string.c, and then dealing with the consequence of those changes. (It
would have been smaller if I had not added the "interpreter" argument
to the string API which lacked it--updating the places they are used is
probably half of the patch or so.)
In addition to may responses below, I hope to have a better-organized
explanation of the approach I've taken, and the rationale behind it.
> But anyway, the patch must have been a lot of work so
> lets see and make the best out of it.
>
> Some questions:
>
> First:
> - What is string->representation? It seems only to contain
> string_rep_{one,two,four}.
> - How does UTF8 fit here?
> - Where are encoding and chartype now?
The key idea is to move to a model in which a string is not represented
as a bag of bytes + associated encoding, but rather conceptually as an
array of (abstract) characters, which boils down to an array of
numbers, those numbers being the Unicode code point of the character.
The easiest way to do this would be to represent a string using an
array of 32-bit ints, but this is a large waste of space in the common
case of an all-ASCII string. So what I'm doing is sticking to the idea
of modeling a string as conceptually an array of 32-bit numbers, but
optimizing by using just a uint8_t[] if all of the characters happen to
have numerical value < 2^8, uint16_t[] if some are outside of that
range but still < 2^16, and finally uint32_t[] in the rare case it
contains characters above that range.
So internally, strings don't have an associated encoding (or chartype
or anything)--if you want to know what the N-th character is, you just
jump to index N for the appropriate datatype, and that number is your
answer.
In my view, the fundamental question you can ask a string is "what's
your N-th character", and the fundamental thing you do with the answer
is go look up what properties that character has (eg, sort ordering,
case mapping, value as a digit, etc.).
In this model, an "encoding" is a serialization algorithm for strings,
and that's it (much like Data::Dumper defines a serialization format
for other Perl types). A particular encoding is a particular strategy
for serializing a string into a bag of bytes (or vice-versa), almost
always for interchange via I/O (which is always byte-based). The split
into separate "encoding" and "chartype" ends up being undesirable. (I'd
argue that such a split is a possible internal design option for a
transcoding library, but not part of the conceptual API of a string
library. So ICU may have such a split internally, but parrot doesn't
need to.) In particular, this metadata is invariably specified as a
single parameter--in an XML declaration is called "encoding", in MIME
headers it's called "charset", but I don't know of any interchange
format which actually tries to specify this sort of thing via two
separate parameters. Additionally, this split isn't reflected in other
string libraries I know of, nor is the concept universal across the
actual encoding standards themselves (though in some cases it is used
pedagogically).
So I get what the split is trying to capture, but I think it's
counterproductive, especially since developers tend to get confused
about the whole "encoding thing", and we make matters worse if we try
to maintain a type of generality that outruns actual usage.
So in this model (to recap), if I have two files which contain the same
Japanese text, one in Shift-JIS and one in UTF-8, then after reading
those into a parrot string, they are identical. (They're the same
text--right?) You could think of this as "early normalization", but my
viewpoint if that the concept of an "encoding" deals exclusively with
how to serialize a string for export (and the reverse), and not with
the in-memory manipulation (or API) of a string itself. This is very
much like defining a format to serialize objects into XML--you don't
end up thinking of the objects as "XML-based" themselves.
There are a couple of key strengths of this approach, from a
performance perspective (in addition to the conceptual benefit I am
claiming):
1) Indexing into a string is O(1). (With the existing model, if you
want to find the 1000th character in a string being represented in
UTF-8, you need to start at the beginning and scan forward.)
2) There's no memory allocation (and hence no GC) needed during string
comparisons or hash lookups. I consider that to be a major win.
3) The Boyer-Moore algorithm can be used for all string searches. (But
currently there are a couple of cases I still need to fill in.)
Incidentally, this closely matches the approach used by ObjC and Java.
> - Where is string->language?
I removed it from the string struct because I think that's the wrong
place for it (and it wasn't actually being used anywhere yet,
fortunately). The problem is that although many operations can be
language-dependent (sorting, for example), the operation doesn't depend
on the language of the strings involved, but rather on the locale of
the reader. The classic example is a list containing a mixture of
English and Swedish names; for an English reader the while list should
be sorted in English order, and a Swedish reader would want to see them
in Swedish alphabetical order. (That example is from Richard Gillam's
book on Unicode.)
I'm assuming that was the intention of "language". If it was meant to
indicate "perl" v. "python" or something, then I should put it back.
Also, we may want to re-create an API which allows us to obtain an
integer to later use to specify an encoding (right now, they're
specified by name, as C-strings), but the ICU API doesn't use this sort
of mechanism, so at the moment we'd end up turning around and looking
up the C-string to pass into the ICU API, so it isn't a
performance-enhancer currently.
> With this string type how do we deal with anything beyond codepoints?
Hmm, what do you mean? Even prior to this, all of our operations ended
up relying on being able to either transcode two arbitrary strings to
the same encoding, or ask arbitrary strings what their N-th character
is. Ultimately, this carried the requirement that a string be
representable as a series of code points. I don't think that there is
anything beyond that which a string has to offer. But I'm curious as to
what you had in mind.
> And some misc remarks/questions:
>
> - string_compare seems just to compare byte,short, or int_32
Yep, exactly.
(There are compare-strings-using-a-particular-normalization-form
concepts which we'll need to ultimately handle, which are much like the
simpler concept of case-insensitive comparison. I see these being
handles by separate API/ops, and there are a couple of different
direction that could take.)
> - What happenend to external constant strings?
They should still work (or could). But the only cases in which we can
optimize, and actually use "in-place" a buffer handed to string_make,
is for a handful of encodings. But dealing with the "flags" argument
passed into string_make may still be incomplete.
> - What's the plan towards all the transcode opcodes? (And leaving these
> as a noop would have been simpler)
Basically there's no need for a transcode op on a string--it no longer
makes sense, there's nothing to transcode. What we need to add is an
"encode" which creates a bag of bytes from a string + an encoding, and
a "decode" which creates a string based on a bag of bytes + an
encoding. But we currently lack (as far as I can tell) a data type to
hold a naked bag of bytes. It's in my plans to create a PMC for that,
and once that exists to add the corresponding ops.
> - hash_string seems not to deal with mixed encodings anymore.
Yep, since we're hashing based on characters rather than bytes, there's
no such thing as mixed encodings. That means that, to use my example
from above, a string representing the Japanese word for "sushi" will
hash to the same thing no matter what encoding may have been used to
represent it on disk.
> - Why does PIO_putps convert to UTF-8?
Mostly for testing. We currently lack a way to associate an encoding
with an IO handle (or with an IO read or write), and if you want to
write out a string you have to pick an encoding--that's the recipe
needed to convert a string into something write-able. Until we've
developed that, I'm writing everything out in UTF-8, just as a sensible
thing to use to allow basically any string to be written out. Before
this, we were writing out the raw bytes used to internally represent
the string, which never really made sense.
> - Why does read/PIO_reads generate an UTF-8 first?
Similar to the above, but right now that API is only being called by
the freeze/thaw stuff, and it's current odd state reflects what needed
to be done it keep it working (basically, to match the write by
PIO_putps). But ultimately, for the freeze-using-opcodes case we should
be accumulating our bytes into a raw byte buffer rather than sneaking
them into the body of a string, but since we don't yet have a data type
representing a raw byte buffer (as mentioned above), this was a
workaround.
Fundamentally, I think we need two sorts of IO API (which may be
handled via IO layers or filters): byte-based and string-based. For the
string based case, an encoding always needs to be specified in some
manner (either implicitly or explicitly, and either associated with the
IO handle or just with a particular read/write).
Of course, pass along any other questions/concerns you have.
JEff
> Dan Sugalski wrote:
>
>> But... this gets us very much closer to where we want to be, and I'm
>> figuring that we're better off applying this and working out the
>> kinks than not. I'll leave this one to Leo to make final decision on,
>> though.
>
> Thanks Dan for this easter egg :)
>
> Intermediate results:
>
> - patch applied fine
Good! (I tested it enough, so I'm happy that was successful.)
> - compiles currently modulo some C++ style variable declarations
I'm lately being spoiled by C99 (which allows mid-block declarations
too), but these were just oversights on my part.
> and s/#import/#include/
Oops. (It's the ObjC in me.)
> - a few warnings about shadowed vars
Hmm, I'll need to figure out how to turn on warnings about this in gcc.
I would expect some "unused" warning as well.
> - patched source does *not* conform to pdd07
Not too surprising--I'm sure there are some tabs and too-long-lines and
such, that I know I need to fix.
> - additional coffee breaks during compiles will ruin me finally
Yes, no kidding! At least, the makefile as currently constructed won't
re-build ICU as a result of a re-Configure-ing parrot. (And it might
make sense to remove cleaning of ICU from the default clean target, so
that you only rebuild it if you've cleaned it explicitly.) Also, be
glad I'm stopping it from building the data into a library--that takes
an additional lifetime to build.
> - compiled ok
Yeah!
> - make test - not too bad (5/1495 subtests failed)
Let me know which ones--I'm curious as to why.
> - JIted NCI (uses string_make for 't' sig) is b0rken
Ah yes, it make sense I wouldn't have hit this, if it's in the
i386-specific code.
> - one native_pbc test fails (more are skipped now)
I know there needs to be added some endianness correction of the
serialized string data now, which hasn't been added yet, so I wouldn't
be surprised if the ppc-generated file fails the test on i386, so this
might be that. There are SKIPs around the other platform test files
(probably should be TODOs), which will need to be regenerated on those
platforms.
> - make testj fails some more (8/1432 subtests failed)
I just confirmed that for me all is passing under make testj (except
t/pmc/object-meths.t:17 of course), but it's not surprising for this to
be platform-specific.
Thanks much for the feedback!
JEff
> At 1:55 PM -0700 4/9/04, Adam Thomason wrote:
>> > From: Dan Sugalski [mailto:d...@sidhe.org]
>> > Done. It'll guaranteed kill half the tinderboxen--I think my first
>>> thing to do on monday is to patch up the build procedure to use the
>>> system ICU if it's available.
>>
>> Does this mean you're resigned to requiring a C++ compiler? Or will
>> it be possible to switch ICU off?
>
> It'll be possible to switch it off, or to use the system ICU install
> if there is one.
The problem with switching it off is that if we can't guarantee access
to the full Unicode character properties database (and associated
data), the we'll have cases where 2 different perl installs disagree
about how to uppercase a string (for instance). ICU gives us case
folding and normalization and character property API (and locales,
actually), in addition to encoding data. (I can see how "that encoding
isn't support by this install" could be handled, but missing the other
stuff would be more problematic.) But maybe you have some ideas on how
this could be done safely/unconfusingly.
JEff
On Friday 09 April 2004 23:11, Jeff Clites wrote:
> Hmm, I'll need to figure out how to turn on warnings about this in gcc.
> I would expect some "unused" warning as well.
-Wshadow and -Wunused
-Wuninitialized and -Wunreachable-code can also help
jens
> So internally, strings don't have an associated encoding (or chartype
> or anything)
How do you handle EBCDIC? UTF8 for Ponie?
>> - Where is string->language?
> I removed it from the string struct because I think that's the wrong
> place for it (and it wasn't actually being used anywhere yet,
> fortunately).
Not used *yet* - what about:
use German;
print uc("i");
use Turkish;
print uc("i");
> language-dependent (sorting, for example), the operation doesn't depend
> on the language of the strings involved, but rather on the locale of
> the reader.
And if one is working with two different language at a time?
>> With this string type how do we deal with anything beyond codepoints?
> Hmm, what do you mean?
"\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}" eq
"\N{LATIN CAPITAL LETTER A WITH ACUTE}"
when comparing graphemes or letters. The latter might depend on the
language too.
We'll basically need 4 levels of string support:
,--[ Larry Wall ]--------------------------------------------------------
| level 0 byte == character, "use bytes" basically
| level 1 codepoint == character, what we seem to be aiming for, vaguely
| level 2 grapheme == character, what the user usually wants
| level 3 letter == character, what the "current language" wants
`------------------------------------------------------------------------
> or ask arbitrary strings what their N-th character
The N-th character depends on the level. Above examples C<.length> gives
either 2 or 1, when the user queries at level 1 or 2. The same problem
arises with positions. The current level depends on the scope were the
string was coming from too. (s. example WRT turkish letter "i")
>> - What happenend to external constant strings?
> They should still work (or could).
,--[ string.c:714 ]--------------------------------------------------
| else /* even if buffer is "external", we won't use it directly */
`--------------------------------------------------------------------
>> - What's the plan towards all the transcode opcodes? (And leaving these
>> as a noop would have been simpler)
> Basically there's no need for a transcode op on a string--it no longer
> makes sense, there's nothing to transcode.
I can't imagine that. I've an ASCII string and want to convert it to UTF8
and UTF16 and write it into a file. How do I do that?
>> - hash_string seems not to deal with mixed encodings anymore.
> Yep, since we're hashing based on characters rather than bytes, there's
> no such thing as mixed encodings.
s. above
> JEff
leo
I'm replying for Jeff since I've been burned by the same questions
over and over again :-)
>
>>So internally, strings don't have an associated encoding (or chartype
>>or anything)
>
>
> How do you handle EBCDIC? UTF8 for Ponie?
All character sets (like EBCDIC) or encodings (like UTF-8) are
"normalized" to the Unicode (character set) (and our own *internal*
encoding, the 8/16/32 one.)
> Not used *yet* - what about:
>
> use German;
> print uc("i");
> use Turkish;
> print uc("i");
That is implementable (and already implemented by ICU) but by something
higher level than a "string".
> And if one is working with two different language at a time?
One becomes mad. As Jeff demonstrated, there is no silver bullet in
there, one gets quickly to situations where there provably is NO correct
solution. So we shouldn't try building the impossible to the lowest
level of string implementation.
> when comparing graphemes or letters. The latter might depend on the
> language too.
>
> We'll basically need 4 levels of string support:
>
> ,--[ Larry Wall ]--------------------------------------------------------
> | level 0 byte == character, "use bytes" basically
> | level 1 codepoint == character, what we seem to be aiming for, vaguely
> | level 2 grapheme == character, what the user usually wants
> | level 3 letter == character, what the "current language" wants
> `------------------------------------------------------------------------
Jeff's solution gives us level 1, and I assume that level 0 is trivially
deductible from that. Note, however, that not all string operations
(especially such a rich set of string ops as Perl has) can even be
defined for all those levels: e.g. bitstring boolean bit ops are rather
insane at levels higher than zero.
> The N-th character depends on the level. Above examples C<.length> gives
> either 2 or 1, when the user queries at level 1 or 2. The same problem
> arises with positions. The current level depends on the scope were the
> string was coming from too. (s. example WRT turkish letter "i")
The levels 2 and 3 depend on something higher level, like the higher
levels of ICU. I believe we have everything we need (and even more) in
ICU. Let's get the levels 0 and 1 working first.
>>>- What's the plan towards all the transcode opcodes? (And leaving these
>>> as a noop would have been simpler)
>
>
>>Basically there's no need for a transcode op on a string--it no longer
>>makes sense, there's nothing to transcode.
>
>
> I can't imagine that. I've an ASCII string and want to convert it to UTF8
> and UTF16 and write it into a file. How do I do that?
IIUC the old "transcoding" stuff was doing transcoding in run-time so
that two encoding-marked strings could be compared. The new scheme
"normalizes" (not to be confused with Unicode normalization) all strings
to Unicode. If you want to do transformations like you describe above
you either call an explicit transcoding interface (which ICU no doubt
has) or your I/O layers do that implicitly (this functionality PIO does
not yet have, if I understood Jeff correctly).
Maybe it's good to refresh on the 'character hierarchy' as defined by
Unicode (and IETF, and W3C).
ACR - Abstract Character Repertoire: an unordered collection of abstract
characters, like "UPPERCASE A" or "LOWERCASE B" or "DECIMAL DIGIT SEVEN".
CCS - Coded Character Set: an ordered (numbered) list of characters,
like 65 -> "UPPERCASE A". For example: ASCII and EBCDIC.
CEF - Character Encoding Form: mapping the numbers of the CCS character
codoes to platform-specific numbers like bytes or integers.
CES - Character Encoding Scheme: mapping the CEF numbers to serialized
bytes, possibly adding synchronization metadata like shift codes or byte
order markers.
Why the great confusion exists is mostly because in the old way (like
ASCII or Latin-1) all these four levels were conflated into one.
ISO 8859-1 (which is a CCS) has an eight-bit CEF. UTF-8 is both a CEF
and a CES. UTF-16 is a CEF, while UTF-16LE is a CES. ISO 2022-{JP,KR}
are CES.
(Outside of Unicode) there is TES (Transfer Encoding Syntax), too, which
is application-level encoding like base64 or gzip.
>
> Done. It'll guaranteed kill half the tinderboxen--I think my first thing
> to do on monday is to patch up the build procedure to use the system ICU
> if it's available.
Thanks for checkin. And yes. What about building without ICU? I can
imagine that some embedded usage of Parrot, dealing with ASCII only,
wouldn't need it. We could provide pure ASCII fallbacks for this case
and reduce functionality.
leo
>> How do you handle EBCDIC? UTF8 for Ponie?
> All character sets (like EBCDIC) or encodings (like UTF-8) are
> "normalized" to the Unicode (character set) (and our own *internal*
> encoding, the 8/16/32 one.)
Ok.
>> Not used *yet* - what about:
>>
>> use German;
>> print uc("i");
>> use Turkish;
>> print uc("i");
> That is implementable (and already implemented by ICU) but by something
> higher level than a "string".
So the first question is: Where is this higher level? Isn't Parrot
responsible for providing that? The old string type did have the
relevant information at least.
I think we can't say it's a Perl6 lib problem. HLL interoperability
comes in here too. *If* there are some more advanced string levels above
Parrot strings, they have to play together too.
So let's first concentrate on this issue. The rest is more or less
an implementation detail.
leo
> Jeff Clites <jcl...@mac.com> wrote:
>> On Apr 9, 2004, at 7:19 AM, Leopold Toetsch wrote:
>
>> So internally, strings don't have an associated encoding (or chartype
>> or anything)
>
> How do you handle EBCDIC?
I'll use pseudo-C to illustrate:
string = string_make(buffer, "EBCDIC"); /* buffer came from an EBCDIC
file, for instance containing the word "hello" */
outputBuffer1 = encode(string, "EBCDIC");
outputBuffer2 = encode(string, "ASCII");
outputBuffer3 = encode(string, "UTF-8");
outputBuffer4 = encode(string, "UTF-32");
//maybe write buffers out to files now....
So far, that would all just work, and outputBuffers 1-4 all have the
word "hello" serialized using various encodings.
Now, let's say instead that we had done:
string = string_make(buffer, "ASCII"); /* buffer came from an ASCII
file, for instance containing the word "hello" */
Now the other 4 lines of code would be the same, and outputBuffers 1-4
are identical to what they would have been above, and in fact "string"
would be identical in the 2 cases as well.
> UTF8 for Ponie?
Ponie should only need to pretend that a string is represented
internally in UTF-8 in a limited number of situations--not for string
comparisons or (most?) regular expressions, etc. For the cases where it
does need that, it can on-the-fly (or, cached) create a buffer of bytes
to work from. But I think of that as a backward compatibility, and not
the case we're optimized for.
>>> - Where is string->language?
>
>> I removed it from the string struct because I think that's the wrong
>> place for it (and it wasn't actually being used anywhere yet,
>> fortunately).
>
> Not used *yet* - what about:
>
> use German;
> print uc("i");
> use Turkish;
> print uc("i");
Perfect example. The string "i" is the same in each case. What you've
done is implicitly supplied a locale argument to the uc()
operation--it's just a hidden form of:
uc(string, locale);
The important thing is that the locale is a parameter to the operation,
not an attribute of the string.
>> language-dependent (sorting, for example), the operation doesn't
>> depend
>> on the language of the strings involved, but rather on the locale of
>> the reader.
>
> And if one is working with two different language at a time?
Hmm? The point is that if you have a list of strings, for instance some
in English, some in Greek, and some in Japanese, and you want to sort
them, then you have to pick a sort ordering. If you associate a
language with each string, that doesn't help you--how do you compare
the English and Japanese strings with one another? So again, the sort
ordering is a parameter to the sort operation:
sort_strings(array, locale);
And you could certainly have:
sortedInOneWay = sort_strings(someArrayOfStrings, locale1)
sortedInAnotherWay = sort_strings(someArrayOfStrings, locale2)
That's only awkward if we're assuming a locale is hanging around
implicitly specified--it's it's an explicit parameter, it's very clear.
And in practice, the collation algorithm specified by a locale has to
be prepared to handle sorting any strings at all--so that you can sort
Japanese strings in the English sort order, for instance. That sounds
strange, but generally this tends to be modeled as a base sorting
algorithm (covering all characters/strings) plus small per-locale
variations. That means that the sort order for Kanji strings would
probably end up being the same for the English and Dutch locales,
though strings containing Latin characters might sort differently.
>>> With this string type how do we deal with anything beyond codepoints?
>
>> Hmm, what do you mean?
>
> "\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}" eq
> "\N{LATIN CAPITAL LETTER A WITH ACUTE}"
>
> when comparing graphemes or letters. The latter might depend on the
> language too.
Right, but this isn't comparing graphemes, it's this:
one = "\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"
two = "\N{LATIN CAPITAL LETTER A WITH ACUTE}";
one eq two //false--they're different strings
normalizeFormD(one) eq normalizeFormD(two) //true
This is quite analogous to:
three = "abc"
four = "ABC"
three eq four //false
uc(three) eq uc(four) //true
So it's not comparing gramphemes, it's comparing strings under
normalization. It's a much clearer, and much more well-worked-out
concept. If you want to think of this as comparing graphemes, you need
to invent a data type to model "a grapheme", and that's a mess and
nobody's ever done that. It's the wrong way to think about the problem.
> We'll basically need 4 levels of string support:
>
> ,--[ Larry Wall
> ]--------------------------------------------------------
> | level 0 byte == character, "use bytes" basically
> | level 1 codepoint == character, what we seem to be aiming for,
> vaguely
> | level 2 grapheme == character, what the user usually wants
> | level 3 letter == character, what the "current language" wants
> `----------------------------------------------------------------------
> --
Yes, and I'm boldly arguing that this is the wrong way to go, and I
guarantee you that you can't find any other string or encoding library
out there which takes an approach like that, or anyone asking for one.
I'm eager for Larry to comment.
>> or ask arbitrary strings what their N-th character
>
> The N-th character depends on the level. Above examples C<.length>
> gives
> either 2 or 1, when the user queries at level 1 or 2. The same problem
> arises with positions. The current level depends on the scope were the
> string was coming from too. (s. example WRT turkish letter "i")
I'm arguing that whether you have dotted-i or dotless-i is decided when
you create the string, and how those are case-mapped depend on your
choice of case-folding algorithm. The ICU case-folding API has explicit
variants to decide how to case fold the i's, for instance.
>>> - What happenend to external constant strings?
>
>> They should still work (or could).
>
> ,--[ string.c:714 ]--------------------------------------------------
> | else /* even if buffer is "external", we won't use it directly */
> `--------------------------------------------------------------------
Yes, that's the else clause for the case where we've determined that we
can't use the passed-in buffer directly. A few lines above that is the
place where we have a mem_sys_memcopy() which we can avoid (but the
code doesn't yet, I see), and use the passed-in buffer in the
external-constant case.
>>> - What's the plan towards all the transcode opcodes? (And leaving
>>> these
>>> as a noop would have been simpler)
>
>> Basically there's no need for a transcode op on a string--it no longer
>> makes sense, there's nothing to transcode.
>
> I can't imagine that. I've an ASCII string and want to convert it to
> UTF8
> and UTF16 and write it into a file. How do I do that?
That's the mindset shift. You don't have an ASCII string. You have a
string, which may have come from a file or a buffer representing a
string using the ASCII encoding. It's the example from above, again:
inputBuffer = read(inputHandle);
string = string_make(inputBuffer, "ASCII");
outputBuffer = encode(string, "UTF-16");
write(outputHandle, outputBuffer);
or, if we want to associate an encoding with a handle (sometimes more
convenient, sometimes less), it could go like this:
inputHandle = open("file1", "ASCII");
string = stringRead(inputHandle);
outputHandle = open("file2", "UTF-16");
stringWrite(outputHandle, string);
Basically, we need to think of strings from the top down (what concept
are they trying to capture, what API makes sense for that), rather than
from the bottom up (how are bytes stored in files). The key is to think
of strings as, by definition, representing a sequence of (logical)
characters, and to think of a character as what's trying to be captured
by things such as "LATIN CAPITAL LETTER A WITH ACUTE" or "SYRIAC
QUSHSHAYA". That's the whole cookie.
(And again, I wish I could take credit for all of this, but it's really
the general state-of-the-art with regard to strings. It's how people
are thinking about these things these day.)
JEff
I'm no Larry, either :-) but I think Larry is *not* saying that the
"localeness" or "languageness" should hang off each string (or *shudder*
off each substring). What I've seen is that Larry wants the "level" to
be a lexical pragma (in Perl terms). The "abstract string" stays the
same, but the operative level decides for _some_ ops what a "character
stands for.
The default level should be somewhere between levels 1 and 2 (again, it
depends on the ops).
For example, usually /./ means "match one Unicode code point" (a CCS
character code). But one can somehow ratchet the level up to 2 and make
it mean "match one Unicode base character, followed by zero or more
modifier characters". For level 3 the language (locale) needs to be
specified.
As another example, bitstring xor does not make much sense for anything
else than level zero.
The basic idea being that we cannot and should not dictate at what level
of abstraction the user wants to operate. We will give a default level,
and ways to "zoom in" and "zoom out".
(If Larry is really saying that the "locale" should be an attribute of
the string value, I'm on the barricades with you, holding cobblestones
and Molotov cocktails...)
Larry can feel free to correct me :-)
> Jarkko Hietaniemi <j...@iki.fi> wrote:
>
>>> Not used *yet* - what about:
>>>
>>> use German;
>>> print uc("i");
>>> use Turkish;
>>> print uc("i");
>
>> That is implementable (and already implemented by ICU) but by
>> something
>> higher level than a "string".
>
> So the first question is: Where is this higher level? Isn't Parrot
> responsible for providing that? The old string type did have the
> relevant information at least.
See my separate post--what's needed is a locale parameter to uc(),
giving uc("i", locale). For Parrot, we at least need ops/API which take
an explicit locale parameter. The interoperability issue comes into
play in whether we decide to let a default locale be specified at the
parrot level, and have op variants which use this. I think that we
actually don't need that, and that we can let it mostly happen at the
HLL level--that is, of Perl6 (etc.) want API such as uc() without an
explicit parameter, it just needs to compile 'print uc("i")' down to:
set S0, "i"
find_global P1, "default_locale" # some agreed-upon global
uc S1, S0, P1
print S1
So parrot can support a default locale cross-language, without needing
separate ops or anything. And if, for instance, Python didn't have a
default locale, and always make you pass a locale into operations which
need one, then everything would still be fine--just Python wouldn't be
looking up this particular global.
> I think we can't say it's a Perl6 lib problem. HLL interoperability
> comes in here too. *If* there are some more advanced string levels
> above
> Parrot strings, they have to play together too.
>
> So let's first concentrate on this issue. The rest is more or less
> an implementation detail.
JEff
Right. It's a Parrot lib problem. But it's not a ".c/.cpp" problem.
> comes in here too. *If* there are some more advanced string levels above
> Parrot strings, they have to play together too.
>
> So let's first concentrate on this issue. The rest is more or less
> an implementation detail.
Once we get levels 0 and 1 working, we can worry about bolting the
levels 2 and 3 from ICU to a Parrot level API. (ICU goes much further
than 2 or 3, incidentally: how about some Buddhist calendar?)
--
Jarkko Hietaniemi <j...@iki.fi> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen