Okay, I've not dug through all the fallout from the ICU checkin, but
I can see there's an awful lot. I'll dig through that in a bit, but...
Here's the plan. We've gone over it in the past, but I'm not sure
everything's been gathered together, so it's time to do so.
1) Parrot will *not* require Unicode. Period. Ever. (Well, upon
release, at least) We will strongly recommend it, however, and use it
if we have it
2) Parrot *will* support multiple encodings (the bytes->code points
stuff), character sets (code points->meaning of a sort), and
language-specific overrides of character set behaviour.
3) All string data can be dealt with as either a series of bytes,
code points, or characters. (Characters are potentially multiple code
points--basically combining character stuff from those standards that
4) We will *not* use ICU for core functions. (string to number or
number to string conversions, for example)
5) Parrot will autoconvert strings as needed. If a string can't be
converted, parrot will throw an exception. This goes for language,
character set, or encoding.
6) There *may* be an overriding set of rules for throwing conversion
exceptions. (They may be supressed on lossy conversions, or required
for any conversions)
7) There *may* be an overriding language used for language-specific
operations (case folding or sorting).
I know ICU's got all sorts of nifty features, but bluntly we're not
going to use most of them.
The original split of encoding, character set, and language is one
that I want to keep. I know we've lost a good chunk of that with the
latest ICU patch, but that's only temporary and the breakage is worth
it to get Unicode actually in use. I expect I need to step up to the
plate and get an alternate encoding and charset in, so I'll probably
take a shot at JIS X 0208:1997 or CNS11643-1992. (Or whatever the
current version of those is)
As far as Parrot is concerned, a string is a series of bytes which
may, via its encoding, be turned into a series of 32 bit integer code
points. Those 32-bit integer code points can be turned, via its
character set, into a series of characters where each character is
one or more code points. Those characters may be classified and
transformed based on the language of the string.
The responsibilities of the three layers are:
*) Transform stream of bytes to and from a set of 32-bit integers
*) Manages byte buffer (so buffer positioning and manipulation by
code point offset is handled here)
*) Provides default manipulation and comparison behaviour (sorting
and case mangling)
*) Provides default character classifications (digit, word char,
space, punctuation, whatever)
*) Provides code point and character manipulation. (substring
*) Provides integrity features (exceptions if a string would be invalid)
*) Provides language-sensitive manipulation of characters (case mangling)
*) Provides language-sensitive comparisons
*) Provides language-sensitive character overrides ('ll' treated as a
single character, for example, in Spanish if that's still desired)
*) Provides language-sensitive grouping overrides.
Since examples are good, here are a few. They're in an "If we"/"Then
IW: Mush together (either concatenate or substr replacement) two
strings of different languages but same charset
TP: Checks to see if that's allowed. If not, an exception is thrown.
If so, we do the operation. If one string is manipulated the language
stays whatever that string was. If a new string is created either the
left side wins or the default language is used, depending on the
IW: Mush together two strings of different charsets
TP: If the two strings can be losslessly converted to one of the two
charsets, do so, otherwise transform to Unicode and mush together. If
transformation is lossy optionally throw an exception (or warning)
Language rules above still apply.
IW: Force a conversion to a different character set
TP: Does it. An exception or warning may be thrown if the conversion
is not lossless.
Please note that in most cases parrot deals with string data as
*strings* in S registers (or hiding behind PMCs) not as integers in I
registers (even though we treat strings as a series of abstract
integer code points). This is because even something as simple as
"give me character 5" may return a series of code points if character
5 is a combining character set. We may (possibly, but possibly not)
get a bit dirtier for the regex code for speed reasons, but we'll see
Also note that some languages, such as perl 6, have a more restricted
view of things. That's fine, but we don't really care much as long as
everything that they need is provided, so the fact that Larry's
mandated the Ux levels is fine, but as they're a (possibly
excessively) restricted subset of what we're going to do means we
can, and in fact should (as they're more restrictive) ignore them for
our purposes. Same goes for other languages that have similar
Finally note that, in general, the actual character set or language
of a string becomes completely irrelevant so there isn't any loss in
abstracting things--to properly support Unicode means abstracting the
heck out of so much stuff that supporting multiple encodings and
character sets is a matter of switching out table pointers, and as
such not particularly a big deal.
Yes, this does mean that some of the recent ICU integration's going
to be moved back some, and it means that string data's more complex
than you might want it to be, but it already is, so we deal.
This all is not, as of yet, entirely non-negotiable, though I've yet
to get a convincing argument for change.
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk