Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Plans for string processing
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Dan Sugalski  
View profile  
 More options Apr 12 2004, 11:48 am
Newsgroups: perl.perl6.internals
From: d...@sidhe.org (Dan Sugalski)
Date: Mon, 12 Apr 2004 11:43:45 -0400
Local: Mon, Apr 12 2004 11:43 am
Subject: Plans for string processing
Okay, I've not dug through all the fallout from the ICU checkin, but
I can see there's an awful lot. I'll dig through that in a bit, but...

Here's the plan. We've gone over it in the past, but I'm not sure
everything's been gathered together, so it's time to do so.

Some declarations:

1) Parrot will *not* require Unicode. Period. Ever. (Well, upon
release, at least) We will strongly recommend it, however, and use it
if we have it
2) Parrot *will* support multiple encodings (the bytes->code points
stuff), character sets (code points->meaning of a sort), and
language-specific overrides of character set behaviour.
3) All string data can be dealt with as either a series of bytes,
code points, or characters. (Characters are potentially multiple code
points--basically combining character stuff from those standards that
do so)
4) We will *not* use ICU for core functions. (string to number or
number to string conversions, for example)
5) Parrot will autoconvert strings as needed. If a string can't be
converted, parrot will throw an exception. This goes for language,
character set, or encoding.
6) There *may* be an overriding set of rules for throwing conversion
exceptions. (They may be supressed on lossy conversions, or required
for any conversions)
7) There *may* be an overriding language used for language-specific
operations (case folding or sorting).

I know ICU's got all sorts of nifty features, but bluntly we're not
going to use most of them.

The original split of encoding, character set, and language is one
that I want to keep. I know we've lost a good chunk of that with the
latest ICU patch, but that's only temporary and the breakage is worth
it to get Unicode actually in use. I expect I need to step up to the
plate and get an alternate encoding and charset in, so I'll probably
take a shot at JIS X 0208:1997 or CNS11643-1992. (Or whatever the
current version of those is)

As far as Parrot is concerned, a string is a series of bytes which
may, via its encoding, be turned into a series of 32 bit integer code
points. Those 32-bit integer code points can be turned, via its
character set, into a series of characters where each character is
one or more code points. Those characters may be classified and
transformed based on the language of the string.

The responsibilities of the three layers are:

Encoding
========

*) Transform stream of bytes to and from a set of 32-bit integers
*) Manages byte buffer (so buffer positioning and manipulation by
code point offset is handled here)

Character set
=============
*) Provides default manipulation and comparison behaviour (sorting
and case mangling)
*) Provides default character classifications (digit, word char,
space, punctuation, whatever)
*) Provides code point and character manipulation. (substring
functionality, basically)
*) Provides integrity features (exceptions if a string would be invalid)

Language
========
*) Provides language-sensitive manipulation of characters (case mangling)
*) Provides language-sensitive comparisons
*) Provides language-sensitive character overrides ('ll' treated as a
single character, for example, in Spanish if that's still desired)
*) Provides language-sensitive grouping overrides.

Since examples are good, here are a few. They're in an "If we"/"Then
Parrot" format.

IW: Mush together (either concatenate or substr replacement) two
strings of different languages but same charset
TP: Checks to see if that's allowed. If not, an exception is thrown.
If so, we do the operation. If one string is manipulated the language
stays whatever that string was. If a new string is created either the
left side wins or the default language is used, depending on the
interpreter setting.

IW: Mush together two strings of different charsets
TP: If the two strings can be losslessly converted to one of the two
charsets, do so, otherwise transform to Unicode and mush together. If
transformation is lossy optionally throw an exception (or warning)
Language rules above still apply.

IW: Force a conversion to a different character set
TP: Does it. An exception or warning may be thrown if the conversion
is not lossless.

Please note that in most cases parrot deals with string data as
*strings* in S registers (or hiding behind PMCs) not as integers in I
registers (even though we treat strings as a series of abstract
integer code points). This is because even something as simple as
"give me character 5" may return a series of code points if character
5 is a combining character set. We may (possibly, but possibly not)
get a bit dirtier for the regex code for speed reasons, but we'll see
about that.

Also note that some languages, such as perl 6, have a more restricted
view of things. That's fine, but we don't really care much as long as
everything that they need is provided, so the fact that Larry's
mandated the Ux levels is fine, but as they're a (possibly
excessively) restricted subset of what we're going to do means we
can, and in fact should (as they're more restrictive) ignore them for
our purposes. Same goes for other languages that have similar
restrictions.

Finally note that, in general, the actual character set or language
of a string becomes completely irrelevant so there isn't any loss in
abstracting things--to properly support Unicode means abstracting the
heck out of so much stuff that supporting multiple encodings and
character sets is a matter of switching out table pointers, and as
such not particularly a big deal.

Yes, this does mean that some of the recent ICU integration's going
to be moved back some, and it means that string data's more complex
than you might want it to be, but it already is, so we deal.

This all is not, as of yet, entirely non-negotiable, though I've yet
to get a convincing argument for change.
--
                                         Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
d...@sidhe.org                         have teddy bears and even
                                       teddy bears get drunk


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.