Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

What Unicode means to us

2 views
Skip to first unread message

Dan Sugalski

unread,
Aug 9, 2004, 2:14:46 PM8/9/04
to perl6-i...@perl.org
Since this has been a sore spot lately, and one
we need to deal with. Might as well formally
define what that is.

We must be able to:

*) Load in string data from an IO source,
regardless of its encoding, and treat it as
Unicode string data
*) write string data to an IO source in any Unicode encoding
*) Collate strings per the Unuicode standard
*) Convert non-Unicode string data to Unicode
properly (that is, obeying the Unicode conversion
rules)
*) Treat combining characters the same regardless
of whether they're composed or decomposed

We don't care about on-screen rendering or date/time/money formatting.

So, basically, we need to be able to read in data
regardless of whether it's UTF-8, UTF-16, or
UTF-32 encoded, and when we have it we should be
able to properly match "o" against "o" and not
"ö" (that's o with an umlaut over it) regardless
of whether the "ö" is composed (that is, one
codepoint) or decomposed (that is, two code
points), and then write it out to some IO handle
in proper UTF-8/16/32 format. When comparing two
Unicode strings we must be able to do so
properly, per the Unicode collation standard.
(With potential local overrides if we ever put
those in) We must also be able to case-mangle
(that is, upcase, downcase, or titlecase) the
string.

Additionally if we have source text which is
Latin-n, EBCDIC, ASCII, or whatever we must be
able to convert it with no loss to Unicode.
(Which I believe is now doable with Unicode 4.0)
Losslessly converting Unicode to
ASCII/EBCDIC/whatever is *not* required, which is
fine as it's theoretically (and often
practically) impossible.

I think that's it. Spelling it out's made the
encoding and charset API clear. I'll type that in
and get it off next.
--
Dan

--------------------------------------it's like this-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Nicholas Clark

unread,
Aug 9, 2004, 2:20:44 PM8/9/04
to Dan Sugalski, perl6-i...@perl.org
On Mon, Aug 09, 2004 at 02:14:46PM -0400, Dan Sugalski wrote:

> We don't care about on-screen rendering or date/time/money formatting.

And whilst every language out there might need these functions available
to its apps, this sounds like a module for the Comprehensive PIR Archive
Network. (ie I agree it's not core internal functionality)

Nicholas Clark

0 new messages