On Apr 23, 2004, at 2:43 PM, Dan Sugalski wrote:
> CHARACTER SET - Contains meta-information about code points. This
> includes both the meaning of individual code points
> (65 is capital A, 776 is a combining diaresis) as
> well as a set of categorizations of code
> points (alpha, numeric, whitespace, punctuation, and
> so on), and a sorting order.
I'm assuming here that you are referring to things like Shift-JIS and
ISO-8859-1 as character sets, right?
Questions (based on that assumption):
[*Note: assume everywhere below that the strings in question are not
explicitly language-tagged (or, are tagged with "Dunno"--however it's
supposed to work).]
1) ISO-8859-1 is used to represent text in several different languages,
including German and Swedish. German and Swedish differ in their sort
order, even for things they have in common. (For example, ö
(o-with-diaeresis) is considered a separate letter in Swedish, but is
just a accented "o" in German.) So (assuming my strings aren't
explicitly langauge-tagged, or are tagged with "Dunno"), what sort
order does ISO-8859-1 define? I'm not sure whether the national
standards themselves actually define a sort order, so are we going to
define one for every "character set"? In addition, many languages can
be represented in several different "character set", so that seems to
mean that the sort order for "öut" v. "out" will vary, depending on the
"character set" used for those strings?
2) In light of the above, how do you sort an array of strings, assuming
they're not all in the same "character set"?
3) If the answer to (2) is "you must upgrade them all to UTF-8", then
that means that the sort order for an array might totally change when
you add one new member, right? If the answer is, "for a given pair,
when you compare them during sorting, only upgrade if their character
sets don't match", then you open the door to non-convergent sorting
(ie, the sort might never finish).
My worry here is that if the semantics of the Latin Capital Letter A
("A"), for example (or pick any other character), are allowed to differ
between different "character sets", then we'll have problems for any
binary string operation.