Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion (Re: The strings design document)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Jeff Clites  
View profile  
 More options Apr 27 2004, 12:40 pm
Newsgroups: perl.perl6.internals
From: jcli...@mac.com (Jeff Clites)
Date: Tue, 27 Apr 2004 09:40:12 -0700
Local: Tues, Apr 27 2004 12:40 pm
Subject: [Q1] (Re: The strings design document)
On Apr 23, 2004, at 2:43 PM, Dan Sugalski wrote:

> CHARACTER SET - Contains meta-information about code points. This
>            includes both the meaning of individual code points
>            (65 is capital A, 776 is a combining diaresis) as
>            well as a set of categorizations of code
>            points (alpha, numeric, whitespace, punctuation, and
>            so on), and a sorting order.

I'm assuming here that you are referring to things like Shift-JIS and
ISO-8859-1 as character sets, right?

Questions (based on that assumption):

[*Note: assume everywhere below that the strings in question are not
explicitly language-tagged (or, are tagged with "Dunno"--however it's
supposed to work).]

1) ISO-8859-1 is used to represent text in several different languages,
including German and Swedish. German and Swedish differ in their sort
order, even for things they have in common. (For example, ö
(o-with-diaeresis) is considered a separate letter in Swedish, but is
just a accented "o" in German.) So (assuming my strings aren't
explicitly langauge-tagged, or are tagged with "Dunno"), what sort
order does ISO-8859-1 define? I'm not sure whether the national
standards themselves actually define a sort order, so are we going to
define one for every "character set"? In addition, many languages can
be represented in several different "character set", so that seems to
mean that the sort order for "öut" v. "out" will vary, depending on the
"character set" used for those strings?

2) In light of the above, how do you sort an array of strings, assuming
they're not all in the same "character set"?

3) If the answer to (2) is "you must upgrade them all to UTF-8", then
that means that the sort order for an array might totally change when
you add one new member, right? If the answer is, "for a given pair,
when you compare them during sorting, only upgrade if their character
sets don't match", then you open the door to non-convergent sorting
(ie, the sort might never finish).

My worry here is that if the semantics of the Latin Capital Letter A
("A"), for example (or pick any other character), are allowed to differ
between different "character sets", then we'll have problems for any
binary string operation.

JEff


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.