Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Catalan collation bug in perl or CLDR?

0 views
Skip to first unread message

Tom Christiansen

unread,
Mar 6, 2011, 5:45:19 PM3/6/11
to Karl Williamson, Perl5 Porters Mailing List
I believe there's a bug in Collate::Locale's Catalan set-up. It seems
to have had too much copied to it from the es_traditional locale.

Here's ca.pl:

+{
backwards => 2,
entry => <<'ENTRY', # for DUCET v6.0.0
0063 0068 ; [.15D2.0020.0002.0063] # <LATIN SMALL LETTER C, LATIN SMALL LETTER H>
0063 0048 ; [.15D2.0020.0007.0063][.0000.0000.0002.0000] # <LATIN SMALL LETTER C, LATIN CAPITAL LETTER H>
0043 0068 ; [.15D2.0020.0007.0043][.0000.0000.0008.0000] # <LATIN CAPITAL LETTER C, LATIN SMALL LETTER H>
0043 0048 ; [.15D2.0020.0008.0043] # <LATIN CAPITAL LETTER C, LATIN CAPITAL LETTER H>
006C 006C ; [.16C5.0020.0002.006C][.0000.0000.0001.0000] # <LATIN SMALL LETTER L, LATIN SMALL LETTER L>
006C 00B7 006C ; [.16C5.0020.0002.006C][.0000.0000.0007.0000] # <LATIN SMALL LETTER L, MIDDLE DOT, LATIN SMALL LETTER L>
006C 004C ; [.16C5.0020.0007.006C][.0000.0000.0002.0000][.0000.0000.0001.0000] # <LATIN SMALL LETTER L, LATIN CAPITAL LETTER L>
006C 00B7 004C ; [.16C5.0020.0007.006C][.0000.0000.0002.0000][.0000.0000.0007.0000] # <LATIN SMALL LETTER L, MIDDLE DOT, LATIN CAPITAL LETTER L>
004C 006C ; [.16C5.0020.0007.004C][.0000.0000.0008.0000][.0000.0000.0001.0000] # <LATIN CAPITAL LETTER L, LATIN SMALL LETTER L>
004C 00B7 006C ; [.16C5.0020.0007.004C][.0000.0000.0008.0000][.0000.0000.0007.0000] # <LATIN CAPITAL LETTER L, MIDDLE DOT, LATIN SMALL LETTER L>
004C 004C ; [.16C5.0020.0008.004C][.0000.0000.0001.0000] # <LATIN CAPITAL LETTER L, LATIN CAPITAL LETTER L>
004C 00B7 004C ; [.16C5.0020.0008.004C][.0000.0000.0007.0000] # <LATIN CAPITAL LETTER L, MIDDLE DOT, LATIN CAPITAL LETTER L>
ENTRY
};

And here's es_trad.pl:

+{
entry => <<'ENTRY', # for DUCET v6.0.0
0063 0068 ; [.15D2.0020.0002.0063] # <LATIN SMALL LETTER C, LATIN SMALL LETTER H>
0043 0068 ; [.15D2.0020.0007.0043] # <LATIN CAPITAL LETTER C, LATIN SMALL LETTER H>
0043 0048 ; [.15D2.0020.0008.0043] # <LATIN CAPITAL LETTER C, LATIN CAPITAL LETTER H>
006C 006C ; [.16C5.0020.0002.006C] # <LATIN SMALL LETTER L, LATIN SMALL LETTER L>
004C 006C ; [.16C5.0020.0007.004C] # <LATIN CAPITAL LETTER L, LATIN SMALL LETTER L>
004C 004C ; [.16C5.0020.0008.004C] # <LATIN CAPITAL LETTER L, LATIN CAPITAL LETTER L>
00F1 ; [.1703.0020.0002.00F1] # LATIN SMALL LETTER N WITH TILDE
006E 0303 ; [.1703.0020.0002.00F1] # LATIN SMALL LETTER N WITH TILDE
00D1 ; [.1703.0020.0008.00D1] # LATIN CAPITAL LETTER N WITH TILDE
004E 0303 ; [.1703.0020.0008.00D1] # LATIN CAPITAL LETTER N WITH TILDE
ENTRY
};

However, my bilingual Castilian-Catalan dictionary (Pere Elies i Busqueta;
Barcelona, 1983) draws specific attention to how Catalan does *not* treat
"ll" and "ch" as separate letters for alphabetization the way Castilian
does/did. There are plenty of places where you can see that they are
following the more normal order; it's not like they don't understand this,
because the Castilian entries follow the other order.

So is this a Perl module bug, or is it really a CLDR bug?

Also, I can find no support for assertion of Frenchlike backwardsness
at collation strength 2.

Catalan words *can* have grave or acute accent marks (eg: còdex,
místic), diaereses (eg: genuïnament, oïble), cedillas (eg: jovença,
providença), or middle dots (eg: col·lapse, imbecil·litat).

You can't have two stress marks on the same word, which is all the two
accents are, So I haven't been able to find any words with more than one of
the two accents or the diaeresis, let alone minimal pairs to contrast.

And although there are words with both the middle dot or the cedilla, plus
either of the accents (eg: il·lícit, col·leció), here again I can find
no minimal pairs to allow me to see which way the algorithm runs.

Plus I doubt they would count the middle dot the same way (see the ENTRY),
nor even the cedilla, since they (sometimes) consider c and ç different
letters altogether.

--tom

0 new messages