Unicode under Mac OS

al...@alfarrabio.di.uminho.pt

unread,

May 26, 2009, 4:18:59 PM5/26/09

to Perl 5 Porters

Hello

My perl 5.10 (check below for full conf settings) is not working
properly with unicode:

------------------------------
#!/usr/bin/perl

use utf8;
use POSIX qw(locale_h);
setlocale(LC_COLLATE, "pt_PT.UTF-8");
setlocale(LC_CTYPE, "pt_PT.UTF-8");
use locale;
binmode(STDOUT, ":utf8");

@a = qw.� a � � o � � � z y.;
print join("|",sort @a),"\n";
---------------------------------

prints a|o|y|z|�|�|�|�|�|�

But under linux (perl 5.10 as well)
prints a|�|�|�|�|o|�|y|�|z

Any hint?

[ambs@rachmaninoff ProjectoDicionario]$ locale -a |grep pt_PT
pt_PT
pt_PT.ISO8859-1
pt_PT.ISO8859-15
pt_PT.UTF-8

[ambs@rachmaninoff ProjectoDicionario]$ perl -V
Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
Platform:
osname=darwin, osvers=9.6.0, archname=darwin-2level
uname='darwin rachmaninoff.local 9.6.0 darwin kernel version 9.6.0:
mon nov 24 17:37:00 pst 2008; root:xnu-1228.9.59~1release_i386 i386 '
config_args=''
hint=recommended, useposix=true, d_sigaction=define
useithreads=undef, usemultiplicity=undef
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-fno-common -DPERL_DARWIN -no-cpp-precomp
-fno-strict-aliasing -pipe -I/usr/local/include -I/opt/local/include',
optimize='-O3',
cppflags='-no-cpp-precomp -fno-common -DPERL_DARWIN -no-cpp-precomp
-fno-strict-aliasing -pipe -I/usr/local/include -I/opt/local/include'
ccversion='', gccversion='4.0.1 (Apple Inc. build 5465)',
gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags ='
-L/usr/local/lib -L/opt/local/lib'
libpth=/usr/local/lib /opt/local/lib /usr/lib
libs=-ldbm -ldl -lm -lutil -lc
perllibs=-ldl -lm -lutil -lc
libc=/usr/lib/libc.dylib, so=dylib, useshrplib=false, libperl=libperl.a
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup
-L/usr/local/lib -L/opt/local/lib'

Characteristics of this binary (from libperl):
Compile-time options: PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP
USE_LARGE_FILES USE_PERLIO
Built under darwin
Compiled at Jan 25 2009 17:37:36
@INC:
/opt/local/lib/perl5/5.10.0/darwin-2level
/opt/local/lib/perl5/5.10.0
/opt/local/lib/perl5/site_perl/5.10.0/darwin-2level
/opt/local/lib/perl5/site_perl/5.10.0
.

--
Alberto Sim�es - Departamento de Inform�tica - Universidade do Minho
Campus de Gualtar - 4710-057 Braga - Portugal

John

unread,

May 27, 2009, 3:07:30 AM5/27/09

to al...@alfarrabio.di.uminho.pt, Perl 5 Porters

Alberto Simões wrote:
> Hello
>
> My perl 5.10 (check below for full conf settings) is not working
> properly with unicode:
>
> ------------------------------
> #!/usr/bin/perl
>
> use utf8;
> use POSIX qw(locale_h);
> setlocale(LC_COLLATE, "pt_PT.UTF-8");
> setlocale(LC_CTYPE, "pt_PT.UTF-8");
> use locale;
> binmode(STDOUT, ":utf8");
>

> @a = qw.ý a é í o ú ã é z y.;

> print join("|",sort @a),"\n";
> ---------------------------------
>

> prints a|o|y|z|ã|é|é|í|ú|ý

>
> But under linux (perl 5.10 as well)

> prints a|ã|é|é|í|o|ú|y|ý|z
>
Does the Unix sort program give the same order as your code when
LC_COLLATE and LC_CTYPE are set in the environment of both systems? If
so then I suggest this is a difference in the local data on both systems.

You could also check http://www.unicode.org/cldr for the current
accepted collation order.

See also http://github.com/ThePilgrim/perlcldr/tree/master where I am
writing code to get the entirety of the Unicode Common local data
repository into Perl

John

______________________________________________
This email has been scanned by Netintelligence
http://www.netintelligence.com/email

al...@alfarrabio.di.uminho.pt

unread,

May 27, 2009, 7:25:00 AM5/27/09

to John, Perl 5 Porters

Hello.

John wrote:

> Alberto Sim�es wrote:
>> Hello
>>
>> My perl 5.10 (check below for full conf settings) is not working
>> properly with unicode:
>>
>> ------------------------------
>> #!/usr/bin/perl
>>
>> use utf8;
>> use POSIX qw(locale_h);
>> setlocale(LC_COLLATE, "pt_PT.UTF-8");
>> setlocale(LC_CTYPE, "pt_PT.UTF-8");
>> use locale;
>> binmode(STDOUT, ":utf8");
>>

>> @a = qw.� a � � o � � � z y.;

>> print join("|",sort @a),"\n";
>> ---------------------------------
>>

>> prints a|o|y|z|�|�|�|�|�|�

>>
>> But under linux (perl 5.10 as well)

>> prints a|�|�|�|�|o|�|y|�|z

>>
> Does the Unix sort program give the same order as your code when
> LC_COLLATE and LC_CTYPE are set in the environment of both systems? If
> so then I suggest this is a difference in the local data on both systems.

In fact it seems that the collation on mac is not correct, as I can't
get unix sort to work properly. In fact, I export LC_(anything) as any
value (for instance, fish), and the order is exactly the same, with any
complain from sort.

I need to go back for linux, I see :(

Cheers
Albert

John

unread,

May 27, 2009, 1:43:42 PM5/27/09

to al...@alfarrabio.di.uminho.pt, Perl 5 Porters

> In fact it seems that the collation on mac is not correct, as I can't
> get unix sort to work properly. In fact, I export LC_(anything) as any
> value (for instance, fish), and the order is exactly the same, with any
> complain from sort.
>
> I need to go back for linux, I see :(
>
> Cheers
> Albert
>

Well Perl may be good but it can't, yet, fix a buggy Local.

That's what the project at
http://github.com/ThePilgrim/perlcldr/tree/master will eventually solve. :-)

Tom Christiansen

unread,

May 27, 2009, 10:21:53 PM5/27/09

to al...@alfarrabio.di.uminho.pt, Perl 5 Porters

SUMMARY:

� When you're using Unicode data, try Unicode sorting.
� Don't trust locales all that much. They often suck.

Alberto Sim�es, do Departamento de Inform�tica,
Universidade do Minho Campus de Gualtar, Braga
Portugal, wrote:

� My perl 5.10 (check below for full conf

� settings) is not working properly with unicode:

Happens to the best of us. :-)

� #!/usr/bin/perl

�
� use utf8;
� use POSIX qw(locale_h);
� setlocale(LC_COLLATE, "pt_PT.UTF-8");
� setlocale(LC_CTYPE, "pt_PT.UTF-8");
� use locale;
� binmode(STDOUT, ":utf8");
�
� @a = qw.� a � � o � � � z y.;
� print join("|",sort @a),"\n";

� ---------------------------------

� prints a|o|y|z|�|�|�|�|�|�

� But under linux (perl 5.10 as well)
� prints a|�|�|�|�|o|�|y|�|z

� Any hint?

� [ambs@rachmaninoff ProjectoDicionario]$ locale -a |grep pt_PT
� pt_PT
� pt_PT.ISO8859-1
� pt_PT.ISO8859-15
� pt_PT.UTF-8

Many--I hope.

First off, this may or may not be a problem, but your program
came across the wire with ISO8859-1 literals, and was marked as
ISO-8859-1, but was saying internally that it was in UTF-8.
Something may have been lost in translation, because otherwise
when run, it produces nonsense about malformed UTF-8.

BTW, if you but use the pt_PT.ISO8859-1 locale on the Mac
(since you seem to have just 8859-1 data), it turns out to
work perfectly well, at least with the data you provided:

use POSIX qw(locale_h);
setlocale(LC_COLLATE, "pt_PT.ISO8859-1");
setlocale(LC_CTYPE, "pt_PT.ISO8859-1");
use locale;
binmode(STDOUT, ":encoding(ISO8859-1");

@a = qw.� a � � o � � � z y.;

print join("|",sort @a),"\n";

This now prints a|�|�|�|�|o|�|y|�|z for me on the Mac.

Secondly...

Alas! Saying C< use utf8 > is neither necessary
nor sufficient to guarantee that you actually
have utf8 characters--or semantics. Similarly,
so too with setting LC_COLLATE: maybe not enough.

�� So very sorry !!! :-{

I'd also be a bit more comfortable seeing
something more along these lines:

#!/usr/bin/env perl5.10.0

use 5.10.0;
use strict;
use warnings;
use encoding "latin1";
use POSIX qw[ :locale_h ];

our $LOC_PT;
BEGIN {
$LOC_PT = "pt_PT.ISO8859-1";
my $retstr;
if ($retstr = setlocale(LC_COLLATE, $LOC_PT)) {
# say "setlocale LC_COLLATE to $LOC_PT returned $retstr";
} else {
die "can't setlocale LC_COLLATE to $LOC_PT: $!"
}
if ($retstr = setlocale(LC_CTYPE, $LOC_PT)) {
# say "setlocale LC_CTYPE to $LOC_PT returned $retstr";
} else {
die "can't setlocale LC_CTYPE to $LOC_PT: $!"
}
}

use locale;

my @letras = split(/\s+/, "\x{FD} a \x{E9} \x{ED} o \x{FA} \x{E3} \x{E9} z y");
printf "[ %s ] sort to [ %s ]\n",
join(" " => @letras),
join(" " => sort @letras);

# now show that it works externally, too

$ENV{LC_CTYPE} = $ENV{LC_COLLATE} = $LOC_PT;

open (SORTER, "| sort") || die "can't open pipe to sort: $!";
binmode(SORTER, ":encoding(latin1)")
|| die "can't binmode to :encoding(latin1): $!";
for (@letras) { say SORTER }
close(SORTER) || die "can't close pipe to sort: $!";

However, you are in a very real sense correct that
the PT UTF-8 locale under Leopard seems "broken".

It may be even worse than you thought, though.

Witness:

Mac% locate pt_PT.UTF-8
/usr/share/locale/pt_PT.UTF-8
/usr/share/locale/pt_PT.UTF-8/LC_COLLATE
/usr/share/locale/pt_PT.UTF-8/LC_CTYPE
/usr/share/locale/pt_PT.UTF-8/LC_MESSAGES
/usr/share/locale/pt_PT.UTF-8/LC_MESSAGES/LC_MESSAGES
/usr/share/locale/pt_PT.UTF-8/LC_MONETARY
/usr/share/locale/pt_PT.UTF-8/LC_NUMERIC
/usr/share/locale/pt_PT.UTF-8/LC_TIME

Now the tragic part:

Mac% ls -l /usr/share/locale/pt_PT.UTF-8/LC_COLLATE
lrwxr-xr-x 1 root wheel 28 Nov 7 2008 /usr/share/locale/pt_PT.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE

Which is pretty (sub-)par for the course, I'm afraid.
Look how many are screwed up that same way:

Mac% find /usr/share/locale -name LC_COLLATE -ls | wc -l
206

Mac% find /usr/share/locale -name LC_COLLATE -ls | grep -c US-ASCII
122

I would never trust a vendor's LC_COLLATE. *Which* sort
of collation are they speaking of? Just for one thing,
you don't usually use the same for a dictionary as you
do for a phone book. Think about "sort -df" for example.

I HAVE GOOD NEWS FOR YOU: if you just want to do something
exceedingly simple, than there's a pretty easy way to get
that done. This works:

#!/usr/bin/env perl5.10.0

use 5.10.0;
use strict;
use warnings;

use Unicode::Collate;

my @a = qw[ � a � � o � � � z y ];
my @sa = Unicode::Collate->new->sort(@a);

printf "[ %s ] sorts to [ %s ]\n",
join(" " => @a),
join(" " => @sa);

Here's a larger suite/demo where I create a PT_sorter
object and repeatedly use it.

#!/usr/bin/env perl5.10.0

use 5.10.0;
use strict;
use warnings;
use encoding "latin1", STDOUT => "utf8";

use List::Util qw[ shuffle ];

use Unicode::Collate;

my $DEBUG = 0;

my $PT_sorter = Unicode::Collate::->new();

$| = 1;

srand(42);

my @tests = (
[qw[ aba ab� abac� aba�ai abacalhoar aba�anar abacate ]],

[qw[ abomin�vel abom�nio abominoso abona��o abonado ]],

[qw[ sequela sequencial sequ�ncias seq��ncias ]],

[qw[ avo av� av� avoa��o avoa�ar avoa�asse avoa��sseis
avoa�assem avoa��ssemos avondo av�s ]],

[qw[ eferente eferescer efes�aco ef�sio �fetas
efic�cia eficaz efici�ncia eficiente ]],

[qw[ economia econ�mico econ�mico econ�micos economismo ]],

[qw[ falamos fal�mos falar falara falar� falaram
fal�ramos falarem falaremos falar�amos fal�rica
falarmos falaz falem falemos falhado falo falou ]],

[qw[ fa�a fa�anha fac�o fac��o fac��es facha
fac�limo facneia fa�o faz fazeis fazem fazer
fazermonos fazermos fazes faz�vel ]],
);

my $testno = 0;

for ($testno = 0; $testno < @tests; $testno++) {

print "test $testno... ";
my @set = @{ $tests[$testno] };
my %order;

for (my $i = 0; $i < @set; $i++) {
$order{ $set[$i] } = $i;
}

my @shuffled = ();
my $try = 0;
do {
@shuffled = shuffle(@set);
print "\b. " if "@shuffled" eq "@set";
} until ("@shuffled" ne "@set" || ++$try > 20);

if ($try >= 20) {
die "couldn't shuffle @set after 20 tries";
}

my @sorted = $PT_sorter->sort(@shuffled);

print "\n\tsorted:\t@shuffled\n\tinto:\t@sorted\n" if $DEBUG;

my @sindices = ();
for (@sorted) { push @sindices, $order{$_} };

my $idx_wanted = join(" " => 0 .. $#sindices);
my $idx_gotten = join(" " => @sindices);

if ($idx_wanted eq $idx_gotten) {
print "ok\n";
} else {
print <<"EO_OOPS";
not ok:
\twanted\t$idx_wanted
\t\t(@set)
\tbut got\t$idx_gotten
\t\t(@sorted)
EO_OOPS
}

}

When run without debuggery enabled, produces:

test 0... ok
test 1... ok
test 2... ok
test 3... ok
test 4... ok
test 5... ok
test 6... ok
test 7... ok

or with debugging, this:

test 0...
sorted: abacate aba�ai abacalhoar ab� aba abac� aba�anar
into: aba ab� abac� aba�ai abacalhoar aba�anar abacate
ok
test 1...
sorted: abomin�vel abona��o abonado abom�nio abominoso
into: abomin�vel abom�nio abominoso abona��o abonado
ok
test 2...
sorted: sequela sequ�ncias seq��ncias sequencial
into: sequela sequencial sequ�ncias seq��ncias
ok
test 3...
sorted: av�s avoa�ar avoa��ssemos av� av� avoa��sseis avoa��o avondo avoa�asse avoa�assem avo
into: avo av� av� avoa��o avoa�ar avoa�asse avoa��sseis avoa�assem avoa��ssemos avondo av�s
ok
test 4...
sorted: �fetas efes�aco ef�sio eficiente eferescer efic�cia efici�ncia eficaz eferente
into: eferente eferescer efes�aco ef�sio �fetas efic�cia eficaz efici�ncia eficiente
ok
test 5...
sorted: econ�mico econ�mico economismo economia econ�micos
into: economia econ�mico econ�mico econ�micos economismo
ok
test 6...
sorted: falar�amos falar� falou falem falo falhado falamos falaremos falarem falaram fal�ramos fal�mos falar falarmos falaz falara falemos fal�rica
into: falamos fal�mos falar falara falar� falaram fal�ramos falarem falaremos falar�amos fal�rica falarmos falaz falem falemos falhado falo falou
ok
test 7...
sorted: facha fazermonos fac��es fazeis fazermos fazer fa�anha fazes fazem faz fa�o fac�limo fa�a facneia faz�vel fac�o fac��o
into: fa�a fa�anha fac�o fac��o fac��es facha fac�limo facneia fa�o faz fazeis fazem fazer fazermonos fazermos fazes faz�vel
ok

However, consider mixed data where sometimes you have
mixed canonical and non-canonical forms, like this:

"economia",
"econ\N{o_acute}mica",
"econo\N{CB_circ}micas",
"econo\N{CB_acute}mico",
"econ\N{o_circ}micos",
"economismo"

where those are named character aliases enabled via:

use charnames qw[ :full :alias ] => {

o_acute => "LATIN SMALL LETTER O WITH ACUTE",
o_circ => "LATIN SMALL LETTER O WITH CIRCUMFLEX",

CB_circ => "COMBINING CIRCUMFLEX ACCENT",
CB_acute => "COMBINING ACUTE ACCENT",

};

You still want that to show up looking like:

economia econ�mica econ�micas econ�mico econ�micos economismo

Well, it will just magically work right. Yah!

You are *so* lucky you're dealing with Portuguese: you get off easy!

It orders (as of 2009) just as English does with 26 letter A..Z,
and it disregards diacriticals unless there's a tie, in which
case the unadorned letter precedes the one with a marking, in a
normal left to right fashion.

While Spanish also discards accents except for ties, it doesn't
count the tilde over the N as an accent--it's a whole new latter.

In Portuguese, as I know you know but other readers may not,
the til(de) over an A or an O is but a diacritic ("accent
mark") for stress and nasalization, so doesn't count as
anything special there.

But not so everywhere in Iberia! In Castilian and Galician,
the letter � falls after N and before O, making *this* the
proper ordering of these words in Spanish:

radio r�faga rana ran�nculo ra�a r�pido rastrillo

You therefore must create your sorter this way:

$ES_sorter = Unicode::Collate->new(entry => <<'END_SPANISH_ENTRY');
00F1 ; [.112B.0020.0002.00F1] # n-tilde
006E 0303 ; [.112B.0020.0002.00F1] # n-tilde
00D1 ; [.112B.0020.0008.00D1] # N-tilde
004E 0303 ; [.112B.0020.0008.00D1] # N-tilde
END_SPANISH_ENTRY

Again you luck out working in Portuguese. When you see
a � in either Portuguese or French, well, it's just a C
with a diacritical.

No big deal.

But in Catalan, it make for a whole new letter, one coming
after C but before D. This leads to a Catalan sort object
declared like so:

$CA_sorter = Unicode::Collate->new(entry => <<'END_CATALAN_ENTRY');
00E7 ; [.0FFC.0020.0002.0063] # c-cedilla
0063 0327 ; [.0FFC.0020.0002.0063] # c-cedilla
00C7 ; [.0FFC.0020.0002.0043] # C-cedilla
0043 0327 ; [.0FFC.0020.0002.0043] # C-cedilla
END_CATALAN_ENTRY

Similarly in Spanish aka Castilian, prior to 1997 the standard
said that CH was its own letter of the alphabet (named "che")
falling after C and before D. That means "chocolate" comes
*AFTER* "color" in dictionaries before 1997, but before it in
those published later. What fun!

Also until that year of orthographic reform, they had
historically always decreed that LL was its own letter, one
falling after L and before M. Many people would get confused
whether to write "LLave" vs "Llave". The second was and is
right, but the first was often really disturbingly seen.
Still is sometimes -- "Next Exit to LL�rida", or whatever.

So you'd have to create your sorter this way:

$ES_trad_sorter = Unicode::Collate->new(entry => <<'TRAD_SPANISH_ENTRY');
0063 0068 ; [.1000.0020.0002.0063] # ch
0043 0068 ; [.1000.0020.0007.0043] # Ch
0043 0048 ; [.1000.0020.0008.0043] # CH
006C 006C ; [.10F5.0020.0002.006C] # ll
004C 006C ; [.10F5.0020.0007.004C] # Ll
004C 004C ; [.10F5.0020.0008.004C] # LL
00F1 ; [.112B.0020.0002.00F1] # n-tilde
006E 0303 ; [.112B.0020.0002.00F1] # n-tilde
00D1 ; [.112B.0020.0008.00D1] # N-tilde
004E 0303 ; [.112B.0020.0008.00D1] # N-tilde
TRAD_SPANISH_ENTRY

In French, the accent marks are disregarded save for
tie-breaking, just like in Portuguese -- EXCEPT that
instead of going left-to-right as I'm pretty sure you
do in Portuguese, in French (but which French? :-), it
appears that they resolve ties by going right to left!

# Level 2 (diacrits) tie-breakers must be
# weighted by reverse order here:

$FR_sorter = Unicode::Collate->new(backwards => 2);

Can you believe it? That means that, for example, using
made-up words:

WRONG: bebe beb� b�be b�b�
RIGHT: bebe b�be beb� b�b�

Then there's what to do about non-letters, like hyphens or
apostrophes. Are they part of the word? `sort -df` doesn't
thinks so, although it DOES count spaces. Most dictionaries
I use do not seem to, though.

That makes for sequences in PT like these:

avo av� av� avondo �-vontade av�s

and

faca fa�a facalh�o fa�alvo faca-marcador facaneia fa�anha
fac�o faca-sola fac��o fac��es facha fac�limo fa�o fac-s�mile
faz fazeis faz�-lo fazem fazer fazermonos fazermos fazes
faz�vel faz-tudo

What about case? Upper first, or lower? Or the same?

If you're sorting book-titles or place-names, shouldn't you
disregard a leading article? That is, strip off what would be
"The", "A", and "And" if it were in English?

But in Portuguese, you may wish to strip the article
contractions, too, I imagine.

I once needed to sort a bunch "Spanish" city names (that is:
Castillian and Galician and Catalan toponyms), so wound up
cobbling together ("cobble" is that it was not quite right for
Galician, but was better with handling the many Catalan names,
since it counts � as its own letter) like so:

$Pueblo_Sorter = Unicode::Collate->new( entry => <<'END_ENTRY',
0063 0068 ; [.1000.0020.0002.0063] # ch
0043 0068 ; [.1000.0020.0007.0043] # Ch
0043 0048 ; [.1000.0020.0008.0043] # CH
006C 006C ; [.10F5.0020.0002.006C] # ll
004C 006C ; [.10F5.0020.0007.004C] # Ll
004C 004C ; [.10F5.0020.0008.004C] # LL
00E7 ; [.0FFC.0020.0002.0063] # c-cedilla
0063 0327 ; [.0FFC.0020.0002.0063] # c-cedilla
00C7 ; [.0FFC.0020.0002.0043] # C-cedilla
0043 0327 ; [.0FFC.0020.0002.0043] # C-cedilla
00F1 ; [.112B.0020.0002.00F1] # n-tilde
006E 0303 ; [.112B.0020.0002.00F1] # n-tilde
00D1 ; [.112B.0020.0008.00D1] # N-tilde
004E 0303 ; [.112B.0020.0008.00D1] # N-tilde
END_ENTRY

upper_before_lower => 1,

normalization => "NFKD",

preprocess => sub { # strip leading articles
my $_ = shift;

s/^L'//; # Catalan

s{ ^ # remove leading articles etc
(?:

# Castilian
El
| Los
| La
| Las

# Catalan
| Els
| Les
| Sa
| Es

# Galego
| O
| Os
| A
| As
)
\s+
}{}x;

# strip various internal, low-importance particles

s/\b[dl]'//; # Catalan

s{
\b
(?:
el | los | la | las | de | del | y # ES
| els | les | i | sa | es | dels # CA
| o | os | a | as | do | da | dos | das # GAL
)
\b
}{}gx;

return $_;
},
) || die ...

Fun, eh?! :-)

When you've mixed data from three different languages,
operating under three differently conflicting rules
schemes, something just has to give. Oh well.

It actually worked out well for me here though.

What's the lesson in all this?

#1: If you're using Unicode data, use Unicode sorting.
#2: Don't trust locales so much. (Well, *I* don't.)

I actually don't know that LC_COLLATE can *ever* do what
UTS #10 (the Unicode collating standard) requires, but
I've never had as much luck with it as I've had with
the slowishly multipass but Correct-As-You-Can-Code-It
full-blown collator approach outlined above.

Good luck--hope this helps give some ideas.

--tom

--

##########################################
# some useful(?) PT charname aliasings #
##########################################

use charnames qw[ :full latin :alias ] => {

CB_acute => "COMBINING ACUTE ACCENT",
CB_circ => "COMBINING CIRCUMFLEX ACCENT",
CB_tilde => "COMBINING TILDE",
CB_grave => "COMBINING GRAVE ACCENT",
CB_cedil => "COMBINING CEDILLA",
CB_trema => "COMBINING DIAERESIS",

A_acute => "LATIN CAPITAL LETTER A WITH ACUTE",
a_acute => "LATIN SMALL LETTER A WITH ACUTE",
A_grave => "LATIN CAPITAL LETTER A WITH GRAVE",
a_grave => "LATIN SMALL LETTER A WITH GRAVE",
A_tilde => "LATIN CAPITAL LETTER A WITH TILDE",
a_tilde => "LATIN SMALL LETTER A WITH TILDE",

E_acute => "LATIN CAPITAL LETTER E WITH ACUTE",
E_open => "LATIN CAPITAL LETTER E WITH ACUTE",
e_acute => "LATIN SMALL LETTER E WITH ACUTE",
e_open => "LATIN SMALL LETTER E WITH ACUTE",
E_circ => "LATIN CAPITAL LETTER E WITH CIRCUMFLEX",
E_closed => "LATIN CAPITAL LETTER E WITH CIRCUMFLEX",
e_circ => "LATIN SMALL LETTER E WITH CIRCUMFLEX",
e_closed => "LATIN SMALL LETTER E WITH CIRCUMFLEX",
E_tilde => "LATIN CAPITAL LETTER E WITH TILDE",
e_tilde => "LATIN SMALL LETTER E WITH TILDE",

I_acute => "LATIN CAPITAL LETTER I WITH ACUTE",
i_acute => "LATIN SMALL LETTER I WITH ACUTE",

O_acute => "LATIN CAPITAL LETTER O WITH ACUTE",
O_open => "LATIN CAPITAL LETTER O WITH ACUTE",
o_acute => "LATIN SMALL LETTER O WITH ACUTE",
o_open => "LATIN SMALL LETTER O WITH ACUTE",
O_circ => "LATIN CAPITAL LETTER O WITH CIRCUMFLEX",
O_closed => "LATIN CAPITAL LETTER O WITH CIRCUMFLEX",
o_circ => "LATIN SMALL LETTER O WITH CIRCUMFLEX",
o_closed => "LATIN SMALL LETTER O WITH CIRCUMFLEX",
O_tilde => "LATIN CAPITAL LETTER O WITH TILDE",
o_tilde => "LATIN SMALL LETTER O WITH TILDE",

U_acute => "LATIN CAPITAL LETTER U WITH ACUTE",
u_acute => "LATIN SMALL LETTER U WITH ACUTE",
U_trema => "LATIN CAPITAL LETTER U WITH DIAERESIS",
u_trema => "LATIN SMALL LETTER U WITH DIAERESIS",

C_cedil => "LATIN CAPITAL LETTER C WITH CEDILLA",
c_cedil => "LATIN SMALL LETTER C WITH CEDILLA",

};

al...@alfarrabio.di.uminho.pt

unread,

May 28, 2009, 4:46:13 AM5/28/09

to Tom Christiansen, Perl 5 Porters

Hello, Tom

Tom Christiansen wrote:
> First off, this may or may not be a problem, but your program
> came across the wire with ISO8859-1 literals, and was marked as
> ISO-8859-1, but was saying internally that it was in UTF-8.
> Something may have been lost in translation, because otherwise
> when run, it produces nonsense about malformed UTF-8.

That because I pasted the code into the mail and did not attach the
original document that IS IN UTF8.

> However, you are in a very real sense correct that
> the PT UTF-8 locale under Leopard seems "broken".

That is what I meant.

> Mac% find /usr/share/locale -name LC_COLLATE -ls | wc -l
> 206
>
> Mac% find /usr/share/locale -name LC_COLLATE -ls | grep -c US-ASCII
> 122

Noted that as well.

> I would never trust a vendor's LC_COLLATE. *Which* sort
> of collation are they speaking of? Just for one thing,
> you don't usually use the same for a dictionary as you
> do for a phone book. Think about "sort -df" for example.

Well, for Portuguese I use always the same sorting order :)

> I HAVE GOOD NEWS FOR YOU: if you just want to do something
> exceedingly simple, than there's a pretty easy way to get
> that done. This works:
>
> #!/usr/bin/env perl5.10.0
>
> use 5.10.0;
> use strict;
> use warnings;
>
> use Unicode::Collate;
>
> my @a = qw[ � a � � o � � � z y ];
> my @sa = Unicode::Collate->new->sort(@a);
>
> printf "[ %s ] sorts to [ %s ]\n",
> join(" " => @a),
> join(" " => @sa);

I'll test it, thanks.

> Good luck--hope this helps give some ideas.

Yep. Thank you
Alberto