Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Linux UTF-8 locales sort SPACE at level 4

3 views
Skip to first unread message

Markus Kuhn

unread,
Mar 21, 2006, 1:46:45 PM3/21/06
to
In the file

/usr/share/i18n/locales/iso14651_t1

in many contemporary Linux distributions (e.g., SuSE 9.3), the line

<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>

defines that the space character affects the sorting order with
LC_COLLATE=en_GB.UTF-8 (and in many other locales) at level 4, that is
only if there are no differences in

- base characters
- accents
- uppercase/lowercase

anywhere in the strings being compared.

Is this really what most users expect? I didn't!

The UCA has lots of options, and I think some discussion is needed
on which of these options are most appropriate for a glibc locale,
possibly leading to a revision ore replacement of the of the iso14651_t1
file.

References:

- Unicode Collation Algorithm (UCA), http://www.unicode.org/reports/tr10/

- ISO TR 14652 (draft: http://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14652.pdf)

- http://sources.redhat.com/bugzilla/show_bug.cgi?id=374

- https://bugzilla.novell.com/show_bug.cgi?id=152778

Example:

$ cat >demo.txt
death
de luge
de-luge
deluge
de-luge
de Luge
de-Luge
deLuge
de-Luge
demark
^D

and then try

$ LC_COLLATE=C sort demo.txt
$ LC_COLLATE=en_GTB.UTF-8 sort demo.txt
$ LC_COLLATE=en_GB sort demo.txt

and see the difference with how your dictionary or phone book sorts
these.

Markus

--
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain


--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/


Denis Barbier

unread,
Mar 21, 2006, 6:08:14 PM3/21/06
to
On Tue, Mar 21, 2006 at 06:46:45PM +0000, Markus Kuhn wrote:
[...]

> References:
>
> - Unicode Collation Algorithm (UCA), http://www.unicode.org/reports/tr10/
>
> - ISO TR 14652 (draft: http://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14652.pdf)

ISO TR 14652 does not deal with collation, GNU libc locales are based on ISO 14651.
A draft is available at http://dkuug.dk/jtc1/sc22/open/n2933.pdf
Iso14651_t1 is intended to be the Common Template Table defined in appendix A.

> - http://sources.redhat.com/bugzilla/show_bug.cgi?id=374

This bugreport does not contain any information. OTOH
http://sources.redhat.com/bugzilla/show_bug.cgi?id=388
explains that current sorting order in wrong in Polish.

> - https://bugzilla.novell.com/show_bug.cgi?id=152778

Access denied.

> Example:
>
> $ cat >demo.txt
> death
> de luge
> de-luge
> deluge
> de-luge
> de Luge
> de-Luge
> deLuge
> de-Luge
> demark
> ^D
>
> and then try
>
> $ LC_COLLATE=C sort demo.txt
> $ LC_COLLATE=en_GTB.UTF-8 sort demo.txt
> $ LC_COLLATE=en_GB sort demo.txt

Out of curiosity, do you see differences between en_GB and en_GTB.UTF-8?
There should be none.

> and see the difference with how your dictionary or phone book sorts
> these.

My understanding is that authors of ISO 14651 tried to gather some
general rules which are relevant for several locales, and other locales
have to derive from these rules if needed. The problem is that very
few people submitted changes, and as can be seen above, it is sometimes
hard to push changes into GNU libc. But at least this is an open
process, other distributions can make up their mind and include the
requested changes if they want.

Denis

0 new messages