/usr/share/i18n/locales/iso14651_t1
in many contemporary Linux distributions (e.g., SuSE 9.3), the line
<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>
defines that the space character affects the sorting order with
LC_COLLATE=en_GB.UTF-8 (and in many other locales) at level 4, that is
only if there are no differences in
- base characters
- accents
- uppercase/lowercase
anywhere in the strings being compared.
Is this really what most users expect? I didn't!
The UCA has lots of options, and I think some discussion is needed
on which of these options are most appropriate for a glibc locale,
possibly leading to a revision ore replacement of the of the iso14651_t1
file.
References:
- Unicode Collation Algorithm (UCA), http://www.unicode.org/reports/tr10/
- ISO TR 14652 (draft: http://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14652.pdf)
- http://sources.redhat.com/bugzilla/show_bug.cgi?id=374
- https://bugzilla.novell.com/show_bug.cgi?id=152778
Example:
$ cat >demo.txt
death
de luge
de-luge
deluge
de-luge
de Luge
de-Luge
deLuge
de-Luge
demark
^D
and then try
$ LC_COLLATE=C sort demo.txt
$ LC_COLLATE=en_GTB.UTF-8 sort demo.txt
$ LC_COLLATE=en_GB sort demo.txt
and see the difference with how your dictionary or phone book sorts
these.
Markus
--
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
ISO TR 14652 does not deal with collation, GNU libc locales are based on ISO 14651.
A draft is available at http://dkuug.dk/jtc1/sc22/open/n2933.pdf
Iso14651_t1 is intended to be the Common Template Table defined in appendix A.
> - http://sources.redhat.com/bugzilla/show_bug.cgi?id=374
This bugreport does not contain any information. OTOH
http://sources.redhat.com/bugzilla/show_bug.cgi?id=388
explains that current sorting order in wrong in Polish.
> - https://bugzilla.novell.com/show_bug.cgi?id=152778
Access denied.
> Example:
>
> $ cat >demo.txt
> death
> de luge
> de-luge
> deluge
> de-luge
> de Luge
> de-Luge
> deLuge
> de-Luge
> demark
> ^D
>
> and then try
>
> $ LC_COLLATE=C sort demo.txt
> $ LC_COLLATE=en_GTB.UTF-8 sort demo.txt
> $ LC_COLLATE=en_GB sort demo.txt
Out of curiosity, do you see differences between en_GB and en_GTB.UTF-8?
There should be none.
> and see the difference with how your dictionary or phone book sorts
> these.
My understanding is that authors of ISO 14651 tried to gather some
general rules which are relevant for several locales, and other locales
have to derive from these rules if needed. The problem is that very
few people submitted changes, and as can be seen above, it is sometimes
hard to push changes into GNU libc. But at least this is an open
process, other distributions can make up their mind and include the
requested changes if they want.
Denis