|locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Linda Walsh||5/20/12 11:36 AM|
As the perl folks are about to release another version with yet more
major unicode corrections (apparently they have had to redo alot of
namespace and unicode stuff in every major version since 5.6 due to the
changing of the guard (amongst other reasons).
But perl has the ability in the new version to specify how chars shall
be handled in a search, for example --
locale specific, unicode specific, ascii rules(C I think), and a
'd'epends mode that handles most things like Unicode, except for bytes
between 128-255, where they will default to staying with Latin1 (as I
understand it -- they aren't especially forth coming with explanations).
I tried to ask, under what circumstances would I get the collating order
that bash now defaults to in the en_US locale -- and was told NONE. I
asked them why -- how did they justify that behavior -- what documents
were they following and was told that to ask for technical
justifications was rude, with no one volunteering an answer. Sounds a
bit sketchy to me... like they have no reasons....but that could be a
I told them I thought that them taking such technical questions as
'rudeness' was their personal trip, and not related to the engineering
task(s) at hand. A few understood, a few other loud ones joined in
with those who had no answer, and said asking for any sort of reason or
justification as to why perl is the way it is, argumentative and
rude...etc....blah blah blah... (bunch of self-absorbed prima donna's -
worse than me on a bad day!)...
Anyway... so WHY does bash collate this way? Under what rules is bash
operating? I.e. justification?
If other programs claim to have locale specific sorting & character
collation, should they be sorting the same way?
So far, i've seen Microsoft do the a<A<z<Z collation, in win7 and above,
BUT.. I hardly think Bash did it to follow windows.... So.. where did
this come from and how should it apply to other "locale honoring" programs?
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Greg Wooledge||5/21/12 5:38 AM|
On Sun, May 20, 2012 at 11:36:35AM -0700, Linda Walsh wrote:It doesn't. The operating system does. Bash just calls upon the C
library's strcoll(3) routine.
The results vary across operating systems, and even potentially across
locale definitions within a given OS implementation.
In particular, LC_COLLATE starts at
Yes. ls(1) for example sorts filenames according to the collation
order defined in the current locale (LC_ALL or LC_COLLATE or LANG).
You should get the same sorting from both ls and *.
For instance, on HP-UX 10.20, in the en_US.iso88591 locale:
imadev:~$ mkdir /tmp/greg && cd "$_"
imadev:/tmp/greg$ touch a A á Á à À â Â ä Ä b B
A a Á á À à Â â Ä ä B b
imadev:/tmp/greg$ echo *
A a Á á À à Â â Ä ä B b
imadev:/tmp/greg$ LC_COLLATE=C ls
A B a b À Á Â Ä à á â ä
imadev:/tmp/greg$ LC_COLLATE=C; echo *; unset LC_COLLATE
A B a b À Á Â Ä à á â ä
Meanwhile, on Debian 6.0, in the en_US.iso88591 locale:
arc3:~$ mkdir /tmp/greg && cd "$_"
arc3:/tmp/greg$ touch a A á Á à À â Â ä Ä b B
arc3:/tmp/greg$ echo *
a A á Á à À â Â ä Ä b B
As you can see, the two en_US.iso88591 implementations are not the same.
See http://mywiki.wooledge.org/locale or myriad other resources (including
the POSIX page linked earlier) for further explanations.
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Linda Walsh||5/21/12 12:19 PM|
> For instance, on HP-UX 10.20, in the en_US.iso88591 locale:> A a ... B b
> Meanwhile, on Debian 6.0, in the en_US.iso88591 locale:> a A ... b B
So which is correct?
Anyone wanting to reference an upper or lower case range
[a-z] or [A-Z], is gonna hurt from this.
My OS uses "en_US.UTF-8".
You'd think unicode would have something to say about collation
order that wouldn't allow such randomness, but maybe not.
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Chris F.A. Johnson||5/21/12 12:21 PM|
On Mon, 21 May 2012, Linda Walsh wrote:Use the correct references: [:upper:] and [:lower:] or (as I do)
always use LC_ALL=C in your scripts.
My OS uses whatever I tell it to (which is C).
Chris F.A. Johnson, <http://cfajohnson.com/>
Pro Bash Programming: Scripting the GNU/Linux Shell (2009, Apress)
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Greg Wooledge||5/21/12 12:27 PM|
On Mon, May 21, 2012 at 12:19:26PM -0700, Linda Walsh wrote:
> >For instance, on HP-UX 10.20, in the en_US.iso88591 locale:> So which is correct?
Both. Locale collating order is determined by the OS. You cannot
rely on it, unless you set the LC_COLLATE variable to "C" or "POSIX",
in which case you get ASCII behavior (accented letters are not part
of the character set at all).
imadev:~$ echo Hello World | tr 'A-Z' 'a-z'
imadev:~$ echo Hello World | tr '[:upper:]' '[:lower:]'
You *cannot* use [a-z] or [A-Z] any more, except in the C/POSIX locale.
If you want to match lowercase characters, you should be using [[:lower:]],
and for uppercase characters, [[:upper:]].
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Aharon Robbins||5/21/12 12:27 PM|
In article <email@example.com>,
Linda Walsh <ba...@tlinx.org> wrote:Both! Isn't this fun! Current POSIX leaves this up to the implementation.
I believe that the Debian order is what earlier POSIX required.
This is why I started the Campaign For Rational Range Interpretation,
now part of gawk and I believe in the most recent grep also, which
returns us to the sane days of yesteryear, where [a-z] got only lowercase
letters and [A-Z] got only uppercase ones.
I personally have had
in my .profile / .bashrc for many years now, to keep the behavior G-d
It actually makes sense that it doesn't, since Unicode is more or less
a mapping of code points to glyphs, which is language independant. The
rules for collating depend upon the language.
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Linda Walsh||5/21/12 12:37 PM|
>> Greg Wooledge wrote:
>> So which is correct?
>> Anyone wanting to reference an upper or lower case range> Correct.
This is a prime example of Posix being stupid and bad for
They take a deterministic behavior and define it to be
non-deterministic and break 1000's of programs.
They cannot justify this... as they are supposed to document
current practice -- which has never been to consider the interpretation
of a-z/A-Z as random!
Thus they are violating their own rules! How can anyone follow
such lame directions? Who in their right mind would have voted to
make ranges "worthless"....i.e. -- established, standard practice has never
been for such ranges to be worthless -- yet that is exactly what they
How is posix following it's own rules? If they don't follow
their own rules -- how can anyone be following these new specifications
which are obviously in conflict with established implementation?
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Chester Ramey||5/21/12 12:41 PM|
On 5/21/12 3:27 PM, Aharon Robbins wrote:What's most fun is having this discussion over and over again, each time as
if it were the first.
``The lyf so short, the craft so long to lerne.'' - Chaucer
``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU ch...@case.edu http://cnswww.cns.cwru.edu/~chet/
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Chester Ramey||5/21/12 12:42 PM|
On 5/21/12 3:27 PM, Aharon Robbins wrote:
> This is why I started the Campaign For Rational Range Interpretation,The next version of bash will have a shell option to enable this behavior.
It's in the development snapshots if anyone wants to try it out now.
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Chester Ramey||5/21/12 12:46 PM|
On 5/21/12 3:37 PM, Linda Walsh wrote:Try being a little less English-centric. Collating order varies by
Posix says that ranges work the way you are used to if you force the
traditional ordering using the `C' or `Posix' locale. Take a deep
breath and use LC_ALL=C in your scripts to avoid depending on whatever
your OS uses as the default.
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Linda Walsh||5/21/12 12:51 PM|
FWIW, I put LC_COLLATE='C' in my System startup scripts...
So I'm NOT being bitten by this problem directly... I'm trying to
figure out how the heck this got voted on in POSIX, to be a correct
standard such that it would break current programs...
POSIX is not supposed to be prescriptive -- but **descriptive**...
I can't think of anywhere that a-z or A-Z would have included letters
from the opposite case... so how did POSIX come to *prescribe* that this
be the case... since I can't see that as being descriptive.
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Sven Mascheck||5/21/12 12:55 PM|
On Mon, May 21, 2012 at 03:46:00PM -0400, Chet Ramey wrote:I'd suggest to use "only" LC_COLLATE=C to keep other important features
like e.g. printability (LC_CTYPE) or localized messages (LC_MESSAGES).
Or has anybody ever noticed problems with this approach?
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Linda Walsh||5/21/12 12:56 PM|
Actually it makes sense that it would... and it DOES.
Bash and POSIX would appear to violate Unicode ordering.
Algorithm reference: http://www.unicode.org/reports/tr10/tr10-24.html
Collation order chart:
It shows that cap A-Z come before a-z.
If a sort order is set (as mine used to be) to
then the Unicode sort order should have been used.
In bash, it was not.
It sounds like this is a bug rooted in the C libraries?
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Eric Blake||5/21/12 1:39 PM|
On 05/21/2012 01:51 PM, Linda Walsh wrote:POSIX 1992 was the culprit that proscribed that [A-Z] must be in
collation order across all locales, but without giving good guidance on
how to write a collation sequence, and without defining a C function to
easily get at that collation ordering. And remember, 20 years ago when
POSIX 1992 was written, there was very little implementation experience
with internationalization, compared to what has happened in the meantime
(that was back when Unicode was brand new, and most users still had
single-byte locales or used shift-lock encodings like Big5). It is
possible to write a locale definition where [A-Z] gives only upper-case
letters while still providing case-insensitive sorting, but not all
locale writes know how to do this (even now in 2012, while most glibc
locales have been corrected in this manner, there still exist several
glibc locales that aren't written very well - the complication stems
from the fact that your locale file becomes exponentially harder to
write: instead of having a single upper and lower case rule, you have to
have one rule per letter, with rules intermixed in a different order).
As soon as people started obeying POSIX 1992 to the letter, and
realizing that range expressions had unusual semantics as a result of
the 1992 specification, POSIX 2001 quickly reverted things, but by then,
the cat was out of the bag. POSIX 2001 had to continue to allow
existing implementations, by stating that range expressions in anything
but the C locale are explicitly undefined.
There is currently a movement under way to introduce 'Rational Range
Intepretation' (RRI), where [A-Z] means the 26 uppercase letters across
ALL locales, by omitting all accented letters and ignoring collation
ordering. Since POSIX 2001 and later allow this behavior, it is gaining
traction - already, GNU sed, GNU grep, and GNU awk have had patches
applied or under consideration to introduce this consistent behavior.
Search those mailing list archives if you want more details. Gnulib has
already had patches as part of this movement, and GNU coreutils and bash
should be picking up on these improvements in a future version; we also
hope to get glibc to agree to them. In other words, we recognize that
this is an issue, and eventually, we _do_ want to reach the point where
all GNU tools use RRI, since POSIX 2001 already allows RRI as part of
its recognition that the decision made in POSIX 1992 causes pain when
coupled with poorly-written locale definitions.
For example, here is an RRI patch for gnulib:
Eric Blake ebl...@redhat.com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Linda Walsh||5/21/12 2:02 PM|
> collation order across all locales,.....
[quickly?! 9 years later?! *cough*]
Explicitly undefined? Or locale dependent?
I.e. Unicode does specify ordering, so if your locale is set
to UTF-8 character encoding, then it is explicitly defined. This would
seem to be in conflict with unicode -- and any implementation claiming
to be unicode compatible MUST use unicode ordering when the local character
set is defined to be Unicode.
This doesn't conflict with Posix, as Posix doesn't define an order
for such -- but a different standard, (Unicode) does specify a standard. So
for those using UTF-8, shouldn't that have made the order randomization 'moot'?
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Chester Ramey||5/21/12 2:12 PM|
On 5/21/12 5:02 PM, Linda Walsh wrote:Then write the locale definitions that way -- that's Eric's point.
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Eric Blake||5/21/12 2:14 PM|
On 05/21/2012 03:02 PM, Linda Walsh wrote:POSIX explicitly undefined ranges for all but the C locale. _Other
standards_, such as Unicode, are free to add range requirements on top
of what POSIX requires, but alas, Unicode collation order does NOT
currently specify anything about regular expression or glob range
matching, so it is out of scope for Unicode to say what [A-Z] expands to.
Unicode may specify collation ordering, but it does NOT specify regular
expression range ordering.
Wishing doesn't make it so. The fact is that regular expression ranges
are currently unspecified in all but the C locale; the RRI project is
attempting to make it sane across all locales within the scope of GNU
programs, but it takes time to write and approve the patches necessary
to get to that point.
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Linda Walsh||5/21/12 4:42 PM|
I think this is the problem.
A-Z in regular expressions is defined to expand to those characters
that are _in collating order_, >A, and <Z...
Without a collating order that expression in RE's would never have made any
sense. It requires a collating order and is dependent on it.
If there is no collating order, then you cannot expand A-Z, but if there
is, you expand it to the values between A-Z that are in the collating order.
The regex(7) man page says that [xx-xx] uses ***collating order**::
If two characters in the list
are separated by '-', this is shorthand for the full range of charac-
ters between those two (inclusive) in the collating sequence, for exam-
ple, "[0-9]" in ASCII matches any decimal digit.
Seems pretty clear -- regex's aren't exempt from collating order, they depend on
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Eric Blake||5/21/12 7:02 PM|
On 05/21/2012 05:42 PM, Linda Walsh wrote:Only in POSIX 1992 or in the C locale. In POSIX 2001 and POSIX 2008,
and non-C locales, [A-Z] is explicitly undefined, because the definition
of characters in collating order between A and Z did not work out.
They still don't make any sense in any locale except C, because POSIX no
longer requires collating order.
The regex(7) man page _of which system_? Just because _some_ systems
(like glibc, picking the POSIX 1992 semantics) have well-defined
semantics, doesn't mean that all systems have those same semantics.
According to POSIX, you cannot portably assume ANY semantics for ranges
except in the C locale. And if RRI gains traction, that means that you
can assume ASCII collation, across ALL locales, but this is a different
order than collation of a specific locale, and it is also a GNU
extension not guaranteed by POSIX.
Only on platforms where libc has chosen to provide an extension beyond
POSIX, and where GNU programs have not further overridden things to
avoid the unexpected glibc semantics.
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Linda Walsh||5/21/12 8:15 PM|
> Only in POSIX 1992 or in the C locale. In POSIX 2001 and POSIX 2008,==============
ONLY under POSIX...
You may not believe this, but there are other standards than POSIX.
More than one distro and company claims unicode compliance, to V6.x.
Any programs or collections that do claim compat, need to adhere to the
stricter of the standards, which is Unicode, in this case.
POSIX only applies if you are going for the bottom of the barrel in this
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Linda Walsh||5/21/12 8:19 PM|
Well, that would be nice, but if Unicode takes off, *cough*,
and anyone claims unicode compliance (isn't UTF-8 the standard for HTML5
and XML?), they are also guaranteed ordering -- full ordering for the full
Unicode character set.
It would be VERY GOOD if RRI didn't come up with an order that
was DIFFERENT from that prescribed by Unicode -- otherwise that could open
another can of worms.
|Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?||Raph||5/25/12 1:41 AM|
On Mon, May 21, 2012 at 12:19:26PM -0700, Linda Walsh wrote:
> Anyone wanting to reference an upper or lower case range
> My OS uses "en_US.UTF-8".I don't remember this bug having been cited here:
About $ export LC_COLLATE=C >> /etc/skel/.bashrc & co