locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?

Showing 1-22 of 22 messages
locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Linda Walsh 5/20/12 11:36 AM



As the perl folks are about to release another version with yet more
major unicode corrections (apparently they have had to redo alot of
namespace and unicode stuff in every major version since 5.6 due to the
changing of the guard (amongst other reasons).

But perl has the ability in the new version to specify how chars shall
be handled in a search, for example --
locale specific, unicode specific, ascii rules(C I think), and a
'd'epends mode that handles most things like Unicode, except for bytes
between 128-255, where they will default to staying with Latin1 (as I
understand it -- they aren't especially forth coming with explanations).

I tried to ask, under what circumstances would I get the collating order
that bash now defaults to in the en_US locale -- and was told NONE.   I
asked them why -- how did they justify that behavior -- what documents
were they following and was told that to ask for technical
justifications was rude, with no one volunteering an answer.  Sounds a
bit sketchy to me... like they have no reasons....but that could be a
projection.

I told them I thought that them taking such technical questions as
'rudeness' was their personal trip, and not related to the engineering
task(s) at hand.   A few understood, a few other loud ones joined in
with those who had no answer, and said asking for any sort of reason or
justification as to why perl is the way it is, argumentative and
rude...etc....blah blah blah...  (bunch of self-absorbed prima donna's -
worse than me on a bad day!)...

Anyway... so WHY does bash collate this way?   Under what rules is bash
operating?  I.e. justification?
If other programs claim to have locale specific sorting & character
collation,  should they be sorting the same way?

So far, i've seen Microsoft do the a<A<z<Z collation, in win7 and above,
BUT.. I hardly think Bash did it to follow windows....  So.. where did
this come from and how should it apply to other "locale honoring" programs?

Thanks!
Linda



Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Greg Wooledge 5/21/12 5:38 AM
On Sun, May 20, 2012 at 11:36:35AM -0700, Linda Walsh wrote:
> Anyway... so WHY does bash collate this way?

It doesn't.  The operating system does.  Bash just calls upon the C
library's strcoll(3) routine.

The results vary across operating systems, and even potentially across
locale definitions within a given OS implementation.

> Under what rules is bash
> operating?  I.e. justification?

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html

In particular, LC_COLLATE starts at
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_03_02

> If other programs claim to have locale specific sorting & character
> collation,  should they be sorting the same way?

Yes.  ls(1) for example sorts filenames according to the collation
order defined in the current locale (LC_ALL or LC_COLLATE or LANG).
You should get the same sorting from both ls and *.

For instance, on HP-UX 10.20, in the en_US.iso88591 locale:

imadev:~$ mkdir /tmp/greg && cd "$_"
imadev:/tmp/greg$ touch a A á Á à À â Â ä Ä b B
imadev:/tmp/greg$ ls
A  a  Á  á  À  à  Â  â  Ä  ä  B  b
imadev:/tmp/greg$ echo *
A a Á á À à Â â Ä ä B b
imadev:/tmp/greg$ LC_COLLATE=C ls
A  B  a  b  À  Á  Â  Ä  à  á  â  ä
imadev:/tmp/greg$ LC_COLLATE=C; echo *; unset LC_COLLATE
A B a b À Á Â Ä à á â ä

Meanwhile, on Debian 6.0, in the en_US.iso88591 locale:

arc3:~$ mkdir /tmp/greg && cd "$_"
arc3:/tmp/greg$ touch a A á Á à À â Â ä Ä b B
arc3:/tmp/greg$ echo *
a A á Á à À â Â ä Ä b B

As you can see, the two en_US.iso88591 implementations are not the same.

See http://mywiki.wooledge.org/locale or myriad other resources (including
the POSIX page linked earlier) for further explanations.

Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Linda Walsh 5/21/12 12:19 PM


Greg Wooledge wrote:

> On Sun, May 20, 2012 at 11:36:35AM -0700, Linda Walsh wrote:

> For instance, on HP-UX 10.20, in the en_US.iso88591 locale:
>     A  a  ...  B  b
> Meanwhile, on Debian 6.0, in the en_US.iso88591 locale:
>     a A   ...  b B
>
> As you can see, the two en_US.iso88591 implementations are not the same.

----
        Great!...

So which is correct?

Anyone wanting to reference an upper or lower case range
[a-z] or [A-Z], is gonna hurt from this.

My OS uses "en_US.UTF-8".

You'd think unicode would have something to say about collation
order that wouldn't allow such randomness, but maybe not.




Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Chris F.A. Johnson 5/21/12 12:21 PM
On Mon, 21 May 2012, Linda Walsh wrote:

>
>
> Greg Wooledge wrote:
>
>> On Sun, May 20, 2012 at 11:36:35AM -0700, Linda Walsh wrote:
>
>> For instance, on HP-UX 10.20, in the en_US.iso88591 locale:
>>     A  a  ...  B  b
>> Meanwhile, on Debian 6.0, in the en_US.iso88591 locale:
>>     a A   ...  b B
>>
>> As you can see, the two en_US.iso88591 implementations are not the same.
>
> ----
>         Great!...
>
> So which is correct?
>
> Anyone wanting to reference an upper or lower case range
> [a-z] or [A-Z], is gonna hurt from this.

    Use the correct references: [:upper:] and [:lower:] or (as I do)
    always use LC_ALL=C in your scripts.

> My OS uses "en_US.UTF-8".

    My OS uses whatever I tell it to (which is C).

> You'd think unicode would have something to say about collation
> order that wouldn't allow such randomness, but maybe not.

--
    Chris F.A. Johnson, <http://cfajohnson.com/>
    Author:
    Pro Bash Programming: Scripting the GNU/Linux Shell (2009, Apress)
    Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)

Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Greg Wooledge 5/21/12 12:27 PM
On Mon, May 21, 2012 at 12:19:26PM -0700, Linda Walsh wrote:
> Greg Wooledge wrote:
> >For instance, on HP-UX 10.20, in the en_US.iso88591 locale:
> >    A  a  ...  B  b
> >Meanwhile, on Debian 6.0, in the en_US.iso88591 locale:
> >    a A   ...  b B

> So which is correct?

Both.  Locale collating order is determined by the OS.  You cannot
rely on it, unless you set the LC_COLLATE variable to "C" or "POSIX",
in which case you get ASCII behavior (accented letters are not part
of the character set at all).

> Anyone wanting to reference an upper or lower case range
> [a-z] or [A-Z], is gonna hurt from this.

Correct.

imadev:~$ echo Hello World | tr 'A-Z' 'a-z'
hÉMMÓ wÓSMÐ
imadev:~$ echo Hello World | tr '[:upper:]' '[:lower:]'
hello world

You *cannot* use [a-z] or [A-Z] any more, except in the C/POSIX locale.
If you want to match lowercase characters, you should be using [[:lower:]],
and for uppercase characters, [[:upper:]].

Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Aharon Robbins 5/21/12 12:27 PM
In article <mailman.1485.1337627992.855.bug-bash@gnu.org>,
Linda Walsh  <ba...@tlinx.org> wrote:
>Greg Wooledge wrote:
>
>> On Sun, May 20, 2012 at 11:36:35AM -0700, Linda Walsh wrote:
>
>> For instance, on HP-UX 10.20, in the en_US.iso88591 locale:
>>     A  a  ...  B  b
>> Meanwhile, on Debian 6.0, in the en_US.iso88591 locale:
>>     a A   ...  b B
>>
>> As you can see, the two en_US.iso88591 implementations are not the same.
>
>----
>        Great!...
>
>So which is correct?

Both!  Isn't this fun!  Current POSIX leaves this up to the implementation.
I believe that the Debian order is what earlier POSIX required.

>Anyone wanting to reference an upper or lower case range
>[a-z] or [A-Z], is gonna hurt from this.

This is why I started the Campaign For Rational Range Interpretation,
now part of gawk and I believe in the most recent grep also, which
returns us to the sane days of yesteryear, where [a-z] got only lowercase
letters and [A-Z] got only uppercase ones.

>My OS uses "en_US.UTF-8".

I personally have had

        export LC_ALL=C

in my .profile / .bashrc for many years now, to keep the behavior G-d
intended.

>You'd think unicode would have something to say about collation
>order that wouldn't allow such randomness, but maybe not.

It actually makes sense that it doesn't, since Unicode is more or less
a mapping of code points to glyphs, which is language independant. The
rules for collating depend upon the language.
--
Aharon (Arnold) Robbins                         arnold AT skeeve DOT com
P.O. Box 354                Home Phone: +972  8 979-0381
Nof Ayalon                Cell Phone: +972 50 729-7545
D.N. Shimshon 99785        ISRAEL
Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Linda Walsh 5/21/12 12:37 PM


Greg Wooledge wrote:

> On Mon, May 21, 2012 at 12:19:26PM -0700, Linda Walsh wrote:
>> Greg Wooledge wrote:
>>> For instance, on HP-UX 10.20, in the en_US.iso88591 locale:
>>>    A  a  ...  B  b
>>> Meanwhile, on Debian 6.0, in the en_US.iso88591 locale:
>>>    a A   ...  b B
>
>> So which is correct?
>
> Both.  Locale collating order is determined by the OS.  You cannot
> rely on it, unless you set the LC_COLLATE variable to "C" or "POSIX",
> in which case you get ASCII behavior (accented letters are not part
> of the character set at all).
>
>> Anyone wanting to reference an upper or lower case range
>> [a-z] or [A-Z], is gonna hurt from this.
>
> Correct.

----
        This is a prime example of Posix being stupid and bad for
computer science.

        They take a deterministic behavior and define it to be
non-deterministic and break 1000's of programs.

        They cannot justify this... as they are supposed to document
current practice -- which has never been to consider the interpretation
of a-z/A-Z as random!

        Thus they are violating their own rules!    How can anyone follow
such lame directions?  Who in their right mind would have voted to
make ranges "worthless"....i.e. -- established, standard practice has never
been for such ranges to be worthless -- yet that is exactly what they
voted for.

        How is posix following it's own rules?   If they don't follow
their own rules -- how can anyone be following these new specifications
which are obviously in conflict with established implementation?





Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Chet Ramey 5/21/12 12:41 PM
On 5/21/12 3:27 PM, Aharon Robbins wrote:

>> So which is correct?
>
> Both!  Isn't this fun!  

What's most fun is having this discussion over and over again, each time as
if it were the first.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU    ch...@case.edu    http://cnswww.cns.cwru.edu/~chet/

Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Chet Ramey 5/21/12 12:42 PM
On 5/21/12 3:27 PM, Aharon Robbins wrote:

> This is why I started the Campaign For Rational Range Interpretation,
> now part of gawk and I believe in the most recent grep also, which
> returns us to the sane days of yesteryear, where [a-z] got only lowercase
> letters and [A-Z] got only uppercase ones.

The next version of bash will have a shell option to enable this behavior.
It's in the development snapshots if anyone wants to try it out now.

Chet
Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Chet Ramey 5/21/12 12:46 PM
On 5/21/12 3:37 PM, Linda Walsh wrote:

> ----
>     This is a prime example of Posix being stupid and bad for
> computer science.
>
>     They take a deterministic behavior and define it to be
> non-deterministic and break 1000's of programs.

Try being a little less English-centric.  Collating order varies by
language.

Posix says that ranges work the way you are used to if you force the
traditional ordering using the `C' or `Posix' locale.  Take a deep
breath and use LC_ALL=C in your scripts to avoid depending on whatever
your OS uses as the default.
Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Linda Walsh 5/21/12 12:51 PM


Chet Ramey wrote:

> On 5/21/12 3:37 PM, Linda Walsh wrote:
>
>> ----
>>     This is a prime example of Posix being stupid and bad for
>> computer science.
>>
>>     They take a deterministic behavior and define it to be
>> non-deterministic and break 1000's of programs.
>
> Try being a little less English-centric.  Collating order varies by
> language.
>
> Posix says that ranges work the way you are used to if you force the
> traditional ordering using the `C' or `Posix' locale.  Take a deep
> breath and use LC_ALL=C in your scripts to avoid depending on whatever
> your OS uses as the default.



FWIW, I put LC_COLLATE='C' in my System startup scripts...

So I'm NOT being bitten by this problem directly... I'm trying to
figure out how the heck this got voted on in POSIX, to be a correct
standard such that it would break current programs...

POSIX is not supposed to be prescriptive -- but **descriptive**...

I can't think of anywhere that a-z or A-Z would have included letters
from the opposite case... so how did POSIX come to *prescribe* that this
be the case... since I can't see that as being descriptive.



Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Sven Mascheck 5/21/12 12:55 PM
On Mon, May 21, 2012 at 03:46:00PM -0400, Chet Ramey wrote:

> Posix says that ranges work the way you are used to if you force the
> traditional ordering using the `C' or `Posix' locale.  Take a deep
> breath and use LC_ALL=C in your scripts to avoid depending on whatever
> your OS uses as the default.

I'd suggest to use "only" LC_COLLATE=C to keep other important features
like e.g. printability (LC_CTYPE) or localized messages (LC_MESSAGES).
Or has anybody ever noticed problems with this approach?

Sven

Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Linda Walsh 5/21/12 12:56 PM


Aharon Robbins wrote:


>> You'd think unicode would have something to say about collation
>> order that wouldn't allow such randomness, but maybe not.
>
> It actually makes sense that it doesn't, since Unicode is more or less
> a mapping of code points to glyphs, which is language independant. The
> rules for collating depend upon the language.

----
        Actually it makes sense that it would... and it DOES.

Bash and POSIX would appear to violate Unicode ordering.

Algorithm reference: http://www.unicode.org/reports/tr10/tr10-24.html

Collation order chart:
http://www.unicode.org/Public/UCA/latest/allkeys.txt

----
It shows that cap A-Z come before a-z.

If a sort order is set (as mine used to be) to
en_US.UTF-8

then the Unicode sort order should have been used.

In bash, it was not.

It sounds like this is a bug rooted in the C libraries?




Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Eric Blake 5/21/12 1:39 PM
On 05/21/2012 01:51 PM, Linda Walsh wrote:

> POSIX is not supposed to be prescriptive -- but **descriptive**...
>
> I can't think of anywhere that a-z or A-Z would have included letters
> from the opposite case... so how did POSIX come to *prescribe* that this
> be the case... since I can't see that as being descriptive.

POSIX 1992 was the culprit that proscribed that [A-Z] must be in
collation order across all locales, but without giving good guidance on
how to write a collation sequence, and without defining a C function to
easily get at that collation ordering.  And remember, 20 years ago when
POSIX 1992 was written, there was very little implementation experience
with internationalization, compared to what has happened in the meantime
(that was back when Unicode was brand new, and most users still had
single-byte locales or used shift-lock encodings like Big5).  It is
possible to write a locale definition where [A-Z] gives only upper-case
letters while still providing case-insensitive sorting, but not all
locale writes know how to do this (even now in 2012, while most glibc
locales have been corrected in this manner, there still exist several
glibc locales that aren't written very well - the complication stems
from the fact that your locale file becomes exponentially harder to
write: instead of having a single upper and lower case rule, you have to
have one rule per letter, with rules intermixed in a different order).
As soon as people started obeying POSIX 1992 to the letter, and
realizing that range expressions had unusual semantics as a result of
the 1992 specification, POSIX 2001 quickly reverted things, but by then,
the cat was out of the bag.  POSIX 2001 had to continue to allow
existing implementations, by stating that range expressions in anything
but the C locale are explicitly undefined.

There is currently a movement under way to introduce 'Rational Range
Intepretation' (RRI), where [A-Z] means the 26 uppercase letters across
ALL locales, by omitting all accented letters and ignoring collation
ordering.  Since POSIX 2001 and later allow this behavior, it is gaining
traction - already, GNU sed, GNU grep, and GNU awk have had patches
applied or under consideration to introduce this consistent behavior.
Search those mailing list archives if you want more details.  Gnulib has
already had patches as part of this movement, and GNU coreutils and bash
should be picking up on these improvements in a future version; we also
hope to get glibc to agree to them.  In other words, we recognize that
this is an issue, and eventually, we _do_ want to reach the point where
all GNU tools use RRI, since POSIX 2001 already allows RRI as part of
its recognition that the decision made in POSIX 1992 causes pain when
coupled with poorly-written locale definitions.

For example, here is an RRI patch for gnulib:
https://lists.gnu.org/archive/html/bug-gnulib/2012-04/msg00185.html

--
Eric Blake   ebl...@redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Linda Walsh 5/21/12 2:02 PM


Eric Blake wrote:

> On 05/21/2012 01:51 PM, Linda Walsh wrote:
>
>> POSIX is not supposed to be prescriptive -- but **descriptive**...
>>
>> I can't think of anywhere that a-z or A-Z would have included letters
>> from the opposite case... so how did POSIX come to *prescribe* that this
>> be the case... since I can't see that as being descriptive.
>
> POSIX 1992 was the culprit that proscribed that [A-Z] must be in
> collation order across all locales,.....



> realizing that range expressions had unusual semantics as a result of
> the 1992 specification, POSIX 2001 quickly

---
[quickly?!  9 years later?!   *cough*]

 reverted things, but by then,
> the cat was out of the bag.  POSIX 2001 had to continue to allow
> existing implementations, by stating that range expressions in anything
> but the C locale are explicitly undefined.

---------------------


        Explicitly undefined?   Or locale dependent?

        I.e. Unicode does specify ordering, so if your locale is set
to UTF-8 character encoding, then it is explicitly defined.  This would
seem to be in conflict with unicode -- and any implementation claiming
to be unicode compatible MUST use unicode ordering when the local character
set is defined to be Unicode.

        This doesn't conflict with Posix, as Posix doesn't define an order
for such -- but a different standard, (Unicode) does specify a standard.  So
for those using UTF-8, shouldn't that have made the order randomization 'moot'?


Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Chet Ramey 5/21/12 2:12 PM
On 5/21/12 5:02 PM, Linda Walsh wrote:

>     I.e. Unicode does specify ordering, so if your locale is set
> to UTF-8 character encoding, then it is explicitly defined.  This would
> seem to be in conflict with unicode -- and any implementation claiming
> to be unicode compatible MUST use unicode ordering when the local character
> set is defined to be Unicode.
>

Then write the locale definitions that way -- that's Eric's point.
Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Eric Blake 5/21/12 2:14 PM
On 05/21/2012 03:02 PM, Linda Walsh wrote:

>> the cat was out of the bag.  POSIX 2001 had to continue to allow
>> existing implementations, by stating that range expressions in anything
>> but the C locale are explicitly undefined.
>
> ---------------------
>
>
>     Explicitly undefined?   Or locale dependent?

POSIX explicitly undefined ranges for all but the C locale.  _Other
standards_, such as Unicode, are free to add range requirements on top
of what POSIX requires, but alas, Unicode collation order does NOT
currently specify anything about regular expression or glob range
matching, so it is out of scope for Unicode to say what [A-Z] expands to.

>
>     I.e. Unicode does specify ordering, so if your locale is set
> to UTF-8 character encoding, then it is explicitly defined.  This would
> seem to be in conflict with unicode -- and any implementation claiming
> to be unicode compatible MUST use unicode ordering when the local character
> set is defined to be Unicode.

Unicode may specify collation ordering, but it does NOT specify regular
expression range ordering.

>
>     This doesn't conflict with Posix, as Posix doesn't define an order
> for such -- but a different standard, (Unicode) does specify a
> standard.  So
> for those using UTF-8, shouldn't that have made the order randomization
> 'moot'?

Wishing doesn't make it so.  The fact is that regular expression ranges
are currently unspecified in all but the C locale; the RRI project is
attempting to make it sane across all locales within the scope of GNU
programs, but it takes time to write and approve the patches necessary
to get to that point.
Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Linda Walsh 5/21/12 4:42 PM


Eric Blake wrote:

> On 05/21/2012 03:02 PM, Linda Walsh wrote:
>
>>> the cat was out of the bag.  POSIX 2001 had to continue to allow
>>> existing implementations, by stating that range expressions in anything
>>> but the C locale are explicitly undefined.
>> ---------------------
>>
>>
>>     Explicitly undefined?   Or locale dependent?
>
> POSIX explicitly undefined ranges for all but the C locale.  _Other
> standards_, such as Unicode, are free to add range requirements on top
> of what POSIX requires, but alas, Unicode collation order does NOT
> currently specify anything about regular expression or glob range
> matching, so it is out of scope for Unicode to say what [A-Z] expands to.


----

I think this is the problem.

A-Z in regular expressions is defined to expand to those characters
that are _in collating order_, >A, and <Z...

Without a collating order that expression in RE's would never have made any
sense.  It requires a collating order and is dependent on it.

If there is no collating order, then you cannot expand A-Z, but if there
is, you expand it to the values between A-Z that are in the collating order.

The regex(7) man page says that [xx-xx] uses ***collating order**::

        If two characters  in  the  list
        are  separated  by '-', this is shorthand for the full range of charac-
        ters between those two (inclusive) in the collating sequence, for exam-
        ple,  "[0-9]" in ASCII matches any decimal digit.

----
Seems pretty clear -- regex's aren't exempt from collating order, they depend on
it...





Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Eric Blake 5/21/12 7:02 PM
On 05/21/2012 05:42 PM, Linda Walsh wrote:

>> POSIX explicitly undefined ranges for all but the C locale.  _Other
>> standards_, such as Unicode, are free to add range requirements on top
>> of what POSIX requires, but alas, Unicode collation order does NOT
>> currently specify anything about regular expression or glob range
>> matching, so it is out of scope for Unicode to say what [A-Z] expands to.
>
>
> ----
>
> I think this is the problem.
>
> A-Z in regular expressions is defined to expand to those characters
> that are _in collating order_, >A, and <Z...

Only in POSIX 1992 or in the C locale.  In POSIX 2001 and POSIX 2008,
and non-C locales, [A-Z] is explicitly undefined, because the definition
of characters in collating order between A and Z did not work out.

>
> Without a collating order that expression in RE's would never have made any
> sense.  It requires a collating order and is dependent on it.

They still don't make any sense in any locale except C, because POSIX no
longer requires collating order.

> The regex(7) man page says that [xx-xx] uses ***collating order**::

The regex(7) man page _of which system_?  Just because _some_ systems
(like glibc, picking the POSIX 1992 semantics) have well-defined
semantics, doesn't mean that all systems have those same semantics.
According to POSIX, you cannot portably assume ANY semantics for ranges
except in the C locale.  And if RRI gains traction, that means that you
can assume ASCII collation, across ALL locales, but this is a different
order than collation of a specific locale, and it is also a GNU
extension not guaranteed by POSIX.

> ----
> Seems pretty clear -- regex's aren't exempt from collating order, they
> depend on it...

Only on platforms where libc has chosen to provide an extension beyond
POSIX, and where GNU programs have not further overridden things to
avoid the unexpected glibc semantics.
Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Linda Walsh 5/21/12 8:15 PM


Eric Blake wrote:

> On 05/21/2012 05:42 PM, Linda Walsh wrote:
>

> Only in POSIX 1992 or in the C locale.  In POSIX 2001 and POSIX 2008,
> and non-C locales, [A-Z] is explicitly undefined,

==============
ONLY under POSIX...

You may not believe this, but there are other standards than POSIX.

More than one distro and company claims unicode compliance, to V6.x.

Any programs or collections that do claim compat, need to adhere to the
stricter of the standards, which is Unicode, in this case.

POSIX only applies if you are going for the bottom of the barrel in this
case.



Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Linda Walsh 5/21/12 8:19 PM


Eric Blake wrote:

> They still don't make any sense in any locale except C, because POSIX no
> longer requires collating order.
>
>> The regex(7) man page says that [xx-xx] uses ***collating order**::
>
> The regex(7) man page _of which system_?  Just because _some_ systems
> (like glibc, picking the POSIX 1992 semantics) have well-defined
> semantics, doesn't mean that all systems have those same semantics.
> According to POSIX, you cannot portably assume ANY semantics for ranges
> except in the C locale.  And if RRI gains traction, that means that you
> can assume ASCII collation, across ALL locales, but this is a different
> order than collation of a specific locale, and it is also a GNU
> extension not guaranteed by POSIX.

===
        Well, that would be nice, but if Unicode takes off, *cough*,
and anyone claims unicode compliance (isn't UTF-8 the standard for HTML5
and XML?), they are also guaranteed ordering -- full ordering for the full
Unicode character set.

        It would be VERY GOOD if RRI didn't come up with an order that
was DIFFERENT from that prescribed by Unicode -- otherwise that could open
another can of worms.






Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? Raphaël Droz 5/25/12 1:41 AM
On Mon, May 21, 2012 at 12:19:26PM -0700, Linda Walsh wrote:
> So which is correct?
>
> Anyone wanting to reference an upper or lower case range
> [a-z] or [A-Z], is gonna hurt from this.
>
> My OS uses "en_US.UTF-8".

I don't remember this bug having been cited here:
https://bugs.launchpad.net/ubuntu/+source/bash/+bug/120687

About $ export LC_COLLATE=C >> /etc/skel/.bashrc & co

More topics »