Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

UTF-8 character order bug?

15 views
Skip to first unread message

Colin B.

unread,
May 7, 2008, 1:22:36 PM5/7/08
to
So in keeping with the "future-thinking" way of doing things, I thought
I'd switch my locale to UTF-8. Sadly, it breaks a number of things.

I can live (unhappily) with having my upper and lower case filenames
interspersed, but that ordering causes some interesting problems.
Specifically:

Letters are ordered lower case, then upper case, for each letter in turn.
That is, the ordering sequence is aAbBcC...zZ.

Now consider what this means in doing an ls:

#ls [A-Z]*

One would intuitively expect this to give all files starting with a
capital letter, which it does in the C (= POSIX) locales. In Unicode
though, it doesn't. Worse, it doesn't give you what you probably _think_
it does. Look at that list again, and then expand the [A-Z] expression.
(Also, do it with the lower-case equivalent.)

Order: aAbBcCdD...zZ
[A-Z]: ************
[a-z]: ************

So [A-Z] gets 51 of the 52 mixed-case letters, missing "a"! Similarly,
[a-z] captures 51, but misses Z. The "correct" range to capture all of
the characters is the rather counterintuitive [a-Z].

Basically, range expansion in filename globbing NO LONGER MIMICS REGULAR
EXPRESSIONS under Unicode. This in my mind is bad. The 'drop one character'
behavior of [A-Z] and [a-z] are worse.

I then turned to ISO-8859-1 as an alternative. With some testing, I thought
I was getting totally different results from EITHER of the above options,
but as it turns out, it behaves identically to UTF-8. My mysterious results
were actually a bug in Solaris. I'll include the results here, though, for
reasons to be seen.

Examine the following case: (Note that I'm using LANG as a proxy for
LC_COLLATE or LC_ALL here for simplicity. It makes no difference to the
results)

#touch A AA Aa B Z a aA aa b z
#export LANG=C

#LANG=C ls
A AA Aa B Z a aA aa b z
#LANG=en_CA.UTF-8 ls
a A aa aA Aa AA b B z Z
#LANG=en_CA.ISO8859-1 ls
a A aa aA Aa AA b B z Z

So far this matches what we expect, given the above claims. Now let's
try some ranges.

#LANG=C
#LANG=C ls [A-Z]*
A AA Aa B Z
#LANG=en_CA.UTF-8 ls [A-Z]*
A Aa AA B Z
#LANG=en_CA.ISO8859-1 ls [A-Z]*
A Aa AA B Z

WHAT??!!! That's not what we saw earlier! But curiously, if you
permanently set the LANG variable, it works (for certain definitions of
the word "works"):

#LANG=C; ls [A-Z]*
A AA Aa B Z
#LANG=en_CA.UTF-8; ls [A-Z]*
A Aa AA b B z Z
#LANG=en_CA.ISO8859-1; ls [A-Z]*
A Aa AA b B z Z

So in short:

1) UTF-8 orders the alphabet aAbB...zZ, which defies 'expected' behaviour
in ranges
2) The following two lines behave differently
#LANG=<locale> ls
#LANG=<locale> ls [<range>]
2) en_CA.UTF-8 and en_CA.ISO8859-1 behave identically

Now all of this is on Solaris 10 (08/07, Generic_127111-06, locale patch
119397-07). For the sake of completeness, I thought I'd look at Solaris 9.
The first thing I find is that UTF-8 isn't present. Fine, I can happily
use ISO8859-1.

Using the same dataset:

#export LANG=C

#ls
A AA Aa B Z a aA aa b z
#LANG=en_CA.ISO8859-1 ls
A a AA Aa aA aa B b Z z

VERY interesting!!! The ordering is still in letter pairs, but upper
case is ahead of lower, opposite of Solaris 10!

#ls [A-Z]*
A AA Aa B Z
#LANG=en_CA.ISO8859-1 ls [A-Z]*
A AA Aa B Z

Same as we saw in Solaris 10.

#LANG=en_CA.ISO8859-1; ls [A-Z]*
A AA Aa B Z

Now THIS is different! Solaris 9 apparently does what we 'want' it to do.
Let's test this further:

#LANG=en_CA.ISO8859-1
#locale
LANG=en_CA.ISO8859-1
LC_CTYPE="en_CA.ISO8859-1"
LC_NUMERIC="en_CA.ISO8859-1"
LC_TIME="en_CA.ISO8859-1"
LC_COLLATE="en_CA.ISO8859-1"
LC_MONETARY="en_CA.ISO8859-1"
LC_MESSAGES="en_CA.ISO8859-1"
LC_ALL=
#ls [A-Z]*
A AA Aa B Z
#ls [a-z]*
a aA aa b z

Yep. That's exactly how Solaris 10 spectacularly fails to work. Solaris 9
got it right, Solaris 10 broke it.

The real summary here is that locales in Solaris 10 are broken in several
ways. The ordering is different, the variable parsing is inconsistent,
and ranges don't work properly.

Whew. Can we get an amen?

Colin

Chris Ridd

unread,
May 7, 2008, 1:32:55 PM5/7/08
to
On 2008-05-07 18:22:36 +0100, "Colin B." <cbi...@somewhereelse.shaw.ca> said:

> Basically, range expansion in filename globbing NO LONGER MIMICS REGULAR
> EXPRESSIONS under Unicode. This in my mind is bad. The 'drop one character'
> behavior of [A-Z] and [a-z] are worse.

Which is why there's a POSIX spec (I think 1003.2) for
internationalized regexes, which do "the right thing". See regex(5) on
a Solaris 10 box.

Cheers,

Chris

0 new messages