: billg@rh7.2:~$ mkdir testing
: billg@rh7.2:~$ cd testing
: billg@rh7.2:~/testing$ touch a b c d e z A B C D E Z
: billg@rh7.2:~/testing$ /bin/ls [a-z]
: a A b B c C d D e E z
: billg@rh7.2:~/testing$ /bin/ls [A-Z]
: A b B c C d D e E z Z
: billg@rh7.2:~/testing$ /bin/ls [abcdez]
: a b c d e z
: billg@rh7.2:~/testing$ /bin/ls [ABCDEZ]
: A B C D E Z
: billg@rh7.2:~/testing$ csh
: [billg@rh7.2 ~/testing]$ /bin/ls [a-z]
: a b c d e z
: [billg@rh7.2 ~/testing]$ /bin/ls [A-Z]
: A B C D E Z
It has been noticed before, and more fully described, in 1999:
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=m3g0x2vsi2.fsf%40maya.linux.ca
Does this affect other distributions? (The above in SunOS/Solaris bash
works the way nature intended.) Is it a bug, a feature, or a
configuration effect?
Its effects are more, ah, annoying when rm is used instead of ls ...
--
... I imagine
> Does this affect other distributions? (The above in SunOS/Solaris bash
> works the way nature intended.) Is it a bug, a feature, or a
> configuration effect?
It is very annoying. Some clowns decided that case should be ignored
when doing a lexical sort. Real stupid, and totally impractical.
But I guess Windows people would be thrilled with it.
Set LC_COLLATE=C in your environment and lead a normal
shell life.
>
> Its effects are more, ah, annoying when rm is used instead of ls ...
>
I reckon MS have infiltrated Posix.
> In article <6e793af4.02112...@posting.google.com>, "Chris Boyd"
> <chrisb...@netscape.net> wrote:
>
>> Does this affect other distributions? (The above in SunOS/Solaris bash
>> works the way nature intended.) Is it a bug, a feature, or a
>> configuration effect?
>
> It is very annoying. Some clowns decided that case should be ignored when
> doing a lexical sort.
Presumably those working for the ISO in Switzerland, who developed ISO
14651.
> Real stupid, and totally impractical.
>
> But I guess Windows people would be thrilled with it.
Looks like it, doesn't it?
> Set LC_COLLATE=C in your environment and lead a normal shell life.
Normal maybe, but "Non-Standard" too. If bash didn't work this way (or
rather glibc) then _someone_ would be chirping up stating that "Linux
doesn't follow agreed international standards". :(
The key, of course, is better documentation, references in places like the
bash manpages to said documentation, and notes on how to make these things
behave in ways a seasoned UNIX user would expect. Principle of least
surprise and all that (which is probably why this behaviour is the default
- for non-computer folk, this is probably the way they would expect
sorting order to work).
>> Its effects are more, ah, annoying when rm is used instead of ls ...
> I reckon MS have infiltrated Posix.
Sadly, this is a religious issue like big endian vs. little endian; it's
hard to convincingly prove that one way is the _right_ way, so you need to
pick a default and hope that the majority aren't offended by your choice.
Best Regards,
Alex.
--
Alex Butcher Brainbench MVP for Internet Security: www.brainbench.com
Bristol, UK Need reliable and secure network systems?
PGP/GnuPG ID:0x271fd950 <http://www.assursys.com/>
Also, it's not `case should be ignored', it's `sorting should follow
collation order', which for English is dictionary order.
>> Real stupid, and totally impractical.
>>
>> But I guess Windows people would be thrilled with it.
>
> Looks like it, doesn't it?
I can't see why anyone would be thrilled with collation order applying
to character classes in regexes and globs. Pretty silly.
>> Set LC_COLLATE=C in your environment and lead a normal shell life.
>
> Normal maybe, but "Non-Standard" too. If bash didn't work this way (or
It's perfectly standard; LC_COLLATE is in POSIX.
--
`I keep hearing about SF writers dying, but I never hear about SF
writers being born. So I guess eventually there'll be none left.'
-- Keith F. Lynch
> On Wed, 27 Nov 2002, Alex Butcher spake:
>> On Thu, 28 Nov 2002 00:21:17 +1100, Brian Patil wrote:
>>
>>> In article <6e793af4.02112...@posting.google.com>, "Chris
>>> Boyd" <chrisb...@netscape.net> wrote:
>>>
>>>> Does this affect other distributions? (The above in SunOS/Solaris
>>>> bash works the way nature intended.) Is it a bug, a feature, or a
>>>> configuration effect?
>>>
>>> It is very annoying. Some clowns decided that case should be ignored
>>> when doing a lexical sort.
>>
>> Presumably those working for the ISO in Switzerland, who developed ISO
>> 14651.
>
> Also, it's not `case should be ignored', it's `sorting should follow
> collation order', which for English is dictionary order.
>
>>> Real stupid, and totally impractical.
>>>
>>> But I guess Windows people would be thrilled with it.
>>
>> Looks like it, doesn't it?
>
> I can't see why anyone would be thrilled with collation order applying
> to character classes in regexes and globs. Pretty silly.
I can see their logic; if you perform "ls [a-z]*", you're actually asking
for a list of files whose first character is in the collation of letters
between a and z (refer to your first point above). This is not the same as
ASCII order, nor necessarily expected to be, unless you know ASCII and
traditional (non-i18n'ed) UNIX.
Other options would be:-
- ignore locale settings entirely when dealing with globs and suchlike.
Inconsistent with the rest of the system then.
- default to LC_COLLATE=C and get complaints from folks expecting it to be
the other way round.
Essentially, it's an arbitrary decision, with no overwhelming argument for
one behaviour in preference to the others, IMHO. Especially seeing as the
OP can simulate what they wanted (not what they asked for) by using 'ls
[abcde]' and suchlike (thereby specifying the range they want less
ambiguously).
>>> Set LC_COLLATE=C in your environment and lead a normal shell life.
>>
>> Normal maybe, but "Non-Standard" too. If bash didn't work this way (or
>
> It's perfectly standard; LC_COLLATE is in POSIX.
That's not the point I was making. The point I was making was that if
LC_COLLATE defaulted to the C locale, rather than en_GB, there'd just be a
different set of people complaining it didn't work "properly" (for their
value of "properly").
Documentation, documentation, documentation. ;-)
> I can see their logic; if you perform "ls [a-z]*", you're actually
> asking for a list of files whose first character is in the collation of
> letters between a and z (refer to your first point above). This is not
> the same as ASCII order, nor necessarily expected to be, unless you know
> ASCII and traditional (non-i18n'ed) UNIX.
I expect the ordering to be in the traditional ASCII order.
Because I see A-Z and a-z as two distinct groups, uppercase
characters and lowercase characters. Then alphabetical order
is subservient to that grouping.
My brain doesn't like a jumbled mix of uppercase and lowercase.
>
> Essentially, it's an arbitrary decision, with no overwhelming argument
> for one behaviour in preference to the others, IMHO. Especially seeing
> as the OP can simulate what they wanted (not what they asked for) by
> using 'ls [abcde]' and suchlike (thereby specifying the range they want
> less ambiguously).
The ability to change this easily is bound to cause
the occasional problem in shell scripts etc.
But some clear documentation about this subject seems
to be absent from most distributions.
>
>>>> Set LC_COLLATE=C in your environment and lead a normal shell life.
>>>
>>> Normal maybe, but "Non-Standard" too. If bash didn't work this way (or
>>
>> It's perfectly standard; LC_COLLATE is in POSIX.
>
> That's not the point I was making. The point I was making was that if
> LC_COLLATE defaulted to the C locale, rather than en_GB, there'd just be
> a different set of people complaining it didn't work "properly" (for
> their value of "properly").
Maybe the system installs should ask, "Are you migrating from Windows
and do you want to pretend you are still on Windows?" and set
the locale accordingly.
> In article <10384773...@ananke.eclipse.net.uk>, "Alex Butcher"
> <alex.butch...@assursys.co.uk> wrote:
>
>> I can see their logic; if you perform "ls [a-z]*", you're actually
>> asking for a list of files whose first character is in the collation of
>> letters between a and z (refer to your first point above). This is not
>> the same as ASCII order, nor necessarily expected to be, unless you know
>> ASCII and traditional (non-i18n'ed) UNIX.
>
> I expect the ordering to be in the traditional ASCII order. Because I see
> A-Z and a-z as two distinct groups, uppercase characters and lowercase
> characters. Then alphabetical order is subservient to that grouping.
>
> My brain doesn't like a jumbled mix of uppercase and lowercase.
I understand exactly where you're coming from, having been playing around
with computers for about 20 years now. Indeed, now that I'm aware of the
issue, I'd be slightly inclined to set LC_COLLATE=C myself, except scripts
I wrote would then probably end up breaking on other systems
periodically... Instead, I think I'll walk away with the knowledge that
ranges are (possibly) ambiguously specified and that more specific regexps
should be used.
>> Essentially, it's an arbitrary decision, with no overwhelming argument
>> for one behaviour in preference to the others, IMHO. Especially seeing
>> as the OP can simulate what they wanted (not what they asked for) by
>> using 'ls [abcde]' and suchlike (thereby specifying the range they want
>> less ambiguously).
>
> The ability to change this easily is bound to cause the occasional problem
> in shell scripts etc. But some clear documentation about this subject
> seems to be absent from most distributions.
The documentation is sort of there, but you need to rifle around quite a
lot to find it. The current behaviour in bash obviously isn't expected
by all users and yet there's little mention of why, nor how to "fix" it.
>>>>> Set LC_COLLATE=C in your environment and lead a normal shell life.
>>>>
>>>> Normal maybe, but "Non-Standard" too. If bash didn't work this way (or
>>>
>>> It's perfectly standard; LC_COLLATE is in POSIX.
>>
>> That's not the point I was making. The point I was making was that if
>> LC_COLLATE defaulted to the C locale, rather than en_GB, there'd just be
>> a different set of people complaining it didn't work "properly" (for
>> their value of "properly").
>
> Maybe the system installs should ask, "Are you migrating from Windows and
> do you want to pretend you are still on Windows?" and set the locale
> accordingly.
Arrrrgh! ;-)
*the sound of hundreds of shellscripts breaking as they don't bother
checking the value of LC_COLLATE and thus doing the Wrong Thing*
Nice idea, but I think that would just make the problem worse. Having a
standard (broken though it might be in some people's eyes) is better than
having _no_ standard.
> I understand exactly where you're coming from, having been playing
> around with computers for about 20 years now. Indeed, now that I'm aware
> of the issue, I'd be slightly inclined to set LC_COLLATE=C myself,
> except scripts I wrote would then probably end up breaking on other
> systems periodically... Instead, I think I'll walk away with the
> knowledge that ranges are (possibly) ambiguously specified and that more
> specific regexps should be used.
You should write your scripts so that they don't break.
>> Maybe the system installs should ask, "Are you migrating from Windows
>> and do you want to pretend you are still on Windows?" and set the
>> locale accordingly.
>
> Arrrrgh! ;-)
>
> *the sound of hundreds of shellscripts breaking as they don't bother
> checking the value of LC_COLLATE and thus doing the Wrong Thing*
Shell scripts don't need to check LC_COLLATE.
That is handled by libc etc.
Isn't that just "old" flavours rather than traditional? i.e. who
doesn't do this stuff?
The biggest problem is some of us learnt to do things like "tr
'a-z' 'A-Z'", rather than "tr [:lower:] [:upper:]", and many of
the old scripts and examples teach it the old way.
I think backward/forward compatibility has a lot to answer for,
and that a lot of non-English speaking users would prefer ASCII
collating for such things, at the script level you want nicely
consistent behaviour and few bugs to avoid automated chaos, what
happens in GUI's and in new tools isn't so crucial but
retrofitting such changes is a recipe for chaos. For a long time
HP-UX had a bug in the iso8859-1 collating tables, so I figure
they weren't being extensively used at the time, or users were
confused (like I was) by the bugs results.
> In article <10384807...@ananke.eclipse.net.uk>, "Alex Butcher"
> <alex.butch...@assursys.co.uk> wrote:
>
>> I understand exactly where you're coming from, having been playing
>> around with computers for about 20 years now. Indeed, now that I'm aware
>> of the issue, I'd be slightly inclined to set LC_COLLATE=C myself,
>> except scripts I wrote would then probably end up breaking on other
>> systems periodically... Instead, I think I'll walk away with the
>> knowledge that ranges are (possibly) ambiguously specified and that more
>> specific regexps should be used.
>
> You should write your scripts so that they don't break.
Certainly, and I generally do write defensive code (be it in shell scripts
or otherwise) - but configuring my system to be different to ~99% of
systems out there is asking for trouble in my book.
>>> Maybe the system installs should ask, "Are you migrating from Windows
>>> and do you want to pretend you are still on Windows?" and set the
>>> locale accordingly.
>>
>> Arrrrgh! ;-)
>>
>> *the sound of hundreds of shellscripts breaking as they don't bother
>> checking the value of LC_COLLATE and thus doing the Wrong Thing*
>
> Shell scripts don't need to check LC_COLLATE. That is handled by libc etc.
*bzzt*
What if they do 'rm -f /tmp/[a-z]*'?
That's right - if you use ranges in a shell script, you need to either
explicitly set LC_COLLATE before you start (so rm does what you expect it
to do) or check LC_COLLATE and use the right range according to the user's
current locale.
This actually is the best reason I can think of for having LC_COLLATE
default to the C locale; old scripts written for traditional UNIX won't be
written to check or set LC_COLLATE and will expect the C locale. Newer
scripts written for i18n'ed UNIX should be written with the expectation
that the user may not be using the C locale and should check or set
LC_COLLATE accordingly.
>
> *bzzt*
>
> What if they do 'rm -f /tmp/[a-z]*'?
>
> That's right - if you use ranges in a shell script, you need to either
> explicitly set LC_COLLATE before you start (so rm does what you expect
> it to do) or check LC_COLLATE and use the right range according to the
> user's current locale.
You should be able to use 'rm -f /tmp/[:lower:]*'.
But I wouldn't hold my breath. Unfortunately most shells
do their own ad-hoc parsing, unlike the relatively standardized
regex in libc.
>
> This actually is the best reason I can think of for having LC_COLLATE
> default to the C locale; old scripts written for traditional UNIX won't
> be written to check or set LC_COLLATE and will expect the C locale.
> Newer scripts written for i18n'ed UNIX should be written with the
> expectation that the user may not be using the C locale and should check
> or set LC_COLLATE accordingly.
I agree. While [:lower:] is unambiguous, [a-z] might become
ambiguous is some people's minds. I think LC_COLLATE should
be C and never anything else.
Actually I think the best solution would be for everyone to start
to use English and have their characters sets a superset of ASCII.
That way everything is nice, clear and simple and we can get
on with doing productive things.
What with UTF-*, and Unicode-* --- what a crazy mess.
What have they been putting in the water over the last few years?
> In article <10385644...@ananke.eclipse.net.uk>, "Alex Butcher"
> <alex.butch...@assursys.co.uk> wrote:
>
>
>> *bzzt*
>>
>> What if they do 'rm -f /tmp/[a-z]*'?
>>
>> That's right - if you use ranges in a shell script, you need to either
>> explicitly set LC_COLLATE before you start (so rm does what you expect
>> it to do) or check LC_COLLATE and use the right range according to the
>> user's current locale.
>
> You should be able to use 'rm -f /tmp/[:lower:]*'. But I wouldn't hold
> my breath. Unfortunately most shells do their own ad-hoc parsing, unlike
> the relatively standardized regex in libc.
Oh dear:-
bash$ cd foo/
bash$ ls
a A b B c C d D e E z Z
bash$ ls [:lower:]*
e
bash$ ls [\:lower\:]*
e
It goes from bad to worse. ;-]
I make it that at least one of those should even work, according to the
bash manpage. A bar of Green & Black's Maya Gold to anyone who can point
out what I've done wrong! ;-)
>> This actually is the best reason I can think of for having LC_COLLATE
>> default to the C locale; old scripts written for traditional UNIX won't
>> be written to check or set LC_COLLATE and will expect the C locale.
>> Newer scripts written for i18n'ed UNIX should be written with the
>> expectation that the user may not be using the C locale and should
>> check or set LC_COLLATE accordingly.
>
> I agree. While [:lower:] is unambiguous, [a-z] might become ambiguous
> is some people's minds. I think LC_COLLATE should be C and never
> anything else.
>
> Actually I think the best solution would be for everyone to start to use
> English and have their characters sets a superset of ASCII.
Uh huh. Tell that to some folks over the border in Wales and see what
reaction you get. ;-)
> That way everything is nice, clear and simple and we can get on with
> doing productive things.
>
> What with UTF-*, and Unicode-* --- what a crazy mess. What have they
> been putting in the water over the last few years?
I think the problem is that we're still in the transitional phase; ASCII
has been around as a published ANSI standard for 39 years now, and
finalised for 34 of those. The Unicode standard was first published in
1991. It seems understandable to me that we'll encounter a few problems
until Unicode is used everywhere that ASCII is used now. Give it 40 years
and we'll wonder what the fuss was all about.
The class specifier is [:lower:] which in turn needs to be inside [], so you
need
bash$ ls [[:lower:]]*
--
Nigel Wade
> I think the problem is that we're still in the transitional phase; ASCII
> has been around as a published ANSI standard for 39 years now, and
> finalised for 34 of those. The Unicode standard was first published in
> 1991. It seems understandable to me that we'll encounter a few problems
> until Unicode is used everywhere that ASCII is used now. Give it 40
> years and we'll wonder what the fuss was all about.
Make that 150 years.
> The class specifier is [:lower:] which in turn needs to be inside [], so
> you need
>
> bash$ ls [[:lower:]]*
>
You learn something new every day!
But I doubt that I will ever use it on a daily basis.
True. Things like LC_COLLATE are currently the source of problems,
but a properly written regex (using POSIX character classes) will work
irrespective of locale, and adapt to the charset in use. It's not
really mentioned in the documentation that using ranges like a-zA-Z
([:alpha:]) etc. is *obsolete*!
--
Roger Leigh
"Liberty and Livelihood" - Support the Countryside Alliance
Printing on GNU/Linux? http://gimp-print.sourceforge.net/
GPG Public Key: 0x25BFB848 available on public keyservers
If you want to match a-i and L-Q, you have little choice but to use
character classes (and set LC_COLLATE).
What, not even in dictionaries?
I think the best argument for non-collation of globs (at least) is that
they're normally used to match filenames, and filenames are
case-sensitive entities, so it makes no sense to jumble up the cases.
[abcdefghiLMNOPQ]
?
Not pretty, I know, but very explicit.
--
/__
\_|\/
/\
>> If you want to match a-i and L-Q, you have little choice but to use
>> character classes (and set LC_COLLATE).
>
> [abcdefghiLMNOPQ]
> ?
>
> Not pretty, I know, but very explicit.
>
It will never catch on.
Looks like we will be stuck with dual concepts for
a very long time.
>> My brain doesn't like a jumbled mix of uppercase and lowercase.
>
> What, not even in dictionaries?
A capitalized word has the same dictionary meaning
as a non-captitalized one.
A capitalized filename is COMPLETELY DIFFERENT from one of
the same name in lowercase.
My brain can handle logic.
BP>
BP> A capitalized word has the same dictionary meaning
BP> as a non-captitalized one.
BP>
god versus God.
The use of the capital letter is explained in the Shorter Oxford
Dictionary.
BP> A capitalized filename is COMPLETELY DIFFERENT from one of
BP> the same name in lowercase.
BP>
BP> My brain can handle logic.
BP>
Alan
( If replying by mail, please note that all "sardines" are canned.
There is also a password autoresponder but a "tuna" will swim
right through. )
> On Sat, 30 Nov 2002, Roger Leigh stipulated:
> > True. Things like LC_COLLATE are currently the source of problems,
> > but a properly written regex (using POSIX character classes) will work
> > irrespective of locale, and adapt to the charset in use. It's not
> > really mentioned in the documentation that using ranges like a-zA-Z
> > ([:alpha:]) etc. is *obsolete*!
>
> If you want to match a-i and L-Q, you have little choice but to use
> character classes (and set LC_COLLATE).
There are cases for both normal ranges and POSIX character classes
which don't have a direct equivalent.
[:lower:] AND [a-i] OR [:upper:] AND [L-Q]
(I haven't thought how you would actually *use* the logic though).
However, if you want portability, you need to solve the problem
without explicit ranges.
That's, um, exactly what I said in the article you quoted.
> On Mon, 2 Dec 2002, Brian Patil wrote:
>
> BP>
> BP> A capitalized word has the same dictionary meaning BP> as a
> non-captitalized one.
> BP>
>
> god versus God.
>
> The use of the capital letter is explained in the Shorter Oxford
> Dictionary.
You mean the shorter Oxford Dictionary?
BP> In article
BP> <Pine.LNX.4.50.021201...@mundungus.clifford.ac>, "Alan
BP> Clifford" <sard...@purse-seine.net> wrote:
BP>
BP> > On Mon, 2 Dec 2002, Brian Patil wrote:
BP> >
BP> > BP>
BP> > BP> A capitalized word has the same dictionary meaning BP> as a
BP> > non-captitalized one.
BP> > BP>
BP> >
BP> > god versus God.
BP> >
BP> > The use of the capital letter is explained in the Shorter Oxford
BP> > Dictionary.
BP>
BP> You mean the shorter Oxford Dictionary?
BP>
It has SHORTER OXFORD DICTIONARY on the cover.
This is somewhat misleading. The e matches because it's in the set of
characters ":elorw".
Nigel Wade then commented:
nw> The class specifier is [:lower:] which in turn needs to be inside [], so
nw> you need
nw>
nw> bash$ ls [[:lower:]]*
Mmm, on Solaris 8, /bin/sh gives me this:
$ ls [[:lower:]]*
[[:lower:]]*: No such file or directory
$ ls [a-z]*
one three
But with bash or ksh, at least I get some sense:
$ ls [[:lower:]]*
one three
$ ls [[:upper:]]*
FOUR TWO
$ ls [a-z]*
one three
Without the extra [] pair you get the same somewhat initially misleading
result from all three shells as seen by Brian (above):
$ ls [:lower:]*
one
Oh yes,
$ set | grep LOCALE | wc -l
0
I'm sure something somewhere's still broken, and it seems that the only
safe solution is to write scripts with special-case code that determines
what [a-z] and [[:lower:]] really do in any particular context.
Anyone else still use this kind of stuff? -
N= C='\c'; echo "$C" | grep c >/dev/null && N=-n C=
echo $N "Hello without a newline$C"
echo "<<----"
Chris
--
@s=split(//,"Je,\nhn ersloak rcet thuarP");$k=$l=@s;for(;$k;$k--){$i=($i+1)%$l
until$s[$i];$c=$s[$i];print$c;undef$s[$i];$i=($i+(ord$c))%$l}