Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Testing if a directory is empty

82 views
Skip to first unread message

Martijn Dekker

unread,
May 23, 2015, 10:46:34 PM5/23/15
to
"test -s dir" or "[ -s dir ]" cannot test if a directory is empty or
not. The POSIX shell definition does not have an obvious way to do it.
When you google it, there only seem to be awful and wrong hacks using
command substitutions with 'find' or 'ls'.

I eventually found some functions using the positional parameters, such
as here: http://www.etalabs.net/sh_tricks.html and here:
https://groups.google.com/forum/#!original/comp.unix.programmer/WHHK4F0-GtQ/ZDUW6Swhry0J
but they seem convoluted and possibly broken.

So I figured out the following method to test if a directory is empty,
which does not launch any external process or subshell.

My question: is this portable? Does this break on any POSIX compliant
shell implementation or on any UNIX-like system? Meaning, does

set -- .* *

produce consistent positional parameters for an empty directory on
every system, that is: exactly three, the first '.', the second '..'
(because those two are always there) and the third '*' (due to the
unresolved glob pattern)?

So far, I've got it confirmed to work on various shells under NetBSD,
Mac OS X, Minix, and Linux.

- M.

dirisempty() {
cd -P -- "$1" || return 2

# Temporarily enable globbing if it's disabled;
# set positional parameters to directory contents.

case "$-" in
( *f* ) set +f; set -- .* *; set -f ;;
( * ) set -- .* * ;;
esac

# Test for number and content of parameters
# corresponding to an empty directory.
# (An unresolved glob pattern '*' and a directory
# containing only one file with the literal name '*'
# produce the same result, so test for that too.)

test "$#" -eq 3 \
&& test "$1" = '.' \
&& test "$2" = '..' \
&& test "$3" = '*' \
&& test ! -e '*'

case "$?" in
( 0 ) cd "$OLDPWD" && return 0 ;;
( 1 ) cd "$OLDPWD" && return 1 ;;
esac

echo "dirisempty: fatal error: 'test' or 'cd' failed" 1>&2
case "$-" in ( *i* ) return 3 ;; ( * ) exit 3 ;; esac
}

Usage example:

if dirisempty "$somedir"; then echo "nothing in here"; fi

Chris F.A. Johnson

unread,
May 24, 2015, 12:08:06 AM5/24/15
to
Ignore my previous rubbish.

is_file()
{
[ -e "$1" ]
}

is_empty()
{
[ -d "$1" ] || return 9
is_file "$1"/*
}


--
Chris F.A. Johnson

Chris F.A. Johnson

unread,
May 24, 2015, 12:08:06 AM5/24/15
to
On 2015-05-24, Martijn Dekker wrote:
is_empty()
{
[ -d "$1" ] || return 9
[ -e "$1"/* ]
}


--
Chris F.A. Johnson

Stephane Chazelas

unread,
May 24, 2015, 3:15:11 AM5/24/15
to
2015-05-24 04:46:29 +0200, Martijn Dekker:
> "test -s dir" or "[ -s dir ]" cannot test if a directory is empty or
> not. The POSIX shell definition does not have an obvious way to do it.
> When you google it, there only seem to be awful and wrong hacks using
> command substitutions with 'find' or 'ls'.

isempty() (
[ -d "$1" ] || exit 2
content=$(ls -ALq -- "$1") || exit 2
[ -z "$content" ]
)

would be POSIX (2008 for -A) and relatively reliable.

[...]
> dirisempty() {
> cd -P -- "$1" || return 2

Note that you don't need to be able to cd to a directory to list
its content (all you need is read permission).

If you want to check if the directory called "-" is empty,
you'll have to write it dirisempty ./- (same problem with other
values like -1, -2, +1 in some shells)

Note that -P is a POSIX addition, if you're going to use that
(which you need on POSIX shells), then you migh as well use [
instead of test.

>
> # Temporarily enable globbing if it's disabled;
> # set positional parameters to directory contents.
>
> case "$-" in
> ( *f* ) set +f; set -- .* *; set -f ;;
> ( * ) set -- .* * ;;
> esac
>
> # Test for number and content of parameters
> # corresponding to an empty directory.
> # (An unresolved glob pattern '*' and a directory
> # containing only one file with the literal name '*'
> # produce the same result, so test for that too.)
>
> test "$#" -eq 3 \
> && test "$1" = '.' \
> && test "$2" = '..' \
> && test "$3" = '*' \

A few notes:

- zsh and shells based on the Forsyth shell (pdksh, mksh, posh,
probably oksh) the original Minix shell, don't know about the
current) nevery expand . nor .. in their glob, so .* will expand
to .* in an empty dir in those shells
- unless you fix the locale to C, you've got no guarantee that .
sorts before ..
- Not all file systems have . and .. entries and not all OSes
fake one if they're missing, so same as above, .* may expand to
.*
- if you've got search but not read permission to the directory,
that glob will expand to .* *.
- test "$x" = whatever may fail in some non-POSIX shell if $x is
! or (... for instance.

> && test ! -e '*'

test ! -e '*' will return false if * is a symlink to a
non-existent or inaccessible (as in a directory you don't have
search access for) file.

You'd want to test for ! -L '*' as well (though if you don't
have search access to the dir and end up using a solution that
doesn't involve cd, that won't help).

> case "$?" in
> ( 0 ) cd "$OLDPWD" && return 0 ;;
> ( 1 ) cd "$OLDPWD" && return 1 ;;
> esac

cd "$OLDPWD" gives you no guarantee to return to where you were,
for instance if any of the path components have been renamed
(even before that function was called).

Best is to use cd in a subshell here.


>
> echo "dirisempty: fatal error: 'test' or 'cd' failed" 1>&2
> case "$-" in ( *i* ) return 3 ;; ( * ) exit 3 ;; esac
> }
>
> Usage example:
>
> if dirisempty "$somedir"; then echo "nothing in here"; fi

--
Stephane

Stephane Chazelas

unread,
May 24, 2015, 3:20:09 AM5/24/15
to
2015-05-23 23:54:25 -0400, Chris F.A. Johnson:
[...]
> is_file()
> {
> [ -e "$1" ]
> }

You need:

[ -e "$1" ] || [ -L "$1" ]

(-h instead of -L in some ancient systems)

> is_empty()
> {
> [ -d "$1" ] || return 9
> is_file "$1"/*
> }
[...]

That ignores hidden files. It may give you wrong information if
you don't have read or search permission to the directory.

--
Stephane

Stephane Chazelas

unread,
May 24, 2015, 5:30:09 AM5/24/15
to
2015-05-24 08:12:18 +0100, Stephane Chazelas:
[...]
> - unless you fix the locale to C, you've got no guarantee that .
> sorts before ..
[...]

Also note that .* and * may not match file names that contain
invalid characters in the current locale.

For instance, if there's a Stéphane.txt in the current directory
with that é encoded in iso8859-1 (as the 0xe9 byte), then
yash won't include it in its expansio of * if the current
locale's charset is UTF-8.

Using LC_ALL=C won't even work there, so you're more or less
screwed unless you know the name of a locale on the system where
all byte sequences form valid characters, and in the case of
yash, you'll need to re-execute the shell:

$ touch $'St\xe9phane'
$ ls | sed -n l
St\351phane$
$ locale charmap
UTF-8
$ yash -c 'echo *'
*
$ yash -c 'LC_ALL=C; echo *'
*
$ LC_ALL=C yash -c 'echo *'
*
$ yash -c 'LC_ALL=fr_FR.iso885915@euro; echo *'
*
$ yash -c 'export LC_ALL=fr_FR.iso885915@euro; echo *'
*
$ yash -c 'LC_ALL=fr_FR.iso885915@euro eval "echo *"'
*
$ LC_ALL=fr_FR.iso885915@euro yash -c 'echo *'
Stéphane

Note that it's not only yash. GNU find:

$ find . -name '*'
.

--
Stephane

Martijn Dekker

unread,
May 24, 2015, 1:08:33 PM5/24/15
to
In article <2015052407...@chaz.gmail.com>,
Stephane Chazelas <stephane...@gmail.com> wrote:

> isempty() (
> [ -d "$1" ] || exit 2
> content=$(ls -ALq -- "$1") || exit 2
> [ -z "$content" ]
> )
>
> would be POSIX (2008 for -A) and relatively reliable.

But it launches two subshells plus an external command, so it's quite
slow.

Also, unless set -f (noglob) is active, the result of anything to the
right of the = in a variable assignment is subject to pathname expansion
(globbing) depending on what's in the present working directory, which
is not the directory you're listing. So, to avoid that, the command
substitution should be quoted:

content="$(ls -ALq -- "$1")"

> [...]
> > dirisempty() {
> > cd -P -- "$1" || return 2
>
> Note that you don't need to be able to cd to a directory to list
> its content (all you need is read permission).
>
> If you want to check if the directory called "-" is empty,
> you'll have to write it dirisempty ./- (same problem with other
> values like -1, -2, +1 in some shells)
>
> Note that -P is a POSIX addition, if you're going to use that
> (which you need on POSIX shells), then you migh as well use [
> instead of test.

Yes, I write for current, pure POSIX shells. I've been working on a
rather ambitious general-purpose POSIX shell library project. It needs a
reliable and fast way to test for an empty directory. I'd rather avoid
using a subshell or external command if at all possible.

'test' and '[' are completely equivalent so there is no functional
difference. But I find 'test' more legible. I also find the way that the
'[' command tries to masquerade as syntax rather evil, because it
involves rather annoying mandatory spaces where you wouldn't expect them
in syntax. Matter of taste, I suppose.

> A few notes:
>
> - zsh and shells based on the Forsyth shell (pdksh, mksh, posh,
> probably oksh) the original Minix shell, don't know about the
> current) nevery expand . nor .. in their glob, so .* will expand
> to .* in an empty dir in those shells

Woops, missed that one. You're absolutely right.

(The current Minix default shell is a rather broken version of the
Almquist shell. I've confirmed it to work on there, though.)

So this means there are two possible outcomes to test for, the other one
being:

test "$#" -eq 2 \
&& test "$1" = '.*' \
&& test ! -L '.*' \
&& test ! -e '.*' \
&& test "$2" = '*' \
&& test ! -L '*' \
&& test ! -e '*'

> - unless you fix the locale to C, you've got no guarantee that .
> sorts before ..

Is there actually *any* locale where a single character sorts after a
double of the same character? I find it hard to imagine.

I suppose it's better to be safe than sorry, though, so I'll set LC_ALL
to C and restore it after.

> - Not all file systems have . and .. entries and not all OSes
> fake one if they're missing, so same as above, .* may expand to
> .*

Do you know any examples of systems that don't fake them? Not that it
matters now, because it's covered.

> - if you've got search but not read permission to the directory,
> that glob will expand to .* *.

We need read permission anyway, so I'll just test for that.

> - test "$x" = whatever may fail in some non-POSIX shell if $x is
> ! or (... for instance.

Yes. Argh. I keep forgetting that.

(In my library I've got strcmp() that deals with this correctly.)

> > && test ! -e '*'
>
> test ! -e '*' will return false if * is a symlink to a
> non-existent or inaccessible (as in a directory you don't have
> search access for) file.

Great catch, thank you.

> You'd want to test for ! -L '*' as well (though if you don't
> have search access to the dir and end up using a solution that
> doesn't involve cd, that won't help).
>
> > case "$?" in
> > ( 0 ) cd "$OLDPWD" && return 0 ;;
> > ( 1 ) cd "$OLDPWD" && return 1 ;;
> > esac
>
> cd "$OLDPWD" gives you no guarantee to return to where you were,
> for instance if any of the path components have been renamed
> (even before that function was called).
>
> Best is to use cd in a subshell here.

I care quite a lot about performance, so I'd really rather not. But I
must agree -- I can't find any other way around this than using a
subshell.

So now it looks like a tradeoff: either accept that it's impossible to
return to a nonexistent PWD or accept slow performance.

How likely is it really that your PWD doesn't exist? Perhaps trying to
defend against this is futile. After all, you're hardly going to delete
your own PWD without expecting breakage -- and if it's possible for
something else to delete or manipulate your PWD while you're working in
it, you've already got bigger problems to worry about than a failure to
return to your nonexistent PWD. So I tend to doubt it's worth the
performance tradeoff. At least the function tests and exits on failure
to return to $OLDPWD.


Two new versions are below. The first one still does not use any
subshell or external command. It's getting kind of absurdly long, but
it's still generally about six times faster than anything using a
subshell or external command (on bash, only about three times faster).

Note: on zsh, the first version only works in POSIX/'emulate sh' mode
(which is all I personally care about anyway). Without it, zsh will exit
on non-matching glob patterns, or in an interactive shell, it will abort
execution and leave you with LC_ALL=C in the environment.

Also, this function does not work with 'set -e' (errexit) active. That
option makes it very hard to distinguish between 'false' and 'error'
exit statuses of commands, because any non-zero exit status is treated
as a fatal error. Ugly hacks would be needed to cope with it.

The second version uses a subshell, so is shorter and probably less
insecure, but (as said) much slower. One big advantage is the lack of
need to restore any kind of variable or setting. For zsh, it activates
POSIX emulation within the subshell so it works on default zsh too.

I'd be very interested to hear about it if either version below breaks
on some POSIX shell or system in some way.

- M.


First version (no subshell):


dirisempty() {
cd -P -- "$1" && test -r '.' || return 2

# Enforce C locale to ensure correct sorting.
if test "${LC_ALL+set}" = 'set'; then
_saveLC_ALL="$LC_ALL"
else
unset -v _saveLC_ALL
fi
LC_ALL=C

# Temporarily enable globbing if it's disabled;
# set positional parameters to directory contents.
case "$-" in
( *f* ) set +f; set -- .* *; set -f ;;
( * ) set -- .* * ;;
esac

# Restore locale.
if test "${_saveLC_ALL+set}" = 'set'; then
LC_ALL="$_saveLC_ALL"
else
unset -v LC_ALL
fi

# Test for number and content of parameters
# corresponding to an empty directory.
{
test "$#" -eq 3 \
&& test "$1" = '.' \
&& test "$2" = '..' \
&& test "$3" = '*' \
&& test ! -L '*' \
&& test ! -e '*'
} || {
test "$#" -eq 2 \
&& test "$1" = '.*' \
&& test ! -L '.*' \
&& test ! -e '.*' \
&& test "$2" = '*' \
&& test ! -L '*' \
&& test ! -e '*'
}

case "$?" in
( 0 ) cd "$OLDPWD" && return 0 ;;
( 1 ) cd "$OLDPWD" && return 1 ;;
esac

echo "dirisempty: fatal error: 'test' or 'cd' failed" 1>&2
case "$-" in ( *i* ) return 3 ;; ( * ) exit 3 ;; esac
}


Second version (subshell):


dirisempty() (
cd -P -- "$1" && test -r '.' || exit 2

# Apply compatible settings within subshell.
[ -n "$ZSH_VERSION" ] && emulate sh || POSIXLY_CORRECT=y
set +e +f
LC_ALL=C

# Set positional parameters to directory contents.
set -- .* *

# Test for number and content of parameters
# corresponding to an empty directory.
{
test "$#" -eq 3 \
&& test "$1" = '.' \
&& test "$2" = '..' \
&& test "$3" = '*' \
&& test ! -L '*' \
&& test ! -e '*'
} || {
test "$#" -eq 2 \
&& test "$1" = '.*' \
&& test ! -L '.*' \
&& test ! -e '.*' \
&& test "$2" = '*' \
&& test ! -L '*' \
&& test ! -e '*'
}
)

Martijn Dekker

unread,
May 24, 2015, 1:21:54 PM5/24/15
to
In article <2015052409...@chaz.gmail.com>,
Stephane Chazelas <stephane...@gmail.com> wrote:

> Also note that .* and * may not match file names that contain
> invalid characters in the current locale.
>
> For instance, if there's a Stéphane.txt in the current directory
> with that é encoded in iso8859-1 (as the 0xe9 byte), then
> yash won't include it in its expansio of * if the current
> locale's charset is UTF-8.

That sounds like a bug in yash. I don't see a reason for it to care
about what character set the file names are in for the purposes of glob
expansion. That should be the problem of the application that deals with
the result of the expansion.

- M.

Stephane Chazelas

unread,
May 24, 2015, 3:10:10 PM5/24/15
to
2015-05-24 19:08:29 +0200, Martijn Dekker:
> In article <2015052407...@chaz.gmail.com>,
> Stephane Chazelas <stephane...@gmail.com> wrote:
>
> > isempty() (
> > [ -d "$1" ] || exit 2
> > content=$(ls -ALq -- "$1") || exit 2
> > [ -z "$content" ]
> > )
> >
> > would be POSIX (2008 for -A) and relatively reliable.
>
> But it launches two subshells plus an external command, so it's quite
> slow.

Note that subshells are not necessarily expensive. Not all
shells implement them with a fork (ksh93 doesn't). And ls or
[/test don't have to be external.

On the other hand all those solutions that get a full list of
the directory will be very slow when applied to directories
containing millions of file...


> Also, unless set -f (noglob) is active, the result of anything to the
> right of the = in a variable assignment is subject to pathname expansion
> (globbing) depending on what's in the present working directory, which
> is not the directory you're listing. So, to avoid that, the command
> substitution should be quoted:
>
> content="$(ls -ALq -- "$1")"

No, the split+glob operator only applies in list contexts. When
assigning to a scalar variable, you're assigning *one* value, so
there can't be split+glob. Maybe you're confusing with array
assignments like the

content=($(cmd))

of ksh93/zsh/bash.

(quoting won't harm though).

[...]
> So this means there are two possible outcomes to test for, the other one
> being:
>
> test "$#" -eq 2 \
> && test "$1" = '.*' \
> && test ! -L '.*' \
> && test ! -e '.*' \
> && test "$2" = '*' \
> && test ! -L '*' \
> && test ! -e '*'

(note that in theory, you could have . and not ..).

If you want to avoid forks, you could do something like:

isempty() {
[ -d "$1" ] && [ -r "$1" ] || return 2
set -- "$1"/[*] "$1"/* "$1"/.[!.]* "$1"/.??*
case $#${1##*/}${2##*/}${3##*/}${4##*/} in
('4[*]*.[!.]*.??*') return 0;;
(*) return 1
esac
}

> > - unless you fix the locale to C, you've got no guarantee that .
> > sorts before ..
>
> Is there actually *any* locale where a single character sorts after a
> double of the same character? I find it hard to imagine.

It could be because .. is explicitely defined as a collating
element sorting the same as . (very unlikely) or because . is
ignored for the purpose of comparison in which case the order of
. and .. would be non-deterministic (more likely, but probably
not in real-life locales).

[...]
> Do you know any examples of systems that don't fake them? Not that it
> matters now, because it's covered.

I remember seeing some, but can't remember which.

> > - if you've got search but not read permission to the directory,
> > that glob will expand to .* *.
>
> We need read permission anyway, so I'll just test for that.
>
> > - test "$x" = whatever may fail in some non-POSIX shell if $x is
> > ! or (... for instance.
>
> Yes. Argh. I keep forgetting that.

Well, if you assume a POSIX shell, you can ignore that. POSIX
clearly specify the behaviour here which would not fail for ! or
(.

test '*' = "$1"

would be fine even with those non-POSIX old shells.


[...]
> > cd "$OLDPWD" gives you no guarantee to return to where you were,
> > for instance if any of the path components have been renamed
> > (even before that function was called).
> >
> > Best is to use cd in a subshell here.
>
> I care quite a lot about performance, so I'd really rather not. But I
> must agree -- I can't find any other way around this than using a
> subshell.
>
> So now it looks like a tradeoff: either accept that it's impossible to
> return to a nonexistent PWD or accept slow performance.
>
> How likely is it really that your PWD doesn't exist? Perhaps trying to
> defend against this is futile. After all, you're hardly going to delete
> your own PWD without expecting breakage -- and if it's possible for
> something else to delete or manipulate your PWD while you're working in
> it, you've already got bigger problems to worry about than a failure to
> return to your nonexistent PWD. So I tend to doubt it's worth the
> performance tradeoff. At least the function tests and exits on failure
> to return to $OLDPWD.
[...]

Note that ksh93 which does not fork for subshells implements the
restoring of cwd by using a close-on-exec fd and doing a
fchdir() to that upon exit of the subshell.

$PWD may become stale even before your script starts (and many
shells inherit it from the environment on startup even if
forbidden by POSIX)

See
http://unix.stackexchange.com/questions/79571/symbolic-link-recursion-what-makes-it-reset/79621#79621


[...]
> Also, this function does not work with 'set -e' (errexit) active. That
> option makes it very hard to distinguish between 'false' and 'error'
> exit statuses of commands, because any non-zero exit status is treated
> as a fatal error. Ugly hacks would be needed to cope with it.

cmd && :

avoids the exiting upon set -e.


[...]
> First version (no subshell):
>
>
> dirisempty() {
> cd -P -- "$1" && test -r '.' || return 2
[...]

You're not restoring the cwd if test -r fails.

--
Stephane

Stephane Chazelas

unread,
May 24, 2015, 3:20:09 PM5/24/15
to
2015-05-24 19:21:49 +0200, Martijn Dekker:
[...]

* as a pattern matches 0 or more *characters*. \xe9 is not a
character in a UTF-8 locale, so it's OK for * not to match it.
I beleive POSIX leaves the behaviour unspecified.

You'll notice that GNU sed's "." RE operator doesn't match those
non-characters either.

Note that yash goes further in that it won't store
non-characters in its variables, so you wouldn't be able to call
a yash script with such a file name as argument for instance.

The behaviour is a bit extreme, and kind of assumes that
character-correctness is enforced through-out on the system.

Bug or not, IMO you can ignore the problem as there's not much
you can do about it.

--
Stephane

Martijn Dekker

unread,
May 24, 2015, 3:32:26 PM5/24/15
to
In article <2015052419...@chaz.gmail.com>,
Stephane Chazelas <stephane...@gmail.com> wrote:

> No, the split+glob operator only applies in list contexts. When
> assigning to a scalar variable, you're assigning *one* value, so
> there can't be split+glob.

Hmm. That was my belief too for years, but then an unexpected parsing
result made me conclude there must be globbing on assignment, so I
started quoting more paranoidly. But what you say makes sense. Perhaps
it had some other cause. I'll have to look at this again.

Thanks,

- M.

Stephane Chazelas

unread,
May 25, 2015, 12:15:11 AM5/25/15
to
2015-05-24 21:32:21 +0200, Martijn Dekker:
[...]

One case where one may think quotes are unecessary are in
redirections, that may be the one you're thinking of:

You'd think the shell would have no reason to apply split+glob
in:

cmd > $file

but POSIX only forbids split+globbing in non-interactive shells,
and bash is not compliant there unless in POSIX mode.

So you do need:

cmd > "$file"

for bash or for interactive shells.

There are also cases where quoting is necessary even if there's
no split+glob like in:

case x in
"$var")

or x=${x#"$var"}

to prevent $var from being treated as a pattern.

I remember needing

a="$b"

for some bug of some old version of zsh or bash a long time ago
as well (something to do with backslash and/or IFS) but if you
start working around the bugs of all ancient shells, you end up
getting nowhere.

--
Stephane

Martijn Dekker

unread,
May 25, 2015, 8:03:25 PM5/25/15
to
In article <2015052421...@chaz.gmail.com>,
Stephane Chazelas <stephane...@gmail.com> wrote:

> 2015-05-24 21:32:21 +0200, Martijn Dekker:
> > In article <2015052419...@chaz.gmail.com>,
> > Stephane Chazelas <stephane...@gmail.com> wrote:
> >
> > > No, the split+glob operator only applies in list contexts. When
> > > assigning to a scalar variable, you're assigning *one* value, so
> > > there can't be split+glob.
> >
> > Hmm. That was my belief too for years, but then an unexpected parsing
> > result made me conclude there must be globbing on assignment, so I
> > started quoting more paranoidly. But what you say makes sense. Perhaps
> > it had some other cause. I'll have to look at this again.

I think I found what bit me before!

In most versions of ash (but not the busybox one), in dash, in yash, in
some versions of pdksh, and perhaps others, exported assignments are
subject to globbing, though normal ones are not.

touch 'v2=abcde' # create this file
glob='*'
v1=$glob
echo "$v1" # outputs '*', ok
export v2=$glob
echo "$v2" # outputs 'abcde' !!!!! (in the abovementioned shells)

In other words, globbing is turned on for assignments preceded by the
"export" keyword in some shells, but not in others. Ouch! I wrote a
little test script to map the behaviour of different shells:

#! /bin/sh
touch 'v2=abcde' || exit
for shell in sh ash dash yash ksh ksh93 pdksh mksh zsh bash; do
command -v "$shell" >/dev/null || continue
case "$($shell -c 'glob="*"; export v2=$glob; printf %s\\n "$v2"')" in
( abcde )
printf '%s\n' "$shell globs on 'export' assignments!" ;;
( '*' )
printf '%s\n' "$shell does not glob on 'export' assignments." ;;
( * )
printf '%s\n' "$shell: command error!" ;;
esac
done
rm 'v2=abcde'

(EOF)

So far, it seems the Almquist-derived shells (except busybox ash) plus
yash are the offenders.

Man, globbing is dangerous -- it turns up where you least expect it. Yet
another reason to turn off globbing, along with field splitting, in
shell scripts and only turn them on temporarily where you need them.

My new shell library is provisionally called "modernish", taking
inspiration from "modernizr" for Javascript. And it's trying to turn the
POSIX shell language, if not into a modern language, at least into a
"modern-ish" language -- kinda like jQuery did for JavaScript.
Everything is completely cross platform. It's modular, having a simple
"use" command inspired by perl. And it's got a "safe" module that turns
off globbing, turns off fieldsplitting, and turns on "nounset" and
"noclobber", plus provides facilities for temporary globbing and field
splitting.

I think I now have the following working quite reliably: arbitrary local
variables and shell options/settings in a 'setlocal'..'endlocal' block.
Such a block can be inserted anywhere and is not limited to shell
functions or anything. An example (yes, it's contrived):

#! /bin/sh
. modernish
use safe

setlocal --doglobbing --dofieldsplitting somedir=/dir/name
cp /blah/*.txt "$somedir"/
endlocal

The setlocal..endlocal block is a true shell block. They use stack
functions (push and pop) to save the local variables and settings and
then restore them once the temp function exits. After alias expansion,
it's a temporary shell function, so you can 'return' from it (instead of
using 'break'). 'setlocal' and 'endlocal' are aliases that include block
delimiters { } and incorporate the code within them into a temporary
shell function, called by the 'endlocal' alias. The parameters to
setlocal can be arbitrary shell options, variables, and variable
assignments which then become local. --doglobbing is a synonym for '+f'
and --dofieldsplitting is a synonym for IFS=" ${CCn}${CCt}" (where the
CC* constants represent control characters).

I think I'm now just a few weeks off from publishing a first alpha
version. More tweaking and testing is still needed. Then I'm going to
have to figure out how to get it out there: github, or just a website?
This stuff is boring, I prefer to do programming. I hate all this
newfangled cloud stuff. Perhaps I'll put it on a gopher site, just to be
contrary.

> [...]
>
> One case where one may think quotes are unecessary are in
> redirections, that may be the one you're thinking of:

Nope, I didn't even know about these -- I just always assumed the
quoting was necessary.

> You'd think the shell would have no reason to apply split+glob
> in:
>
> cmd > $file
>
> but POSIX only forbids split+globbing in non-interactive shells,
> and bash is not compliant there unless in POSIX mode.

How strange, that POSIX rule. I suppose this is meant to make
interactive shell use more convenient by allowing globbing to be abused
to represent a single file name without having to type it completely.
But in any decent interactive shell, that's what tab completion is for.

[...]
> There are also cases where quoting is necessary even if there's
> no split+glob like in:
>
> case x in
> "$var")
>
> or x=${x#"$var"}
>
> to prevent $var from being treated as a pattern.

Yes, I am quite familiar with that one and it at least makes sense.

> I remember needing
>
> a="$b"
>
> for some bug of some old version of zsh or bash a long time ago
> as well (something to do with backslash and/or IFS) but if you
> start working around the bugs of all ancient shells, you end up
> getting nowhere.

I strongly agree, and I don't tolerate bugs in new shells either.
Instead, I write tests for those bugs as I find them, and my library
refuses to load if any such bug is found. I'm testing on 7 such bugs so
far, including a few very obscure ones.

Last week, I got zsh to fix three bugs I found during the development of
my library, including one really obscure one with empty bracket
expressions eating shell grammar, even if the bracket expression
contains an empty variable (so it depends on the contents of the
variable whether it eats shell grammar or not!). Fun times. They only
fixed it for POSIX mode. I guess when zsh acts all bizarre in zsh mode,
it's correct by definition.

- Martijn

Stephane Chazelas

unread,
May 26, 2015, 2:30:09 AM5/26/15
to
2015-05-26 02:03:20 +0200, Martijn Dekker:
[...]
> touch 'v2=abcde' # create this file
> glob='*'
> v1=$glob
> echo "$v1" # outputs '*', ok
> export v2=$glob
> echo "$v2" # outputs 'abcde' !!!!! (in the abovementioned shells)
>
> In other words, globbing is turned on for assignments preceded by the
> "export" keyword in some shells, but not in others. Ouch! I wrote a
> little test script to map the behaviour of different shells:
[...]

That's not an assignment, that's a builtin command (except in
bash and ksh93 where that's a hybrid between assignment and
command).

Same in env v2=$glob
or awk '...' v2=$glob

At the moment the POSIX spec specifies the ash/zsh/yash...
behaviour, I believe the next version will be amended to allow
bash/ksh behaviour (or maybe it has already).

Note that in bash/ksh as well, the glob is expanded in:

export \v2=$glob

as that's not recognised as an assigment anymore.

See also:

http://unix.stackexchange.com/questions/193095/where-is-export-var-value-not-available/193137#193137
http://unix.stackexchange.com/questions/171346/security-implications-of-forgetting-to-quote-a-variable-in-bash-posix-shells

And this thread:

http://thread.gmane.org/gmane.comp.shells.bash.bugs/22737

[...]
> So far, it seems the Almquist-derived shells (except busybox ash) plus
> yash are the offenders.

And zsh (in sh emulation). Though I'd rather say bash and ksh
are the offenders and the others do what I'd expect.

>
> Man, globbing is dangerous -- it turns up where you least expect it. Yet
> another reason to turn off globbing, along with field splitting, in
> shell scripts and only turn them on temporarily where you need them.

Or use zsh (though it still does splitting upon command
substitution and empty removal).

Turning off globbing globally IMO is more likely to cause more
harm. There was a discussion about it here a few weeks ago.

[...]
> #! /bin/sh
> . modernish
> use safe
>
> setlocal --doglobbing --dofieldsplitting somedir=/dir/name
> cp /blah/*.txt "$somedir"/
> endlocal
>
> The setlocal..endlocal block is a true shell block. They use stack
> functions (push and pop) to save the local variables and settings and
> then restore them once the temp function exits.
[...]

I once did one like that as a PoC though never used it:
https://github.com/stephane-chazelas/misc-scripts/blob/master/locvar.sh
(also local context for options)

My view is that when you start needing things like arrays or
functions or local scope, chances are you need a real
programming language, not a shell which is better left as a
fancy command-line interpreter.

[...]
> I think I'm now just a few weeks off from publishing a first alpha
> version. More tweaking and testing is still needed. Then I'm going to
> have to figure out how to get it out there: github, or just a website?
[...]

Looking forward to seeing it.

[...]
> > cmd > $file
> >
> > but POSIX only forbids split+globbing in non-interactive shells,
> > and bash is not compliant there unless in POSIX mode.
>
> How strange, that POSIX rule. I suppose this is meant to make
> interactive shell use more convenient by allowing globbing to be abused
> to represent a single file name without having to type it completely.
> But in any decent interactive shell, that's what tab completion is for.

Yes.

See also:

file=' /etc/passwd '
echo test > $file

in bash.

--
Stephane

Martijn Dekker

unread,
May 26, 2015, 10:35:39 AM5/26/15
to
In article <2015052606...@chaz.gmail.com>,
Stephane Chazelas <stephane...@gmail.com> wrote:

> 2015-05-26 02:03:20 +0200, Martijn Dekker:
> [...]
> > touch 'v2=abcde' # create this file
> > glob='*'
> > v1=$glob
> > echo "$v1" # outputs '*', ok
> > export v2=$glob
> > echo "$v2" # outputs 'abcde' !!!!! (in the abovementioned shells)
> >
> > In other words, globbing is turned on for assignments preceded by the
> > "export" keyword in some shells, but not in others. Ouch! I wrote a
> > little test script to map the behaviour of different shells:
> [...]
>
> That's not an assignment, that's a builtin command (except in
> bash and ksh93 where that's a hybrid between assignment and
> command).

Yeah. Got that. Yet, the result is that the variables get assigned
values, so it's an assignment... just a different form of it.

Many times I've decided that a variable needs to be exported after
writing the assignment statement, so you'd think it ought to be as
simple as just prefixing the assignment with the "export" keyword.
That's how this bit me.

Meh, I'll just keep quoting everything, including assignments (unless in
my proposed 'safe' mode with IFS=''; set -f -u -C). Quoting never does
any harm. In 2015, the wastage of two bytes per assignment is definitely
insignificant. And it protects me from forgetting to add them in obscure
and convoluted contexts where they may be needed in some shells and not
others.
Whether it's harmful or not depends on whether your script needs
globbing or not. If it doesn't, turning it off is neutral at best and
can otherwise only do good.

Most of my scripts take filenames as arguments from the (usually
interactive) command line. Globbing is already resolved by the
interactive shell, so my script has no need to do globbing itself.

In any case, in my library I've made this as part of a 'safe' mode that
needs to be specified explicitly ('use safe'), so it's not the default.


> My view is that when you start needing things like arrays or
> functions or local scope, chances are you need a real
> programming language, not a shell which is better left as a
> fancy command-line interpreter.

That is certainly a legit view to take, and it's a common one.

My own view is that the shell, as defined by POSIX, has now been
developed enough to be a real programming language, and that denying
this is futile -- you can program anything in it, so people will program
anything in it, just because they can. That's just how life works. It's
just a real programming language with some rather annoying and obscure
deficiencies.

But the shell language has some very powerful mechanisms for extending
the language (particularly powerful is that aliases are resolved before
the shell grammar is even parsed). So, I also take the view that most of
these deficiencies are nothing that can't be solved with a good library
plus some updated programming practices. All programming languages have
standard libraries and established coding best practices, except the
shell language. Why should the shell language be an exception?

Whether this view is correct, only time will tell. I've set myself the
challenge to prove it by attempting to write this standard library, and
to make it 100% POSIX compatible.

The shell language also has some huge and unique advantages over other
programming languages: trivial launching of external commands, trival
use of input/output redirection and pipes, trivial parallel processing
(just add '&' to any command or block), and more. It can do very
powerful things with very little programming. In my view, that makes it
worth attempting to fix the deficiencies.

Perhaps I should start a thread about this. We all like the shell
language here, or we wouldn't be here. What is it we all like about it
the most?


[...]
> See also:
>
> file=' /etc/passwd '
> echo test > $file
>
> in bash.

Hmm. By default, this does field splitting, so removes the blanks and
attempts to write to /etc/passwd, just as I would expect. What's the
point you're trying to make with this?

- M.

Stephane Chazelas

unread,
May 26, 2015, 11:15:10 AM5/26/15
to
2015-05-26 16:35:35 +0200, Martijn Dekker:
[...]
> > See also:
> >
> > file=' /etc/passwd '
> > echo test > $file
> >
> > in bash.
>
> Hmm. By default, this does field splitting, so removes the blanks and
> attempts to write to /etc/passwd, just as I would expect. What's the
> point you're trying to make with this?
[...]

Just that it's not only about globbing. While I can understand
there's value in being able to do:

foo > foo*.txt

at a shell prompt.

and agree with you is probably why POSIX allows it for
interactive shell, I can see no value in allowing splitting
there.

But then again, I see no value in doing split+glob upon
variable assignment or glob upon command-substitution either,
but that's just another misdesign we've been carring over from
shell to shell for the past 45 years for the sake of backward
compatibility.

And any attempt like yours at fixing some of them is certainly a
laudable endeavor.

--
Stephane

Martijn Dekker

unread,
May 27, 2015, 7:11:47 AM5/27/15
to
In article <2015052615...@chaz.gmail.com>,
Stephane Chazelas <stephane...@gmail.com> wrote:

> Just that it's not only about globbing. While I can understand
> there's value in being able to do:
>
> foo > foo*.txt
>
> at a shell prompt.
>
> and agree with you is probably why POSIX allows it for
> interactive shell, I can see no value in allowing splitting
> there.

Ah. True. Except it would introduce yet another an annoying grammatical
inconsistency if it didn't. Two commands like

test -s $file || echo test > $file

would no longer be guaranteed to refer to the same file name.

Just turning off field splitting by default fixes stuff like this
nicely, in my experience.

> But then again, I see no value in doing split+glob upon
> variable assignment

Wait, I thought you just established that *doesn't* happen?

> or glob upon command-substitution either,
> but that's just another misdesign we've been carring over from
> shell to shell for the past 45 years for the sake of backward
> compatibility.

Agreed.

> And any attempt like yours at fixing some of them is certainly a
> laudable endeavor.

Thanks. We'll see if it pans out.

- M.

Stephane Chazelas

unread,
May 27, 2015, 8:10:09 AM5/27/15
to
2015-05-27 13:11:43 +0200, Martijn Dekker:
[...]
> > But then again, I see no value in doing split+glob upon
> > variable assignment
>
> Wait, I thought you just established that *doesn't* happen?

Sorry typo. I meant "upon variable expansion", actually
"parameter expansion" (variable or other).

The Thomson shell (the shell from the first Unix) which had no
variable, but $1,$2... positional parameters already did
split+glob upon their expansion.

Well, it was more like macro expansion then which kind of made
more sense than the Bourne's way even though a lot more
dangerous.

With a script like:

echo $1

calling it as:

sh ./the-script 'foo; echo test'

would output:

foo
test

You

could make it:

echo "$1"

(effectively disabling globs) though that would break if you ran
it as

sh ./the-script \"

That looks like crap design nowadays, but remember that was
written in the early 70s by computing pioneers on $multi-million
massive computers with kilo-bytes of memory, at a time were
administrators, software authors and users on a machine were the
same persons.

I suppose Bourne tried to keep some kind of backward portability
with the Thomson shell which is why it still did split+glob upon
parameter expansion.

It's a shame that this behaviour designed in another era
survived until now.

"rc" (probably the best Unix shell design ever and the shell we
would all be using if the decision was based on technical
grounds), "zsh" and "fish" have fixed that but never achieved
widespread usage probably for the only reason that they're not
Bourne compatible...

--
Stephane

Martijn Dekker

unread,
Jun 23, 2015, 4:10:38 PM6/23/15
to
In article <2015052419...@chaz.gmail.com>,
Stephane Chazelas <stephane...@gmail.com> wrote:

> If you want to avoid forks, you could do something like:
>
> isempty() {
> [ -d "$1" ] && [ -r "$1" ] || return 2
> set -- "$1"/[*] "$1"/* "$1"/.[!.]* "$1"/.??*
> case $#${1##*/}${2##*/}${3##*/}${4##*/} in
> ('4[*]*.[!.]*.??*') return 0;;
> (*) return 1
> esac
> }

Hmm. How did I miss this reply a month ago?

The above seems like a brilliant technique: it avoids subshell forking,
the potential problem of locale-dependent sorting of "." and "..", and
the problem with shell-dependent glob matching of "." and "..".

Just one question: what's the "$1"/[*] glob pattern test for? I can't
figure out how it's not redundant.

Thanks,

- Martijn

Barry Margolin

unread,
Jun 23, 2015, 5:18:44 PM6/23/15
to
In article <martijn-C9E94C...@news.individual.net>,
It only matches a file named *. This allows you to distinguish the case
of the * glob returning * because there were no matches from it
returning * because there was exactly one file named *.

If the directory is empty, the first two globs will expand to

[*] *

because neither of them matches anything, so they expand to themselves.

If the directory just contains a file named *, they will expand to

* *

because both of them match that file.

If it just contains a file named [*], they will expand to

[*] [*]

again because both of them match that file.

And if it contains files xxx, yyy, and zzz, they'll expand to:

[*] xxx yyy zzz

The first glob expands to itself because it doesn't match, and * expands
to the filenames.

--
Barry Margolin, bar...@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***

Martijn Dekker

unread,
Jun 26, 2015, 6:15:40 PM6/26/15
to
In article <barmar-941097....@88-209-239-213.giganet.hu>,
Barry Margolin <bar...@alum.mit.edu> wrote:

> It only matches a file named *. This allows you to distinguish the case
> of the * glob returning * because there were no matches from it
> returning * because there was exactly one file named *.

That makes sense, thanks.

I was noticing the same thing is not done for the other glob patterns,
so I thought it was relying on the non-existence of the (admittedly
unlikely) literal filenames '.[!.]*' and '.??*'. But each of these two
patterns matches the other as a literal string, so this is covered. Very
clever.

- M.

Barry Margolin

unread,
Jun 26, 2015, 7:31:39 PM6/26/15
to
In article <martijn-5A1B4B...@news.individual.net>,
Yeah, whoever came up with that put some good thought into it, I think
it's bulletproof.

Martijn Dekker

unread,
Jun 26, 2015, 8:17:55 PM6/26/15
to
In article <barmar-7DE3E2....@88-209-239-213.giganet.hu>,
Barry Margolin <bar...@alum.mit.edu> wrote:

> In article <martijn-5A1B4B...@news.individual.net>,
> Martijn Dekker <mar...@inlv.demon.nl> wrote:
>
> > I was noticing the same thing is not done for the other glob patterns,
> > so I thought it was relying on the non-existence of the (admittedly
> > unlikely) literal filenames '.[!.]*' and '.??*'. But each of these two
> > patterns matches the other as a literal string, so this is covered. Very
> > clever.
>
> Yeah, whoever came up with that put some good thought into it, I think
> it's bulletproof.

It was Stéphane Chazelas. (Hat off and thanks to him!)

- M.

Stephane Chazelas

unread,
Jun 28, 2015, 3:20:09 AM6/28/15
to
2015-06-27 02:17:49 +0200, Martijn Dekker:
[...]

Thanks,

But the initial idea of using * [*] was IIRC Laura Fairhead's
as posted here a few years ago.

--
Stephane
0 new messages