Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Solaris race condition with FIFOs?

67 views
Skip to first unread message

Martijn Dekker

unread,
Dec 24, 2018, 11:20:29 AM12/24/18
to
You'd think that establishing a pipe between two processes is a very
basic UNIX feature that should work reliably on all UNIX variants.

But on my Solaris evaluation VirtualBox VMs (10.3, 11.3 and 11.4) the
following script, that tests named pipes (FIFOs), breaks consistently
when executed by ksh93, bash or dash. All it does is make 100 FIFOs and
read a line from each -- it should be trivial.

And it does work fine on Linux, macOS, and all the BSDs, on all shells.

#! /bin/sh
tmpdir=/tmp/FIFOs$$
trap "exec rm -rf $tmpdir" EXIT INT PIPE TERM
mkdir "$tmpdir" || exit
i=0; while test "$((i+=1))" -le 100; do
fifo=$tmpdir/FIFO$i
mkfifo "$fifo" || exit
echo "this is FIFO $i" >"$fifo" &
read foo <"$fifo" && echo "$foo"
done

On Solaris, ksh93 (/bin/sh) and bash simply hang after the first few;
dash goes through all of them bust most fail with 'interrupted system call'.

ksh93, bash and dash execute this script just fine on any other OS, so I
think this exposes a kernel-level race condition involving FIFOs on Solaris.

However, both zsh and /usr/xpg4/bin/sh (ksh88) execute the script
correctly on Solaris -- but about 20 times as slowly as on other OSs.

I also tried an illumos distro, OpenIndiana, and it breaks on that too,
in identical ways.

I'd like to know if this strange bug is limited to VirtualBox or that
"real" Solaris/illumos machines suffer from this as well.

Can anyone try the script above on a real Solaris and/or illumos machine
and let me know the results?

Thanks,

- M.

Barry Margolin

unread,
Dec 24, 2018, 2:47:55 PM12/24/18
to
In article <g8cfa9...@mid.individual.net>,
Martijn Dekker <mar...@inlv.demon.nl> wrote:

> #! /bin/sh
> tmpdir=/tmp/FIFOs$$
> trap "exec rm -rf $tmpdir" EXIT INT PIPE TERM
> mkdir "$tmpdir" || exit
> i=0; while test "$((i+=1))" -le 100; do
> fifo=$tmpdir/FIFO$i
> mkfifo "$fifo" || exit
> echo "this is FIFO $i" >"$fifo" &
> read foo <"$fifo" && echo "$foo"
> done
>
> On Solaris, ksh93 (/bin/sh) and bash simply hang after the first few;
> dash goes through all of them bust most fail with 'interrupted system call'.
>
> ksh93, bash and dash execute this script just fine on any other OS, so I
> think this exposes a kernel-level race condition involving FIFOs on Solaris.
>
> However, both zsh and /usr/xpg4/bin/sh (ksh88) execute the script
> correctly on Solaris -- but about 20 times as slowly as on other OSs.

Have you tried making a system call trace to see where they're hanging?
Although if it's a race condition, the timing change that tracing
introduces could mask the problem.

--
Barry Margolin, bar...@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***

Martijn Dekker

unread,
Dec 24, 2018, 4:08:00 PM12/24/18
to
Op 24-12-18 om 19:47 schreef Barry Margolin:
> In article <g8cfa9...@mid.individual.net>,
> Martijn Dekker <mar...@inlv.demon.nl> wrote:
>
>> #! /bin/sh
>> tmpdir=/tmp/FIFOs$$
>> trap "exec rm -rf $tmpdir" EXIT INT PIPE TERM
>> mkdir "$tmpdir" || exit
>> i=0; while test "$((i+=1))" -le 100; do
>> fifo=$tmpdir/FIFO$i
>> mkfifo "$fifo" || exit
>> echo "this is FIFO $i" >"$fifo" &
>> read foo <"$fifo" && echo "$foo"
>> done
>>
>> On Solaris, ksh93 (/bin/sh) and bash simply hang after the first few;
>> dash goes through all of them bust most fail with 'interrupted system call'.
[...]
> Have you tried making a system call trace to see where they're hanging?

Here is the output of 'truss -p'.

Note there is about half a second between the second-to-last line and
the last one ("openat(AT_FDCWD, "/tmp/FIFOs28056/FIFO10",
O_RDONLY|O_XPG4OPEN) (sleeping...)"), where it hangs.

- M.

$ /bin/sh test.sh & truss -p $!
this is FIFO 1
[1] 28056
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
openat(AT_FDCWD, "/tmp/FIFOs28056/FIFO2", O_RDONLY|O_XPG4OPEN) = 3
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=28061 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19C810, 0x7FF4BA19C890) = 0
setcontext(0x7FF4BA19C5C0)
fcntl(0, F_DUPFD, 0x0000000A) = 11
fcntl(11, F_SETFD, 0x00000001) = 0
close(0) = 0
ioctl(3, TCGETS, 0x7FF4BA19DA50) Err#22 EINVAL
lseek(3, 0, SEEK_CUR) Err#29 ESPIPE
fstatat(3, NULL, 0x7FF4BA19DA90, 0) = 0
fstatat(3, NULL, 0x7FF4BA19D9F0, 0) = 0
fstatat(3, NULL, 0x7FF4BA19DA20, 0) = 0
fcntl(3, F_DUPFD, 0x00000000) = 0
close(3) = 0
read(0, " t h i s i s F I F O".., 65536) = 15
close(0) = 0
fcntl(11, F_DUPFD, 0x00000000) = 0
close(11) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DC60, 0) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DCE0,
AT_SYMLINK_NOFOLLOW) = 0
this is FIFO 2
write(1, " t h i s i s F I F O".., 15) = 15
getpid() = 28056 [28055]
spawn("/usr/bin/mkfifo", 0x7FE87FD51000, 0x00000090, 0x7FE87FC80000,
0x000005A4) = 28063
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=28063 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
setcontext(0x7FF4BA19AA40)
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19DB90, 0x7FF4BA19DC10) = 0
ioctl(2, TIOCGSID, 0x7FF4BA19DD3C) = 0
getsid(0) = 28055
ioctl(2, TIOCGPGRP, 0x7FF4BA19DD5C) = 0
forkx(0) = 28064
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
openat(AT_FDCWD, "/tmp/FIFOs28056/FIFO3", O_RDONLY|O_XPG4OPEN) = 3
fcntl(0, F_DUPFD, 0x0000000A) = 11
fcntl(11, F_SETFD, 0x00000001) = 0
close(0) = 0
ioctl(3, TCGETS, 0x7FF4BA19DA50) Err#22 EINVAL
lseek(3, 0, SEEK_CUR) Err#29 ESPIPE
fstatat(3, NULL, 0x7FF4BA19DA90, 0) = 0
fstatat(3, NULL, 0x7FF4BA19D9F0, 0) = 0
fstatat(3, NULL, 0x7FF4BA19DA20, 0) = 0
fcntl(3, F_DUPFD, 0x00000000) = 0
close(3) = 0
Received signal #18, SIGCLD, in read() [caught]
siginfo: SIGCLD CLD_EXITED pid=28064 status=0x0000
read(0, 0x7FE87FC67050, 65536) Err#4 EINTR
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
waitid(P_ALL, 0, 0x7FF4BA19C710,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19C710,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19C780, 0x7FF4BA19C800) = 0
setcontext(0x7FF4BA19C530)
read(0, " t h i s i s F I F O".., 65536) = 15
close(0) = 0
fcntl(11, F_DUPFD, 0x00000000) = 0
close(11) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DC60, 0) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DCE0,
AT_SYMLINK_NOFOLLOW) = 0
this is FIFO 3
write(1, " t h i s i s F I F O".., 15) = 15
getpid() = 28056 [28055]
spawn("/usr/bin/mkfifo", 0x7FE87FD51000, 0x00000090, 0x7FE87FC80000,
0x000005A4) = 28065
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=28065 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
setcontext(0x7FF4BA19AA40)
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19DB90, 0x7FF4BA19DC10) = 0
ioctl(2, TIOCGSID, 0x7FF4BA19DD3C) = 0
getsid(0) = 28055
ioctl(2, TIOCGPGRP, 0x7FF4BA19DD5C) = 0
forkx(0) = 28066
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
openat(AT_FDCWD, "/tmp/FIFOs28056/FIFO4", O_RDONLY|O_XPG4OPEN) = 3
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=28066 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19C810, 0x7FF4BA19C890) = 0
setcontext(0x7FF4BA19C5C0)
fcntl(0, F_DUPFD, 0x0000000A) = 11
fcntl(11, F_SETFD, 0x00000001) = 0
close(0) = 0
ioctl(3, TCGETS, 0x7FF4BA19DA50) Err#22 EINVAL
lseek(3, 0, SEEK_CUR) Err#29 ESPIPE
fstatat(3, NULL, 0x7FF4BA19DA90, 0) = 0
fstatat(3, NULL, 0x7FF4BA19D9F0, 0) = 0
fstatat(3, NULL, 0x7FF4BA19DA20, 0) = 0
fcntl(3, F_DUPFD, 0x00000000) = 0
close(3) = 0
read(0, " t h i s i s F I F O".., 65536) = 15
close(0) = 0
fcntl(11, F_DUPFD, 0x00000000) = 0
close(11) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DC60, 0) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DCE0,
AT_SYMLINK_NOFOLLOW) = 0
this is FIFO 4
write(1, " t h i s i s F I F O".., 15) = 15
getpid() = 28056 [28055]
spawn("/usr/bin/mkfifo", 0x7FE87FD51000, 0x00000090, 0x7FE87FC80000,
0x000005A4) = 28067
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=28067 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
setcontext(0x7FF4BA19AA40)
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19DB90, 0x7FF4BA19DC10) = 0
ioctl(2, TIOCGSID, 0x7FF4BA19DD3C) = 0
getsid(0) = 28055
ioctl(2, TIOCGPGRP, 0x7FF4BA19DD5C) = 0
forkx(0) = 28068
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
openat(AT_FDCWD, "/tmp/FIFOs28056/FIFO5", O_RDONLY|O_XPG4OPEN) = 3
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=28068 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19C810, 0x7FF4BA19C890) = 0
setcontext(0x7FF4BA19C5C0)
fcntl(0, F_DUPFD, 0x0000000A) = 11
fcntl(11, F_SETFD, 0x00000001) = 0
close(0) = 0
ioctl(3, TCGETS, 0x7FF4BA19DA50) Err#22 EINVAL
lseek(3, 0, SEEK_CUR) Err#29 ESPIPE
fstatat(3, NULL, 0x7FF4BA19DA90, 0) = 0
fstatat(3, NULL, 0x7FF4BA19D9F0, 0) = 0
fstatat(3, NULL, 0x7FF4BA19DA20, 0) = 0
fcntl(3, F_DUPFD, 0x00000000) = 0
close(3) = 0
read(0, " t h i s i s F I F O".., 65536) = 15
close(0) = 0
fcntl(11, F_DUPFD, 0x00000000) = 0
close(11) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DC60, 0) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DCE0,
AT_SYMLINK_NOFOLLOW) = 0
this is FIFO 5
write(1, " t h i s i s F I F O".., 15) = 15
getpid() = 28056 [28055]
spawn("/usr/bin/mkfifo", 0x7FE87FD51000, 0x00000090, 0x7FE87FC80000,
0x000005A4) = 28069
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=28069 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
setcontext(0x7FF4BA19AA40)
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19DB90, 0x7FF4BA19DC10) = 0
ioctl(2, TIOCGSID, 0x7FF4BA19DD3C) = 0
getsid(0) = 28055
ioctl(2, TIOCGPGRP, 0x7FF4BA19DD5C) = 0
forkx(0) = 28070
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
openat(AT_FDCWD, "/tmp/FIFOs28056/FIFO6", O_RDONLY|O_XPG4OPEN) = 3
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=28070 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19C810, 0x7FF4BA19C890) = 0
setcontext(0x7FF4BA19C5C0)
fcntl(0, F_DUPFD, 0x0000000A) = 11
fcntl(11, F_SETFD, 0x00000001) = 0
close(0) = 0
ioctl(3, TCGETS, 0x7FF4BA19DA50) Err#22 EINVAL
lseek(3, 0, SEEK_CUR) Err#29 ESPIPE
fstatat(3, NULL, 0x7FF4BA19DA90, 0) = 0
fstatat(3, NULL, 0x7FF4BA19D9F0, 0) = 0
fstatat(3, NULL, 0x7FF4BA19DA20, 0) = 0
fcntl(3, F_DUPFD, 0x00000000) = 0
close(3) = 0
read(0, " t h i s i s F I F O".., 65536) = 15
close(0) = 0
fcntl(11, F_DUPFD, 0x00000000) = 0
close(11) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DC60, 0) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DCE0,
AT_SYMLINK_NOFOLLOW) = 0
this is FIFO 6
write(1, " t h i s i s F I F O".., 15) = 15
getpid() = 28056 [28055]
spawn("/usr/bin/mkfifo", 0x7FE87FD51000, 0x00000090, 0x7FE87FC80000,
0x000005A4) = 28071
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=28071 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
setcontext(0x7FF4BA19AA40)
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19DB90, 0x7FF4BA19DC10) = 0
ioctl(2, TIOCGSID, 0x7FF4BA19DD3C) = 0
getsid(0) = 28055
ioctl(2, TIOCGPGRP, 0x7FF4BA19DD5C) = 0
forkx(0) = 28072
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
openat(AT_FDCWD, "/tmp/FIFOs28056/FIFO7", O_RDONLY|O_XPG4OPEN) = 3
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=28072 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19C810, 0x7FF4BA19C890) = 0
setcontext(0x7FF4BA19C5C0)
fcntl(0, F_DUPFD, 0x0000000A) = 11
fcntl(11, F_SETFD, 0x00000001) = 0
close(0) = 0
ioctl(3, TCGETS, 0x7FF4BA19DA50) Err#22 EINVAL
lseek(3, 0, SEEK_CUR) Err#29 ESPIPE
fstatat(3, NULL, 0x7FF4BA19DA90, 0) = 0
fstatat(3, NULL, 0x7FF4BA19D9F0, 0) = 0
fstatat(3, NULL, 0x7FF4BA19DA20, 0) = 0
fcntl(3, F_DUPFD, 0x00000000) = 0
close(3) = 0
read(0, " t h i s i s F I F O".., 65536) = 15
close(0) = 0
fcntl(11, F_DUPFD, 0x00000000) = 0
close(11) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DC60, 0) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DCE0,
AT_SYMLINK_NOFOLLOW) = 0
this is FIFO 7
write(1, " t h i s i s F I F O".., 15) = 15
getpid() = 28056 [28055]
spawn("/usr/bin/mkfifo", 0x7FE87FD51000, 0x00000090, 0x7FE87FC80000,
0x000005A4) = 28073
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=28073 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
setcontext(0x7FF4BA19AA40)
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19DB90, 0x7FF4BA19DC10) = 0
ioctl(2, TIOCGSID, 0x7FF4BA19DD3C) = 0
getsid(0) = 28055
ioctl(2, TIOCGPGRP, 0x7FF4BA19DD5C) = 0
forkx(0) = 28074
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
openat(AT_FDCWD, "/tmp/FIFOs28056/FIFO8", O_RDONLY|O_XPG4OPEN) = 3
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=28074 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19C810, 0x7FF4BA19C890) = 0
setcontext(0x7FF4BA19C5C0)
fcntl(0, F_DUPFD, 0x0000000A) = 11
fcntl(11, F_SETFD, 0x00000001) = 0
close(0) = 0
ioctl(3, TCGETS, 0x7FF4BA19DA50) Err#22 EINVAL
lseek(3, 0, SEEK_CUR) Err#29 ESPIPE
fstatat(3, NULL, 0x7FF4BA19DA90, 0) = 0
fstatat(3, NULL, 0x7FF4BA19D9F0, 0) = 0
fstatat(3, NULL, 0x7FF4BA19DA20, 0) = 0
fcntl(3, F_DUPFD, 0x00000000) = 0
close(3) = 0
read(0, " t h i s i s F I F O".., 65536) = 15
close(0) = 0
fcntl(11, F_DUPFD, 0x00000000) = 0
close(11) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DC60, 0) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DCE0,
AT_SYMLINK_NOFOLLOW) = 0
this is FIFO 8
write(1, " t h i s i s F I F O".., 15) = 15
getpid() = 28056 [28055]
spawn("/usr/bin/mkfifo", 0x7FE87FD51000, 0x00000090, 0x7FE87FC80000,
0x000005A4) = 28075
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=28075 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
setcontext(0x7FF4BA19AA40)
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19DB90, 0x7FF4BA19DC10) = 0
ioctl(2, TIOCGSID, 0x7FF4BA19DD3C) = 0
getsid(0) = 28055
ioctl(2, TIOCGPGRP, 0x7FF4BA19DD5C) = 0
forkx(0) = 28076
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
openat(AT_FDCWD, "/tmp/FIFOs28056/FIFO9", O_RDONLY|O_XPG4OPEN) = 3
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=28076 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19C810, 0x7FF4BA19C890) = 0
setcontext(0x7FF4BA19C5C0)
fcntl(0, F_DUPFD, 0x0000000A) = 11
fcntl(11, F_SETFD, 0x00000001) = 0
close(0) = 0
ioctl(3, TCGETS, 0x7FF4BA19DA50) Err#22 EINVAL
lseek(3, 0, SEEK_CUR) Err#29 ESPIPE
fstatat(3, NULL, 0x7FF4BA19DA90, 0) = 0
fstatat(3, NULL, 0x7FF4BA19D9F0, 0) = 0
fstatat(3, NULL, 0x7FF4BA19DA20, 0) = 0
fcntl(3, F_DUPFD, 0x00000000) = 0
close(3) = 0
read(0, " t h i s i s F I F O".., 65536) = 15
close(0) = 0
fcntl(11, F_DUPFD, 0x00000000) = 0
close(11) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DC60, 0) = 0
fstatat(AT_FDCWD, "/usr/bin/mkfifo", 0x7FF4BA19DCE0,
AT_SYMLINK_NOFOLLOW) = 0
this is FIFO 9
write(1, " t h i s i s F I F O".., 15) = 15
getpid() = 28056 [28055]
spawn("/usr/bin/mkfifo", 0x7FE87FD51000, 0x00000090, 0x7FE87FC80000,
0x000005A5) = 28077
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=28077 status=0x0000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
setcontext(0x7FF4BA19AA40)
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19DB20,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19DB90, 0x7FF4BA19DC10) = 0
ioctl(2, TIOCGSID, 0x7FF4BA19DD3C) = 0
getsid(0) = 28055
ioctl(2, TIOCGPGRP, 0x7FF4BA19DD5C) = 0
forkx(0) = 28078
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
Received signal #18, SIGCLD, in openat() [caught]
siginfo: SIGCLD CLD_EXITED pid=28078 status=0x0000
openat(AT_FDCWD, "/tmp/FIFOs28056/FIFO10", O_RDONLY|O_XPG4OPEN) Err#4 EINTR
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000, 0x00000000, 0x00000000)
= 0xFFBFFEFF [0xFFFFFFFF]
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) = 0
waitid(P_ALL, 0, 0x7FF4BA19C7A0,
WEXITED|WTRAPPED|WSTOPPED|WCONTINUED|WNOHANG) Err#10 ECHILD
sigaction(SIGCLD, 0x7FF4BA19C810, 0x7FF4BA19C890) = 0
setcontext(0x7FF4BA19C5C0)
openat(AT_FDCWD, "/tmp/FIFOs28056/FIFO10", O_RDONLY|O_XPG4OPEN)
(sleeping...)


Kaz Kylheku

unread,
Dec 25, 2018, 1:59:36 PM12/25/18
to
On 2018-12-24, Martijn Dekker <mar...@inlv.demon.nl> wrote:
> Can anyone try the script above on a real Solaris and/or illumos machine
> and let me know the results?

Merry Christmas!

I tried it on a Solaris VM, not real machine, with bash. Here is what
is going wrong:

> echo "this is FIFO $i" >"$fifo" &

This is forked off and the parent opens the pipe for reading.

Well, sort of!

In any case, the reader is able to proceed. It writes to the pipe
successfully and dies.

> read foo <"$fifo" && echo "$foo"

But the parent hasn't actually finished the open64 system call! It gets a
SIGCHLD and so the open64 returns with ERESTART. Then, the system call
is automatically restarted: again it opens the pipe for writing, and
this time hangs forever waiting for a reader to arrive.

(It hangs in an poorly chosen sleep that can't be interrupted with
Ctrl-C/SIGTERM.)

Kaz Kylheku

unread,
Dec 25, 2018, 2:32:06 PM12/25/18
to
On 2018-12-25, Kaz Kylheku <157-07...@kylheku.com> wrote:
> On 2018-12-24, Martijn Dekker <mar...@inlv.demon.nl> wrote:
>> Can anyone try the script above on a real Solaris and/or illumos machine
>> and let me know the results?
>
> Merry Christmas!
>
> I tried it on a Solaris VM, not real machine, with bash. Here is what
> is going wrong:

P.S. Solaris 10 in a VirtualBox.

Martijn Dekker

unread,
Jan 1, 2019, 6:33:01 PM1/1/19
to
Op 25-12-18 om 19:59 schreef Kaz Kylheku:
> Merry Christmas!

Happy New Year!

> I tried it on a Solaris VM, not real machine, with bash. Here is what
> is going wrong:
[...]

Thanks for trying and analysing it. Meanwhile I also brought it up on
the bug-bash list and an Oracle employee replied that the bug only shows
up on VirtualBox and not on real machines.

I was hoping as much -- it would have been hard to believe that
something this basic is broken on Solaris in general.

- M.

Casper H.S. Dik

unread,
Jan 2, 2019, 9:41:14 AM1/2/19
to
It's is an actual Solaris issue; it seems to being a problem only
in Virtual Box but that is likely because you configured it with one
CPU.

If you add more CPUs the problem will likely go away and if you run
this on a process bound to one CPU, you will likely also see it on
bare metal.

"truss" will likely hide the problem but dtrace was able to
point to the cause.

Casper

Casper H.S. Dik

unread,
Jan 2, 2019, 9:43:37 AM1/2/19
to
Right.

Unfortunately, the system has already consumed one writer so as it
is restarted, it cannot be properly restarted.

As we already made the rendezvous, fifo_open should not have
returned EINTR/ERESTART; it should have returned 0.


Casper

Casper H.S. Dik

unread,
Jan 2, 2019, 9:44:19 AM1/2/19
to
Try more CPUs and the problem goes away, I think.

(Likely why we did not find it earlier)

Casper

Kaz Kylheku

unread,
Jan 2, 2019, 3:03:06 PM1/2/19
to
["Followup-To:" header set to comp.unix.shell.]
On 2019-01-01, Martijn Dekker <mar...@inlv.demon.nl> wrote:
> Op 25-12-18 om 19:59 schreef Kaz Kylheku:
>> Merry Christmas!
>
> Happy New Year!
>
>> I tried it on a Solaris VM, not real machine, with bash. Here is what
>> is going wrong:
> [...]
>
> Thanks for trying and analysing it. Meanwhile I also brought it up on
> the bug-bash list and an Oracle employee replied that the bug only shows
> up on VirtualBox and not on real machines.

Then they should investigate it with VirtualBox; any target that
helps reproduce a race is valuable.

> I was hoping as much -- it would have been hard to believe that
> something this basic is broken on Solaris in general.

Something this basic *is* broken in Solaris in general.

"Doesn't repro readily" isn't the same as "not broken".

What's required is an root cause analysis of how it happens on
VirtualBox. Only then can a believable argument be made that the
situation is impossible on real hardware, if that happens to be true.

It's hard to believe that there would be VirtualBox-specific code in
areas of the kernel dealing with FIFOS, signals and system call
restarting. Inter-process communication, signals and process exits are
purely software abstractions; they are not close to the hardware.

Kaz Kylheku

unread,
Jan 2, 2019, 3:12:01 PM1/2/19
to
["Followup-To:" header set to comp.unix.shell.]
Will it go away even if those additional CPU's are loaded down with
work, unavailable for scheduling the tasks involved in the test case?

YTC#1

unread,
Jan 2, 2019, 3:38:57 PM1/2/19
to
On 02/01/2019 20:03, Kaz Kylheku wrote:
> ["Followup-To:" header set to comp.unix.shell.]
> On 2019-01-01, Martijn Dekker <mar...@inlv.demon.nl> wrote:
>> Op 25-12-18 om 19:59 schreef Kaz Kylheku:
>>> Merry Christmas!
>>
>> Happy New Year!
>>
>>> I tried it on a Solaris VM, not real machine, with bash. Here is what
>>> is going wrong:
>> [...]
>>
>> Thanks for trying and analysing it. Meanwhile I also brought it up on
>> the bug-bash list and an Oracle employee replied that the bug only shows
>> up on VirtualBox and not on real machines.
>
> Then they should investigate it with VirtualBox; any target that
> helps reproduce a race is valuable.

In the good old days, someone probably would.
If you were to raise a support call and state the case then the time may
be allocated to it.

>
>> I was hoping as much -- it would have been hard to believe that
>> something this basic is broken on Solaris in general.
>
> Something this basic *is* broken in Solaris in general.

Possibly yes. And if that is the case it will probably be apparent in
OpenIndiana

maybe the OP can test with that in VB ? And if so, get it looked at there ?





--
Bruce Porter
"The internet is a huge and diverse community but mainly friendly"
http://ytc1.blogspot.co.uk/
There *is* an alternative! http://www.openoffice.org/

Casper H.S. Dik

unread,
Jan 3, 2019, 4:09:53 AM1/3/19
to
Probably not; I've filed a bug:

29161556 race condition in fifo_open

and I'll be fixing it shortly (the fix is simple).

Casper

Casper H.S. Dik

unread,
Jan 3, 2019, 4:12:50 AM1/3/19
to
YTC#1 <b...@ytc1-spambin.co.uk> writes:

>Possibly yes. And if that is the case it will probably be apparent in
>OpenIndiana

>maybe the OP can test with that in VB ? And if so, get it looked at there ?

It's even to reproduce it in bare metal as long as you bound it
to one CPU.

Casper

Martijn Dekker

unread,
Jan 3, 2019, 6:03:00 PM1/3/19
to
Op 02-01-19 om 21:11 schreef Kaz Kylheku:
> ["Followup-To:" header set to comp.unix.shell.]
> On 2019-01-02, Casper H.S Dik <Caspe...@OrSPaMcle.COM> wrote:
>> Kaz Kylheku <157-07...@kylheku.com> writes:
>>
>>> On 2018-12-25, Kaz Kylheku <157-07...@kylheku.com> wrote:
>>>> On 2018-12-24, Martijn Dekker <mar...@inlv.demon.nl> wrote:
>>>>> Can anyone try the script above on a real Solaris and/or illumos machine
>>>>> and let me know the results?
>>>>
>>>> I tried it on a Solaris VM, not real machine, with bash. Here is what
>>>> is going wrong:
>>
>>> P.S. Solaris 10 in a VirtualBox.
>>
>> Try more CPUs and the problem goes away, I think.

Adding more CPUs makes the problem "go away" for me (see below) on
Solaris 10.1 and 11.3, but, interestingly, *not* on Solaris 11.4.

> Will it go away even if those additional CPU's are loaded down with
> work, unavailable for scheduling the tasks involved in the test case?

Kaz called it. Simply running a couple of
while :; do :; done &
makes the problem come right back, even with 4 CPUs.

- M.

[comp.unix.solaris added back in; it's entirely relevant]

Martijn Dekker

unread,
Jan 3, 2019, 6:25:55 PM1/3/19
to
[comp.unix.solaris added back in; this is entirely relevant to that group]

Op 03-01-19 om 10:09 schreef Casper H.S. Dik:
> Kaz Kylheku <157-07...@kylheku.com> writes:
>> Will it go away even if those additional CPU's are loaded down with
>> work, unavailable for scheduling the tasks involved in the test case?
>
> Probably not; I've filed a bug:

Yes, a little experimenting shows that maxing out the CPUs makes the
problem come right back.

> 29161556 race condition in fifo_open
>
> and I'll be fixing it shortly (the fix is simple).

I've found that illumos inherited this bug from OpenSolaris. In the
interest of portability, and since I just did Oracle a free favour by
finding and reporting a long-standing bug in basic SunOS kernel
functionality, it would be nice if you could describe in general terms
what this simple fix entails, so I can pass on these hints to the
illumos bug tracker.

TIA,

- Martijn

Kaz Kylheku

unread,
Jan 3, 2019, 10:05:31 PM1/3/19
to
Don't you get an eerie feeling about some SunOS FIFO things being fixed
in 2019?

I'd love to know how far back this goes.

Who broke it and what were they mucking around with for what purpose,
kind of thing. :)

Casper H.S. Dik

unread,
Jan 4, 2019, 2:28:12 AM1/4/19
to
The function fifo_open() uses cv_wait_sig_swap(); it returns
"0" when it finds that a signal was received and "1" when
no signal was received but when the cv was signalled.

But if both a signal (SIGCHLD in our test case) and a cv was
signalled, it still returns "0" and the code returns EINTR; it
should verify if the condition is actually met and return 0
because otherwise we have consumed the peer.

Generally, it would wakeup the sleeping thread immediately so
you would either wakeup on the signal or on the cv_signal but
if the wakeup is delayed, you get both cv_signal & the signal
and the cv_wait_sig_* then returns that it received a signal.

I'm not sure if that is actually the proper behaviour of
cv_wait_sig*(), but generally a caller of cv_wait*() should
work on the assumption that a wakeup can be spurious and it
should check the condition.

Now generally a system call can return EINTR and it can be
restarted; however in the case of fifo_open() it cannot be
restarted: the peer has been consumed.

Casper

Casper H.S. Dik

unread,
Jan 4, 2019, 2:56:31 AM1/4/19
to
Looking at the code, it seems it was broken since Solaris 2.5

The history of cv_wait_sig*() makes clear that the intention to return
a signal was received even when the cv_signal() was also seen.
(And that cv_wait_sig*() was changed in Solaris 2.8 to make sure that
the cv_signal() was not consumed when it returned a signal was received.

To be honest, I think this is backward. If the code find that both
a unix signal and a cv_signal have been received, it should pretend
the signal did not happen; cv_wait_sig*() would be called again, anyway,
and the next call would return finally return a unix signal was found.
(Or in the case of fifo_open(), it would just succeed and won't return
EINTR)

But changing cv_wait_sig*() semantics is much more risky than fixing
fifo_open().

Casper

Kaz Kylheku

unread,
Jan 4, 2019, 3:04:15 PM1/4/19
to
On 2019-01-04, Casper H.S Dik <Caspe...@OrSPaMcle.COM> wrote:
> I'm not sure if that is actually the proper behaviour of
> cv_wait_sig*(), but generally a caller of cv_wait*() should

It doesn't sound as if the return value of that function can be used for
any reliable purpose, even if it is a true indication of the low-level
wake up reason.

I don't know about Solaris' implementation, but it can easily work like
this: a signal wakes up the thread, and then the thread has to acquire
the mutex as if by a locking operation, competing for it with other
threads. Suppose the peer thread gets the mutex first; it will make the
condition true, and signal the condition variable. The previous thread
will get the mutex, and return from the condition wait, still indicating
that it had been woken by a signal.

> work on the assumption that a wakeup can be spurious and it
> should check the condition.

That pretty much applies to the caller of any condition variable wait
in just about any system, anywhere.

Martijn Dekker

unread,
Jan 4, 2019, 7:30:50 PM1/4/19
to
Op 04-01-19 om 04:05 schreef Kaz Kylheku:
> Don't you get an eerie feeling about some SunOS FIFO things being fixed
> in 2019?

To be honest: yes.

I've been spending about three years developing a portable library to
improve and extend the shell language as a scripting language, fixing
flaws in some of its basic paradigms. I'm nearly ready for a first alpha
release. While developing this I've been finding and reporting many
shell bugs, including 25-year-old corner case bugs in some shells like
bash -- but nothing like this. FIFOs are utterly broken during high
system load or on single-CPU systems as of Solaris 2.5 (1995). Yikes.

In case anyone is curious, I found this bug by combining very basic Unix
functionality (block-local file descriptors and aliases on shells,
FIFOS, and background processes) in new ways to do something unorthodox:
a robust and extensible shell loop construct.
https://github.com/modernish/modernish#user-content-use-varloop
https://github.com/modernish/modernish/blob/master/libexec/modernish/var/loop.mm
https://github.com/modernish/modernish/tree/master/libexec/modernish/var/loop
This allows doing interesting things like turning the 'find' utility
into a loop, making complex processing of 'find' results much easier.

Atomicity requires that FIFOs be made on the fly when initialising such
a loop, then used immediately and unlinked early. This breaks on Solaris
and illumos when using ksh93 or bash under high system load or on a
single CPU. Dash also breaks, but in a way I can recover from (see code
in var/loop.mm) as 'read' fails with an interrupted system call error
instead of blocking.

It's not the fault of any of these shells, of course. Still I think it's
interesting that yash and zsh manage to cope with this SunOS kernel race
condition (by slowing down drastically).

- M.

Martijn Dekker

unread,
Jan 5, 2019, 7:19:44 AM1/5/19
to
Op 04-01-19 om 08:28 schreef Casper H.S. Dik:
> Martijn Dekker <mar...@inlv.demon.nl> writes:
>> [...] it would be nice if you could describe in general terms
>> what this simple fix entails, so I can pass on these hints to the
>> illumos bug tracker.
>
> The function fifo_open() uses cv_wait_sig_swap(); it returns
> "0" when it finds that a signal was received and "1" when
> no signal was received but when the cv was signalled.
>
> But if both a signal (SIGCHLD in our test case) and a cv was
> signalled, it still returns "0" and the code returns EINTR; it
> should verify if the condition is actually met and return 0
> because otherwise we have consumed the peer.
>
> Generally, it would wakeup the sleeping thread immediately so
> you would either wakeup on the signal or on the cv_signal but
> if the wakeup is delayed, you get both cv_signal & the signal
> and the cv_wait_sig_* then returns that it received a signal.
>
> I'm not sure if that is actually the proper behaviour of
> cv_wait_sig*(), but generally a caller of cv_wait*() should
> work on the assumption that a wakeup can be spurious and it
> should check the condition.
>
> Now generally a system call can return EINTR and it can be
> restarted; however in the case of fifo_open() it cannot be
> restarted: the peer has been consumed.

Thanks.

Here's the illumos bug for reference:

https://www.illumos.org/issues/10179

- Martijn

Richard L. Hamilton

unread,
Jan 14, 2019, 6:25:08 PM1/14/19
to
In article <201901041...@kylheku.com>,
The condvar(9f) man page seems to correctly describe the actual behavior
of cv_wait_sig(), if I'm correctly understanding this discussion. In
particular, either 0 or >0 return could happen even if no cv_signal()
or cv_broadcast() was actually sent; about all you could say is that
if 0 was returned, that was due to a Unix signal.

The NOTES section of the aforementioned man page says
"Code that calls these functions must always recheck the reason for
blocking and call again if the reason for blocking is still true."

i.e. that presumably applies regardless of the return value.

--
ftp> get |fortune
377 I/O error: smart remark generator failed

Bogonics: the primary language inside the Beltway

Lasik/PRK theme music:
"In the Hall of the Mountain King", from "Peer Gynt"
(read act 2, scene 6 of the play if that doesn't make sense)
0 new messages