Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#719273: sysvinit-utils: /bin/pidof fails when there are stuck NFS mount points, preventing shutdown

79 views
Skip to first unread message

Daniel Povey

unread,
Aug 9, 2013, 7:40:02 PM8/9/13
to
Package: sysvinit-utils
Version: 2.88dsf-41
Severity: important
Tags: d-i upstream

Dear Maintainer,

This bug is basically the same as https://bugzilla.redhat.com/show_bug.cgi?id=138788
which has been resolved in Red Hat. Basically, the current version of "pidof" looks in /proc
for stuff that it doesn't really need to look for, which leads to an attempt to access files or
directories on broken NFS mount points (if they exist), whifch leads to a hang. This
prevents normal shutdown of a machine with broken NFS mount points. The Red Hat people seem to have
fixed this issue already.


*** Please consider answering these questions, where appropriate ***

* What led up to the situation? We have a system with many NFS mounts, some of
which sometimes go down.
* What exactly did you do (or not do) that was effective (or
ineffective)? No satisfactory resolution yet.
* What was the outcome of this action?
* What outcome did you expect instead?

*** End of the template - remove these lines ***


-- System Information:
Debian Release: 7.1
APT prefers stable-updates
APT policy: (500, 'stable-updates'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 3.2.0-4-amd64 (SMP w/24 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=ANSI_X3.4-1968) (ignored: LC_ALL set to C)
Shell: /bin/sh linked to /bin/bash

Versions of packages sysvinit-utils depends on:
ii libc6 2.13-38
ii libselinux1 2.1.9-5

sysvinit-utils recommends no packages.

Versions of packages sysvinit-utils suggests:
pn bootlogd <none>
pn sash <none>

-- no debconf information


--
To UNSUBSCRIBE, email to debian-bugs-...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Daniel Povey

unread,
Aug 10, 2013, 5:20:01 PM8/10/13
to
I am sending a patch to the source of "killall5" that I am using
locally. It basically ignores all processes in a "D" or "Z" state (or
states where D or Z appear in the string). This is of course not
ideal, but it works for me. I found that not all machines where I had
stuck processes, would cause problems for start-stop-daemon-- this
patch is only for "pidof". I may later replicate the problem with
start-stop-daemon and figure out a fix.


root@a04:~# diff -c sysvinit-2.88dsf/src/killall5.c
sysvinit-2.88dsf-modified/src/killall5.c
*** sysvinit-2.88dsf/src/killall5.c Sat Aug 10 17:05:31 2013
--- sysvinit-2.88dsf-modified/src/killall5.c Sat Aug 10 16:50:27 2013
***************
*** 1,5 ****
/*
! * kilall5.c Kill all processes except processes that have the
* same session id, so that the shell that called us
* won't be killed. Typically used in shutdown scripts.
*
--- 1,5 ----
/*
! * killall5.c Kill all processes except processes that have the
* same session id, so that the shell that called us
* won't be killed. Typically used in shutdown scripts.
*
***************
*** 536,548 ****
p->statname = (char *)xmalloc(strlen(s)+1);
strcpy(p->statname, s);

/* Get session, startcode, endcode. */
startcode = endcode = 0;
! if (sscanf(q, "%*c %*d %*d %d %*d %*d %*u %*u "
"%*u %*u %*u %*u %*u %*d %*d "
"%*d %*d %*d %*d %*u %*u %*d "
"%*u %lu %lu",
! &p->sid, &startcode, &endcode) != 3) {
p->sid = 0;
nsyslog(LOG_ERR, "can't read sid from %s\n",
path);
--- 536,550 ----
p->statname = (char *)xmalloc(strlen(s)+1);
strcpy(p->statname, s);

+ char status[11];
+
/* Get session, startcode, endcode. */
startcode = endcode = 0;
! if (sscanf(q, "%10s %*d %*d %d %*d %*d %*u %*u "
"%*u %*u %*u %*u %*u %*d %*d "
"%*d %*d %*d %*d %*u %*u %*d "
"%*u %lu %lu",
! status, &p->sid, &startcode,
&endcode) != 4) {
p->sid = 0;
nsyslog(LOG_ERR, "can't read sid from %s\n",
path);
***************
*** 555,560 ****
--- 557,571 ----
if (startcode == 0 && endcode == 0)
p->kernel = 1;
fclose(fp);
+ if (strchr(status, 'D') != NULL ||
strchr(status, 'Z') != NULL) {
+ /* Ignore zombie processes or processes in
disk sleep, as attempts
+ to access the stats of these will
sometimes fail. */
+ if (p->argv0) free(p->argv0);
+ if (p->argv1) free(p->argv1);
+ if (p->statname) free(p->statname);
+ free(p);
+ continue;
+ }
} else {
/* Process disappeared.. */
if (p->argv0) free(p->argv0);




On Fri, Aug 9, 2013 at 7:33 PM, Debian Bug Tracking System
<ow...@bugs.debian.org> wrote:
> Thank you for filing a new Bug report with Debian.
>
> This is an automatically generated reply to let you know your message
> has been received.
>
> Your message is being forwarded to the package maintainers and other
> interested parties for their attention; they will reply in due course.
>
> Your message has been sent to the package maintainer(s):
> Debian sysvinit maintainers <pkg-sysvi...@lists.alioth.debian.org>
>
> If you wish to submit further information on this problem, please
> send it to 719...@bugs.debian.org.
>
> Please do not send mail to ow...@bugs.debian.org unless you wish
> to report a problem with the Bug-tracking system.
>
> --
> 719273: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=719273
> Debian Bug Tracking System
> Contact ow...@bugs.debian.org with problems

Daniel Povey

unread,
Dec 30, 2017, 6:10:03 PM12/30/17
to
Guys,

No-one ever responded to this thread (years ago).
We have just noticed the same problem on a newer version of Debian and we are going to dig up our old patch and use it.
This is an issue for services like rpcbind for which systemd uses the older SysV init scripts.  

Dan


>  Debian sysvinit maintainers <pkg-sysvinit-devel@lists.alioth.debian.org>

Daniel Povey

unread,
Dec 30, 2017, 6:30:03 PM12/30/17
to
We're going to check it out early next week, whether it still applies.

The bug it fixes is a situation where pidof reaches a process that's in a bad state such as D or Z, that might have an executable on a stuck mount point or something like that, and gets stuck, preventing system maintenance tasks or shutdown.





On Sat, Dec 30, 2017 at 3:20 PM, Ian Jackson <ijac...@chiark.greenend.org.uk> wrote:
Control: tags -1 + patch

Daniel Povey writes ("Bug#719273: Acknowledgement (sysvinit-utils: /bin/pidof fails when there are stuck NFS mount points, preventing shutdown)"):

> No-one ever responded to this thread (years ago).
> We have just noticed the same problem on a newer version of Debian and we are
> going to dig up our old patch and use it.
> This is an issue for services like rpcbind for which systemd uses the older
> SysV init scripts.

Thanks for pinging this bug.

Do you know if the patch still applies to current sysvinit source ?

Reading a non-unified diff certainly takes me back some years...

Regards,
Ian.

Ian Jackson

unread,
Dec 30, 2017, 6:30:03 PM12/30/17
to
Control: tags -1 + patch

Daniel Povey writes ("Bug#719273: Acknowledgement (sysvinit-utils: /bin/pidof fails when there are stuck NFS mount points, preventing shutdown)"):
> No-one ever responded to this thread (years ago).
> We have just noticed the same problem on a newer version of Debian and we are
> going to dig up our old patch and use it.
> This is an issue for services like rpcbind for which systemd uses the older
> SysV init scripts.

Petter Reinholdtsen

unread,
Dec 31, 2017, 10:00:03 AM12/31/17
to
pDaniel Povey]
> We're going to check it out early next week, whether it still applies.
>
> The bug it fixes is a situation where pidof reaches a process that's in a
> bad state such as D or Z, that might have an executable on a stuck mount
> point or something like that, and gets stuck, preventing system maintenance
> tasks or shutdown.

Perhaps something to send upstream,
<URL: https://savannah.nongnu.org/projects/sysvinit/ >? Someone
recently showed up on the upstream mailing list and offered to update
upstream source. :)

--
Happy hacking
Petter Reinholdtsen

Ian Jackson

unread,
Dec 31, 2017, 12:40:03 PM12/31/17
to
Daniel Povey writes ("Bug#719273: Acknowledgement (sysvinit-utils: /bin/pidof fails when there are stuck NFS mount points, preventing shutdown)"):
> We're going to check it out early next week, whether it still applies.
>
> The bug it fixes is a situation where pidof reaches a process that's in a bad
> state such as D or Z, that might have an executable on a stuck mount point or
> something like that, and gets stuck, preventing system maintenance tasks or
> shutdown.

Yes. The principle of the patch LGTM. Hence my tagging the bug
accordingly.

Thanks for your persistence.

Ian.

Daniel Povey

unread,
Dec 31, 2017, 2:50:02 PM12/31/17
to
Guys, I just want to give a bit more context that I've found out since then which might be relevant.

As you probably know (but to document it here), if you set the variable PIDOF_NETFS or set the flag -n for 'pidof', it will avoid trying to stat files on NFS partitions.  This will solve some of these types of issues-- but not all of them, because a binary that's local can still be in D state due to accessing a stuck partition.  I'd also like to point out (and I don't know if this is really the case but it might be) that maybe a process that's not in a D or Z state might still have a binary on a stuck mount point where calling "stat" on it would fail.  But I don't want to complicate things too much: a partial fix is better than nothing.

Also, at the current time (and IIRC this wasn't the case when we submitted the original patch), start-stop-daemon is a binary not a script, and it doesn't call pidof or killall.  Instead it uses its own code, and that code is subject to the same issue where it hangs on stuck NFS partitions.  Therefore, as it stands, applying this patch to 'pidof' will no longer resolve the issue; similar changes would have to be made to 'killall'.

We have a cluster of Linux servers running GridEngine, and they each export volumes via NFS.  We have a problem when one of the NFS servers dies.  The stuck mount point on other nodes causes user processes there to enter the D state (e.g. if they have pending output to the disk that's down).  This makes it impossible to do routine maintenance on the other nodes, so they pretty soon will need a reboot, and very often in this condition those reboots will fail because a shutdown task hangs.  Their unavailable NFS volumes cause failures on other nodes in turn, unless someone is available to physically reboot.  (We're planning to eventually use virtualization to prevent this failure; previously there were complications relating to NVidia GPUs).

Dan


Jesse Smith

unread,
Oct 27, 2018, 7:40:02 PM10/27/18
to
The patch for killall5.c (pidof) has been applied upstream. Assuming
testing goes well, it'll be included in version 2.92 of sysvinit.

- Jesse (upstream dev)
0 new messages