Hi Trey!
On Wed, Oct 5, 2016 at 1:33 PM, Dockendorf, Trey <
tdock...@osc.edu> wrote:
> Something yet to be solved has gone wrong with IB on some systems and this
> particular failure is causing NHC to hang. The underlying cause is that
> access to /sys/class/infiniband/mlx5_0/ports/1/state is hanging which is
> checked by the data gathering in lbnl_hw.nhc. If I try and cat the state
> file things like CTRL+C and CTLR+Z have no effect. Using kill –9 on the cat
> process has no effect either.
Yes, this can happen sometimes. I'm not a kernel expert, but I
suspect the process has gotten itself into D-state; i.e.,
uninterruptible sleep/IOWAIT. You can see this via the `ps` command,
and in particular I recommend the S and WCHAN fields. An uber-spiffy
walk-through of soup-to-nuts troubleshooting of uninterruptible I/O
wait states can be found here:
http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quickndirty-troubleshooting/
> Executing NHC from command line results in the same style of hang, and NHC
> never times out. Is there a way to make this kind of failure something NHC
> can handle and mark a node offline?
There is, unfortunately, no way to terminate the NHC process which is
locked in the D state until whatever system call it's stuck in
returns. It's just not possible due to the way the kernel must
protect itself from data structure corruption by blocking signals in
certain functions.
However, the watchdog timer, despite not being able to kill the hung
NHC process, *should* (in theory) be able to exit cleanly and offline
the node itself. But if you look at the logic in the
nhc_watchdog_timer() function for handling this situation
(
https://github.com/mej/nhc/blob/1.4.2/nhc#L493), you can see that
it's currently assuming an unkillable main NHC process is simply
defunct -- meaning that it exited properly and just hasn't been
wait()'d on yet by its parent process (e.g., pbs_mom or slurmd).
I don't know if there's a way for the NHC watchdog to determine the
difference between the two cases (zombie vs. D-state), but I'm
certainly open to ideas or suggestions on how to do better. Perhaps
the contents of /proc/$NHC_PID/wchan might be revealing?
I would certainly like to address this situation as D-state hangups
are one of the primary reasons I wrote the watchdog timer in the first
place, and while most of the hung node situations I've encountered are
handled correctly by NHC, clearly not all of them are. :-)
Michael
--
Michael Jennings (KainX)
https://medium.com/@mej0/ <
m...@eterm.org>
Linux/HPC Systems Engineer, LBL.gov Author, Eterm (
www.eterm.org)
-----------------------------------------------------------------------
"The trouble with doing something right the first time is that nobody
appreciates how difficult it was." -- Walt West