nhc keeps appending to note

27 views
Skip to first unread message

John Griffin-Wiesner

unread,
Feb 15, 2017, 8:06:00 PM2/15/17
to n...@lbl.gov
Been using nhc with moab and torque for a few years. After
updating to torque 6.1 we've been having trouble with pbs node
notes. Whenever I set a non-nhc note and that node is also
failing an nhc check, the error from that check gets appended to
the node note every time nhc runs.

Some other trouble between torque and nhc was caused by the fact
that the torque executables moved to /usr/local/bin with this
update. We added a "PATH" line to nhc.conf and that resolved
everything but this problem.

Has anyone else run into this?

Thanks.

--
John Griffin-Wiesner
HPC Systems Administrator
Minnesota Supercomputing Institute
http://www.msi.umn.edu
joh...@msi.umn.edu

Dockendorf, Trey

unread,
Feb 16, 2017, 10:54:48 AM2/16/17
to n...@lbl.gov
We ran into this on our Torque 6.0 system. Is the note that gets appended prefixed with “NHC” or something like “ERROR” or some other prefix? In our case it was actually Torque that kept appending the note, not NHC. The note that got appended had NHC output text when NHC failed, but the actual appending was done by Torque. We have since set things up so only NHC marks a node offline when NHC fails.

We set down_on_error=False on pbs_server and $down_on_error to false on pbs_mom.

- Trey

--
Trey Dockendorf

HPC Systems Engineer
Ohio Supercomputer Center
>--
>You received this message because you are subscribed to the Google Groups "LBNL Node Health Check" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to nhc+uns...@lbl.gov.
>To post to this group, send email to n...@lbl.gov.
>Visit this group at https://groups.google.com/a/lbl.gov/group/nhc/.

Michael Jennings

unread,
Feb 16, 2017, 11:27:17 AM2/16/17
to LBNL Node Health Check
Hi John!

Trey is correct. NHC will, if $MARK_OFFLINE is set, invoke `pbsnodes
-o -N '...' node` to mark the node offline directly -- provided that
either there is no Note field present, or the Note field that *is*
present starts with "NHC:" -- otherwise, NHC will leave it alone.

It *also* returns the string "ERROR Health check failed: ..." to the
pbs_mom on the file descriptor it was provided as STDOUT (fd 1). If
$down_on_error is set to true, TORQUE will mark the node as being down
and will assign the returned string to the Note field. And TORQUE
doesn't have the logic to do this only in cases where the Note field
was already set by NHC, so that's why you're seeing NHC's note get
tacked onto the end of the note you set. It's TORQUE doing it, not
NHC directly.

There was/is a bug in certain versions of TORQUE 6.x which caused the
string returned by the health check program to be *appended*, rather
than assigned, to the Note field. So each time NHC returned failure,
that field would get longer and longer. This was supposed to have
been resolved in a previous version, but it sounds like maybe there's
a regression? In any event, Trey's solution should work by having NHC
handle everything and bypassing TORQUE's assignment of the Note field.

Hope that helps!
Michael
--
Michael Jennings (KainX) https://medium.com/@mej0/ <m...@eterm.org>
Linux/HPC Systems Engineer, LANL.gov Author, Eterm (www.eterm.org)
-----------------------------------------------------------------------
"The trouble with doing something right the first time is that nobody
appreciates how difficult it was." -- Walt West

Bidwell, Matt

unread,
Feb 16, 2017, 11:51:45 AM2/16/17
to n...@lbl.gov
David Whiteside at here at NREL submitted a patch to Torque that made it in at some point, but you still have to actively ' set server note_append_on_error = False', otherwise it defaults to 'set server note_append_on_error = True'. I'm not positive when it made it in, but I believe it's definitely in 6.1.
Matt

John Griffin-Wiesner

unread,
Feb 17, 2017, 4:25:23 PM2/17/17
to n...@lbl.gov
Thanks everyone for the feedback.

Turns out we had 6 settings on pbs_server that we'd set for our system that were reset to the defaults when we upgraded pbs_server.  Setting "down_on_error = False" (as it was before the upgrade) on the server was all that was needed to get the node notes working correctly.  By the way, the NHC docs at https://github.com/mej/nhc still says to set "$down_on_error 1" on the compute node.  It's unclear to me if that means the mom is supposed to check with the server for it's setting or something else.  But whatever the case, the server setting is winning.

Thanks.

John Griffin-Wiesner
763-568-8885
Reply all
Reply to author
Forward
0 new messages