Hi John!
Trey is correct. NHC will, if $MARK_OFFLINE is set, invoke `pbsnodes
-o -N '...' node` to mark the node offline directly -- provided that
either there is no Note field present, or the Note field that *is*
present starts with "NHC:" -- otherwise, NHC will leave it alone.
It *also* returns the string "ERROR Health check failed: ..." to the
pbs_mom on the file descriptor it was provided as STDOUT (fd 1). If
$down_on_error is set to true, TORQUE will mark the node as being down
and will assign the returned string to the Note field. And TORQUE
doesn't have the logic to do this only in cases where the Note field
was already set by NHC, so that's why you're seeing NHC's note get
tacked onto the end of the note you set. It's TORQUE doing it, not
NHC directly.
There was/is a bug in certain versions of TORQUE 6.x which caused the
string returned by the health check program to be *appended*, rather
than assigned, to the Note field. So each time NHC returned failure,
that field would get longer and longer. This was supposed to have
been resolved in a previous version, but it sounds like maybe there's
a regression? In any event, Trey's solution should work by having NHC
handle everything and bypassing TORQUE's assignment of the Note field.
Hope that helps!
Michael
--
Michael Jennings (KainX)
https://medium.com/@mej0/ <
m...@eterm.org>
Linux/HPC Systems Engineer, LANL.gov Author, Eterm (
www.eterm.org)
-----------------------------------------------------------------------
"The trouble with doing something right the first time is that nobody
appreciates how difficult it was." -- Walt West