NHC and SGE implementations

46 views

Skip to first unread message

Cam

unread,

Apr 22, 2016, 4:16:16 PM4/22/16

to LBNL Node Health Check

While NHC seems to be geared more towards SLURM and TORQUE, reading through documentation and the driver itself shows that there is some SGE integration as well.

Dave Love has written up a load sensor how-to for integrating NHC with SGE, but for now, we'd like to use it more as a "harness". That is to say we'd like a central, upstream monitoring server to run NHC periodically and parse output, checking for unhealthy states and the associated diagnoses, then adjusting the queue status accordingly.

There seem to be some quirks with SGE integration, however. For example, why does nhc seem to require some stdin via UNIX pipelines in order to get output? We've shell traced the process, and have followed its execution by reading the driver, but we're still confused as to why this is the case, and the design reasons behind it. Because of this, the nhc-wrapper script does not work. There is also the fact that the NHC environment variable TIMEOUT is always set to 0 if SGE is detected as the resource manager. This prevents functions like check_cmd_output from working at all.

There was a post on the Warewulf list about doing checks by sourcing the nhc scripts and then checking the return values, but this seems to sidestep the use of a central configuration file, relying on individual nhc functions instead.

That said, it's a great framework, and is exactly what we'd like to use, but we can't find many SGE use cases out there in the wild.

Reply all

Reply to author

Forward

0 new messages