Running GPFS commands with check_cmd_output

138 views
Skip to first unread message

Dockendorf, Trey

unread,
Oct 31, 2018, 11:21:49 AM10/31/18
to n...@lbl.gov

I am wanting to add the command “/usr/lpp/mmfs/bin/mmfsadm test verbs status” to NHC looking for the string “started” to verify systems are using RDMA for GPFS and not falling back to ethernet.  It seems that setting any value for the environment variable DEBUG will cause the GPFS command to dump a huge amount of output, like doing “set -x”.  If I set DEBUG= (no value) before the command then the normal output is given.  I tried putting DEBUG= at beginning of command used for -e with check_cmd_output but that didn’t work [1].  Is there a way to set DEBUG= with check_cmd_output in a way that won’t break normal debugging, ie not putting something like “* || export DEBUG=”.  I have verified that putting the mmfsadm command in a wrapper script and using that for -e works around the problem but would prefer not using a wrapper script if possible.

 

Thanks,

- Trey

 

[1]:

Running check:  "check_cmd_output -t 3 -m /started/ -e 'DEBUG= /usr/lpp/mmfs/bin/mmfsadm test verbs status'"

DEBUG=: line 0: exec: DEBUG=: not found

ERROR:  nhc-monitor:  Health check failed:  check_cmd_output:  127 returned by "DEBUG= /usr/lpp/mmfs/bin/mmfsadm test verbs status".

 

-- 

Trey Dockendorf

HPC Systems Engineer

Ohio Supercomputer Center

John Hearns

unread,
Oct 31, 2018, 11:35:04 AM10/31/18
to n...@lbl.gov
Trey, there are a lot of GPFS healthchecks. You can get the health status by running mmhealth .
I do not know if there is a specific check for verbs - I do not recall seeing one so I suspect not.

I wrote Bright CM healtchecks for GPFS status, which are Python scripts which parse the output of mmhealth,
and it was quite easy.

--
You received this message because you are subscribed to the Google Groups "LBNL Node Health Check" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nhc+uns...@lbl.gov.
To post to this group, send email to n...@lbl.gov.
Visit this group at https://groups.google.com/a/lbl.gov/group/nhc/.

John Hearns

unread,
Oct 31, 2018, 11:39:51 AM10/31/18
to n...@lbl.gov
mmhealth node show network   shows the network status... but the documentation does not mention RDMA...
As I recall it gives network up/down status and if something is misconfigured.

Dockendorf, Trey

unread,
Oct 31, 2018, 11:40:37 AM10/31/18
to n...@lbl.gov

Thanks for pointing me at mmhealth, looks like a system not using RDMA will show as unhealthy which is good.  The issue of DEBUG exists with mmhealth too where any value set for DEBUG environment variable will produce huge amounts of output and I’m guessing NHC internally is setting DEBUG=0.

 

Thanks,

- Trey

 

-- 

Trey Dockendorf

HPC Systems Engineer

Ohio Supercomputer Center

 

Reply all
Reply to author
Forward
0 new messages