NHC Configuration on SGE

42 views
Skip to first unread message

Eisa Hedayati

unread,
Jan 11, 2017, 1:24:31 PM1/11/17
to n...@lbl.gov, Gowtham S
Dear all,

I am trying to use NHC on Rocks Cluster distribution, which SGE is its job scheduler. 

The instruction presented on NHC git repository is heavily toward SLURM and TORQUE. It will be nice if you can add a mini section "SGE Integration".  Right now I'm not sure what is the best practice of using NHC on SGE.

I tried a couple of ways, but currently what I'm doing is to install NHC on a shared directory and run it on every single node every now and again, then change the status of my load_sensor by my own script. Finally, check which node is in alert status. I don't think this is the best practice by the way.

The real question is, what is the correct way of using NHC on SGE clusters.

Sincerely, 
Eisa
--
Research Computing Student Intern, IT

Michael Jennings

unread,
Jan 11, 2017, 1:32:41 PM1/11/17
to LBNL Node Health Check, Gowtham S
By far the most authoritative guide to NHC with SGE was written by Dave Love, author/maintainer of the "Son of Grid Engine" SGE incarnation.  You can find it here:  https://arc.liv.ac.uk/SGE/howto/nhc-recipe.html

I thought I had added a link to the NHC docs for Dave's page, but it seems to have gotten lost in my recent laptop/workstation/job shuffling.  Please feel free to open an issue at https://github.com/mej/nhc for the adding of this link, and I will get it added as soon as I am able.

Thanks as always for your feedback!
Michael


--
You received this message because you are subscribed to the Google Groups "LBNL Node Health Check" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nhc+uns...@lbl.gov.
To post to this group, send email to n...@lbl.gov.
Visit this group at https://groups.google.com/a/lbl.gov/group/nhc/.



--
Michael Jennings (KainX)   https://medium.com/@mej0/    <m...@eterm.org>
Linux/HPC Systems Engineer, LANL.gov      Author, Eterm (www.eterm.org)
-----------------------------------------------------------------------
 "The trouble with doing something right the first time is that nobody
  appreciates how difficult it was."                      -- Walt West

Gowtham

unread,
Jan 11, 2017, 1:52:35 PM1/11/17
to Michael Jennings, LBNL Node Health Check
Thank you for helping us out, Michael. Much appreciated!


Best regards,
Gowtham

--
Gowtham, PhD
Director of Research Computing, IT
Adj. Asst. Professor, Physics and ECE
Michigan Technological University

(906) 487-4096
https://it.mtu.edu
https://hpc.mtu.edu
Reply all
Reply to author
Forward
0 new messages