Simple NHC test install on an SGE managed node

21 views
Skip to first unread message

Brian Kircher

unread,
Jul 6, 2017, 4:42:13 PM7/6/17
to n...@lbl.gov

Good morning,

 

I have what I hope is a simple question on an initial configuration issue I ran into.

 

We have been looking for a good baseline test script to run on our clusters (outside of SGE at the current time) and happened to stumble across NHC which would probably fit the bill perfectly.

 

I did some initial rpm install on our test Centos 6 and 7 systems, ran the auto config, and got the expected output.  Of course these test nodes aren’t in an SGE environment so the next step was to do the same on some of our SGE dev nodes.

 

When doing so, just a basic run always hung.  No output, logs, anything until I hit return and then it seemed to go through the checks, but put itself back in the same state.  A return always gave the same output, but it never finished without me killing the process.

 

I guess at this point I have two questions.

 

1.       Is there an option to put in a NHC_RM of none in the /etc/nhc/nhc.conf file so it doesn’t try to detect?

2.       What would be a good way to see where this is hanging at as the –d and –x aren’t showing much at this point.   I suspect we just have a strange SGE environment, but would like to get this to work for pre/post job checks as designed.

 

I have tried running as root, our SGE admin, and SGE user.  All with the same results (once permissions were set for the SGE accounts to have access to the root nhc files).  I did get it to run by changing the instance of SGE_ROOT to SGE_ROOT2 and qselect to qselect2 thus it doesn’t do a RM detection on our SGE.  Not exactly a good work around though.

 

Below is the output I am getting as well as the command.  The repeating information is shown only when I hit enter and stops with a ctrl+c.

 

 

[root@hptb006 sbin]# nhc -x -v -a MARK_OFFLINE=0

+ dbg 'BASH tracing active.'

+ local PREFIX=

+ [[ '' == \1 ]]

+ getopts :D:ac:de:fhl:n:qt:vx OPTION

+ case "$OPTION" in

+ VERBOSE=1

+ dbg 'Verbose mode activated via -v option.'

+ local PREFIX=

+ [[ '' == \1 ]]

+ getopts :D:ac:de:fhl:n:qt:vx OPTION

+ case "$OPTION" in

+ NHC_CHECK_ALL=1

+ dbg 'Force running of all checks.'

+ local PREFIX=

+ [[ '' == \1 ]]

+ getopts :D:ac:de:fhl:n:qt:vx OPTION

+ shift 3

+ [[ ! -z MARK_OFFLINE=0 ]]

+ eval MARK_OFFLINE=0

++ MARK_OFFLINE=0

+ shift

+ [[ ! -z '' ]]

+ return 0

+ nhcmain_load_sysconfig

+ [[ -f /etc/sysconfig/nhc ]]

+ nhcmain_finalize_env

+ CONFDIR=/etc/nhc

+ CONFFILE=/etc/nhc/nhc.conf

+ INCDIR=/etc/nhc/scripts

+ HELPERDIR=/usr/libexec/nhc

+ ONLINE_NODE=/usr/libexec/nhc/node-mark-online

+ OFFLINE_NODE=/usr/libexec/nhc/node-mark-offline

+ LOGFILE='>>/var/log/nhc.log 2>&1'

+ RESULTFILE=/var/run/nhc/nhc.status

+ DEBUG=0

+ TS=0

+ SILENT=0

+ VERBOSE=1

+ MARK_OFFLINE=0

+ DETACHED_MODE=0

+ DETACHED_MODE_FAIL_NODATA=0

+ TIMEOUT=30

+ NHC_CHECK_ALL=1

+ NHC_CHECK_FORKED=0

+ export NHC_SID=0

+ NHC_SID=0

+ kill -s 0 -- -23299

+ [[ 0 -eq 0 ]]

+ dbg 'NHC process 23299 is session leader.'

+ local PREFIX=

+ [[ 0 == \1 ]]

+ NHC_SID=-23299

+ [[ >>/var/log/nhc.log 2>&1 != \>\>\/\v\a\r\/\l\o\g\/\n\h\c\.\l\o\g\ \2\>\&\1 ]]

+ [[ >>/var/log/nhc.log 2>&1 == \- ]]

+ [[ -z '' ]]

+ nhcmain_find_rm

+ local DIR

+ local -a DIRLIST

+ [[ -d /var/spool/torque ]]

+ [[ -n /opt/sge ]]

+ [[ -x /opt/sge/util/arch ]]

+ NHC_RM=sge

+ type -a -p -f -P pbsnodes

+ type -a -p -f -P scontrol

+ type -a -p -f -P badmin

+ type -a -p -f -P qselect

+ [[ -z sge ]]

+ [[ sge == \s\g\e ]]

+ ONLINE_NODE=:

+ OFFLINE_NODE=:

+ MARK_OFFLINE=0

+ DETACHED_MODE=0

+ TIMEOUT=0

+ [[ 0 -ne 0 ]]

+ [[ -n '' ]]

+ [[ 0 -eq 1 ]]

+ export NAME CONFDIR CONFFILE INCDIR HELPERDIR ONLINE_NODE OFFLINE_NODE LOGFILE DEBUG TS SILENT TIMEOUT NHC_RM

+ [[ -n '' ]]

+ nhcmain_redirect_output

+ [[ -n >>/var/log/nhc.log 2>&1 ]]

+ exec

 

begin

hptb006:healthy:false

hptb006:diagnosis:NHC: check_fs_free:  /boot has only 24MB free, minimum is 40MB

end

begin

hptb006:healthy:true

hptb006:diagnosis:HEALTHY

end

 

 

 

begin

hptb006:healthy:false

hptb006:diagnosis:NHC: check_fs_free:  /boot has only 24MB free, minimum is 40MB

end

begin

hptb006:healthy:true

hptb006:diagnosis:HEALTHY

end

 

 

 

begin

hptb006:healthy:false

hptb006:diagnosis:NHC: check_fs_free:  /boot has only 24MB free, minimum is 40MB

end

begin

hptb006:healthy:true

hptb006:diagnosis:HEALTHY

end

^Cbegin

hptb006:healthy:false

hptb006:diagnosis:NHC: Terminated by signal SIGINT.

end

 

 

PGS-Logo

Brian Kircher
Senior Engineer
Imaging & Engineering | Imaging

Telephone: +1 281 509 8525
Direct: +1 281 509 8531       
Mobile: +1 713 385 8999
Email: brian....@pgs.com

A Clearer Image | www.pgs.com

NSA-Campaign

Colour-Wave

Address: 5150 Westway Park Boulevard, Suite 120, Houston, Texas 77041, United States

This e-mail, including any attachments and response string, may contain proprietary information which is confidential and may be legally privileged. It is for the intended recipient only. If you are not the intended recipient or transmission error has misdirected this e-mail, please notify the author by return e-mail and delete this message and any attachment immediately. If you are not the intended recipient you must not use, disclose, distribute, forward, copy, print or rely on this e-mail in any way except as permitted by the author.

 

Reply all
Reply to author
Forward
0 new messages