Good morning,
I have what I hope is a simple question on an initial configuration issue I ran into.
We have been looking for a good baseline test script to run on our clusters (outside of SGE at the current time) and happened to stumble across NHC which would probably fit the bill perfectly.
I did some initial rpm install on our test Centos 6 and 7 systems, ran the auto config, and got the expected output. Of course these test nodes aren’t in an SGE environment so the next step was to do the same on some of our SGE dev nodes.
When doing so, just a basic run always hung. No output, logs, anything until I hit return and then it seemed to go through the checks, but put itself back in the same state. A return always gave the same output, but it never finished without me killing the process.
I guess at this point I have two questions.
1. Is there an option to put in a NHC_RM of none in the /etc/nhc/nhc.conf file so it doesn’t try to detect?
2. What would be a good way to see where this is hanging at as the –d and –x aren’t showing much at this point. I suspect we just have a strange SGE environment, but would like to get this to work for pre/post job checks as designed.
I have tried running as root, our SGE admin, and SGE user. All with the same results (once permissions were set for the SGE accounts to have access to the root nhc files). I did get it to run by changing the instance of SGE_ROOT to SGE_ROOT2 and qselect to qselect2 thus it doesn’t do a RM detection on our SGE. Not exactly a good work around though.
Below is the output I am getting as well as the command. The repeating information is shown only when I hit enter and stops with a ctrl+c.
[root@hptb006 sbin]# nhc -x -v -a MARK_OFFLINE=0
+ dbg 'BASH tracing active.'
+ local PREFIX=
+ [[ '' == \1 ]]
+ getopts :D:ac:de:fhl:n:qt:vx OPTION
+ case "$OPTION" in
+ VERBOSE=1
+ dbg 'Verbose mode activated via -v option.'
+ local PREFIX=
+ [[ '' == \1 ]]
+ getopts :D:ac:de:fhl:n:qt:vx OPTION
+ case "$OPTION" in
+ NHC_CHECK_ALL=1
+ dbg 'Force running of all checks.'
+ local PREFIX=
+ [[ '' == \1 ]]
+ getopts :D:ac:de:fhl:n:qt:vx OPTION
+ shift 3
+ [[ ! -z MARK_OFFLINE=0 ]]
+ eval MARK_OFFLINE=0
++ MARK_OFFLINE=0
+ shift
+ [[ ! -z '' ]]
+ return 0
+ nhcmain_load_sysconfig
+ [[ -f /etc/sysconfig/nhc ]]
+ nhcmain_finalize_env
+ CONFDIR=/etc/nhc
+ CONFFILE=/etc/nhc/nhc.conf
+ INCDIR=/etc/nhc/scripts
+ HELPERDIR=/usr/libexec/nhc
+ ONLINE_NODE=/usr/libexec/nhc/node-mark-online
+ OFFLINE_NODE=/usr/libexec/nhc/node-mark-offline
+ LOGFILE='>>/var/log/nhc.log 2>&1'
+ RESULTFILE=/var/run/nhc/nhc.status
+ DEBUG=0
+ TS=0
+ SILENT=0
+ VERBOSE=1
+ MARK_OFFLINE=0
+ DETACHED_MODE=0
+ DETACHED_MODE_FAIL_NODATA=0
+ TIMEOUT=30
+ NHC_CHECK_ALL=1
+ NHC_CHECK_FORKED=0
+ export NHC_SID=0
+ NHC_SID=0
+ kill -s 0 -- -23299
+ [[ 0 -eq 0 ]]
+ dbg 'NHC process 23299 is session leader.'
+ local PREFIX=
+ [[ 0 == \1 ]]
+ NHC_SID=-23299
+ [[ >>/var/log/nhc.log 2>&1 != \>\>\/\v\a\r\/\l\o\g\/\n\h\c\.\l\o\g\ \2\>\&\1 ]]
+ [[ >>/var/log/nhc.log 2>&1 == \- ]]
+ [[ -z '' ]]
+ nhcmain_find_rm
+ local DIR
+ local -a DIRLIST
+ [[ -d /var/spool/torque ]]
+ [[ -n /opt/sge ]]
+ [[ -x /opt/sge/util/arch ]]
+ NHC_RM=sge
+ type -a -p -f -P pbsnodes
+ type -a -p -f -P scontrol
+ type -a -p -f -P badmin
+ type -a -p -f -P qselect
+ [[ -z sge ]]
+ [[ sge == \s\g\e ]]
+ ONLINE_NODE=:
+ OFFLINE_NODE=:
+ MARK_OFFLINE=0
+ DETACHED_MODE=0
+ TIMEOUT=0
+ [[ 0 -ne 0 ]]
+ [[ -n '' ]]
+ [[ 0 -eq 1 ]]
+ export NAME CONFDIR CONFFILE INCDIR HELPERDIR ONLINE_NODE OFFLINE_NODE LOGFILE DEBUG TS SILENT TIMEOUT NHC_RM
+ [[ -n '' ]]
+ nhcmain_redirect_output
+ [[ -n >>/var/log/nhc.log 2>&1 ]]
+ exec
begin
hptb006:healthy:false
hptb006:diagnosis:NHC: check_fs_free: /boot has only 24MB free, minimum is 40MB
end
begin
hptb006:healthy:true
hptb006:diagnosis:HEALTHY
end
begin
hptb006:healthy:false
hptb006:diagnosis:NHC: check_fs_free: /boot has only 24MB free, minimum is 40MB
end
begin
hptb006:healthy:true
hptb006:diagnosis:HEALTHY
end
begin
hptb006:healthy:false
hptb006:diagnosis:NHC: check_fs_free: /boot has only 24MB free, minimum is 40MB
end
begin
hptb006:healthy:true
hptb006:diagnosis:HEALTHY
end
^Cbegin
hptb006:healthy:false
hptb006:diagnosis:NHC: Terminated by signal SIGINT.
end
Brian Kircher |
||
|
||
Address: 5150 Westway Park Boulevard, Suite
120, Houston, Texas 77041, United States |