I’ve hit an issue that occurs when pbs_mom executes NHC and NHC restarts a failed service. The failed service will end up with the same file descriptors as pbs_mom which means restarts of pbs_mom fail because the other service holds ports 15003 and 15002 open.
This is happening with NHC 1.4.1 on RHEL 6.8 with Torque 6.0.3. This issue does not seem to occur on RHEL 7.3 with same version of Torque and NHC 1.4.2. Has anyone run into this problem or have any suggestions on how to work around the problem?
/r0220|r0506/ || check_ps_service -m /Xorg/ -u root -r Xorg
[root@r0220 ~]# lsof -i :15003
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
pbs_mom 32232 root 6u IPv4 1294036 0t0 TCP *:pbs_resmom (LISTEN)
[root@r0220 ~]# service Xorg stop
Shutting down Xorg: [ OK ]
[root@r0220 ~]# service pbs_mom_quick restart
Shutting down PBS mom: [ OK ]
Starting PBS mom: [ OK ]
[root@r0220 ~]# lsof -i :15003
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
pbs_mom 33411 root 6u IPv4 1299507 0t0 TCP *:pbs_resmom (LISTEN)
Xorg 33496 root 6u IPv4 1299507 0t0 TCP *:pbs_resmom (LISTEN)
[root@r0220 ~]# ps auxf
<SNIP>
root 33411 0.0 0.0 88197732 60052 ? SLsl 16:16 0:00 pbs_mom -p -d /var/spool/batch/torque_quick -H r0220
root 33425 3.7 0.0 0 0 ? Zs 16:16 0:01 \_ [nhc] <defunct>
root 33496 0.5 0.0 141352 38448 tty8 Ss+ 16:16 0:00 /usr/bin/Xorg :0 -nolisten tcp
[root@r0220 ~]# service pbs_mom_quick restart
Shutting down PBS mom: [FAILED]
Starting PBS mom: pbs_mom: LOG_ERROR::Resource temporarily unavailable (11) in pbs_mom, cannot lock '/var/spool/batch/torque_quick/mom_priv/mom.lock' - another mom running
cannot lock '/var/spool/batch/torque_quick/mom_priv/mom.lock' - another mom running
[FAILED]