File descriptor issue with services started by NHC when invoked by pbs_mom

21 views
Skip to first unread message

Dockendorf, Trey

unread,
Mar 28, 2017, 4:20:09 PM3/28/17
to n...@lbl.gov
I’ve hit an issue that occurs when pbs_mom executes NHC and NHC restarts a failed service.  The failed service will end up with the same file descriptors as pbs_mom which means restarts of pbs_mom fail because the other service holds ports 15003 and 15002 open.  This is happening with NHC 1.4.1 on RHEL 6.8 with Torque 6.0.3.  This issue does not seem to occur on RHEL 7.3 with same version of Torque and NHC 1.4.2.  Has anyone run into this problem or have any suggestions on how to work around the problem?

Thanks,
- Trey

Example of what happens:

 /r0220|r0506/ || check_ps_service -m /Xorg/ -u root -r Xorg

[root@r0220 ~]# lsof -i :15003
COMMAND   PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
pbs_mom 32232 root    6u  IPv4 1294036      0t0  TCP *:pbs_resmom (LISTEN)

[root@r0220 ~]# service Xorg stop
Shutting down Xorg: [ OK ]

[root@r0220 ~]# service pbs_mom_quick restart
Shutting down PBS mom: [ OK ]
Starting PBS mom: [ OK ]

[root@r0220 ~]# lsof -i :15003
COMMAND   PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
pbs_mom 33411 root    6u  IPv4 1299507      0t0  TCP *:pbs_resmom (LISTEN)
Xorg    33496 root    6u  IPv4 1299507      0t0  TCP *:pbs_resmom (LISTEN)

[root@r0220 ~]# ps auxf
<SNIP>
root     33411  0.0  0.0 88197732 60052 ?      SLsl 16:16   0:00 pbs_mom -p -d /var/spool/batch/torque_quick -H r0220
root     33425  3.7  0.0      0     0 ?        Zs   16:16   0:01  \_ [nhc] <defunct>
root     33496  0.5  0.0 141352 38448 tty8     Ss+  16:16   0:00 /usr/bin/Xorg :0 -nolisten tcp

[root@r0220 ~]# service pbs_mom_quick restart
Shutting down PBS mom: [FAILED]
Starting PBS mom: pbs_mom: LOG_ERROR::Resource temporarily unavailable (11) in pbs_mom, cannot lock '/var/spool/batch/torque_quick/mom_priv/mom.lock' - another mom running
cannot lock '/var/spool/batch/torque_quick/mom_priv/mom.lock' - another mom running
[FAILED]

-- 
Trey Dockendorf
HPC Systems Engineer
Ohio Supercomputer Center
Reply all
Reply to author
Forward
0 new messages