hanging processes when using SLURM + saga

48 views
Skip to first unread message

Anna Kostikova

unread,
Oct 5, 2014, 11:36:29 AM10/5/14
to saga-...@googlegroups.com
Dear list,

We are using SLURM + python/saga connector. When a job submitted to
SLURM through saga is done, we still see hanging processes, e.g.:
user1 23473 0.0 0.0 113252 3200 ? S oct04 0:00
/bin/sh /home/user1/.saga/adaptors/shell_job/wrapper.sh 23449
user2 24274 0.0 0.0 113252 3196 ? S oct04 0:00
/bin/sh /home/user2/.saga/adaptors/shell_job/wrapper.sh 24250
user1 28031 0.0 0.1 44392 4136 ? Ss oct03 0:00
/usr/lib/systemd/systemd --user
user2 28239 0.0 0.1 44392 4028 ? Ss oct03 0:00
/usr/lib/systemd/systemd --user

Why this is happening and finished (done) processes are still
appearing (hanging) in the system?
on a server side we use fedora 20 and saga-python 0.20.

Thanks a lot,
Anna

Andre Merzky

unread,
Oct 5, 2014, 12:52:12 PM10/5/14
to saga-...@googlegroups.com
Dear Anna,

On Sun, Oct 5, 2014 at 5:36 PM, Anna Kostikova <anna.ko...@gmail.com> wrote:
> Dear list,
>
> We are using SLURM + python/saga connector. When a job submitted to
> SLURM through saga is done, we still see hanging processes, e.g.:
> user1 23473 0.0 0.0 113252 3200 ? S oct04 0:00
> /bin/sh /home/user1/.saga/adaptors/shell_job/wrapper.sh 23449
> user2 24274 0.0 0.0 113252 3196 ? S oct04 0:00
> /bin/sh /home/user2/.saga/adaptors/shell_job/wrapper.sh 24250

Those two indeed look like SAGA processes (or, more specifically, like
two threads of the same process).


> user1 28031 0.0 0.1 44392 4136 ? Ss oct03 0:00
> /usr/lib/systemd/systemd --user
> user2 28239 0.0 0.1 44392 4028 ? Ss oct03 0:00
> /usr/lib/systemd/systemd --user

Those two I cannot associate with SAGA really -- 'ps -ef --forest' or
'pstree' may help to find out what spawned them.


> Why this is happening and finished (done) processes are still
> appearing (hanging) in the system?
> on a server side we use fedora 20 and saga-python 0.20.

The 'shell_job/wrapper.sh' process should not originate from the slurm
submission, at least I do not see a code path which would start it.
That process is rather associated with the shell adaptor, i.e. when a
job gets submitted via an 'ssh://host.name.net/' or
'fork://localhost/' type URL. Did you happen to also run tests
against those URLs?

Either way, even then the processes should not linger around for long
-- the script includes a monitoring thread (see [1]) which will will
kill it after 30 seconds of idle time. Did you see the script
disappearing after a while, or did it continue to run?

I'll run some tests to ensure that the watcher thread is still working
as expected.

Thanks, Andre.


[1] https://github.com/radical-cybertools/saga-python/blob/master/saga/adaptors/shell/shell_wrapper.sh#L47



> Thanks a lot,
> Anna
>
> --
> You received this message because you are subscribed to the Google Groups "saga-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to saga-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
It was a sad and disappointing day when I discovered that my Universal
Remote Control did not, in fact, control the Universe. (Not even
remotely.)

trollius

unread,
Oct 5, 2014, 1:36:43 PM10/5/14
to saga-...@googlegroups.com
Dear Andre,

Thanks a lot, great help!
Indeed, we are submitting our jobs from a web2py application that is connected to SAGA and then saga submits to SLURM. We are using an ssh connection to slurm in saga script. However, the processes are hanging much longer than 30 seconds (like 2 days already).

Which program or process is responsible for monitoring/killing of the processes after 30 sec? If you could kindly let us know whether your own tests are successful that would be very great - then we could figure out potential issue.

Thanks a lot again,
Anna

Andre Merzky

unread,
Oct 5, 2014, 2:12:16 PM10/5/14
to saga-...@googlegroups.com
The idle check seems indeed broken!

If the process is still running, would you mind sending me the output of

ps -ef -www --forest | grep -C5 wrapper.sh

Thanks!
Andre.

Anna Kostikova

unread,
Oct 5, 2014, 2:37:09 PM10/5/14
to saga-...@googlegroups.com
Hi Andre,


That's our tracklog:
[root@main idna]# ps -ef -www --forest | grep -C5 wrapper.sh
bob 31251 31250 0 14:26 pts/2 00:00:00 \_ -bash
root 31273 31251 0 14:26 pts/2 00:00:00 \_ sudo su
root 31274 31273 0 14:26 pts/2 00:00:00 \_ su
root 31275 31274 0 14:26 pts/2 00:00:00 \_ bash
root 17842 31275 0 18:32 pts/2 00:00:00
\_ ps -ef -www --forest
root 17843 31275 0 18:32 pts/2 00:00:00
\_ grep --color=auto -C5 wrapper.sh
root 1314 1 0 15:17 ? 00:00:00 /usr/sbin/sshd -D
user1 17563 1 0 18:27 ? 00:00:00 /bin/sh
/data/13/.saga/adaptors/shell_job/wrapper.sh 17529
user1 17738 1 0 18:30 ? 00:00:00 /bin/sh
/data/13/.saga/adaptors/shell_job/wrapper.sh 17714
> You received this message because you are subscribed to a topic in the Google Groups "saga-users" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/saga-users/wKaRe399xjw/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to saga-users+...@googlegroups.com.

Andre Merzky

unread,
Oct 5, 2014, 2:44:00 PM10/5/14
to saga-...@googlegroups.com
Thanks Anna,

its a Zombie at that point, and you are safe to kill it. I'll
continue debugging and will ping back.

Cheers, Andre.

FWIW, I opened a ticket at
https://github.com/radical-cybertools/saga-python/issues/379

Anna Kostikova

unread,
Oct 5, 2014, 2:49:18 PM10/5/14
to saga-...@googlegroups.com
Thanks a lot, amazing that your are helping us!

Andre Merzky

unread,
Oct 5, 2014, 6:51:06 PM10/5/14
to saga-...@googlegroups.com
Dear Anna,

alas, I have troubles reproducing the problem -- I can't convince the
wrapper script to linger. Just staring at the code did not help
either, for all I can see it should finish immediately whenever it is
not used anymore.

I mentioned that the script has an idle checker: I forgot that this
check was disabled a while ago, when we removed the functionality to
reconnect on interrupted ssh sessions (that was rather complex code
for a rather rare problem), so that idle checker is not the root of
the problem either.

Is the problem happening regularly? If so, is it possible to get a
copy of the code which performs the ssh connection, i.e. which spawns
the 'ssh://some.host.net/' job service? Also interesting would be how
that code shuts down -- does it close the job service object, or is it
left for self destruct?

Thanks, Andre.

Anna Kostikova

unread,
Oct 7, 2014, 8:54:29 AM10/7/14
to saga-...@googlegroups.com, saga-...@googlegroups.com

Dear Andre ,

Thanks a lot for your help
Yes, we have this problem each time someone connects via saga by ssh to slurm and the process stays hanging forever. This is our code that does it (actually, we took it from saga documentation):

**************
import saga
import sys
try:
REMOTE_HOST = "slurm-server"
ctx = saga.Context("UserPass")
ctx.user_id = "user1"
ctx.user_pass = "pwd1"
session = saga.Session()
session.add_context(ctx)
js = saga.job.Service('ssh://%s' % REMOTE_HOST, session=session)
jd = saga.job.Description()
jd.executable = "srun sleep 5s"
myjob = js.create_job(jd)
# Check our job's id and state
print "Job ID : %s" % (myjob.id)
print "Job State : %s" % (myjob.state)
print "\n...starting job...\n"
myjob.run()
print "Job full ID : %s" % (myjob.id)
print "Job State : %s" % (myjob.state)
print "\n...waiting for job...\n"
myjob.wait()
print "Job short ID : %s" % (myjob.id).split("-")[1]
print "Job State : %s" % (myjob.state)
print "Exitcode : %s" % (myjob.exit_code)
return 0

except saga.SagaException, ex:
# Catch all saga exceptions
print "An exception occured: (%s) %s " % (ex.type, (str(ex)))
# Trace back the exception. That can be helpful for debugging.
print " \n*** Backtrace:\n %s" % ex.traceback
return -1
**************

This code works good and sends a job "sleep 5 sec" to the node of slurm, after that opened ssh connection doesn't close and there are couple of wrapper.sh processes hanging. And this happens for each task/job we submit.
When I run "who" on lunix all users look logged in, even though their tasks are done quite some time ago - meaning that ssh connection is not closing properly.

"Also interesting would be how that code shuts down -- does it close the job service object, or is it left for self destruct?" - could you elaborate more on this question? I am not sure I got it:)

Thanks a lot again
Anna

Andre Merzky

unread,
Oct 7, 2014, 9:44:09 AM10/7/14
to saga-...@googlegroups.com
>> Also interesting would be how that code shuts down -- does it close the job service object, or is it left for self destruct?
> could you elaborate more on this question? I am not sure I got it:)

Sure, and that is also the workaround I would like to suggest: please
add 'js.close()' just before returning. The job service should not
depend on that, and should clean up without explicit call to close()
-- but close should help I hope.

Many thanks for the code -- I'll debug with that later today, and try
to fix the default behaviour.

Best, Andre.

Anna Kostikova

unread,
Oct 10, 2014, 3:17:59 AM10/10/14
to saga-...@googlegroups.com
Hello Andre,

We've tried js.close(), but it didn't help: hanging wrapper.sh stays
in the system. One more thing: each newly submitted task via saga
creates new wrapper.sh - i.e. if one user sends 10 tasks via saga it
will create 10 wrapper.sh processes.

Thanks a lot,
Anna

Andre Merzky

unread,
Oct 10, 2014, 6:54:24 AM10/10/14
to saga-...@googlegroups.com
Dear Anna,

alas, I am unable to reproduce the problem. I am running the
following script against a pbs server:

---------------------------------------------
import saga
import sys
import os

try:
js = saga.job.Service('ssh://localhost/')
jd = saga.job.Description()

for i in range (100) :
jd.executable = "echo 'sleep 5' | qsub"
myjob = js.create_job(jd)

# Check our job's id and state
# print "Job ID : %s" % (myjob.id)
# print "Job State : %s" % (myjob.state)
# print "\n...starting job...\n"
myjob.run()

# print "Job full ID : %s" % (myjob.id)
# print "Job State : %s" % (myjob.state)
# print "\n...waiting for job...\n"
myjob.wait()

# print "Job short ID : %s" % (myjob.id).split("-")[1]
# print "Job State : %s" % (myjob.state)
# print "Exitcode : %s" % (myjob.exit_code)

os.system ('echo -n "%5d : " ; ps -ef | grep -v grep | grep -i
wrapper | wc -l '% i)

js.close ()

os.system ('echo -n "%5d : " ; ps -ef | grep -v grep | grep -i
wrapper | wc -l '% 0)

sys.exit (0)

except saga.SagaException, ex:
# Catch all saga exceptions
print "An exception occured: (%s) %s " % (ex.type, (str(ex)))
# Trace back the exception. That can be helpful for debugging.
print " \n*** Backtrace:\n %s" % ex.traceback
sys.exit (1)

------------------------------------------------------------

and see as output:

------------------------------------------------------------
(ve)merzky@tutorial:~ $ python test.py
0 : 2
1 : 2
2 : 2
3 : 2
4 : 2
5 : 2
6 : 2
[...]
98 : 2
99 : 2
0 : 0
------------------------------------------------------------

The process list seems to confirm that exactly two wrapper.sh
instances are alive during the submission phase:

------------------------------------------------------------
merzky@tutorial:~ $ ps -ef --forest | grep -v grep | grep -i -e saga
-e wrapper -e ssh -e sleep -C 5
mongodb 1632 1 0 10:15 ? 00:00:02 /usr/bin/mongod
--unixSocketPrefix=/var/run/mongodb --config /etc/mongodb.conf run
root 1652 1 0 10:15 ? 00:00:00 /usr/sbin/rsyslogd -c5
root 1655 1 1 10:15 ? 00:00:09 /usr/sbin/pbs_mom
merzky 16212 1655 0 10:30 ? 00:00:00 \_ -bash
merzky 16218 16212 0 10:30 ? 00:00:00 | \_ -bash
merzky 16219 16218 0 10:30 ? 00:00:00 | \_ sleep 5
merzky 16271 1655 0 10:30 ? 00:00:00 \_ -bash
merzky 16278 16271 0 10:30 ? 00:00:00 | \_ -bash
merzky 16279 16278 0 10:30 ? 00:00:00 | \_ sleep 5
merzky 16328 1655 0 10:30 ? 00:00:00 \_ -bash
merzky 16334 16328 0 10:30 ? 00:00:00 | \_ -bash
merzky 16335 16334 0 10:30 ? 00:00:00 | \_ sleep 5
merzky 16388 1655 0 10:30 ? 00:00:00 \_ -bash
merzky 16394 16388 0 10:30 ? 00:00:00 | \_ -bash
merzky 16395 16394 0 10:30 ? 00:00:00 | \_ sleep 5
merzky 16444 1655 0 10:30 ? 00:00:00 \_ -bash
merzky 16450 16444 0 10:30 ? 00:00:00 | \_ -bash
merzky 16451 16450 0 10:30 ? 00:00:00 | \_ sleep 5
merzky 16504 1655 0 10:30 ? 00:00:00 \_ -bash
merzky 16510 16504 0 10:30 ? 00:00:00 | \_ -bash
merzky 16511 16510 0 10:30 ? 00:00:00 | \_ sleep 5
merzky 16560 1655 0 10:30 ? 00:00:00 \_ -bash
merzky 16566 16560 0 10:30 ? 00:00:00 | \_ -bash
merzky 16567 16566 0 10:30 ? 00:00:00 | \_ sleep 5
merzky 16620 1655 0 10:30 ? 00:00:00 \_ -bash
merzky 16626 16620 0 10:30 ? 00:00:00 | \_ -bash
merzky 16627 16626 0 10:30 ? 00:00:00 | \_ sleep 5
merzky 16676 1655 0 10:30 ? 00:00:00 \_ -bash
merzky 16682 16676 0 10:30 ? 00:00:00 \_ -bash
merzky 16683 16682 0 10:30 ? 00:00:00 \_ sleep 5
root 1676 1 0 10:15 ? 00:00:01 /usr/sbin/pbs_server
root 1760 1 0 10:15 ? 00:00:00 /usr/sbin/cron
102 1785 1 0 10:15 ? 00:00:00 /usr/bin/dbus-daemon --system
root 1803 1 0 10:15 ? 00:00:00 /usr/sbin/sshd
root 1890 1803 0 10:17 ? 00:00:00 \_ sshd: merzky [priv]
merzky 1966 1890 0 10:17 ? 00:00:00 | \_ sshd: merzky@pts/0,pts/1
merzky 1967 1966 0 10:17 pts/0 00:00:00 | \_ -bash
merzky 2951 1967 0 10:26 pts/0 00:00:00 | | \_ vim test.py
merzky 11631 1967 3 10:29 pts/0 00:00:01 | | \_ python test.py
merzky 11642 11631 0 10:29 pts/2 00:00:00 | | \_
/usr/bin/ssh -t -o IdentityFile=/home/merzky/.ssh/id_rsa -o
ControlMaster=yes -o ControlPath=/tmp/saga_ssh_merzky_%h_%p.ctrl
localhost
merzky 11667 11631 0 10:29 pts/4 00:00:00 | | \_
/usr/bin/ssh -t -o IdentityFile=/home/merzky/.ssh/id_rsa -o
ControlMaster=no -o ControlPath=/tmp/saga_ssh_merzky_%h_%p.ctrl
localhost
merzky 2152 1966 0 10:24 pts/1 00:00:00 | \_ -bash
merzky 16684 2152 0 10:30 pts/1 00:00:00 | \_ ps -ef --forest
root 11643 1803 0 10:29 ? 00:00:00 \_ sshd: merzky [priv]
merzky 11648 11643 0 10:29 ? 00:00:00 \_ sshd: merzky@pts/3,pts/5
merzky 11649 11648 0 10:29 pts/3 00:00:00 \_ -bash
merzky 11668 11648 0 10:29 pts/5 00:00:00 \_ /bin/sh -i
merzky 11688 11668 1 10:29 pts/5 00:00:00 \_
/bin/sh /home/merzky/.saga/adaptors/shell_job/wrapper.sh 11668
merzky 12208 11688 0 10:29 pts/5 00:00:00 \_
/bin/sh /home/merzky/.saga/adaptors/shell_job/wrapper.sh 11668
merzky 15183 12208 0 10:30 pts/5 00:00:00 \_ sleep 30
root 1867 1 0 10:15 ? 00:00:00 /usr/sbin/pbs_sched
root 1893 1 0 10:17 ? 00:00:00
/usr/sbin/console-kit-daemon --no-daemon
root 1960 1 0 10:17 ? 00:00:00
/usr/lib/policykit-1/polkitd --no-debug
------------------------------------------------------------

This run used the saga-python release from pypi. So I am not sure
what to make of this. Would it be, by any chance, possible to get
access to your slurm-server, in order to debug this?

Best, Andre.

PS.: mail formatting will likely screw up the pasted texts above --
you may want to check this out on the ticket at
https://github.com/radical-cybertools/saga-python/issues/379
> You received this message because you are subscribed to the Google Groups "saga-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to saga-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



Anna Kostikova

unread,
Oct 13, 2014, 2:38:20 PM10/13/14
to saga-...@googlegroups.com
Hello Andre,

Sorry for a slow response (I was away on a conference). I've checked
with our admin and he didn't allow to get an access to the server, but
it would be great it you could say which OS and which version of
python you've used?

After we run all the tests, everything works (from the console), here
are our results for centos 7, centos 6.5 and fedora 20.
For each task sent via saga to the slurm server, there are two
wrapper.sh processes with identical numbers at the end:
/bin/sh /user/bob/.saga/adaptors/shell_job/wrapper.sh 26258
/bin/sh /user/bob/.saga/adaptors/shell_job/wrapper.sh 26258
One of them finishes after slurm has done its job, but the second one
keeps hanging in the system. That's how they get multiplied. This
happens only if we use js.close() before return. If we remove
js.close(), then both wrapper.sh processed are finished successfully.

What seems a bit weird is that when we've created a topic even from
the console both processes were hanging, but now they work just fine.
As we are running tasks via saga from browser (we have a python
application), we need a bit more time to figure out what's going on
with the hanging processes.

Thanks a lot again for all your help and have a good evening,
Anna
Reply all
Reply to author
Forward
0 new messages