[slurm-users] how can users start their worker daemons using srun?

1,276 views
Skip to first unread message

Priedhorsky, Reid

unread,
Aug 27, 2018, 6:17:25 PM8/27/18
to slurm...@lists.schedmd.com
Folks,

I am trying to figure out how to advise users on starting worker daemons in their allocations using srun. That is, I want to be able to run “srun foo”, where foo starts some child process and then exits, and the child process(es) persist and wait for work.

Use cases for this include Apache Spark and FUSE mounts. In general, it seems that there are a number of newer computing frameworks that have this model, in particular for the data science space.

We are on Slurm 17.02.10 with the proctrack/cgroup plugin.

I’m using a Python script foo.py to test this (included at end of e-mail). After forking, the parent exits immediately, and the child writes the numbers 1 to 10 at one-second intervals to /tmp/foo, then the word “done”, and then exits.

Desired behavior in a one-node allocation:

$ srun ./foo.py && sleep 12 && cat /tmp/foo
starting cn001.localdomain 79615
0
1
2
3
4
5
6
7
8
9
10
done

Actual behavior:

$ srun ./foo.py && sleep 12 && cat /tmp/foo
starting cn001.localdomain 79615
0

As far as I can tell, what is going on is that when foo.py exits, Slurm concludes that the job step is over and kills the child; see debug log at end of e-mail.

I have considered the following:

(1) Various command line options, all of which have no effect on this: --kill-on-bad-exit=0, --no-kill, --mpi-none, --overcommit, --oversubscribe, --wait=0.

(2) srun --task-prolog=./foo.py true

Instead of killing foo.py’s child, this invocation waits for it to exit. Also, this seems to require a single executable rather than a command line.

One can work around the waiting to exit by putting the entire command in the background, but then subsequent sruns wait until the child completes anyway (with the warning “Job step creation temporarily disabled, retrying”). --overcommit on the 1st, 2nd, or both sruns has no effect.

Recall that for real-world tasks, the child will run indefinitely waiting for work, so we can’t wait for it to finish.

(3) srun sh -c './foo.py && sleep 15' : same behavior as item 2.

(4) Teach Slurm how to deal with the worker daemons somehow.

This doesn’t generalize. We want users to be able to bring whatever compute framework they want, without waiting for Slurm support, so they can innovate faster.

(5) Put the worker daemons in their own job. For example, one could start the Spark worker daemons in one job, with the Spark coordinator daemon and user work submission in a second one-node job.

This doesn’t solve the general use case. For example, in the case of Spark, I’ve a large test suite where starting and stopping a Spark cluster is only one of many tests. For FUSE, which depends on a worker daemon to implement filesystem operations, the mount is there to serve the needs of the rest of the job script.

(6) Change the software to not daemonize. For example, one can start Spark by invoking the .jar files directly, bypassing the daemonizing start script, or in newer versions by setting SPARK_NO_DAEMONIZE=1.

This again doesn’t generalize. I need to be able to support imperfect scientific software as it arrives, without hacking or framework-specific workarounds.

(7) Don’t launch with srun. For example, pdsh can interpret Slurm environment variables and uses SSH to launch tasks on my allocated nodes.

This works, and is what I’m doing currently, but it doesn’t scale. One or two dozen SSH processes on the first node of my allocation are fine, but 1000 or 10,000 aren’t. Also, it’s a kludge since srun is specifically provided and optimized to launch jobs in a Slurm cluster.

My question: Is there any way I can convince Slurm to let a job step’s children keep running beyond the end of the step, and kill them at the end of the job if needed. Or, less preferably, overlap job steps?

Much appreciated,
Reid


Appendix 1: foo.py

#!/usr/bin/env python3

# Try to find a way to run daemons under srun.

import os
import socket
import sys
import time

print("starting %s %d" % (socket.gethostname(), os.getpid()))

# one fork is enough to get killed by Slurm
if (os.fork() > 0): sys.exit(0)

fp = open("/tmp/foo", "w")

fp.truncate()
for i in range(10):
   fp.write("%d\n" % i)
   fp.flush()
   time.sleep(1)

fp.write("done\n")

Appendix 2: error log showing job step cleanup removes the worker daemon

slurmstepd: debug level = 6
slurmstepd: debug:  IO handler started pid=62147
slurmstepd: debug2: mpi/pmi2: _tree_listen_readable
slurmstepd: debug2: mpi/pmi2: _task_readable
slurmstepd: debug2: Using gid list sent by slurmd
slurmstepd: debug2: mpi/pmi2: _tree_listen_readable
slurmstepd: debug2: mpi/pmi2: _task_readable
slurmstepd: debug2: mpi/pmi2: _tree_listen_readable
slurmstepd: debug2: mpi/pmi2: _task_readable
slurmstepd: starting 1 tasks
slurmstepd: task 0 (62153) started 2018-08-27T11:03:33
slurmstepd: debug2: Using gid list sent by slurmd
slurmstepd: debug2: mpi/pmi2: _tree_listen_readable
slurmstepd: debug2: mpi/pmi2: _task_readable
slurmstepd: debug2: mpi/pmi2: _tree_listen_readable
slurmstepd: debug2: mpi/pmi2: _task_readable
slurmstepd: debug2: adding task 0 pid 62153 on node 0 to jobacct
slurmstepd: debug:  jobacct_gather_cgroup_cpuacct_attach_task: jobid 206670 stepid 62 taskid 0 max_task_id 0
slurmstepd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm' already exists
slurmstepd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_1001' already exists
slurmstepd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670' already exists
slurmstepd: debug:  jobacct_gather_cgroup_memory_attach_task: jobid 206670 stepid 62 taskid 0 max_task_id 0
slurmstepd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm' already exists
slurmstepd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_1001' already exists
slurmstepd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_1001/job_206670' already exists
slurmstepd: debug2: jag_common_poll_data: 62153 mem size 0 290852 time 0.000000(0+0)
slurmstepd: debug2: _get_sys_interface_freq_line: filename = /sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_cur_freq
slurmstepd: debug2:  cpu 1 freq= 2101000
slurmstepd: debug:  jag_common_poll_data: Task average frequency = 2101000 pid 62153 mem size 0 290852 time 0.000000(0+0)
slurmstepd: debug2: energycounted = 0
slurmstepd: debug2: getjoules_task energy = 0
slurmstepd: debug:  Step 206670.62 memory used:0 limit:251658240 KB
slurmstepd: debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
slurmstepd: debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/cpuset' entry '/sys/fs/cgroup/cpuset/slurm/system' properties: No such file or directory
slurmstepd: debug2: xcgroup_load: unable to get cgroup '/sys/fs/cgroup/memory' entry '/sys/fs/cgroup/memory/slurm/system' properties: No such file or directory
slurmstepd: debug:  Sending launch resp rc=0
slurmstepd: debug:  mpi type = (null)
slurmstepd: debug:  [job 206670] attempting to run slurm task_prolog [/opt/slurm/task_prolog]
slurmstepd: debug:  Handling REQUEST_STEP_UID
slurmstepd: debug:  Handling REQUEST_SIGNAL_CONTAINER
slurmstepd: debug:  _handle_signal_container for step=206670.62 uid=0 signal=995
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 18446744073709551615
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 16384
slurmstepd: debug2: _set_limit: RLIMIT_RSS    : max:inf cur:inf req:257698037760
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_RSS succeeded
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NPROC no change in value: 8192
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_NOFILE no change in value: 65536
slurmstepd: debug:  Couldn't find SLURM_RLIMIT_MEMLOCK in environment
slurmstepd: debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615
slurmstepd: debug2: Set task rss(245760 MB)
starting fg001.localdomain 62153
slurmstepd: debug:  Step 206670.62 memory used:0 limit:251658240 KB
slurmstepd: debug2: removing task 0 pid 62153 from jobacct
slurmstepd: task 0 (62153) exited with exit code 0.
slurmstepd: debug:  [job 206670] attempting to run slurm task_epilog [/opt/slurm/task_epilog]
slurmstepd: debug2: Using gid list sent by slurmd
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670/step_62/task_0): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670/step_62/task_0 Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670/step_62): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_1001/job_206670 Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_1001 Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62/task_0): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete /sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62/task_0 Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete /sys/fs/cgroup/memory/slurm/uid_1001/job_206670/step_62 Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_1001/job_206670): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete /sys/fs/cgroup/memory/slurm/uid_1001/job_206670 Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_1001): Device or resource busy
slurmstepd: debug2: jobacct_gather_cgroup_memory_fini: failed to delete /sys/fs/cgroup/memory/slurm/uid_1001 Device or resource busy
slurmstepd: debug2: step_terminate_monitor will run for 60 secs
slurmstepd: debug2: killing process 62158 (inherited_task) with signal 9
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/slurm/uid_1001/job_206670/step_62): Device or resource busy
slurmstepd: debug:  _slurm_cgroup_destroy: problem deleting step cgroup path /sys/fs/cgroup/freezer/slurm/uid_1001/job_206670/step_62: Device or resource busy
slurmstepd: debug2: killing process 62158 (inherited_task) with signal 9
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/slurm/uid_1001/job_206670): Device or resource busy
slurmstepd: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/slurm/uid_1001): Device or resource busy
slurmstepd: debug:  step_terminate_monitor_stop signalling condition
slurmstepd: debug2: step_terminate_monitor is stopping
slurmstepd: debug2: Sending SIGKILL to pgid 62147
slurmstepd: debug:  Waiting for IO
slurmstepd: debug:  Closing debug channel

Chris Samuel

unread,
Aug 27, 2018, 8:22:32 PM8/27/18
to slurm...@lists.schedmd.com
On Tuesday, 28 August 2018 8:15:55 AM AEST Priedhorsky, Reid wrote:

> I am trying to figure out how to advise users on starting worker daemons in
> their allocations using srun. That is, I want to be able to run “srun foo”,
> where foo starts some child process and then exits, and the child
> process(es) persist and wait for work.

That won't happen on a well configured Slurm system as it is Slurm's role to
clear up any processes from that job left around once that job exits. This
is why cgroups and pam_slurm_adopt are so useful, you can track and kill those
off far more easily.

If you want processes to stick around you either need to ask for enough time
in the job and ensure that the script doesn't exit (and thus signal the end of
the job) until those daemons are done or you will need to find a way outside
of Slurm to do it.

One possible way for the latter would be to configure something like systemd
to allow specific users to run daemons as themselves. Then you could let
them submit a job where they do:

systemctl start --user mydaemon.service

to start it up (and check it has started successfully before exiting).

There's a bit about how to do this here (which I've just started using for a
side radio-astronomy project at the observatory I volunteer at):

https://www.brendanlong.com/systemd-user-services-are-amazing.html

Hope this helps!

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC




Chris Samuel

unread,
Aug 28, 2018, 8:36:01 AM8/28/18
to slurm...@lists.schedmd.com
On Tuesday, 28 August 2018 10:21:45 AM AEST Chris Samuel wrote:

> That won't happen on a well configured Slurm system as it is Slurm's role to
> clear up any processes from that job left around once that job exits.

Sorry Reid, for some reason I misunderstood your email and the fact you were
talking about job steps! :-(

One other option in this case is that you can say add 2 cores per node for the
daemons to the overall job request and then do in your jobs

srun --ntasks-per-node=1 -c 2 ./foo.py &

and ensure that foo.py doesn't exit after the daemons launch (if you are using
cgroups then those daemons should be contained within the job steps cgroup so
you should be able to spot their PIDs easily enough).

That then gives you the rest of the cores to play with, so you would launch
future job steps on n-2 cores per node (you could use the environment
variables SLURM_CPUS_PER_TASK & SLURM_NTASKS_PER_NODE to avoid having to hard
code these for instance).

Of course at the end then your batch script would need to kill off that first
job step.

Would that help?

Priedhorsky, Reid

unread,
Aug 28, 2018, 7:11:32 PM8/28/18
to Slurm User Community List

> On Aug 28, 2018, at 6:35 AM, Chris Samuel <ch...@csamuel.org> wrote:
>
> On Tuesday, 28 August 2018 10:21:45 AM AEST Chris Samuel wrote:
>
>> That won't happen on a well configured Slurm system as it is Slurm's role to
>> clear up any processes from that job left around once that job exits.
>
> Sorry Reid, for some reason I misunderstood your email and the fact you were
> talking about job steps! :-(
>
> One other option in this case is that you can say add 2 cores per node for the
> daemons to the overall job request and then do in your jobs
>
> srun --ntasks-per-node=1 -c 2 ./foo.py &

Thanks Chris.

I tried the following:

$ srun --ntasks-per-node=1 -c1 -- sleep 15 &
[1] 180948
$ srun --ntasks-per-node=1 -c1 -- hostname
srun: Job step creation temporarily disabled, retrying
srun: Job step created
cn001.localdomain
[1]+ Done srun --ntasks-per-node=1 -c1 -- sleep 15

and the second srun still waits until the first is complete.

This is surprising to me, as my interpretation is that the first run should allocate only one CPU, leaving 35 for the second srun, which also only needs one CPU and need not wait.

Is this behavior expected?
Am I missing something?

Thanks,
Reid


Christopher Samuel

unread,
Aug 28, 2018, 8:13:53 PM8/28/18
to slurm...@lists.schedmd.com
On 29/08/18 09:10, Priedhorsky, Reid wrote:

> This is surprising to me, as my interpretation is that the first run
> should allocate only one CPU, leaving 35 for the second srun, which
> also only needs one CPU and need not wait.
>
> Is this behavior expected? Am I missing something?

That's odd - and I can reproduce what you see here with Slurm 17.11.7!

However, on an older system I have access to where I know this technique
is used with 16.05.8 it does work.

My test script is:

---------------8< snip snip 8<---------------
#!/bin/bash
#SBATCH -n2
#SBATCH -c2
#SBATCH --mem-per-cpu=2g

srun -n1 --mem-per-cpu=500m sleep 5 &
srun -n1 --mem-per-cpu=1g hostname
---------------8< snip snip 8<---------------

On the older system it just prints the hostname, on the newer system
I get the warning:

srun: Job 1241799 step creation temporarily disabled, retrying

Very odd...

Brian W. Johanson

unread,
Aug 30, 2018, 1:48:10 PM8/30/18
to Slurm User Community List
On 08/29/2018 04:59 PM, Chris Samuel wrote:
> On Thursday, 30 August 2018 12:45:51 AM AEST Brian W. Johanson wrote:
>
>> In your example, you do not have enough memory for both sruns at the same
>> time.
> Nice spot, I think I was thinking in mem-per-task (which doesn't exist) then!
>
> Unfortunately fixing it doesn't seem to resolve the issue, both these changed
> versions have the same result:
>
> ---------------8< snip snip 8<---------------
> #!/bin/bash
> #SBATCH -n2
> #SBATCH -c2
> #SBATCH --mem-per-cpu=4g
>
> srun -n1 --mem-per-cpu=500m sleep 5 &
> srun -n1 --mem-per-cpu=1g hostname
> ---------------8< snip snip 8<---------------
>
> john1
> srun: Job 1244182 step creation temporarily disabled, retrying
>
> ---------------8< snip snip 8<---------------
> #!/bin/bash
> #SBATCH -n2
> #SBATCH -c2
> #SBATCH --mem-per-cpu=4g
>
> srun -n1 --mem-per-cpu=250m sleep 5 &
> srun -n1 --mem-per-cpu=500m hostname
> ---------------8< snip snip 8<---------------
>
> john1
> srun: Job 1244183 step creation temporarily disabled, retrying
> srun: Step created for job 1244183
>
> All the best,
> Chris

That's interesting,  those examples work for me on 17.11.7, I am not sure what's
stopping you now.
-b

Priedhorsky, Reid

unread,
Aug 31, 2018, 12:34:39 PM8/31/18
to slurm...@lists.schedmd.com

> On Aug 28, 2018, at 6:13 PM, Christopher Samuel <ch...@csamuel.org> wrote:
>
> On 29/08/18 09:10, Priedhorsky, Reid wrote:
>
>> This is surprising to me, as my interpretation is that the first run
>> should allocate only one CPU, leaving 35 for the second srun, which
>> also only needs one CPU and need not wait.
>> Is this behavior expected? Am I missing something?
>
> That's odd - and I can reproduce what you see here with Slurm 17.11.7!
>
> However, on an older system I have access to where I know this technique
> is used with 16.05.8 it does work.
>
> My test script is:
>
> ---------------8< snip snip 8<---------------
> #!/bin/bash
> #SBATCH -n2
> #SBATCH -c2
> #SBATCH --mem-per-cpu=2g
>
> srun -n1 --mem-per-cpu=500m sleep 5 &
> srun -n1 --mem-per-cpu=1g hostname
> ---------------8< snip snip 8<---------------

Adding in memory seems to work (Bash job control chatter removed):

$ srun -n1 -c1 --mem=1K sh -c './bar.py && sleep 30' &
$ srun -n1 -c1 --mem=1K hostname
cn001.localdomain
$

hostname runs immediately, and I don’t get the warning about waiting anymore.

bar.py is another test script that forks one child per CPU that allocates 128MiB of memory and then busy-loops for about 20 seconds. I confirmed with top that it’s really running on all 36 CPUs.

That is, it exceeds both the CPU count (1) and memory (1KiB) that I told Slurm it would use. This is what I want. Is allowing such exceedance a common configuration? I don’t want to rely on quirks of our site.

The drawback here is that for real daemons, I’ll need “sleep infinity”, so I’ll need to manually kill the srun. So, this is still a workaround. The ideal behavior would be to have Slurm not clean up processes when the job step completes, but instead at the end of the job.

Thanks,
Reid

Chris Samuel

unread,
Aug 31, 2018, 9:46:51 PM8/31/18
to slurm...@lists.schedmd.com
On Saturday, 1 September 2018 2:33:39 AM AEST Priedhorsky, Reid wrote:

> That is, it exceeds both the CPU count (1) and memory (1KiB) that I told
> Slurm it would use. This is what I want. Is allowing such exceedance a
> common configuration? I don’t want to rely on quirks of our site.

I think you can configure Slurm to do that, but in my experience sites are
always doing their best to constrain jobs to what they ask for and so we use
cgroups for this (tasks can only access the cores, memory and GPUs they
request and the kernel will prevent them accessing anything else).

For your situation using using CR_Core as your SelectTypeParameters basically
tells Slurm to ignore memory for scheduling.

> The drawback here is that for real daemons, I’ll need “sleep infinity”, so
> I’ll need to manually kill the srun. So, this is still a workaround. The
> ideal behavior would be to have Slurm not clean up processes when the job
> step completes, but instead at the end of the job.

You've got a race condition here though then - the job doesn't complete until
all the steps are done, and if you've got a step with processes that never end
then the job will keep running until it hits its time limit (unless, as you
say, you manually kill that step yourself).
Reply all
Reply to author
Forward
0 new messages