[slurm-users] Problem launching interactive jobs using srun

1,991 views
Skip to first unread message

Andy Georges

unread,
Mar 9, 2018, 12:21:10 PM3/9/18
to slurm...@lists.schedmd.com
Hi,


I am trying to get interactive jobs to work from the machine we use as a login node, i.e., where the users of the cluster log into and from where they typically submit jobs.


I submit the job as follows:

vsc40075@test2802 (banette) ~> /bin/salloc -N1 -n1 /bin/srun bash -i
salloc: Granted job allocation 41
salloc: Waiting for resource configuration
salloc: Nodes node2801 are ready for job


hangs


On node2801, the slurmd log has the following information:


[2018-03-09T18:16:08.820] _run_prolog: run job script took usec=10379
[2018-03-09T18:16:08.820] _run_prolog: prolog with lock for job 41 ran for 0 seconds
[2018-03-09T18:16:08.829] [41.extern] task/cgroup: /slurm/uid_2540075/job_41: alloc=800MB mem.limit=800MB memsw.limit=880MB
[2018-03-09T18:16:08.830] [41.extern] task/cgroup: /slurm/uid_2540075/job_41/step_extern: alloc=800MB mem.limit=800MB memsw.limit=880MB
[2018-03-09T18:16:11.824] launch task 41.0 request from 2540075...@10.141.21.202 (port 61928)
[2018-03-09T18:16:11.824] lllp_distribution jobid [41] implicit auto binding: cores,one_thread, dist 1
[2018-03-09T18:16:11.824] _task_layout_lllp_cyclic
[2018-03-09T18:16:11.824] _lllp_generate_cpu_bind jobid [41]: mask_cpu,one_thread, 0x1
[2018-03-09T18:16:11.834] [41.0] task/cgroup: /slurm/uid_2540075/job_41: alloc=800MB mem.limit=800MB memsw.limit=880MB
[2018-03-09T18:16:11.834] [41.0] task/cgroup: /slurm/uid_2540075/job_41/step_0: alloc=800MB mem.limit=800MB memsw.limit=880MB
[2018-03-09T18:16:11.836] [41.0] error: connect io: Connection refused
[2018-03-09T18:16:11.836] [41.0] error: IO setup failed: Connection refused
[2018-03-09T18:16:11.905] [41.0] _oom_event_monitor: oom-kill event count: 1
[2018-03-09T18:16:11.905] [41.0] error: job_manager exiting abnormally, rc = 4021
[2018-03-09T18:16:11.905] [41.0] error: _send_launch_resp: Failed to send RESPONSE_LAUNCH_TASKS: Connection refused
[2018-03-09T18:16:11.907] [41.0] done with job


We are running slurm 17.11.4.


When I change to the same user on both the master node (running slurmctld) and worker nodes (running slurmd), things work just fine. I would assume I need not run slurmd on the login node for this to work?


Any pointers are appreciated,
— Andy
signature.asc

Pickering, Roger (NIH/NIAAA) [E]

unread,
Mar 9, 2018, 12:47:32 PM3/9/18
to Slurm User Community List
I'm confused. Why would you want to run an interactive program using srun?

Roger

Michael Robbert

unread,
Mar 9, 2018, 1:04:38 PM3/9/18
to slurm...@lists.schedmd.com
I think that the piece you may be missing is --pty, but I also don't
think that salloc is necessary.

The most simple command that I typically use is:

srun -N1 -n1 --pty bash -i

Mike

Andy Georges

unread,
Mar 9, 2018, 1:18:08 PM3/9/18
to Slurm User Community List
Hi,

Adding —pty makes no difference. I do not get a prompt and on the node the logs show an error. If —pty is used, the error is somewhat different compared to not using it but the end result is the same.

My main issue is that giving the same command on the machines running slurmd and slurmctld just works.

As far as srun is concerned, that’s what is advised for an interactive job, no?

— Andy.

Sent from my iPhone

Mark M

unread,
Mar 9, 2018, 2:02:59 PM3/9/18
to Slurm User Community List
I'm having the same issue. The salloc command hangs on my login nodes, but works fine on the head node. My default salloc command is:

SallocDefaultCommand="/usr/bin/srun -n1 -N1 --pty --preserve-env $SHELL"

I'm on the OpenHPC slurm 17.02.9-69.2.

The log says the job is assigned, then eventually times out. I have tried srun directly with various tweaks, but it hangs every time. You can't ctl-C or ctl-Z out of it either, but the shell returns after the job times out. I killed the firewall on the login nodes but that made no difference. 

Andy Georges

unread,
Mar 9, 2018, 3:46:48 PM3/9/18
to Slurm User Community List
Hi all,

Cranked up the debug level a bit

Job was not started when using:

vsc40075@test2802 (banette) ~> /bin/salloc -N1 -n1 /bin/srun --pty bash -i
salloc: Granted job allocation 42
salloc: Waiting for resource configuration
salloc: Nodes node2801 are ready for job

For comparison purposes, running this on the master (head?) node:

vsc40075@master23 () ~> /bin/salloc -N1 -n1 /bin/srun --pty bash -i
salloc: Granted job allocation 43
salloc: Waiting for resource configuration
salloc: Nodes node2801 are ready for job
vsc40075@node2801 () ~>


Below some more debug output from the hanging job.

Kind regards,
— Andy

[2018-03-09T21:27:52.251] [42.0] debug: _oom_event_monitor: started.
[2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 objects
[2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable
[2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 objects
[2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable
[2018-03-09T21:27:52.251] [42.0] debug2: Entering _setup_normal_io
[2018-03-09T21:27:52.251] [42.0] debug: stdin uses a pty object
[2018-03-09T21:27:52.251] [42.0] debug: init pty size 23:119
[2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 objects
[2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable
[2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T21:27:52.252] [42.0] debug2: Error connecting slurm stream socket at 10.141.21.202:33698: Connection refused
[2018-03-09T21:27:52.252] [42.0] error: slurm_open_msg_conn(pty_conn) 10.141.21.202,33698: Connection refused
[2018-03-09T21:27:52.252] [42.0] debug4: adding IO connection (logical node rank 0)
[2018-03-09T21:27:52.252] [42.0] debug4: connecting IO back to 10.141.21.202:33759
[2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T21:27:52.252] [42.0] debug3: Error connecting, picking new stream port
[2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T21:27:52.252] [42.0] debug2: Error connecting slurm stream socket at 10.141.21.202:33759: Connection refused
[2018-03-09T21:27:52.252] [42.0] error: connect io: Connection refused
[2018-03-09T21:27:52.252] [42.0] debug2: Leaving _setup_normal_io
[2018-03-09T21:27:52.253] [42.0] error: IO setup failed: Connection refused
[2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_param: parameter 'freezer.state' set to 'THAWED' for '/sys/fs/cgroup/freezer/slurm/uid_2540075/job_42/step_0'
[2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_uint32_param: parameter 'cgroup.procs' set to '6414' for '/sys/fs/cgroup/freezer'
[2018-03-09T21:27:52.253] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/slurm/uid_2540075/job_42): Device or resource busy
[2018-03-09T21:27:52.253] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/freezer/slurm/uid_2540075): Device or resource busy
[2018-03-09T21:27:52.253] [42.0] debug: step_terminate_monitor_stop signaling condition
[2018-03-09T21:27:52.253] [42.0] debug4: eio: handling events for 1 objects
[2018-03-09T21:27:52.253] [42.0] debug3: Called _msg_socket_readable
[2018-03-09T21:27:52.253] [42.0] debug2: step_terminate_monitor will run for 60 secs
[2018-03-09T21:27:52.253] [42.0] debug2: step_terminate_monitor is stopping
[2018-03-09T21:27:52.253] [42.0] debug2: Sending SIGKILL to pgid 6414
[2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_uint32_param: parameter 'cgroup.procs' set to '6414' for '/sys/fs/cgroup/cpuset'
[2018-03-09T21:27:52.265] [42.0] debug3: Took 1038 checks before stepd pid was removed from the step cgroup.
[2018-03-09T21:27:52.265] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuset/slurm/uid_2540075/job_42): Device or resource busy
[2018-03-09T21:27:52.265] [42.0] debug2: task/cgroup: not removing job cpuset : Device or resource busy
[2018-03-09T21:27:52.265] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuset/slurm/uid_2540075): Device or resource busy
[2018-03-09T21:27:52.265] [42.0] debug2: task/cgroup: not removing user cpuset : Device or resource busy
[2018-03-09T21:27:52.315] [42.0] debug3: _oom_event_monitor: res: 1
[2018-03-09T21:27:52.315] [42.0] _oom_event_monitor: oom-kill event count: 1
[2018-03-09T21:27:52.315] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_2540075/job_42): Device or resource busy
[2018-03-09T21:27:52.315] [42.0] debug2: task/cgroup: not removing job memcg : Device or resource busy
[2018-03-09T21:27:52.315] [42.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_2540075): Device or resource busy
[2018-03-09T21:27:52.315] [42.0] debug2: task/cgroup: not removing user memcg : Device or resource busy
[2018-03-09T21:27:52.315] [42.0] debug2: Before call to spank_fini()
[2018-03-09T21:27:52.315] [42.0] debug2: After call to spank_fini()
[2018-03-09T21:27:52.315] [42.0] error: job_manager exiting abnormally, rc = 4021
[2018-03-09T21:27:52.315] [42.0] debug: Sending launch resp rc=4021
[2018-03-09T21:27:52.315] [42.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T21:27:52.315] [42.0] debug2: Error connecting slurm stream socket at 10.141.21.202:37053: Connection refused
[2018-03-09T21:27:52.315] [42.0] error: _send_launch_resp: Failed to send RESPONSE_LAUNCH_TASKS: Connection refused
[2018-03-09T21:27:52.315] [42.0] debug2: Rank 0 has no children slurmstepd
[2018-03-09T21:27:52.315] [42.0] debug2: _one_step_complete_msg: first=0, last=0
[2018-03-09T21:27:52.315] [42.0] debug3: Rank 0 sending complete to slurmctld, range 0 to 0
[2018-03-09T21:27:52.317] [42.0] debug4: eio: handling events for 1 objects
[2018-03-09T21:27:52.317] [42.0] debug3: Called _msg_socket_readable
[2018-03-09T21:27:52.317] [42.0] debug2: false, shutdown
[2018-03-09T21:27:52.317] [42.0] debug: Message thread exited
[2018-03-09T21:27:52.317] [42.0] done with job
signature.asc

Nicholas McCollum

unread,
Mar 9, 2018, 3:58:57 PM3/9/18
to slurm...@lists.schedmd.com
Connection refused makes me think a firewall issue.

Assuming this is a test environment, could you try on the compute node:

# iptables-save > iptables.bak
# iptables -F && iptables -X

Then test to see if it works. To restore the firewall use:

# iptables-restore < iptables.bak

You may have to use...

# systemctl stop firewalld
# systemctl start firewalld

If you use firewalld.

---

Nicholas McCollum - HPC Systems Expert
Alabama Supercomputer Authority - CSRA

Mark M

unread,
Mar 9, 2018, 4:11:02 PM3/9/18
to Slurm User Community List

In my case I tested firewall. But I'm wondering if the login nodes need to appear in the slurm.conf, and also if slurmd needs to be running on the login nodes in order for them to be a submit host? Either or both could be my issue.

Mark M

unread,
Mar 9, 2018, 4:45:52 PM3/9/18
to Slurm User Community List
OK, I'm eating my words now. Perhaps I have had multiple issues before, but at the moment stopping the firewall allows salloc to work. Can anyone suggest an iptables rule specific to slurm? Or a way to restrict slurm communications to the right network?

Andy Georges

unread,
Mar 9, 2018, 4:45:52 PM3/9/18
to Slurm User Community List
Hi,



> On 9 Mar 2018, at 21:58, Nicholas McCollum <nmcc...@asc.edu> wrote:
>
> Connection refused makes me think a firewall issue.
>
> Assuming this is a test environment, could you try on the compute node:
>
> # iptables-save > iptables.bak
> # iptables -F && iptables -X
>
> Then test to see if it works. To restore the firewall use:
>
> # iptables-restore < iptables.bak
>
> You may have to use...
>
> # systemctl stop firewalld
> # systemctl start firewalld
>
> If you use firewalld.

We’re using shorewall …


There is an srun process listening on the login node:

srun 8500 vsc40075 13u IPv4 597473 0t0 TCP *:36506 (LISTEN)


And slurmd on the worker node is trying to connect to it

[2018-03-09T22:00:44.908] [47.0] debug4: adding IO connection (logical node rank 0)
[2018-03-09T22:00:44.908] [47.0] debug4: connecting IO back to 10.141.21.202:36506
[2018-03-09T22:00:44.908] [47.0] debug: _oom_event_monitor: started.
[2018-03-09T22:00:44.908] [47.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T22:00:44.908] [47.0] debug3: Error connecting, picking new stream port
[2018-03-09T22:00:44.909] [47.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T22:00:44.909] [47.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T22:00:44.909] [47.0] debug2: slurm_connect failed: Connection refused
[2018-03-09T22:00:44.909] [47.0] debug2: Error connecting slurm stream socket at 10.141.21.202:36506: Connection refused
[2018-03-09T22:00:44.909] [47.0] error: connect io: Connection refused


Opening ports 30000-50000 seems to do the trick. Will try to figure out what’s different on the other machines.

Thanks for the pointers and help!

— Andy
signature.asc
Reply all
Reply to author
Forward
0 new messages