[slurm-users] Cannot run interactive jobs

1,530 views
Skip to first unread message

Sajesh Singh

unread,
Mar 25, 2020, 2:22:16 AM3/25/20
to slurm...@schedmd.com

CentOS 7.7.1908

Slurm 18.08.8

 

When trying to run an interactive job I am getting the following error:

 

srun: error: task 0 launch failed: Slurmd could not connect IO

 

Checking the log file on the compute node I see the following error:

 

[2020-03-25T01:42:08.262] launch task 13.0 request from UID:1326 GID:50000 HOST:192.168.229.254 PORT:14980

[2020-03-25T01:42:08.262] lllp_distribution jobid [13] implicit auto binding: cores,one_thread, dist 8192

[2020-03-25T01:42:08.262] _task_layout_lllp_cyclic

[2020-03-25T01:42:08.262] _lllp_generate_cpu_bind jobid [13]: mask_cpu,one_thread, 0x0000000000000001

[2020-03-25T01:42:08.262] _run_prolog: run job script took usec=5

[2020-03-25T01:42:08.262] _run_prolog: prolog with lock for job 13 ran for 0 seconds

[2020-03-25T01:42:08.272] [13.0] Considering each NUMA node as a socket

[2020-03-25T01:42:08.310] [13.0] error: stdin openpty: Operation not permitted

[2020-03-25T01:42:08.311] [13.0] error: IO setup failed: Operation not permitted

[2020-03-25T01:42:08.311] [13.0] error: job_manager exiting abnormally, rc = 4021

[2020-03-25T01:42:08.315] [13.0] done with job

 

When doing the same on a CentOS 7.3 and Slurm 18.08.4 cluster the interactive job runs as expected.

 

Any advise on how to remedy this would be appreciated.

 

-Sajesh-

 

 

 

 

Manalo, Kevin L

unread,
Apr 6, 2021, 9:55:51 AM4/6/21
to Slurm User Community List

Sajesh,

 

For those other users that may have run into this.  I found a reason why srun cannot run interactive jobs, and it may not necessarily be related to RHEL/CentOS 7

 

If one straces the slurmd one may see (see arg 3 for gid)

 

chown("/dev/pts/1", 1326, 7) = -1 EPERM (Operation not permitted)

 

in my case I had (something similar)

 

chown("/dev/pts/1", 1326, 0) = -1 EPERM (Operation not permitted)

 

For our site, this report was also helpful

https://bugs.schedmd.com/show_bug.cgi?id=8729

 

tty was mapped to group 7 in Sajesh’s case.  It (tty) should always be mapped to group 5. At our site, we had a problem with /etc/group being large and the tty group not being properly read in.

 

The fix for us was to resort the group file by gid, so that the tty line would fall on line 5.

 

Hope this helps,

Kevin

Reply all
Reply to author
Forward
0 new messages