[slurm-users] srun --x11 connection rejected because of wrong authentication

398 views
Skip to first unread message

Christopher Benjamin Coffey

unread,
Jun 7, 2018, 6:27:38 PM6/7/18
to slurm-users
Hi,

I've compiled slurm 17.11.7 with x11 support. We can ssh to a node from the login node and get xeyes to work, etc. However, srun --x11 xeyes results in:

[cbc@wind ~ ]$ srun --x11 --reservation=root_58 xeyes
X11 connection rejected because of wrong authentication.
Error: Can't open display: localhost:60.0
srun: error: cn100: task 0: Exited with exit code 1

On the node in slurmd.log it says:

[2018-06-07T15:04:29.932] _run_prolog: run job script took usec=1
[2018-06-07T15:04:29.932] _run_prolog: prolog with lock for job 11806306 ran for 0 seconds
[2018-06-07T15:04:29.957] [11806306.extern] task/cgroup: /slurm/uid_3301/job_11806306: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
[2018-06-07T15:04:29.957] [11806306.extern] task/cgroup: /slurm/uid_3301/job_11806306/step_extern: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
[2018-06-07T15:04:30.138] [11806306.extern] X11 forwarding established on DISPLAY=cn100:60.0
[2018-06-07T15:04:30.239] launch task 11806306.0 request from 3301...@172.16.3.21 (port 32453)
[2018-06-07T15:04:30.240] lllp_distribution jobid [11806306] implicit auto binding: cores,one_thread, dist 1
[2018-06-07T15:04:30.240] _task_layout_lllp_cyclic
[2018-06-07T15:04:30.240] _lllp_generate_cpu_bind jobid [11806306]: mask_cpu,one_thread, 0x0000001
[2018-06-07T15:04:30.268] [11806306.0] task/cgroup: /slurm/uid_3301/job_11806306: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
[2018-06-07T15:04:30.268] [11806306.0] task/cgroup: /slurm/uid_3301/job_11806306/step_0: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
[2018-06-07T15:04:30.303] [11806306.0] task_p_pre_launch: Using sched_affinity for tasks
[2018-06-07T15:04:30.310] [11806306.extern] error: _handle_channel: remote disconnected
[2018-06-07T15:04:30.310] [11806306.extern] error: _handle_channel: exiting thread
[2018-06-07T15:04:30.376] [11806306.0] done with job
[2018-06-07T15:04:30.413] [11806306.extern] x11 forwarding shutdown complete
[2018-06-07T15:04:30.443] [11806306.extern] _oom_event_monitor: oom-kill event count: 1
[2018-06-07T15:04:30.508] [11806306.extern] done with job

It seems like its close, as srun, and the node can agree on the port to connect on, but there is a difference between slurmd specifying the node name and port, where srun is trying to connect via localhost and the same port. Maybe I have an ssh setting wrong somewhere? I've tried all combinations I believe in ssh_config, and sshd_config. No issues with /home either, it’s a shared filesystem that each node mounts, and we even tried no_root_squash so root can write to the .Xauthority file like some folks have suggested.

Also, xauth list shows that there was no magic cookie written for host cn100:

[cbc@wind ~ ]$ xauth list
wind.hpc.nau.edu/unix:14 MIT-MAGIC-COOKIE-1 ac4a0f1bfe9589806f81dd45306ee33d

Something preventing root from writing the magic cookie? The file is definitely writeable:

[root@cn100 ~]# touch /home/cbc/.Xauthority
[root@cn100 ~]#

Anyone have any ideas? Thanks!

Best,
Chris


Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167


Hadrian Djohari

unread,
Jun 7, 2018, 9:49:10 PM6/7/18
to Slurm User Community List
Hi,

I do not remember whether we had the same error message.
But, if the user's known_host has an old entry of the node he is trying to connect, the x11 won't connect properly.
Once the known_host entry has been deleted, the x11 connects just fine.

Hadrian
--
Hadrian Djohari
Manager of Research Computing Services, [U]Tech
Case Western Reserve University
(W): 216-368-0395
(M): 216-798-7490

Christopher Benjamin Coffey

unread,
Jun 11, 2018, 2:06:44 PM6/11/18
to Slurm User Community List
Hi Hadrian,

Thank you, unfortunately that is not the issue. We can connect to the nodes outside of slurm and have the X11 stuff work properly.

Best,
Chris


Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167


On 6/7/18, 6:49 PM, "slurm-users on behalf of Hadrian Djohari" <slurm-use...@lists.schedmd.com on behalf of hx...@case.edu> wrote:

Hi,


I do not remember whether we had the same error message.
But, if the user's known_host has an old entry of the node he is trying to connect, the x11 won't connect properly.
Once the known_host entry has been deleted, the x11 connects just fine.


Hadrian


On Thu, Jun 7, 2018 at 6:26 PM, Christopher Benjamin Coffey
<Chris....@nau.edu> wrote:

Hi,

I've compiled slurm 17.11.7 with x11 support. We can ssh to a node from the login node and get xeyes to work, etc. However, srun --x11 xeyes results in:

[cbc@wind ~ ]$ srun --x11 --reservation=root_58 xeyes
X11 connection rejected because of wrong authentication.
Error: Can't open display: localhost:60.0
srun: error: cn100: task 0: Exited with exit code 1

On the node in slurmd.log it says:

[2018-06-07T15:04:29.932] _run_prolog: run job script took usec=1
[2018-06-07T15:04:29.932] _run_prolog: prolog with lock for job 11806306 ran for 0 seconds
[2018-06-07T15:04:29.957] [11806306.extern] task/cgroup: /slurm/uid_3301/job_11806306: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
[2018-06-07T15:04:29.957] [11806306.extern] task/cgroup: /slurm/uid_3301/job_11806306/step_extern: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
[2018-06-07T15:04:30.138] [11806306.extern] X11 forwarding established on DISPLAY=cn100:60.0
[2018-06-07T15:04:30.239] launch task 11806306.0 request from
3301...@172.16.3.21 <mailto:3301...@172.16.3.21> (port 32453)
[2018-06-07T15:04:30.240] lllp_distribution jobid [11806306] implicit auto binding: cores,one_thread, dist 1
[2018-06-07T15:04:30.240] _task_layout_lllp_cyclic
[2018-06-07T15:04:30.240] _lllp_generate_cpu_bind jobid [11806306]: mask_cpu,one_thread, 0x0000001
[2018-06-07T15:04:30.268] [11806306.0] task/cgroup: /slurm/uid_3301/job_11806306: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
[2018-06-07T15:04:30.268] [11806306.0] task/cgroup: /slurm/uid_3301/job_11806306/step_0: alloc=1000MB mem.limit=1000MB memsw.limit=1000MB
[2018-06-07T15:04:30.303] [11806306.0] task_p_pre_launch: Using sched_affinity for tasks
[2018-06-07T15:04:30.310] [11806306.extern] error: _handle_channel: remote disconnected
[2018-06-07T15:04:30.310] [11806306.extern] error: _handle_channel: exiting thread
[2018-06-07T15:04:30.376] [11806306.0] done with job
[2018-06-07T15:04:30.413] [11806306.extern] x11 forwarding shutdown complete
[2018-06-07T15:04:30.443] [11806306.extern] _oom_event_monitor: oom-kill event count: 1
[2018-06-07T15:04:30.508] [11806306.extern] done with job

It seems like its close, as srun, and the node can agree on the port to connect on, but there is a difference between slurmd specifying the node name and port, where srun is trying to connect via localhost and the same port. Maybe I have an ssh setting wrong
somewhere? I've tried all combinations I believe in ssh_config, and sshd_config. No issues with /home either, it’s a shared filesystem that each node mounts, and we even tried no_root_squash so root can write to the .Xauthority file like some folks have suggested.

Also, xauth list shows that there was no magic cookie written for host cn100:

[cbc@wind ~ ]$ xauth list
wind.hpc.nau.edu/unix:14 <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwind.hpc.nau.edu%2Funix%3A14&data=02%7C01%7Cchris.coffey%40nau.edu%7Cff0e3e30539f4411850908d5cce220a0%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636640193976928475&sdata=7RP3G%2FgProB9cc00B7XSeqRK12OGgmHYsbMRx4jBJs4%3D&reserved=0>

Hadrian Djohari

unread,
Jun 11, 2018, 2:46:38 PM6/11/18
to Slurm User Community List
Yes. The x11 also worked for us outside of slurm. Well, good luck finding your issue. 
Reply all
Reply to author
Forward
0 new messages