[slurm-users] SRUN and SBATCH network issues on configless login node.

32 views
Skip to first unread message

Bruno Bruzzo via slurm-users

unread,
Aug 28, 2025, 2:34:24 PMAug 28
to slurm...@schedmd.com
Hi all, first of all, sorry for my English, it's not my native language.

We are currently experiencing an issue with srun and salloc on our
login nodes, while sbatch works properly.

slurm  version 23.11.4.

slurmctld runs on management node mmgt01.
srun and salloc fail intermittently on login node, that means
we can successfully use srun on login node from time to time, but it
stops working for a while without us changing any configuration.

login nodes reports the following to the user:

$ srun -N 1 -n 64 --partition=gpunode --pty /bin/bash
srun: job 4872 queued and waiting for resources
srun: error: Security violation, slurm message from uid 202
srun: error: Security violation, slurm message from uid 202
srun: error: Security violation, slurm message from uid 202
srun: error: Task launch for StepId=4872.0 failed on node cn065: Invalid job
credential
srun: error: Application launch failed: Invalid job credential
srun: Job step aborted

uid 202 is the slurm user.

On the server side, slurmctld logs show:

sched: _slurm_rpc_allocate_resources JobId=4872 NodeList=(null) usec=228
sched: Allocate JobId=4872 NodeList=cn065 #CPUs=64 Partition=gpunode
error: slurm_receive_msgs: [[snmgt01]:38727] failed: Zero Bytes were
transmitted or received
Killing interactive JobId=4872: Communication connection failure
_job_complete: JobId=4872 WEXITSTATUS 1
_job_complete: JobId=4872 done
step_partial_comp: JobId=4872 StepID=0 invalid; this step may have already
completed
_slurm_rpc_complete_job_allocation: JobId=4872 error Job/step already
completing or completed

We suspect it is a network issue regarding the Zero Bytes were transmitted
or received.

The configless system is working properly. Slurmd on login
node can read changes made at slurm.conf after a scontrol reconfig.

srun runs successfully from management nodes and from compute nodes.
The issue is from the login node.

scontrol ping always shows DOWN from login node, even when we can successfully
run srun or salloc.

$ scontrol ping
Slurmctld(primary) at mmgt01 is DOWN

We checked as well for munge consistency.

mmgt and login nodes have the hostnames of their respective others on
/etc/hosts. They can communicate.

We would really appreciate some tips on what we could be missing.

Best regards,
Bruno Bruzzo
System Administrator - Clementina XXI

Bjørn-Helge Mevik via slurm-users

unread,
Aug 29, 2025, 2:43:19 AMAug 29
to slurm...@schedmd.com
Bruno Bruzzo via slurm-users <slurm...@lists.schedmd.com> writes:

> slurmctld runs on management node mmgt01.
> srun and salloc fail intermittently on login node, that means
> we can successfully use srun on login node from time to time, but it
> stops working for a while without us changing any configuration.

This, to me, sounds like there could be a problem on the compute nodes,
or the communication between logins and computes. One thing that have
bit me several times over the years, is compute nodes missing from
/etc/hosts on other compute nodes. Slurmctld is often sending messages
to computes via other computes, and if the messages happen go go via a
node that does not have the target compute in its /etc/hosts, it cannot
forward the message.

Another thing to look out for, is to check whether any nodes running
slurmd (computes or logins) have their slurmd port blocked by firewalld
or something else.

> scontrol ping always shows DOWN from login node, even when we can
> successfully
> run srun or salloc.

This might indicate that the slurmctld port on mmgt01 is blocked, or the
slurmd port on the logins.

It might be something completely different, but I'd at least check /etc/hosts
on all nodes (controller, logins, computes) and check that all needed
ports are unblocked.

--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
signature.asc

Bruno Bruzzo via slurm-users

unread,
Sep 24, 2025, 2:19:13 PM (10 days ago) Sep 24
to Bjørn-Helge Mevik, slurm...@schedmd.com
Hi, sorry for the late reply.

We tested your proposal and can confirm that all nodes have each other on their respective /etc/hosts.We can also confirm that the slurmd port is not blocked. 

One thing we found useful to reproduce the issue is that if we run srun -w <node x> and on another session srun -w <node x>, the second srun waits for resources while the first one gets into <node x>. If we exit the session on the first shell, the one that was waiting gets error: security violation/invalid job credentials instead of getting into <node x>.

We also found that scontrol ping not only fails on the login node, but also on the nodes of a specific partition, showing the larger message:

Slurmctld(primary) at <headnode> is DOWN
*****************************************
** RESTORE SLURMCTLD DAEMON TO SERVICE **
*****************************************
Still, slurm is able to assign those nodes for jobs.

We also raised debug to the max on slurmctld, and when doing the scontrol ping, we get this log:
[2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969 
[2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969 
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] auth_g_verify: REQUEST_PING has authentication error: Unspecified error 
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] Protocol authentication error 
[2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]: Protocol authentication error 
[2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202 
[2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969 
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] auth_g_verify: REQUEST_PING has authentication error: Unspecified error 
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] Protocol authentication error [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55286]: Protocol authentication error 
[2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202

I find it suspicious that the date munge shows is Wed Dec 31 21:00:00 1969. I checked correct ownership of the munge.key and that all nodes have the same file.

Does anyone has more documentation on what scontrol ping does? We haven't found detailed information on the docs.

Best regards,
Bruno Bruzzo
System Administrator - Clementina XXI
--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

John Hearns via slurm-users

unread,
Sep 24, 2025, 2:30:40 PM (10 days ago) Sep 24
to Bruno Bruzzo, Bjørn-Helge Mevik, slurm...@schedmd.com
Err., are all your nodes on the same time?

Actually slurmd will not start if a compute node is too far away in time from the controller node. So you should be OK

I would still check the times on all nodes are in agreement 

Bruno Bruzzo via slurm-users

unread,
Sep 24, 2025, 2:45:45 PM (10 days ago) Sep 24
to John Hearns, Bjørn-Helge Mevik, slurm...@schedmd.com
Yes, all nodes are synchronized with crony.

John Hearns via slurm-users

unread,
Sep 24, 2025, 2:53:13 PM (10 days ago) Sep 24
to Bruno Bruzzo, Bjørn-Helge Mevik, slurm...@schedmd.com
Shot down in 🔥🔥

Bruno Bruzzo via slurm-users

unread,
Sep 30, 2025, 3:05:58 PM (4 days ago) Sep 30
to John Hearns, Bjørn-Helge Mevik, slurm...@schedmd.com
Update:
We have solved the issue.

Our problem was that even tough we have a configless configuration, our provisioning served a unconfigured slurm.conf file to /etc/slurm

On the failing nodes, we could see:
scontrol show config | grep -i "hash_val"
cn080: HASH_VAL                = Different Ours=<...> a Slurmctld=<...>

While on working nodes we saw:
scontrol show config | grep -i "hash_val"
cn044: HASH_VAL                = Match

Note: The failing nodes could still get jobs scheduled via sbatch. The issue was with srun/salloc.

We removed the slurm.conf file, restarted services, and for now, everything works fine.

Thanks for the support.

Bruno Bruzzo
System Administrator - Clementina XXI

Ole Holm Nielsen via slurm-users

unread,
Oct 1, 2025, 2:36:37 AM (4 days ago) Oct 1
to slurm...@lists.schedmd.com
On 9/30/25 20:52, Bruno Bruzzo via slurm-users wrote:
> Update:
> We have solved the issue.
>
> Our problem was that even tough we have a configless configuration, our
> provisioning served a unconfigured slurm.conf file to /etc/slurm

FYI: Configless Slurm documents this order of precedence for which
slurm.conf is used. The Configless slurm.conf has the lowest priority of
all, see https://slurm.schedmd.com/configless_slurm.html#NOTES

Best regards,
Ole
Reply all
Reply to author
Forward
0 new messages