Bruno Bruzzo via slurm-users <
slurm...@lists.schedmd.com> writes:
> slurmctld runs on management node mmgt01.
> srun and salloc fail intermittently on login node, that means
> we can successfully use srun on login node from time to time, but it
> stops working for a while without us changing any configuration.
This, to me, sounds like there could be a problem on the compute nodes,
or the communication between logins and computes. One thing that have
bit me several times over the years, is compute nodes missing from
/etc/hosts on other compute nodes. Slurmctld is often sending messages
to computes via other computes, and if the messages happen go go via a
node that does not have the target compute in its /etc/hosts, it cannot
forward the message.
Another thing to look out for, is to check whether any nodes running
slurmd (computes or logins) have their slurmd port blocked by firewalld
or something else.
> scontrol ping always shows DOWN from login node, even when we can
> successfully
> run srun or salloc.
This might indicate that the slurmctld port on mmgt01 is blocked, or the
slurmd port on the logins.
It might be something completely different, but I'd at least check /etc/hosts
on all nodes (controller, logins, computes) and check that all needed
ports are unblocked.
--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo