[slurm-users] what is the possible reason for secondary slurmctld node not allocate job after takeover?

taleint...@sjtu.edu.cn

unread,

Jun 3, 2022, 6:17:43 AM6/3/22

to slurm...@lists.schedmd.com

Hi, all:

Our cluster set up 2 slurm control node and scontrol show config as below:

> scontrol show config

…

SlurmctldHost[0] = slurm1

SlurmctldHost[1] = slurm2

StateSaveLocation = /etc/slurm/state

…

Of course we have make sure both node has the some slurm conf and mount the same nfs on StateSaveLocation and can read/write it. (but there system is different, slurm1 is centos7 and slurm2 is centos8)

When slurm1 control the cluster and slurm2 work in standby mode, the cluster has no problem.

But when we use “scontrol takeover” on slurm2 to switch the primary role, we find new-submit jobs all stuck in PD state.

No job will be allocated resource by slurm2, no matter how long we wait. Meanwhile old running jobs can complete without problem, and query command like “sinfo”, “sacct” all work well.

The pending reason is firstly shown as “priority” in squeue, but after we manually update the priority, it become “none” reason and still stuck in PD state.

During slurm2 primary period, there is no significant error in slurmctld.log. Only after we restart the slurm1 service to let slurm2 return to standby role, it report lots of error as:

error: Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while in standby mode

error: Invalid RPC received REQUEST_COMPLETE_PROLOG while in standby mode

error: Invalid RPC received REQUEST_COMPLETE_JOB_ALLOCATION while in standby mode

So is there any suggestion to find the reason why slurm2 work abnormally as primary controller?

Brian Andrus

unread,

Jun 3, 2022, 9:16:23 AM6/3/22

to slurm...@lists.schedmd.com

Offhand, I would suggest double check munge and versions of slurmd/slurmctld.

Brian Andrus

taleint...@sjtu.edu.cn

unread,

Jun 4, 2022, 4:24:00 AM6/4/22

to Brian Andrus, slurm...@lists.schedmd.com

Well, after increase slurmctld log level to debug, we do found some error related to munge like:

[2022-06-04T15:17:21.258] debug: auth/munge: _decode_cred: Munge decode failed: Failed to connect to "/run/munge/munge.socket.2": Resource temporarily unavailable (retrying ...)

But when test munge manually, it works well between slurm2 and other compute nodes.

> munge -n | ssh node010 unmunge

The authenticity of host 'node010 (192.168.1.10)' can't be established.

RSA key fingerprint is SHA256:/fx4zQPDDPHj7df6ml0Fd0kn8cIKkSO0OgKpF+qcRDI.

Are you sure you want to continue connecting (yes/no/[fingerprint])? yes

Warning: Permanently added 'node010,192.168.1.10' (RSA) to the list of known hosts.

Password:

STATUS: Success (0)

ENCODE_HOST: slurm2 (192.168.0.33)

ENCODE_TIME: 2022-06-04 16:11:35 +0800 (1654330295)

DECODE_TIME: 2022-06-04 16:11:52 +0800 (1654330312)

TTL: 300

CIPHER: aes128 (4)

MAC: sha256 (5)

ZIP: none (0)

UID: root (0)

GID: root (0)

LENGTH: 0

Of course munge at compute nodes and unmunge at slurm2 also work well.

So what else does slurmctld required from munge? Or what is the difference between slurm auth/munge from manually munge/unmunge test?

发件人: Brian Andrus <>
发送时间: 2022年6月3日 21:16
收件人: slurm...@lists.schedmd.com
主题: Re: [slurm-users] what is the possible reason for secondary slurmctld node not allocate job after takeover?

Reply all

Reply to author

Forward