[slurm-users] slumctld don't start at boot

1,577 views
Skip to first unread message

Riccardo Sucapane

unread,
Jul 23, 2021, 6:30:02 AM7/23/21
to slurm...@lists.schedmd.com
Hello everyone,
I am using Slurm as a workload manager on a system
with a master and 3 nodes.
The operating system used is the recent rocky linux 8.4
while for slurm, is used the version 20.11.8 taken from EPEL
repository.
Everything works correctly and when the system is started the command
"systemctl start slurmctld" works fine, but at boot the daemon
slurmctld does not start on the master machine, reporting a series of errors.
Without reporting all the slurmctld.log the recurring error is the following:

[2021-07-23T09:58:01.932] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2021-07-23T09:58:01.932] error: slurm_set_addr: Unable to resolve "blade01"
[2021-07-23T09:58:01.932] error: slurm_get_port: Address family '0' not supported
[2021-07-23T09:58:01.932] error: _set_slurmd_addr: failure on blade01


In this case I have set it in the slurm.conf file, for simplicity,
"AccountingStorageType=accounting_storage/none", but also using the
slurmdbd/mariadb support is all right with no problems, but slurmctld
still does not start on boot.
Also in the log reported blade01 is the hostname of one of the nodes.

I have already read some messages that reported a similar problem,
but none of the considerations I read helped me to overcome the problem.
Is there anyone who can help me find a solution?
Greetings to all
Riccardo

--
**********************************************************
Riccardo Sucapane
Dip. MEMOTEF - Sapienza Università di Roma
Via del Castro Laurenziano, 9 - 00161 - Roma
Tel. 06 4976 6846
**********************************************************

________________________________________________________
Le informazioni contenute in questo messaggio di posta elettronica sono strettamente riservate e indirizzate esclusivamente al destinatario. Si prega di non leggere, fare copia, inoltrare a terzi o conservare tale messaggio se non si è il legittimo destinatario dello stesso. Qualora tale messaggio sia stato ricevuto per errore, si prega di restituirlo al mittente e di cancellarlo permanentemente dal proprio computer.
The information contained in this e mail message is strictly confidential and intended for the use of the addressee only.  If you are not the intended recipient, please do not read, copy, forward or store it on your computer. If you have received the message in error, please forward it back to the sender and delete it permanently from your computer system.



Fai crescere i nostri giovani ricercatori
dona il 5 per mille alla Sapienza
codice fiscale 80209930587

Ole Holm Nielsen

unread,
Jul 23, 2021, 6:53:14 AM7/23/21
to slurm...@lists.schedmd.com
On 7/23/21 12:29 PM, Riccardo Sucapane wrote:
> I am using Slurm as a workload manager on a system
> with a master and 3 nodes.
> The operating system used is the recent rocky linux 8.4
> while for slurm, is used the version 20.11.8 taken from EPEL
> repository.
> Everything works correctly and when the system is started the command
> "systemctl start slurmctld" works fine, but at boot the daemon
> slurmctld does not start on the master machine, reporting a series of errors.
> Without reporting all the slurmctld.log the recurring error is the following:
>
> [2021-07-23T09:58:01.932] error: get_addr_info: getaddrinfo() failed: Name
> or service not known
> [2021-07-23T09:58:01.932] error: slurm_set_addr: Unable to resolve "blade01"
> [2021-07-23T09:58:01.932] error: slurm_get_port: Address family '0' not
> supported
> [2021-07-23T09:58:01.932] error: _set_slurmd_addr: failure on blade01

This seems to be a DNS name resolution error.

This could be due to slurmctld starting before the server's network is
completely up! We have seen this with slurmd on EL 8.4 nodes, and I found
a solution, see https://bugs.schedmd.com/show_bug.cgi?id=11878#c5. This
will be fixed in Slurm 21.08.

In /usr/lib/systemd/system/slurmd.service and
/usr/lib/systemd/system/slurmctld.service you should replace
"network.target" by "network-online.target". Reboot to test it.

> In this case I have set it in the slurm.conf file, for simplicity,
> "AccountingStorageType=accounting_storage/none", but also using the
> slurmdbd/mariadb support is all right with no problems, but slurmctld
> still does not start on boot.
> Also in the log reported blade01 is the hostname of one of the nodes.

You should probably fix /usr/lib/systemd/system/slurmdbd.service as well.

/Ole

Diego Zuccato

unread,
Jul 23, 2021, 6:59:08 AM7/23/21
to Slurm User Community List, Riccardo Sucapane
Hi Riccardo.

I've had a similar problem (slurm.conf is served via NFS share). I just
modified slurmd unit:
#systemctl edit slurmd
[Unit]
Requires=network-online.target
After=home.mount

HIH

Diego
> ------------------------------------------------------------------------
>
>
> Fai crescere i nostri giovani ricercatori
> dona il 5 per mille alla Sapienza
> *codice fiscale 80209930587*

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Diego Zuccato

unread,
Jul 23, 2021, 7:01:18 AM7/23/21
to Slurm User Community List, Ole Holm Nielsen
We answered in parallel :)
I usually prefer to avoid modifying system-managed files because system
updates could reset 'em. Since systemd allows overrides, I chose to use
'em :)

Ole Holm Nielsen

unread,
Jul 23, 2021, 7:04:24 AM7/23/21
to Slurm User Community List
On 7/23/21 1:00 PM, Diego Zuccato wrote:
> We answered in parallel :)
> I usually prefer to avoid modifying system-managed files because system
> updates could reset 'em. Since systemd allows overrides, I chose to use
> 'em :)

I agree with you! The permanent fix will change those Systemd files in
21.08.

Copy the Slurm service files to /etc/systemd/system/ and edit them, or use
systemctl edit --full <service-name>.

/Ole

Riccardo Sucapane

unread,
Jul 23, 2021, 7:09:48 AM7/23/21
to Slurm User Community List
Yes, the problem was that. Thanks everyone for the help.
Greetings Riccardo
Reply all
Reply to author
Forward
0 new messages