Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#984928: slurmctld: fails to start on reboot

1,696 views
Skip to first unread message

David Bremner

unread,
Mar 10, 2021, 7:10:04 AM3/10/21
to
Package: slurmctld
Version: 20.11.4-1
Severity: normal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I have a slurm cluster set up on a single node. This node is running
slurmctld, munge, and slurmd. When I reboot the node it seems that
there is some race condition with slurmctld and/or slurmd trying to
restart before networking is fully available. By the time I can ssh
into the machine manually restarting slurmctld and slurmd works. I
replaced "localhost" with "127.0.0.1", but that does not seem to change anything.

slurmctld.log has

[2021-03-10T07:13:08.118] slurmctld version 20.11.4 started on cluster cluster
[2021-03-10T07:13:08.132] No memory enforcing mechanism configured.
[2021-03-10T07:13:08.137] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2021-03-10T07:13:08.137] error: slurm_set_addr: Unable to resolve "127.0.0.1"
[2021-03-10T07:13:08.137] error: slurm_get_port: Address family '0' not supported
[2021-03-10T07:13:08.137] error: _set_slurmd_addr: failure on 127.0.0.1
[2021-03-10T07:13:08.137] Recovered state of 1 nodes
[2021-03-10T07:13:08.138] Recovered JobId=1651 Assoc=0
[2021-03-10T07:13:08.138] Recovered information about 1 jobs
[2021-03-10T07:13:08.138] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 4 partitions
[2021-03-10T07:13:08.140] Recovered state of 0 reservations
[2021-03-10T07:13:08.140] read_slurm_conf: backup_controller not specified
[2021-03-10T07:13:08.140] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2021-03-10T07:13:08.140] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 4 partitions
[2021-03-10T07:13:08.141] Running as primary controller
[2021-03-10T07:13:08.141] No parameter for mcs plugin, default values set
[2021-03-10T07:13:08.141] mcs: MCSParameters = (null). ondemand set.
[2021-03-10T07:13:08.142] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2021-03-10T07:13:08.142] error: slurm_set_addr: Unable to resolve "(null)"
[2021-03-10T07:13:08.142] error: slurm_set_port: attempting to set port without address family
[2021-03-10T07:13:08.144] error: Error creating slurm stream socket: Address family not supported by protocol
[2021-03-10T07:13:08.144] fatal: slurm_init_msg_engine_port error Address family not supported by protocol


slurmd.log has



[2021-03-10T07:13:08.195] cgroup namespace 'freezer' is now mounted
[2021-03-10T07:13:08.198] slurmd version 20.11.4 started
[2021-03-10T07:13:08.199] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2021-03-10T07:13:08.199] error: slurm_set_addr: Unable to resolve "(null)"
[2021-03-10T07:13:08.199] error: slurm_set_port: attempting to set port without address family
[2021-03-10T07:13:08.200] error: Error creating slurm stream socket: Address family not supported by protocol
[2021-03-10T07:13:08.200] error: Unable to bind listen port (6818): Address family not supported by protocol


- -- System Information:
Debian Release: bullseye/sid
APT prefers unstable-debug
APT policy: (500, 'unstable-debug'), (500, 'testing-security'), (500, 'testing-proposed-updates-debug'), (500, 'testing-debug'), (500, 'testing')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.10.0-3-amd64 (SMP w/8 CPU threads)
Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=en_CA.UTF-8, LC_CTYPE=en_CA.UTF-8 (charmap=UTF-8), LANGUAGE=en_CA:en
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages slurmctld depends on:
ii libc6 2.31-9
ii lsb-base 11.1.0
pn munge <none>
pn slurm-client <none>
pn slurm-wlm-basic-plugins <none>
ii ucf 3.0043

slurmctld recommends no packages.

slurmctld suggests no packages.

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEkiyHYXwaY0SiY6fqA0U5G1WqFSEFAmBItjwACgkQA0U5G1Wq
FSETBBAAozRM+8NBZYZjdMLJ09KdIXvpOzk7CDgnV1NQTetm+rZxJ1pNpir1fbIz
gzFxIlvjropFD42UJhXI1IkJa5OEoiCrlKCvwJflBdZ2Ap1Qjl/j/vWQRotr+CYk
By5I9Ason/iEEEe3TRVu2Gvs6LsB+92N4JKblpYb8Wn33P7XX4boy9/uKhmtpkDj
sQ4QAP95f+VTsMn/R36e1y3ktRvos0Ao9FAyzorPpDsyjgatN1aBYLfrJI+GSDzP
+Y38vLMcE1wkmP34H8IFmoHuHXkMrNJL8h4lzcMf2YpL2FSya/pJxcoyoRNnCz0h
tMVu2PsHWVFEWat7cQICoyDUZmdNMa396oeoPOOrh7seLwFWBRU8TRVo3+YaXDgp
oKFENCA70Xrptk48No81uKPl2uwdxcpaApecu9IYFVA7W0Tk4VlXO2LZ83VW6z3V
opAzyDQ1lJ9uGpvIQu+gMvDTbVFpdyZd7nrZylsilGqIUecaBEHAfnai73trPziY
KI/7Xwu7ipXOWrLKmWvuyMdZfvvjaGJso4S60C1YDqrI3x+G/HJKqLUMw2VRXl6r
BHOy88D1qIB3v9JxMtlW8kGQRJ4PZo79vG5vmCzKocU5jUhIclAVr2jgcOsRmHuU
vAeCTW5CuFMwQzJxHq+d6GIBg9CQi6yxHn15UBaXrxUUWth/tO0=
=hABj
-----END PGP SIGNATURE-----
slurm.conf

David Bremner

unread,
Mar 27, 2021, 7:00:05 PM3/27/21
to

As a workaround, I noticed that setting the main ethernet interface to
"auto" instead of "allow-hotplug" seems to fix the problem. By way of
confirmation, on a different (virtual) machine changing the "auto" to
"allow-hotplog" on the main ethernet interface causes the same problem
to manifest.

This is still a bit mysterious, since the messages complain about
127.0.0.1 which is of course the loopback interace, already marked
"auto", and presumably up pretty early.


signature.asc

David Bremner

unread,
Aug 6, 2021, 10:10:03 AM8/6/21
to
I think (one) underlying problem is that the systemd unit file for
slurmctld is incorrect. The details are in [1], but it seems like
network.target is not correct (I think it very rarely is a useful
target). I added the following

# /etc/systemd/system/slurmctld.service.d/override.conf
[Unit]
After=network-online.target munge.service
Wants=network-online.target

And it seems to help. I didn't check if the second mention of
munge.service is really needed.

I've switched to systemd-networkd on the hosts in question, so I can't
easily test how this works with ifupdown, but I notice ifupdown provides

/lib/systemd/system/ifupdown-wait-online.service

which (guessing based on the name) should provide similar functionality
to those documented in [1] for NetworkManager and systemd-networkd.

[1]: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

Gennaro Oliva

unread,
Jan 27, 2022, 5:30:05 PM1/27/22
to
Hi David,
sorry for getting back to you so late. Thanks to your valuable
contribution I managed to find a working solution.

On Fri, Aug 06, 2021 at 11:01:48AM -0300, David Bremner wrote:
> I think (one) underlying problem is that the systemd unit file for
> slurmctld is incorrect. The details are in [1], but it seems like
> network.target is not correct (I think it very rarely is a useful
> target). I added the following
>
> # /etc/systemd/system/slurmctld.service.d/override.conf
> [Unit]
> After=network-online.target munge.service
> Wants=network-online.target

Yes this change is now part of the service file.

> I've switched to systemd-networkd on the hosts in question, so I can't
> easily test how this works with ifupdown, but I notice ifupdown provides
>
> /lib/systemd/system/ifupdown-wait-online.service
>
> which (guessing based on the name) should provide similar functionality
> to those documented in [1] for NetworkManager and systemd-networkd.
>
> [1]: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

Unfortunately using ifupdown-wait-online didn't help if I use
ifupdown and allow-hotplug interfaces, but I did not tested it
thoroughly since I want a solution that works out of the box.

Therefore I decided to patch the slurm code that is failing in order to
retry getaddrinfo before giving up starting daemons.

Best regards,
--
Gennaro Oliva

Jerome BENOIT

unread,
Jan 1, 2024, 7:20:04 AM1/1/24
to
Hi,

I have a setup similar to the one of the original reporter.
My NodeName is localhost .

The error messages at booting time scared me, so I dug the issue.
I also related this issue to my observation that slurm fails to launch jobs
when my standalone computer is disconnected (the router provided by my ISP
is very unstable). I could reproduced the issue with a simple C program
that mimics get_addr_info function. After some trials, it appears that the issue
disappears when the hints.ai_flags do not include the AI_ADDRCONFIG flag
(see get_addr_info(3) for more information). So the current workaround
patch `retry-getaddrinfo` only fixes the issue partially.

The following patch neutralize the setup of the AI_ADDRCONFIG flag:

============================8><--------------------------------------------------------
--- a/src/common/conmgr.c
+++ b/src/common/conmgr.c
@@ -1807,7 +1807,7 @@
struct addrinfo hints = { .ai_family = AF_UNSPEC,
.ai_socktype = SOCK_STREAM,
.ai_protocol = 0,
- .ai_flags = AI_PASSIVE | AI_ADDRCONFIG };
+ .ai_flags = AI_PASSIVE /*| AI_ADDRCONFIG */ };
struct addrinfo *addrlist = NULL;
parsed_host_port_t *parsed_hp;

--- a/src/common/util-net.c
+++ b/src/common/util-net.c
@@ -261,7 +261,7 @@
else
hints.ai_family = AF_UNSPEC;

- hints.ai_flags = AI_ADDRCONFIG | AI_NUMERICSERV | AI_PASSIVE;
+ hints.ai_flags = /* AI_ADDRCONFIG | */ AI_NUMERICSERV | AI_PASSIVE;
if (hostname)
hints.ai_flags |= AI_CANONNAME;
hints.ai_socktype = SOCK_STREAM;
----------------------------><8========================================================

I guess that this patch is too brutal and that it must be refined.
In particular, the flag may not be AI_ADDRCONFIG set up only on standalone computer.
However I am not familiar enough with slurm and network stuff to step further.

Here is the simple C program that helps me to isolate better the issue:

============================8><--------------------------------------------------------
// `example-getaddrinfo-00.c' C source file

// gcc -Wall -o example-getaddrinfo-00 example-getaddrinfo-00.c
// $ ./example-getaddrinfo-00
// $ ./example-getaddrinfo-00 localhost
// $ ./example-getaddrinfo-00 debian.org

#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <arpa/inet.h>

int main(int nargs, char *args[]) {
char nodename[1024]="localhost";
const char serv[6]="6817";
struct addrinfo hints;
struct addrinfo * result=NULL;
struct addrinfo * rdx=NULL;
struct sockaddr_in * ai_addr_v4=NULL;
char sa_str[INET6_ADDRSTRLEN];
char * xnodename=NULL;
int status=0;

if (1<nargs) {
snprintf(nodename,sizeof(nodename),"%s",args[1]);
}
if (strcmp(nodename,"NULL")) {
xnodename=nodename;
}

memset(&hints,0,sizeof(hints));
hints.ai_family=AF_INET;
hints.ai_flags= AI_NUMERICSERV | AI_PASSIVE | AI_CANONNAME ;
#if 0
hints.ai_flags |= AI_ADDRCONFIG ;
#endif
hints.ai_socktype=SOCK_STREAM;
status=getaddrinfo(xnodename,serv,&hints,&result);
if (status) {
fprintf(stderr,"FAIL:getaddrinfo: ``%s''\n",gai_strerror(status));
}
for(rdx=result;rdx!=NULL;rdx=rdx->ai_next) {
ai_addr_v4=(struct sockaddr_in *)(rdx->ai_addr);
inet_ntop(AF_INET,&(ai_addr_v4->sin_addr),sa_str,sizeof(sa_str));
fprintf(stdout,">%s< >%s<\n",result->ai_canonname,sa_str);
}
freeaddrinfo(result); result=NULL;

return (status); }
----------------------------><8========================================================

hth,
Jerome
--
Jerome BENOIT | calculus+at-rezozer^dot*net
https://qa.debian.org/developer.php?login=calc...@rezozer.net
AE28 AE15 710D FF1D 87E5 A762 3F92 19A6 7F36 C68B
0 new messages