[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

366 views
Skip to first unread message

Ole Holm Nielsen

unread,
Oct 30, 2023, 8:50:43 AM10/30/23
to Slurm User Community List
I'm fighting this strange scenario where slurmd is started before the
Infiniband/OPA network is fully up. The Node Health Check (NHC) executed
by slurmd then fails the node (as it should). This happens only on EL8
Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with
Infiniband/OPA network work without problems.

Question: Does anyone know how to reliably delay the start of the slurmd
Systemd service until the Infiniband/OPA network is fully up?

Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not
Mellanox IB. On AlmaLinux 8.8 we use the in-distro OPA drivers since the
CornelisNetworks drivers are not available for RHEL 8.8.

The details:

The slurmd service is started by the service file
/usr/lib/systemd/system/slurmd.service after the "network-online.target"
has been reached.

It seems that NetworkManager reports "network-online.target" BEFORE the
Infiniband/OPA device ib0 is actually up, and this seems to be the cause
of our problems!

Here are some important sequences of events from the syslog showing that
the network goes online before the Infiniband/OPA network (hfi1_0 adapter)
is up:

Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online.
(lines deleted)
Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check failed:
rc:1 output:ERROR: nhc: Health check failed: check_hw_ib: No IB port
is ACTIVE (LinkUp 100 Gb/sec).
(lines deleted)
Oct 30 13:01:41 d064 kernel: hfi1 0000:4b:00.0: hfi1_0: 8051: Link up
Oct 30 13:01:41 d064 kernel: hfi1 0000:4b:00.0: hfi1_0: set_link_state:
current GOING_UP, new INIT (LINKUP)
Oct 30 13:01:41 d064 kernel: hfi1 0000:4b:00.0: hfi1_0: physical state
changed to PHYS_LINKUP (0x5), phy 0x50

I tried to delay the NetworkManager "network-online.target" by setting a
wait on the ib0 device and reboot, but that seems to be ignored:

$ nmcli -p connection modify "System ib0"
connection.connection.wait-device-timeout 20

I'm hoping that other sites using Omni-Path have seen this and maybe can
share a fix or workaround?

Of course we could remove the Infiniband check in Node Health Check (NHC),
but that would not really be acceptable during operations.

Thanks for sharing any insights,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

Max Rutkowski

unread,
Oct 30, 2023, 9:31:30 AM10/30/23
to slurm...@lists.schedmd.com

Hi,

we're not using Omni-Path but also had issues with Infiniband taking too long and slurmd failing to start due to that.

Our solution was to implement a little wait-for-interface systemd service which delays the network.target until the ib interface has come up.

Our discovery was that the network-online.target is triggered by the NetworkManager as soon as the first interface is connected.

I've put the solution we use on my GitHub: https://github.com/maxlxl/network.target_wait-for-interfaces

You may need to do small adjustments, but it's pretty straight forward in general.


Kind regards
Max

--
Max Rutkowski
IT-Services und IT-Betrieb
Tel.: +49 (0)331/6264-2341
E-Mail: max.ru...@gfz-potsdam.de
___________________________________

Helmholtz-Zentrum Potsdam
Deutsches GeoForschungsZentrum GFZ
Stiftung des öff. Rechts Land Brandenburg
Telegrafenberg, 14473 Potsdam

Ole Holm Nielsen

unread,
Oct 30, 2023, 10:12:18 AM10/30/23
to slurm...@lists.schedmd.com
Hi Max,

Thanks so much for your fast response with a solution! I didn't know that
NetworkManager (falsely) claims that the network is online as soon as the
first interface comes up :-(

Your solution of a wait-for-interfaces Systemd service makes a lot of
sense, and I'm going to try it out.

Best regards,
Ole

On 10/30/23 14:30, Max Rutkowski wrote:
> Hi,
>
> we're not using Omni-Path but also had issues with Infiniband taking too
> long and slurmd failing to start due to that.
>
> Our solution was to implement a little wait-for-interface systemd service
> which delays the network.target until the ib interface has come up.
>
> Our discovery was that the network-online.target is triggered by the
> NetworkManager as soon as the first interface is connected.
>
> I've put the solution we use on my GitHub:
> https://github.com/maxlxl/network.target_wait-for-interfaces
>
> You may need to do small adjustments, but it's pretty straight forward
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: Ole.H....@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620 in
> general.
>
>
> Kind regards
> Max
>
> On 30.10.23 13:50, Ole Holm Nielsen wrote:
>> I'm fighting this strange scenario where slurmd is started before the
>> Infiniband/OPA network is fully up.  The Node Health Check (NHC)
>> executed by slurmd then fails the node (as it should).  This happens
>> only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes
>> with Infiniband/OPA network work without problems.
>>
>> Question: Does anyone know how to reliably delay the start of the slurmd
>> Systemd service until the Infiniband/OPA network is fully up?
>>
>> Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not
>> Mellanox IB.  On AlmaLinux 8.8 we use the in-distro OPA drivers since
>> the CornelisNetworks drivers are not available for RHEL 8.8.
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: Ole.H....@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620
> *Deutsches GeoForschungsZentrum GFZ*

Jens Elkner

unread,
Oct 30, 2023, 10:52:43 AM10/30/23
to Slurm User Community List
On Mon, Oct 30, 2023 at 03:11:32PM +0100, Ole Holm Nielsen wrote:
Hi Max & freinds,
...
> Thanks so much for your fast response with a solution! I didn't know that
> NetworkManager (falsely) claims that the network is online as soon as the
> first interface comes up :-(

IIRC it is documented in the man page.

> Your solution of a wait-for-interfaces Systemd service makes a lot of sense,
> and I'm going to try it out.

Actually there is no need for such a script since
/lib/systemd/systemd-networkd-wait-online should be able to handle it.

I.e. 'Exec=/lib/systemd/systemd-networkd-wait-online -i ib0:routable'
or something like that should handle it. E.g. on my laptop the complete
/etc/systemd/system/systemd-networkd-wait-online.service looks like
this:
---schnipp---
[Unit]
Description=Wait for Network to be Configured
Documentation=man:systemd-networkd-wait-online.service(8)
DefaultDependencies=no
Conflicts=shutdown.target
Requires=systemd-networkd.service
After=systemd-networkd.service
Before=network-online.target shutdown.target

[Service]
Type=oneshot
ExecStart=/lib/systemd/systemd-networkd-wait-online -i eth0:routable -i wlan0:routable --any
RemainAfterExit=yes

[Install]
WantedBy=network-online.target
---schnapp---

Have fun,
jel.
--
Otto-von-Guericke University http://www.cs.uni-magdeburg.de/
Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany Tel: +49 391 67 52768

Ole Holm Nielsen

unread,
Oct 30, 2023, 2:56:53 PM10/30/23
to slurm...@lists.schedmd.com
Hi Jens,

Thanks for your feedback:

On 30-10-2023 15:52, Jens Elkner wrote:
> Actually there is no need for such a script since
> /lib/systemd/systemd-networkd-wait-online should be able to handle it.

It seems that systemd-networkd exists in Fedora FC38 Linux, but not in
RHEL 8 and clones, AFAICT.

/Ole


Jeffrey R. Lang

unread,
Oct 30, 2023, 3:16:18 PM10/30/23
to Ole.H....@fysik.dtu.dk, Slurm User Community List
The service is available in RHEL 8 via the EPEL package repository as system-networkd, i.e. systemd-networkd.x86_64 253.4-1.el8 epel


-----Original Message-----
From: slurm-users <slurm-use...@lists.schedmd.com> On Behalf Of Ole Holm Nielsen
Sent: Monday, October 30, 2023 1:56 PM
To: slurm...@lists.schedmd.com
Subject: Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

◆ This message was sent from a non-UWYO address. Please exercise caution when clicking links or opening attachments from external sources.

Ole Holm Nielsen

unread,
Oct 31, 2023, 6:00:38 AM10/31/23
to Slurm User Community List
Hi Jeffrey,

On 10/30/23 20:15, Jeffrey R. Lang wrote:
> The service is available in RHEL 8 via the EPEL package repository as system-networkd, i.e. systemd-networkd.x86_64 253.4-1.el8 epel

Thanks for the info. We can install the systemd-networkd RPM from the
EPEL repo as you suggest.

I tried to understand the properties of systemd-networkd before
implementing it in our compute nodes. While there are lots of networkd
man-pages, it's harder to find an overview of the actual properties of
networkd. This is what I found:

* Networkd is a service included in recent versions of Systemd. It seems
to be an alternative to NetworkManager.

* Red Hat has stated that systemd-networkd is NOT going to be implemented
in RHEL 8 or 9.

* Comparing systemd-networkd and NetworkManager:
https://fedoracloud.readthedocs.io/en/latest/networkd.html

* Networkd is described in the Wikipedia article
https://en.wikipedia.org/wiki/Systemd

While networkd seems to be really nifty, I hesitate to replace
NetworkManager by networkd on our EL8 and EL9 systems because this is an
unsupported and only lightly tested setup, and it may require additional
work to keep our systems up-to-date in the future.

It seems to me that Max Rutkowski's solution in
https://github.com/maxlxl/network.target_wait-for-interfaces is less
intrusive than converting to systemd-networkd.

Best regards,
Ole

Jens Elkner

unread,
Oct 31, 2023, 12:50:41 PM10/31/23
to Slurm User Community List
On Tue, Oct 31, 2023 at 10:59:56AM +0100, Ole Holm Nielsen wrote:
Hi Ole,

TLTR;: below systemd-networkd stuff, only.

> On 10/30/23 20:15, Jeffrey R. Lang wrote:
> > The service is available in RHEL 8 via the EPEL package repository as system-networkd, i.e. systemd-networkd.x86_64 253.4-1.el8 epel
>
> Thanks for the info. We can install the systemd-networkd RPM from the EPEL
> repo as you suggest.

Strange, that it is not installed by default. We use Ubuntu, only. The
first LTS which includes it is Xenial (16.04) - released in April 2016.
Anyway, we have never installed any NetworkManager stuff (too unflexible,
unreliable, buggy - last eval ~5 years ago and ditched forever), even
before 16.04 as well on desktops I ditch[ed] it (IMHO just overhead).

> I tried to understand the properties of systemd-networkd before implementing
> it in our compute nodes. While there are lots of networkd man-pages, it's
> harder to find an overview of the actual properties of networkd. This is
> what I found:

Basically you just need for each interface a *.netdev and a *.network
file in /etc/systemd/network/. Optionally symlink /etc/resolv.conf to
/run/systemd/resolve/resolv.conf. If you want to rename your
interface[s] (e.g. we use ${hostname}${ifidx}), and parameter
'net.ifnames=0' gets passed to the kernel, you can use a *.link file to
accomplish this. That's it. See example 1 below.

Some distros have obscure bloatware to manage them (e.g. Ubuntu installs
per default 'netplan.io' aka another way of indirection), but we ditch
those packages immediately and manage them "manually" as needed.

> * Comparing systemd-networkd and NetworkManager:
> https://fedoracloud.readthedocs.io/en/latest/networkd.html

Pretty good - shows all you probably need. Actually within containers we
have just /etc/systemd/network/40-${hostname}0.network, because the
lxc.net.* config already describe, what *.link and *.netdev would do.
See example 2.

...
> While networkd seems to be really nifty, I hesitate to replace

Does/can do all we need w/o a lot of overhead.

> NetworkManager by networkd on our EL8 and EL9 systems because this is an
> unsupported and only lightly tested setup,

We use it ~5 years on all machines, ~7 years on most of our machines;
multihomed, containers, simple and complex (i.e. a lot of NICs, VLANs)
w/o any problems ...

> and it may require additional
> work to keep our systems up-to-date in the future.

I doubt that. The /etc/systemd/network/*.{link,netdev,network} interface
seems to be pretty stable. Haven't seen/noticed any stuff, which got
removed so far.

> It seems to me that Max Rutkowski's solution in
> https://github.com/maxlxl/network.target_wait-for-interfaces is less
> intrusive than converting to systemd-networkd.

Depends on your setup/environment. But I guess soomer or later, you need
to get into touch with it anyway. So here some examples:

Example 1:
----------
# /etc/systemd/network/10-mb0.link
# we rename usually eth0, the 1st NIC on the motherboard to mb0 using
# its PCI Address to identify it
[Match]
Path=pci-0000:00:19.0

[Link]
Name=mb0
MACAddressPolicy=persistent


# /etc/systemd/network/25-phys-2-vlans+vnics.network
[Match]
Name=mb0

[Link]
ARP=false

[Network]
LinkLocalAddressing=no
LLMNR=false
IPv6AcceptRA=no
LLDP=true
MACVLAN=node1_0
#VLAN=vlan2
#VLAN=vlan3


# /etc/systemd/network/40-node1_0.netdev
[NetDev]
Name=node1_0
Kind=macvlan
# Optional: we use fix mac addr on vnics
MACAddress=00:01:02:03:04:00

[MACVLAN]
Mode=bridge


# /etc/systemd/network/40-node1_0.network
[Match]
Name=node1_0

[Network]
LinkLocalAddressing=no
LLMNR=false
IPv6AcceptRA=no
LLDP=no
Address=10.11.12.13/24
Gateway=10.11.12.200
# stuff which gets copied to /run/systemd/resolve/resolv.conf, when ready
Domains=my.do.main an.other.do.main
DNS=10.11.12.100 10.11.12.101


Example 2 (LXC):
----------------
# /zones/n00-00/config
...
lxc.net.0.type = macvlan
lxc.net.0.macvlan.mode = bridge
lxc.net.0.flags = up
lxc.net.0.link = mb0
lxc.net.0.name = n00-00_0
lxc.net.0.hwaddr = 00:01:02:03:04:01
...


# /zones/n00-00/rootfs/etc/systemd/network/40-n00-00_0.network
[Match]
Name=n00-00_0

[Network]
LLMNR=false
LLDP=no
LinkLocalAddressing=no
IPv6AcceptRouterAdvertisements=no
Address=10.12.11.0/16
Gateway=10.12.11.2
Domains=gpu.do.main


Have fun,
jel.
> Best regards,
> Ole
>
>
> > -----Original Message-----
> > From: slurm-users <slurm-use...@lists.schedmd.com> On Behalf Of Ole Holm Nielsen
> > Sent: Monday, October 30, 2023 1:56 PM
> > To: slurm...@lists.schedmd.com
> > Subject: Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
> >
> > ◆ This message was sent from a non-UWYO address. Please exercise caution when clicking links or opening attachments from external sources.
> >
> >
> > Hi Jens,
> >
> > Thanks for your feedback:
> >
> > On 30-10-2023 15:52, Jens Elkner wrote:
> > > Actually there is no need for such a script since
> > > /lib/systemd/systemd-networkd-wait-online should be able to handle it.
> >
> > It seems that systemd-networkd exists in Fedora FC38 Linux, but not in
> > RHEL 8 and clones, AFAICT.

Paulo Jose Braga Estrela

unread,
Oct 31, 2023, 8:13:04 PM10/31/23
to Slurm User Community List
I think that you should use NetworkManager-wait-online.service In RHEL 8. Take a look at its man page. It only allows the system reach network-online after all network interfaces are online. So, if your OP interfaces are managed by Network Manager, you can use it.


PÚBLICA
-----Mensagem original-----
De: slurm-users <slurm-use...@lists.schedmd.com> Em nome de Ole Holm Nielsen
Enviada em: terça-feira, 31 de outubro de 2023 07:00
Para: Slurm User Community List <slurm...@lists.schedmd.com>
Assunto: Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
O emitente desta mensagem é responsável por seu conteúdo e endereçamento e deve observar as normas internas da Petrobras. Cabe ao destinatário assegurar que as informações e dados pessoais contidos neste correio eletrônico somente sejam utilizados com o grau de sigilo adequado e em conformidade com a legislação de proteção de dados e privacidade aplicável. A utilização das informações e dados pessoais contidos neste correio eletrônico em desconformidade com as normas aplicáveis acarretará a aplicação das sanções cabíveis.

The sender of this message is responsible for its content and address and must comply with Petrobras' internal rules. It is up to the recipient to ensure that the information and personal data contained in this email are only used with the appropriate degree of confidentiality and in compliance with applicable data protection and privacy legislation. The use of the information and personal data contained in this e-mail in violation of the applicable rules will result in the application of the applicable sanctions.

El remitente de este mensaje es responsable por su contenido y dirección y debe cumplir con las normas internas de Petrobras. Corresponde al destinatario asegurarse de que la información y los datos personales contenidos en este correo electrónico solo se utilicen con el grado adecuado de confidencialidad y de conformidad con la legislación aplicable en materia de privacidad y protección de datos. El uso de la información y datos personales contenidos en este correo electrónico en contravención de las normas aplicables dará lugar a la aplicación de las sanciones correspondientes.

Ole Holm Nielsen

unread,
Nov 1, 2023, 4:19:16 AM11/1/23
to slurm...@lists.schedmd.com
Hi Paulo,

On 11/1/23 01:12, Paulo Jose Braga Estrela wrote:
> I think that you should use NetworkManager-wait-online.service In RHEL 8. Take a look at its man page. It only allows the system reach network-online after all network interfaces are online. So, if your OP interfaces are managed by Network Manager, you can use it.

Unfortunately NetworkManager-wait-online.service returns as soon as 1
network interface is up. It doesn't wait for any other networks,
including the Infiniband/OPA network, unfortunately :-(

You can see that the NetworkManager-wait-online.service file executes:

ExecStart=/usr/bin/nm-online -s -q

and this is causing our problems with Infiniband/OPA networks. This is
the reason why we need Max's workaround wait-for-interfaces.service.

/Ole

Rémi Palancher

unread,
Nov 1, 2023, 4:45:22 AM11/1/23
to slurm...@lists.schedmd.com
Hi Ole,

Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit :
> I'm fighting this strange scenario where slurmd is started before the
> Infiniband/OPA network is fully up. The Node Health Check (NHC) executed
> by slurmd then fails the node (as it should). This happens only on EL8
> Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with
> Infiniband/OPA network work without problems.
>
> Question: Does anyone know how to reliably delay the start of the slurmd
> Systemd service until the Infiniband/OPA network is fully up?
>
> …

FWIW, after a while struggling with systemd dependencies to wait for
availability of networks and shared filesystems, we ended up with a
customer writing a patch in Slurm to delay slurmd registration (and jobs
start) until NHC is OK:

https://github.com/scibian/slurm-wlm/blob/scibian/buster/debian/patches/b31fa177c1ca26dcd2d5cd952e692ef87d95b528

For the record, this patch was once merged in Slurm and then reverted[1]
for reasons I did not fully explore.

This approach is far from your original idea, it is clearly not ideal
and should be taken with caution but it works for years for this customer.

[1]
https://github.com/SchedMD/slurm/commit/b31fa177c1ca26dcd2d5cd952e692ef87d95b528

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io/


Ole Holm Nielsen

unread,
Nov 1, 2023, 5:17:21 AM11/1/23
to slurm...@lists.schedmd.com
Hi Rémi,

Thanks for the feedback! The patch revert[1] explains SchedMD's reason:

> The reasoning is that sysadmins who see nodes with Reason "Not Responding"
> but they can manually ping/access the node end up confused. That reason
> should only be set if the node is trully not responding, but not if the
> HealthCheckProgram execution failed or returned non-zero exit code. For
> that case, the program itself would take the appropiate actions, such
> as draining the node and setting an appropiate Reason.

We speculate that there may possibly be an issue with slurmd starting up
at boot time and starting new jobs, while NHC is running in a separate
thread and possibly fails the node AFTER the job has started! NHC might
fail, for example, if an Infiniband/OPA network or NVIDIA GPUs have not
yet started up completely.

I still need to verify whether this observation is correct and
reproducible. Does anyone have evidence that jobs start before NHC is
complete when slurmd starts up?

IMHO, slurmd ought to start up without delay at boot time, then execute
the NHC and wait for it to complete. Only after NHC has succeeded without
errors should slurmd begin accepting new jobs.

We should configure NHC to make site-specific hardware and network checks,
for example for Infiniband/OPA network or NVIDIA GPUs.

Best regards,
Ole
Ole Holm Nielsen
PhD, Senior HPC Officer

Paulo Jose Braga Estrela

unread,
Nov 1, 2023, 9:04:30 AM11/1/23
to Slurm User Community List, Ole.H....@fysik.dtu.dk
Ole,

Look at the NetworkManager-wait-online.service man page bellow (from RHEL 8.8). Maybe your IB interfaces aren't properly configured in NetworkManager. The *** were added by me.

" NetworkManager-wait-online.service blocks until NetworkManager logs "startup complete" and announces startup
complete on D-Bus. How long that takes depends on the network and the NetworkManager configuration. If it
takes longer than expected, then the reasons need to be investigated in NetworkManager.

There are various reasons what affects NetworkManager reaching "startup complete" and how long
NetworkManager-wait-online.service blocks.

· In general, ***startup complete is not reached as long as NetworkManager is busy activating a device and as
long as there are profiles in activating state ***. During boot, NetworkManager starts autoactivating
suitable profiles that are ***configured to autoconnect***. If activation fails, NetworkManager might retry
right away (depending on connection.autoconnect-retries setting). While trying and retrying,
NetworkManager is busy until all profiles and devices either reached an activated or disconnected state
and no further events are expected.

***Basically, as long as there are devices and connections in activating state visible with nmcli device
and nmcli connection, startup is still pending. ***"



PÚBLICA
-----Mensagem original-----
De: slurm-users <slurm-use...@lists.schedmd.com> Em nome de Ole Holm Nielsen
Enviada em: quarta-feira, 1 de novembro de 2023 05:19
Para: slurm...@lists.schedmd.com
Assunto: Re: [slurm-users] RES: How to delay the start of slurmd until Infiniband/OPA network is fully up?

Hi Paulo,

On 11/1/23 01:12, Paulo Jose Braga Estrela wrote:
> I think that you should use NetworkManager-wait-online.service In RHEL 8. Take a look at its man page. It only allows the system reach network-online after all network interfaces are online. So, if your OP interfaces are managed by Network Manager, you can use it.

Unfortunately NetworkManager-wait-online.service returns as soon as 1 network interface is up. It doesn't wait for any other networks, including the Infiniband/OPA network, unfortunately :-(

You can see that the NetworkManager-wait-online.service file executes:

ExecStart=/usr/bin/nm-online -s -q

and this is causing our problems with Infiniband/OPA networks. This is the reason why we need Max's workaround wait-for-interfaces.service.

/Ole


> -----Mensagem original-----
> De: slurm-users <slurm-use...@lists.schedmd.com> Em nome de Ole
> Holm Nielsen Enviada em: terça-feira, 31 de outubro de 2023 07:00
> Para: Slurm User Community List <slurm...@lists.schedmd.com>
> Assunto: Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
>
> Hi Jeffrey,
>
> On 10/30/23 20:15, Jeffrey R. Lang wrote:
>> The service is available in RHEL 8 via the EPEL package repository as system-networkd, i.e. systemd-networkd.x86_64 253.4-1.el8 epel
>
> Thanks for the info. We can install the systemd-networkd RPM from the EPEL repo as you suggest.
>
> I tried to understand the properties of systemd-networkd before implementing it in our compute nodes. While there are lots of networkd man-pages, it's harder to find an overview of the actual properties of networkd. This is what I found:
>
> * Networkd is a service included in recent versions of Systemd. It seems to be an alternative to NetworkManager.
>
> * Red Hat has stated that systemd-networkd is NOT going to be implemented in RHEL 8 or 9.
>
> * Comparing systemd-networkd and NetworkManager:
> https://fedo/
> racloud.readthedocs.io%2Fen%2Flatest%2Fnetworkd.html&data=05%7C01%7Cpa
> ulo.estrela%40petrobras.com.br%7Cb488d8141bdd4e0fde0908dbdab42982%7C5b
> 6f62419a574be48e501dfa72e79a57%7C0%7C0%7C638344239576802836%7CUnknown%
> 7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> VCI6Mn0%3D%7C3000%7C%7C%7C&sdata=gPEtcsxK5IYKUrY4j7YwzI3TClHCjGUl%2BCO
> TxfCvupc%3D&reserved=0
>
> * Networkd is described in the Wikipedia article
> https://en.w/
> ikipedia.org%2Fwiki%2FSystemd&data=05%7C01%7Cpaulo.estrela%40petrobras
> .com.br%7Cb488d8141bdd4e0fde0908dbdab42982%7C5b6f62419a574be48e501dfa7
> 2e79a57%7C0%7C0%7C638344239576802836%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiM
> C4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C
> %7C&sdata=tmTrTlFh67hQ4XjjWHv3reLrNiNiXGirgcAstFigGWk%3D&reserved=0
>
> While networkd seems to be really nifty, I hesitate to replace NetworkManager by networkd on our EL8 and EL9 systems because this is an unsupported and only lightly tested setup, and it may require additional work to keep our systems up-to-date in the future.
>
> It seems to me that Max Rutkowski's solution in
> https://github.com/maxlxl/network.target_wait-for-interfaces is less intrusive than converting to systemd-networkd.
>
> Best regards,
> Ole
>
>
>> -----Original Message-----
>> From: slurm-users <slurm-use...@lists.schedmd.com> On Behalf
>> Of Ole Holm Nielsen
>> Sent: Monday, October 30, 2023 1:56 PM
>> To: slurm...@lists.schedmd.com
>> Subject: Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
>>
>> ◆ This message was sent from a non-UWYO address. Please exercise caution when clicking links or opening attachments from external sources.
>>
>>
>> Hi Jens,
>>
>> Thanks for your feedback:
>>
>> On 30-10-2023 15:52, Jens Elkner wrote:
>>> Actually there is no need for such a script since
>>> /lib/systemd/systemd-networkd-wait-online should be able to handle it.
>>
>> It seems that systemd-networkd exists in Fedora FC38 Linux, but not
>> in RHEL 8 and clones, AFAICT.

Ole Holm Nielsen

unread,
Nov 1, 2023, 11:50:53 AM11/1/23
to Slurm User Community List
I would like to report how the Infiniband/OPA network device starts up
step by step as reported by Max's Systemd service from
https://github.com/maxlxl/network.target_wait-for-interfaces

This is the sequence of events during boot:

$ grep wait-for-interfaces.sh /var/log/messages
Nov 1 16:13:39 d064 wait-for-interfaces.sh[1610]: Wait for network devices
Nov 1 16:13:39 d064 wait-for-interfaces.sh[1610]: Available connections are:
Nov 1 16:13:40 d064 wait-for-interfaces.sh[1613]: NAME UUID
TYPE DEVICE
Nov 1 16:13:40 d064 wait-for-interfaces.sh[1613]: eno8403
1108d0aa-8841-4f2e-b42e-bd9509a2aba0 ethernet --
Nov 1 16:13:40 d064 wait-for-interfaces.sh[1613]: System eno8303
44931a14-005a-415d-a82b-8c1a2007a118 ethernet --
Nov 1 16:13:40 d064 wait-for-interfaces.sh[1613]: System ib0
2ab4abde-b8a5-6cbc-19b1-2bfb193e4e89 infiniband --
Nov 1 16:13:40 d064 wait-for-interfaces.sh[2011]: Error: Device 'ib0' not
found.
Nov 1 16:13:41 d064 wait-for-interfaces.sh[2127]: Error: Device 'ib0' not
found.
Nov 1 16:13:41 d064 wait-for-interfaces.sh[1610]: Waiting for interface
ib0 to come online:
Nov 1 16:13:42 d064 wait-for-interfaces.sh[2134]: Error: Device 'ib0' not
found.
Nov 1 16:13:42 d064 wait-for-interfaces.sh[1610]: Waiting for interface
ib0 to come online:
Nov 1 16:13:43 d064 wait-for-interfaces.sh[2148]: Error: Device 'ib0' not
found.
Nov 1 16:13:43 d064 wait-for-interfaces.sh[1610]: Waiting for interface
ib0 to come online:
Nov 1 16:13:44 d064 wait-for-interfaces.sh[1610]: Waiting for interface
ib0 to come online: 20 (unavailable)
Nov 1 16:13:45 d064 wait-for-interfaces.sh[1610]: Waiting for interface
ib0 to come online: 20 (unavailable)
Nov 1 16:13:46 d064 wait-for-interfaces.sh[1610]: Waiting for interface
ib0 to come online: 20 (unavailable)
Nov 1 16:13:47 d064 wait-for-interfaces.sh[1610]: Waiting for interface
ib0 to come online: 20 (unavailable)
Nov 1 16:13:48 d064 wait-for-interfaces.sh[1610]: Waiting for interface
ib0 to come online: 20 (unavailable)
Nov 1 16:13:49 d064 wait-for-interfaces.sh[1610]: Waiting for interface
ib0 to come online: 20 (unavailable)
Nov 1 16:13:50 d064 wait-for-interfaces.sh[1610]: Waiting for interface
ib0 to come online: 20 (unavailable)
Nov 1 16:13:51 d064 wait-for-interfaces.sh[1610]: Waiting for interface
ib0 to come online: 20 (unavailable)
Nov 1 16:13:52 d064 wait-for-interfaces.sh[1610]: Waiting for interface
100 (connected)ib0 to come online: 20 (unavailable)
Nov 1 16:13:53 d064 wait-for-interfaces.sh[1610]: Waiting for interface
ib0 to come online: 80 (connecting (checking IP connectivity))
Nov 1 16:13:54 d064 wait-for-interfaces.sh[1610]: Waiting for interface
ib0 to come online: 100 (connected)

As you can see there are many intermediate steps before the "100
(connected)" status reports that ib0 is up.

The slurmd service will only start after this, which is what we wanted.

Best regards,
Ole

On 11/1/23 14:03, Paulo Jose Braga Estrela wrote:
> Ole,
>
> Look at the NetworkManager-wait-online.service man page bellow (from RHEL 8.8). Maybe your IB interfaces aren't properly configured in NetworkManager. The *** were added by me.
>
> " NetworkManager-wait-online.service blocks until NetworkManager logs "startup complete" and announces startup
> complete on D-Bus. How long that takes depends on the network and the NetworkManager configuration. If it
> takes longer than expected, then the reasons need to be investigated in NetworkManager.
>
> There are various reasons what affects NetworkManager reaching "startup complete" and how long
> NetworkManager-wait-online.service blocks.
>
> · In general, ***startup complete is not reached as long as NetworkManager is busy activating a device and as
> long as there are profiles in activating state ***. During boot, NetworkManager starts autoactivating
> suitable profiles that are ***configured to autoconnect***. If activation fails, NetworkManager might retry
> right away (depending on connection.autoconnect-retries setting). While trying and retrying,
> NetworkManager is busy until all profiles and devices either reached an activated or disconnected state
> and no further events are expected.
>
> ***Basically, as long as there are devices and connections in activating state visible with nmcli device
> and nmcli connection, startup is still pending. ***"
>
>
>
> PÚBLICA
> -----Mensagem original-----
> De: slurm-users <slurm-use...@lists.schedmd.com> Em nome de Ole Holm Nielsen
> Enviada em: quarta-feira, 1 de novembro de 2023 05:19
> Para: slurm...@lists.schedmd.com
> Assunto: Re: [slurm-users] RES: How to delay the start of slurmd until Infiniband/OPA network is fully up?
>
> Hi Paulo,
> > O emitente desta mensagem é responsável por seu conteúdo e endereçamento e deve observar as normas internas da Petrobras. Cabe ao destinatário assegurar que as informações e dados pessoais contidos neste correio eletrônico somente sejam utilizados com o grau de sigilo adequado e em conformidade com a legislação de proteção de dados e privacidade aplicável. A utilização das informações e dados pessoais contidos neste correio eletrônico em desconformidade com as normas aplicáveis acarretará a aplicação das sanções cabíveis.
>
> The sender of this message is responsible for its content and address and must comply with Petrobras' internal rules. It is up to the recipient to ensure that the information and personal data contained in this email are only used with the appropriate degree of confidentiality and in compliance with applicable data protection and privacy legislation. The use of the information and personal data contained in this e-mail in violation of the applicable rules will result in the application of the applicable sanctions.
>
> El remitente de este mensaje es responsable por su contenido y dirección y debe cumplir con las normas internas de Petrobras. Corresponde al destinatario asegurarse de que la información y los datos personales contenidos en este correo electrónico solo se utilicen con el grado adecuado de confidencialidad y de conformidad con la legislación aplicable en materia de privacidad y protección de datos. El uso de la información y datos personales contenidos en este correo electrónico en contravención de las normas aplicables dará lugar a la aplicación de las sanciones correspondientes.
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: Ole.H....@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

Ward Poelmans

unread,
Nov 1, 2023, 3:09:57 PM11/1/23
to slurm...@lists.schedmd.com
Hi,

We have a slightly difference script to do the same. It only relies on /sys:

# Search for infiniband devices and check waits until
# at least one reports that it is ACTIVE

if [[ ! -d /sys/class/infiniband ]]
then
logger "No infiniband found"
exit 0
fi

ports=$(ls /sys/class/infiniband/*/ports/*/state)

for (( count = 0; count < 300; count++ ))
do
for port in ${ports}; do
if grep -qc ACTIVE $port; then
logger "Infiniband online at $port"
exit 0
fi
done
sleep 1
done

logger "Failed to find an active infiniband interface"
exit 1


Ward

Ole Holm Nielsen

unread,
Nov 2, 2023, 4:29:15 AM11/2/23
to slurm...@lists.schedmd.com
Hi Ward,

Thanks a lot for the feedback! The method of probing
/sys/class/infiniband/*/ports/*/state is also used in the NHC script
lbnl_hw.nhc and has the advantage of not depending on the nmcli command
from the NetworkManager package.

Can I ask you how you implement your script as a service in the Systemd
booting process, perhaps similar to Max's solution in
https://github.com/maxlxl/network.target_wait-for-interfaces ?

Thanks,
Ole

Ward Poelmans

unread,
Nov 5, 2023, 3:33:19 PM11/5/23
to Ole Holm Nielsen, slurm...@lists.schedmd.com
Hi Ole,

Yes, it's very similar. I've put our systemd unit file also online on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11

And we add it as a dependency for slurmd:

$ cat /etc/systemd/system/slurmd.service.d/wait.conf

[Service]
Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID"
LimitMEMLOCK=infinity

[Unit]
After=waitforib.service
Requires=munge.service
Wants=waitforib.service


So far this has worked flawlessly.


Ward

Ole Holm Nielsen

unread,
Nov 10, 2023, 9:05:22 AM11/10/23
to slurm...@lists.schedmd.com
Hi Ward,

On 11/5/23 21:32, Ward Poelmans wrote:
> Yes, it's very similar. I've put our systemd unit file also online on
> https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11

This looks really good! However, I was testing the waitforib.sh script on
a SuperMicro server WITHOUT Infiniband and only a dual-port Ethernet NIC
(Intel Corporation Ethernet Connection X722 for 10GBASE-T).

The EL8 drivers in kernel 4.18.0-477.27.2.el8_8.x86_64 seem to think that
the Ethernet ports are also Infiniband ports:

# ls -l /sys/class/infiniband
total 0
lrwxrwxrwx 1 root root 0 Nov 10 14:31 irdma0 ->
../../devices/pci0000:5d/0000:5d:02.0/0000:5e:00.0/0000:5f:03.0/0000:60:00.0/infiniband/irdma0
lrwxrwxrwx 1 root root 0 Nov 10 14:31 irdma1 ->
../../devices/pci0000:5d/0000:5d:02.0/0000:5e:00.0/0000:5f:03.0/0000:60:00.1/infiniband/irdma1

This might disturb the logic in waitforib.sh, or at least cause some
confusion?

One advantage of Max's script using NetworkManager is that nmcli isn't
fooled by the fake irdma Infiniband device:

# nmcli connection show
NAME UUID TYPE DEVICE
eno1 cb0937f8-1902-48f7-8139-37cf0c4077b2 ethernet eno1
eno2 98130354-9215-412e-ab26-032c76c2dbe4 ethernet --

I found a discussion of the mysterious irdma device in
https://github.com/prometheus/node_exporter/issues/2769
with this explanation:

>> The irdma module is Intel's replacement for the legacy i40iw module, which was the iWARP driver for the Intel X722. The irdma module is a complete rewrite, which landed in mainline kernel 5.14, and which also now supports the Intel E810 (iWARP & RoCE).

The Infiniband commands also work on the fake device, claiming that it
runs 100 Gbit/s:

# ibstatus
Infiniband device 'irdma0' port 1 status:
default gid: 3cec:ef38:d960:0000:0000:0000:0000:0000
base lid: 0x1
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: Ethernet

Infiniband device 'irdma1' port 1 status:
default gid: 3cec:ef38:d961:0000:0000:0000:0000:0000
base lid: 0x1
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 100 Gb/sec (4X EDR)
link_layer: Ethernet

IMHO, this seems quite confusing.

Regarding the slurmd service:

> And we add it as a dependency for slurmd:
>
> $ cat /etc/systemd/system/slurmd.service.d/wait.conf
>
> [Service]
> Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID"
> LimitMEMLOCK=infinity
>
> [Unit]
> After=waitforib.service
> Requires=munge.service
> Wants=waitforib.service

An alternative to this extra service would be like Max's service file
https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service
which has:
Before=network-online.target

What do you think of these considerations?

Best regards,
Ole

Ward Poelmans

unread,
Nov 10, 2023, 1:45:50 PM11/10/23
to Ole Holm Nielsen, slurm...@lists.schedmd.com
Hi Ole,

On 10/11/2023 15:04, Ole Holm Nielsen wrote:
> On 11/5/23 21:32, Ward Poelmans wrote:
>> Yes, it's very similar. I've put our systemd unit file also online on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11
>
> This might disturb the logic in waitforib.sh, or at least cause some confusion?

I had never heard of these cards. But if they behave like infiniband cards, is there also an .../ports/1/state file present in /sys with the state? In that case it should work just as well.

We could also change the glob '/sys/class/infiniband/*/ports/*/state' to only look at devices starting with mlx. I have no clue how much diversity is out there, we only have Mellanox cards (or rebrands of those).

> IMHO, this seems quite confusing.

Yes, I agree.

> Regarding the slurmd service:

> An alternative to this extra service would be like Max's service file https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service which has:
> Before=network-online.target
>
> What do you think of these considerations?

I think Max his approach is the better one. We only do it for slurmd while his is completely general for everything that waits on network. The downside is probably that if you have issue with your IB network, this will make it worse ;)

Ward

Max Rutkowski

unread,
Nov 10, 2023, 2:28:14 PM11/10/23
to slurm...@lists.schedmd.com
Hi Ward,
That's why we do have a limit in there which will allow the boot to
complete even without the network coming up. In case we need to log in
and check the server. The script is made to delay it up until a timeout
is reached. And yes, we used a more general approach since our issue
actually was the network not coming up fast enough for our NFS mounts
which are also used by Slurmd at our site.
>
> Ward

Ole Holm Nielsen

unread,
Nov 13, 2023, 8:28:35 AM11/13/23
to slurm...@lists.schedmd.com
Hi Max and Ward,

I've made a variation of your scripts which wait for at least 1 Infiniband
port to come up before starting services such as slurmd or NFS mounts.

I prefer Max's Systemd service which comes before the Systemd
network-online.target. And I like Ward's script which checks the
Infiniband status in /sys/class/infiniband/ in stead of relying on
NetworkManager being installed.

At our site there are different types of compute nodes with different
types of NICs:

1. Mellanox Infiniband.
2. Cornelis Omni-Path behaving just like Infiniband.
3. Intel X722 Ethernet NICs presenting a "fake" iRDMA Infiniband.
4. Plain Ethernet only.

I've written some modified scripts which are available in
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/InfiniBand
and which have been tested on the 4 types of NICs listed above.

The case 3. is particularly troublesome as reported earlier because it's
an Ethernet port which presents an iRDMA InfiniBand interface. My
waitforib.sh script skips NICs whose link_layer type is not equal to
InfiniBand.

Comments and suggestions would be most welcome.

Best regards,
Ole
Reply all
Reply to author
Forward
0 new messages