Warewulf4.4 Installing Mellanox OFED in container

158 views
Skip to first unread message

huang jonney

unread,
Apr 22, 2023, 12:46:51 AM4/22/23
to Warewulf
Hello,

So I have a node with a ConnectX3 card, and if I try to boot on a clean image and install OFED after boot works. (OS using Rocky 8.6)
But if I installed the OFED drivers into the image and boot, I can no longer SSH into the compute node.
And because it's a stateless boot, I lost the ability to log in through IPMI to identify the cause (or is there a way that I'm not aware of?)

Does anyone have successfully installed OFED ? and can you share the node settings and overlays so I can identify what configurations I'm missing?

Cheers
Jonney

huang jonney

unread,
Apr 22, 2023, 12:49:21 AM4/22/23
to Warewulf, huang jonney
I forgot to mention there are no error messages when booting.
So I'm guessing the cause is related to ifconfig or NetworkManager.
But as I mentioned in the previous email, I do not have the ability to log into the node to diagnose.

Cheers
Jonney

John Hanks

unread,
Apr 22, 2023, 6:38:14 AM4/22/23
to ware...@lbl.gov, huang jonney
A  couple of things that might be it.
  • Is the MOFED one that supports ConnectX3? I think the newer releases do not so it has to be an older branch/version.
  • On my nodes using MOFED changes the name of the ethernet devices, appending 'npN' where N so far has been 1 or 2. If you are using the actual device names then it may be trying to start the wrong device. That might not matter if using 'ethX' style names. 
MOFED shouldn't impact IPMI in any way, not sure why that would change unless the node IPMI config changed in WW and it had problems updating the node.

griznog

--
You received this message because you are subscribed to the Google Groups "Warewulf" group.
To unsubscribe from this group and stop receiving emails from it, send an email to warewulf+u...@lbl.gov.
To view this discussion on the web visit https://groups.google.com/a/lbl.gov/d/msgid/warewulf/b79f358a-0d99-4fc1-8691-28b954eac4c6n%40lbl.gov.

huang jonney

unread,
May 1, 2023, 5:44:10 PM5/1/23
to John Hanks, ware...@lbl.gov
Hello,

The OFEM is the one that matches the adapter. I've tried installing it after the node is booted, and it works that way. 

What I mean about IPMI is that when I run "sol active", it only shows the warewulf provisioning messages. and I cannot login to the system like a stateful installation.
Is this perhaps related to console redirection? I cannot find any documentation about these settings for WW4.0.

Cheers
Jonney

huang jonney

unread,
May 1, 2023, 6:37:41 PM5/1/23
to Warewulf, huang jonney, ware...@lbl.gov, John Hanks
this is the settings i have for the node:

####
[root@master warewulf]# wwctl node ls cn02 -a NODE FIELD PROFILE VALUE ===================================================================================== cn02 Id -- cn02 cn02 comment default This profile is automatically included for each node cn02 cluster -- -- cn02 container SUPERSEDED hostbuild_rocky8.6_3_ofed cn02 ipxe -- (default) cn02 runtime default generic,syncfile cn02 wwinit -- wwinit cn02 root -- (initramfs) cn02 discoverable -- -- cn02 init -- (/sbin/init) cn02 asset -- -- cn02 kerneloverride -- -- cn02 kernelargs -- (quiet crashkernel=no vga=791 net.naming-scheme=v238) cn02 ipmiaddr -- -- cn02 ipminetmask -- -- cn02 ipmiport -- -- cn02 ipmigateway -- -- cn02 ipmiuser -- -- cn02 ipmipass -- -- cn02 ipmiinterface -- -- cn02 ipmiwrite -- -- cn02 profile -- default cn02 IBNet:type -- infiniband cn02 IBNet:onboot -- yes cn02 IBNet:netdev -- ib0 cn02 IBNet:hwaddr -- -- cn02 IBNet:ipaddr -- 192.168.201.102 cn02 IBNet:ipaddr6 -- -- cn02 IBNet:netmask -- 255.255.255.0 cn02 IBNet:gateway -- -- cn02 IBNet:mtu -- -- cn02 IBNet:primary -- false cn02 default:type -- (ethernet) cn02 default:onboot -- -- cn02 default:netdev SUPERSEDED eth0 cn02 default:hwaddr -- cn02 default:ipaddr -- 192.168.200.102 cn02 default:ipaddr6 -- -- cn02 default:netmask SUPERSEDED 255.255.255.0 cn02 default:gateway default 192.168.200.1 cn02 default:mtu -- -- cn02 default:primary -- true ####

Am I missing some essential configs?

huang jonney

unread,
May 18, 2023, 9:18:52 AM5/18/23
to Warewulf, huang jonney, ware...@lbl.gov, John Hanks
Just in case anyone else is having similar issues.
There might be some compatibility issue with older kernels.
I was using the kernel out of the box for Rocky 8.6 (375)
after upgrading to 425, I've successfully compiled the new RPMs with --add-kernel-support and install&boot without issue.

(the setting stayed the same, didn't changed any node settings)
Reply all
Reply to author
Forward
0 new messages