I have a cluster with 100 nodes, but when I tried to add a new node recently with slightly different hardware, it failed to boot with the following error messages appearing:
Loading drivers: uci-hcd ohci-hcd ehci-hcd whci-hcd isp1362-hcd ci-hc4 s1811-hcd sd_mod
Detecting hardware: mlx4_core ahci
Bringing up local loopback network:
ERROR: Network hardware was not recognized!
In a Debug shell, I only see the loopback interface:
ls -l /sys/class/net/
lo -> ../../devices/virtual/net/lo
cat /sys/devices/virtual/net/lo/address
00:00:00:00:00:00
This new
Gigabyte Z690 UD DDR4 V2 motherboard has a Realtek 2.5GbE NIC instead of 1GbE NICs I have on other nodes. (I use this onboard NIC just for the PXE network. They do all have separate high-throughput InfiniBand cards.) I have tried putting a different network card in, but the motherboard can't PXE boot from discrete NIC unless CSM Support is enabled and I have had trouble enabling CSM Support. I think it needs a discrete graphics card for this and even then I'm not sure if this path will work and seems unnecessary for something I should be able to fix in software.
I tried to figure out which driver we need by booting from a Live Ubuntu USB and ran ethtool
ethtool -i eth0
driver: r8169
version: 5.15.0-43-generic
firmware-version: rtl8125b-2_0.0.2 07/13/20
expansion-rom-version:
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no
I believe that r8169 driver is already installed, because on a different node using the same image I do find this file present:
/lib/modules/3.10.0-957.1.3.el7.x86_64/kernel/drivers/net/ethernet/realtek/r8169.ko.xz
I have also edited /etc/warewulf/bootstrap.conf and added a line like this:
modprobe += r8169
Then I rebuilt the bootstrap with wwsh bootstrap rebuild <mybootstrapname>
Still, the results are the same with it not booting with that same error message.
Any advice about how to further troubleshoot or what else I could try?