Boot problem diagnosis

55 views
Skip to first unread message

Jonathan Boswell

unread,
Oct 17, 2022, 7:49:27 AM10/17/22
to ware...@lbl.gov
I need some help.  Normally, when I update my compute node image, I reboot hundreds of them all at once and they all come back online with no issue. 
 
But lately (the last few months?), rebooting them often results in them going offline instead.  When I watch what's happening on the console, they reboot and pull a correct IP address.  Then, most of the time, I see a "No more network devices" and they stop trying.  After rebooting again 1-10 times, they boot up and run normally.
 
Is this the bootloader that's not coming down much of the time?  How can I diagnose this problem?
 
 - JB

Virus-free.www.avast.com

John Hanks

unread,
Oct 17, 2022, 9:14:54 AM10/17/22
to ware...@lbl.gov
Is that "no more network devices" for your PXE or iPXE part of the boot? It sounds suspiciously like it's giving up after not getting a DHCP reply, implying your DHCP server is overwhelmed by a thundering herd but out of the box pretty much any DHCP server should be able to easily handle hundreds of nodes per second. I'd look in the dhcp server logs first, maybe do a test where they are rebooted with a 1 second delay between each node reboot and if all else fails, start looking at packet traces to see if the dhcp requests are actually arriving at the master node. 

There may be some clues in the messages leading up to this as well, e.g. did the bootstrap of iPXE from the master work, but iPXE then failed or did the initial PXE in the systemc/NIC BIOS fail because nothing ever reached the server? 

griznog

--
You received this message because you are subscribed to the Google Groups "Warewulf" group.
To unsubscribe from this group and stop receiving emails from it, send an email to warewulf+u...@lbl.gov.
To view this discussion on the web visit https://groups.google.com/a/lbl.gov/d/msgid/warewulf/1ECB148528A9405B8B5141041A6639AA%40VanessaHP.

Jason Stover

unread,
Oct 17, 2022, 11:44:40 AM10/17/22
to ware...@lbl.gov
I have to assume it's a PXE message. The closest in WW3 is "Network hardware was not recognized" if the initramfs cannot bring up the NIC. WW3 will reboot if it cannot find the NIC and try again (fun if you get into a reboot cycle because of a missing module).

For myself, I was able to bring up generally around 20-30 nodes before I would saturate the network. So I'd generally bring up 20 ... wait 10-15 seconds, bring up 20 more... A bunch of hosts trying to transfer the VNFS wouldn't take long to eat up the 1GB network they provisioned over.

-J


Jonathan

unread,
Oct 18, 2022, 8:04:15 AM10/18/22
to Warewulf, jason....@gmail.com
We are only using regular PXE/tftp to my knowledge.

I have most definitely seen the "Network hardware was not recognized" error before because I have 24 nodes that use the forcedeth driver.  That's not my latest error which affects all remaining formerly working nodes.   I can try rebooting them one by one and get the same issue, until each finally works.  The largest number of manual reboots I have seen before it finally works is 10.  Occasionally it is 0 (reboot command works on the 1st try).  My head node is very lightly loaded so I am nonplussed.

As I say, the rest of my nodes pull the correct IP address immediately as they always have, so I think dhcpd is working correctly.  It's the next step  that results in the "No more network interfaces" error and a hang.  I am running Warewulf 3.8.1.  Where would I find the logs for PXE?

John Hanks

unread,
Oct 18, 2022, 9:19:06 AM10/18/22
to ware...@lbl.gov, jason....@gmail.com
Unless you build iPXE with logging support, I don't think you can get logs from it. There is some debugging you can enable but AFAIK it just goes to the console. For the NIC/system PXE stack, console is the only place to get any info from.

Backing up a bit, do you have a managed switch for your cluster interconnect and do all the nodes with this problem connect to the same switch? If your switch is not set to have the node ports be edge ports so that STP is disabled on them, then it can take up to 30 seconds for the link to come up on the port. That can lead to some random failure behavior like you are seeing. You might be able to work around that in a given card's BIOS setup by extending how long it will wait for link, but the easiest and more reliable fix is to set node ports on the switch to edge-ports or however the given switch disables STP on a port.

griznog

Jeffrey Layton

unread,
Oct 18, 2022, 10:17:27 AM10/18/22
to ware...@lbl.gov, jason....@gmail.com
I would recommend plugging in a crash cart to one of the nodes. Watch it get the IP address and download the image via tftp.

I'm not sure about the error message but I had an issue with WW 4 on some old hardware where I had to use Legacy booting and the image was too large to be uncompressed on the compute node. Did you perhaps change the compute node image and it increased? Check the BIOS of a compute node to see if it is Legavy or UEFI.

Jeff


Jonathan

unread,
Oct 21, 2022, 12:28:45 PM10/21/22
to Warewulf, layt...@gmail.com, jason....@gmail.com
Yes; I plug in a monitor and keyboard (mouse not needed) to see what's going on with a compute node that will not boot.  They do not drop off the net; I reboot them and they fail to come back online.  Since I have no IPMI, I have to go down to my computer room to diagnose.  I believe they are all legacy BIOS.  Since I can boot my oldest 9-core nodes, I don't think this is a bigger VNFS.  I certainly do not see that coming down when watching it boot on the console.

Jeffrey Layton

unread,
Oct 21, 2022, 12:32:26 PM10/21/22
to Jonathan, Warewulf, jason....@gmail.com
I'm a bit confused about what is happening. You reboot a node and node boots up, you can ping it and even ssh into it, yet after a period of time, it can no longer ping it or ssh into it?

Apologies, I'm lost on what is happening.

Jeff

Jonathan

unread,
Oct 21, 2022, 12:48:18 PM10/21/22
to Warewulf, layt...@gmail.com, Warewulf, jason....@gmail.com, Jonathan
It would help if I said what I mean!  I meant to type "8-core nodes", which have only 16GB of RAM.  Yet they almost always boot up just fine despite a slightly larger VNFS. 

If I successfully boot up a node, it stays online and runs fine.  If I have to reboot it, such as for changes I make to my VNFS, the 48 and 64-core nodes very often fail to reboot at all the first time.  I have had to reboot some 10 times before it finally completes the boot and then it too stays online and runs fine

Jason Stover

unread,
Oct 21, 2022, 1:20:37 PM10/21/22
to Jonathan, Warewulf, layt...@gmail.com
> Yet they almost always boot up just fine despite a slightly larger VNFS. 
> the 48 and 64-core nodes very often fail to reboot at all the first time

... What are the 48/64 Core nodes? I read this as the hardware hasn't been initialized by the time it tries PXE booting. But that's just me throwing something at the wall to see if it sticks. But it _seems_ like it's a hardware issue where the NICs on the 48/64 core machines aren't actually up by the time the system tries to boot from them.

-J

Jonathan

unread,
Oct 21, 2022, 1:31:47 PM10/21/22
to Warewulf, jason....@gmail.com, Warewulf, layt...@gmail.com, Jonathan
How can they not be up if
  1. They pull the correct IP address and
  2. they get the first (and only?) tftp download?

Jason Stover

unread,
Oct 21, 2022, 7:37:23 PM10/21/22
to Jonathan, Warewulf, layt...@gmail.com
Okay.... so they *are* pulling the tftp download... 

Do you have remote logging enabled on your provisioner syslogd?  If you do allow remote access, nodes will log to the provisioner once they have their NIC up.

Set DEBUG on the nodes to a value of 3 or more (I... honestly don't remember if we ever exposed this in wwsh or if it's a `wwsh object modify` thing). This will start a debug shell after it tries bringing up the NIC and before it gets into the core of the provisioning. syslog isn't started yet, but if it did detect the NIC, there should be a /tmp/wwdev file.

Look under /sys/class/net/ for the network devices that are being detected. You should be able to check dmesg, etc... and do normal poking around. You'll be in a busybox shell. 

-J

Jonathan

unread,
Oct 24, 2022, 11:22:33 AM10/24/22
to Warewulf, jason....@gmail.com, Warewulf, layt...@gmail.com, Jonathan
Using "wwsh help" I don't see that DEBUG is exposed anywhere!  I tried "wwsh object print -p debug" and got UNDEF for all nodes.

Jason Stover

unread,
Oct 24, 2022, 12:32:34 PM10/24/22
to ware...@lbl.gov
Then I *think* the syntax is (don't have anyway to actually look currently):

  wwsh object modify [node set] -s DEBUG=4

The above should set the DEBUG variable on the nodes in [node set] to 4. 

  wwsh object modify [node set] -D DEBUG

And that should remove the variable from the object.

The object subcommand does *not* trigger the events because you're modifying the serialized perl data directly. Lets you modify any object (base for nodes, files, etc....). Useful for some things, but can be dangerous. ;)

Just remember to remove the value when done as it will cause the initramfs to enter a debug shell... 

-J


Reply all
Reply to author
Forward
0 new messages