[Rocks-Discuss] New Compute Nodes Don't Get DHCP Address

161 views
Skip to first unread message

Mike Hanby

unread,
Oct 1, 2009, 1:37:35 PM10/1/09
to Discussion of Rocks Clusters
Howdy,

I'm adding some nodes to an existing Rocks 5.1 cluster (see my previous post about creating a new appliance verari-compute).

I start insert-ethers and select my new appliance type, then boot the node specifying the network as the boot device.

The NIC tries to obtain an IP address for about a minute and ultimately reports that it was not successful.

On the head node the insert-ethers screen does not show that any new devices were detected.

I've tried this with several of the nodes and all of them fail the DHCP request. I tried restarting the DHCPd on the head node, same result, and don't see anything in the /var/log/messages or /var/log/daemon files that shed any light (i.e. no requests).

If I reboot an existing compute node and watch the /var/log/daemon file, I do see the DHCP request:
Oct 1 12:34:05 rockshn dhcpd: DHCPDISCOVER from 00:1e:c9:ce:56:3f via eth0
Oct 1 12:34:05 rockshn dhcpd: DHCPOFFER on 172.20.20.238 to 00:1e:c9:ce:56:3f via eth0
Oct 1 12:34:09 rockshn dhcpd: DHCPREQUEST for 172.20.20.238 (172.20.20.1) from 00:1e:c9:ce:56:3f via eth0
Oct 1 12:34:09 rockshn dhcpd: DHCPACK on 172.20.20.238 to 00:1e:c9:ce:56:3f via eth0

The new nodes are connected to the same switch as the existing nodes.

Any suggestions? I'd plug the new node directly into the head node eth0, however there are jobs running on the other compute nodes and I don't want to kill the NFS share.

Thanks, Mike

Vlad Manea

unread,
Oct 1, 2009, 1:54:04 PM10/1/09
to Discussion of Rocks Clusters
Check network cables and/or reboot switch.

V


--
*Dr. Vlad Constantin Manea*
*Professor of Geophysics*
Computational Geodynamics Lab. <http://www.geociencias.unam.mx/geodinamica>
Centro de Geociencias,
Campus UNAM, Juriquilla,
Blvd Juriquilla 3001,
Juriquilla, Querétaro, 76230,
México.
phone: +52 55 5623 4104/ext.133
fax: (55) 5623-4129
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20091001/f33857de/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cgeo_logo.gif
Type: image/gif
Size: 7448 bytes
Desc: not available
Url : https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20091001/f33857de/cgeo_logo.gif

Greg Bruno

unread,
Oct 1, 2009, 1:55:23 PM10/1/09
to Discussion of Rocks Clusters


the above DHCP messages indicate that there is an entry in
/etc/dhcpd.conf for the mac address 00:1e:c9:ce:56:3f.

what is the output of:

# rocks list host
# rocks list host interface

- gb

Scott L. Hamilton

unread,
Oct 1, 2009, 2:11:56 PM10/1/09
to Discussion of Rocks Clusters
Try using a network port on the switch that has an existing node plugged
in, just in case it is a switch issue. If that does not work, check
the mac address of the node and grep /etc/dhcpd.conf for it too make
sure the node is not already in the database and set for OS boot.

You can also try running tcpdump on the head node to capture the dhcp
request traffic to see if the node is making a request that is being
rejected by the server.

The site below gives very detailed information about troubleshooting
DHCP on Debian Linux, the commands are the same on Centos for this process.

http://debianclusters.cs.uni.edu/index.php/Troubleshooting_DHCP

Scott

Mike Hanby

unread,
Oct 1, 2009, 3:38:13 PM10/1/09
to Discussion of Rocks Clusters
I may have not been clear regarding that output in the /var/log/daemon log file. That's what shows up in the log file when an existing node, compute-1-1 in this case, reboots and requests its address on start up.

When I attempt the PXE boot of a node that is not yet managed by Rocks, the DHPC request doesn't look like it ever makes it to the head node. I've tried multiple network cables and verified that the link is up on both ends.

Scott, I'll try again and take a look at tcpdump to see if anything is coming across.

This is strange behavior indeed. I'll post back once I get some output from tcpdump.

Thanks, Mike

Mike Hanby

unread,
Oct 1, 2009, 4:43:52 PM10/1/09
to Discussion of Rocks Clusters
This is baffling, I've tried different ports on the switch and none of the compute-nodes-to-be generate any DHCP traffic on the head node.

Here's the output for an existing compute node to prove that DHCP is working on the head node and that the compute nodes can get to it via the switch:

# tcpdump -i eth0 -n port 67 or port 68
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
14:45:52.385768 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:1e:c9:ce:56:3f, length: 548
14:45:52.385973 IP 172.20.20.1.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length: 300
14:45:56.422745 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:1e:c9:ce:56:3f, length: 548
14:45:56.422920 IP 172.20.20.1.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length: 300

I left that running and made several attempts to PXE boot the new compute nodes, and not one peep on the tcpdump output.

I guess I'll have to wait until the jobs complete to try a direct connection to remove the switch from the equation.

Scott L. Hamilton

unread,
Oct 2, 2009, 10:45:17 AM10/2/09
to Discussion of Rocks Clusters
Mike,

Try booting a Linux livecd in one of the nodes and see if the live
operating system requests and gets a DHCP address. I have had some Dell
compute nodes that failed to PXE boot because of a firmware issue with
VLANs. I know you are probably not using VLANS on your cluster
switches, but firmware on the new nodes could still be the issue.

If you are able to get an ip address on the node using a live cd then
you have narrowed it down to network card firmware refusing to correctly
pxeboot.

Scott

Reply all
Reply to author
Forward
0 new messages