[Rocks-Discuss] admin question regarding node status down

639 views
Skip to first unread message

János Löbb

unread,
Jan 6, 2011, 3:23:08 PM1/6/11
to Discussion of Rocks Clusters
Hi,

When I do a rocks run host compute command, for some nodes I am getting "down". For example:

[root@rocks ~]# rocks run host compute "ls -al /tmp"
compute-0-0: down
compute-0-1: total 20
compute-0-1: drwxrwxrwt 5 root root 4096 Jan 6 14:42 .
compute-0-1: drwxr-xr-x 26 root root 4096 Jan 3 01:05 ..
compute-0-1: drwxrwxrwt 2 root root 4096 Jan 3 01:05 .ICE-unix
compute-0-1: -rw-r--r-- 1 root root 0 Jan 3 01:05 post-99-done.debug
compute-0-1: -rw-r--r-- 1 root root 0 Jan 3 01:05 pre-09-prep-kernel-source.debug
compute-0-1: -rw-r--r-- 1 root root 0 Jan 3 01:05 pre-10-src-install.debug
compute-0-1: drwxr-xr-x 2 root root 4096 Nov 28 22:44 RCS
compute-0-1: drwx------ 2 root root 4096 Jan 6 14:42 ssh-duNhaD8571
compute-0-2: total 20
compute-0-2: drwxrwxrwt 5 root root 4096 Jan 6 14:42 .
compute-0-2: drwxr-xr-x 26 root root 4096 Jan 3 01:05 ..
compute-0-2: drwxrwxrwt 2 root root 4096 Jan 3 01:05 .ICE-unix
compute-0-2: -rw-r--r-- 1 root root 0 Jan 3 16:06 post-99-done.debug
compute-0-2: -rw-r--r-- 1 root root 0 Jan 3 01:05 pre-09-prep-kernel-source.debug
compute-0-2: -rw-r--r-- 1 root root 0 Jan 3 01:05 pre-10-src-install.debug
compute-0-2: drwxr-xr-x 2 root root 4096 Nov 28 22:39 RCS
compute-0-2: drwx------ 2 root root 4096 Jan 6 14:42 ssh-gaqCOf8564

Of course compute-0-0 is up, but somehow its networking is not up, so the down is correct. Verified it with ping from the frontend and from the downed node.

[root@rocks ~]# ping compute-0-0
PING compute-0-0.local (192.168.131.254) 56(84) bytes of data.

--- compute-0-0.local ping statistics ---
10 packets transmitted, 0 received, 100% packet loss, time 8999ms


The compute-0-0 node reports unknown host when ping is tried.


What is the usual way to troubleshoot these kind of situation ?

When I do ifconfig on compute-0-0, I see that the parameters for inet are missing, like inet addr, Bcast, Mask.

Also arp -a on a good node reports rocks.local (ip address) at hardver address [ether] on eth0. On the down node it reports nothing.

What command should kick the down nodes into an up stage ?

I tried on the frontend:

rocks sync dns
rocks sync config
rocks synv host network compute-0-0

but no cigar.

Thanks ahead,

János

Mason J. Katz

unread,
Jan 6, 2011, 4:14:02 PM1/6/11
to Discussion of Rocks Clusters
If the node is off the network like this, you'll need to log into the
console.

I would try restarting the network (service network restart) and see if this
fixes things, after that a reboot. If this still fails re-install the node.

The only thing I can think of that would have triggered this is have a
duplicate IP address on the same network. You'll have to catch it happening
again to diagnose it.

mason j. katz
+1.619.800.0655

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20110106/b1465eaa/attachment.html

János Löbb

unread,
Jan 6, 2011, 5:42:53 PM1/6/11
to Discussion of Rocks Clusters
Hi Mason,

Yes, it is the duplicate IP address issue on the private network. It is still version 5.1 :-)

Of course there is nothing else on the private network just the front end with ip: 192.168.131.128 and the three nodes with 192.168.131.[254,253,252]

When compute-0-0 is booting it reports that it cannot bring network up because some other device is already using the IP address 192.168.131.254.

Bringing up interface eth0: Error, some other host already uses adress 192.168.131.254 {FAILED]

Now, this is a VmWare Fusion based private network and for networking all the nodes have the "Host Only" network setting. So I am a little puzzled.

On the frontend:
[root@rocks ~]# rocks list host interface
HOST SUBNET IFACE MAC IP NETMASK GATEWAY MODULE NAME VLANID
rocks: private eth0 00:0c:29:c9:ff:3c 192.168.131.128 255.255.255.0 ------------- e1000 rocks ------
rocks: public eth1 ----------------- 172.16.24.130 255.255.255.0 172.16.24.255 ------ rocks.yalepath.org ------
compute-0-0: private eth0 00:50:56:33:d1:38 192.168.131.254 255.255.255.0 ------------- e1000 compute-0-0 ------
compute-0-1: private eth0 00:50:56:21:cc:7c 192.168.131.253 255.255.255.0 ------------- e1000 compute-0-1 ------
compute-0-2: private eth0 00:50:56:37:fa:46 192.168.131.252 255.255.255.0 ------------- e1000 compute-0-2 ------
[root@rocks ~]#

I rebooted the node already twice, but it did not help. Running service network restart on compute-0-0 results the same message as I listed above for boot.

If I do an nslookup on the frontend or on the other compute nodes I see this:
[root@rocks ~]# nslookup 192.168.131.254
Server: 127.0.0.1
Address: 127.0.0.1#53

254.131.168.192.in-addr.arpa name = compute-0-0.local.

and that looks OK to me, so the DNS on the private network looks fine.

Next step is to reinstall the node. How do I do that with 5.1 in such a way that I do not reinstall the other two ? I tried

/boot/kickstart/cluster-kickstart-pxe

on the node, it removed everything but now it is not pxe booting and report a no OS. Next I will delete it as a VM, use
rocks remove host compute-0-0
on the front end to remove it from the database, then recreate it in the VM and do a first time install on it. I wish I could find a How To for resolving a duplicate IP issue on the private network.

Thanks,

János

János Löbb

unread,
Jan 6, 2011, 6:50:19 PM1/6/11
to Discussion of Rocks Clusters
Well,

I recreated the VMWare device, removed it from the database with

rocks remove host compute-0-0

started up insert-ethers as:

insert-ethers --rack=0 --rank=0

Booted the node, insert-ethers found it, I managed not to fall into the trap to ask for keyboard and language, etc..., the installation went nicely, I logged in and still there is no network on the node an service network restart indicates that some other host already using the IP address 192.168.131.254.

Here is now the output of:


[root@rocks ~]# rocks list host interface
HOST SUBNET IFACE MAC IP NETMASK GATEWAY MODULE NAME VLANID
rocks: private eth0 00:0c:29:c9:ff:3c 192.168.131.128 255.255.255.0 ------------- e1000 rocks ------
rocks: public eth1 ----------------- 172.16.24.130 255.255.255.0 172.16.24.255 ------ rocks.yalepath.org ------

compute-0-1: private eth0 00:50:56:21:cc:7c 192.168.131.253 255.255.255.0 ------------- e1000 compute-0-1 ------

compute-0-0: private eth0 00:50:56:32:71:9e 192.168.131.254 255.255.255.0 ------------- e1000 compute-0-0 ------
[root@rocks ~]#

I do not see any problem here. The hardware - MAC - addresses are matching. /In the meantime I also got rid off compute-0-2/

So, how can I avoid this duplicate IP address phantom thing ?

Thanks ahead,

János

János Löbb

unread,
Jan 7, 2011, 12:03:53 PM1/7/11
to Discussion of Rocks Clusters
Hi,

What file should be modified on the frontend in such a way that when it comes to installing nodes, the first node should have the IP address of x.y.z.253 instead of x.y.z.254 ? Is it determined at the set up of the front end or it is independent from it and coded in an auxiliary file ? I do not remember to be able to specify it when I setup my little test cluster.

Thanks ahead,

János

Gladu, Charles

unread,
Jan 7, 2011, 7:36:10 PM1/7/11
to Discussion of Rocks Clusters
You said:

>Of course there is nothing else on the private network just the front
> end with ip: 192.168.131.128 and the three nodes with
> 192.168.131.[254,253,252]

and

> Now, this is a Vmware Fusion based private network and for networking all the nodes have the "Host Only" network setting.

Well, if you set the network up as "host-only" in Vmware then there is more than just your frontend and your three nodes on the network. By definition "host-only" means that your HOST is also on that network. It also means that by default your host is acting as a DHCP server for that network, which can conflict with Rocks.

Either change from "host-only" to a "custom" network in Vmware (I'm not sure specifically where you do this in Fusion, but I know it's supported in all other Vmware products) or verify your host's addess on that network and reserve it before you install compute nodes by using the rocks command line to define a dummy device with that IP address (make sure you pick a non-managed device type) and then disable the "host-only" DHCP server (on Windows you's stop a particular service, on Linux and Mac I suspect that you'd kill a daemon - I suspect that on a Mac it would be a daemon as well - see the Vmware docs or their support forums)

Mason J. Katz

unread,
Jan 10, 2011, 2:13:36 PM1/10/11
to Discussion of Rocks Clusters
Rocks assumes it controls the entire private network IP space, and starts
allocated address at the top of the range.

If you wish to change the settings of any of the installed nodes you can do
the following:

# rocks dump host interface

Edit config.sh

# sh config.sh


mason j. katz
+1.619.800.0655

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20110110/4040db2c/attachment.html

János Löbb

unread,
Jan 11, 2011, 3:32:44 PM1/11/11
to Discussion of Rocks Clusters
Hi Mason,

Just to make sure I did not make any stupid mistake, when I assigned let's say the private network gateway to 192.168.131.254, I reinstalled the frontend named rocks and I also reinstalled two nodes, compute-0-0 and compute-0-1. Compute-0-0 STILL do not have networking and complaining that another host is using that IP.

It is all on my Intel Mac, with VMWARE Fusion virtual devices as machines. I selected the 192.168.131.0 network because VMWare assigns this by default to a device when the networking is set to "Host only".

Well, now I do your advise, wash my hands in, and do the "brain surgery" :-)

I am still startled what device is hugging that IP address on this private network. When I ping 192.168.131.254 on the frontend, it reports 100% packet loss, that means to me that NO device is listening on the local subnet with that IP. If that is the case compute-0-0 should NOT complain. But it does :(

Thanks,

János

János Löbb

unread,
Jan 13, 2011, 2:21:41 PM1/13/11
to Discussion of Rocks Clusters
Just for the record of others run into the same issue...

I dumped the interface on the frontend, put the result into a config.sh file, decreased the last digiys of the IP addresses of the nodes with one, reloaded this config file into the database, set the nodes to reinstall, and after the reinstall all of my nodes have good networking, that is ifconfig shows IP addresses and from any node I can ping the frontend and vice versa.

Thanks Mason !!

János

János Löbb

unread,
Feb 1, 2012, 12:11:19 PM2/1/12
to Discussion of Rocks Clusters
Mason,

Today I tried to do the same, but when I issued

./config.sh

as root, Rocks told me :
bash: ./config.sh: Permission denied

Is there any other way to load back this config.sh content into the rocks repository ?

Thanks ahead,

János

Bart Brashers

unread,
Feb 1, 2012, 12:18:43 PM2/1/12
to Discussion of Rocks Clusters
Mason suggested you do

# sh config.sh

and yet you did

# ./config.sh

If you look at the permissions of the file:

# ls -lF config.sh

You'll see it's not set with execute permissions. Either do exactly as Mason suggested, or

# chmod u+x config.sh
# ./config.sh

Bart

> -----Original Message-----
> From: npaci-rocks-dis...@sdsc.edu [mailto:npaci-rocks-
> discussio...@sdsc.edu] On Behalf Of János Löbb
> Sent: Wednesday, February 01, 2012 9:11 AM
> To: Discussion of Rocks Clusters
> Subject: Re: [Rocks-Discuss] admin question regarding node status down
>

> discussion/attachments/20110110/4040db2c/attachment.html


________________________________
This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to em...@environcorp.com and immediately delete all copies of the message.

János Löbb

unread,
Feb 1, 2012, 12:48:54 PM2/1/12
to Discussion of Rocks Clusters
Bart,

Looks like I need a good walk :-) It is lunchtime anyway. In the meantime I used

rocks set host interface ip compute-0-3 eth0 192.168.131.253

looks like it worked. Now I just have to reinstall compute-0-3 to get the right ip and not to try to grab the dhcp server ip on the subnet.

Thanks a lot, as always.

János

Reply all
Reply to author
Forward
0 new messages