[Rocks-Discuss] Torque problems fresh 5.4 install

123 views
Skip to first unread message

jean-francois prieur

unread,
Jan 21, 2011, 5:45:46 PM1/21/11
to Discussion of Rocks Clusters, Guillaume Lamoureux
Hello,
I have been tasked to rebuild our cluster to Rocks 5.4 and Torque 5.4
After re-installing several times with and without a restore roll,
installing the torque roll during or after installation, and every
permutation of these, I still have a non-working Torque installation.
AFAIK, Torque has never worked properly on this cluster, SGE works
flawlessly but we can't use it here...Starting to wonder if there is
something about this cluster that the Torque roll does not like...

So I have re-installed from scratch, ran insert-ethers and inserted
all the nodes, copied our old users and groups back, synced the
config. Up to here, the cluster works as it is supposed to. The only
change to a vanilla config is that I have enabled web access to our
front ends following the directions in the documentation
(http://www.rocksclusters.org/roll-documentation/base/5.4/enable-www.html)


I then downloaded and installed the torque roll as instructed on this
page (http://www.rocksclusters.org/wordpress/?p=125), rebooted, synced
the config and kickstarted the nodes. As soon as the nodes come back
up, syslog fills up with these messages (excerpt)


Jan 21 16:53:33 golem PBS_Server: LOG_ERROR::stream_eof, connection to
compute-0-7 is bad, remote service may be down, message may be
corrupt, or connection may have been dropped remotely (End of File).
setting node state to down
Jan 21 16:53:54 golem PBS_Server: LOG_ERROR::stream_eof, connection to
compute-0-5 is bad, remote service may be down, message may be
corrupt, or connection may have been dropped remotely (End of File).
setting node state to down
Jan 21 16:54:05 golem PBS_Server: LOG_ERROR::stream_eof, connection to
compute-0-4 is bad, remote service may be down, message may be
corrupt, or connection may have been dropped remotely (End of File).
setting node state to down
Jan 21 16:54:19 golem PBS_Server: LOG_ERROR::stream_eof, connection to
compute-0-1 is bad, remote service may be down, message may be
corrupt, or connection may have been dropped remotely (End of File).
setting node state to down
Jan 21 16:54:22 golem PBS_Server: LOG_ERROR::stream_eof, connection to
compute-0-18 is bad, remote service may be down, message may be
corrupt, or connection may have been dropped remotely (End of File).
setting node state to down
Jan 21 16:54:34 golem PBS_Server: LOG_ERROR::stream_eof, connection to
compute-0-16 is bad, remote service may be down, message may be
corrupt, or connection may have been dropped remotely (End of File).
setting node state to down
Jan 21 16:54:47 golem PBS_Server: LOG_ERROR::stream_eof, connection to
compute-1-1 is bad, remote service may be down, message may be
corrupt, or connection may have been dropped remotely (End of File).
setting node state to down
Jan 21 16:56:35 compute-0-7.local pbs_mom: LOG_ERROR::mom_server_add,
host golem.local not found
Jan 21 16:56:55 compute-0-5.local pbs_mom: LOG_ERROR::mom_server_add,
host golem.local not found
Jan 21 16:57:04 compute-0-4.local pbs_mom: LOG_ERROR::mom_server_add,
host golem.local not found
Jan 21 16:57:23 compute-0-18.local pbs_mom: LOG_ERROR::mom_server_add,
host golem.local not found
Jan 21 16:57:27 compute-0-1.local pbs_mom: LOG_ERROR::mom_server_add,
host golem.local not found
Jan 21 16:57:39 compute-0-16.local pbs_mom: LOG_ERROR::mom_server_add,
host golem.local not found
Jan 21 16:58:29 golem PBS_Server: LOG_ERROR::stream_eof, connection to
compute-0-6 is bad, remote service may be down, message may be
corrupt, or connection may have been dropped remotely (End of File).
setting node state to down
Jan 21 17:01:30 compute-0-6.local pbs_mom: LOG_ERROR::mom_server_add,
host golem.local not found
Jan 21 17:32:47 golem PBS_Server: LOG_ERROR::Access from host not
allowed, or unknown host (15008) in send_job, child failed in previous
commit request for job 0.golem.concordia.ca


I issued a "sleep 30" job with a non-root user as a test, that is job
0 above, it stays queued indefinitely probably because the nodes are
not registered with the server

[jfprieur@golem ~]$ qstat -q
server: golem.concordia.ca
Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
default            --      --       --      --    0   1 --   E R
                                               ----- -----
                                                   0     1
[jfprieur@golem ~]$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
0.golem                   STDIN            jfprieur               0 Q default

pbsnodes shows the nodes to be free:
compute-1-5
     state = free
     np = 8
     ntype = cluster
     status = opsys=linux,uname=Linux compute-1-5.local
2.6.18-194.17.4.el5 #1 SMP Mon Oct 25 15:50:53 EDT 2010
x86_64,sessions=? 0,nsessions=?
0,nusers=0,idletime=3007,totmem=13310848kb,availmem=13238376kb,physmem=12290732kb,ncpus=8,loadave=0.00,gres=,netload=809170,state=free,jobs=,varattr=,rectime=1295649668

I saw in another thread that the routing has to go over the public
interface from the nodes, I believe my set up is correct

[root@compute-0-0 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
255.255.255.255 0.0.0.0         255.255.255.255 UH    0      0        0 eth1
132.205.24.239  10.1.1.1        255.255.255.255 UGH   0      0        0 eth1
224.0.0.0       0.0.0.0         255.255.255.0   U     0      0        0 eth1
10.1.0.0        0.0.0.0         255.255.0.0     U     0      0        0 eth1
169.254.0.0     0.0.0.0         255.255.0.0     U     0      0        0 eth1
0.0.0.0         10.1.1.1        0.0.0.0         UG    0      0        0 eth1

I am pretty much at an impasse, cant see what the problem is and I am
getting tired of re-installing this cluster in slightly different
ways. May take less time to try and convince people to use SGE, I just
can't see what I am doing wrong and repeatedly at that. These errors
are constant whatever I do...
Thanks for any help,
JF

jean-francois prieur

unread,
Jan 21, 2011, 6:51:29 PM1/21/11
to Discussion of Rocks Clusters
Looking at the Maui logs, that part of the system looks to be working OK

01/21 18:40:34 MPBSLoadQueueInfo(base,compute-1-3,SC)
01/21 18:40:34 INFO: queue 'default' started state set to True
01/21 18:40:34 INFO: class to node not mapping enabled for queue
'default' adding class to all nodes
01/21 18:40:34 __MPBSGetNodeState(Name,State,PNode)
01/21 18:40:34 INFO: PBS node compute-1-4 set to state Idle (free)
01/21 18:40:34 MPBSNodeUpdate(compute-1-4,compute-1-4,Idle,base)
01/21 18:40:34 MPBSLoadQueueInfo(base,compute-1-4,SC)
01/21 18:40:34 INFO: queue 'default' started state set to True
01/21 18:40:34 INFO: class to node not mapping enabled for queue
'default' adding class to all nodes
01/21 18:40:34 __MPBSGetNodeState(Name,State,PNode)
01/21 18:40:34 INFO: PBS node compute-1-5 set to state Idle (free)
01/21 18:40:34 MPBSNodeUpdate(compute-1-5,compute-1-5,Idle,base)
01/21 18:40:34 MPBSLoadQueueInfo(base,compute-1-5,SC)
01/21 18:40:34 INFO: queue 'default' started state set to True
01/21 18:40:34 INFO: class to node not mapping enabled for queue
'default' adding class to all nodes
01/21 18:40:34 INFO: 26 PBS resources detected on RM base
01/21 18:40:34 INFO: resources detected: 26
01/21 18:40:34 MRMWorkloadQuery()
01/21 18:40:34 MPBSWorkloadQuery(base,JCount,SC)
01/21 18:40:34 MPBSJobUpdate(0,0.golem.concordia.ca,TaskList,0)
01/21 18:40:34 INFO: 1 PBS jobs detected on RM base
01/21 18:40:34 INFO: jobs detected: 1
01/21 18:40:34 MStatClearUsage(node,Active)
01/21 18:40:34 MClusterUpdateNodeState()
01/21 18:40:34 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
01/21 18:40:34 INFO: job '0' Priority: 8
01/21 18:40:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
01/21 18:40:34 MStatClearUsage([NONE],Active)
01/21 18:40:34 MResDestroy(NULL)
01/21 18:40:34 INFO: total jobs selected (ALL): 0/1 [EState: 1]


Will try to troubleshoot some more tomorrow.

Realise the tone of my last email may have been a little off, I
sincerely appreciate the time and effort all the devs put in to Rocks
and it has worked very well in our lab for many years. Just has been a
long week and stared at the screen for too long...

Regards,
JF

Don Winsor

unread,
Jan 21, 2011, 10:36:08 PM1/21/11
to Discussion of Rocks Clusters, Guillaume Lamoureux
We had what I think was this same issue after we did a fresh install
of Rocks 5.4 and Torque. One of our local experts (an expert with
Linux clusters, but without previous Rocks experience) was able to get
it working by editing the /opt/torque/mom_priv/config file on each
compute node with $pbsclient followed by our fully qualified host name
for the head node. In other words, here is the file before, as it was
"out of the box" with Rocks and Torque:

$pbsserver myname.local
$usecp myname.mydept.umich.edu:/home /home

and here is our /opt/torque/mom_priv/config after the edit:

$pbsserver myname.local
$pbsclient myname.mydept.umich.edu
$usecp myname.mydept.umich.edu:/home /home

after making this change, we ran:
/etc/init.d/pbs restart
on each compute node and everything started working again.

I'm not at all sure this is the best fix (or why it wasn't working out
of the box in the first place), but in any case this was something
that was good enough for us.

Don Winsor
University of Michigan, Electrical Engineering and Computer Science

Roy Dragseth

unread,
Jan 22, 2011, 7:17:08 AM1/22/11
to Discussion of Rocks Clusters
On Saturday, January 22, 2011 04:36:08 Don Winsor wrote:
> > Jan 21 16:57:39 compute-0-16.local pbs_mom: LOG_ERROR::mom_server_add,
> > host golem.local not found

You have a DNS problem. The compute nodes cannot find your frontends ip
address associated with golem.local.

Both forward and reverse DNS lookups must be correct on both frontend and
compute nodes for torque to work properly. From my dev cluster

[root@hpc1 ~]# host compute-0-0
compute-0-0.local has address 10.1.255.254
[root@hpc1 ~]# host 10.1.255.254
254.255.1.10.in-addr.arpa domain name pointer compute-0-0.local.

[root@compute-0-0 ~]# host hpc1.local
hpc1.local has address 10.1.1.1
[root@compute-0-0 ~]# host hpc1
hpc1.local has address 10.1.1.1
[root@compute-0-0 ~]# host 10.1.1.1
1.1.1.10.in-addr.arpa domain name pointer hpc1.local.

You must have something non-standard in your network config. I'm pretty sure
the torque-roll works with standard configs or else I would have been bogged
down by bug-reports by now...

r.


--
The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, Team Leader, High Performance Computing
Direct call: +47 77 64 62 56. email: roy.dr...@uit.no

jean-francois prieur

unread,
Jan 22, 2011, 10:32:24 AM1/22/11
to Discussion of Rocks Clusters
Thanks, yes I am pretty sure the problem is on my end as well ;)

The weird thing is if I ping golem.local from the nodes it responds
with the proper address, didn't even bother mentioning it as that was
the first thing I checked:

[root@compute-0-0 ~]# ping golem.local
PING golem.local (10.1.1.1) 56(84) bytes of data.
64 bytes from golem.local (10.1.1.1): icmp_seq=1 ttl=64 time=0.110 ms
64 bytes from golem.local (10.1.1.1): icmp_seq=2 ttl=64 time=0.132 ms
64 bytes from golem.local (10.1.1.1): icmp_seq=3 ttl=64 time=0.108 ms


[root@golem jfprieur]# host compute-0-0
compute-0-0.local has address 10.1.255.253
[root@golem jfprieur]# host 10.1.255.253
253.255.1.10.in-addr.arpa domain name pointer compute-0-0.local.

[root@compute-0-0 ~]# host golem.local
golem.local has address 10.1.1.1
[root@compute-0-0 ~]# host golem
golem.local has address 10.1.1.1


[root@compute-0-0 ~]# host 10.1.1.1

1.1.1.10.in-addr.arpa domain name pointer golem.local.

Appreciate the help, will try the solution posted before yours as well

JF

Roy Dragseth

unread,
Jan 22, 2011, 6:37:12 PM1/22/11
to Discussion of Rocks Clusters
On Saturday, January 22, 2011 16:32:24 jean-francois prieur wrote:
> [root@golem jfprieur]# host compute-0-0
> compute-0-0.local has address 10.1.255.253

Just out of curiosity, who's got the address 10.1.255.254? In fresh out-of-
the-box install usually compute-0-0 has this adress.

jean-francois prieur

unread,
Jan 23, 2011, 3:46:19 PM1/23/11
to Discussion of Rocks Clusters
It is our managed switch, the rocks documentation says to insert this
first before the nodes, is that a problem?

[root@golem jfprieur]# host 10.1.255.254
254.255.1.10.in-addr.arpa domain name pointer network-0-0.local.

JF

jean-francois prieur

unread,
Jan 24, 2011, 11:41:00 AM1/24/11
to Discussion of Rocks Clusters
Thank you so much, this did the trick.

Are there any drawbacks to editing this manually? Was I wrong to
detect our switch first and it is the one that "received" the .254
address?

Once again thanks for helping me out, much appreciated.
JF

Don Winsor

unread,
Jan 24, 2011, 2:05:24 PM1/24/11
to Discussion of Rocks Clusters
Very strange. My DNS lookups appear to be working identically to
yours, Roy; here is a direct cut and paste from my running the same
commands as you:

[root@mnc2 ~]# host compute-0-0
compute-0-0.local has address 10.1.255.254
[root@mnc2 ~]# host 10.1.255.254


254.255.1.10.in-addr.arpa domain name pointer compute-0-0.local.

[root@compute-0-0 ~]# host mnc2.local
mnc2.local has address 10.1.1.1
[root@compute-0-0 ~]# host mnc2
mnc2.local has address 10.1.1.1


[root@compute-0-0 ~]# host 10.1.1.1

1.1.1.10.in-addr.arpa domain name pointer mnc2.local.

We have just about as plain of a plain vanilla Rocks 5.4 install with
torque as there could be, I think, with the possible exception that we
included all the OS disks. For network configuration, I believe I
very carefully and precisely followed the Rocks Users Guide "2.2.
Install and Configure Your Frontend".

Here's our list of rolls:

[root@mnc2 ~]# rocks list roll
NAME VERSION ARCH ENABLED
base: 5.4 x86_64 yes
ganglia: 5.4 x86_64 yes
hpc: 5.4 x86_64 yes
kernel: 5.4 x86_64 yes
os: 5.4 x86_64 yes
os: 5.4 x86_64 yes
os: 5.4 x86_64 yes
os: 5.4 x86_64 yes
os: 5.4 x86_64 yes
os: 5.4 x86_64 yes
os: 5.4 x86_64 yes
web-server: 5.4 x86_64 yes
torque: 5.4.0 x86_64 yes
service-pack: 5.4.2 x86_64 yes

So I really don't have any ideas why we seem to need a "$pbsclient
fully-qualified.host.name" in our /opt/torque/mom_priv/config file but
(most) others seems to be getting by just file without it.

Don

Roy Dragseth

unread,
Jan 24, 2011, 4:35:41 PM1/24/11
to Discussion of Rocks Clusters
On Monday, January 24, 2011 17:41:00 jean-francois prieur wrote:
> Thank you so much, this did the trick.
>
> Are there any drawbacks to editing this manually?

No, as long as you can launch parallel jobs and get back the stdout and stderr
files everything is OK as far as torque goes.

> Was I wrong to
> detect our switch first and it is the one that "received" the .254
> address?

No, that shouldn't matter at all, this setup should be fine. One thing though,
the network switch, what is the rack and rank number in the database? Is it 0
and 0? I have seen problems with DNS when you have two hosts in the database
with the same rack and rank number. So if compute-0-0 also have rack=0 and
rank=0 it could be the cause of the problems.

Every host needs to have an unique combination of rack and rank or else it
might goof up the DNS database. I believe this still is true. Greg? Mason?
I was able to break my test cluster the same way as yours by adding a
network-0-0 with rack=0 and rank=0.

If,

rocks dump | grep rank= | grep rack=

list hosts with the same numbering you might have the root of your problems.
This shouldn't happen if you add hosts with insert-ethers.


>
> Once again thanks for helping me out, much appreciated.

I'm quite surprised that this solution works, the setup with

$pbsserver frontend.local

is there because I could not get it to work with the FQDN address. Hmm,
really strange, but if it works for you, keep it.


r.

jean-francois prieur

unread,
Jan 24, 2011, 5:02:46 PM1/24/11
to Discussion of Rocks Clusters
Hi Roy,

As you can see there is network-0-0 and compute-0-0, this was done by
insert-ethers, I did not change anything until I inserted cabinet 1
later in the in the install

[root@golem QueueTest]# rocks dump | grep rank= | grep rack=
/opt/rocks/bin/rocks add host network-0-0 cpus=1 rack=0 rank=0
membership=Ethernet\ Switch
/opt/rocks/bin/rocks add host compute-0-0 cpus=8 rack=0 rank=0
membership=Compute
/opt/rocks/bin/rocks add host compute-0-1 cpus=8 rack=0 rank=1
membership=Compute
/opt/rocks/bin/rocks add host compute-0-2 cpus=8 rack=0 rank=2
membership=Compute
...
/opt/rocks/bin/rocks add host compute-1-0 cpus=8 rack=1 rank=0
membership=Compute
/opt/rocks/bin/rocks add host compute-1-1 cpus=8 rack=1 rank=1
membership=Compute
/opt/rocks/bin/rocks add host compute-1-2 cpus=8 rack=1 rank=2
membership=Compute
/opt/rocks/bin/rocks add host compute-1-3 cpus=8 rack=1 rank=3
membership=Compute
/opt/rocks/bin/rocks add host compute-1-4 cpus=8 rack=1 rank=4
membership=Compute
/opt/rocks/bin/rocks add host compute-1-5 cpus=8 rack=1 rank=5
membership=Compute

JF

Roy Dragseth

unread,
Jan 24, 2011, 7:12:10 PM1/24/11
to Discussion of Rocks Clusters
I think we might be onto something here. Could you try the following

(first take a backup of your cluster config

rocks dump > /tmp/cluster_config.backup
)

rocks dump | grep network-0-0 > /tmp/network-0-0.rocksdump
rocks remove host network-0-0

edit /tmp/network-0-0.rocksdump and change rank=0 to something you never are
going to use, for instance rank=10. (Of course, the true geek use sed
sed -i 's/rank=0/rank=10/' /tmp/network-0-0.rocksdump
)

sh -x /tmp/network-0-0.rocksdump

rocks sync dns
rocks sync hosts

This will remove network-0-0 and reinsert it with a new rack/rank combination
that do not collide with compute-0-0. You can do the same thing by editing
the database directly, but you should be pretty sure about your SQL matching
skills before you attempt this.

r.

--

jean-francois prieur

unread,
Jan 25, 2011, 1:42:51 PM1/25/11
to Discussion of Rocks Clusters
Hello,

I did everything you listed, I get errors when I try to reinsert the switch:

[root@golem jfprieur]# sh -x /tmp/network-0-0.rocksdump
+ /opt/rocks/bin/rocks add host network-0-0 cpus=1 rack=0 rank=10
'membership=Ethernet Switch'
+ /opt/rocks/bin/rocks set host runaction network-0-0 action=os
+ /opt/rocks/bin/rocks set host installaction network-0-0 action=install
+ /opt/rocks/bin/rocks set host boot network-0-0 action=install
Traceback (most recent call last):
File "/opt/rocks/bin/rocks", line 270, in ?
command.runWrapper(name, args[i:])
File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py",
line 1984, in runWrapper
self.run(self._params, self._args)
File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/set/host/boot/__init__.py",
line 147, in run
self.runPlugins(host)
File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py",
line 1735, in runPlugins
plugin.run(args)
File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/set/host/boot/plugin_physical_host.py",
line 359, in run
self.writePxebootCfg(host, nodeid)
File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/set/host/boot/plugin_physical_host.py",
line 235, in writePxebootCfg
ip, = self.db.fetchone()
TypeError: unpack non-sequence
+ /opt/rocks/bin/rocks add host interface network-0-0 00:1e:c1:8f:e4:c0
+ /opt/rocks/bin/rocks set host interface ip network-0-0
00:1e:c1:8f:e4:c0 10.1.255.254
+ /opt/rocks/bin/rocks set host interface name network-0-0
00:1e:c1:8f:e4:c0 network-0-0
+ /opt/rocks/bin/rocks set host interface mac network-0-0
00:1e:c1:8f:e4:c0 00:1e:c1:8f:e4:c0
+ /opt/rocks/bin/rocks set host interface subnet network-0-0
00:1e:c1:8f:e4:c0 private
Traceback (most recent call last):
File "/opt/rocks/bin/rocks", line 270, in ?
command.runWrapper(name, args[i:])
File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py",
line 1984, in runWrapper
self.run(self._params, self._args)
File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/set/host/interface/subnet/__init__.py",
line 174, in run
name, = self.db.fetchone()
TypeError: unpack non-sequence

But when I do rocks list host, it seems to have worked:

[root@golem jfprieur]# rocks list host
HOST MEMBERSHIP CPUS RACK RANK RUNACTION INSTALLACTION
golem: Frontend 8 0 0 os install
network-0-0: Ethernet Switch 1 0 10 os install
compute-0-0: Compute 8 0 0 os install

What do you want me to test next?

JF

Roy Dragseth

unread,
Jan 25, 2011, 5:54:09 PM1/25/11
to Discussion of Rocks Clusters
On Tuesday, January 25, 2011 19:42:51 jean-francois prieur wrote:
>
> What do you want me to test next?

Next do

rocks sync dns
rocks sync hosts

I think that should be all that is needed. You should recheck that the host
lookups still work correctly on compute and frontend. Also check that
network-0-0 has got the correct ip in /etc/hosts on the frontend.

This should be all that is needed, I do not think you need to reinstall the
compute nodes to make the changes take effect, it's all DNS related.

r.

jean-francois prieur

unread,
Jan 25, 2011, 8:08:57 PM1/25/11
to Discussion of Rocks Clusters
I did the 2 rocks sync commands, there is no entry in /etc/hosts for
network-0-0 and compute-0-0 is still at .253

Roy Dragseth

unread,
Jan 26, 2011, 3:41:46 AM1/26/11
to Discussion of Rocks Clusters

But do torque work as expected? Can you run jobs on the cluster?

r.

jean-francois prieur

unread,
Jan 26, 2011, 10:34:17 AM1/26/11
to Discussion of Rocks Clusters
Sorry for the confusion here is the state:

a)Our cluster works with the manual modifications to the nodes config
files suggested by Don ($pbsclient golem.concordia.ca)

b)After performing your instructions, I have not yet rebuilt a node to
see if I still need to do the manual modification. The cluster is
already at 99% usage since coming back up, if I tell them I have more
stuff to test...

Will report back after shooting a node and keep this in mind when
rebuilding the cluster. I guess if you start the compute nodes in
cabinet 1 eg. my first inserted node would be compute-1-0 instead of
0-0, this would avoid the conflict with network-0-0 in the DNS
records.

Although I am surprised that more people do not run into this since I
am following the Rocks documentation verbatim when using insert-ethers
and it states to start by capturing your switches MAC address which
then gets assigned network-0-0

Sorry if I am being obtuse or am missing some simple thing in this
whole discussion...help is very much appreciated

JF

Roy Dragseth

unread,
Jan 26, 2011, 4:24:22 PM1/26/11
to Discussion of Rocks Clusters
OK, good. Keep us posted when you have the chance to test this.

I agree that this is a problem that many people might see, but I think it only
manifests itself in torque because it is depending on the host lookups to be
absolutely correct. SGE is somewhat more robust in that matter as I believe
its trust is based on certificates. Right?

Maybe the Rocks deities could descend upon us and implement a warning or
something that would help us avoid this problem?

r.

Sudarshan Wadkar

unread,
Feb 2, 2011, 1:37:43 AM2/2/11
to Discussion of Rocks Clusters
On Tue, Jan 25, 2011 at 3:05 AM, Roy Dragseth <roy.dr...@uit.no> wrote:
> Every host needs to have an unique combination of rack and rank or else it
> might goof up the DNS database. I believe this still is true. Greg? Mason?
> I was able to break my test cluster the same way as yours by adding a
> network-0-0 with rack=0 and rank=0.
>
> If,
>
> rocks dump | grep rank= | grep rack=
>
> list hosts with the same numbering you might have the root of your problems.
> This shouldn't happen if you add hosts with insert-ethers.
>
May I disagree with you on this one Roy?
[root@hydra new]# rocks dump | grep rank= | grep rack=

/opt/rocks/bin/rocks add host network-0-0 cpus=1 rack=0 rank=0
membership=Ethernet\ Switch
/opt/rocks/bin/rocks add host compute-0-0 cpus=8 rack=0 rank=0
membership=Compute

and the system has been working flawlessly for past few months now.
The hosts were added using insert-ethers.

--
~$udhi
"Success is getting what you want. Happiness is wanting what you get."
- Dale Carnegie
"It's always our decision who we are"
- Robert Solomon in Waking Life

Roy Dragseth

unread,
Feb 2, 2011, 5:52:15 PM2/2/11
to Discussion of Rocks Clusters
On Wednesday, February 02, 2011 07:37:43 Sudarshan Wadkar wrote:
> On Tue, Jan 25, 2011 at 3:05 AM, Roy Dragseth <roy.dr...@uit.no> wrote:
> > Every host needs to have an unique combination of rack and rank or else
> > it might goof up the DNS database. I believe this still is true. Greg?
> > Mason? I was able to break my test cluster the same way as yours by
> > adding a network-0-0 with rack=0 and rank=0.
> >
> > If,
> >
> > rocks dump | grep rank= | grep rack=
> >
> > list hosts with the same numbering you might have the root of your
> > problems. This shouldn't happen if you add hosts with insert-ethers.
>
> May I disagree with you on this one Roy?
> [root@hydra new]# rocks dump | grep rank= | grep rack=
> /opt/rocks/bin/rocks add host network-0-0 cpus=1 rack=0 rank=0
> membership=Ethernet\ Switch
> /opt/rocks/bin/rocks add host compute-0-0 cpus=8 rack=0 rank=0
> membership=Compute
>
> and the system has been working flawlessly for past few months now.
> The hosts were added using insert-ethers.

Hmm, then I have no clue. When I inserted a network-0-0 on my dev cluster
torque stopped working immediately, and it started working again after I
removed network-0-0.

Thanks for the info, truly an odd problem.

Reply all
Reply to author
Forward
0 new messages