[Rocks-Discuss] Not all nodes get added to sge queue

291 views
Skip to first unread message

David M Noriega

unread,
Nov 24, 2009, 11:02:22 AM11/24/09
to npaci-rocks...@sdsc.edu
Just want to say I come from an Oscar based cluster and am trying out
Rocks as we plan out our upgrade to the cluster.

I installed Rocks and the sge roll, used insert-ethers to first get
the switch, and then some nodes, five for now. All five were capured
by insert-ethers and using 'rocks list hosts' shows this. Though when
I then use 'qstat -f' I only see three of the five nodes. Also to note
when I run a simple job script, it reports the hostname, it looks like
it executes just fine, shows up in the queue. But no output logs are
created. Even after I put the '-o' and '-e' options.

David

--
Personally, I liked the university. They gave us money and facilities,
we didn't have to produce anything! You've never been out of college!
You don't know what it's like out there! I've worked in the private
sector. They expect results. -Ray Ghostbusters

Greg Bruno

unread,
Nov 30, 2009, 12:58:51 PM11/30/09
to Discussion of Rocks Clusters
On Tue, Nov 24, 2009 at 8:02 AM, David M Noriega <tsk...@my.utsa.edu> wrote:
>
> I installed Rocks and the sge roll, used insert-ethers to first get
> the switch, and then some nodes, five for now. All five were capured
> by insert-ethers and using 'rocks list hosts' shows this. Though when
> I then use 'qstat -f' I only see three of the five nodes. Also to note

are you sure the installation completed on 2 nodes that are not
reported by qstat?

> when I run a simple job script, it reports the hostname, it looks like
> it executes just fine, shows up in the queue. But no output logs are
> created. Even after I put the '-o' and '-e' options.

there are a couple of examples here:

http://www.rocksclusters.org/roll-documentation/sge/5.2/submitting-batch-jobs.html

let us know if you see similar behavior.

also, what is the output of:

# rocks list roll

- gb

David M Noriega

unread,
Dec 2, 2009, 12:16:47 PM12/2/09
to Discussion of Rocks Clusters
Yes the installation completed as I can run "rocks run host compute
hostname" and get the hostname of all the nodes.

# rocks run host compute hostname | sort -n
compute-0-0: compute-0-0.local
compute-0-1: compute-0-1.local
compute-0-2: compute-0-2.local
compute-0-3: compute-0-3.local
compute-0-4: compute-0-4.local
compute-0-5: compute-0-5.local
compute-0-6: compute-0-6.local
compute-0-7: compute-0-7.local
compute-0-8: compute-0-8.local
compute-0-9: compute-0-9.local

Yet when I run qstat -f

# qstat -f
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
al...@compute-0-1.local BIP 0/0/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
al...@compute-0-2.local BIP 0/0/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
al...@compute-0-3.local BIP 0/0/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
al...@compute-0-6.local BIP 0/0/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
al...@compute-0-8.local BIP 0/0/8 0.03 lx26-amd64

I submit the sleep.sh job and I can see it in the queue with qstat.
Though I dont know if it executes as no log files are created.

# rocks list roll
NAME VERSION ARCH ENABLED
base: 5.2 x86_64 yes
bio: 5.2 x86_64 yes
ganglia: 5.2 x86_64 yes
hpc: 5.2 x86_64 yes
java: 5.2 x86_64 yes
kernel: 5.2 x86_64 yes
os: 5.2 x86_64 yes
sge: 5.2 x86_64 yes
web-server: 5.2 x86_64 yes

Though one thing to note is we are just testing rocks and as such the
head node does not have a dns entry(University controls that) and so
looking in /opt/gridengine/default/spool/qmaster/messages I can see
errors about not being able to reslove the name of the head node, even
though the private ip I gave it is in /etc/hosts.

Any ideas?

Thanks
David

--

David M Noriega

unread,
Dec 2, 2009, 1:10:17 PM12/2/09
to Discussion of Rocks Clusters
Just want to quickly add that I forgot I was submitting the job as
root. When I tried again as a user(did rocks sync users) I got my
output logs. Though still doesnt explain why not all nodes are showing
up in the queue.

Strangely I can see that not all nodes have the the init script installed:

# rocks run host compute "ls /etc/init.d/sge*" | sort -n
compute-0-0: ls: /etc/init.d/sge*: No such file or directory
compute-0-1: /etc/init.d/sgeexecd.cluster
compute-0-2: /etc/init.d/sgeexecd.cluster
compute-0-3: /etc/init.d/sgeexecd.cluster
compute-0-4: ls: /etc/init.d/sge*: No such file or directory
compute-0-5: ls: /etc/init.d/sge*: No such file or directory
compute-0-6: /etc/init.d/sgeexecd.cluster
compute-0-7: ls: /etc/init.d/sge*: No such file or directory
compute-0-8: /etc/init.d/sgeexecd.cluster
compute-0-9: ls: /etc/init.d/sge*: No such file or directory
compute-1-0: ls: /etc/init.d/sge*: No such file or directory
compute-1-1: ls: /etc/init.d/sge*: No such file or directory
compute-1-2: ls: /etc/init.d/sge*: No such file or directory
compute-1-3: /etc/init.d/sgeexecd.cluster
compute-1-4: /etc/init.d/sgeexecd.cluster
compute-1-5: /etc/init.d/sgeexecd.cluster
compute-1-6: down << Bad node, ignore
compute-1-7: /etc/init.d/sgeexecd.cluster
compute-1-8: ls: /etc/init.d/sge*: No such file or directory
compute-1-9: /etc/init.d/sgeexecd.cluster

David

Ian Kaufman

unread,
Dec 2, 2009, 1:18:12 PM12/2/09
to Discussion of Rocks Clusters
SGE uses DNS, using gethostbyname() to figure things out. However, by
default,
the frontend should be using a private address for the SGE master - SGE
should
not be listening elsewhere i.e. frontend.local 10.0.0.1 or whatever is your
SGE
master info. ROCKS sets up its own internal DNS. Did you set up your
frontend
with both public and private interfaces?

Ian

--
Ian Kaufman
Research Systems Administrator
UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20091202/78a6cc9d/attachment.html

David M Noriega

unread,
Dec 2, 2009, 1:48:27 PM12/2/09
to Discussion of Rocks Clusters
Yes both interfaces are setup, the rocks installer doesnt let you get
past without doing so. I have the headnode plugged into a private(ie
non-cluster) switch so that I can at least ssh out.

Yu Fu

unread,
Dec 2, 2009, 3:59:07 PM12/2/09
to Discussion of Rocks Clusters
I got the following errors when generating a kickstart file for a node
in a ROCKS 5.2 frontend server:

[root@fe install]# ROCKSDEBUG=y rocks list host graph gridftp
Traceback (most recent call last):
File "/opt/rocks/bin/rocks", line 264, in ?
command.runWrapper(name, args[i:])
File
"/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py",
line 1733, in runWrapper
self.run(self._params, self._args)
File
"/opt/rocks/lib/python2.4/site-packages/rocks/commands/list/host/graph/__init__.py",
line 241, in run
handler = rocks.profile.GraphHandler(attrs)
File "/opt/rocks/lib/python2.4/site-packages/rocks/profile.py", line
283, in __init__
self.os = attrs['os']
KeyError: 'os'

The xml files for this node are all fine as I got the same problem after
copying xmls over from other correctly working nodes. What could be wrong?

Thanks,

Yu

Mason J. Katz

unread,
Dec 2, 2009, 4:42:28 PM12/2/09
to Discussion of Rocks Clusters
This is a bug/feature of Rocks 5.2. You need to actually kickstart
the node using insert-ethers before you can run "rocks list host
graph" for a node.

mason katz
+1.240.724.6825

David M Noriega

unread,
Dec 4, 2009, 10:29:58 AM12/4/09
to Discussion of Rocks Clusters
Reply all
Reply to author
Forward
0 new messages