I installed Rocks and the sge roll, used insert-ethers to first get
the switch, and then some nodes, five for now. All five were capured
by insert-ethers and using 'rocks list hosts' shows this. Though when
I then use 'qstat -f' I only see three of the five nodes. Also to note
when I run a simple job script, it reports the hostname, it looks like
it executes just fine, shows up in the queue. But no output logs are
created. Even after I put the '-o' and '-e' options.
David
--
Personally, I liked the university. They gave us money and facilities,
we didn't have to produce anything! You've never been out of college!
You don't know what it's like out there! I've worked in the private
sector. They expect results. -Ray Ghostbusters
are you sure the installation completed on 2 nodes that are not
reported by qstat?
> when I run a simple job script, it reports the hostname, it looks like
> it executes just fine, shows up in the queue. But no output logs are
> created. Even after I put the '-o' and '-e' options.
there are a couple of examples here:
http://www.rocksclusters.org/roll-documentation/sge/5.2/submitting-batch-jobs.html
let us know if you see similar behavior.
also, what is the output of:
# rocks list roll
- gb
# rocks run host compute hostname | sort -n
compute-0-0: compute-0-0.local
compute-0-1: compute-0-1.local
compute-0-2: compute-0-2.local
compute-0-3: compute-0-3.local
compute-0-4: compute-0-4.local
compute-0-5: compute-0-5.local
compute-0-6: compute-0-6.local
compute-0-7: compute-0-7.local
compute-0-8: compute-0-8.local
compute-0-9: compute-0-9.local
Yet when I run qstat -f
# qstat -f
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
al...@compute-0-1.local BIP 0/0/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
al...@compute-0-2.local BIP 0/0/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
al...@compute-0-3.local BIP 0/0/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
al...@compute-0-6.local BIP 0/0/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
al...@compute-0-8.local BIP 0/0/8 0.03 lx26-amd64
I submit the sleep.sh job and I can see it in the queue with qstat.
Though I dont know if it executes as no log files are created.
# rocks list roll
NAME VERSION ARCH ENABLED
base: 5.2 x86_64 yes
bio: 5.2 x86_64 yes
ganglia: 5.2 x86_64 yes
hpc: 5.2 x86_64 yes
java: 5.2 x86_64 yes
kernel: 5.2 x86_64 yes
os: 5.2 x86_64 yes
sge: 5.2 x86_64 yes
web-server: 5.2 x86_64 yes
Though one thing to note is we are just testing rocks and as such the
head node does not have a dns entry(University controls that) and so
looking in /opt/gridengine/default/spool/qmaster/messages I can see
errors about not being able to reslove the name of the head node, even
though the private ip I gave it is in /etc/hosts.
Any ideas?
Thanks
David
--
Strangely I can see that not all nodes have the the init script installed:
# rocks run host compute "ls /etc/init.d/sge*" | sort -n
compute-0-0: ls: /etc/init.d/sge*: No such file or directory
compute-0-1: /etc/init.d/sgeexecd.cluster
compute-0-2: /etc/init.d/sgeexecd.cluster
compute-0-3: /etc/init.d/sgeexecd.cluster
compute-0-4: ls: /etc/init.d/sge*: No such file or directory
compute-0-5: ls: /etc/init.d/sge*: No such file or directory
compute-0-6: /etc/init.d/sgeexecd.cluster
compute-0-7: ls: /etc/init.d/sge*: No such file or directory
compute-0-8: /etc/init.d/sgeexecd.cluster
compute-0-9: ls: /etc/init.d/sge*: No such file or directory
compute-1-0: ls: /etc/init.d/sge*: No such file or directory
compute-1-1: ls: /etc/init.d/sge*: No such file or directory
compute-1-2: ls: /etc/init.d/sge*: No such file or directory
compute-1-3: /etc/init.d/sgeexecd.cluster
compute-1-4: /etc/init.d/sgeexecd.cluster
compute-1-5: /etc/init.d/sgeexecd.cluster
compute-1-6: down << Bad node, ignore
compute-1-7: /etc/init.d/sgeexecd.cluster
compute-1-8: ls: /etc/init.d/sge*: No such file or directory
compute-1-9: /etc/init.d/sgeexecd.cluster
David
Ian
--
Ian Kaufman
Research Systems Administrator
UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20091202/78a6cc9d/attachment.html
[root@fe install]# ROCKSDEBUG=y rocks list host graph gridftp
Traceback (most recent call last):
File "/opt/rocks/bin/rocks", line 264, in ?
command.runWrapper(name, args[i:])
File
"/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py",
line 1733, in runWrapper
self.run(self._params, self._args)
File
"/opt/rocks/lib/python2.4/site-packages/rocks/commands/list/host/graph/__init__.py",
line 241, in run
handler = rocks.profile.GraphHandler(attrs)
File "/opt/rocks/lib/python2.4/site-packages/rocks/profile.py", line
283, in __init__
self.os = attrs['os']
KeyError: 'os'
The xml files for this node are all fine as I got the same problem after
copying xmls over from other correctly working nodes. What could be wrong?
Thanks,
Yu
mason katz
+1.240.724.6825