We recently updated RAM on our compute nodes. qhost command shows the RAM
is increased. We also have our head node as execute node. But all my nodes
are in permanent E state. Please have a look at the output below. We are not
able to qlogin also. The output of the qlogin command is:
Your job 1739 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your "qlogin" request could not be scheduled, try again later.
error: error shutting down the connection: undefined commlib error code
We have stopped and restarted SGE many times but it didn't help. Please help
us in this regard.
We would highly appreciate your suggestions.
Many thanks in advance.
With best regards,
Pooja
=======================================================
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
al...@compute-0-0.local BIP 0/0/16 0.00 lx26-amd64 E
---------------------------------------------------------------------------------
al...@compute-0-1.local BIP 0/0/16 0.00 lx26-amd64 E
---------------------------------------------------------------------------------
al...@compute-0-2.local BIP 0/0/16 0.00 lx26-amd64 E
---------------------------------------------------------------------------------
al...@compute-0-3.local BIP 0/0/16 0.02 lx26-amd64 E
---------------------------------------------------------------------------------
al...@nanda.local BIP 0/0/8 0.00 lx26-amd64 E
=======================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20101018/41f2f232/attachment.html
Thanks for the reply. Here is the output of the
qstat -explain E command.
I have checked the RAM is consumable and also updated correctly. The nodes
are ssh-ables. There are no jobs in the queue . Also I have tried to kill
the jobs mentioned in the command output and restarted sgeexced but it
didn't work.
Please help.
Thanks and regards,
Pooja
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
al...@compute-0-0.local BIP 0/0/16 0.01 lx26-amd64 E
queue all.q marked QERROR as result of job 1711's failure at host
compute-0-0.local
---------------------------------------------------------------------------------
al...@compute-0-1.local BIP 0/0/16 0.00 lx26-amd64 E
queue all.q marked QERROR as result of job 1707's failure at host
compute-0-1.local
---------------------------------------------------------------------------------
al...@compute-0-2.local BIP 0/0/16 0.00 lx26-amd64 E
queue all.q marked QERROR as result of job 1709's failure at host
compute-0-2.local
---------------------------------------------------------------------------------
al...@compute-0-3.local BIP 0/0/16 0.00 lx26-amd64 E
queue all.q marked QERROR as result of job 1706's failure at host
compute-0-3.local
---------------------------------------------------------------------------------
al...@nanda.local BIP 0/0/8 0.01 lx26-amd64 E
queue all.q marked QERROR as result of job 1705's failure at host
nanda.local
Thank you for your help. It works now.
With best regards,
Pooja