[Rocks-Discuss] SGE: Nodes in E state

730 views
Skip to first unread message

pooja gupta

unread,
Oct 18, 2010, 9:06:00 AM10/18/10
to npaci-rocks...@sdsc.edu
Dear All,

We recently updated RAM on our compute nodes. qhost command shows the RAM
is increased. We also have our head node as execute node. But all my nodes
are in permanent E state. Please have a look at the output below. We are not
able to qlogin also. The output of the qlogin command is:

Your job 1739 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...

Your "qlogin" request could not be scheduled, try again later.
error: error shutting down the connection: undefined commlib error code


We have stopped and restarted SGE many times but it didn't help. Please help
us in this regard.
We would highly appreciate your suggestions.
Many thanks in advance.

With best regards,
Pooja

=======================================================

queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
al...@compute-0-0.local BIP 0/0/16 0.00 lx26-amd64 E
---------------------------------------------------------------------------------
al...@compute-0-1.local BIP 0/0/16 0.00 lx26-amd64 E
---------------------------------------------------------------------------------
al...@compute-0-2.local BIP 0/0/16 0.00 lx26-amd64 E
---------------------------------------------------------------------------------
al...@compute-0-3.local BIP 0/0/16 0.02 lx26-amd64 E
---------------------------------------------------------------------------------
al...@nanda.local BIP 0/0/8 0.00 lx26-amd64 E

=======================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20101018/41f2f232/attachment.html

Mike Hanby

unread,
Oct 18, 2010, 11:51:40 AM10/18/10
to npaci-rocks...@sdsc.edu
Try using:

qstat -explain E

That may reveal the root problem.

Also, if you previously made memory / virtual memory consumable, you may need to update the execution hosts with the new ram totals.

pooja gupta

unread,
Oct 18, 2010, 5:34:32 PM10/18/10
to Discussion of Rocks Clusters
Hello!

Thanks for the reply. Here is the output of the
qstat -explain E command.

I have checked the RAM is consumable and also updated correctly. The nodes
are ssh-ables. There are no jobs in the queue . Also I have tried to kill
the jobs mentioned in the command output and restarted sgeexced but it
didn't work.
Please help.

Thanks and regards,
Pooja

queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------

al...@compute-0-0.local BIP 0/0/16 0.01 lx26-amd64 E
queue all.q marked QERROR as result of job 1711's failure at host
compute-0-0.local


---------------------------------------------------------------------------------
al...@compute-0-1.local BIP 0/0/16 0.00 lx26-amd64 E

queue all.q marked QERROR as result of job 1707's failure at host
compute-0-1.local


---------------------------------------------------------------------------------
al...@compute-0-2.local BIP 0/0/16 0.00 lx26-amd64 E

queue all.q marked QERROR as result of job 1709's failure at host
compute-0-2.local
---------------------------------------------------------------------------------
al...@compute-0-3.local BIP 0/0/16 0.00 lx26-amd64 E
queue all.q marked QERROR as result of job 1706's failure at host
compute-0-3.local
---------------------------------------------------------------------------------
al...@nanda.local BIP 0/0/8 0.01 lx26-amd64 E
queue all.q marked QERROR as result of job 1705's failure at host
nanda.local

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20101018/0c8a1d06/attachment.html

Mike Hanby

unread,
Oct 18, 2010, 6:15:18 PM10/18/10
to Discussion of Rocks Clusters
You can try 'clearing' the error state and then running a test job. The exec hosts were marked in an error state due to job failures that SGE determined may be due to compute node instability, i.e. automatically mark them offline to prevent other jobs from starting and failing.

To clear the error state for all nodes in all.q, run:

qmod -cq all.q

Or, to clear only a single node:

qmod -cq al...@compute-0-0.local

Prior to doing so, I'd put a hold on all jobs in the queue, clear the error state, submit a test job and verify that it runs. If the node immediately goes back into the error state, perhaps the job log will contain the cause.

Mike

pooja gupta

unread,
Oct 19, 2010, 5:27:40 AM10/19/10
to Discussion of Rocks Clusters
Dear Mike,

Thank you for your help. It works now.

With best regards,
Pooja

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20101019/8a8b3f81/attachment.html

Reply all
Reply to author
Forward
0 new messages