[Rocks-Discuss] Torque Maui: Job deferred, RM failure

523 views
Skip to first unread message

vigh...@aero.iitb.ac.in

unread,
Oct 31, 2009, 3:29:21 AM10/31/09
to npaci-rocks...@sdsc.edu
Hi,
My cluster has 3 nodes with rocks 5.2 base. I have also installed
torque 5.2 roll in it, but due to some reasons whenever i try to submit
job(interactive mode), it gets
submitted in deferred state and newer enters for execution. There is no SGE.
If i do 'checkjob <jobid> , it gives following output:
-----------------------------------------------------------------
checking job <jobid>

State: Idle EState: Deferred
Creds: user:vighnesh group:vighnesh class:default qos:DEFAULT
WallTime: 00:00:00 of 99:23:59:59
SubmitTime: Sat Oct 31 15:24:36
(Time Queued Total: 00:01:04 Eligible: 00:00:01)

StartDate: -00:01:02 Sat Oct 31 15:24:38
Total Tasks: 8

Req[0] TaskCount: 8 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]


IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
job is deferred. Reason: RMFailure (cannot start job - RM failure, rc:
15041, msg: 'Execution server rejected request MSG=cannot send job to mom,
state=PRERUN')
Holds: Defer (hold reason: RMFailure)
PE: 8.00 StartPriority: 1
cannot select job 3 for partition DEFAULT (job hold active)
-------------------------------------------------------------------

some kind of RM failure.

Please can anyone help me solve this problem.

Thankyou,

Regards,
Vighnesh

Bart Brashers

unread,
Nov 2, 2009, 12:41:20 PM11/2/09
to Discussion of Rocks Clusters

What is the output of the following, done as root on the frontend:

# pbsnodes -a | grep "state ="
# ps aux | grep maui
# ps aux | grep pbs

Bart

This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to em...@environcorp.com and immediately delete all copies of the message.

vigh...@aero.iitb.ac.in

unread,
Nov 2, 2009, 11:41:31 PM11/2/09
to npaci-rocks...@sdsc.edu
Hi Bart,
Following is the output:
---------------------------------------------------------

# pbsnodes -a | grep "state ="
state = free

# ps aux | grep maui

maui 5133 0.0 0.3 51484 31524 ? Ss Nov02 0:00
/opt/maui/sbin/maui
root 27040 0.0 0.0 61144 668 pts/1 S+ 12:36 0:00 grep maui

# ps aux | grep pbs

root 22086 0.0 0.0 10416 1344 ? Ss Nov02 0:00
/opt/torque/sbin/pbs_server
root 27042 0.0 0.0 61144 672 pts/1 S+ 12:36 0:00 grep pbs
---------------------------------------------------------

Regards,
Vighnesh

Bart Brashers

unread,
Nov 3, 2009, 11:29:16 AM11/3/09
to Discussion of Rocks Clusters

This tells us that both maui and pbs_server are running.

I'm curious though why only _one_ node responded to the "pbsnodes -a |
grep 'state ='" command. You said you had three nodes, but only one is
listed as free? Can you post the full output of "pbsnodes -a".

Also, do you get any warnings of interest from "mdiag -S -v" or "mdiag
-j JOBID" (where JOBID is the job id of your interactive job you just
submitted).

You might also check the pbs_mom logs on the nodes, just after you
submit the interactive job and it goes into the RMFailure state. Look
in /opt/torque/mom_logs/ on the compute nodes for the latest file, and
look at the end of it.

Bart

This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to em...@environcorp.com and immediately delete all copies of the message.

Reply all
Reply to author
Forward
0 new messages