Re: [Rocks-Discuss] Torque Maui: Job deferred, RM failure

vigh...@aero.iitb.ac.in

unread,

Nov 5, 2009, 12:13:05 PM11/5/09

to npaci-rocks...@sdsc.edu

Hi Bart,
Sorry for replying late.
Yeah there are 3 nodes in my cluster (1)frontend, (1)compute node and
(1)nas node. Nas has a 2TB hard disk and the user dirs. are mounted from
nas /export/data1 to frontend /export/home.

Following are the outputs that you asked.
----------------------------------------------------------------
# pbsnodes -a
compute-0-0
state = free
np = 8
ntype = cluster
status = opsys=linux,uname=Linux compute-0-0.local
2.6.18-128.1.14.el5 #1 SMP Wed Jun 17 06:38:05 EDT 2009
x86_64,sessions=? 0,nsessions=?
0,nusers=0,idletime=16027,totmem=9202340kb,availmem=9091448kb,physmem=8182224kb,ncpus=8,loadave=0.00,netload=22222015,state=free,jobs=,varattr=,rectime=1257469245

# mdiag -S -v <jobid>
Initialized: S:FALSE/I:FALSE CCount: 0 FCount: 0 QCount: 0 JCount: 0
RCount: 0

end of mom log file:
11/06/2009 00:46:10;0008; pbs_mom;Job;process_request;request type
QueueJob from host hyperx.local rejected (host not authorized)
11/06/2009 00:46:10;0080; pbs_mom;Req;req_reject;Reject reply
code=15008(Access from host not allowed, or unknown host MSG=request not
authorized), aux=0, type=QueueJob, from PBS_S...@hyperx.local
11/06/2009 00:47:33;0008; pbs_mom;Job;process_request;request type
QueueJob from host hyperx.local rejected (host not authorized)
11/06/2009 00:47:33;0080; pbs_mom;Req;req_reject;Reject reply
code=15008(Access from host not allowed, or unknown host MSG=request not
authorized), aux=0, type=QueueJob, from PBS_S...@hyperx.local
-----------------------------------------------------------------------

Do the mounting of user dirs. from nas to frontend causing this trouble.
Cmd #qmgr -c 'p s' gives:

create queue default
set queue default queue_type = Execution
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = hyperx.iitb.ac.in
set server managers = ma...@hyperx.iitb.ac.in
set server managers += ro...@hyperx.iitb.ac.in
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 5

Please help me.

Thankyou,

Regards,
Vighnesh

This tells us that both maui and pbs_server are running.

I'm curious though why only _one_ node responded to the "pbsnodes -a |
grep 'state ='" command. You said you had three nodes, but only one is
listed as free? Can you post the full output of "pbsnodes -a".

Also, do you get any warnings of interest from "mdiag -S -v" or "mdiag
-j JOBID" (where JOBID is the job id of your interactive job you just
submitted).

You might also check the pbs_mom logs on the nodes, just after you
submit the interactive job and it goes into the RMFailure state. Look
in /opt/torque/mom_logs/ on the compute nodes for the latest file, and
look at the end of it.

Bart

> Hi Bart,
> Following is the output:
> ---------------------------------------------------------
> # pbsnodes -a | grep "state ="
> state = free
>
> # ps aux | grep maui
> maui 5133 0.0 0.3 51484 31524 ? Ss Nov02 0:00
> /opt/maui/sbin/maui
> root 27040 0.0 0.0 61144 668 pts/1 S+ 12:36 0:00 grep
maui
>
> # ps aux | grep pbs
> root 22086 0.0 0.0 10416 1344 ? Ss Nov02 0:00
> /opt/torque/sbin/pbs_server
> root 27042 0.0 0.0 61144 672 pts/1 S+ 12:36 0:00 grep
pbs
> ---------------------------------------------------------
>
> Regards,
> Vighnesh
>
>
> What is the output of the following, done as root on the frontend:
>
> # pbsnodes -a | grep "state ="
> # ps aux | grep maui
> # ps aux | grep pbs
>
> Bart
>
> > Hi,
> > My cluster has 3 nodes with rocks 5.2 base. I have also installed
> > torque 5.2 roll in it, but due to some reasons whenever i try to
> submit
> > job(interactive mode), it gets
> > submitted in deferred state and newer enters for execution. There is
> no SGE.
> > If i do 'checkjob <jobid> , it gives following output:
> > -----------------------------------------------------------------
> > checking job <jobid>
> >
> > State: Idle EState: Deferred
> > Creds: user:vighnesh group:vighnesh class:default qos:DEFAULT
> > WallTime: 00:00:00 of 99:23:59:59
> > SubmitTime: Sat Oct 31 15:24:36
> > (Time Queued Total: 00:01:04 Eligible: 00:00:01)
> >
> > StartDate: -00:01:02 Sat Oct 31 15:24:38
> > Total Tasks: 8
> >
> > Req[0] TaskCount: 8 Partition: ALL
> > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> > Opsys: [NONE] Arch: [NONE] Features: [NONE]
> >
> >
> > IWD: [NONE] Executable: [NONE]
> > Bypass: 0 StartCount: 1
> > PartitionMask: [ALL]
> > job is deferred. Reason: RMFailure (cannot start job - RM
failure,
> rc:
> > 15041, msg: 'Execution server rejected request MSG=cannot send job
to
> mom,
> > state=PRERUN')
> > Holds: Defer (hold reason: RMFailure)
> > PE: 8.00 StartPriority: 1
> > cannot select job 3 for partition DEFAULT (job hold active)
> > -------------------------------------------------------------------
> >
> > some kind of RM failure.
> >
> > Please can anyone help me solve this problem.
> >
> > Thankyou,
> >
> > Regards,
> > Vighnesh
>

Bart Brashers

unread,

Nov 9, 2009, 12:28:22 PM11/9/09

to Discussion of Rocks Clusters

Sorry, was away at a conference...

Everything looks good. Can you post the job script, and the command
line you uses to submit it?

Bart

This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to em...@environcorp.com and immediately delete all copies of the message.

Patrick S. Roberts

unread,

Nov 9, 2009, 1:46:15 PM11/9/09

to Discussion of Rocks Clusters

I am experiencing the same issue with interactive and non-interactive job submissions.

I have 8 compute nodes, 1 front end server, on a freshly installed 5.2 rocks cluster with 5.2.2 service pack roll installed.

pbsnodes -a | grep "state ="

---cut---
[root@chiron ~]# pbsnodes -a | grep "state ="
state = free
state = free
state = free
state = free
state = free
state = free
state = free
state = free
---cut---

ps aux |grep maui
---cut---
[root@chiron ~]# ps aux |grep maui
maui 13607 0.0 3.5 51528 31568 ? Ss Nov05 0:04 /opt/maui/sbin/maui
root 14873 0.0 0.0 61116 660 pts/1 R+ 12:19 0:00 grep maui
---cut---

ps aux |grep pbs
---cut---
[root@chiron ~]# ps aux|grep pbs
root 13740 0.0 0.1 10944 1400 ? Ss Nov05 0:07 /opt/torque/sbin/pbs_server
root 14877 0.0 0.0 61116 660 pts/1 R+ 12:19 0:00 grep pbs
---cut---

cat simple-jobscript.sh
---cut---
[preachermanx@chiron ~]$ cat simple-jobscript.sh
#!/bin/bash

#PBS -lwalltime=0:10:0

echo starting
sleep 10
echo ending
---cut---

checkjob $JOBNUMBER_FROM_QSUB
---cut---
[preachermanx@chiron ~]$ checkjob 6

checking job 6

State: Idle EState: Deferred
Creds: user:preachermanx group:preachermanx class:default qos:DEFAULT
WallTime: 00:00:00 of 00:10:00
SubmitTime: Mon Nov 9 12:16:28
(Time Queued Total: 00:00:02 Eligible: 00:00:01)

StartDate: 00:00:00 Mon Nov 9 12:16:30
Total Tasks: 1

Req[0] TaskCount: 1 Partition: ALL

Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]

IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]

Flags: RESTARTABLE

job is deferred. Reason: RMFailure (cannot start job - RM failure, rc: 15041, msg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN')
Holds: Defer (hold reason: RMFailure)

PE: 1.00 StartPriority: 1
cannot select job 6 for partition DEFAULT (job hold active)
---cut---

mdiag -S -v
---cut---
[root@chiron ~]# mdiag -S -v

Initialized: S:FALSE/I:FALSE CCount: 0 FCount: 0 QCount: 0 JCount: 0 RCount: 0

---cut---

mdiag -j 6
---cut---
[root@chiron ~]# mdiag -j 6
Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features

6 Idle ALL 1 DEF 00:10:00 0 1 preacher preacher - 00:18:18 [NONE] [NONE] [NONE] >=0 >=0 NC0 [default:1] [NONE]
---cut---

qmgr -c 'p s'
---cut---
[root@chiron ~]# qmgr -c 'p s'
#
# Create queues and set their attributes.
#
#
# Create and define queue default
#

create queue default
set queue default queue_type = Execution
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False

set server acl_hosts = chiron.centtech.com
set server managers = ma...@chiron.centtech.com
set server managers += ro...@chiron.centtech.com

set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6

set server next_job_number = 7
---cut---

Thanks in advance for any assistance that can be provided, this is only the second rocks+torque cluster I have assembled the first did not encounter this issue. :) Hopefully its just something that I have missed or overlooked.
--
__ __
/ / Patrick S. Roberts / / 512.924.4039(c)
/ / Sr. Systems Admin / / 512.418.5792(o)
/_/ Centaur Technology /_/ "Бережёного Бог бережёт"

Patrick S. Roberts

unread,

Nov 9, 2009, 2:01:29 PM11/9/09

to Discussion of Rocks Clusters

Seems I have found a workaround for my problem.

Doing a "tentakel /etc/init.d/pbs restart" after the nodes had been up for a minute seems to have fixed this... it seems from reading the mom logs via "tentakel cat /opt/torque/mom_logs/*" that when the nodes start they are unable to talk to the frontend and giveup leaving the nodes in a state where they cannot authenticate the frontend.

As you can see below, host chiron.local not found which is the frontend. I waited 5 min, then restarted pbs and viola it is able to find chiron.local and add as an authorized submitter.

---cut---
11/09/2009 12:47:02;0002; pbs_mom;Svr;Log;Log opened
11/09/2009 12:47:02;0002; pbs_mom;Svr;setpbsserver;chiron.local
11/09/2009 12:47:02;0002; pbs_mom;Svr;mom_server_add;server chiron.local added
11/09/2009 12:47:17;0001; pbs_mom;Svr;pbs_mom;mom_server_add, host chiron.local not found
11/09/2009 12:47:17;0002; pbs_mom;Svr;usecp;chiron.centtech.com:/home /home
11/09/2009 12:47:17;0002; pbs_mom;n/a;initialize;independent
11/09/2009 12:47:17;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
11/09/2009 12:47:18;0002; pbs_mom;Svr;pbs_mom;Is up
11/09/2009 12:47:18;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/torque/sbin/pbs_mom 1246994885
11/09/2009 12:47:18;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server chiron.local
11/09/2009 12:52:30;0002; pbs_mom;Svr;pbs_mom;caught signal 15: leaving jobs running, just exiting
11/09/2009 12:52:30;0002; pbs_mom;Svr;pbs_mom;Is down
11/09/2009 12:52:30;0002; pbs_mom;Svr;Log;Log closed
11/09/2009 12:52:30;0002; pbs_mom;Svr;Log;Log opened
11/09/2009 12:52:30;0002; pbs_mom;Svr;setpbsserver;chiron.local
11/09/2009 12:52:30;0002; pbs_mom;Svr;mom_server_add;server chiron.local added
11/09/2009 12:52:30;0002; pbs_mom;Svr;usecp;chiron.centtech.com:/home /home
11/09/2009 12:52:30;0002; pbs_mom;n/a;initialize;independent
11/09/2009 12:52:30;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
11/09/2009 12:52:30;0002; pbs_mom;Svr;pbs_mom;Is up
11/09/2009 12:52:30;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/torque/sbin/pbs_mom 1246994885
11/09/2009 12:52:30;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server chiron.local
[preachermanx@chiron ~]$
---cut---

--
__ __
/ / Patrick S. Roberts / / 512.924.4039(c)
/ / Sr. Systems Admin / / 512.418.5792(o)
/_/ Centaur Technology /_/ "Бережёного Бог бережёт"

| > end of mom log file:
| > 11/06/2009 00:46:10;0008; pbs_mom;Job;process_request;request
| type
| > QueueJob from host hyperx.local rejected (host not authorized)
| > 11/06/2009 00:46:10;0080; pbs_mom;Req;req_reject;Reject reply
| > code=15008(Access from host not allowed, or unknown host
| MSG=request
| not
| > authorized), aux=0, type=QueueJob, from PBS_S...@hyperx.local
| > 11/06/2009 00:47:33;0008; pbs_mom;Job;process_request;request
| type
| > QueueJob from host hyperx.local rejected (host not authorized)
| > 11/06/2009 00:47:33;0080; pbs_mom;Req;req_reject;Reject reply
| > code=15008(Access from host not allowed, or unknown host
| MSG=request
| not
| > authorized), aux=0, type=QueueJob, from PBS_S...@hyperx.local
| >
| -----------------------------------------------------------------------
| >
| > Do the mounting of user dirs. from nas to frontend causing this
| trouble.
| > Cmd #qmgr -c 'p s' gives:
| >

| > set server acl_hosts = hyperx.iitb.ac.in
| > set server managers = ma...@hyperx.iitb.ac.in
| > set server managers += ro...@hyperx.iitb.ac.in

| > set server next_job_number = 5
| >
| > Please help me.
| >
| > Thankyou,
| >
| > Regards,
| > Vighnesh
| >
| >
| >
| >
| >
| >
| >
| >
| >
| >
| >
| >
| > This tells us that both maui and pbs_server are running.
| >

| > I'm curious though why only _one_ node responded to the "pbsnodes -a

| > > # pbsnodes -a | grep "state ="
| > > state = free
| > >

| > > # ps aux | grep maui

| > > # ps aux | grep pbs

| > > root 22086 0.0 0.0 10416 1344 ? Ss Nov02 0:00
| > > /opt/torque/sbin/pbs_server
| > > root 27042 0.0 0.0 61144 672 pts/1 S+ 12:36 0:00
| grep
| > pbs
| > > ---------------------------------------------------------
| > >
| > > Regards,
| > > Vighnesh
| > >
| > >
| > > What is the output of the following, done as root on the
| frontend:
| > >

| > > # pbsnodes -a | grep "state ="

| > > # ps aux | grep maui

| > > # ps aux | grep pbs
| > >

| > > Bart
| > >
| > > > Hi,
| > > > My cluster has 3 nodes with rocks 5.2 base. I have also
| installed
| > > > torque 5.2 roll in it, but due to some reasons whenever i try
| to
| > > submit
| > > > job(interactive mode), it gets
| > > > submitted in deferred state and newer enters for execution.
| There
| is
| > > no SGE.
| > > > If i do 'checkjob <jobid> , it gives following output:
| > > >
| -----------------------------------------------------------------
| > > > checking job <jobid>
| > > >

| > > > State: Idle EState: Deferred
| > > > Creds: user:vighnesh group:vighnesh class:default
| qos:DEFAULT

| > > > Req[0] TaskCount: 8 Partition: ALL

| > > > job is deferred. Reason: RMFailure (cannot start job - RM
| > failure,
| > > rc:
| > > > 15041, msg: 'Execution server rejected request MSG=cannot send
| job
| > to
| > > mom,
| > > > state=PRERUN')
| > > > Holds: Defer (hold reason: RMFailure)

Bart Brashers

unread,

Nov 9, 2009, 2:08:06 PM11/9/09

to Discussion of Rocks Clusters

Try changing your jobscript to look like this:
#!/bin/bash

---cut---
#PBS -lnodes=1,walltime=0:10:0

echo starting
sleep 10
echo ending
---cut---

I'm assuming you submitted using a command like this:

# qsub simple-jobscript.sh

You requested a certain amount of time, but did not request any nodes. Presumably you want to do some computation...

Bart

Patrick S. Roberts

unread,

Nov 9, 2009, 4:22:27 PM11/9/09

to Discussion of Rocks Clusters

The "#PBS -lnodes=1,walltime=0:10:0" denotes I want one node to do this yes? :) Also it did work after using the workaround of restarting pbs 5min after reboot on the nodes.

--
__ __
/ / Patrick S. Roberts / / 512.924.4039(c)
/ / Sr. Systems Admin / / 512.418.5792(o)
/_/ Centaur Technology /_/ "Бережёного Бог бережёт"

Scott L. Hamilton

unread,

Nov 9, 2009, 5:35:01 PM11/9/09

to Discussion of Rocks Clusters

Patrick S. Roberts wrote:
> The "#PBS -lnodes=1,walltime=0:10:0" denotes I want one node to do this yes? :) Also it did work after using the workaround of restarting pbs 5min after reboot on the nodes.
>
>

I would check and make sure that chiron.local is in the hosts file on
each node. If it is not post back to the list and one of us will help
you with the commands to put it there. My headnode name accidently got
removed from my host files and I had the same issue of having to restart
pbs on each node after a reboot.

Scott

vigh...@aero.iitb.ac.in

unread,

Nov 9, 2009, 11:11:39 PM11/9/09

to Discussion of Rocks Clusters

Hi Bart,
The problem got solved. What i did is, I added the private IP address
of my frontend in /etc/hosts file of compute-0-0 node. Now the compute
node is accepting the job and is not deferred.
But i want to ask, whether what i did is correct. Will it cause any
hassle during future upgradation or package addition.

Thankyou,

Regards,
Vighnesh

Bart Brashers

unread,

Nov 10, 2009, 11:03:09 AM11/10/09

to Discussion of Rocks Clusters

The frontend is normally supposed have an entry in /etc/hosts on the compute nodes. The fact that it didn't is a problem. Try this, on the frontend:

# rocks sync config
# rocks sync users

And look in /etc/hosts on the compute node. Do you find an entry for the frontend? If not, we'll have to look at why...

Bart

> -----Original Message-----
> From: npaci-rocks-dis...@sdsc.edu [mailto:npaci-rocks-discussion-
> bou...@sdsc.edu] On Behalf Of vigh...@aero.iitb.ac.in
>

> Hi Bart,
> The problem got solved. What i did is, I added the private IP address
> of my frontend in /etc/hosts file of compute-0-0 node. Now the compute
> node is accepting the job and is not deferred.
> But i want to ask, whether what i did is correct. Will it cause any
> hassle during future upgradation or package addition.
>
> Thankyou,
>
> Regards,
> Vighnesh
>
>
>
>
>
> > The "#PBS -lnodes=1,walltime=0:10:0" denotes I want one node to do this
> > yes? :) Also it did work after using the workaround of restarting pbs 5min
> > after reboot on the nodes.
> >
> > --
> > __ __
> > / / Patrick S. Roberts / / 512.924.4039(c)
> > / / Sr. Systems Admin / / 512.418.5792(o)

> > /_/ Centaur Technology /_/ "Ð‘ÐµÑ€ÐµÐ¶Ñ‘Ð½Ð¾Ð³Ð¾ Ð‘Ð¾Ð³
> > Ð±ÐµÑ€ÐµÐ¶Ñ‘Ñ‚"

> > | > /_/ Centaur Technology /_/ "Ð‘ÐµÑ€ÐµÐ¶Ñ‘Ð½Ð¾Ð³Ð¾ Ð‘Ð¾Ð³
> > Ð±ÐµÑ€ÐµÐ¶Ñ‘Ñ‚"

vigh...@aero.iitb.ac.in

unread,

Nov 10, 2009, 11:44:55 AM11/10/09

to Discussion of Rocks Clusters

Hi Bart,
I did tried the following cmds, but was of no help. The contents of
/etc/hosts file were as follows:
---------------------------------------------------
127.0.0.1 localhost.localdomain localhost
192.168.2.253 compute-0-0.local compute-0-0
10.101.2.98 hyperx.iitb.ac.in
---------------------------------------------------

I than added two more lines in /etc/hosts file, as follows
---------------------------------------------------
127.0.0.1 localhost.localdomain localhost
192.168.2.1 hyperx.local hyperx # originally frontend-0-0
192.168.2.254 nas-0-0.local nas-0-0
192.168.2.253 compute-0-0.local compute-0-0
10.101.2.98 hyperx.iitb.ac.in
---------------------------------------------------

Regards,

Vighnesh

Edward Walter

unread,

Jul 2, 2010, 2:25:15 PM7/2/10

to Discussion of Rocks Clusters

Hello List,

We just installed a Rocks 5.3 cluster with the PBS roll and ran into the
problem described below. Reviewing the /etc/hosts files on the compute
nodes reveals that there's no alias for the frontend's private
interface, just the public interface. Adding the private interface and
restarting pbs on the compute nodes seems to fix things (jobs run now).

I don't want to have to worry about re-adding this every time a node
gets reinstalled etc. Does anyone know the right way to add this alias
to rocks so that it automatically gets added to the nodes when they
install or sync with the frontend?

To summarize; I'm looking for /etc/hosts on the compute nodes to contain
something like this:
10.1.1.1 frontend.local frontend
where "frontend" is the short name of the cluster's frontend server.