Following are the outputs that you asked.
----------------------------------------------------------------
# pbsnodes -a
compute-0-0
state = free
np = 8
ntype = cluster
status = opsys=linux,uname=Linux compute-0-0.local
2.6.18-128.1.14.el5 #1 SMP Wed Jun 17 06:38:05 EDT 2009
x86_64,sessions=? 0,nsessions=?
0,nusers=0,idletime=16027,totmem=9202340kb,availmem=9091448kb,physmem=8182224kb,ncpus=8,loadave=0.00,netload=22222015,state=free,jobs=,varattr=,rectime=1257469245
# mdiag -S -v <jobid>
Initialized: S:FALSE/I:FALSE CCount: 0 FCount: 0 QCount: 0 JCount: 0
RCount: 0
end of mom log file:
11/06/2009 00:46:10;0008; pbs_mom;Job;process_request;request type
QueueJob from host hyperx.local rejected (host not authorized)
11/06/2009 00:46:10;0080; pbs_mom;Req;req_reject;Reject reply
code=15008(Access from host not allowed, or unknown host MSG=request not
authorized), aux=0, type=QueueJob, from PBS_S...@hyperx.local
11/06/2009 00:47:33;0008; pbs_mom;Job;process_request;request type
QueueJob from host hyperx.local rejected (host not authorized)
11/06/2009 00:47:33;0080; pbs_mom;Req;req_reject;Reject reply
code=15008(Access from host not allowed, or unknown host MSG=request not
authorized), aux=0, type=QueueJob, from PBS_S...@hyperx.local
-----------------------------------------------------------------------
Do the mounting of user dirs. from nas to frontend causing this trouble.
Cmd #qmgr -c 'p s' gives:
create queue default
set queue default queue_type = Execution
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = hyperx.iitb.ac.in
set server managers = ma...@hyperx.iitb.ac.in
set server managers += ro...@hyperx.iitb.ac.in
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 5
Please help me.
Thankyou,
Regards,
Vighnesh
This tells us that both maui and pbs_server are running.
I'm curious though why only _one_ node responded to the "pbsnodes -a |
grep 'state ='" command. You said you had three nodes, but only one is
listed as free? Can you post the full output of "pbsnodes -a".
Also, do you get any warnings of interest from "mdiag -S -v" or "mdiag
-j JOBID" (where JOBID is the job id of your interactive job you just
submitted).
You might also check the pbs_mom logs on the nodes, just after you
submit the interactive job and it goes into the RMFailure state. Look
in /opt/torque/mom_logs/ on the compute nodes for the latest file, and
look at the end of it.
Bart
> Hi Bart,
> Following is the output:
> ---------------------------------------------------------
> # pbsnodes -a | grep "state ="
> state = free
>
> # ps aux | grep maui
> maui 5133 0.0 0.3 51484 31524 ? Ss Nov02 0:00
> /opt/maui/sbin/maui
> root 27040 0.0 0.0 61144 668 pts/1 S+ 12:36 0:00 grep
maui
>
> # ps aux | grep pbs
> root 22086 0.0 0.0 10416 1344 ? Ss Nov02 0:00
> /opt/torque/sbin/pbs_server
> root 27042 0.0 0.0 61144 672 pts/1 S+ 12:36 0:00 grep
pbs
> ---------------------------------------------------------
>
> Regards,
> Vighnesh
>
>
> What is the output of the following, done as root on the frontend:
>
> # pbsnodes -a | grep "state ="
> # ps aux | grep maui
> # ps aux | grep pbs
>
> Bart
>
> > Hi,
> > My cluster has 3 nodes with rocks 5.2 base. I have also installed
> > torque 5.2 roll in it, but due to some reasons whenever i try to
> submit
> > job(interactive mode), it gets
> > submitted in deferred state and newer enters for execution. There is
> no SGE.
> > If i do 'checkjob <jobid> , it gives following output:
> > -----------------------------------------------------------------
> > checking job <jobid>
> >
> > State: Idle EState: Deferred
> > Creds: user:vighnesh group:vighnesh class:default qos:DEFAULT
> > WallTime: 00:00:00 of 99:23:59:59
> > SubmitTime: Sat Oct 31 15:24:36
> > (Time Queued Total: 00:01:04 Eligible: 00:00:01)
> >
> > StartDate: -00:01:02 Sat Oct 31 15:24:38
> > Total Tasks: 8
> >
> > Req[0] TaskCount: 8 Partition: ALL
> > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> > Opsys: [NONE] Arch: [NONE] Features: [NONE]
> >
> >
> > IWD: [NONE] Executable: [NONE]
> > Bypass: 0 StartCount: 1
> > PartitionMask: [ALL]
> > job is deferred. Reason: RMFailure (cannot start job - RM
failure,
> rc:
> > 15041, msg: 'Execution server rejected request MSG=cannot send job
to
> mom,
> > state=PRERUN')
> > Holds: Defer (hold reason: RMFailure)
> > PE: 8.00 StartPriority: 1
> > cannot select job 3 for partition DEFAULT (job hold active)
> > -------------------------------------------------------------------
> >
> > some kind of RM failure.
> >
> > Please can anyone help me solve this problem.
> >
> > Thankyou,
> >
> > Regards,
> > Vighnesh
>
Everything looks good. Can you post the job script, and the command
line you uses to submit it?
Bart
This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to em...@environcorp.com and immediately delete all copies of the message.
I have 8 compute nodes, 1 front end server, on a freshly installed 5.2 rocks cluster with 5.2.2 service pack roll installed.
pbsnodes -a | grep "state ="
---cut---
[root@chiron ~]# pbsnodes -a | grep "state ="
state = free
state = free
state = free
state = free
state = free
state = free
state = free
state = free
---cut---
ps aux |grep maui
---cut---
[root@chiron ~]# ps aux |grep maui
maui 13607 0.0 3.5 51528 31568 ? Ss Nov05 0:04 /opt/maui/sbin/maui
root 14873 0.0 0.0 61116 660 pts/1 R+ 12:19 0:00 grep maui
---cut---
ps aux |grep pbs
---cut---
[root@chiron ~]# ps aux|grep pbs
root 13740 0.0 0.1 10944 1400 ? Ss Nov05 0:07 /opt/torque/sbin/pbs_server
root 14877 0.0 0.0 61116 660 pts/1 R+ 12:19 0:00 grep pbs
---cut---
cat simple-jobscript.sh
---cut---
[preachermanx@chiron ~]$ cat simple-jobscript.sh
#!/bin/bash
#PBS -lwalltime=0:10:0
echo starting
sleep 10
echo ending
---cut---
checkjob $JOBNUMBER_FROM_QSUB
---cut---
[preachermanx@chiron ~]$ checkjob 6
checking job 6
State: Idle EState: Deferred
Creds: user:preachermanx group:preachermanx class:default qos:DEFAULT
WallTime: 00:00:00 of 00:10:00
SubmitTime: Mon Nov 9 12:16:28
(Time Queued Total: 00:00:02 Eligible: 00:00:01)
StartDate: 00:00:00 Mon Nov 9 12:16:30
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE
job is deferred. Reason: RMFailure (cannot start job - RM failure, rc: 15041, msg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN')
Holds: Defer (hold reason: RMFailure)
PE: 1.00 StartPriority: 1
cannot select job 6 for partition DEFAULT (job hold active)
---cut---
mdiag -S -v
---cut---
[root@chiron ~]# mdiag -S -v
Initialized: S:FALSE/I:FALSE CCount: 0 FCount: 0 QCount: 0 JCount: 0 RCount: 0
---cut---
mdiag -j 6
---cut---
[root@chiron ~]# mdiag -j 6
Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features
6 Idle ALL 1 DEF 00:10:00 0 1 preacher preacher - 00:18:18 [NONE] [NONE] [NONE] >=0 >=0 NC0 [default:1] [NONE]
---cut---
qmgr -c 'p s'
---cut---
[root@chiron ~]# qmgr -c 'p s'
#
# Create queues and set their attributes.
#
#
# Create and define queue default
#
create queue default
set queue default queue_type = Execution
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = chiron.centtech.com
set server managers = ma...@chiron.centtech.com
set server managers += ro...@chiron.centtech.com
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 7
---cut---
Thanks in advance for any assistance that can be provided, this is only the second rocks+torque cluster I have assembled the first did not encounter this issue. :) Hopefully its just something that I have missed or overlooked.
--
__ __
/ / Patrick S. Roberts / / 512.924.4039(c)
/ / Sr. Systems Admin / / 512.418.5792(o)
/_/ Centaur Technology /_/ "Бережёного Бог бережёт"
Doing a "tentakel /etc/init.d/pbs restart" after the nodes had been up for a minute seems to have fixed this... it seems from reading the mom logs via "tentakel cat /opt/torque/mom_logs/*" that when the nodes start they are unable to talk to the frontend and giveup leaving the nodes in a state where they cannot authenticate the frontend.
As you can see below, host chiron.local not found which is the frontend. I waited 5 min, then restarted pbs and viola it is able to find chiron.local and add as an authorized submitter.
---cut---
11/09/2009 12:47:02;0002; pbs_mom;Svr;Log;Log opened
11/09/2009 12:47:02;0002; pbs_mom;Svr;setpbsserver;chiron.local
11/09/2009 12:47:02;0002; pbs_mom;Svr;mom_server_add;server chiron.local added
11/09/2009 12:47:17;0001; pbs_mom;Svr;pbs_mom;mom_server_add, host chiron.local not found
11/09/2009 12:47:17;0002; pbs_mom;Svr;usecp;chiron.centtech.com:/home /home
11/09/2009 12:47:17;0002; pbs_mom;n/a;initialize;independent
11/09/2009 12:47:17;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
11/09/2009 12:47:18;0002; pbs_mom;Svr;pbs_mom;Is up
11/09/2009 12:47:18;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/torque/sbin/pbs_mom 1246994885
11/09/2009 12:47:18;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server chiron.local
11/09/2009 12:52:30;0002; pbs_mom;Svr;pbs_mom;caught signal 15: leaving jobs running, just exiting
11/09/2009 12:52:30;0002; pbs_mom;Svr;pbs_mom;Is down
11/09/2009 12:52:30;0002; pbs_mom;Svr;Log;Log closed
11/09/2009 12:52:30;0002; pbs_mom;Svr;Log;Log opened
11/09/2009 12:52:30;0002; pbs_mom;Svr;setpbsserver;chiron.local
11/09/2009 12:52:30;0002; pbs_mom;Svr;mom_server_add;server chiron.local added
11/09/2009 12:52:30;0002; pbs_mom;Svr;usecp;chiron.centtech.com:/home /home
11/09/2009 12:52:30;0002; pbs_mom;n/a;initialize;independent
11/09/2009 12:52:30;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
11/09/2009 12:52:30;0002; pbs_mom;Svr;pbs_mom;Is up
11/09/2009 12:52:30;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/torque/sbin/pbs_mom 1246994885
11/09/2009 12:52:30;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server chiron.local
[preachermanx@chiron ~]$
---cut---
--
__ __
/ / Patrick S. Roberts / / 512.924.4039(c)
/ / Sr. Systems Admin / / 512.418.5792(o)
/_/ Centaur Technology /_/ "Бережёного Бог бережёт"
| > Initialized: S:FALSE/I:FALSE CCount: 0 FCount: 0 QCount: 0
| JCount:
| 0
| > RCount: 0
| >
| > end of mom log file:
| > 11/06/2009 00:46:10;0008; pbs_mom;Job;process_request;request
| type
| > QueueJob from host hyperx.local rejected (host not authorized)
| > 11/06/2009 00:46:10;0080; pbs_mom;Req;req_reject;Reject reply
| > code=15008(Access from host not allowed, or unknown host
| MSG=request
| not
| > authorized), aux=0, type=QueueJob, from PBS_S...@hyperx.local
| > 11/06/2009 00:47:33;0008; pbs_mom;Job;process_request;request
| type
| > QueueJob from host hyperx.local rejected (host not authorized)
| > 11/06/2009 00:47:33;0080; pbs_mom;Req;req_reject;Reject reply
| > code=15008(Access from host not allowed, or unknown host
| MSG=request
| not
| > authorized), aux=0, type=QueueJob, from PBS_S...@hyperx.local
| >
| -----------------------------------------------------------------------
| >
| > Do the mounting of user dirs. from nas to frontend causing this
| trouble.
| > Cmd #qmgr -c 'p s' gives:
| >
| > create queue default
| > set queue default queue_type = Execution
| > set queue default enabled = True
| > set queue default started = True
| > #
| > # Set server attributes.
| > #
| > set server scheduling = True
| > set server acl_host_enable = False
| > set server acl_hosts = hyperx.iitb.ac.in
| > set server managers = ma...@hyperx.iitb.ac.in
| > set server managers += ro...@hyperx.iitb.ac.in
| > set server default_queue = default
| > set server log_events = 511
| > set server mail_from = adm
| > set server query_other_jobs = True
| > set server scheduler_iteration = 600
| > set server node_check_rate = 150
| > set server tcp_timeout = 6
| > set server next_job_number = 5
| >
| > Please help me.
| >
| > Thankyou,
| >
| > Regards,
| > Vighnesh
| >
| >
| >
| >
| >
| >
| >
| >
| >
| >
| >
| >
| > This tells us that both maui and pbs_server are running.
| >
| > I'm curious though why only _one_ node responded to the "pbsnodes -a
| |
| > grep 'state ='" command. You said you had three nodes, but only
| one
| is
| > listed as free? Can you post the full output of "pbsnodes -a".
| >
| > Also, do you get any warnings of interest from "mdiag -S -v" or
| "mdiag
| > -j JOBID" (where JOBID is the job id of your interactive job you
| just
| > submitted).
| >
| > You might also check the pbs_mom logs on the nodes, just after you
| > submit the interactive job and it goes into the RMFailure state.
| Look
| > in /opt/torque/mom_logs/ on the compute nodes for the latest file,
| and
| > look at the end of it.
| >
| > Bart
| >
| > > Hi Bart,
| > > Following is the output:
| > > ---------------------------------------------------------
| > > # pbsnodes -a | grep "state ="
| > > state = free
| > >
| > > # ps aux | grep maui
| > > maui 5133 0.0 0.3 51484 31524 ? Ss Nov02 0:00
| > > /opt/maui/sbin/maui
| > > root 27040 0.0 0.0 61144 668 pts/1 S+ 12:36 0:00
| grep
| > maui
| > >
| > > # ps aux | grep pbs
| > > root 22086 0.0 0.0 10416 1344 ? Ss Nov02 0:00
| > > /opt/torque/sbin/pbs_server
| > > root 27042 0.0 0.0 61144 672 pts/1 S+ 12:36 0:00
| grep
| > pbs
| > > ---------------------------------------------------------
| > >
| > > Regards,
| > > Vighnesh
| > >
| > >
| > > What is the output of the following, done as root on the
| frontend:
| > >
| > > # pbsnodes -a | grep "state ="
| > > # ps aux | grep maui
| > > # ps aux | grep pbs
| > >
| > > Bart
| > >
| > > > Hi,
| > > > My cluster has 3 nodes with rocks 5.2 base. I have also
| installed
| > > > torque 5.2 roll in it, but due to some reasons whenever i try
| to
| > > submit
| > > > job(interactive mode), it gets
| > > > submitted in deferred state and newer enters for execution.
| There
| is
| > > no SGE.
| > > > If i do 'checkjob <jobid> , it gives following output:
| > > >
| -----------------------------------------------------------------
| > > > checking job <jobid>
| > > >
| > > > State: Idle EState: Deferred
| > > > Creds: user:vighnesh group:vighnesh class:default
| qos:DEFAULT
| > > > WallTime: 00:00:00 of 99:23:59:59
| > > > SubmitTime: Sat Oct 31 15:24:36
| > > > (Time Queued Total: 00:01:04 Eligible: 00:00:01)
| > > >
| > > > StartDate: -00:01:02 Sat Oct 31 15:24:38
| > > > Total Tasks: 8
| > > >
| > > > Req[0] TaskCount: 8 Partition: ALL
| > > > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
| > > > Opsys: [NONE] Arch: [NONE] Features: [NONE]
| > > >
| > > >
| > > > IWD: [NONE] Executable: [NONE]
| > > > Bypass: 0 StartCount: 1
| > > > PartitionMask: [ALL]
| > > > job is deferred. Reason: RMFailure (cannot start job - RM
| > failure,
| > > rc:
| > > > 15041, msg: 'Execution server rejected request MSG=cannot send
| job
| > to
| > > mom,
| > > > state=PRERUN')
| > > > Holds: Defer (hold reason: RMFailure)
---cut---
#PBS -lnodes=1,walltime=0:10:0
echo starting
sleep 10
echo ending
---cut---
I'm assuming you submitted using a command like this:
# qsub simple-jobscript.sh
You requested a certain amount of time, but did not request any nodes. Presumably you want to do some computation...
Bart
--
__ __
/ / Patrick S. Roberts / / 512.924.4039(c)
/ / Sr. Systems Admin / / 512.418.5792(o)
/_/ Centaur Technology /_/ "Бережёного Бог бережёт"
Scott
Thankyou,
Regards,
Vighnesh
# rocks sync config
# rocks sync users
And look in /etc/hosts on the compute node. Do you find an entry for the frontend? If not, we'll have to look at why...
Bart
> -----Original Message-----
> From: npaci-rocks-dis...@sdsc.edu [mailto:npaci-rocks-discussion-
> bou...@sdsc.edu] On Behalf Of vigh...@aero.iitb.ac.in
>
> Hi Bart,
> The problem got solved. What i did is, I added the private IP address
> of my frontend in /etc/hosts file of compute-0-0 node. Now the compute
> node is accepting the job and is not deferred.
> But i want to ask, whether what i did is correct. Will it cause any
> hassle during future upgradation or package addition.
>
> Thankyou,
>
> Regards,
> Vighnesh
>
>
>
>
>
> > The "#PBS -lnodes=1,walltime=0:10:0" denotes I want one node to do this
> > yes? :) Also it did work after using the workaround of restarting pbs 5min
> > after reboot on the nodes.
> >
> > --
> > __ __
> > / / Patrick S. Roberts / / 512.924.4039(c)
> > / / Sr. Systems Admin / / 512.418.5792(o)
> > /_/ Centaur Technology /_/ "Бережёного Бог
> > бережёт"
> > | > /_/ Centaur Technology /_/ "Бережёного Бог
> > бережёт"
I than added two more lines in /etc/hosts file, as follows
---------------------------------------------------
127.0.0.1 localhost.localdomain localhost
192.168.2.1 hyperx.local hyperx # originally frontend-0-0
192.168.2.254 nas-0-0.local nas-0-0
192.168.2.253 compute-0-0.local compute-0-0
10.101.2.98 hyperx.iitb.ac.in
---------------------------------------------------
Regards,
Vighnesh
We just installed a Rocks 5.3 cluster with the PBS roll and ran into the
problem described below. Reviewing the /etc/hosts files on the compute
nodes reveals that there's no alias for the frontend's private
interface, just the public interface. Adding the private interface and
restarting pbs on the compute nodes seems to fix things (jobs run now).
I don't want to have to worry about re-adding this every time a node
gets reinstalled etc. Does anyone know the right way to add this alias
to rocks so that it automatically gets added to the nodes when they
install or sync with the frontend?
To summarize; I'm looking for /etc/hosts on the compute nodes to contain
something like this:
10.1.1.1 frontend.local frontend
where "frontend" is the short name of the cluster's frontend server.
Thanks much.
-Ed