[Rocks-Discuss] Torque/PBS crash

90 views
Skip to first unread message

wmi...@mail.usp.edu

unread,
Aug 20, 2009, 6:51:37 PM8/20/09
to Rocks Discussion

Greetings all,
I was wondering if someone could help me out? Torque seemed to crash and I keep getting the same error when I try to restart pbs and pbs_server in ether order (below). Does anyone have any suggestions, and/or a way to completely restart torque? Any suggestions are welcome and appreciated!

Thanks,
Whelton

[root@indium init.d]# ./pbs_server start
Starting pbs_server: PBS_Server: process_host_name_part, host compute-0-0:2 not found
PBS_Server: pbsd_init(setup_nodes), could not create node "compute-0-0:2", error = 15062
PBS_Server: PBS_Server, pbsd_init failed
[FAILED]
[root@indium init.d]# ./pbs start
Starting PBS
PBS_Server: process_host_name_part, host compute-0-0:2 not found
PBS_Server: pbsd_init(setup_nodes), could not create node "compute-0-0:2", error = 15062
PBS_Server: PBS_Server, pbsd_init failed
PBS server
Warning: can not open holidays file, assuming 24hr primetime: No such file or directory
Error opening file dedicated_time: No such file or directory
Warning: resource group file error, fair share will not work: No such file or directory
In token_acct_open filed to open file /opt/torque/sched_priv/accounting/20090820
acct_open: No such file or directory
PBS sched

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20090820/5a7ae372/attachment.html

Bart Brashers

unread,
Aug 20, 2009, 7:04:54 PM8/20/09
to Discussion of Rocks Clusters

You should not start /etc/init.d/pbs, it's just "there".
/etc/init.d/pbs_server and /etc/init.d/maui are the ones to use.

It looks like your /opt/torque/server_priv/nodes file has some oddities.
What does it contain? Mine looks like this:

# cat /opt/torque/server_priv/nodes
compute-0-0 np=4
compute-0-1 np=4
compute-0-2 np=4
compute-0-3 np=4

etc...

Bart


This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to em...@environcorp.com and immediately delete all copies of the message.

wmi...@mail.usp.edu

unread,
Aug 20, 2009, 7:49:06 PM8/20/09
to Rocks Discussion

Thanks... this is what I'm getting.

[root@indium ~]# cat /opt/torque/server_priv/nodes
compute-0-0:2
compute-0-1:2
compute-0-2:2
compute-0-3:2
compute-0-4:2
compute-0-5:2
compute-0-6:2
compute-0-7:2
compute-0-8:2
compute-0-9:2
compute-0-10:2
compute-0-11:2
compute-0-12:2


> Date: Thu, 20 Aug 2009 16:04:54 -0700
> From: bbra...@Environcorp.com
> To: npaci-rocks...@sdsc.edu
> Subject: Re: [Rocks-Discuss] Torque/PBS crash

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20090820/20df1411/attachment.html

Bart Brashers

unread,
Aug 20, 2009, 8:17:03 PM8/20/09
to Discussion of Rocks Clusters

Very different from mine, so unless Torque has undergone some major
revisions since Rocks 5.0, your file is messed up. I assume pbs_server
isn't running (use "ps" to check, and "service pbs_server stop" if it
is). Edit that file, and make it look like mine. Then "service start
pbs_server" again.

Note that my computes are quad-core, thus the np=4 (np means number of
processors). Obviously, if you have more cores, use np=8 or whatever is
right.

This file should get re-generated when you do a "rocks sync config".
Have you tried that already? If you did, and that's what it wrote, then
somehow your database has gotten messed up. What's the output of "rocks
list host compute". It should look like this:

# rocks list host compute
HOST MEMBERSHIP CPUS RACK RANK COMMENT
compute-0-0: Compute 4 0 0 -------
compute-0-1: Compute 4 0 1 -------
compute-0-2: Compute 4 0 2 -------
compute-0-3: Compute 4 0 3 -------

Bart

Gus Correa

unread,
Aug 20, 2009, 10:30:23 PM8/20/09
to Discussion of Rocks Clusters
Hi Whelton

Just follow Bart's suggestions and his Torque node file syntax,
and restart the pbs_server.
That's the syntax Torque expects.

The syntax you used, with colons, suggests
that you copied a MPICH2 "machines" file to the Torque nodes file.
Each one uses a different syntax, although they look similar.

The MPICH2 machines file uses colons to separate the
node name from the number of processors/cores,
followed optionally by space and
ifhn=IP-address of the interface to use for MPI.

OTOH, the Torque node file uses the node name
followed by spaces,
and then the sequence

np=number_of_processors/cores on your nodes,

followed optionally by any
character strings you may want
to use to set the nodes' "properties/features".

I hope this helps,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

wmi...@mail.usp.edu

unread,
Aug 21, 2009, 11:58:12 AM8/21/09
to Rocks Discussion


Thanks a lot Bart and Gus!
That was exactly the problem. I know someone was trying to configure mpich2 and may have accidentally changed the node file.

Thanks again,
Whelton

> Date: Thu, 20 Aug 2009 22:30:23 -0400
> From: g...@ldeo.columbia.edu

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20090821/d54a6aee/attachment.html

Reply all
Reply to author
Forward
0 new messages