[Rocks-Discuss] rocks 5.4 - pbs jobs deferred

107 views
Skip to first unread message

Edward Walter

unread,
Jun 20, 2011, 1:18:20 PM6/20/11
to Discussion of Rocks Clusters
Hello List,

We recently upgraded (reinstalled) one of our Rocks 5.0 clusters with
Rocks 5.4. Now, we're seeing problem where jobs automatically go into a
deferred state rather than running even though there are free
resources. We didn't see this problem with PBS on our previous
installation so I don't believe this is a hardware or network issue.

Here are the rolls we're running:

NAME VERSION ARCH ENABLED
base: 5.4 x86_64 yes
hpc: 5.4 x86_64 yes
os: 5.4 x86_64 yes
kernel: 5.4 x86_64 yes
ganglia: 5.4 x86_64 yes
web-server: 5.4 x86_64 yes
service-pack: 5.4.2 x86_64 yes
torque: 5.4.0 x86_64 yes
intel-developer: 5.0 x86_64 yes

Doing a 'qstat -f' on a deferred job shows which node it has been
assigned to (lets say compute-3-3 for example).

Running 'pbsnodes -l' does not list the node (compute-3-3) as down.

When I check the node in question (compute-3-3) it shows that pbs is running

> # ssh compute-3-3 'ps auxwww |grep pbs'
> root 3762 0.0 0.0 17376 4924 ? SLs Jun14 6:48
> /opt/torque/sbin/pbs_mom
> root 21902 0.0 0.0 65892 1272 ? Ss 13:01 0:00 bash
> -c ps auxwww |grep pbs
> root 21928 0.0 0.0 61124 676 ? S 13:01 0:00 grep pbs

Attempting to query the mom on the node directly from the frontend seems
to fall over though:
> # momctl -h compute-3-3 -d3
> ERROR: query[0] 'diag3' failed on compute-3-3 (errno=0-Success:
> 5-Input/output error)

Some nodes accept the momctl query while others do not (so things are
not broken globally).

We're also seeing messages like these in the logs on the frontend:
> Jun 20 12:50:08 <frontend> PBS_Server: LOG_ERROR::Access from host not
> allowed, or unknown host (15008) in send_job, child failed in previous
> commit request for job 1651268.<frontend>
> Jun 20 12:50:09 warp PBS_Server: LOG_ERROR::Access from host not
> allowed, or unknown host (15008) in send_job, child failed in previous
> commit request for job 1651269.<frontend>
> Jun 20 12:54:00 compute-3-3.local pbs_mom: LOG_ERROR::Success (0) in
> rm_request, bad attempt to connect - unauthorized (port: 1022)
> message refused from port 1022 addr 10.1.1.1
> Jun 20 13:01:01 compute-3-31.local pbs_mom: LOG_ERROR::Connection
> refused (111) in scan_for_exiting, cannot connect to port 1023 in
> client_to_svr - connection refused
> Jun 20 13:02:33 compute-3-3.local pbs_mom: LOG_ERROR::Success (0) in
> rm_request, bad attempt to connect - unauthorized (port: 1022)
> message refused from port 1022 addr 10.1.1.1

Sometimes restarting pbs on the node seems to fix things. Restarting
pbs on the wedged nodes seems to cause inconsistent behavior though. In
some cases; '/etc/init.d/pbs restart' claims to restart the daemon but
the PID doesn't change (which makes me suspicious). If I only restart
the daemon on 1 node; it seems to start accepting jobs properly. If I
restart pbs on all of the wedged nodes (using rocks run host or
tentakel) I still end up with a bunch of nodes that won't accept jobs
from the frontend.

Looking for help or advice on where to go next with this. Is anyone
else seeing similar behavior?

Thanks much.

-Ed Walter
Carnegie Mellon University

Roy Dragseth

unread,
Jun 23, 2011, 8:29:44 AM6/23/11
to Discussion of Rocks Clusters

You wouldn't happen to have another appliance assigned the same rack and rank
as one or more of your compute nodes? I've seen cases in the past where this
mess up the DNS and cause problems with reverse host lookups needed to make
torque work.

Take a look at the output of

rocks list host

and check that all RACK and RANK combinations are unique.

r.

--

The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, Team Leader, High Performance Computing
Direct call: +47 77 64 62 56. email: roy.dr...@uit.no

Edward Walter

unread,
Jun 23, 2011, 12:47:13 PM6/23/11
to npaci-rocks...@sdsc.edu
Hi Roy,

I verified that we did not have any rack/rank collisions. We worked
around this by adding the front end's public and private interfaces to
/opt/torque/mom_priv/config. I haven't had any jobs deferred since I
made that change.

For the frontend's internal interface; I used the IP (10.1.1.1). For
the external interface; I used the FQDN.


I had one other observation while working with the mom_priv config
file... the $usecp directive gets automatically added for NAS appliances
that have been added to the cluster. The directive that gets added
seems to make the assumption that the NAS appliances are mounting things
in /home. It's not clear to me what the correct location is for NAS
exports (we've been mounting them under /share/data*). Given this
ambiguity; it might warrant a note in the roll docs so that people can
configure appropriate defaults for their sites.

-Ed

ps. Thanks for all the effort you put in to make this roll available.
We find it really useful.

Reply all
Reply to author
Forward
0 new messages