[Rocks-Discuss] PBS error

350 views
Skip to first unread message

H P

unread,
Aug 21, 2009, 1:55:42 PM8/21/09
to npaci-rocks...@sdsc.edu

Dear All,

We are running rocks cluster 4.2 with front end and 3 compute nodes. During data analysis we found that 2 jobs are hang an not moving. We are not able to kill those jobs as they are in running status. Is there any way to kill running PBS jobs?

Please provide solution.

Regards
HP

_________________________________________________________________
Log on to MSN India for a lowdown on what’s hot in the world today
http://in.msn.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20090821/99723608/attachment.html

Bart Brashers

unread,
Aug 21, 2009, 2:10:56 PM8/21/09
to Discussion of Rocks Clusters
Did you mean to write "qdel NNNNN (where NNNNN is the job id number)
doesn't work"?

You could, as root, ssh to the compute nodes, figure out the process id
of the main thing running (use ps), and use kill to make it die.

Or perhaps the compute nodes have frozen? Can you ssh to them? Ping
them?

To force a job to be removed regardless of whether the frontend can
communicate with the compute nodes (i.e. pbs_server can't talk to
pbs_mom on a compute node) use "qdel -p NNNN".

Bart

> We are running rocks cluster 4.2 with front end and 3 compute
nodes. During
> data analysis we found that 2 jobs are hang an not moving. We are not
able to
> kill those jobs as they are in running status. Is there any way to
kill running
> PBS jobs?
>
> Please provide solution.
>
> Regards
> HP


This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to em...@environcorp.com and immediately delete all copies of the message.

Chris Powell

unread,
Aug 21, 2009, 5:32:35 PM8/21/09
to npaci-rocks...@sdsc.edu
Good Afternoon all,

I have been working on my Rocks 5.1 installation again. I have the
Torque/Maui roll installed and have not changed any settings. The jobs
are currently queueing but are not getting sent to the compute nodes. In
the pbs_server logs, i am getting a strange error about the maui account
not having permission to modify jobs. Any suggestions are welcome. Once
again, this is a fresh installation. No changes have been made.

08/21/2009
08:18:47;0020;PBS_Server;Job;2.cluster.domain.com;Unauthorized Request,
request type: 11, Object: Job, Name: 2.cluster.domain.com,
request from: ma...@cluster.domain.com
08/21/2009 08:18:47;0080;PBS_Server;Req;req_reject;Reject reply
code=15007(Unauthorized Request MSG=operation not permitted), aux=0,
type=ModifyJob, from ma...@cluster.domain.com


Chris Powell
IT Systems Technician
Arete Associates
(818)885-2470 Office
(818)640-8509 Cell
(818)541-6194 Pager
cpo...@arete.com



-------------- next part --------------
An HTML attachment was scrubbed...

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20090821/73338d24/attachment.html

Bart Brashers

unread,
Aug 21, 2009, 6:05:04 PM8/21/09
to Discussion of Rocks Clusters

Two things come to mind:

1. Check if the maui user is in /etc/passwd (grep maui /etc/passwd). I
think it defaults to being uid 500, which may collide with your existing
users. Check if /opt/maui is owned by maui (ls -lF /opt/maui).

2. In the output of "qmgr -c 'print server'", do you see a line like
this:

set server managers += ma...@frontend.company.com

If not, then you set it (pass that line to qmgr -c).

Bart

H P

unread,
Aug 22, 2009, 4:06:42 AM8/22/09
to npaci-rocks...@sdsc.edu

i have tried to do ssh to compute-0-1 and its not working.

I have also treid ps on head node and it seems no jobs are running on head node

is there any other way i can trace a job id and delete it?

Regards

HP

> Date: Fri, 21 Aug 2009 11:10:56 -0700
> From: bbra...@environcorp.com
> To: npaci-rocks...@sdsc.edu
> Subject: Re: [Rocks-Discuss] PBS error

_________________________________________________________________
We all see it as it is. But on MSN India, the difference lies in perspective.


http://in.msn.com
-------------- next part --------------
An HTML attachment was scrubbed...

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20090822/4648b472/attachment.html

Vanush Misha

unread,
Aug 24, 2009, 5:10:54 AM8/24/09
to npaci-rocks...@sdsc.edu
On Sat, 22 Aug 2009 13:36:42 +0530
H P <linuxc...@msn.com> wrote:

>
> i have tried to do ssh to compute-0-1 and its not working.
>

Are you saying compute-0-1 is down at the moment? If it is - then PBS
still thinks those jobs are running, but any attempt to do anything
else with them fails (as node is not responding). Those jobs will
disappear once you bring the node back.

Misha.


--
Vanush "Misha" Paturyan
Senior Technical Officer
Computer Science Department
NUI Maynooth

Bart Brashers

unread,
Aug 24, 2009, 12:30:29 PM8/24/09
to Discussion of Rocks Clusters

To delete a job when the compute node is not up, use "qdel -p JobID"
where JobID is the job id number -- printed by "qstat".

Bart

Chris Powell

unread,
Aug 24, 2009, 7:47:23 PM8/24/09
to Discussion of Rocks Clusters
Bart,

Thanks for the tip. After checking the output of qmgr -c 'print server'
it did not have any server managers set, yet maui and root are listed
in /opt/torque/pbs.default. Did a little more poking around and found
that for some reason the /opt/torque/server_priv/acl_svr/managers file
was not created. I simply touched the file, restarted the pbs_server and
it populated it with the maui and root account info. The server is now
working properly.

Thanks,
Chris


-----Original Message-----
From: Bart Brashers <bbra...@environcorp.com>
Reply-to: Discussion of Rocks Clusters <npaci-rocks...@sdsc.edu>
To: Discussion of Rocks Clusters <npaci-rocks...@sdsc.edu>
Subject: Re: [Rocks-Discuss] PBS error

Bart

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20090824/18dd6e45/attachment.html

H P

unread,
Aug 25, 2009, 10:55:14 AM8/25/09
to npaci-rocks...@sdsc.edu

I have tried qdel -p JobId which didnt work for me.

After that problem I have switched off the that particular compute node and re-submitted the jobs again. During which I found the same problem. qstat and showq both shows jobs running , to verify that I have done ssh and ps to check running jobs. I could not find any job running.

So now when I do pbsnodes - it says 1 node down and 2 nodes free but stat shows jobs assigned to it.

Can I go to /opt/torqur/server_priv/jobs to get rid of running status by removing files in the folder.

Regards
HP

> Date: Mon, 24 Aug 2009 09:30:29 -0700

_________________________________________________________________
Sports, news, fashion and entertainment. Pick it all up in a package called MSN India


http://in.msn.com
-------------- next part --------------
An HTML attachment was scrubbed...

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20090825/e48c0a32/attachment.html

Gus Correa

unread,
Aug 25, 2009, 11:28:46 AM8/25/09
to Discussion of Rocks Clusters
Hi HP, list

At the very worst, for very sticky hung jobs,
what I do is to root ssh to the "Mother Superior" node,
i.e. the first node listed when you do "qstat -f jobnum"
on the hung job.
Then cd to $PBSHOME/mom_priv/jobs
and remove the three files associated
to that job there, something like:
$JOBNAME.JB, $JOBNAME.SC, and $JOBNAME.TK.
Then restart pbs_mom on that particular node (service pbs restart).

Note that the "Mother Superior" node may or may not be
the same node that failed / is offline now.

This is a dirty and brute force method,
which should be avoided by trying "qdel" first, then "qdel -p",
and other clean methods suggested by Bart.
However, for very sticky jobs it may be the way to cleanup.
On ancient versions of PBS that I used
this was the only way to cleanup,
as there was no "-p" option to qdel back then.

I hope this helps,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Bart Brashers

unread,
Aug 25, 2009, 12:05:58 PM8/25/09
to Discussion of Rocks Clusters

To be able to qdel a job, the pbs_server process on the frontend must
communicate with the pbs_mom process on the compute node. So when the
compute node to which a job has been assigned in down, the communication
can't happen, and the frontend refuses to delete the job. That's why
qstat still shows the job "running" on that compute node. It's not
running now, but when the node comes up again, it will automatically
re-start the job.

Using "qdel -p 1234" for job id 1234 is supposed to ignore that failing
connection between frontend (pbs_server) and compute node (pbs_mom).

Try turning on the compute node again, letting it boot, then "qdel 1234"
the job.

Yes, you can delete the files on the frontend in
/opt/torque/server_priv/jobs/, the ones with the correct job id number.
Then do a "service pbs_server restart" and they should disappear from
the output of "qstat".

Bart

Reply all
Reply to author
Forward
0 new messages