[Rocks-Discuss] Jobs stalled in queue (Torque)

255 views
Skip to first unread message

Huiqun Zhou

unread,
Dec 16, 2010, 5:31:40 AM12/16/10
to npaci-rocks...@sdsc.edu
Hi list-users,

After a power outage, I had to reinstall most of my compute nodes. Everything looks
OK except Torque, which always put jobs into "Q" state no matter what I restart maui,
pbs_server on the frontend, and/or pbs_mom on the nodes, it simply doesn't help.
Then, I run checkjob and diagnose, and got the results below. I noticed that it told me
the problem is NoResources. But, in fact, all my nodes are available and I'm the only
user. Running "pbsnodes -a | grep 'state ='" told me that the nodes are all in "free"
state.

Any idea?


zhou huiqun
@earth sciences, nanjing university, china

============================================================
# checkjob 639
checking job 639

State: Idle EState: Deferred
Creds: user:zhou_huiqun group:quantum class:default qos:DEFAULT
WallTime: 00:00:00 of 99:23:59:59
SubmitTime: Thu Dec 16 17:47:49
(Time Queued Total: 00:02:36 Eligible: 00:00:00)

Total Tasks: 1

Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]


IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE

job is deferred. Reason: NoResources (exceeds available partition procs)
Holds: Defer (hold reason: NoResources)
PE: 1.00 StartPriority: 2
cannot select job 639 for partition DEFAULT (job hold active)

=============================================================
# diagnose -j 639
Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features

639 Idle ALL 1 DEF 99:23:59:59 0 1 hqzhou quantum - 00:36:44 [NONE] [NONE] [NONE] >=0 >=0 NC0 [default:1] [NONE]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20101216/1873fa51/attachment.html

Roy Dragseth

unread,
Dec 16, 2010, 7:07:45 AM12/16/10
to Discussion of Rocks Clusters
maui detects that all nodes are down and thus no jobs have enough resources
available to start and get deferred. The fastest way to solve this is to run
releasehold on all queued jobs:

qselect -s Q | cut -f 1 -d . | xargs -i releasehold {}

r.

--
The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, Team Leader, High Performance Computing
Direct call: +47 77 64 62 56. email: roy.dr...@uit.no

hqzhou

unread,
Dec 16, 2010, 9:00:07 AM12/16/10
to Discussion of Rocks Clusters
Roy,

Thanks for you quick response.

I have tried to releasehold the jobs, but no change even though
the message said "job holds adjusted". The jobs already in the
queue and newly submitted are still in "Q" state,


Huiqun


----- 原始邮件 -----
发件人: Roy Dragseth <roy.dr...@uit.no>
收件人: Discussion of Rocks Clusters <npaci-rocks...@sdsc.edu>
已发送邮件: Thu, 16 Dec 2010 20:07:45 +0800 (CST)
主题: Re: [Rocks-Discuss] Jobs stalled in queue (Torque)

Huiqun Zhou

unread,
Dec 17, 2010, 12:40:24 AM12/17/10
to Discussion of Rocks Clusters
Roy,

Thank you very much for your quick response.

I have tried to run releasehold, but there are no changes. The jobs already
in the queue and newly submitted are still in "Q" state.

I'm sorry to the list if you received my two mails with simmilar content
because
I can't determine whether or not my posting is reached the list as neither
can I receive my own mail via the list, nor can I find my posts in the
archive
of the mailing list.


huiqun

r.

--
The Computer Center, University of Troms�, N-9037 TROMS� Norway.

Reply all
Reply to author
Forward
0 new messages