[slurm-dev] job requeued in held state

3 views
Skip to first unread message

Anatoliy Kovalenko

unread,
Jul 6, 2015, 4:55:57 PM7/6/15
to slurm-dev
Hello. We have a job that has a "job requeued in held state". 
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 8      part1    test   bob PD       0:00      1 (job requeued in held state)
What does it mean? Other tasks work well, but this task is hang. scontrol resume/requeue doesn't helps. In slurm's log we see: 
[2015-07-06T20:31:06.126] _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=0
[2015-07-06T20:31:06.126] _slurm_rpc_requeue: 8: Job is pending execution
[2015-07-06T20:31:18.469] Processing RPC: REQUEST_SUSPEND(resume) from uid=0
[2015-07-06T20:31:18.469] _slurm_rpc_suspend(resume) for 8 Job is pending execution
What we can do to continue execution without breaking or cansel?

Qianqian Sha

unread,
Jul 6, 2015, 9:45:01 PM7/6/15
to slurm-dev
Hi, Anatoliy

Maybe you can try "scontrol release <job_id>".
Node Failure may cause batch jobs requeue.

Sean Blanton

unread,
Jul 7, 2015, 2:25:52 AM7/7/15
to slurm-dev
Yes, this is a big problem for us after upgrading from 2.6.1 to 14.11.7. We have had to schedule a cron job every 5 minutes to release held jobs. We notice this may happen when there is a WIFEXITED status of zero, but haven't nailed down the cause(s). Simply releasing the jobs, they then requeue and most often succeed normally.  We are wondering if there is a configuration setting that would cause them to requeue without being held.

Regards,
Sean

Sean Blanton

Sean Blanton

unread,
Jul 7, 2015, 4:02:59 AM7/7/15
to slurm-dev
And, not sure why my emails are getting sent duplicated. Apologies - I'll look into it.

Regards,
Sean

Sean Blanton

Anatoliy Kovalenko

unread,
Jul 7, 2015, 7:31:56 AM7/7/15
to slurm-dev

2015-07-07 4:45 GMT+03:00 Qianqian Sha <qqsh...@gmail.com>:
Maybe you can try "scontrol release <job_id>".

Thank you for your advice.  Now, it solved.

Moe Jette

unread,
Jul 7, 2015, 11:34:55 AM7/7/15
to slurm-dev

Do you have either of these configured in slurm.conf?

$ scontrol show config | grep Requeue
RequeueExit = (null)
RequeueExitHold = (null)
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support

Anatoliy Kovalenko

unread,
Jul 7, 2015, 12:06:55 PM7/7/15
to slurm-dev

2015-07-07 18:34 GMT+03:00 Moe Jette <je...@schedmd.com>:
Do you have either of these configured in slurm.conf?

$ scontrol show config | grep Requeue
RequeueExit             = (null)
RequeueExitHold         = (null)

$ scontrol show config | grep Requeue
JobRequeue              = 1

Sean Blanton

unread,
Jul 7, 2015, 1:22:59 PM7/7/15
to slurm-dev

I have:

$ scontrol show config | grep -i requeue

JobRequeue              = 1

RequeueExit             = (null)

RequeueExitHold         = (null)

And, I'm only sending mail once to 'slurm-dev@schedmd.com', but it appears to be double-posting from within gmail...


Regards,
Sean

Sean Blanton

Anatoliy Kovalenko

unread,
Jul 19, 2015, 8:32:52 AM7/19/15
to slurm-dev
 Can "PrivateData=accounts,users,usage,jobs" option, in a slurmdbd.conf, influence to this behavior?
--
¯\_(ツ)_/¯
Reply all
Reply to author
Forward
0 new messages