Hello. We have a job that has a "job requeued in held state".
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8 part1 test bob PD 0:00 1 (job requeued in held state)
What does it mean? Other tasks work well, but this task is hang. scontrol resume/requeue doesn't helps. In slurm's log we see:
[2015-07-06T20:31:06.126] _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=0
[2015-07-06T20:31:06.126] _slurm_rpc_requeue: 8: Job is pending execution
[2015-07-06T20:31:18.469] Processing RPC: REQUEST_SUSPEND(resume) from uid=0
[2015-07-06T20:31:18.469] _slurm_rpc_suspend(resume) for 8 Job is pending execution
What we can do to continue execution without breaking or cansel?