[slurm-dev] Duplicate jobid, launch failed?

8 views
Skip to first unread message

Robbert Eggermont

unread,
Mar 16, 2016, 6:21:27 AM3/16/16
to slurm-dev

Hello all,

Two times now I've found a node draining with reason "Duplicate jobid".

The slurmctld logs shows:
backfill: Started JobId=X in <...> on Y
_slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=0
email msg to <...>: SLURM Job_id=X Name=<...> Failed, Run time 00:00:00,
PENDING, ExitCode 0
drain_nodes: node Y state set to DRAIN
error: Duplicate jobid on nodes Y, set to state DRAIN
Requeuing JobID=X State=0x0 NodeCnt=0

Job X shows reason "launch failed requeued held".

I'm guessing job X is the offending job here.

Is it expected behaviour that a failed job launch is handled as a
duplicate jobid? If so, can anybody elaborate on this and do I need to
do anything (besides resuming the node)?

Or is this a bug? (Caused by the timing of the requeue?)

Best,

Robbert

--
Robbert Eggermont Intelligent Systems
R.Egg...@tudelft.nl Electr.Eng., Mathematics & Comp.Science
+31 15 27 83234 Delft University of Technology
Reply all
Reply to author
Forward
0 new messages