[slurm-dev] Confusing JobState Reason for Pending due to TimeLimit

E.M. Dragowsky

unread,

Jul 5, 2016, 10:24:01 AM7/5/16

to slurm-dev

Greetings --

A few users have experienced jobs pending due to the reason ReqNodeNotAvail(Unavailable:<nodename1>,<nodename2>,...)

We have determined that the jobs in fact were pending due to asking for TimeLimit > "time remaining before maintenace shutdown" -- managed by making a global reservation on all nodes, all partitions.

This reason code is not very helpful in understanding the reason for the jobs being pending. Resubmitting the jobs with appropriate TimeLimit allows the jobs to run immediately. The jobs were therefore pending due to "excessive time requested".

To prehaps help knowledgeable developers understand why the Reason ReqNodeNotAvail appeared, I note that the nodes listed in the above are actually being drained in advance of updates. Happy to provide further information as needed.

Best wishes,

~ Emily

----------------------------------

E.M. Dragowsky, Ph.D.

ITS -- Research Computing

Case Western Reserve University

(216) 368-0082

John DeSantis

unread,

Sep 20, 2016, 3:21:21 PM9/20/16

to slurm-dev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Emily,

What version of SLURM are you running?

We are running version 15.08.4 and have just run into the same issue.

There was a bug report filed [1], and it states that the issue was
corrected in version 14.08.11.

Thanks,
John DeSantis

[1] https://bugs.schedmd.com/show_bug.cgi?id=1614

Ryan Novosielski

unread,

Sep 20, 2016, 3:31:11 PM9/20/16

to slurm-dev

It’s pretty certain that setting a maintenance reservation instead of draining nodes in advance of the reservation would at least change the message. I’m not sure if it will make more sense or not, but I’d think it might.

> On Jul 5, 2016, at 10:23 AM, E.M. Dragowsky <drag...@case.edu> wrote:
>
> Greetings --
>
> A few users have experienced jobs pending due to the reason ReqNodeNotAvail(Unavailable:<nodename1>,<nodename2>,...)
> We have determined that the jobs in fact were pending due to asking for TimeLimit > "time remaining before maintenace shutdown" -- managed by making a global reservation on all nodes, all partitions.
>
> This reason code is not very helpful in understanding the reason for the jobs being pending. Resubmitting the jobs with appropriate TimeLimit allows the jobs to run immediately. The jobs were therefore pending due to "excessive time requested".
>
> To prehaps help knowledgeable developers understand why the Reason ReqNodeNotAvail appeared, I note that the nodes listed in the above are actually being drained in advance of updates. Happy to provide further information as needed.

--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - novo...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

signature.asc

Reply all

Reply to author

Forward