[slurm-dev] Confusing JobState Reason for Pending due to TimeLimit

0 views
Skip to first unread message

E.M. Dragowsky

unread,
Jul 5, 2016, 10:24:01 AM7/5/16
to slurm-dev
Greetings --

A few users have experienced jobs pending due to the reason ReqNodeNotAvail(Unavailable:<nodename1>,<nodename2>,...)
We have determined that the jobs in fact were pending due to asking for TimeLimit > "time remaining before maintenace shutdown" -- managed by making a global reservation on all nodes, all partitions.

This reason code is not very helpful in understanding the reason for the jobs being pending. Resubmitting the jobs with appropriate TimeLimit allows the jobs to run immediately. The jobs were therefore pending due to "excessive time requested".

To prehaps help knowledgeable developers understand why the Reason ReqNodeNotAvail appeared, I note that the nodes listed in the above are actually being drained in advance of updates. Happy to provide further information as needed.

Best wishes,
~ Emily

----------------------------------
E.M. Dragowsky, Ph.D.
ITS -- Research Computing
Case Western Reserve University

John DeSantis

unread,
Sep 20, 2016, 3:21:21 PM9/20/16
to slurm-dev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Emily,

What version of SLURM are you running?

We are running version 15.08.4 and have just run into the same issue.

There was a bug report filed [1], and it states that the issue was
corrected in version 14.08.11.

Thanks,
John DeSantis

[1] https://bugs.schedmd.com/show_bug.cgi?id=1614
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCAAGBQJX4YtrAAoJEEmckBqrs5nBaA4H/0OevfFtiMbfynmhb1d9tKAF
HEXM9T4REp5MimrTJoD/W9rshvAIYLD+hOVlLflsKjQ2E63KEu2UlzcVdZ6maI/x
os4u1oAVtolfpecGfdj9cG+qFubkiu8+6lPzioBay2lSFZa0EJbm8p6eJDub+jHV
CWfi+yaP9V1YYVkehb+0Rbwp47d+xsSfA2Lgs89rw3+O1bUXp9tLgOCbHjA1B1kx
HaSm+uhXTOFi61N3YnoYsnFoHuFjP+XTDHy5Mh5QQobYSMwyrFwvL1HZ76bjZ2kc
vWUJEB3wX8y1nHM0Hxmsax4wD0pRdr3oi63cvuDlth7o9UV5dn3RC3PVcwDtI6I=
=xcQr
-----END PGP SIGNATURE-----

Ryan Novosielski

unread,
Sep 20, 2016, 3:31:11 PM9/20/16
to slurm-dev
It’s pretty certain that setting a maintenance reservation instead of draining nodes in advance of the reservation would at least change the message. I’m not sure if it will make more sense or not, but I’d think it might.

> On Jul 5, 2016, at 10:23 AM, E.M. Dragowsky <drag...@case.edu> wrote:
>
> Greetings --
>
> A few users have experienced jobs pending due to the reason ReqNodeNotAvail(Unavailable:<nodename1>,<nodename2>,...)
> We have determined that the jobs in fact were pending due to asking for TimeLimit > "time remaining before maintenace shutdown" -- managed by making a global reservation on all nodes, all partitions.
>
> This reason code is not very helpful in understanding the reason for the jobs being pending. Resubmitting the jobs with appropriate TimeLimit allows the jobs to run immediately. The jobs were therefore pending due to "excessive time requested".
>
> To prehaps help knowledgeable developers understand why the Reason ReqNodeNotAvail appeared, I note that the nodes listed in the above are actually being drained in advance of updates. Happy to provide further information as needed.

--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - novo...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

signature.asc
Reply all
Reply to author
Forward
0 new messages