On 3/8/22 10:20 pm, Gerhard Strangar wrote:
> With a fake license called reboot?
It's a neat idea, but I think there is a catch:
* 3 jobs start, each taking 1 license
* Other reboot jobs are all blocked
* Running reboot jobs trigger node reboot
* Running reboot jobs end when either the script exits and slurmd cleans
it up before the reboot kills it, or it gets killed as NODE_FAIL when
the node has been unresponsive for too long and is marked as down
* Licenses for those jobs are released
* 3 more reboot jobs start whilst the original 3 are rebooting
* 6 nodes are now rebooting
* Filesystem fall down go boom
* Also your rebooted nodes are now drained as "Node unexpectedly rebooted"
I guess you could change your Slurm config to not mark nodes as down if
they stop responding and make sure the job that's launched, but that
feels wrong to me.