Dear Slurm User list,
using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in
slurm_update error: Invalid node state specified
when we called:
scontrol update NodeName="$1"
state=RESUME reason=FailedStartup
in the Fail script. We run this to make 100% sure that the instances - that are created on demand - are again `~idle` after being removed by the fail program. They are set to RESUME before the actual instance gets destroyed. I remember that I had this case manually before, but I don't remember when it occurs.
Maybe someone has a great idea how to tackle this problem.
Best regards
Xaver Stiensmeier