Xaver,
You may want to look at the ResumeRate option in slurm.conf:
ResumeRate
The rate at which nodes in power save mode are returned to normal operation by ResumeProgram. The value is a number of nodes per minute and it can be used to prevent power surges if a large number of nodes in power save mode are assigned work at the same time (e.g. a large job starts). A value of zero results in no limits being imposed. The default value is 300 nodes per minute.
Thank you Brian,
while ResumeRate might be able to keep the CPU usage within an acceptable margin, it's not really a fix, but a workaround. I would prefer a solution that groups resume requests and therefore makes use of a single Ansible playbook run per second instead of <=ResumeRate.
As we completely destroy our instances when powering down, we
need to set them up from anew using Ansible. Running Ansible on
the worker nodes would be possible, but that comes with additional
steps in order to save all log files on the master in case the
startup fails and you want to investigate. For now I feel like
using the master to setup workers is the better structure.
Best regards,
Xaver