[slurm-users] Slurm cloud scheduling/power saving

37 views
Skip to first unread message

Steve Brasier

unread,
Apr 1, 2021, 4:54:06 AM4/1/21
to slurm...@schedmd.com
Hi all, anyone have suggestions for debugging cloud nodes not resuming? I've had this working before but I'm now using "configless" mode so wondering if that's an issue.

If I login as SlurmUser and run the ResumeProgram manually, the specified node(s) boot, and if I log into them `sinfo` works although it only shows the "static" nodes, not the newly booted "cloud" nodes. So that at least shows the program works, the image works, and new nodes can contact the slurmctld.

However if I run a job which requires cloud nodes it immediately goes Pending showing "Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions". Looking at SlurmctldLogFile with SlurmdDebug=debug5 I don't see any attempt to boot the nodes at all :-(.

I can post slurm.conf if anyone wants to look but I think the important parameters are probably that I've got:

SlurmctldParameters=enable_configless,idle_on_node_suspend,cloud_dns,power_save_interval=10,power_save_min_interval=0

That look right?

thanks for any suggestions!

Steve

Please note I work Tuesday to Friday.

Brian Andrus

unread,
Apr 1, 2021, 12:58:22 PM4/1/21
to slurm...@lists.schedmd.com

Run 'sinfo -R' to see if any of your nodes are out of the mix.

If so, resume them and see if things work.

Brian Andrus

Reply all
Reply to author
Forward
0 new messages