How does SLURM GCP handle preemptible instances near their 24h limit

171 views
Skip to first unread message

Bo Langgaard Lind

unread,
Jan 12, 2021, 10:08:14 AM1/12/21
to google-cloud-slurm-discuss
Let's say my expected job duration is on the high end, 23 hours for the sake of argument, and let's assume I'm using preemptible instances.

Is there a mechanism to ensure that compute nodes are restarted near the 24h limit? I think we can all agree that starting a job with an estimated duration longer than the time "left" on the instance is a waste of resources, as it's certain to not complete.

I found scontrol reboot ASAP [node names] which could probably help, but I don't see how it's invoked.

Joseph Schoonover

unread,
Jan 12, 2021, 10:12:54 AM1/12/21
to Bo Langgaard Lind, google-cloud-slurm-discuss
Hey Bo,
Last I recall, there is a script called slurm_gcp_sync.py that is set up to run as a cronjob at once per minute to check for downed nodes in Slurm and Stopped compute nodes in GCE. Nodes that match in both of these lists are assumed to be preempted and are then deleted and re-spun up. Slurm will automatically reschedule the batch script.



--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/f408f0b0-f759-438f-9391-3a9eb8a625b9n%40googlegroups.com.

Bo Langgaard Lind

unread,
Jan 12, 2021, 10:18:40 AM1/12/21
to Joseph Schoonover, google-cloud-slurm-discuss
Hi Joseph

Thanks for your quick response.

It's not exactly what I'm asking for though. I'm concerned about the case where it's 100% predictable that a job will not complete before a preemptible instance will reach its 24h end-of-life.

Joseph Schoonover

unread,
Jan 12, 2021, 10:43:06 AM1/12/21
to Bo Langgaard Lind, google-cloud-slurm-discuss
Hey Bo,
That same script will kick in at 24 hour, as far as I can tell, and restart the job. Ideally, you'd write your batch script to pickup from the latest state of execution of your application.
The content of this email is confidential and intended for the recipient specified in message only. It is strictly forbidden to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.



Dr. Joseph Schoonover

Chief Executive Officer

Senior Research Software Engineer

j...@fluidnumerics.com







Bo Langgaard Lind

unread,
Jan 12, 2021, 11:01:25 AM1/12/21
to Joseph Schoonover, google-cloud-slurm-discuss
Hi Joseph

I should have perhaps made this more clear, but the current state of our algorithm implementation does not allow for picking up where we left off. Any job that's either cut short due to preemption or hitting the 24h limit is wasted effort.

Additionally, there's a potentially expensive setup step which means we'll probably run long jobs, perhaps around 12 hours, estimated.

Thus, we can, with a high degree of certainty, predict when it's a fool's errand to start a job.

Alex Chekholko

unread,
Jan 12, 2021, 12:09:28 PM1/12/21
to Bo Langgaard Lind, Joseph Schoonover, google-cloud-slurm-discuss
Hey Bo,

The preemptible instances can get killed any time.  So if you don't want to re-run the job, you will want to use regular instances.

Run some scenarios with your pricing to see which way is cheaper.

Regards,
Alex

Bo Langgaard Lind

unread,
Jan 20, 2021, 10:29:21 AM1/20/21
to google-cloud-slurm-discuss
I think I figured it out.

There is, indeed, a script (slurmsync.py) that runs every minute, on the controller, and which will restart nodes that have been preempted.

That was not, however, my concern.

It turns out, that the RebootProgram is not set in slurm.conf.tpl. Setting this to the same as SuspendProgram (so, suspend.py) make scontrol reboot work.

After that, it's as simple as issuing scontrol reboot ASAP <nodename> when it's a given that it no longer makes sense to schedule the next job on a node.

Reply all
Reply to author
Forward
0 new messages