Minimal "Burst to SLURM" Setup

81 views
Skip to first unread message

Sidd Karamcheti

unread,
Feb 24, 2022, 1:18:42 AM2/24/22
to google-cloud-slurm-discuss
Hey folks, I just read through the SLURM on GCP documentation, and had a few questions:
  • Is there a reason we need two separate "persistent" nodes (the login node AND the controller node?)

  • I'm trying to set SLURM up for my university lab, to be used as a fallback only when we're out of capacity (before deadlines, for large-scale, long-running jobs). Can I set my "persistent nodes" to 0, with a large capacity for ephemeral nodes?
    • Extra context: For the most part, this cluster won't be used... but when it will get used, it'll be hammered (at least 20+ multi-GPU instances simultaneously). 

  • When it comes to setting up an NFS server, is there a way to set up a persistent disk that attaches to each node? This is possible with the Kubernetes Engine (https://medium.com/platformer-blog/nfs-persistent-volumes-with-kubernetes-a-case-study-ce1ed6e2c266), but not sure how to translate these steps to the Slurm setup.
Finally - how do folks estimate pricing of compute + persistent disks + ingress/egress costs? The pricing calculator is non-intuitive, and I'd really like to not accidentally spend $$$.

Thank you so much!

Alex Chekholko

unread,
Feb 25, 2022, 1:59:15 PM2/25/22
to Sidd Karamcheti, google-cloud-slurm-discuss
Hi Sidd,

The controller node has the disk volume that gets served out via NFS to the compute nodes.  I recommend you make the controller node "large enough" that it can handle all the I/O from all the compute nodes.  It's standard NFS.

The login node is separate so that users can't interfere with the controller node functions (nfs server, slurmctld, etc).  You can make the login node fairly small.

You can have 0 compute nodes by default and all compute nodes will be on-demand.

There is a "persistent disk" on each compute node, just imagine a standard physical HPC cluster, the compute nodes all have local OS disk.  The k8s concepts don't really apply here.

In terms of pricing, IME, the most practical way to determine it is through trial and error, run a standard test workload and then check the billing results a day later.  And then you know your real total costs.  It will not be cheap.

Regards,
Alex

--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/49acb347-d658-4ebd-9e44-7fb740f94146n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages