Slurm preemptible VMs and quotas

180 views
Skip to first unread message

Arthur Gilly

unread,
May 25, 2021, 11:14:25 PM5/25/21
to google-cloud-slurm-discuss
Hi there,

as a disclaimer, I am new to GCP. I'm trying to set up an auto-scaling Slurm cluster as a demo, to show that this is a nice alternative/extension to our on-premise HPC cluster. I am currently on the $300 free trial.

I am following the tutorial at https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp. Briefly, everything goes well until my preemptible compute nodes refuse to spin up. Since I am looking to have Singularity support, I did the following:

gcloud cloud-shell ssh --authorize-session
cd ~/slurm-gcp/tf/examples/singularity
cp basic.tfvars.example basic.tfvars
nano basic.tfvars
#the above file is changed as follows:
## cluster_name is set
## project is set
## zone         = "us-central1-a"
## controller_image = "projects/schedmd-slurm-public/global/images/family/schedmd-slurm-20-11-4-hpc-centos-7"
## login_image set to the same
## partitions = [
## { name                 = "normal"
##  machine_type         = "e2-highmem-16"
##  static_node_count    = 0
##  max_node_count       = 10
##  zone                 = "us-central1-a"
##  image        = "projects/schedmd-slurm-public/global/images/family/schedmd-slurm-20-11-4-hpc-centos-7"
##   preemptible_bursting = true

The rest of the file is unchanged. I changed custom-controller-install to include versions of Go and Singularity:

export GOLANG_VERSION=1.14.12
export SINGULARITY_VERSION=3.7.0

Then I run:

terraform apply -var-file=basic.tfvars

which completes successfully. Then I login to the head node:

gcloud compute ssh simcluster-login0 --zone us-central1-a

and wait for Slurm to finish config, with re-login if required. sinfo gives:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up   infinite     10  idle~ simcluster-compute-0-[0-9]

Then I submit a test job:
$> sbatch --wrap "hostname"
Submitted batch job 2

The issue is that the node required to run this never spins up:
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up   infinite      1  down# simcluster-compute-0-0
normal*      up   infinite      9  idle~ simcluster-compute-0-[1-9]

and scontrol show node gives:
NodeName=simcluster-compute-0-0 CoresPerSocket=1
   CPUAlloc=0 CPUTot=8 CPULoad=N/A
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=simcluster-compute-0-0 NodeHostName=simcluster-compute-0-0
   RealMemory=126832 AllocMem=0 FreeMem=N/A Sockets=8 Boards=1
   State=DOWN#+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=normal
   BootTime=None SlurmdStartTime=None
   CfgTRES=cpu=8,mem=126832M,billing=8
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=<HttpError 403 when requesting https://compute.googleapis.com/compute/v1/projects/testslurmcluster/zones/us-west1-a/instances/bulkInsert?alt=json returned "Quota PREEMPTIBLE_CPUS exceeded. Limit: 0.0 in region us-west1.". Details: "Quota PREEMPTIBLE_CPUS exceeded. Limit: 0.0 in region us-west1."> [slurm@2021-05-26T03:08:50]
   Comment=(null)

So it seems this is a quota problem. I contacted support which gave the following answer: I have to request a quota for PREEMPTIBLE_CPUS, but I cannot do so until I upgrade my account. Does this mean that, in effect, I cannot use preemptible VMs in the $300 free trial? It seems strange that GCP would in effect disable testing of preemptibility, a very important feature. Or should this work, in which case there is something else at play on the slurm-gcp side?

Cheers,

Arthur

Joseph Schoonover

unread,
May 25, 2021, 11:19:02 PM5/25/21
to Arthur Gilly, google-cloud-slurm-discuss
Hey Arthur,
You will need to upgrade your account to use preemptible instances. The same applies for GPU accelerated instances.

However, you should be able to demo the cluster using standard GCE nodes.

--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/b212f7c5-99b0-4788-bed0-30b9dee204c6n%40googlegroups.com.

Arthur Gilly

unread,
May 26, 2021, 4:53:47 AM5/26/21
to google-cloud-slurm-discuss
Thank you, I disabled the preemption option and it works now. I will just assume that preemptible VMs will work out of the box when the account is upgraded...

Cheers,

Arthur

Reply all
Reply to author
Forward
0 new messages