Hi there,
as a disclaimer, I am new to GCP. I'm trying to set up an auto-scaling Slurm cluster as a demo, to show that this is a nice alternative/extension to our on-premise HPC cluster. I am currently on the $300 free trial.
gcloud cloud-shell ssh --authorize-session
cd ~/slurm-gcp/tf/examples/singularity
cp basic.tfvars.example basic.tfvars
nano basic.tfvars
#the above file is changed as follows:
## cluster_name is set
## project is set
## zone = "us-central1-a"
## controller_image = "projects/schedmd-slurm-public/global/images/family/schedmd-slurm-20-11-4-hpc-centos-7"
## login_image set to the same
## partitions = [
## { name = "normal"
## machine_type = "e2-highmem-16"
## static_node_count = 0
## max_node_count = 10
## zone = "us-central1-a"
## image = "projects/schedmd-slurm-public/global/images/family/schedmd-slurm-20-11-4-hpc-centos-7"
## preemptible_bursting = true
The rest of the file is unchanged. I changed custom-controller-install to include versions of Go and Singularity:
export GOLANG_VERSION=1.14.12
export SINGULARITY_VERSION=3.7.0
Then I run:
terraform apply -var-file=basic.tfvars
which completes successfully. Then I login to the head node:
gcloud compute ssh simcluster-login0 --zone us-central1-a
and wait for Slurm to finish config, with re-login if required. sinfo gives:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up infinite 10 idle~ simcluster-compute-0-[0-9]
Then I submit a test job:
$> sbatch --wrap "hostname"
Submitted batch job 2
The issue is that the node required to run this never spins up:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up infinite 1 down# simcluster-compute-0-0
normal* up infinite 9 idle~ simcluster-compute-0-[1-9]
and scontrol show node gives:
NodeName=simcluster-compute-0-0 CoresPerSocket=1
CPUAlloc=0 CPUTot=8 CPULoad=N/A
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=simcluster-compute-0-0 NodeHostName=simcluster-compute-0-0
RealMemory=126832 AllocMem=0 FreeMem=N/A Sockets=8 Boards=1
State=DOWN#+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=normal
BootTime=None SlurmdStartTime=None
CfgTRES=cpu=8,mem=126832M,billing=8
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment=(null)
So it seems this is a quota problem. I contacted support which gave the following answer: I have to request a quota for PREEMPTIBLE_CPUS, but I cannot do so until I upgrade my account. Does this mean that, in effect, I cannot use preemptible VMs in the $300 free trial? It seems strange that GCP would in effect disable testing of preemptibility, a very important feature. Or should this work, in which case there is something else at play on the slurm-gcp side?
Cheers,
Arthur