Inconsistency of number of CPUs of compute node machine types and compute nodes in the Slurm cluster

44 views
Skip to first unread message

Christoph Gorgulla

unread,
Apr 3, 2021, 8:36:42 AM4/3/21
to google-cloud-...@googlegroups.com
Hi everyone, 

The general problem is that the compute nodes of the Slurm cluster are providing less CPUs than their machine type actually provides. Both with the Deployment manager and the Terraform versions. In addition, the Deployment manager 

Terraform version

When using the latest version of the Terraform version of the Slurm cluster, the compute nodes in the Slurm cluster only provide half of the number of CPUs than was specified in the terraform.tfvars file via the variable machine_type in the partitions section. 
For instance, when machine_type is set to n2d-highcpu-64, then in Slurm the compute nodes have only 32 cores: 

$ scontrol show node g1-compute-0-0                                                          
NodeName=g1-compute-0-0 CoresPerSocket=1
   CPUAlloc=1 CPUTot=32 CPULoad=N/A
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=g1-compute-0-0 NodeHostName=g1-compute-0-0
   RealMemory=63216 AllocMem=800 FreeMem=N/A Sockets=32 Boards=1
   State=MIXED#+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=None SlurmdStartTime=None
   CfgTRES=cpu=32,mem=63216M,billing=32
   AllocTRES=cpu=1,mem=800M
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

And indeed jobs which request 64 cores per node are not accepted by Slurm, only up 32 cores per node per job. I have attached the terraform.tfvars file. 

Deployment manager version

With the previous version of slurm-gcp, we had no problems with the deployment manager, but now with the latest version we have these problems. When using the deployment manager instead of Terraform via the command

gcloud deployment-manager deployments create g2 --config slurm-cluster.yaml

the entire section specifying the partition seems to be ignored, and the default values for the partition are used which are specified in the file schedmd-slurm-gcp.jinja.schema.

I have attached the slurm-cluster.yaml file. 

Now even when the default values are used, we have the same problem as with Terraform with the number of CPUs per compute node. The default compute node machine type is n1-highcpu-2 which is used by the Slurm cluster now (since the partition section of the slurm-cluster.yaml is ignored), but the compute nodes in Slurm have now only a single CPU available:

$ scontrol show node g2-compute-0-0
NodeName=g2-compute-0-0 Arch=x86_64 CoresPerSocket=1 
   CPUAlloc=0 CPUTot=1 CPULoad=0.23
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=g2-compute-0-0 NodeHostName=g2-compute-0-0 Version=20.11.4
   OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 
   RealMemory=1413 AllocMem=0 FreeMem=1163 Sockets=1 Boards=1
   State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=p1 
   BootTime=2021-04-03T11:40:54 SlurmdStartTime=2021-04-03T11:41:23
   CfgTRES=cpu=1,mem=1413M,billing=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

While the actual VMs which are used for the compute nodes are n1-highcpu-2, and thus 2 CPUs should be available. 

Any ideas on how to solve the problems with either the Terraform or the Deployment Manager versions? 

Many thanks,
Christoph
slurm-cluster.yaml
terraform.tfvars
Reply all
Reply to author
Forward
0 new messages