Hi everyone,
The general problem is that the compute nodes of the Slurm cluster are providing less CPUs than their machine type actually provides. Both with the Deployment manager and the Terraform versions. In addition, the Deployment manager
Terraform version
When using the latest version of the Terraform version of the Slurm cluster, the compute nodes in the Slurm cluster only provide half of the number of CPUs than was specified in the terraform.tfvars file via the variable machine_type in the partitions section.
For instance, when machine_type is set to n2d-highcpu-64, then in Slurm the compute nodes have only 32 cores:
$ scontrol show node g1-compute-0-0
NodeName=g1-compute-0-0 CoresPerSocket=1
CPUAlloc=1 CPUTot=32 CPULoad=N/A
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=g1-compute-0-0 NodeHostName=g1-compute-0-0
RealMemory=63216 AllocMem=800 FreeMem=N/A Sockets=32 Boards=1
State=MIXED#+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=None SlurmdStartTime=None
CfgTRES=cpu=32,mem=63216M,billing=32
AllocTRES=cpu=1,mem=800M
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment=(null)
And indeed jobs which request 64 cores per node are not accepted by Slurm, only up 32 cores per node per job. I have attached the terraform.tfvars file.
Deployment manager version
With the previous version of slurm-gcp, we had no problems with the deployment manager, but now with the latest version we have these problems. When using the deployment manager instead of Terraform via the command
gcloud deployment-manager deployments create g2 --config slurm-cluster.yaml
the entire section specifying the partition seems to be ignored, and the default values for the partition are used which are specified in the file schedmd-slurm-gcp.jinja.schema.
I have attached the slurm-cluster.yaml file.
Now even when the default values are used, we have the same problem as with Terraform with the number of CPUs per compute node. The default compute node machine type is n1-highcpu-2 which is used by the Slurm cluster now (since the partition section of the slurm-cluster.yaml is ignored), but the compute nodes in Slurm have now only a single CPU available:
$ scontrol show node g2-compute-0-0
NodeName=g2-compute-0-0 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUTot=1 CPULoad=0.23
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=g2-compute-0-0 NodeHostName=g2-compute-0-0 Version=20.11.4
OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
RealMemory=1413 AllocMem=0 FreeMem=1163 Sockets=1 Boards=1
State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=p1
BootTime=2021-04-03T11:40:54 SlurmdStartTime=2021-04-03T11:41:23
CfgTRES=cpu=1,mem=1413M,billing=1
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment=(null)
While the actual VMs which are used for the compute nodes are n1-highcpu-2, and thus 2 CPUs should be available.
Any ideas on how to solve the problems with either the Terraform or the Deployment Manager versions?
Many thanks,
Christoph