n2 machine cpu number different between slurm.conf file and deployment config file

82 views
Skip to first unread message

Will

unread,
May 25, 2021, 10:39:25 AM5/25/21
to google-cloud-slurm-discuss

Hi, 

I installed a new slurm cluster from the marketplace. I set up 3 partitions with n2 machines, p2 with n2-standard-2, p4 with n2-standard-4, and p8 with n2-standard-8 (see the View config in the deployment).

e.g. partition p2

      "compute1_partition_name": "p2",
      "compute1_max_node_count": 10000.0,
      "compute1_static_node_count": 0.0,
      "compute1_preemptible": true,
      "compute1_machine_type": "n2-standard-2",
      "compute1_disk_type": "pd-standard",
      "compute1_disk_size_gb": 60.0,
      "compute1_gpu_count": 0.0,


The slurm cluster works pretty well, except that I can't use the full number of cpu for my jobs. It triggers an error telling that the number of CPU is not available, it works only when I divided the number of cpu by 2. 

#!/usr/bin/env bash
#SBATCH -p p2
#SBATCH -n 1
#SBATCH --cpus-per-task=2
#SBATCH --mem=8G
#SBATCH -t 00:30:00
srun hostname

Error:

sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

I found in the slurm.conf file (/usr/local/etc/slurm/slurm.conf) (see below p2 and all partitions at the end of the message), that indeed the number of cpu defined in the file is different from the number of cpu available in the machine. 

NodeName=DEFAULT CPUs=1 RealMemory=7552 State=UNKNOWN
NodeName=slurm-hpc-compute-0-[0-9999] State=CLOUD
PartitionName=p2 Nodes=biosystem-hpc-compute-0-[0-9999] MaxTime=INFINITE State=UP DefMemPerCPU=7552 LLN=yes Default=YES

For instance for my partition p2 that uses n2-standard-2 (2cpu), the slurm.conf file shows CPU=1. For each partition, the number of cpu seems to be divided by 2. 

Is there a reason for that? Should I modify the slurm.conf file to access the full number of CPU? Note that when setting up a slurm cluster from terraform and using n1 machine, it was working well, so I was wondering if this discrepancy (if it is one) is specific to n2 machine, deployment from marketplace vs terraform, or something I don't understand.

Thanks a lot for your help!

Will


View config in the deployment section:

"resources": [{
    "name": "schedmd-slurm-gcp",
    "type": "schedmd-slurm-gcp.jinja",
    "properties": {
      "cluster_name": "slurm-hpc",
      "zone": "us-central1-a",
      "login_labels": [],
      "network": [“-”],
      "subnetwork": [“-“],
      "controller_external_ip": true,
      "login_external_ip": true,
      "compute_external_ip": false,
      "netstore_enabled": false,
      "netstore_server_ip": "",
      "netstore_remote_mount": "",
      "netstore_local_mount": "",
      "netstore_fs_type": "nfs",
      "netstore_mount_options": "defaults,_netdev",
      "controller_machine_type": "n2-standard-32",
      "controller_disk_type": "pd-standard",
      "controller_disk_size_gb": 180.0,
      "suspend_time": 300.0,
      "login_machine_type": "n1-standard-2",
      "login_disk_type": "pd-standard",
      "login_disk_size_gb": 30.0,
      "compute1_partition_name": "p2",
      "compute1_max_node_count": 10000.0,
      "compute1_static_node_count": 0.0,
      "compute1_preemptible": true,
      "compute1_machine_type": "n2-standard-2",
      "compute1_disk_type": "pd-standard",
      "compute1_disk_size_gb": 60.0,
      "compute1_gpu_count": 0.0,
      "compute1_gpu_type": "",
      "compute2_enabled": true,
      "compute2_partition_name": "p4",
      "compute2_max_node_count": 8000.0,
      "compute2_static_node_count": 0.0,
      "compute2_preemptible": true,
      "compute2_machine_type": "n2-standard-4",
      "compute2_disk_type": "pd-standard",
      "compute2_disk_size_gb": 80.0,
      "compute2_gpu_count": 0.0,
      "compute2_gpu_type": "",
      "compute3_enabled": true,
      "compute3_partition_name": "p8",
      "compute3_max_node_count": 6000.0,
      "compute3_static_node_count": 0.0,
      "compute3_preemptible": true,
      "compute3_machine_type": "n2-standard-8",
      "compute3_disk_type": "pd-standard",
      "compute3_disk_size_gb": 110.0,
      "compute3_gpu_count": 0.0,
      "compute3_gpu_type": ""
    }
  }]


/usr/local/etc/slurm/slurm.conf: 
NodeName=DEFAULT CPUs=1 RealMemory=7552 State=UNKNOWN
NodeName=slurm-hpc-compute-0-[0-9999] State=CLOUD
PartitionName=p2 Nodes=biosystem-hpc-compute-0-[0-9999] MaxTime=INFINITE State=UP DefMemPerCPU=7552 LLN=yes Default=YES

NodeName=DEFAULT CPUs=2 RealMemory=15504 State=UNKNOWN
NodeName=slurm-hpc-compute-1-[0-7999] State=CLOUD
PartitionName=p4 Nodes=biosystem-hpc-compute-1-[0-7999] MaxTime=INFINITE State=UP DefMemPerCPU=7752 LLN=yes

NodeName=DEFAULT CPUs=4 RealMemory=31408 State=UNKNOWN
NodeName=slurm-hpc-compute-2-[0-5999] State=CLOUD
PartitionName=p8 Nodes=biosystem-hpc-compute-2-[0-5999] MaxTime=INFINITE State=UP DefMemPerCPU=7852 LLN=yes

Alex Chekholko

unread,
May 25, 2021, 1:05:20 PM5/25/21
to Will, google-cloud-slurm-discuss
Yes, the default new "hpc" image has no hyperthreading; you can use the image without 'hpc' in the name to get the regular GCP vCPU -> SLURM CPU behavior:


--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/aa984ffe-3525-4c8a-89f6-f3b3c606f88dn%40googlegroups.com.

Nick Ihli

unread,
May 25, 2021, 1:44:06 PM5/25/21
to Alex Chekholko, Will, google-cloud-slurm-discuss

Google released a new HPC image with hyperthreading enabled and we are updating the Slurm image for that new one. You should see that in the Terraform scripts very soon. The Marketplace will be updated with it afterwards as well.

--Nick




Nick Ihli
Director, Cloud and Sales Engineering
ni...@schedmd.com


Will

unread,
May 25, 2021, 7:50:52 PM5/25/21
to google-cloud-slurm-discuss
Thanks a lot for your answers!

If I understand correctly, I cannot unlock hyperthreading while deploying slurm from the markplace (I don't have the choice of the image while filling up the form), however, I can do it using terraform to deploy it.

In the later case, would such config for all machines (login, controller, worker) would work:
image = "projects/schedmd-slurm-public/global/images/family/schedmd-slurm-20-11-4-centos-7"
or 
image = "projects/schedmd-slurm-public/global/images/family/schedmd-slurm-20-11-4-debian-10"

image_hyperthreads = true

Thanks a lot!

Nick Ihli

unread,
May 28, 2021, 12:52:17 PM5/28/21
to Will, google-cloud-slurm-discuss
Will,

Yes, that is correct. We pushed the new release for the Terraform scripts, which also has the new image.

Thanks,
Nick





Nick Ihli
Director, Cloud and Sales Engineering
ni...@schedmd.com

Reply all
Reply to author
Forward
0 new messages