Hi,
I installed a new slurm cluster from the marketplace. I set up 3 partitions with n2 machines, p2 with n2-standard-2, p4 with n2-standard-4, and p8 with n2-standard-8 (see the View config in the deployment).
e.g. partition p2
"compute1_partition_name": "p2",
"compute1_max_node_count": 10000.0,
"compute1_static_node_count": 0.0,
"compute1_preemptible": true,
"compute1_machine_type": "n2-standard-2",
"compute1_disk_type": "pd-standard",
"compute1_disk_size_gb": 60.0,
"compute1_gpu_count": 0.0,
The slurm cluster works pretty well, except that I can't use the full number of cpu for my jobs. It triggers an error telling that the number of CPU is not available, it works only when I divided the number of cpu by 2.
#!/usr/bin/env bash
#SBATCH -p p2
#SBATCH -n 1
#SBATCH --cpus-per-task=2
#SBATCH --mem=8G
#SBATCH -t 00:30:00
srun hostname
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
I found in the slurm.conf file (/usr/local/etc/slurm/slurm.conf) (see below p2 and all partitions at the end of the message), that indeed the number of cpu defined in the file is different from the number of cpu available in the machine.
For instance for my partition p2 that uses n2-standard-2 (2cpu), the slurm.conf file shows CPU=1. For each partition, the number of cpu seems to be divided by 2.
Is there a reason for that? Should I modify the slurm.conf file to access the full number of CPU? Note that when setting up a slurm cluster from terraform and using n1 machine, it was working well, so I was wondering if this discrepancy (if it is one) is specific to n2 machine, deployment from marketplace vs terraform, or something I don't understand.
Thanks a lot for your help!
Will
View config in the deployment section:
"resources": [{
"name": "schedmd-slurm-gcp",
"type": "schedmd-slurm-gcp.jinja",
"properties": {
"cluster_name": "slurm-hpc",
"zone": "us-central1-a",
"login_labels": [],
"network": [“-”],
"subnetwork": [“-“],
"controller_external_ip": true,
"login_external_ip": true,
"compute_external_ip": false,
"netstore_enabled": false,
"netstore_server_ip": "",
"netstore_remote_mount": "",
"netstore_local_mount": "",
"netstore_fs_type": "nfs",
"netstore_mount_options": "defaults,_netdev",
"controller_machine_type": "n2-standard-32",
"controller_disk_type": "pd-standard",
"controller_disk_size_gb": 180.0,
"suspend_time": 300.0,
"login_machine_type": "n1-standard-2",
"login_disk_type": "pd-standard",
"login_disk_size_gb": 30.0,
"compute1_partition_name": "p2",
"compute1_max_node_count": 10000.0,
"compute1_static_node_count": 0.0,
"compute1_preemptible": true,
"compute1_machine_type": "n2-standard-2",
"compute1_disk_type": "pd-standard",
"compute1_disk_size_gb": 60.0,
"compute1_gpu_count": 0.0,
"compute1_gpu_type": "",
"compute2_enabled": true,
"compute2_partition_name": "p4",
"compute2_max_node_count": 8000.0,
"compute2_static_node_count": 0.0,
"compute2_preemptible": true,
"compute2_machine_type": "n2-standard-4",
"compute2_disk_type": "pd-standard",
"compute2_disk_size_gb": 80.0,
"compute2_gpu_count": 0.0,
"compute2_gpu_type": "",
"compute3_enabled": true,
"compute3_partition_name": "p8",
"compute3_max_node_count": 6000.0,
"compute3_static_node_count": 0.0,
"compute3_preemptible": true,
"compute3_machine_type": "n2-standard-8",
"compute3_disk_type": "pd-standard",
"compute3_disk_size_gb": 110.0,
"compute3_gpu_count": 0.0,
"compute3_gpu_type": ""
}
}]
--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/aa984ffe-3525-4c8a-89f6-f3b3c606f88dn%40googlegroups.com.
Nick Ihli
|
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/CANcy_PbW%2BDunJnx%3DwQxjzqPq7tBcUeaMSFthszxX-x9WT-WyxQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/6460c820-98ee-4887-a62b-57d7243c4c13n%40googlegroups.com.