We use c2-standard-30 VMs for our SLURM compute nodes, and, due to multiple parallel jobs, we started more VMs than usual. We discovered that at 166 compute nodes, which is 4980 cores, we were unable to spin up any more, getting the following error in resume.log:
Fair enough. So we went to Google and got our quota increased from 5k to 20k cores. This is reflected in the GCP Quota panel.
The problem is, that we still don't get more than 166 compute nodes, but now the logs are just silent.
[bo@gcp0-controller ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
europe-west4-a-c2-standard-30* up infinite 1000 idle~ gcp0-compute-0-[0-999]
europe-west4-b-c2-standard-30 up infinite 834 idle~ gcp0-compute-1-[4,13,20-21,39,55,62-64,72,90,92-94,97,102,105,107,109-110,116-118,123-125,129-130,132,134-136,142-143,150,162,167-174,176-184,188-189,191-192,196-206,209-211,213-216,218-221,223-224,228-233,237-238,241-244,249-259,264-268,270,272-277,279-288,290-354,356-539,541-999]
europe-west4-b-c2-standard-30 up infinite 166 alloc gcp0-compute-1-[0-3,5-12,14-19,22-38,40-54,56-61,65-71,73-89,91,95-96,98-101,103-104,106,108,111-115,119-122,126-128,131,133,137-141,144-149,151-161,163-166,175,185-187,190,193-195,207-208,212,217,222,225-227,234-236,239-240,245-248,260-263,269,271,278,289,355,540]
[bo@gcp0-controller ~]$ sudo tail /var/log/slurm/resume.log
2024-03-19 09:47:22,032 2715 47782678399296 resume.py INFO: done adding instances: gcp0-compute-1-[91,106,108,131,133,175,190,212,217,222] 0
2024-03-19 09:51:05,909 3927 47284840417600 resume.py INFO: done adding instances: gcp0-compute-1-[0-3,95-96,98-101,103-104,119-121,126-128,185-187,193-195,207-208,225-227,234-236,239-240,269,271,278,289,355,540] 0
2024-03-19 09:52:26,749 5254 47397429188352 resume.py ERROR: group operation failed: Requested minimum count of 1 VMs could not be created.
2024-03-19 09:52:27,163 5254 47397429188352 resume.py ERROR: insert requests failed: {"Operation was canceled by user ''.": ['gcp0-compute-1-7']}
2024-03-19 09:52:27,202 5254 47397173823808 resume.py INFO: done adding instances: gcp0-compute-1-7 0
2024-03-19 09:52:42,753 4508 47316170742080 resume.py INFO: done adding instances: gcp0-compute-1-[5-6,14-19,56-61,65-71,111-115,122,137-141,144-149,163-166,245-248,260-263] 0
2024-03-19 09:53:11,404 5469 47682778970432 resume.py INFO: done adding instances: gcp0-compute-1-8 0
2024-03-19 09:54:52,164 6467 47315044907328 resume.py INFO: done adding instances: gcp0-compute-1-[9-12,40-41,151-161] 0
2024-03-19 09:55:15,038 6748 47327566039360 resume.py INFO: done adding instances: gcp0-compute-1-[22-28,42-54] 0
2024-03-19 10:00:32,490 10992 47028322044224 resume.py INFO: done adding instances: gcp0-compute-1-[7,29-38,73-89,269] 0
SLURM version is 21.08.4