We're stuck at 5k cores even after having our quota increased

13 views
Skip to first unread message

Bo Langgaard Lind

unread,
Mar 19, 2024, 7:24:25 AM3/19/24
to google-cloud-slurm-discuss
We use c2-standard-30 VMs for our SLURM compute nodes, and, due to multiple parallel jobs, we started more VMs than usual. We discovered that at 166 compute nodes, which is 4980 cores, we were unable to spin up any more, getting the following error in resume.log:

2024-03-18 19:07:53,422 9729 47212950984448 resume.py ERROR: failed to add gcp0-compute-2-200*39 to slurm, <HttpError 403 when requesting https://compute.googleapis.com/compute/beta/projects/redacted-slurm-cluster-eu-2/zones/europe-west4-c/instances/bulkInsert?alt=json returned "Quota 'PREEMPTIBLE_CPUS' exceeded. Limit: 5000.0 in region europe-west4.". Details: "[{'message': "Quota 'PREEMPTIBLE_CPUS' exceeded. Limit: 5000.0 in region europe-west4.", 'domain': 'usageLimits', 'reason': 'quotaExceeded'}]">

Fair enough. So we went to Google and got our quota increased from 5k to 20k cores. This is reflected in the GCP Quota panel.

The problem is, that we still don't get more than 166 compute nodes, but now the logs are just silent.

[bo@gcp0-controller ~]$ sinfo
PARTITION                      AVAIL  TIMELIMIT  NODES  STATE NODELIST
europe-west4-a-c2-standard-30*    up   infinite   1000  idle~ gcp0-compute-0-[0-999]
europe-west4-b-c2-standard-30     up   infinite    834  idle~ gcp0-compute-1-[4,13,20-21,39,55,62-64,72,90,92-94,97,102,105,107,109-110,116-118,123-125,129-130,132,134-136,142-143,150,162,167-174,176-184,188-189,191-192,196-206,209-211,213-216,218-221,223-224,228-233,237-238,241-244,249-259,264-268,270,272-277,279-288,290-354,356-539,541-999]
europe-west4-b-c2-standard-30     up   infinite    166  alloc gcp0-compute-1-[0-3,5-12,14-19,22-38,40-54,56-61,65-71,73-89,91,95-96,98-101,103-104,106,108,111-115,119-122,126-128,131,133,137-141,144-149,151-161,163-166,175,185-187,190,193-195,207-208,212,217,222,225-227,234-236,239-240,245-248,260-263,269,271,278,289,355,540]

[bo@gcp0-controller ~]$ sudo tail /var/log/slurm/resume.log
2024-03-19 09:47:22,032 2715 47782678399296 resume.py INFO: done adding instances: gcp0-compute-1-[91,106,108,131,133,175,190,212,217,222] 0
2024-03-19 09:51:05,909 3927 47284840417600 resume.py INFO: done adding instances: gcp0-compute-1-[0-3,95-96,98-101,103-104,119-121,126-128,185-187,193-195,207-208,225-227,234-236,239-240,269,271,278,289,355,540] 0
2024-03-19 09:52:26,749 5254 47397429188352 resume.py ERROR: group operation failed: Requested minimum count of 1 VMs could not be created.
2024-03-19 09:52:27,163 5254 47397429188352 resume.py ERROR: insert requests failed: {"Operation was canceled by user ''.": ['gcp0-compute-1-7']}
2024-03-19 09:52:27,202 5254 47397173823808 resume.py INFO: done adding instances: gcp0-compute-1-7 0
2024-03-19 09:52:42,753 4508 47316170742080 resume.py INFO: done adding instances: gcp0-compute-1-[5-6,14-19,56-61,65-71,111-115,122,137-141,144-149,163-166,245-248,260-263] 0
2024-03-19 09:53:11,404 5469 47682778970432 resume.py INFO: done adding instances: gcp0-compute-1-8 0
2024-03-19 09:54:52,164 6467 47315044907328 resume.py INFO: done adding instances: gcp0-compute-1-[9-12,40-41,151-161] 0
2024-03-19 09:55:15,038 6748 47327566039360 resume.py INFO: done adding instances: gcp0-compute-1-[22-28,42-54] 0
2024-03-19 10:00:32,490 10992 47028322044224 resume.py INFO: done adding instances: gcp0-compute-1-[7,29-38,73-89,269] 0

SLURM version is 21.08.4

Bo Langgaard Lind

unread,
Mar 19, 2024, 7:50:15 AM3/19/24
to google-cloud-slurm-discuss
Update: the problem appears to be isolated to europe-west4-b.

Bo Langgaard Lind

unread,
Mar 20, 2024, 6:22:13 AM3/20/24
to google-cloud-slurm-discuss
I think I figured it out, somewhat.

We have configured 1000 nodes in our partition, which is just a random number and way higher than our actual quota. Turns out, that when we tried to provision node 167, we were denied, by Google, and rightly so. This made the SLURM controller put node 167 into a NOT_RESPONDING state, and this node is now tainted forever.

One fix is to run:

scontrol update NodeName=gcp0-compute-1-167 State=POWER_UP

which powers up the node, and eventually removes the NOT_RESPONDING state.
Reply all
Reply to author
Forward
0 new messages