maximum 8 compute nodes while submitting 90 jobs on Slurm GCP

30 views
Skip to first unread message

Al

unread,
Jan 6, 2021, 9:24:23 PM1/6/21
to google-cloud-slurm-discuss
Hi,
I am trying to submit around 90 jobs to Slurm+GCP and would like to see 90 compute CPUs running at the same time but I can at most see 8 CPUs running and the others are waiting in the queue.

I deployed Slurm on GCP with maximum of 100 nodes. 
Even this toy example cannot use for more than 8 CPUs when I se nodes as 90.

What might be the cause for such an issue? Is it possible GCP account limits it by a setting?

thanks
Altay

TJ

unread,
Jan 7, 2021, 5:17:58 AM1/7/21
to google-cloud-slurm-discuss
Hello Altay, you might want to check whether you are running into any resource quota limits and request a higher quota if necessary:

Al

unread,
Jan 7, 2021, 3:31:28 PM1/7/21
to google-cloud-slurm-discuss
Thanks TJ,

I run the gcloud command and see the below result, which does not seem that clear. Does this explain the maximum 8 CPUs I have?

- limit: 1000.0 metric: SNAPSHOTS usage: 0.0 - limit: 5.0 metric: NETWORKS usage: 2.0 - limit: 100.0 metric: FIREWALLS usage: 6.0 - limit: 100.0 metric: IMAGES usage: 1.0 - limit: 8.0 metric: STATIC_ADDRESSES usage: 0.0- limit: 200.0 metric: ROUTES usage: 2.0- limit: 15.0 metric: FORWARDING_RULES usage: 0.0- limit: 50.0 metric: TARGET_POOLS usage: 0.0- limit: 50.0 metric: HEALTH_CHECKS usage: 0.0- limit: 8.0  metric: IN_USE_ADDRESSES usage: 0.0- limit: 50.0 metric: TARGET_INSTANCES usage: 0.0- limit: 10.0 metric: TARGET_HTTP_PROXIES usage: 0.0- limit: 10.0 metric: URL_MAPS usage: 0.0- limit: 5.0 metric: BACKEND_SERVICES usage: 0.0- limit: 100.0 metric: INSTANCE_TEMPLATES usage: 0.0- limit: 5.0 metric: TARGET_VPN_GATEWAYS usage: 0.0- limit: 10.0 metric: VPN_TUNNELS usage: 0.0- limit: 3.0 metric: BACKEND_BUCKETS usage: 0.0- limit: 10.0 metric: ROUTERS usage: 1.0- limit: 10.0 metric: TARGET_SSL_PROXIES usage: 0.0- limit: 10.0 metric: TARGET_HTTPS_PROXIES usage: 0.0- limit: 10.0 metric: SSL_CERTIFICATES usage: 0.0 - limit: 100.0 metric: SUBNETWORKS usage: 25.0- limit: 10.0 metric: TARGET_TCP_PROXIES usage: 0.0- limit: 32.0 metric: CPUS_ALL_REGIONS usage: 0.0- limit: 10.0 metric: SECURITY_POLICIES usage: 0.0- limit: 100.0 metric: SECURITY_POLICY_RULES usage: 0.0- limit: 20.0 metric: PACKET_MIRRORINGS usage: 0.0- limit: 100.0 metric: NETWORK_ENDPOINT_GROUPS usage: 0.0- limit: 6.0 metric: INTERCONNECTS usage: 0.0- limit: 5000.0 metric: GLOBAL_INTERNAL_ADDRESSES usage: 0.0- limit: 5.0 metric: VPN_GATEWAYS usage: 0.0- limit: 100.0 metric: MACHINE_IMAGES usage: 0.0. - limit: 20.0 metric: SECURITY_POLICY_CEVAL_RULES usage: 0.0- limit: 0.0 metric: GPUS_ALL_REGIONS usage: 0.0- limit: 5.0 metric: EXTERNAL_VPN_GATEWAYS usage: 0.0- limit: 1.0 metric: PUBLIC_ADVERTISED_PREFIXES usage: 0.0- limit: 10.0 metric: PUBLIC_DELEGATED_PREFIXES usage: 0.0- limit: 128.0 metric: STATIC_BYOIP_ADDRESSES usage: 0.0

TJ

unread,
Jan 7, 2021, 3:38:36 PM1/7/21
to google-cloud-slurm-discuss
I think so- in the second half it says 
CPUS_ALL_REGIONS usage: 0.0- limit: 10.0 

So when you submit your job, you have 2 CPUs being used by login and controller and 8 CPUs being used by compute, adding up to 10, which is your current limit. You probably need to request more CPUS_ALL_REGIONS. There's an easier to look at your quotas though- just go to this page, 


and click "Quotas" when you see this sentence on the page:
"If you expect a notable upcoming increase in usage, you can proactively request quota adjustments from the Quotas page in the Cloud Console."

Al

unread,
Jan 7, 2021, 3:47:02 PM1/7/21
to google-cloud-slurm-discuss
Thank you for your help TJ!
Reply all
Reply to author
Forward
0 new messages