Assistance Required Provisioning Slurm Cluster on shared VPC

Skip to first unread message

Martin Gordon

Apr 14, 2022, 6:18:22 PM4/14/22
to google-cloud-slurm-discuss
Hi all,

Novice to both terraform and GCP. I am trying to create a slurm cluster using terraform following the codelabs tutorial on a shared VPC. The deployment seemed to complete without errors, controller and login nodes are created, but when I ssh into the machine setup seems to be stalled on "*** Slurm is currently being configured in the background. ***". 

All firewall rules are defined on host project VPC so I added network tag(s) to the controller & login nodes after creation (additional q: can this implemented during creation?). Both Compute Engine and Deployment Manger API's are enabled on service project. Please find attached below log files from the set up and my test.tfvars file. 

Thanks for your time,

Serena Lien

Apr 21, 2022, 5:40:34 AM4/21/22
to google-cloud-slurm-discuss
I have also been trying to do this, and have had the same problem as you, but I got a little bit further before encountering another problem. Have you managed to solve this?

I found the reason it stalled on *** Slurm is currently being configured in the background. *** was due to the instances not having any internet access through the shared VPC and therefore could not complete the setup properly. This happens when you set up a shared VPC but the nodes are configured without external IP addresses. In that case, you would also need to configure a Cloud NAT/gateway to allow the instances access to the internet through the gateway, see google vpc docs
Alternatively, and this is what I did as I didn't know how to setup the tfvars to use an existing Cloud NAT router - just use external IP addresses by ensuring tfvars:
disable_controller_public_ips = false
 disable_login_public_ips      = false
 disable_compute_public_ips    = false

After this, the controller and login nodes successfully completed setup and can access the internet and the nfs share which I set up on another VM in the shared VPC network. However, my problem now is that slurm can't allocate any compute node resources. It gets to:
salloc: Granted job allocation 2
salloc: Waiting for resource configuration
and hangs indefinitely and no compute node instances even start/appear. I don't think there are any problems with quotas or machine types.
I have tried setting in the partition options both the vpc_net and vpc_subnet and even a cidr setting of "" which matches my network addressing range but none of this works.

Can anyone assist in how I can troubleshoot this further to get it working?
Reply all
Reply to author
0 new messages