I have also been trying to do this, and have had the same problem as you, but I got a little bit further before encountering another problem. Have you managed to solve this?
I found the reason it stalled on *** Slurm is currently being configured in the background. *** was due to the instances not having any internet access through the shared VPC and therefore could not complete the setup properly. This happens when you set up a shared VPC but the nodes are configured without external IP addresses. In that case, you would also need to configure a Cloud NAT/gateway to allow the instances access to the internet through the gateway, see
google vpc docsAlternatively, and this is what I did as I didn't know how to setup the tfvars to use an existing Cloud NAT router - just use external IP addresses by ensuring tfvars:
disable_controller_public_ips = false
disable_login_public_ips = false
disable_compute_public_ips = false
After this, the controller and login nodes successfully completed setup and can access the internet and the nfs share which I set up on another VM in the shared VPC network. However, my problem now is that slurm can't allocate any compute node resources. It gets to:
salloc: Granted job allocation 2
salloc: Waiting for resource configuration
and hangs indefinitely and no compute node instances even start/appear. I don't think there are any problems with quotas or machine types.
I have tried setting in the partition options both the vpc_net and vpc_subnet and even a cidr setting of "
10.152.0.0/20" which matches my network addressing range but none of this works.
Can anyone assist in how I can troubleshoot this further to get it working?
Thanks