I'm trying to follow
this tutorial "Slurm-GCP - V5 - Codelab guide for PDF"
I chose to deploy the basic version from slurm-gcp/terraform/slurm_cluster/examples/slurm_cluster/cloud/basic. I just installed terraform and cloned slurm-gcp so those should be up to date. I modified the example.tfvars file a little bit but nothing that seems significant, given the errors below.
After logging into the controller and login nodes, I see that slurm failed to be set up. I find the following setup.log contents in the controller node and login node, respectively (pasted at the end).
It seems to have to do with network issues, but I'm a newbie with these things.
I also noticed that the login & controller VMs did not have public IPs, and I had to request roles/iap.tunnelResourceAccessor from my org's gcp admin to ssh into them at all. Could that be related? I found some posts in this group mentioning setting disable_controller_public_ips=false etc. but did not find these variables anywhere in slurm-gcp, so I'm not sure how to use those.
I'd be grateful for any help.
Thank you,
Kuba
[kuba_cohere_ai@kubatfsl1-controller ~]$ sudo cat /slurm/scripts/setup.log
2022-06-29 18:04:03,962 util DEBUG: run: wall -n '*** Slurm is currently being configured in the background. ***'
2022-06-29 18:04:04,007 util ERROR: Error while getting metadata from http://metadata.google.internal/computeMetadata/v1/project/attributes/kubatfsl1-slurm-devel
2022-06-29 18:04:04,011 setup.py INFO: Setting up controller
2022-06-29 18:04:04,016 setup.py INFO: installing custom scripts: compute.d/hello_compute.sh
2022-06-29 18:04:04,017 setup.py DEBUG: compute.d/hello_compute.sh
2022-06-29 18:04:04,027 util DEBUG: Using version=v1 of Google Compute Engine API
2022-06-29 18:06:11,306 util ERROR: [Errno 101] Network is unreachable
Traceback (most recent call last):
File "/slurm/scripts/util.py", line 662, in ensure_execute
return request.execute()
File "/usr/local/lib/python3.6/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/googleapiclient/http.py", line 932, in execute
headers=self.headers,
File "/usr/local/lib/python3.6/site-packages/googleapiclient/http.py", line 191, in _retry_request
resp, content = http.request(uri, method, *args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/google_auth_httplib2.py", line 225, in request
**kwargs
File "/usr/local/lib/python3.6/site-packages/googleapiclient/http.py", line 1872, in new_request
connection_type=connection_type,
File "/usr/local/lib/python3.6/site-packages/httplib2/__init__.py", line 1702, in request
conn, authority, uri, request_uri, method, body, headers, redirections, cachekey,
File "/usr/local/lib/python3.6/site-packages/httplib2/__init__.py", line 1421, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "/usr/local/lib/python3.6/site-packages/httplib2/__init__.py", line 1343, in _conn_request
conn.connect()
File "/usr/local/lib/python3.6/site-packages/httplib2/__init__.py", line 1179, in connect
raise socket_err
File "/usr/local/lib/python3.6/site-packages/httplib2/__init__.py", line 1133, in connect
sock.connect((self.host, self.port))
OSError: [Errno 101] Network is unreachable
[kuba_cohere_ai@kubatfsl1-login-8z3fw97p-001 ~]$ sudo tail /slurm/scripts/setup.log -n 20
2022-06-29 18:13:45,629 setup.py INFO: Waiting for '/home' to be mounted...
2022-06-29 18:13:45,630 util DEBUG: run: mount /home
2022-06-29 18:13:45,636 setup.py INFO: Waiting for '/etc/munge' to be mounted...
2022-06-29 18:13:45,636 util DEBUG: run: mount /etc/munge
2022-06-29 18:13:45,647 setup.py INFO: Waiting for '/opt/apps' to be mounted...
2022-06-29 18:13:45,647 util DEBUG: run: mount /opt/apps
2022-06-29 18:13:45,652 setup.py ERROR: mount of path '/usr/local/etc/slurm' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/usr/local/etc/slurm']' returned non-zero exit status 32.
2022-06-29 18:13:45,661 setup.py ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2022-06-29 18:13:45,667 setup.py ERROR: mount of path '/etc/munge' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/etc/munge']' returned non-zero exit status 32.
2022-06-29 18:13:45,676 setup.py ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2022-06-29 18:13:45,676 setup.py ERROR: CalledProcessError:
command=['mount', '/usr/local/etc/slurm']
returncode=32
stdout:
stderr:
mount.nfs: access denied by server while mounting kubatfsl1-controller:/usr/local/etc/slurm