'Network is unreachable' issues when following tutorial "Slurm-GCP - V5 - Codelab guide for PDF"

68 views
Skip to first unread message

Kuba Perlin

unread,
Jun 29, 2022, 2:35:19 PM6/29/22
to google-cloud-slurm-discuss
Hi,

I'm trying to follow this tutorial "Slurm-GCP - V5 - Codelab guide for PDF" 
I chose to deploy the basic version from slurm-gcp/terraform/slurm_cluster/examples/slurm_cluster/cloud/basic. I just installed terraform and cloned slurm-gcp so those should be up to date. I modified the example.tfvars file a little bit but nothing that seems significant, given the errors below.

After logging into the controller and login nodes, I see that slurm failed to be set up. I find the following setup.log contents in the controller node and login node, respectively (pasted at the end).

It seems to have to do with network issues, but I'm a newbie with these things. 

I also noticed that the login & controller VMs did not have public IPs, and I had to request roles/iap.tunnelResourceAccessor from my org's gcp admin to ssh into them at all. Could that be related? I found some posts in this group mentioning setting disable_controller_public_ips=false etc. but did not find these variables anywhere in slurm-gcp, so I'm not sure how to use those.

I'd be grateful for any help.
Thank you,
Kuba

[kuba_cohere_ai@kubatfsl1-controller ~]$ sudo cat /slurm/scripts/setup.log
2022-06-29 18:04:03,962 util DEBUG: run: wall -n '*** Slurm is currently being configured in the background. ***'
2022-06-29 18:04:04,007 util ERROR: Error while getting metadata from http://metadata.google.internal/computeMetadata/v1/project/attributes/kubatfsl1-slurm-devel
2022-06-29 18:04:04,011 setup.py INFO: Setting up controller
2022-06-29 18:04:04,016 setup.py INFO: installing custom scripts: compute.d/hello_compute.sh
2022-06-29 18:04:04,017 setup.py DEBUG: compute.d/hello_compute.sh
2022-06-29 18:04:04,027 util DEBUG: Using version=v1 of Google Compute Engine API
2022-06-29 18:06:11,306 util ERROR: [Errno 101] Network is unreachable
Traceback (most recent call last):
  File "/slurm/scripts/util.py", line 662, in ensure_execute
    return request.execute()
  File "/usr/local/lib/python3.6/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/googleapiclient/http.py", line 932, in execute
    headers=self.headers,
  File "/usr/local/lib/python3.6/site-packages/googleapiclient/http.py", line 191, in _retry_request
    resp, content = http.request(uri, method, *args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google_auth_httplib2.py", line 225, in request
    **kwargs
  File "/usr/local/lib/python3.6/site-packages/googleapiclient/http.py", line 1872, in new_request
    connection_type=connection_type,
  File "/usr/local/lib/python3.6/site-packages/httplib2/__init__.py", line 1702, in request
    conn, authority, uri, request_uri, method, body, headers, redirections, cachekey,
  File "/usr/local/lib/python3.6/site-packages/httplib2/__init__.py", line 1421, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/usr/local/lib/python3.6/site-packages/httplib2/__init__.py", line 1343, in _conn_request
    conn.connect()
  File "/usr/local/lib/python3.6/site-packages/httplib2/__init__.py", line 1179, in connect
    raise socket_err
  File "/usr/local/lib/python3.6/site-packages/httplib2/__init__.py", line 1133, in connect
    sock.connect((self.host, self.port))
OSError: [Errno 101] Network is unreachable


[kuba_cohere_ai@kubatfsl1-login-8z3fw97p-001 ~]$ sudo tail /slurm/scripts/setup.log -n 20
2022-06-29 18:13:45,629 setup.py INFO: Waiting for '/home' to be mounted...
2022-06-29 18:13:45,630 util DEBUG: run: mount /home
2022-06-29 18:13:45,636 setup.py INFO: Waiting for '/etc/munge' to be mounted...
2022-06-29 18:13:45,636 util DEBUG: run: mount /etc/munge
2022-06-29 18:13:45,647 setup.py INFO: Waiting for '/opt/apps' to be mounted...
2022-06-29 18:13:45,647 util DEBUG: run: mount /opt/apps
2022-06-29 18:13:45,652 setup.py ERROR: mount of path '/usr/local/etc/slurm' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/usr/local/etc/slurm']' returned non-zero exit status 32.
2022-06-29 18:13:45,661 setup.py ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2022-06-29 18:13:45,667 setup.py ERROR: mount of path '/etc/munge' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/etc/munge']' returned non-zero exit status 32.
2022-06-29 18:13:45,676 setup.py ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2022-06-29 18:13:45,676 setup.py ERROR: CalledProcessError:
    command=['mount', '/usr/local/etc/slurm']
    returncode=32
    stdout:

    stderr:
mount.nfs: access denied by server while mounting kubatfsl1-controller:/usr/local/etc/slurm

Olivier Martin

unread,
Jun 29, 2022, 6:12:49 PM6/29/22
to Kuba Perlin, google-cloud-slurm-discuss
Hi,

are you seeing logs in stackdriver which could also give some more GCP friendly messages? It could very well be that the example deploys VMs with public IPs and that if your org policies go against that, it will fail. Messages from stackdriver could be more indicative of that issue... if that's the case.

Cheers,
Olivier



--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/aa4b4879-1edd-4241-b372-a3cf3203ddf8n%40googlegroups.com.

Kuba Perlin

unread,
Jun 30, 2022, 3:41:59 PM6/30/22
to google-cloud-slurm-discuss
Thanks for your suggestion Olivier,

I actually couldn't find any logs containing the name of my slurm cluster. I wasn't expecting to find nothing at all. Perhaps this is due to some limited visibility (I'm not an admin)...? Not sure.

Anyways, I also realised that I didn't have all the IAM permissions listed as required in the tutorial, so I'm trying to get those and retry.
Will update here.

Kuba

Olivier Martin

unread,
Jun 30, 2022, 9:29:51 PM6/30/22
to Kuba Perlin, google-cloud-slurm-discuss
Service account probably needs roles of monitoring.write, and logging.write, at the least so you can see things directly in stackdriver. That can help. Let us know what you find with the IAM permissions.

Olivier

Kuba Perlin

unread,
Jul 5, 2022, 5:22:56 PM7/5/22
to google-cloud-slurm-discuss
Hi Olivier,

Indeed I could get the stackdriver (Logs Explorer) logs to show up now. Thanks!

I don't see anything with non-info severity, but I went to look at 'instance templates' that terraform has created (in https://console.cloud.google.com/compute/instanceTemplates)
and see that those already has External IPs set to None, if that's relevant.

I am also still curious about the `disable_controller_public_ips` etc. variables, they show up in a few places on the internet, including https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp#2
but they dont seem to be accepted by my slurm-gcp terraform setup.
> The root module does not declare a variable named "disable_login_public_ips" but a value was found in file "example.tfvars".

I would still be grateful if anyone has suggestions on how to try to solve this.
Best wishes,
Kuba

Olivier Martin

unread,
Jul 6, 2022, 10:21:20 AM7/6/22
to Kuba Perlin, google-cloud-slurm-discuss
I suspect you might have problems with any VM on your network to access other VMs or the internet. Can you SSH to the box and try accessing the internet? Something like "curl http://google.com", does it work? Can you curl  http://metadata.google.internal/computeMetadata/v1/project/attributes/kubatfsl1-slurm-devel ? or simply "curl -v metadata.google.internal" and see what happens? This looks like a network error, so searching around this might help. You could have stringent networking policies in your organization.

Also, if you're working with a Google account team, to reach out to the Customer Engineer and ask for a bit more guidance?

good luck,
Olivier


Kuba Perlin

unread,
Jul 6, 2022, 11:11:33 AM7/6/22
to google-cloud-slurm-discuss
Thanks Olivier,

Update: slurm started working with enough IAM permissions!:) My bad on not making sure those were all present to begin with...
I still haven't gotten external IPs to appear for any of the VMs, but I found a way (https://stackoverflow.com/a/71452930) to ssh my VS Code to a VM without external IP, so I have no need for external IPs, at least for now.

Thank you for your help! I learned from your advice, even if it didn't end up being the solution here.

Best wishes,
Kuba

Gonzalo Sáez

unread,
Jul 13, 2022, 4:04:11 AM7/13/22
to google-cloud-slurm-discuss
Hi Kuba,

I'm seeing the same errors when using a custom VM image built with packer.  Everything works well when using the pre-built slurm centos image. Could you please explain a bit more what you did to add the IAM permissions? Which principal did you assign the IAM permissions to?

Thank you,

Gonzalo

Kuba Perlin

unread,
Jul 13, 2022, 11:25:33 AM7/13/22
to google-cloud-slurm-discuss
Hi Gonzalo,


I made a SA to act as the terraform user, so that I dont permanently have all those admin permissions. I added a key to the SA, downloaded the private key and then you can run
`gcloud auth activate-service-account x...@xxx.iam.gserviceaccount.com --key-file ~/.ssh/slurm_sa_key` where the xxx is your SA's address.

Then you're acting as the SA and run `terraform apply`. You can later switch back to your identity using some gcloud commands.

I hope that helps

Gonzalo Sáez

unread,
Jul 17, 2022, 7:36:19 AM7/17/22
to google-cloud-slurm-discuss
Thanks Kuba, that was very helpful indeed!
Reply all
Reply to author
Forward
0 new messages