Why does the login node connect to external networks but allocated compute node fail in Slurm-GCP?

61 views
Skip to first unread message

Abhilash Mathews

unread,
May 22, 2023, 2:38:36 PM5/22/23
to google-cloud-slurm-discuss
I've noticed that connecting to the internet from the allocated compute node via Slurm-GCP keeps failing. For example, using `wget` from the login node works successfully:

    [me@gcp-login0 ~]$ wget https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.17.2.tar.xz
    --2023-05-11 19:06:34--  https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.17.2.tar.xz
    Resolving cdn.kernel.org (cdn.kernel.org)... 111.111.1.111, 111.111.11.111, 111.111.111.111, ...
    Connecting to cdn.kernel.org (cdn.kernel.org)|111.111.1.111|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 102167060 (97M) [application/x-xz]
    Saving to: ‘linux-4.17.2.tar.xz’
   
    100%[======================================>] 102,167,060  277MB/s   in 0.4s  
   
    2023-05-11 19:06:35 (277 MB/s) - ‘linux-4.17.2.tar.xz’ saved [102167060/102167060]


But on a single allocated GPU, `wget` stalls and fails:

    [me@gcp-compute-0-0 ~]$ wget https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.17.2.tar.xz
    Resolving cdn.kernel.org (cdn.kernel.org)... 111.111.1.111, 111.111.11.111, 111.111.111.111, ...
    Connecting to cdn.kernel.org (cdn.kernel.org)|151.101.1.176|:443...
    failed: Connection timed out.

Accordingly, I was wondering if there's a way to solve this network issue for compute nodes on Slurm-GCP? I tried modifying firewall settings and VPC network details, but it seems to only affect the login node and unable to target the compute node settings on GCP.

Olivier Martin

unread,
May 22, 2023, 2:50:23 PM5/22/23
to Abhilash Mathews, google-cloud-slurm-discuss
Is there a nat gateway created and corresponding cloud router created on the vpc?

--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/1e4f9800-e65b-45fc-9de3-def2647a1e01n%40googlegroups.com.
--

Olivier Martin

martin...@google.com

HPC Customer Engineer

(514) 670-8562

Abhilash Mathews

unread,
May 22, 2023, 4:15:11 PM5/22/23
to Olivier Martin, google-cloud-slurm-discuss
Yes, the Cloud NAT gateway `cloud-nat-us-central1` with VPC network `slurm-gcp-v5-net` is currently running

And the VM instances `gcp-controller` and `gcp-login0` associated with Slurm-GCP do have both external and internal IP addresses. 

Accordingly, is there anything else that I should be further checking/creating? 

Olivier Martin

unread,
May 22, 2023, 4:39:03 PM5/22/23
to Abhilash Mathews, google-cloud-slurm-discuss
Hi Abhilash,

Can you confirm whether the gcp-controller, and gcp-login0 nodes both are on the same VPC and regions as are the compute nodes? The firewall rules can be configured to apply to any hosts you'd like, depending on how you'd set them up. You could potentially use a "Connectivity Test" to see if from the IP range of the compute node(s), it's able to browse the web and if not, perhaps you'll get signal as to what might be wrong?

Olivier

Abhilash Mathews

unread,
May 22, 2023, 6:41:01 PM5/22/23
to Olivier Martin, google-cloud-slurm-discuss
I can confirm that gcp-controller and gcp-login0 and compute nodes are all in the same region (us-central1-a) and, to the best of my knowledge, using the same VPC network (slurm-gcp-v5-net)

But I did just notice that HTTP traffic and HTTPS traffic is oddly set to `Off` for the compute nodes whilst `On` for gcp-controller and gcp-login0. (I naively assumed the Firewalls would be the same between the compute nodes and gcp-controller and gcp-login0.) But even after making this change, i.e. switching HTTP traffic and HTTPS traffic to `On` for the compute nodes, I'm still unable to use `wget` on a single allocated GPU 

And, following your suggestion I did try the Connectivity Test:

[me@gcp-compute-0-0 ~]$ gcloud network-management connectivity-tests create my-test --source-instance=projects/my-project/zones/us-central1-a/instances/my-instance --destination-ip-address=10.142.0.2 --destination-network=projects/my-project/global/networks/peering-network


But when I run the above, it simply stalls with no output. Is there anything further that I should try/check?

Olivier Martin

unread,
May 22, 2023, 7:36:14 PM5/22/23
to Abhilash Mathews, google-cloud-slurm-discuss
Hi Abhilash,

Using the UI, I see something like that (not sure how to get the output with the CLI), but perhaps you'll see the test in the UI and can see the output, or run the test from the UI altogether. Choose a destination which is a public IP, in your example, I see a destination IP as a private IP (10.142.0.2...). And you could run the same test from the GCP-LOGIN0 node to see what the difference might be.
image.png
Please let me know what you get!
Olivier

Abhilash Mathews

unread,
May 22, 2023, 8:20:48 PM5/22/23
to google-cloud-slurm-discuss
Thank you for the response. Here (attached) are the results from the test with both the login and compute nodes. It seems that the login node works, but the compute node oddly fails. Any thoughts on why this may be arising? 

(I also received the following message for the compute node test: "This IP address is from a RFC 1918 range")

compute_node_connectivity_test.png
login_node_connectivity_test.png

Olivier Martin

unread,
May 22, 2023, 8:34:58 PM5/22/23
to Abhilash Mathews, google-cloud-slurm-discuss
Intuitively, I feel like I gave you an IP which belongs to Google (8.8.8.8 is one of our DNS IPs) and then it wants to get there using private google access. Can you try as a destination, 151.101.1.176 and port 443 instead and post the output? It's an IP I picked up when doing a wget on my machine to the same URL (https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.17.2.tar.xz). If you don't mind, I suspect the problem is getting closer to NAT not being properly configured for the nodes (and any node without a private IP on your VPC, I suspect, would be affected, regardless or whether they are or not related to Slurm).

Olivier

Abhilash Mathews

unread,
May 22, 2023, 9:47:02 PM5/22/23
to google-cloud-slurm-discuss
Interesting and thank you for letting me know. Attached are the new outputs when using your recommended inputs (i.e. destination 151.101.1.176 and port 443). It still looks like the login node works successfully, whilst the compute node does not work. Any further thoughts on fixing the compute node's settings would be great!
Screenshot 2023-05-22 at 21.42.33.png
Screenshot 2023-05-22 at 21.44.11.png

Olivier Martin

unread,
May 22, 2023, 9:54:43 PM5/22/23
to Abhilash Mathews, google-cloud-slurm-discuss
It’s a bit challenging to troubleshoot over emails, but it seems like you have an issue with your NAT gateway configuration which has likely nothing to do with your Slurm deployments. You could try to stand up a standalone VM without a public IP in the same VPC and subnet (so same region), and make sure this works for you. I feel like this part isn’t working or configured properly… missing cloud nat gateway on this specific VPC and/or improper configuration of said Cloud NAT..?


Abhilash Mathews

unread,
May 23, 2023, 11:00:38 AM5/23/23
to google-cloud-slurm-discuss
The reason I believe it's related to Slurm deployments is because the compute nodes (which are the ones not functioning) are fully instantiated by Slurm and the only ones failing the Connectivity Test

As recommended, when I create a new standalone VM with the external IPv4 address set to None and use slurm-gcp-v5-primary-subnet, this does pass the Connectivity Test (attached). Thus I'm led to believe it may be something specific to Slurm-GCP's setup of the compute nodes upon allocation with Slurm, but glad to try anything that you think can help overcome this issue

(Not sure if it's relevant, e.g. firewall troubleshooting, but in past usage of HPC resources, I could directly ssh into the compute nodes from the login node without the key. For example, `ssh gcp-compute-0-0` would work, but for Slurm-GCP, I find that `ssh -i ~/.ssh/id_rsa_gcp gcp-compute-0-0` is necessary by default.)
Screenshot 2023-05-23 at 10.49.53.png
Reply all
Reply to author
Forward
0 new messages