Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Assistance Needed: FATAL ERROR - Network Connectivity Issue

55 views
Skip to first unread message

Naveen M

unread,
Feb 26, 2024, 5:52:02 AM2/26/24
to google-cloud-slurm-discuss
Hi Team,

We are currently encountering a critical issue with our Slurm workloads, specifically related to the use of the machine image from the Source project 'schedmd-slurm-public' and the Source family 'slurm-gcp-5-7-hpc-centos-7' as the base image for Slurm.

The issue manifests as a FATAL ERROR with the message: "Network error: Software caused connection abort." Our setup involves using the Plink CLI to establish connections from our on-premises Windows server to the GCP cloud. The communication traverses both on-premises and cloud firewalls.

Upon investigation, we observed the error occurring on the on-premises Windows server. We checked the on-premises firewall logs and identified a TCP reset from the destination side. However, on the cloud firewall, we found no denial of traffic.

To provide additional context, our analysis of CPU and memory utilization on both machines revealed that the on-premises VM's CPU did not exceed 30%, and memory usage remained below 70%. Similarly, on the GCP cloud side, the server's CPU utilization did not surpass 30%, and memory usage remained below 40%.

We have already raised a ticket with Google Cloud Support to investigate any potential network-related issues. Their examination from their end suggests that there are no network issues, either with the guest OS "slurm-gcp-5-7-hpc-centos-7" (CentOS-7) or at the network level.

Given the complexity of our setup and the involvement of Slurm, we are reaching out to seek assistance. Have any team members encountered similar issues before? We would greatly appreciate any insights, suggestions, or resolutions that could help us address this matter and improve the stability of our Slurm workloads.
Screenshot 2024-02-26 at 4.20.34 PM.png
For better understanding, I have attached screenshots related to the issue.

Your prompt attention and support on this matter are highly appreciated.

Regards,
Naveen M

Olivier Martin

unread,
Feb 26, 2024, 8:30:26 AM2/26/24
to Naveen M, google-cloud-slurm-discuss
for what it’s worth, if at all possible, I would try giving a temporary public IP address to this endpoint (10.236.164.78), and allow it only access from the public IP, your Windows VM, and see if it works. You’ll maybe exclude the VPN devices as possible explanations as to where the TCP reset comes from. If it works this way… you know it’s somehow one of the devices that’s building up the VPN or connecting the sites together (routers or such).



--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/69fb114e-4f90-4e7f-9605-889deb876279n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages