Hi Team,
We are currently encountering a critical issue with our Slurm workloads, specifically related to the use of the machine image from the Source project '
schedmd-slurm-public' and the Source family
'slurm-gcp-5-7-hpc-centos-7' as the base image for Slurm.
The issue manifests as a FATAL ERROR with the message:
"Network error: Software caused connection abort." Our setup involves using the Plink CLI to establish connections from our on-premises Windows server to the GCP cloud. The communication traverses both on-premises and cloud firewalls.
Upon investigation, we observed the error occurring on the on-premises Windows server. We checked the on-premises firewall logs and identified a TCP reset from the destination side. However, on the cloud firewall, we found no denial of traffic.
To provide additional context, our analysis of CPU and memory utilization on both machines revealed that the on-premises VM's CPU did not exceed 30%, and memory usage remained below 70%. Similarly, on the GCP cloud side, the server's CPU utilization did not surpass 30%, and memory usage remained below 40%.
We have already raised a ticket with Google Cloud Support to investigate any potential network-related issues. Their examination from their end suggests that there are no network issues, either with the guest OS "slurm-gcp-5-7-hpc-centos-7" (CentOS-7) or at the network level.
Given the complexity of our setup and the involvement of Slurm, we are reaching out to seek assistance. Have any team members encountered similar issues before? We would greatly appreciate any insights, suggestions, or resolutions that could help us address this matter and improve the stability of our Slurm workloads.
For better understanding, I have attached screenshots related to the issue.
Your prompt attention and support on this matter are highly appreciated.
Regards,
Naveen M