Hi
I made some progress trying to understand the problem i reported some weeks ago:
https://lists.schedmd.com/pipermail/slurm-users/2023-May/010027.html
I noticed that the intermittent connection timeout that i am experiencing occurs only
when using the tcp based direct connection to establish communication between stepd
on different nodes.
When disabling the optimized direct connection using
export SLURM_PMIX_DIRECT_CONN=false
the submission of hetjobs is stable and not
connection timeout occurs anymore.
Any idea what can goes wrong when using tcp based direct connection together with hetjobs?