[slurm-users] srun : Communication connection failure

3,395 views
Skip to first unread message

Durai Arasan

unread,
Jan 20, 2022, 9:41:15 AM1/20/22
to Slurm User Community List
Hello Slurm users,

We are suddenly encountering strange errors while trying to launch interactive jobs on our cpu partitions. Have you encountered this problem before? Kindly let us know.

[darasan84@bg-slurmb-login1 ~]$ srun --job-name "admin_test231" --ntasks=1 --nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G  --nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash
srun: error: Task launch for StepId=1137134.0 failed on node slurm-cpu-hm-7: Communication connection failure
srun: error: Application launch failed: Communication connection failure
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Best regards,
Durai Arasan
MPI Tuebingen

Durai Arasan

unread,
Jan 20, 2022, 10:06:49 AM1/20/22
to Slurm User Community List
Hello slurm users,

I forgot to mention that an identical interactive job works successfully on the gpu partitions (in the same cluster). So this is really puzzling.

Best,
Durai Arasan
MPI Tuebingen

Michael Robbert

unread,
Jan 20, 2022, 11:06:08 AM1/20/22
to Slurm User Community List

It looks like it could be some kind of network problem but could be DNS. Can you ping and do DNS resolution for the host involved?

What does slurmctld.log say? How about slurmd.log on the node in question?

 

Mike

 

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Durai Arasan <arasan...@gmail.com>
Date: Thursday, January 20, 2022 at 08:08
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: [External] Re: [slurm-users] srun : Communication connection failure

CAUTION: This email originated from outside of the Colorado School of Mines organization. Do not click on links or open attachments unless you recognize the sender and know the content is safe.

Durai Arasan

unread,
Jan 21, 2022, 6:31:12 AM1/21/22
to Slurm User Community List
Hello MIke,

I am able to ping the nodes from the slurm master without any problem. Actually there is nothing interesting in slurmctld.log or slurmd.log. You can trust me on this. That is why I posted here.

Best,
Durai Arasan
MPI Tuebingen

Doug Meyer

unread,
Jan 21, 2022, 8:13:54 AM1/21/22
to Slurm User Community List
Hi,
Did you recently add nodes?  We have seen that when we add nodes past the treewidth count the most recently added nodes will lose communication (asterisk next to node name in sifo).  We have to ensure the treewidth declaration in the slurm.conf matches or exceeds the number of nodes.  

Doug

Durai Arasan

unread,
Jan 25, 2022, 8:42:12 AM1/25/22
to Slurm User Community List
Hello Mike,Doug:

The issue was resolved somehow. My colleagues says the addresses in slurm.conf on the login nodes were incorrect. It could also have been a temporary network issue.

Best,
Durai Arasan
MPI Tübingen

Doug Meyer

unread,
Jan 25, 2022, 7:19:40 PM1/25/22
to Slurm User Community List
Always hate those odd problems.  Glad you are up!
Doug

Ryan Novosielski

unread,
Jan 25, 2022, 7:27:17 PM1/25/22
to Slurm User Community List
I’m coming to this question late, and this is not the answer to your problem (well, maybe tangentially), but it may help someone else: my recollection is that the compute node that gets assigned the job must be able to contact the node you’re starting the interactive job from (so bg-slurmb-login1 here) on a wide variety of ports in the case of interactive jobs. For us, we had a firewall config that didn’t allow for that and all interactive jobs failed until we resolved that. I guess having the wrong address someplace could a mimic that behavior.

--
#BlackLivesMatter
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - novo...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
Reply all
Reply to author
Forward
0 new messages