[slurm-users] Submit job using srun fails but sbatch works

4,476 views
Skip to first unread message

Alexander Åhman

unread,
May 29, 2019, 9:00:40 AM5/29/19
to Slurm User Community List
Hi,
Have a very strange problem. The cluster has been working just fine
until one node died and now I can't submit jobs to 2 of the nodes using
srun from the login machine. Using sbatch works just fine and also if I
use srun from the same host as slurmctld.
All the other nodes works just fine as they always has, only 2 nodes are
experiencing this problem. Very strange...

Have checked network connectivity and DNS and that is OK. I can ping,
ssh to all nodes just fine. All nodes are identical and using Slurm 18.08.
Also tested to reboot the 2 nodes and slurmctld but still same problem.

[alex@li1 ~]$ srun -w cn7 hostname
srun: error: fwd_tree_thread: can't find address for host cn7, check
slurm.conf
srun: error: Task launch for 1088816.0 failed on node cn7: Can't find an
address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check
slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

[alex@li1 ~]$ srun -w cn6 hostname
cn6.ydesign.se

What is this error "can't find address for host" about? Have searched
the web but can't find any good information about what the problem is or
what to do to resolve it.

Any kind soul out there who knows what to do next?

Regards,
Alexander Åhman


Ole Holm Nielsen

unread,
May 29, 2019, 9:13:38 AM5/29/19
to slurm...@lists.schedmd.com
Hi Alexander,

The error "can't find address for host cn7" would indicate a DNS
problem. What is the output of "host cn7" from the srun host li1?

How many network devices are in your subnet? It may be that the Linux
kernel is doing "ARP cache trashing" if the number of devices approaches
512. What is the result of "arp cn7"?

To fix ARP cache trashing look in my Slurm Wiki page
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks

Best regards,
Ole

Alexander Åhman

unread,
May 29, 2019, 10:46:25 AM5/29/19
to slurm...@lists.schedmd.com
I have tried to find a network error but can't see anything. Every node I've tested has the same (and correct) view of things.

On node cn7: (the problematic one)
em1: link/ether 50:9a:4c:79:31:4d inet 10.28.3.137/24

On login machine:
[alex@li1 ~]$ host cn7
cn7.ydesign.se has address 10.28.3.137
[alex@li1 ~]$ arp cn7
Address                  HWtype  HWaddress           Flags Mask            Iface
cn7.ydesign.se           ether   50:9a:4c:79:31:4d   C                     em1

On slurmctld machine:
[alex@cmgr1 ~]$ host cn7
cn7.ydesign.se has address 10.28.3.137
[alex@cmgr1 ~]$ arp cn7
Address                  HWtype  HWaddress           Flags Mask            Iface
cn7.ydesign.se           ether   50:9a:4c:79:31:4d   C                     em1


Yes, I have seen your pages and must say that they have been pure gold on many occasions, thanks a lot Ole! But our cluster is still tiny and the whole cluster is located in its own network segment. The number of ARP entries is far from 512 (actually, more like ~30).

I just don't understand why sbatch works but not srun?
Could this be some error in the state files perhaps? Something that maybe got corrupted when the node (cn7) unexpectedly died?

Regards,
Alexander

Alex Chekholko

unread,
May 29, 2019, 1:25:06 PM5/29/19
to Slurm User Community List
I think this error usually means that on your node cn7 it has either the wrong /etc/hosts or the wrong /etc/slurm/slurm.conf

E.g. try 'srun --nodelist=cn7 ping -c 1 cn7'

Alexander Åhman

unread,
Jun 3, 2019, 10:54:03 AM6/3/19
to slurm...@lists.schedmd.com
That was my first thought too, but... no. Both /etc/hosts (not used) and slurm.conf are identical on all nodes, both working and non-working nodes.

From login machine:
[alex@li1 ~]$ srun --nodelist=cn7 ping -c 1 cn7
srun: job 1118071 queued and waiting for resources
srun: job 1118071 has been allocated resources

srun: error: fwd_tree_thread: can't find address for host cn7, check slurm.conf
srun: error: Task launch for 1118071.0 failed on node cn7: Can't find an address, check slurm.conf

srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

From slurmctld machine:
[root@cmgr1 ~]# srun --nodelist=cn7 ping -c 1 cn7
srun: job 1118076 queued and waiting for resources
srun: job 1118076 has been allocated resources
PING cn7.ydesign.se (10.28.3.137) 56(84) bytes of data.
64 bytes from cn7.ydesign.se (10.28.3.137): icmp_seq=1 ttl=64 time=0.012 ms

--- cn7.ydesign.se ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.012/0.012/0.012/0.000 ms


I guess that some state file somewhere got corrupted. Think the new mission will be to try to reset the correct state file and try again or if that fails - clean it with fire! ;-)

Regards,
Alexander Åhman

Chris Samuel

unread,
Jun 6, 2019, 2:42:37 AM6/6/19
to slurm...@lists.schedmd.com
On Monday, 3 June 2019 7:53:39 AM PDT Alexander Åhman wrote:

> That was my first thought too, but... no. Both /etc/hosts (not used) and
> slurm.conf are identical on all nodes, both working and non-working nodes.

I think Slurm caches things like that, so it might be worth restarting
slurmctld to see if that helps.

Good luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA




Reply all
Reply to author
Forward
0 new messages