[slurm-users] srun problem -- Can't find an address, check slurm.conf

1,003 views
Skip to first unread message

Scott Hazelhurst

unread,
Nov 7, 2018, 8:35:04 AM11/7/18
to slurm...@lists.schedmd.com


Dear list

We have a relatively new installation of SLURM. We have started to have a problem with some of the nodes when using srun

[scott@cream-ce ~]$ srun --pty -w n38 hostname
srun: error: fwd_tree_thread: can't find address for host n38, check slurm.conf
srun: error: Task launch for 18710.0 failed on node n38: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete


I’ve spent most of the day trying to follow others who’ve had similar problems and check everything but I haven’t made any progress

— Using sbatch there is no problem: jobs launch on the given node and finish normally and reliably

— srun works fine for most of the nodes

— the slum.conf file is identical on all nodes (checked by diffing and no complaints in the logs)

— both the slurmctld and the slurmd start cleanly with no obvious errors or warnings (e.g. about slurm.conf)

— sinfo reports that all our nodes are up, some busy some not. The problem is independent of load on the nodes

— I’ve increased the log level on the control daemon and there’s no obvious additional information when the srun happens

— we use puppet to maintain our infrastructure so while there must be a difference between the machines that work and those that don’t I can see it.

— all nodes run ntpd and the times appear the same when checked manually

— all nodes have plenty of disk space

— I’ve tried restarting both the slurm and control daemons and this has no effect, even for a short time

— hostname on working and problematic nodes give the expected results in the same format as other

— all hostnames are in /etc/hosts on all machines

— we currently have just less than 40 worker nodes, treewidth=50


We’re running SLURM 17.11.10 under CentOS 7.5


This is the final part of the slurm.conf file

NodeName=n[02,08,10,29-40,42-45] RealMemory=131072 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
NodeName=n[05] RealMemory=256000 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN
NodeName=n[07] RealMemory=45000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
NodeName=n[15] RealMemory=48000 Sockets=20 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
NodeName=n[16] RealMemory=31000 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN
NodeName=n[17] RealMemory=215000 Sockets=16 CoresPerSocket=8 ThreadsPerCore=1 State=UNKNOWN
NodeName=n[18] RealMemory=90000 Sockets=20 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
NodeName=n[19] RealMemory=515000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
NodeName=n[20] RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
NodeName=n[21] RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
NodeName=n[22] RealMemory=56000 Sockets=16 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
NodeName=n[23] RealMemory=225500 Sockets=20 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
NodeName=n[27] RealMemory=65536 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
NodeName=n[28] RealMemory=65536 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN

PartitionName=batch Nodes=n[02,05,07-08,10,15-23,27-40,42-45] Default=YES MaxTime=40320 State=UP

As examples
— fails: n15, n17, n27, n28, n38, n45
— success: n02, n10, n16, n18, n29 onwards except for n38, n45


Many thanks for any help

Scott


This communication is intended for the addressee only. It is confidential. If you have received this communication in error, please notify us immediately and destroy the original message. You may not copy or disseminate this communication without the permission of the University. Only authorised signatories are competent to enter into agreements on behalf of the University and recipients are thus advised that the content of this message may not be legally binding on the University and may contain the personal views and opinions of the author, which are not necessarily the views and opinions of The University of the Witwatersrand, Johannesburg. All agreements between the University and outsiders are subject to South African Law unless the University agrees in writing to the contrary.

Paul Edmon

unread,
Nov 7, 2018, 9:58:53 AM11/7/18
to slurm...@lists.schedmd.com
This smacks of either the submission host, the destination host, or the
master not being able to resolve the name to an IP.  I would triple
check that to ensure that resolution is working.

-Paul Edmon-

Scott Hazelhurst

unread,
Nov 7, 2018, 10:21:04 AM11/7/18
to Slurm User Community List

Thanks, Paul, yes, it does seem a likely cause, but I can’t see the problem. All machines have the same /etc/hosts file and the worker nodes are just listed one after each other. I’ve checked that the problem nodes are there — no obvious difference. I’ve checked that the IP address is correct.

Moreover, I can ping and ssh either using the node name (e.g. n38) or the fqdn

Scott




> On 07 Nov 2018, at 16:57, Paul Edmon <ped...@cfa.harvard.edu> wrote:
>
> This smacks of either the submission host, the destination host, or the master not being able to resolve the name to an IP. I would triple check that to ensure that resolution is working.
>
> -Paul Edmon-

Paul Edmon

unread,
Nov 7, 2018, 10:24:20 AM11/7/18
to slurm...@lists.schedmd.com
Yeah, these are frustrating ones to troubleshoot.  When I have seen this
in the past it was usually a missing forward or reverse in DNS that
cause the problem.  You could try dialing up the verbosity all the way
and see what you can spot.  Else I might recommend dropping a ticket
into the SchedMD guys to see if they have any more insight.  Then again
some one on this list might have seen the same issue.

-Paul Edmon-

Scott Hazelhurst

unread,
Nov 13, 2018, 3:26:22 AM11/13/18
to Slurm User Community List

Dear all

I still haven’t found the cause to the problem I raised last week where srun -w xx runs for some nodes but not for others — thanks for the ideas.

One intriguing result I’ve had trying to pursue this which I thought I’d share in case it sparks some ideas. If I give the full path for srun, then it works


# show path
scott@cream-ce ~]$ which srun
/opt/exp_soft/bin/srun


# Node n37 is good (as are most of our nodes)
[scott@cream-ce ~]$ srun -w n37 --pty bash
[scott@n37 ~]$


# Node n38 is not (and a few othrs)
scott@cream-ce ~]$ srun -w n38 --pty bash
srun: error: fwd_tree_thread: can't find address for host n38, check slurm.conf
srun: error: Task launch for 20094.0 failed on node n38: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

But if I give the full path name — it works!

scott@cream-ce ~]$ /opt/exp_soft/slurm/bin/srun -w n38 --pty bash
[scott@n38 ~]$


Scott



Scott

mercan

unread,
Nov 13, 2018, 3:42:29 AM11/13/18
to Slurm User Community List, Scott Hazelhurst
Hi;

Are there some typo errors or they are really different paths:

/opt/exp_soft/slurm/bin/srun

vs.

which srun
/opt/exp_soft/bin/srun

Ahmet Mercan



13.11.2018 11:24 tarihinde Scott Hazelhurst yazdı:

Scott Hazelhurst

unread,
Nov 13, 2018, 6:01:06 AM11/13/18
to Slurm User Community List

Dear Mercan

Thank you! — yes different paths so different behaviour. Amazing how you can spend so much time looking at something and not seeing it.

On Sunday did an upgrade from 17.11.10 to 17.11.12 to try to fix the problem but had left old binaries in a directory I should not have, so kept on getting the same behaviour.


I can’t be sure, but I think the problem I reported last week was in 17.11.10 and has gone away in 17.11.12

All good now


Again, thanks for the help

Scott



> Are there some typo errors or they are really different paths:
>
> /opt/exp_soft/slurm/bin/srun
>
> vs.
>
> which srun
> /opt/exp_soft/bin/srun
Reply all
Reply to author
Forward
0 new messages