[slurm-users] enable_configless, srun and DNS vs. hosts file

446 views
Skip to first unread message

Mark Dixon

unread,
Nov 10, 2021, 10:14:15 AM11/10/21
to slurm...@lists.schedmd.com
Hi,

I'm using the "enable_configless" mode to avoid the need for a shared
slurm.conf file, and am having similar trouble to others when running
"srun", e.g.

srun: error: fwd_tree_thread: can't find address for host cn120, check slurm.conf
srun: error: Task launch for StepId=113.0 failed on node cn120: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

I understand that the accepted solution is to add the nodenames to DNS. Is
that really correct?

I ask because it would be a great help if slurm instead used the more
usual mechanism and consult the sources listed in /etc/nsswitch.conf. We
use a large /etc/hosts file instead of DNS for our cluster and would
rather not start running named if we can help it.

Thanks,

Mark

PS Adding a line like "NodeName=cn[001-999]" to the submit/compute host
slurm.conf file makes this go away (I hope skipping the node detail, or
adding nodes that don't exist [yet] won't cause other problems).

Paul Brunk

unread,
Nov 12, 2021, 9:38:10 AM11/12/21
to Slurm User Community List
Hi:

We run configless. If we add a node to slurm.conf and don't restart slurmd on our submit nodes, then attempts to submit to that new node will get the error you saw. Restarting slurmd on the submit node fixes it. This is the documented behavior (adding nodes needs slurmd restarted everywhere). Could this be what you're seeing (as opposed to /etc/hosts vs DNS)?

--
Wishing that I'd just listened this time,
Paul Brunk, system administrator, Workstation Support Group
GACRC (formerly RCC)
UGA EITS (formerly UCNS)


-----Original Message-----
From: slurm-users <slurm-use...@lists.schedmd.com> On Behalf Of Mark Dixon
Sent: Wednesday, November 10, 2021 10:14
To: slurm...@lists.schedmd.com
Subject: [slurm-users] enable_configless, srun and DNS vs. hosts file

[EXTERNAL SENDER - PROCEED CAUTIOUSLY]

Diego Zuccato

unread,
Nov 15, 2021, 1:59:05 AM11/15/21
to Slurm User Community List, Paul Brunk
I'm not yet using configless slurm, but shouldn't it be slurmctld on the
submit node?

Il 12/11/2021 15:37, Paul Brunk ha scritto:
> Hi:
>
> We run configless. If we add a node to slurm.conf and don't restart slurmd on our submit nodes, then attempts to submit to that new node will get the error you saw. Restarting slurmd on the submit node fixes it. This is the documented behavior (adding nodes needs slurmd restarted everywhere). Could this be what you're seeing (as opposed to /etc/hosts vs DNS)?
>

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Ole Holm Nielsen

unread,
Nov 15, 2021, 1:53:00 PM11/15/21
to slurm...@lists.schedmd.com
On 12-11-2021 15:37, Paul Brunk wrote:
> We run configless. If we add a node to slurm.conf and don't restart slurmd on our submit nodes, then attempts to submit to that new node will get the error you saw. Restarting slurmd on the submit node fixes it. This is the documented behavior (adding nodes needs slurmd restarted everywhere). Could this be what you're seeing (as opposed to /etc/hosts vs DNS)?

Links to the official Slurm documentation and presentations about
adding/removing nodes can be found in the Wiki page
https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes

/Ole

Mark Dixon

unread,
Nov 16, 2021, 9:01:04 AM11/16/21
to Slurm User Community List
Hi Paul,

Thanks for the thought but no, we'd restarted all slurmctld, slurmdbd and
slurmd daemons since changing any of the slurm config files.

I have a very cut-down slurm.conf on the non-slurmctld nodes, which seems
to be consulted when running srun (regardless of whether slurmd is running
or not).

Removing the simplified NodeName lines from the cut-down slurm.conf causes
srun to immediately return to its "can't find address for host" behaviour
I outlined at the start. Seen this both on clients running slurmd and
those that don't.

The cut-down slurm.conf is slowly growing: I've found that I also need to
add GresTypes, otherwise srun/sbatch don't know what users can put in
their "--gres" flag and so reject it. I guess at least that makes sense -
the tools need to get that information from somewhere.

Interesting!

Best,

Mark

On Fri, 12 Nov 2021, Paul Brunk wrote:

> [EXTERNAL EMAIL]
Reply all
Reply to author
Forward
0 new messages