[slurm-users] Jobs fail on specific nodes.

137 views
Skip to first unread message

Roger Mason

unread,
May 24, 2022, 11:00:00 AM5/24/22
to slurm...@lists.schedmd.com
Hello,

I have a small cluster of 4 nodes. I'm seeing jobs fail on two nodes
with this written to slurm-*.out:

less 1x1x1_220524_121358/slurm-1368_1.out
srun: error: Unable to resolve "node012": Unknown server error
srun: error: fwd_tree_thread: can't find address for host node012, check slurm.conf
srun: error: Task launch for 1368.0 failed on node node012: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

The same job runs correctly on either of two other nodes.

sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
macpro* up infinite 1 idle node012
macpro* up infinite 3 down node[001-002,004]

I can ssh into node012 and the above sinfo suggests no communication
problems. I have not modified slurm.conf recently.

I would appreciate any suggestions on what might be causing this problem
or what I can do to diagnose it.

Thanks,
Roger

Roger Mason

unread,
May 25, 2022, 11:09:54 AM5/25/22
to slurm...@lists.schedmd.com

Roger Mason <rma...@mun.ca> writes:

> I have a small cluster of 4 nodes. I'm seeing jobs fail on two nodes

I forgot some information:

slurm 20.02.7 on FreeBSD 12.2.

New information:

Running this from the controller succeeds on both machines:

srun -w node[002,012] hostname

Gerhard Strangar

unread,
May 25, 2022, 12:06:42 PM5/25/22
to slurm...@lists.schedmd.com
Roger Mason wrote:

> I would appreciate any suggestions on what might be causing this problem
> or what I can do to diagnose it.

Run getent hosts node012 on all hosts to see which one can't resolve it.

Roger Mason

unread,
May 25, 2022, 2:26:29 PM5/25/22
to slurm...@lists.schedmd.com

Gerhard Strangar <g...@arcor.de> writes:

> Run getent hosts node012 on all hosts to see which one can't resolve
> it.

Thank you, that located a problem with the hosts file on some nodes.
Fixed.

Best wishes,
Roger


Reply all
Reply to author
Forward
0 new messages