Roger Mason
unread,May 24, 2022, 11:00:00 AM5/24/22Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to slurm...@lists.schedmd.com
Hello,
I have a small cluster of 4 nodes. I'm seeing jobs fail on two nodes
with this written to slurm-*.out:
less 1x1x1_220524_121358/slurm-1368_1.out
srun: error: Unable to resolve "node012": Unknown server error
srun: error: fwd_tree_thread: can't find address for host node012, check slurm.conf
srun: error: Task launch for 1368.0 failed on node node012: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
The same job runs correctly on either of two other nodes.
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
macpro* up infinite 1 idle node012
macpro* up infinite 3 down node[001-002,004]
I can ssh into node012 and the above sinfo suggests no communication
problems. I have not modified slurm.conf recently.
I would appreciate any suggestions on what might be causing this problem
or what I can do to diagnose it.
Thanks,
Roger