Dear all
I still haven’t found the cause to the problem I raised last week where srun -w xx runs for some nodes but not for others — thanks for the ideas.
One intriguing result I’ve had trying to pursue this which I thought I’d share in case it sparks some ideas. If I give the full path for srun, then it works
# show path
scott@cream-ce ~]$ which srun
/opt/exp_soft/bin/srun
# Node n37 is good (as are most of our nodes)
[scott@cream-ce ~]$ srun -w n37 --pty bash
[scott@n37 ~]$
# Node n38 is not (and a few othrs)
scott@cream-ce ~]$ srun -w n38 --pty bash
srun: error: fwd_tree_thread: can't find address for host n38, check slurm.conf
srun: error: Task launch for 20094.0 failed on node n38: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
But if I give the full path name — it works!
scott@cream-ce ~]$ /opt/exp_soft/slurm/bin/srun -w n38 --pty bash
[scott@n38 ~]$
Scott
Scott