[slurm-users] launch failed requeued held

310 views
Skip to first unread message

sportlecon sportlecon via slurm-users

unread,
Jan 7, 2025, 6:30:00 AM1/7/25
to slurm...@lists.schedmd.com
slurm 24.11 - squeue displays reason "launch failed requeued held"

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

John Hearns via slurm-users

unread,
Jan 8, 2025, 2:52:12 AM1/8/25
to sportlecon sportlecon, slurm...@lists.schedmd.com
You need to find the node which the job started on.
Then look at the slurmd log on that node. You may find an indication of the reason for the failure.

John Hearns via slurm-users

unread,
Jan 8, 2025, 6:43:34 AM1/8/25
to sportlecon sportlecon, slurm...@lists.schedmd.com
Generally, the troubleshooting steps which you should take for Slurm are:

squeue to look at the list of running/queued or held jobs

sinfo to show which nodes are idle, busy or down

scontrol show node  to get more detailed information on a node

For problem nodes - indeed just log into any node to see what a healthy node looks like
systemctl status slurmd
cat /var/log/slurm/slurmd.log

On your slurm controller look at the slurmctld and slurmdbd logs




On Tue, 7 Jan 2025 at 11:30, sportlecon sportlecon via slurm-users <slurm...@lists.schedmd.com> wrote:
Reply all
Reply to author
Forward
0 new messages