[slurm-users] Prolog error causing node to drain

15 views
Skip to first unread message

Ratnasamy, Fritz via slurm-users

unread,
Oct 7, 2025, 10:55:21 PMOct 7
to Slurm User Community List
Hi,

 We have a running slurm cluster and users have been submitting jobs for the past 3 months without any issues. Recently, some nodes jobs are getting drained randomly due to the reason "prolog error" 
Our slurm.conf has these 2 lines regarding prolog:
PrologFlags=Contain,Alloc,X11
Prolog=/slurm_stuff/bin/prolog.d/prolog*

Inside the prolog.d folder, there are 2 scripts which run with no errors as far as I can see but is there a way to debug why the nodes are going in draining mode once in a while because of "prolog error"? That seems to happen at random times and on random nodes. 

From the log file, I can see only this:

Oct 06 00:57:43 pgpu008.chicagobooth.edu slurmd[3709622]: slurmd: error: prolog failed: rc:230 output:Successfully started proces>

Oct 06 00:57:43 pgpu008.chicagobooth.edu slurmd[3709622]: slurmd: error: [job 20398] prolog failed status=230:0

Oct 06 00:57:43 pgpu008 slurmd[3709622]: slurmd: Job 20398 already killed, do not launch batch job

Oct 06 13:06:23 pgpu008 systemd[1]: Stopping Slurm node daemon...

Oct 06 13:06:23 pgpu008 slurmd[3709622]: slurmd: Caught SIGTERM. Shutting down.

Oct 06 13:06:23 pgpu008 slurmd[3709622]: slurmd: Slurmd shutdown completing


Currently, now the job 20398 that is getting killed in the picture above is in the state "Launch failed requeue held" after I resume the node.







Fritz Ratnasamy
Data Scientist
Information Technology


Laura Hild via slurm-users

unread,
Oct 8, 2025, 9:23:21 AMOct 8
to Ratnasamy, Fritz, Slurm User Community List
230 is a strange exit status. Are you sure there's nothing in the prolog scripts and nothing called by the prolog scripts that could be returning that?

Do you know why systemd is stopping slurmd about twelve hours later?

Is there anything in in the general host log (e.g. /var/log/messages) or in dmesg during either of those times that might indicate why the prolog is failing or slurmd is stopping?



________________________________________
Od: Ratnasamy, Fritz via slurm-users <slurm...@lists.schedmd.com>
Poslano: torek, 07. oktober 2025 22:53
Za: Slurm User Community List
Zadeva: [slurm-users] Prolog error causing node to drain
--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Chris Samuel via slurm-users

unread,
Oct 11, 2025, 7:20:43 AMOct 11
to slurm...@lists.schedmd.com
On 7/10/25 10:53 pm, Ratnasamy, Fritz via slurm-users wrote:

> Inside the prolog.d folder, there are 2 scripts which run with no errors
> as far as I can see but is there a way to debug why the nodes are going
> in draining mode once in a while because of "prolog error"? That seems
> to happen at random times and on random nodes.

You could try and add some logging to the start of your prolog to
capture execution and errors. Something like this:

~/tmp/test$ cat prolog.sh
#!/bin/bash

exec 1>>"/tmp/prolog.log.${SLURM_JOB_ID}.${$}"
exec 2>&1

set -x

echo hello
fooo
~/tmp/test$ SLURM_JOB_ID=1234 ./prolog.sh
~/tmp/test$ echo $?
127
~/tmp/test$ cat /tmp/prolog.log.1234.10512
+ echo hello
hello
+ fooo
./prolog.sh: line 9: fooo: command not found
~/tmp/test$



Best of luck!
Chris
Reply all
Reply to author
Forward
0 new messages