[slurm-users] Odd prolog Error?

266 views
Skip to first unread message

Jason Simms

unread,
Apr 11, 2023, 12:49:19 PM4/11/23
to Slurm User Community List
Hello all,

Regularly I'm seeing array jobs fail, and the only log info from the compute node is this:

[2023-04-11T11:41:12.336] error: /opt/slurm/prolog.sh: exited with status 0x0100
[2023-04-11T11:41:12.336] error: [job 26090] prolog failed status=1:0
[2023-04-11T11:41:12.336] Job 26090 already killed, do not launch batch job

The contents of prolog.sh are incredibly simple:

#!/bin/bash
loginctl enable-linger $SLURM_JOB_USER

I can't sort out what may be going on here. An example script from a job that can result in this error is here:

#!/bin/bash
#SBATCH -t 2:00:00
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -p compute
#SBATCH --array=1-100
#SBATCH -o tempOut/MSO-%j-%a.log

module load python3/python3
python3 runVoltage.py $SLURM_ARRAY_TASK_ID

Any insight would be welcome! This is really frustrating because it's constantly causing nodes to drain.

Warmest regards,
Jason

--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research Computing
Swarthmore College
Information Technology Services
Schedule a meeting: https://calendly.com/jlsimms

Brian Andrus

unread,
Apr 11, 2023, 12:55:15 PM4/11/23
to slurm...@lists.schedmd.com

From the documentation:


Parameter

Location

Invoked by

User

When executed

Prolog (from slurm.conf)

Compute or front end node

slurmd daemon

SlurmdUser (normally user root)

First job or job step initiation on that node (by default); PrologFlags=Alloc will force the script to be executed at job allocation

So ensure:
1) /opt/slurm/prolog.sh exists on the node(s)
2) the slurmd user is able to execute it

I would connect to the node and try to run the command as the slurmd user.
Also, ensure the user exists on the node, however you are propagating the uids.

Brian ANdrus

Jason Simms

unread,
Apr 11, 2023, 1:29:14 PM4/11/23
to Slurm User Community List
Thanks, Brian, helpful as always. Yes, /opt/slurm/prolog.sh is mounted across IB on all nodes, so it's reachable from everywhere. And the slurmd user can execute it.

I'll keep mucking around with it...

Warmest regards,
Jason

Jason Simms

unread,
Apr 11, 2023, 4:01:27 PM4/11/23
to Slurm User Community List
Brian, your prompt about the user not being present on the node was what I needed. To close the loop on this, the error was due to an expired vendor SSL cert for LDAP. This was causing sssd on the nodes to balk. Once patched, all is well again. 

Thanks,
Jason
Reply all
Reply to author
Forward
0 new messages