[slurm-users] Restarting slurmd kills still-running jobs

10 views
Skip to first unread message

Griebel, Christian via slurm-users

unread,
Feb 12, 2026, 2:58:18 PM (10 days ago) Feb 12
to slurm...@lists.schedmd.com

Dear community,


Trying to implement the latest fix/patch for munged, we restarted the updated munged locally on the compute nodes with "systemctl restart munged", resulting in the sudden death of a lot of compute nodes' slurmd.


Checking the jobs on the affected nodes, we saw a lot of user processes/jobs still running, which was good - yet  "systemctl restart slurmd" cancelled all of them, eg.

[2026-02-12T17:08:00.325] Cleaning up stray StepId=49695760.extern
[2026-02-12T17:08:00.325] [49695760.extern] Sent signal 9 to StepId=49695760.extern
[2026-02-12T17:08:00.325] slurmd version 25.05.5 started
and all affected user jobs (even though having survived the death of their parent slurmd) were killed and re-queued...


We have cgroups v2 (only, no hybrid), "Delegate=yes" in the slurmd unit and "ProctrackType=proctrack/cgroup" configured. 


Other sites do not see the same behavior (their user jobs survive a slurmd restart without issues), so now we are at a loss figuring out why the h.... this happens within our setup.


Anyone experienced similar problems and got them solved...?


Thanks in advance -

--
___________________________
Christian Griebel/HPC


Christopher Samuel via slurm-users

unread,
Feb 12, 2026, 3:27:08 PM (10 days ago) Feb 12
to slurm...@lists.schedmd.com
On 2/12/26 2:56 pm, Griebel, Christian via slurm-users wrote:

> Anyone experienced similar problems and got them solved...?

No, sorry, updating the munge RPM for us across 5000+ nodes with running
went without a hitch.

You weren't trying to change the munge key at the same time were you?

All the best,
Chris

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

William Brown via slurm-users

unread,
Feb 12, 2026, 4:38:26 PM (10 days ago) Feb 12
to Griebel, Christian, Slurm User Community List
I think the service name is munge not munged, although the binary is munged.

Or was your 'systemctl restart munged' a typo?

William 



Brian Andrus via slurm-users

unread,
Feb 12, 2026, 7:03:02 PM (10 days ago) Feb 12
to slurm...@lists.schedmd.com

That smells like the munge key was changed, which would require the behavior you see.


Brian Andrus

Christian Griebel, HRZ/HPC via slurm-users

unread,
Feb 12, 2026, 7:19:21 PM (10 days ago) Feb 12
to slurm...@lists.schedmd.com
... thanks for your first answers -


>  Or was your 'systemctl restart munged' a typo?

... yes, that was a typo - it wreaked havoc without the "d"...


>  You weren't trying to change the munge key at the same time were you?

No, that was planned for a later (down) time, though -
our /etc/munge/munge.key was untouched during the package update & restart.


>  That smells like the munge key was changed,

... it wasn't - unless a restart of the munge service causes a new key
to be created which I doubt ;-)



I have also asked next door @ bugs.schedmd.com yet without a contract, I
have little hope of being helped there.


--
___________________________
Christian Griebel/HPC


Ole Holm Nielsen via slurm-users

unread,
Feb 13, 2026, 5:00:29 AM (9 days ago) Feb 13
to slurm...@lists.schedmd.com
Dear Christian,

On 2/12/26 20:56, Griebel, Christian via slurm-users wrote:
> Trying to implement the latest fix/patch for munged, we restarted the
> updated munged locally on the compute nodes with "systemctl restart
> munged", resulting in the sudden death of a lot of compute nodes' slurmd.

What is your OS? What method did you use for updating the Munge software?

If you use the RPM package installation method, updating the munge*
packages will automatically restart the "munge" Systemd service without
any other user intervention. This worked perfectly for us (700 nodes).
The slurmd service on the compute nodes isn't affected by the restarted
munge service.

Best regards,
Ole
Reply all
Reply to author
Forward
0 new messages