[slurm-users] pam_slurm_adopt malfunct after slurm upgrade to 25.11

22 views
Skip to first unread message

hermes via slurm-users

unread,
Dec 16, 2025, 6:04:40 AM (4 days ago) Dec 16
to slurm...@lists.schedmd.com

Hello everyone:

 

We recently upgrade slurm to 25.11 and find the pam_slurm_adopt got broken. This cause the users cannot ssh to the compute node where their jobs is running on.

Of course we have make sure all the slurm related packages have been upgraded together.

Simplest test process is like:

```

> cat test1.sh

#!/bin/bash

#SBATCH --job-name=test

#SBATCH --partition=debug

#SBATCH --nodes=1

#SBATCH --nodelist=cas639

sleep 6000

 

> sbatch test1.sh

Submitted batch job 51047091

 

> squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

          51047091     debug     test   tester  R       0:28      1 cas639

 

> ssh cas639

(wait for a long time...)

Connection closed by 172.16.3.129 port 22 (finally failed to ssh)

```

on the target compute node, we can see the following debug message from pam_slurm_adopt.so:

```

cas639 pam_slurm_adopt[1007301]: debug2: _establish_config_source: using config_file=/etc/slurm/slurm.conf (default)

cas639 pam_slurm_adopt[1007301]: debug:  slurm_conf_init: using config_file=/etc/slurm/slurm.conf

cas639 pam_slurm_adopt[1007301]: debug:  Reading slurm.conf file: /etc/slurm/slurm.conf

cas639 pam_slurm_adopt[1007301]: PreemptMode=GANG is a cluster-wide option and cannot be set at partition level, option ignored.

cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so

cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge authentication plugin type:auth/munge version:0x190b00

cas639 pam_slurm_adopt[1007301]: debug:  auth/munge: init: loaded

cas639 pam_slurm_adopt[1007301]: debug3: Success.

cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin /usr/lib64/slurm/certgen_script.so

cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Certificate generation script plugin type:certgen/script version:0x190b00

cas639 pam_slurm_adopt[1007301]: debug:  certgen/script: init: loaded

cas639 pam_slurm_adopt[1007301]: debug3: Success.

cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin /usr/lib64/slurm/hash_k12.so

cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:KangarooTwelve hash plugin type:hash/k12 version:0x190b00

cas639 pam_slurm_adopt[1007301]: debug:  hash/k12: init: init: KangarooTwelve hash plugin loaded

cas639 pam_slurm_adopt[1007301]: debug3: Success.

cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin /usr/lib64/slurm/tls_none.so

cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Null tls plugin type:tls/none version:0x190b00

cas639 pam_slurm_adopt[1007301]: debug:  tls/none: init: tls/none loaded

cas639 pam_slurm_adopt[1007301]: debug3: Success.

cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin /usr/lib64/slurm/accounting_storage_slurmdbd.so

cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Accounting storage SLURMDBD plugin type:accounting_storage/slurmdbd version:0x190b00

cas639 pam_slurm_adopt[1007301]: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin loaded

cas639 pam_slurm_adopt[1007301]: debug3: Success.

cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin /usr/lib64/slurm/cred_munge.so

cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge credential signature plugin type:cred/munge version:0x190b00

cas639 pam_slurm_adopt[1007301]: cred/munge: init: Munge credential signature plugin loaded

cas639 pam_slurm_adopt[1007301]: debug3: Success.

cas639 pam_slurm_adopt[1007301]: debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf

cas639 pam_slurm_adopt[1007301]: debug4: found StepId=51047091.extern

cas639 pam_slurm_adopt[1007301]: debug4: found StepId=51047091.batch

cas639 pam_slurm_adopt[1007301]: Connection by user tester: user has only one job 51047091

cas639 pam_slurm_adopt[1007301]: debug:  _adopt_process: trying to get StepId=51047091.extern to adopt 1007301

cas639 pam_slurm_adopt[1007301]: debug:  Leaving stepd_add_extern_pid

cas639 pam_slurm_adopt[1007301]: debug:  Leaving stepd_get_x11_display

cas639 pam_slurm_adopt[1007301]: debug:  entering stepd_get_namespace_fd

```

It looks like something block during stepd_get_namespace_fd? And we found nothing in slurmd log even with SlurmdDebug = debug5, so I guess the pam module had not run to the step to talk with slurmd (if it should).

Would it be a compatibility problem between slurm25.11 and EL8 system or cgroup/v1?

Or can anyone help to give some suggestion on how to further locate the fault point?

 

Best regards,

Hermes

Brian Andrus via slurm-users

unread,
Dec 16, 2025, 2:00:28 PM (4 days ago) Dec 16
to slurm...@lists.schedmd.com

pam_slurm_adopt is a pam module, so does not talk to slurmd.

It looks like it is having trouble matching the uid info for your tester user. Is that a local account? It needs to be available with the same uid/gid on both the submitting node and the node it is trying to run on.

Brian Andrus

Christopher J Orr via slurm-users

unread,
Dec 16, 2025, 2:35:56 PM (4 days ago) Dec 16
to slurm...@lists.schedmd.com

Hello!

I'm also experiencing this problem, on Rocky 8 machinery. I did test
switching a node over to cgroupv2, and it still failed similarly. Note
that the slurmd from v25.05.5 works fine. I've not tried isolating it
further, but likely will. (Unless 25.11.1 hits with a fix soon!)



On Tue, 2025-12-16 at 10:58 -0800, Brian Andrus via slurm-users wrote:
>
> ---- External Email: Use caution with attachments, links, or sharing
> data ----
> > Connection closed by 172.16.3.129 port 22(finally failed to ssh)
> > found nothing in slurmd log even withSlurmdDebug = debug5, so I
> > guess the pam module had not run to the step to talk with slurmd
> > (if it should).
> > Would it be a compatibility problem between slurm25.11 and EL8
> > system or cgroup/v1?
> > Or can anyone help to give some suggestion on how to further locate
> > the fault point?
> >  
> > Best regards,
> > Hermes
> >
> >

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

hermes via slurm-users

unread,
Dec 16, 2025, 8:26:53 PM (3 days ago) Dec 16
to slurm...@lists.schedmd.com
Our test nodes are all connected to LDAP through SSSD, and I can make sure the test user exist on both submit and compute node (getent passwd XXX shows exactly the same result on both nodes).
> > *(wait for a long time...)*
> >
> > Connection closed by 172.16.3.129 port 22 *(finally failed to ssh)*
> > found nothing in slurmd log even with *SlurmdDebug = debug5*, so I
> > guess the pam module had not run to the step to talk with slurmd (if
> > it should).
> > Would it be a compatibility problem between slurm25.11 and EL8 system
> > or cgroup/v1?
> > Or can anyone help to give some suggestion on how to further locate
> > the fault point?
> > Best regards,
> > Hermes
> >

--

Christopher Samuel via slurm-users

unread,
Dec 16, 2025, 8:27:45 PM (3 days ago) Dec 16
to slurm...@lists.schedmd.com
On 12/16/25 8:24 pm, hermes via slurm-users wrote:

> Our test nodes are all connected to LDAP through SSSD, and I can make sure the test user exist on both submit and compute node (getent passwd XXX shows exactly the same result on both nodes).

25.11.1 came out today with a fix for pam_slurm_adopt, so worth a try.

--
Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA

Christopher J Orr via slurm-users

unread,
Dec 16, 2025, 8:47:09 PM (3 days ago) Dec 16
to slurm...@lists.schedmd.com
On Tue, 2025-12-16 at 20:26 -0500, Christopher Samuel via slurm-users
wrote:
> ---- External Email: Use caution with attachments, links, or sharing
> data ----
>
>
> On 12/16/25 8:24 pm, hermes via slurm-users wrote:
>
> > Our test nodes are all connected to LDAP through SSSD, and I can
> > make sure the test user exist on both submit and compute node
> > (getent passwd XXX shows exactly the same result on both nodes).
>
> 25.11.1 came out today with a fix for pam_slurm_adopt, so worth a
> try.
>

Slurm 25.11.1 ended up solving this for us. That was released just at
the right time!

hermes via slurm-users

unread,
Dec 16, 2025, 9:54:34 PM (3 days ago) Dec 16
to slurm...@lists.schedmd.com
Good job!
It seems the newest commit 4efc12d on pam_slurm_adopt.c is exactly about the bug.

Christopher Samuel wrote:
> On 12/16/25 8:24 pm, hermes via slurm-users wrote:
> > Our test nodes are all connected to LDAP through SSSD, and I can make sure the test user exist on both submit and compute node (getent passwd XXX shows exactly the same result on both nodes).
> > 25.11.1 came out today with a fix for pam_slurm_adopt, so worth a try.

--
Reply all
Reply to author
Forward
0 new messages