[slurm-users] How to make TLS and PMIx v4 work together?

9 views
Skip to first unread message

Grigory Shamov via slurm-users

unread,
Sep 25, 2025, 12:57:23 PM (9 days ago) Sep 25
to slurm...@lists.schedmd.com
Hi All,

We have updated SLURM to the current 25.05.x and tried to enable TLS on it. The OS is Alma 8.10, cgroups v1, and PMIx v 4.

We see that srun fails for MPI jobs across the nodes, with TLS related errors when using PMIx (the default) but passes with srun --mpi=pmi2 or with mpirun .

TLSType = tls/s2n
TLSParameters = ca_cert_file= (has all the certs here under /etc/slurm/certs)

And the errors when using PMIx are

025-09-25T11:04:43.894] error: con_close_on_poll_error: [n388:6818(fd:15)] socket error encountered while polling: Connection reset by peer
[2025-09-25T11:04:50.102] [6451416.0] error: _negotiate: s2n_negotiate() failed S2N_ERR_CERT_UNTRUSTED[335544366]: Certificate is untrusted -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/tls/s2n_x509_validator.c:494
(couple of these)
[2025-09-25T11:05:57.878] [6451416.0] error: tls_p_recv: s2n_recv() failed S2N_ERR_CLOSED[134217728]: connection is closed -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:37
[2025-09-25T11:05:57.883] [6451416.0] error: tls_p_send: s2n_send() failed S2N_ERR_IO[67108864]: underlying I/O operation failed, check system errno -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:28
(couple of these)
[2025-09-25T11:05:59.076] error: wrap_on_data: [unix:/var/spool/slurmd/slurmd.socket(fd:17)] on_data returned rc: Unable to proxy slurmstepd message
[2025-09-25T11:05:59.076] [6451416.0] error: _stepd_send_recv_msg: slurmd was unable to proxy request message to its final destination
[2025-09-25T11:05:59.878] error: _slurmd_send_recv_msg: Failed to send/recv slurmstepd message MESSAGE_TASK_EXIT using proxy_type PROXY_TO_NODE_SEND_RECV

2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: pmixp_p2p_send: n388 [0]: pmixp_utils.c:469: send failed, rc=1001, exceeded the retry limit
[2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: _slurm_send: n388 [0]: pmixp_server.c:1586: Cannot send message to /var/spool/slurmd/stepd.slurm.pmix.6451416.0, size = 27679, hostlist:
(null)
(and couple more PMIx errors). Looks like PMIx cannot talk to their peers now ?

There was no specific configuration for the certgen plugin, because SLURM documentation seems to say it is optional(?).

I wonder what do we miss here to have SLURM 25.05 in with TLS enabled and PMIx working? Any advice appreciated! Thanks!

-- 
Grigory Shamov
Site Lead / HPC Specialist
University of Manitoba and DRI Alliance Canada

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Grigory Shamov via slurm-users

unread,
Sep 25, 2025, 2:13:55 PM (9 days ago) Sep 25
to slurm...@lists.schedmd.com
Forgot to add: the s2n-tls comes from EPEL and is ver 1.5.10.



On 2025-09-25, 11:56 AM, "Grigory Shamov via slurm-users" <slurm...@lists.schedmd.com <mailto:slurm...@lists.schedmd.com>> wrote:


Caution! This message was sent from outside the University of Manitoba.
slurm-users mailing list -- slurm...@lists.schedmd.com <mailto:slurm...@lists.schedmd.com>
To unsubscribe send an email to slurm-us...@lists.schedmd.com <mailto:slurm-us...@lists.schedmd.com>

Brian Andrus via slurm-users

unread,
Sep 25, 2025, 7:02:48 PM (9 days ago) Sep 25
to slurm...@lists.schedmd.com

Grigory, 

You likely need to add your CA to the nodes and update. Under Ubuntu, you would:

  • Put your CA public key file in /usr/local/share/ca-certificates/
  • Run /usr/sbin/update-ca-certificates
This should then create a pem file in /etc/ssl/certs for that CA and you can then trust certs signed by it.

You will need to do that on all your systems that need to trust your CA.

Brian Andrus

Grigory Shamov via slurm-users

unread,
Sep 28, 2025, 9:55:06 AM (6 days ago) Sep 28
to Brian Andrus, slurm...@lists.schedmd.com

Hi Brian,

 

Thank you very much! We will try it.

 

Another thing we have noticed is a massive decrease of slurmctld performance. Had to 4x VM’s memory and CPU cores as compared to 24.11, so that 25.05  would run  without  freezing.

Does everyone have this , or we did misconfigure some settings of the new RPC connection manager?

Timony, Mick via slurm-users

unread,
Sep 30, 2025, 2:11:30 PM (4 days ago) Sep 30
to slurm...@lists.schedmd.com, Grigory Shamov
Regarding performance, have a look at the release notes:


Maybe you are being bit by this change?

  • Autodetect the available cpus, and automatically set the connection management thread pool size to 2x that value. The “conmgr_threads” settings can be used to override this.

Docs at:


The sdiag command might help you debug this issue or performance in general.

Kind Regards

--
Mick Timony
Senior DevOps Engineer
LASER, Longwood, & O2 Cluster Admin
Harvard Medical School
--

From: Grigory Shamov via slurm-users <slurm...@lists.schedmd.com>
Sent: Sunday, September 28, 2025 9:52 AM
To: Brian Andrus <toom...@gmail.com>; slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Reply all
Reply to author
Forward
0 new messages