Hi All,
We have updated SLURM to the current 25.05.x and tried to enable TLS on it. The OS is Alma 8.10, cgroups v1, and PMIx v 4.
We see that srun fails for MPI jobs across the nodes, with TLS related errors when using PMIx (the default) but passes with srun --mpi=pmi2 or with mpirun .
TLSType = tls/s2n
TLSParameters = ca_cert_file= (has all the certs here under /etc/slurm/certs)
And the errors when using PMIx are
025-09-25T11:04:43.894] error: con_close_on_poll_error: [n388:6818(fd:15)] socket error encountered while polling: Connection reset by peer
[2025-09-25T11:04:50.102] [6451416.0] error: _negotiate: s2n_negotiate() failed S2N_ERR_CERT_UNTRUSTED[335544366]: Certificate is untrusted -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/tls/s2n_x509_validator.c:494
(couple of these)
[2025-09-25T11:05:57.878] [6451416.0] error: tls_p_recv: s2n_recv() failed S2N_ERR_CLOSED[134217728]: connection is closed -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:37
[2025-09-25T11:05:57.883] [6451416.0] error: tls_p_send: s2n_send() failed S2N_ERR_IO[67108864]: underlying I/O operation failed, check system errno -> Error encountered in /builddir/build/BUILD/s2n-tls-1.5.10/utils/s2n_io.c:28
(couple of these)
[2025-09-25T11:05:59.076] error: wrap_on_data: [unix:/var/spool/slurmd/slurmd.socket(fd:17)] on_data returned rc: Unable to proxy slurmstepd message
[2025-09-25T11:05:59.076] [6451416.0] error: _stepd_send_recv_msg: slurmd was unable to proxy request message to its final destination
[2025-09-25T11:05:59.878] error: _slurmd_send_recv_msg: Failed to send/recv slurmstepd message MESSAGE_TASK_EXIT using proxy_type PROXY_TO_NODE_SEND_RECV
2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: pmixp_p2p_send: n388 [0]: pmixp_utils.c:469: send failed, rc=1001, exceeded the retry limit
[2025-09-25T11:07:36.335] [6451416.0] error: mpi/pmix_v4: _slurm_send: n388 [0]: pmixp_server.c:1586: Cannot send message to /var/spool/slurmd/stepd.slurm.pmix.6451416.0, size = 27679, hostlist:
(null)
(and couple more PMIx errors). Looks like PMIx cannot talk to their peers now ?
There was no specific configuration for the certgen plugin, because SLURM documentation seems to say it is optional(?).
I wonder what do we miss here to have SLURM 25.05 in with TLS enabled and PMIx working? Any advice appreciated! Thanks!
--
Grigory Shamov
Site Lead / HPC Specialist
University of Manitoba and DRI Alliance Canada
--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com