[slurm-users] Slurm versions 21.08.8 and 20.11.9 are now available (CVE-2022-29500, 29501, 29502)

Tim Wickberg

unread,

May 4, 2022, 3:51:44 PM5/4/22

to slurm...@schedmd.com, slurm-a...@schedmd.com

Slurm versions 21.08.8 and 20.11.9 are now available to address a
critical security issue with Slurm's authentication handling.

SchedMD customers were informed on April 20th and provided a patch on
request; this process is documented in our security policy [1].

For SchedMD customers: please note that there are additional changes
included in these releases to address recently reported problems with
PMIx, and to fix communication issues between patched and unpatched
slurmd processes.

--------

CVE-2022-29500:

An architectural flaw with how credentials are handled can be exploited
to allow an unprivileged user to impersonate the SlurmUser account.
Access to the SlurmUser account can be used to execute arbitrary
processes as root.

This issue impacts all Slurm releases since at least Slurm 1.0.0.

Systems remain vulnerable until all slurmdbd, slurmctld, and slurmd
processes have been restarted in the cluster.

Once all daemons have been upgraded sites are encouraged to add
"block_null_hash" to CommunicationParameters. That new option provides
additional protection against a potential exploit.

CVE-2022-29501:

An issue was discovered with a network RPC handler in the slurmd daemon
used for PMI2 and PMIx support. This vulnerability could allow an
unprivileged user to send data to an arbitrary unix socket on the host
as the root user.

CVE-2022-29502:

An issue was found with the I/O key validation logic in the srun client
command that could permit an attacker to attach to the user's terminal,
and intercept process I/O. (Slurm 21.08 only.)

--------

Due to the severity of the CVE-2022-29500 issue, SchedMD has removed all
prior Slurm releases from our download site.

SchedMD only issues security fixes for the supported releases (currently
21.08 and 20.11). Due to the complexity of these fixes, we do not
recommend attempting to backport the fixes to older releases, and
strongly encourage sites to upgrade to fixed versions immediately.

Downloads are available at https://www.schedmd.com/downloads.php .

Release notes follow below.

- Tim

[1] https://www.schedmd.com/security.php

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support

> * Changes in Slurm 21.08.8
> ==========================
> -- openapi/dbv0.0.37 - fix slurmrestd fatal() when deleting an association.
> -- Allow scontrol update <job> Gres=... to not require "gres:".
> -- Fix inconsistent reboot message appending behavior.
> -- Fix incorrect reason_time and reason_uid on reboot message.
> -- Fix "scontrol reboot" clearing node reason on ResumeTimeout.
> -- Fix ResumeTimeout error message missing when node already has reason set.
> -- Avoid "running with local config" error when conf server is provided by DNS.
> -- openapi/v0.0.37 - resolve job user name when not sent by slurmctld.
> -- openapi/dbv0.0.37 - Correct OpenAPI specification for diag request.
> -- Ignore power_down request when node is already powering down.
> -- CVE-2022-29500 - Prevent credential abuse.
> -- CVE-2022-29501 - Prevent abuse of REQUEST_FORWARD_DATA.
> -- CVE-2022-29502 - Correctly validate io keys.

> * Changes in Slurm 20.11.9
> ==========================
> -- burst_buffer - add missing common directory to the Makefile SUBDIRS.
> -- sacct - fix truncation when printing jobidraw field.
> -- GRES - Fix loading state of jobs using --gpus to request gpus.
> -- Fix minor logic error in health check node state output
> -- Fix GCC 11.1 compiler warnings.
> -- Delay steps when memory already used instead of rejecting step request.
> -- Fix memory leak in the slurmdbd when requesting wckeys from all clusters.
> -- Fix determining if a reservation is used or not.
> -- openapi/v0.0.35 - Honor kill_on_invalid_dependency as job parameter.
> -- openapi/v0.0.36 - Honor kill_on_invalid_dependency as job parameter.
> -- Fix various issues dealing with updates on magnetic reservations that could
> lead to abort slurmctld.
> -- openapi/v0.0.36 - Avoid setting default values of min_cpus, job name, cwd,
> mail_type, and contiguous on job update.
> -- openapi/v0.0.36 - Clear user hold on job update if hold=false.
> -- Fix slurmctld segfault due to a bit_test() call with a MAINT+ANY_NODES
> reservation NULL node_bitmap.
> -- Fix slurmctld segfault due to a bit_copy() call with a REPLACE+ANY_NODES
> reservation NULL node_bitmap.
> -- Fix error in GPU frequency validation logic.
> -- Fix error in pmix logic dealing with the incorrect size of buffer.
> -- PMIx v1.1.4 and below are no longer supported.
> -- Fix shutdown of slurmdbd plugin to correctly notice when the agent thread
> finishes.
> -- Fix slurmctld segfault due to job array --batch features double free.
> -- CVE-2022-29500 - Prevent credential abuse.
> -- CVE-2022-29501 - Prevent abuse of REQUEST_FORWARD_DATA.

Ole Holm Nielsen

unread,

May 5, 2022, 7:54:48 AM5/5/22

to slurm...@lists.schedmd.com

Just a heads-up regarding setting CommunicationParameters=block_null_hash
in slurm.conf:

On 5/4/22 21:50, Tim Wickberg wrote:
> CVE-2022-29500:
>
> An architectural flaw with how credentials are handled can be exploited to
> allow an unprivileged user to impersonate the SlurmUser account. Access to
> the SlurmUser account can be used to execute arbitrary processes as root.
>
> This issue impacts all Slurm releases since at least Slurm 1.0.0.
>
> Systems remain vulnerable until all slurmdbd, slurmctld, and slurmd
> processes have been restarted in the cluster.
>
> Once all daemons have been upgraded sites are encouraged to add
> "block_null_hash" to CommunicationParameters. That new option provides
> additional protection against a potential exploit.

The block_null_hash still needs to be documented in the slurm.conf
man-page. But in https://bugs.schedmd.com/show_bug.cgi?id=14002 I was
assured that it's OK to use it now.

I upgraded 21.08.7 to 21.08.8 using RPM packages while the cluster was
running production jobs. This is perhaps not recommended (see
https://slurm.schedmd.com/quickstart_admin.html#upgrade), but it worked
without a glitch also in this case.

However, when I defined CommunicationParameters=block_null_hash in
slurm.conf later today, I started getting RPC errors on the compute nodes
and in slurmctld when jobs were completing, see bug 14002.

I would recommend sites to hold up a bit with
CommunicationParameters=block_null_hash until we have found a resolution
in bug 14002. Draining all jobs from the cluster before setting this
parameter may be the safe approach(?).

/Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

Marcus Boden

unread,

May 5, 2022, 8:46:37 AM5/5/22

to slurm...@lists.schedmd.com

Hi Ole,

we had a similar issues on our systems. As I understand from the bug you
linked, we just need to wait until all the old jobs are finished (and
the old slurmstepd are gone). So a full drain should not be necessary?

Best,
Marcus

Marcus Vincent Boden, M.Sc. (he/him)
AG Computing
Tel.: +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-------------------------------------------------------------------------
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
(GWDG) Burckhardtweg 4, 37077 Göttingen, URL: https://gwdg.de

Support: Tel.: +49 551 39-30000, URL: https://gwdg.de/support
Sekretariat: Tel.: +49 551 39-30001, E-Mail: gw...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-------------------------------------------------------------------------

Ole Holm Nielsen

unread,

May 5, 2022, 9:44:03 AM5/5/22

to slurm...@lists.schedmd.com

Hi Marcus,

On 5/5/22 14:45, Marcus Boden wrote:
> we had a similar issues on our systems. As I understand from the bug you
> linked, we just need to wait until all the old jobs are finished (and the
> old slurmstepd are gone). So a full drain should not be necessary?

Yes, I believe that sounds right.

I've been thinking about how to determine the timestamp of the oldest job
running on the cluster, and then make sure this is after the time that all
slurmd daemons were upgraded to 21.08.8.

This command will tell you the oldest running jobs:

$ squeue -t running -O StartTime | sort | head

You can add more -O options to get JobIDs etc., as long as you sort on the
StartTime column (Slurm ISO 8601 timestamps[1] can simply be sorted in
lexicographical order).

I hope this helps.

/Ole

[1] https://en.wikipedia.org/wiki/ISO_8601

Tim Wickberg

unread,

May 5, 2022, 1:44:22 PM5/5/22

to slurm...@schedmd.com, slurm-a...@schedmd.com

I wanted to provide some elaboration on the new
CommunicationParameters=block_null_hash option based on initial feedback.

The original email said it was safe to enable after all daemons had been
restarted. Unfortunately that statement was incomplete - the flag can
only be safely enabled after all daemons have been restarted *and* all
currently running jobs have completed.

The new maintenance releases - with or without this new option enabled -
do fix the reported issues. The option is not required to secure your
system.

This option provides an additional - redundant - layer of security
within the cluster, and we do encourage sites to enable it at their
earliest convenience, but only after currently running jobs (with an
associated unpatched slurmstepd process) have all completed.

- Tim

Tim Wickberg

unread,

May 5, 2022, 4:28:42 PM5/5/22

to slurm...@schedmd.com, slurm-a...@schedmd.com

And, what is hopefully my final update on this:

Unfortunately I missed including a single last-minute commit in the
21.08.8 release. That missing commit fixes a communication issue between
a mix of patched and unpatched slurmd processes that could lead to nodes
being incorrectly marked as offline.

That patch was included in 20.11.9. That missing commit is included in a
new 21.08.8-2 release which is on our download page now.

If you've already starting rolling out 21.08.8 on your systems, the best
path forward it to restart all slurmd processes in the cluster immediately.

- Tim

Reply all

Reply to author

Forward