[slurm-users] Slurm version 24.05.1 is now available

80 views
Skip to first unread message

Tim Wickberg via slurm-users

unread,
Jun 27, 2024, 6:07:21 PMJun 27
to slurm...@schedmd.com, slurm-a...@schedmd.com
We are pleased to announce the availability of Slurm version 24.05.1.

This release addresses a number of minor-to-moderate issues since the
24.05 release was first announced a month ago.

Slurm can be downloaded from https://www.schedmd.com/downloads.php .

- Tim


> * Changes in Slurm 24.05.1
> ==========================
> -- Fix slurmctld and slurmdbd potentially stopping instead of performing a
> logrotate when recieving SIGUSR2 when using auth/slurm.
> -- switch/hpe_slingshot - Fix slurmctld crash when upgrading from 23.02.
> -- Fix "Could not find group" errors from validate_group() when using
> AllowGroups with large /etc/group files.
> -- Prevent an assertion in debugging builds when triggering log rotation
> in a backup slurmctld.
> -- Add AccountingStoreFlags=no_stdio which allows to not record the stdio
> paths of the job when set.
> -- slurmrestd - Prevent a slurmrestd segfault when parsing the crontab field,
> which was never usable. Now it explicitly ignores the value and emits a
> warning if it is used for the following endpoints:
> 'POST /slurm/v0.0.39/job/{job_id}'
> 'POST /slurm/v0.0.39/job/submit'
> 'POST /slurm/v0.0.40/job/{job_id}'
> 'POST /slurm/v0.0.40/job/submit'
> 'POST /slurm/v0.0.41/job/{job_id}'
> 'POST /slurm/v0.0.41/job/submit'
> 'POST /slurm/v0.0.41/job/allocate'
> -- mpi/pmi2 - Fix communication issue leading to task launch failure with
> "invalid kvs seq from node".
> -- Fix getting user environment when using sbatch with "--get-user-env" or
> "--export=" when there is a user profile script that reads /proc.
> -- Prevent slurmd from crashing if acct_gather_energy/gpu is configured but
> GresTypes is not configured.
> -- Do not log the following errors when AcctGatherEnergyType plugins are used
> but a node does not have or cannot find sensors:
> "error: _get_joules_task: can't get info from slurmd"
> "error: slurm_get_node_energy: Zero Bytes were transmitted or received"
> However, the following error will continue to be logged:
> "error: Can't get energy data. No power sensors are available. Try later"
> -- sbatch, srun - Set SLURM_NETWORK environment variable if --network is set.
> -- Fix cloud nodes not being able to forward to nodes that restarted with new
> IP addresses.
> -- Fix cwd not being set correctly when running a SPANK plugin with a
> spank_user_init() hook and the new "contain_spank" option set.
> -- slurmctld - Avoid deadlock during shutdown when auth/slurm is active.
> -- Fix segfault in slurmctld with topology/block.
> -- sacct - Fix printing of job group for job steps.
> -- scrun - Log when an invalid environment variable causes the job submission
> to be rejected.
> -- accounting_storage/mysql - Fix problem where listing or modifying an
> association when specifying a qos list could hang or take a very long time.
> -- gpu/nvml - Fix gpuutil/gpumem only tracking last GPU in step. Now,
> gpuutil/gpumem will record sums of all GPUS in the step.
> -- Fix error in scrontab jobs when using slurm.conf:PropagatePrioProcess=1.
> -- Fix slurmctld crash on a batch job submission with "--nodes 0,...".
> -- Fix dynamic IP address fanout forwarding when using auth/slurm.
> -- Restrict listening sockets in the mpi/pmix plugin and sattach to the
> SrunPortRange.
> -- slurmrestd - Limit mime types returned from query to 'GET /openapi/v3' to
> only return one mime type per serializer plugin to fix issues with OpenAPI
> client generators that are unable to handle multiple mime type aliases.
> -- Fix many commands possibly reporting an "Unexpected Message Received" when
> in reality the connection timed out.
> -- Prevent slurmctld from starting if there is not a json serializer present
> and the extra_constraints feature is enabled.
> -- Fix heterogeneous job components not being signaled with scancel --ctld and
> 'DELETE slurm/v0.0.40/jobs' if the job ids are not explicitly given,
> the heterogeneous job components match the given filters, and the
> heterogeneous job leader does not match the given filters.
> -- Fix regression from 23.02 impeding job licenses from being cleared.
> -- Move error to log_flag which made _get_joules_task error to be logged to the
> user when too many rpcs were queued in slurmd for gathering energy.
> -- For scancel --ctld and the associated rest api endpoints:
> 'DELETE /slurm/v0.0.40/jobs'
> 'DELETE /slurm/v0.0.41/jobs'
> Fix canceling the final array task in a job array when the task is pending
> and all array tasks have been split into separate job records. Previously
> this task was not canceled.
> -- Fix power_save operation after recovering from a failed reconfigure.
> -- slurmctld - Skip removing the pidfile when running under systemd. In that
> situation it is never created in the first place.
> -- Fix issue where altering the flags on a Slurm account (UsersAreCoords)
> several limits on the account's association would be set to 0 in
> Slurm's internal cache.
> -- Fix memory leak in the controller when relaying stepmgr step accounting to
> the dbd.
> -- Fix segfault when submitting stepmgr jobs within an existing allocation.
> -- Added "disable_slurm_hydra_bootstrap" as a possible MpiParams parameter in
> slurm.conf. Using this will disable env variable injection to allocations
> for the following variables: I_MPI_HYDRA_BOOTSTRAP,
> I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS, HYDRA_BOOTSTRAP,
> HYDRA_LAUNCHER_EXTRA_ARGS.
> -- scrun - Delay shutdown until after start requested. This caused scrun
> to never start or shutdown and hung forever when using --tty.
> -- Fix backup slurmctld potentially not running the agent when taking over as
> the primary controller.
> -- Fix primary controller not running the agent when a reconfigure of the
> slurmctld fails.
> -- slurmd - fix premature timeout waiting for REQUEST_LAUNCH_PROLOG with large
> array jobs causing node to drain.
> -- jobcomp/{elasticsearch,kafka} - Avoid sending fields with invalid date/time.
> -- jobcomp/elasticsearch - Fix slurmctld memory leak from curl usage
> -- acct_gather_profile/influxdb - Fix slurmstepd memory leak from curl usage
> -- Fix 24.05.0 regression not deleting job hash dirs after MinJobAge.
> -- Fix filtering arguments being ignored when using squeue --json.
> -- switch/nvidia_imex - Move setup call after spank_init() to allow namespace
> manipulation within the SPANK plugin.
> -- switch/nvidia_imex - Skip plugin operation if nvidia-caps-imex-channels
> device is not present rather than preventing slurmd from starting.
> -- switch/nvidia_imex - Skip plugin operation if job_container/tmpfs
> is configured due to incompatibility.
> -- switch/nvidia_imex - Remove any pre-existing channels when slurmd starts.
> -- rpc_queue - Add support for an optional rpc_queue.yaml configuration file.


--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com
Reply all
Reply to author
Forward
0 new messages