[slurm-users] Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2 views
Skip to first unread message

William VINCENT via slurm-users

unread,
Jul 15, 2024, 4:45:45 AM (21 hours ago) Jul 15
to slurm...@lists.schedmd.com
Hello

I am writing to report an issue with the Slurmctld process on our RHEL 9
(Rocky Linux) .

Twice in the past 5 days, the Slurmctld process has encountered an error
that resulted in the service stopping. The error message displayed was
"double free or corruption (out)". This error has caused significant
disruption to our jobs, and we are concerned about its recurrence.

We have tried troubleshooting the issue, but we have not been able to
identify the root cause of the problem. We would appreciate any
assistance or guidance you can provide to help us resolve this issue.

Please let us know if you need any additional information or if there
are any specific steps we should take to diagnose the problem further.

Thank you for your attention to this matter.

Best regards,

_________________________

Jul 09 22:12:01 admin slurmctld[711010]: double free or corruption
(fasttop)
Jul 09 22:12:01 admin systemd[1]: slurmctld.service: Main process
exited, code=killed, status=6/ABRT
Jul 09 22:12:01 admin systemd[1]: slurmctld.service: Failed with result
'signal'.
Jul 09 22:12:01 admin systemd[1]: slurmctld.service: Consumed 11min
26.451s CPU time.

.....

Jul 14 10:15:01 admin slurmctld[1633720]: double free or corruption (out)
Jul 14 10:15:02 admin systemd[1]: slurmctld.service: Main process
exited, code=killed, status=6/ABRT
Jul 14 10:15:02 admin systemd[1]: slurmctld.service: Failed with result
'signal'.
Jul 14 10:15:02 admin systemd[1]: slurmctld.service: Consumed 7min
27.596s CPU time.

_________________________

slurmctld -V
slurm 22.05.9

________________________

cat /etc/slurm/slurm.conf |grep -v '#'


ClusterName=xxx
SlurmctldHost=admin
SlurmctldParameters=enable_configless
SlurmUser=slurm
AuthType=auth/munge
CryptoType=crypto/munge


SlurmctldPort=6817
StateSaveLocation=/var/spool/slurmctld
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmctldDebug=verbose
DebugFlags=NO_CONF_HASH


SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmdDebug=verbose

SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core,CR_LLN
DefMemPerCPU=1024
MaxMemPerCPU=4096
GresTypes=gpu


ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/cgroup
JobAcctGatherFrequency=15
JobCompType=jobcomp/none

TaskPlugin=task/cgroup
LaunchParameters=use_interactive_step

AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=admin
AccountingStoragePort=6819
AccountingStorageEnforce=associations
AccountingStorageTRES=gres/gpu



MailProg=/usr/bin/mailx
EnforcePartLimits=YES
MaxArraySize=200000
MaxJobCount=500000
MpiDefault=none
ReturnToService=2
SwitchType=switch/none
TmpFS=/tmpslurm/
UsePAM=1



InactiveLimit=0
KillWait=30
MessageTimeout=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0



PriorityType=priority/multifactor
PriorityFlags=FAIR_TREE,MAX_TRES
PriorityDecayHalfLife=1-0
PriorityWeightFairshare=10000




NodeName=xxx  NodeHostname=xxx  CPUs=4 Sockets=4 RealMemory=3500
TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN
NodeName=xxx  NodeHostname=xxx  CPUs=2 Sockets=2 RealMemory=1700
TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN
NodeName=xxx  NodeHostname=xxx  CPUs=4 Sockets=4 RealMemory=1700
TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN
NodeName=xxx  NodeHostname=xxx  CPUs=4 Sockets=4 RealMemory=3500
TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN


NodeName=r9nc-24-[1-12] NodeHostname=r9nc-24-[1-12] Sockets=2
CoresPerSocket=12 ThreadsPerCore=1 CPUs=24 RealMemory=180000 State=UNKNOWN
NodeName=r9nc-48-[1-4]  NodeHostname=r9nc-48-[1-4] Sockets=2
CoresPerSocket=24 ThreadsPerCore=1 CPUs=48 RealMemory=480000 State=UNKNOWN
NodeName=r9ng-1080-[1-7]   NodeHostname=r9ng-1080-[1-7] Sockets=2
CoresPerSocket=10 ThreadsPerCore=1 CPUs=20 RealMemory=180000
State=UNKNOWN Gres=gpu:1080ti:4
NodeName=r9ng-1080-8   NodeHostname=r9ng-1080-8 Sockets=2
CoresPerSocket=10 ThreadsPerCore=1 CPUs=20 RealMemory=176687
State=UNKNOWN Gres=gpu:1080ti:1

PartitionName=24CPUNodes      Nodes=r9nc-24-[1-12]        State=UP
MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=7500 DefMemPerCPU=7500
TRESBillingWeights="CPU=1.0,Mem=0.125G" Default=YES
PartitionName=48CPUNodes      Nodes=r9nc-48-[1-4]         State=UP
MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=10000 DefMemPerCPU=8000
TRESBillingWeights="CPU=1.0,Mem=0.125G"
PartitionName=GPUNodes   Nodes=r9ng-1080-[1-7]            State=UP
MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=9000 DefMemPerCPU=9000
PartitionName=GPUNodes1080-dev   Nodes=r9ng-1080-8        State=UP
MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=9000 DefMemPerCPU=9000
Hidden=Yes

_________________________

sinfo
PARTITION        AVAIL  TIMELIMIT  NODES  STATE NODELIST
24CPUNodes*         up   infinite     12   idle r9nc-24-[1-12]
48CPUNodes          up   infinite      2   idle r9nc-48-[1-2]
GPUNodes            up   infinite      4   idle r9ng-1080-[4-7]
GPUNodes1080-dev    up   infinite      1   idle r9ng-1080-8


--
William VINCENT
Administrateur systèmes et réseaux

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Ole Holm Nielsen via slurm-users

unread,
Jul 15, 2024, 5:25:38 AM (20 hours ago) Jul 15
to slurm...@lists.schedmd.com
On 7/15/24 10:43, William VINCENT via slurm-users wrote:
> I am writing to report an issue with the Slurmctld process on our RHEL 9
> (Rocky Linux) .
>
> Twice in the past 5 days, the Slurmctld process has encountered an error
> that resulted in the service stopping. The error message displayed was
> "double free or corruption (out)". This error has caused significant
> disruption to our jobs, and we are concerned about its recurrence.
>
> We have tried troubleshooting the issue, but we have not been able to
> identify the root cause of the problem. We would appreciate any assistance
> or guidance you can provide to help us resolve this issue.
>
> Please let us know if you need any additional information or if there are
> any specific steps we should take to diagnose the problem further.

You're running Slurm 22.05.9 on RockyLinux 9 (is that Rocky 9.4 or what?).
Such an old Slurm version probably hasn't been tested much on EL9 systems,

For security reasons you ought to upgrade to a recent Slurm version, just
search for "CVE" in https://github.com/SchedMD/slurm/blob/master/NEWS to
find out about security holes in older versions.

You can upgrade by 2 major releases in a single step, so you can go to
23.11.8. Upgrading Slurm is fairly easy, and I've collected various
pieces of advice in the Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm

Hopefully a newer Slurm version is going to solve your issue.

I hope this helps,
Ole

William V via slurm-users

unread,
Jul 15, 2024, 5:37:43 AM (20 hours ago) Jul 15
to slurm...@lists.schedmd.com
Thank you for your response, I hadn't considered that version 22 could be the problem.

I am aware that we are not up to date, but we use the EPEL repo for our RPM packages. Originally, we did not want to install .rpm directly because our policy is to apply security updates every night via the repositories, but unfortunately, in this case, it does not work. I think it is because only one person is responsible for maintaining the packages for RHEL.

I have already reported the security issue, but at the moment it does not seem possible to update: https://bugzilla.redhat.com/show_bug.cgi?id=2280545

It appears from another ticket that the compilation fails for version 24: https://bugzilla.redhat.com/show_bug.cgi?id=2259935

If the compilation fails, will the RPM package work on RHEL 9?

Ole Holm Nielsen via slurm-users

unread,
Jul 15, 2024, 6:46:50 AM (19 hours ago) Jul 15
to slurm...@lists.schedmd.com
On 7/15/24 11:35, William V via slurm-users wrote:
> Thank you for your response, I hadn't considered that version 22 could be the problem.
>
> I am aware that we are not up to date, but we use the EPEL repo for our RPM packages. Originally, we did not want to install .rpm directly because our policy is to apply security updates every night via the repositories, but unfortunately, in this case, it does not work. I think it is because only one person is responsible for maintaining the packages for RHEL.

You should *NOT* use Slurm packages from the EPEL repository!! The Slurm
documentation recommends to exclude those packages, see
https://slurm.schedmd.com/upgrades.html#epel_repository

> I have already reported the security issue, but at the moment it does not seem possible to update: https://bugzilla.redhat.com/show_bug.cgi?id=2280545

RedHat doesn't provide support for Slurm, and if necessary you should
contact SchedMD to obtain Slurm support.

> It appears from another ticket that the compilation fails for version 24: https://bugzilla.redhat.com/show_bug.cgi?id=2259935

I think this ticket only reports problems regarding older Slurm releases?

> If the compilation fails, will the RPM package work on RHEL 9?

You should build your own Slurm RPM packages, and compilation failure
would indicate a bug somewhere!

Just as a test, I've now built RPM packages of the currently supported
Slurm releases 23.11.8 and 24.05.1 on a RockyLinux 9.4 system. The RPMs
have built without any issues or compilation errors at all! I haven't
tested these RPMs on our production cluster which runs EL8 :-)

I recommend that you consult the Slurm documentation page[1] and my Wiki
page for Slurm installation:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/ Remember to
install all prerequisite packages before building Slurm, as explained in
the Wiki!

Best regards,
Ole

[1] https://slurm.schedmd.com/documentation.html

William V via slurm-users

unread,
Jul 15, 2024, 7:11:04 AM (18 hours ago) Jul 15
to slurm...@lists.schedmd.com
Wow, thank you so much for all this information and the installation wiki.
I have a lot of work to do to change the infrastructure, I hope it will go smoothly.
Reply all
Reply to author
Forward
0 new messages