[slurm-users] Slurmd enabled crash with CgroupV2

641 views
Skip to first unread message

Tristan LEFEBVRE

unread,
Mar 10, 2023, 10:41:55 AM3/10/23
to slurm...@lists.schedmd.com

Hello to all,

I'm trying to do an installation of Slurm with cgroupv2 activated.

But I'm facing an odd thing : when slurmd is enabled it crash at the next reboot and will never start unless i disable it.

Here is a full example of the situation


[root@compute ~]# systemctl start slurmd
[root@compute ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
   Active: active (running) since Fri 2023-03-10 15:57:00 CET; 967ms ago
 Main PID: 8053 (slurmd)
    Tasks: 1
   Memory: 3.1M
   CGroup: /system.slice/slurmd.service
           └─8053 /opt/slurm_bin/sbin/slurmd -D --conf-server XXXXX:6817 -s

mars 10 15:57:00 compute.cluster.lab systemd[1]: Started Slurm node daemon.
mars 10 15:57:00 compute.cluster.lab slurmd[8053]: slurmd: slurmd version 23.02.0 started
mars 10 15:57:00 compute.cluster.lab slurmd[8053]: slurmd: slurmd started on Fri, 10 Mar 2023 15:57:00 +0100
mars 10 15:57:00 compute.cluster.lab slurmd[8053]: slurmd: CPUs=48 Boards=1 Sockets=2 Cores=24 Threads=1 Memory=385311 TmpDisk=19990 Uptime=12>

[root@compute ~]# systemctl enable slurmd
Created symlink /etc/systemd/system/multi-user.target.wants/slurmd.service → /usr/lib/systemd/system/slurmd.service.

[root@compute ~]#  reboot now

> [ reboot of the node]

[adm@compute ~]$ sudo systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2023-03-10 16:00:33 CET; 1min 0s ago
  Process: 2659 ExecStart=/opt/slurm_bin/sbin/slurmd -D --conf-server XXXX:6817 -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 2659 (code=exited, status=1/FAILURE)
mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: slurmd version 23.02.0 started
mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: error: Controller cpuset is not enabled!
mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: error: Controller cpu is not enabled!
mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: error: cpu cgroup controller is not available.
mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: error: There's an issue initializing memory or cpu controller
mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: error: Couldn't load specified plugin name for jobacct_gather/cgroup: Plugin init()>
mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: error: cannot create jobacct_gather context for jobacct_gather/cgroup
mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: fatal: Unable to initialize jobacct_gather
mars 10 16:00:33 compute.cluster.lab systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
mars 10 16:00:33 compute.cluster.lab systemd[1]: slurmd.service: Failed with result 'exit-code'.

[adm@compute ~]$ sudo systemctl start slurmd
[adm@compute ~]$ sudo systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2023-03-10 16:01:37 CET; 1s ago
  Process: 3321 ExecStart=/opt/slurm_bin/sbin/slurmd -D --conf-server XXXX:6817 -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 3321 (code=exited, status=1/FAILURE)
mars 10 16:01:37 compute.cluster.lab slurmd[3321]: slurmd: slurmd version 23.02.0 started
mars 10 16:01:37 compute.cluster.lab slurmd[3321]: slurmd: error: Controller cpuset is not enabled!
mars 10 16:01:37 compute.cluster.lab slurmd[3321]: slurmd: error: Controller cpu is not enabled!
mars 10 16:01:37 compute.cluster.lab slurmd[3321]: slurmd: error: cpu cgroup controller is not available.
mars 10 16:01:37 compute.cluster.lab slurmd[3321]: slurmd: error: There's an issue initializing memory or cpu controller
mars 10 16:01:37 compute.cluster.lab slurmd[3321]: slurmd: error: Couldn't load specified plugin name for jobacct_gather/cgroup: Plugin init()>
mars 10 16:01:37 compute.cluster.lab slurmd[3321]: slurmd: error: cannot create jobacct_gather context for jobacct_gather/cgroup
mars 10 16:01:37 compute.cluster.lab slurmd[3321]: slurmd: fatal: Unable to initialize jobacct_gather
mars 10 16:01:37 compute.cluster.lab systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
mars 10 16:01:37 compute.cluster.lab systemd[1]: slurmd.service: Failed with result 'exit-code'.

[adm@compute ~]$ sudo systemctl disable slurmd
Removed /etc/systemd/system/multi-user.target.wants/slurmd.service.
[adm@compute ~]$ sudo systemctl start slurmd
[adm@compute ~]$ sudo systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
   Active: active (running) since Fri 2023-03-10 16:01:45 CET; 1s ago
 Main PID: 3358 (slurmd)
    Tasks: 1
   Memory: 6.1M
   CGroup: /system.slice/slurmd.service
           └─3358 /opt/slurm_bin/sbin/slurmd -D --conf-server XXXX:6817 -s
mars 10 16:01:45 compute.cluster.lab systemd[1]: Started Slurm node daemon.
mars 10 16:01:45 compute.cluster.lab slurmd[3358]: slurmd: slurmd version 23.02.0 started
mars 10 16:01:45 compute.cluster.lab slurmd[3358]: slurmd: slurmd started on Fri, 10 Mar 2023 16:01:45 +0100
mars 10 16:01:45 compute.cluster.lab slurmd[3358]: slurmd: CPUs=48 Boards=1 Sockets=2 Cores=24 Threads=1 Memory=385311 TmpDisk=19990 Uptime=84>

As you can see. Slurmd successfully start only when not enable after a reboot.

- I'm using Rocky Linux 8  and I've configured cgroupv2 with grubby

> grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0 cgroup_no_v1=all"

- Slurm 23.02 is build with rpmbuild and slurmd on the compute node is installed with rpm

- Here is my cgroup.conf :

CgroupPlugin=cgroup/v2
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=no

And my slurm.conf have : 

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup


- If i do "systemctl start slurmd" on a compute node it's a success.

- If i do "systemctl enable slurmd" and then "systemctl restart slurmd" it's still ok

- if i enable and reboot, slurmd send this error :

slurmd: error: Controller cpuset is not enabled!
slurmd: error: Controller cpu is not enabled!
slurmd: error: cpu cgroup controller is not available.
slurmd: error: There's an issue initializing memory or cpu controlle

-  I've done some research and read about cgroup.subtree_control. And so if i do:

cat /sys/fs/cgroup/cgroup.subtree_control
memory pids

So I've tried to follow the RedHat documentation with there example : ( the link of the RedHat page here )

echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control
echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control

cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu memory pids

And indeed i can restart slurmd.

But at the next boot it failed again  and /sys/fs/cgroup/cgroup.subtree_control is back with "memory pids" only.

And strangely i found if slurmd is enabled and then i disable it, it change the value of /sys/fs/cgroup/cgroup.subtree_control :

[root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control 
memory pids
[root@compute ~]# systemctl disable slurmd
Removed /etc/systemd/system/multi-user.target.wants/slurmd.service.
[root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control 
cpuset cpu io memory pids


I've made a script at launch time as a dirty fix by using ExecPreStart in slurmd.service:

ExecStartPre=/opt/slurm_bin/dirty_fix_slurmd.sh

with dirty_fix_slurmd.sh:

#!/bin//bash
echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control
echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control
echo "+cpu" >> /sys/fs/cgroup/system.slice/cgroup.subtree_control
echo "+cpuset" >> /sys/fs/cgroup/system.slice/cgroup.subtree_control

(And i'm not sure if this is something good to do ?)


If you have an idea how to correct this situation

Have a nice day

Thank you

Tristan LEFEBVRE

CONFIDENTIALITE : ce courriel et les éventuelles pièces attachées sont la propriété de l’IRT Jules Verne, sont confidentiels et sont réservés à l’usage de la ou des personne(s) identifées(s) comme destinataire(s). Si vous avez reçu ce courriel par erreur, toute utilisation, divulgation, ou copie de ce courriel est interdite. Dans ce cas, merci d’en informer immédiatement l'expéditeur et de supprimer le courriel et ses pièces jointes.
CONFIDENTIALITY : This e-mail and any attachments are IRT Jules Verne’s property and are intended solely for the person or entity to whom it is addressed, and may contain confidential or privileged information. Should you have received this e-mail in error, any use, disclosure, or copy of this email is prohibited. In this case, please inform the sender immediately and delete this email and its attachments.

Brian Andrus

unread,
Mar 10, 2023, 2:08:56 PM3/10/23
to slurm...@lists.schedmd.com

I'm not sure which specific item to look at, but this seems like a race condition.
Likely you need to add an override to your slurmd startup (/etc/systemd/system/slurmd.service/override.conf) and put a dependency there so it won't start until that is done.

I have mine wait for a few things:

[Unit]
After=autofs.service getty.target sssd.service


That makes it wait for all of those before trying to start.

Brian Andrus

Alan Orth

unread,
May 23, 2023, 6:46:54 AM5/23/23
to Slurm User Community List
I notice the exact same behavior as Tristan. My CentOS Stream 8 system is in full unified cgroupv2 mode, the slurmd.service has a "Delegate=Yes" override added to it, and all cgroup stuff is added to slurm.conf and cgroup.conf, yet slurmd does not start after reboot. I don't understand what is happening, but I see the exact same behavior regarding the cgroup subtree_control with disabling / re-enabling slurmd.

[root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control                                                                                                                                          
memory pids                                                                                                                                                                                            
[root@compute ~]# systemctl disable slurmd
Removed /etc/systemd/system/multi-user.target.wants/slurmd.service.
[root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory pids
[root@compute ~]# systemctl enable slurmd                        
Created symlink /etc/systemd/system/multi-user.target.wants/slurmd.service → /usr/lib/systemd/system/slurmd.service.
[root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory pids

After this slurmd starts up successfully (until the next reboot). We are on version 22.05.9.

Regards,


--

Josef Dvoracek via slurm-users

unread,
Apr 11, 2024, 11:16:14 AMApr 11
to slurm...@lists.schedmd.com

I observe same behavior on slurm 23.11.5 Rocky Linux8.9..

> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control                                                                                                                                         
> memory pids                                                                                                                                                                                            
> [root@compute ~]# systemctl disable slurmd
> Removed /etc/systemd/system/multi-user.target.wants/slurmd.service.
> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> cpuset cpu io memory pids
> [root@compute ~]# systemctl enable slurmd                        
> Created symlink /etc/systemd/system/multi-user.target.wants/slurmd.service → /usr/lib/systemd/system/slurmd.service.
> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> cpuset cpu io memory pids

over time (i see this thread is ~1 year old, is here better / new understanding of this?

cheers

josef

Williams, Jenny Avis via slurm-users

unread,
Apr 11, 2024, 1:55:24 PMApr 11
to Josef Dvoracek, slurm...@lists.schedmd.com

There needs to be a slurmstepd infinity process running before slurmd starts.

This doc goes into it:
https://slurm.schedmd.com/cgroup_v2.html

 

Probably a better way to do this, but this is what we do to deal with that:

 

::::::::::::::

files/slurm-cgrepair.service

::::::::::::::

[Unit]

Before=slurmd.service slurmctld.service

After=nas-longleaf.mount remote-fs.target system.slice

 

[Service]

Type=oneshot

ExecStart=/callback/slurm-cgrepair.sh

 

[Install]

WantedBy=default.target

::::::::::::::

files/slurm-cgrepair.sh

::::::::::::::

#!/bin/bash

/usr/bin/echo +cpu +cpuset +memory >> /sys/fs/cgroup/cgroup.subtree_control && \

/usr/bin/echo +cpu +cpuset +memory >> /sys/fs/cgroup/system.slice/cgroup.subtree_control

 

/usr/sbin/slurmstepd infinity &

Josef Dvoracek via slurm-users

unread,
Apr 11, 2024, 3:29:46 PMApr 11
to slurm...@lists.schedmd.com

thanks for hint.

so you end with two "slurmstepd infinity" processes like me when I tried this workaround?

[root@node ~]# ps aux | grep slurm
root        1833  0.0  0.0  33716  2188 ?        Ss   21:02   0:00 /usr/sbin/slurmstepd infinity
root        2259  0.0  0.0 236796 12108 ?        Ss   21:02   0:00 /usr/sbin/slurmd --systemd
root        2331  0.0  0.0  33716  1124 ?        S    21:02   0:00 /usr/sbin/slurmstepd infinity
root        2953  0.0  0.0 221944  1092 pts/0    S+   21:12   0:00 grep --color=auto slurm
[root@node ~]#

BTW, I found mention of change in slurm cgroupsv2 code in changelog of slurm for next release,


one can see here the commit


referring to bug


but as the bug is private, I can not see the bug description.

So perhaps with Slurm 24.xx release  we'll see something new.

cheers

josef

Williams, Jenny Avis via slurm-users

unread,
Apr 11, 2024, 5:24:48 PMApr 11
to Josef Dvoracek, slurm...@lists.schedmd.com

The end goal is to see the following 2 things –

jobs under the slurmstepd cgroup path, and

the cpu,cpuset,memory at least in the cgroup.controllers file within the jobs cgroups.controller list.

 

The pattern you have would be the processes left after boot, first failed slurmd service start which leaves a slurmstepd infinity process, and then the second slurmd starts. In your case there is a second slurmstepd infinity process. As to why those specifics I can’t answer that one sitting here without poking at it more.

 

 

Having that slurmstepd infinity running with the cgroups needed ( for us at a minimum cpuset, cpu and memory – YMMV depending on the cgroups.conf settings ) before slurmd tries to start is what enables slurmd to start.

The necessary piece to this working is that the required controls are available at the parent of the path before the slurmd and in particular slurmstepd infinity start.

 

Our cgroup.conf file is:

CgroupAutomount=yes

ConstrainCores=yes

ConstrainRAMSpace=yes

CgroupPlugin=cgroup/v2

AllowedSwapSpace=1

ConstrainDevices=yes

ConstrainSwapSpace=yes

 

So the resulting missing piece to get slurmd to start at boot is corrected by running these mods to the cgroup controls before the slurmd service attempts to start.  As a test, on your system as it is now without adding anything I’ve mentioned, try having a cgroup.conf with zero Constrain statements.  My bet is that in that case slurmd starts clean on boot in that case.  I hope that the bug fix does not change slurmd to be more liberal about checking the cgroup control list. – it took a while before I trusted that the controls were actually there so knowing if slurmd starts the controls are there is great.

 

/usr/bin/echo +cpu +cpuset +memory >> /sys/fs/cgroup/cgroup.subtree_control && \

/usr/bin/echo +cpu +cpuset +memory >> /sys/fs/cgroup/system.slice/cgroup.subtree_control

 

 

 

The job cgroup propagation ( contents of cgroup.controllers files along the cgroup path ) after slurmd + slurmstepd infinity start is via the cgroup path established under slurmstepd.scope .  If there is no slurmstepd infinity slurm will start one; if slurmstepd infinity is running and it sets up at minimum the cgroups slurmd needs based on what is in cgroups.con. then slurmd doesn’t end up starting more slurmstepd infinity processes.  My recollection is that first slurmstepd infinity does set up the needed cgroup controllers which is why a second slurmd attempt then starts.

 

To see slurmd complaining about the specifics try disabling slurmd service, reboot, set SLURM_DEBUG_FLAGS = cgroups then run slurmd -D -vvv manually . I am fairly sure that helps see the particulars better.

 

Theoretically in our setup with the slurm-cgrepair.service we force a slurmstepd infinity process to be running prior* to the slurmd service finishing * ( IDK the PID order says otherwise )

# systemctl show slurmd |egrep cgrepair

After=network-online.target systemd-journald.socket slurm-cgrepair.service remote-fs.target system.slice sysinit.target nvidia.service munge.service basic.target

 

The resulting behavior of this setup is as we expect – the slurmd service is running on nodes after reboot without intervention.  Our steps may not be all necessary, but they are sufficient.


The list of cgroup controllers ( cpu , cpuset, memory for slurmstepd.scope/job_xxxx  ) for processes further down the cgoup path can only be a subset of any parent in the cgroup path ( cgroup , cpuset, memory, pid for slurmstepd.scope ).

 

You asked in the context of what our process tree looks like – here is that information.  I add the cgoup field in top for ongoing assurance that user processes are under the slurmstepd.scope path.

 

This is the process tree on our nodes.

 

# ps aux |grep slurm |head -n 15 |sed 's/xxxx/aUser/g'

root        8687  0.0  0.0 6471088 34044 ?       Ss   Apr03   0:29 /usr/sbin/slurmd -D -s

root        8694  0.0  0.0  33668  1080 ?        S    Apr03   0:00 /usr/sbin/slurmstepd infinity

root     2942928  0.0  0.0 311804  7416 ?        Sl   Apr06   0:42 slurmstepd: [35400562.extern]

root     2942930  0.0  0.0 311804  7164 ?        Sl   Apr06   0:43 slurmstepd: [35400563.extern]

root     2942933  0.0  0.0 311804  7144 ?        Sl   Apr06   0:45 slurmstepd: [35400564.extern]

root     2942935  0.0  0.0 311804  7280 ?        Sl   Apr06   0:38 slurmstepd: [35400565.extern]

root     2942953  0.0  0.0 312164  7496 ?        Sl   Apr06   0:45 slurmstepd: [35400564.batch]

root     2942958  0.0  0.0 312164  7620 ?        Sl   Apr06   0:41 slurmstepd: [35400562.batch]

root     2942960  0.0  0.0 312164  7636 ?        Sl   Apr06   0:43 slurmstepd: [35400563.batch]

root     2942962  0.0  0.0 312164  7728 ?        Sl   Apr06   0:41 slurmstepd: [35400565.batch]

aUser    2942972  0.0  0.0  12868  3072 ?        SN   Apr06   0:00 /bin/bash /var/spool/slurmd/job35400562/slurm_script

aUser    2942973  0.0  0.0  12868  2868 ?        SN   Apr06   0:00 /bin/bash /var/spool/slurmd/job35400564/slurm_script

aUser    2942974  0.0  0.0  12868  3000 ?        SN   Apr06   0:00 /bin/bash /var/spool/slurmd/job35400565/slurm_script

aUser    2942975  0.0  0.0  12868  2980 ?        SN   Apr06   0:00 /bin/bash /var/spool/slurmd/job35400563/slurm_script

root     2944250  0.0  0.0 311804  7248 ?        Sl   Apr06   0:44 slurmstepd: [35400838.extern]




 

# pgrep slurm |head -n4  |xargs |sed 's/ /,/g' |xargs -L1 -I{} top -bn1 -Hi -p {}

top - 16:24:16 up 7 days, 21:09,  2 users,  load average: 48.07, 47.39, 46.90

Threads:  10 total,   0 running,  10 sleeping,   0 stopped,   0 zombie

%Cpu(s): 76.6 us,  1.1 sy,  1.8 ni, 12.1 id,  7.6 wa,  0.4 hi,  0.4 si,  0.0 st

KiB Mem : 39548588+total, 48296324 free, 28092556 used, 31909702+buff/cache

KiB Swap:  2097148 total,  1354112 free,   743036 used. 32913958+avail Mem

 

    PID USER        RES    SHR S  %CPU  %MEM     TIME+  P CGROUPS                                                                                            COMMAND

   8687 root      34044   7664 S   0.0   0.0   0:04.28  8 0::/system.slice/slurmd.service                                                                    slurmd

   8694 root       1080    936 S   0.0   0.0   0:00.00 46 0::/system.slice/slurmstepd.scope/system                                                           slurmstepd

2942928 root       7412   6488 S   0.0   0.0   0:00.01 24 0::/system.slice/slurmstepd.scope/job_35400562/step_extern/slurm                                   slurmstepd

2942936 root       7412   6488 S   0.0   0.0   0:34.29 24 0::/system.slice/slurmstepd.scope/job_35400562/step_extern/slurm                                    `- acctg

2942937 root       7412   6488 S   0.0   0.0   0:08.27 24 0::/system.slice/slurmstepd.scope/job_35400562/step_extern/slurm                                    `- acctg_prof

2942938 root       7412   6488 S   0.0   0.0   0:00.05 24 0::/system.slice/slurmstepd.scope/job_35400562/step_extern/slurm                                    `- slurmstepd

2942930 root       7164   6236 S   0.0   0.0   0:00.01 28 0::/system.slice/slurmstepd.scope/job_35400563/step_extern/slurm                                   slurmstepd

2942939 root       7164   6236 S   0.0   0.0   0:36.40 28 0::/system.slice/slurmstepd.scope/job_35400563/step_extern/slurm                                    `- acctg

2942940 root       7164   6236 S   0.0   0.0   0:07.10 28 0::/system.slice/slurmstepd.scope/job_35400563/step_extern/slurm                                    `- acctg_prof

2942941 root       7164   6236 S   0.0   0.0   0:00.04 28 0::/system.slice/slurmstepd.scope/job_35400563/step_extern/slurm                                    `- slurmstepd

 

 

# sacct -j 35400562 -p

JobID|JobName|Partition|Account|AllocCPUS|State|ExitCode|

35400547_10|run_s111.sh|general|slurm_account|1|RUNNING|0:0|

35400547_10.batch|batch||slurm_account|1|RUNNING|0:0|

35400547_10.extern|extern||slurm_account|1|RUNNING|0:0|

# scontrol listpids 35400562

PID      JOBID    STEPID   LOCALID GLOBALID

-1       35400562 extern   0       0      

2942928  35400562 extern   -       -      

2942946  35400562 extern   -       -      

2942972  35400562 batch    0       0      

2942958  35400562 batch    -       -      

2943039  35400562 batch    -       -      

 


# cat /sys/fs/cgroup/system.slice/slurmstepd.scope/cgroup.controllers                              

cpuset cpu memory pids


# cat /sys/fs/cgroup/system.slice/slurmstepd.scope/job_35400562/step_extern/slurm/cgroup.controllers

cpuset cpu memory

Reply all
Reply to author
Forward
0 new messages