[slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

3,863 views
Skip to first unread message

Angel de Vicente

unread,
Apr 21, 2023, 8:33:39 AM4/21/23
to Slurm User Community List
Hello,

I've installed Slurm in a workstation (this is a single-node install)
with Ubuntu 22.04, and have installed Slurm version 21.08.5 (I didn't
compile it myself, just installed it with "apt install").

In the slurm.conf file I have:

,----
| ProctrackType=proctrack/cgroup
| TaskPlugin=task/affinity,task/cgroup
`----

When I submit a job, the "slurmd.log" shows:

,----
| [2023-04-21T12:22:14.128] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 127018
| [2023-04-21T12:22:14.128] task/affinity: batch_bind: job 127018 CPU input mask for node: 0x00000001
| [2023-04-21T12:22:14.129] task/affinity: batch_bind: job 127018 CPU final HW mask for node: 0x00000001
| [2023-04-21T12:22:14.156] [127018.extern] error: cgroup namespace 'cpuset' not mounted. aborting
| [2023-04-21T12:22:14.156] [127018.extern] error: unable to create cpuset cgroup namespace
| [2023-04-21T12:22:14.156] [127018.extern] error: cgroup namespace 'memory' not mounted. aborting
| [2023-04-21T12:22:14.156] [127018.extern] error: unable to create memory cgroup namespace
| [2023-04-21T12:22:14.156] [127018.extern] error: failure enabling memory enforcement: Unspecified error
| [2023-04-21T12:22:14.156] [127018.extern] error: Couldn't load specified plugin name for task/cgroup: Plugin init() callback failed
| [2023-04-21T12:22:14.156] [127018.extern] error: cannot create task context for task/cgroup
| [2023-04-21T12:22:14.156] [127018.extern] error: job_manager: exiting abnormally: Plugin initialization failed
`----

If I change TaskPlugin to be just

,----
| TaskPlugin=task/affinity
`----

then the job executes without any problems.

Do you know how I could fix this while keeping the cgroup plugin? My
intuition tells me that I should probably get the latest version of
Slurm and compile it myself, but I thought I would ask here before going
that route.

Any ideas/pointers? Many thanks,
--
Ángel de Vicente
Research Software Engineer (Supercomputing and BigData)
Tel.: +34 922-605-747
Web.: http://research.iac.es/proyecto/polmag/

GPG: 0x8BDC390B69033F52

Hermann Schwärzler

unread,
Apr 21, 2023, 9:03:43 AM4/21/23
to slurm...@lists.schedmd.com
Hi Ángel,

which version of cgroups does Ubuntu 22.04 use?

What is the output of
mount | grep cgroup
on your system?

Regards,
Hermann

Angel de Vicente

unread,
Apr 21, 2023, 11:59:24 AM4/21/23
to Hermann Schwärzler, slurm...@lists.schedmd.com
Hello,

Hermann Schwärzler <hermann.s...@uibk.ac.at> writes:

> which version of cgroups does Ubuntu 22.04 use?

I'm a cgroups noob, but my understanding is that both v2 and v1 coexist
in Ubuntu 22.04
(https://manpages.ubuntu.com/manpages/jammy/man7/cgroups.7.html). I have
another machine with Ubuntu 18.04, which also has (AFAIK) both versions
(https://manpages.ubuntu.com/manpages/jammy/man7/cgroups.7.html) and
where Slurm (slurm-wlm) 21.08.8-2 is installed, and I have no cgroups
issues there.

> What is the output of "mount | grep cgroup" on your system?

,----
| mount | grep cgroup
| cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
| cpuacct on /cgroup/cpuacct type cgroup (rw,relatime,cpuacct)
| freezer on /cgroup/freezer type cgroup (rw,relatime,freezer)
| cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
`----

Thanks for any help/pointers,

Michael Gutteridge

unread,
Apr 21, 2023, 12:12:57 PM4/21/23
to Slurm User Community List

Does this link help? 

> Debian and derivatives (e.g. Ubuntu) usually exclude the memory and 
> memsw (swap) cgroups by default. To include them, add the following 
> parameters to the kernel command line: cgroup_enable=memory swapaccount=1

I'm using Bionic (18) and after applying those changes it seems to be working OK for me. I don't believe that Ubuntu has changed memory cgroup configuration between 18 and 22, but we're only starting to use 22.

 - Michael

Angel de Vicente

unread,
Apr 21, 2023, 1:16:37 PM4/21/23
to Michael Gutteridge, Slurm User Community List
Hello,

Michael Gutteridge <michael.g...@gmail.com> writes:

> Does this link help? 
>
>> Debian and derivatives (e.g. Ubuntu) usually exclude the memory and 
>> memsw (swap) cgroups by default. To include them, add the following 
>> parameters to the kernel command line: cgroup_enable=memory swapaccount=1

In the old machine (Ubuntu 18.04) we don't set those kernel parameters
and Slurm seems to have no issues with Cgroups? (what happens if those
are not defined? Would you get something like what I was reporting, that
the plugin cannot be loaded, or simply that cgroup would not be able to
enforce memory policies?)

> I'm using Bionic (18) and after applying those changes it seems to be
> working OK for me. I don't believe that Ubuntu has changed memory
> cgroup configuration between 18 and 22, but we're only starting to use
> 22.

I see that there are some differences in cgroup between 18.04 and 22.04,
but I don't understand them well enough to be able to figure out what
could be the issue...

Cheers,

Angel de Vicente

unread,
Apr 22, 2023, 6:47:22 AM4/22/23
to Slurm User Community List
Hello,

Angel de Vicente <angel.de...@iac.es> writes:

> Do you know how I could fix this while keeping the cgroup plugin? My
> intuition tells me that I should probably get the latest version of
> Slurm and compile it myself, but I thought I would ask here before going
> that route.

I followed my intuition and got much closer to fixing it.

So far I learned the following:

+ the older Slurm seems to only support cgroup v1
+ Ubuntu 18.04 seems to be in a "Hybrid" cgroup mode
+ Ubuntu 22.04 is in "V2" mode:

So, I compiled in Ubuntu 22.04 a newer Slurm version (22.05.8), which
has support for both cgroup v1 and v2.

But when starting this version, slurmd complains with:

,----
| slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are:
| 5:freezer:/
| 3:cpuacct:/
`----

If I unmount /cgroup/freezer and /cgroup/cpuacct then all works fine
with Slurm, but I'm not sure where/who is mounting these, and whether it
is safe to unmount them. These machines were being upgraded from 18.04,
so not sure if these are leftovers from that version or a completely new
22.04 installation would also have these.

Cheers,

Angel de Vicente

unread,
May 3, 2023, 6:31:24 PM5/3/23
to Slurm User Community List
Hello,

Angel de Vicente <angel.de...@iac.es> writes:

> ,----
> | slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are:
> | 5:freezer:/
> | 3:cpuacct:/
> `----

in the end I learnt that despite Ubuntu 22.04 reporting to be using
only cgroup V2, it was also using V1 and creating those mount points,
and then Slurm 23.02.01 was complaining that it could not work with
Cgroups in hybrid mode.

So, the "solution" (as far as you don't need V1 for some reason) was to
add "cgroup_no_v1=all" to the Kernel parameters and reboot: no more V1
mount points and Slurm was happy with that.

[in case somebody is interested in the future, I needed this so that I
could limit the resources given to users not using Slurm. We have some
shared workstations with many cores and users were oversubscribing the
CPUs, so I have installed Slurm to put some order in the executions
there. But these machines are not an actual cluster with a login node:
the login node is the same as the executing node! So with cgroups I
control that users connecting via ssh only have the resources equivalent
to 3/4 of a core (enough to edit files, etc.) until they submit their
jobs via Slurm, when they then get the full allocation they requested].

Cristóbal Navarro

unread,
Jul 24, 2023, 8:10:43 PM7/24/23
to Slurm User Community List
Hello Angel and Community,
I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on Ubuntu 22.04 LTS) and Slurm 23.02.
When I execute `slurmd` service, it status shows failed with the following information below.
As of today, what is the best solution to this problem? I am really not sure if the DGX A100 could fail or not by disabling cgroups v1.
Any suggestions are welcome.

➜  slurm-23.02.3 systemctl status slurmd.service                                                                
× slurmd.service - Slurm node daemon
     Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Mon 2023-07-24 19:07:03 -04; 7s ago
    Process: 3680019 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
   Main PID: 3680019 (code=exited, status=1/FAILURE)
        CPU: 40ms

jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug:  Log file re-opened
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_init
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_load
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_export_xml
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug:  CPUs:128 Boards:1 Sockets:2 CoresPerSocket:64 ThreadsPerCore:1
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: Node reconfigured socket/core boundaries SocketsPerBoard=8:2(hw) CoresPerSocket=16:64(hw)
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 2:freezer:/
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: 0::/init.scope
jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Failed with result 'exit-code'.
➜  slurm-23.02.3      


--
Cristóbal A. Navarro

ralf.u...@physik.uni-augsburg.de

unread,
Jul 27, 2023, 8:09:20 AM7/27/23
to slurm...@lists.schedmd.com
Am 26.07.23 um 11:38 schrieb Ralf Utermann:
> Am 25.07.23 um 02:09 schrieb Cristóbal Navarro:
>> Hello Angel and Community,
>> I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on Ubuntu 22.04 LTS) and Slurm 23.02.
>> When I execute `slurmd` service, it status shows failed with the following information below.
>
> Hello Cristobal,
>
> we see similar problems not on DGX but standard server nodes running
> Ubuntu 22.04 (kernel 5.15.0-76-generic) and Slurm 23.02.3.
>
> The first start of the slurmd service always fails, with lots of errors
> in the slurmd.log like:
>   error: cpu cgroup controller is not available.
>   error: There's an issue initializing memory or cpu controller
> After 90 seconds this slurmd service start times out and is failed.
>
> BUT: One process is still running:
>   /usr/local/slurm/23.02.3/sbin/slurmstepd infinity
>
> This looks like the process started to handle cgroup v2 as described in
>   https://slurm.schedmd.com/cgroup_v2.html
>
> When we keep this slurmstepd infinity running, and just start
> the slurmd service a second time, everything comes up running.
>
> So our current workaround is: we configure the slurmd service
> with a Restart=on-failure in the [Service] section.
>
>
> Are there real solutions to this initial timeout failure?
>
> best regards, Ralf
>
>
>
>> As of today, what is the best solution to this problem? I am really not sure if the DGX A100 could fail or not by disabling cgroups v1.
>> Any suggestions are welcome.
>>
>> ➜  slurm-23.02.3 systemctl status slurmd.service
>> × slurmd.service - Slurm node daemon
>>       Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled)
>>       Active: failed (Result: exit-code) since Mon 2023-07-24 19:07:03 -04; 7s ago
>>      Process: 3680019 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
>>     Main PID: 3680019 (code=exited, status=1/FAILURE)
>>          CPU: 40ms
>>
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug:  Log file re-opened
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_init
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_load
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_export_xml
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug:  CPUs:128 Boards:1 Sockets:2 CoresPerSocket:64 ThreadsPerCore:1
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: Node reconfigured socket/core boundaries SocketsPerBoard=8:2(hw) CoresPerSocket=16:64(hw)
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 2:freezer:/
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: 0::/init.scope
>> jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
>> jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Failed with result 'exit-code'.
>> ➜  slurm-23.02.3
>>
>>
>>
>> On Wed, May 3, 2023 at 6:32 PM Angel de Vicente <angel.de...@iac.es <mailto:angel.de...@iac.es>> wrote:
>>
>>     Hello,
>>
>>       Web.: http://research.iac.es/proyecto/polmag/ <http://research.iac.es/proyecto/polmag/>
>>
>>       GPG: 0x8BDC390B69033F52
>>
>>
>>
>> --
>> Cristóbal A. Navarro
>

--
Ralf Utermann

Universität Augsburg
Rechenzentrum
D-86135 Augsburg

ralf.u...@uni-a.de
https://www.rz.uni-augsburg.de


Angel de Vicente

unread,
Sep 7, 2023, 2:09:54 PM9/7/23
to Cristóbal Navarro, Slurm User Community List
Hello Cristobal,

Cristóbal Navarro <cristobal...@gmail.com> writes:

> Hello Angel and Community,

> I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on
> Ubuntu 22.04 LTS) and Slurm 23.02.
> When I execute `slurmd` service, it status shows failed with the
> following information below.
> As of today, what is the best solution to this problem? I am really
> not sure if the DGX A100 could fail or not by disabling cgroups v1.
> Any suggestions are welcome.

did you manage to find a solution to this without disabling cgroups v1?

In our case:

,----
| slurm 23.02.3
| Ubuntu 22.04.3 LTS
|
| # cat /proc/cmdline
| BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=... ro quiet splash cgroup_no_v1=all vt.handoff=7
`----

disabling cgroups v1 has been working reliably, but it would be nice to
find a solution that doesn't require modifying the kernel parameters.
Reply all
Reply to author
Forward
0 new messages