Am 26.07.23 um 11:38 schrieb Ralf Utermann:
> Am 25.07.23 um 02:09 schrieb Cristóbal Navarro:
>> Hello Angel and Community,
>> I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on Ubuntu 22.04 LTS) and Slurm 23.02.
>> When I execute `slurmd` service, it status shows failed with the following information below.
>
> Hello Cristobal,
>
> we see similar problems not on DGX but standard server nodes running
> Ubuntu 22.04 (kernel 5.15.0-76-generic) and Slurm 23.02.3.
>
> The first start of the slurmd service always fails, with lots of errors
> in the slurmd.log like:
> error: cpu cgroup controller is not available.
> error: There's an issue initializing memory or cpu controller
> After 90 seconds this slurmd service start times out and is failed.
>
> BUT: One process is still running:
> /usr/local/slurm/23.02.3/sbin/slurmstepd infinity
>
> This looks like the process started to handle cgroup v2 as described in
>
https://slurm.schedmd.com/cgroup_v2.html
>
> When we keep this slurmstepd infinity running, and just start
> the slurmd service a second time, everything comes up running.
>
> So our current workaround is: we configure the slurmd service
> with a Restart=on-failure in the [Service] section.
>
>
> Are there real solutions to this initial timeout failure?
>
> best regards, Ralf
>
>
>
>> As of today, what is the best solution to this problem? I am really not sure if the DGX A100 could fail or not by disabling cgroups v1.
>> Any suggestions are welcome.
>>
>> ➜ slurm-23.02.3 systemctl status slurmd.service
>> × slurmd.service - Slurm node daemon
>> Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled)
>> Active: failed (Result: exit-code) since Mon 2023-07-24 19:07:03 -04; 7s ago
>> Process: 3680019 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
>> Main PID: 3680019 (code=exited, status=1/FAILURE)
>> CPU: 40ms
>>
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug: Log file re-opened
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_init
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_load
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_export_xml
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug: CPUs:128 Boards:1 Sockets:2 CoresPerSocket:64 ThreadsPerCore:1
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: Node reconfigured socket/core boundaries SocketsPerBoard=8:2(hw) CoresPerSocket=16:64(hw)
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 2:freezer:/
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: 0::/init.scope
>> jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
>> jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Failed with result 'exit-code'.
>> ➜ slurm-23.02.3
>>
>>
>>
>> On Wed, May 3, 2023 at 6:32 PM Angel de Vicente <
angel.de...@iac.es <mailto:
angel.de...@iac.es>> wrote:
>>
>> Hello,
>>
>> Web.:
http://research.iac.es/proyecto/polmag/ <
http://research.iac.es/proyecto/polmag/>
>>
>> GPG: 0x8BDC390B69033F52
>>
>>
>>
>> --
>> Cristóbal A. Navarro
>
--
Ralf Utermann
Universität Augsburg
Rechenzentrum
D-86135 Augsburg
ralf.u...@uni-a.de
https://www.rz.uni-augsburg.de