[slurm-users] Nodes stay drained no matter what I do

Patrick Goetz

unread,

Aug 24, 2023, 11:27:57 AM8/24/23

to Slurm User Community List

Master/Nodes: Ubuntu 20.04, Slurm 19.05.5 (as packaged by Debian)

This is an upgrade from a working Ubuntu 18.04/Slurm 17.x system where I
re-used the original slurm.conf (fearing this might cause issues). The
hardware is the same. The Master and nodes all use the same slurm.conf,
gres.conf, and cgroup.conf files which are soft linked into
/etc/slurm-llnl from an NFS mounted filesystem.

As per the subject, the nodes refuse to revert to idle:

-----------------------------------------------------------
root@hypnotoad:~# sinfo -N -l
Thu Aug 24 10:01:20 2023
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK
WEIGHT AVAIL_FE REASON
dgx-2 1 dgx drained 80 80:1:1 500000 0
1 (null) gres/gpu count repor
dgx-3 1 dgx drained 80 80:1:1 500000 0
1 (null) gres/gpu count repor
dgx-4 1 dgx drained 80 80:1:1 500000 0
1 (null) gres/gpu count
...
titan-3 1 titans* drained 40 40:1:1 250000 0
1 (null) gres/gpu count report
...
-----------------------------------------------------------

Neither of these commands has any effect:

scontrol update NodeName=dgx-[2-6] State=RESUME
scontrol update state=idle nodename=dgx-[2-6]

When I check the slurmctld log I find this helpful information:

-----------------------------------------------------------
...
[2023-08-24T00:00:00.033] error: _slurm_rpc_node_registration
node=dgx-4: Invalid argument
[2023-08-24T00:00:00.037] error: _slurm_rpc_node_registration
node=dgx-2: Invalid argument
[2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
node=titan-12: Invalid argument
[2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration
node=titan-11: Invalid argument
[2023-08-24T00:00:00.266] error: _slurm_rpc_node_registration
node=dgx-6: Invalid argument
...
-----------------------------------------------------------

Googling, this appears to indicate that there is a resource mismatch
between the actual hardware and what is specified in slurm.conf. Note
that the existing configuration worked for Slurm 17, but I checked, and
it looks fine to me:

Relevant parts of slurm.conf:

-----------------------------------------------------------
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

PartitionName=titans Default=YES Nodes=titan-[3-15] State=UP
MaxTime=UNLIMITED
PartitionName=dgx Nodes=dgx-[2-6] State=UP MaxTime=UNLIMITED

GresTypes=gpu
NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40
NodeName=dgx-2 Gres=gpu:tesla-v100:7 RealMemory=500000 CPUs=80
NodeName=dgx-[3-6] Gres=gpu:tesla-v100:8 RealMemory=500000 CPUs=80
-----------------------------------------------------------

All the nodes in the titan partition are identical hardware, as are the
nodes in the dgx partition save for dgx-2, which lost a GPU and is no
longer under warranty. So, using a couple of representative nodes:

root@dgx-4:~# slurmd -C
NodeName=dgx-4 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
ThreadsPerCore=2 RealMemory=515846

root@titan-8:~# slurmd -C
NodeName=titan-8 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10
ThreadsPerCore=2 RealMemory=257811

I'm at a loss for how to debug this and am looking suggestions. Since
the resources on these machines are strictly dedicated to Slurm jobs,
would it be best to use the output of `slurmd -C` directly for the right
hand side of NodeName, reducing the memory a bit for OS overhead? Is
there any way to get better debugging output? "Invalid argument" doesn't
tell me much.

Thanks.

Groner, Rob

unread,

Aug 24, 2023, 12:17:20 PM8/24/23

to Slurm User Community List

Ya, I agree about the invalid argument not being much help.

In times past when I encountered issues like that, I typically tried:

restart slurmd on the compute node. Watch its log to see what it complains about. Usually it's about memory.
Set the configuration of the node to whatever slurmd -C says, or set config_overrides in slurm.conf

Rob

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Patrick Goetz <pgo...@math.utexas.edu>
Sent: Thursday, August 24, 2023 11:27 AM
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: [slurm-users] Nodes stay drained no matter what I do

Timony, Mick

unread,

Aug 24, 2023, 12:24:47 PM8/24/23

to Slurm User Community List

Hi Patrick,

You may want to review the release notes for 19.05 and any intermediate versions:

https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES

https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES

I'd also check the slurmd.log on the compute nodes. It's usually in /var/log/slurm/slurmd.log

I'm not 100% sure your gres.conf is correct, we use one gres.conf for all our nodes, it looks something like this:

NodeName=gpu-[1,2] Name=gpu Type=teslaM40 File=/dev/nvidia[0-3]

NodeName=gpu-[3,6] Name=gpu Type=teslaK80 File=/dev/nvidia[0-7]

NodeName=gpu-[7-9] Name=gpu Type=teslaV100 File=/dev/nvidia[0-3]

SchedMd docs example is a little different as they have a unique gres.conf by node in their example at:

https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5

Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1

I don't see Name in your gres.conf?

Kind regards

--

Mick Timony

Senior DevOps Engineer
Harvard Medical School

--

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Patrick Goetz <pgo...@math.utexas.edu>
Sent: Thursday, August 24, 2023 11:27 AM
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: [slurm-users] Nodes stay drained no matter what I do

Patrick Goetz

unread,

Aug 24, 2023, 6:14:29 PM8/24/23

to slurm...@lists.schedmd.com

Hi Rob -

Thanks for this suggestion. I'm sure I restarted slurmd on the nodes
multiple times with nothing in the slurm log file on the node, but after

# tail -f /var/slurm-llnl/slurmd.log
# systemctl restart slurmd

I started to get errors in the log which eventually lead me to the solution.

To save future users the days of frustration I just experienced, here is
what I discovered.

All the problems were confined to the shared slurm.conf file. As a
reminder, all this just worked in Slurm 17.x.

Slurm 19.05 no longer likes this syntax:

NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40

The log file on the node included this error message:

error: Node configuration differs from hardware: Procs=40:40(hw)
Boards=1:1(hw) SocketsPerBoard=40:2(hw) CoresPerSocket=1:10(hw)
ThreadsPerCore=1:2(hw)

Notice that it's somehow auto-detecting the wrong hardware information.
The solution was to just use precisely what's reported by

slurmd -C

on the node:

NodeName=titan-[3-15] Gres=gpu:titanv:8 CPUs=40 Boards=1
SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=250000

But that wasn't the only issue. There was also this:

WARNING: A line in gres.conf for GRES gpu has 8 more configured than
expected in slurm.conf. Ignoring extra GRES.

It's calling this a warning, but

# scontrol show node titan-11 | grep Reason

revealed that this match was causing the node to drain immediately after
being set to idle. The problem was this:

Gres=gpu:titanv:8

^
|

For some reason this syntax was acceptable to Slurm 17, but not Slurm
19. The fix was

Gres=gpu:titanv:8 --> Gres=gpu:8

Final correct NodeName syntax:

NodeName=titan-[3-15] Gres=gpu:8 CPUs=40 Boards=1 SocketsPerBoard=2
CoresPerSocket=10 ThreadsPerCore=2 RealMemory=250000

Researching all this raised a number of questions, e.g. do I need to
express CPU affinity in gres.conf, but at least the users now have at
least the functionality they enjoyed previously.

On 8/24/23 11:16, Groner, Rob wrote:
> Ya, I agree about the invalid argument not being much help.
>
> In times past when I encountered issues like that, I typically tried:
>

> * restart slurmd on the compute node. Watch its log to see what it

> complains about. Usually it's about memory.

> * Set the configuration of the node to whatever slurmd -C says, or set
> config_overrides in slurm.conf
>
> Rob
>
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-use...@lists.schedmd.com> on behalf of
> Patrick Goetz <pgo...@math.utexas.edu>
> *Sent:* Thursday, August 24, 2023 11:27 AM
> *To:* Slurm User Community List <slurm...@lists.schedmd.com>
> *Subject:* [slurm-users] Nodes stay drained no matter what I do

Patrick Goetz

unread,

Aug 24, 2023, 6:18:16 PM8/24/23

to slurm...@lists.schedmd.com

Hi Mick -

Thanks for these suggestions. I read over both release notes, but
didn't find anything helpful.

Note that I didn't include gres.conf in my original post. That would be
this:

NodeName=titan-[3-15] Name=gpu File=/dev/nvidia[0-7]
NodeName=dgx-2 Name=gpu File=/dev/nvidia[0-6]
NodeName=dgx-[3-6] Name=gpu File=/dev/nvidia[0-7]

Everything is working now, but some schedmd comment alerted me to this
highly useful command:

# nvidia-smi topo -m

Now I'm wondering if I should be expressing CPU affinity explicitly in
the gres.conf file.

On 8/24/23 11:24, Timony, Mick wrote:
> Hi Patrick,
>
> You may want to review the release notes for 19.05 and any intermediate
> versions:
>
> https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES

> <https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES>
>
> https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES

> <https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES>
>
> I'd also check the |slurmd.log| on the compute nodes. It's usually in
> |/var/log/slurm/slurmd.log|
>
> I'm not 100% sure your gres.conf is correct, we use one gres.conf for
> all our nodes, it looks something like this:
>
> |NodeName=gpu-[1,2] Name=gpu Type=teslaM40 File=/dev/nvidia[0-3]|
> |NodeName=gpu-[3,6] Name=gpu Type=teslaK80 File=/dev/nvidia[0-7]|
> |NodeName=gpu-[7-9] Name=gpu Type=teslaV100 File=/dev/nvidia[0-3]|
>
> SchedMd docs example is a little different as they have a unique
> gres.conf by node in their example at:
>

> https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5 <https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5>

> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-use...@lists.schedmd.com> on behalf of
> Patrick Goetz <pgo...@math.utexas.edu>
> *Sent:* Thursday, August 24, 2023 11:27 AM
> *To:* Slurm User Community List <slurm...@lists.schedmd.com>
> *Subject:* [slurm-users] Nodes stay drained no matter what I do

> This message is from an external sender. Learn more about why this
> matters.
> <https://ut.service-now.com/sp?id=kb_article&number=KB0011401>
>
>

Tina Friedrich

unread,

Aug 25, 2023, 5:51:45 AM8/25/23

to slurm...@lists.schedmd.com

Hi Patrick,

we certainly use that information to set affinity, yes. Our gres.conf
files (node-specific, as our config management creates them locally from
'nvidia-smi topo -m') - look like this:

Name=gpu Type=a100 File=/dev/nvidia0 CPUs=0-23
Name=gpu Type=a100 File=/dev/nvidia1 CPUs=0-23
Name=gpu Type=a100 File=/dev/nvidia2 CPUs=24-47
Name=gpu Type=a100 File=/dev/nvidia3 CPUs=24-47

which means that the processor affinity is known, and you can request
GPUs as '--gres=gpu:a100:X'.

Tina

Patrick Goetz

unread,

Aug 25, 2023, 10:27:01 AM8/25/23

to slurm...@lists.schedmd.com

Hi Tina -

Thanks for the confirmation! I will make this adjustment to gres.conf.

>>> matters at https://links.utexas.edu/rtyclf. <<

Reply all

Reply to author

Forward