[slurm-users] _node_config_validate: gres/gpu: Count changed on node (0 != 2)

8 views
Skip to first unread message

Xaver Stiensmeier via slurm-users

unread,
Mar 20, 2026, 12:24:19 PM (5 days ago) Mar 20
to slurm...@lists.schedmd.com

Hey Slurm-users list,

while our regular gpu nodes are working fine, our on demand gpu nodes have a weird issue. They power up, I can ssh onto them and execute nvidia-smi on them without issue, but they are marked invalid and slurmctld logs

_node_config_validate: gres/gpu: Count changed on node (0 != 2)

however, node show shows that the gpus are recognized and the gres.conf are stored on the worker nodes as expected and the node entries in the slurm.conf are fine, too:

# slurm.conf
NodeName=my_worker_node SocketsPerBoard=16 CoresPerSocket=1 RealMemory=64075 MemSpecLimit=4000 State=CLOUD Gres=gpu:L4:2 # openstack

# gres.conf on my_worker_node
ubuntu@my_node:~$ cat /etc/slurm/gres.conf 
# GRES CONFIG
Name=gpu Type=L4 File=/dev/nvidia0
Name=gpu Type=L4 File=/dev/nvidia1

Thankful for any ideas and debugging ideas.

Best,
Xaver

PS:
By executing:

sudo scontrol update NodeName=$(bibiname 0) Gres=
sudo scontrol reconfigure
sudo scontrol update NodeName=$(bibiname 0) state=RESUME reason=None

the node can be resumed. However, this is not a real solution.

Xaver Stiensmeier via slurm-users

unread,
Mar 23, 2026, 9:13:39 AM (2 days ago) Mar 23
to slurm...@lists.schedmd.com

Hey Slurm-users list,

meanwhile I was able to find:

[2026-03-23T12:58:16.105] debug:  gres/gpu: init: loaded                                                             
[2026-03-23T12:58:16.105] debug:  gpu/generic: init: init: GPU Generic plugin loaded                                 
[2026-03-23T12:58:16.105] warning: Ignoring file-less GPU gpu:L4 from final GRES list                                
[2026-03-23T12:58:16.105] debug:  skipping GRES for NodeName=my_worker_node  Name=gpu File=/dev/nv
dia0                                                                                                                 
                                                                                                                     
[2026-03-23T12:58:16.105] debug:  skipping GRES for NodeName=my_worker_node  Name=gpu File=/dev/nv
dia1                                                                                                                 
                                                                                                                     
[2026-03-23T12:58:16.105] debug:  gres/gpu: init: loaded                                                             
[2026-03-23T12:58:16.106] debug:  skipping GRES for NodeName=my_worker_node  Name=gpu File=/dev/nv
dia0                                                                                                                 
                                                                                                                     
[2026-03-23T12:58:16.106] debug:  skipping GRES for NodeName=my_worker_node  Name=gpu File=/dev/nv
dia1                                                                                                                 
                                                                                                                     
[2026-03-23T12:58:16.106] debug:  gres/gpu: init: loaded                                                             

so I am wondering whether that is the issue. I also noticed that after powering up the node without requesting a gpu (works), scheduling to the node by requesting a GPU is not an issue.

Best,
Xaver

Hermann Schwärzler via slurm-users

unread,
Mar 23, 2026, 10:15:40 AM (2 days ago) Mar 23
to slurm...@lists.schedmd.com
Hi everyone,

On 3/23/26 14:11, Xaver Stiensmeier via slurm-users wrote:
[...]
> so I am wondering whether that is the issue. I also noticed that after
> powering up the node without requesting a gpu (works), scheduling to the
> node by requesting a GPU is not an issue.
[...]

We noticed this as well: after powering up a node the GPU device-files
(/dev/nvidia*) are not created (immediately).

What we did:
we changed the slurmd.service file and added

ExecStartPre=-/path/to/nvidia-smi -L

to the [Service] section.
This creates the device files and a failure (e.g. on non-GPU nodes) is
ignored by systemd (due to the "-" before the command).

Maybe this helps?

Kind regards,
Hermann

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Xaver Stiensmeier via slurm-users

unread,
Mar 23, 2026, 11:41:27 AM (2 days ago) Mar 23
to slurm...@lists.schedmd.com
Hey,

I am not 100% sure yet as that needs further testing (in case it is a
race condition), but I think I was able to fix my issue by using the
NodeName gres.conf format and supplying it to each node instead of
placing the gres.conf files just on the nodes with gres with the node
specific information.

Best,
Xaver
Reply all
Reply to author
Forward
0 new messages