Hi Rob -
Thanks for this suggestion. I'm sure I restarted slurmd on the nodes
multiple times with nothing in the slurm log file on the node, but after
# tail -f /var/slurm-llnl/slurmd.log
# systemctl restart slurmd
I started to get errors in the log which eventually lead me to the solution.
To save future users the days of frustration I just experienced, here is
what I discovered.
All the problems were confined to the shared slurm.conf file. As a
reminder, all this just worked in Slurm 17.x.
Slurm 19.05 no longer likes this syntax:
NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40
The log file on the node included this error message:
error: Node configuration differs from hardware: Procs=40:40(hw)
Boards=1:1(hw) SocketsPerBoard=40:2(hw) CoresPerSocket=1:10(hw)
ThreadsPerCore=1:2(hw)
Notice that it's somehow auto-detecting the wrong hardware information.
The solution was to just use precisely what's reported by
slurmd -C
on the node:
NodeName=titan-[3-15] Gres=gpu:titanv:8 CPUs=40 Boards=1
SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=250000
But that wasn't the only issue. There was also this:
WARNING: A line in gres.conf for GRES gpu has 8 more configured than
expected in slurm.conf. Ignoring extra GRES.
It's calling this a warning, but
# scontrol show node titan-11 | grep Reason
revealed that this match was causing the node to drain immediately after
being set to idle. The problem was this:
Gres=gpu:titanv:8
^
|
For some reason this syntax was acceptable to Slurm 17, but not Slurm
19. The fix was
Gres=gpu:titanv:8 --> Gres=gpu:8
Final correct NodeName syntax:
NodeName=titan-[3-15] Gres=gpu:8 CPUs=40 Boards=1 SocketsPerBoard=2
CoresPerSocket=10 ThreadsPerCore=2 RealMemory=250000
Researching all this raised a number of questions, e.g. do I need to
express CPU affinity in gres.conf, but at least the users now have at
least the functionality they enjoyed previously.
On 8/24/23 11:16, Groner, Rob wrote:
> Ya, I agree about the invalid argument not being much help.
>
> In times past when I encountered issues like that, I typically tried:
>
> * restart slurmd on the compute node. Watch its log to see what it
> complains about. Usually it's about memory.
> * Set the configuration of the node to whatever slurmd -C says, or set
> config_overrides in slurm.conf
>
> Rob
>
> ------------------------------------------------------------------------
> *From:* slurm-users <
slurm-use...@lists.schedmd.com> on behalf of
> Patrick Goetz <
pgo...@math.utexas.edu>
> *Sent:* Thursday, August 24, 2023 11:27 AM
> *To:* Slurm User Community List <
slurm...@lists.schedmd.com>
> *Subject:* [slurm-users] Nodes stay drained no matter what I do