[slurm-users] exempting a node from Gres Autodetect

495 views
Skip to first unread message

Paul Brunk

unread,
Feb 19, 2021, 11:32:36 AM2/19/21
to slurm...@lists.schedmd.com
Hi all:

(I hope plague and weather are being visibly less than maximally cruel
to you all.)

In short, I was trying to exempt a node from NVML Autodetect, and
apparently introduced a syntax error in gres.conf. This is not an
urgent matter for us now, but I'm curious what went wrong. Thanks for
lending any eyes to this!

More info:

Slurm 20.02.6, CentOS 7.

We've historically had only this in our gres.conf:
AutoDetect=nvml

Each of our GPU nodes has e.g. 'Gres=gpu:V100:1' as part of its
NodeName entry (GPU models vary across them).

I wanted to exempt one GPU node from the autodetect (was curious about
the presence or absence of the GPU model subtype designation,
e.g. 'V100' vs. 'v100s'), so I changed gres.conf to this (modelled
after 'gres.conf' man page):

AutoDetect=nvml
NodeName=a1-10 AutoDetect=off Name=gpu File=/dev/nvidia0

I restarted slurmctld, then "scontrol reconfigure". Each node got a
fatal error parsing gres.conf, causing RPC failure between slurmctld
and nodes, causing slurmctld to consider the nodes failed.

Here's how it looked to slurmctld:

[2021-02-04T13:36:30.482] backfill: Started JobId=1469772_3(1473148) in batch on ra3-6
[2021-02-04T15:14:48.642] error: Node ra3-6 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2021-02-04T15:25:40.258] agent/is_node_resp: node:ra3-6 RPC:REQUEST_PING : Communication connection failure
[2021-02-04T15:39:49.046] requeue job JobId=1443912 due to failure of node ra3-6

And to the slurmd's :

[2021-02-04T15:14:50.730] Message aggregation disabled
[2021-02-04T15:14:50.742] error: Parsing error at unrecognized key: AutoDetect
[2021-02-04T15:14:50.742] error: Parse error in file /var/lib/slurmd/conf-cache/gres.conf line 2: " AutoDetect=off Name=gpu File=/dev/nvidia0"
[2021-02-04T15:14:50.742] fatal: error opening/reading /var/lib/slurmd/conf-cache/gres.conf

Reverting to the original, one-line gres.conf reverted the cluster to production state.

--
Paul Brunk, system administrator
Georgia Advanced Computing Resource Center
Enterprise IT Svcs, the University of Georgia


Robert Kudyba

unread,
Feb 19, 2021, 3:45:44 PM2/19/21
to Slurm User Community List
have you seen this? https://bugs.schedmd.com/show_bug.cgi?id=7919#c7, fixed in 20.06.1

Prentice Bisbal

unread,
Feb 23, 2021, 3:35:31 PM2/23/21
to slurm...@lists.schedmd.com

I don't see how that bug is related. That bug is about requiring the libnvidia-ml.so library for an RPM that was built with NVML Autodetect enabled. His problem is the opposite - he's already using NVML autodetect, but wants to disable that feature on a single node, where it looks like that node isn't using RPMs with NVML support.

Prentice

-- 
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Prentice Bisbal

unread,
Feb 23, 2021, 3:39:30 PM2/23/21
to slurm...@lists.schedmd.com

How many nodes are we talking about here? What if you gave each node it's own gres.conf file, where all of them said

AutoDetect=nvml

Except the one you want to exclude, which would have this in gres.conf :

NodeName=a1-10 AutoDetect=off Name=gpu File=/dev/nvidia0

It seems to me like Autodetect and Autodetect=off are exclusive in the same gres.conf file, but maybe my suggestion would work. If you have a small number of GPU nodes, or use a configuration management tool like Ansible, Chef, or Puppet, it might be worth a shot.

Prentice

-- 
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Prentice Bisbal

unread,
Feb 23, 2021, 3:47:11 PM2/23/21
to slurm...@lists.schedmd.com

Correction/addendum: If the node you want to exclude has RPMS that were built without NVML autodetection, you probably want that gres.conf to look like this:

NodeName=a1-10 Name=gpu File=/dev/nvidia0

I'm guessing if it was built without Autodetection, the AutoDetect=off option wouldn't be understood, or would be pointless.

Hardly a expert on GRES configuration, so just spitballing here...

Prentice

Paul Brunk

unread,
Mar 4, 2021, 6:10:57 PM3/4/21
to Slurm User Community List
Hi all:

Prentice wrote:
> I don't see how that bug is related. That bug is about requiring the
> libnvidia-ml.so library for an RPM that was built with NVML
> Autodetect enabled. His problem is the opposite - he's already using
> NVML autodetect, but wants to disable that feature on a single node,
> where it looks like that node isn't using RPMs with NVML support.

Indeed, this was a PEBCAK problem--I was not heeding the classic "read
the right fine version of the fine manual" (RTRFVOTFM?) advice.

Thanks all for your replies.

--
Jesting grimly,
Reply all
Reply to author
Forward
0 new messages