[slurm-dev] CPU/GPU Affinity Not Working

Dave Sizer

unread,

Oct 26, 2017, 12:24:13 AM10/26/17

to slurm-dev, Vipin Sirohi

Hi,

We are running slurm 17.02.7

For some reason, we are observing that the preferred CPUs defined in gres.conf for GPU devices are being ignored when running jobs. That is, in our gres.conf we have gpu resource lines, such as:

Name=gpu Type=kepler File=/dev/nvidia0 CPUs=0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23

and

Name=gpu Type=kepler File=/dev/nvidia4 CPUs=8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31

but when we run a job with the second gpu allocated, /sys/fs/cgroup/cpuset/slurm/…./cpuset.cpus reports that the job has been allocated cpus from the first gpu’s set. It seems as if the CPU/GPU affinity in gres.conf is being completely ignored. Slurmd.log doesn’t seem to mention anything about it with maximum debug verbosity.

We have tried the following TaskPlugin settings: “task/affinity,task/cgroup” and just “task/cgroup”. In both cases we have tried setting TaskPluginParam to “Cpuset”. All of these configurations produced the same incorrect results.

Is there some special configuration that is needed to get CPU/GPU affinity through gres.conf to work as described in the documentation?

Thanks

This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

Kilian Cavalotti

unread,

Oct 26, 2017, 1:40:16 PM10/26/17

to slurm-dev, Vipin Sirohi

Hi Dave,

On Wed, Oct 25, 2017 at 9:23 PM, Dave Sizer <dsi...@nvidia.com> wrote:
> For some reason, we are observing that the preferred CPUs defined in
> gres.conf for GPU devices are being ignored when running jobs. That is, in
> our gres.conf we have gpu resource lines, such as:
>
> Name=gpu Type=kepler File=/dev/nvidia0
> CPUs=0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23

> Name=gpu Type=kepler File=/dev/nvidia4
> CPUs=8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31

In passing, you can use range notation for CPU indexes, and make it
more compact:

Name=gpu Type=kepler File=/dev/nvidia0 CPUs=[0-7,16-23]
Name=gpu Type=kepler File=/dev/nvidia4 CPUs=[8-15,24-31]

> but when we run a job with the second gpu allocated,
> /sys/fs/cgroup/cpuset/slurm/…./cpuset.cpus reports that the job has been
> allocated cpus from the first gpu’s set. It seems as if the CPU/GPU
> affinity in gres.conf is being completely ignored. Slurmd.log doesn’t seem
> to mention anything about it with maximum debug verbosity.

You can try to use DebugFlags=CPU_Bind,gres in your slurm.conf for more details.

> We have tried the following TaskPlugin settings: “task/affinity,task/cgroup”
> and just “task/cgroup”. In both cases we have tried setting TaskPluginParam
> to “Cpuset”. All of these configurations produced the same incorrect
> results.

We use this:

SelectType=select/cons_res
SelectTypeParameters=CR_CORE_MEMORY
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup

and for a 4-GPU node which has a gres.conf like this (don't ask, some
vendors like their CPU ids alternating between sockets):

NodeName=sh-114-03 name=gpu File=/dev/nvidia[0-1]
CPUs=0,2,4,6,8,10,12,14,16,18
NodeName=sh-114-03 name=gpu File=/dev/nvidia[2-3]
CPUs=1,3,5,7,9,11,13,15,17,19

we can submit 4 jobs using 1 GPU each, which end up getting a CPU id
that matches the allocated GPU:

$ sbatch --array=1-4 -p gpu -w sh-114-03 --gres=gpu:1 --wrap="sleep 100"
Submitted batch job 2669681

$ scontrol -dd show job 2669681 | grep CPU_ID | sort
Nodes=sh-114-03 CPU_IDs=0 Mem=12800 GRES_IDX=gpu(IDX:0)
Nodes=sh-114-03 CPU_IDs=1 Mem=12800 GRES_IDX=gpu(IDX:2)
Nodes=sh-114-03 CPU_IDs=2 Mem=12800 GRES_IDX=gpu(IDX:1)
Nodes=sh-114-03 CPU_IDs=3 Mem=12800 GRES_IDX=gpu(IDX:3)

How do you check which GPU your job has been allocated?

Cheers,
--
Kilian

Dave Sizer

unread,

Oct 26, 2017, 9:45:01 PM10/26/17

to slurm-dev, Vipin Sirohi

Thanks for the tips, Kilian, this really pointed me in the right direction.

It turns out the issue was the CPU IDs we were using in gres.conf were based on how our system was identifying them, when they really needed to be in the platform-agnostic format (CPU_ID = Board_ID x threads_per_board + Socket_ID x threads_per_socket + Core_ID x threads_per_core + Thread_ID; from the gres.conf docs).

-----------------------------------------------------------------------------------

This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.

-----------------------------------------------------------------------------------

Michael Di Domenico

unread,

Oct 27, 2017, 7:45:35 AM10/27/17

to slurm-dev

On Thu, Oct 26, 2017 at 1:39 PM, Kilian Cavalotti
<kilian.cav...@gmail.com> wrote:
> and for a 4-GPU node which has a gres.conf like this (don't ask, some
> vendors like their CPU ids alternating between sockets):
>
> NodeName=sh-114-03 name=gpu File=/dev/nvidia[0-1]
> CPUs=0,2,4,6,8,10,12,14,16,18
> NodeName=sh-114-03 name=gpu File=/dev/nvidia[2-3]
> CPUs=1,3,5,7,9,11,13,15,17,19

as an aside, is there some tool which provides the optimal mapping of
CPU id's to GPU cards?

Kilian Cavalotti

unread,

Oct 27, 2017, 11:14:24 AM10/27/17

to slurm-dev

Hi Michael,

On Fri, Oct 27, 2017 at 4:44 AM, Michael Di Domenico
<mdidom...@gmail.com> wrote:
> as an aside, is there some tool which provides the optimal mapping of
> CPU id's to GPU cards?

We use nvidia-smi:

-- 8< -----------------------------------------------------------------------------------------
# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity
GPU0 X NV1 NV1 NV2 PHB
0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18
GPU1 NV1 X NV2 NV1 PHB
0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18
GPU2 NV1 NV2 X NV1 PHB
0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18
GPU3 NV2 NV1 NV1 X PHB
0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18
mlx5_0 PHB PHB PHB PHB X

Legend:

X = Self
SOC = Connection traversing PCIe as well as the SMP link between CPU
sockets(e.g. QPI)
PHB = Connection traversing PCIe as well as a PCIe Host Bridge
(typically the CPU)
PXB = Connection traversing multiple PCIe switches (without
traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
-- 8< -----------------------------------------------------------------------------------------

and hwloc (https://www.open-mpi.org/projects/hwloc/):
-- 8< -----------------------------------------------------------------------------------------
# hwloc-ls --ignore misc
Machine (256GB total)
NUMANode L#0 (P#0 128GB)
Package L#0 + L3 L#0 (25MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
HostBridge L#0
PCIBridge
PCIBridge
PCIBridge
PCI 10de:1b02
GPU L#0 "card1"
GPU L#1 "renderD128"
PCIBridge
PCI 10de:1b02
GPU L#2 "card2"
GPU L#3 "renderD129"
PCIBridge
PCIBridge
PCIBridge
PCI 10de:1b02
GPU L#4 "card3"
GPU L#5 "renderD130"
PCIBridge
PCI 10de:1b02
GPU L#6 "card4"
GPU L#7 "renderD131"
PCI 8086:8d62
Block(Disk) L#8 "sda"
PCIBridge
PCIBridge
PCI 1a03:2000
GPU L#9 "card0"
GPU L#10 "controlD64"
PCI 8086:8d02
NUMANode L#1 (P#1 128GB)
Package L#1 + L3 L#1 (25MB)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 +
PU L#10 (P#10)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 +
PU L#11 (P#11)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 +
PU L#12 (P#12)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 +
PU L#13 (P#13)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 +
PU L#14 (P#14)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 +
PU L#15 (P#15)
L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 +
PU L#16 (P#16)
L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 +
PU L#17 (P#17)
L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 +
PU L#18 (P#18)
L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 +
PU L#19 (P#19)
HostBridge L#11
PCIBridge
PCI 8086:1521
Net L#11 "enp129s0f0"
PCI 8086:1521
Net L#12 "enp129s0f1"
PCIBridge
PCI 15b3:1013
Net L#13 "ib0"
OpenFabrics L#14 "mlx5_0"
PCIBridge
PCIBridge
PCIBridge
PCI 10de:1b02
GPU L#15 "card5"
GPU L#16 "renderD132"
PCIBridge
PCI 10de:1b02
GPU L#17 "card6"
GPU L#18 "renderD133"
PCIBridge
PCIBridge
PCIBridge
PCI 10de:1b02
GPU L#19 "card7"
GPU L#20 "renderD134"
PCIBridge
PCI 10de:1b02
GPU L#21 "card8"
GPU L#22 "renderD135"
-- 8< -----------------------------------------------------------------------------------------

Both will show which CPU ids are associated to which GPUs.

Cheers,
--
Kilian

Dave Sizer

unread,

Oct 27, 2017, 3:45:56 PM10/27/17

to slurm-dev

Also, supposedly adding the "--accel-bind=g" option to srun will do this, though we are observing that this is broken and causes jobs to hang.

Can anyone confirm this?

-----Original Message-----
From: Kilian Cavalotti [mailto:kilian.cav...@gmail.com]
Sent: Friday, October 27, 2017 8:13 AM
To: slurm-dev <slur...@schedmd.com>
Subject: [slurm-dev] Re: CPU/GPU Affinity Not Working

-----------------------------------------------------------------------------------

This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.

-----------------------------------------------------------------------------------

Kilian Cavalotti

unread,

Oct 27, 2017, 5:44:37 PM10/27/17

to slurm-dev

On Fri, Oct 27, 2017 at 12:45 PM, Dave Sizer <dsi...@nvidia.com> wrote:
> Also, supposedly adding the "--accel-bind=g" option to srun will do this, though we are observing that this is broken and causes jobs to hang.
>
> Can anyone confirm this?

Not really, it doesn't seem to be hanging for us:

-- 8< -----------------------------------------------------------------------
$ srun --gres=gpu:1 --accel-bind=g --pty bash
srun: job 2682093 queued and waiting for resources
srun: job 2682093 has been allocated resources
[kilian@sh-113-01 ~]$
[kilian@sh-113-01 ~]$ nvidia-smi topo -m
GPU0 mlx5_0 CPU Affinity
GPU0 X PHB 10-10
mlx5_0 PHB X
[kilian@sh-113-01 ~]$
-- 8< -----------------------------------------------------------------------

How do you submit your job? You can try with "srun -vvv" to display
some more information about the submission process.

Cheers,
--
Kilian

Dave Sizer

unread,

Oct 27, 2017, 6:58:14 PM10/27/17

to slurm-dev, Vipin Sirohi

Kilian, when you specify your CPU bindings in gres.conf, are you using the same IDs that show up in nvidia-smi?

We noticed that our CPU IDs were being remapped from their nvidia-smi values by SLURM according to hwloc, so to get affinity working we needed to use these remapped values.

I'm wondering if --accel-bind=g is not using these same remappings, because when our jobs hang with the option, slurmd.log reports "fatal: Invalid gres data for gpu, CPUs=16-31". But when we omit the option, we get no such error and everything seems to work fine, including GPU affinity.

Thanks
Dave

-----Original Message-----
From: Kilian Cavalotti [mailto:kilian.cav...@gmail.com]
Sent: Friday, October 27, 2017 2:44 PM
To: slurm-dev <slur...@schedmd.com>
Subject: [slurm-dev] Re: CPU/GPU Affinity Not Working

Kilian Cavalotti

unread,

Oct 30, 2017, 12:53:21 PM10/30/17

to slurm-dev, Vipin Sirohi

Hi Dave,

On Fri, Oct 27, 2017 at 3:57 PM, Dave Sizer <dsi...@nvidia.com> wrote:
> Kilian, when you specify your CPU bindings in gres.conf, are you using the same IDs that show up in nvidia-smi?

Yes:

$ srun -p gpu -c 4 --gres gpu:1 --pty bash

sh-114-01 $ cat /etc/slurm/gres.conf
name=gpu File=/dev/nvidia[0-1] CPUs=0,2,4,6,8,10,12,14,16,18
name=gpu File=/dev/nvidia[2-3] CPUs=1,3,5,7,9,11,13,15,17,19

sh-114-01 $ nvidia-smi topo -m
GPU0 mlx5_0 CPU Affinity
GPU0 X PHB 0-0,4-4,8-8,12-12
mlx5_0 PHB X

> We noticed that our CPU IDs were being remapped from their nvidia-smi values by SLURM according to hwloc, so to get affinity working we needed to use these remapped values.

I don't think there's any remapping happening. Both Slurm (through
hwloc) and nvidia-smi get the CPU IDs from the kernel, which takes
them from the DMI pages and the BIOS. So they should all match, as
they're all coming from the same source.
Could you please elaborate on what makes you think the CPU ids are
remapped somehow?

> I'm wondering if --accel-bind=g is not using these same remappings, because when our jobs hang with the option, slurmd.log reports "fatal: Invalid gres data for gpu, CPUs=16-31".
> But when we omit the option, we get no such error and everything seems to work fine, including GPU affinity.

We don't see such a hang, nor any similar error in slurmd.log, with or
without --accel-bind=g. Do you have hyperthreading enabled by any
chance? Are you positive you have all 32 CPUs available on that node?

Cheers,
--
Kilian

zhangta...@126.com

unread,

Oct 31, 2017, 4:35:06 AM10/31/17

to slurm-dev

Dear slurm developers,

I have noticed that slurm v17.11 will federated cluster, but i cann't find detailed documentation about it.

Now, i have 2 question about federated cluster:

(1) When configuring federated cluster, should i configure the two slurmctld communicated with the same slurmdbd (or make each cluster's slurmctld/slurmdbd worked with the same mysql database)?

(2) i have built up a one-node slurm cluster (v17.11), and add a federation with sacctmgr, add "FederationParameters=fed_display" to slurm.conf. After submitting a job with sbatch, squeue cannot display the federation info. How can i resolve this problem?

Does anyone can help me? Thanks for your help!

Best regards.

Ole Holm Nielsen

unread,

Oct 31, 2017, 7:08:08 AM10/31/17

to slurm-dev

On 10/31/2017 09:34 AM, zhangta...@126.com wrote:> I have

noticed that slurm v17.11 will federated cluster, but i
> cann't find detailed documentation about it.
> Now, i have 2 question about federated cluster:
> (1) When configuring federated cluster, should i configure the two
> slurmctld communicated with the same slurmdbd (or make each cluster's
> slurmctld/slurmdbd worked with the same mysql database)?

Federation support was described at the Slurm User Group Meeting last
month. PDFs of the presentations are online at
http://slurm.schedmd.com/publications.html
See the talk: Technical: Federated Cluster Support, Brian Christiansen
and Danny Auble, SchedMD.

Maybe this will help you?

/Ole

zhangta...@126.com

unread,

Oct 31, 2017, 1:36:50 PM10/31/17

to slurm-dev

Thank you very much, Ole

I have read this PDF document, but i'm not sure about the configuration.

I guess the two slurmctld should be configured to use the same slurmdbd.

Is it right? Or which is the right way?

Thanks，regards

zhangta...@126.com

Ole Holm Nielsen

unread,

Nov 1, 2017, 3:35:23 AM11/1/17

to slurm-dev

I'm pretty sure that a single, central slurmdbd service is required for
multiple, federated clusters. I think that's what ties multiple
clusters together into a single "federation".

You mention a problem with squeue, but you don't list the error
messages. Are you sure that all nodes have identical slurm.conf, and
that daemons have been restarted after changes? You may want to consult
my Slurm Wiki at https://wiki.fysik.dtu.dk/niflheim/SLURM for
configuration details.

Caveat: I just heard the talk at the SLUG conference, but I have no
intention of working with federated clusters myself. So I can't help
you. Commercial support from SchedMD is recommended, see
https://www.schedmd.com/services.php

/Ole

On 10/31/2017 06:36 PM, zhangta...@126.com wrote:
> Thank you very much, Ole
> I have read this PDF document, but i'm not sure about the configuration.
> I guess the two slurmctld should be configured to use the same slurmdbd.
> Is it right? Or which is the right way?
> Thanks，regards
>

> ------------------------------------------------------------------------
> zhangta...@126.com
>
> *From:* Ole Holm Nielsen <mailto:Ole.H....@fysik.dtu.dk>
> *Date:* 2017-10-31 19:08
> *To:* slurm-dev <mailto:slur...@schedmd.com>
> *Subject:* [slurm-dev] Re: question about federation

zhangta...@126.com

unread,

Nov 1, 2017, 11:39:23 PM11/1/17

to slurm-dev

hi，

I'll try to test it again.

Thank you for your help, Ole

Best regards.

zhangta...@126.com

Reply all

Reply to author

Forward