[slurm-users] Strange error, submission denied

Marcus Wagner

unread,

Feb 13, 2019, 7:48:34 AM2/13/19

to Slurm User Community List

Hi all,

I have a strange behaviour here.
We are using slurm 18.08.5-2 on CentOS 7.6.

Let me first describe our computenodes:
NodeName=ncm[0001-1032] CPUs=48 Sockets=4 CoresPerSocket=12
ThreadsPerCore=2 RealMemory=185000
Feature=skx8160,hostok,hpcwork                        Weight=10541
State=UNKNOWN

we have the following config set:

$>scontrol show config | grep -i select
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY,CR_ONE_TASK_PER_CORE

So, I have 48 cores on one node. According to the manpage of sbatch, I
should be able to do the following:

#SBATCH --ntasks=48
#SBATCH --ntasks-per-node=48

But I get the following error:
sbatch: error: Batch job submission failed: Requested node configuration
is not available

Has anyone an explanation for this?

Best
Marcus

--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Marcus Wagner

unread,

Feb 14, 2019, 12:20:38 AM2/14/19

to slurm...@lists.schedmd.com

Hi all,

I have narrowed this down a little bit.

the really astonishing thing is, that if I use

--ntasks=48

I can submit the job, it will be scheduled onto one host:

NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,mem=182400M,node=1,billing=48

but as soon as I change --ntasks to --ntasks-per-node (which should be
the same, as --ntasks=48 schedules onto one host), I get the error:

sbatch: error: CPU count per node can not be satisfied

sbatch: error: Batch job submission failed: Requested node configuration
is not available

Is there no one else, who observes this behaviour?
Any explanations?

Best
Marcus

Henkel, Andreas

unread,

Feb 14, 2019, 1:10:46 AM2/14/19

to Slurm User Community List

Hi Marcus,

What just came to my mind: if you don’t set —ntasks isn’t the default just 1? All examples I know using ntasks-per-node also set ntasks with ntasks >= ntasks-per-node.

Best,
Andreas

Chris Samuel

unread,

Feb 14, 2019, 1:15:55 AM2/14/19

to slurm...@lists.schedmd.com

On Wednesday, 13 February 2019 4:48:05 AM PST Marcus Wagner wrote:

> #SBATCH --ntasks-per-node=48

I wouldn't mind betting is that if you set that to 24 it will work, and each
thread will be assigned a single core with the 2 thread units on it.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Marcus Wagner

unread,

Feb 14, 2019, 2:14:28 AM2/14/19

to slurm...@lists.schedmd.com

Hi Andreas,

I get the same result if I set --ntasks-per-node=48 and --ntasks=48, or
96, or whatever.

What we wanted to achieve is, that exactly ntasks-per-node tasks get
scheduled onto one host.

Best
Marcus

Marcus Wagner

unread,

Feb 14, 2019, 2:27:24 AM2/14/19

to slurm...@lists.schedmd.com

Hi Chris,

this are 96 thread nodes with 48 cores. You are right, that if we set it
to 24, the job will get scheduled. But then, only half of the node is
used. On the other side, if I only use --ntasks=48, slurm schedules all
tasks onto the same node. The hyperthread of each core is included in
the cgroup and the task_affinity plugin also correctly binds the
hyperthread together with the core (small ugly testscript from us, the
last two numbers are the core and its hyperthread):

ncm0728.hpc.itc.rwth-aachen.de <0> OMP_STACKSIZE: <#> unlimited+p2
+pemap 0,48
ncm0728.hpc.itc.rwth-aachen.de <10> OMP_STACKSIZE: <#> unlimited+p2
+pemap 26,74
ncm0728.hpc.itc.rwth-aachen.de <11> OMP_STACKSIZE: <#> unlimited+p2
+pemap 29,77
ncm0728.hpc.itc.rwth-aachen.de <12> OMP_STACKSIZE: <#> unlimited+p2
+pemap 6,54
ncm0728.hpc.itc.rwth-aachen.de <13> OMP_STACKSIZE: <#> unlimited+p2
+pemap 9,57
ncm0728.hpc.itc.rwth-aachen.de <14> OMP_STACKSIZE: <#> unlimited+p2
+pemap 30,78
ncm0728.hpc.itc.rwth-aachen.de <15> OMP_STACKSIZE: <#> unlimited+p2
+pemap 33,81
ncm0728.hpc.itc.rwth-aachen.de <16> OMP_STACKSIZE: <#> unlimited+p2
+pemap 7,55
ncm0728.hpc.itc.rwth-aachen.de <17> OMP_STACKSIZE: <#> unlimited+p2
+pemap 10,58
ncm0728.hpc.itc.rwth-aachen.de <18> OMP_STACKSIZE: <#> unlimited+p2
+pemap 31,79
ncm0728.hpc.itc.rwth-aachen.de <19> OMP_STACKSIZE: <#> unlimited+p2
+pemap 34,82
ncm0728.hpc.itc.rwth-aachen.de <1> OMP_STACKSIZE: <#> unlimited+p2
+pemap 3,51
ncm0728.hpc.itc.rwth-aachen.de <20> OMP_STACKSIZE: <#> unlimited+p2
+pemap 8,56
ncm0728.hpc.itc.rwth-aachen.de <21> OMP_STACKSIZE: <#> unlimited+p2
+pemap 11,59
ncm0728.hpc.itc.rwth-aachen.de <22> OMP_STACKSIZE: <#> unlimited+p2
+pemap 32,80
ncm0728.hpc.itc.rwth-aachen.de <23> OMP_STACKSIZE: <#> unlimited+p2
+pemap 35,83
ncm0728.hpc.itc.rwth-aachen.de <24> OMP_STACKSIZE: <#> unlimited+p2
+pemap 12,60
ncm0728.hpc.itc.rwth-aachen.de <25> OMP_STACKSIZE: <#> unlimited+p2
+pemap 15,63
ncm0728.hpc.itc.rwth-aachen.de <26> OMP_STACKSIZE: <#> unlimited+p2
+pemap 36,84
ncm0728.hpc.itc.rwth-aachen.de <27> OMP_STACKSIZE: <#> unlimited+p2
+pemap 39,87
ncm0728.hpc.itc.rwth-aachen.de <28> OMP_STACKSIZE: <#> unlimited+p2
+pemap 13,61
ncm0728.hpc.itc.rwth-aachen.de <29> OMP_STACKSIZE: <#> unlimited+p2
+pemap 16,64
ncm0728.hpc.itc.rwth-aachen.de <2> OMP_STACKSIZE: <#> unlimited+p2
+pemap 24,72
ncm0728.hpc.itc.rwth-aachen.de <30> OMP_STACKSIZE: <#> unlimited+p2
+pemap 37,85
ncm0728.hpc.itc.rwth-aachen.de <31> OMP_STACKSIZE: <#> unlimited+p2
+pemap 40,88
ncm0728.hpc.itc.rwth-aachen.de <32> OMP_STACKSIZE: <#> unlimited+p2
+pemap 14,62
ncm0728.hpc.itc.rwth-aachen.de <33> OMP_STACKSIZE: <#> unlimited+p2
+pemap 17,65
ncm0728.hpc.itc.rwth-aachen.de <34> OMP_STACKSIZE: <#> unlimited+p2
+pemap 38,86
ncm0728.hpc.itc.rwth-aachen.de <35> OMP_STACKSIZE: <#> unlimited+p2
+pemap 41,89
ncm0728.hpc.itc.rwth-aachen.de <36> OMP_STACKSIZE: <#> unlimited+p2
+pemap 18,66
ncm0728.hpc.itc.rwth-aachen.de <37> OMP_STACKSIZE: <#> unlimited+p2
+pemap 21,69
ncm0728.hpc.itc.rwth-aachen.de <38> OMP_STACKSIZE: <#> unlimited+p2
+pemap 42,90
ncm0728.hpc.itc.rwth-aachen.de <39> OMP_STACKSIZE: <#> unlimited+p2
+pemap 45,93
ncm0728.hpc.itc.rwth-aachen.de <3> OMP_STACKSIZE: <#> unlimited+p2
+pemap 27,75
ncm0728.hpc.itc.rwth-aachen.de <40> OMP_STACKSIZE: <#> unlimited+p2
+pemap 19,67
ncm0728.hpc.itc.rwth-aachen.de <41> OMP_STACKSIZE: <#> unlimited+p2
+pemap 22,70
ncm0728.hpc.itc.rwth-aachen.de <42> OMP_STACKSIZE: <#> unlimited+p2
+pemap 43,91
ncm0728.hpc.itc.rwth-aachen.de <43> OMP_STACKSIZE: <#> unlimited+p2
+pemap 46,94
ncm0728.hpc.itc.rwth-aachen.de <44> OMP_STACKSIZE: <#> unlimited+p2
+pemap 20,68
ncm0728.hpc.itc.rwth-aachen.de <45> OMP_STACKSIZE: <#> unlimited+p2
+pemap 23,71
ncm0728.hpc.itc.rwth-aachen.de <46> OMP_STACKSIZE: <#> unlimited+p2
+pemap 44,92
ncm0728.hpc.itc.rwth-aachen.de <47> OMP_STACKSIZE: <#> unlimited+p2
+pemap 47,95
ncm0728.hpc.itc.rwth-aachen.de <4> OMP_STACKSIZE: <#> unlimited+p2
+pemap 1,49
ncm0728.hpc.itc.rwth-aachen.de <5> OMP_STACKSIZE: <#> unlimited+p2
+pemap 4,52
ncm0728.hpc.itc.rwth-aachen.de <6> OMP_STACKSIZE: <#> unlimited+p2
+pemap 25,73
ncm0728.hpc.itc.rwth-aachen.de <7> OMP_STACKSIZE: <#> unlimited+p2
+pemap 28,76
ncm0728.hpc.itc.rwth-aachen.de <8> OMP_STACKSIZE: <#> unlimited+p2
+pemap 2,50
ncm0728.hpc.itc.rwth-aachen.de <9> OMP_STACKSIZE: <#> unlimited+p2
+pemap 5,53

--ntasks=48:

NodeList=ncm0728
BatchHost=ncm0728

NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,mem=182400M,node=1,billing=48

--ntasks=48
--ntasks-per-node=24:

   NodeList=ncm[0438-0439]
   BatchHost=ncm0438
   NumNodes=2 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,mem=182400M,node=2,billing=48

--ntasks=48
--ntasks-per-node=48:

sbatch: error: CPU count per node can not be satisfied

sbatch: error: Batch job submission failed: Requested node configuration
is not available

Isn't the first essentially the same as the last, with the difference,
that I want to force slurm to put all tasks onto one node?

Best
Marcus

On 2/14/19 7:15 AM, Chris Samuel wrote:
> On Wednesday, 13 February 2019 4:48:05 AM PST Marcus Wagner wrote:
>
>> #SBATCH --ntasks-per-node=48
> I wouldn't mind betting is that if you set that to 24 it will work, and each
> thread will be assigned a single core with the 2 thread units on it.
>
> All the best,
> Chris

--

Henkel, Andreas

unread,

Feb 14, 2019, 2:57:22 AM2/14/19

to Slurm User Community List

Hi Marcus,

More ideas:
CPUs doesn’t always count as core but may take the meaning of one thread, hence makes different
Maybe the behavior of CR_ONE_TASK is still not solid nor properly documente and ntasks and ntasks-per-node are honored different internally. If so solely using ntasks can mean using alle threads for Slurm even if the binding may be correct according to binding.
Obviously in your results Slurm handles the options differently.

Have you tried configuring the node with cpus=96? What output do you get from slurmd -C?
Is this a new architecture like skylake? In case of subnuma-Layouts Slurm can not handle it without hwloc2.
Have you tried to use srun -v(vv) instead of sbatch? Maybe you can get a glimpse of what Slurm actually does with your options.

Best,
Andreas

Marcus Wagner

unread,

Feb 14, 2019, 3:23:20 AM2/14/19

to slurm...@lists.schedmd.com

Hi Andreas,

On 2/14/19 8:56 AM, Henkel, Andreas wrote:
> Hi Marcus,
>
> More ideas:
> CPUs doesn’t always count as core but may take the meaning of one thread, hence makes different
> Maybe the behavior of CR_ONE_TASK is still not solid nor properly documente and ntasks and ntasks-per-node are honored different internally. If so solely using ntasks can mean using alle threads for Slurm even if the binding may be correct according to binding.
> Obviously in your results Slurm handles the options differently.
>
> Have you tried configuring the node with cpus=96? What output do you get from slurmd -C?

Not yet, as this is not the desired behaviour. We want to schedule by
cores. But I will try that. slurmd -C output is the following:

NodeName=ncm0708 slurmd: Considering each NUMA node as a socket
CPUs=96 Boards=1 SocketsPerBoard=4 CoresPerSocket=12 ThreadsPerCore=2
RealMemory=191905
UpTime=6-21:30:02

> Is this a new architecture like skylake? In case of subnuma-Layouts Slurm can not handle it without hwloc2.

Yes, we have Skylake and as you can see in the above output, we have
subnuma-clustering enabled. Still, we only use hwloc coming with CentOS
7: hwloc-1.11.8-4.el7.x86_64
Where did you get the information, that hwloc2 is needed?

> Have you tried to use srun -v(vv) instead of sbatch? Maybe you can get a glimpse of what Slurm actually does with your options.

The only strange thing I can observe is the following:
srun: threads        : 60

What threads is srun talking about there?
Nonetheless, here the full output:

$> srun --ntasks=48 --ntasks-per-node=48 -vvv hostname
srun: defined options for program `srun'
srun: --------------- ---------------------
srun: user           : `mw445520'
srun: uid            : 40574
srun: gid            : 40574
srun: cwd            : /rwthfs/rz/cluster/home/mw445520/tests/slurm/cgroup
srun: ntasks         : 48 (set)
srun: nodes          : 1 (default)
srun: jobid          : 4294967294 (default)
srun: partition      : default
srun: profile        : `NotSet'
srun: job name       : `hostname'
srun: reservation    : `(null)'
srun: burst_buffer   : `(null)'
srun: wckey          : `(null)'
srun: cpu_freq_min   : 4294967294
srun: cpu_freq_max   : 4294967294
srun: cpu_freq_gov   : 4294967294
srun: switches       : -1
srun: wait-for-switches : -1
srun: distribution   : unknown
srun: cpu-bind       : default (0)
srun: mem-bind       : default (0)
srun: verbose        : 3
srun: slurmd_debug   : 0
srun: immediate      : false
srun: label output   : false
srun: unbuffered IO : false
srun: overcommit     : false
srun: threads        : 60
srun: checkpoint_dir : /w0/slurm/checkpoint
srun: wait           : 0
srun: nice           : -2
srun: account        : (null)
srun: comment        : (null)
srun: dependency     : (null)
srun: exclusive      : false
srun: bcast          : false
srun: qos            : (null)
srun: constraints    :
srun: reboot         : yes
srun: preserve_env   : false
srun: network        : (null)
srun: propagate      : NONE
srun: prolog         : (null)
srun: epilog         : (null)
srun: mail_type      : NONE
srun: mail_user      : (null)
srun: task_prolog    : (null)
srun: task_epilog    : (null)
srun: multi_prog     : no
srun: sockets-per-node : -2
srun: cores-per-socket : -2
srun: threads-per-core : -2
srun: ntasks-per-node   : 48
srun: ntasks-per-socket : -2
srun: ntasks-per-core   : -2
srun: plane_size        : 4294967294
srun: core-spec         : NA
srun: power             :
srun: cpus-per-gpu      : 0
srun: gpus              : (null)
srun: gpu-bind          : (null)
srun: gpu-freq          : (null)
srun: gpus-per-node     : (null)
srun: gpus-per-socket   : (null)
srun: gpus-per-task     : (null)
srun: mem-per-gpu       : 0
srun: remote command    : `hostname'
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0007
srun: debug2: srun PMI messages to port=34521
srun: debug: Entering slurm_allocation_msg_thr_create()
srun: debug: port from net_stream_listen is 35465
srun: debug: Entering _msg_thr_internal
srun: debug: Munge authentication plugin loaded
srun: error: CPU count per node can not be satisfied
srun: error: Unable to allocate resources: Requested node configuration
is not available

Best
Marcus

Henkel, Andreas

unread,

Feb 14, 2019, 5:28:25 AM2/14/19

to Slurm User Community List

Hi Marcus,

We have skylake too and it didn’t work for us. We used cgroups only and process binding went completely havoc with subnuma enabled.
While searching for solutions I found that hwloc does support subnuma only with version > 2 (when looking for skylake in hwloc you will get hits in version 2 branches only). At least hwloc 2.x made Numa-blocks children objects whereas hwloc 1.x has Numablocks as parents only. I think that was the reason why there was a special branch in hwloc for handling subNuma-layouts of Xeon Phi.
But I’ll be happy if you proof me wrong.

Best,
Andreas

Marcus Wagner

unread,

Feb 14, 2019, 6:55:07 AM2/14/19

to slurm...@lists.schedmd.com

Hi Andreas,

as slurmd -C shows, it detects 4 numa-nodes taking these as sockets.
This was also the way, we configured slurm.

numactl -H clearly shows the four domains and which belongs to which socket:

node distances:
node   0   1   2   3
0: 10 11 21 21
1: 11 10 21 21
2: 21 21 10 11
3: 21 21 11 10

This is fairly the same with hwloc:

$> hwloc-distances
Relative latency matrix between 4 NUMANodes (depth 3) by logical indexes
(below Machine L#0):
index     0     1     2     3
      0 1.000 1.100 2.100 2.100
      1 1.100 1.000 2.100 2.100
      2 2.100 2.100 1.000 1.100
      3 2.100 2.100 1.100 1.000

We use the task/affinity plugin together with task/cgroup, but in the
cgroup.conf set affinity to off, such that the task affinity plugin is
doing the magic.
We also see slurm configured that way to do a round robin over the
numanodes by default (12 tasks on 48 core machine):

ncm0071.hpc.itc.rwth-aachen.de <0> OMP_STACKSIZE: <#> unlimited+p2
+pemap 0,48
ncm0071.hpc.itc.rwth-aachen.de <1> OMP_STACKSIZE: <#> unlimited+p2
+pemap 3,51
ncm0071.hpc.itc.rwth-aachen.de <2> OMP_STACKSIZE: <#> unlimited+p2
+pemap 24,72
ncm0071.hpc.itc.rwth-aachen.de <3> OMP_STACKSIZE: <#> unlimited+p2
+pemap 27,75
ncm0071.hpc.itc.rwth-aachen.de <4> OMP_STACKSIZE: <#> unlimited+p2
+pemap 1,49
ncm0071.hpc.itc.rwth-aachen.de <5> OMP_STACKSIZE: <#> unlimited+p2
+pemap 4,52
ncm0071.hpc.itc.rwth-aachen.de <6> OMP_STACKSIZE: <#> unlimited+p2
+pemap 25,73
ncm0071.hpc.itc.rwth-aachen.de <7> OMP_STACKSIZE: <#> unlimited+p2
+pemap 28,76
ncm0071.hpc.itc.rwth-aachen.de <8> OMP_STACKSIZE: <#> unlimited+p2
+pemap 2,50
ncm0071.hpc.itc.rwth-aachen.de <9> OMP_STACKSIZE: <#> unlimited+p2
+pemap 5,53
ncm0071.hpc.itc.rwth-aachen.de <10> OMP_STACKSIZE: <#> unlimited+p2
+pemap 26,74
ncm0071.hpc.itc.rwth-aachen.de <11> OMP_STACKSIZE: <#> unlimited+p2
+pemap 29,77

using #SBATCH -m block:block results in all tasks on one numanode:

ncm0071.hpc.itc.rwth-aachen.de <0> OMP_STACKSIZE: <#> unlimited+p2
+pemap 0,48
ncm0071.hpc.itc.rwth-aachen.de <1> OMP_STACKSIZE: <#> unlimited+p2
+pemap 1,49
ncm0071.hpc.itc.rwth-aachen.de <2> OMP_STACKSIZE: <#> unlimited+p2
+pemap 2,50
ncm0071.hpc.itc.rwth-aachen.de <3> OMP_STACKSIZE: <#> unlimited+p2
+pemap 6,54
ncm0071.hpc.itc.rwth-aachen.de <4> OMP_STACKSIZE: <#> unlimited+p2
+pemap 7,55
ncm0071.hpc.itc.rwth-aachen.de <5> OMP_STACKSIZE: <#> unlimited+p2
+pemap 8,56
ncm0071.hpc.itc.rwth-aachen.de <6> OMP_STACKSIZE: <#> unlimited+p2
+pemap 12,60
ncm0071.hpc.itc.rwth-aachen.de <7> OMP_STACKSIZE: <#> unlimited+p2
+pemap 13,61
ncm0071.hpc.itc.rwth-aachen.de <8> OMP_STACKSIZE: <#> unlimited+p2
+pemap 14,62
ncm0071.hpc.itc.rwth-aachen.de <9> OMP_STACKSIZE: <#> unlimited+p2
+pemap 18,66
ncm0071.hpc.itc.rwth-aachen.de <10> OMP_STACKSIZE: <#> unlimited+p2
+pemap 19,67
ncm0071.hpc.itc.rwth-aachen.de <11> OMP_STACKSIZE: <#> unlimited+p2
+pemap 20,68

isn't it that, what would be needed, or do I miss something? What would
be "better" with hwloc2?

Besides my original problem, we are fairly happy with slurm so far, but
that one gives me grey hair :/

Best
Marcus

Andreas Henkel

unread,

Feb 14, 2019, 8:37:42 AM2/14/19

to slurm...@lists.schedmd.com

Hi Marcus,

for us slurmd -C as well as numactl -H looked fine, too. But we're using task/cgroup only and every job starting on a skylake node gave us

error("task/cgroup: task[%u] infinite loop broken while trying "
		    "to provision compute elements using %s (bitmap:%s)",

from src/plugins/task/cgroup/task_cgroup_cpuset.c and the process placement was wrong.

Once we deactivated subnuma everythings running fine.

But for completeness: I tested that on Slurm 17 (and maybe the core was partly 16 at that time). We're using Slurm 17.11.13 and I'll check the behavior there in the next days.
I'm hestitant to switch to 18 because of the latest bugs that appeared with every minor release.

Best,

Andreas

Marcus Wagner

unread,

Feb 14, 2019, 8:52:48 AM2/14/19

to slurm...@lists.schedmd.com

Hi Andreas,

might be that this is one of the bugs in Slurm 18.

I think, I will open a bug report and see what they say.

Thank you very much, nonetheless.

Best
Marcus

Christopher Samuel

unread,

Feb 14, 2019, 11:35:38 AM2/14/19

to slurm...@lists.schedmd.com

On 2/14/19 12:22 AM, Marcus Wagner wrote:

> CPUs=96 Boards=1 SocketsPerBoard=4 CoresPerSocket=12 ThreadsPerCore=2
> RealMemory=191905

That's different to what you put in your config in the original email
though. There you had:

CPUs=48 Sockets=4 CoresPerSocket=12 ThreadsPerCore=2

This config tells Slurm there are just 24 cores for a total of 48
threads. Try updating your config with what slurmd detected and see if
that helps.

All the best,
Chris

Marcus Wagner

unread,

Feb 15, 2019, 12:26:49 AM2/15/19

to slurm...@lists.schedmd.com

Hi Chris,

that can't be right, or there is some bug elsewhere:

We have configured CR_ONE_TASK_PER_CORE, so two tasks won't get a core
and its hyperthread.
According to your theory, I configured 48 threads. But then using just
--ntasks=48 would give me two nodes, right?

But Slurm schedules these 48 tasks onto one node:

NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,mem=182400M,node=1,billing=48

Here you can also see, that CPUs/Task=1, so really the tasks are scheduled.
Essentially, --ntasks=48 --ntask-per-node=48 should do the same.
Obviously they don't, because in this case submission gets denied.
Nonetheless, you can see in the cgroups and the binding, which is done
by the task affinity plugin, that every tasks not only gets a core, but
also its hyperthread.

I think I'll have to file a bug at SchedMD.

Best
Marcus

Marcus Wagner

unread,

Feb 15, 2019, 1:16:00 AM2/15/19

to slurm...@lists.schedmd.com

I have filed a bug:

https://bugs.schedmd.com/show_bug.cgi?id=6522

Lets see, what ScheMD has to tell us ;)

Best
Marcus

On 2/15/19 6:25 AM, Marcus Wagner wrote:
> NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> TRES=cpu=48,mem=182400M,node=1,billing=48

Andreas Henkel

unread,

Feb 18, 2019, 12:03:14 AM2/18/19

to slurm...@lists.schedmd.com

Not the answer you hoped for there I guess...

Marcus Wagner

unread,

Feb 18, 2019, 1:10:22 AM2/18/19

to slurm...@lists.schedmd.com

No, but that was expected ;)

Thanks nonetheless.

Best
Marcus

Prentice Bisbal

unread,

Feb 19, 2019, 8:59:48 AM2/19/19

to slurm...@lists.schedmd.com

--ntasks-per-node is meant to be used in conjunction with --nodes option. From https://slurm.schedmd.com/sbatch.html:

--ntasks-per-node=<ntasks>

Request that ntasks be invoked on each node. If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option...

If you don't specify --ntasks, it defaults to --ntasks=1, as Andreas said. https://slurm.schedmd.com/sbatch.html:

-n, --ntasks=<number>

sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.

So the correct way to specify your job is either like this

--ntasks=48

or

--nodes=1 --ntasks-per-node=48

Specifying both --ntasks-per-node and --ntasks at the same time is not correct.

Prentice

Marcus Wagner

unread,

Feb 20, 2019, 12:09:42 AM2/20/19

to slurm...@lists.schedmd.com

Hi Prentice,

On 2/19/19 2:58 PM, Prentice Bisbal wrote:

--ntasks-per-node is meant to be used in conjunction with --nodes option. From https://slurm.schedmd.com/sbatch.html:

--ntasks-per-node=<ntasks>

Request that ntasks be invoked on each node. If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option...

Yes, but used together with --ntasks would mean to use e.g. 48 tasks at maximum per node. I don't see, where there lies the difference regarding submission of the job. Even if the semantic (how or how many cores will be scheduled onto which number of hosts) might be incorrect, at least the syntax should be correct.

If you don't specify --ntasks, it defaults to --ntasks=1, as Andreas said. https://slurm.schedmd.com/sbatch.html:

-n, --ntasks=<number>

sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.

So the correct way to specify your job is either like this

--ntasks=48

or

--nodes=1 --ntasks-per-node=48

Specifying both --ntasks-per-node and --ntasks at the same time is not correct.

funnily the result is the same:

$> sbatch -N 1 --ntasks-per-node=48 --wrap hostname

sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

whereas just using --ntasks=48 gets submitted and it gets scheduled onto one host:

$> sbatch --ntasks=48 --wrap hostname
sbatch: [I] No output file given, set to: output_%j.txt
sbatch: [I] No runtime limit given, set to: 15 minutes
Submitted batch job 199784
$> scontrol show job 199784 | egrep "NumNodes|TRES"

NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,mem=182400M,node=1,billing=48

To me, this still looks like a bug, not like the wrong usage of submission parameters.

Does no one else use nodes in this shared way?
If nodes are shared, do you schedule by hardware threads or by cores?
If you schedule by cores, how did you implement this in slurm?

Best
Marcus

Marcus Wagner

unread,

Feb 20, 2019, 1:15:22 AM2/20/19

to slurm...@lists.schedmd.com

I just made a little bit debugging, setting the debug level to debug5 during submission.

I submitted (or at least tried to) two jobs:

sbatch -n 48 --wrap hostname
got submitted, got jobid 199801

sbatch -N 1 --ntasks-per-node=48 --wrap hostname

submission denied, got jobid 199805

The only difference, found in the logs between these two jobs was:
199801:
debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1
199805:
debug3: ntasks_per_node=48 ntasks_per_socket=-1 ntasks_per_core=-1

Nonetheless, slurm schedules 199801 onto one host:
sched: Allocate JobId=199801 NodeList=ncm0288 #CPUs=48 Partition=c18m

Best
Marcus

On 2/19/19 2:58 PM, Prentice Bisbal wrote:

Chris Samuel

unread,

Feb 20, 2019, 1:50:05 AM2/20/19

to slurm...@lists.schedmd.com

On Tuesday, 19 February 2019 10:14:21 PM PST Marcus Wagner wrote:

> sbatch -N 1 --ntasks-per-node=48 --wrap hostname
> submission denied, got jobid 199805

On one of our 40 core nodes with 2 hyperthreads:

$ srun -C gpu -N 1 --ntasks-per-node=80 hostname | uniq -c
80 nodename02

The spec is:

CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2

Hope this helps!

All the best,
Chris
--

Marcus Wagner

unread,

Feb 20, 2019, 3:55:49 AM2/20/19

to slurm...@lists.schedmd.com

Hi Chris,

I assume, you have not set

CR_ONE_TASK_PER_CORE

                     CR_ONE_TASK_PER_CORE
                            Allocate one task per core by default. Without this option, by default one task will be allocated per thread on nodes with more than one ThreadsPerCore configured. NOTE: This option cannot be used with
                            CR_CPU*.

$> scontrol show config | grep CR_ONE_TASK_PER_CORE
SelectTypeParameters    = CR_CORE_MEMORY,CR_ONE_TASK_PER_CORE

$> srun --export=all -N 1 --ntasks-per-node=24 hostname | uniq -c
srun: error: CPU count per node can not be satisfied
srun: error: Unable to allocate resources: Requested node configuration is not available

I even reconfigured one node, such that there is no difference between the slurmd -C output and the config.

nodeconfig lnm596:
NodeName=lnm596          CPUs=48 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=120000 Feature=bwx2650,hostok,hpcwork                        Weight=10430 State=UNKNOWN

The result is still the same.

Seems to be related to the parameter CR_ONE_TASK_PER_CORE

... short testing ...

OK, it IS related to this parameter.
But now slurm distributes the tasks fairly unlucky onto the hosts.

The background is, that we wanted to have only one task per core, exactly what CR_ONE_TASK_PER_CORE promises to do.
So, normally, I would let the user ask at max half of the number of CPUs, so one typical job would look like

sbatch -p test -n 24 -w lnm596 --wrap "srun --cpu-bind=verbose ./mpitest.sh"

resulting in a job, which uses both sockets (good!) but only half of the cores of the socket, as it uses the first 6 cores and their hyperthreads :
cpuinfo of the host:

===== Placement on packages =====
Package Id.     Core Id.        Processors
0               0,1,2,3,4,5,8,9,10,11,12,13             (0,24)(1,25)(2,26)(3,27)(4,28)(5,29)(6,30)(7,31)(8,32)(9,33)(10,34)(11,35)
1               0,1,2,3,4,5,8,9,10,11,12,13             (12,36)(13,37)(14,38)(15,39)(16,40)(17,41)(18,42)(19,43)(20,44)(21,45)(22,46)(23,47)

Output from job:
cpu-bind=MASK - lnm596, task 20 20 [8572]: mask 0x20 set
cpu-bind=MASK - lnm596, task 4 4 [8556]: mask 0x2 set
cpu-bind=MASK - lnm596, task 3 3 [8555]: mask 0x1000000000 set
cpu-bind=MASK - lnm596, task 12 12 [8564]: mask 0x8 set
cpu-bind=MASK - lnm596, task 2 2 [8554]: mask 0x1000000 set
cpu-bind=MASK - lnm596, task 9 9 [8561]: mask 0x4000 set
cpu-bind=MASK - lnm596, task 10 10 [8562]: mask 0x4000000 set
cpu-bind=MASK - lnm596, task 15 15 [8567]: mask 0x8000000000 set
cpu-bind=MASK - lnm596, task 18 18 [8570]: mask 0x10000000 set
cpu-bind=MASK - lnm596, task 7 7 [8559]: mask 0x2000000000 set
cpu-bind=MASK - lnm596, task 1 1 [8553]: mask 0x1000 set
cpu-bind=MASK - lnm596, task 6 6 [8558]: mask 0x2000000 set
cpu-bind=MASK - lnm596, task 8 8 [8560]: mask 0x4 set
cpu-bind=MASK - lnm596, task 14 14 [8566]: mask 0x8000000 set
cpu-bind=MASK - lnm596, task 21 21 [8573]: mask 0x20000 set
cpu-bind=MASK - lnm596, task 5 5 [8557]: mask 0x2000 set
cpu-bind=MASK - lnm596, task 0 0 [8552]: mask 0x1 set
cpu-bind=MASK - lnm596, task 11 11 [8563]: mask 0x4000000000 set
cpu-bind=MASK - lnm596, task 13 13 [8565]: mask 0x8000 set
cpu-bind=MASK - lnm596, task 16 16 [8568]: mask 0x10 set
cpu-bind=MASK - lnm596, task 17 17 [8569]: mask 0x10000 set
cpu-bind=MASK - lnm596, task 19 19 [8571]: mask 0x10000000000 set
cpu-bind=MASK - lnm596, task 22 22 [8574]: mask 0x20000000 set
cpu-bind=MASK - lnm596, task 23 23 [8575]: mask 0x20000000000 set
lnm596.hpc.itc.rwth-aachen.de <4> OMP_STACKSIZE: <#> unlimited+p1 +pemap 1
lnm596.hpc.itc.rwth-aachen.de <9> OMP_STACKSIZE: <#> unlimited+p1 +pemap 14
lnm596.hpc.itc.rwth-aachen.de <12> OMP_STACKSIZE: <#> unlimited+p1 +pemap 3
lnm596.hpc.itc.rwth-aachen.de <14> OMP_STACKSIZE: <#> unlimited+p1 +pemap 27
lnm596.hpc.itc.rwth-aachen.de <15> OMP_STACKSIZE: <#> unlimited+p1 +pemap 39
lnm596.hpc.itc.rwth-aachen.de <16> OMP_STACKSIZE: <#> unlimited+p1 +pemap 4
lnm596.hpc.itc.rwth-aachen.de <18> OMP_STACKSIZE: <#> unlimited+p1 +pemap 28
lnm596.hpc.itc.rwth-aachen.de <21> OMP_STACKSIZE: <#> unlimited+p1 +pemap 17
lnm596.hpc.itc.rwth-aachen.de <22> OMP_STACKSIZE: <#> unlimited+p1 +pemap 29
lnm596.hpc.itc.rwth-aachen.de <11> OMP_STACKSIZE: <#> unlimited+p1 +pemap 38
lnm596.hpc.itc.rwth-aachen.de <20> OMP_STACKSIZE: <#> unlimited+p1 +pemap 5
lnm596.hpc.itc.rwth-aachen.de <13> OMP_STACKSIZE: <#> unlimited+p1 +pemap 15
lnm596.hpc.itc.rwth-aachen.de <23> OMP_STACKSIZE: <#> unlimited+p1 +pemap 41
lnm596.hpc.itc.rwth-aachen.de <8> OMP_STACKSIZE: <#> unlimited+p1 +pemap 2
lnm596.hpc.itc.rwth-aachen.de <10> OMP_STACKSIZE: <#> unlimited+p1 +pemap 26
lnm596.hpc.itc.rwth-aachen.de <2> OMP_STACKSIZE: <#> unlimited+p1 +pemap 24
lnm596.hpc.itc.rwth-aachen.de <6> OMP_STACKSIZE: <#> unlimited+p1 +pemap 25
lnm596.hpc.itc.rwth-aachen.de <0> OMP_STACKSIZE: <#> unlimited+p1 +pemap 0
lnm596.hpc.itc.rwth-aachen.de <3> OMP_STACKSIZE: <#> unlimited+p1 +pemap 36
lnm596.hpc.itc.rwth-aachen.de <1> OMP_STACKSIZE: <#> unlimited+p1 +pemap 12
lnm596.hpc.itc.rwth-aachen.de <5> OMP_STACKSIZE: <#> unlimited+p1 +pemap 13
lnm596.hpc.itc.rwth-aachen.de <7> OMP_STACKSIZE: <#> unlimited+p1 +pemap 37
lnm596.hpc.itc.rwth-aachen.de <19> OMP_STACKSIZE: <#> unlimited+p1 +pemap 40
lnm596.hpc.itc.rwth-aachen.de <17> OMP_STACKSIZE: <#> unlimited+p1 +pemap 16

What we wanted to achieve, and what went besides the --ntask-per-node problem very well, was to schedule by core, putting only one task onto one core.

The cgroups contain the cores and the hyperthreads, taskaffinity plugin gives each task one core together with its hyperthread. So we schedule by core and the user gets for free to according hyperthread. Perfect! Exactly, what we wanted (submitted again with the unmodified nodes:
$> sbatch -p test -n 48 --wrap "srun --cpu-bind=verbose ./mpitest.sh"

ncm0400.hpc.itc.rwth-aachen.de <3> OMP_STACKSIZE: <#> unlimited+p2 +pemap 27,75
ncm0400.hpc.itc.rwth-aachen.de <27> OMP_STACKSIZE: <#> unlimited+p2 +pemap 39,87
ncm0400.hpc.itc.rwth-aachen.de <10> OMP_STACKSIZE: <#> unlimited+p2 +pemap 26,74
ncm0400.hpc.itc.rwth-aachen.de <8> OMP_STACKSIZE: <#> unlimited+p2 +pemap 2,50
ncm0400.hpc.itc.rwth-aachen.de <11> OMP_STACKSIZE: <#> unlimited+p2 +pemap 29,77
ncm0400.hpc.itc.rwth-aachen.de <2> OMP_STACKSIZE: <#> unlimited+p2 +pemap 24,72
ncm0400.hpc.itc.rwth-aachen.de <5> OMP_STACKSIZE: <#> unlimited+p2 +pemap 4,52
ncm0400.hpc.itc.rwth-aachen.de <0> OMP_STACKSIZE: <#> unlimited+p2 +pemap 0,48
ncm0400.hpc.itc.rwth-aachen.de <44> OMP_STACKSIZE: <#> unlimited+p2 +pemap 20,68
ncm0400.hpc.itc.rwth-aachen.de <4> OMP_STACKSIZE: <#> unlimited+p2 +pemap 1,49
ncm0400.hpc.itc.rwth-aachen.de <6> OMP_STACKSIZE: <#> unlimited+p2 +pemap 25,73
ncm0400.hpc.itc.rwth-aachen.de <26> OMP_STACKSIZE: <#> unlimited+p2 +pemap 36,84
ncm0400.hpc.itc.rwth-aachen.de <45> OMP_STACKSIZE: <#> unlimited+p2 +pemap 23,71
ncm0400.hpc.itc.rwth-aachen.de <47> OMP_STACKSIZE: <#> unlimited+p2 +pemap 47,95
ncm0400.hpc.itc.rwth-aachen.de <13> OMP_STACKSIZE: <#> unlimited+p2 +pemap 9,57
ncm0400.hpc.itc.rwth-aachen.de <1> OMP_STACKSIZE: <#> unlimited+p2 +pemap 3,51
ncm0400.hpc.itc.rwth-aachen.de <24> OMP_STACKSIZE: <#> unlimited+p2 +pemap 12,60
ncm0400.hpc.itc.rwth-aachen.de <32> OMP_STACKSIZE: <#> unlimited+p2 +pemap 14,62
ncm0400.hpc.itc.rwth-aachen.de <12> OMP_STACKSIZE: <#> unlimited+p2 +pemap 6,54
ncm0400.hpc.itc.rwth-aachen.de <25> OMP_STACKSIZE: <#> unlimited+p2 +pemap 15,63
ncm0400.hpc.itc.rwth-aachen.de <22> OMP_STACKSIZE: <#> unlimited+p2 +pemap 32,80
ncm0400.hpc.itc.rwth-aachen.de <34> OMP_STACKSIZE: <#> unlimited+p2 +pemap 38,86
ncm0400.hpc.itc.rwth-aachen.de <46> OMP_STACKSIZE: <#> unlimited+p2 +pemap 44,92
ncm0400.hpc.itc.rwth-aachen.de <7> OMP_STACKSIZE: <#> unlimited+p2 +pemap 28,76
ncm0400.hpc.itc.rwth-aachen.de <35> OMP_STACKSIZE: <#> unlimited+p2 +pemap 41,89
ncm0400.hpc.itc.rwth-aachen.de <31> OMP_STACKSIZE: <#> unlimited+p2 +pemap 40,88
ncm0400.hpc.itc.rwth-aachen.de <19> OMP_STACKSIZE: <#> unlimited+p2 +pemap 34,82
ncm0400.hpc.itc.rwth-aachen.de <14> OMP_STACKSIZE: <#> unlimited+p2 +pemap 30,78
ncm0400.hpc.itc.rwth-aachen.de <33> OMP_STACKSIZE: <#> unlimited+p2 +pemap 17,65
ncm0400.hpc.itc.rwth-aachen.de <28> OMP_STACKSIZE: <#> unlimited+p2 +pemap 13,61
ncm0400.hpc.itc.rwth-aachen.de <20> OMP_STACKSIZE: <#> unlimited+p2 +pemap 8,56
ncm0400.hpc.itc.rwth-aachen.de <37> OMP_STACKSIZE: <#> unlimited+p2 +pemap 21,69
ncm0400.hpc.itc.rwth-aachen.de <21> OMP_STACKSIZE: <#> unlimited+p2 +pemap 11,59
ncm0400.hpc.itc.rwth-aachen.de <36> OMP_STACKSIZE: <#> unlimited+p2 +pemap 18,66
ncm0400.hpc.itc.rwth-aachen.de <15> OMP_STACKSIZE: <#> unlimited+p2 +pemap 33,81
ncm0400.hpc.itc.rwth-aachen.de <9> OMP_STACKSIZE: <#> unlimited+p2 +pemap 5,53
ncm0400.hpc.itc.rwth-aachen.de <30> OMP_STACKSIZE: <#> unlimited+p2 +pemap 37,85
ncm0400.hpc.itc.rwth-aachen.de <16> OMP_STACKSIZE: <#> unlimited+p2 +pemap 7,55
ncm0400.hpc.itc.rwth-aachen.de <43> OMP_STACKSIZE: <#> unlimited+p2 +pemap 46,94
ncm0400.hpc.itc.rwth-aachen.de <23> OMP_STACKSIZE: <#> unlimited+p2 +pemap 35,83
ncm0400.hpc.itc.rwth-aachen.de <40> OMP_STACKSIZE: <#> unlimited+p2 +pemap 19,67
ncm0400.hpc.itc.rwth-aachen.de <17> OMP_STACKSIZE: <#> unlimited+p2 +pemap 10,58
ncm0400.hpc.itc.rwth-aachen.de <18> OMP_STACKSIZE: <#> unlimited+p2 +pemap 31,79
ncm0400.hpc.itc.rwth-aachen.de <41> OMP_STACKSIZE: <#> unlimited+p2 +pemap 22,70
ncm0400.hpc.itc.rwth-aachen.de <29> OMP_STACKSIZE: <#> unlimited+p2 +pemap 16,64
ncm0400.hpc.itc.rwth-aachen.de <38> OMP_STACKSIZE: <#> unlimited+p2 +pemap 42,90
ncm0400.hpc.itc.rwth-aachen.de <39> OMP_STACKSIZE: <#> unlimited+p2 +pemap 45,93
ncm0400.hpc.itc.rwth-aachen.de <42> OMP_STACKSIZE: <#> unlimited+p2 +pemap 43,91

cpuinfo of this node:
Package Id.     Core Id.        Processors
0               0,1,2,3,4,5,8,9,10,11,12,13,16,17,18,19,20,21,24,25,26,27,28,29         (0,48)(1,49)(2,50)(3,51)(4,52)(5,53)(6,54)(7,55)(8,56)(9,57)(10,58)(11,59)(12,60)(13,61)(14,62)(15,63)(16,64)(17,65)(18,66)(19,67)(20,68)(21,69)(22,70)(23,71)
1               0,1,2,3,4,5,8,9,10,11,12,13,16,17,18,19,20,21,24,25,26,27,28,29         (24,72)(25,73)(26,74)(27,75)(28,76)(29,77)(30,78)(31,79)(32,80)(33,81)(34,82)(35,83)(36,84)(37,85)(38,86)(39,87)(40,88)(41,89)(42,90)(43,91)(44,92)(45,93)(46,94)(47,95)

Best
Marcus

On 2/20/19 7:49 AM, Chris Samuel wrote:

On Tuesday, 19 February 2019 10:14:21 PM PST Marcus Wagner wrote:

sbatch -N 1 --ntasks-per-node=48 --wrap hostname
submission denied, got jobid 199805

On one of our 40 core nodes with 2 hyperthreads:

$ srun -C gpu -N 1 --ntasks-per-node=80 hostname | uniq -c
     80 nodename02

The spec is:

CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2

Hope this helps!

All the best,
Chris

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383

Marcus Wagner

unread,

Feb 20, 2019, 4:20:10 AM2/20/19

to slurm...@lists.schedmd.com

Dear all,

I did a little bit more testing.

* I have reenabled CR_ONE_TASK_PER_CORE.
* My testnode is still configured, as slurmd -C tells me.
* "--ntasks=24" or "--ntasks=24 --ntasks-per-node=24" can both be
submitted, resulting in a job with the "free" hyperthread per task.
Nearly perfect.

BUT:
The node has 48 CPUs:
NodeName=lnm596 Arch=x86_64 CoresPerSocket=12
CPUAlloc=0 CPUTot=48 CPULoad=0.04

but I cannnot submit the following:
sbatch -p test -n 24 --ntasks-per-node=24 --cpus-per-task=2 -w lnm596

24*2 is 48, so I'm asking for 48 CPUs.

There is still something wrong with CR_ONE_TASK_PER_CORE.

Best
Marcus

On 2/20/19 7:49 AM, Chris Samuel wrote:

> On Tuesday, 19 February 2019 10:14:21 PM PST Marcus Wagner wrote:
>
>> sbatch -N 1 --ntasks-per-node=48 --wrap hostname
>> submission denied, got jobid 199805
> On one of our 40 core nodes with 2 hyperthreads:
>
> $ srun -C gpu -N 1 --ntasks-per-node=80 hostname | uniq -c
> 80 nodename02
>
> The spec is:
>
> CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2
>
> Hope this helps!
>
> All the best,
> Chris

--

Henkel

unread,

Feb 20, 2019, 6:00:53 AM2/20/19

to slurm...@lists.schedmd.com

Hi Chris,
Hi Marcus,

Just want to understand the cause, too. I'll try to sum it up.

Chris you have

CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2

and

srun -C gpu -N 1 --ntasks-per-node=80 hostname

works.

Marcus has configured

CPUs=48 Sockets=4 CoresPerSocket=12 ThreadsPerCore=2
(slurmd -C says CPUs=96 Boards=1 SocketsPerBoard=4 CoresPerSocket=12
ThreadsPerCore=2)

and

CR_ONE_TASK_PER_CORE

and

srun -n 48 WORKS

srun -N 1 --ntasks-per-node=48 DOESN'T WORK.

I'm not sure if it's caused by CR_ONE_TASK_PER_CORE but at least that's
one of the major differences. I'm wondering if the effort to force using
only physical cores is doubled by removing the 48 Threads AND setting
CR_ONE_TAKS_PER_CORE. My impression is that with CR_ONE_TASK_PER_CORE
ntasks-per-node accounts for threads (you have set ThreadsPerCore=2),
hence only 24 may work but CR_ONE_TASK_PER_CORE doen't affect the
selection of 'cores only' with ntasks.

We don't use CR_ONE_TASK_PER_CORE but our users either set -c 2 or
--hint=nomultithread, which results in core-only.

You could also enforce this with a job-submit-plugin or lua-plugin.

The fact that CR_ONE_TASK_PER_CORE is described as "under changed" in
the public bugs and that there is a non-accessible bug about this
probably points to better not use this unless you have to.

Best,

Andreas

On 2/20/19 7:49 AM, Chris Samuel wrote:

> On Tuesday, 19 February 2019 10:14:21 PM PST Marcus Wagner wrote:
>
>> sbatch -N 1 --ntasks-per-node=48 --wrap hostname
>> submission denied, got jobid 199805
> On one of our 40 core nodes with 2 hyperthreads:
>
> $ srun -C gpu -N 1 --ntasks-per-node=80 hostname | uniq -c
> 80 nodename02
>
> The spec is:
>
> CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2
>
> Hope this helps!
>
> All the best,
> Chris

--

Dr. Andreas Henkel
Operativer Leiter HPC
Zentrum für Datenverarbeitung
Johannes Gutenberg Universität
Anselm-Franz-von-Bentzelweg 12
55099 Mainz
Telefon: +49 6131 39 26434
OpenPGP Fingerprint: FEC6 287B EFF3
7998 A141 03BA E2A9 089F 2D8E F37E

0xE2A9089F2D8EF37E.asc

signature.asc

Prentice Bisbal

unread,

Feb 20, 2019, 10:09:58 AM2/20/19

to slurm...@lists.schedmd.com

On 2/20/19 12:08 AM, Marcus Wagner wrote:

Hi Prentice,

On 2/19/19 2:58 PM, Prentice Bisbal wrote:

--ntasks-per-node is meant to be used in conjunction with --nodes option. From https://slurm.schedmd.com/sbatch.html:

--ntasks-per-node=<ntasks>

Request that ntasks be invoked on each node. If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option...

Yes, but used together with --ntasks would mean to use e.g. 48 tasks at maximum per node. I don't see, where there lies the difference regarding submission of the job. Even if the semantic (how or how many cores will be scheduled onto which number of hosts) might be incorrect, at least the syntax should be correct.

The difference would be in how Slurm looks at those specifications internally. To us humans, what you say should work seems logical, but if Slurm wasn't programmed to behave that way, it won't. I provided the quote from the documentation, since that implies, to me at least, that Slurm isn't programmed to behave like that. Looking at the source code or asking SchedMD could confirm that.

If you don't specify --ntasks, it defaults to --ntasks=1, as Andreas said. https://slurm.schedmd.com/sbatch.html:

-n, --ntasks=<number>

sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.

So the correct way to specify your job is either like this

--ntasks=48

or

--nodes=1 --ntasks-per-node=48

Specifying both --ntasks-per-node and --ntasks at the same time is not correct.

funnily the result is the same:

$> sbatch -N 1 --ntasks-per-node=48 --wrap hostname
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

whereas just using --ntasks=48 gets submitted and it gets scheduled onto one host:

$> sbatch --ntasks=48 --wrap hostname
sbatch: [I] No output file given, set to: output_%j.txt
sbatch: [I] No runtime limit given, set to: 15 minutes
Submitted batch job 199784
$> scontrol show job 199784 | egrep "NumNodes|TRES"
NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,mem=182400M,node=1,billing=48

To me, this still looks like a bug, not like the wrong usage of submission parameters.

Either a bug, or there's something subtly wrong with your slurm.conf. I would continue troubleshooting by simplifying both your node definition and SelectType options as much as possible, and see if the problem still persists. Also, look at 'scontrol show node <node name>' to see if your definition in slurm.conf lines up with how Slurm actually sees the node. I don't think I saw that output anywhere is this thread yet.

Marcus Wagner

unread,

Feb 21, 2019, 1:18:34 AM2/21/19

to slurm...@lists.schedmd.com

Hi Andreas,

I'll try to sum this up ;)

First of all, I used now a Broadwell node, so there is no interference
with Skylake and SubNuma clustering.

We are using slurm 18.08.5-2

I have configured the node as slurmd -C tells me:
NodeName=lnm596          Sockets=2 CoresPerSocket=12 ThreadsPerCore=2
RealMemory=120000 Feature=bwx2650,hostok,hpcwork
Weight=10430 State=UNKNOWN

This is, what slurmctld knows about the node:
NodeName=lnm596 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=0 CPUTot=48 CPULoad=0.03
   AvailableFeatures=bwx2650,hostok,hpcwork
   ActiveFeatures=bwx2650,hostok,hpcwork
   Gres=(null)
   GresDrain=N/A
   GresUsed=gpu:0
   NodeAddr=lnm596 NodeHostName=lnm596 Version=18.08
   OS=Linux 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019
   RealMemory=120000 AllocMem=0 FreeMem=125507 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=10430 Owner=N/A
MCS_label=N/A
   Partitions=future
   BootTime=2019-02-19T07:43:33 SlurmdStartTime=2019-02-20T12:08:54
   CfgTRES=cpu=48,mem=120000M,billing=48
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=120 LowestJoules=714879 ConsumedJoules=8059263
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Lets first begin with half of the node:

--ntasks=12 -> 12 CPUs asked. I implicitly get the hyperthread for free
(besides the accounting).
   NumNodes=1 NumCPUs=24 NumTasks=12 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=24,mem=120000M,energy=46,node=1,billing=24
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=120000M MinTmpDiskNode=0

--ntasks=12 --cpus-per-tasks=2 -> 24 CPUs asked. I now have explicitly
asked for 24 CPUs
   NumNodes=1 NumCPUs=24 NumTasks=12 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   TRES=cpu=24,mem=120000M,energy=55,node=1,billing=24
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=2 MinMemoryNode=120000M MinTmpDiskNode=0

--ntasks=12 --ntasks-per-node=12 --cpus-per-tasks=2 -> 24 CPUs asked.
Additional constraint: All 12 tasks should be on one node. I also asked
here for 24 CPUs.
   NumNodes=1 NumCPUs=24 NumTasks=12 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   TRES=cpu=24,mem=120000M,energy=55,node=1,billing=24
   Socks/Node=* NtasksPerN:B:S:C=12:0:*:1 CoreSpec=*
   MinCPUsNode=24 MinMemoryNode=120000M MinTmpDiskNode=0

Everything good up to now. Now I'll try to use the full node:

--ntasks=24 -> 24 CPUs asked, implicitly got 48.
   NumNodes=1 NumCPUs=48 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,mem=120000M,energy=62,node=1,billing=48
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=120000M MinTmpDiskNode=0

--ntasks=24 --cpus-per-tasks=2 -> 48 CPUs explicitly asked.
   NumNodes=1 NumCPUs=48 NumTasks=24 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,mem=120000M,energy=62,node=1,billing=48
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=2 MinMemoryNode=120000M MinTmpDiskNode=0

And now the funny thing, I don't understand
--ntasks=24 --ntasks-per-node=24 --cpus-per-tasks=2 -> 48 CPUs asked,
all 24 tasks on one node. Slurm tells me:

sbatch: error: Batch job submission failed: Requested node configuration
is not available

I would have expected the following job, which would have fit onto the node:
   NumNodes=1 NumCPUs=48 NumTasks=24 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,mem=120000M,energy=62,node=1,billing=48
   Socks/Node=* NtasksPerN:B:S:C=24:0:*:1 CoreSpec=*
   MinCPUsNode=48 MinMemoryNode=120000M MinTmpDiskNode=0

part of the sbatch -vvv output:
sbatch: ntasks            : 24 (set)
sbatch: cpus_per_task     : 2
sbatch: nodes             : 1 (set)
sbatch: sockets-per-node : -2
sbatch: cores-per-socket : -2
sbatch: threads-per-core : -2
sbatch: ntasks-per-node   : 24
sbatch: ntasks-per-socket : -2
sbatch: ntasks-per-core   : -2

So, again, I see 24 tasks per node, 2 cpus per task and 1 node. This is
altogether 48 CPUs on one node. Which fits perfectly, as one can see
with the last two examples
Sprich 24 tasks pro Knoten, 2 cpus pro task, 1 Knoten. Macht bei mir
immer noch 48 CPUs.

I just ask explicitly what slurm already gives me implicitly, or have I
understood something wrong.

We will have to look into this further internally. Might be we have to
give up CR_ONE_TASK_PER_CORE.

Best
Marcus

P.S.:
Sorry for the lengthy post

Marcus Wagner

unread,

Feb 21, 2019, 2:13:25 AM2/21/19

to slurm...@lists.schedmd.com

ahh, ...

one thing, I forgot. The following is working again ...

--ntasks=24 --ntasks-per-node=24

NumNodes=1 NumCPUs=48 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

TRES=cpu=48,mem=120000M,energy=63,node=1,billing=48

Socks/Node=* NtasksPerN:B:S:C=24:0:*:1 CoreSpec=*

MinCPUsNode=24 MinMemoryNode=120000M MinTmpDiskNode=0

Best
Marcus

Reply all

Reply to author

Forward