[slurm-users] Multiple Program Runs using srun in one Slurm batch Job on one node

Guillaume De Nayer

unread,

Jun 15, 2022, 8:30:35 AM6/15/22

to slurm...@lists.schedmd.com

Dear all,

I'm new on this list. I am responsible for several small clusters at our
chair.

I set up slurm 21.08.8-2 on a small cluster (CentOS 7) with 8 nodes:
NodeName=node0[1-8] CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
ThreadsPerCore=1

One collegue has to run 20,000 jobs on this machine. Every job starts
his program with mpirun on 12 cores. The standard slurm behavior makes
that the node, which runs this job is blocked (and 28 cores are idle).
The small cluster has only 8 nodes, so only 8 jobs can run in parallel.

In order to solve this problem I'm trying to start some subtasks with
srun inside a batch job (without mpirun for now):

#!/bin/bash
#SBATCH --job-name=test_multi_prog_srun
#SBATCH --nodes=1
#SBATCH --partition=short
#SBATCH --time=02:00:00
#SBATCH --exclusive

srun -vvv --exact -n1 -c1 sleep 20 > srun1.log 2>&1 &
srun -vvv --exact -n1 -c1 sleep 30 > srun2.log 2>&1 &
wait

However, only one task runs. The second is waiting for the completion of
the first task to start.

Can someone explain me, what I'm doing wrong?

Thx in advance,
Regards,
Guillaume

# slurm.conf file
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmUser=root
SwitchType=switch/none
TaskPlugin=task/none
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
AccountingStorageEnforce=limits
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreFlags=job_comment
JobAcctGatherFrequency=30
SlurmctldDebug=error
SlurmdDebug=error
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdLogFile=/var/log/slurmd.log

NodeName=node0[1-8] CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
ThreadsPerCore=1 State=UNKNOWN
PartitionName=short Nodes=node[01-08] Default=NO MaxTime=0-02:00:00
State=UP DefaultTime=00:00:00 MinNodes=1 PriorityTier=100

Tina Friedrich

unread,

Jun 15, 2022, 8:48:49 AM6/15/22

to slurm...@lists.schedmd.com

Hi Guillaume,

in that example you wouldn't need the 'srun' to run more than one task,
I think.

I'm not 100% sure, but to me it sounds like you're currently assigning
whole nodes to jobs rather than cores (i.e have
'SelectType=select/linear' and no OverSubscribe) and find that to be
wasteful - is that correct?

If it is, I'd say the more obvious solution to that would be to change
the SelectType to either select/cons_res or select/cons_tres, so that
cores (not nodes) are allocated to jobs?

Tina

--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk

Guillaume De Nayer

unread,

Jun 15, 2022, 9:09:11 AM6/15/22

to slurm...@lists.schedmd.com

On 06/15/2022 02:48 PM, Tina Friedrich wrote:
> Hi Guillaume,
>

Hi Tina,

> in that example you wouldn't need the 'srun' to run more than one task,
> I think.
>

You are correct. To start a program like sleep I could simply run:
sleep 20s &
sleep 30s &
wait

However, my objective is to use mpirun in combination with srun to avoid
to define manually rankfile.

>
> I'm not 100% sure, but to me it sounds like you're currently assigning
> whole nodes to jobs rather than cores (i.e have
> 'SelectType=select/linear' and no OverSubscribe) and find that to be
> wasteful - is that correct?
>

In my first email I copy parts of my slurm.conf. I'm using
"SelectType=select/cons_res"

with

"SelectTypeParameters=CR_Core_Memory"

And until now "no OverSubscribe". I tried to activate
"OverSubscribe=YES" on the partition with

PartitionName=short Nodes=node[01-08] Default=NO MaxTime=0-02:00:00

State=UP DefaultTime=00:00:00 MinNodes=1 PriorityTier=100 OverSubscribe=YES

But it did not solve the issue with

srun -vvv --exact -n1 -c1 sleep 20 > srun1.log 2>&1 &
srun -vvv --exact -n1 -c1 sleep 30 > srun2.log 2>&1 &
wait

> If it is, I'd say the more obvious solution to that would be to change
> the SelectType to either select/cons_res or select/cons_tres, so that
> cores (not nodes) are allocated to jobs?
>

How can I be sure that my slurm is using the parameter "select/cons_res"
defined in my /etc/slurm/slurm.conf?

Thx a lot
Guillaume

Frank Lenaerts

unread,

Jun 15, 2022, 9:50:16 AM6/15/22

to slurm...@lists.schedmd.com

On Wed, Jun 15, 2022 at 02:20:56PM +0200, Guillaume De Nayer wrote:
> In order to solve this problem I'm trying to start some subtasks with
> srun inside a batch job (without mpirun for now):
>
> #!/bin/bash
> #SBATCH --job-name=test_multi_prog_srun
> #SBATCH --nodes=1
> #SBATCH --partition=short
> #SBATCH --time=02:00:00
> #SBATCH --exclusive
>
> srun -vvv --exact -n1 -c1 sleep 20 > srun1.log 2>&1 &
> srun -vvv --exact -n1 -c1 sleep 30 > srun2.log 2>&1 &
> wait
>
>
> However, only one task runs. The second is waiting for the completion of
> the first task to start.
>
> Can someone explain me, what I'm doing wrong?

I think this is because the value for sbatch(1)'s -n (or --ntasks) is
1 (per node) by default.

> Thx in advance,
> Regards,

Kind regards

> Guillaume

Frank

Frank Lenaerts

unread,

Jun 15, 2022, 9:53:21 AM6/15/22

to slurm...@lists.schedmd.com

On Wed, Jun 15, 2022 at 02:20:56PM +0200, Guillaume De Nayer wrote:

> One collegue has to run 20,000 jobs on this machine. Every job starts
> his program with mpirun on 12 cores. The standard slurm behavior makes
> that the node, which runs this job is blocked (and 28 cores are idle).
> The small cluster has only 8 nodes, so only 8 jobs can run in parallel.

If your colleague also uses sbatch(1)'s --exclusive option, only one
job can run on a node...

> In order to solve this problem I'm trying to start some subtasks with
> srun inside a batch job (without mpirun for now):
>
> #!/bin/bash
> #SBATCH --job-name=test_multi_prog_srun
> #SBATCH --nodes=1
> #SBATCH --partition=short
> #SBATCH --time=02:00:00
> #SBATCH --exclusive

Guillaume De Nayer

unread,

Jun 15, 2022, 10:59:59 AM6/15/22

to slurm...@lists.schedmd.com

On 06/15/2022 03:53 PM, Frank Lenaerts wrote:
> On Wed, Jun 15, 2022 at 02:20:56PM +0200, Guillaume De Nayer wrote:
>> One collegue has to run 20,000 jobs on this machine. Every job starts
>> his program with mpirun on 12 cores. The standard slurm behavior makes
>> that the node, which runs this job is blocked (and 28 cores are idle).
>> The small cluster has only 8 nodes, so only 8 jobs can run in parallel.
>
> If your colleague also uses sbatch(1)'s --exclusive option, only one
> job can run on a node...
>

Perhaps I missunderstand the Slurm documentation...

As thought that the --exclusive option used in combination with sbatch
will reserve the whole node (40 cores) for the job (submitted with
sbatch). This part is working fine. I can check it with sacct.

Then, this job starts subtasks on the reserved 40 cores with srun.
Therefore I'm using "-n1 -c1" in combination with "srun". I thought that
it was possible to use the reserved cores inside this job using srun.

The following slightly modified job without --exclusive and with
--ntasks=2 leads to a similar problem: Only one srun is running at a
time. The second starts directly after the first one finished.

#!/bin/bash
#SBATCH --job-name=test_multi_prog_srun
#SBATCH --ntasks=2

#SBATCH --partition=short
#SBATCH --time=02:00:00

srun -vvv --exact -n1 -c1 sleep 20 > srun1.log 2>&1 &
srun -vvv --exact -n1 -c1 sleep 30 > srun2.log 2>&1 &
wait

Kind regards
Guillaume

Ward Poelmans

unread,

Jun 15, 2022, 11:26:30 AM6/15/22

to slurm...@lists.schedmd.com

Hi Guillaume,

On 15/06/2022 16:59, Guillaume De Nayer wrote:
>
> Perhaps I missunderstand the Slurm documentation...
>
> As thought that the --exclusive option used in combination with sbatch
> will reserve the whole node (40 cores) for the job (submitted with
> sbatch). This part is working fine. I can check it with sacct.
>
> Then, this job starts subtasks on the reserved 40 cores with srun.
> Therefore I'm using "-n1 -c1" in combination with "srun". I thought that
> it was possible to use the reserved cores inside this job using srun.

You're correct. --exclusive will give you all cores on the nodes but only as much memory as requested.

> The following slightly modified job without --exclusive and with
> --ntasks=2 leads to a similar problem: Only one srun is running at a
> time. The second starts directly after the first one finished.
>
> #!/bin/bash
> #SBATCH --job-name=test_multi_prog_srun
> #SBATCH --ntasks=2
> #SBATCH --partition=short
> #SBATCH --time=02:00:00
>
> srun -vvv --exact -n1 -c1 sleep 20 > srun1.log 2>&1 &
> srun -vvv --exact -n1 -c1 sleep 30 > srun2.log 2>&1 &
> wait

This should work... It works on our cluster. Are you sure they don't run in parallel?

We usually recommend to use gnu parallel or xargs like:

xargs -P $SLURM_NTASKS srun -N 1 -n 1 -c 1 --exact sleep 30

Ward

Guillaume De Nayer

unread,

Jun 15, 2022, 11:37:18 AM6/15/22

to slurm...@lists.schedmd.com

Yes I'm pretty sure that it does not work in parallel: The command sacct
show me only on subtask "RUNNING". Then, when this subtask is marked as
"COMPLETED", the second one appears and is marked "RUNNING".

Moreover, if I directly connect on the node, only one process of "sleep"
is running.

ok. If it works on your cluster, I have perhaps a problem in my slurm
config. Which version of slurm are you using on your cluster? And can
you share your slurm.conf?

> We usually recommend to use gnu parallel or xargs like:
>
> xargs -P $SLURM_NTASKS srun -N 1 -n 1 -c 1 --exact sleep 30
>

ok. I will install "gnu parallel" and also test your xargs command.

Thx a lot!
Guillaume

Williams, Gareth (IM&T, Black Mountain)

unread,

Jun 15, 2022, 5:21:11 PM6/15/22

to Slurm User Community List

I think the problem might be that you are not requesting memory, so by default, all memory on a node is allocated to the job and "cons_res" will not allocate a second job to any node. That comes up quite often.

Gareth

Guillaume De Nayer

unread,

Jun 16, 2022, 3:25:22 AM6/16/22

to slurm...@lists.schedmd.com

Hi Gareth,

I think you solved the problem. In my slurm.conf no setting on the
Memory was set (not for the node definition, not for the partition). I
change that and I add also "--mem-per-cpu 1" in the srun. It seems to
work. I will test it now with mpirun.

Thx a lot for your help!
Regards
Guillaume

Reply all

Reply to author

Forward