[slurm-users] Nested sruns

833 views
Skip to first unread message

Raymond Norris

unread,
May 25, 2018, 5:01:16 PM5/25/18
to slurm...@schedmd.com

Slurm: 17.11.4

 

I want to run an interactive job on a compute node.  I know that I’m going to need to run an MPI app, so I request a bunch of tasks upfront

 

srun -n 16 –gres=gpu:4 –pty $SHELL

 

This creates a job with 4 nodes.

 

                …

SLURM_CPUS_ON_NODE=4

SLURM_DISTRIBUTION=block

SLURM_JOB_CPUS_PER_NODE=4(x4)

SLURM_JOB_NODELIST=n1,n2,n3,n4

SLURM_JOB_NUM_NODES=16

SLURM_NNODES=4

SLURM_NPROCS=16

SLURM_NTASKS=16

SLURM_STEP_NODELIST=n1,n2,n3,n4

SLURM_STEP_NUM_NODES=4

SLURM_STEP_NUM_TASKS=16

SLURM_STEP_TASKS_PER_NODE=4(x4)

SLURM_TASKS_PER_NODE=4(x4)

                …

 

Once on the compute node, I run an application that make a system call to a shell script.  Of my 16 cores, I want to make use of 10 of them.  In that shell script is the following call

 

srun -l -n 10 a.out

 

However, Slurm comes back with

 

srun: Warning: can't run 10 processes on 16 nodes, setting nnodes to 10

srun: error: Only allocated 4 nodes asked for 10

 

Exiting with code: 1

 

If I run

 

                Srun -l -n 16 a.out

 

The app hangs and the debug show

 

            srun: jobid 1234: nodes(4):`n1,n2,n3,n4’, cpu counts: 4(x4)

            srun: error: SLURM_NNODES environment variable conflicts with allocated node count (16 != 4).

            srun: debug:  requesting job 1234, user 5678, nodes 4 including ((null))

            srun: debug:  cpus 16, tasks 16, name a.out, relative 65534

            srun: Job 1234 step creation temporarily disabled, retrying

           

            srun: debug:  Got signal 2

            srun: Cancelled pending job step with signal 2

            srun: error: Unable to create step for job 1234: Job/step already completing or completed

 

I think there are two issues:

  1. I’m asking for a gres that is being consumed by the outer srun (but my inner srun is going to need the GPU, so I need to ensure I’m asking for them upfront)
  2. Even without the gres, Slurm still doesn’t seem to like srun calling srun.

 

How do I make this work?

 

Thanks,

Raymond

 

Reply all
Reply to author
Forward
0 new messages