[slurm-users] CPU binding outside of job step allocation

4,826 views
Skip to first unread message

Rutledge, Chris

unread,
Jun 10, 2022, 9:49:13 AM6/10/22
to slurm...@schedmd.com
Hello Everyone,

Having an odd issue with the latest version of slurm (22.05.0) when submitting jobs to the queue while on a compute resource. Some jobs are unable to reproduce this issue every time, but I've got a few that will. Here's one case that consistently errors when trying to launch. I've not been able to reproduce the issue when submitting jobs from the login node.

Anyone seen anything like this?

##############################
# start interactive session
##############################
[crutledge@ht1 ~]$ /usr/bin/srun --pty /bin/bash -i -l
[crutledge@largemem-5-1 ~]$ cd hpcc/bin/gpu-6/

##############################
# job details
##############################
[crutledge@largemem-5-1 gpu-6]$ cat job
#!/bin/bash -l
#
#SBATCH --job-name=HPCC
#SBATCH -n 48
#SBATCH -p gpu
#SBATCH --mem-per-cpu=3975

module load icc/2022.0.2 env_icc/any mvapich2/2.3.7-intel

srun ./hpcc

mv hpccoutf.txt hpccoutf.txt.${SLURM_JOB_ID}

##############################
# submit the job
##############################
[crutledge@largemem-5-1 gpu-6]$ sbatch job
Submitted batch job 8533

##############################
# resulting error
##############################
[crutledge@largemem-5-1 gpu-6]$ cat slurm-8533.out
Loading icc version 2022.0.2
Loading compiler-rt version 2022.0.2
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x000000000001000000000001.
srun: error: Task launch for StepId=8533.0 failed on node gpu-5-2: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 8533.0 ON gpu-5-1 CANCELLED AT 2022-06-10T09:38:19 ***
srun: error: gpu-5-1: tasks 0-46: Killed
mv: cannot stat ‘hpccoutf.txt’: No such file or directory
[crutledge@largemem-5-1 gpu-6]$

Will Furnass

unread,
Nov 14, 2022, 11:22:24 AM11/14/22
to slurm...@schedmd.com
Hi Chris, all,

We've been having similar issues, seemingly since upgrading to Slurm 22.05.x, where job steps in batch jobs submitted from interactive sessions fail sporadically:

1. User SSHs to login node.
2. User runs 'srun --pty /bin/bash' to get an interactive session on a worker node
3. From that interactive session the user submits a batch job containing >=1 explicit job step
4. The job step then _might_ fail with something like:

    srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x2.
    srun: error: Task launch for StepId=372.0 failed on node px01: Unable to satisfy cpu bind request

    srun: error: Application launch failed: Unable to satisfy cpu bind request
    srun: Job step aborted

This seems to be due to SLURM_CPU_BIND_* env vars being set in the interactive job, which then (undesirably) propagate to the batch job and cause problems if the job's taskset conflicts with the inherited SLURM_CPU_BIND_* values.

Unsetting those env vars at the top of the job submission script seems to prevent the issue from occurring, but isn't something we want to recommend to users.  Also, we're concerned that propagation of other env vars from the interactive job to the batch might cause other issues.

We thought that SLURM_EXPORT_ENV / SBATCH_EXPORT could help here but the docs for those features say: "Note that SLURM_* variables are always propagated."

Has anything changed in 22.05 that could explain this?  The only relevant things I can spot in the changelog that might be related are:

 -- Fail srun when using invalid `--cpu-bind` options (e.g. `--cpu-bind=map_cpu:99` when only 10 cpus are allocated).
 -- `srun --overlap` now allows the step to share all resources (CPUs, memory, and GRES), where previously `--overlap` only allowed the step to share CPUs with other steps.

NB this has also been discussed on the Slurm Bugzilla (https://bugs.schedmd.com/show_bug.cgi?id=14298).

Regards,

Will

--
Dr Will Furnass | Research Platforms Engineer
IT Services | University of Sheffield 
+44 (0)114 22 29693 
Reply all
Reply to author
Forward
0 new messages