[slurm-users] --no-alloc breaks mpi?

O'Grady, Paul Christopher

unread,

Mar 8, 2021, 3:57:08 PM3/8/21

to slurm...@lists.schedmd.com

Hi,

I’m having an issue with srun's --no-alloc flag with mpi which I can reproduce with a fairly simple example. When I run a simple one-core mpi test program as “slurmUser” (the account that has the --no-alloc privilege) it succeeds:

srun -p psfehq -n 1 -o logs/test.log -w psana1507 python ~/ipsana/mpi_simpletest.py

However when I add the --no-alloc flag it fails in a way that appears to break mpi (see logfile output and other slurm/mpi info below). It fails similarly on 2 cores.

srun --no-alloc -p psfehq -n 1 -o logs/test.log -w psana1507 python ~/ipsana/mpi_simpletest.py

srun: do not allocate resources

srun: error: psana1507: task 0: Exited with exit code 1

Would anyone have any suggestions for how I could make the “--no-alloc” flag work with mpi? Thanks!

chris

------------------------------------------------------------------------------------------------------

Logfile error with --no-alloc flag:

(ana-4.0.12) psanagpu105:batchtest_slurm$ more logs/test.log

--------------------------------------------------------------------------

The application appears to have been direct launched using "srun",

but OMPI was not built with SLURM support. This usually happens

when OMPI was not configured --with-slurm and we weren't able

to discover a SLURM installation in the usual places.

Please configure as appropriate and try again.

--------------------------------------------------------------------------

*** An error occurred in MPI_Init_thread

*** on a NULL communicator

*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

*** and potentially your MPI job)

[psana1507:13884] Local abort before MPI_INIT completed completed successfully,

but am not able to aggregate error messages, and not able to guarantee that all

other processes were killed!

(ana-4.0.12) psanagpu105:batchtest_slurm$

System information:

(ana-4.0.12) psanagpu105:batchtest_slurm$ conda list | grep mpi

mpi 1.0 openmpi conda-forge

mpi4py 3.0.3 py27h9ab638b_1 conda-forge

openmpi 4.1.0 h9b22176_1 conda-forge

(ana-4.0.12) psanagpu105:batchtest_slurm$ srun --mpi=list

srun: MPI types are...

srun: cray_shasta

srun: none

srun: pmi2

srun: pmix

srun: pmix_v3

(ana-4.0.12) psanagpu105:batchtest_slurm$ srun --version

slurm 20.11.3

(ana-4.0.12) psanagpu105:batchtest_slurm$

Pritchard Jr., Howard

unread,

Mar 8, 2021, 4:35:24 PM3/8/21

to Slurm User Community List

Hi Chris,

What’s happening is that there’s no SLURM_JOBID (my speculation since I don’t have perms to use –no-alloc) is set, but SLURM_NODELIST may be set, so its confusing ORTE.

Could you list which SLURM env variables are set in the shell in which your running the srun command?

Howard

O'Grady, Paul Christopher

unread,

Mar 8, 2021, 9:37:16 PM3/8/21

to slurm...@lists.schedmd.com

On Mar 8, 2021, at 1:35 PM, slurm-use...@lists.schedmd.com wrote:

What?s happening is that there?s no SLURM_JOBID (my speculation since I don?t have perms to use ?no-alloc) is set, but SLURM_NODELIST may be set, so its confusing ORTE.

Could you list which SLURM env variables are set in the shell in which your running the srun command?

Howard,

I believe you are correct. Once I set SLURM_JOBID then ORTE starts functioning again with the --no-alloc option. Since you asked (and for completeness) I include the list of environment variables that were different with/without --no-alloc below, but my tests show that jobid seems to be the magic one, as you predicted.

I guess I will manufacture an artificial job id for our “--no-alloc” runs, but if anyone is aware of any dangers lurking in the shadows from that approach I would be interested.

Thanks for the guidance ... impressive that you could identify the issue so quickly!

chris

----------------------------------------------------------

SLURM_JOB_CPUS_PER_NODE=1

SLURM_JOB_ID=25300

SLURM_JOBID=25300

SLURM_JOB_NUM_NODES=1

SLURM_JOB_PARTITION=psfehq

SLURM_JOB_QOS=normal

SLURM_CPUS_ON_NODE=1

Reply all

Reply to author

Forward