[slurm-users] srun and Intel MPI 2020 Update 4

473 views
Skip to first unread message

Ciaron Linstead

unread,
Nov 5, 2020, 10:41:56 AM11/5/20
to slurm...@lists.schedmd.com
Hello all

I've been trying to run a simple MPI application (the Intel MPI
Benchmark) using the latest Intel Parallel Studio (2020 Update 4) and
srun. Version 2019 Update 4 runs this example correctly, as does mpirun.

SLURM is 17.11.7

The error I get is the following, unless I use --exclusive:


MPI startup(): Could not import some environment variables. Intel MPI
process pinning will not be used.
Possible reason: Using the Slurm srun command. In this
case, Slurm pinning will be used.
MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to
/p/system/slurm/lib/libpmi.so
Abort(2664079) on node 19 (rank 19 in comm 0): Fatal error in
PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(136).........:
MPID_Init(1127)...............:
MPIDI_SHMI_mpi_init_hook(29)..:
MPIDI_POSIX_mpi_init_hook(141):
MPIDI_POSIX_eager_init(2109)..:
MPIDU_shm_seg_commit(296).....: unable to allocate shared memory



I have a ticket open with Intel, who suggested increasing /dev/shm on
the nodes to 64GB (the size of the RAM on the nodes), but this had no
effect.

Here's my submit script:



#!/bin/bash

#SBATCH --ntasks=25 # fails, unless exclusive
##SBATCH --exclusive

source
/p/system/packages/intel/parallel_studio_xe_2020_update4/impi/2019.9.304/intel64/bin/mpivars.sh
-ofi_internal=1

export I_MPI_FABRICS=shm:ofi
export FI_PROVIDER=verbs
export FI_VERBS_IFACE=ib0
export FI_LOG_LEVEL=trace

export I_MPI_PMI_LIBRARY=/p/system/slurm/lib/libpmi.so
export I_MPI_DEBUG=5

# Fails for any MPI program, not just this one
srun -v -n $SLURM_NTASKS /home/linstead/imb_2019.5/IMB-MPI1 barrier



Do you have any ideas about where/how to investigate this further?

Many thanks

Ciaron

Ole Holm Nielsen

unread,
Nov 6, 2020, 2:57:22 AM11/6/20
to slurm...@lists.schedmd.com
On 11/5/20 4:41 PM, Ciaron Linstead wrote:
> I've been trying to run a simple MPI application (the Intel MPI Benchmark)
> using the latest Intel Parallel Studio (2020 Update 4) and srun. Version
> 2019 Update 4 runs this example correctly, as does mpirun.
>
> SLURM is 17.11.7
>
> The error I get is the following, unless I use --exclusive:
...

I think you will be well advised to upgrade your ancient Slurm 17.11
installation! Numerous bugs have been fixed in the last 3 years.

FWIW, I have written some Slurm upgrade instructions in
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

Best regards,
Ole

Malte Thoma

unread,
Nov 6, 2020, 3:31:20 AM11/6/20
to Slurm User Community List, Ciaron Linstead
Hi Ciaron,

on our Omnipath network, we encounterd a simmilar problem:

The MPI needs exclusive access to the interconnect.

Cray once provided a workaround, but that was not worth to implement (terrible efford/gain for us).

Conclusion You might have to live with this limitation.

Kind regards,
Malte



Am 05.11.20 um 16:41 schrieb Ciaron Linstead:
--
Malte Thoma Tel. +49-471-4831-1828
HSM Documentation: https://spaces.awi.de/x/YF3-Eg (User)
https://spaces.awi.de/x/oYD8B (Admin)
HPC Documentation: https://spaces.awi.de/x/Z13-Eg (User)
https://spaces.awi.de/x/EgCZB (Admin)
AWI, Geb.E (3125)
Am Handelshafen 12
27570 Bremerhaven
Tel. +49-471-4831-1828

Reply all
Reply to author
Forward
0 new messages