debugging Cray MPICH Hybrid PMI_Init /var/run errors

142 views
Skip to first unread message

Jason Addison

unread,
Feb 3, 2022, 12:47:07 PM2/3/22
to singularity
Does anyone have experience with MPICH Hybrid containers on Cray systems or ideas on how to debug?

With my latest attempt using the "MPICH Hybrid Container" example from the Apptainer documentation, I get errors like:

jobStarter

Wed Feb 2 18:41:35 2022: [unset]:_pmi_pals_init:Couldn't open /var/run/palsd/f20b1e77-55e4-499c-a020-b20a1e6476c7/apinfo: No such file or directory
Wed Feb 2 18:41:35 2022: [unset]:_pmi_init:_pmi_pals_init returned -1
Wed Feb 2 18:41:35 2022: [unset]:_pmi_pals_init:Couldn't open /var/run/palsd/f20b1e77-55e4-499c-a020-b20a1e6476c7/apinfo: No such file or directory
Wed Feb 2 18:41:35 2022: [unset]:_pmi_init:_pmi_pals_init returned -1
MPICH ERROR [Rank 0] [job id unknown] [Wed Feb 2 18:41:35 2022] [x1001c7s7b1n1] - Abort(1112591) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(144):
MPID_Init(430).......:
MPIR_pmi_init(78)....: PMI_Init returned 1
...

This is a Cray system using PBS batch scheduler.

My approach, batch file, is like:

...
export SINGULARITYENV_LD_LIBRARY_PATH=\
/lib64:\
/opt/cray/pe/gcc-libs:\
/opt/cray/pe/lib64:\
/opt/cray/pe/lib64/cce:\
/usr/lib64:\
/opt/cray/libfabric/1.11.0.4.67/lib64

module swap PrgEnv-cray PrgEnv-gnu
module swap cray-mpich cray-mpich-abi

DPATH_HOST_HOME=$HOME
DPATH_CNTR_HOME=/home/$USER
FPATH_HOST_IMAG=$PBS_O_WORKDIR/image.sif
FPATH_CNTR_APP=/opt/mpitest

/usr/bin/time -p \
aprun -n 8 \
singularity run \
--bind /opt/cray \
--bind /usr/lib64 \
--home $DPATH_HOST_HOME:$DPATH_CNTR_HOME \
$FPATH_HOST_IMAG \
$FPATH_CNTR_APP

Tru Huynh

unread,
Feb 3, 2022, 3:11:42 PM2/3/22
to singu...@lbl.gov
Hi,

On Thu, Feb 03, 2022 at 09:47:07AM -0800, Jason Addison wrote:
> Does anyone have experience with MPICH Hybrid containers on Cray systems
> or ideas on how to debug?
> With my latest attempt using the "MPICH Hybrid Container" example from the
> Apptainer documentation, I get errors like:
> jobStarter
>
> Wed Feb 2 18:41:35 2022: [unset]:_pmi_pals_init:Couldn't open
> /var/run/palsd/f20b1e77-55e4-499c-a020-b20a1e6476c7/apinfo: No such file
> or directory
> ...

maybe add "--bind /run" along with the other ones?

> ...
> /usr/bin/time -p \
> aprun -n 8 \
> singularity run \
> --bind /opt/cray \
> --bind /usr/lib64 \
> --home $DPATH_HOST_HOME:$DPATH_CNTR_HOME \
> $FPATH_HOST_IMAG \
> $FPATH_CNTR_APP

Cheers

Tru
--
Tru Huynh (PhD) | mailto:t...@pasteur.fr | tel +33 1 45 68 87 37
https://research.pasteur.fr/en/team/structural-bioinformatics/
Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France

Jason Addison

unread,
Feb 4, 2022, 2:23:47 PM2/4/22
to singularity, Tru Huynh
Hi,

On Thursday, February 3, 2022 at 1:11:42 PM UTC-7 Tru Huynh wrote:

On Thu, Feb 03, 2022 at 09:47:07AM -0800, Jason Addison wrote:
> Does anyone have experience with MPICH Hybrid containers on Cray systems
> or ideas on how to debug?
> With my latest attempt using the "MPICH Hybrid Container" example from the
> Apptainer documentation, I get errors like:
> jobStarter
>
> Wed Feb 2 18:41:35 2022: [unset]:_pmi_pals_init:Couldn't open
> /var/run/palsd/f20b1e77-55e4-499c-a020-b20a1e6476c7/apinfo: No such file
> or directory
> ...

maybe add "--bind /run" along with the other ones?


I thinks that got me past the /run issue. I can run it on a single node. When running across nodes, I get:

jobStarter
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
MPICH ERROR [Rank 0] [job id 0880974a-ee84-4906-af7b-53a69cacb18b] [Fri Feb 4 19:16:03 2022] [x1002c4s0b1n0] - Abort(2161295) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(144).......:
MPID_Init(495)..............:
MPIDI_OFI_mpi_init_hook(606):
open_fabric(1318)...........:
find_provider(1601).........: OFI fi_getinfo() failed (ofi_init.c:1601:find_provider:No data available)

aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(144).......:
MPID_Init(495)..............:
MPIDI_OFI_mpi_init_hook(606):
open_fabric(1318)...........:
find_provider(1601).........: OFI fi_getinfo() failed (ofi_init.c:1601:find_provider:No data available)
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
MPICH ERROR [Rank 0] [job id 0880974a-ee84-4906-af7b-53a69cacb18b] [Fri Feb 4 19:16:03 2022] [x1002c4s0b1n0] - Abort(2161295) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(144).......:
MPID_Init(495)..............:
MPIDI_OFI_mpi_init_hook(606):
open_fabric(1318)...........:
find_provider(1601).........: OFI fi_getinfo() failed (ofi_init.c:1601:find_provider:No data available)

...


> ...
> /usr/bin/time -p \
> aprun -n 8 \
> singularity run \
> --bind /opt/cray \
> --bind /usr/lib64 \
> --home $DPATH_HOST_HOME:$DPATH_CNTR_HOME \
> $FPATH_HOST_IMAG \
> $FPATH_CNTR_APP



Thanks,
Jason
 
Reply all
Reply to author
Forward
0 new messages