Does anyone have experience with MPICH Hybrid containers on Cray systems or ideas on how to debug?
With my latest attempt using the "MPICH Hybrid Container" example from the Apptainer documentation, I get errors like:
jobStarter
Wed Feb 2 18:41:35 2022: [unset]:_pmi_pals_init:Couldn't open /var/run/palsd/f20b1e77-55e4-499c-a020-b20a1e6476c7/apinfo: No such file or directory
Wed Feb 2 18:41:35 2022: [unset]:_pmi_init:_pmi_pals_init returned -1
Wed Feb 2 18:41:35 2022: [unset]:_pmi_pals_init:Couldn't open /var/run/palsd/f20b1e77-55e4-499c-a020-b20a1e6476c7/apinfo: No such file or directory
Wed Feb 2 18:41:35 2022: [unset]:_pmi_init:_pmi_pals_init returned -1
MPICH ERROR [Rank 0] [job id unknown] [Wed Feb 2 18:41:35 2022] [x1001c7s7b1n1] - Abort(1112591) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(144):
MPID_Init(430).......:
MPIR_pmi_init(78)....: PMI_Init returned 1
...
This is a Cray system using PBS batch scheduler.
My approach, batch file, is like:
...
export SINGULARITYENV_LD_LIBRARY_PATH=\
/lib64:\
/opt/cray/pe/gcc-libs:\
/opt/cray/pe/lib64:\
/opt/cray/pe/lib64/cce:\
/usr/lib64:\
/opt/cray/libfabric/1.11.0.4.67/lib64
module swap PrgEnv-cray PrgEnv-gnu
module swap cray-mpich cray-mpich-abi
DPATH_HOST_HOME=$HOME
DPATH_CNTR_HOME=/home/$USER
FPATH_HOST_IMAG=$PBS_O_WORKDIR/image.sif
FPATH_CNTR_APP=/opt/mpitest
/usr/bin/time -p \
aprun -n 8 \
singularity run \
--bind /opt/cray \
--bind /usr/lib64 \
--home $DPATH_HOST_HOME:$DPATH_CNTR_HOME \
$FPATH_HOST_IMAG \
$FPATH_CNTR_APP