Hi Paul & co,
We're trying to run Chapel over GASNet on OLCF Summit and have so far been unsuccessful. We think the way we're using the
gasnetrun-mpi.pl script may be the problem, so I hope you can offer some advice. We've read your
notes for UPC++ and we're trying to replicate that setup as far as possible with Chapel.
CHPL_LLVM=bundled
CHPL_COMM=gasnet
CHPL_COMM_SUBSTRATE=ibv
CHPL_LAUNCHER=gasnetrun_ibv
We're also running with
GASNET_IBV_SPAWNER=mpi
as we don't believe it's possible to spawn processes on Summit using SSH.
If I run a simple 'hello world' program with Chapel on two nodes:
GASNET_IBV_SPAWNER=mpi ./hello4-datapar-dist -nl 2 --verbose
I get the following error:
Not enough hosts LSB_MCPU_HOSTS to satisfy '-N 2'
The first problem we're encountering is that the batch environment variables on Summit are encoded with '%020' instead of the space character, and consequently the
gasnetrun-mpi.pl script seems to have trouble separating the host string. At
gasnetrun_mpi.pl:573, it tries to split the host string by spaces:
my @tmp = split(" ", $ENV{'LSB_MCPU_HOSTS'});
If I change this to
my @tmp = split("%020", $ENV{'LSB_MCPU_HOSTS'});
I no longer get the 'Not enough hosts' error. Did you encounter a similar issue running UPC++ on Summit?
After fixing the split issue above, I get the following error:
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: batch4
Framework: pml
Component: pami
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
mca_pml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[batch4:2823868] *** An error occurred in MPI_Init_thread
[batch4:2823868] *** reported by process [2947153921,0]
[batch4:2823868] *** on a NULL communicator
[batch4:2823868] *** Unknown error
[batch4:2823868] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[batch4:2823868] *** and potentially your MPI job)
[batch4:2823788] 1 more process has sent help message help-mca-base.txt / find-available:not-valid
[batch4:2823788] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
batch4 is a Summit launcher node (not a compute node), and it looks to me like MPI is trying to run a Chapel process on the launcher node. I notice that
gasnetrun-mpi.pl contains logic for systems like Summit that use launcher nodes, which attempts to rewrite the
LSB_MCPU_HOSTS environment variable to remove the first node if it is of a different width than the others. However, this logic only takes effect if
jsrun is detected. The
is_jsrun test at
gasnetrun-mpi.pl:113 relies on
mpirun --help returning a line of text including "jsrun --usage", which is not the case with IBM Spectrum MPI 10.4 on Summit, hence the
LSB_MCPU_HOSTS variable does not get rewritten.
Did you encounter any of these issues in configuring UPC++ to run on Summit? Are we doing something obviously silly?