running Chapel over GASNet on Summit

30 views
Skip to first unread message

Josh Milthorpe

unread,
Feb 16, 2023, 4:33:05 PM2/16/23
to gasnet...@lbl.gov
Hi Paul & co,

We're trying to run Chapel over GASNet on OLCF Summit and have so far been unsuccessful. We think the way we're using the gasnetrun-mpi.pl script may be the problem, so I hope you can offer some advice. We've read your notes for UPC++ and we're trying to replicate that setup as far as possible with Chapel.

Chapel includes the gasnetrun-mpi.pl script, which seems to be a direct lift from the GASNet repository We're building Chapel with the following settings:

CHPL_LLVM=bundled
CHPL_COMM=gasnet
CHPL_COMM_SUBSTRATE=ibv
CHPL_LAUNCHER=gasnetrun_ibv


We're also running with 

GASNET_IBV_SPAWNER=mpi

as we don't believe it's possible to spawn processes on Summit using SSH.

If I run a simple 'hello world' program with Chapel on two nodes:

GASNET_IBV_SPAWNER=mpi ./hello4-datapar-dist -nl 2 --verbose

I get the following error:

 Not enough hosts LSB_MCPU_HOSTS to satisfy '-N 2'

The first problem we're encountering is that the batch environment variables on Summit are encoded with '%020' instead of the space character, and consequently the gasnetrun-mpi.pl script seems to have trouble separating the host string. At gasnetrun_mpi.pl:573, it tries to split the host string by spaces:

my @tmp = split(" ", $ENV{'LSB_MCPU_HOSTS'});

If I change this to 

my @tmp = split("%020", $ENV{'LSB_MCPU_HOSTS'});

I no longer get the 'Not enough hosts' error. Did you encounter a similar issue running UPC++ on Summit?

After fixing the split issue above, I get the following error:

--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      batch4
Framework: pml
Component: pami
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  mca_pml_base_open() failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[batch4:2823868] *** An error occurred in MPI_Init_thread
[batch4:2823868] *** reported by process [2947153921,0]
[batch4:2823868] *** on a NULL communicator
[batch4:2823868] *** Unknown error
[batch4:2823868] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[batch4:2823868] ***    and potentially your MPI job)
[batch4:2823788] 1 more process has sent help message help-mca-base.txt / find-available:not-valid
[batch4:2823788] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


batch4 is a Summit launcher node (not a compute node), and it looks to me like MPI is trying to run a Chapel process on the launcher node. I notice that gasnetrun-mpi.pl contains logic for systems like Summit that use launcher nodes, which attempts to rewrite the LSB_MCPU_HOSTS environment variable to remove the first node if it is of a different width than the others. However, this logic only takes effect if jsrun is detected. The is_jsrun test at gasnetrun-mpi.pl:113 relies on mpirun --help returning a line of text including "jsrun --usage", which is not the case with IBM Spectrum MPI 10.4 on Summit, hence the LSB_MCPU_HOSTS variable does not get rewritten.

Did you encounter any of these issues in configuring UPC++ to run on Summit? Are we doing something obviously silly?


Elliott Slaughter

unread,
Feb 16, 2023, 4:44:56 PM2/16/23
to Josh Milthorpe, gasnet...@lbl.gov
Josh,

For what it's worth, we launch Legion on Summit using jsrun, and we build GASNet with --enable-mpi-compat to ensure we can connect to the job launcher.

We launch the job like you'd launch any other MPI+X job: jsrun -n $NRANKS ... ./application

Overall this has worked pretty well for us. The only issue I can recall has to do with the GPU shim, but that's mostly unrelated.

--
You received this message because you are subscribed to the Google Groups "gasnet-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gasnet-users...@lbl.gov.
To view this discussion on the web visit https://groups.google.com/a/lbl.gov/d/msgid/gasnet-users/CAGf3CU-8mOko0sqWDr%2Bz-DZQKFkANHJwNxJ4bu%2BSpRRigTowxw%40mail.gmail.com.


--
Elliott Slaughter

"Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay

Paul H. Hargrove

unread,
Feb 16, 2023, 5:25:41 PM2/16/23
to Josh Milthorpe, gasnet...@lbl.gov
Josh,

You are right that `batch4` is not a compute node, but if things were working "properly" then you'd not have reached that `spilt` of a variable that included it.

If you configure using `--with-mpirun-cmd="jsrun -p %N %C"` you should be able to get past the spawn issues using `gasnetrun_ibv`
I actually recommend the following options for Summit:

  --with-cxx=mpicxx --with-cc=mpicc  --with-mpirun-cmd="jsrun -p %N %C" \
  --disable-pshm-posix --enable-pshm-sysv  --disable-smp --enable-udp --enable-mpi \
  --with-default-network=ibv  --enable-ibv  --with-ibv-physmem-max=2/3  --enable-ibv-odp \
  --enable-ibv-multirail  --with-ibv-max-hcas=2  --with-ibv-ports="mlx5_0+mlx5_3" \
  --disable-ibv-conn-thread  --disable-ibv-rcv-thread


-Paul

On Thu, Feb 16, 2023 at 1:33 PM Josh Milthorpe <josh.mi...@gmail.com> wrote:
--
You received this message because you are subscribed to the Google Groups "gasnet-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gasnet-users...@lbl.gov.
To view this discussion on the web visit https://groups.google.com/a/lbl.gov/d/msgid/gasnet-users/CAGf3CU-8mOko0sqWDr%2Bz-DZQKFkANHJwNxJ4bu%2BSpRRigTowxw%40mail.gmail.com.


--
Paul H. Hargrove <PHHar...@lbl.gov>
Pronouns: he, him, his
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department
Lawrence Berkeley National Laboratory

Josh Milthorpe

unread,
Feb 16, 2023, 7:34:40 PM2/16/23
to Paul H. Hargrove, gasnet...@lbl.gov
Thanks Paul and Elliot!

I reconfigured GASNet as Paul suggested. Running with gasnetrun_ibv I got the following error, which I think is coming from IBM Spectrum mpirun:

Error: Request for environment variable HOSTNAME to be propagated (-E) cannot be satisfied.
Exporting this envionment variable is not allowed.


It seems like we're not allowed to propagate any of the environment variables HOSTNAME,USER,SHELL,PWD.

I was able to hack around this (and successfully run my Chapel program) by explicitly removing these variables from the -E environment variable propagation string, that is,  changing line gasnetrun.pl:113 to:

            my $tmpenv = $ARGV[0];
            $tmpenv =~ s/HOSTNAME,//g;
            $tmpenv =~ s/USER,//g;
            $tmpenv =~ s/PWD,//g;
            $tmpenv =~ s/SHELL,//g;
            push @mpi_args, $tmpenv;


I feel like carving up the environment string like this shouldn't be necessary, or maybe there is a better way?


Paul H. Hargrove

unread,
Feb 16, 2023, 9:20:57 PM2/16/23
to Josh Milthorpe, gasnet...@lbl.gov
Josh,

If I am not mistaken, it is some part of Chapel that is making the questionable request to forward `HOSTNAME` (and the others?) via `-E`.
It would probably be better to consider fixing the behavior there if possible.

Looking at some of our CI infrastructure for testing Chapel over GASNet, I see we prefix `env -u HOSTNAME ` to the executable to be run, and the following is the commit message from the commit which added that behavior:

    Chapel CI: Fix for runs on Summit

    This commit add `HOSTNAME` to a list of black-listed env vars, because
    Summit sets this and `jsrun` considers it a fatal error when Chapel's
    launcher requests it to be forwarded to the compute nodes.

So, unless you actually see issues with the other three, I'd suggest something like the following as the simplest solution:

env -u HOSTNAME ./my_app -nl [locales] [...]

of even just `unset HOSTNAME` inside your batch script.

-Paul

Josh Milthorpe

unread,
Feb 17, 2023, 12:34:56 AM2/17/23
to Paul H. Hargrove, gasnet...@lbl.gov
Thanks - adding env -u HOSTNAME -u SHELL -u USER -u PWD works, and is the least invasive solution for Summit.

Josh Milthorpe

unread,
Mar 1, 2023, 8:09:30 PM3/1/23
to Paul H. Hargrove, gasnet...@lbl.gov
A further note: if I build with  --disable-ibv-rcv-thread as suggested, then at runtime I get the error:

*** FATAL ERROR (proc 0): in gasnetc_load_settings() at third-party/gasnet/gasnet-src/ibv-conduit/gasnet_core.c:930: AM receive thread enabled by environment variable GASNET_RCV_THREAD, but was disabled at GASNet build time

This seems odd because I hadn't set GASNET_RCV_THREAD in my environment, and I still get the error even if I explicitly set GASNET_RCV_THREAD=0. As a workaround, I can remove --disable-ibv-rcv-thread from the GASNet build options.

Paul H. Hargrove

unread,
Mar 1, 2023, 9:05:45 PM3/1/23
to Josh Milthorpe, gasnet...@lbl.gov
Josh,

Looking at their sources, I see that Chapel is setting GASNET_RCV_THREAD=1 and I am not aware of any means to disable that behavior (which is not to say there is none).  So, removing   --disable-ibv-rcv-threadfrom the GASNet build options is probably the right choice.

-Paul
Reply all
Reply to author
Forward
0 new messages