FW: Building UPC++ on HIP

10 views
Skip to first unread message

Powell, Amy Jo

unread,
Sep 23, 2024, 5:32:18 PM9/23/24
to gasnet...@lbl.gov, gasnet...@lbl.gov, Elliott, James John, ServiceNow

Hello,


I’m trying to build UPC++ on an MI250, and then an MI300a.

 

I’m not getting correct UPC++ behavior (scaling across nodes)  b/c I believe my configuration is wrong, specifically:

 

--with-pmirun-cmd='flux run -n %N %C' \

 

I do not think this is correct (for the UPC++ configure) for flux machines.

 

I’m include LLNL , because I’m trying to run on their machines (Tioga, RZVernal).

 

Many thanks for any guidance you can offer!

 

Best,


AJP

 

 

 

 

From: Amy Jo Powell <ajp...@sandia.gov>
Date: Monday, September 23, 2024 at 3:16 PM
To: Rob Egan <rse...@lbl.gov>, Steven Hofmeyr <shof...@lbl.gov>
Subject: Building UPC++ on HIP

 

Hi Rob and Steve,

 

I don’t suppose you know how to build UPC++ on AMD machines, do you?

 

I have a historical record of something that was “kinda sorta” working, but not really.

 

At issue is getting UPC++ to scale correctly over nodes.

 

The  “—with-pmirun-cmd” argument is something I cobbled together awhile ago, without actually knowing if it was correct (see configure statement below).

 

If you guys run on AMD machines , how do you configure?

 

 

 

../configure \

--with-cxx=mpicxx \

--with-cc=mpicc \

--with-pmi-rpath \

--with-pmirun-cmd='flux run -n %N %C' \

--disable-hugetlbfs \

--disable-smp \

--enable-udp \

--enable-hip \

--enable-mpi \

--enable-ofi \

--with-cxxflags=-std=c++17 \

--with-gasnet=https://bitbucket.org/berkeleylab/gasnet/downloads/GASNet-stable.tar.gz \

--with-default-network=ofi \

--with-ofi-provider=cxi \

--prefix=${INSTALL_DIR}

 

Paul H. Hargrove

unread,
Sep 25, 2024, 3:04:27 PM9/25/24
to Powell, Amy Jo, gasnet...@lbl.gov, gasnet...@lbl.gov, Elliott, James John, ServiceNow
Amy,

I do not believe there are any connections between the use of AMD hardware (CPU or GPU) and the issues you are having.
These seem to be related to the system's job spawner, which may involve SSH, PMI or MPI.

For a HPE Cray EX system like Tioga or RZVernal, my experience is limited to `srun` over SLURM (pmi-based launch).

My suggestions include:
  1. Regardless of using `srun` versus `flux run`, I suggest adding `--with-ofi-spawner=pmi` to the configure command line to ensure MPI is not used
  2. The `--with-pmirun-cmd` looks reasonable and might be right for your system, but (as noted above) is outside my own experience
  3. I recommend trying `PMIRUN_CMD=srun -n %N -- %C` in your environment (without rerunning configure) to see if srun is a better option than `flux run`.  This environment variable overrides the default sety using the `--with-pmirun-cmd` configure option.  If that solves the problem(s) then you can go back to the configure step to make srun the default.

-Paul

--
You received this message because you are subscribed to the Google Groups "gasnet-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gasnet-devel...@lbl.gov.
To view this discussion on the web visit https://groups.google.com/a/lbl.gov/d/msgid/gasnet-devel/99603A2D-02C9-4C32-96D0-B96E9364DD5D%40sandia.gov.


--
Paul H. Hargrove <PHHar...@lbl.gov>
Pronouns: he, him, his
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department
Lawrence Berkeley National Laboratory
Reply all
Reply to author
Forward
0 new messages