Greetings,
I'm a CS Ph D student at ASU working on several projects that use UPC++. I've built UPC++ myself as well as used a module that the IT team here installed on our cluster for a variety of projects, but each install (including the module version) has it's own series of problems. What I'm looking to do is to use UPC++ with its memory kinds feature to do multi node runs where each node owns 1 or more GPU.
I'm currently using ASU's Agave supercomputer which a subset of nodes are connected with Infiniband, and a subset of those nodes have GPUs.
Previously I have done an IB network build of upc++ without GPUs, and I thought I had got it working with GPUs but I was just running more ranks on a single node with that build and multinode was actually failing. So I set about rebuilding upc++ and here is my situation.
Here is my configuration for upc++ that I am trying:
UPC++ configure: $UPCXX_SOURCE/configure
--prefix=$UPCXX_INSTALL --enable-cuda
--with-cxx=mpicxx
I load the following modules for the compilers to use with upc++:
module load gcc/10.3.0 (the most recent one on the cluster with a corresponding openmpi module)
module load openmpi/4.1.1-gcc-10.3.0
module load cuda/11.6.0
I have a few environment variables set as well:
export UPCXX_NETWORK=ibv
export UPCXX_GASNET_CONDUIT=ibv
For some of the nodes this build process has different issues, and different things happen. For instance, I can get all the way to the "make check" phase and a few things can happen.
First, I often get this problem when running that the compiled binaries require a version of glibc that cannot be found. Here is a sample error message:
"/lib64/libstdc++.so.6: version `GLIBCXX_3.4.26' not found"
The correct version of libstdc++.so.6 is found within the package manager on the cluster (something like /packages/gcc/gcc-10.3.0/ and so on), but the binaries are looking in /lib64/ for some reason. Interestingly I never have had this issue when building SMP executables. If I explicitly point to the correct library files (using LD_LIBRARY_PATH) or changing the make files, the compile will succeed. In some cases I get a timeout in the run phase of the tests. I get the following message:
"
WARNING: There was an error initializing an OpenFabrics device.
Local host: s76-2
Local device: mlx4_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: s76-3
PID: 18468
Message: connect() to
169.254.0.2:1026 failed
Error: Operation now in progress (115)
--------------------------------------------------------------------------
[
s76-2.agave.rc.asu.edu:13898] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[
s76-2.agave.rc.asu.edu:13898] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[13867,1],0]
Exit code: 124
"
Another strange error occurs when I attempt to run the make check step on nodes with GPUs. I get this error:
error while loading shared libraries: libXNVCtrl.so.0: cannot open shared object file: No such file or directory
During the compile step.
There are IB connected nodes, with GPUs on the cluster I am working on, so theoretically what I am trying to do should be possible but I keep running into hangups.
Any ideas what is going wrong? I would greatly appreciate any and all help!
Regards,
Kirtus Leyba