Issue with OpenCoarrays and MVAPICH2-2.2

31 views
Skip to first unread message

Sayan Ghosh

unread,
Jun 21, 2018, 5:57:04 PM6/21/18
to OpenCoarrays
I built the latest OpenCoarrays master from git repo (used install.sh), and used MVAPICH2-2.2 on Argonne LCRC Blues cluster (MVAPICH2 was built with ch3:psm). I use GNU 6.1 compilers. I had no issues in building OpenCoarrays, however, when I try to execute my program, it segfaults. The code runs on my laptop, which uses mpich 3.2, GNU 6.1 and latest OpenCoarrays. Is this a known problem?

[sghosh@blogin3 put-get]$ cafrun -np 2 ./get_bw

get_bw:78930 terminated with signal 11 at PC=2b0e7306d601 SP=7ffe0f245870.  Backtrace:

get_bw:78931 terminated with signal 11 at PC=2aea06545601 SP=7ffe80131320.  Backtrace:
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPID_Isend+0x131)[0x2b0e7306d601]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPID_Isend+0x131)[0x2aea06545601]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIC_Sendrecv+0x119)[0x2aea064ddb99]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIC_Sendrecv+0x119)[0x2b0e73005b99]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Allreduce_pt2pt_rd_MV2+0x39b)[0x2b0e72f034ab]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Allreduce_pt2pt_rd_MV2+0x39b)[0x2aea063db4ab]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Allreduce_index_tuned_intra_MV2+0x50e)[0x2b0e72f082de]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Allreduce_index_tuned_intra_MV2+0x50e)[0x2aea063e02de]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Allreduce_impl+0x26)[0x2b0e72eaaa56]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Allreduce_impl+0x26)[0x2aea06382a56]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Get_contextid_sparse_group+0x3b6)[0x2b0e730291a6]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Get_contextid_sparse_group+0x3b6)[0x2aea065011a6]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Get_contextid+0x10)[0x2aea06501630]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Get_contextid+0x10)[0x2b0e73029630]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Comm_copy+0x32)[0x2b0e7302a592]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Comm_copy+0x32)[0x2aea06502592]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Comm_dup_impl+0x44)[0x2b0e72fa0ff4]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Comm_dup_impl+0x44)[0x2aea06478ff4]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPI_Comm_dup+0x158)[0x2b0e72fa11c8]
/home/sghosh/builds/opencaf/lib64/libcaf_mpi.so.1(_gfortran_caf_init+0x94)[0x2b0e72c41a04]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPI_Comm_dup+0x158)[0x2aea064791c8]
/home/sghosh/builds/opencaf/lib64/libcaf_mpi.so.1(_gfortran_caf_init+0x94)[0x2aea06119a04]
./get_bw[0x4019d9]
./get_bw[0x4019d9]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b0e74068d1d]
./get_bw[0x401029]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2aea07540d1d]
./get_bw[0x401029]
Error: Command:
   `/home/sghosh/builds/mvapich222-gnu-psm/bin/mpirun -np 2 --disable-auto-cleanup ./get_bw`
failed to run.

Alessandro Fanfarillo

unread,
Jun 21, 2018, 7:18:25 PM6/21/18
to Sayan Ghosh, OpenCoarrays
Can you post your code?

--
You received this message because you are subscribed to the Google Groups "OpenCoarrays" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencoarrays...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencoarrays.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencoarrays/eda92b83-4f64-4fcd-8f50-8d0132eed8a5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sayan Ghosh

unread,
Jun 21, 2018, 7:30:48 PM6/21/18
to OpenCoarrays
After a closer inspection with Valgrind it appears that I have a bug in the code somewhere, as in I am doing an illegal write, caused by passing wrong index to a coarray, which in turn messes up the displacement parameter for the corresponding MPI_Put. Sorry for the misinterpretation.

//Sayan  

Sayan Ghosh

unread,
Jun 21, 2018, 9:36:50 PM6/21/18
to OpenCoarrays
Even after fixing the code and checking that Valgrind is not complaining of any leaks on my laptop, I am still not able to run it using the OpenCoarrays+MVAPICH2 configuration on the cluster (commenting out the communication part throws the same error). One small correction, I use GCC 6.3 on my laptop (not 6.1), whereas on Blues I use GCC 6.1. I have attached the code here. 

[sghosh@blogin3 put-get]$ mpif90 -fcoarray=lib -I/home/sghosh/builds/opencaf/include -Wall -o get_bw get_bw.f90 -L/home/sghosh/builds/opencaf/lib64 -lcaf_mpi -L/home/sghosh/builds/mvapich222-gnu-psm/lib -lmpi -lmpifort -lmpich
[sghosh@blogin3 put-get]$ cafrun -n 2 ./get_bw

get_bw:89652 terminated with signal 11 at PC=2b852443d601 SP=7ffc600491b0.  Backtrace:
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPID_Isend+0x131)[0x2b852443d601]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIC_Sendrecv+0x119)[0x2b85243d5b99]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Allreduce_pt2pt_rd_MV2+0x39b)[0x2b85242d34ab]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Allreduce_index_tuned_intra_MV2+0x50e)[0x2b85242d82de]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Allreduce_impl+0x26)[0x2b852427aa56]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Get_contextid_sparse_group+0x3b6)[0x2b85243f91a6]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Get_contextid+0x10)[0x2b85243f9630]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Comm_copy+0x32)[0x2b85243fa592]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Comm_dup_impl+0x44)[0x2b8524370ff4]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPI_Comm_dup+0x158)[0x2b85243711c8]
/home/sghosh/builds/opencaf/lib64/libcaf_mpi.so.1(_gfortran_caf_init+0x94)[0x2b8524011a04]
./get_bw[0x40195c]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b8525438d1d]
./get_bw[0x4010a9]

get_bw:89653 terminated with signal 11 at PC=2b30ff2e2601 SP=7fff6c1bc7c0.  Backtrace:
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPID_Isend+0x131)[0x2b30ff2e2601]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIC_Sendrecv+0x119)[0x2b30ff27ab99]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Allreduce_pt2pt_rd_MV2+0x39b)[0x2b30ff1784ab]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Allreduce_index_tuned_intra_MV2+0x50e)[0x2b30ff17d2de]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Allreduce_impl+0x26)[0x2b30ff11fa56]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Get_contextid_sparse_group+0x3b6)[0x2b30ff29e1a6]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Get_contextid+0x10)[0x2b30ff29e630]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Comm_copy+0x32)[0x2b30ff29f592]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPIR_Comm_dup_impl+0x44)[0x2b30ff215ff4]
/home/sghosh/builds/mvapich222-gnu-psm/lib/libmpi.so.12(MPI_Comm_dup+0x158)[0x2b30ff2161c8]
/home/sghosh/builds/opencaf/lib64/libcaf_mpi.so.1(_gfortran_caf_init+0x94)[0x2b30feeb6a04]
./get_bw[0x40195c]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b31002ddd1d]
./get_bw[0x4010a9]
Error: Command:
   `/home/sghosh/builds/mvapich222-gnu-psm/bin/mpirun -np 2 --disable-auto-cleanup ./get_bw`
failed to run.


get_bw.f90

Zaak Beekman

unread,
Jun 22, 2018, 5:00:12 PM6/22/18
to OpenCoarrays
Hi Sayan, a few quick notes:

1. The official bug reporting location for OpenCoarrays is https://github.com/sourceryinstitute/OpenCoarrays/issues/new, it would be much appreciated if you could file this bug report there, including the requested information in the bug report template.
2. MVAPICH is not officially supported; if you can test with MPICH or even OpenMPI (OMPI is not *officially* supported but it is more thoroughly tested than MVAPich) you may have more luck.
3. GFortran 6.1 is quite old (for us) and there may be numerous bug fixes that are not available in 6.1. If you can install or upgrade to 6.4 that would be good. Updating to a more recent version is recommended.
4.  How did you build/install OpenCoarrays? (With hat flags/options using which build technique? e.g., install.sh, CMake, makefile?)

Thanks,
Zaak

Sayan Ghosh

unread,
Jun 22, 2018, 9:29:55 PM6/22/18
to OpenCoarrays
A quick note, with MPICH 3.2 (with ch3:nemesis:ofi device) I am able to run OpenCoarrays. I'll investigate a bit more about the MVAPICH2 problem before filing an issue.

Zaak Beekman

unread,
Jun 23, 2018, 4:57:09 PM6/23/18
to Sayan Ghosh, OpenCoarrays
Great, thanks for the update and please keep us posted with any further details you may learn!
--
You received this message because you are subscribed to a topic in the Google Groups "OpenCoarrays" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/opencoarrays/SFtt53yBQz0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to opencoarrays...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages