ABySS with SGI's MPT hangs

242 views
Skip to first unread message

Haruna Cofer

unread,
Nov 28, 2011, 12:32:33 PM11/28/11
to ABySS
Hi Shaun!

ABySS seems to hang when used with SGI's optimized MPI library (called
MPT). This appears to be due to an MPI_Irecv that does not have a
corresponding MPI_Isend, so when MPI_Finalize is called, then MPT will
wait on the outstanding receive, while OpenMPI will ignore it.

I think I found the outstanding MPI_Irecv, but I can't follow the code
well enough to suggest an appropriate fix. Could you please take a
look? Using the ABySS 1.3.1 source, I believe the MPI_Irecv is in
Parallel/CommLayer.cpp on line 22:

MPI_Irecv(m_rxBuffer, RX_BUFSIZE,
MPI_BYTE, MPI_ANY_SOURCE, MPI_ANY_TAG,
MPI_COMM_WORLD,
&m_request);

And it is called during this constructor statement in Parallel/
parallelAbyss.cpp on line 84:

NetworkSequenceCollection networkSeqs;

The trace looks like this (the line numbers are close but not exact,
because this trace was from ABySS 1.3.0 and I also added extra print
statements for myself in the code):

(gdb) where
#0 0x0000000000407d73 in CommLayer::CommLayer (this=0x7fffffffda00)
at CommLayer.cpp:14
#1 0x0000000000418ccc in MessageBuffer::MessageBuffer
(this=0x7fffffffda70)
at MessageBuffer.cpp:8
#2 0x0000000000406dfe in
NetworkSequenceCollection::NetworkSequenceCollection
(this=0x7fffffffd9f0) at NetworkSequenceCollection.h:47
#3 0x000000000040570c in main (argc=10, argv=0x7fffffffde38)
at parallelAbyss.cpp:99

Thank you, Shaun! I appreciate your insight on this!

-- Haruna :)


Shaun Jackman

unread,
Nov 28, 2011, 6:59:05 PM11/28/11
to Haruna Cofer, ABySS
Hi Haruna,

The MPI_Irecv call is paired with MPI_Send. The MPI calls of ABySS are
not perfectly portable. ABySS requires that MPI_Send never block, which
is true for OpenMPI when the messages are small, but may not be true for
other implementations.

ABySS's messages are smaller than 4 kb. The default eager send limit of
OpenMPI is 4 kb for shared memory and 64 kb for TCP. Perhaps you can
find similar parameters for MPT.

$ ompi_info |grep eager_limit
MCA btl: parameter "btl_sm_eager_limit" (current value: "4096", data
source: default value)
MCA btl: parameter "btl_tcp_eager_limit" (current value: "65536
", data source: default value)

Cheers,
Shaun

Shaun Jackman

unread,
Feb 28, 2012, 7:47:40 PM2/28/12
to Haruna Cofer, ABySS
Thanks for the patch, Haruna! I’ve applied your patch, and it will be included with the release of ABySS 1.3.3.

Cheers,
Shaun

On 2012-02-28, at 12:06 , Haruna Cofer wrote:

> Hi Shaun!
>
> It's been a few months, but I think one of our engineers fixed this! The original problem was that ABySS compiled with SGI's MPT would hang and run forever. The fix is two-fold:
>
> 1. In ABySS 1.3.2, add a call to MPI_Cancel(&m_request) in ~CommLayer(). This frees the last pending receive, before deleting the buffer:
>
> --- abyss-1.3.2/Parallel/CommLayer.cpp 2011-09-16 15:39:48.000000000 -0700
> +++ abyss-1.3.2.fix/Parallel/CommLayer.cpp 2012-02-08 21:50:30.029148001
> -0800 @@ -26,6 +26,7 @@
>
> CommLayer::~CommLayer()
> {
> + MPI_Cancel(&m_request);
> delete[] m_rxBuffer;
> logger(1) << "Sent " << m_msgID << " control, "
> << m_txPackets << " packets,
>
> 2. In SGI's MPT, add a call to the progress engine in MPI_Request_get_status(), because this was not being done.
>
> -----
>
> The MPT fix will be in MPT 2.06, which will be released in May 2012. Would it be OK for you to add the MPI_Cancel() call in ABySS? Thank you, Shaun!
>
> -- Haruna :)

Christopher Fields

unread,
Apr 17, 2012, 5:53:16 PM4/17/12
to abyss...@googlegroups.com, Haruna Cofer
We are attempting to compile ABySS on our local SGI Altix and have had no luck, mainly due to what appears to be MPI linkage issues.  Any hints?  Simply pointing at the MPI include files doesn't seem to help.

chris

Shaun Jackman

unread,
Apr 17, 2012, 5:55:48 PM4/17/12
to Christopher Fields, abyss...@googlegroups.com, Haruna Cofer
Hi Chris,

Please report the command line and error message.

Cheers,
Shaun

Haruna Cofer

unread,
Apr 17, 2012, 6:09:24 PM4/17/12
to ABySS
Hi Chris,

You need to compile with the -DMPI_NO_CPPBIND option. For example
(set the CFLAGS and CXXFLAGS variables below):

./configure --prefix=$PBS_O_WORKDIR/dist.gcc.mpt --with-mpi=$MPI_ROOT
--disable-openmp --with-boost=$PBS_O_WORKDIR/../boost_1_47_0/boost
CPPFLAGS=-I$PBS_O_WORKDIR/../sparsehash-1.9/dist/include CFLAGS="-
DMPI_NO_CPPBIND" CXXFLAGS="-DMPI_NO_CPPBIND"

-- Haruna :)

Fields, Christopher J

unread,
Apr 17, 2012, 6:13:42 PM4/17/12
to Shaun Jackman, abyss...@googlegroups.com, Haruna Cofer
Shaun,

Initial configuration:

cfields@ember:~/ev6/builds/abyss-1.3.3> CPPFLAGS="-I/u/ac/cfields/ev6/opt/include" CXXFLAGS="-lmpi" ./configure --prefix=/u/ac/cfields/ev6/opt/ --with-mpi=/usr/local/sgi/mpt/mpt-2.02

Configuration works fine, sparsehash, boost, mpt is found. One thing mentioned about '-lmpi' flag in our local docs is that it should be added at the end:

gcc myprog.c -lmpi

I'm not sure is this is possible using simple env variables, though, seems this would require some retooling of the Makefile. Note also we're using an older mpt that Haruna mentions, but the error is during compilation, not during execution. If I simply run 'make' after configuration, I get this (truncated to the Parallel part):

...
Making all in Parallel
make[2]: Entering directory `/gpfs2/projects/ev6/builds/abyss-1.3.3/Parallel'
g++ -DHAVE_CONFIG_H -I. -I.. -I.. -I../Assembly -I../Common -I../DataLayer -I. -I/usr/local/sgi/mpt/mpt-2.02/include -I/u/ac/cfields/ev6/opt/include -Wall -Wextra -Werror -lmpi -MT ABYSS_P-parallelAbyss.o -MD -MP -MF .deps/ABYSS_P-parallelAbyss.Tpo -c -o ABYSS_P-parallelAbyss.o `test -f 'parallelAbyss.cpp' || echo './'`parallelAbyss.cpp
cc1plus: warnings being treated as errors
In file included from /usr/local/sgi/mpt/mpt-2.02/include/mpi.h:1383,
from CommLayer.h:5,
from NetworkSequenceCollection.h:7,
from parallelAbyss.cpp:3:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In copy constructor ‘PMPI::Op::Op(const PMPI::Op&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:651: error: ‘PMPI::Op::mpi_op’ will be initialized after
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:648: error: ‘void (* PMPI::Op::op_user_function)(const void*, void*, int, const PMPI::Datatype&)’
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:627: error: when initialized here
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In constructor ‘MPI::Prequest::Prequest(const MPI::Request&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:2541: error: ‘MPI::Prequest::pmpi_request’ will be initialized after
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:2506: error: base ‘MPI::Request’
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:2506: error: when initialized here
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In constructor ‘MPI::Prequest::Prequest(const PMPI::Prequest&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:2541: error: ‘MPI::Prequest::pmpi_request’ will be initialized after
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:2508: error: base ‘MPI::Request’
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:2508: error: when initialized here
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In constructor ‘MPI::Prequest::Prequest(const MPI_Request&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:2541: error: ‘MPI::Prequest::pmpi_request’ will be initialized after
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:2510: error: base ‘MPI::Request’
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:2510: error: when initialized here
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In copy constructor ‘MPI::Comm::Comm(const MPI::Comm&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3117: error: ‘MPI::Comm::pmpi_comm’ will be initialized after
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:2824: error: base ‘MPI::Comm_Null’
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:2824: error: when initialized here
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In constructor ‘MPI::Comm::Comm(const MPI_Comm&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3117: error: ‘MPI::Comm::pmpi_comm’ will be initialized after
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:2827: error: base ‘MPI::Comm_Null’
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:2827: error: when initialized here
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In constructor ‘MPI::Comm::Comm(const PMPI::Comm&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3117: error: ‘MPI::Comm::pmpi_comm’ will be initialized after
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:2829: error: base ‘MPI::Comm_Null’
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:2829: error: when initialized here
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In copy constructor ‘MPI::Intracomm::Intracomm(const MPI::Intracomm&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3464: error: ‘MPI::Intracomm::pmpi_comm’ will be initialized after
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3278: error: base ‘MPI::Comm’
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3278: error: when initialized here
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In constructor ‘MPI::Intracomm::Intracomm(const MPI_Comm&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3464: error: ‘MPI::Intracomm::pmpi_comm’ will be initialized after
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3286: error: base ‘MPI::Comm’
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3286: error: when initialized here
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In constructor ‘MPI::Intracomm::Intracomm(const PMPI::Intracomm&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3464: error: ‘MPI::Intracomm::pmpi_comm’ will be initialized after
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3288: error: base ‘MPI::Comm’
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3288: error: when initialized here
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In constructor ‘MPI::Cartcomm::Cartcomm(const PMPI::Cartcomm&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3543: error: ‘MPI::Cartcomm::pmpi_comm’ will be initialized after
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3482: error: base ‘MPI::Intracomm’
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3482: error: when initialized here
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In constructor ‘MPI::Graphcomm::Graphcomm(const PMPI::Graphcomm&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3622: error: ‘MPI::Graphcomm::pmpi_comm’ will be initialized after
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3563: error: base ‘MPI::Intracomm’
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3563: error: when initialized here
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In copy constructor ‘MPI::Intercomm::Intercomm(const MPI::Intercomm&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3704: error: ‘MPI::Intercomm::pmpi_comm’ will be initialized after
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3645: error: base ‘MPI::Comm’
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3645: error: when initialized here
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In constructor ‘MPI::Intercomm::Intercomm(const PMPI::Intercomm&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3704: error: ‘MPI::Intercomm::pmpi_comm’ will be initialized after
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3646: error: base ‘MPI::Comm’
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:3646: error: when initialized here
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In static member function ‘static int PMPI::Comm::NULL_COPY_FN(const PMPI::Comm&, int, void*, void*, void*, MPI2CPP_BOOL_T&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:4885: error: the address of ‘int MPI_NULL_COPY_FN(MPI_Comm, int, void*, void*, void*, int*)’ will never be NULL
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In static member function ‘static int PMPI::Comm::NULL_DELETE_FN(PMPI::Comm&, int, void*, void*)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:4935: error: the address of ‘int MPI_NULL_DELETE_FN(MPI_Comm, int, void*, void*)’ will never be NULL
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h: In copy constructor ‘PMPI::Errhandler::Errhandler(const PMPI::Errhandler&)’:
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:1187: error: ‘PMPI::Errhandler::mpi_errhandler’ will be initialized after
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:1184: error: ‘void (* PMPI::Errhandler::handler_fn)(PMPI::Comm&, int*, ...)’
/usr/local/sgi/mpt/mpt-2.02/include/mpi++.h:5542: error: when initialized here
make[2]: *** [ABYSS_P-parallelAbyss.o] Error 1
make[2]: Leaving directory `/gpfs2/projects/ev6/builds/abyss-1.3.3/Parallel'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/gpfs2/projects/ev6/builds/abyss-1.3.3'
make: *** [all] Error 2

chris

Christopher Fields

unread,
Apr 17, 2012, 9:45:33 PM4/17/12
to abyss...@googlegroups.com
That fixed it; never would have worked that one out on my own.  Thanks Haruna!

chris

Matthew MacManes

unread,
Sep 3, 2012, 6:14:29 PM9/3/12
to abyss...@googlegroups.com, Shaun Jackman
Did you ever figure this 'make' issue out?
Matt

amit upadhyay

unread,
Jun 25, 2014, 11:30:03 AM6/25/14
to abyss...@googlegroups.com, sjac...@bcgsc.ca, har...@sgi.com
Hello,

I am having the same issues using MPT. Has anyone resolved this issue?

Thank you,
Amit Upadhyay.
Reply all
Reply to author
Forward
0 new messages