runtests.py fails with TestCCOObjInter ERROR when -np 5

93 views
Skip to first unread message

Bennet

unread,
May 20, 2012, 9:42:19 AM5/20/12
to mpi4py
I installed mpi4py 1.3 from the tarball download at code.google. The
compiler was gcc 4.6.2, and openmpi version is 1.4.4. Following the
example, I ran runtests.py using -np 5, and I got an error. Looking
through it, the error come from this test

$ mpirun -np 5 python test_cco_obj_inter.py --verbose
testAllgather (__main__.TestCCOObjInter) ... testAllgather
(__main__.TestCCOObjInter) ... testAllgather
(__main__.TestCCOObjInter) ... testAllgather
(__main__.TestCCOObjInter) ... testAllgather
(__main__.TestCCOObjInter) ... ERROR
testAllreduce (__main__.TestCCOObjInter) ... ERROR
ERROR

What is odd about this is that it seems only to appear when using five
processors. I've run it using -np 2 3 4 6 7 8. It consistently
succeeds with all other numbers of processors but 5.

I'm at a bit of a loss to see why 5 processors is special, though I
found someone else posted in Sep with a similar problem, but I didn't
see any conclusion/solution posted. There was a suggestion to use the
SVN version, but I'm only finding a mercurial version on code.google.

Suggestions, anyone?

Thanks, -- bennet

Aron Ahmadia

unread,
May 20, 2012, 10:29:17 AM5/20/12
to mpi...@googlegroups.com
Start with the development version (the mercurial repository), and see if the problem is reproducible there.

A


--
You received this message because you are subscribed to the Google Groups "mpi4py" group.
To post to this group, send email to mpi...@googlegroups.com.
To unsubscribe from this group, send email to mpi4py+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mpi4py?hl=en.


Lisandro Dalcin

unread,
May 20, 2012, 10:42:43 AM5/20/12
to mpi...@googlegroups.com
Can you run all the tests like this:

$mpiexec -n 5 python test/runtests.py --verbose --no-threads --include
cco_obj_inter

and tell us the outcome? I remember having issues with OpenMPI and
collectives on intercommunicators when using MPI_Init_thread.

If the error still appears, please try mpi4py-dev from the Mercurial
repo. BTW, I cannot reproduce the issue using MPICH2.


--
Lisandro Dalcin
---------------
CIMEC (INTEC/CONICET-UNL)
Predio CONICET-Santa Fe
Colectora RN 168 Km 472, Paraje El Pozo
3000 Santa Fe, Argentina
Tel: +54-342-4511594 (ext 1011)
Tel/Fax: +54-342-4511169

Bennet

unread,
May 20, 2012, 1:27:56 PM5/20/12
to mpi4py
I downloaded this version:

hg clone https://code.google.com/p/mpi4py/

and rebuilt using gcc-4.6.2 and openmpi-1.4.4

1019 python setup.py build
1020 python setup.py install --prefix=/tmp/bennet
1021 export PYTHONPATH=/tmp/bennet/lib/python2.7/site-packages
1022 python
1023 mpirun -np 3 python test/runtests.py
1024 mpirun -np 5 python test/runtests.py

and the error is the same and in the same place.

To concantenate the reply to Lisandro's suggestion, I ran

$ mpirun -np 5 python test/runtests.py --verbose --no-threads --
include cco_obj_inter

and the first time, it ran to completion with no errors. The second
and third times, it errors with

[0...@host.engin.umich.edu] Python 2.7 (/home/software/rhel5/lsa/epd/7.2/
bin/python)
[0...@host.engin.umich.edu] MPI 2.1 (Open MPI 1.4.4)
[0...@host.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/
mpi4py)
[2...@host.engin.umich.edu] Python 2.7 (/home/software/rhel5/lsa/epd/7.2/
bin/python)
[2...@host.engin.umich.edu] MPI 2.1 (Open MPI 1.4.4)
[2...@host.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/
mpi4py)
[3...@host.engin.umich.edu] Python 2.7 (/home/software/rhel5/lsa/epd/7.2/
bin/python)
[3...@host.engin.umich.edu] MPI 2.1 (Open MPI 1.4.4)
[3...@host.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/
mpi4py)
[1...@host.engin.umich.edu] Python 2.7 (/home/software/rhel5/lsa/epd/7.2/
bin/python)
[1...@host.umich.edu] MPI 2.1 (Open MPI 1.4.4)
[1...@host.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/
mpi4py)
[4...@host.engin.umich.edu] Python 2.7 (/home/software/rhel5/lsa/epd/7.2/
bin/python)
[4...@host.engin.umich.edu] MPI 2.1 (Open MPI 1.4.4)
[4...@host.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/
mpi4py)
testAllgather (test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... ERROR

That is with the version built from the mercurial repository. Again, -
np 4 runs fine, as does -np 12.

-- bennet

Bennet

unread,
May 20, 2012, 1:48:56 PM5/20/12
to mpi4py
Since I also have access to an RH 6 machine, I decided to try
installing there. I built using the RH included python, 2.6.6, using
gcc 4.7.0 and openmpi-1.6.0. I get some additional messages about
infiniband that didn't show up on RH 5. Here's the output to
Lisandro's suggested command line (I only have openmpi, not mpich to
use):

$ mpirun -np 5 python test/runtests.py --verbose --no-threads --
include cco_obj_inter
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to open /dev/infiniband/rdma_cm
CMA: unable to open /dev/infiniband/rdma_cm
CMA: unable to open /dev/infiniband/rdma_cm
CMA: unable to open /dev/infiniband/rdma_cm
CMA: unable to open /dev/infiniband/rdma_cm
[4...@host-rh6.engin.umich.edu] Python 2.6 (/usr/bin/python)
[4...@host-rh6.engin.umich.edu] MPI 2.1 (Open MPI 1.6.0)
[4...@host-rh6.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.6/
mpi4py)
[2...@host-rh6.engin.umich.edu] Python 2.6 (/usr/bin/python)
[2...@host-rh6.engin.umich.edu] MPI 2.1 (Open MPI 1.6.0)
[2...@host-rh6.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.6/
mpi4py)
[1...@host-rh6.engin.umich.edu] Python 2.6 (/usr/bin/python)
[1...@host-rh6.engin.umich.edu] MPI 2.1 (Open MPI 1.6.0)
[1...@host-rh6.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.6/
mpi4py)
[0...@host-rh6.engin.umich.edu] Python 2.6 (/usr/bin/python)
[0...@host-rh6.engin.umich.edu] MPI 2.1 (Open MPI 1.6.0)
[0...@host-rh6.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.6/
mpi4py)
[3...@host-rh6.engin.umich.edu] Python 2.6 (/usr/bin/python)
[3...@host-rh6.engin.umich.edu] MPI 2.1 (Open MPI 1.6.0)
[3...@host-rh6.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.6/
mpi4py)
testAllgather (test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... ERROR

system information:

$ python
Python 2.6.6 (r266:84292, Sep 12 2011, 14:03:14)
[GCC 4.4.5 20110214 (Red Hat 4.4.5-6)] on linux2

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/software/rhel6/gcc/4.7.0/libexec/gcc/x86_64-
unknown-linux-gnu/4.7.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../gcc-4.7.0/configure --prefix=/home/software/rhel6/
gcc/4.7.0 --with-mpfr=/home/software/rhel6/gcc/mpfr-3.1.0/ --with-mpc=/
home/software/rhel6/gcc/mpc-0.9/ --with-gmp=/home/software/rhel6/gcc/
gmp-5.0.5/ --disable-multilib
Thread model: posix
gcc version 4.7.0 (GCC)

Again, there seems to be something 'magical' about -np 5, as it passes
all tests with any other number of procs.

I just also note that it does not always error on the task with rank
4, it sometimes errors on rank 3 (I think I have the terminology
right).

[3...@host.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/
mpi4py)
testAllgather (test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... ERROR
ERROR

-- bennet

On May 20, 1:27 pm, Bennet <justben...@gmail.com> wrote:
> I downloaded this version:
>
> hg clonehttps://code.google.com/p/mpi4py/

Aron Ahmadia

unread,
May 20, 2012, 1:51:15 PM5/20/12
to mpi...@googlegroups.com
I'm not sure this is worth tracing down, I can't reproduce it on mpich2 either.

A

justb...@gmail.com

unread,
May 20, 2012, 1:56:24 PM5/20/12
to mpi...@googlegroups.com
Do you have access to an openmpi version that you can or cannot replicate it with...?

I'd be curious to know if you can replicate with openmpi.  If not, then I might try to install mpich2 and see if I can not replicate it here.  If I can do that, perhaps it would be worth pursing in the context of openmpi...?

-- bennet

Lisandro Dalcin

unread,
May 20, 2012, 3:21:23 PM5/20/12
to mpi...@googlegroups.com
On 20 May 2012 14:56, <justb...@gmail.com> wrote:
> Do you have access to an openmpi version that you can or cannot replicate it
> with...?
>

I can replicate with OpenMPI 1.5.4 from the Fedora 16 package. This is
not the first time I got issues with OpenMPI that did not appear with
MPICH2, so I'm inclined to say that this issues is in OpenMPI side and
not mpi4py's fault.

Figuring out what's going on will require to write some self-contained
C example, cross fingers to reproduce the issue, then bug Open MPI
folks about it, and so on...

$ PYTHONPATH=./build/lib.linux-x86_64-2.7 mpiexec -n 5 python
test/test_cco_obj_inter.py -v TestCCOObjInter.testGather

testGather (__main__.TestCCOObjInter) ... testGather
(__main__.TestCCOObjInter) ... testGather (__main__.TestCCOObjInter)
... testGather (__main__.TestCCOObjInter) ... testGather
(__main__.TestCCOObjInter) ... ERROR

======================================================================
ERROR: testGather (__main__.TestCCOObjInter)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/test_cco_obj_inter.py", line 94, in testGather
rmess = self.INTERCOMM.gather(smess, root=MPI.ROOT)
File "Comm.pyx", line 869, in mpi4py.MPI.Comm.gather (src/mpi4py.MPI.c:72146)
File "pickled.pxi", line 613, in mpi4py.MPI.PyMPI_gather
(src/mpi4py.MPI.c:32972)
Exception: MPI_ERR_COUNT: invalid count argument

> I'd be curious to know if you can replicate with openmpi.  If not, then I
> might try to install mpich2 and see if I can not replicate it here.

I cannot reproduce with mpich2. Aron cannot reproduce, too. I bet you
will not reproduce..

> If I
> can do that, perhaps it would be worth pursing in the context of openmpi...?
>

BTW, we should test with OpenMPI 1.6, released a few days ago.

Thomas Spura

unread,
May 20, 2012, 3:48:23 PM5/20/12
to mpi...@googlegroups.com
On Sun, May 20, 2012 at 7:51 PM, Aron Ahmadia <ar...@ahmadia.net> wrote:
> I'm not sure this is worth tracing down, I can't reproduce it on mpich2
> either.

It also happens with the current hg version of mpi4py and
$ rpm -qa openmpi gcc python
python-2.7.3-6.fc17.x86_64
gcc-4.7.0-5.fc17.x86_64
openmpi-1.5.4-5.fc17.1.x86_64

over here:
$ PYTHONPATH=build/lib.linux-x86_64-2.7/ mpirun -np 5 python
test/runtests.py --verbose --no-threads --include cco_obj_inter
[0@leonidas] Python 2.7 (/usr/bin/python)
[4@leonidas] Python 2.7 (/usr/bin/python)
[4@leonidas] MPI 2.1 (Open MPI 1.5.4)
[4@leonidas] mpi4py 1.3 (build/lib.linux-x86_64-2.7/mpi4py)
[0@leonidas] MPI 2.1 (Open MPI 1.5.4)
[0@leonidas] mpi4py 1.3 (build/lib.linux-x86_64-2.7/mpi4py)
[2@leonidas] Python 2.7 (/usr/bin/python)
[2@leonidas] MPI 2.1 (Open MPI 1.5.4)
[2@leonidas] mpi4py 1.3 (build/lib.linux-x86_64-2.7/mpi4py)
[3@leonidas] Python 2.7 (/usr/bin/python)
[3@leonidas] MPI 2.1 (Open MPI 1.5.4)
[3@leonidas] mpi4py 1.3 (build/lib.linux-x86_64-2.7/mpi4py)
[1@leonidas] Python 2.7 (/usr/bin/python)
[1@leonidas] MPI 2.1 (Open MPI 1.5.4)
[1@leonidas] mpi4py 1.3 (build/lib.linux-x86_64-2.7/mpi4py)
testAllgather (test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather
(test_cco_obj_inter.TestCCOObjInter) ... ERROR
ERROR
testAllreduce (test_cco_obj_inter.TestCCOObjInter) ... testAllreduce
(test_cco_obj_inter.TestCCOObjInter) ... ERROR
testAllreduce (test_cco_obj_inter.TestCCOObjInter) ... ^Cmpirun: killing job...


It works with mpich2 thought...

When running into the bug, all 5 processes are busy waiting:
sched_yield() = 0
sched_yield() = 0
sched_yield() = 0
epoll_wait(5, {}, 32, 0) = 0
sched_yield() = 0
sched_yield() = 0
sched_yield() = 0

Greetings,
Tom
Reply all
Reply to author
Forward
0 new messages