Re: failure to execute demo

12 views
Skip to first unread message

Art Poon

unread,
Dec 8, 2009, 2:30:07 AM12/8/09
to mpi4py
Dear colleagues,

First, my thanks to the developers for creating and maintaining a well-
documented package.

I am trying to get mpi4py up and running on my cluster (x86 running
CentOS 5.4 with Scyld Clusterware). I've been using MPI-enabled
source code without any issues so the MPI implementations seem to be
functional. mpi4py-1.1.0 seems to have compiled successfully (used --
configure flag and double-checked settings to be sure).

The demo file 'helloworld.py' executes without problems for 2 or 3
processors (I've tried many times):
[art@cluster demo]$ mpirun -np 3 python helloworld.py
Hello, World! I am process 0 of 3 on cluster.
Hello, World! I am process 2 of 3 on n1.
Hello, World! I am process 1 of 3 on n0.

However, once I try 4 processors I run into the dreaded p4_error:
Hello, World! I am process 0 of 4 on strong_badia.cfenet.ubc.ca.
Hello, World! I am process 2 of 4 on n1.
Hello, World! I am process 3 of 4 on n2.
Hello, World! I am process 1 of 4 on n0.
p1_30215: p4_error: net_recv read: probable EOF on socket: 1
rm_l_1_30217: (1.171875) net_send: could not write to fd=4, errno =
32
p3_30218: p4_error: net_recv read: probable EOF on socket: 1
rm_l_3_30220: (0.058594) net_send: could not write to fd=4, errno =
32

This is driving me nuts. My wild guess is that the main Python
instance on the head node terminates before one of the child processes
can send its message to standard console. But I've tried getting the
head node to pause (with time.sleep) to no effect. And that the
expected output makes it to the console is not consistent with this
hypothesis. So I'm left feeling pretty dumb.

I've even run into the same issue with pypar!

Your expertise would be very much appreciated.
Thanks,
- Art.

Lisandro Dalcin

unread,
Dec 8, 2009, 5:39:12 PM12/8/09
to mpi...@googlegroups.com
On Tue, Dec 8, 2009 at 4:30 AM, Art Poon <art...@gmail.com> wrote:
>
> I am trying to get mpi4py up and running on my cluster (x86 running
> CentOS 5.4 with Scyld Clusterware).  I've been using MPI-enabled
> source code without any issues so the MPI implementations seem to be
> functional.  mpi4py-1.1.0 seems to have compiled successfully (used --
> configure flag and double-checked settings to be sure).

What MPI implementation? MPICH(1) or MPICH2?

>
> The demo file 'helloworld.py' executes without problems for 2 or 3
> processors (I've tried many times):
>  [art@cluster demo]$ mpirun -np 3 python helloworld.py
>  Hello, World! I am process 0 of 3 on cluster.
>  Hello, World! I am process 2 of 3 on n1.
>  Hello, World! I am process 1 of 3 on n0.
>
> However, once I try 4 processors I run into the dreaded p4_error:
>  Hello, World! I am process 0 of 4 on strong_badia.cfenet.ubc.ca.
>  Hello, World! I am process 2 of 4 on n1.
>  Hello, World! I am process 3 of 4 on n2.
>  Hello, World! I am process 1 of 4 on n0.

What is the hostname of the front-end node? In the first run, it seems
to be ''cluster", but in the second it seems to be
"strong_badia.cfenet.ubc.ca" ... What's going on there?


>  p1_30215:  p4_error: net_recv read:  probable EOF on socket: 1
>  rm_l_1_30217: (1.171875) net_send: could not write to fd=4, errno =
> 32
>  p3_30218:  p4_error: net_recv read:  probable EOF on socket: 1
>  rm_l_3_30220: (0.058594) net_send: could not write to fd=4, errno =
> 32
>

So you never ever saw this error before with other MPI applications?

> This is driving me nuts.

Indeed.

> My wild guess is that the main Python
> instance on the head node terminates before one of the child processes
> can send its message to standard console.

No, I do not thing so.

> But I've tried getting the
> head node to pause (with time.sleep) to no effect.  And that the
> expected output makes it to the console is not consistent with this
> hypothesis.  So I'm left feeling pretty dumb.
>

The helloworld example is so simple that no communication at all is involved.

Could you try to add a MPI.COMM_WORLD.Barrier() at the end of the script?

Could you to explicitly call MPI.Finalize() at the end of the script?

>
> I've even run into the same issue with pypar!
>

OK. So this seems to be a Python-related issue. Try to make the
modifications I commented before and come back. If your MPI is MPICH2,
send me the output of "mpich2version" and "mpicc -show". Also try to
run other demos to see if the error always happens at the end of the
run.


--
Lisandro Dalcín
---------------
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594

Art Poon

unread,
Dec 8, 2009, 7:43:28 PM12/8/09
to mpi...@googlegroups.com
Hi Lisandro,

Thanks for your reply. Please bear with me as I am a biologist by training and am scrambling to fill the gaping holes in my knowledge of Linux.

> On Dec 8, 2009, at 2:39 PM, Lisandro Dalcin wrote:
> What MPI implementation? MPICH(1) or MPICH2?

I wish I knew which implementation is active. Scyld Clusterware claims to install MPICH and OpenMPI libraries among others. I found some OpenMPI binaries under /usr/openmpi and am trying to re-build mpi4py with them..

> What is the hostname of the front-end node? In the first run, it seems
> to be ''cluster", but in the second it seems to be
> "strong_badia.cfenet.ubc.ca" ... What's going on there?

That was my poorly-executed attempt at securing a little privacy :-(

> So you never ever saw this error before with other MPI applications?

Not with the open-source C++ application that I co-develop, nor "Hello World"-grade C code snippets, nor any other example C code that I've tried.

> The helloworld example is so simple that no communication at all is involved.
I'm guessing that should tell me that the problem is systemic, i.e., the MPI implementation is at fault or the hardware is at fault.

> Could you try to add a MPI.COMM_WORLD.Barrier() at the end of the script?
>
> Could you to explicitly call MPI.Finalize() at the end of the script?
Ok, tried these suggestions but no change in outcome.


> OK. So this seems to be a Python-related issue. Try to make the
> modifications I commented before and come back. If your MPI is MPICH2,
> send me the output of "mpich2version" and "mpicc -show". Also try to
> run other demos to see if the error always happens at the end of the
> run.

mpicc -show:
gcc -L/usr/lib64/MPICH/p4/gnu -I/usr/include -lmpi -lbproc

No mpich2version in /usr/bin, so I guess MPICH2 is not installed. :-/

cpi-cco.py crashes with more than 3 processors immediately after prompting user to enter the number of intervals with similar error messages:

> mpirun -np 4 python cpi-cco.py
Enter the number of intervals: (0 quits) p1_10566: p4_error: net_recv read: probable EOF on socket: 1
p2_10567: p4_error: net_recv read: probable EOF on socket: 1
rm_l_3_10571: (0.769531) net_send: could not write to fd=4, errno = 32
[art@strong_badia compute-pi]$ rm_l_1_10568: (1.871094) net_send: could not write to fd=4, errno = 32
rm_l_2_10570: (1.320312) net_send: could not write to fd=4, errno = 32
p3_10569: (6.773438) net_send: could not write to fd=4, errno = 32


Thanks again,
- Art.


>
>
> --
> Lisandro Dalcín
> ---------------
> Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
> Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
> Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
> PTLC - Güemes 3450, (3000) Santa Fe, Argentina
> Tel/Fax: +54-(0)342-451.1594
>
> --
>
> You received this message because you are subscribed to the Google Groups "mpi4py" group.
> To post to this group, send email to mpi...@googlegroups.com.
> To unsubscribe from this group, send email to mpi4py+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mpi4py?hl=en.
>
>

Lisandro Dalcin

unread,
Dec 8, 2009, 9:28:34 PM12/8/09
to mpi...@googlegroups.com
Indeed. It seems you have MPICH(1). I tried to do my best for
supporting the old MPICH, but I hope you understand this is hard.

> cpi-cco.py crashes with more than 3 processors immediately after prompting user to enter the number of intervals with similar error messages:
>
>> mpirun -np 4 python cpi-cco.py
> Enter the number of intervals: (0 quits) p1_10566:  p4_error: net_recv read:  probable EOF on socket: 1
> p2_10567:  p4_error: net_recv read:  probable EOF on socket: 1
> rm_l_3_10571: (0.769531) net_send: could not write to fd=4, errno = 32
> [art@strong_badia compute-pi]$ rm_l_1_10568: (1.871094) net_send: could not write to fd=4, errno = 32
> rm_l_2_10570: (1.320312) net_send: could not write to fd=4, errno = 32
> p3_10569: (6.773438) net_send: could not write to fd=4, errno = 32
>

I'm running out of ideas. I've sent you a chat invite. Let's proceed
offlist and try to sort-out your issues.
Reply all
Reply to author
Forward
0 new messages