EOFError when attempting to receive large numpy arrays

1,419 views
Skip to first unread message

julio.h...@gmail.com

unread,
May 3, 2014, 12:28:55 PM5/3/14
to mpi...@googlegroups.com
Dear all,

Consider this fairly simple snippet of code:

import emcee
import numpy as np

pool = emcee.utils.MPIPool(debug=True)

def f(x): return np.empty(2500)

X = np.empty([100,50000])

Y = np.array(pool.map(f, [x for x in X]))

pool.close()

The pool class implementation is found here: https://github.com/dfm/emcee/blob/master/emcee/mpi_pool.py#L32

Now consider the following scenarios assuming each node in the cluster has 8 cores:
  • If I run the code with 8 MPI processes, it works just fine.
  • If I run with 16 MPI processes, it fails with EOFError (see traceback below).
  • Finally, with 16 MPI processes, and replacing 2500 by 100 for instance, it works again.
Traceback (most recent call last):
  File "main.py", line 10, in <module>
    Y = np.array(pool.map(f, [x for x in X]))
  File "/net/home/A-home/ramiro/julio/.local/lib/python2.7/site-packages/emcee-2.1.0-py2.7.egg/emcee/mpi_pool.py", line 185, in map
    result = self.comm.recv(source=worker, tag=i)
  File "Comm.pyx", line 816, in mpi4py.MPI.Comm.recv (src/mpi4py.MPI.c:72032)
  File "pickled.pxi", line 250, in mpi4py.MPI.PyMPI_recv (src/mpi4py.MPI.c:29545)
  File "pickled.pxi", line 111, in mpi4py.MPI._p_Pickle.load (src/mpi4py.MPI.c:28058)
EOFError

These scenarios indicate the issue is somehow related to the size of the MPI messages being received, or is a limitation of the cPickle module (huge memory consumption).

Do you have any suggestions or comments on how can I solve this problem?

I really appreciate any help,
Júlio.

Lisandro Dalcin

unread,
May 4, 2014, 8:03:54 AM5/4/14
to mpi4py
While communicating pickled streams do have a 2GB size limit, I do not
think you are reaching that point. I've tried to run you code with 16
processes in my workstation (8 cores, 24GB RAM) and it runs just fine
(using development mpi4py from bitbucket). Could you try with a
smaller X array, let say X = np.empty([100,500]) ? Do you still get
the error? How much RAM do you have at compute nodes? What mpi4py
version and backend MPI implementation are you using?



--
Lisandro Dalcin
---------------
CIMEC (UNL/CONICET)
Predio CONICET-Santa Fe
Colectora RN 168 Km 472, Paraje El Pozo
3000 Santa Fe, Argentina
Tel: +54-342-4511594 (ext 1016)
Tel/Fax: +54-342-4511169

julio.h...@gmail.com

unread,
May 6, 2014, 11:15:40 AM5/6/14
to mpi...@googlegroups.com
Hi Lisandro,
 
While communicating pickled streams do have a 2GB size limit, I do not
think you are reaching that point. I've tried to run you code with 16
processes in my workstation (8 cores, 24GB RAM) and it runs just fine
(using development mpi4py from bitbucket).

When I request 16 processes I use internode communication (8 + 8 cores = 2 nodes) in the cluster which is different than intranode communication happening in your workstation?
 
Could you try with a smaller X array, let say X = np.empty([100,500]) ?
Do you still get the error?

Yes, the error happens the same way. I guess the size of interest here is 2500 returned by f.
 
How much RAM do you have at compute nodes? What mpi4py
version and backend MPI implementation are you using?

I'll confirm it soon as possible, but on the cluster website, they say 16GB of RAM for each node.

I compiled the latest mpi4py revision from last month, if I remember correctly. I also need to check that.

Bull MPI + infiniband is what I'm using.

Sorry for the delay, I'm very busy trying to finish my masters dissertation. I really appreciate if you can help me solve this issue. The deadline is May 28 and I have no clues on how to overcome such error in time. It's for the second case study of the thesis.

I'll try recompile the latest mpi4py to see if something changed along the way.

Best,
Júlio.

Lisandro Dalcin

unread,
May 6, 2014, 12:34:37 PM5/6/14
to mpi4py
On 6 May 2014 14:15, <julio.h...@gmail.com> wrote:
> Hi Lisandro,
>
>>
>> While communicating pickled streams do have a 2GB size limit, I do not
>> think you are reaching that point. I've tried to run you code with 16
>> processes in my workstation (8 cores, 24GB RAM) and it runs just fine
>> (using development mpi4py from bitbucket).
>
>
> When I request 16 processes I use internode communication (8 + 8 cores = 2
> nodes) in the cluster which is different than intranode communication
> happening in your workstation?
>

From the MPI end-user point of view, nothing is different. Of course,
the MPI implementation might (and the good ones will) use different
communication mechanisms for intra and inter node messages.

>>
>> Could you try with a smaller X array, let say X = np.empty([100,500]) ?
>>
>> Do you still get the error?
>
>
> Yes, the error happens the same way. I guess the size of interest here is
> 2500 returned by f.
>

That size is way small. So I guess it is unrelated to any out-of-memory issue.

Your error message smells like the process received a truncated
message, then pickle complained about it with EOFError. I'm wondering
if this is not a bug in your MPI implementation,

>>
>> How much RAM do you have at compute nodes? What mpi4py
>> version and backend MPI implementation are you using?
>
>
> I'll confirm it soon as possible, but on the cluster website, they say 16GB
> of RAM for each node.
>

That should be enough,

Have you tried launching the 16 processes in a single node (i.e,
oversubscribing the node) ?

> I compiled the latest mpi4py revision from last month, if I remember
> correctly. I also need to check that.
>
> Bull MPI + infiniband is what I'm using.
>

Is it possible to install a evaluation version of this MPI implementation?

Final suggestion to try:

At the very beginning of you main script, add the following lines:

import mpi4py
mpi4py.rc.threaded = False

This will not as for multiple-thread support in your MPI backend, and
things could magically start working. As Bull MPI is based on Open
MPI, that would not surprise me :-).

Lisandro Dalcin

unread,
May 6, 2014, 12:37:44 PM5/6/14
to mpi4py
On 6 May 2014 15:34, Lisandro Dalcin <dal...@gmail.com> wrote:
> On 6 May 2014 14:15, <julio.h...@gmail.com> wrote:
>> Hi Lisandro,
>>

Ups. After you telling me you are using Bull MPI, I've tried to run
your code using Open MPI 1.6.5. Now I can reproduce the error! I'll
investigate a little further to see if I can figure out the issue.

Lisandro Dalcin

unread,
May 6, 2014, 1:12:47 PM5/6/14
to mpi4py
On 6 May 2014 15:37, Lisandro Dalcin <dal...@gmail.com> wrote:
> On 6 May 2014 15:34, Lisandro Dalcin <dal...@gmail.com> wrote:
>> On 6 May 2014 14:15, <julio.h...@gmail.com> wrote:
>>> Hi Lisandro,
>>>
>
> Ups. After you telling me you are using Bull MPI, I've tried to run
> your code using Open MPI 1.6.5. Now I can reproduce the error! I'll
> investigate a little further to see if I can figure out the issue.
>

Looking more carefully at the code in mpi_pool.py, I would say it is
broken. There are calls to comm.isend() that never wait for the
communication to complete (that is, r = comm.isend() .... r.wait())

Júlio Hoffimann

unread,
May 6, 2014, 1:53:04 PM5/6/14
to mpi...@googlegroups.com

Lisandro, thank you so much for your effort! I really appreciate it. I'm far from my computer right now, but soon I have time I'll check the pool bug you mentioned.

I reply with more news.

Best,
Júlio.

Júlio Hoffimann

unread,
May 7, 2014, 10:48:49 AM5/7/14
to mpi...@googlegroups.com
Hi Lisandro,

Could you please point in the source code where exactly do you think the bug is? All occurrences of isend() are followed by waitall(): https://github.com/dfm/emcee/blob/master/emcee/mpi_pool.py#L151-L188

I checked in the cluster and the version I'm currently using is mpi4py 1.3.1.

Do you think the issue might be related to the fact that I'm spawning processes (with python multiprocessing) in the MPI master process?

Júlio.

Yury V. Zaytsev

unread,
May 7, 2014, 12:22:51 PM5/7/14
to mpi...@googlegroups.com
On Wed, 2014-05-07 at 07:48 -0300, Júlio Hoffimann wrote:
>
> Do you think the issue might be related to the fact that I'm spawning
> processes (with python multiprocessing) in the MPI master process?

From what I know, this is generally a very dangerous thing to do. Not
all MPI implementations support forking in the first place, and in any
case, you must not call any MPI functions from the child process.

See e.g.

http://www.open-mpi.de/faq/?category=tuning#fork-warning

--
Sincerely yours,
Yury V. Zaytsev


Lisandro Dalcin

unread,
May 7, 2014, 2:18:13 PM5/7/14
to mpi4py
On 7 May 2014 13:48, Júlio Hoffimann <julio.h...@gmail.com> wrote:
> Hi Lisandro,
>
> Could you please point in the source code where exactly do you think the bug
> is? All occurrences of isend() are followed by waitall():
> https://github.com/dfm/emcee/blob/master/emcee/mpi_pool.py#L151-L188
>

Oh, I got confused. I'm talking about the branch that does load balance,

https://github.com/dfm/emcee/blob/master/emcee/mpi_pool.py#L190-L225


>
> Do you think the issue might be related to the fact that I'm spawning
> processes (with python multiprocessing) in the MPI master process?
>

Oh, that's always VERY dangerous.

But anyway, this does not explain why things are failing on my side,
the example you posted and I'm testing is not using multiprocessing.

Lisandro Dalcin

unread,
May 7, 2014, 2:26:43 PM5/7/14
to mpi4py
On 7 May 2014 17:18, Lisandro Dalcin <dal...@gmail.com> wrote:
> On 7 May 2014 13:48, Júlio Hoffimann <julio.h...@gmail.com> wrote:
>> Hi Lisandro,
>>
>> Could you please point in the source code where exactly do you think the bug
>> is? All occurrences of isend() are followed by waitall():
>> https://github.com/dfm/emcee/blob/master/emcee/mpi_pool.py#L151-L188
>>
>
> Oh, I got confused. I'm talking about the branch that does load balance,
>
> https://github.com/dfm/emcee/blob/master/emcee/mpi_pool.py#L190-L225
>
>
>>
>> Do you think the issue might be related to the fact that I'm spawning
>> processes (with python multiprocessing) in the MPI master process?
>>
>
> Oh, that's always VERY dangerous.
>
> But anyway, this does not explain why things are failing on my side,
> the example you posted and I'm testing is not using multiprocessing.
>

Well, after a closer look at the code, I've found a problem:

https://github.com/dfm/emcee/blob/master/emcee/mpi_pool.py#L176

When you oversuscribe the workers, i.e, you have more tasks than
workers, this code is relying on MPI buffering to progress, and MPI
provides not such guarantee.

Júlio Hoffimann

unread,
May 7, 2014, 3:55:13 PM5/7/14
to mpi...@googlegroups.com

Hi Lisandro,

Are you sure this is the source of the issue? Do you suggest any patch to that?

Anyways, incredible catch. I forgot MPI buffers the messages in such situations.

Júlio.

Lisandro Dalcin

unread,
May 8, 2014, 4:00:40 PM5/8/14
to mpi4py
On 7 May 2014 18:55, Júlio Hoffimann <julio.h...@gmail.com> wrote:
> Hi Lisandro,
>
> Are you sure this is the source of the issue?

No, I'm not sure. But if you cannot trust MPI implementations to do
the right thing for correct code, I have no expectations with wrong
code :-)

>
> Do you suggest any patch to that?
>

Uff, proposing a patch is probably much harder for me that
implementing this master/worker approach from scratch. Please find
attached my own version, I've not tested it, but I think it is right.
As usual, no comments, no docs, there are better ways to do it, etc.
Look at the end for a quick example about how to use it.
pool.py

Júlio Hoffimann

unread,
May 9, 2014, 12:01:08 AM5/9/14
to mpi...@googlegroups.com
Uff,  proposing a patch is probably much harder for me that
implementing this master/worker approach from scratch. Please find
attached my own version, I've not tested it, but I think it is right.
 
Thank you very much Lisandro, you're being incredible. I'll check if I can use your pool class in my actual code. Anyways, there is no other solution if we stick with isend()?

Best,
Júlio.

Júlio Hoffimann

unread,
May 10, 2014, 7:16:57 PM5/10/14
to mpi...@googlegroups.com
Lisandro, you're my hero!

Can I open a pull request for the emcee project to consider your pool implementation? It's under the MIT open source license.

Now I can proceed with the case study for my dissertation, thanks!

Best,
Júlio.

Lisandro Dalcin

unread,
May 10, 2014, 8:33:45 PM5/10/14
to mpi4py
On 10 May 2014 22:16, Júlio Hoffimann <julio.h...@gmail.com> wrote:
> Lisandro, you're my hero!
>
> Can I open a pull request for the emcee project to consider your pool
> implementation? It's under the MIT open source license.
>

Yes, of course. However, warn the maintainers of emcee that my
implementation was put together in 15 minutes to get something
working, it should be seriously reviewed and profiled. I do not
advocate correctness, smartness or performance for this piece of code.

bachm...@gmail.com

unread,
May 8, 2015, 9:01:12 PM5/8/15
to mpi...@googlegroups.com
We ran into this exact same problem, and found that switching emcee to loadbalance mode fixed the issue. It doesn't look like this issue was previously mentioned to the emcee users/developers so I have submitted an issue in the emcee repo describing it:

https://github.com/dfm/emcee/issues/151

Lisandro's pool.py implementation also fixed the issue. However, to elaborate on his performance disclaimer, we noticed that when using it with emcee there was a performance penalty for us due to the fact that the PTLikePrior object (which contains the likelihood/prior functions and their associated arguments, which can include substantial amounts of data) gets pickled with each message between master and worker. In the emcee MPIPool implementation the function is stored on each worker once at initialization time, which eliminates this inefficiency.
Reply all
Reply to author
Forward
0 new messages