bcast returning different objects in different ranks

21 views
Skip to first unread message

Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)

unread,
Dec 7, 2022, 3:41:01 PM12/7/22
to mpi...@googlegroups.com
After years of using mpi4py happily, I'm having a very strange problem. I'm calling `bcast`, and the object that's returned on different ranks is not the same type.  It's supposed to be a `np.random.Generator.bit_generator.state`, but on rank=1 (size = 2) it's coming back as an ndarray with a single integer.

It's not happening elsewhere in the program, including doing a bcast on very similar data structures, so I'm guessing something is getting corrupted, but I have zero idea what. I'm happy to try to debug, but does anyone have any suggestions as to where to start looking? I don't even know how this could happen at all.

Lisandro Dalcin

unread,
Dec 7, 2022, 4:15:09 PM12/7/22
to mpi...@googlegroups.com
On Wed, 7 Dec 2022 at 17:41, 'Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)' via mpi4py <mpi...@googlegroups.com> wrote:
After years of using mpi4py happily, I'm having a very strange problem.

All good things must come to an end...
 
I'm calling `bcast`, and the object that's returned on different ranks is not the same type.   
It's supposed to be a `np.random.Generator.bit_generator.state`,

And that's a dict, right? Are you sure you are communicating the state dict, rather than the full Generator instance?
 
but on rank=1 (size = 2) it's coming back as an ndarray with a single integer.

What mpi4py version are you using? Could you provide a minimal reproducer?
  
It's not happening elsewhere in the program, including doing a bcast on very similar data structures, so I'm guessing something is getting corrupted, but I have zero idea what. I'm happy to try to debug, but does anyone have any suggestions as to where to start looking? I don't even know how this could happen at all.

Of course it could be that the message gets corrupted out of an issue in your MPI, but then it would be a bit surprising other MPI calls are working fine. 

Perhaps it is some issue/bug related to pickling? Can you try checking:

pickle.loads(pickle.dumps(rng, 5)).bit_generator.state == rng.bit_generator.state

where rng is the np.random.Generator instance of the same type you are using?

--
Lisandro Dalcin
============
Senior Research Scientist
Extreme Computing Research Center (ECRC)
King Abdullah University of Science and Technology (KAUST)
http://ecrc.kaust.edu.sa/

Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)

unread,
Dec 7, 2022, 4:15:21 PM12/7/22
to mpi...@googlegroups.com
I've determined that it starts failing after a sequence of send/recv pairs of a complex python object. Before I do the set it's fine, after it's broken.

Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)

unread,
Dec 7, 2022, 5:08:13 PM12/7/22
to mpi...@googlegroups.com
Thanks a lot for taking the time to think about this.

 
I'm calling `bcast`, and the object that's returned on different ranks is not the same type.   
It's supposed to be a `np.random.Generator.bit_generator.state`,

And that's a dict, right? Are you sure you are communicating the state dict, rather than the full Generator instance?

Yes, it's a dict, and I'm definitely passing `bit_generator.state`. I can try to bcast the entire generator, I guess, but haven't done it yet.

 
but on rank=1 (size = 2) it's coming back as an ndarray with a single integer.

What mpi4py version are you using? Could you provide a minimal reproducer?

3.1.4 (Linux Rocky 8, OpenMPI 4.something, gcc 9.4).  I'm trying to make a minimal reproducer, but haven't been able to do it so far.  I've got it narrowed down to
- bcast works
- a bunch of comm.send/comm.recv on a complex python class instance object
- bcast fails
but when I try to just extract that bit and synthetically generate the same kind of class instances, the second bcast works.

  
It's not happening elsewhere in the program, including doing a bcast on very similar data structures, so I'm guessing something is getting corrupted, but I have zero idea what. I'm happy to try to debug, but does anyone have any suggestions as to where to start looking? I don't even know how this could happen at all.

Of course it could be that the message gets corrupted out of an issue in your MPI, but then it would be a bit surprising other MPI calls are working fine. 

It's hard to imagine how it would corrupt the message but still leave it as something that is unpickled into an ndarray with a sensible integer value (2260). 


Perhaps it is some issue/bug related to pickling? Can you try checking:

pickle.loads(pickle.dumps(rng, 5)).bit_generator.state == rng.bit_generator.state

where rng is the np.random.Generator instance of the same type you are using?

This equality is fine, and pickling/unpickling just the rng.bit_generator.state also works fine.

Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)

unread,
Dec 7, 2022, 7:12:15 PM12/7/22
to mpi...@googlegroups.com
By the way, I'm happy to grab the source, put in debugging prints and run a modified version if that's what it takes. I just need a suggestion of where to start.  I tried to put in print statements in mpi4py.utils.pkl5.Comm.recv, but nothing was printed, and I'm not sure whether that's not the right place, or whether I did something else wrong.

Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)

unread,
Dec 7, 2022, 9:33:37 PM12/7/22
to mpi...@googlegroups.com
Finally, not sure if this provides any hints, but if I repeat the bcast a few times after it first goes wrong it alternates between one of 2 things. A numpy array with a single int64, 2260, or an array of integers with 2260 int8 values between -128..127.  That value (2260) is not the length of the pickle of the bit_generator.state dict that's being bcast, as far as I can tell, so I'm not sure where it comes from, but the fact that it shows up as an integer or at the length of an integer array seems like it should give some clue.

Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)

unread,
Dec 8, 2022, 8:47:19 AM12/8/22
to mpi...@googlegroups.com
Aha - problem solved. In between the first bcast that worked and the second that didn't I was doing so file I/O on these complex objects by calling ase.io.read, from https://wiki.fysik.dtu.dk/ase/ase/io/io.html. What I forgot was that by default, if mpi4py is available that function automatically does a read on a single node followed by a bcast, implicitly assuming that all the MPI processes call it at the same time in the same way. That's not what I was doing, so the root process was probably doing bcasts without matching bcast calls on the other processes. 

The send/recv pairs after the 1st bcast still worked, but the following bcast calls on rank >= 1 must have been matching the bcast from inside ase.io.read(). Once I realized that ase.io.read was doing its own mpi4py calls, I figured out what was happening.

No problem with mpi4py, so sorry for the false alarm, and thanks again for all your work.

Lisandro Dalcin

unread,
Dec 8, 2022, 9:58:47 AM12/8/22
to mpi...@googlegroups.com
On Thu, 8 Dec 2022 at 10:47, 'Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)' via mpi4py <mpi...@googlegroups.com> wrote:
Aha - problem solved. In between the first bcast that worked and the second that didn't I was doing so file I/O on these complex objects by calling ase.io.read, from https://wiki.fysik.dtu.dk/ase/ase/io/io.html.

I see. Looks like ASE is using MPI.COMM_WORLD by default. I consider this approach bad practice, the world communicator is something for USERS to use, not LIBRARIES. If every parallel library out there uses COMM_WORLD, then things can get really messed up, just as it happened to you.
But if a parallel library is using COMM_WORLD anyway, then you as a user  should try to fix this deficiency by configuring/forcing ASE to use a duplicate of MPI.COMM_WORLD, so you do not have to care ever again with ASE MPI calls interfering with other parts of your code.

Reply all
Reply to author
Forward
0 new messages