After years of using mpi4py happily, I'm having a very strange problem.
I'm calling `bcast`, and the object that's returned on different ranks is not the same type.
It's supposed to be a `np.random.Generator.bit_generator.state`,
but on rank=1 (size = 2) it's coming back as an ndarray with a single integer.
It's not happening elsewhere in the program, including doing a bcast on very similar data structures, so I'm guessing something is getting corrupted, but I have zero idea what. I'm happy to try to debug, but does anyone have any suggestions as to where to start looking? I don't even know how this could happen at all.
I'm calling `bcast`, and the object that's returned on different ranks is not the same type.It's supposed to be a `np.random.Generator.bit_generator.state`,
And that's a dict, right? Are you sure you are communicating the state dict, rather than the full Generator instance?
but on rank=1 (size = 2) it's coming back as an ndarray with a single integer.
What mpi4py version are you using? Could you provide a minimal reproducer?
but when I try to just extract that bit and synthetically generate the same kind of class instances, the second bcast works.- bcast works- a bunch of comm.send/comm.recv on a complex python class instance object- bcast fails
It's not happening elsewhere in the program, including doing a bcast on very similar data structures, so I'm guessing something is getting corrupted, but I have zero idea what. I'm happy to try to debug, but does anyone have any suggestions as to where to start looking? I don't even know how this could happen at all.
Of course it could be that the message gets corrupted out of an issue in your MPI, but then it would be a bit surprising other MPI calls are working fine.
Perhaps it is some issue/bug related to pickling? Can you try checking:
pickle.loads(pickle.dumps(rng, 5)).bit_generator.state == rng.bit_generator.state
where rng is the np.random.Generator instance of the same type you are using?
Aha - problem solved. In between the first bcast that worked and the second that didn't I was doing so file I/O on these complex objects by calling ase.io.read, from https://wiki.fysik.dtu.dk/ase/ase/io/io.html.