issues with mpi send/recv

1 view

Skip to first unread message

Min Yun

unread,

Nov 19, 2008, 5:48:42 PM11/19/08

to cse6230-hpcta-fa08

We were able to solve our initial problem we posted on the google
group. However, we have a new issue. My project group is having a
very specific and confusing problem with our parallelization program.
The general explanation is that when we send a chunk of data to the
other nodes, the data is being received correctly but NOT in the full
amount we specified in the MPI_Send/Recv functions. The end part of
the data sent is not received correctly.

Specific explanation:
We narrowed down our problem and created a small sample code to figure
things out. In our sample program, we create a 2D dynamic double
matrix called "rho" of size 256x256 which is initialized to 1.0 for
each matrix value on the master node only. We then split this array
into four parts and send 3 parts out, so each node should get its own
64x256 amount of data. So the master node deals with rows: 0-63, node
1 deals with 64-127, node 2 deals with 128-191, and node 3 deals with
192-255. After sending the data to the other 3 nodes, we simply test
to make sure that each matrix data point is "1.0" and if it's not, we
know that data has not been received correctly for certain matrix
points.

What we've found is that for EACH of the three nodes we send data to,
data is received correctly EXCEPT for exactly 126 values of the last
row of data. So, we are not getting any data received for row: 127
columns: 130-255, row: 191 columns: 130-255, row: 255 columns:
130-255.
This is odd because 126 is not a power of 2.

Interestingly, we noted that if we initially defined the "rho"
variable as a STATIC matrix of size 256x256, there are no errors.
This leads us to believe there's something fishy about the way dynamic
matrices are constructed.
If we run the code on only two nodes (and thus half the 256x256 matrix
is sent out to the other node) then the amount of data missing from
the receive function is exactly 254 values... this is confusing
because it's not exactly 126x2 = 252. So it's not a scalable problem
either.
If we simply say to send out our original chunksize of data + 126,
then the program works. We are just confused about where the missing
126 values comes from and how this can be attributed to dynamic matrix
memory allocation since the program works for static matrices.

any help?

Richard Vuduc

unread,

Nov 19, 2008, 7:17:02 PM11/19/08

to cse6230-h...@googlegroups.com

Which MPI are you using? I believe I built two MPIs -- MPICH2 and
OpenMPI. So if you are having problems with one and haven't tried the
other, you might give that a shot to see if you have the same problem.