Dumping a DistMatrix<double,MC,MR> to a single file

ANJU KAMBADUR

unread,

May 16, 2013, 2:45:19 PM5/16/13

to <elemental-dev@googlegroups.com>

I have a distributed matrix that I have to dump to a *single* file. I have tried one of two methods:

(1) Use the Axpy interface to get the DistMatrix to one place and then write out the entries in binary.
(2) Use the "Print()" method to write out the entries in text.

As expected, on a vanilla cluster, both these methods are a major bottleneck. Has anyone any suggestions (other than to abandon writing out distributed matrices --- that is a requirement I cannot circumvent as of now)?

Just some context: I am using DistMatrix<double,MC,MR> to perform some BLAS-3 and BLAS-2 operations in parallel on rectangular matrices. Is it faster to get the DistMatrix in one place if use some other distribution? If I use some other distribution, will there be an internal conversion to <MC,MR> for operations such as GEMM, GEMV, SYRK, etc?

- Anju
--------------------------------------------
Prabhanjan Kambadur
Research Staff Member
Business Analytics and Mathematical Sciences
IBM TJ Watson Research Center
Room 30-229 A

Jack Poulson

unread,

May 16, 2013, 3:11:44 PM5/16/13

to elemen...@googlegroups.com

Dear Anju,

Ignoring any potential parallel filesystem magic, I wager that one cannot do much better than performing a single MPI_Gather(v) call and then writing to disk from that process. The cost should be roughly:

alpha lg p + beta n^2

with p processes, and n x n matrix, a message latency of alpha, and bandwidth of 1/beta.

The method currently used for the [MC,MR] PrintBase routine, which is used under the hood of Print (for writing to stdout) and Write (for writing to a file), is, with a simple cost model, only off by a lg p factor in the bandwidth, as it performs an MPI_Reduce on an n x n matrix, resulting in a cost of:

alpha lg p + beta n^2 lg p

Practically speaking, the expense is a bit worse than this, as, in the MPI_Gather(v) approach, only the root process allocated a buffer of size n^2, and all other processes only need n^2/p memory. In the simple MPI_Reduce approach, *every* process allocated (and fills) a buffer of size n^2, and the root process allocates a second one for the receive. This code was written a long time ago and could certainly be easily optimized (most trivially, by using MPI_IN_PLACE to avoid a second n^2 buffer on the root), as I have not worried much about Elemental's file I/O performance.

Patches are welcome!

Jack

--
You received this message because you are subscribed to the Google Groups "elemental-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elemental-de...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jeff Hammond

unread,

May 16, 2013, 3:18:11 PM5/16/13

to elemen...@googlegroups.com

> Ignoring any potential parallel filesystem magic, I wager that one cannot do
> much better than performing a single MPI_Gather(v) call and then writing to
> disk from that process. The cost should be roughly:
> alpha lg p + beta n^2
> with p processes, and n x n matrix, a message latency of alpha, and
> bandwidth of 1/beta.

Ignoring any potential MPI collective magic, I wager that one cannot
do much better than performing N send-recv pair calls to implement the
MPI_Gather(v)...

(If my point is not transparent above, I would be happy to be more explicit.)

> The method currently used for the [MC,MR] PrintBase routine, which is used
> under the hood of Print (for writing to stdout) and Write (for writing to a
> file), is, with a simple cost model, only off by a lg p factor in the
> bandwidth, as it performs an MPI_Reduce on an n x n matrix, resulting in a
> cost of:
> alpha lg p + beta n^2 lg p
>
> Practically speaking, the expense is a bit worse than this, as, in the
> MPI_Gather(v) approach, only the root process allocated a buffer of size
> n^2, and all other processes only need n^2/p memory. In the simple
> MPI_Reduce approach, *every* process allocated (and fills) a buffer of size
> n^2, and the root process allocates a second one for the receive. This code
> was written a long time ago and could certainly be easily optimized (most
> trivially, by using MPI_IN_PLACE to avoid a second n^2 buffer on the root),
> as I have not worried much about Elemental's file I/O performance.
>
> Patches are welcome!

MPI-IO or HDF5 are the right way to implement IO in Elemental without
reinventing the wheel...

Jeff

--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jham...@alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
ALCF docs: http://www.alcf.anl.gov/user-guides

Jack Poulson

unread,

May 16, 2013, 3:26:06 PM5/16/13

to elemen...@googlegroups.com

On Thu, May 16, 2013 at 12:18 PM, Jeff Hammond <jham...@alcf.anl.gov> wrote:

> Ignoring any potential parallel filesystem magic, I wager that one cannot do
> much better than performing a single MPI_Gather(v) call and then writing to
> disk from that process. The cost should be roughly:
> alpha lg p + beta n^2
> with p processes, and n x n matrix, a message latency of alpha, and
> bandwidth of 1/beta.

Ignoring any potential MPI collective magic, I wager that one cannot
do much better than performing N send-recv pair calls to implement the
MPI_Gather(v)...

(If my point is not transparent above, I would be happy to be more explicit.)

I see your point, but what I wanted to avoid what it meant to "write a file". One might demand that it be readable from a single location. If this is so, then it had to exist in a single location, which requires that the equivalent of an MPI_Gather(v) to have been performed at some point, and I would much rather trust MPI to do this properly.

Notice that my cost model ignores the, say delta n^2 cost that would be required to write n^2 entries to file from a single process and focused on the cost of the communication. I suppose that delta >> alpha,beta, so maybe the model needs to be extended for rational discussion.

> The method currently used for the [MC,MR] PrintBase routine, which is used
> under the hood of Print (for writing to stdout) and Write (for writing to a
> file), is, with a simple cost model, only off by a lg p factor in the
> bandwidth, as it performs an MPI_Reduce on an n x n matrix, resulting in a
> cost of:
> alpha lg p + beta n^2 lg p
>
> Practically speaking, the expense is a bit worse than this, as, in the
> MPI_Gather(v) approach, only the root process allocated a buffer of size
> n^2, and all other processes only need n^2/p memory. In the simple
> MPI_Reduce approach, *every* process allocated (and fills) a buffer of size
> n^2, and the root process allocates a second one for the receive. This code
> was written a long time ago and could certainly be easily optimized (most
> trivially, by using MPI_IN_PLACE to avoid a second n^2 buffer on the root),
> as I have not worried much about Elemental's file I/O performance.
>
> Patches are welcome!

MPI-IO or HDF5 are the right way to implement IO in Elemental without
reinventing the wheel...

I am admittedly not very familiar with either MPI-IO or HDF5. If someone else wants to contribute code for this, I would happily incorporate it.

Jack

Jed Brown

unread,

May 16, 2013, 3:27:32 PM5/16/13

to Jeff Hammond, elemen...@googlegroups.com

Jeff Hammond <jham...@alcf.anl.gov> writes:

> MPI-IO or HDF5 are the right way to implement IO in Elemental without
> reinventing the wheel...

Agreed, and defining the cyclic ordering is easy to do in either system.

Jeff, can you point to any benchmarks that compare the IO bandwidth of
cyclic ordering versus contiguous blocks? I know that use of MPI-IO
collectives is much more important for cyclic ordering, but I don't know
how far its peak is away from that of block ordering.

Jeff Hammond

unread,

May 16, 2013, 3:34:15 PM5/16/13

to elemen...@googlegroups.com

I have no idea. Sorry. You might know this guy named Rob Latham... :-)

Jeff

ANJU KAMBADUR

unread,

May 16, 2013, 3:45:15 PM5/16/13

to elemen...@googlegroups.com

Thanks Guys. I figured that there was no way around this bottleneck :-/.

This dumping to file solution is not ideal, but it is one that occurs commonly. In my case, I have a really old cluster with NFS and some legacy MATLAB code that does mixed sparse and dense BLAS operations. I am trying to quickly scale up some of the critical operations using an "MPI server" that is basically accepts requests and using Elemental under the covers to do the necessary.

I certainly did not mean to say that these are the kind of operations Elemental should support.

- Anju
--------------------------------------------
Prabhanjan Kambadur
Research Staff Member
Business Analytics and Mathematical Sciences
IBM TJ Watson Research Center
Room 30-229 A

Jeff Hammond ---05/16/2013 03:34:51 PM---On Thu, May 16, 2013 at 2:27 PM, Jed Brown <fiv...@gmail.com> wrote: > Jeff Hammond <jhammond@alcf.

From:	Jeff Hammond <jham...@alcf.anl.gov>
To:	elemen...@googlegroups.com
Date:	05/16/2013 03:34 PM
Subject:	Re: [elemental] Dumping a DistMatrix<double,MC,MR> to a single file
Sent by:	elemen...@googlegroups.com

--

You received this message because you are subscribed to the Google Groups "elemental-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to elemental-de...@googlegroups.com. For more options, visit

https://groups.google.com/groups/opt_out.

Jack Poulson

unread,

May 16, 2013, 3:45:52 PM5/16/13

to elemen...@googlegroups.com

On Thu, May 16, 2013 at 12:34 PM, Jeff Hammond <jham...@alcf.anl.gov> wrote:

On Thu, May 16, 2013 at 2:27 PM, Jed Brown <fiv...@gmail.com> wrote:
> Jeff Hammond <jham...@alcf.anl.gov> writes:
>
>> MPI-IO or HDF5 are the right way to implement IO in Elemental without
>> reinventing the wheel...
>
> Agreed, and defining the cyclic ordering is easy to do in either system.
>
> Jeff, can you point to any benchmarks that compare the IO bandwidth of
> cyclic ordering versus contiguous blocks? I know that use of MPI-IO
> collectives is much more important for cyclic ordering, but I don't know
> how far its peak is away from that of block ordering.

I have no idea. Sorry. You might know this guy named Rob Latham... :-)

Worst-case scenario, a shuffle (via MPI_Alltoall) can be performed beforehand in order to write in whatever the most efficient manner is for MPI-IO. Such a shuffle should be cheap in comparison with the file IO.

So far, this is the best discussion I have found of MPI-IO:
http://www.sdsc.edu/us/training/workshops/institute2005/docs/Thakur-MPI-IO.ppt

Jed Brown

unread,

May 16, 2013, 3:53:10 PM5/16/13

to Jeff Hammond, elemen...@googlegroups.com, Rob Latham

Jeff Hammond <jham...@alcf.anl.gov> writes:

> On Thu, May 16, 2013 at 2:27 PM, Jed Brown <fiv...@gmail.com> wrote:
>> Jeff Hammond <jham...@alcf.anl.gov> writes:
>>
>>> MPI-IO or HDF5 are the right way to implement IO in Elemental without
>>> reinventing the wheel...
>>
>> Agreed, and defining the cyclic ordering is easy to do in either system.
>>
>> Jeff, can you point to any benchmarks that compare the IO bandwidth of
>> cyclic ordering versus contiguous blocks? I know that use of MPI-IO
>> collectives is much more important for cyclic ordering, but I don't know
>> how far its peak is away from that of block ordering.
>
> I have no idea. Sorry. You might know this guy named Rob Latham... :-)

Well, yes. It seems like something that should be in all the tutorials,
as the obvious "parallel IO" analog to demonstrating bandwidth for
strided access.

Jeff Hammond

unread,

May 16, 2013, 3:56:25 PM5/16/13

to elemen...@googlegroups.com, Rob Latham

MPI-IO frequently does an MPI_Alltoall(v) internally to remap the data
appropriately so it would be a very bad idea to do the MPI_Alltoall(v)
approach _and_ the MPI-IO approach. Of course, if one is using the
p2p i.e. non-coll. MPI-IO calls, then maybe rolling ones own is okay.

I defer to RobL in all MPI-IO matters though.

Jeff

> --

Jed Brown

unread,

May 16, 2013, 3:59:47 PM5/16/13

to Jeff Hammond, elemen...@googlegroups.com, Rob Latham

Jeff Hammond <jham...@alcf.anl.gov> writes:

> MPI-IO frequently does an MPI_Alltoall(v) internally to remap the data
> appropriately so it would be a very bad idea to do the MPI_Alltoall(v)
> approach _and_ the MPI-IO approach. Of course, if one is using the
> p2p i.e. non-coll. MPI-IO calls, then maybe rolling ones own is okay.

Yes, but writing cyclic ordering independently has got to be the worst
possible case. I don't have a sense for the relative cost of the
Alltoall compared to the expected IO bandwidth afterward.

Jeff Hammond

unread,

May 16, 2013, 4:30:02 PM5/16/13

to elemen...@googlegroups.com, Rob Latham

Cyclic is bad if one insists upon writing out the matrix in contiguous
ordering, yes, but is this actually required? The advantage of the
alltoall(v) is that you can put the data in the right places such that
you can - at least in theory - maximize your IO to disk. Of course,
hitting theoretical peak IO performance on most systems requires
numerous human sacrifices and the linear alignment of at least 5
planets.

Jeff

Jed Brown

unread,

May 16, 2013, 4:34:32 PM5/16/13

to Jeff Hammond, elemen...@googlegroups.com, Rob Latham

Jeff Hammond <jham...@alcf.anl.gov> writes:

> Cyclic is bad if one insists upon writing out the matrix in contiguous
> ordering, yes, but is this actually required?

Do you want to be able to read it back using a different number of
processes? If you'll always read back with exactly the same
distribution, then you can just write out each process' chunk in a
contiguous block. That doesn't strike me as being attractive to many
users, but I could be wrong.

> The advantage of the alltoall(v) is that you can put the data in the
> right places such that you can - at least in theory - maximize your IO
> to disk.

Right, hence the question about relative cost of the Alltoall(v) versus
IO bandwidth. I suspect IO bandwidth is still the limiting factor on
most systems.

Jack Poulson

unread,

May 16, 2013, 4:44:16 PM5/16/13

to elemen...@googlegroups.com, Jeff Hammond, Rob Latham

On Thu, May 16, 2013 at 1:34 PM, Jed Brown <jedb...@mcs.anl.gov> wrote:

Jeff Hammond <jham...@alcf.anl.gov> writes:

> The advantage of the alltoall(v) is that you can put the data in the
> right places such that you can - at least in theory - maximize your IO
> to disk.

Right, hence the question about relative cost of the Alltoall(v) versus
IO bandwidth. I suspect IO bandwidth is still the limiting factor on
most systems.

The AllToAll cost will be ignorable, as its cost (assuming a low-bandwidth algorithm) will be of the form:

alpha p + beta n^2/p

and the cost of writing the file will be *at least*

delta n^2/p,

where delta is the cost of writing each entry to file (though there should probably be a file latency term as well). Since delta should be significantly larger than beta, and the latency cost, alpha p, will almost certainly be in the sub-second range for even very large p, the file IO should dominate.

Jack

Jed Brown

unread,

May 16, 2013, 5:38:38 PM5/16/13

to Jack Poulson, elemen...@googlegroups.com, Jeff Hammond, Rob Latham

Jack Poulson <jack.p...@gmail.com> writes:

> The AllToAll cost will be ignorable, as its cost (assuming a low-bandwidth
> algorithm) will be of the form:
> alpha p + beta n^2/p
> and the cost of writing the file will be *at least*
> delta n^2/p,
> where delta is the cost of writing each entry to file (though there should
> probably be a file latency term as well). Since delta should be
> significantly larger than beta, and the latency cost, alpha p, will almost
> certainly be in the sub-second range for even very large p, the file IO
> should dominate.

You do these redistributions more than I, but 'beta n^2/p' is extremely
unrealistic. For large messages, MPICH's MPI_Alltoall does pairwise
exchange using MPI_Sendrecv, thus requiring p steps. A 4096 byte
Alltoall takes more than one second on 10k processes of Hopper using
Cray's optimized implementation (beta=4 kB/s in your model) and 6
seconds using the unoptimized implementation. See slide 25.

https://www.nersc.gov/assets/NUG-Meetings/2012/HowardP-MPI-NUG2012.pdf

The total amount of data in that 4kB all-to-all is only 40 MB, which my
laptop can write to disk in 100 ms. The observed scaling is superlinear
in P, so presumably the effective bandwidth is significantly worse with
100k procs.

Converting from 2D cyclic to 1D contiguous is not this bad because a lot
of the messages in the Alltoallv will be empty, but I think it's
premature to expect the Alltoallv to always be fast compared to IO
bandwidth.

Jed Brown

unread,

May 16, 2013, 5:39:34 PM5/16/13

to Rob Latham, Jeff Hammond, elemen...@googlegroups.com

Rob Latham <ro...@mcs.anl.gov> writes:

> On Thu, May 16, 2013 at 02:53:10PM -0500, Jed Brown wrote:
>> Jeff Hammond <jham...@alcf.anl.gov> writes:
>>
>> > On Thu, May 16, 2013 at 2:27 PM, Jed Brown <fiv...@gmail.com> wrote:
>> >> Jeff Hammond <jham...@alcf.anl.gov> writes:
>> >>
>> >>> MPI-IO or HDF5 are the right way to implement IO in Elemental without
>> >>> reinventing the wheel...
>> >>
>> >> Agreed, and defining the cyclic ordering is easy to do in either system.
>> >>
>> >> Jeff, can you point to any benchmarks that compare the IO bandwidth of
>> >> cyclic ordering versus contiguous blocks? I know that use of MPI-IO
>> >> collectives is much more important for cyclic ordering, but I don't know
>> >> how far its peak is away from that of block ordering.
>> >
>> > I have no idea. Sorry. You might know this guy named Rob Latham... :-)
>>
>> Well, yes. It seems like something that should be in all the tutorials,
>> as the obvious "parallel IO" analog to demonstrating bandwidth for
>> strided access.
>

> Hi Jed, all: we had an old benchmark to demonstrate how performance varies
> with the contiguity of data: one 8 MiB block vs 2 4 MiB block ... vs 8
> thousand 1KiB blocks. It's fragile and I hate it.
>
> In tutorials we don't bother quantifying the difference between
> contiguous parallel i/o and cyclic (interleaved among processes?)
> parallel i/o. We probably should.
>
> Let me scrounge up an IOR recipe for this and cook some Mira numbers.

Thanks, Rob. That would be really helpful to keep our mental
performance models in check.

Jed Brown

unread,

May 16, 2013, 5:41:53 PM5/16/13

to Rob Latham, Jeff Hammond, elemen...@googlegroups.com

Rob Latham <ro...@mcs.anl.gov> writes:

> On Thu, May 16, 2013 at 03:34:32PM -0500, Jed Brown wrote:
>> Jeff Hammond <jham...@alcf.anl.gov> writes:
>>
>> > Cyclic is bad if one insists upon writing out the matrix in contiguous
>> > ordering, yes, but is this actually required?
>>
>> Do you want to be able to read it back using a different number of
>> processes? If you'll always read back with exactly the same
>> distribution, then you can just write out each process' chunk in a
>> contiguous block. That doesn't strike me as being attractive to many
>> users, but I could be wrong.
>

> Since you mentioned reading back the data, something like HDF5 or
> pnetcdf might make sense: you're dumping arrays to a file so it's a
> good fit for the API and file model, and you get machine potability
> (so you can read back your bluegene dump on your laptop or linux
> cluster and not re-implement a byteswapping routine -- or maybe you
> can skip re-implementing the reader altogether using the tools that
> already exist in the HDF5 ecosystem.

Kinda heavy libraries, but yes, they are worth considering.

Jeff Hammond

unread,

May 16, 2013, 5:44:27 PM5/16/13

to elemen...@googlegroups.com, Rob Latham

>>> > Cyclic is bad if one insists upon writing out the matrix in contiguous
>>> > ordering, yes, but is this actually required?
>>>
>>> Do you want to be able to read it back using a different number of
>>> processes? If you'll always read back with exactly the same
>>> distribution, then you can just write out each process' chunk in a
>>> contiguous block. That doesn't strike me as being attractive to many
>>> users, but I could be wrong.
>>
>> Since you mentioned reading back the data, something like HDF5 or
>> pnetcdf might make sense: you're dumping arrays to a file so it's a
>> good fit for the API and file model, and you get machine potability
>> (so you can read back your bluegene dump on your laptop or linux
>> cluster and not re-implement a byteswapping routine -- or maybe you
>> can skip re-implementing the reader altogether using the tools that
>> already exist in the HDF5 ecosystem.
>
> Kinda heavy libraries, but yes, they are worth considering.

Having worked with multiple users trying to move binary files from x86
to BG, I can say that there are few things I enjoy less than
explaining endian conversion. HDF5 deals with this and that's worth
the weight.

Jeff

Jack Poulson

unread,

May 16, 2013, 6:14:40 PM5/16/13

to Jed Brown, elemen...@googlegroups.com, Jeff Hammond, Rob Latham

On Thu, May 16, 2013 at 2:38 PM, Jed Brown <jedb...@mcs.anl.gov> wrote:

Jack Poulson <jack.p...@gmail.com> writes:

> The AllToAll cost will be ignorable, as its cost (assuming a low-bandwidth
> algorithm) will be of the form:
> alpha p + beta n^2/p
> and the cost of writing the file will be *at least*
> delta n^2/p,
> where delta is the cost of writing each entry to file (though there should
> probably be a file latency term as well). Since delta should be
> significantly larger than beta, and the latency cost, alpha p, will almost
> certainly be in the sub-second range for even very large p, the file IO
> should dominate.

You do these redistributions more than I, but 'beta n^2/p' is extremely
unrealistic. For large messages, MPICH's MPI_Alltoall does pairwise
exchange using MPI_Sendrecv, thus requiring p steps.

Nothing about what I just said contradicts your claim. The key is to note that, with dense linear algebra, it is standard to keep p <= n. Thus n^2 >= p^2, and this implies that n^2/p >= p, which allows us to send n^2/p^2 entries in each of the p pairwise exchanges, which results in an overall (modelled) cost matching my claim:

p (alpha + beta n^2/p^2) = alpha p + beta n^2/p.

A 4096 byte
Alltoall takes more than one second on 10k processes of Hopper using
Cray's optimized implementation (beta=4 kB/s in your model) and 6
seconds using the unoptimized implementation. See slide 25.

https://www.nersc.gov/assets/NUG-Meetings/2012/HowardP-MPI-NUG2012.pdf

The total amount of data in that 4kB all-to-all is only 40 MB, which my
laptop can write to disk in 100 ms. The observed scaling is superlinear
in P, so presumably the effective bandwidth is significantly worse with
100k procs.

Continuing with the above discussion, a 4096 byte AllToAll corresponds to 512 double-precision entries, and this would be equal to n^2/p^2 in our matrix example. Thus, n ~= 22 p, and, when p=10,000, n ~= 220,000.

Your disk timings correspond to 400 MB/s, whereas BGP has a network bandwidth of 2250 MB/s (the easiest number I had access to). So delta >> beta isn't quite true, but typically delta > beta. Message latencies are typically on the order of a few microseconds, and so, with p exchanges, one should expect roughly p microseconds of latency. Thus, unless about a million processes are used, the latency cost should be under a second.

So, what gives? The obvious answer is that the model I cited isn't sufficient. What it ignores is network contention, which is known to typically be aggravated by AllToAll communication patterns. Since, if you stuck your harddrive into BGP, beta would be only roughly 5 delta, it only takes five messages competing for the same link before the effective bandwidth drops below the harddisk write speed. I have not given network conflicts for AllToAll on multi-dimensional tori much thought, but I would assume that there is a paper discussing this issue somewhere.

Converting from 2D cyclic to 1D contiguous is not this bad because a lot
of the messages in the Alltoallv will be empty, but I think it's
premature to expect the Alltoallv to always be fast compared to IO
bandwidth.

I would agree given the cited numbers. My conclusion is that network contention must be taken into account here if one wants to draw qualitative conclusions about the coefficient in the runtime.

Jack

Jed Brown

unread,

May 16, 2013, 6:31:28 PM5/16/13

to Rob Latham, Jeff Hammond, elemen...@googlegroups.com, jack.p...@gmail.com

Rob Latham <ro...@mcs.anl.gov> writes:

> On Thu, May 16, 2013 at 04:39:34PM -0500, Jed Brown wrote:
>> Thanks, Rob. That would be really helpful to keep our mental
>> performance models in check.
>

> Can you give me an idea about how much data per process is getting
> dumped here? Want to make sure the synthetic benchmark at least
> somewhat approximates what you are trying to do.

Typical problem sizes in the Elemental paper [1] are 50k*50k on 8192
cores of BG/P. With double precision reals, that would be a total of 20
GB. The parallel distribution is 2D cyclic.

[1] http://doi.acm.org/10.1145/2427023.2427030

Jed Brown

unread,

May 24, 2013, 9:57:21 AM5/24/13

to Rob Latham, Jeff Hammond, elemen...@googlegroups.com, jack.p...@gmail.com

Rob Latham <ro...@mcs.anl.gov> writes:

> On Thu, May 23, 2013 at 12:59:57PM -0500, Rob Latham wrote:
>> So, that's your three configurations. in each case, the processes
>> write 2 367 488 bytes (which was as close as I could come to
>> 50*1024*50*1024*8 / 8192 = 2 560 000 bytes)
>>
> Here's a fourth experiment: independent I/O once something has
> already put the data in contiguous format. (I didn't bother with the
> independent i/o to non-contig regions in file -- that's just going to
> be abysmal).

Thanks, Rob. This is very useful.

> If the data is already contiguous and fairly large, one might well
> wonder what advantage collective i/o offers. Well, two things:
>
> - collective i/o will aggregate data down to a few processes. Blue
> Gene has I/O proxies, so instead of 8192 clients the file system
> only sees 32 proxies. Those proxies in independent mode will be
> representing 256 clients. In collective I/O, however, they only
> represent 8 clients (by default). So, there is less contention for
> the i/o proxy service.
>
> - collective i/o on blue gene will block-align writes, which is
> probably the single biggest optimization one can do on GPFS.
>
> Let's add a fourth line to that table then:
> case write write open-close read read open-close
> collective I/O, non interleaved 744.04 701.59 1859.86 1819.64
> collective I/O, interleaved, alltoall 599.82 572.82 4140.56 3385.18
> collective I/O, interleaved, alltoall 383.46 378.28 3135.08 2973.80

Is this actually point-to-point, not alltoall?

I'm curious about the cost of doing the naive alltoallv up-front, before
calling MPI-IO. Do we expect it to exactly match MPI-IO (i.e., MPI-IO
does exactly the alltoallv on the compute nodes, then calls the
contiguous code), or does MPI-IO do something like an inter-communicator
alltoallv directly to the IONs?

The take-home here seems to be that redistributing to a contiguous
representation is not free, but not that expensive either (not enough to
justify ADIOS, for example). So unless you are micro-optimizing and you
can do the redistribution significantly faster than Alltoallv using
application knowledge, standard collective MPI-IO is just fine. That's
a pretty good position to be in.

> independent i/o, non-interleaved 256.47 253.44 791.82i 763.72
>
> It's entirely possible that Elemental, since it has more
> application-level knowledge about data layout, can rearrange the array
> data more efficiently than MPI-IO can. Once the data hits the MPI-IO
> layer it's just "stream of bytes". I don't know how to quantify that
> for you, though.
>
> I'd say describe to MPI-IO your data with an MPI datatype and throw as
> much data at MPI-IO in one collective function call as you can, and if
> we are spending more than 25% of runtime in I/O we can look at ways to
> tune that up.

Reply all

Reply to author

Forward