I have a distributed matrix that I have to dump to a *single* file. I have tried one of two methods:
(1) Use the Axpy interface to get the DistMatrix to one place and then write out the entries in binary.
(2) Use the "Print()" method to write out the entries in text.
As expected, on a vanilla cluster, both these methods are a major bottleneck. Has anyone any suggestions (other than to abandon writing out distributed matrices --- that is a requirement I cannot circumvent as of now)?
Just some context: I am using DistMatrix<double,MC,MR> to perform some BLAS-3 and BLAS-2 operations in parallel on rectangular matrices. Is it faster to get the DistMatrix in one place if use some other distribution? If I use some other distribution, will there be an internal conversion to <MC,MR> for operations such as GEMM, GEMV, SYRK, etc?
- Anju
--------------------------------------------
Prabhanjan Kambadur
Research Staff Member
Business Analytics and Mathematical Sciences
IBM TJ Watson Research Center
Room 30-229 A
--
You received this message because you are subscribed to the Google Groups "elemental-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elemental-de...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
> Ignoring any potential parallel filesystem magic, I wager that one cannot doIgnoring any potential MPI collective magic, I wager that one cannot
> much better than performing a single MPI_Gather(v) call and then writing to
> disk from that process. The cost should be roughly:
> alpha lg p + beta n^2
> with p processes, and n x n matrix, a message latency of alpha, and
> bandwidth of 1/beta.
do much better than performing N send-recv pair calls to implement the
MPI_Gather(v)...
(If my point is not transparent above, I would be happy to be more explicit.)
> The method currently used for the [MC,MR] PrintBase routine, which is usedMPI-IO or HDF5 are the right way to implement IO in Elemental without
> under the hood of Print (for writing to stdout) and Write (for writing to a
> file), is, with a simple cost model, only off by a lg p factor in the
> bandwidth, as it performs an MPI_Reduce on an n x n matrix, resulting in a
> cost of:
> alpha lg p + beta n^2 lg p
>
> Practically speaking, the expense is a bit worse than this, as, in the
> MPI_Gather(v) approach, only the root process allocated a buffer of size
> n^2, and all other processes only need n^2/p memory. In the simple
> MPI_Reduce approach, *every* process allocated (and fills) a buffer of size
> n^2, and the root process allocates a second one for the receive. This code
> was written a long time ago and could certainly be easily optimized (most
> trivially, by using MPI_IN_PLACE to avoid a second n^2 buffer on the root),
> as I have not worried much about Elemental's file I/O performance.
>
> Patches are welcome!
reinventing the wheel...
Thanks Guys. I figured that there was no way around this bottleneck :-/.
This dumping to file solution is not ideal, but it is one that occurs commonly. In my case, I have a really old cluster with NFS and some legacy MATLAB code that does mixed sparse and dense BLAS operations. I am trying to quickly scale up some of the critical operations using an "MPI server" that is basically accepts requests and using Elemental under the covers to do the necessary.
I certainly did not mean to say that these are the kind of operations Elemental should support.
- Anju
--------------------------------------------
Prabhanjan Kambadur
Research Staff Member
Business Analytics and Mathematical Sciences
IBM TJ Watson Research Center
Room 30-229 A
Jeff Hammond ---05/16/2013 03:34:51 PM---On Thu, May 16, 2013 at 2:27 PM, Jed Brown <fiv...@gmail.com> wrote: > Jeff Hammond <jhammond@alcf.
| Jeff Hammond <jham...@alcf.anl.gov> |
| elemen...@googlegroups.com |
| 05/16/2013 03:34 PM |
| Re: [elemental] Dumping a DistMatrix<double,MC,MR> to a single file |
| elemen...@googlegroups.com |
--
You received this message because you are subscribed to the Google Groups "elemental-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elemental-de...@googlegroups.com.
For more options, visit
On Thu, May 16, 2013 at 2:27 PM, Jed Brown <fiv...@gmail.com> wrote:I have no idea. Sorry. You might know this guy named Rob Latham... :-)
> Jeff Hammond <jham...@alcf.anl.gov> writes:
>
>> MPI-IO or HDF5 are the right way to implement IO in Elemental without
>> reinventing the wheel...
>
> Agreed, and defining the cyclic ordering is easy to do in either system.
>
> Jeff, can you point to any benchmarks that compare the IO bandwidth of
> cyclic ordering versus contiguous blocks? I know that use of MPI-IO
> collectives is much more important for cyclic ordering, but I don't know
> how far its peak is away from that of block ordering.
Jeff Hammond <jham...@alcf.anl.gov> writes:Right, hence the question about relative cost of the Alltoall(v) versus
> The advantage of the alltoall(v) is that you can put the data in the
> right places such that you can - at least in theory - maximize your IO
> to disk.
IO bandwidth. I suspect IO bandwidth is still the limiting factor on
most systems.
Jack Poulson <jack.p...@gmail.com> writes:You do these redistributions more than I, but 'beta n^2/p' is extremely
> The AllToAll cost will be ignorable, as its cost (assuming a low-bandwidth
> algorithm) will be of the form:
> alpha p + beta n^2/p
> and the cost of writing the file will be *at least*
> delta n^2/p,
> where delta is the cost of writing each entry to file (though there should
> probably be a file latency term as well). Since delta should be
> significantly larger than beta, and the latency cost, alpha p, will almost
> certainly be in the sub-second range for even very large p, the file IO
> should dominate.
unrealistic. For large messages, MPICH's MPI_Alltoall does pairwise
exchange using MPI_Sendrecv, thus requiring p steps.
A 4096 byte
Alltoall takes more than one second on 10k processes of Hopper using
Cray's optimized implementation (beta=4 kB/s in your model) and 6
seconds using the unoptimized implementation. See slide 25.
https://www.nersc.gov/assets/NUG-Meetings/2012/HowardP-MPI-NUG2012.pdf
The total amount of data in that 4kB all-to-all is only 40 MB, which my
laptop can write to disk in 100 ms. The observed scaling is superlinear
in P, so presumably the effective bandwidth is significantly worse with
100k procs.
Converting from 2D cyclic to 1D contiguous is not this bad because a lot
of the messages in the Alltoallv will be empty, but I think it's
premature to expect the Alltoallv to always be fast compared to IO
bandwidth.