Many networks do end-to-end completion so you will not see much
improvement from nonblocking GA calls. I rewrote part of NWChem to
use non-blocking and observed zero improvement despite the presence of
significant overlap with computation because the network I tested it
on (Infiniband) isn't going to allow for much overlap.
If you're on a Blue Gene/P machine, you can overlap computation and
communication very effectively in the hardware, but there are some
software issues that prevent doing this effectively in GA right now.
Jeff
On Wed, Dec 22, 2010 at 12:46 AM, Karthik <karthi...@gmail.com> wrote:
> Hi all,
>
> I have been trying to parallelize a jacobi algorithm using global
> arrays using primitive get and put methods to read from and write into
> the global array. The core implementation is as follows. I tried the
> non blocking get and put but there is no significant computation which
> can overlap the latency involved in read and write and hence doesn't
> provide any major improvement in performance. Is there a way to
> improve the performance by eliminating the blocking get and put calls?
>
> while(iter<maxiter){
> NGA_Get(g_xold, low, high, xold, ld);
>
> {compute xnew from xold}
>
>
> NGA_Put(g_xold, lo2, hi2, xnew, ld1);
> GA_Sync();
>
> iter++
> }
--
Jeff Hammond
Argonne Leadership Computing Facility
jham...@alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
If you don't have significant computation, then you might not see the
benefit of communication. However let us consider the following scenario,
where computation time is less compared to communication.
The best you can overlap depends on your computation time. If
computation takes 10 microsecs and communication takes 40
microseconds, then you will see ~25% improvement in performance
using non-blocking calls. In this case, the ovelap is minimal when
compared to the communication, however you could still get *some*
improvement.
Even at the worst case, you will see better or same performance when
using non-blocking communication (compared to blocking).
Hope it helpss.
-Manoj.
---------------------------------------------------------------
Manojkumar Krishnan | High Performance Computing | (509) 372-4206
http://hpc.pnl.gov/people/manoj
This is not an appropriate use case for NB unless there is compute which does not depend on the preceding Get or the Jacobi itself can be overlapped with something else, in which case the Put could be NB.
Jeff
Sent from my iPhone
Jeff
--
It is well known that Jacobi iteration is a terrible solver for many
types of PDEs. You should be using multigrid, which is many orders of
magnitude more efficient than Jacobi.
If you design your code properly, you should be able to use
nonblocking communication to hide the ghost cell update.
For example:
1. start nonblocking update of ghost cells
2. start direct access to interior region
3. do interior stencil pass
4. end direct access
5. end nonblocking ghost cell update with wait
6. do ghost cell stencil pass
I don't know if the GA function for ghost cell update is nonblocking,
but you could just as well implement ghost cells on your own.
Jeff
This error is suggesting that you have not properly constructed the map array in your call to nga_create_ghosts_irreg call (or whatever calls you are using to construct your global array). Note that if the nblock array has values M,N,..., then the first M values of map need to be monotonic, the next N values of map need to be monotonic, etc. Check the documentation on the create_irreg calls in GA for more information on the nblock and map arrays.
Bruce
This is fundamentally due to the performance gap b/w these two communication (protocols) - shared memory and RDMA.
-Manoj.
-----------------------------------------------------------------
Manojkumar Krishnan | High Performance Computing | (509) 372-4206
http://hpc.pnl.gov/people/manoj
> -----Original Message-----
> From: HPC TOOLS On Behalf Of Karthik
> Sent: Saturday, December 25, 2010 2:01 PM
> To: HPC TOOLS
> Subject: [hpctools] Re: Global arrays - Performance optimization
>
The last two arguments in the NGA_Create_ghosts_irreg call have been switched. They should be
g_xold = NGA_Create_ghosts_irreg(C_DBL, ndim,dims,width,"xold", block, map);
I think this will fix the problem.
Bruce
-----Original Message-----
From: HPC TOOLS On Behalf Of Karthik
Sent: Tuesday, December 28, 2010 12:07 AM
*snip*
Shouldn't our API be consistent? Is this a bug then?
--
-Manoj.
---------------------------------------------------------------
Manojkumar Krishnan | High Performance Computing | (509) 372-4206
http://hpc.pnl.gov/people/manoj
Bruce
Since we do not want to break the codes using NGA_Create_irreg, let us not
modify its API or documentation. However I vote for modifying the
NGA_Create_ghost_irreg to be consistent with NGA_Create_irreg
(since it is relatively a newer API and assuming its usage is
limited/zero) both in Fortran and C.
In this way we ensure backward compatibility with minimal (or zero)
damage.
--
-Manoj.
---------------------------------------------------------------
Manojkumar Krishnan | High Performance Computing | (509) 372-4206
http://hpc.pnl.gov/people/manoj