Global arrays - Performance optimization

Message has been deleted

Karthik

unread,

Dec 22, 2010, 1:50:11 AM12/22/10

to hpctools

Hi all,

I have been trying to parallelize a jacobi algorithm using global
arrays using primitive get and put methods to read from and write into
the global array. The core implementation is as follows. I tried the
non blocking get and put but there is no significant computation which
can overlap the latency involved in read and write and hence doesn't
provide any major improvement in performance. Is there a way to
improve the performance by eliminating the blocking get and put calls?

while(iter<maxiter){
NGA_Get(g_xold, low, high, xold, ld);

{compute xnew from xold}

NGA_Put(g_xnew, lo2, hi2, xnew, ld1);
copy g_xnew to g_xold
iter++
}

Thanks and Regards
Karthik

Jeff Hammond

unread,

Dec 22, 2010, 8:15:56 AM12/22/10

to hpct...@googlegroups.com

As you say, if there is no significant computation to overlap with,
nonblocking isn't going to help you. Based upon the structure of this
code, there is absolutely nothing you can do to improve the
performance without redesigning the algorithm. An alternative which
might improve performance would be to exploit locality to eliminate
the NGA_Get in favor of memcpy or direct local access.

Many networks do end-to-end completion so you will not see much
improvement from nonblocking GA calls. I rewrote part of NWChem to
use non-blocking and observed zero improvement despite the presence of
significant overlap with computation because the network I tested it
on (Infiniband) isn't going to allow for much overlap.

If you're on a Blue Gene/P machine, you can overlap computation and
communication very effectively in the hardware, but there are some
software issues that prevent doing this effectively in GA right now.

Jeff

On Wed, Dec 22, 2010 at 12:46 AM, Karthik <karthi...@gmail.com> wrote:
> Hi all,
>
> I have been trying to parallelize a jacobi algorithm using global
> arrays using primitive get and put methods to read from and write into
> the global array. The core implementation is as follows. I tried the
> non blocking get and put but there is no significant computation which
> can overlap the latency involved in read and write and hence doesn't
> provide any major improvement in performance. Is there a way to
> improve the performance by eliminating the blocking get and put calls?
>
> while(iter<maxiter){
> NGA_Get(g_xold, low, high, xold, ld);
>
> {compute xnew from xold}
>
>

> NGA_Put(g_xold, lo2, hi2, xnew, ld1);
> GA_Sync();
>
> iter++
> }

--
Jeff Hammond
Argonne Leadership Computing Facility
jham...@alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond

Manojkumar Krishnan

unread,

Dec 22, 2010, 2:41:49 PM12/22/10

to Jeff Hammond, HPC TOOLS

Karthik,

If you don't have significant computation, then you might not see the
benefit of communication. However let us consider the following scenario,
where computation time is less compared to communication.

The best you can overlap depends on your computation time. If
computation takes 10 microsecs and communication takes 40
microseconds, then you will see ~25% improvement in performance
using non-blocking calls. In this case, the ovelap is minimal when
compared to the communication, however you could still get *some*
improvement.

Even at the worst case, you will see better or same performance when
using non-blocking communication (compared to blocking).

Hope it helpss.

-Manoj.
---------------------------------------------------------------
Manojkumar Krishnan | High Performance Computing | (509) 372-4206
http://hpc.pnl.gov/people/manoj

Jeff Hammond

unread,

Dec 22, 2010, 5:22:12 PM12/22/10

to Manojkumar Krishnan, HPC TOOLS

No, he cannot use NB Get because of the data dependency on xold in compute and cannot use NB Put because no compute is done after it.

This is not an appropriate use case for NB unless there is compute which does not depend on the preceding Get or the Jacobi itself can be overlapped with something else, in which case the Put could be NB.

Jeff

Sent from my iPhone

Karthik

unread,

Dec 23, 2010, 2:28:14 AM12/23/10

to hpctools

I tried using the Non Blocking Put but since there was no significant
computation to overlap it there was no improvement in performance.
Direct access to the global array proves costlier than the get
operation as well.

Jeff Hammond

unread,

Dec 23, 2010, 2:38:44 PM12/23/10

to hpct...@googlegroups.com

If direct access is slow than GA_Get then there is a bug in the
implementation or you are doing something wrong.

Jeff

--

Karthik

unread,

Dec 24, 2010, 1:15:41 PM12/24/10

to hpctools

Jeff

The problem with direct access is that I cannot access the 'non-local'
elements directly and hence resorted to using direct access using
ghost cells(As processes needs to communicate the boundary rows to the
neighbors). Updating the ghost cells for every iteration doesn't
provide the expected improvement in performance.

On Dec 23, 2:38 pm, Jeff Hammond <jhamm...@mcs.anl.gov> wrote:
> If direct access is slow than GA_Get then there is a bug in the
> implementation or you are doing something wrong.
>
> Jeff
>

> On Thu, Dec 23, 2010 at 1:28 AM, Karthik <karthikra...@gmail.com> wrote:
> > I tried using the Non Blocking Put but since there was no significant
> > computation to overlap it there was no improvement in performance.
> > Direct access to the global array proves costlier than the get
> > operation as well.
>
> --
> Jeff Hammond
> Argonne Leadership Computing Facility

> jhamm...@alcf.anl.gov /(630) 252-5381begin_of_the_skype_highlighting (630) 252-5381 end_of_the_skype_highlightinghttp://www.linkedin.com/in/jeffhammond

Jeff Hammond

unread,

Dec 24, 2010, 1:26:15 PM12/24/10

to hpct...@googlegroups.com

True, but if you're doing Jacobi iteration to solve PDEs on a uniform
grid, communication is not even close to your biggest problem.

It is well known that Jacobi iteration is a terrible solver for many
types of PDEs. You should be using multigrid, which is many orders of
magnitude more efficient than Jacobi.

If you design your code properly, you should be able to use
nonblocking communication to hide the ghost cell update.

For example:

1. start nonblocking update of ghost cells
2. start direct access to interior region
3. do interior stencil pass
4. end direct access
5. end nonblocking ghost cell update with wait
6. do ghost cell stencil pass

I don't know if the GA function for ghost cell update is nonblocking,
but you could just as well implement ghost cells on your own.

Jeff

Karthik

unread,

Dec 25, 2010, 5:00:37 PM12/25/10

to hpctools

Even the non-blocking calls doesn't improve the performance
significantly for inter node communication. Intra-node communication
is 4 times faster than inter-node communication for any of the GA
constructs (both blocking and non-blocking). Is there any way to
reduce this latency?

Regards
Karthik

Karthik

unread,

Dec 26, 2010, 7:13:06 PM12/26/10

to hpctools

For irregular work load to balance the communication latency I tried
using "NGA_Create_ghosts_irreg". But it throws the following runtime
error.

1:1:ngai_create_ghosts_irreg_config:ga_set_irreg_distr:Mapc entries
are not properly monotonic:: 9
(rank:1 hostname:opt0717.ten.osc.edu pid:2119):ARMCI DASSERT fail. src/
armci.c:ARMCI_Error():276 cond:0

Can someone kindly clarify the possible causes of this.

Regards
Karthik Raj

Palmer, Bruce J

unread,

Dec 27, 2010, 11:07:31 AM12/27/10

to HPC TOOLS

Karthik,

This error is suggesting that you have not properly constructed the map array in your call to nga_create_ghosts_irreg call (or whatever calls you are using to construct your global array). Note that if the nblock array has values M,N,..., then the first M values of map need to be monotonic, the next N values of map need to be monotonic, etc. Check the documentation on the create_irreg calls in GA for more information on the nblock and map arrays.

Bruce

Krishnan, Manojkumar

unread,

Dec 27, 2010, 2:58:41 PM12/27/10

to HPC TOOLS, karthi...@gmail.com

Intra-node communication is always faster since it uses shared memory (so the cost is just memcpy). However the inter-node communication uses network which is relatively slower.

This is fundamentally due to the performance gap b/w these two communication (protocols) - shared memory and RDMA.

-Manoj.
-----------------------------------------------------------------

Manojkumar Krishnan | High Performance Computing | (509) 372-4206
http://hpc.pnl.gov/people/manoj

> -----Original Message-----
> From: HPC TOOLS On Behalf Of Karthik
> Sent: Saturday, December 25, 2010 2:01 PM
> To: HPC TOOLS
> Subject: [hpctools] Re: Global arrays - Performance optimization
>

Karthik

unread,

Dec 28, 2010, 3:07:08 AM12/28/10

to hpctools

Bruce

But the same map works fine for NGA_Create_irreg() call

The following is the code snippet

int ndim = 2;
int dims[] = {N+2,N+2};
int width[] = {1,0};
int block[] = {8,1};
int map[] = {0,13,26,39,52,65,78,91,0};

g_xold = NGA_Create_ghosts_irreg(C_DBL, ndim,dims,width,"xold", map,
block);

So what could be the possible difference in the usage of map between
NGA_Create_irreg() and NGA_Create_ghosts_irreg() calls.

Regards
Karthik

Palmer, Bruce J

unread,

Jan 3, 2011, 3:43:17 PM1/3/11

to HPC TOOLS

Okay, I've created a little test code that tries to create an array with this configuration on 8 processors and it tanks with the same error message. This nga_create routine is currently being used in a number of places with the fortran interface and seems to work okay, so I think it may be a problem with the C interface. I'll take a closer look and see if I can debug it.

Palmer, Bruce J

unread,

Jan 3, 2011, 4:07:02 PM1/3/11

to HPC TOOLS

Karthik,

The last two arguments in the NGA_Create_ghosts_irreg call have been switched. They should be

g_xold = NGA_Create_ghosts_irreg(C_DBL, ndim,dims,width,"xold", block, map);

I think this will fix the problem.

Bruce

-----Original Message-----
From: HPC TOOLS On Behalf Of Karthik

Sent: Tuesday, December 28, 2010 12:07 AM

Daily, Jeff A

unread,

Jan 3, 2011, 5:14:16 PM1/3/11

to Palmer, Bruce J, Karthik, HPC TOOLS

> -----Original Message-----
> From: HPC TOOLS On Behalf Of Palmer, Bruce J
> Sent: Monday, January 03, 2011 1:07 PM
> To: HPC TOOLS
> Subject: RE: [hpctools] Re: Global arrays - Performance optimization
>
> Karthik,
>
> The last two arguments in the NGA_Create_ghosts_irreg call have been
> switched. They should be
>
> g_xold = NGA_Create_ghosts_irreg(C_DBL, ndim,dims,width,"xold", block,
> map);
>
> I think this will fix the problem.

*snip*

Shouldn't our API be consistent? Is this a bug then?

Manojkumar Krishnan

unread,

Jan 3, 2011, 5:19:36 PM1/3/11

to HPC TOOLS, Palmer, Bruce J, Karthik

"block" followed by "map" seems to be consistent (in comparison to
NGA_Create_irreg). Looks like a bug in the documentation (and API).

--

-Manoj.
---------------------------------------------------------------
Manojkumar Krishnan | High Performance Computing | (509) 372-4206
http://hpc.pnl.gov/people/manoj

Palmer, Bruce J

unread,

Jan 3, 2011, 5:31:30 PM1/3/11

to Krishnan, Manojkumar, HPC TOOLS, Karthik

Okay, the main inconsistency seems to be between the Fortran and C interfaces, the Fortran interfaces are all "map, block" and the C interfaces are all "block, map". The C interface documentation for some of the calls seems at have been cut and pasted from the Fortran documentation and so it is wrong for the C interface. Do we want to fix up the C interface so it is consistent with the Fortran interface, or just fix up the documentation so the it is correct for the C interface?

Bruce

Manojkumar Krishnan

unread,

Jan 3, 2011, 5:43:12 PM1/3/11

to Palmer, Bruce J, HPC TOOLS, Karthik

Just to be cautious and backward compatible so that we are not
breaking things, how about just replicate the usage of NGA_Create_irreg to
NGA_Create_ghost_irreg.

Since we do not want to break the codes using NGA_Create_irreg, let us not
modify its API or documentation. However I vote for modifying the
NGA_Create_ghost_irreg to be consistent with NGA_Create_irreg
(since it is relatively a newer API and assuming its usage is
limited/zero) both in Fortran and C.

In this way we ensure backward compatibility with minimal (or zero)
damage.

--
-Manoj.
---------------------------------------------------------------
Manojkumar Krishnan | High Performance Computing | (509) 372-4206
http://hpc.pnl.gov/people/manoj

Palmer, Bruce J

unread,

Jan 3, 2011, 6:12:49 PM1/3/11

to Krishnan, Manojkumar, HPC TOOLS, Karthik

The actual implementation of NGA_Create_irreg and NGA_Create_ghost_irreg are the same (both use "block, map"). The documentation for NGA_Create_ghost_irreg is wrong. So I guess what you are saying is just fix up the documentation on NGA_Create_ghosts_irreg (and NGA_Create_ghosts_irreg_config) and leave everything else alone. The create_irreg calls in the C and Fortran interfaces will not match each other, but probably not that many people are using both.

Reply all

Reply to author

Forward