Hi Thomas,
It certainly takes some getting used to thinking in terms of coarrays and PGAS languages when coming from more traditional parallelism models like MPI and OpeMP!
My general recommendation for performance and programability is:
1) Avoid having coindexed items on the RHS and LHS of an assignment statement or as more than a single argument to subroutines and functions; this potentially creates a 3-way interaction where the local image is both "put"ing and "get"ing values from remote images. In OpenCoarrays this is implemented in some special `sendget` routines which are not well optimized.
2) For large simulations with reductions you probably can't do better than the built in collectives
3) You can and should use `sync images(...)` over `sync all` where a subset of images need to wait for each other, but the entire code doesn't need to wait. e.g., halo exchanges
4) For the best ability to overlap work and communication use early "puts" instead of "gets". In principle a smart optimizing compiler could start a fetch early, but it is easier to reason about "puts" for the compiler.
5) Use separate communication buffers (i.e., coarrya ariables for halo regions) to help loosen the coupling of the communication, and ensure local operations are on memory aligned arrays without strange strides and gaps.
6) Events should be quite cheap and performant in most implementations; you can use them for latency hiding to overlap work and communication
7) Remember that `event_post` and `event_wait` ARE image control statements, however, `event_query` is *NOT* an image control statement. So in any spin-work-wait loops you can use `event_query` to test if the event has been posted, but then you need to use either `sync memory` to induce user defined ordering, or `event_wait` once the query tells you that the event has been posted. If the event did not post after the first query then you can go do some other work until it does.
For your initial example, if you need the reduction accumulated on the first image, then you could do something like:
```Fortran
...
real, allocatable :: sum_reduction(:)[*]
...
allocate(sum_reduction(num_images())) ! implicit synchronization
sum_reduction = 0.0
...
if ( this_image() == 1 )
sync images(*)
else
sum_reduction(this_image())[1] = val
sync images(1)
end if
if ( this_image() == 1 )
! Accumulate the sum sent by everyone else
sum_reduction(1) = sum(sum_reduction(2:))
else
! Do some other work
end if
! co_broadcast sum_reduction(1) to all the other images if need be
! or use a strategy with sync images(...) like the one above to "put" or "get" sum_reduction(1)[1] ...
```
Anyway, I hope these suggestions don't muddy the waters and are helpful. I think that your first attempts should be with `sync images()` although, without a fully working teams implementation this might be hard if only subsets of images need to coordinate because `sync images(*)` will wait for all other images in the team, IIRC, but a `sync images(*)` posted on image `n`, and a `sync images(n)` posted on the other images should release the other images without having to wait for each other. But your case is sufficiently complex that, without robust teams support, custom logic using events is probably a more complicated but reliable approach.
Best of luck,
Zaak