Problem with communications

thomas.re...@gmail.com

unread,

Apr 18, 2019, 8:17:50 AM4/18/19

to OpenCoarrays

Hello,

The problem below is probably not due to OpenCoarrays but to my understanding of the standard, but I could not find any help on this specific topic.

I have a problem during communications with Gfortran + OpenCoarrays in a simple program :

------------

program test

implicit none

integer, codimension[*] :: val
integer :: ii

val = 0
if (this_image() /= 1) then
    do ii = 1, 10
      val = val + ii
    end do
end if

!sync all

if (this_image() /= 1) then
    val[1] = val[1] + val
end if

sync all

write(*,*) 'image = ',this_image(),'val = ',val

end program test

------------

My understanding is that using 'val[1] = val[1] + val' will require each image to finish its local job (the summation val = val + ii) before the communications take place. I did something similar in a larger code to avoid any synchronisation process and to overlap calculations with communications. However I get inconsistent results with 4 images (55 or 110, whereas it should be 165). What troubles me even more is than even with the first "sync all" statement (commented out in the example), the value is wrong. My question is therefore: is the statement "val[1] = val[1] + val" ("push" approach) valid? Could somebody explain me what is the machinery behind this statement in OpenCoarrays?

If I use a co_sum, everything is fine, but I have read that it implies a synchronization barrier. Otherwise the following code also works ("pull" approach), but with "sync all" statements.

Thanks in advance (and thanks for the great job in OpenCoarrays !)

Thomas

------------

program test

implicit none

integer, codimension[*] :: val
integer :: ii

val = 0
if (this_image() /= 1) then
    do ii = 1, 10
      val = val + ii
    end do
end if

sync all

if (this_image() == 1) then
    do ii = 2, num_images()
      val = val + val[ii]
    end do
end if

sync all

write(*,*) 'image = ',this_image(),'val = ',val

end program test

------------

Zaak Beekman

unread,

Apr 18, 2019, 10:42:05 PM4/18/19

to OpenCoarrays

Hi Thomas,

The semantics of coarrays are a bit tricky; I highly recommend investing in Modern Fortran Explained by M.R.C. (https://amzn.to/2GvboAq) and reading the chapter about coarrays, image control statements, and segment ordering.

TL;DR: Image control statements define execution segment boundaries, and segments are only ordered with respect to one another if there is an image control statement. Allocations of coarrays, sync all and sync images(...), event wait, etc. are image control statements and let you order segments across images and reason about the execution of your code.

The problem here is that `val[1] = val[1] + val` is not an atomic operation; First each image (other than 1) must fetch the value of `val` from image one (the `val[1]` on the LHS of the equals sign). Then each image performs the addition locally of `val + val[1]` where `val[1]` is whatever value was fetched from image 1. Next, each image stores the result of that sum back into `val` on image 1, i.e., `val[1]`. The problem occurs because for each image, there is no ordering when it fetches `val[1]` on the left hand side of the assignment, and when it "puts" `val[1]` back on the right hand side. Therefore, you have a race condition with non-deterministic behavior.

Whenever there is a coindexed object on the LHS of an assignment, or corresponding to an intent(in) dummy argument, you effectively have a "get" operation. That value is fetched from the remote image. If a coindexed object is on the LHS of an assignemt (or an actual argument corresponding to an intent(out) dummy argument) then it is effectively a "put". The problem in your initial code is that, with or without the commented out sync all statement, the coindexed coarray `val[1]` appears on both sides of the assignment, and is accessed from multiple images.

Since the semantics of your statement `val[1] = val[1] + val` are completely undefined, I think 55 is a completely reasonable answer: this would imply every image fetched `val[1]` on the lhs of the assignment while it's value was still zero, then added its local `val` to this fetched value (0), and finally overwrote `val[1]` with its local value.

If you want to atomically update `val[1]` you could try wrapping that statement in a `critical` ... `end critical` block; this will ensure only one image can execute it at a time. However, I still have my doubts about whether this will correctly order segments across the images and would need to further consult MRC or the standard to verify this.

At the end of the day, the algorithms in collectives like `co_sum()` are likely to be highly optimized (or call through to MPI collectives or other parallel runtime collectives that are highly optimized) and in most circumstances will do much better than any implementation that you hand code. Using the additional arguments to co_sum() to let the runtime know in the case that the reduced value is only required on one image may further improve things by releasing other images from the collective operation early. I agree that it's too bad that we don't (yet) have non-blocking collectives, but in your example, I doubt there is a better way to do it than calling a collective, blocking or not.

I hope my explanation makes sense and has been helpful.

Thanks,

Zaak

Zaak Beekman

unread,

Apr 18, 2019, 10:45:05 PM4/18/19

to OpenCoarrays

On Thursday, April 18, 2019 at 10:42:05 PM UTC-4, Zaak Beekman wrote:

The problem here is that `val[1] = val[1] + val` is not an atomic operation; First each image (other than 1) must fetch the value of `val` from image one (the `val[1]` on the LHS of the equals sign). Then each image performs the addition locally of `val + val[1]` where `val[1]` is whatever value was fetched from image 1. Next, each image stores the result of that sum back into `val` on image 1, i.e., `val[1]`. The problem occurs because for each image, there is no ordering when it fetches `val[1]` on the left hand side of the assignment, and when it "puts" `val[1]` back on the right hand side. Therefore, you have a race condition with non-deterministic behavior.

After pressing "send" I wanted to clarify a little bit more. The root problem is that the "fetch" of `val[1]` on the LHS of the assignment on image m, is un-ordered with respect to the "put" of `val[1]` on the RHS of the assignment on image n.

Zaak Beekman

unread,

Apr 18, 2019, 10:47:15 PM4/18/19

to OpenCoarrays

After pressing "send" I wanted to clarify a little bit more. The root problem is that the "fetch" of `val[1]` on the LHS of the assignment on image m, is un-ordered with respect to the "put" of `val[1]` on the RHS of the assignment on image n.

Gah! My dyslexia got the best of me there: the "fetch" should correspond to the *RHS* of the assignment, the "put" corresponds to the *LHS* of the assignment. Sorry about that.

-Zaak

thomas.re...@gmail.com

unread,

Apr 19, 2019, 7:21:41 AM4/19/19

to OpenCoarrays

Hi Zaak,

Thanks a lot for your comprehensive and insightful answer. I was not aware that the sum "val[1] + val" would be done on the local image n (/= 1) and thus would require a fetch. But it makes perfect sense.

My feeling with co-arrays is that they seem quite powerful, but that we do have the same safeguards as in the rest of the Fortran standard (perhaps for the sake of efficiency). Quite disturbing for people used to writing Fortran!

Anyway, let's go with collectives!

Thomas

we...@iastate.edu

unread,

Apr 19, 2019, 8:00:56 AM4/19/19

to OpenCoarrays

Also, see ATOMIC_ADD, which OpenCoarrays supports:

https://gcc.gnu.org/onlinedocs/gfortran/ATOMIC_005fADD.html

--

Nathan

thomas.re...@gmail.com

unread,

Apr 19, 2019, 9:30:19 AM4/19/19

to OpenCoarrays

Thanks. Unfortunately unlike the test, my code handles reals, not integers (I could not find the appropriate atomic routine in the standard). In the larger code, the work (the loop on variable ii in the test) is performed several times ("niter" times), by a small subset of images which can be different from one iteration to another. "val" is actually an allocatable array of size "niter". If a barrier is set at each iteration, images that could calculate the next iteration are blocked. That's why co_sum does the job but is perhaps not optimal for a large number of images.

By looking more closely at events, I think they could be interesting for this. If I define an array like this :

type(event_type), allocatable :: Events(:)[:]

and allocate it to the total number of iterations:

allocate(Events(niter)[*])

then I can do the following:

do jj = 1, niter

...

! calculation by a subset of images; calculation must be collected on image n

...

event post(Events(jj)[n])
if (this_image() == n) then
   event wait(Events(jj),until_count=num_images())
   do im = 1, num_images()
   if (im /= n) then
          val(jj) = val(jj) + val(jj)[im]
       end if
   end do
end if

end do

The calculation could be improved by posting events only by images of the subset and iterating im only on the images of the subset. If communications are non-blocking, I would expect this to be better than co_sum.

I will try to do some tests and see if it is worth adding this complexity.

Zaak you are right, the conclusion is that I should have read Metcalf-Reid-Cohen more carefully.

Cheers,

Thomas

Zaak Beekman

unread,

Apr 19, 2019, 11:09:08 AM4/19/19

to OpenCoarrays

Hi Thomas,

It certainly takes some getting used to thinking in terms of coarrays and PGAS languages when coming from more traditional parallelism models like MPI and OpeMP!

My general recommendation for performance and programability is:

1) Avoid having coindexed items on the RHS and LHS of an assignment statement or as more than a single argument to subroutines and functions; this potentially creates a 3-way interaction where the local image is both "put"ing and "get"ing values from remote images. In OpenCoarrays this is implemented in some special `sendget` routines which are not well optimized.

2) For large simulations with reductions you probably can't do better than the built in collectives

3) You can and should use `sync images(...)` over `sync all` where a subset of images need to wait for each other, but the entire code doesn't need to wait. e.g., halo exchanges

4) For the best ability to overlap work and communication use early "puts" instead of "gets". In principle a smart optimizing compiler could start a fetch early, but it is easier to reason about "puts" for the compiler.

5) Use separate communication buffers (i.e., coarrya ariables for halo regions) to help loosen the coupling of the communication, and ensure local operations are on memory aligned arrays without strange strides and gaps.

6) Events should be quite cheap and performant in most implementations; you can use them for latency hiding to overlap work and communication

7) Remember that `event_post` and `event_wait` ARE image control statements, however, `event_query` is *NOT* an image control statement. So in any spin-work-wait loops you can use `event_query` to test if the event has been posted, but then you need to use either `sync memory` to induce user defined ordering, or `event_wait` once the query tells you that the event has been posted. If the event did not post after the first query then you can go do some other work until it does.

For your initial example, if you need the reduction accumulated on the first image, then you could do something like:

```Fortran

...

real, allocatable :: sum_reduction(:)[*]

...

allocate(sum_reduction(num_images())) ! implicit synchronization

sum_reduction = 0.0

...

if ( this_image() == 1 )

sync images(*)

else

sum_reduction(this_image())[1] = val

sync images(1)

end if

if ( this_image() == 1 )

! Accumulate the sum sent by everyone else

sum_reduction(1) = sum(sum_reduction(2:))

else

! Do some other work

end if

! co_broadcast sum_reduction(1) to all the other images if need be

! or use a strategy with sync images(...) like the one above to "put" or "get" sum_reduction(1)[1] ...

```

Anyway, I hope these suggestions don't muddy the waters and are helpful. I think that your first attempts should be with `sync images()` although, without a fully working teams implementation this might be hard if only subsets of images need to coordinate because `sync images(*)` will wait for all other images in the team, IIRC, but a `sync images(*)` posted on image `n`, and a `sync images(n)` posted on the other images should release the other images without having to wait for each other. But your case is sufficiently complex that, without robust teams support, custom logic using events is probably a more complicated but reliable approach.

Best of luck,

Zaak

Michael Siehl

unread,

Apr 20, 2019, 4:48:38 AM4/20/19

to OpenCoarrays

Solving your problem is very simple:

program test
  implicit none
  integer, dimension (1:5), codimension[*] :: val ! to compile/run with max. 5 images
  integer :: ii

  val = 0


  if (this_image() /= 1) then


    val(this_image()) = 55
  end if

  sync all

  if (this_image() /= 1) then
    val(this_image())[1] = val(this_image())
  end if

  sync all

  write(*,*) 'image = ',this_image(),'val = ',sum(val)

end program test

You can use this with other data types as well, even character (except using sum() then). Combining Fortran's array and coarray syntax is the simple solution and also the main force behind this PGAS implementation. With Fortran 2008 the arrays were required to be impratical large (i.e. to be of size num_images()). Thanks to Fortran 2018 coarray teams the arrays can be of small size (i.e. of size num_images() of the current team).

cheers

thomas.re...@gmail.com

unread,

Apr 23, 2019, 6:56:32 AM4/23/19

to OpenCoarrays

Michael, Zaak,

Thanks a lot for your examples and hints. It is interesting to see that you both use "puts" instead of "gets". I was not aware that it can give better performance (actually I read a paper by Ashby and Reid - CUG 2008 proceedings - where results seems opposite, but this old paper is with another implementation of coarrays).

Cheers,

Thomas

Zaak Beekman

unread,

Apr 23, 2019, 12:01:43 PM4/23/19

to thomas.re...@gmail.com, OpenCoarrays

Puts vs gets depends both on the implementation of the parallel runtime library (OpenCoarrays vs other) and also on the algorithm implementation. I like the puts (into dedicated communication buffers) because, with a suitable implementation, and paired with events (or hopefully, in the future, notified access) you can "send" data from the local image to the remote image, and you don't need to wait on the remote image (in theory) to finish receiving the data before you proceed. Similarly, you can query if your inbox buffer has the data you need with something like events, and then make decisions about what to work on if you're still waiting for remote data. IMO, it is easier to conceptualize ways to achieve latency hiding by overlapping communication with computation.

--
You received this message because you are subscribed to a topic in the Google Groups "OpenCoarrays" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/opencoarrays/qujBvPkdtHU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to opencoarrays...@googlegroups.com.
Visit this group at https://groups.google.com/group/opencoarrays.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencoarrays/f467e7e8-bf79-4957-a0c4-a2a0706bdec6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward