reduction in do concurrent

Dominik Gronkiewicz

unread,

Oct 13, 2018, 1:21:46 PM10/13/18

to

Is the following code valid? It does compile and give expected results, but it has variable dependency in between iterations (s), even though the order of execution does not matter since it's a reduction operation.

https://paste.fedoraproject.org/paste/rsfEsp5895~SQF7vMtCRpQ

real :: a(1024), s
integer :: i

call random_seed()
call random_number(a)

!-----------------------------------------------

s = 0

do concurrent (i = 1:size(a))
s = s + (2 * a(i) - 1)
end do

print *, s

!-----------------------------------------------

s = 0

do i = 1, size(a)
s = s + (2 * a(i) - 1)
end do

print *, s

!-----------------------------------------------

end

Tim Prince

unread,

Oct 13, 2018, 2:06:16 PM10/13/18

to

The SUM intrinsic makes more sense.

--
Tim Prince

Steve Lionel

unread,

Oct 13, 2018, 2:14:17 PM10/13/18

to

On 10/13/2018 1:21 PM, Dominik Gronkiewicz wrote:

> Is the following code valid? It does compile and give expected results, but it has variable dependency in between iterations (s), even though the order of execution does not matter since it's a reduction operation.

The standard says only that "The executions may occur in any order". It
is up to the programmer to determine whether or not that is acceptable.
I assume you know that, computationally, you might get different results
depending on the sequence and/or the compiler's choice of grouping
operations (vectorization, etc.)

--
Steve Lionel
Retired Intel Fortran developer/support
Email: firstname at firstnamelastname dot com
Twitter: @DoctorFortran
LinkedIn: https://www.linkedin.com/in/stevelionel
Blog: http://intel.com/software/DrFortran

Ron Shepard

unread,

Oct 13, 2018, 2:41:59 PM10/13/18

to

On 10/13/18 12:21 PM, Dominik Gronkiewicz wrote:
> Is the following code valid? It does compile and give expected results, but it has variable dependency in between iterations (s), even though the order of execution does not matter since it's a reduction operation.

What exactly is your symptom? Do the two printed results differ? Or do
the results from one execution differ from the next?

If it is the former, then something is wrong with the compiler. For only
1024 values, all with the same order of magnitude, there should only be
a few of the least significant bits different at most in the two values,
regardless of the order of execution in the do concurrent loop.

If it is the latter, then the answer is that you need to initialize the
random number generator with a fixed seed each time. If your compiler
does some kind of randomization of the initial seed, then you would
expect to see different values printed on different runs. You are
calling random_seed(), but apparently you are expecting it to do
something specific without telling it what to do.

If you do want to initialize it to a fixed value, then you need to do
something like this (untested)

integer, allocatable :: seed(:)
...
call random_seed(size=i)
allocate( seed(i) )
seed(:) = 20181013
call random_seed(put=seed)
...

This is just from memory, so I hope I have the keywords spelled correctly.

$.02 -Ron Shepard

Dominik Gronkiewicz

unread,

Oct 13, 2018, 4:55:31 PM10/13/18

to

W dniu sobota, 13 października 2018 20:14:17 UTC+2 użytkownik Steve Lionel napisał:
> On 10/13/2018 1:21 PM, Dominik Gronkiewicz wrote:
> > Is the following code valid? It does compile and give expected results, but it has variable dependency in between iterations (s), even though the order of execution does not matter since it's a reduction operation.
>
> The standard says only that "The executions may occur in any order". It
> is up to the programmer to determine whether or not that is acceptable.
> I assume you know that, computationally, you might get different results
> depending on the sequence and/or the compiler's choice of grouping
> operations (vectorization, etc.)

Thanks, this is the question I had in mind: if the following code violates any restrictions on "do concurrent" from the standard side, which may result in producing other than expected results in compilers other than the ones I tested (gfortran). I am aware that in this trivial example sum() intrinsic would be obvious, but the actual code I am working on is much more complex and involves several nested loops but with no dependencies other than the reduction, so DO CONCURRENT is obvious choice here.

So DO CONCURRENT and reductions mix well. Thanks! :)

Dick Hendrickson

unread,

Oct 13, 2018, 5:09:36 PM10/13/18

to

I don't think you can do this. One of the restrictions on the DO
CONCURRENT is

"A variable that is referenced in an iteration shall either be
previously defined during that iteration, or shall not be defined or
become undefined during any other iteration. A variable that is defined
or becomes defined by more than one iteration becomes undefined when the
loop terminates."

Here, s is referenced before it is defined on each iteration.

Dick Hendrickson

Dominik Gronkiewicz

unread,

Oct 13, 2018, 5:13:29 PM10/13/18

to

Yeah I remembered exactly that paragraph (it comes up often in discussions) but I wasn't sure if it refers to this situation, hence my question.

Tim Prince

unread,

Oct 13, 2018, 5:31:56 PM10/13/18

to

Yes, it is a problem with SUM that compilers aren't likely to fuse it
with neighboring array assignments or even dot_product or SUM, so the
ability to include reduction in DO CONCURRENT is potentially useful.

--
Tim Prince

Steve Lionel

unread,

Oct 13, 2018, 6:37:41 PM10/13/18

to

On 10/13/2018 5:13 PM, Dominik Gronkiewicz wrote:
>> I don't think you can do this. One of the restrictions on the DO
>> CONCURRENT is
>>
>> "A variable that is referenced in an iteration shall either be
>> previously defined during that iteration, or shall not be defined or
>> become undefined during any other iteration. A variable that is defined
>> or becomes defined by more than one iteration becomes undefined when the
>> loop terminates."
>>
>> Here, s is referenced before it is defined on each iteration.
>>
>> Dick Hendrickson
> Yeah I remembered exactly that paragraph (it comes up often in discussions) but I wasn't sure if it refers to this situation, hence my question.

Dick is correct, I had forgotten that there was an entire other section
of the text with additional restrictions.

Damian Rouson

unread,

Oct 13, 2018, 9:16:53 PM10/13/18

to

FWIW, the Fortran committee is considering a proposal for adding reductions to DO CONCURRENT in 202X. I support it and at least one vendor supports it so I'm hopeful that the proposal will succeed. If you have other use cases, you might want to put them in writing and share them with a committee member or consider attending a committee meeting if possible.

FortranFan

unread,

Oct 13, 2018, 10:04:25 PM10/13/18

to

On Saturday, October 13, 2018 at 9:16:53 PM UTC-4, Damian Rouson wrote:

> FWIW, the Fortran committee is considering a proposal for adding reductions to DO CONCURRENT in 202X. ..

Interesting.

On Saturday, October 13, 2018 at 1:21:46 PM UTC-4, Dominik Gronkiewicz wrote:
> Is the following code valid? ..

@Dominik Gronkiewicz,

If you mean standard-conforming by "valid", you already have several answers the code in the original post does not conform.

Since reductions are not part of the DO CONCURRENT construct in the current standard (as well as the next 2018 revision) and given the convenient support for arrays in the language as well as the fact memory is cheap and plentiful nowadays, the use of array temporaries can be a short-gap measure as shown below by your "toy" example but which can be an option you can try with your "actual code .. much more complex and involves several nested loops but with no dependencies other than the reduction":

--- begin code ---
real :: a(1024)
real, allocatable :: work_array(:)

integer :: i
call random_seed()
call random_number(a)
!-----------------------------------------------

s = 0.0

do i = 1, size(a)

s = s + compute_intensive_calcs( a(i) )
end do
print *, "DO: s = ", s
!-----------------------------------------------
s = sum( compute_intensive_calcs(a) )
print *, "SUM intrinsic: s = ", s
!-----------------------------------------------
allocate( work_array, mold=a )

do concurrent (i = 1:size(a))

work_array(i) = compute_intensive_calcs( a(i) )
end do
s = sum( work_array )
print *, "DO CONCURRENT: s = ", s
!-----------------------------------------------
stop
contains
elemental function compute_intensive_calcs( x ) result( y )
real, intent(in) :: x
! function result
real :: y
! Substitute actual computations here
y = 2.0*x - 1.0
end function
end
--- end code ---

Note with suitable encapsulation of one's 'complex' computations in 'classes' (derived types) whose 'data' includes problem sizes, one can setup the array temporaries such that any performance impact due to repeat allocations and deallocations is minimized.

Dick Hendrickson

unread,

Oct 13, 2018, 10:58:26 PM10/13/18

to

On 10/13/18 4:09 PM, Dick Hendrickson wrote:
> On 10/13/18 12:21 PM, Dominik Gronkiewicz wrote:
>> Is the following code valid? It does compile and give expected
>> results, but it has variable dependency in between iterations (s),
>> even though the order of execution does not matter since it's a
>> reduction operation.
>>
>> https://paste.fedoraproject.org/paste/rsfEsp5895~SQF7vMtCRpQ
>>
>>          real :: a(1024), s
>>          integer :: i
>>
>>          call random_seed()
>>          call random_number(a)
>>
>>          !-----------------------------------------------
>>
>>          s = 0
>>
>>          do concurrent (i = 1:size(a))
>>            s = s + (2 * a(i) - 1)
>>          end do
>>

Most of the trouble with examples like this is that they are two small.
Assuming DO CONCURRENT makes some sort of operating system call to fire
up many other threads and then continues to do useful work, the original
thread will almost for sure have completed the loop before any other
threads can get organized. On today's computers the "concurrent" isn't
interesting for loops like this. If you change the array size to a
bazillion and run your program, with multi-threading fully enabled, many
many times during the busiest time of the day you'll see significantly
different results. I guarantee it. ;)

Nevertheless, it is surprising that a compiler wouldn't give some sort
of comment about the use of "s" in this do concurrent loop. Sure, the
standard is clear; you've written a loop that, in general, has a
(possibly) undefined result ("s") and it's all YOUR fault! But,
shouldn't compilers try to be helpful?

Dick Hendrickson

Thomas Koenig

unread,

Oct 14, 2018, 7:27:26 AM10/14/18

to

Dick Hendrickson <dick.hen...@att.net> schrieb:

>>>          real :: a(1024), s
>>>          integer :: i
>>>
>>>          call random_seed()
>>>          call random_number(a)
>>>
>>>          !-----------------------------------------------
>>>
>>>          s = 0
>>>
>>>          do concurrent (i = 1:size(a))
>>>            s = s + (2 * a(i) - 1)
>>>          end do
>>>

[...]

> Nevertheless, it is surprising that a compiler wouldn't give some sort
> of comment about the use of "s" in this do concurrent loop. Sure, the
> standard is clear; you've written a loop that, in general, has a
> (possibly) undefined result ("s") and it's all YOUR fault! But,
> shouldn't compilers try to be helpful?

Compilers (or rather, compiler writers) do try to be helpful,
but not all helpful things are created equal (or rather, equally
easy to write).

However, a question:

>> "A variable that is referenced in an iteration shall either be
>> previously defined during that iteration, or shall not be defined or
>> become undefined during any other iteration. A variable that is defined
>> or becomes defined by more than one iteration becomes undefined when the
>> loop terminates."

I think these requirements would allow the compiler to mark variables
which occur in a variable definition context (RHS of an assignment,
INTENT(OUT)) withou a branch around them to undefined on loop entry
(in gcc jargon, to clobber them).

So, the code above could be translated into

do concurrent (i= 1:size(a))
s = {clobber}
s = s + (2*a(i) - 1)
end do

which would let later stages of the compiler know that s is, in effect,
undefined at this stage, so a warning could be issued.

No such transformation should occur for

s = 0
do concurrent (i = 1:size(a))

if (i == 21) s = s + 2
end do

Would this be allowed by the standard?

steve kargl

unread,

Oct 14, 2018, 12:57:10 PM10/14/18

to

Thomas Koenig wrote:

(attribution of who wrote what has gone missing)

>>> "A variable that is referenced in an iteration shall either be
>>> previously defined during that iteration, or shall not be defined or
>>> become undefined during any other iteration. A variable that is defined
>>> or becomes defined by more than one iteration becomes undefined when the
>>> loop terminates."
>
> I think these requirements would allow the compiler to mark variables
> which occur in a variable definition context (RHS of an assignment,
> INTENT(OUT)) withou a branch around them to undefined on loop entry
> (in gcc jargon, to clobber them).
>
> So, the code above could be translated into
>
> do concurrent (i= 1:size(a))
> s = {clobber}
> s = s + (2*a(i) - 1)
> end do
>
> which would let later stages of the compiler know that s is, in effect,
> undefined at this stage, so a warning could be issued.
>
> No such transformation should occur for
>
> s = 0
> do concurrent (i = 1:size(a))
> if (i == 21) s = s + 2
> end do
>
> Would this be allowed by the standard?

The Fortran standard does not address implementation details.
For example, the Fortran standard states that RANDOM_NUMBER
returns a value in the range [0,1). It does not state how the
value is determined. Similarly, I think the standard does not
address how a compiler may or may not mark entities that
are undefined.

--
steve

Ev. Drikos

unread,

Oct 14, 2018, 3:36:12 PM10/14/18

to

On 13/10/2018 8:21 PM, Dominik Gronkiewicz wrote:

Hello,

Here is another variant of the program without "do concurrent" loops.

In case you care about portability, the loop that prints the last (3d)
line is perhaps compact but doesn't look portable or i have an outdated
version (18.4) of the PGI compiler.

Regards,
Ev. Drikos

----------------------------------------------------------
$ gfortran7 -fopenmp -O3 do-concurrent-4.f90
$ ./a.out
size(a)= 1024000
102.71127629280090 passed 1138000
102.71127629280090 passed 750000
102.71127629280090 passed 399000
$ pgfortran -mp=allcores -O3 do-concurrent-4.f90
$ ./a.out
size(a)= 1024000
367.0113028883934 passed 0
367.0113028883934 passed 1
241.2957104444504 passed 0
$ cat do-concurrent-4.f90
use omp_lib
use, intrinsic :: iso_fortran_env

integer , parameter :: threads=4

integer(int64) :: t0, t1
real*4 :: a(1024000)
real*8 :: s=0, s1=0, s2=0, s3=0, s4=0, v(threads)
integer :: i, i1, i2, i3, i4

call random_seed()
call random_number(a)

print *, "size(a)=", size(a)

!-----------------------------------------------

! s = 0
! call system_clock(t0)
! do concurrent (i = 1:1024000)
! s = s + (2 * a(i) - 1)
! end do

! call system_clock(t1)
! print *, s , " passed " , t1 - t0

!-----------------------------------------------
call system_clock(t0)
s = 0
do i = 1, 1024000

s = s + (2 * a(i) - 1)
end do

call system_clock(t1)
print *, s , " passed " , t1 - t0

!-----------------------------------------------
! s=0
! call system_clock(t0)
! s = 2 * sum(a,1) - 1024000 !rounding errors here?
!
! call system_clock(t1)
! print *, s , " passed " , t1 - t0

call system_clock(t0)
!$omp parallel num_threads(4)
select case( omp_get_thread_num() )

case ( 0 )
s1=0
DO i1 = 1,1024000,4
s1 = s1 + (2 * a(i1) - 1)
end do

case ( 1 )
s2=0
DO i2 = 2,1024000,4
s2 = s2 + (2 * a(i2) - 1)
end do

case ( 2 )
s3=0
DO i3 = 3,1024000,4
s3 = s3 + (2 * a(i3) - 1)
end do

case ( 3 )
s4=0
DO i4 = 4,1024000,4
s4 = s4 + (2 * a(i4) - 1)
end do

case default
print *, "error thread number"

end select

!$omp end parallel
s=s1+s2+s3+s4

call system_clock(t1)
print *, s , " passed " , t1 - t0

!-----------------------------------------------
call system_clock(t0)
!$omp parallel num_threads(threads)
associate ( p => v( omp_get_thread_num() + 1 ) )
p = 0
DO i1 = omp_get_thread_num()+1,1024000,threads
p = p + (2 * a(i1) - 1)
end do
end associate
!$omp end parallel
s=sum(v)

call system_clock(t1)
print *, s , " passed " , t1 - t0
!-----------------------------------------------

end

FortranFan

unread,

Oct 14, 2018, 10:37:24 PM10/14/18

to

On Sunday, October 14, 2018 at 3:36:12 PM UTC-4, Ev. Drikos wrote:

> .. the loop that prints the last (3d)

> line is perhaps compact but doesn't look portable or i have an outdated
> version (18.4) of the PGI compiler.

> ..

@Ev. Drikos,

Your cases involving OpenMP make little sense, one would think the equivalent to what OP has in the original post to be simply:

!$omp parallel do reduction(+:s)

do i = 1, size(a)

s = s + (2.0 * a(i) - 1.0)
end do

JCampbell

unread,

Oct 14, 2018, 11:45:55 PM10/14/18

to

I can not imagine any value of size(a) where use of multiple threads would be helpful.
For small size, the thread initiation overhead would be excessive, while for large size,
memory access would probably be the bottleneck; the loop calculation is too (two?) trivial.
Better to use "s = 2.0*sum(a) - size(a)" and hope for vector instructions for the sum.
In general to use !$OMP effectively, there needs to be sufficient calculation per iteration.
There are too many examples presented of !$OMP with trivial calculations.
I imagine DO CONCURRENT has a similar problem.

Ev. Drikos

unread,

Oct 15, 2018, 6:35:08 AM10/15/18

to

Honestly, I 'd the impression that PGI doesn't fully support OpenMP
(4.5) reductions and that's why I avoided to use the keyword "reduction"
in any place. In addition I've never used that feature before.

Of course your example is fully supported by PGI and produces the
expected results.

My point also was that the OP needs a buffer (ie s) that is larger than
each added element (ie a(i)). That's why I've coded in my example:

real*4 :: a(1024000)
real*8 :: s

Further, the intrinsic "sum" doesn't really functions here as expected.

--------------------------------------------------------------------
$ gfortran7 -fopenmp -O3 do-concurrent-5.f90
$ pgfortran -mp=allcores -O3 do-concurrent-5.f90
$ cat do-concurrent-5.f90

use omp_lib
use, intrinsic :: iso_fortran_env

integer(int64) :: t0, t1
real*4 :: a(1024000)
real*8 :: s

integer :: i

call random_seed()
call random_number(a)

print *, "size(a)=", size(a)

!-----------------------------------------------

call system_clock(t0)
s = 0

do i = 1, size(a)

s = s + (2 * a(i) - 1)
end do
call system_clock(t1)

print *, s , "time passed " , t1 - t0

!-----------------------------------------------

call system_clock(t0)
s=0

!$omp parallel do reduction(+:s)

do 4 i = 1, size(a)
4 s = s + (2.0 * a(i) - 1.0)

call system_clock(t1)
print *, s , "time passed " , t1 - t0

end

Dick Hendrickson

unread,

Oct 15, 2018, 11:12:32 AM10/15/18

to

I think it would. I think the key phrase "a variable ... shall not be
defined ... during any other iteration" is a run-time condition. "s" is
only defined once. Code that isn't executed doesn't do anything and the
standard doesn't impose rules on the values or results. That's why IF
(X .GT. 0) Y = SQRT(X) is standard conforming, whereas IF (Y .GT. 0) Y =
SQRT(X) might or might not be standard conforming.

Dick Hendrickson

FortranFan

unread,

Oct 15, 2018, 1:14:25 PM10/15/18

to

On Monday, October 15, 2018 at 6:35:08 AM UTC-4, Ev. Drikos wrote:

> .. the intrinsic "sum" doesn't really functions here as expected. ..

Care to explain in terms of what you mean by "as expected"? The intrinsic will return a compiler-dependent approximate result; that will be consistent with the standard.

Ev. Drikos

unread,

Oct 15, 2018, 8:15:40 PM10/15/18

to

On 15/10/2018 1:35 PM, Ev. Drikos wrote:
> gfortran7 -fopenmp -O3 do-concurrent-5.f90

$ ./a.out
size(a)= 1024000
852.70759236812592 time passed 1310000
852.70759236812592 time passed 729000

Thomas Koenig

unread,

Oct 18, 2018, 4:11:22 PM10/18/18

to

Dick Hendrickson <dick.hen...@att.net> schrieb:

Thanks for your analysis.

I have added this as an enhancement possibility at
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87648