Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Shouldn't intrinsic SUM avoid overflow?

577 views
Skip to first unread message

evan

unread,
Sep 2, 2016, 1:29:11 PM9/2/16
to
Some intrinsic routines, e.g. NORM2 tries to avoid overflow but not SUM. Shouldn't it? Consider the program below

double precision, parameter :: x = huge(1d0)
print*, sum((/x,-x/))
print*, sum((/x,x,-x,-x/))
print*, sum((/x,-x,1d0/))
print*, sum((/1d0,x,-x/))
end


With gfortran 4.6.3 I get:

print*, sum((/x,x,-x,-x/))
1
Error: Arithmetic overflow at (1)
f951: internal compiler error: Segmentation fault
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-4.6/README.Bugs> for instructions.


With ifort 14.0.0 20130728 the program compiles but gives the wrong answer in some cases:

./a.out
0.000000000000000E+000
Infinity
1.00000000000000
0.000000000000000E+000

Richard Maine

unread,
Sep 2, 2016, 2:05:13 PM9/2/16
to
evan <evan...@gmail.com> wrote:

> Some intrinsic routines, e.g. NORM2 tries to avoid overflow but not SUM.
> Shouldn't it?

What's the cost/benefit tradeoff? For SUM, it seems like a pretty rare
corner case to have something that would overflow if done naively, but
could be massaged to give a meaningful result. I'd have thought that
avoiding loss of precision in SUM would be a more important issue in
practice if one were to go after more sophisticated implementations.

NORM2, on the other hand, can much more easily overflow.

I assume, by the way, that your "shoudn't" refers to quality of
implementation rather than to requirements of the standard.
If you are actually asking about whether the standard requires such
things, that answer is easy. No, it doesn't.

--
Richard Maine
email: last name at domain . net
dimnain: summer-triangle

kargl

unread,
Sep 2, 2016, 2:51:13 PM9/2/16
to
evan wrote:

> Some intrinsic routines, e.g. NORM2 tries to avoid overflow but not SUM. Shouldn't it?

What algorithm do you propose a compiler should use? How slow and
memory hungry are you willing to endure?

--
steve

Louis Krupp

unread,
Sep 2, 2016, 4:59:00 PM9/2/16
to
On Fri, 2 Sep 2016 10:29:08 -0700 (PDT), evan <evan...@gmail.com>
wrote:

>Some intrinsic routines, e.g. NORM2 tries to avoid overflow but not SUM. Shouldn't it? Consider the program below
>
>double precision, parameter :: x = huge(1d0)
>print*, sum((/x,-x/))
>print*, sum((/x,x,-x,-x/))
>print*, sum((/x,-x,1d0/))
>print*, sum((/1d0,x,-x/))
>end
>
>
>With gfortran 4.6.3 I get:
>
>print*, sum((/x,x,-x,-x/))
> 1
>Error: Arithmetic overflow at (1)
>f951: internal compiler error: Segmentation fault
>Please submit a full bug report,
>with preprocessed source if appropriate.
>See <file:///usr/share/doc/gcc-4.6/README.Bugs> for instructions.

The same internal compiler error happens with gfortran version 7.0.0
20160902 (experimental).

See bug 77641.

Louis

Louis Krupp

unread,
Sep 2, 2016, 6:05:54 PM9/2/16
to
Which was a duplicate of 77640, so see bug 77640 instead.

Louis

robert....@oracle.com

unread,
Sep 6, 2016, 10:58:06 PM9/6/16
to
All of the sums printed seem reasonable to me. Remember that floating-point numbers are not the real numbers of mathematics.

Robert Corbett

herrman...@gmail.com

unread,
Sep 8, 2016, 3:47:02 PM9/8/16
to
On Friday, September 2, 2016 at 10:29:11 AM UTC-7, evan wrote:
> Some intrinsic routines, e.g. NORM2 tries to avoid overflow but not SUM.
> Shouldn't it? Consider the program below

For NORM2, there is a convenient way to avoid the overflow of the intermediates from doing the square, and also it is fairly easy to overflow when squaring.

As Richard notes for SUM, I suspect that the standard doesn't require that, but it is an implementation consideration. But you say "tries". Presumably it can still overflow.

Somewhat in general, if you want something more than the usual way of doing
something, you should do it yourself. It might be nice if SUM did something
like the Kahan summation:

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

especially since it isn't always so easy to do yourself.
(Some optimizations can mess up the algorithm.)

I don't think that helps your case, though.

robert....@oracle.com

unread,
Sep 9, 2016, 1:48:16 AM9/9/16
to
On Thursday, September 8, 2016 at 12:47:02 PM UTC-7, herrman...@gmail.com wrote:
> On Friday, September 2, 2016 at 10:29:11 AM UTC-7, evan wrote:
> > Some intrinsic routines, e.g. NORM2 tries to avoid overflow but not SUM.
> > Shouldn't it? Consider the program below
>
> For NORM2, there is a convenient way to avoid the overflow of the intermediates from doing the square, and also it is fairly easy to overflow when squaring.
>
> As Richard notes for SUM, I suspect that the standard doesn't require that, but it is an implementation consideration. But you say "tries". Presumably it can still overflow.
>
> Somewhat in general, if you want something more than the usual way of doing
> something, you should do it yourself. It might be nice if SUM did something
> like the Kahan summation:

I tried something like that in release 2.0 of Sun Fortran 90. It proved not to be viable. Sun's support engineers reported that we were being killed on customer benchmarks, which led to a patch that put an end to that feature.

Robert Corbett

campbel...@gmail.com

unread,
Sep 10, 2016, 12:53:55 AM9/10/16
to
I am interested to see how Kahan's algorithm might conflict with a "smart"
compiler so this is my testing:

Chasing round-off errors with SUM can lead to a number of options with limited
practical gains.
* Kahan is one of them.
* Using a higher precision accumulator is another approach (probably simpler).

I have written a simple test of Kahan's algorithm using real*4 precision
(listed below) and compared this to a real*8 accumulator.

I have used gFortran 6.1.0 to see what smart compiler conflicts there may be
and also tested a number of different compiler options:
set options=-O1 -mavx
set options=-O2 -mavx
set options=-O3 -mavx
set options=-O3 -mavx -ffast-math
set options=-O3 -mavx -ffast-math -funroll-loops --param max-unroll-times=2

My .bat test loop is:
del %1.exe
del %1.o
gfortran %1.f90 %options% -o %1.exe
set options >> %1.log
%1 >> %1.log

My test tried 4 approaches at using SUM of a list of real*4 values.
si = SUM intrinsic
s4 = real*4 DO loop
k4 = Kahan's algorithm for real*4
s8 = real*8 accumulator

The results I found were
-ffast-math messed with Kahan's algorithm, but
changing optimisation with -Ox doesn't (no smarts identified)

From this test case, using a real*8 accumulator appears to be the most robust
approach.

In the past when using 8086/8087 there was the uncertainty of what values were
retained in 80-bit registers.
These latest tests with -mavx appear to show no mixing of 4-byte and 8-byte
registers.

When considering other cases, the results are very sensitive to the type of
accumulated rounding error that is occurring. Claims of error analysis in the
Wiki article look wrong to me as I find that cases where you may want to
improve accuracy are where the error accumulation is unusual.
Another problem becomes as to what you should make of the result, given the
accuracy of the original data.

The test case I used is summing a set of +ve random numbers in the range 0,1.
Overflow swamps the result at about 10^7 values.
Others may be interested in using the following test or adapting to a different
set of values with different round-off characteristics.

The test code I used is:
! program to test KahanSum
!
real*4 function KahanSum (values, n)
! https://en.wikipedia.org/wiki/Kahan_summation_algorithm
integer*4 n
real*4 values(n)
!
integer*4 i
real*4 sum, c, y, t
!
c = 0
sum = 0
do i = 1,n
y = values(i) - c ! So far, so good : c is zero
t = sum + y ! Alas, sum is big, Y small, so low-order digits of y are lost
c = (t - sum) - y ! (t-sum) cancels the high-order part of y; subtracting y recovers negative (low part of y)
sum = t ! Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers !
end do ! Next time around, the low part will be added to y in a fresh attempt.
!
KahanSum = sum
end function KahanSum

real*4 function Sum8 (values, n)
integer*4 n
real*4 values(n)
!
integer*4 i
real*8 sum, y
!
sum = 0
do i = 1,n
y = values(i)
sum = sum + y
end do
!
Sum8 = sum
end function Sum8

real*4 function Sum4 (values, n)
integer*4 n
real*4 values(n)
!
integer*4 i
real*4 sum, y
!
sum = 0
do i = 1,n
y = values(i)
sum = sum + y
end do
!
Sum4 = sum
end function Sum4

Program Kahan_test
!
integer*4 n, k
real*4, allocatable :: values(:)
real*4 sum4, sum8, KahanSum, s8, k4, s4, si
!
call report_options
!
do k = 2,30
n = 2**k
allocate ( values(n) )
call random_number ( values )
s8 = sum8 ( values, n )
k4 = KahanSum ( values, n )
s4 = sum4 ( values, n )
si = sum ( values )
write (*,fmt='(i10,7es11.3)') n, si, s4, k4, s8, abs(si-s8) , abs(s4-s8), abs(k4-s8)
deallocate (values)
end do
!
end

subroutine report_options
use ISO_FORTRAN_ENV
write (*,*) 'Compiler: ',compiler_version ()
write (*,*) 'Options : ',compiler_options ()
end subroutine report_options

kargl

unread,
Sep 10, 2016, 12:59:41 AM9/10/16
to
campbel...@gmail.com wrote:

> The results I found were
> -ffast-math messed with Kahan's algorithm, but
> changing optimisation with -Ox doesn't (no smarts identified)

Using -ffast-math is quite possibly the dumbest thing for anyone
to use if she or he is concerned about numerical issues. The
math may indeed be fast, but it is typically wrong.

--
steve

campbel...@gmail.com

unread,
Sep 10, 2016, 1:27:44 AM9/10/16
to
Steve,

All floating point calculations are "wrong" but how wrong is more important and is it tolerable. There are no absolutes with this.
I have been using -ffast-math and getting acceptable accuracy, so no it is not dumb.

kargl

unread,
Sep 10, 2016, 1:51:36 AM9/10/16
to
Either you don't understand what -ffast-math does or you don't
understand the carefully written Kahan summation algorithm.
Yes, FP arithmetic must be done with care to get any reasonable
approximation to the "correct" answer. Using a compiler option
that by design undoes a carefully written FP algorithm, because
the option allows a compiler to violate the semantics of the
language, and then appearing to be surprised by one's findings
is, well, dumb.

If you take the time to read the GNU GCC documentation, you'll
find that none of the -Ox options turn on -ffast-math except for
-Ofast, which explicit states that it does. If you then read about
-ffast-math, you'll find that it is explicitly stated that this option
can lead to wrong results.

--
steve

Ron Shepard

unread,
Sep 10, 2016, 1:05:11 PM9/10/16
to
do i = 1, n
y = values(i) - c
t = sum + y
c = (t - sum) - y
sum = t
end do

Just because an option CAN lead to wrong results does not mean that it
necessarily DOES lead to wrong results for a particular code. In this
case, the results were wrong with this option, which is a useful piece
of information, and I don't think anyone following this thread is
"surprised" by this outcome. -Ofast and -ffast-math do several things
that would not affect the results of this particular code, such as
ignoring interrupts, Inf, NaN, reciprocal approximation, and gradual
underflow arithmetic, all of which do affect performance. One thing that
it enables that does affect this code is to allow fortran parentheses to
be ignored during subexpression optimization. The references to and the
evaluation of "c" within the loop might be completely removed as the
result of such an optimization. I'm guessing that a scan of the
intermediate code or of the assembler code by someone who is fluent in
those things might show this. The Kahan summation algorithm replaces a
loop with a single fp addition with a loop that requires four additions,
and those four additions cannot be optimized away while retaining the
benefit of the algorithm. There are other ways for a compiler to screw
up the algorithm too. For example, if there are multiple functional
units with different rounding or truncation properties (such as some fp
instructions done with registers while others are done with sse/avx
hardware, or some fp operations done with 80-bit registers and others
done with 64-bit registers), then the carryover of the rounding errors
from one cycle to the next would not be performed correctly.

$.02 -Ron Shepard

campbel...@gmail.com

unread,
Sep 10, 2016, 10:52:14 PM9/10/16
to
I tested the following gFortran compile options:
set options=-O1
set options=-O1 -mavx
set options=-O2 -mavx
set options=-O3 -mavx
set options=-O3 -mavx -ffast-math
set options=-O3 -mavx -ffast-math -funroll-loops --param max-unroll-times=2

What is interesting about the testing I did was that -ffast-math was the only compile option that changed the results.
While it nullified the KahanSum algorithm approach, the use of -ffast-math actually improved the precision of SUM intrinsic and the real*4 DO loop.
This outcome may be limited to the particular case I have selected, where all values are +ve, but it appears -fast-math may be rounding up, which is good for this test.

I have identified few practical calculations where -fast-math is a problem. Apart from describing -fast-math as a dumb approach or the optimising it applies to Kahan, can anyone suggest where this option should not be used.
For real*8 computation involving large vectors, there is a doubling of speed with options=-O3 -mavx -ffast-math. This makes it a preferred choice for most computation and the failure of the Kahan algorithm is the only code example I have of it failing.

The testing also demonstrates that choosing a higher precision accumulator is a simpler approach for real*4 calculation, although there is a high performance price for real*8 calculation when using a real*16 accumulator. When real*10 was available via 8087, this was a viable option.

As with most cases, if overflow is a problem, the blame lies with the algorithm and not the processor.

Tim Prince

unread,
Sep 11, 2016, 7:47:41 AM9/11/16
to
-ffast-math (along with -O3 or -O2 -ftree-vectorize) enables simd
parallel optimization of sum reduction. The batched sums usually
improve accuracy, but not by a predictable degree. Other compilers, like
ifort, batch sums more aggressively than gfortran.
gfortran should not violate parentheses even with -ffast-math unless
-fno-protect-parens is set. gcc, however, doesn't have the option to
observe parentheses under fast-math.
icc only recently introduced an option to observe parentheses while
performing default optimizations. Intel compiler users trip up
frequently over the default violation of parentheses.
Inconsistent violation of parentheses breaks Kahan's algorithm entirely,
as opposed to ignoring them in a consistent manner as Cray Fortran used
to do, which could reduce Kahan's algorithm to a correct unprotected sum.

herrman...@gmail.com

unread,
Sep 11, 2016, 8:19:57 PM9/11/16
to
On Friday, September 9, 2016 at 10:27:44 PM UTC-7, campbel...@gmail.com wrote:
> On Saturday, September 10, 2016 at 2:59:41 PM UTC+10, kargl wrote:

(snip)

> > Using -ffast-math is quite possibly the dumbest thing for anyone
> > to use if she or he is concerned about numerical issues. The
> > math may indeed be fast, but it is typically wrong.


> All floating point calculations are "wrong" but how wrong is more important
> and is it tolerable. There are no absolutes with this. I have been using
> -ffast-math and getting acceptable accuracy, so no it is not dumb.

Some problems are much more sensitive to numerical instability than
others. Sometimes you know about the problem you are working on,
and other times you don't.

I do find it interesting that the precision we have for single and double
has stayed about the same for so many years, yet the problems are
getting bigger.

Many matrix problems are nice and stable when small, but must less
when they get large. Some improvement in algorithms over the years
has allowed avoiding the need for even more precision.

It is probably best not to use fastmath until you actually know that
you need the speed, and then check to be sure that that problem
is still stable.

herrman...@gmail.com

unread,
Sep 11, 2016, 8:24:42 PM9/11/16
to
On Friday, September 9, 2016 at 9:53:55 PM UTC-7, campbel...@gmail.com wrote:
> I am interested to see how Kahan's algorithm might conflict with a "smart"
> compiler so this is my testing:

> Chasing round-off errors with SUM can lead to a number of options
> with limited practical gains.

> * Kahan is one of them.
> * Using a higher precision accumulator is another approach (probably simpler).

I believe that the extended (temporary real) format Intel implemented in the 8087
was suggested by Kahan.

Note that the x87 form has more exponent bits, so, for the OP's problem, you
can add many huge(1.d0) without overflowing.

There are also summation algorithms that reorder (sort) the data before adding,
to minimize lost bits.

Tim Prince

unread,
Sep 12, 2016, 8:36:12 AM9/12/16
to
On 9/11/2016 8:24 PM, herrman...@gmail.com wrote:
> On Friday, September 9, 2016 at 9:53:55 PM UTC-7, campbel...@gmail.com wrote:
>> I am interested to see how Kahan's algorithm might conflict with a "smart"
>> compiler so this is my testing:
>
>> Chasing round-off errors with SUM can lead to a number of options
>> with limited practical gains.
>
>> * Kahan is one of them.
>> * Using a higher precision accumulator is another approach (probably simpler).
>
> I believe that the extended (temporary real) format Intel implemented in the 8087
> was suggested by Kahan.
The default violation of parentheses in Intel compilers seems to be
designed for 8087 extra precision, even though 8087 is no longer
supported in the compilers for 64-bit mode, and normal default practice
is to set 53-bit precision. Cray Fortran could remove the Kahan guards
and vectorize, getting the same results as a plain vectorizable reduction.
>
> Note that the x87 form has more exponent bits, so, for the OP's problem, you
> can add many huge(1.d0) without overflowing.
Windows X64 sets x87 precision mode to 53 bits before launching a .exe
so you will see the large exponent range but not the extra precision in
expression evaluation.
>
> There are also summation algorithms that reorder (sort) the data before adding,
> to minimize lost bits.
>
Sorting also is time consuming and doesn't appear to give as good
results as Kahan sum.

herrman...@gmail.com

unread,
Sep 12, 2016, 11:39:24 AM9/12/16
to
On Monday, September 12, 2016 at 5:36:12 AM UTC-7, Tim Prince wrote:

(snip, I wrote)

> > I believe that the extended (temporary real) format Intel
> > implemented in the 8087 was suggested by Kahan.

> The default violation of parentheses in Intel compilers seems to be
> designed for 8087 extra precision, even though 8087 is no longer
> supported in the compilers for 64-bit mode, and normal default practice
> is to set 53-bit precision.

The precision specifier doesn't apply to all operations, but I
forget now which ones it applies to.

The additional complication with the 8087 is the number of stack
registers. The original design was for a virtual stack, where it would
spill to/from memory at overflow/underflow. That never worked for
the 8087. I don't know that it was ever fixed. So, compilers always
work with only an 8 register stack.

> Cray Fortran could remove the Kahan guards
> and vectorize, getting the same results as a plain vectorizable reduction.

(I also wrote)
> > Note that the x87 form has more exponent bits, so, for the OP's problem, you
> > can add many huge(1.d0) without overflowing.

> Windows X64 sets x87 precision mode to 53 bits before launching a .exe
> so you will see the large exponent range but not the extra precision in
> expression evaluation.

Again, I forget which operations that applies to.

> > There are also summation algorithms that reorder (sort) the
> > data before adding, to minimize lost bits.

> Sorting also is time consuming and doesn't appear to give as good
> results as Kahan sum.

For the more usual case, I suspect sorting isn't best, but for cases with
exact cancelation, like the OP, it might be better.


Tim Prince

unread,
Sep 12, 2016, 11:54:32 AM9/12/16
to
On 9/12/2016 11:39 AM, herrman...@gmail.com wrote:
> On Monday, September 12, 2016 at 5:36:12 AM UTC-7, Tim Prince wrote:
>
> (snip, I wrote)
>
>>> I believe that the extended (temporary real) format Intel
>>> implemented in the 8087 was suggested by Kahan.
>
>> The default violation of parentheses in Intel compilers seems to be
>> designed for 8087 extra precision, even though 8087 is no longer
>> supported in the compilers for 64-bit mode, and normal default practice
>> is to set 53-bit precision.
>
> The precision specifier doesn't apply to all operations, but I
> forget now which ones it applies to.
Mainly, the x87 math intrinsics aren't affected by precision
specification, so can't be speeded up by reducing precision, as divide
and sqrt can.

evan

unread,
Sep 12, 2016, 2:02:30 PM9/12/16
to
Thank you for pointing out the Kahan algorithm.

I thought I may run into overflow problems with SUM when writing a program to compute the average of M numbers stored in the array X.

I ended up computing the SUM(X) and testing if it overflows. If not then I divide by DBLE(M) otherwise I am computing a cumulative moving average
https://en.wikipedia.org/wiki/Moving_average#Cumulative_moving_average

Gordon Sande

unread,
Sep 12, 2016, 2:47:17 PM9/12/16
to
On 2016-09-12 18:02:24 +0000, evan said:

> Thank you for pointing out the Kahan algorithm.
>
> I thought I may run into overflow problems with SUM when writing a
> program to compute the average of M numbers stored in the array X.

There are algoithms based on finding tentative averages which go a long way
towards solving this problem. The more common variant in the algorithm for
fining the average and variance using a tenatave mean.

Thomas Koenig

unread,
Sep 12, 2016, 3:00:44 PM9/12/16
to
Ron Shepard <nos...@nowhere.org> schrieb:

> do i = 1, n
> y = values(i) - c
> t = sum + y
> c = (t - sum) - y
> sum = t
> end do

Well, you _could_ get the effect with this if you specified y,
t and c (and possibly sum) as VOLATILE, even with -ffast-math.

Looking at the generated assembly, this seems to do something,
but I am not sure it is 100% correct, because I don't have
test data to try the algorithm on.

Richard Maine

unread,
Sep 12, 2016, 3:46:20 PM9/12/16
to
evan <evan...@gmail.com> wrote:

> I thought I may run into overflow problems with SUM when writing a program
> to compute the average of M numbers stored in the array X.

You had some serious basis for thinking this might happen or was it just
an idle thought that it might happen? You need to be summing awfully big
values in the first place for this to come up. It's hard to hit just
based on a large number of values because you are unlikely to have more
than a few billion values to sum even in extreme cases with current
hardware. That still leaves about 30 orders of magnitude or so to go,
which has to be coverred by the individual values being pretty big. Your
example resonably uses HUGE to get something to demonstrate the
question, but you expect it to be an issue with real data? Note that if
the real data does something like use HUGE as a flag value, the average
isn't likely to make much sense anyway. An average also isn't going to
make much sense if you have something like IEEE infinites running
around.

Overflow is a different issue than loss of precision. Sure loss of
precision can come up easily enough with SUM, and that's what the Kahan
algorithm is mostly about. But running into overflow seems a lot less
likely in practice, except in cases where the whole computation is
diverging and addressing just SUM wouldn't be of much help in the end.

Yes, it can happen in theory. It just doesn't seem likely to come up a
lot in practical use. As in rare enough not to be worth the overhead of
dealing with it in an intrinsic used for lots of things. If you have
that rare a case, better to roll your own code instead of just hoping
that a general-purpose intrinsic would address your rare case.

herrman...@gmail.com

unread,
Sep 13, 2016, 9:33:29 AM9/13/16
to
On Monday, September 12, 2016 at 12:46:20 PM UTC-7, Richard Maine wrote:
> evan <evange...@gmail.com> wrote:

> > I thought I may run into overflow problems with SUM when writing a program
> > to compute the average of M numbers stored in the array X.

> You had some serious basis for thinking this might happen or was it just
> an idle thought that it might happen? You need to be summing awfully big
> values in the first place for this to come up. It's hard to hit just
> based on a large number of values because you are unlikely to have more
> than a few billion values to sum even in extreme cases with current
> hardware.

Single precision on many machines isn't that hard to overflow, but
you should probably always sum into a double precision variable, even
when summing single precision. (The SUM function might not do that.)

> That still leaves about 30 orders of magnitude or so to go,
> which has to be coverred by the individual values being pretty big. Your
> example resonably uses HUGE to get something to demonstrate the
> question, but you expect it to be an issue with real data? Note that if
> the real data does something like use HUGE as a flag value, the average
> isn't likely to make much sense anyway. An average also isn't going to
> make much sense if you have something like IEEE infinites running
> around.

Since we don't know at all where the data comes from, it is hard
to say. If, for example, you are doing a linear least squares fit to
a 10th degree polynomial (not always a good idea), then, if I remember,
it needs up to the 19th power of the x values.

Routines I remember from years ago, and sometimes on machines
with smaller HUGE(1.D0), would compute the mean, divide all the
x values by the mean, do the polynomial fit, then rescale the results.

Summing powers of input data, it isn't hard to overflow, but the
OP didn't really explain the data source.

> Overflow is a different issue than loss of precision. Sure loss of
> precision can come up easily enough with SUM, and that's what the Kahan
> algorithm is mostly about. But running into overflow seems a lot less
> likely in practice, except in cases where the whole computation is
> diverging and addressing just SUM wouldn't be of much help in the end.

It often helps to explain the actual problem. That is, the overall problem,
not just the sum. Is it high powers of some input data?

> Yes, it can happen in theory. It just doesn't seem likely to come up a
> lot in practical use. As in rare enough not to be worth the overhead of
> dealing with it in an intrinsic used for lots of things. If you have
> that rare a case, better to roll your own code instead of just hoping
> that a general-purpose intrinsic would address your rare case.

It might be nice for SUM to do a double precision sum on single
precision data, but otherwise an ordinary sum should be fine.

Gary Scott

unread,
Sep 13, 2016, 9:37:50 AM9/13/16
to
I recently made a change to use SUM into a double precision result
array. I was concerned whether the summation process was actually
performed in single precision. Still don't know. I guess I could test it.

Tim Prince

unread,
Sep 13, 2016, 10:24:49 AM9/13/16
to
You must promote the operand of SUM to double. Simple enough that I
don't see the point in any additional machinery.

herrman...@gmail.com

unread,
Sep 13, 2016, 11:48:26 AM9/13/16
to
On Tuesday, September 13, 2016 at 7:24:49 AM UTC-7, Tim Prince wrote:
> On 9/13/2016 9:37 AM, Gary Scott wrote:

(snip, I wrote)
> >> It might be nice for SUM to do a double precision sum on single
> >> precision data, but otherwise an ordinary sum should be fine.

> > I recently made a change to use SUM into a double precision result
> > array. I was concerned whether the summation process was actually
> > performed in single precision. Still don't know. I guess I could test
> > it.

> You must promote the operand of SUM to double. Simple enough that I
> don't see the point in any additional machinery.

Not so easy if it is a large array.

Note that I didn't mean that the result of SUM should be double, only
that the sum should be accumulated in a double. And the sum for double
should be accumulated in a larger type, if available.

This is a quality of implementation issue, though as previously noted
it can slow down benchmarks. People do have to realize that getting
the right answer can take longer. Getting the wrong answer fast is
not usually the best way.

Gordon Sande

unread,
Sep 13, 2016, 1:14:11 PM9/13/16
to
This used to be known as fl_2 aritmetic for double precision accumulation
os single precision operands. There is DPROD for doing double pecision
accumlation
of inner products. I suppose one could always do a DPROD with 1.0 to
avoid space
at the price of spurios computation. (It might even get optimized out!)

Gary Scott

unread,
Sep 13, 2016, 8:18:29 PM9/13/16
to
These are multi-hundred megabyte arrays of single precision reals summed
either column wise or row wise depending. Didn't want to make a
duplicate as these calcs happen in real time while the data is edited,
following a GUI change callback.

Thomas Koenig

unread,
Sep 14, 2016, 2:37:17 AM9/14/16
to
herrman...@gmail.com <herrman...@gmail.com> schrieb:

> Single precision on many machines isn't that hard to overflow, but
> you should probably always sum into a double precision variable, even
> when summing single precision. (The SUM function might not do that.)

It is certainly not required by the standard, and, since the result
characteristics are of the same type and kind parameters as its
argument, it is not to be expected either.

It might be a good idea to add a KIND argument, so that

REAL(kind=sp), dimension(:) :: a
REAL(kind=dp) :: b
b = SUM(a,KIND=dp)

does the automatic conversion of each value before summing up,
to get better precision and avoid overflow.

Is this on anybody's agenda?

herrman...@gmail.com

unread,
Sep 14, 2016, 11:14:03 AM9/14/16
to
On Tuesday, September 13, 2016 at 11:37:17 PM UTC-7, Thomas Koenig wrote:

(I wrote)
> > Single precision on many machines isn't that hard to overflow, but
> > you should probably always sum into a double precision variable, even
> > when summing single precision. (The SUM function might not do that.)

> It is certainly not required by the standard, and, since the result
> characteristics are of the same type and kind parameters as its
> argument, it is not to be expected either.

Seems to me a quality of implementation issue. There are many
things not required by the standard, that we expect anyway.
This might not be quite as high on the list. On many machines,
it isn't any slower, and on some there is extra work to reduce the
precision.

Kahan summation is enough slower, or so it seems, that
one might not expect that.

I think you lose about log2(N)/2-1 bits, from random walk in
rounding the bits at each step. Even for ordinary sized N,
that can be pretty significant for single precision, and
noticeable for larger N. That is, even if you don't have
catastrophic bit loss, there is still statistical bit loss.

> It might be a good idea to add a KIND argument, so that

> REAL(kind=sp), dimension(:) :: a
> REAL(kind=dp) :: b
> b = SUM(a,KIND=dp)

> does the automatic conversion of each value before summing up,
> to get better precision and avoid overflow.

Yes, I wondered about that, too.

> Is this on anybody's agenda?

-- glen

Pascal

unread,
Sep 14, 2016, 11:25:19 AM9/14/16
to
On 12/09/16 19:02, evan wrote:
> Thank you for pointing out the Kahan algorithm.
>
> I thought I may run into overflow problems with SUM when writing a program to compute the average of M numbers stored in the array X.
>
> I ended up computing the SUM(X) and testing if it overflows. If not then I divide by DBLE(M) otherwise I am computing a cumulative moving average
> https://en.wikipedia.org/wiki/Moving_average#Cumulative_moving_average

It is possible to calculate the average without summing all the elements:
<https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm>

<x_n> = <x_{n-1}> + ( x_n - <x_{n-1}> ) / n

Pascal

herrman...@gmail.com

unread,
Sep 14, 2016, 11:39:19 AM9/14/16
to
On Wednesday, September 14, 2016 at 8:25:19 AM UTC-7, Pascal wrote:

(snip on summation methods)

> It is possible to calculate the average without summing all the elements:
> <https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm>

> <x_n> = <x_{n-1}> + ( x_n - <x_{n-1}> ) / n

Note that this can also improve the precision of the calculation.

Something similar is used for the last step of a Newton-Raphson
square root calculation on a system with floating point base other
than two. Specifically, for IBM S/360 and S/370 with hexadecimal
floating point.

campbel...@gmail.com

unread,
Sep 15, 2016, 5:49:45 AM9/15/16
to
Based on later posts, I have now expanded the test program I posted earlier.

This later version tests a number of approaches for summing a large set of
real*4 numbers, using:
# intrinsic SUM
# Sum_4 : DO loop with real*4 accumulator
# Sum_8 : DO loop with real*8 accumulator
# Kahan_Sum error corrector
# mov_avg : moving average calculation
# Sum_OMP : DO loop with real*4 accumulator and !$OMP

I tested using gFortran Ver 6.1.0 with the following compile options:
set options=-O1
set options=-O1 -mavx -fopenmp
set options=-O2 -mavx -fopenmp
set options=-O3 -mavx -fopenmp
set options=-O3 -mavx -ffast-math -fopenmp
set options=-O3 -mavx -ffast-math -funroll-loops --param max-unroll-times=2 -fopenmp
set options=-O3 -mavx -ffast-math -funroll-loops --param max-unroll-times=4 -fopenmp

What is interesting is how the different coding approaches interact with the
different compile options.

In summary,
# loop unrolling did not change the result (unlike !$OMP)
# -ffast-math actually improved the accuracy for most cases except for Kahan_Sum
# both !$OMP and mov_avg had a similar improvement when combined with -ffast-math
# the SUM intrinsic and SUM_4 both perform poorly and are not affected by
-ffast-math or loop unrolling
# the only uniform improvement was to use a real*8 accumulator

I should qualify this summary, by noting that the set of values are random numbers in the range 0:1. This is a well behaved set of numbers and so only
tests this type of round-off error.
I have assumed that real*8 sum gives the most accurate estimate of the sum.
I have also not carried out an error analysis to identify the significance of
the inaccuracy of the real*4 values being summed.

In conclusion, while other approaches partially worked, the use of a higher
precision accumulator is the most practical solution for this type of
round-off error.
I would be interested to read any contrary views.

The following link provides:
the test code
the windows .bat file
the run .log file
Hopefully others can reproduce the results I have provided.

https://www.dropbox.com/s/h2kamnh6bk2fjen/Kahan_Sum.zip?dl=0

Tim Prince

unread,
Sep 15, 2016, 8:05:07 AM9/15/16
to
Who asked for a duplicate array? Promote element-wise within SUM. It
should vectorize on any simd CPU (still requiring gfortran -ffast-math
or the like).

Gary Scott

unread,
Sep 15, 2016, 9:05:01 AM9/15/16
to
:) if I knew what this means, I could respond...lol I don't know how to
get the SUM intrinsic to "promote element wise". My return array is
double precision, but I don't know that that has any bearing on how SUM
behaves in determining the sum for each row or column.

Tim Prince

unread,
Sep 15, 2016, 10:05:12 AM9/15/16
to
SUM(real(x,selected_real_kind(12)))
Evidently, the promotion increases latency (where there are actual
operations involved), but doesn't necessarily hurt throughput.

herrman...@gmail.com

unread,
Sep 15, 2016, 12:44:29 PM9/15/16
to
On Thursday, September 15, 2016 at 7:05:12 AM UTC-7, Tim Prince wrote:
> On 9/15/2016 9:05 AM, Gary Scott wrote:
> > On 9/15/2016 7:02 AM, Tim Prince wrote:

(snip)
> >> Who asked for a duplicate array? Promote element-wise within SUM. It
> >> should vectorize on any simd CPU (still requiring gfortran -ffast-math
> >> or the like).

> > :) if I knew what this means, I could respond...lol I don't know how to
> > get the SUM intrinsic to "promote element wise". My return array is
> > double precision, but I don't know that that has any bearing on how SUM
> > behaves in determining the sum for each row or column.

> SUM(real(x,selected_real_kind(12)))
> Evidently, the promotion increases latency (where there are actual
> operations involved), but doesn't necessarily hurt throughput.

It is, then, a quality of implementation issue not to generate a
temporary array of the appropriate kind, to pass to sum.

In the early days of array expressions, compilers weren't so good
at optimizing them, especially more complicated expressions.

I don't know at all how compilers are with this one.

-- glen

herrman...@gmail.com

unread,
Sep 15, 2016, 12:54:22 PM9/15/16
to
On Thursday, September 15, 2016 at 2:49:45 AM UTC-7, campbel...@gmail.com wrote:
> Based on later posts, I have now expanded the test program I posted earlier.

> This later version tests a number of approaches for summing a large set of
> real*4 numbers, using:
> # intrinsic SUM
> # Sum_4 : DO loop with real*4 accumulator
> # Sum_8 : DO loop with real*8 accumulator
> # Kahan_Sum error corrector
> # mov_avg : moving average calculation
> # Sum_OMP : DO loop with real*4 accumulator and !$OMP

There is another one that would be interesting to see, which
is the sum that is done by the FFT algorithm.

It is often said to be more accurate than a direct sum, though
that might depend statistically on the data that people use
the FFT for.

That is, sum pairs, then pairs of those pairs, then pairs of ...
With an array of O(log n) elements to store intermediates,
it seems that it should be possible to do going down the
input array. I didn't try to write that, though.

(snip)

> # the only uniform improvement was to use a real*8 accumulator

> I should qualify this summary, by noting that the set of values are random
> numbers in the range 0:1. This is a well behaved set of numbers and so only
> tests this type of round-off error.
> I have assumed that real*8 sum gives the most accurate estimate of the sum.

For values of similar magnitude, that should be true until about 2**(53-24)
values are being summed.

> I have also not carried out an error analysis to identify the significance of
> the inaccuracy of the real*4 values being summed.

> In conclusion, while other approaches partially worked, the use of a higher
> precision accumulator is the most practical solution for this type of
> round-off error.
> I would be interested to read any contrary views.

-- glen

Gordon Sande

unread,
Sep 15, 2016, 1:04:00 PM9/15/16
to
On 2016-09-15 16:54:20 +0000, herrman...@gmail.com said:

> On Thursday, September 15, 2016 at 2:49:45 AM UTC-7,
> campbel...@gmail.com wrote:
>> Based on later posts, I have now expanded the test program I posted earlier.
>
>> This later version tests a number of approaches for summing a large set of
>> real*4 numbers, using:
>> # intrinsic SUM
>> # Sum_4 : DO loop with real*4 accumulator
>> # Sum_8 : DO loop with real*8 accumulator
>> # Kahan_Sum error corrector
>> # mov_avg : moving average calculation
>> # Sum_OMP : DO loop with real*4 accumulator and !$OMP
>
> There is another one that would be interesting to see, which
> is the sum that is done by the FFT algorithm.
>
> It is often said to be more accurate than a direct sum, though
> that might depend statistically on the data that people use
> the FFT for.
>
> That is, sum pairs, then pairs of those pairs, then pairs of ...
> With an array of O(log n) elements to store intermediates,
> it seems that it should be possible to do going down the
> input array. I didn't try to write that, though.

If you just want to do a tree summation then take a look at quick sort
for an example of doing tree recursion. Converting the sort to a sum
is left as an exercise for the reader. But you will still have the
recursion overhead.

herrman...@gmail.com

unread,
Sep 15, 2016, 5:29:34 PM9/15/16
to
On Thursday, September 15, 2016 at 10:04:00 AM UTC-7, Gordon Sande wrote:

(snip, I wrote)

> > There is another one that would be interesting to see, which
> > is the sum that is done by the FFT algorithm.

> > It is often said to be more accurate than a direct sum, though
> > that might depend statistically on the data that people use
> > the FFT for.

> > That is, sum pairs, then pairs of those pairs, then pairs of ...
> > With an array of O(log n) elements to store intermediates,
> > it seems that it should be possible to do going down the
> > input array. I didn't try to write that, though.

> If you just want to do a tree summation then take a look at quick sort
> for an example of doing tree recursion. Converting the sort to a sum
> is left as an exercise for the reader. But you will still have the
> recursion overhead.

It is usual not to write quicksort using actual recursion, but instead
with loops and a small array to keep track of the recursion points.
Since the maximum stack depth, done right, is O(log N) it doesn't
take a big array.

For this, too, it would be nice to avoid the recursion call overhead,
though I didn't actually try writing one. I had forgotten that it
is called tree summation.

-- glen

Gordon Sande

unread,
Sep 15, 2016, 7:00:03 PM9/15/16
to
On 2016-09-15 21:29:22 +0000, herrman...@gmail.com said:

> On Thursday, September 15, 2016 at 10:04:00 AM UTC-7, Gordon Sande wrote:
>
> (snip, I wrote)
>
>>> There is another one that would be interesting to see, which
>>> is the sum that is done by the FFT algorithm.
>
>>> It is often said to be more accurate than a direct sum, though
>>> that might depend statistically on the data that people use
>>> the FFT for.
>
>>> That is, sum pairs, then pairs of those pairs, then pairs of ...
>>> With an array of O(log n) elements to store intermediates,
>>> it seems that it should be possible to do going down the
>>> input array. I didn't try to write that, though.
>
>> If you just want to do a tree summation then take a look at quick sort
>> for an example of doing tree recursion. Converting the sort to a sum
>> is left as an exercise for the reader. But you will still have the
>> recursion overhead.
>
> It is usual not to write quicksort using actual recursion, but instead
> with loops and a small array to keep track of the recursion points.
> Since the maximum stack depth, done right, is O(log N) it doesn't
> take a big array.

Even managing the simulated recursion involves overhead.

herrman...@gmail.com

unread,
Sep 15, 2016, 7:21:10 PM9/15/16
to
On Thursday, September 15, 2016 at 4:00:03 PM UTC-7, Gordon Sande wrote:

(snip)
> > On Thursday, September 15, 2016 at 10:04:00 AM UTC-7, Gordon Sande wrote:

(snip on tree summation, where I forgot the name of it)

> >> If you just want to do a tree summation then take a look at quick sort
> >> for an example of doing tree recursion. Converting the sort to a sum
> >> is left as an exercise for the reader. But you will still have the
> >> recursion overhead.

> > It is usual not to write quicksort using actual recursion, but instead
> > with loops and a small array to keep track of the recursion points.
> > Since the maximum stack depth, done right, is O(log N) it doesn't
> > take a big array.

> Even managing the simulated recursion involves overhead.

It does, but it should be less than real recursion. Maybe still too much
of people are basing buying decisions on benchmarks, though.

I suppose less overhead if you do higher than a radix 2 (binary) tree.
Message has been deleted

campbel...@gmail.com

unread,
Sep 15, 2016, 7:58:14 PM9/15/16
to
The recursive sum actually works very well with little identifiable overhead.
I used the following, although "4" could be changed.
if ( n > 4 ) then
k = n/2
s = Rsum (values,k) + Rsum (values(k+1),n-k)
else
s = sum (values)
end if

The results are comparable to sum_8, although the selected "values" are optimal for this type of calculation.

Gary Scott

unread,
Sep 15, 2016, 9:16:30 PM9/15/16
to
ah, of course. I would have probably thought that might be time
consuming. My intent in switching from a loop to the SUM intrinsic was
to try to get a speedup, which it did quite a bit, unsure fully why
though, perhaps my optimization settings were not very aggressive.

herrman...@gmail.com

unread,
Sep 15, 2016, 10:14:27 PM9/15/16
to
On Thursday, September 15, 2016 at 6:16:30 PM UTC-7, Gary Scott wrote:
> On 9/15/2016 9:02 AM, Tim Prince wrote:

(snip on SUM, timing, and result accuracy)

> >> :) if I knew what this means, I could respond...lol I don't know how to
> >> get the SUM intrinsic to "promote element wise". My return array is
> >> double precision, but I don't know that that has any bearing on how SUM
> >> behaves in determining the sum for each row or column.
> > SUM(real(x,selected_real_kind(12)))
> > Evidently, the promotion increases latency (where there are actual
> > operations involved), but doesn't necessarily hurt throughput.

> ah, of course. I would have probably thought that might be time
> consuming. My intent in switching from a loop to the SUM intrinsic was
> to try to get a speedup, which it did quite a bit, unsure fully why
> though, perhaps my optimization settings were not very aggressive.

Compilers are usually pretty good at optimizing DO loops, if the options
are turned on. In the beginning, and for some years after array expressions
were added, they weren't all that good at optimizing the more complicated
array expressions. They might be better now.

But partly it is that complicated array expressions, as written,
can be much harder to evaluate.

Say you want to find the first element of an array x whose sine
is greater than 0.7. You might:

n=findloc(sin(x).gt.0.7, .true.)

(Maybe there is a better array expression, that is the one
that I thought of first.)

With a DO loop, I can be pretty sure that it will exit the loop, and not
calculate any more sines, after the first one is found. I am must less
sure that the array expression will do that.

If the optimizer does figure that out, I might make something
more complicated that it doesn't figure out.

-- glen

Tim Prince

unread,
Sep 16, 2016, 8:35:48 AM9/16/16
to
In the basic gfortran vectorized case, double precision sum could take
twice as long as single, due to the vector width.

robin....@gmail.com

unread,
Sep 17, 2016, 9:16:32 AM9/17/16
to
On Friday, September 16, 2016 at 2:44:29 AM UTC+10, herrman...@gmail.com wrote:
> On Thursday, September 15, 2016 at 7:05:12 AM UTC-7, Tim Prince wrote:
> > On 9/15/2016 9:05 AM, Gary Scott wrote:
> > > On 9/15/2016 7:02 AM, Tim Prince wrote:
>
> (snip)
> > >> Who asked for a duplicate array? Promote element-wise within SUM. It
> > >> should vectorize on any simd CPU (still requiring gfortran -ffast-math
> > >> or the like).
>
> > > :) if I knew what this means, I could respond...lol I don't know how to
> > > get the SUM intrinsic to "promote element wise". My return array is
> > > double precision, but I don't know that that has any bearing on how SUM
> > > behaves in determining the sum for each row or column.
>
> > SUM(real(x,selected_real_kind(12)))
> > Evidently, the promotion increases latency (where there are actual
> > operations involved), but doesn't necessarily hurt throughput.
>
> It is, then, a quality of implementation issue not to generate a
> temporary array of the appropriate kind, to pass to sum.

As the kind is being changed, a temporary array will be created.

If you want to avoid that, write your own SUM function,
using a dp accumulator and default real data.

robin....@gmail.com

unread,
Sep 17, 2016, 9:21:06 AM9/17/16
to
On Friday, September 16, 2016 at 7:29:34 AM UTC+10, herrman...@gmail.com wrote:
> On Thursday, September 15, 2016 at 10:04:00 AM UTC-7, Gordon Sande wrote:
>
> (snip, I wrote)
>
> > > There is another one that would be interesting to see, which
> > > is the sum that is done by the FFT algorithm.
>
> > > It is often said to be more accurate than a direct sum, though
> > > that might depend statistically on the data that people use
> > > the FFT for.
>
> > > That is, sum pairs, then pairs of those pairs, then pairs of ...
> > > With an array of O(log n) elements to store intermediates,
> > > it seems that it should be possible to do going down the
> > > input array. I didn't try to write that, though.
>
> > If you just want to do a tree summation then take a look at quick sort
> > for an example of doing tree recursion. Converting the sort to a sum
> > is left as an exercise for the reader. But you will still have the
> > recursion overhead.
>
> It is usual not to write quicksort using actual recursion, but instead
> with loops and a small array to keep track of the recursion points.
> Since the maximum stack depth, done right, is O(log N) it doesn't
> take a big array.

As quick sort can degenerate, a large array could be needed.

Gordon Sande

unread,
Sep 17, 2016, 10:55:17 AM9/17/16
to
On 2016-09-17 13:20:46 +0000, robin....@gmail.com said:

> On Friday, September 16, 2016 at 7:29:34 AM UTC+10, herrman...@gmail.com wrote:
>> On Thursday, September 15, 2016 at 10:04:00 AM UTC-7, Gordon Sande wrote:
>>
>> (snip, I wrote)
>>
>>>> There is another one that would be interesting to see, which
>>>> is the sum that is done by the FFT algorithm.
>>
>>>> It is often said to be more accurate than a direct sum, though
>>>> that might depend statistically on the data that people use
>>>> the FFT for.
>>
>>>> That is, sum pairs, then pairs of those pairs, then pairs of ...
>>>> With an array of O(log n) elements to store intermediates,
>>>> it seems that it should be possible to do going down the
>>>> input array. I didn't try to write that, though.
>>
>>> If you just want to do a tree summation then take a look at quick sort
>>> for an example of doing tree recursion. Converting the sort to a sum
>>> is left as an exercise for the reader. But you will still have the
>>> recursion overhead.
>>
>> It is usual not to write quicksort using actual recursion, but instead
>> with loops and a small array to keep track of the recursion points.
>> Since the maximum stack depth, done right, is O(log N) it doesn't
>> take a big array.
>
> As quick sort can degenerate, a large array could be needed.

Always stack the longer segment is the standard fix for log(N).
If you don't you might stack N segments of size one in worst case.
0 new messages