FMA in implementaion of FFT - reaction to last year's post of Mitch

Michael S

unread,

Sep 14, 2021, 11:22:59 AM9/14/21

to

On Friday, April 24, 2020 at 12:10:58 AM UTC+3, MitchAlsup wrote:
> On Thursday, April 23, 2020 at 2:54:54 PM UTC-5, Terje Mathisen wrote:
> > David Brown wrote:
> > > On 23/04/2020 10:25, Tom Gardner wrote:
> > >> On 23/04/20 08:44, David Brown wrote:
> > >>> On 22/04/2020 20:49, Tom Gardner wrote:
> > >>>>
> > >>>> The HPC mob have complained that is problematic for
> > >>>> them w.r.t. floating point behaviour.
> > >>>>
> > >>>
> > >>> If the language makes restrictions on floating point details, then I
> > >>> can see how that might annoy people.Â Some floating point users need
> > >>> highly predictable results that match IEEE specifications across
> > >>> different platforms - others want the highest possible speeds and
> > >>> will accept minor differences in results.Â If a language is
> > >>> specified to suit one of these groups, it is bound to annoy the other.
> > >>
> > >> There's an argument that if you are using floating point is never
> > >> exact and it does no harm to expose that inexactitude.
> > >
> > > Yes. That's my attitude for the code I write.
> >
> > Good for you!
> >
> > >
> > > There is another argument that says floating point should work exactly
> > > the same each time, and on all platforms - it is not an exact model of
> > > real numbers, but needs to be consistent and tightly specified nonetheless.
> > >
> > > Both attitudes are fine IMHO, and have their uses.
> >
> > Individual operations needs to be both consistent and reproducible, but
> > HPC style programming in the large have to accept that when the order of
> > operation is effectively random, you have to accept some limits to the
> > reproducibility of the final results.
> > >
> > >>
> > >> Caution: I am not an FP expert; so this is from memory
> > >> and is possibly misunderstood...
> > >> More importantly, if you have higher "internal precision"
> > >> during operations (e.g. 80-bit for 64-bit FP numbers), then
> > >> the end result of a calculation can be more exact - but the
> > >> IEEE spec precludes that.
> >
> > It does not! Java however do so.
> >
> > I.e. partly because the ieee754 spec was written in parallel with the
> > 8087 fpu, more or less everything Intel's 80-bit format can do is legal. :-)
> >
> >
> > OTOH, since all x64 fp math is done using the SSE regs, even that
> > platform is now in the exact reproducibility camp.
> >
> > Actually, there are still a few open slots for ulp offsets to occur
> > legally: The one I know best is related to rounding of
> > subnormal/denormal numbers where you can round first, before detecting
> > that the result is subnormal, or you can do it after normalization.
> >
> > The latter is the only "Right Way" (TM) to do it imho, but since both
> > exists in current cpus, both options are grandfathered in.
> >
> >
> > > Yes. 80-bit FP dates mainly back to floating point coprocessors, I
> > > think (I'm sure others here know vastly more of floating point history
> > > than I do), which are pretty much gone now. But you do get other cases
> > > like fused MAC that can give more accurate results than separate
> > > operations.
> >
> > FMAC is completely specified, but compilers can't really use it unless
> > language lawyers allow them to, i.e when you do
> >
> > s = a*b+c;
> >
> > without an internal rounding after the multiplication but instead keep
> > the exact product and feed it into the adder, you are explicitly
> > breaking the "as-if" rule since the results you get are different.
> Fast Fourier Transforms have an accuracy dependency that precludes
> using FMACs for the last 2 Multiply ADDs in the inner loop::
>
> DO 30 I = J, N, N1
> XT = RE(I) - RO(I)
> YT = IM(I) - IN(I)
> RE(I) = RE(I) + RO(I)
> IM(I) = IM(I) + IN(I)
> RO(I) = XT*C - YT*S // no FMAc here
> IN(I) = XT*S + YT*C // no FMAC here
> 30 CONTINUE
>
> The problem is that the COS() term and the SIN() term are balanced in
> a way* that causes FMAC to lose accuracy over the 2 MUL and 1 ADD
> version. This balance is inherent in S**2 + C**2 = 1.000000000000000

I don't understand why what's written above is bigger problem than any other loss of accuracy that inevitably happens during implementation of FFT via fixed-width FP arithmetic.

Mitch, can you provide an example of input vector for which FFT with FMA really breaks accuracy in significant/unusual way?

Also, your loop appears to be of Radix2 Decimation-In-Frequency variety.
If the problem really exist, does it apply to other common FFT algorithms,
e.g. Radix4 Decimation-In-Frequency or Radix2 Decimation-In-Time?

MitchAlsup

unread,

Sep 14, 2021, 1:34:08 PM9/14/21

to

It has to do with the balance between the SIN terms and to COS terms.
Performing a FMUL followed by an FMAC results in the first term being
carried out in 53-bit of precision, while the second is performed in 106
bits of FMUL precision and 53-bits of FADD precision.
<
This created a bias in the roundings that FMUL+FMUL does not.

>
> Mitch, can you provide an example of input vector for which FFT with FMA really breaks accuracy in significant/unusual way?
>
> Also, your loop appears to be of Radix2 Decimation-In-Frequency variety.
> If the problem really exist, does it apply to other common FFT algorithms,
> e.g. Radix4 Decimation-In-Frequency or Radix2 Decimation-In-Time?
<

My guess is that the bias is reduced, but intermixing of FMUL and FMAC
tends to create opportunity for rounding bias.

Michael S

unread,

Sep 14, 2021, 1:39:58 PM9/14/21

to

So far, I am not convinced at all.
Did you came to conclusion that the problem exist by yourself or you was told about it by somebody else?

MitchAlsup

unread,

Sep 14, 2021, 4:53:24 PM9/14/21

to

I was told of the problem by someone else.

Bernd Linsel

unread,

Sep 14, 2021, 5:00:51 PM9/14/21

to

On 14.09.2021 22:53, MitchAlsup wrote:
> On Tuesday, September 14, 2021 at 12:39:58 PM UTC-5, Michael S wrote:
> <
> <
> I was told of the problem by someone else.
>

Wouldn't a simple remedy be to start with 0 and then do 2 successive FMAC?

eg.

RO(I) = 0 + XT*C - YT*S
IN(I) = 0 + XT*S + YT*C

MitchAlsup

unread,

Sep 14, 2021, 6:20:41 PM9/14/21

to

No mater how the compiler inserts (effectively) parenthesis {determines order}
you are adding a 53-bit fraction to a 106-bit fraction. To avoid rounding error bias
you want to be adding 106-bit product to a 106-bit product and perform a single
rounding.