pm86...@longs.LANCE.ColoState.Edu (Peter J. McKinney) wrote:
> This is part one of a two part summary to the net concerning FFT code in C.
>I had asked for the best, fastest implementation of the FFT in C or for a
>"standard" piece of code if such a thing existed.
As I'm sure you realize, "best" depends quite a bit on what you
consider best, and "fastest" depends largely on the number of points,
what compiler, and what architecture you're using. If it's amazingly
critical to have the best and fastest, you're probably best off
writing your own to meet your specific application. Short of
specialized hardware, if you can use a real valued FFT this would be
your best optimization (should be faster by a factor of 2).
It seems that two of the source code sections in your summary used "a
duhamel-holman split radix fft". Perhaps this is considered a
'"standard" piece of code', but I do know of two different fft's which
in my opinion qualify as both "faster" and "better".
I tried comparing my code to the code by Dave Edelblute that you
included in your previous posting; but the times for that code was so
much worse (twice as slow) that it's unlikely that such code would
have been in a posting labeled best and fastest; so I probably made an
error in compiling it. In the nearly identical test programs I
include in my other posting, they both seem to produce the same
results; but who knows. I'll include the code duhamel-holman code I
tested in the second part of this posting so someone can point out
where I screwed up. (If anyone tries it, please let me know if you
get the same results...)
The "best" published FFT routine I have studied recently is based on
Singleton's mixed-radix FFT algorithm. His routine has a number of
optimizations which include a decent trig function generator, and
doing a quite optimized radix-4 transforms and only 0 or 1 radix-2
transforms for any power of 2. It is also neat in that it works with
any sized array, not just powers of 2! I believe the published
version is in fortran; but someone at work translated it to C, and
it seems to be ~25% faster than the duhamel-holman routine you posted
in your summary (if I compiled it correctly). I can probably dig up a
reference to this code next time I dig through all my school stuff if
anyone really needs it.
The "fastest" (for typical computers: single processor, non-vector,
fast-integer-math, slower-floating-point-math, slow-trig-function) FFT
routine I have ever seen is one I wrote myself; trying to incorporate
as many optimizations I could find in various IEEE publications while
I was at college. As you can see in the file "summary.sparc" included
below, it is nearly twice as fast as the duhamel-holman routine you
posted in your posting.
The routine I came up with includes the following optimizations:
1) It is a wrapper around a highly optimized FHT (hartley transform)
A FHT better localizes memory accesses for large transforms,
thereby avoiding paging. Hartley transforms are also much easier
to add optimization tricks too; more than making up for the
overhead of converting the hartley transfrom to a fourier
transform. Another advantage is that the transformation from
FHT -> real_valued_FFT
is faster than the transformation from
1/2pointFFT-> real_valued_FFT
so my real-valued fft is even better when compared to most
published real valued ffts.
2) Avoid multiplications by 1 and zero (and sometimes sqrt2).
Many published routines such as Numerical Recipes seem to spend
a lot of time multiplying by cos(0), cos(pi), etc. and almost all
seem to do 1/sqrt_2*x+1/sqrt_2*y instead of 1/sqrt_2*(x+y).
3) Faster trig generation.
Most algorithms use 1 "sin" library call for each level of the
transform; and 2 real multiplications and 2 real additions for
each needed trig value within it's loop.
I use a stable algorithm to generate each trig value using 1 real
multiplication and 1 real addition for each value using a small
(log(n)) table of trig values. The tradeoff is that I require
much more integer arithmetic for this calculation, including a
(n*log(n)) loop; but for multiples of pi/16 or so, my routine
still seems faster. By taking advantage of the fact that
values required for FFTs are for evenly spaced angles, I avoid
all calls to slow trig library functions which are unnecessarily
complex because they need to work for arbitrary values.
4) Generate less trig values
I use the identities sin(x)=sin(pi-x)=-sin(pi+x)=-sin(-x),etc. to
avoid excessive trig calculations, and sin(2x) = 2*cos(x)*sin(x)
to allow simpler trig calculations when accuracy permits. A more
stable than average trig generator mentioned in (3) above allows
me to use the unstable sin(2x) = 2*cos(x)*sin(x) hack for every
other "level" in the FFT without the usual loss of accuracy.
5) Mixed 2,4-radix inner loop.
By doing two levels in the inner loop, I gain all the advantages
of a radix-4 algorithm; primarily reducing integer arithmetic and
memory access. This has a great affect on large arrays when
paging occurs.
6) Unrolling loops and variable storage to optimize register
allocation. I try not to require storing too many values in
variables at any one time to ease a compilers register
allocation.
7) Remove low levels of the transform out of the loop. It's
significantly faster to do 8 point transforms explicitly; rather
than using the general loop.
One catch to this routine is that at least two of the algorithms used
by it are patented(!) (just about any FHT is patented by R. Bracewell;
and the stable trig generator is patented by O. Buneman; both at
Stanford Univ.) Who owns the copyright rights to it is also probably
being debated; since it is a derivative work of a GNU-licensed
routine, so subject to their restrictions; it was worked on for a
Stanford project, so they have a claim on it; and I optimized it
further working for this company, so they probably claim parts of it.
Considering Gauss apparently used the equivalent of real valued FFTs
in 1805; and Euler did fourier transforms it in the mid 1700s; I'm
amazed that people still want to claim this math.
Ron Mayer
ma...@acuson.com
My code is in part 2.