FFTW.REDFT00 and FFT.RODFT00 are ~10x slower than other routines?

587 views
Skip to first unread message

Sheehan Olver

unread,
Sep 22, 2015, 6:51:50 AM9/22/15
to julia-users
I get the following timings with the various FFTW routines, where I ran each line multiple times to make sure it was accurate.  Why are REDFT00 and RODFT00 almost 10x slower?


r=rand(100000)
@time FFTW.r2r(r,FFTW.REDFT00) #0.26s
@time FFTW.r2r(r,FFTW.REDFT01) #0.0033s
@time FFTW.r2r(r,FFTW.REDFT10) #0.0035s
@time FFTW.r2r(r,FFTW.REDFT11) #0.0033s
@time FFTW.r2r(r,FFTW.RODFT00) #0.017s
@time FFTW.r2r(r,FFTW.RODFT01) #0.0035s
@time FFTW.r2r(r,FFTW.RODFT10) #0.0035s
@time FFTW.r2r(r,FFTW.RODFT11) #0.0035s
@time fft(r)                   #0.0033s

Sheehan Olver

unread,
Sep 22, 2015, 6:52:10 AM9/22/15
to julia-users
This is in 0.4rc2 on OS X.

Páll Haraldsson

unread,
Sep 22, 2015, 11:41:45 AM9/22/15
to julia-users
I'm just curious so I looked into it. I'm just, familiar, with what FFT, DCT, IDCT and Fourier transform is for.. but not really that much and my math is fading..

I got similar numbers (I'm on 0.3.11) to you and then I tried with (as I recall power of twos are what you should be using):

r=rand(2^20)

Then numbers are much more closer, but still a gap.


You can look at the code and edit! with edit(FFTW.r2r)

I'm not sure the former or latter numbers, are NOT what should be expected (or at least that a slowdown is introduced by Julia).

See here:

That is the (GPL) library that does the heavy lifting (not coded in Julia), and there are the equations behind what you are doing.


You could do something like @profile FFTW.r2r(r,FFTW.REDFT00)

And look into how the profiler works (to see the results), but I doubt you'll see much interesting if it only shows the Julia side..


By just scanning the code I code like this:

execute(precision::fftwTypeDouble, plan) =
    ccall((:fftw_execute,libfftw), Void, (Ptr{Void},), plan)

That I assume does all the work (e.g. the FFTW library that it calls).

-- 
Palli.

Steven G. Johnson

unread,
Sep 22, 2015, 1:52:38 PM9/22/15
to julia-users


On Tuesday, September 22, 2015 at 6:51:50 AM UTC-4, Sheehan Olver wrote:
I get the following timings with the various FFTW routines, where I ran each line multiple times to make sure it was accurate.  Why are REDFT00 and RODFT00 almost 10x slower?

This is because a REDFT00 (DCT-I) of n points is exactly equivalent to a DFT of N=2(n-1) points with real-even data.  Although FFTW uses a different algorithm, not a size-N complex FFT, the fast algorithms are essentially equivalent to pruned versions of the FFT algorithms for N, and depend on the factorization of N, not of n.  Similarly for RODFT00 (DST-I), except N=2(n+1) in that case.  In contrast, the other DCT/DST types correspond to DFTs of length N=2n or N=4n or N=8n, so their speed depends on the factorization of n.

If n=100000 as in your example, then n-1 has large prime factors (41, 271), so the FFT algorithms are much slower, although they are still O(n log n).   If you try n=100001, you will notice that REDFT00 becomes much faster (but other DCT types become slower).

This is mentioned in the FFTW manual:


"FFTW is most efficient when N is a product of small factors; note that this differs from the factorization of the physical size n for REDFT00 and RODFT00!"

Sheehan Olver

unread,
Sep 23, 2015, 3:47:43 AM9/23/15
to julia...@googlegroups.com
Hi Steven,

OK that makes sense.  But why is Julia 2x slower than Matlab for some values of n  (see below, the timing difference is consistent when looped over)?  Shouldn’t they both be using FFTW?


julia> r=rand(2000000);plan=plan_fft(r); @time plan*r;
  0.080122 seconds (12 allocations: 61.035 MB, 12.91% gc time)

julia> r=rand(2000001);plan=plan_fft(r); @time plan*r;
  0.287002 seconds (12 allocations: 61.036 MB, 3.76% gc time)

julia> r=rand(1999999);plan=plan_fft(r); @time plan*r;
  0.218389 seconds (12 allocations: 61.035 MB, 4.76% gc time)


>> r=rand(2000000,1);tic();fft(r);toc()
Elapsed time is 0.023956 seconds.
>> r=rand(2000001,1);tic();fft(r);toc()
Elapsed time is 0.303320 seconds.
>> r=rand(1999999,1);tic();fft(r);toc()
Elapsed time is 0.131212 seconds.



Steven G. Johnson

unread,
Sep 23, 2015, 8:59:41 PM9/23/15
to julia-users


On Wednesday, September 23, 2015 at 3:47:43 AM UTC-4, Sheehan Olver wrote:
OK that makes sense.  But why is Julia 2x slower than Matlab for some values of n  (see below, the timing difference is consistent when looped over)?  Shouldn’t they both be using FFTW?

I think that maybe Matlab defaults to using FFTW's real-data routines when the data is real, whereas Julia only uses the real-data routines if you request them via rfft etc.   (The rfft functions have the advantage of requiring half of the storage for the output compared to a complex FFT, whereas I think Matlab pads the output back to the full length using the mirror symmetries.)

Sheehan Olver

unread,
Sep 23, 2015, 9:07:51 PM9/23/15
to julia...@googlegroups.com
OK that makes sense.  But then why is rfft on a vector length 2*(n-1) more than 2x faster than FFT.REDFT00?

julia> r=rand(100000);@time for k=1:100 FFTW.r2r(r,FFTW.REDFT00) end;
  2.496130 seconds (8.30 k allocations: 76.703 MB, 0.76% gc time)
julia> r=rand(2*(100000-1));@time for k=1:100 rfft(r) end;
  0.943706 seconds (8.30 k allocations: 152.985 MB, 1.58% gc time)



PS  Why doesn't fft(::Vector{Float64}) automatically call rfft and re-interpret the output?

Steven G. Johnson

unread,
Sep 25, 2015, 11:40:50 AM9/25/15
to julia-users


On Wednesday, September 23, 2015 at 9:07:51 PM UTC-4, Sheehan Olver wrote:
OK that makes sense.  But then why is rfft on a vector length 2*(n-1) more than 2x faster than FFT.REDFT00?

julia> r=rand(100000);@time for k=1:100 FFTW.r2r(r,FFTW.REDFT00) end;
  2.496130 seconds (8.30 k allocations: 76.703 MB, 0.76% gc time)
julia> r=rand(2*(100000-1));@time for k=1:100 rfft(r) end;
  0.943706 seconds (8.30 k allocations: 152.985 MB, 1.58% gc time)

The short answer is that rfft is much more optimized in FFTW than than the r2r transforms.

The long answer is REDFT00 transforms of even lengths n are especially bad, because the algorithms for this problem are not very attractive.   For an even length, the options are:

   1) use an algorithm from FFTPACK or Numerical Recipes that turns it into a rfft of the same length plus O(n) pre/post-processing.   We implemented this, but don't use it because it seems to suffer from severe accuracy problems: https://github.com/FFTW/fftw3/blob/master/reodft/redft00e-r2hc.c
    2) pad symmetrically to an rfft of twice the length.  This is accurate, but sacrices a factor of 2 in speed as you noticed:  https://github.com/FFTW/fftw3/blob/master/reodft/redft00e-r2hc-pad.c
    3) if n-1 has a small prime factor r, implement a pruned version of the analogous radix-r Cooley-Tukey algorithm.  This would work and be accurate, but is a lot of work to implement, and we didn't both except in the case where n is odd (so that we can use radix r=2, or actually split radix: https://github.com/FFTW/fftw3/blob/master/reodft/reodft00e-splitradix.c)
  
PS  Why doesn't fft(::Vector{Float64}) automatically call rfft and re-interpret the output? 

We could, but my feeling was that if you really care about factors of two in performance, you should be using rfft directly.  That also lets you save a factor of two in memory, and a factor of two in any post-processing computations as well, and additionally lets you save a factor of two in any subsequent inverse transform (if you need it, e.g. for convolutions or filtering).

Matlab, in contrast, doesn't expose an rfft interface, so you can't get all of the savings that are possible in this case.  So they might as well get what they can from fft(x).
 
Reply all
Reply to author
Forward
0 new messages