FFT performance compared to MATLAB

2,379 views
Skip to first unread message

SYoon

unread,
Nov 4, 2013, 6:50:14 PM11/4/13
to julia...@googlegroups.com
I am trying to optimize FFT performance following the suggestion in the discussion .
I have the following testing function:

function test()
    a
= float32([1:38192])
    b
= Array(Complex64, length(a))

    tic
()
   
for i = 1:100
        b
= fft(a)
   
end
    toc
();
end

This takes about 0.146 seconds (Matlab takes about 0.06 seconds)

Now, I have implemented three helper functions:

function get_fft_plan(ri)
    ro
= Array(Complex64, length(ri)>>1 + 1)
    r2c
= Base.FFTW.Plan(ri, ro, 1, Base.FFTW.PATIENT, Base.FFTW.NO_TIMELIMIT)
   
return ro, r2c
end
 
function execute_fft_plan(ri, ro, r2c, symmetry)
   
Base.FFTW.execute(r2c.plan, ri, ro)

   
if ~(symmetry)
       
return ro
   
else symmetry
        roo
= make_fft_symmetry(ro)
       
return roo
   
end
end

function make_fft_symmetry(ro)
    len
= 2*(length(ro) - 1)
    roo
= Array(Complex64, len)
    roo
[1] = ro[1]
 
   
for i = 2:len/2+1
        roo
[i] = conj(ro[i])
        roo
[end-i+2] = ro[i]
   
end
   
return roo
end

Now, my updated test function looks like

function test()
    a
= float32([1:38192])
    b
= Array(Complex64, length(a))
    ro
, r2c = get_fft_plan(a)

    tic
()
   
for i = 1:100
        b
= execute_fft_plan(a, ro, r2c, true)
   
end
    toc
();
end

This takes about 0.074 seconds. When I change "true" to "false", it takes about 0.053 seconds. So, it takes about 0.02 seconds longer to make the FFT output array full length. Can somebody suggest a better way to implement the function "make_fft_symmetry" so that 0.074 seconds becomes closer to Matlab performance (0.06 seconds) or better?

Thanks.

Simon Kornblith

unread,
Nov 4, 2013, 7:24:39 PM11/4/13
to julia...@googlegroups.com
Are you sure your benchmark is fair? MATLAB uses multiple threads for FFTW by default, whereas I think Julia only uses a single thread unless told to use more, so unless you use Base.FFTW.num_threads() to set the number of threads in Julia or launch MATLAB with -singleCompThread, the performance difference might just be due to parallelization.

SYoon

unread,
Nov 4, 2013, 10:45:38 PM11/4/13
to julia...@googlegroups.com
Following Simon's suggestion, I modified the "get_fft_plan" function

function get_fft_plan(ri; nthread=1)
   
if nthread > 1
       
Base.FFTW.set_num_threads(nthread)
   
end


    ro
= Array(Complex64, length(ri)>>1 + 1)
    r2c
= Base.FFTW.Plan(ri, ro, 1, Base.FFTW.PATIENT, Base.FFTW.NO_TIMELIMIT)
   
return ro, r2c
end

Now, the execution time is about 0.05 seconds with nthread=8 (Matlab takes about 0.06 seconds).

When symmetry is turned off in the function "execute_fft_plan" with nthread=8, the execution time is about 0.02 seconds. So, in any case whether thread is used or not, optimizing the function "make_fft_symmetry" will help. Any idea how to improve the performance of "make_fft_symmetry"?

SYoon

unread,
Nov 4, 2013, 11:15:01 PM11/4/13
to julia...@googlegroups.com
Would it be possible to improve the for loop performance in the "make_fft_symmetry" by making use of the Julia's parallel computing features? My application uses lots of FFT so I'd like to improve the FFT performance to the limit.


On Monday, November 4, 2013 6:50:14 PM UTC-5, SYoon wrote:

Billou Bielour

unread,
Nov 5, 2013, 5:09:17 AM11/5/13
to julia...@googlegroups.com
Have you tried to profile your code ? 


Maybe you should pre-allocate roo outside of the main loop. Look also at memory allocation with @time.

Steven G. Johnson

unread,
Nov 5, 2013, 6:25:13 PM11/5/13
to julia...@googlegroups.com
What is the point of making the FFT output full length?  Anything you might want to do you can do with the half-length array.

SYoon

unread,
Nov 5, 2013, 6:25:18 PM11/5/13
to julia...@googlegroups.com
Following Billou's suggestion, I moved roo memory allocation out of the for loop. My modified functions are

function get_fft_plan(ri; nthread=1)

    ro
= Array(Complex64, length(ri)>>1 + 1)

    roo
= Array(Complex64, length(ri))


   
if nthread > 1
       
Base.FFTW.set_num_threads(nthread)
   
end


    r2c
= Base.FFTW.Plan(ri, ro, 1, Base.FFTW.PATIENT, Base.FFTW.NO_TIMELIMIT)
   
return ro, roo, r2c
end

function execute_fft_plan!(ri, ro, roo, r2c; full=true)
   
Base.FFTW.execute(r2c.plan, ri, ro)

   
if full
        len
= 2*(length(ro) - 1)

        roo
[1] = ro[1]

       
for i = 2:len/2+1
            roo
[i] = conj(ro[i])
            roo
[end-i+2] = ro[i]
       
end

   
end
end

function test()
    FF
= fast_fft
    a
= float32([1:38192])
    ro
, roo, r2c = FF.get_fft_plan(a, nthread=8)


    tic
()
   
for i = 1:100

        FF
.execute_fft_plan!(a, ro, roo, r2c)
   
end
    toc
();
end

The variable "roo" hold the full length fft result.

Now the execution time is 0.063 seconds (nthread=1) and 0.032 seconds (nthread=8). This is a big improvement over the standard fft (0.146 seconds, please refer the first posting to this thread) and matlab (0.060 seconds).

SYoon

unread,
Nov 5, 2013, 6:34:44 PM11/5/13
to julia...@googlegroups.com
On Tuesday, November 5, 2013 6:25:13 PM UTC-5, Steven G. Johnson wrote:
What is the point of making the FFT output full length?  Anything you might want to do you can do with the half-length array.

I am translating an existing matlab code to Julia. The existing code assumes full length FFT results. Once the translation is verified with the full length result, then I will modify the code to use the half-length result. Using half-length array (full=false with execute_fft_plan! function) will shorten the execution time to half in my case.

Steven G. Johnson

unread,
Nov 6, 2013, 11:34:17 AM11/6/13
to julia...@googlegroups.com

If the full-length results are only for validation, it seems a bit pointless to optimize their performance.

SYoon

unread,
Nov 7, 2013, 12:12:15 AM11/7/13
to julia...@googlegroups.com

If the full-length results are only for validation, it seems a bit pointless to optimize their performance.

I guess you got the point.

BTW, Julia manual says the following about "rfft": Multidimensional FFT of a real array A, exploiting the fact that the transform has conjugate symmetry in order to save roughly half the computational time and storage costs compared with fft()

julia> a = float32([1:256]);
julia> @time for i=1:10^5;fft(a);end
elapsed time: 2.980890546 seconds (404707424 bytes allocated)
julia> @time for i=1:10^5;rfft(a);end
elapsed time: 3.504742832 seconds (430317472 bytes allocated)

Using Julia's FFT capability correctly seems not trivial.

Simon Byrne

unread,
Nov 7, 2013, 5:49:02 AM11/7/13
to julia...@googlegroups.com
That's interesting, some profiling seems to indicate that rfft spends much more time building Plans: I guess that means that the garbage collector is being more aggressive in this case (since plans are only rebuilt when the gc is called).

This can be avoided by explicitly building one plan: in this case the benefit from using rfft is marginal:

julia> a = float32([1:256]);
julia> @time (p = plan_fft(a); for i=1:10^5;p(a);end)
elapsed time: 1.463772197 seconds (294848204 bytes allocated)
julia> @time (p = plan_rfft(a); for i=1:10^5;p(a);end)
elapsed time: 1.398978816 seconds (251455708 bytes allocated)

However for larger arrays, the benefit can be greater than twofold:

julia> a = float32([1:2^20]);
julia> @time (p = plan_fft(a); for i=1:10^2;p(a);end)
elapsed time: 6.208034155 seconds (851779540 bytes allocated)
julia> @time (p = plan_rfft(a); for i=1:10^2;p(a);end)
elapsed time: 2.079043354 seconds (425584604 bytes allocated)

Tim Holy

unread,
Nov 7, 2013, 6:40:27 AM11/7/13
to julia...@googlegroups.com
On Wednesday, November 06, 2013 09:12:15 PM SYoon wrote:
> Using Julia's FFT capability correctly seems not trivial.

Only if you're trying to extract the absolute optimal performance. If you want
to use it as in Matlab, it's roughly as easy as in Matlab---it's just that
Julia offers more opportunities for fiddling than Matlab does.

One difference I have noticed, however, is the importance of doing planning---
without it, Julia's FFTs are slower than Matlab's. (With sufficient time devoted
to planning, I find that Julia's FFTs can be faster than Matlab's.) I'm
guessing Matlab ships with a huge pile of pre-generated wisdom?

Best,
--Tim

Viral Shah

unread,
Nov 7, 2013, 12:54:42 PM11/7/13
to julia...@googlegroups.com
Perhaps we should do a blog post on FFTs in julia that captures the discussion in this thread. Some of this stuff can also be included in the performance section of the manual.

-viral

Stefan Karpinski

unread,
Nov 7, 2013, 5:24:51 PM11/7/13
to Julia Users
Hmm. I wonder who should write that post...

SYoon

unread,
Nov 8, 2013, 12:07:07 AM11/8/13
to julia...@googlegroups.com

 I'm guessing Matlab ships with a huge pile of pre-generated wisdom?  
 

Great idea! What I noticed was creating an fft plan could take as long as several tens of seconds with Base.FFTW.PATIENT and Base.FFTW.NO_TIMELIMIT. Would it be possible to save the created plan into a file at the first execution of the application and next time when I run the application, can I just load the pre-created plan from a file? Can I achieve this using serialize/deserialize functions?

SYoon

unread,
Nov 8, 2013, 12:33:28 AM11/8/13
to julia...@googlegroups.com

On Thursday, November 7, 2013 12:54:42 PM UTC-5, Viral Shah wrote:
Perhaps we should do a blog post on FFTs in julia that captures the discussion in this thread. Some of this stuff can also be included in the performance section of the manual.

-viral


 Great idea!! This kind of information should be easy to find for Julia beginners.

SYoon

unread,
Nov 8, 2013, 12:56:05 AM11/8/13
to julia...@googlegroups.com

On Thursday, November 7, 2013 5:24:51 PM UTC-5, Stefan Karpinski wrote:
Hmm. I wonder who should write that post...



I am hesitant to nominate one publicly since it could be considered rude, but I will bear the risk since having a blog about this topic written by an authoritative person seems important. I wish Steven G. Johnson could write a blog post about the right way to use Julia's FFT capability. He is one of the two FFTW developers and his name can also be found in this discussion thread. I am sorry Professor Johnson for being rude.

Simon Byrne

unread,
Nov 8, 2013, 5:56:16 AM11/8/13
to julia...@googlegroups.com
While we're discussing this, I still think it's a bit strange that fft_plan returns an anonymous function, it just somehow doesn't seem very julian. In fact, I think there's room to improve the whole fft interface, perhaps in a way that more closely matches the underlying FFTW interface, rather than mimicking the matlab interface.

Some ideas:
* Perhaps it would be useful if users were required to explicitly create a plan, as it would make clear that this operation does incur overhead in addition to the fft operation itself.

* Users would interact with plans as julia objects: this would also make it easy to expose the interface for users to save and load plans.

* There would be one function that would apply any plan to an array (similar to the non-exported execute function, but with checking). We could overload * for this, though the analogy with multiplication doesn't quite hold in the multi-dimensional case (though it is still a linear operator if we view the whole array as a vector space, rather than just the columns).

* The matlab-ish interface could be kept in a submodule, so that users who wanted it could import it easily. Perhaps we could also pre-generate some common plans for this as well (say, all powers of 2 up to 2^10?)

Thoughts?

Simon

Tim Holy

unread,
Nov 8, 2013, 6:50:54 AM11/8/13
to julia...@googlegroups.com
There are the (non-exported) functions FFTW.export_wisdom and
FFTW.import_wisdom. You should be able to use those to your advantage. You
could insert lines into your .juliarc.jl file to have it automatically loaded
on startup.

More long-term, I wonder if the right approach is to have this happen
automatically. The main caveat is that I have no idea how well wisdom
generalizes from one machine to the next, so others will have to comment here.
If it does generalize reasonably well, then perhaps we should ship Julia with
a wisdom file. I shudder to think of how long it might take to generate that
file; I could easily imagine a machine cranking for a month just to cover
dimensions 1-3.

But it looks like FFTW is already set up to make this easy, see
http://www.fftw.org/fftw-wisdom.1.html.

Best,
--Tim

Steven G. Johnson

unread,
Nov 8, 2013, 12:57:50 PM11/8/13
to julia...@googlegroups.com
On Friday, November 8, 2013 5:56:16 AM UTC-5, Simon Byrne wrote:
While we're discussing this, I still think it's a bit strange that fft_plan returns an anonymous function, it just somehow doesn't seem very julian. In fact, I think there's room to improve the whole fft interface, perhaps in a way that more closely matches the underlying FFTW interface, rather than mimicking the matlab interface.

The thinking was that the only useful thing to do with a plan, under normal circumstances, is to execute it.  But maybe that was the wrong decision.

The FFTW.Plan interface was never intended to be public.  I didn't realize that there would be a performance problem with the anonymous-function interface (https://github.com/JuliaLang/julia/issues/1864), and I asked Jeff a while back whether we should change the interface to work around this, but he assured me that there was no deep technical reason for the issue and that we should just wait for it to be fixed.   For the same reason, I'm reluctant to publicize the low-level stuff too much (e.g. in a blog post) if it is just a hack to work around a temporary performance glitch. 

You can't save individual plans in FFTW; you can only save the accumulated cache of all planning information ("wisdom").

The high-level matlab-ish "fft(X)" interface should definitely not be in a submodule.  That interface should be the default choice for most people doing FFTs.... you should only screw around with plans (and real-data FFTs for that matter) if it turns out to be performance-critical.  Premature optimization is the root of much evil...

Shipping some pregenerated wisdom is not too crazy, e.g. for powers of two and 10 up to some reasonable maximum.  It's unlikely to be worse than the FFTW_ESTIMATE heuristic plan.  But we might have to make it easier to override if you explicitly plan with FFTW_PATIENT.

Steven G. Johnson

unread,
Nov 8, 2013, 5:01:33 PM11/8/13
to julia...@googlegroups.com


On Friday, November 8, 2013 12:57:50 PM UTC-5, Steven G. Johnson wrote:
The thinking was that the only useful thing to do with a plan, under normal circumstances, is to execute it.  But maybe that was the wrong decision.

In hindsight, having it be some type (not an anonymous function) would at the very least allow pretty-printing.  Not to mention having different methods to allow it to be applied to not only new arrays, but also to pre-allocated output arrays (currently an annoying omission).  And perhaps niceties like overloading `*` to make it look more like a linear operator.

Something to think about for 0.3 ...

Stefan Karpinski

unread,
Nov 8, 2013, 5:05:39 PM11/8/13
to julia...@googlegroups.com
That was who I was thinking of as well, but it may be unnecessary in light if the ensuing discussion.

Stefan Karpinski

unread,
Nov 8, 2013, 5:06:59 PM11/8/13
to julia...@googlegroups.com
Is there some technical reason you can't save a single plan? Would it make sense to add the ability to save and load individual plans to FFTW?

Steven G. Johnson

unread,
Nov 8, 2013, 8:01:35 PM11/8/13
to julia...@googlegroups.com


On Friday, November 8, 2013 5:06:59 PM UTC-5, Stefan Karpinski wrote:
Is there some technical reason you can't save a single plan? Would it make sense to add the ability to save and load individual plans to FFTW?

Technically, the information in the wisdom cache, which is a map from problem hashes to id's in a solver lookup table, is quite different from the datastructure for a plan, which is basically a tree of opaque (solver-dependent) data structures.  It's not really set up to be able to look up the former from the latter.

Of course, you can (sort-of) save a single plan by forgetting the wisdom (perhaps saving it first), creating the plan, saving the resulting wisdom, and then perhaps reloading the former wisdom, but this is not really ideal.

However, as far as I recall, there's never really been much demand for the ability to save a single plan as opposed to saving the whole wisdom cache, so we haven't really looked into it much.

--SGJ

PS. One difficulty with shipping pre-generated wisdom is that if you generate wisdom on a machine with (e.g.) AVX instructions and then run it on a machine without these instructions, the wisdom won't really help you.  (It shouldn't hurt, but basically none of the wisdom will be usable.)   It may be possible to improve this so that a single wisdom file can contain wisdom for both AVX and non-AVX processors in a future version of FFTW.  (I suspect that Matlab may install different wisdom files depending on the architecture, but I'm not sure.)

Alessandro "Jake" Andrioni

unread,
Nov 8, 2013, 8:18:12 PM11/8/13
to julia...@googlegroups.com
On 8 November 2013 23:01, Steven G. Johnson <steve...@gmail.com> wrote:
> PS. One difficulty with shipping pre-generated wisdom is that if you
> generate wisdom on a machine with (e.g.) AVX instructions and then run it on
> a machine without these instructions, the wisdom won't really help you. (It
> shouldn't hurt, but basically none of the wisdom will be usable.) It may
> be possible to improve this so that a single wisdom file can contain wisdom
> for both AVX and non-AVX processors in a future version of FFTW. (I suspect
> that Matlab may install different wisdom files depending on the
> architecture, but I'm not sure.)

How big are wisdom files, in general?

Steven G. Johnson

unread,
Nov 9, 2013, 8:09:43 AM11/9/13
to julia...@googlegroups.com, jake...@gmail.com


On Friday, November 8, 2013 8:18:12 PM UTC-5, Alessandro Andrioni wrote:
How big are wisdom files, in general?

The wisdom file for all 1d/2d/3d real and complex forward and backward transforms up to 2^20 elements, for all powers of 2 and 10, is about 150kB.

Tim Holy

unread,
Nov 9, 2013, 8:17:22 AM11/9/13
to julia...@googlegroups.com
I looked into this a bit, and now I doubt that Matlab uses a big pile of pre-
built wisdom. Here's a section from the help on fftw(), their interface for
working with wisdom:

"Note on large powers of 2 For FFT dimensions that are powers of 2, between
2^14 and 2^22, MATLAB software uses special preloaded information in its
internal database to optimize the FFT computation. No tuning is performed when
the dimension of the FTT is a power of 2, unless you clear the database using
the command fftw('wisdom', [])."

To me this suggests that they're using the wisdom mechanism, but that it's
pretty restricted in terms of sizes.

So I ran some tests. The files are in this gist:
https://gist.github.com/timholy/7384851
I tried Julia with 1, 2, and 4 FFTW threads, and just used the default with
Matlab.

The results are attached as a plot to this email. Every third task is a new
size, and the next two are repeated runs on the same size. (There's a couple
of odd "blips" in the julia timings at position 8 that are presumably due to
garbage-collection.) From the gist you can see what the tasks are.

The bottom line seems to be that (1) we're all in the same ballpark, although
there are factors-of-2 in many cases, (2) Matlab typically invests more time
than Julia in planning the initial FFT, and the later ones of the same size
run faster. It's interesting that in some cases we can't quite match Matlab
even with lots of planning, although in other cases we beat it even with
minimal planning. Of course, some of this could be due to differences in
allocation/garbage collection performance.

--Tim
fft_wisdom_plot.png

Viral Shah

unread,
Nov 9, 2013, 10:37:46 AM11/9/13
to julia...@googlegroups.com
This was the original discussion on the FFT interface, just for those who may not be aware of it.


-viral

Juan Carlos Cuevas Bautista

unread,
Jun 14, 2015, 5:25:21 PM6/14/15
to julia...@googlegroups.com, spy...@gmail.com
Hi Everybody,

I am new in Julia Programming. I have been following this post to optimize
my code of time series in which I have big heavy files of data. I changed the line
float32([1:38192]) by float64([1:38192]) but the script does not work,
I know this is related with this method

Base.FFTW.Plan(ri, ro, 1, Base.FFTW.PATIENT, Base.FFTW.NO_TIMELIMIT),

I was wondering if this just work for float32 arrays? In case to be positive
could anyone recommend me another way to optimize the fft routine?

Thank you so much and I like so much Julia.


On Monday, November 4, 2013 at 6:50:14 PM UTC-5, SYoon wrote:
I am trying to optimize FFT performance following the suggestion in the discussion .
I have the following testing function:

function test()

    a
= float32([1:38192])

    b
= Array(Complex64, length(a))


    tic
()
   
for i = 1:100

        b
= fft(a)
   
end
    toc
();
end

This takes about 0.146 seconds (Matlab takes about 0.06 seconds)

Now, I have implemented three helper functions:

function get_fft_plan(ri)

    ro
= Array(Complex64, length(ri)>>1 + 1)

    r2c
= Base.FFTW.Plan(ri, ro, 1, Base.FFTW.PATIENT, Base.FFTW.NO_TIMELIMIT)

   
return ro, r2c
end
 
function execute_fft_plan(ri, ro, r2c, symmetry)
   
Base.FFTW.execute(r2c.plan, ri, ro)

   
if ~(symmetry)
       
return ro
   
else symmetry
        roo
= make_fft_symmetry(ro)
       
return roo
   
end
end

function make_fft_symmetry(ro)

    len
= 2*(length(ro) - 1)

    roo
= Array(Complex64, len)

    roo
[1] = ro[1]
 
   
for i = 2:len/2+1
        roo
[i] = conj(ro[i])
        roo
[end-i+2] = ro[i]
   
end

   
return roo
end

Now, my updated test function looks like

function test()

    a
= float32([1:38192])

    b
= Array(Complex64, length(a))
    ro
, r2c = get_fft_plan(a)


    tic
()

   
for i = 1:100

        b
= execute_fft_plan(a, ro, r2c, true)
   
end
    toc
();
end

Simon Byrne

unread,
Jun 15, 2015, 8:24:07 AM6/15/15
to julia...@googlegroups.com, spy...@gmail.com
Hi Juan,

You would also need to change Complex64 to Complex128.

-Simon

Juan Carlos Cuevas Bautista

unread,
Jun 15, 2015, 11:33:17 AM6/15/15
to julia...@googlegroups.com, spy...@gmail.com
Hi Simon,

Thanks for your help, it worked correctly. However the reduction in
the time is not very significantly
(1/6) of the total (~300 secs). I think my arrays are too big
(6000000-element Array{Float64,1})
to use PATIENT or even MEASURE, since that MATLAB spends just 60 secs.
Reply all
Reply to author
Forward
0 new messages