Inplace FFT timing issues

knightd94

unread,

Nov 2, 2015, 3:50:29 PM11/2/15

to ArrayFire Users

I have an application where I need to do many ffts with tight time constraints. When I was only using one size, everything ran great and very quickly. When I started changing sizes between iterations, things would run great for several iterations but then the time would ramp up. Sometimes it would increase by a factor of 10, sometimes it would double or triple. It depended upon the sizes and the number of operations I had it complete. I moved to making separate arrays for each size, but still had issues.
What I noticed after some tinkering is that this only happens with the inplace transforms (fftinplace, ifftinplace). It also held true if I used out of place transforms but stored the transformed data in the same place as the original data (i.e. data = af::fft(data)). Is this a bug, or is there something I'm missing that is causing this?

Here is some simple code to demonstrate the issue.

Output of af::info

ArrayFire v3.1.3 (CUDA, 64-bit Linux, build 35c89f5)
Platform: CUDA Toolkit 7.5, Driver: 352.55
[0] GeForce GTX 680, 2043 MB, CUDA Compute 3.0

Source code:

#include <arrayfire.h>

int main()
{
af::info();
size_t numIter = 1000;
af::timer start, start2;
float stop[numIter], stop2[numIter];
af::array complexArray, complexArray2;

for(size_t i = 0; i < numIter; i++)
{
    complexArray = af::randu(256, 256, 16, 1, c32);
    start = af::timer::start();
    af::fftInPlace(complexArray);
    stop[i] = af::timer::stop(start);

    complexArray2 = af::randu(128, 64, 8, 1, c32);
    start2 = af::timer::start();
    af::fftInPlace(complexArray2);
    stop2[i] = af::timer::stop(start2);
}

af::array timing1 = af::array(numIter, 1, stop);
af::array timing2 = af::array(numIter, 1, stop2);
af::print("", timing1, 6);
af::print("", timing2, 6);
}

If you don't want to filter through the times yourself, you could replace the last two lines with min and median to see the discrepancy like this:

af::print("First fft times min value:", af::min(timing1), 6);
af::print("First fft times median value:", af::median(timing1), 6);
af::print("Second fft times min value:", af::min(timing2), 6);
af::print("Second fft median value:", af::median(timing2), 6);

The output of that for me is as follows:
First fft times min value:
[1 1 1 1]
    0.000024
First fft times median value:
[1 1 1 1]
    0.000026
Second fft times min value:
[1 1 1 1]
    0.000025
Second fft times median value:
[1 1 1 1]
    0.000121

In this case, for me, the first set of times stayed pretty consistent but the second case ran at 24 micros for a while but then jumped up to 123 micros.
This is not specific to GPU (I also see this on a Tesla) or OS (I have tried both Debian and Red Hat).

Shehzan Mohammed

unread,

Nov 2, 2015, 3:56:52 PM11/2/15

to ArrayFire Users

Please see this post from before: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/arrayfire-users/kC2ZcinM30c

There are a lot of factors that govern how much time is taken including memory operations, transfers, FFT kernel caching etc.

To remove these variables it is best to use timeit. Benchmarking with timer will only give you times for your run with all the overhead included.

Alternately, you run the FFT multiple times and then take the average time. For example:

start = af::timer::start();

for(int i = 0; i < iters; i++) af::fftInPlace(complexArray);
stop[i] = af::timer::stop(start) / iters; // Now stop[i] will have an average time.

-Shehzan

knightd94

unread,

Nov 2, 2015, 4:45:49 PM11/2/15

to ArrayFire Users

On Monday, November 2, 2015 at 2:56:52 PM UTC-6, Shehzan Mohammed wrote:

Please see this post from before: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/arrayfire-users/kC2ZcinM30c

There are a lot of factors that govern how much time is taken including memory operations, transfers, FFT kernel caching etc.
To remove these variables it is best to use timeit. Benchmarking with timer will only give you times for your run with all the overhead included.

Alternately, you run the FFT multiple times and then take the average time. For example:
start = af::timer::start(); for(int i = 0; i < iters; i++) af::fftInPlace(complexArray); stop[i] = af::timer::stop(start) / iters; // Now stop[i] will have an average time.

-Shehzan

I did look at the other post before posting this one and thought I had fairly clearly explained that those issues were not the same as mine. Notice that I am running the timer 1,000 times. I can clearly see the first iteration setting up the memory, plans, etc before speeding up on the second and third iterations.

Timeit will hide the slowdown I am talking about as it only gives the average for a set number of iterations. If I want 1,000 iterations and every iteration after a set threshold is a factor of ten slower than the rest, that is an issue. I am not looking for the average time, I am wondering why after x iterations fftinplace slows down and if this is a bug or an issue with something in my code. The example with median/min is to show very quickly that the min time and median time--which should be very close together since after the first run we have allocated memory and cached the plans-- have huge discrepancies.

knightd94

unread,

Nov 3, 2015, 10:49:49 AM11/3/15

to ArrayFire Users

On Monday, November 2, 2015 at 2:56:52 PM UTC-6, Shehzan Mohammed wrote:

Please see this post from before: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/arrayfire-users/kC2ZcinM30c

There are a lot of factors that govern how much time is taken including memory operations, transfers, FFT kernel caching etc.
To remove these variables it is best to use timeit. Benchmarking with timer will only give you times for your run with all the overhead included.

Alternately, you run the FFT multiple times and then take the average time. For example:
start = af::timer::start(); for(int i = 0; i < iters; i++) af::fftInPlace(complexArray); stop[i] = af::timer::stop(start) / iters; // Now stop[i] will have an average time.

-Shehzan

Here is a plot of each iteration's timing. You can see around 250 the jump I am talking about. My concern is about after running hundreds of iterations the timing jumping like this for in place transforms of differing more than one size.
If you add more transforms, the jump is more significant and/or happens earlier.

FFT out of place or FFT in place of only one size have very little variation in their times (maybe a couple of micros). I can attach some plots of those as well to demonstrate if needed.

Pavan Yalamanchili

unread,

Nov 3, 2015, 11:49:54 AM11/3/15

to ArrayFire Users

Hi,

You are not performing af::sync() before timer::stop(). All ArrayFire operations are asynchronous and will only be synchronized when the output is read back to host or if the user asks for synchronization.

Can you please repeat the experiment after putting af::sync() before each timer::stop()?

knightd94

unread,

Nov 3, 2015, 3:44:38 PM11/3/15

to ArrayFire Users

Thanks; this does make the timing consistent. I'll test it out some more and post back if there is any more issue. Thanks again!

Reply all

Reply to author

Forward