Fwd: Question about vDSP DFT performance on iPhone (6)

12 views
Skip to first unread message

Vesa Peltonen

unread,
Dec 16, 2015, 8:01:48 PM12/16/15
to perfoptimi...@lists.apple.com
Hi all,

I noticed a bit strange behaviour concerning vDSP DFT performance on iPhone (tested on iPhone 6 and 6s) when using doubles (vDSP_DFT_ExecuteD).
 
When power of two DFT length is used, the performance is worse when using DFT size with a multiplier (2^n* (3, 5, 15)). I would expect that the power of two would be faster than the DFT sizes that are not power of two.  And that is the case for floats in iPhone, and also for both floats and doubles on Mac OS X.

See the this screenshot that show the phenomenon:  http://i.imgur.com/VmMlc6O.png . In this experiment I'm doing 1000 ffts and iffts for varying FFT lengths. The DFT is initialized to the next supported DFT size. Notice the peaks on the double when FFT size is rounded to power of two, and float has small valleys there, as it "should" be.

Can someone explain what might happen here? I was thinking about cache misses slowing down the calculation, but that should happen equally with all DFT sizes in my test. 

Thank you,
Vesa Peltonen


Anand, Christopher

unread,
Dec 16, 2015, 8:14:44 PM12/16/15
to Vesa Peltonen, perfoptimi...@lists.apple.com
Congratulations to the people who got such consistent preformance out of this processor. 

This is what you would expect. Random sizes have much worse complexity. I've looked at this before and it never made sense to do non-power-of-two sizes---any improvement elsewhere from using less data was lost by using a less efficient ft size. 

The steps are what you would expect from cache effects but I wouldn't have expected so many steps. Some steps may be the result of changing the factorization of n.

Christopher 
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (PerfOptimi...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/perfoptimization-dev/anandc%40mcmaster.ca

This email sent to ana...@mcmaster.ca

Vesa Peltonen

unread,
Dec 16, 2015, 8:20:56 PM12/16/15
to Anand, Christopher, perfoptimi...@lists.apple.com
Hi Christopher, others!

This exactly my problem, I assumed that non-power of two should be slower than power-of-two lengths. And this is the case for floats, but for doubles the performance is worse with the power of twos! Check the link to the image, the peaks are power of two. And this happens (as said) only on iphone. On Mac the powers of two are faster for both doubles and floats than non-power of twos.

Cheers,
Vesa

Vesa Peltonen

unread,
Dec 16, 2015, 10:15:19 PM12/16/15
to Anand, Christopher, perfoptimi...@lists.apple.com
One more test that I just did:
20000 FFT and IFFT calls:

vDSP_DFT_EXecuteD
fftsize: 2560 took 2945 ms
fftsize: 2048 took  5001 ms

vDSP_DFT_EXecute
fftsize: 2560 took 835 ms
fftsize: 2048 took 460 ms

So for float precision the DFT works as expected, but for doubles there is something fishy going on. I might have a bug in my code, but the same test code is run in all cases. The same code "behaves" on OS X.

thanks,
Vesa

Jonathan Taylor

unread,
Dec 17, 2015, 4:03:43 AM12/17/15
to ve...@resapphealth.com.au, perfoptimi...@lists.apple.com
Hi,

I had a look at your graph, which has some interesting features. However, isn't there something a bit funny about your data? (Unless I'm missing something). Your blue 2^n*{3,5,15} curve does not show the anomalous peaks, but if you compare it to your "double all supported" curve, that should also take in the points on your blue curve. It does not - for example there is clearly a point just after the small "bump" on the blue curve around 2000, but the equivalent point on the red curve is much higher. They should share that datapoint, and have the same value.

I don't know what that means but as Christopher says, I wonder if on your red curve you are seeing cache-related anomalies due to the sequence in which you are generating the datapoints?

Cheers
Jonny


_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list (PerfOptimi...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:

https://lists.apple.com/mailman/options/perfoptimization-dev/perfoptimization-dev-garchive-8409%40googlegroups.com

This email sent to perfoptimization-...@googlegroups.com

Vesa Peltonen

unread,
Dec 17, 2015, 4:11:42 AM12/17/15
to Jonathan Taylor, perfoptimi...@lists.apple.com
Thanks Jonny,

I think that I was not clear enough in the labels. The blue curve uses all supported DFT size, but NOT powers of two (but the next bigger size). The red curve uses all supported size, including powers of two. This test was to show that power of twos on double precision consume more power than bigger not power of two, which sound like a bug. The same test code uses less CPU for floats on powers of two.

There is obviously small variation on the results, but those peaks on the red curve are consistent and reproducible every time. It is using almost 50-60% more CPU.

Cheers,
Vesa






Paul Russell

unread,
Dec 17, 2015, 6:41:30 AM12/17/15
to perfoptimi...@lists.apple.com
It might be cache thrashing (or a similar super-alignment problem) - try the double test with 1280 and 1024, so that the data sets are the same size (in bytes) as for the single precision test?

Paul

_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (PerfOptimi...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:

Justin Voo

unread,
Dec 17, 2015, 3:15:45 PM12/17/15
to Vesa Peltonen, perfoptimi...@lists.apple.com
Hi Vesa,

We haven’t been able to reproduce this in our own timing harnesses.  We see DFT_ExecuteD performing closer to what you see for the float case, i.e. the performance actually improves substantially at powers of two relative to the nearby powers of 3/5.  Could you please file a bug report and include your entire timing program and how you build it, basically as much info as possible so we can try to reproduce the same thing you are seeing.

Thanks,
- Justin
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (PerfOptimi...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:

Vesa Peltonen

unread,
Dec 17, 2015, 8:12:40 PM12/17/15
to Justin Voo, perfoptimi...@lists.apple.com
Thanks Justin and other very much for the reply.

That triggered me to prepare a example project, and of course I found the issue when doing that. Out project was targeted only for armv7 and not arm64. After changing the architecture, everything works fine. So, it was not a bug in the code, but "a bug" in the project. 
Next time is should ring bells for me when doubles take 4x more time than floats...

Thanks again,
Vesa 
Reply all
Reply to author
Forward
0 new messages