2D array FFT - ios Accelerate performance gains nullified by it's limited API

Kiran Pradeep

unread,

Apr 22, 2015, 3:21:21 PM4/22/15

to perfoptimi...@lists.apple.com

Hi,

The aim is to do correlation/convolutions(flip) of two 2D arrays using ios Accelerate framework for gaining speed.

My first attempt was with vImageConvolve_PlanarF/vdsp_imgfir which was good for lower sized arrays. But as array size increased, performance dropped drastically as it was an O(n²) implementation as mentioned by Accelerate developers here(1).
I moved to FFT implementation(2) for reducing complexity to O(nlog₂n). Using vDSP_fft2d_zip, gained speed to an extent. But using vDSP_fft2d_zip on 2D arrays at non powers of 2, we need to pad zeros. For e.g. on a 2D array of size 640 * 480, we need to pad zeros to make it 1024 * 512. Other FFT implementations like FFTW or OpenCV's DFT allow sizes which could be expressed as size = 2^p * 3^p * 5^r. That allows, FFTW/OpenCV to do FFT of 640 * 480 2D array at the same size.

So for 2D arrays at size 640*480, in an Accelerate vs FFTW/OpenCV comparison, it is effectively between 1024*512 and 640*480 dimensions. So what ever performance gains I get from Accelerate's FFT on 2D arrays is nullified by it's limited API which prevent performing FFT at reasonable dimensions like size = 2^p * 3^p * 5^r

2 queries.

Am I missing any Accelerate functionality to perform this easily ? For e.g. any Accelerate function which could perform 2D array FFT at size = 2^p * 3^p * 5^r. I assume vDSP_DFT_Execute performs only 1D FFT.
Better approaches to 2D FFT or correlation. Like in this answer(3), which asks to split arrays like 480 = 256 + 128 + 64 + 32 with repeated 1D FFT's over rows and then over columns. But this will need too many functions calls and I assume, will not help with performance.

_{Of lesser importance: I am doing correlation in tiles as one of the 2D arrays is far bigger then another. Say like 1920*1024 vs 150*100.}

_{Thanks for reading,}

-Kiran.

Eric Postpischil

unread,

Apr 23, 2015, 11:53:18 AM4/23/15

to Kiran Pradeep, perfoptimi...@lists.apple.com

On Apr 22, 2015, at 15:19, Kiran Pradeep <kiran...@gmail.com> wrote:

So for 2D arrays at size 640*480, in an Accelerate vs FFTW/OpenCV comparison, it is effectively between 1024*512 and 640*480 dimensions. So what ever performance gains I get from Accelerate's FFT on 2D arrays is nullified by it's limited API which prevent performing FFT at reasonable dimensions like size = 2^p * 3^p * 5^r

I would not say the gains from the 2D FFTs are nullified by the lack of support for non-power-of-two sizes. Although there are algorithms for non-power-of-two sizes, they require different calculations and data motions that do not benefit from some opportunities in the power-of-two algorithms. Thus, we might find that the power-of-two sizes perform reasonably well compared to what non-power-of-two two-dimensional functions might provide.

Am I missing any Accelerate functionality to perform this easily ? For e.g. any Accelerate function which could perform 2D array FFT at size = 2^p * 3^p * 5^r. I assume vDSP_DFT_Execute performs only 1D FFT.

No, except you might consider using the DFT (vDSP_DFT_zop_CreateSetup, vDSP_DFT_Execute, and vDSP_DFT_DestroySetup) repeatedly to transform rows (or vDSP_DFT_zrop_CreateSetup if you are doing real-to-complex, but then you will have to deal with some issues of mixed real and complex data packed in the results), then transposing the data with vDSP_mtrans, then using the DFT to transform the former columns (now rows). Then you could multiply (by the transform of the filter, produced in the same way) and then reverse the transformation to produce the convolved data. However, I would not predict this would perform better than vDSP_fft2d_zop or vDSP_fft2d_zrop with padding to power-of-two sizes.

In the future, it would be reasonable for us to add a two-dimensional DFT API that supported some of these lengths, and might potentially perform faster than the sequence described above. The current DFT API is new, and we could use customer feedback about directions to go. We have not received a lot of demand for large transforms, let alone large transforms of non-power-of-two sizes. Please use the bug reporting system (https://developer.apple.com/bug-reporting/) to request new features. (It is very much for requesting new features, not just reporting bugs.)

Better approaches to 2D FFT or correlation. Like in this answer(3), which asks to split arrays like 480 = 256 + 128 + 64 + 32 with repeated 1D FFT's over rows and then over columns. But this will need too many functions calls and I assume, will not help with performance.

I do not see a sensible algorithm described by that answer. It might be misguided.

If your filter has symmetries, it might be possible to construct custom code that takes advantage of those.

—edp

Olivier Tristan

unread,

Apr 23, 2015, 12:03:30 PM4/23/15

to perfoptimi...@lists.apple.com

Following this, I wouldn't mind a 1D DFT Forward and Inverse
(vDSP_DFT_zop_CreateSetup)
that works for any size like Intel Performance Primitive does. (see
ippsDFTInitAlloc_C_32f, ippsDFTFwd_CToC_32f and ippsDFTInv_CToC_32f)

Right now it's " The implemented cases are where the length is f * 2**n,
where f is 3, 5, or 15 and n is at least 4."

:D

Thanks,

--
Olivier Tristan
Research & Development
www.uvi.net

_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list (PerfOptimi...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/perfoptimization-dev/perfoptimization-dev-garchive-8409%40googlegroups.com

This email sent to perfoptimization-...@googlegroups.com

Stephen Canon

unread,

Apr 23, 2015, 1:08:44 PM4/23/15

to Olivier Tristan, perfoptimi...@lists.apple.com

Like Eric said, the best thing you can do is report a bug to request the feature. I know it’s a bit of a pain, but there are lots things we can work on, and (for better or worse) bug reports are the major way that we prioritize that work.

Eric and I see your emails here, and might well implement the feature in our free time, but we very rarely have free time =). If you file a bug report, that makes it much easier for us to go to managers and say “our clients need this feature” and carve out real time in our schedules to work on it.

The bug report needn’t be elaborate, and it needn’t be long:

“IPP supports arbitrary size DFTs. Please add this useful feature to Accelerate so I can port my app to iOS."

If you’re feeling ambitious, add a paragraph about how you would use it, what sizes are most important to you, and if you just want it to work passably or it needs to be as fast as possible to be useful:

“Even if these are somewhat slower than the currently supported sizes, they’d still be useful because I wouldn’t need to pad my data out myself, and this would also make it easier for novices to adopt the APIs. I’m especially interested in performance for lengths of the form 25*2^n because they come up when handling $(some imaging task or whatever), but being able to simply pass in any size and have it deliver correct results would be really useful for prototyping new algorithms.”

I know that it takes a couple minutes from your busy day to do this, and I’m sorry for the inconvenience, but it really does make it much more likely that your pet feature will be implemented.

– Steve

Kiran Pradeep

unread,

Apr 23, 2015, 2:35:32 PM4/23/15

to Stephen Canon, perfoptimi...@lists.apple.com

Hi Steve, I just created two feature/bug id's 20670360 and 20669820.
Kindly let me know if the details filled needs change to be considered
as real feature/bug requests.

Thanks,
Kiran.

> - Steve

>
>> On Apr 23, 2015, at 12:02 PM, Olivier Tristan <o.tr...@uvi.net

>> <mailto:o.tr...@uvi.net>> wrote:
>>
>> Following this, I wouldn't mind a 1D DFT Forward and Inverse
>> (vDSP_DFT_zop_CreateSetup)
>> that works for any size like Intel Performance Primitive does. (see
>> ippsDFTInitAlloc_C_32f, ippsDFTFwd_CToC_32f and ippsDFTInv_CToC_32f)
>>
>> Right now it's " The implemented cases are where the length is f * 2**n,
>> where f is 3, 5, or 15 and n is at least 4."
>>
>> :D
>>
>> Thanks,
>>
>> --
>> Olivier Tristan
>> Research & Development

>> www.uvi.net <http://www.uvi.net/>

Stephen Canon

unread,

Apr 23, 2015, 2:36:40 PM4/23/15

to Kiran Pradeep, perfoptimi...@lists.apple.com

These are both excellent feature requests. Thanks very much for filing them.

Stephen Canon

unread,

Apr 23, 2015, 2:46:45 PM4/23/15

to PerfOpt-Dev

More generally I want to make a note that we (Accelerate team) absolutely love to get feature requests. Many recent API additions have been in response to feature requests from outside developers. The worst thing that can happen is that we choose not to implement your feature for some reason or another. When this happens we try to respond with an explanation of why it’s out of scope for Accelerate, how you can already accomplish the same task using existing APIs, or why there’s a better way to do it.

In my experience, the most common scenario is that we get up at WWDC at some point in the future to announce new features, and yours is on the list. I can’t guarantee that we implement everything anyone asks for, but we do give serious consideration to all requests, and if it’s a good fit for Accelerate, we usually add it.

So next time you’re using Accelerate and none of the APIs are quite what you want, or you find yourself thinking “wouldn’t it be cool if …”, take a minute to drop those thoughts into https://developer.apple.com/bug-reporting/ and we’ll see what we can do.

– Steve

Olivier Tristan

unread,

Apr 24, 2015, 4:19:19 AM4/24/15

to perfoptimi...@lists.apple.com

Le 23/04/2015 20:46, Stephen Canon a écrit :
> More generally I want to make a note that we (Accelerate team) absolutely love to get feature requests. Many recent API additions have been in response to feature requests from outside developers. The worst thing that can happen is that we choose not to implement your feature for some reason or another. When this happens we try to respond with an explanation of why it’s out of scope for Accelerate, how you can already accomplish the same task using existing APIs, or why there’s a better way to do it.
>
> In my experience, the most common scenario is that we get up at WWDC at some point in the future to announce new features, and yours is on the list. I can’t guarantee that we implement everything anyone asks for, but we do give serious consideration to all requests, and if it’s a good fit for Accelerate, we usually add it.
>
> So next time you’re using Accelerate and none of the APIs are quite what you want, or you find yourself thinking “wouldn’t it be cool if …”, take a minute to drop those thoughts into https://developer.apple.com/bug-reporting/ and we’ll see what we can do.
>

Here it is: 20682193

Many thanks !

--
Olivier Tristan
Research & Development
www.uvi.net

_______________________________________________

Sean McBride

unread,

May 1, 2015, 5:21:28 PM5/1/15

to Stephen Canon, PerfOpt-Dev

On Thu, 23 Apr 2015 14:46:27 -0400, Stephen Canon said:

>So next time you’re using Accelerate and none of the APIs are quite what
>you want, or you find yourself thinking “wouldn’t it be cool if …”, take
>a minute to drop those thoughts into https://developer.apple.com/bug-
>reporting/ and we’ll see what we can do.

Thought about this for a bit... before filing Radars, since I could easily be missing something...

1) I often find myself wanting both the min and max and thus calling vDSP_minv and vDSP_maxv back to back. Is there a function that can find both together? I imagine that could be faster...

2) Is there a way, given a large buffer of uint8, to count how many elements are not equal to zero? Doing this is the only place I currently resort to emmintrin.h

3) Is there a way to byte swap a large buffer? (for uint64, uint32, uint16 cases)

Thanks,

--
____________________________________________________________
Sean McBride, B. Eng se...@rogue-research.com
Rogue Research www.rogue-research.com
Mac Software Developer Montréal, Québec, Canada

Ian Ollmann

unread,

May 1, 2015, 5:26:24 PM5/1/15

to Sean McBride, Stephen Canon, PerfOpt-Dev

On May 1, 2015, at 2:21 PM, Sean McBride <se...@rogue-research.com> wrote:

3) Is there a way to byte swap a large buffer? (for uint64, uint32, uint16 cases)

16: vImageByteSwap_Planar16U

32: vImagePermuteChannels_ARGB8888

64: vImagePermuteChannels_ARGB16U + vImageByteSwap_Planar16U

Make sure you tile the 64-bit case. Otherwise, performance may be not so nice.

It seems like this is something that might also be handled by a memcpy() variant, kind of like how we have memset_pattern variants.

Ian

Demitri Muna

unread,

May 1, 2015, 5:28:31 PM5/1/15

to Sean McBride, PerfOpt-Dev

On May 1, 2015, at 5:21 PM, Sean McBride <se...@rogue-research.com> wrote:

1) I often find myself wanting both the min and max and thus calling vDSP_minv and vDSP_maxv back to back. Is there a function that can find both together? I imagine that could be faster...

2) Is there a way, given a large buffer of uint8, to count how many elements are not equal to zero? Doing this is the only place I currently resort to emmintrin.h

If the answer is "file a Radar", please post the ID so I can +1 it. I have the same needs.

_________________________________________
Demitri Muna

Department of Astronomy
Ohio State University

http://trillianverse.org
http://scicoder.org

Sean McBride

unread,

May 1, 2015, 5:35:48 PM5/1/15

to Ian Ollmann, Stephen Canon, PerfOpt-Dev

Ian,

Thanks for your reply. Seems odd to have something generic like byte swapping in vImage. In my case, my data is not pixels. What would I set height, width, and rowBytes to? Could I do:
width = buffer size
rowBytes = buffer size
height = 1

Cheers,

Ian Ollmann

unread,

May 1, 2015, 5:50:45 PM5/1/15

to Sean McBride, PerfOpt-Dev

On May 1, 2015, at 2:35 PM, Sean McBride <se...@rogue-research.com> wrote:

On Fri, 1 May 2015 14:26:06 -0700, Ian Ollmann said:

On May 1, 2015, at 2:21 PM, Sean McBride <se...@rogue-research.com> wrote:

3) Is there a way to byte swap a large buffer? (for uint64, uint32,
uint16 cases)

16: vImageByteSwap_Planar16U
32: vImagePermuteChannels_ARGB8888
64: vImagePermuteChannels_ARGB16U + vImageByteSwap_Planar16U

Make sure you tile the 64-bit case. Otherwise, performance may be not so
nice.

It seems like this is something that might also be handled by a memcpy()
variant, kind of like how we have memset_pattern variants.

Ian,

Thanks for your reply. Seems odd to have something generic like byte swapping in vImage. In my case, my data is not pixels. What would I set height, width, and rowBytes to? Could I do:
width = buffer size
rowBytes = buffer size
height = 1

Width is in units of pixels, not bytes. The size of the pixels are given by the function name and will be 2,4 or 8 bytes.

rowBytes = # of bytes in buffer
width = rowBytes / 2 (or 4 or 8)

Most of the time, vImage only multithreads along scanlines, so a single scanline will probably leave you stuck with one thread.

That said, multithreading is generally only helpful when the function does non-trivial amounts of work. This function is entirely trivial.

Once you observe the existence of things like kCGBitmapByteOrder16Big, the reason for such functions should be evident.

Ian

Hristo Hristov

unread,

May 1, 2015, 6:10:53 PM5/1/15

to Sean McBride, PerfOpt-Dev

1) There is no single function that combines vDSP_minv and vDSP_maxv in vDSP. File a request radar.

2) You can use vDSP_vclipc to find the number of negatives and the number of positives using zeroes as thresholds.
vDSP_vcmprs could be used also but it will probably be slower than vDSP_vclipc.
Unfortunately both of these functions do additional work and require output array to store the results.
Again, you can file a request radar.

Hristo

> https://lists.apple.com/mailman/options/perfoptimization-dev/hristo%40apple.com
>
> This email sent to hri...@apple.com

Ian Ollmann

unread,

May 1, 2015, 6:13:10 PM5/1/15

to Sean McBride, PerfOpt-Dev

> On May 1, 2015, at 3:10 PM, Hristo Hristov <hri...@apple.com> wrote:
>
> 1) There is no single function that combines vDSP_minv and vDSP_maxv in vDSP. File a request radar.
>
> 2) You can use vDSP_vclipc to find the number of negatives and the number of positives using zeroes as thresholds.
> vDSP_vcmprs could be used also but it will probably be slower than vDSP_vclipc.
> Unfortunately both of these functions do additional work and require output array to store the results.
> Again, you can file a request radar.

A histogram would achieve more or less 2, though this is probably an expensive way to go about it.

Ian

Sean McBride

unread,

May 4, 2015, 4:12:18 PM5/4/15

to PerfOpt-Dev

On Fri, 1 May 2015 17:21:08 -0400, Sean McBride said:

>1) I often find myself wanting both the min and max and thus calling
>vDSP_minv and vDSP_maxv back to back. Is there a function that can find
>both together? I imagine that could be faster...

See: <rdar://20804746>

>2) Is there a way, given a large buffer of uint8, to count how many
>elements are not equal to zero? Doing this is the only place I
>currently resort to emmintrin.h

See: <rdar://20804940>

>3) Is there a way to byte swap a large buffer? (for uint64, uint32,
>uint16 cases)

See: <rdar://20804846>

Cheers,

Ian Ollmann

unread,

May 4, 2015, 4:27:56 PM5/4/15

to Sean McBride, PerfOpt-Dev

> On May 4, 2015, at 1:11 PM, Sean McBride <se...@rogue-research.com> wrote:
>
> On Fri, 1 May 2015 17:21:08 -0400, Sean McBride said:
>
>> 1) I often find myself wanting both the min and max and thus calling
>> vDSP_minv and vDSP_maxv back to back. Is there a function that can find
>> both together? I imagine that could be faster...
>
> See: <rdar://20804746>
>
>> 2) Is there a way, given a large buffer of uint8, to count how many
>> elements are not equal to zero? Doing this is the only place I
>> currently resort to emmintrin.h
>
> See: <rdar://20804940>
>
>> 3) Is there a way to byte swap a large buffer? (for uint64, uint32,
>> uint16 cases)
>
> See: <rdar://20804846>