The aim is to do correlation/convolutions(flip) of two 2D arrays using ios Accelerate framework for gaining speed.
My first attempt was with vImageConvolve_PlanarF/vdsp_imgfir which was good for lower sized arrays. But as array size increased, performance dropped drastically as it was an O(n2) implementation as mentioned by Accelerate developers here(1).
I moved to FFT implementation(2) for reducing complexity to O(nlog2n). Using vDSP_fft2d_zip, gained speed to an extent. But using vDSP_fft2d_zip on 2D arrays at non powers of 2, we need to pad zeros. For e.g. on a 2D array of size 640 * 480, we need to pad zeros to make it 1024 * 512. Other FFT implementations like FFTW or OpenCV's DFT allow sizes which could be expressed as size = 2p * 3p * 5r. That allows, FFTW/OpenCV to do FFT of 640 * 480 2D array at the same size.
So for 2D arrays at size 640*480, in an Accelerate vs FFTW/OpenCV comparison, it is effectively between 1024*512 and 640*480 dimensions. So what ever performance gains I get from Accelerate's FFT on 2D arrays is nullified by it's limited API which prevent performing FFT at reasonable dimensions like size = 2p * 3p * 5r
2 queries.
vDSP_DFT_Execute performs only 1D FFT.Of lesser importance: I am doing correlation in tiles as one of the 2D arrays is far bigger then another. Say like 1920*1024 vs 150*100.
Thanks for reading,
-Kiran.
So for 2D arrays at size 640*480, in an Accelerate vs FFTW/OpenCV comparison, it is effectively between 1024*512 and 640*480 dimensions. So what ever performance gains I get from Accelerate's FFT on 2D arrays is nullified by it's limited API which prevent performing FFT at reasonable dimensions like size = 2p * 3p * 5r
- Am I missing any Accelerate functionality to perform this easily ? For e.g. any Accelerate function which could perform 2D array FFT at size = 2p * 3p * 5r. I assume
vDSP_DFT_Executeperforms only 1D FFT.
- Better approaches to 2D FFT or correlation. Like in this answer(3), which asks to split arrays like 480 = 256 + 128 + 64 + 32 with repeated 1D FFT's over rows and then over columns. But this will need too many functions calls and I assume, will not help with performance.
Right now it's " The implemented cases are where the length is f * 2**n,
where f is 3, 5, or 15 and n is at least 4."
:D
Thanks,
--
Olivier Tristan
Research & Development
www.uvi.net
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list (PerfOptimi...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/perfoptimization-dev/perfoptimization-dev-garchive-8409%40googlegroups.com
This email sent to perfoptimization-...@googlegroups.com
“IPP supports arbitrary size DFTs. Please add this useful feature to Accelerate so I can port my app to iOS."
I know that it takes a couple minutes from your busy day to do this, and I’m sorry for the inconvenience, but it really does make it much more likely that your pet feature will be implemented.“Even if these are somewhat slower than the currently supported sizes, they’d still be useful because I wouldn’t need to pad my data out myself, and this would also make it easier for novices to adopt the APIs. I’m especially interested in performance for lengths of the form 25*2^n because they come up when handling $(some imaging task or whatever), but being able to simply pass in any size and have it deliver correct results would be really useful for prototyping new algorithms.”
Thanks,
Kiran.
> - Steve
>
>> On Apr 23, 2015, at 12:02 PM, Olivier Tristan <o.tr...@uvi.net
>> <mailto:o.tr...@uvi.net>> wrote:
>>
>> Following this, I wouldn't mind a 1D DFT Forward and Inverse
>> (vDSP_DFT_zop_CreateSetup)
>> that works for any size like Intel Performance Primitive does. (see
>> ippsDFTInitAlloc_C_32f, ippsDFTFwd_CToC_32f and ippsDFTInv_CToC_32f)
>>
>> Right now it's " The implemented cases are where the length is f * 2**n,
>> where f is 3, 5, or 15 and n is at least 4."
>>
>> :D
>>
>> Thanks,
>>
>> --
>> Olivier Tristan
>> Research & Development
>> www.uvi.net <http://www.uvi.net/>
In my experience, the most common scenario is that we get up at WWDC at some point in the future to announce new features, and yours is on the list. I can’t guarantee that we implement everything anyone asks for, but we do give serious consideration to all requests, and if it’s a good fit for Accelerate, we usually add it.
So next time you’re using Accelerate and none of the APIs are quite what you want, or you find yourself thinking “wouldn’t it be cool if …”, take a minute to drop those thoughts into https://developer.apple.com/bug-reporting/ and we’ll see what we can do.
– Steve
Many thanks !
--
Olivier Tristan
Research & Development
www.uvi.net
_______________________________________________
>So next time you’re using Accelerate and none of the APIs are quite what
>you want, or you find yourself thinking “wouldn’t it be cool if …”, take
>a minute to drop those thoughts into https://developer.apple.com/bug-
>reporting/ and we’ll see what we can do.
Thought about this for a bit... before filing Radars, since I could easily be missing something...
1) I often find myself wanting both the min and max and thus calling vDSP_minv and vDSP_maxv back to back. Is there a function that can find both together? I imagine that could be faster...
2) Is there a way, given a large buffer of uint8, to count how many elements are not equal to zero? Doing this is the only place I currently resort to emmintrin.h
3) Is there a way to byte swap a large buffer? (for uint64, uint32, uint16 cases)
Thanks,
--
____________________________________________________________
Sean McBride, B. Eng se...@rogue-research.com
Rogue Research www.rogue-research.com
Mac Software Developer Montréal, Québec, Canada
On May 1, 2015, at 2:21 PM, Sean McBride <se...@rogue-research.com> wrote:
3) Is there a way to byte swap a large buffer? (for uint64, uint32, uint16 cases)
1) I often find myself wanting both the min and max and thus calling vDSP_minv and vDSP_maxv back to back. Is there a function that can find both together? I imagine that could be faster...
2) Is there a way, given a large buffer of uint8, to count how many elements are not equal to zero? Doing this is the only place I currently resort to emmintrin.h
Ian,
Thanks for your reply. Seems odd to have something generic like byte swapping in vImage. In my case, my data is not pixels. What would I set height, width, and rowBytes to? Could I do:
width = buffer size
rowBytes = buffer size
height = 1
Cheers,
On May 1, 2015, at 2:35 PM, Sean McBride <se...@rogue-research.com> wrote:On Fri, 1 May 2015 14:26:06 -0700, Ian Ollmann said:On May 1, 2015, at 2:21 PM, Sean McBride <se...@rogue-research.com> wrote:uint16 cases)
3) Is there a way to byte swap a large buffer? (for uint64, uint32,
16: vImageByteSwap_Planar16U
32: vImagePermuteChannels_ARGB8888
64: vImagePermuteChannels_ARGB16U + vImageByteSwap_Planar16U
Make sure you tile the 64-bit case. Otherwise, performance may be not so
nice.
It seems like this is something that might also be handled by a memcpy()
variant, kind of like how we have memset_pattern variants.
Ian,
Thanks for your reply. Seems odd to have something generic like byte swapping in vImage. In my case, my data is not pixels. What would I set height, width, and rowBytes to? Could I do:
width = buffer size
rowBytes = buffer size
height = 1
rowBytes = # of bytes in bufferwidth = rowBytes / 2 (or 4 or 8)
2) You can use vDSP_vclipc to find the number of negatives and the number of positives using zeroes as thresholds.
vDSP_vcmprs could be used also but it will probably be slower than vDSP_vclipc.
Unfortunately both of these functions do additional work and require output array to store the results.
Again, you can file a request radar.
Hristo
> https://lists.apple.com/mailman/options/perfoptimization-dev/hristo%40apple.com
>
> This email sent to hri...@apple.com
A histogram would achieve more or less 2, though this is probably an expensive way to go about it.
Ian
>1) I often find myself wanting both the min and max and thus calling
>vDSP_minv and vDSP_maxv back to back. Is there a function that can find
>both together? I imagine that could be faster...
See: <rdar://20804746>
>2) Is there a way, given a large buffer of uint8, to count how many
>elements are not equal to zero? Doing this is the only place I
>currently resort to emmintrin.h
See: <rdar://20804940>
>3) Is there a way to byte swap a large buffer? (for uint64, uint32,
>uint16 cases)
See: <rdar://20804846>
Cheers,
Thanks! Bug reports like these make it easier to move the ball forward.
Ian