Best way to perform multiple FFT (same shape and dtype)

florian...@ensta-bretagne.org

unread,

Apr 20, 2017, 6:05:22 AM4/20/17

to reikna

Hi!

I will have to compute several FFTs, always with the same shape and dtype.

Thus, I am wondering which function calls should be done once and for all to save some computation time.

I am also wondering what is the best way to chain operation in Reikna (e.g. fftshift(fft(X))).

Here is my code:

from reikna.cluda import dtypes, cuda_api
import reikna.fft as cudaFFT
from reikna.core import Annotation, Type, Transformation, Parameter
import time
import scipy.misc as misc
import numpy as np
import matplotlib.pyplot as plt

plt.close('all')

# Pick the CUDA GPGPU API (OpenCL not available on this computer)
# and make a Thread on it.
api = cuda_api()
thr = api.Thread.create()

# host input
input_im = misc.imresize(misc.lena(), size=(2048, 2048)).astype(np.complex64)

t = time.time()

# device input
input_dev = thr.to_device(input_im)

# prepare output array
output_complex = thr.array(input_im.shape, dtype=np.complex64)

fft = cudaFFT.FFT(input_dev)
fft_compiled = fft.compile(thr)

fft_compiled(output_complex, input_dev, inverse=0)

result = output_complex.get()
print('%f' % (time.time() - t) )

t = time.time()
reference = np.fft.fft2(input_im)
print('%f' % (time.time() - t) )

fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.imshow(np.log(np.absolute(result)))
ax2.imshow(np.log(np.absolute(reference)))

# release the thread to be able to launch another computation
thr.release()

assert np.linalg.norm(result - reference) / np.linalg.norm(reference) < 1e-6

Thanks!

Florian

Bogdan Opanchuk

unread,

Apr 21, 2017, 10:43:38 PM4/21/17

to reikna

Hi Florian,

> Thus, I am wondering which function calls should be done once and for all to save some computation time.

If your arguments are always of the same size and dtype, everything besides `fft_compiled(output_complex, input_dev, inverse=0)` only needs to be called once (plus, whatever code you use to actually create the `input_dev` array, of course). So the correct way to compare the performance would be

t = time.time()
input_dev = thr.to_device(input_im)


fft_compiled(output_complex, input_dev, inverse=0)
result = output_complex.get()
print('%f' % (time.time() - t) )

if you're comparing a single call to GPU FFT with a single call to a CPU FFT, where the input is initially in the CPU memory, and the output should be placed there as well. Alternatively, if (as it often happens) you perform many operations on the GPU and the GPU-CPU transfer is relatively rare, you may want to compare just the performance of just the FFT algorithms:

t = time.time()
thr.synchronize() # to finalize whatever operations were happening before
fft_compiled(output_complex, input_dev, inverse=0)
thr.synchronize() # wait for `fft_compiled` to finish


print('%f' % (time.time() - t) )

> I am also wondering what is the best way to chain operation in Reikna (e.g. fftshift(fft(X))).

You can create a subclass of Computation that bundles several computations and transformations. For an example you can see, e.g. demo_specgram.py in the `examples` folder in the repo. Some explanation can be found in the docs here: http://reikna.publicfields.net/en/latest/tutorial-advanced.html#writing-a-computation

This bundling gives several advantages:

- slight decrease of the computation call overhead (it's not very significant; mainly it's just binding of the positional arguments/keywords to specific parameter identifiers)

- the temporary buffers you use to pass the result of one computation to the other will be added to the Thread's temporary buffer pool and reused by other kernel calls. That is, in your example you will need to do something like

temp = thr.array(...)
fft(temp, input)
fftshift(output, temp)

This way the `temp` buffer will not be used automatically by other computations, and will just sit there wasting memory until you delete it. Conversely, if you allocate it inside a computation, it will know when the buffer is used, and when it can be safely overwritten. If your memory buffers are large, it can give a significant reduction of the memory footprint.

- If you want to attach some transformations to your computations, these will also be hidden inside and the user will not need to worry about them. The specgram.py example illustrates that as well.

If these considerations are not important to you (and currently bundling several computations, regrettably, requires one to write some boilerplate code), you can just call the computations sequentially, as shown in the code snippet above.

Also, regarding your code, note that the `input_im` array has the shape `(2048, 2048, 3)`, and reikna's FFT transforms all axes by default, while `fft2` only transforms the first two, so the comparison is hardly fair. Although I suspect it is a mistake and you did not actually plan to FFT the RGB dimension :)

Bogdan Opanchuk

unread,

Apr 21, 2017, 10:50:25 PM4/21/17

to reikna

P.S. Possibly scratch that last note, you used `misc.lena()` as a source, to which my `scipy` complained that it's deprecated and I should use `misc.face()` instead, which does have the RGB dimension. I do not remember whether `lena()` was grayscale or not, and if it was, your code is correct.

florian...@ensta-bretagne.org

unread,

Apr 25, 2017, 11:27:51 AM4/25/17

to reikna

Thanks for your answer Bogdan, it was very clear!

Indeed, misc.lena() returns a gray level image.

Reply all

Reply to author

Forward