FFT on Reikna

rohit.s...@gmail.com

unread,

Dec 31, 2018, 11:24:04 AM12/31/18

to reikna

Hello,

I am performing FFTs using Reikna on images acquired from a camera and I wanted some guidance on my approach since I am new to GPU programming.

I am running my code on an Nvidia Jetson TX2 running Ubuntu Linux.

I have installed the latest version of CUDA 9.0 and also pyCUDA.

Here is my code:

# This code is executed once during initialization
# imageWidth = 1920, imageHeight = 1080

self.api = cluda.cuda_api()
self.dev = self.api.get_platforms()[0].get_devices()[0]
self.thr = self.api.Thread (self.dev)
imgData = np.zeros (imageWidth * self.imageHeight, dtype=np.complex64)
self.imgfft = fft.FFT (imgData, axes=(0,))
self.imgfftc = self.imgfft.compile (self.thr, fast_math=True)

# The following is executed for each incoming image

imgData_dev = self.thr.to_device (imgData)
self.imgfftc (imgData_dev, imgData_dev)
self.thr.synchronize()
imgData = imgData_dev.get()

# This code is executed once during cleanup.

self.thr.release()

A couple of observations:

1. It takes a long long time to compile the FFT - almost 5 seconds with fast_math = True

2. It takes a long time get the data after the FFT has been computed. (imgData_dev.get())

3. My application crashes presumeably because the pyCUDA context is not being released.

Some questions:

1. How can significantly increase the speed of the compile?
2. How can I increase the processing time for the FFT?
3. How do I fix the crash because of the pyCUDA context not being released?

Any help would be appreciated.

Thank you.

Bogdan Opanchuk

unread,

Dec 31, 2018, 9:12:35 PM12/31/18

to reikna

Hi Rahit,

> I have installed the latest version of CUDA 9.0

Just a note, the latest version is 10.0, although I don't think it affects any of your questions.

> 1. It takes a long long time to compile the FFT - almost 5 seconds with fast_math = True

First, `fast_math` determines what kind of instructions will be used by the FFT, so it does not affect compilation time. As for the time itself - it does not sound too unreasonable, CUDA compilation is relatively slow. I noticed though that you are compiling an FFT for a 1D problem of size `imageWidth * imageHeight` instead of a 2D problem of shape `(imageWidth, imageHeight)` - is that intentional? A huge 1D problem like the former might compile a bit slowly. In the applications Reikna is built for it shouldn't matter - 5s of preparation should be negligible as compared to the total execution time. I think PyCUDA also caches the binaries based on the source being compiled - do you see an improvement in the compilation time when you run your program again?

> 2. It takes a long time get the data after the FFT has been computed. (imgData_dev.get())

Kernel calls are serialized in the CUDA stream, so the call `self.imgfftc (imgData_dev, imgData_dev)` returns right away, while there's still stuff happening on the GPU. `.get()` has to wait for the GPU to finish, so that's why it takes most of the time for you. You can forcefully synchronize the stream after a computation call (e.g. to measure the performance) with `thr.synchronize()`.

If you feel that FFT is too slow, there's not much you can do, really, besides improving the FFT-generating code in Reikna.

> 3. My application crashes presumeably because the pyCUDA context is not being released.

It's hard to say what is happening without seeing the full reproducing example. Normally, the context should be released automatically when the `Thread` object loses all the references to it. Do you by any chance have several `Thread` objects? Because of CUDA's stateful interface they require some explicit management.

Message has been deleted

rohit.s...@gmail.com

unread,

Jan 2, 2019, 12:19:01 PM1/2/19

to reikna

Thanks Bogdan.

I now have it working.

I added an fftshift reikna computation and then some extra processing to allow me to view the 2-D FFT on the screen.

But I want to move some of the auxiliary computations to the GPU.

Here are the auxiliary computations that I perform after the fft and fftshift:

imgTransformed = np.log2 (np.abs (imgFFTShifted))
imgNormalized = (imgTransformed - np.min(imgTransformed)) / (np.max(imgTransformed) - np.min(imgTransformed)) * 255.

There are several operations here:

1. np.abs
2. np.log2
3. np.min
4. np.max

Is it possible to use Reikna to perform my auxiliary computations on the GPU instead of using the CPU based numpy?

Bogdan Opanchuk

unread,

Jan 4, 2019, 4:16:56 AM1/4/19

to reikna

> 1. np.abs
> 2. np.log2

These two are easier to implement, they can be done by attaching transformations to the fftshift computation (see https://github.com/fjarri/reikna/blob/develop/examples/demo_specgram.py as an example, or http://reikna.publicfields.net/en/latest/tutorial-advanced.html#writing-a-transformation).

> 3. np.min
> 4. np.max

For these you will have to create a separate computation (an algorithms.Reduce with the corresponding predicate; you can do both at the same time as shown in https://github.com/fjarri/reikna/blob/develop/examples/demo_struct_reduce.py)

You can join these two in a big computation (again see http://reikna.publicfields.net/en/latest/tutorial-advanced.html) to make use of the temporary memory manager for the intermediate results. The structure will look something like:

computation1: tr1 -> fftshift, output in a temporary buffer (temp1), where tr1 is an elementwise transformation `y = log2(abs(x))`

computation2: minmax to a temporary buffer (temp2), which will contain just two values, temp2.min and temp2.max

computation3: an algorithms.PureParallel computation doing `(temp1[...] - temp2.min) / (temp2.max - temp2.min)` for every element of temp1.

Bogdan Opanchuk

unread,

Jan 4, 2019, 4:17:41 AM1/4/19

to reikna

Correction: fftshift -> tr1 for the first computation (tr1 operates on fftshift's output)

rohit.s...@gmail.com

unread,

Jan 7, 2019, 6:04:54 PM1/7/19

to reikna

In order to implement the abs function, I am trying to use the norm_const transformation.

However, when I attach it, even though the signature says that the output type if real32, the actual output array I obtain is complex64.

Here is my code:

self.imgfftshift = fft.FFTShift (testimg)
fft_norm = transformations.norm_const (self.imgfftshift.parameter.output, 1)
self.imgfftshift.parameter.output.connect (fft_norm, fft_norm.input, fft_shifted = fft_norm.output)
self.imgfftshiftc = self.imgfftshift.compile (self.thr, fast_math=True)

# in some other function

imgDataShiftHandle = self.thr.to_device (imgFFT)
self.imgfftshiftc (imgDataShiftHandle, imgDataShiftHandle)
imgFFTShifted = imgDataShiftHandle.get()

What am I doing wrong?

rohit.s...@gmail.com

unread,

Jan 8, 2019, 11:03:40 AM1/8/19

to reikna

Never mind. I figured it out.

Sorry.

rohit.s...@gmail.com

unread,

Jan 8, 2019, 11:37:22 AM1/8/19

to reikna

Does reikna already have a log2 transformation?

I didn't see one in the docs.

If not, is there some other way I could use existing Reikna functionality to implement the log2 computation on the GPU?

rohit.s...@gmail.com

unread,

Jan 11, 2019, 4:42:20 PM1/11/19

to reikna

I figured this out too.

But I really need your help regarding the following error:

PyCUDA ERROR: The context stack was not empty upon module cleanup.

I am only using 1 Thread object.

Here is the basic code sequence:

api = cluda.cuda_api()
dev = api._get_platforms()[0].get_devices()[0]
thr = api.Thread (dev)

So, I am passing the device handle to the Thread constructor.

I am not able to root cause the cause of the crash.

Any help would be very much appreciated.

Bogdan Opanchuk

unread,

Jan 12, 2019, 5:29:02 AM1/12/19

to reikna

Sorry for a slow reply, glad you figured it out. So you're saying that a program of just these three lines and nothing else produces the error? (although there is no `api._get_platforms()`, just `api.get_platforms()`). There doesn't seem to be anything wrong. Does a program of just `import pycuda.autoinit` work fine?

rohit.s...@gmail.com

unread,

Jan 14, 2019, 4:47:07 PM1/14/19

to reikna

Thank you for getting back to me.

The program is not just those three lines.

I am trying to restructure my code to get rid of the crash.

I was able to get rid of the crash.

But now I have a new problem: whenever I try to call any Thread object function, I get a Pycuda invalid device context error.

I am creating a Thread object by passing in the device object returned by Platorm.get_devices().

I then save the Thread object handle for later use.

I have a separate Python thread when then attempts to call a Thread function such as empty_like() and then my app crashes with an invalid device context error.

Bogdan Opanchuk

unread,

Jan 14, 2019, 8:23:19 PM1/14/19

to reikna

It would help if you could construct a minimal reproducing example.

CUDA API is stateful, so it is possible to have a situation where you somehow have a wrong context (or no context) on top of the context stack, and all the calls using objects from the wrong context will fail (that's why you get the invalid device context error). As a general advice, make sure that the Thread object is actually staying alive during the execution of your program (Array objects have it as an attribute, so they should keep it alive), and is the only one such object created.

Message has been deleted

rohit.s...@gmail.com

unread,

Jan 15, 2019, 10:28:11 AM1/15/19

to reikna

Sorry, I had to delete the other 2 posts because of a typo.

Here is a minimal example:

from reikna import cluda
from reikna.cluda import Snippet
from reikna.core import Transformation, Type, Annotation, Parameter
from reikna.algorithms import PureParallel
import reikna.transformations as transformations
from threading import Thread

def processBuffer (computation, thr):
in_array = np.zeros ((2, 16), dtype=np.float32)
print (in_array)
in_dev = thr.to_device (in_array)
computation (in_dev, in_dev, 2.0, 3.0)
out_array = in_dev.get()
print (out_array)

arr_t = Type(np.float32, shape=(2, 16))

api = cluda.cuda_api()
dev = api.get_platforms()[0].get_devices()[0]
thr = api.Thread (dev)

comp = PureParallel(
[Parameter('out', Annotation(arr_t, 'o')),
Parameter('in1', Annotation(arr_t, 'i')),
Parameter('param', Annotation(np.float32)),
Parameter('param2', Annotation(np.float32))],
"""
VSIZE_T idx = ${idxs[0]};
VSIZE_T idx2 = ${idxs[1]};
${out.store_idx}(
idx, idx2, (${in1.load_idx}(idx, idx2) + ${param} + 5.0) / ${param2});
""")

compc = comp.compile (thr)

t = Thread (target = processBuffer, args = (compc, thr,))
t.start()
t.join()

rohit.s...@gmail.com

unread,

Jan 15, 2019, 10:31:53 AM1/15/19

to reikna

It seems like you cannot pass a ComputationCallable object to another Python thread.

rohit.s...@gmail.com

unread,

Jan 15, 2019, 11:26:36 AM1/15/19

to reikna

I read the PyCUDA documentation and learned the CUDA contexts are a per-thread affair.

I used that knowledge to fix my problem.

There are certain aspects of CUDA that are quite stupid (statefulness and per thread limitations).

Sheesh.

Reply all

Reply to author

Forward