I am performing FFTs using Reikna on images acquired from a camera and I wanted some guidance on my approach since I am new to GPU programming.
I am running my code on an Nvidia Jetson TX2 running Ubuntu Linux.
I have installed the latest version of CUDA 9.0 and also pyCUDA.
Here is my code:
# This code is executed once during initialization
# imageWidth = 1920, imageHeight = 1080
self.api = cluda.cuda_api()
self.dev = self.api.get_platforms()[0].get_devices()[0]
self.thr = self.api.Thread (self.dev)
imgData = np.zeros (imageWidth * self.imageHeight, dtype=np.complex64)
self.imgfft = fft.FFT (imgData, axes=(0,))
self.imgfftc = self.imgfft.compile (self.thr, fast_math=True)
# The following is executed for each incoming image
imgData_dev = self.thr.to_device (imgData)
self.imgfftc (imgData_dev, imgData_dev)
self.thr.synchronize()
imgData = imgData_dev.get()
# This code is executed once during cleanup.
self.thr.release()
A couple of observations:
1. It takes a long long time to compile the FFT - almost 5 seconds with fast_math = True
2. It takes a long time get the data after the FFT has been computed. (imgData_dev.get())
3. My application crashes presumeably because the pyCUDA context is not being released.
Some questions:
1. How can significantly increase the speed of the compile?
2. How can I increase the processing time for the FFT?
3. How do I fix the crash because of the pyCUDA context not being released?
Any help would be appreciated.
Thank you.
I now have it working.
I added an fftshift reikna computation and then some extra processing to allow me to view the 2-D FFT on the screen.
But I want to move some of the auxiliary computations to the GPU.
Here are the auxiliary computations that I perform after the fft and fftshift:
imgTransformed = np.log2 (np.abs (imgFFTShifted))
imgNormalized = (imgTransformed - np.min(imgTransformed)) / (np.max(imgTransformed) - np.min(imgTransformed)) * 255.
There are several operations here:
1. np.abs
2. np.log2
3. np.min
4. np.max
Is it possible to use Reikna to perform my auxiliary computations on the GPU instead of using the CPU based numpy?
However, when I attach it, even though the signature says that the output type if real32, the actual output array I obtain is complex64.
Here is my code:
self.imgfftshift = fft.FFTShift (testimg)
fft_norm = transformations.norm_const (self.imgfftshift.parameter.output, 1)
self.imgfftshift.parameter.output.connect (fft_norm, fft_norm.input, fft_shifted = fft_norm.output)
self.imgfftshiftc = self.imgfftshift.compile (self.thr, fast_math=True)
# in some other function
imgDataShiftHandle = self.thr.to_device (imgFFT)
self.imgfftshiftc (imgDataShiftHandle, imgDataShiftHandle)
imgFFTShifted = imgDataShiftHandle.get()
What am I doing wrong?
Sorry.
I didn't see one in the docs.
If not, is there some other way I could use existing Reikna functionality to implement the log2 computation on the GPU?
But I really need your help regarding the following error:
PyCUDA ERROR: The context stack was not empty upon module cleanup.
I am only using 1 Thread object.
Here is the basic code sequence:
api = cluda.cuda_api()
dev = api._get_platforms()[0].get_devices()[0]
thr = api.Thread (dev)
So, I am passing the device handle to the Thread constructor.
I am not able to root cause the cause of the crash.
Any help would be very much appreciated.
The program is not just those three lines.
I am trying to restructure my code to get rid of the crash.
I was able to get rid of the crash.
But now I have a new problem: whenever I try to call any Thread object function, I get a Pycuda invalid device context error.
I am creating a Thread object by passing in the device object returned by Platorm.get_devices().
I then save the Thread object handle for later use.
I have a separate Python thread when then attempts to call a Thread function such as empty_like() and then my app crashes with an invalid device context error.
Here is a minimal example:
from reikna import cluda
from reikna.cluda import Snippet
from reikna.core import Transformation, Type, Annotation, Parameter
from reikna.algorithms import PureParallel
import reikna.transformations as transformations
from threading import Thread
def processBuffer (computation, thr):
in_array = np.zeros ((2, 16), dtype=np.float32)
print (in_array)
in_dev = thr.to_device (in_array)
computation (in_dev, in_dev, 2.0, 3.0)
out_array = in_dev.get()
print (out_array)
arr_t = Type(np.float32, shape=(2, 16))
api = cluda.cuda_api()
dev = api.get_platforms()[0].get_devices()[0]
thr = api.Thread (dev)
comp = PureParallel(
[Parameter('out', Annotation(arr_t, 'o')),
Parameter('in1', Annotation(arr_t, 'i')),
Parameter('param', Annotation(np.float32)),
Parameter('param2', Annotation(np.float32))],
"""
VSIZE_T idx = ${idxs[0]};
VSIZE_T idx2 = ${idxs[1]};
${out.store_idx}(
idx, idx2, (${in1.load_idx}(idx, idx2) + ${param} + 5.0) / ${param2});
""")
compc = comp.compile (thr)
t = Thread (target = processBuffer, args = (compc, thr,))
t.start()
t.join()
I used that knowledge to fix my problem.
There are certain aspects of CUDA that are quite stupid (statefulness and per thread limitations).
Sheesh.