transformation question

46 views

jori...@gmail.com

Apr 5, 2019, 1:41:46 AM4/5/19
to reikna
Hello

I have a question. I'd like to make a very simple thing. I need to make FFT (say 2 points), but my signal has 6 points. Before making FFT I'd like to modify my input data so that I have 2 points instead of 6, just summing it. So instead of a0,a1,a2,a3,a4,a5 I want to have a0+a1+a2, a3+a4+a5 and use these two point as FFT input.

I wrote a transformation for FFT:

import numpy as np

from reikna.fft import FFT
from reikna.core import Annotation, Type, Transformation, Parameter
from reikna.cluda import dtypes, ocl_api, functions

api = ocl_api()
thr = api.Thread.create(device_filters={'include_platforms': 'NVIDIA'})

a = np.arange(3 * 2, dtype=np.complex64)
a.imag = a.real

tr = Transformation(
[
Parameter('out', Annotation(Type(np.complex64, shape=(2,)), 'o')),
Parameter('indata', Annotation(Type(np.complex64, shape=(2,)), 'i')),
],
"""
VSIZE_T idx1 = \${idxs[0]};

##\${indata.ctype} integr = \${indata.load_idx}(idx1) + \${indata.load_idx}(idx1 + 1) + \${indata.load_idx}(idx1 + 2);
\${indata.ctype} integr = \${indata.load_idx}(idx1);

\${out.store_idx}(idx1, integr);
##\${out.store_same}(integr);

""",
connectors=['indata']
)

fft = FFT(np.empty(2, dtype=a.dtype))
fft.parameter.output.connect(tr, tr.indata, out=tr.out)
fftc = fft.compile(thr)

data_dev = thr.to_device(a)
res_dev = thr.to_device(np.zeros(2, dtype=np.complex64))

fftc(res_dev, data_dev)
thr.synchronize()

myfft = res_dev.get()

pass

In the code I have two comments. First I replaced sum of 3 just by one (like decimation) and since I want to avoid FFT right now (for debugging), I replaced store_same by store_idx, which doesn't invoke FFT afaik.
But the problem is that I have an Exception: Template rendering failed
It doesn't like \${indata.load_idx}(idx1);
Can't understand why? How can I access current input point and a few after?
Thanks

Bogdan Opanchuk

Apr 5, 2019, 3:16:24 AM4/5/19
to reikna
There are several problems with your code.

First, you want to pre-process the FFT input, but you are attaching a transformation to its output (`fft.parameter.output.connect()`). You need to connect it to `input`.

Correspondingly, you want the connector of the transformation to be its `out` parameter, and it will get the `store_same()` call (in other words, the transformation code will be called once for each time the FFT code tries to read something from its input; that is when you will sum three values and feed them to FFT).

A minor thing, the shape of `indata` parameter should be (6,), not (2,). But since the shape of arguments is not checked on call to a computation, it does not produce any errors in this code (but if it's a more than 1D array, declared shape and strides will be important for memory access).

Finally, since `idxs[0]` now spans the FFT size (that is, 0 to 1), you need to modify the `indata.load_idx()` calls so that they load data from correct indices (unless you plan to do some kind of a rolling window). I added a hardcoded `stride=3`, which you can alternatively pass via `render_kwds`, a scalar parameter to the transformation, or take from the shape of `indata` (that is, make it a 2D (2,3) array instead of 1D (2*3,) array).

The modified code will then look like:

a = np.arange(3 * 2, dtype=np.complex64)
a.imag = a.real

tr = Transformation(
[
Parameter('out', Annotation(Type(np.complex64, shape=(2,)), 'o')),
Parameter('indata', Annotation(Type(np.complex64, shape=(6,)), 'i')),
],
"""
VSIZE_T idx1 = \${idxs[0]};

int stride = 3;

\${indata.ctype} integr =
\${indata.load_idx}(idx1 * stride) +
\${indata.load_idx}(idx1 * stride + 1) +
\${indata.load_idx}(idx1 * stride + 2);

\${out.store_same}(integr);
""",
connectors=['out']
)

fft = FFT(np.empty(2, dtype=a.dtype))
fft.parameter.input.connect(tr, tr.out, indata=tr.indata)

jori...@gmail.com

Apr 5, 2019, 11:03:53 PM4/5/19
to reikna
Thanks, Bogdan.

Now it works... I feel myself like a blind kitten since OpenCL, Mako and reikna are new for me.
How would you recommend to debug myself? Foe example this transformation, it always ends up (by definition) with FFT. So if I want to figure out whether my TRANSFORMATION is correct, I have no chance to see its output. Maybe only is I take a result and make IFFT :) But is there any proper way to debug different stages of transformation or even computation?

Bogdan Opanchuk

Apr 8, 2019, 12:45:19 AM4/8/19
to reikna
Admittedly Reikna is kind of hard to debug now. For me it's usually involves just a lot of staring at the code, but that requires experience. I'm working on a better API for computations/transformations that will be easier to decompose and debug by parts (and which will allow automatic correctness checks), but it goes slowly.

You can see what your transformation is doing by turning it into a PureParallel computation (http://reikna.publicfields.net/en/latest/api/computations.html#reikna.algorithms.PureParallel.from_trf), which is essentially the same as attaching it to a trivial computation that just copies its input to its output. Perhaps that will help a bit. Sometimes it also helps to see the rendered code that is fed to the compiler. For a compiled computation you can access it as

for kernel_call in compiled_comp._kernel_calls:
print(kernel_call._kernel._program.source)

which will iterate over all kernels in a computation (yep, all of this except Program.source is technically not documented). Then you can just take this code, compile it separately using PyCUDA, PyOpenCL or low-level Reikna and debug it that way.

jori...@gmail.com

Apr 20, 2019, 9:48:05 PM4/20/19
to reikna
Bogdan, I have one more question related to transformation.
I'd like to make FFT, like I said and I also have to make some transformation before. That's work. But during profiling I noticed that host to device data transferring takes essential amount of time. My data by nature is binary (0/1) or maybe 2 bits per sample, but anyway it can fit in one byte. Therefore I want to transfer not two floats per sample (4 bytes I and 4 bytes Q), but two chars (1 byte I and 1 byte Q).
My plan was to feed an array of int8 (I,Q) to my transformation, where I could convert it to floats (along with other operations) and then pass to FFT how it's done right now.
The problem is that FFT computation requires complex data type, which is said in the beginning:
if not dtypes.is_complex(arr_t.dtype):
raise ValueError("FFT computation requires array of a complex dtype")
There is no complex int8 in numpy unfortunately. SO I can't create FFT object with my array of int.
I wanted to cheat FFT and create a class of FFT with a fake array of complex64 dtype. But as far as I understand it will affect strides and everything else related to computation of address of each certain element. In other words if I write load_idx(1) it will read 4 bytes with 4 bytes offset from the beginning of the array. But not the second byte. Right?
What is the best way to cheat FFT?
Thanks

Bogdan Opanchuk

Apr 21, 2019, 2:01:33 AM4/21/19
to reikna
> I'd like to make FFT, like I said and I also have to make some transformation before. That's work. But during profiling I noticed that host to device data transferring takes essential amount of time.

I suspect that you forgot to synchronize the thread before measuring time. By default, Reikna runs kernels in asynchronous mode, which means that Computation.__call__() schedules a kernel to run and returns immediately. Consequently, when you request data transfer from GPU, it forcibly synchronizes, waiting for all the scheduled kernels to finish, creating an illusion that it is the part that takes the most time. Try adding `thr.synchronize()` after a computation call and see if it changes the results.

I'm not sure I understand your problem with the inputs though. Why can't you write a transformation that creates a complex value out of a single int8 array, or out of two int8 arrays if that's more convenient?

jori...@gmail.com

Apr 21, 2019, 8:51:31 PM4/21/19
to reikna
Bogdan,
I guess I might have confused you asking this question in this topic after I asked about FFT of 2 points. In reality I have much longer size of my problem and I transfer hundreds of megabytes of data, therefore it has essential transfer time. I did not forget about synchronization, measurements are correct and confirmed by NVidia profilers.
Regarding your suggestion... I can easily create a transformation that converts types. Or I can do that within my main transformation. But (and here I might have a wrong understanding of something) how reikna knows input data format? I see that if I use a transformation connected to the input of FFT computation I specify input data twice. First time when I describe my transformation. The second time when I create FFT object. Reikna needs information about what type of data it has as an input. It needs to know that to know what shift in bytes it needs to add to get, say, second element. Where does it get this information? From the data at the input of FFT object constructor? Or from transformation description?
Since I am going to change types of data going through my transformation from int8 to complex64 I predict a problem. If reikna takes this information from FFT constructor I guess when I will write \${in.load_idx}(2) it will read 8 bytes (since complex64) with 2*8 bytes offset. But I need to read just 2 bytes with 2*2 bytes offset.