Call consecutive kernels correctly

107 views
Skip to first unread message

Jorge Aguilar

unread,
Mar 9, 2021, 1:31:02 PM3/9/21
to reikna
Hello,

I'm quite new to Reikna, and maybe the question is a bit silly, but I'm not able to find the key so I'm begging for help. Basically I have a dataset, to which I want to apply several operations (some of them are Reikna operations, like fft, and others are kernels written by me). In doing so, I am forced to keep in memory all the intermediate results, so I run out of space if the input dataset is too long. Looking at the "writing a computation" section there is what looks like an example of how to do something similar to what I am looking for, but I am unable to get the example to working and I don't think I understand it so I am writing to ask if this problem can really be solved without compromising performance or if there is a correct way to do it.

To be a little more clear, my actual approach is something like this:

program = thr.compile(open(/kernel_1.cl).read())
kernel1 = program.kernel_1
program = thr.compile(open(/kernel_2.cl).read())
kernel2 = program.kernel_2
...
program = thr.compile(open(/kernel_N.cl).read())
kernelN = program.kernel_N

output_1 = thr.empty_like(numpy_array_1)
output_2 = thr.empty_like(numpy_array_2)
...
output_N = thr.empty_like(numpy_array_N)

for i in range(0,NumbersOfDatasets):
    input_array = load(data_i)
    input = thr.to_device(input_array)

    kernel1(output_1, input)
    kernel2(output_2, output_1)
    ...
    kernelN(output_N, output_N-1)

    result[i] = output_N.get()

Thank you very much in advance

Bogdan Opanchuk

unread,
Mar 9, 2021, 7:12:27 PM3/9/21
to reikna
You are correct, managing temporary arrays is one of the main advantages of bundling kernel/computation calls in a bigger computation (although it is possible to do that manually). Admittedly, it is not always straightforward. What are the problems you're experiencing with the example? Since it's not a doctest, it may have got out of date with the current API (although at a glance it looks ok). Have you tried looking at the full examples, e.g. https://github.com/fjarri/reikna/blob/develop/examples/demo_specgram.py ?

In your case it will look something like

class MyComputation(Computation):

    def __init__(self, input_shape, input_dtype):
        Computation.__init__(
            self,
            [
                Parameter('output', Annotation(Type(input_dtype, input_shape), 'o')),
                Parameter('input', Annotation(Type(input_dtype, input_shape), 'i')),
            ])

    def _build_plan(self, plan_factory, device_params, output, input):

        plan = plan_factory()

        output_1 = plan.temp_array_like(input)
        plan.kernel_call(kernel1_source, [output_1, input], global_size=..., local_size=...)
        output_2 = plan.temp_array_like(input)
        plan.kernel_call(kernel2_source, [output_2, output1], global_size=..., local_size=...)
        ...
        plan.kernel_call(kernelN_source, [output, outputN-1], global_size=..., local_size=...)


comp = MyComp(input_shape, input_dtype).compile(thr)
for i in range(0,NumbersOfDatasets):
    input_array = load(data_i)
    input = thr.to_device(input_array)
    comp(output, input)
    result[i] = output.get()

Note that:
- you need to supply the kernel source during _build_plan(), since at that moment the Thread is not available yet.
- you need to be able to derive your computation's signature (input/output paramteres, their shapes and dtypes) at construction time. Pass whatever arguments you need to the constructor, you can create class attributes too to use in _build_plan() later.
Refer to the API reference for the full list of parameters for various calls used here.

The default temporary memory manager (http://reikna.publicfields.net/en/latest/api/cluda.html#reikna.cluda.tempalloc.ZeroOffsetManager) will attempt to pack everything created with temp_array() or temp_array_like() such that independent temporary arrays (the ones that don't need to hold data at the same time, as judged by their usage in kernel calls) can point to the same physical memory.

Of course, you don't have to do that. If temporary memory is the only problem, you can manage it manually with just ZeroOffsetManager linked above. It won't be much easier, but may be more flexible if your dataflow cannot be fit into Computation's limitations. You will need to create your temporary arrays with http://reikna.publicfields.net/en/latest/api/cluda.html#reikna.cluda.tempalloc.TemporaryManager.array and manually specify the dependencies. Then you can use them as you would use regular arrays.

Hope that helps.

Jorge Aguilar

unread,
Mar 19, 2021, 6:58:02 AM3/19/21
to reikna
Hello, thank you very much for your help, it has been very helpful to understand how it works. I can only dedicate weekends to this project, so it takes me some time to experiment with the code, sorry for the delay. Now I understand a little better how it works, so I have more new doubts:
  1. Is there any performance advantage in using computations against the manual solution with ZeroOffsetManager?
  2. The kernel source means the relative path? For example: plan.kernel_call('./GPU_Kernels/Slice.cl', .... ) or do I need to add anything else?
  3. How can I send normal numpy variables to the computation to set for example the global and local sizes?
  4. Due to my workflow I will probably need to write several computations. Would a persistent array (which is neither input or output) inside the computation keep its memory space and name even when the computation ends, or the memory space is free once the computation is finished? If this is the behaviour of arrays inside the computation maybe is just what I need.
  5. This is related to previous question. Do temporary arrays means that inside the computation you can overwrite their memory with another array or they are mean to free memory outside the computation?
Once again, thank you for your help and the amazing library you have develop.

Bogdan Opanchuk

unread,
Mar 20, 2021, 2:02:22 AM3/20/21
to reikna
> 1. Is there any performance advantage in using computations against the manual solution with ZeroOffsetManager?

There is slightly less overhead because all the kernel calls from nested computations are flattened. But mainly it's just the convenience advantage when it comes to nesting computations, attaching transformations and so on.

> 2. The kernel source means the relative path? For example: plan.kernel_call('./GPU_Kernels/Slice.cl', .... ) or do I need to add anything else?

No, it has to be a specially written template definition (so that the Computation internals know how to construct the C signature of the kernel and so on). See the example in http://reikna.publicfields.net/en/latest/tutorial-advanced.html#writing-a-computation . It can be created out of a string (as in the example), or out of a file (as in the Reikna's own algorithms), but it has to be a template.

> 3. How can I send normal numpy variables to the computation to set for example the global and local sizes?

All kernels in a computation have to have static global and local sizes for all kernels (local sizes can be None). I have some ideas about "freeing" chosen dimensions, but it is currently not supported. 

> 4. Due to my workflow I will probably need to write several computations. Would a persistent array (which is neither input or output) inside the computation keep its memory space and name even when the computation ends, or the memory space is free once the computation is finished? If this is the behaviour of arrays inside the computation maybe is just what I need.

When you call `plan.persistent_array()`, it remembers that `numpy` array, and when the computation is compiled, it is copied to the device. As long as the compiled computation object is alive, the array stays on the device. 

> 5. This is related to previous question. Do temporary arrays means that inside the computation you can overwrite their memory with another array or they are mean to free memory outside the computation?

You can do whatever you want with them, but when the computation ends (that is the call to it is finished), they may be overwritten by someone else. The physical allocation they're based on may be freed by the temporary memory manager, if it decides to repack things.

Jorge Aguilar

unread,
Mar 21, 2021, 5:06:24 AM3/21/21
to reikna
Thanks Bodgan, very clear explanation, I will try to put these ideas into practice.

Jorge Aguilar

unread,
Jul 15, 2021, 5:31:36 AM7/15/21
to reikna

Hello again, I've worked on trying to implement this when I've had time, but I'm finding it impossible to get it to work. Are there any examples of manually using ZeroOffsetManager to try to figure out how to do this? So far I haven't found anything similar in the Examples or other discussions and it seems imposible to me to fit my workflow into a computation.

Bogdan Opanchuk

unread,
Jul 18, 2021, 1:49:37 PM7/18/21
to reikna
I added an example, https://github.com/fjarri/reikna/blob/develop/examples/demo_tempalloc.py . Hopefully it will be helpful. Please tell me if you have any follow-up questions.

Jorge Aguilar

unread,
Jul 19, 2021, 8:22:57 AM7/19/21
to reikna
Thank you very much, awesome, it is just what I needed. I almost understand how it works now, but  I'm running into a bit of a problem:

is it possible to apply cuda functions to an array generated in this way? I cannot execute conj() on a temp_manager.array. Here is a silly example:

conj = cluda.functions.conj(np.complex64)
a1 = temp_manager.array([200,200], np.complex64, dependencies=[conj])
a1.conj()

This simple example won't execute and returns an error, I don't know if I'm doing something wrong or if maybe there is some deep incompatibility on what I'm trying.

Bogdan Opanchuk

unread,
Jul 20, 2021, 1:18:08 AM7/20/21
to reikna
Hm, I'm not sure what your code above is supposed to do. `cluda.functions.conj` returns a Module object. `dependencies` keyword takes either an `Array` or something with a `__tempalloc__` attribute, which `Module` does not have (since it does not allocate anything). As for the last line, `Array` objects do not have a `conj()` method.

Jorge Aguilar

unread,
Jul 20, 2021, 5:46:47 AM7/20/21
to reikna
Thank you very much for all your help Bogdan. You are right, the first line does nothing, it still works but is stupid, sorry for that. About the final remark, wich I think is my problem, I currently define the variables on the gpu as:

numpy_temp = np.zeros((4096,32,32),  dtype = np.complex64)
gpu_subImg = thr.empty_like(numpy_temp)

The only problem I do have with my algorithm is the memory usage, but is working fine. I thought that this line: "gpu_subImg = temp_manager.array([4096,32,32], np.complex64)"  would be equivalent to "thr.empty_like(numpy_temp)"  but obviously I got it wrong, cause gpu_subImg.conj() is not working when using the temp_manager array. If I understand correctly, gpu_subImg is not an array, so I think that the real question is how can I send this 3 dimension matrix to the temp_manager. My goal is to use it as previously, and just control the pack of different variables that doesn't intersect in same real memory allocations.

Thank you again for your time.

Bogdan Opanchuk

unread,
Jul 20, 2021, 5:18:41 PM7/20/21
to reikna
No, both `thr.empty_like()` and `temp_manager.array()` produce an `Array` object, the only difference is the underlying buffer. Sorry, I got confused myself earlier when saying that `Array` objects don't have `.conj()` - they do, since they inherit from the corresponding PyCUDA/PyOpenCL backend. I just checked in the demo that I wrote for you earlier, and `.conj()` works on all the arrays created there. How exactly is it not working for you? Could you post the reproducing example and the error with the backtrace?

Jorge Aguilar

unread,
Jul 27, 2021, 4:50:49 AM7/27/21
to reikna
This is a minimal example to reproduce the error:

import numpy as np
import reikna.cluda as cluda
from reikna.cluda import functions, dtypes
from reikna.cluda.tempalloc import ZeroOffsetManager

dtype = np.complex64
api = cluda.ocl_api()
thr = api.Thread.create(0)

#This is working
input_1 = np.ones((200,200)).astype(np.complex64)
i1 = thr.empty_like(input_1)
i1.conj()

#This is not working
temp_manager = ZeroOffsetManager(thr, pack_on_alloc=True, pack_on_free=True)
a1 = temp_manager.array([200,200], np.complex64)
a1.conj()


Traceback (most recent call last):

File "error.py", line 19, in <module>

a1.conj()

File "/home/jorge/Virtual_Env/dpivsoft/lib/python3.7/site-packages/pyopencl/array.py", line 1674, in conj

result.add_event(self._conj(result, self))

File "/home/jorge/Virtual_Env/dpivsoft/lib/python3.7/site-packages/pyopencl/array.py", line 214, in kernel_runner

return knl(queue, gs, ls, *args, wait_for=wait_for)

File "<generated code>", line 18, in enqueue_knl_cfloat_conj_kernel

pyopencl._cl.LogicError: Kernel.set_arg failed: INVALID_VALUE - when processing arg#1 (1-based): invalid kernel argument

Bogdan Opanchuk

unread,
Jul 29, 2021, 1:59:52 AM7/29/21
to reikna
Thanks, I understand what is happening now.

When I tried it I called `conj()` on an `int32` array, and I guess PyOpenCL uses a shortcut when it sees that it doesn't need to actually do a conjugation, so I didn't notice the problem. When `temp_manager.array()` is called, and it allocates an actual array inside, it substitutes an allocator for a dummy one so that the buffer actually isn't allocated right away. But when `.conj()` is called, it needs to allocate the result, and the dummy allocator is still there, so the call fails.

I pushed a fix to https://github.com/fjarri/reikna/tree/tempalloc-fix - could you try it and see if it works for you? Thank you for discovering this.

Jorge Aguilar

unread,
Aug 16, 2021, 11:18:23 AM8/16/21
to reikna
Hi again Bogdan,

sorry for the delay due to the holidays. I cloned the repository from the link and I installed it in a virtual environment using pip install -e . (I think this step should be fine), but I get the same error in the minimal example I posted. As you pointed, there is no error if I use an int32 array.

Jorge Aguilar

unread,
Nov 12, 2021, 10:49:33 AM11/12/21
to reikna
Hi again, I came back to this problem after a while, and I have discovered that I used the wrong branch. Stupid mistake on my side. The fix is working flawlessly.

It would be great when the fixed version gets upload to pypi to be able to upload the changes and the correct dependencies to my, now quite optimized, package.

Thank you very much,
Jorge

Bogdan Opanchuk

unread,
Nov 20, 2021, 3:27:36 AM11/20/21
to reikna
Hi Jorge,

I published 0.7.6 - please check if it works for you. Unfortunately there's still an unfixed issue with PyOpenCL versions newer than 2021.2.6 - the parameter list of Array has been changed, along with some changes to how new arrays are created on arithmetic operations/indexing. 

Jorge Aguilar

unread,
Dec 1, 2021, 12:03:11 PM12/1/21
to reikna
Hi Bogan,

Thank you so much, it is working!!
Reply all
Reply to author
Forward
0 new messages