ArrayFire interoperability with NVRTC

Ville-Veikko

unread,

Mar 31, 2021, 7:51:02 AM3/31/21

to ArrayFire Users

What is the correct/best way to use AF arrays in a CUDA kernel when using NVRTC framework and how to create AF arrays from NVRTC data?

So far I've used AF arrays in CUDA kernels with

CUdeviceptr* cudaArray = AFarray.device<CUdeviceptr>();

However, this creates odd behavior, meaning that it works fine in a mex-file in MATLAB, but causes illegal memory accesses in a mex-file in Octave.

For the second case of using NVRTC data in AF computations, I've simply transferred the data to host first and then created an AF array. This is, obviously, very inefficient though.

Pradeep Garigipati

unread,

Apr 7, 2021, 12:48:05 AM4/7/21

to Ville-Veikko, ArrayFire Users

ArrayFire by itself doesn't have any additional restrictions than CUDA does when an application wants to mix CUDA's Driver API and Runtime API. As mentioned https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#interoperability-between-runtime-and-driver-apis in that section, you need to make sure right context is being used at any given time.

As far as the via host transfer, if the right CUDA context is active, you can easily cast between a, lets say float*, pointer to a CUdeviceptr without issues. If such a cast pointer is passed to the array constructor, ArrayFire will take care of the device to device transfer. I am not entirely clear why you required an intermediate host buffer for this transfer. If you can share a standalone example code snippet that does what you need to do, we might be able to suggest how you may avoid the intermediate host buffer.

If a given code works in MATLAB and not in Octave, I would suspect something to do with Octave. But again, I am not an expert in either tools.

Hope that helps.

Pradeep.

--
You received this message because you are subscribed to the Google Groups "ArrayFire Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arrayfire-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arrayfire-users/c2c0225c-0cf2-4656-a70e-27cb1959bfc2n%40googlegroups.com.

--

Pradeep

https://www.linkedin.com/in/pradeepgarigipati

Ville-Veikko

unread,

Apr 7, 2021, 4:28:30 AM4/7/21

to ArrayFire Users

In OpenCL you can do:

af::array AFarray = afcl::array(dim1, OCLBuffer, f32, true);

What is the corresponding way when using CUdeviceptr?

Pradeep Garigipati

unread,

Apr 7, 2021, 4:55:26 AM4/7/21

to Ville-Veikko, ArrayFire Users

Assuming you have the right CUDA context.

CUdeviceptr dapiDevPtr = ....

float* rtDevPtr = (float*)dapiDevPtr;

array a(size, rtDevPtr, afDevice); //Note that ArrayFire takes ownership of this pointer

To view this discussion on the web visit https://groups.google.com/d/msgid/arrayfire-users/55321508-0c14-423d-ac08-2cfa5c0a96b4n%40googlegroups.com.

--

Pradeep

https://www.linkedin.com/in/pradeepgarigipati

Ville-Veikko

unread,

Apr 7, 2021, 10:42:59 AM4/7/21

to ArrayFire Users

That doesn't seem to work if the CUdeviceptr is later used out of AF (causes an illegal memory access).

What would be the best way to compute the sum of a CUdeviceptr with AF when the CUdeviceptr is otherwise used out of AF?

Pradeep Garigipati

unread,

Apr 7, 2021, 11:37:39 AM4/7/21

to Ville-Veikko, ArrayFire Users

Do you mean it doesn't work if you use the same device pointer later when ArrayFire's array that used this pointer is destroyed ?

To view this discussion on the web visit https://groups.google.com/d/msgid/arrayfire-users/7d719a8c-7dc3-46ae-bcb8-fb633122d88an%40googlegroups.com.

--

Pradeep

https://www.linkedin.com/in/pradeepgarigipati

Ville-Veikko

unread,

Apr 12, 2021, 3:41:12 AM4/12/21

to ArrayFire Users

Yes, that's right. Currently I first transfer the data to host before constructing the AF array.

Pradeep Garigipati

unread,

Apr 12, 2021, 3:51:01 AM4/12/21

to Ville-Veikko, ArrayFire Users

Please send us a standalone source file that we can use to reproduce the problem. That way, we will be on the same page on what we are trying to debug.

Thank you,

Pradeep

To view this discussion on the web visit https://groups.google.com/d/msgid/arrayfire-users/e4c1a911-1d8d-4c11-862d-b947d5abe7b8n%40googlegroups.com.

--

Pradeep

https://www.linkedin.com/in/pradeepgarigipati

Ville-Veikko

unread,

May 6, 2021, 5:50:36 AM5/6/21

to ArrayFire Users

With the below code I'm getting a crash instead of errors when the CUDAVector is used. This happens in the second iteration.

loop the below function over N_iterations

function(CUDAVector, dim) {

float a_Summa = 1.f;

float* hOut = reinterpret_cast<float*>(CUDAVector);

array uu;

if (condition) {
uu = array(dim, hOut, afDevice);
a_Summa = af::sum<float>(uu);
}

Use CUDAVector (crash happens during the 2nd iteration)

if (condition)

use a_Summa

}

More complete code can be found from here (currently uses the workaround I mentioned): https://github.com/villekf/OMEGA/blob/master/source/compute_OS_estimates_subiter_CUDA.cpp#L20

Pradeep Garigipati

unread,

May 10, 2021, 7:58:07 AM5/10/21

to Ville-Veikko, ArrayFire Users

Hello Ville-Veikko,

I believe I have shared this info with you earlier - ArrayFire's memory(GPU memory) manager assumes control over the device pointer you pass to it. In this case, I think the following is what's happening: In the first iteration, ArrayFire memory manager assumes control of the pointer and then releases it when the corresponding af::array object goes out of scope. If you don't want that to happen, you need to call the lock method on the af:array inside the function which informs the memory manager to avoid marking it as free for reuse or deallocation.

Hope this helps,

Pradeep.

To view this discussion on the web visit https://groups.google.com/d/msgid/arrayfire-users/0adbfa02-d275-44e3-99b5-1bab4b5ab79fn%40googlegroups.com.

--

Pradeep

https://www.linkedin.com/in/pradeepgarigipati

Reply all

Reply to author

Forward