Wondering about performance cost for AF to allocate memory for matmul return

47 views

Skip to first unread message

Philip Murray

unread,

Jul 16, 2021, 5:35:56 AM7/16/21

to ArrayFire Users

Hi,

I am using Arrayfire C++ with a cuda backend.

For a C = AB matrix multiplication, in cublas you preallocate memory for C, and provide an *out pointer into the cublas method to store the answer in.

So if I want to iteratively perform C = AB, (i.e. reusing output buffers to iteratively do the forward pass of an NN), there is only a 1-time cost of cudaMalloc.

However, in arrayfire you don't preallocate memory, i.e. af::array C = matmul(A, B).

I was wondering, how much of a performance cost it is for arrayfire to operate without preallocating memory for situations when output buffers could be reused?

---

I've tried looking at the github code, and if I understand it correctly, it seems like the process for allocating an AF array is as follows:

Array Constructor -> memAlloc<>() -> memoryManager().alloc() -> current.free_map.find(alloc_bytes)

However, I can't seem to find the find(allocated_bytes) method on github.

---

From what I've seen in nvvp, it seems like cudaMalloc() is also very slow relative to a kernel execution overhead.

However, if arrayfire's memory manager is lightweight enough in how it distributes what it preserves, then I guess it's possible that getting the memory to store af::array C = matmul(A, B) is negligible when compared to matmul(A, B), or even on par with the time to launch a kernel.

Is there any general info on how AF's allocator should perform relative cudaMalloc?

Thanks!

Philip

Reply all

Reply to author

Forward

0 new messages