Wondering about performance cost for AF to allocate memory for matmul return

47 views
Skip to first unread message

Philip Murray

unread,
Jul 16, 2021, 5:35:56 AM7/16/21
to ArrayFire Users
Hi, 

I am using Arrayfire C++ with a cuda backend.

For a C = AB matrix multiplication, in cublas you preallocate memory for C, and provide an *out pointer into the cublas method to store the answer in. 

So if I want to iteratively perform C = AB, (i.e. reusing output buffers to iteratively do the forward pass of an NN), there is only a 1-time cost of cudaMalloc.


However, in arrayfire you don't preallocate memory, i.e. af::array C = matmul(A, B).
I was wondering, how much of a performance cost it is for arrayfire to operate without preallocating memory for situations when output buffers could be reused?

---
I've tried looking at the github code, and if I understand it correctly, it seems like the process for allocating an AF array is as follows:

Array Constructor -> memAlloc<>() -> memoryManager().alloc() -> current.free_map.find(alloc_bytes)

However, I can't seem to find the find(allocated_bytes) method on github.
---

From what I've seen in nvvp, it seems like cudaMalloc() is also very slow relative to a kernel execution overhead.

However, if arrayfire's memory manager is lightweight enough in how it distributes what it preserves, then I guess it's possible that getting the memory to store af::array C = matmul(A, B) is negligible when compared to matmul(A, B), or even on par with the time to launch a kernel. 

Is there any general info on how AF's allocator should perform relative cudaMalloc? 


Thanks!
Philip






Reply all
Reply to author
Forward
0 new messages