Hi,
I am using Arrayfire C++ with a cuda backend.
For a C = AB matrix multiplication, in cublas you preallocate memory for C, and provide an *out pointer into the cublas method to store the answer in.
So if I want to iteratively perform C = AB, (i.e. reusing output buffers to iteratively do the forward pass of an NN), there is only a 1-time cost of cudaMalloc.
However, in arrayfire you don't preallocate memory, i.e. af::array C = matmul(A, B).
I was wondering, how much of a performance cost it is for arrayfire to operate without preallocating memory for situations when output buffers could be reused?
---
I've tried looking at the github code, and if I understand it correctly, it seems like the process for allocating an AF array is as follows:
Array Constructor -> memAlloc<>() -> memoryManager().alloc() -> current.free_map.find(alloc_bytes)
However, I can't seem to find the find(allocated_bytes) method on github.
---
From what I've seen in nvvp, it seems like cudaMalloc() is also very slow relative to a kernel execution overhead.
However, if arrayfire's memory manager is lightweight enough in how it distributes what it preserves, then I guess it's possible that getting the memory to store af::array C = matmul(A, B) is negligible when compared to matmul(A, B), or even on par with the time to launch a kernel.
Is there any general info on how AF's allocator should perform relative cudaMalloc?
Thanks!
Philip