Another question about the profile: I am seeing some latency between the data transfer.
Label 1: There are two data transfers (one big, and one small right after it) that are initiated by my code.
Label 2: There is 112 byte transfer that is taking place at this point. I am not sure what that is?
Label 3: This is where the magma computation/data transfer for its own purposes is beginning
Label 4: These are mallocs that are taking place - but not initiated by my code. (nomem_alloc wrapper in above email).
We are not sure what is the reason for the latency between label 1 and label 3. Would this latency be observed for the batched call as well?
I am now trying to use the batched version to increase workload, and also trying to minimize data transfers to the GPU by keeping data on the GPU for longer. Please could you help me understand these calls, so we can plan accordingly for our code?