I have a matrix with large dimensions and I want to run dgetrf for a panel of matrix on CPU and parallely a dgemm for another part of the matrix. Actually they are not related to each other and I don't need to wait for the result from the host.
magma_dgetmatrix_async( N, P, dA(j, j), ldda, work, w_ldda, queues );
magma_queue_sync( queues );
magma_dgetf2_nopiv(N, P, work, w_ldda, &info_h);
magma_dgemm( MagmaNoTrans, MagmaNoTrans,
N, M, nb,
-1, dA(j, j+k), ldda,
dA(j, j+z), ldda,
1, dA(j, j+w), ldda, queues );
I expected to see that dgetf2 and dgeem are running in parallel, but I am seeing that dgemm which is not dependent to dgetf2 is stoped and waiting for dgetf2 to finish. Why this things are happening? How can I solve this problem? I think A100 is powerful enough.
Here the gap is clear and I am sure that it is not related to memory copy from d2h, because by removing dgetf2 from program and just doing memcpy d2h I don't see the gap.
If remove dgetf2:
Should I enable something somewhere?