Parallel execution of Device and Host

13 views

Skip to first unread message

aran nokan

unread,

Jun 15, 2021, 7:53:41 PM6/15/21

to MAGMA User

Hi,

I have a matrix with large dimensions and I want to run dgetrf for a panel of matrix on CPU and parallely a dgemm for another part of the matrix. Actually they are not related to each other and I don't need to wait for the result from the host.

magma_dgetmatrix_async( N, P, dA(j, j), ldda, work, w_ldda, queues[4] );
magma_queue_sync( queues[4] );

int info_h=0;

magma_dgetf2_nopiv(N, P, work, w_ldda, &info_h);

magma_dgemm( MagmaNoTrans, MagmaNoTrans,
N, M, nb,
-1, dA(j, j+k), ldda,
dA(j, j+z), ldda,
1, dA(j, j+w), ldda, queues[2] );

I expected to see that dgetf2 and dgeem are running in parallel, but I am seeing that dgemm which is not dependent to dgetf2 is stoped and waiting for dgetf2 to finish. Why this things are happening? How can I solve this problem? I think A100 is powerful enough.

Here the gap is clear and I am sure that it is not related to memory copy from d2h, because by removing dgetf2 from program and just doing memcpy d2h I don't see the gap.

If remove dgetf2:

Should I enable something somewhere?

Best regards,

A.N.

Mark Gates

unread,

Jun 17, 2021, 10:24:03 AM6/17/21

to aran nokan, MAGMA User

Try switching the order, do the dgemm first, then the getf2. MAGMA routines that don't take a queue are synchronous — they don't return until the operation is finished. BLAS routines like dgemm take a queue and return immediately. You need to synchronize on the queue before using the results. Or you can use 2 threads.

Mark

Innovative Computing Laboratory

University of Tennessee, Knoxville

Reply all

Reply to author

Forward

0 new messages