Parallel execution of Device and Host

Skip to first unread message

aran nokan

Jun 15, 2021, 7:53:41 PMJun 15
to MAGMA User

I have a matrix with large dimensions and I want to run dgetrf for a panel of matrix on CPU and parallely a dgemm for another part of the matrix. Actually they are not related to each other and I don't need to wait for the result from the host.

magma_dgetmatrix_async( N, P, dA(j, j), ldda, work, w_ldda, queues[4] );
magma_queue_sync( queues[4] );

int info_h=0;

magma_dgetf2_nopiv(N, P, work, w_ldda, &info_h);

magma_dgemm( MagmaNoTrans, MagmaNoTrans,
                           N, M, nb,
                            -1, dA(j, j+k), ldda,
                              dA(j, j+z), ldda,
                              1,     dA(j, j+w), ldda, queues[2] );

I expected to see that dgetf2 and dgeem are running in parallel, but I am seeing that dgemm which is not dependent to dgetf2 is stoped and waiting for dgetf2 to finish. Why this things are happening? How can I solve this problem? I think A100 is powerful enough.

Here the gap is clear and I am sure that it is not related to memory copy from d2h, because by removing dgetf2 from program and just doing memcpy d2h I don't see the gap.


If remove dgetf2:

Should I enable something somewhere?

Best regards,

Mark Gates

Jun 17, 2021, 10:24:03 AMJun 17
to aran nokan, MAGMA User
Try switching the order, do the dgemm first, then the getf2. MAGMA routines that don't take a queue are synchronous — they don't return until the operation is finished. BLAS routines like dgemm take a queue and return immediately. You need to synchronize on the queue before using the results. Or you can use 2 threads.


Innovative Computing Laboratory
University of Tennessee, Knoxville
Reply all
Reply to author
0 new messages