Calling magmablas_sgemm_vbatched in different streams

8 views
Skip to first unread message

Christian Reiser

unread,
Dec 30, 2020, 1:23:40 PM12/30/20
to MAGMA User
Hello all,

I want to do multiple calls to "magmablas_sgemm_vbatched" in different CUDA streams. With help of the visual profiler I saw that the corresponding kernels are getting serialized:



If you look closely then you see that there is only a small overlap between memcopies and kernel execution. I wonder if this is something that could be improved, i.e. by using asynchronous memcopies. Or is it more likely that these kernels are fully occupying the GPU and therefore there cannot be any concurrent execution?

You might be wondering why I might want to use multiple calls to vbatched instead of doing all GEMMs with a single call. I noticed that vbatched performance declines strongly if there are matrices with very different dimensions in the batch and therefore I group the matrices by dimension as a preprocessing step. Despite the kernels being almost serialized I am already getting a 1.2x speedup with this technique.

Thanks in advance,
Christian

Ahmad Abdelfattah

unread,
Dec 30, 2020, 2:33:33 PM12/30/20
to Christian Reiser, MAGMA User
You can definitely overlap the kernels by using a lower-level expert interface. 

Under magmablas/sgemm_vbatched.cpp, there are three different interfaces other than magmablas_sgemm_vbatched. The one with almost no overhead or memory-copies would be 

magmablas_sgemm_vbatched_max_nocheck(
    magma_trans_t transA, magma_trans_t transB,
    magma_int_t* m, magma_int_t* n, magma_int_t* k,
    float alpha,
    float const * const * dA_array, magma_int_t* ldda,
    float const * const * dB_array, magma_int_t* lddb,
    float beta,
    float **dC_array, magma_int_t* lddc,
    magma_int_t batchCount,
    magma_int_t max_m, magma_int_t max_n, magma_int_t max_k,
    magma_queue_t queue );

As you can see, there are three extra parameters that hold the maximum values of (m, n, k) across the batch. You don’t have to pass the exact maximums. Upper-bounds are ok, but the tighter the better. Note that this routines does not perform any error checks, so you have to be sure about the dimensions you pass. 

Ahmad


--
You received this message because you are subscribed to the Google Groups "MAGMA User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magma-user+...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/magma-user/5d41bad1-d1dc-46ab-ae4f-272e8a284fd5n%40icl.utk.edu.

Christian Reiser

unread,
Dec 31, 2020, 8:39:42 AM12/31/20
to MAGMA User, ah...@icl.utk.edu, MAGMA User, Christian Reiser
Thanks. With the _no_max function the kernels are executing concurrently, although it does not make a huge difference since the overlap is very small. In general using the _no_max function made things a bit faster, thanks for that!
Reply all
Reply to author
Forward
0 new messages