Can you please post a code example of what you are trying to achieve? It is also helpful if you mention the range of sizes for each DGEMM.
If you are calling the MAGMA wrapper for cuBLAS (magma_dgemm), then it is possible that you don’t have enough resources to launch the two DGEMMs concurrently. cuBLAS tries to fill up the GPU even if the sizes are relatively small.
Another suggestion is to play with the sizes a little bit (e.g. making them really small) to see if you get any overlap on the tracer.