We don’t have it yet. It will be great if you can contribute it!
I do’t see any problem with the approach. I suppose you modified the transpose.cu
(vs. the inplace versions).
I would double check the types, leading dimensions, and the printing.
If you try small enough matrices, e.g., less than 32x8 there will be no blocking in the code
and everything will be done from a single thread block - and a thread will transpose just one element.
You can make for example thread i,j print from the kernel HA(i,j) and HA(j,i) and after the transposition HAT(i,j)
It is interesting that a 4x4 leading block got transposed correctly, and the rest stayed the same (except those zeroes).