I've played around with modifying the ScaLAPACK API for the gesv routine, and found that a significant speedup can be achieved by changing the layout of the matrix in memory.
Presently a SLATE matrix is mapped onto the local matrices provided in the ScaLAPACK layout (columns in contiguous memory). But SLATE seems to perform best (or gesv does at least in my experience) when the local matrices are stored in the native SLATE layout with tiles in contiguous memory.
See attached a performance comparison comparing ScaLAPACK with the SLATE ScaLAPACK API in both original and modified forms (times given as a function of N for an N*N matrix).
What I do (inside the gesv ScaLAPACK api function) is reorder the local matrix so that blocks are in contiguous memory, then define a standard SLATE matrix and insert tiles with pointers to the blocks in the local matrix.
This does require a small amount of extra memory to facilitate the reordering – in my implementation the temporary buffer required is the width of one block and height of the local matrix.
In principle if the user needs the factored form of the matrix back kept in tact the matrix can be reordered back to the ScaLAPACK layout after gesv is finished.
I'm wondering if this is something that has been considered but not implemented because it does require allocating some addition memory?