Please help me understand if I'm understanding the code right.
It's expected that source matrices are always row or column major, and then get rearranged for the cell format and number of cells desired by the kernel. Constant matrices aren't pre-formatted for a kernel-favorable order? Or is it expected that the compiler can figure this out?
Similarly, if we have the result of one matrix multiply feeding another, there's no facility to reduce the data rearrangement? It's expected that we should do the work to do the full unpack -> pack?