here's some basic matrix transpose code

Jason Stratos Papadopoulos

unread,

Apr 7, 1998, 3:00:00 AM4/7/98

to

Hello. First of all, thanks to all the people who helped me out
in my quest for a really fast matrix transpose. Pending the use
of something faster, this is the result. It was really surprising
how much things can be sped up at the source level; too bad it only
works for power of two square matrices (i.e. the easy case, all that I
need). Let me know what you think.

jasonp

--------------CUT HERE INCLUDING THIS LINE----------------------------
/*
The functions here are for a very fast matrix transpose, needed
for Bailey's four-step FFT. Some assumptions: the matrix must
be square, must be a power of two on a side, and must be stored
as a 1-D array. It uses block transposition; divide the matrix
into blocks of a certain optimal size; then transpose the diagonal
blocks and transpose/switch the off-diagonal blocks. BL below is the
power of two defining block size.

On a 90MHz Pentium, blocktranspose() can transpose a 1024x1024
matrix in .097 seconds, more than six times faster than a naive
transpose routine. There's plenty of room for assembly optimization,
but right now the time is dominated by cache misses, and tweaking
code here and there won't help too much.
*/

#define BL 5

/*----------------------------------------------------------------------*/
void copyblock( long *startpt, long startlength,
long *destpt, long destlength ) {
long i,j;

for(i=0; i<(1<<BL); i++) {
for(j=0; j<(1<<BL); j+=8) {

/* Move a BLxBL block from matrix "startpt" (of size
1<<startlength on a side) to matrix "destpt" (of
size 1<<destlength on a side). The inline assembly
below preloads the rows of destptr so they're in
cache when data movement takes place. Then the block
gets loaded, row by row. For non-DOS users the assembly
below isn't vital, it shaves off a few milliseconds
at most and its effects can be duplicated with C code */

asm("
movl (%0,%1,4), %%eax
":
: "r"(destpt), "r"(j)
: "%eax");

destpt[j]=startpt[j]; /* unrolled 8 times, or */
destpt[j+1]=startpt[j+1]; /* one cache line worth */
destpt[j+2]=startpt[j+2];
destpt[j+3]=startpt[j+3];
destpt[j+4]=startpt[j+4];
destpt[j+5]=startpt[j+5];
destpt[j+6]=startpt[j+6];
destpt[j+7]=startpt[j+7];
}
startpt += startlength; /* next row of startpt */
destpt += destlength; /* next row of destpt */
}
}

/*-------------------------------------------------------------------*/
void copytranspose( long *startpt, long *destpt, long destlength ) {

long i, j, *row, temp;

/* this function switches the columns in the BLxBL block
"startpt" with the rows of the BLxBL block in "destpt".
This also has the effect of transposing the block in
destpt, which need only then be copied into its appropriate
place. The code below pairs nicely, and performs the
copying while all the data involved is in cache. */

for(i=0; i<(1<<BL); i++) {
row=startpt;
for(j=0; j<(1<<BL); j+=8, row+=8*(1<<BL)) {

temp = destpt[j];
destpt[j] = row[0];
row[0] = temp;
/* gcc turns the BL stuff here */
temp = destpt[j+1]; /* into numbers, saving lots of */
destpt[j+1] = row[1<<BL]; /* pointer addition and AGIs */
row[1<<BL] = temp;

temp = destpt[j+2];
destpt[j+2] = row[2*(1<<BL)];
row[2*(1<<BL)] = temp;

temp = destpt[j+3];
destpt[j+3] = row[3*(1<<BL)];
row[3*(1<<BL)] = temp;

temp = destpt[j+4];
destpt[j+4] = row[4*(1<<BL)];
row[4*(1<<BL)] = temp;

temp = destpt[j+5];
destpt[j+5] = row[5*(1<<BL)];
row[5*(1<<BL)] = temp;

temp = destpt[j+6];
destpt[j+6] = row[6*(1<<BL)];
row[6*(1<<BL)] = temp;

temp = destpt[j+7];
destpt[j+7] = row[7*(1<<BL)];
row[7*(1<<BL)] = temp;
}
startpt++; /* move to next column */
destpt+=destlength; /* move to next row */
}
}

/*-------------------------------------------------------------------*/
void blocktranspose( long *x , long *scratch, long side ) {
long *corner = &x[0], col, row, blocks = side-BL;

/* transpose an array "x". "scratch" must be the size
of one full block (BL*BL), and "size" is the power of two x
is on a side (e.g. 512x512 matrix has side=9). The loop
below transposes block (1,1) and then moves down the first
row and first column of blocks, switching their transposes.
Then the process repeats for the second row and column, etc */

for(col=1; col<=1<<blocks; col++) {

copyblock( corner, 1<<side, scratch, 1<<BL ); /* diagonals */
copytranspose( scratch, corner, 1<<side );

for( row=1; row< (1<<blocks)-col+1; row++) {

copyblock( corner+(row<<BL), 1<<side,
scratch, 1<<BL ); /* off-diagonals */

copytranspose( scratch, corner+(row<<(side+BL)),
1<<side );

copyblock( scratch, 1<<BL,
corner+(row<<BL), 1<<side);

}
corner += (1<<(side+BL)) + (1<<BL);
}
/* move down and to the right */
}

David A. Heiser

unread,

Apr 9, 1998, 3:00:00 AM4/9/98

to

I have a really, really, really dumb question
Why can't you just change the index loop parameters, one set for normal, the other set for a
transpose.
DAH

Steven G. Johnson

unread,

Apr 9, 1998, 3:00:00 AM4/9/98

to

dahe...@gvn.net (David A. Heiser) wrote:
>I have a really, really, really dumb question
>Why can't you just change the index loop parameters, one set for normal,
>the other set for a transpose.

He wants to perform computations on the columns of the data, and just
changing the indexing would mean that he would be accessing the array with
a large stride between subsequent accesses. This is really hard on the
cache, and memory access is the bottleneck in this sort of program. The
hope is that the cost of a true transpose will be offset by the performance
advantage of accessing the columns as contiguous blocks.

Cordially,
Steven G. Johnson

PS. Please don't quote long postings, especially for a 3 line message.